[jira] [Commented] (SPARK-38058) Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently

Hyukjin Kwon (Jira) Sat, 29 Jan 2022 17:26:28 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484281#comment-17484281
 ]


Hyukjin Kwon commented on SPARK-38058:
--------------------------------------

Is this an issue from Spark, Azure SQL Server or somewhere else? Doesn't look 
like a problem specific to Spark. Would be great to post some evidance and logs 
that it's an issue from Spark.

> Writing a spark dataframe to Azure Sql Server is causing duplicate records 
> intermittently
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-38058
>                 URL: https://issues.apache.org/jira/browse/SPARK-38058
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.1.0
>            Reporter: john
>            Priority: Major
>
> We are using JDBC option to insert transformed data in a spark DataFrame to a 
> table in Azure SQL Server. Below is the code snippet we are using for this 
> insert. However, we noticed on few occasions that some records are being 
> duplicated in the destination table. This is happening for large tables. e.g. 
> if a DataFrame has 600K records, after inserting data into the table, we get 
> around 620K records.  we still want to understand why that's happening.
>  {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = 
> "overwrite", properties = jdbcConnectionProperties)}}
>  
> Only reason we could think of is that while inserts are happening in 
> distributed fashion, if one of the executors fail in between, they are being 
> re-tried and could be inserting duplicate records. This could be totally 
> meaningless but just to see if that could be an issue.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38058) Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently

Reply via email to