[ https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484281#comment-17484281 ]
Hyukjin Kwon commented on SPARK-38058: -------------------------------------- Is this an issue from Spark, Azure SQL Server or somewhere else? Doesn't look like a problem specific to Spark. Would be great to post some evidance and logs that it's an issue from Spark. > Writing a spark dataframe to Azure Sql Server is causing duplicate records > intermittently > ----------------------------------------------------------------------------------------- > > Key: SPARK-38058 > URL: https://issues.apache.org/jira/browse/SPARK-38058 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core > Affects Versions: 3.1.0 > Reporter: john > Priority: Major > > We are using JDBC option to insert transformed data in a spark DataFrame to a > table in Azure SQL Server. Below is the code snippet we are using for this > insert. However, we noticed on few occasions that some records are being > duplicated in the destination table. This is happening for large tables. e.g. > if a DataFrame has 600K records, after inserting data into the table, we get > around 620K records. we still want to understand why that's happening. > {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = > "overwrite", properties = jdbcConnectionProperties)}} > > Only reason we could think of is that while inserts are happening in > distributed fashion, if one of the executors fail in between, they are being > re-tried and could be inserting duplicate records. This could be totally > meaningless but just to see if that could be an issue.{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org