[ https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486263#comment-17486263 ]
john commented on SPARK-38058: ------------------------------ since i am working in production env i cannot disclose any docs in here. this may be bug in spark. it happend every 3/5 times. for 2 times all the records are inserted correctly. other times duplicats are inserted. we have tried all workarounds it is not working > Writing a spark dataframe to Azure Sql Server is causing duplicate records > intermittently > ----------------------------------------------------------------------------------------- > > Key: SPARK-38058 > URL: https://issues.apache.org/jira/browse/SPARK-38058 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core > Affects Versions: 3.1.0 > Reporter: john > Priority: Major > > We are using JDBC option to insert transformed data in a spark DataFrame to a > table in Azure SQL Server. Below is the code snippet we are using for this > insert. However, we noticed on few occasions that some records are being > duplicated in the destination table. This is happening for large tables. e.g. > if a DataFrame has 600K records, after inserting data into the table, we get > around 620K records. we still want to understand why that's happening. > {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = > "overwrite", properties = jdbcConnectionProperties)}} > > Only reason we could think of is that while inserts are happening in > distributed fashion, if one of the executors fail in between, they are being > re-tried and could be inserting duplicate records. This could be totally > meaningless but just to see if that could be an issue.{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org