[ 
https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486263#comment-17486263
 ] 

john commented on SPARK-38058:
------------------------------

since i am working in production env i cannot disclose any docs in here. this 
may be bug in spark. it happend every 3/5 times. for 2 times all the records 
are inserted correctly. other times duplicats are inserted. we have tried all 
workarounds it is not working

> Writing a spark dataframe to Azure Sql Server is causing duplicate records 
> intermittently
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-38058
>                 URL: https://issues.apache.org/jira/browse/SPARK-38058
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.1.0
>            Reporter: john
>            Priority: Major
>
> We are using JDBC option to insert transformed data in a spark DataFrame to a 
> table in Azure SQL Server. Below is the code snippet we are using for this 
> insert. However, we noticed on few occasions that some records are being 
> duplicated in the destination table. This is happening for large tables. e.g. 
> if a DataFrame has 600K records, after inserting data into the table, we get 
> around 620K records.  we still want to understand why that's happening.
>  {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = 
> "overwrite", properties = jdbcConnectionProperties)}}
>  
> Only reason we could think of is that while inserts are happening in 
> distributed fashion, if one of the executors fail in between, they are being 
> re-tried and could be inserting duplicate records. This could be totally 
> meaningless but just to see if that could be an issue.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to