john created SPARK-38058:
----------------------------

             Summary: Writing a spark dataframe to Azure Sql Server is causing 
duplicate records intermittently
                 Key: SPARK-38058
                 URL: https://issues.apache.org/jira/browse/SPARK-38058
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core
    Affects Versions: 3.1.0
            Reporter: john


We are using JDBC option to insert transformed data in a spark DataFrame to a 
table in Azure SQL Server. Below is the code snippet we are using for this 
insert. However, we noticed on few occasions that some records are being 
duplicated in the destination table. This is happening for large tables. e.g. 
if a DataFrame has 600K records, after inserting data into the table, we get 
around 620K records.  we still want to understand why that's happening.
 {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = 
"overwrite", properties = jdbcConnectionProperties)}}
 
Only reason we could think of is that while inserts are happening in 
distributed fashion, if one of the executors fail in between, they are being 
re-tried and could be inserting duplicate records. This could be totally 
meaningless but just to see if that could be an issue.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to