[jira] [Commented] (SPARK-38058) Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently

2022-02-02 Thread john (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486263#comment-17486263
 ] 

john commented on SPARK-38058:
--

since i am working in production env i cannot disclose any docs in here. this 
may be bug in spark. it happend every 3/5 times. for 2 times all the records 
are inserted correctly. other times duplicats are inserted. we have tried all 
workarounds it is not working

> Writing a spark dataframe to Azure Sql Server is causing duplicate records 
> intermittently
> -
>
> Key: SPARK-38058
> URL: https://issues.apache.org/jira/browse/SPARK-38058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.1.0
>Reporter: john
>Priority: Major
>
> We are using JDBC option to insert transformed data in a spark DataFrame to a 
> table in Azure SQL Server. Below is the code snippet we are using for this 
> insert. However, we noticed on few occasions that some records are being 
> duplicated in the destination table. This is happening for large tables. e.g. 
> if a DataFrame has 600K records, after inserting data into the table, we get 
> around 620K records.  we still want to understand why that's happening.
>  {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = 
> "overwrite", properties = jdbcConnectionProperties)}}
>  
> Only reason we could think of is that while inserts are happening in 
> distributed fashion, if one of the executors fail in between, they are being 
> re-tried and could be inserting duplicate records. This could be totally 
> meaningless but just to see if that could be an issue.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38058) Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently

2022-02-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486201#comment-17486201
 ] 

Hyukjin Kwon commented on SPARK-38058:
--

spark.speculation has been disabled many years ago so this should not be the 
cause. Did you enable this? It is difficult to debug more without details here. 
do you have more info e.g., logs or Spark UI screenshot, etc? Or are you able 
to reproduce this in other DBMS?

> Writing a spark dataframe to Azure Sql Server is causing duplicate records 
> intermittently
> -
>
> Key: SPARK-38058
> URL: https://issues.apache.org/jira/browse/SPARK-38058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.1.0
>Reporter: john
>Priority: Major
>
> We are using JDBC option to insert transformed data in a spark DataFrame to a 
> table in Azure SQL Server. Below is the code snippet we are using for this 
> insert. However, we noticed on few occasions that some records are being 
> duplicated in the destination table. This is happening for large tables. e.g. 
> if a DataFrame has 600K records, after inserting data into the table, we get 
> around 620K records.  we still want to understand why that's happening.
>  {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = 
> "overwrite", properties = jdbcConnectionProperties)}}
>  
> Only reason we could think of is that while inserts are happening in 
> distributed fashion, if one of the executors fail in between, they are being 
> re-tried and could be inserting duplicate records. This could be totally 
> meaningless but just to see if that could be an issue.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38058) Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently

2022-01-30 Thread john (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484318#comment-17484318
 ] 

john commented on SPARK-38058:
--

it seems it doesn't specific to sql server.it is the problem with the spark 
itself. 
https://issues.apache.org/jira/browse/SPARK-16741 - this link suggest that 
disable the spark.speculation . but in latest spark version it is disable is 
default
i have tried that also. also then the duplicate rows were there in sql server 
when i am using jdbc in spark.

i have tried with small mount of data like 10K . it is working fine  no 
duplicates.
when i have load millions of data duplicate is there. 
because of this issue. we are using intermediate stage layer table to get all 
data including duplicates and we are inserting into landing zone with distinct 
clause.

> Writing a spark dataframe to Azure Sql Server is causing duplicate records 
> intermittently
> -
>
> Key: SPARK-38058
> URL: https://issues.apache.org/jira/browse/SPARK-38058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.1.0
>Reporter: john
>Priority: Major
>
> We are using JDBC option to insert transformed data in a spark DataFrame to a 
> table in Azure SQL Server. Below is the code snippet we are using for this 
> insert. However, we noticed on few occasions that some records are being 
> duplicated in the destination table. This is happening for large tables. e.g. 
> if a DataFrame has 600K records, after inserting data into the table, we get 
> around 620K records.  we still want to understand why that's happening.
>  {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = 
> "overwrite", properties = jdbcConnectionProperties)}}
>  
> Only reason we could think of is that while inserts are happening in 
> distributed fashion, if one of the executors fail in between, they are being 
> re-tried and could be inserting duplicate records. This could be totally 
> meaningless but just to see if that could be an issue.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38058) Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently

2022-01-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484281#comment-17484281
 ] 

Hyukjin Kwon commented on SPARK-38058:
--

Is this an issue from Spark, Azure SQL Server or somewhere else? Doesn't look 
like a problem specific to Spark. Would be great to post some evidance and logs 
that it's an issue from Spark.

> Writing a spark dataframe to Azure Sql Server is causing duplicate records 
> intermittently
> -
>
> Key: SPARK-38058
> URL: https://issues.apache.org/jira/browse/SPARK-38058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.1.0
>Reporter: john
>Priority: Major
>
> We are using JDBC option to insert transformed data in a spark DataFrame to a 
> table in Azure SQL Server. Below is the code snippet we are using for this 
> insert. However, we noticed on few occasions that some records are being 
> duplicated in the destination table. This is happening for large tables. e.g. 
> if a DataFrame has 600K records, after inserting data into the table, we get 
> around 620K records.  we still want to understand why that's happening.
>  {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = 
> "overwrite", properties = jdbcConnectionProperties)}}
>  
> Only reason we could think of is that while inserts are happening in 
> distributed fashion, if one of the executors fail in between, they are being 
> re-tried and could be inserting duplicate records. This could be totally 
> meaningless but just to see if that could be an issue.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org