[ 
https://issues.apache.org/jira/browse/SPARK-34537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292809#comment-17292809
 ] 

angerszhu commented on SPARK-34537:
-----------------------------------

gentle ping [~cloud_fan] [~jiangxb1987] 

I have saw you work in https://issues.apache.org/jira/browse/SPARK-23207 and 
https://issues.apache.org/jira/browse/SPARK-23243.

Our use meet similar issue on spark 3.0.1 too when shuffle data is huge. 

 

> Repartition miss/duplicated data
> --------------------------------
>
>                 Key: SPARK-34537
>                 URL: https://issues.apache.org/jira/browse/SPARK-34537
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.1
>            Reporter: angerszhu
>            Priority: Major
>         Attachments: image-2021-02-25-19-43-49-687.png, 
> image-2021-02-25-19-46-52-809.png, image-2021-02-25-19-47-10-005.png
>
>
> We have a SQL
> {code:java}
> INSERT OVERWRITE TABLE t1 
> SELECT /*+ repartition(300) */ * from t2.{code}
> Below is SQL metrics of the repartition ShuffleExchange. we can see that the 
> shuffle record written and records read is not same. 
> In the result table, there are some data missing and some data duplicated.
> !image-2021-02-25-19-43-49-687.png!
> !image-2021-02-25-19-46-52-809.png|width=408,height=654!!image-2021-02-25-19-47-10-005.png|width=282,height=414!
> We can see that *InsertIntoHadoopFsRelationCommand's output is save as 
> repartition Exchange's record read(reducer side)*
> *and repartition Exchange's shuffle record written (mapper side written) is 
> same as Filter's output.*
> *So we can see that repartition's Exchange return wrong data.*
>  
> *In our env, AQE and speculation is open.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to