[ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:
--------------------------------
    Attachment: D18F4.png
                95330.png
                91ADA.png

> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-28849
>                 URL: https://issues.apache.org/jira/browse/SPARK-28849
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.1
>            Reporter: Saisai Shao
>            Priority: Major
>         Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk for 
> several hours.
> And here is the thread dump, we could see that it is calling native method 
> {{size0}}.
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to