[ https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Saisai Shao updated SPARK-28849: -------------------------------- Attachment: D18F4.png 95330.png 91ADA.png > Spark's UnsafeShuffleWriter may run into infinite loop in transferTo > occasionally > --------------------------------------------------------------------------------- > > Key: SPARK-28849 > URL: https://issues.apache.org/jira/browse/SPARK-28849 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.1 > Reporter: Saisai Shao > Priority: Major > Attachments: 91ADA.png, 95330.png, D18F4.png > > > Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling > {{transferTo}} occasionally. What we saw is that when merging shuffle temp > file, the task is hung for several hours until killed manually. Here's the > log you can see, there's no any log after spill the shuffle files to disk for > several hours. > And here is the thread dump, we could see that it is calling native method > {{size0}}. > And we use strace to trace the system, we found that this thread is always > calling {{fstat}}, here is the screenshot. > We didn't find the root cause here, I guess it might be related to FS or disk > issue. Anyway we should figure out a way to fail fast in a such scenario. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org