[ https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-28849. ---------------------------------- Resolution: Won't Fix > Spark's UnsafeShuffleWriter may run into infinite loop in transferTo > occasionally > --------------------------------------------------------------------------------- > > Key: SPARK-28849 > URL: https://issues.apache.org/jira/browse/SPARK-28849 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.1 > Reporter: Saisai Shao > Priority: Major > Attachments: 91ADA.png, 95330.png, D18F4.png > > > Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling > {{transferTo}} occasionally. What we saw is that when merging shuffle temp > file, the task is hung for several hours until it is killed manually. Here's > the log you can see, there's no any log after spilling the shuffle data to > disk, but the executor is still alive. > !95330.png! > And here is the thread dump, we could see that it always calls native method > {{size0}}. > !91ADA.png! > And we use strace to trace the system call, we found that this thread is > always calling {{fstat}}, and the system usage is pretty high, here is the > screenshot. > !D18F4.png! > We didn't find the root cause here, I guess it might be related to FS or disk > issue. Anyway we should figure out a way to fail fast in a such scenario. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org