hdaikoku commented on PR #42426: URL: https://github.com/apache/spark/pull/42426#issuecomment-1685547323
> To make sure I understand correctly - there is an OOM which is thrown, which happens to be within `initiateRetry` and so shuffle fetch stalled indefinitely, and so task appeared to be hung ? Yes correct. To be more precise, it's not a memory issue as shown in the log - "unable to create new native thread". We suspect it's running out of either threads or file descriptors and thus is unable to spawn a new thread at `executorService.submit()` in `RetryingBlockTransferor#initiateRetry()`. > In meantime, you can simply run with `-XX:OnOutOfMemoryError` to kill the executor in case of OOM if this is blocking you ? This is what Spark on Yarn does (see `YarnSparkHadoopUtil.addOutOfMemoryErrorArgument`) - looks like this is not done in other resource managers. Actually we are running Spark on YARN, and having `-XX:OnOutOfMemoryError='kill %p'` in place. But it seems it's not working probably because it's "unable to create new native thread" and hence even unable to spawn `kill`. Instead, we have enabled `spark.speculation` to proactively kill any stuck tasks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org