hdaikoku commented on PR #42426:
URL: https://github.com/apache/spark/pull/42426#issuecomment-1685547323

   > To make sure I understand correctly - there is an OOM which is thrown, 
which happens to be within `initiateRetry` and so shuffle fetch stalled 
indefinitely, and so task appeared to be hung ?
   
   Yes correct. To be more precise, it's not a memory issue as shown in the log 
- "unable to create new native thread". We suspect it's running out of either 
threads or file descriptors and thus is unable to spawn a new thread at 
`executorService.submit()` in `RetryingBlockTransferor#initiateRetry()`.
   
   
   > In meantime, you can simply run with `-XX:OnOutOfMemoryError` to kill the 
executor in case of OOM if this is blocking you ? This is what Spark on Yarn 
does (see `YarnSparkHadoopUtil.addOutOfMemoryErrorArgument`) - looks like this 
is not done in other resource managers.
   
   Actually we are running Spark on YARN, and having 
`-XX:OnOutOfMemoryError='kill %p'` in place. But it seems it's not working 
probably because it's "unable to create new native thread" and hence even 
unable to spawn `kill`.
   
   Instead, we have enabled `spark.speculation` to proactively kill any stuck 
tasks.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to