[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
huangweiyi updated SPARK-30246: ------------------------------- Description: In our large busy yarn cluster which deploy Spark external shuffle service as part of YARN NM aux service, we encountered OOM in some NMs. after i dump the heap memory and found there are some StremState objects still in heap, but the app which the StreamState belongs to is already finished. Here is some relate Figures: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! The heap dump below shows that the memory consumption mainly consists of two parts: *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! dig into the OneForOneStreamManager, there are some StreaStates still remained : !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! was: In our large busy yarn cluster which deploy Spark external shuffle service as part of YARN NM aux service, we encountered OOM in some NMs. after i dump the heap memory and found there are some StremState objects still in heap, but the app which the StreamState belongs to is already finished. Here is some relate Figures: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! The heap dump below shows that the memory consumption mainly consists of two parts: *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! dig into the OneForOneStreamManager, there are some StreaStates still remained : !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > Spark on Yarn External Shuffle Service Memory Leak > -------------------------------------------------- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core > Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 > Reporter: huangweiyi > Priority: Major > > In our large busy yarn cluster which deploy Spark external shuffle service as > part of YARN NM aux service, we encountered OOM in some NMs. > after i dump the heap memory and found there are some StremState objects > still in heap, but the app which the StreamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some StreaStates still > remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org