[jira] [Commented] (SPARK-17380) Spark streaming with a multi shard Kinesis freezes after several days (memory/resource leak?)
[ https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824564#comment-15824564 ] Xeto commented on SPARK-17380: -- Hi We switched to using StorageLevel.MEMORY_AND_DISK_SER on consumption from Kinesis as suggested above, also upgraded to EMR 5.2.0 (Spark 2.0.2) Looks stable even on a multi-shard Kinesis (was able to survive high load without executors being killed). Thanks! > Spark streaming with a multi shard Kinesis freezes after several days > (memory/resource leak?) > - > > Key: SPARK-17380 > URL: https://issues.apache.org/jira/browse/SPARK-17380 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Xeto > Attachments: exec_Leak_Hunter.zip, memory-after-freeze.png, memory.png > > > Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 > shards). > Used memory keeps growing all the time according to Ganglia. > The application works properly for about 3.5 days till all free memory has > been used. > Then, micro batches start queuing up but none is served. > Spark freezes. You can see in Ganglia that some memory is being freed but it > doesn't help the job to recover. > Is it a memory/resource leak? > The job uses back pressure and Kryo. > The code has a mapToPair(), groupByKey(), flatMap(), > persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then > storing to s3 using foreachRDD() > Cluster size: 20 machines > Spark cofiguration: > spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' > spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO > -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M > -XX:OnOutOfMemoryError='kill -9 %p' > spark.master yarn-cluster > spark.executor.instances 19 > spark.executor.cores 7 > spark.executor.memory 7500M > spark.driver.memory 7500M > spark.default.parallelism 133 > spark.yarn.executor.memoryOverhead 2950 > spark.yarn.driver.memoryOverhead 2950 > spark.eventLog.enabled false > spark.eventLog.dir hdfs:///spark-logs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17380) Spark streaming with a multi shard Kinesis freezes after several days (memory/resource leak?)
[ https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680276#comment-15680276 ] Udit Mehrotra commented on SPARK-17380: --- The above leak was seen with Spark 2.0 running on EMR. I noticed that the code path which causes the leak is the Block replication code, so I switched to using StorageLevel.MEMORY_AND_DISK, from StorageLevel.MEMORY_AND_DISK_2 for the Kinesis blocks received. After switching, I do not observe the above memory leak in the logs, but the application still freezes after 3-3.5 days. Spark streaming stops processing the records, and the input queue of records received from Kinesis keeps growing, until the executor runs out of memory. > Spark streaming with a multi shard Kinesis freezes after several days > (memory/resource leak?) > - > > Key: SPARK-17380 > URL: https://issues.apache.org/jira/browse/SPARK-17380 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Xeto > Attachments: exec_Leak_Hunter.zip, memory-after-freeze.png, memory.png > > > Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 > shards). > Used memory keeps growing all the time according to Ganglia. > The application works properly for about 3.5 days till all free memory has > been used. > Then, micro batches start queuing up but none is served. > Spark freezes. You can see in Ganglia that some memory is being freed but it > doesn't help the job to recover. > Is it a memory/resource leak? > The job uses back pressure and Kryo. > The code has a mapToPair(), groupByKey(), flatMap(), > persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then > storing to s3 using foreachRDD() > Cluster size: 20 machines > Spark cofiguration: > spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' > spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO > -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M > -XX:OnOutOfMemoryError='kill -9 %p' > spark.master yarn-cluster > spark.executor.instances 19 > spark.executor.cores 7 > spark.executor.memory 7500M > spark.driver.memory 7500M > spark.default.parallelism 133 > spark.yarn.executor.memoryOverhead 2950 > spark.yarn.driver.memoryOverhead 2950 > spark.eventLog.enabled false > spark.eventLog.dir hdfs:///spark-logs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17380) Spark streaming with a multi shard Kinesis freezes after several days (memory/resource leak?)
[ https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680270#comment-15680270 ] Udit Mehrotra commented on SPARK-17380: --- We came across this Memory Leak in the executor logs, by using the JVM option '-Dio.netty.leakDetectionLevel=advanced', which seems like a good evidence of memory leak, and tells the location where the buffer is created. 16/11/09 06:03:28 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information. Recent access records: 0 Created at: io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103) io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335) io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247) org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69) org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1161) org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:976) org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910) org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910) org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:700) org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80) org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158) org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129) org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133) org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282) org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352) org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297) org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269) org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110) Can we please have some action on this JIRA ? > Spark streaming with a multi shard Kinesis freezes after several days > (memory/resource leak?) > - > > Key: SPARK-17380 > URL: https://issues.apache.org/jira/browse/SPARK-17380 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Xeto > Attachments: exec_Leak_Hunter.zip, memory-after-freeze.png, memory.png > > > Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 > shards). > Used memory keeps growing all the time according to Ganglia. > The application works properly for about 3.5 days till all free memory has > been used. > Then, micro batches start queuing up but none is served. > Spark freezes. You can see in Ganglia that some memory is being freed but it > doesn't help the job to recover. > Is it a memory/resource leak? > The job uses back pressure and Kryo. > The code has a mapToPair(), groupByKey(), flatMap(), > persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then > storing to s3 using foreachRDD() > Cluster size: 20 machines > Spark cofiguration: > spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' > spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO > -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M > -XX:OnOutOfMemoryError='kill -9 %p' > spark.master yarn-cluster > spark.executor.instances 19 > spark.executor.cores 7 > spark.executor.memory 7500M > spark.driver.memory 7500M > spark.default.parallelism 133 > spark.yarn.executor.memoryOverhead 2950 > spark.yarn.driver.memoryOverhead 2950 > spark.eventLog.enabled false > spark.eventLog.dir hdfs:///spark-logs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17380) Spark streaming with a multi shard Kinesis freezes after several days (memory/resource leak?)
[ https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470399#comment-15470399 ] Sean Owen commented on SPARK-17380: --- Weirdly.. this might be related to SPARK-17379, where we're upgrading Netty and finding some problems with its memory pool. It's kind of what you're showing here, with a lot of memory held by netty pooled byte buffers. CC [~zsxwing] FYI > Spark streaming with a multi shard Kinesis freezes after several days > (memory/resource leak?) > - > > Key: SPARK-17380 > URL: https://issues.apache.org/jira/browse/SPARK-17380 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Xeto > Attachments: exec_Leak_Hunter.zip, memory-after-freeze.png, memory.png > > > Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 > shards). > Used memory keeps growing all the time according to Ganglia. > The application works properly for about 3.5 days till all free memory has > been used. > Then, micro batches start queuing up but none is served. > Spark freezes. You can see in Ganglia that some memory is being freed but it > doesn't help the job to recover. > Is it a memory/resource leak? > The job uses back pressure and Kryo. > The code has a mapToPair(), groupByKey(), flatMap(), > persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then > storing to s3 using foreachRDD() > Cluster size: 20 machines > Spark cofiguration: > spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' > spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO > -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M > -XX:OnOutOfMemoryError='kill -9 %p' > spark.master yarn-cluster > spark.executor.instances 19 > spark.executor.cores 7 > spark.executor.memory 7500M > spark.driver.memory 7500M > spark.default.parallelism 133 > spark.yarn.executor.memoryOverhead 2950 > spark.yarn.driver.memoryOverhead 2950 > spark.eventLog.enabled false > spark.eventLog.dir hdfs:///spark-logs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17380) Spark streaming with a multi shard Kinesis freezes after several days (memory/resource leak?)
[ https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470322#comment-15470322 ] Xeto commented on SPARK-17380: -- Decided to try more executors with less memory to reduce the full GC time. spark.executor.memory 2612M spark.driver.memory 2612M spark.yarn.executor.memoryOverhead 2612 spark.yarn.driver.memoryOverhead 2612 Also increased the network timeout: spark.network.timeout 1800s The cluster has been running without freezing for 1day and 16 hours. However, the used memory kept growing till filling up almost all available. Then 3 executors were killed and 3 new started. Judging from the container log of the removed executors, I can tell that the full GC has failed and an OutOfMemory happened. Executor stdout log follows. The same message can be seen on 3 executors. As I'm not storing any data of my own in long term memory - so it seems that's spark itself (or the Kinesis connector in spark-streaming-kinesis-asl) is leaking. We need Spark streaming to run without freezing/killing executors for at least a week. Any input is appreciated. Thanks in advance. 2016-09-07T09:42:35.110+: [CMS-concurrent-mark-start] 2016-09-07T09:42:35.116+: [Full GC (Allocation Failure) 2016-09-07T09:42:35.116+: [CMS2016-09-07T09:42:36.090+: [CMS-concurrent-mark: 0.978/0.979 secs] [Times: user=1.98 sys=0.00, real=0.98 secs] (concurrent mode failure): 1993151K->1993151K(1993152K), 4.6558419 secs] 2606437K->2606098K(2606592K), [Metaspace: 49338K->49338K(1093632K)], 4.6559435 secs] [Times: user=5.63 sys=0.00, real=4.66 secs] 2016-09-07T09:42:39.772+: [Full GC (Allocation Failure) 2016-09-07T09:42:39.772+: [CMS: 1993151K->1993151K(1993152K), 2.9516622 secs] 2606098K->2606090K(2606592K), [Metaspace: 49338K->49338K(1093632K)], 2.9517595 secs] [Times: user=2.95 sys=0.00, real=2.95 secs] # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 18903"... > Spark streaming with a multi shard Kinesis freezes after several days > (memory/resource leak?) > - > > Key: SPARK-17380 > URL: https://issues.apache.org/jira/browse/SPARK-17380 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Xeto > Attachments: exec_Leak_Hunter.zip, memory-after-freeze.png, memory.png > > > Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 > shards). > Used memory keeps growing all the time according to Ganglia. > The application works properly for about 3.5 days till all free memory has > been used. > Then, micro batches start queuing up but none is served. > Spark freezes. You can see in Ganglia that some memory is being freed but it > doesn't help the job to recover. > Is it a memory/resource leak? > The job uses back pressure and Kryo. > The code has a mapToPair(), groupByKey(), flatMap(), > persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then > storing to s3 using foreachRDD() > Cluster size: 20 machines > Spark cofiguration: > spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' > spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO > -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M > -XX:OnOutOfMemoryError='kill -9 %p' > spark.master yarn-cluster > spark.executor.instances 19 > spark.executor.cores 7 > spark.executor.memory 7500M > spark.driver.memory 7500M > spark.default.parallelism 133 > spark.yarn.executor.memoryOverhead 2950 > spark.yarn.driver.memoryOverhead 2950 > spark.eventLog.enabled false > spark.eventLog.dir hdfs:///spark-logs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17380) Spark streaming with a multi shard Kinesis freezes after several days (memory/resource leak?)
[ https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15458862#comment-15458862 ] Sean Owen commented on SPARK-17380: --- This doesn't show evidence of a memory leak. You may be low on memory, and you're experiencing a full GC. That would be consistent with these observations. That just means you need more memory or to better tune your GC. Huge pauses can cause stuff to 'hang' for a while, in any Java app. > Spark streaming with a multi shard Kinesis freezes after several days > (memory/resource leak?) > - > > Key: SPARK-17380 > URL: https://issues.apache.org/jira/browse/SPARK-17380 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Xeto > Attachments: memory-after-freeze.png, memory.png > > > Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 > shards). > Used memory keeps growing all the time according to Ganglia. > The application works properly for about 3.5 days till all free memory has > been used. > Then, micro batches start queuing up but none is served. > Spark freezes. You can see in Ganglia that some memory is being freed but it > doesn't help the job to recover. > Is it a memory/resource leak? > The job uses back pressure and Kryo. > The code has a mapToPair(), groupByKey(), flatMap(), > persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then > storing to s3 using foreachRDD() > Cluster size: 20 machines > Spark cofiguration: > spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' > spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO > -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M > -XX:OnOutOfMemoryError='kill -9 %p' > spark.master yarn-cluster > spark.executor.instances 19 > spark.executor.cores 7 > spark.executor.memory 7500M > spark.driver.memory 7500M > spark.default.parallelism 133 > spark.yarn.executor.memoryOverhead 2950 > spark.yarn.driver.memoryOverhead 2950 > spark.eventLog.enabled false > spark.eventLog.dir hdfs:///spark-logs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17380) Spark streaming with a multi shard Kinesis freezes after several days (memory/resource leak?)
[ https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15458669#comment-15458669 ] Xeto commented on SPARK-17380: -- Could you advise how to obtain such an evidence? Not storing anything in memory - besides the persist with replication. It's a very straightforward POC code with no 3-rd paty cache or anything. The events processed are objects with nested HashMap/LinkedList objects. I noticed that after the freeze, used memory starts going down (eventual GC?) but it doesn't help the application to recover. > Spark streaming with a multi shard Kinesis freezes after several days > (memory/resource leak?) > - > > Key: SPARK-17380 > URL: https://issues.apache.org/jira/browse/SPARK-17380 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Xeto > Attachments: memory.png > > > Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 > shards). > Used memory keeps growing all the time according to Ganglia. > The application works properly for about 3.5 days till all free memory has > been used. > Then, micro batches start queuing up but none is served. > Spark freezes. You can see in Ganglia that some memory is being freed but it > doesn't help the job to recover. > Is it a memory/resource leak? > The job uses back pressure and Kryo. > The code has a mapToPair(), groupByKey(), flatMap(), > persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then > storing to s3 using foreachRDD() > Cluster size: 20 machines > Spark cofiguration: > spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' > spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO > -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M > -XX:OnOutOfMemoryError='kill -9 %p' > spark.master yarn-cluster > spark.executor.instances 19 > spark.executor.cores 7 > spark.executor.memory 7500M > spark.driver.memory 7500M > spark.default.parallelism 133 > spark.yarn.executor.memoryOverhead 2950 > spark.yarn.driver.memoryOverhead 2950 > spark.eventLog.enabled false > spark.eventLog.dir hdfs:///spark-logs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org