[ 
https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xeto updated SPARK-17380:
-------------------------
    Attachment: exec_Leak_Hunter.zip

I simplified the POC on Spark 2.0.0 by removing persist with replicas and 
groupByKey.
Now have one dummy filter, several map() and mapPartitions(), then output to s3.
No cache or anything like that. 
Just using StorageLevel.MEMORY_AND_DISK_2 in KinesisUtils.createStream
Provided a load of about 50k events/s this time (the cluster doesn't do 
groupBy/persist, so there's enough CPU and network to handle the load increase).
.
In a couple of hours did a memory dump of one of the executors (notice that the 
cluster is still functioning properly, has loads of free memory, etc - I just 
wanted to learn the cluster state).
Result of using Eclipse Memory Analyzer on the dump to look for leaks is 
attached.
To check requires to unzip exec_Leak_Hunter.zip, then open index.html in a 
browser
I'm seeing two leak suspects: one in contents of spark MemoryStore (38.17%), 
another in io.netty.buffer.PoolChunk (53.83%)

This time I copied memory definitions from the default spark conf in EMR master
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
-XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'

spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC 
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
-XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p' 
-Dspark.driver.log.level=INFO 

The cluster is still running.
Let me know if more data is needed.


> Spark streaming with a multi shard Kinesis freezes after several days 
> (memory/resource leak?)
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17380
>                 URL: https://issues.apache.org/jira/browse/SPARK-17380
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 2.0.0
>            Reporter: Xeto
>         Attachments: exec_Leak_Hunter.zip, memory-after-freeze.png, memory.png
>
>
> Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 
> shards).
> Used memory keeps growing all the time according to Ganglia.
> The application works properly for about 3.5 days till all free memory has 
> been used.
> Then, micro batches start queuing up but none is served.
> Spark freezes. You can see in Ganglia that some memory is being freed but it 
> doesn't help the job to recover.
> Is it a memory/resource leak?
> The job uses back pressure and Kryo.
> The code has a mapToPair(), groupByKey(),  flatMap(), 
> persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then 
> storing to s3 using foreachRDD()
> Cluster size: 20 machines
> Spark cofiguration:
> spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' 
> spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO 
> -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M 
> -XX:OnOutOfMemoryError='kill -9 %p' 
> spark.master yarn-cluster
> spark.executor.instances 19
> spark.executor.cores 7
> spark.executor.memory 7500M
> spark.driver.memory 7500M
> spark.default.parallelism 133
> spark.yarn.executor.memoryOverhead 2950
> spark.yarn.driver.memoryOverhead 2950
> spark.eventLog.enabled false
> spark.eventLog.dir hdfs:///spark-logs/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to