[ https://issues.apache.org/jira/browse/SPARK-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157725#comment-15157725 ]
JESSE CHEN commented on SPARK-13288: ------------------------------------ The code is simple but I can't share the data per legal restrictions. It is basically computing sentiments for tweets from a Twitter stream: ---- val sentiTweets= tweets.map(tweet=>(tweet, getSentimentScore(sentiMap,tweet.body))) val numTweets=tweets.count() val posThreshold = 0.25 val negThreshold = -0.25 val sentiCount=sentiTweets.map(t=>(t._2._1-t._2._2)) .map(score=>if (score>posThreshold) 1 else if (score<negThreshold) -1 else 0) .map(score=>(score,1)) .reduceByKeyAndWindow(_+_,_-_, Seconds(ReduceWindow*60), Seconds(BatchWindow)) sentiCount.print(5) numTweets.map(count=>"Batch rate: "+count/BatchWindow+" tweets per second") .print(1) --- The same code ran on both 1.5.1 and 1.6.0. 1.6 has the auto memory feature which I think may have an impact on this. I am going to enable spark.storage.memoryFraction 0.6 (deprecated) This is read only if spark.memory.useLegacyMode is enabled. Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase it if you configure your own old generation size. Will see if this changes the behavior. Also, if the heap dumps are not helpful, what do you recommend I collect for you? DEBUG traces (which will change timing so I am not sure how helpful it will be)...need some ideas please. Jesse > [1.6.0] Memory leak in Spark streaming > -------------------------------------- > > Key: SPARK-13288 > URL: https://issues.apache.org/jira/browse/SPARK-13288 > Project: Spark > Issue Type: Bug > Components: Streaming > Affects Versions: 1.6.0 > Environment: Bare metal cluster > RHEL 6.6 > Reporter: JESSE CHEN > Labels: streaming > > Streaming in 1.6 seems to have a memory leak. > Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 > showed a gradual increasing processing time. > The app is simple: 1 Kafka receiver of tweet stream and 20 executors > processing the tweets in 5-second batches. > Spark 1.5.0 handles this smoothly and did not show increasing processing time > in the 40-minute test; but 1.6 showed increasing time about 8 minutes into > the test. Please see chart here: > https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116 > I captured heap dumps in two version and did a comparison. I noticed the Byte > is using 50X more space in 1.5.1. > Here are some top classes in heap histogram and references. > Heap Histogram > > All Classes (excluding platform) > 1.6.0 Streaming 1.5.1 Streaming > Class Instance Count Total Size Class Instance Count Total > Size > class [B 8453 3,227,649,599 class [B 5095 > 62,938,466 > class [C 44682 4,255,502 class [C 130482 > 12,844,182 > class java.lang.reflect.Method 9059 1,177,670 class > java.lang.String 130171 1,562,052 > > > References by Type References by Type > > class [B [0x640039e38] class [B [0x6c020bb08] > > > Referrers by Type Referrers by Type > > Class Count Class Count > java.nio.HeapByteBuffer 3239 > sun.security.util.DerInputBuffer 1233 > sun.security.util.DerInputBuffer 1233 > sun.security.util.ObjectIdentifier 620 > sun.security.util.ObjectIdentifier 620 [[B 397 > [Ljava.lang.Object; 408 java.lang.reflect.Method > 326 > ---- > The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0. > The Java.nio.HeapByteBuffer referencing class did not show up in top in > 1.5.1. > I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them > here > https://ibm.box.com/sparkstreaming-jstack160 > https://ibm.box.com/sparkstreaming-jstack151 > Jesse -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org