I'm seeing a similar (same?) problem on Spark 1.4.1 running on Yarn (Amazon EMR, Java 8). I'm running a Spark Streaming app 24/7 and system memory eventually gets exhausted after about 3 days and the JVM process dies with:
# # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory. # An error report file with more information is saved as: # /mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1442933070871_0002/container_1442933070871_0002_01_000002/hs_err_pid19082.log [thread 139846843156224 also had an error] To reiterate what Sea said, the heap is fine, this is NOT a heap memory issue - I've monitored it with scripts and also observed it via VisualVm - this is an off heap issue. I ran pmap on the pid of CoarseGrainedExecutor, spaced about 5 hours apart, and saw several 64mb chunks of off heap memory allocated in that time: 00007fccd4000000 65500K rw--- [ anon ] 00007fccd7ff7000 36K ----- [ anon ] 00007fccd8000000 65528K rw--- [ anon ] 00007fccdbffe000 8K ----- [ anon ] 00007fccdc000000 65504K rw--- [ anon ] 00007fccdfff8000 32K ----- [ anon ] 00007fcce0000000 65536K rw--- [ anon ] 00007fcce4000000 65508K rw--- [ anon ] 00007fcce7ff9000 28K ----- [ anon ] 00007fcce8000000 65524K rw--- [ anon ] 00007fccebffd000 12K ----- [ anon ] 00007fccec000000 65532K rw--- [ anon ] 00007fcceffff000 4K ----- [ anon ] 00007fccf0000000 65496K rw--- [ anon ] 00007fccf3ff6000 40K ----- [ anon ] 00007fccf4000000 65496K rw--- [ anon ] 00007fccf7ff6000 40K ----- [ anon ] 00007fccf8000000 65532K rw--- [ anon ] 00007fccfbfff000 4K ----- [ anon ] 00007fccfc000000 65520K rw--- [ anon ] 00007fccffffc000 16K ----- [ anon ] 00007fcd00000000 65508K rw--- [ anon ] 00007fcd03ff9000 28K ----- [ anon ] Over these 8 hours, total memory usage by the JVM (as reported by top) had grown ~786mb over 5 hours, or basically the sum of those 13 64mb chunks. I dumped the memory from /proc/pid/, and was able to see a bunch of lines from the data files that my Spark job is processing, but I couldn't tell figure out what was actually creating these 64mb chunks. I thought it might be netty so I set spark.shuffle.io.preferDirectBufs to false, but that hasn't changed anything. The only thing I see in the config page regarding "64mb" is spark.kryoserializer.buffer.max, which defaults to 64mb. I'll try setting that to something different, but as far as I know, kryo is not doing anything off heap. Still wondering if this could be netty, or maybe something Akka is doing if it's using off heap mem? There were not ERROR messages in the executor (or driver's) logs during this time. Any help would be greatly appreciated. This issue continues to cause our streaming apps to die every few days ,which is...less than ideal! :) On Wed, Aug 5, 2015 at 9:10 AM, Sea <261810...@qq.com> wrote: > No one help me... I help myself, I split the cluster to two cluster.... > 1.4.1 and 1.3.0 > > > ------------------ 原始邮件 ------------------ > *发件人:* "Ted Yu";<yuzhih...@gmail.com>; > *发送时间:* 2015年8月4日(星期二) 晚上10:28 > *收件人:* "Igor Berman"<igor.ber...@gmail.com>; > *抄送:* "Sea"<261810...@qq.com>; "Barak Gitsis"<bar...@similarweb.com>; " > user@spark.apache.org"<user@spark.apache.org>; "rxin"<r...@databricks.com>; > "joshrosen"<joshro...@databricks.com>; "davies"<dav...@databricks.com>; > *主题:* Re: About memory leak in spark 1.4.1 > > w.r.t. spark.deploy.spreadOut , here is the scaladoc: > > // As a temporary workaround before better ways of configuring memory, > we allow users to set > // a flag that will perform round-robin scheduling across the nodes > (spreading out each app > // among all the nodes) instead of trying to consolidate each app onto a > small # of nodes. > private val spreadOutApps = conf.getBoolean("spark.deploy.spreadOut", > true) > > Cheers > > On Tue, Aug 4, 2015 at 4:13 AM, Igor Berman <igor.ber...@gmail.com> wrote: > >> sorry, can't disclose info about my prod cluster >> >> nothing jumps into my mind regarding your config >> we don't use lz4 compression, don't know what is spark.deploy.spreadOut(there >> is no documentation regarding this) >> >> If you are sure that you don't have memory leak in your business logic I >> would try to reset each property to default(or just remove it from your >> config) and try to run your job to see if it's not >> somehow connected >> >> my config(nothing special really) >> spark.shuffle.consolidateFiles true >> spark.speculation false >> spark.executor.extraJavaOptions -XX:+UseStringCache >> -XX:+UseCompressedStrings -XX:+PrintGC -XX:+PrintGCDetails >> -XX:+PrintGCTimeStamps -Xloggc:gc.log -verbose:gc >> spark.executor.logs.rolling.maxRetainedFiles 1000 >> spark.executor.logs.rolling.strategy time >> spark.worker.cleanup.enabled true >> spark.logConf true >> spark.rdd.compress true >> >> >> >> >> >> On 4 August 2015 at 12:59, Sea <261810...@qq.com> wrote: >> >>> How much machines are there in your standalone cluster? >>> I am not using tachyon. >>> >>> GC can not help me... Can anyone help ? >>> >>> my configuration: >>> >>> spark.deploy.spreadOut false >>> spark.eventLog.enabled true >>> spark.executor.cores 24 >>> >>> spark.ui.retainedJobs 10 >>> spark.ui.retainedStages 10 >>> spark.history.retainedApplications 5 >>> spark.deploy.retainedApplications 10 >>> spark.deploy.retainedDrivers 10 >>> spark.streaming.ui.retainedBatches 10 >>> spark.sql.thriftserver.ui.retainedSessions 10 >>> spark.sql.thriftserver.ui.retainedStatements 100 >>> >>> spark.file.transferTo false >>> spark.driver.maxResultSize 4g >>> spark.sql.hive.metastore.jars=/spark/spark-1.4.1/hive/* >>> >>> spark.eventLog.dir hdfs://mycluster/user/spark/historylog >>> spark.history.fs.logDirectory hdfs://mycluster/user/spark/historylog >>> >>> spark.driver.extraClassPath=/spark/spark-1.4.1/extlib/* >>> spark.executor.extraClassPath=/spark/spark-1.4.1/extlib/* >>> >>> spark.sql.parquet.binaryAsString true >>> spark.serializer org.apache.spark.serializer.KryoSerializer >>> spark.kryoserializer.buffer 32 >>> spark.kryoserializer.buffer.max 256 >>> spark.shuffle.consolidateFiles true >>> spark.io.compression.codec org.apache.spark.io.LZ4CompressionCodec >>> >>> >>> >>> >>> >>> ------------------ 原始邮件 ------------------ >>> *发件人:* "Igor Berman";<igor.ber...@gmail.com>; >>> *发送时间:* 2015年8月3日(星期一) 晚上7:56 >>> *收件人:* "Sea"<261810...@qq.com>; >>> *抄送:* "Barak Gitsis"<bar...@similarweb.com>; "Ted Yu"< >>> yuzhih...@gmail.com>; "user@spark.apache.org"<user@spark.apache.org>; >>> "rxin"<r...@databricks.com>; "joshrosen"<joshro...@databricks.com>; >>> "davies"<dav...@databricks.com>; >>> *主题:* Re: About memory leak in spark 1.4.1 >>> >>> in general, what is your configuration? use --conf "spark.logConf=true" >>> >>> we have 1.4.1 in production standalone cluster and haven't experienced >>> what you are describing >>> can you verify in web-ui that indeed spark got your 50g per executor >>> limit? I mean in configuration page.. >>> >>> might be you are using offheap storage(Tachyon)? >>> >>> >>> On 3 August 2015 at 04:58, Sea <261810...@qq.com> wrote: >>> >>>> "spark uses a lot more than heap memory, it is the expected behavior." >>>> It didn't exist in spark 1.3.x >>>> What does "a lot more than" means? It means that I lose control of it! >>>> I try to apply 31g, but it still grows to 55g and continues to grow!!! >>>> That is the point! >>>> I have tried set memoryFraction to 0.2,but it didn't help. >>>> I don't know whether it will still exist in the next release 1.5, I >>>> wish not. >>>> >>>> >>>> >>>> ------------------ 原始邮件 ------------------ >>>> *发件人:* "Barak Gitsis";<bar...@similarweb.com>; >>>> *发送时间:* 2015年8月2日(星期天) 晚上9:55 >>>> *收件人:* "Sea"<261810...@qq.com>; "Ted Yu"<yuzhih...@gmail.com>; >>>> *抄送:* "user@spark.apache.org"<user@spark.apache.org>; "rxin"< >>>> r...@databricks.com>; "joshrosen"<joshro...@databricks.com>; "davies"< >>>> dav...@databricks.com>; >>>> *主题:* Re: About memory leak in spark 1.4.1 >>>> >>>> spark uses a lot more than heap memory, it is the expected behavior. >>>> in 1.4 off-heap memory usage is supposed to grow in comparison to 1.3 >>>> >>>> Better use as little memory as you can for heap, and since you are not >>>> utilizing it already, it is safe for you to reduce it. >>>> memoryFraction helps you optimize heap usage for your data/application >>>> profile while keeping it tight. >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Aug 2, 2015 at 12:54 PM Sea <261810...@qq.com> wrote: >>>> >>>>> spark.storage.memoryFraction is in heap memory, but my situation is >>>>> that the memory is more than heap memory ! >>>>> >>>>> Anyone else use spark 1.4.1 in production? >>>>> >>>>> >>>>> ------------------ 原始邮件 ------------------ >>>>> *发件人:* "Ted Yu";<yuzhih...@gmail.com>; >>>>> *发送时间:* 2015年8月2日(星期天) 下午5:45 >>>>> *收件人:* "Sea"<261810...@qq.com>; >>>>> *抄送:* "Barak Gitsis"<bar...@similarweb.com>; "user@spark.apache.org"< >>>>> user@spark.apache.org>; "rxin"<r...@databricks.com>; "joshrosen"< >>>>> joshro...@databricks.com>; "davies"<dav...@databricks.com>; >>>>> *主题:* Re: About memory leak in spark 1.4.1 >>>>> >>>>> http://spark.apache.org/docs/latest/tuning.html does mention >>>>> spark.storage.memoryFraction >>>>> in two places. >>>>> One is under Cache Size Tuning section. >>>>> >>>>> FYI >>>>> >>>>> On Sun, Aug 2, 2015 at 2:16 AM, Sea <261810...@qq.com> wrote: >>>>> >>>>>> Hi, Barak >>>>>> It is ok with spark 1.3.0, the problem is with spark 1.4.1. >>>>>> I don't think spark.storage.memoryFraction will make any sense, >>>>>> because it is still in heap memory. >>>>>> >>>>>> >>>>>> ------------------ 原始邮件 ------------------ >>>>>> *发件人:* "Barak Gitsis";<bar...@similarweb.com>; >>>>>> *发送时间:* 2015年8月2日(星期天) 下午4:11 >>>>>> *收件人:* "Sea"<261810...@qq.com>; "user"<user@spark.apache.org>; >>>>>> *抄送:* "rxin"<r...@databricks.com>; "joshrosen"< >>>>>> joshro...@databricks.com>; "davies"<dav...@databricks.com>; >>>>>> *主题:* Re: About memory leak in spark 1.4.1 >>>>>> >>>>>> Hi, >>>>>> reducing spark.storage.memoryFraction did the trick for me. Heap >>>>>> doesn't get filled because it is reserved.. >>>>>> My reasoning is: >>>>>> I give executor all the memory i can give it, so that makes it a >>>>>> boundary. >>>>>> From here i try to make the best use of memory I can. >>>>>> storage.memoryFraction is in a sense user data space. The rest can be >>>>>> used >>>>>> by the system. >>>>>> If you don't have so much data that you MUST store in memory for >>>>>> performance, better give spark more space.. >>>>>> ended up setting it to 0.3 >>>>>> >>>>>> All that said, it is on spark 1.3 on cluster >>>>>> >>>>>> hope that helps >>>>>> >>>>>> On Sat, Aug 1, 2015 at 5:43 PM Sea <261810...@qq.com> wrote: >>>>>> >>>>>>> Hi, all >>>>>>> I upgrage spark to 1.4.1, many applications failed... I find the >>>>>>> heap memory is not full , but the process of >>>>>>> CoarseGrainedExecutorBackend >>>>>>> will take more memory than I expect, and it will increase as time goes >>>>>>> on, >>>>>>> finally more than max limited of the server, the worker will die..... >>>>>>> >>>>>>> Any can help? >>>>>>> >>>>>>> Mode:standalone >>>>>>> >>>>>>> spark.executor.memory 50g >>>>>>> >>>>>>> 25583 xiaoju 20 0 75.5g 55g 28m S 1729.3 88.1 2172:52 java >>>>>>> >>>>>>> 55g more than 50g I apply >>>>>>> >>>>>>> -- >>>>>> *-Barak* >>>>>> >>>>> >>>>> -- >>>> *-Barak* >>>> >>> >>> >> >