Not sure if anyone else here can help you. But if I were you, I will adjust SPARK_DAEMON_MEMORY to 2g, to bump the worker to 2G. Even though the worker's responsibility is very limited, but in today's world, who knows. Give 2g a try to see if the problem goes away.
BTW, in our production, I set the worker to 2g, and never experienced any OOM from workers. Our cluster is live for more than 1 year, and we also use Spark 1.6.2 on production. Yong ________________________________ From: Behroz Sikander <behro...@gmail.com> Sent: Friday, March 24, 2017 9:29 AM To: Yong Zhang Cc: user@spark.apache.org Subject: Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded Yea we also didn't find anything related to this online. Are you aware of any memory leaks in worker in 1.6.2 spark which might be causing this ? Do you know of any documentation which explains all the tasks that a worker is performing ? Maybe we can get some clue from there. Regards, Behroz On Fri, Mar 24, 2017 at 2:21 PM, Yong Zhang <java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote: I never experienced worker OOM or very rarely see this online. So my guess that you have to generate the heap dump file to analyze it. Yong ________________________________ From: Behroz Sikander <behro...@gmail.com<mailto:behro...@gmail.com>> Sent: Friday, March 24, 2017 9:15 AM To: Yong Zhang Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded Thank you for the response. Yes, I am sure because the driver was working fine. Only 2 workers went down with OOM. Regards, Behroz On Fri, Mar 24, 2017 at 2:12 PM, Yong Zhang <java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote: I am not 100% sure, but normally "dispatcher-event-loop" OOM means the driver OOM. Are you sure your workers OOM? Yong ________________________________ From: bsikander <behro...@gmail.com<mailto:behro...@gmail.com>> Sent: Friday, March 24, 2017 5:48 AM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded Spark version: 1.6.2 Hadoop: 2.6.0 Cluster: All VMS are deployed on AWS. 1 Master (t2.large) 1 Secondary Master (t2.large) 5 Workers (m4.xlarge) Zookeeper (t2.large) Recently, 2 of our workers went down with out of memory exception. java.lang.OutOfMemoryError: GC overhead limit exceeded (max heap: 1024 MB) Both of these worker processes were in hanged state. We restarted them to bring them back to normal state. Here is the complete exception https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91 [https://avatars1.githubusercontent.com/u/4642104?v=3&s=400]<https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91> Worker crashing<https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91> gist.github.com<http://gist.github.com> Worker crashing Master's spark-default.conf file: https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d [https://avatars1.githubusercontent.com/u/4642104?v=3&s=400]<https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d> Default Configuration file for MASTER<https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d> gist.github.com<http://gist.github.com> Default Configuration file for MASTER Master's spark-env.sh https://gist.github.com/bsikander/42f76d7a8e4079098d8a2df3cdee8ee0 Slave's spark-default.conf file: https://gist.github.com/bsikander/54264349b49e6227c6912eb14d344b8c So, what could be the reason of our workers crashing due to OutOfMemory ? How can we avoid that in future. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Worker-Crashing-OutOfMemoryError-GC-overhead-limit-execeeded-tp28535.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>