Re: RECEIVED SIGNAL 15: SIGTERM

Konstantinos Kougios Mon, 13 Jul 2015 01:21:08 -0700

I do have other non-xml tasks and I was getting the same SIGTERM on allof them. I think the issue might be due to me processing small files viabinaryFiles or wholeTextFiles. Initially I had issues with Xmx memorybecause I got more than 1 mil files (and in 1 occasion it is 5 milfiles). I sorted that out by processing them in batches of 32k. But thenthis started happening. I've set the memoryOverhead to 4g for most ofthe tasks and it is ok now. But 4g is too much for tasks that processsmall files. I do have 32 threads per executor on some tasks but 32megfor stack & thread overhead should do. Maybe the issue is sockets orsome mem leak of network communication.


On 13/07/15 09:15, Ewan Higgs wrote:

It depends on how large the xml files are and how you're processing them.
If you're using !ENTITY tags then you don't need a very large piece ofxml to consume a lot of memory. e.g. the billion laughs xml:
https://en.wikipedia.org/wiki/Billion_laughs

-Ewan

On 13/07/15 10:11, Konstantinos Kougios wrote:
it was the memoryOverhead. It runs ok with more of that, but do youknow which libraries could affect this? I find it strange that itneeds 4g for a task that processes some xml files. The task themselfsrequire less Xmx.
Cheers

On 13/07/15 06:29, Jong Wook Kim wrote:
Based on my experience, YARN containers can get SIGTERM when

- it produces too much logs and use up the hard drive
- it uses off-heap memory more than what is given byspark.yarn.executor.memoryOverhead configuration. It might be due totoo many classes loaded (less than MaxPermGen but more thanmemoryOverhead), or some other off-heap memory allocated bynetworking library, etc.- it opens too many file descriptors, which you can check on theexecutor node's /proc/<executor jvm's pid>/fd/
Does any of these apply to your situation?

Jong Wook
On Jul 7, 2015, at 19:16, Kostas Kougios<kostas.koug...@googlemail.com> wrote:
I am still receiving these weird sigterms on the executors. Thedriver claims
it lost the executor, the executor receives a SIGTERM (from whom???)
It doesn't seem a memory related issue though increasing memorytakes thejob a bit further or completes it. But why? there is no memorypressure on
neither driver nor executor. And nothing in the logs indicating so.

driver:
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task14762.0 instage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069bytes)15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task14517.0 instage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified(14507/42240)15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driverterminated
or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost executor1 on
cruncher05.stratified: remote Rpc client disassociated
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasksfor 1
from TaskSet 0.0
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driverterminated
or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor:Association withremote system[akka.tcp://sparkExecutor@cruncher05.stratified:32976] has
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0in stage
0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure (executor 1
lost)

gc log for driver, it doesnt look like it run outofmem:

2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
1764131K->1391211K(3393024K), 0.0102839 secs]
2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)
1764971K->1391867K(3405312K), 0.0099062 secs]
2015-07-07T10:46:45.252+0100: [GC (Allocation Failure)
1782011K->1392596K(3401216K), 0.0167572 secs]

executor:
15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 instage 0.0
(TID 14750)
15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not
found, computing it
15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
SIGNAL 15: SIGTERM
15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called

executor gc log (no outofmem as it seems):
2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC)
24696750K->23712939K(33523712K), 0.0416640 secs]
2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC)
24700520K->23722043K(33523712K), 0.0391156 secs]
2015-07-07T10:47:02.862+0100: [GC (Allocation Failure)
24709182K->23726510K(33518592K), 0.0390784 secs]





--
View this message in context:http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.htmlSent from the Apache Spark User List mailing list archive atNabble.com <http://Nabble.com>.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org

Re: RECEIVED SIGNAL 15: SIGTERM

Reply via email to