I do have other non-xml tasks and I was getting the same SIGTERM on all of them. I think the issue might be due to me processing small files via binaryFiles or wholeTextFiles. Initially I had issues with Xmx memory because I got more than 1 mil files (and in 1 occasion it is 5 mil files). I sorted that out by processing them in batches of 32k. But then this started happening. I've set the memoryOverhead to 4g for most of the tasks and it is ok now. But 4g is too much for tasks that process small files. I do have 32 threads per executor on some tasks but 32meg for stack & thread overhead should do. Maybe the issue is sockets or some mem leak of network communication.

On 13/07/15 09:15, Ewan Higgs wrote:
It depends on how large the xml files are and how you're processing them.

If you're using !ENTITY tags then you don't need a very large piece of xml to consume a lot of memory. e.g. the billion laughs xml:


On 13/07/15 10:11, Konstantinos Kougios wrote:
it was the memoryOverhead. It runs ok with more of that, but do you know which libraries could affect this? I find it strange that it needs 4g for a task that processes some xml files. The task themselfs require less Xmx.


On 13/07/15 06:29, Jong Wook Kim wrote:
Based on my experience, YARN containers can get SIGTERM when

- it produces too much logs and use up the hard drive
- it uses off-heap memory more than what is given by spark.yarn.executor.memoryOverhead configuration. It might be due to too many classes loaded (less than MaxPermGen but more than memoryOverhead), or some other off-heap memory allocated by networking library, etc. - it opens too many file descriptors, which you can check on the executor node's /proc/<executor jvm's pid>/fd/

Does any of these apply to your situation?

Jong Wook

On Jul 7, 2015, at 19:16, Kostas Kougios <kostas.koug...@googlemail.com> wrote:

I am still receiving these weird sigterms on the executors. The driver claims
it lost the executor, the executor receives a SIGTERM (from whom???)

It doesn't seem a memory related issue though increasing memory takes the job a bit further or completes it. But why? there is no memory pressure on
neither driver nor executor. And nothing in the logs indicating so.


15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task 14762.0 in stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069 bytes) 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task 14517.0 in stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified (14507/42240) 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost executor 1 on
cruncher05.stratified: remote Rpc client disassociated
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasks for 1
from TaskSet 0.0
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@cruncher05.stratified:32976] has
failed, address is now gated for [5000] ms. Reason is: [Disassociated].

15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0 in stage
0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure (executor 1

gc log for driver, it doesnt look like it run outofmem:

2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
1764131K->1391211K(3393024K), 0.0102839 secs]
2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)
1764971K->1391867K(3405312K), 0.0099062 secs]
2015-07-07T10:46:45.252+0100: [GC (Allocation Failure)
1782011K->1392596K(3401216K), 0.0167572 secs]


15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in stage 0.0
(TID 14750)
15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not
found, computing it
15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called

executor gc log (no outofmem as it seems):
2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC)
24696750K->23712939K(33523712K), 0.0416640 secs]
2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC)
24700520K->23722043K(33523712K), 0.0391156 secs]
2015-07-07T10:47:02.862+0100: [GC (Allocation Failure)
24709182K->23726510K(33518592K), 0.0390784 secs]

View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html Sent from the Apache Spark User List mailing list archive at Nabble.com <http://Nabble.com>.

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to