Re: Mesos killing Spark Driver

Shijun Kong Thu, 08 Jan 2015 06:55:24 -0800

Hi Gerard,

What version of Marathon are you running? I ran into similar behavior some time 
back. My problem seems to be compatibility issue between Marathon and Meosos: 
https://github.com/mesosphere/marathon/issues/595




Regards,
Shijun

On Jan 8, 2015, at 9:28 AM, Gerard Maas 
<gerard.m...@gmail.com<mailto:gerard.m...@gmail.com>> wrote:

Hi again,

I finally found a clue in this issue. It looks like Marathon is the one behind 
the job killing spree. I still don't know *why* but it looks like the task 
consolidation of Marathon finds a discrepancy with Mesos and decides to kill 
the instance.

INFO|2015-01-08 
10:05:35,491|pool-1-thread-1173|MarathonScheduler.scala:299|Requesting task 
reconciliation with the Mesos master
 INFO|2015-01-08 
10:05:35,493|Thread-188479|MarathonScheduler.scala:138|Received status update 
for task 
core-compute-jobs-actualvalues-st.be0e36cc-9714-11e4-9e7c-3e6ce77341aa: 
TASK_RUNNING (Reconciliation: Latest task state)
INFO|2015-01-08 
10:05:35,494|pool-1-thread-1171|MarathonScheduler.scala:338|Need to scale 
core-compute-jobs-actualvalues-st from 0 up to 1 instances

#### Following mesos, at this point, there's already an instance of this job 
running, so it's actually scaling from 1 to 2 and not from 0 to 1 as it says in 
the logs ####

INFO|2015-01-08 10:05:35,878|Thread-188483|TaskBuilder.scala:38|No matching 
offer for core-compute-jobs-actualvalues-st (need 1.0 CPUs, 1000.0 mem, 0.0 
disk, 1 ports)
(... offers ...)
...
#### Killing ####
 INFO|2015-01-08 
10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:353|Scaling 
core-compute-jobs-actualvalues-st from 2 down to 1 instances
 INFO|2015-01-08 
10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:357|Killing tasks: 
Set(core-compute-jobs-actualvalues-st.e458118d-971d-11e4-9e7c-3e6ce77341aa)

Any ideas why this happens and how to fix it?

-kr, Gerard.


On Tue, Dec 2, 2014 at 1:15 AM, Gerard Maas 
<gerard.m...@gmail.com<mailto:gerard.m...@gmail.com>> wrote:
Thanks!.  I'll try that and report back once I've some interesting evidence.

-kr, Gerard.

On Tue, Dec 2, 2014 at 12:54 AM, Tim Chen 
<t...@mesosphere.io<mailto:t...@mesosphere.io>> wrote:
Hi Gerard,

I see. What will be helpful to help diagnoise your problem is that if you can 
enable verbose logging (GLOG_v=1) before running the slave, and share the slave 
logs when it happens.

Tim

On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas 
<gerard.m...@gmail.com<mailto:gerard.m...@gmail.com>> wrote:
Hi Tim,

It's quite hard to reproduce. It just "happens"... some time worst than others, 
mostly when the system is under load. We notice b/c the framework starts 
'jumping' from one slave to other, but so far we have no clue why this is 
happening.

What I'm currently looking for is some potential conditions that could cause 
Mesos to kill the executor (not the task) to validate whether any of those 
conditions apply to our case and try to narrow down the problem to some 
reproducible subset.

-kr, Gerard.


On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen 
<t...@mesosphere.io<mailto:t...@mesosphere.io>> wrote:
There are different reasons, but most commonly is when the framework ask to 
kill the task.

Can you provide some easy repro steps/artifacts? I've been working on Spark on 
Mesos these days and can help try this out.

Tim

On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas 
<gerard.m...@gmail.com<mailto:gerard.m...@gmail.com>> wrote:
Hi,

Sorry if this has been discussed before. I'm new to the list.

We are currently running our Spark + Spark Streaming jobs on Mesos, submitting 
our jobs through Marathon.

We see with some regularity that the Spark Streaming driver gets killed by 
Mesos and then restarted on some other node by Marathon.

I've no clue why Mesos is killing the driver and looking at both the Mesos and 
Spark logs didn't make me any wiser.

On the Spark Streaming driver logs, I find this entry of Mesos "signing off" my 
driver:

Shutting down
Sending SIGTERM to process tree at pid 17845
Killing the following process trees:
[
-+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
 \-+- 17846 sh ./run-mesos.sh application-ts.conf
   \--- 17847 java -cp core-compute-job.jar -Dconfig.file=application-ts.conf 
com.compute.job.FooJob 31326
]
Command terminated with signal Terminated (pid: 17845)

What would be the reasons for Mesos to kill an executor?
Have anybody seen something similar? Any hints on where to start digging?

-kr, Gerard.
.

Re: Mesos killing Spark Driver

Reply via email to