Hi Gerard, I see. What will be helpful to help diagnoise your problem is that if you can enable verbose logging (GLOG_v=1) before running the slave, and share the slave logs when it happens.
Tim On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <gerard.m...@gmail.com> wrote: > Hi Tim, > > It's quite hard to reproduce. It just "happens"... some time worst than > others, mostly when the system is under load. We notice b/c the framework > starts 'jumping' from one slave to other, but so far we have no clue why > this is happening. > > What I'm currently looking for is some potential conditions that could > cause Mesos to kill the executor (not the task) to validate whether any of > those conditions apply to our case and try to narrow down the problem to > some reproducible subset. > > -kr, Gerard. > > > On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <t...@mesosphere.io> wrote: > >> There are different reasons, but most commonly is when the framework ask >> to kill the task. >> >> Can you provide some easy repro steps/artifacts? I've been working on >> Spark on Mesos these days and can help try this out. >> >> Tim >> >> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <gerard.m...@gmail.com> >> wrote: >> >>> Hi, >>> >>> Sorry if this has been discussed before. I'm new to the list. >>> >>> We are currently running our Spark + Spark Streaming jobs on Mesos, >>> submitting our jobs through Marathon. >>> >>> We see with some regularity that the Spark Streaming driver gets killed >>> by Mesos and then restarted on some other node by Marathon. >>> >>> I've no clue why Mesos is killing the driver and looking at both the >>> Mesos and Spark logs didn't make me any wiser. >>> >>> On the Spark Streaming driver logs, I find this entry of Mesos "signing >>> off" my driver: >>> >>> Shutting down >>>> Sending SIGTERM to process tree at pid 17845 >>>> Killing the following process trees: >>>> [ >>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf >>>> \-+- 17846 sh ./run-mesos.sh application-ts.conf >>>> \--- 17847 java -cp core-compute-job.jar >>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326 >>>> ] >>>> Command terminated with signal Terminated (pid: 17845) >>> >>> >>> What would be the reasons for Mesos to kill an executor? >>> Have anybody seen something similar? Any hints on where to start digging? >>> >>> -kr, Gerard. >>> . >>> >>> >>> >>> >>> >>> >> >