Thanks!. I'll try that and report back once I've some interesting evidence.
-kr, Gerard. On Tue, Dec 2, 2014 at 12:54 AM, Tim Chen <t...@mesosphere.io> wrote: > Hi Gerard, > > I see. What will be helpful to help diagnoise your problem is that if you > can enable verbose logging (GLOG_v=1) before running the slave, and share > the slave logs when it happens. > > Tim > > On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <gerard.m...@gmail.com> wrote: > >> Hi Tim, >> >> It's quite hard to reproduce. It just "happens"... some time worst than >> others, mostly when the system is under load. We notice b/c the framework >> starts 'jumping' from one slave to other, but so far we have no clue why >> this is happening. >> >> What I'm currently looking for is some potential conditions that could >> cause Mesos to kill the executor (not the task) to validate whether any of >> those conditions apply to our case and try to narrow down the problem to >> some reproducible subset. >> >> -kr, Gerard. >> >> >> On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <t...@mesosphere.io> wrote: >> >>> There are different reasons, but most commonly is when the framework ask >>> to kill the task. >>> >>> Can you provide some easy repro steps/artifacts? I've been working on >>> Spark on Mesos these days and can help try this out. >>> >>> Tim >>> >>> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <gerard.m...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Sorry if this has been discussed before. I'm new to the list. >>>> >>>> We are currently running our Spark + Spark Streaming jobs on Mesos, >>>> submitting our jobs through Marathon. >>>> >>>> We see with some regularity that the Spark Streaming driver gets killed >>>> by Mesos and then restarted on some other node by Marathon. >>>> >>>> I've no clue why Mesos is killing the driver and looking at both the >>>> Mesos and Spark logs didn't make me any wiser. >>>> >>>> On the Spark Streaming driver logs, I find this entry of Mesos "signing >>>> off" my driver: >>>> >>>> Shutting down >>>>> Sending SIGTERM to process tree at pid 17845 >>>>> Killing the following process trees: >>>>> [ >>>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf >>>>> \-+- 17846 sh ./run-mesos.sh application-ts.conf >>>>> \--- 17847 java -cp core-compute-job.jar >>>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326 >>>>> ] >>>>> Command terminated with signal Terminated (pid: 17845) >>>> >>>> >>>> What would be the reasons for Mesos to kill an executor? >>>> Have anybody seen something similar? Any hints on where to start >>>> digging? >>>> >>>> -kr, Gerard. >>>> . >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >