Hi Gerard,

I see. What will be helpful to help diagnoise your problem is that if you
can enable verbose logging (GLOG_v=1) before running the slave, and share
the slave logs when it happens.

Tim

On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <gerard.m...@gmail.com> wrote:

> Hi Tim,
>
> It's quite hard to reproduce. It just "happens"... some time worst than
> others, mostly when the system is under load. We notice b/c the framework
> starts 'jumping' from one slave to other, but so far we have no clue why
> this is happening.
>
> What I'm currently looking for is some potential conditions that could
> cause Mesos to kill the executor (not the task) to validate whether any of
> those conditions apply to our case and try to narrow down the problem to
> some reproducible subset.
>
> -kr, Gerard.
>
>
> On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <t...@mesosphere.io> wrote:
>
>> There are different reasons, but most commonly is when the framework ask
>> to kill the task.
>>
>> Can you provide some easy repro steps/artifacts? I've been working on
>> Spark on Mesos these days and can help try this out.
>>
>> Tim
>>
>> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <gerard.m...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Sorry if this has been discussed before. I'm new to the list.
>>>
>>> We are currently running our Spark + Spark Streaming jobs on Mesos,
>>> submitting our jobs through Marathon.
>>>
>>> We see with some regularity that the Spark Streaming driver gets killed
>>> by Mesos and then restarted on some other node by Marathon.
>>>
>>> I've no clue why Mesos is killing the driver and looking at both the
>>> Mesos and Spark logs didn't make me any wiser.
>>>
>>> On the Spark Streaming driver logs, I find this entry of Mesos "signing
>>> off" my driver:
>>>
>>> Shutting down
>>>> Sending SIGTERM to process tree at pid 17845
>>>> Killing the following process trees:
>>>> [
>>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>>>>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>>>>    \--- 17847 java -cp core-compute-job.jar
>>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
>>>> ]
>>>> Command terminated with signal Terminated (pid: 17845)
>>>
>>>
>>> What would be the reasons for Mesos to kill an executor?
>>> Have anybody seen something similar? Any hints on where to start digging?
>>>
>>> -kr, Gerard.
>>> .
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Reply via email to