Thanks!.  I'll try that and report back once I've some interesting evidence.

-kr, Gerard.

On Tue, Dec 2, 2014 at 12:54 AM, Tim Chen <t...@mesosphere.io> wrote:

> Hi Gerard,
>
> I see. What will be helpful to help diagnoise your problem is that if you
> can enable verbose logging (GLOG_v=1) before running the slave, and share
> the slave logs when it happens.
>
> Tim
>
> On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <gerard.m...@gmail.com> wrote:
>
>> Hi Tim,
>>
>> It's quite hard to reproduce. It just "happens"... some time worst than
>> others, mostly when the system is under load. We notice b/c the framework
>> starts 'jumping' from one slave to other, but so far we have no clue why
>> this is happening.
>>
>> What I'm currently looking for is some potential conditions that could
>> cause Mesos to kill the executor (not the task) to validate whether any of
>> those conditions apply to our case and try to narrow down the problem to
>> some reproducible subset.
>>
>> -kr, Gerard.
>>
>>
>> On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <t...@mesosphere.io> wrote:
>>
>>> There are different reasons, but most commonly is when the framework ask
>>> to kill the task.
>>>
>>> Can you provide some easy repro steps/artifacts? I've been working on
>>> Spark on Mesos these days and can help try this out.
>>>
>>> Tim
>>>
>>> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <gerard.m...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Sorry if this has been discussed before. I'm new to the list.
>>>>
>>>> We are currently running our Spark + Spark Streaming jobs on Mesos,
>>>> submitting our jobs through Marathon.
>>>>
>>>> We see with some regularity that the Spark Streaming driver gets killed
>>>> by Mesos and then restarted on some other node by Marathon.
>>>>
>>>> I've no clue why Mesos is killing the driver and looking at both the
>>>> Mesos and Spark logs didn't make me any wiser.
>>>>
>>>> On the Spark Streaming driver logs, I find this entry of Mesos "signing
>>>> off" my driver:
>>>>
>>>> Shutting down
>>>>> Sending SIGTERM to process tree at pid 17845
>>>>> Killing the following process trees:
>>>>> [
>>>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>>>>>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>>>>>    \--- 17847 java -cp core-compute-job.jar
>>>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
>>>>> ]
>>>>> Command terminated with signal Terminated (pid: 17845)
>>>>
>>>>
>>>> What would be the reasons for Mesos to kill an executor?
>>>> Have anybody seen something similar? Any hints on where to start
>>>> digging?
>>>>
>>>> -kr, Gerard.
>>>> .
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to