[ 
https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277399#comment-14277399
 ] 

Steven Schlansker edited comment on MESOS-1949 at 1/14/15 6:33 PM:
-------------------------------------------------------------------

Well, it's not quite as urgent as I thought.  But there's still a lot of 
information that is hidden in log files and is very hard to correlate.  For 
example, I had a task die with 
{code}
I0106 20:08:04.998108  1625 docker.cpp:928] Starting container 
'78065406-449e-4103-85c1-bbfab09d7372' for task 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 (and executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a')
 of framework 'Singularity'
E0106 20:08:05.221181  1624 slave.cpp:2787] Container 
'78065406-449e-4103-85c1-bbfab09d7372' for executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 of framework 'Singularity' failed to start: Port [4111] not included in 
resources
E0106 20:08:05.277864  1622 slave.cpp:2882] Termination of executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 of framework 'Singularity' failed: Unknown container: 
78065406-449e-4103-85c1-bbfab09d7372
{code}
but the "message" field only has "Abnormal executor termination"

Whenever something like this happens, application developers come to me -- they 
don't have the knowledge to trawl through Mesos logs (arguably a developer 
education problem, but the tools could help much more!).  You can find the 
Mesos slave logs through the UI, but you have to do a lot of correlation 
yourself -- you have to find the right slave, dig through the messages looking 
only for the ones relevant to your task, etc.

If all of the relevant logs to one task were collected in one place, this would 
be much easier.  Makes sense?


was (Author: stevenschlansker):
Well, it's not quite as urgent as I thought.  But there's still a lot of 
information that is hidden in log files and is very hard to correlate.  For 
example, I had a task die with 
{code}
I0106 20:08:04.998108  1625 docker.cpp:928] Starting container 
'78065406-449e-4103-85c1-bbfab09d7372' for task 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 (and executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a')
 of framework 'Singularity'
E0106 20:08:05.221181  1624 slave.cpp:2787] Container 
'78065406-449e-4103-85c1-bbfab09d7372' for executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 of framework 'Singularity' failed to start: Port [4111] not included in 
resources
E0106 20:08:05.277864  1622 slave.cpp:2882] Termination of executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 of framework 'Singularity' failed: Unknown container: 
78065406-449e-4103-85c1-bbfab09d7372
{code}
but the "message" field only has "Abnormal executor termination"

Whenever something like this happens, application developers come to me -- they 
don't have any way to see the Mesos slave logs (no login permissions in 
general).  You can find the Mesos slave logs through the UI, but you have to do 
a lot of correlation yourself -- you have to find the right slave, dig through 
the messages, etc.

If all of the relevant logs to one task were collected in one place, this would 
be much easier.  Makes sense?

> All log messages from master, slave, executor, etc. should be collected on a 
> per-task basis
> -------------------------------------------------------------------------------------------
>
>                 Key: MESOS-1949
>                 URL: https://issues.apache.org/jira/browse/MESOS-1949
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master, slave
>    Affects Versions: 0.20.1
>            Reporter: Steven Schlansker
>
> Currently through a task's lifecycle, various debugging information is 
> created at different layers of the Mesos ecosystem.  The framework will log 
> task information, the master deals with resource allocation, the slave 
> actually allocates those resources, and the executor does the work of 
> launching the task.
> If anything through that pipeline fails, the end user is left with little but 
> a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful 
> information (for example a "Docker pull failed because repository didn't 
> exist") is hidden in one of four or five different places, potentially spread 
> across as many different machines.  This leads to unpleasant and repetitive 
> searching through logs looking for a clue to what went wrong.
> Collating logs on a per-task basis would give the end user a much friendlier 
> way of figuring out exactly where in this process something went wrong, and 
> likely much faster resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to