[ 
https://issues.apache.org/jira/browse/MESOS-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354096#comment-17354096
 ] 

Charles Natali commented on MESOS-10221:
----------------------------------------

>  In addition, according to the framework running log, the accept information 
>is sent immediately after the offer is received, but the accept information in 
>the master log is far behind the send offer, so is it that the accept has not 
>been processed immediately, or is it that I have a wrong understanding of the 
>time of the send offer.

 

Yeah that looks suspicious, it'd be good to have the full logs of the master 
and framework so we can compare the timestamps of:
 * the offer being sent by the master
 * the offer being received by the framework
 * the accept being sent by the framework
 * the accept being received by the master

 

> A large number of TASK_LOST causes the task to be unable to run
> ---------------------------------------------------------------
>
>                 Key: MESOS-10221
>                 URL: https://issues.apache.org/jira/browse/MESOS-10221
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.9.0, 1.11.0
>         Environment: Ubuntu 16.04
>            Reporter: clancyhuang
>            Priority: Major
>
> Recently, we found that the mesos master frequently generates Task lost 
> exceptions after task submission, and retrying in a short period of time is 
> not feasible, and it is becoming more and more frequent.
>  We selected two abnormal logs
> {code:java}
> I0528 15:09:55.367336   964 master.cpp:9579] Sending offers [ 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13236, 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] to framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:10:25.369561   969 master.cpp:11878] Removing offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237
> I0528 15:10:43.383028   959 http.cpp:1436] HTTP POST for 
> /master/api/v1/scheduler from 10.118.28.66:50484 with 
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> I0528 15:10:43.383656   959 master.cpp:5434] Processing DECLINE call for 
> offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] for framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) with 5 
> seconds filter
> I0528 15:10:03.385080   971 master.cpp:9579] Sending offers [ 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:10:33.386322   972 master.cpp:11878] Removing offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238
> I0528 15:10:57.181581   967 http.cpp:1436] HTTP POST for 
> /master/api/v1/scheduler from 10.118.28.66:50484 with 
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> W0528 15:10:57.183194   967 master.cpp:3959] Ignoring accept of offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 since it is no longer valid
> W0528 15:10:57.183265   967 master.cpp:3964] ACCEPT call used invalid offers 
> '[ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid
> I0528 15:10:57.184392   967 master.cpp:8212] Sending status update TASK_LOST 
> for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 of framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 'Task launched with invalid offers: 
> Offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid'
> {code}
> The following is a log of normal execution
> {code:java}
> I0528 15:17:03.690855   959 master.cpp:9579] Sending offers [ 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529, 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13530 ] to framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:17:03.742848   970 http.cpp:1436] HTTP POST for 
> /master/api/v1/scheduler from 10.118.28.66:50484 with 
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> I0528 15:17:03.745221   970 master.cpp:4356] Processing ACCEPT call for 
> offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529 ] on agent 
> cbe540a8-c894-4655-a899-cec7463d00c9-S2 at slave(1)@ip:5053 (ip) for 
> framework 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:17:03.745889   970 master.cpp:11878] Removing offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529
> {code}
> We found that the offer was cancelled before accept when the exception 
> occurred,and the interval time is just the configured offer-timeout. Our 
> framework communicates with mesos based on http, I am sure that he sends the 
> accept message immediately after receiving the offer and the request is 
> successful.
>  The question is why sometimes the master processes the accept message after 
> the offer times out. In addition, we tried to increase the offer-timeout, but 
> the problem was not resolved



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to