[
https://issues.apache.org/jira/browse/MESOS-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354096#comment-17354096
]
Charles Natali commented on MESOS-10221:
----------------------------------------
> In addition, according to the framework running log, the accept information
>is sent immediately after the offer is received, but the accept information in
>the master log is far behind the send offer, so is it that the accept has not
>been processed immediately, or is it that I have a wrong understanding of the
>time of the send offer.
Yeah that looks suspicious, it'd be good to have the full logs of the master
and framework so we can compare the timestamps of:
* the offer being sent by the master
* the offer being received by the framework
* the accept being sent by the framework
* the accept being received by the master
> A large number of TASK_LOST causes the task to be unable to run
> ---------------------------------------------------------------
>
> Key: MESOS-10221
> URL: https://issues.apache.org/jira/browse/MESOS-10221
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.9.0, 1.11.0
> Environment: Ubuntu 16.04
> Reporter: clancyhuang
> Priority: Major
>
> Recently, we found that the mesos master frequently generates Task lost
> exceptions after task submission, and retrying in a short period of time is
> not feasible, and it is becoming more and more frequent.
> We selected two abnormal logs
> {code:java}
> I0528 15:09:55.367336 964 master.cpp:9579] Sending offers [
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13236,
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] to framework
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:10:25.369561 969 master.cpp:11878] Removing offer
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237
> I0528 15:10:43.383028 959 http.cpp:1436] HTTP POST for
> /master/api/v1/scheduler from 10.118.28.66:50484 with
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> I0528 15:10:43.383656 959 master.cpp:5434] Processing DECLINE call for
> offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] for framework
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) with 5
> seconds filter
> I0528 15:10:03.385080 971 master.cpp:9579] Sending offers [
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:10:33.386322 972 master.cpp:11878] Removing offer
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238
> I0528 15:10:57.181581 967 http.cpp:1436] HTTP POST for
> /master/api/v1/scheduler from 10.118.28.66:50484 with
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> W0528 15:10:57.183194 967 master.cpp:3959] Ignoring accept of offer
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 since it is no longer valid
> W0528 15:10:57.183265 967 master.cpp:3964] ACCEPT call used invalid offers
> '[ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid
> I0528 15:10:57.184392 967 master.cpp:8212] Sending status update TASK_LOST
> for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 of framework
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 'Task launched with invalid offers:
> Offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid'
> {code}
> The following is a log of normal execution
> {code:java}
> I0528 15:17:03.690855 959 master.cpp:9579] Sending offers [
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529,
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13530 ] to framework
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:17:03.742848 970 http.cpp:1436] HTTP POST for
> /master/api/v1/scheduler from 10.118.28.66:50484 with
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> I0528 15:17:03.745221 970 master.cpp:4356] Processing ACCEPT call for
> offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529 ] on agent
> cbe540a8-c894-4655-a899-cec7463d00c9-S2 at slave(1)@ip:5053 (ip) for
> framework 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:17:03.745889 970 master.cpp:11878] Removing offer
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529
> {code}
> We found that the offer was cancelled before accept when the exception
> occurred,and the interval time is just the configured offer-timeout. Our
> framework communicates with mesos based on http, I am sure that he sends the
> accept message immediately after receiving the offer and the request is
> successful.
> The question is why sometimes the master processes the accept message after
> the offer times out. In addition, we tried to increase the offer-timeout, but
> the problem was not resolved
--
This message was sent by Atlassian Jira
(v8.3.4#803005)