clancyhuang created MESOS-10221:
-----------------------------------

             Summary: A large number of TASK_LOST causes the task to be unable 
to run
                 Key: MESOS-10221
                 URL: https://issues.apache.org/jira/browse/MESOS-10221
             Project: Mesos
          Issue Type: Bug
          Components: master
    Affects Versions: 1.11.0, 1.9.0
         Environment: Ubuntu 16.04
            Reporter: clancyhuang


Recently, we found that the mesos master frequently generates Task lost 
exceptions after task submission, and retrying in a short period of time is not 
feasible, and it is becoming more and more frequent.
 We selected two abnormal logs
{code:java}
I0528 15:09:55.367336   964 master.cpp:9579] Sending offers [ 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13236, 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] to framework 
24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
I0528 15:10:25.369561   969 master.cpp:11878] Removing offer 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237
I0528 15:10:43.383028   959 http.cpp:1436] HTTP POST for 
/master/api/v1/scheduler from 10.118.28.66:50484 with 
User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
I0528 15:10:43.383656   959 master.cpp:5434] Processing DECLINE call for 
offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] for framework 
24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) with 5 seconds 
filter

I0528 15:10:03.385080   971 master.cpp:9579] Sending offers [ 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework 
24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
I0528 15:10:33.386322   972 master.cpp:11878] Removing offer 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238
I0528 15:10:57.181581   967 http.cpp:1436] HTTP POST for 
/master/api/v1/scheduler from 10.118.28.66:50484 with 
User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
W0528 15:10:57.183194   967 master.cpp:3959] Ignoring accept of offer 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 since it is no longer valid
W0528 15:10:57.183265   967 master.cpp:3964] ACCEPT call used invalid offers '[ 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid
I0528 15:10:57.184392   967 master.cpp:8212] Sending status update TASK_LOST 
for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 of framework 
24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 'Task launched with invalid offers: 
Offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid'
{code}
The following is a log of normal execution
{code:java}
I0528 15:17:03.690855   959 master.cpp:9579] Sending offers [ 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529, 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13530 ] to framework 
24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
I0528 15:17:03.742848   970 http.cpp:1436] HTTP POST for 
/master/api/v1/scheduler from 10.118.28.66:50484 with 
User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
I0528 15:17:03.745221   970 master.cpp:4356] Processing ACCEPT call for offers: 
[ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529 ] on agent 
cbe540a8-c894-4655-a899-cec7463d00c9-S2 at slave(1)@ip:5053 (ip) for framework 
24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
I0528 15:17:03.745889   970 master.cpp:11878] Removing offer 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529
{code}
We found that the offer was cancelled before accept when the exception 
occurred,and the interval time is just the configured offer-timeout. Our 
framework communicates with mesos based on http, I am sure that he sends the 
accept message immediately after receiving the offer and the request is 
successful.
 The question is why sometimes the master processes the accept message after 
the offer times out. In addition, we tried to increase the offer-timeout, but 
the problem was not resolved



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to