[ 
https://issues.apache.org/jira/browse/AURORA-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212492#comment-16212492
 ] 

Dmitry Zhuk commented on AURORA-1952:
-------------------------------------

[~mohitjaggi] there's a code that handles rescinds outside executor queue by 
banning an offer until it lands to host offers.
So assume scheduler is busy and executor queue is long enough.
1. Mesos sends {{offer_1}}, {{offer_2}} for the same host, and they are added 
to the executor queue.
For example, {{offer_1}} - {{cpus:5,cpus\{REV\}:5}}. {{offer_2}} - {{cpus:2}} 
(e.g. some task terminated, and released resources are made available to 
scheduler in additional offer).
2. Agent updates oversubscribable resources. In this case Mesos rescinds all 
previous offers, which have revocable resources from this agent. 
{{handleRescind}} tries to cancel {{offer_1}}, but since {{offer_1}} is still 
in executor queue, it bans the offer and adds {{cancelOffer}} call to executor 
queue.
3. Mesos sends {{offer_3}} with new resources.
4. Executor starts processing {{offer_1}} and just adds the offer.
5. Executor starts processing {{offer_2}}. 
{{hostOffers.get(offer.getOffer().getAgentId())}} returns {{Optional.absent()}} 
because {{offer_1}} is banned. So this offer is added and it replaces 
{{offer_1}} in {{offersBySlave}} and {{offersByHost}}.
6. Executor starts processing cancel of {{offer_1}}. And {{remove}} clears 
{{offer_2}} from {{offersBySlave}} and {{offersByHost}}, making internal 
structures inconsistent.
7. Executor starts processing {{offer_3}}. 
{{hostOffers.get(offer.getOffer().getAgentId())}} returns {{Optional.absent()}} 
because {{offersBySlave}} does not have this agent after 6. So offer is added 
to {{offers}} and there are now 2 offers from the same host, and none of them 
is banned.

> race condition in offers by agent id map (and potentially others) 
> caused(probably) a crash
> ------------------------------------------------------------------------------------------
>
>                 Key: AURORA-1952
>                 URL: https://issues.apache.org/jira/browse/AURORA-1952
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 0.18.0
>         Environment: nothing special
>            Reporter: Mohit Jaggi
>            Assignee: Mohit Jaggi
>             Fix For: 0.18.0
>
>
> Crashed here
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145
> due to duplicates in map. Most likely a concurrency issue. [~wfarner] pointed 
> out the following code:
> I'm looking at this chunk here, where a concurrent map would not help.
> {code:java}
>       Optional<HostOffer> sameSlave = 
> hostOffers.get(offer.getOffer().getAgentId());
>       if (sameSlave.isPresent()) {
>         // If there are existing offers for the slave, decline all of them so 
> the master can
>         // compact all of those offers into a single offer and send them back.
>         LOG.info("Returning offers for " + 
> offer.getOffer().getAgentId().getValue()
>             + " for compaction.");
>         decline(offer.getOffer().getId());
>         removeAndDecline(sameSlave.get().getOffer().getId());
>       } else {
>         hostOffers.add(offer);
> {code}
> --------- logs ----------
> {code:java}
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017 6:09:00 PM 
> com.google.common.util.concurrent.ServiceManager$ServiceListener failed
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service 
> PreemptorService [FAILED] has failed in the RUNNING state.
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: 
> java.lang.IllegalArgumentException: Multiple entries with same key: 
> 1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102=HostOffer{offer=id {
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value: 
> "f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202"
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: }
> ...
> ...
> ep 28 18:09:00 machine1163 aurora-scheduler[14266]: , 
> hostAttributes=IHostAttributes{host=compute606-dca1.prod.uber.internal, 
> attributes=[IAttribute{name=host, values=[compute606-dca1]}, 
> IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]}, 
> IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE, 
> slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple values 
> under a key, use Multimaps.index.
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> com.google.common.collect.Maps.uniqueIndex(Maps.java:1251)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> com.google.common.collect.Maps.uniqueIndex(Maps.java:1208)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$0(PendingTaskProcessor.java:146)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:147)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.storage.log.LogStorage.read(LogStorage.java:562)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:113)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(PendingTaskProcessor.java:135)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.preemptor.PreemptorModule$PreemptorService.runOneIteration(PreemptorModule.java:161)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:188)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> java.lang.Thread.run(Thread.java:748)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: E0928 18:09:00.316 
> [PreemptorService RUNNING, GuavaUtils$LifecycleShutdownListener:55] Service: 
> PreemptorService [FAILED] failed unexpectedly. Triggering shutdown.
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316 
> [qtp1000734462-3068369, Slf4jRequestLog:60] 10.187.28.19 - - 
> [28/Sep/2017:18:09:00 +0000] "POST //10.188.43.6:8082/api HTTP/1.1" 200 95
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316 
> [PreemptorService RUNNING, Lifecycle:84] Shutting down application
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316 
> [PreemptorService RUNNING, ShutdownRegistry$ShutdownRegistryImpl:77] 
> Executing 4 shutdown commands.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to