[ https://issues.apache.org/jira/browse/MESOS-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331176#comment-16331176 ]
Vinod Kone commented on MESOS-8460: ----------------------------------- Debugged the issue with [~mcypark]. The problem comes from the way we capture `this` implicitly via `=` capture in this piece of code {code} slave->garbageCollect(path) .onAny(defer(slave->self(), [=](const Future<Nothing>& future) { slave->detachFile(path); if (executor->info.has_type() && executor->info.type() == ExecutorInfo::DEFAULT) { foreachvalue (const Task* task, executor->launchedTasks) { executor->detachTaskVolumeDirectory(*task); } foreachvalue (const Task* task, executor->terminatedTasks) { executor->detachTaskVolumeDirectory(*task); } foreach (const shared_ptr<Task>& task, executor->completedTasks) { executor->detachTaskVolumeDirectory(*task); } } })); {code} Specifically, the `slave` pointer inside the onAny lambda actually refers to `this->slave` which is a member variable of `Framework`. Since it is possible that the Framework struct could be deleted before the onAny callback is executed the `slave` pointer could become invalid. The proposed fix here is to explicitly capture member variables of `Framework` instead of using `=` in the lambda. Note that there is more than one place in the code where we have to fix this. > `Slave::detachFile` can segfault because it could use invalid Framework* > ------------------------------------------------------------------------ > > Key: MESOS-8460 > URL: https://issues.apache.org/jira/browse/MESOS-8460 > Project: Mesos > Issue Type: Improvement > Reporter: Vinod Kone > Assignee: Vinod Kone > Priority: Major > > Observed this SEGV in an internal cluster > {code} > {noformat} > 2018-01-18 19:00:54: *** SIGSEGV (@0x0) received by PID 26410 (TID > 0x7fe9e4f65700) from PID 0; stack trace: *** > 2018-01-18 19:00:54: @ 0x7fe9ea2c85e0 (unknown) > 2018-01-18 19:00:54: @ 0x7fe9ec4cc855 mesos::internal::Files::detach() > 2018-01-18 19:00:54: @ 0x7fe9ec8cb5b0 > mesos::internal::slave::Slave::detachFile() > 2018-01-18 19:00:54: @ 0x7fe9ec8ccadb > _ZZN5mesos8internal5slave9Framework15recoverExecutorERKNS1_5state13ExecutorStateEbRK7hashsetINS_6TaskIDESt4hashIS8_ESt8equal_toIS8_EEENKUlRKN7process6FutureI7NothingEEE0_clESL_.isra.2000 > 2018-01-18 19:00:54: @ 0x7fe9ec37e4e4 > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEEEEEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_St12_PlaceholderILi1EEEEEEclEOS3_ > 2018-01-18 19:00:54: @ 0x7fe9ed455ea1 process::ProcessBase::consume() > 2018-01-18 19:00:54: @ 0x7fe9ed464bcc process::ProcessManager::resume() > 2018-01-18 19:00:54: @ 0x7fe9ed46a136 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > 2018-01-18 19:00:54: @ 0x7fe9ea7a0230 (unknown) > 2018-01-18 19:00:54: @ 0x7fe9ea2c0e25 start_thread > 2018-01-18 19:00:54: @ 0x7fe9e9fee34d __clone > 2018-01-18 19:00:54: dcos-mesos-slave.service: main process exited, > code=killed, status=11/SEGV{noformat} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)