[jira] [Commented] (MESOS-9657) Launching a command task twice can crash the agent

Aram (Jira) Mon, 07 Dec 2020 04:58:06 -0800


    [ 
https://issues.apache.org/jira/browse/MESOS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245188#comment-17245188
 ]


Aram  commented on MESOS-9657:
------------------------------

Guys, we are also hitting the issue, do you have any ETA or at least potential 
release milestone to solve the bug? 
We are running 1.9.0 version, but code is the same in master. 


{code:java}
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: I1207 11:16:05.194779 29877 
slave.cpp:2130] Got assigned task 'ct:1607336160000:0:TAKS_NAME:' for framework 
7c85369d-7ed2-4634-afe1-f59ba555e427-0274
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: I1207 11:16:05.196017 29877 
slave.cpp:2504] Authorizing task 'ct:1607336160000:0:TAKS_NAME:' for framework 
7c85369d-7ed2-4634-afe1-f59ba555e427-0274
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: I1207 11:16:05.198606 29877 
slave.cpp:2977] Launching task 'ct:1607336160000:0:TAKS_NAME:' for framework 
7c85369d-7ed2-4634-afe1-f59ba555e427-0274
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: F1207 11:16:05.198640 29877 
slave.cpp:2990] Check failed: executor == nullptr
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: *** Check failure stack 
trace: ***
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89eb7b1d 
google::LogMessage::Fail()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89eb9dfd 
google::LogMessage::SendToLog()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89eb76ab 
google::LogMessage::Flush()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89eba859 
google::LogMessageFatal::~LogMessageFatal()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac8900a27d 
mesos::internal::slave::Slave::__run()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac8902db71 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbESG_SJ_SO_SS_SY_S11_EEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_S3_E_JSE_SH_SM_SQ_SW_SZ_St12_PlaceholderILi1EEEEEEclEOS3_
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89df5b91 
process::ProcessBase::consume()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89e1af77 
process::ProcessManager::resume()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89e1eb36 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac8a0d54d0 
execute_native_thread_routine
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac8621240b 
start_thread
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac859aee7f 
__GI___clone
Dec 07 11:16:05 ip-10-103-6-133 systemd[1]: mesos-slave.service: main process 
exited, code=killed, status=6/ABRT{code}

> Launching a command task twice can crash the agent
> --------------------------------------------------
>
>                 Key: MESOS-9657
>                 URL: https://issues.apache.org/jira/browse/MESOS-9657
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benno Evers
>            Priority: Major
>
> When launching a command task, we verify that the framework has no existing 
> executor for that task:
> {noformat}
>       // We are dealing with command task; a new command executor will be
>       // launched.
>       CHECK(executor == nullptr);
> {noformat}
> and afterwards an executor is created with the same executor id as the task 
> id:
> {noformat}
>   // (slave.cpp)
>   // Either the master explicitly requests launching a new executor
>   // or we are in the legacy case of launching one if there wasn't
>   // one already. Either way, let's launch executor now.
>   if (executor == nullptr) {
>     Try<Executor*> added = framework->addExecutor(executorInfo);
>   [...]
> {noformat}
> This means that if we relaunch the task with the same task id before the 
> executor is removed, it will crash the agent:
> {noformat}
> F0315 16:39:32.822818 38112 slave.cpp:2865] Check failed: executor == nullptr 
> *** Check failure stack trace: ***
>     @     0x7feb29a407af  google::LogMessage::Flush()
>     @     0x7feb29a43c3f  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7feb28a5a886  mesos::internal::slave::Slave::__run()
>     @     0x7feb28af4f0e  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbESG_SJ_SO_SS_SY_S11_EEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_S3_E_JSE_SH_SM_SQ_SW_SZ_St12_PlaceholderILi1EEEEEEclEOS3_
>     @     0x7feb2998a620  process::ProcessBase::consume()
>     @     0x7feb29987675  process::ProcessManager::resume()
>     @     0x7feb299a2d2b  
> _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvE3$_8EEEEE6_M_runEv
>     @     0x7feb2632f523  (unknown)
>     @     0x7feb25e40594  start_thread
>     @     0x7feb25b73e6f  __GI___clone
> Aborted (core dumped)
> {noformat}
> Instead of crashing, the agent should just drop the task with an appropriate 
> error in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MESOS-9657) Launching a command task twice can crash the agent

Reply via email to