YZ sun created MESOS-10211:
------------------------------

             Summary: mesos agent crashes every time when launched tensorboard 
in a horovod image with mesos container
                 Key: MESOS-10211
                 URL: https://issues.apache.org/jira/browse/MESOS-10211
             Project: Mesos
          Issue Type: Bug
          Components: agent
    Affects Versions: 1.11.0
         Environment: agent:ubuntu18.04
            Reporter: YZ sun


When launch a task using image 
"horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1",
if tensorboard in this image is started,

the agent node will immediately crash every time.

if tensorboard is not started by command, mesos will just work as expected.

agent log looks like below:
{code:java}
//agent crash
I0127 16:07:21.860065 30960 slave.cpp:3181] Launching task 
'baseEnvSingle_gpunode1' for framework baseDevEnv_root_1611734806
F0127 16:07:21.860143 30960 slave.cpp:3194] Check failed: executor == nullptr
*** Check failure stack trace: ***
    @     0x7f2bcc4221fc  google::LogMessage::Fail()
    @     0x7f2bcc422145  google::LogMessage::SendToLog()
    @     0x7f2bcc421ad1  google::LogMessage::Flush()
    @     0x7f2bcc4251e8  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f2bca4cb10b  mesos::internal::slave::Slave::__run()
    @     0x7f2bca570ac6  
_ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSB_INS1_13TaskGroupInfoEERKSt6vectorINS2_19ResourceVersionUUIDESaISL_EERKSB_IbEbS7_SA_SF_SJ_SP_SS_bEEvRKNS_3PIDIT_EEMSU_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_ENKUlOS5_OS8_OSD_OSH_OSN_OSQ_ObPNS_11ProcessBaseEE_clES1L_S1M_S1N_S1O_S1P_S1Q_S1R_S1T_
    @     0x7f2bca663b01  
_ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS3_13FrameworkInfoERKNS3_12ExecutorInfoERK6OptionINS3_8TaskInfoEERKSD_INS3_13TaskGroupInfoEERKSt6vectorINS4_19ResourceVersionUUIDESaISN_EERKSD_IbEbS9_SC_SH_SL_SR_SU_bEEvRKNS1_3PIDIT_EEMSW_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS7_OSA_OSF_OSJ_OSP_OSS_ObPNS1_11ProcessBaseEE_JS7_SA_SF_SJ_SP_SS_bS1V_EEEDTclcl7forwardISW_Efp_Espcl7forwardIT0_Efp0_EEEOSW_DpOS1X_
    @     0x7f2bca6555dc  
_ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi1EEEE13invoke_expandIS1X_St5tupleIJS8_SB_SG_SK_SQ_ST_bS1Z_EES22_IJOS1W_EEJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7EEEEDTcl6invokecl7forwardISX_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIS11_Efp0_EEcl7forwardIS12_Efp2_EEEEOSX_OS11_N5cpp1416integer_sequenceImJXspT2_EEEEOS12_
    @     0x7f2bca64da94  
_ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi1EEEEclIJS1W_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOS25_
    @     0x7f2bca647e56  
_ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS6_13FrameworkInfoERKNS6_12ExecutorInfoERK6OptionINS6_8TaskInfoEERKSG_INS6_13TaskGroupInfoEERKSt6vectorINS7_19ResourceVersionUUIDESaISQ_EERKSG_IbEbSC_SF_SK_SO_SU_SX_bEEvRKNS4_3PIDIT_EEMSZ_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSA_OSD_OSI_OSM_OSS_OSV_ObPNS4_11ProcessBaseEE_JSA_SD_SI_SM_SS_SV_bSt12_PlaceholderILi1EEEEEJS1Y_EEEDTclcl7forwardISZ_Efp_Espcl7forwardIT0_Efp0_EEEOSZ_DpOS23_
    @     0x7f2bca645145  
_ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS7_13FrameworkInfoERKNS7_12ExecutorInfoERK6OptionINS7_8TaskInfoEERKSH_INS7_13TaskGroupInfoEERKSt6vectorINS8_19ResourceVersionUUIDESaISR_EERKSH_IbEbSD_SG_SL_SP_SV_SY_bEEvRKNS5_3PIDIT_EEMS10_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSB_OSE_OSJ_OSN_OST_OSW_ObPNS5_11ProcessBaseEE_JSB_SE_SJ_SN_ST_SW_bSt12_PlaceholderILi1EEEEEJS1Z_EEEvOS10_DpOT0_
    @     0x7f2bca641d60  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbEbSG_SJ_SO_SS_SY_S11_bEEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_ObS3_E_JSE_SH_SM_SQ_SW_SZ_bSt12_PlaceholderILi1EEEEEEclEOS3_
    @     0x7f2bcc2f7a59  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
    @     0x7f2bcc2baae8  process::ProcessBase::consume()
    @     0x7f2bcc2e475c  
_ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
    @     0x55bb81f997ae  process::ProcessBase::serve()
    @     0x7f2bcc2b7486  process::ProcessManager::resume()
    @     0x7f2bcc2b3878  
_ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
    @     0x7f2bcc2c2c1d  
_ZSt13__invoke_implIvZN7process14ProcessManager12init_threadsEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
    @     0x7f2bcc2bf84c  
_ZSt8__invokeIZN7process14ProcessManager12init_threadsEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
    @     0x7f2bcc2dddca  
_ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEE9_M_invokeIJLm0EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE
    @     0x7f2bcc2dce3e  
_ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEclEv
    @     0x7f2bcc2dbc7e  
_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEEE6_M_runEv
    @     0x7f2bbda376df  (unknown)
    @     0x7f2bbd54a6db  start_thread
    @     0x7f2bbd27371f  clone
Aborted (core dumped)

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to