subject:"\[jira\] \[Commented\] \(MESOS\-7389\) Mesos 1.2.0 crashes with pre\-1.0 Mesos agents"

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-05-02 Thread Michael Park (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993958#comment-15993958
 ] 

Michael Park commented on MESOS-7389:
-

[~neilc]: Pushing this off to target 1.4.0 and 1.3.1.

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-19 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975401#comment-15975401
 ] 

Benjamin Mahler commented on MESOS-7389:


One thing that could be simple to cherry pick is to simply log and drop 
registrations from pre-1.0 agents in the master into the 1.2.x branch.

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-19 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975396#comment-15975396
 ] 

Benjamin Mahler commented on MESOS-7389:


Sorry for the confusion, I just meant that I'm inclined not to support pre-1.0 
agents against a 1.2 master given the complexity of the solution.

However, I totally agree that incompatible versions of the master and agent 
should not not lead to crashes (especially vague ones). Explicit handling of 
incompatible agents (i.e. just MESOS-6976 within the MESOS-6975 epic) is long 
overdue. In the interim, we can start with explicitly stating the upgrade 
requirements for getting into 1.x from 0.y, since they aren't captured 
[here|http://mesos.apache.org/documentation/latest/versioning/] and you need to 
reach a certain 0.y release before you can upgrade to 1.x. cc [~vinodkone]

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-19 Thread Neil Conway (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975256#comment-15975256
 ] 

Neil Conway commented on MESOS-7389:


[~bmahler] Leaving it unfixed seems a bit concerning to me, because it is a 
pretty trivial way to crash the master.

It would be great if we had time to ship agent version checks (MESOS-6975) in 
1.3.0 and backport them to 1.2, but I think both of those will be challenging. 
Not impossible though.

We could only allow agent's with the {{MULTI_ROLE}} capability to register with 
1.2.0 masters, but that prevents using 1.0 and 1.1 agents with a 1.2 master, 
which is unfortunate.

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-18 Thread Nicholas Studt (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973908#comment-15973908
 ] 

Nicholas Studt commented on MESOS-7389:
---

Having hit this in a production system it was a little shocking that the 
behavior was for the leader to suicide rather than ignore the agent. Made for 
an interesting afternoon until we found the source.

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-18 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973853#comment-15973853
 ] 

Benjamin Mahler commented on MESOS-7389:


Ah, I'm sorry! I misread that code.

With regard to this ticket, I'm inclined not to fix given the difficulty and 
that we don't support pre-1.0 agents against a 1.2 master. Any objections?

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-18 Thread Neil Conway (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973612#comment-15973612
 ] 

Neil Conway commented on MESOS-7389:


[~bmahler] In the scenario you describe, the master has failed over. This means 
{{slaveWasRemoved}} will be {{false}}, in which case all tasks for 
non-completed frameworks will be re-added by the master. i.e., the master's 
re-registration handling should not drop the task. So I believe it is not 
broken, but maybe I am the one who is missing something :)

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-18 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973514#comment-15973514
 ] 

Benjamin Mahler commented on MESOS-7389:


As far as I can tell, fixing this to support pre-1.0 agents is complicated and 
is likely to produce its own subtle bugs. 1.2+ masters maintain an invariant 
that each task / executor has a known allocation role (and it can determine 
this given 1.0+ agents report their frameworks). Whereas if we were to support 
pre-1.0 agents against a 1.2+ master, the master would have to be updated to 
handle tasks that have an unknown allocation role (i.e. what used to be called 
"orphaned" tasks).

A partial fix here would be to handle the case where the framework is already 
re-registered, leaving the "orphaned" task case triggering this check.

[~neilc] [~vinodkone] The handling of pre-1.0 agents in the context of 
"orphaned tasks" already appears have issues, e.g.:
* Master upgraded to 1.2.x
* Pre- 1.0 agent re-registers with task and task's framework id, doesn't send 
the FrameworkInfos.
* This task's framework hasn't re-registered yet, so this is what we used to 
call an "orphan task".
* The re-registration handling drops the task, see 
[here|https://github.com/apache/mesos/blob/1.2.0/src/master/master.cpp#L5784-L5807].
* Later, when this framework re-registers, the task is absent in the master but 
known to the agent.

Is this broken or am I missing something? If broken, given that fixing this 
ticket requires a complicated solution, and we didn't originally intend to 
support pre-1.0 upgrades for > 1.0.x masters, I'd be inclined to not support it 
(and possibly cherry-pick safety checks like MESOS-6975).

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-17 Thread Adam B (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971933#comment-15971933
 ] 

Adam B commented on MESOS-7389:
---

[~bmahler] Will you have time this week to fix this for 1.2.1/1.3.0? Who's the 
shepherd?

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

9 matches

Site Navigation

Mail list logo

Footer information