Re: Mesos Slave Port Change Fails Recovery

Philippe Laflamme Fri, 03 Jul 2015 12:36:05 -0700

Awesome!

We've reverted to the previous port and all our slaves have recovered
nicely.


Thanks for looking into this,
Philippe

On Fri, Jul 3, 2015 at 3:27 PM, Vinod Kone <vinodk...@gmail.com> wrote:

> Looks like this is due to a bug in versions < 23.0, where slave recovery
> didn't check for changes in 'port' when considering compatibility
> <https://github.com/apache/mesos/blob/0.21.0/src/common/type_utils.cpp#L137>.
> It has since been fixed in the upcoming 0.23.0 release.
>
> On Thu, Jul 2, 2015 at 8:45 PM, Philippe Laflamme <phili...@hopper.com>
> wrote:
>
>> Checkpointing has been enabled since 0.18 on these slaves. The only other
>> setting that changed during the upgrade was that we added --gc_delay=1days.
>> Otherwise, it's an in-place upgrade without any changes to the work
>> directory...
>>
>> Philippe
>>
>> On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone <vinodk...@gmail.com> wrote:
>>
>>> It is surprising that the slave didn't bail out during the initial phase
>>> of recovery when the port changed. I'm assuming you enabled checkpointing
>>> in 0.20.0 and that you didn't wipe the meta data directory or anything when
>>> upgrading to 21.0?
>>>
>>> On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme <phili...@hopper.com>
>>> wrote:
>>>
>>>> Here you are:
>>>>
>>>> https://gist.github.com/plaflamme/9cd056dc959e0597fb1c
>>>>
>>>> You can see in the mesos-master.INFO log that it re-registers the slave
>>>> using port :5050 (line 9) and fails the health checks on port :5051 (line
>>>> 10). So it might be the slave that re-uses the old configuration?
>>>>
>>>> Thanks,
>>>> Philippe
>>>>
>>>> On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <vinodk...@gmail.com> wrote:
>>>>
>>>>> Can you paste some logs?
>>>>>
>>>>> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <phili...@hopper.com
>>>>> > wrote:
>>>>>
>>>>>> Ok, that's reasonable, but I'm not sure why it would successfully
>>>>>> re-register with the master if it's not supposed to in the first place. I
>>>>>> think changing the resources (for example) will dump the old 
>>>>>> configuration
>>>>>> in the logs and tell you why recovery is bailing out. It's not doing that
>>>>>> in this case.
>>>>>>
>>>>>> I looks as though this doesn't work only because the master can't
>>>>>> ping the slave on the old port, because the whole recovery process was
>>>>>> successful otherwise.
>>>>>>
>>>>>> I'm not sure if the slave could have picked up its configuration
>>>>>> change and failed the recovery early, but that would definitely be a 
>>>>>> better
>>>>>> experience.
>>>>>>
>>>>>> Philippe
>>>>>>
>>>>>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vinodk...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> For slave recovery to work, it is expected to not change its config.
>>>>>>>
>>>>>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <
>>>>>>> phili...@hopper.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>>>>>>>> configured with checkpointing and with "reconnect" recovery.
>>>>>>>>
>>>>>>>> I was investigating why the slaves would successfully re-register
>>>>>>>> with the master and recover, but would subsequently be asked to 
>>>>>>>> shutdown
>>>>>>>> ("health check timeout").
>>>>>>>>
>>>>>>>> It turns out that our slaves had been unintentionally configured to
>>>>>>>> use port 5050 in the previous configuration. We decided to fix that 
>>>>>>>> during
>>>>>>>> the upgrade and have them use the default 5051 port.
>>>>>>>>
>>>>>>>> This change seems to make the health checks fail and eventually
>>>>>>>> kills the slave due to inactivity.
>>>>>>>>
>>>>>>>> I've confirmed that leaving the port to what it was in the previous
>>>>>>>> configuration makes the slave successfully re-register and is not 
>>>>>>>> asked to
>>>>>>>> shutdown later on.
>>>>>>>>
>>>>>>>> Is this a known issue? I haven't been able to find a JIRA ticket
>>>>>>>> for this. Maybe it's the expected behaviour? Should I create a ticket?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Philippe
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mesos Slave Port Change Fails Recovery

Reply via email to