It is surprising that the slave didn't bail out during the initial phase of
recovery when the port changed. I'm assuming you enabled checkpointing in
0.20.0 and that you didn't wipe the meta data directory or anything when
upgrading to 21.0?

On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme <phili...@hopper.com>
wrote:

> Here you are:
>
> https://gist.github.com/plaflamme/9cd056dc959e0597fb1c
>
> You can see in the mesos-master.INFO log that it re-registers the slave
> using port :5050 (line 9) and fails the health checks on port :5051 (line
> 10). So it might be the slave that re-uses the old configuration?
>
> Thanks,
> Philippe
>
> On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <vinodk...@gmail.com> wrote:
>
>> Can you paste some logs?
>>
>> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <phili...@hopper.com>
>> wrote:
>>
>>> Ok, that's reasonable, but I'm not sure why it would successfully
>>> re-register with the master if it's not supposed to in the first place. I
>>> think changing the resources (for example) will dump the old configuration
>>> in the logs and tell you why recovery is bailing out. It's not doing that
>>> in this case.
>>>
>>> I looks as though this doesn't work only because the master can't ping
>>> the slave on the old port, because the whole recovery process was
>>> successful otherwise.
>>>
>>> I'm not sure if the slave could have picked up its configuration change
>>> and failed the recovery early, but that would definitely be a better
>>> experience.
>>>
>>> Philippe
>>>
>>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vinodk...@gmail.com> wrote:
>>>
>>>> For slave recovery to work, it is expected to not change its config.
>>>>
>>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <phili...@hopper.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>>>>> configured with checkpointing and with "reconnect" recovery.
>>>>>
>>>>> I was investigating why the slaves would successfully re-register with
>>>>> the master and recover, but would subsequently be asked to shutdown
>>>>> ("health check timeout").
>>>>>
>>>>> It turns out that our slaves had been unintentionally configured to
>>>>> use port 5050 in the previous configuration. We decided to fix that during
>>>>> the upgrade and have them use the default 5051 port.
>>>>>
>>>>> This change seems to make the health checks fail and eventually kills
>>>>> the slave due to inactivity.
>>>>>
>>>>> I've confirmed that leaving the port to what it was in the previous
>>>>> configuration makes the slave successfully re-register and is not asked to
>>>>> shutdown later on.
>>>>>
>>>>> Is this a known issue? I haven't been able to find a JIRA ticket for
>>>>> this. Maybe it's the expected behaviour? Should I create a ticket?
>>>>>
>>>>> Thanks,
>>>>> Philippe
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to