Awesome! We've reverted to the previous port and all our slaves have recovered nicely.
Thanks for looking into this, Philippe On Fri, Jul 3, 2015 at 3:27 PM, Vinod Kone <vinodk...@gmail.com> wrote: > Looks like this is due to a bug in versions < 23.0, where slave recovery > didn't check for changes in 'port' when considering compatibility > <https://github.com/apache/mesos/blob/0.21.0/src/common/type_utils.cpp#L137>. > It has since been fixed in the upcoming 0.23.0 release. > > On Thu, Jul 2, 2015 at 8:45 PM, Philippe Laflamme <phili...@hopper.com> > wrote: > >> Checkpointing has been enabled since 0.18 on these slaves. The only other >> setting that changed during the upgrade was that we added --gc_delay=1days. >> Otherwise, it's an in-place upgrade without any changes to the work >> directory... >> >> Philippe >> >> On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone <vinodk...@gmail.com> wrote: >> >>> It is surprising that the slave didn't bail out during the initial phase >>> of recovery when the port changed. I'm assuming you enabled checkpointing >>> in 0.20.0 and that you didn't wipe the meta data directory or anything when >>> upgrading to 21.0? >>> >>> On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme <phili...@hopper.com> >>> wrote: >>> >>>> Here you are: >>>> >>>> https://gist.github.com/plaflamme/9cd056dc959e0597fb1c >>>> >>>> You can see in the mesos-master.INFO log that it re-registers the slave >>>> using port :5050 (line 9) and fails the health checks on port :5051 (line >>>> 10). So it might be the slave that re-uses the old configuration? >>>> >>>> Thanks, >>>> Philippe >>>> >>>> On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <vinodk...@gmail.com> wrote: >>>> >>>>> Can you paste some logs? >>>>> >>>>> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <phili...@hopper.com >>>>> > wrote: >>>>> >>>>>> Ok, that's reasonable, but I'm not sure why it would successfully >>>>>> re-register with the master if it's not supposed to in the first place. I >>>>>> think changing the resources (for example) will dump the old >>>>>> configuration >>>>>> in the logs and tell you why recovery is bailing out. It's not doing that >>>>>> in this case. >>>>>> >>>>>> I looks as though this doesn't work only because the master can't >>>>>> ping the slave on the old port, because the whole recovery process was >>>>>> successful otherwise. >>>>>> >>>>>> I'm not sure if the slave could have picked up its configuration >>>>>> change and failed the recovery early, but that would definitely be a >>>>>> better >>>>>> experience. >>>>>> >>>>>> Philippe >>>>>> >>>>>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vinodk...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> For slave recovery to work, it is expected to not change its config. >>>>>>> >>>>>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme < >>>>>>> phili...@hopper.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves >>>>>>>> configured with checkpointing and with "reconnect" recovery. >>>>>>>> >>>>>>>> I was investigating why the slaves would successfully re-register >>>>>>>> with the master and recover, but would subsequently be asked to >>>>>>>> shutdown >>>>>>>> ("health check timeout"). >>>>>>>> >>>>>>>> It turns out that our slaves had been unintentionally configured to >>>>>>>> use port 5050 in the previous configuration. We decided to fix that >>>>>>>> during >>>>>>>> the upgrade and have them use the default 5051 port. >>>>>>>> >>>>>>>> This change seems to make the health checks fail and eventually >>>>>>>> kills the slave due to inactivity. >>>>>>>> >>>>>>>> I've confirmed that leaving the port to what it was in the previous >>>>>>>> configuration makes the slave successfully re-register and is not >>>>>>>> asked to >>>>>>>> shutdown later on. >>>>>>>> >>>>>>>> Is this a known issue? I haven't been able to find a JIRA ticket >>>>>>>> for this. Maybe it's the expected behaviour? Should I create a ticket? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Philippe >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >