Can you paste some logs? On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <phili...@hopper.com> wrote:
> Ok, that's reasonable, but I'm not sure why it would successfully > re-register with the master if it's not supposed to in the first place. I > think changing the resources (for example) will dump the old configuration > in the logs and tell you why recovery is bailing out. It's not doing that > in this case. > > I looks as though this doesn't work only because the master can't ping the > slave on the old port, because the whole recovery process was successful > otherwise. > > I'm not sure if the slave could have picked up its configuration change > and failed the recovery early, but that would definitely be a better > experience. > > Philippe > > On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vinodk...@gmail.com> wrote: > >> For slave recovery to work, it is expected to not change its config. >> >> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <phili...@hopper.com> >> wrote: >> >>> Hi, >>> >>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves >>> configured with checkpointing and with "reconnect" recovery. >>> >>> I was investigating why the slaves would successfully re-register with >>> the master and recover, but would subsequently be asked to shutdown >>> ("health check timeout"). >>> >>> It turns out that our slaves had been unintentionally configured to use >>> port 5050 in the previous configuration. We decided to fix that during the >>> upgrade and have them use the default 5051 port. >>> >>> This change seems to make the health checks fail and eventually kills >>> the slave due to inactivity. >>> >>> I've confirmed that leaving the port to what it was in the previous >>> configuration makes the slave successfully re-register and is not asked to >>> shutdown later on. >>> >>> Is this a known issue? I haven't been able to find a JIRA ticket for >>> this. Maybe it's the expected behaviour? Should I create a ticket? >>> >>> Thanks, >>> Philippe >>> >> >> >