G'day!

I upgraded our production 4-node Riak cluster from 1.2.1 to 1.3.1 on the weekend. It didn't go as smoothly as expected.

After starting Riak on the first upgraded node, node01, I started getting error messages on two as yet unupgraded nodes, node02 and node03:

2013-06-08 21:22:50.596 [error] <0.149.0> gen_server riak_core_handoff_manager terminated with reason: no match of right hand value {error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}} in riak_core_handoff_manager:receive_handoff/1 line 492 2013-06-08 21:22:50.604 [error] <0.149.0> CRASH REPORT Process riak_core_handoff_manager with 0 neighbours exited with reason: no match of right hand value {error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}} in riak_core_handoff_manager:receive_handoff/1 line 492 in gen_server:terminate/6 line 747 2013-06-08 21:22:50.605 [error] <0.143.0> Supervisor riak_core_handoff_sup had child riak_core_handoff_manager started with riak_core_handoff_manager:start_link() at <0.149.0> exit with reason no match of right hand value {error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}} in riak_core_handoff_manager:receive_handoff/1 line 492 in context child_terminated 2013-06-08 21:22:50.606 [error] <0.147.0> gen_server <0.147.0> terminated with reason: {{{badmatch,{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}},[{riak_core_handoff_manager,receive_handoff,1,[{file,"src/riak_core_handoff_manager.erl"},{line,492}]},{riak_core_handoff_manager,handle_call,3,[{...},...]},...]},...} 2013-06-08 21:22:50.608 [error] <0.147.0> CRASH REPORT Process riak_core_handoff_listener with 1 neighbours exited with reason: {{{badmatch,{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}},[{riak_core_handoff_manager,receive_handoff,1,[{file,"src/riak_core_handoff_manager.erl"},{line,492}]},{riak_core_handoff_manager,handle_call,3,[{...},...]},...]},...} in gen_server:terminate/6 line 747

Eventually, after 5 minutes, Riak on node02 and node03 crashed completely with:

2013-06-08 21:27:47.029 [error] <0.17586.989> Supervisor riak_core_handoff_listener_sup had child riak_core_handoff_listener started with riak_core_handoff_listener:start_link() at undefined exit with reason bad return value: {error,eaddrinuse} in context start_error 2013-06-08 21:27:47.030 [error] <0.17583.989> Supervisor riak_core_handoff_sup had child riak_core_handoff_listener_sup started with riak_core_handoff_listener_sup:start_link() at undefined exit with reason shutdown in context start_error 2013-06-08 21:27:47.031 [error] <0.139.0> Supervisor riak_core_sup had child riak_core_handoff_sup started with riak_core_handoff_sup:start_link() at {restarting,<0.17470.989>} exit with reason shutdown in context start_error 2013-06-08 21:27:47.032 [error] <0.17594.989> CRASH REPORT Process riak_core_handoff_listener with 1 neighbours exited with reason: bad return value: {error,eaddrinuse} in gen_server:init_it/6 line 332 2013-06-08 21:27:47.033 [error] <0.17593.989> Supervisor riak_core_handoff_listener_sup had child riak_core_handoff_listener started with riak_core_handoff_listener:start_link() at undefined exit with reason bad return value: {error,eaddrinuse} in context start_error 2013-06-08 21:27:47.034 [error] <0.17590.989> Supervisor riak_core_handoff_sup had child riak_core_handoff_listener_sup started with riak_core_handoff_listener_sup:start_link() at undefined exit with reason shutdown in context start_error 2013-06-08 21:27:47.034 [error] <0.139.0> Supervisor riak_core_sup had child riak_core_handoff_sup started with riak_core_handoff_sup:start_link() at {restarting,<0.17470.989>} exit with reason shutdown in context start_error 2013-06-08 21:27:47.035 [error] <0.139.0> Supervisor riak_core_sup had child riak_core_handoff_sup started with riak_core_handoff_sup:start_link() at {restarting,<0.17470.989>} exit with reason reached_max_restart_intensity in context shutdown

And on node01 I finally got some messages indicating something was wrong:

2013-06-08 21:27:43.849 [error] <0.9407.0>@riak_core_handoff_sender:start_fold:226 hinted_handoff transfer of riak_kv_vnode from 'riaknode01@10.1.1.13' 685078892498860742907977265335757665463718379520 to 'riaknode02@10.1.1.10' 685078892498860742907977265335757665463718379520 failed because of exit:{noproc,{riak_core_gen_server,call,[{riak_kv_handoff_listener,'riaknode02@10.1.1.10'},handoff_port,infinity]}} [{riak_core_gen_server,call,3,[{file,"src/riak_core_gen_server.erl"},{line,214}]},{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,84}]}]

The fourth node, node04, kept running fine. I assume this is because it doesn't have any of node01's vnode replicas on it so wasn't involved in any handoffs.

Anyway, I continued with the upgrades without any further incident, upgrading the crashed nodes next and, finally, node04.

Everything seems to be running fine. Thankfully we were in a maintenance window and I wasn't relying on the rolling upgrade capability to ensure service continuity. But should I be worried that something might be messed up because of the crash? Or that something is messed up that caused the crash?

I have crash dumps if they're of any use.

Thanks!

Shane.

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to