Riak 1.2.1 Crash During Rolling Upgrade to 1.3.1

Shane McEwan Tue, 11 Jun 2013 01:35:47 -0700

G'day!

I upgraded our production 4-node Riak cluster from 1.2.1 to 1.3.1 on theweekend. It didn't go as smoothly as expected.

After starting Riak on the first upgraded node, node01, I startedgetting error messages on two as yet unupgraded nodes, node02 and node03:

2013-06-08 21:22:50.596 [error] <0.149.0> gen_serverriak_core_handoff_manager terminated with reason: no match of right handvalue{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}in riak_core_handoff_manager:receive_handoff/1 line 4922013-06-08 21:22:50.604 [error] <0.149.0> CRASH REPORT Processriak_core_handoff_manager with 0 neighbours exited with reason: no matchof right hand value{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}in riak_core_handoff_manager:receive_handoff/1 line 492 ingen_server:terminate/6 line 7472013-06-08 21:22:50.605 [error] <0.143.0> Supervisorriak_core_handoff_sup had child riak_core_handoff_manager started withriak_core_handoff_manager:start_link() at <0.149.0> exit with reason nomatch of right hand value{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}in riak_core_handoff_manager:receive_handoff/1 line 492 in contextchild_terminated2013-06-08 21:22:50.606 [error] <0.147.0> gen_server <0.147.0>terminated with reason:{{{badmatch,{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}},[{riak_core_handoff_manager,receive_handoff,1,[{file,"src/riak_core_handoff_manager.erl"},{line,492}]},{riak_core_handoff_manager,handle_call,3,[{...},...]},...]},...}2013-06-08 21:22:50.608 [error] <0.147.0> CRASH REPORT Processriak_core_handoff_listener with 1 neighbours exited with reason:{{{badmatch,{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}},[{riak_core_handoff_manager,receive_handoff,1,[{file,"src/riak_core_handoff_manager.erl"},{line,492}]},{riak_core_handoff_manager,handle_call,3,[{...},...]},...]},...}in gen_server:terminate/6 line 747

Eventually, after 5 minutes, Riak on node02 and node03 crashedcompletely with:

2013-06-08 21:27:47.029 [error] <0.17586.989> Supervisorriak_core_handoff_listener_sup had child riak_core_handoff_listenerstarted with riak_core_handoff_listener:start_link() at undefined exitwith reason bad return value: {error,eaddrinuse} in context start_error2013-06-08 21:27:47.030 [error] <0.17583.989> Supervisorriak_core_handoff_sup had child riak_core_handoff_listener_sup startedwith riak_core_handoff_listener_sup:start_link() at undefined exit withreason shutdown in context start_error2013-06-08 21:27:47.031 [error] <0.139.0> Supervisor riak_core_sup hadchild riak_core_handoff_sup started withriak_core_handoff_sup:start_link() at {restarting,<0.17470.989>} exitwith reason shutdown in context start_error2013-06-08 21:27:47.032 [error] <0.17594.989> CRASH REPORT Processriak_core_handoff_listener with 1 neighbours exited with reason: badreturn value: {error,eaddrinuse} in gen_server:init_it/6 line 3322013-06-08 21:27:47.033 [error] <0.17593.989> Supervisorriak_core_handoff_listener_sup had child riak_core_handoff_listenerstarted with riak_core_handoff_listener:start_link() at undefined exitwith reason bad return value: {error,eaddrinuse} in context start_error2013-06-08 21:27:47.034 [error] <0.17590.989> Supervisorriak_core_handoff_sup had child riak_core_handoff_listener_sup startedwith riak_core_handoff_listener_sup:start_link() at undefined exit withreason shutdown in context start_error2013-06-08 21:27:47.034 [error] <0.139.0> Supervisor riak_core_sup hadchild riak_core_handoff_sup started withriak_core_handoff_sup:start_link() at {restarting,<0.17470.989>} exitwith reason shutdown in context start_error2013-06-08 21:27:47.035 [error] <0.139.0> Supervisor riak_core_sup hadchild riak_core_handoff_sup started withriak_core_handoff_sup:start_link() at {restarting,<0.17470.989>} exitwith reason reached_max_restart_intensity in context shutdown


And on node01 I finally got some messages indicating something was wrong:

2013-06-08 21:27:43.849 [error]<0.9407.0>@riak_core_handoff_sender:start_fold:226 hinted_handofftransfer of riak_kv_vnode from 'riaknode01@10.1.1.13'685078892498860742907977265335757665463718379520 to'riaknode02@10.1.1.10' 685078892498860742907977265335757665463718379520failed because ofexit:{noproc,{riak_core_gen_server,call,[{riak_kv_handoff_listener,'riaknode02@10.1.1.10'},handoff_port,infinity]}}[{riak_core_gen_server,call,3,[{file,"src/riak_core_gen_server.erl"},{line,214}]},{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,84}]}]

The fourth node, node04, kept running fine. I assume this is because itdoesn't have any of node01's vnode replicas on it so wasn't involved inany handoffs.

Anyway, I continued with the upgrades without any further incident,upgrading the crashed nodes next and, finally, node04.

Everything seems to be running fine. Thankfully we were in a maintenancewindow and I wasn't relying on the rolling upgrade capability to ensureservice continuity. But should I be worried that something might bemessed up because of the crash? Or that something is messed up thatcaused the crash?


I have crash dumps if they're of any use.

Thanks!

Shane.

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Riak 1.2.1 Crash During Rolling Upgrade to 1.3.1

Reply via email to