Re: TCP recv timeout and handoffs almost all the time

Simon Effenberg Sat, 20 Jul 2013 02:41:28 -0700

No system limit crash anymore but now I have stuff like this:

2013-07-20 08:23:10 UTC =CRASH REPORT====
  crasher:
    initial call: mochiweb_acceptor:init/3
    pid: <0.232.0>
    registered_name: []
    exception error: 
{function_clause,[{webmachine_request,peer_from_peername,[{error,enotconn},{webmachine_request,{wm_reqstate,#Port<0.6191>,{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},undefined,undefined,{wm_reqdata,'GET',http,{1,1},"defined_in_wm_req_srv_init",defined_on_call,defined_in_load_dispatch_data,"/riak/config/config","/riak/config/config",{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},defined_in_load_dispatch_data,"defined_in_load_dispatch_data",500,1073741824,67108864,[],[],{2,{"connection",{'Connection',"Close"},nil,{"host",{'Host',"10.46.109.204"},nil,nil}}},not_fetched_yet,false,{0,nil},<<>>,undefined,undefined,[]},undefined,undefined,undefined}}],[{file,"src/webmachine_request.erl"},{line,133}]},{webmachine_request,get_peer,1,[{file,"src/webmachine_request.erl"},{line,124}]},{webmachine,new_request,2,[{file,"src/webmachine.erl"},{line,59}]},{webmachine_mochiweb,loop,1,[{file,"src/webmachine_mochiweb.erl"},{line,75}]},{mochiweb_http,parse_headers,5,[{file,"src/mochiweb_http.erl"},{line,180}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
    ancestors: ['http_0.0.0.0:8098',riak_core_sup,<0.128.0>]
    messages: []
    links: [<0.230.0>,#Port<0.6191>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 24
    reductions: 918
  neighbours:



and this is also weird:

2013-07-20 08:52:35.388 UTC [warning] <0.22537.120>@riak_kv_stat:update:87 
error:badarg updating stat 
{read_repairs,[{1404411729622664522961353393938303214200622678016,notfound}],[{{1398702738851840683437120250060505233655091691520,'riak@10.46.109.206'},primary},{{1404411729622664522961353393938303214200622678016,'riak@10.47.109.209'},primary},{{1410120720393488362485586537816101194746153664512,'riak@10.47.109.202'},primary}]}.


what does it mean? badarg within a read_repair?

Should I try to disable AAE for some time to see if the whole cluster cools 
down?

Cheers
Simon


On Fri, 19 Jul 2013 16:44:44 +0200
Simon Effenberg <seffenb...@team.mobile.de> wrote:

> I'm getting again crash reports about system_limits:
> 
> 2013-07-19 14:30:58 UTC =CRASH REPORT====
>   crasher:
>     initial call: riak_kv_exchange_fsm:init/1
>     pid: <0.25883.24>
>     registered_name: []
>     exception exit: 
> {{{system_limit,[{erlang,spawn,[riak_kv_get_put_monitor,spawned,[gets,<0.11717.31>]],[]},{riak_kv_get_put_monitor,get_fsm_spawned,1,[{file,"src/riak_kv_get_put_monitor.erl"},{line,53}]},{riak_kv_get_fsm,init,1,[{file,"src/riak_kv_get_fsm.erl"},{line,135}]},{gen_fsm,init_it,6,[{file,"gen_fsm.erl"},{line,361}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},{gen_server,call,[<0.1187.0>,{compare,{856348615623575928634971581669697081829647974400,3},#Fun<riak_kv_exchange_fsm.0.49629222>,#Fun<riak_kv_exchange_fsm.1.49629222>},infinity]}},[{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,611}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>     ancestors: [riak_kv_entropy_manager,riak_kv_sup,<0.569.0>]
>     messages: 
> [{'DOWN',#Ref<0.0.26.196075>,process,<0.1187.0>,{system_limit,[{erlang,spawn,[riak_kv_get_put_monitor,spawned,[gets,<0.11717.31>]],[]},{riak_kv_get_put_monitor,get_fsm_spawned,1,[{file,"src/riak_kv_get_put_monitor.erl"},{line,53}]},{riak_kv_get_fsm,init,1,[{file,"src/riak_kv_get_fsm.erl"},{line,135}]},{gen_fsm,init_it,6,[{file,"gen_fsm.erl"},{line,361}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}]
>     links: []
>     dictionary: []
>     trap_exit: false
>     status: running
>     heap_size: 1597
>     stack_size: 24
>     reductions: 380
>   neighbours:
> 
> I'm trying now to increase the erlang process limit but "system_limit" always 
> looks like a "system" limit and not an "erlang" limit?!
> 
> That's the limits for the process:
> 
> [root@kriak47-9:/var/log/riak]# cat /proc/17313/limits
> Limit                     Soft Limit           Hard Limit           Units     
> Max cpu time              unlimited            unlimited            seconds   
> Max file size             unlimited            unlimited            bytes     
> Max data size             unlimited            unlimited            bytes     
> Max stack size            8388608              unlimited            bytes     
> Max core file size        0                    unlimited            bytes     
> Max resident set          unlimited            unlimited            bytes     
> Max processes             unlimited            unlimited            processes 
> Max open files            30000                30000                files     
> Max locked memory         65536                65536                bytes     
> Max address space         unlimited            unlimited            bytes     
> Max file locks            unlimited            unlimited            locks     
> Max pending signals       16382                16382                signals   
> Max msgqueue size         819200               819200               bytes     
> Max nice priority         0                    0                    
> Max realtime priority     0                    0                    
> Max realtime timeout      unlimited            unlimited            us        
> 
> 
> 
> On Fri, 19 Jul 2013 16:08:44 +0200
> Simon Effenberg <seffenb...@team.mobile.de> wrote:
> 
> > only after restarting the Riak instance on this node the awaiting
> > handoffs where processed.. this is weird :(
> > 
> > On Fri, 19 Jul 2013 15:55:43 +0200
> > Simon Effenberg <seffenb...@team.mobile.de> wrote:
> > 
> > > It looked good for some hours but now again we got 
> > > 
> > > 2013-07-19 13:27:07.800 UTC [error] 
> > > <0.18747.29>@riak_core_handoff_sender:start_fold:216 hinted_handoff 
> > > transfer of riak_kv_vnode from 'riak@10.46.109.207' 
> > > 1136089163393944065322395631681798128560666312704 to 'riak@10.47.109.202' 
> > > 1136089163393944065322395631681798128560666312704 failed because of TCP 
> > > recv timeout
> > > 
> > > and on the destination host I see:
> > > 
> > > 
> > > 2013-07-19 13:25:04.455 UTC [error] 
> > > <0.28632.25>@riak_core_handoff_receiver:handle_info:80 Handoff receiver 
> > > for partition 1136089163393944065322395631681798128560666312704 exited 
> > > abnormally after processing 2 objects: 
> > > {timeout,{gen_fsm,sync_send_all_state_event,[<0.1107.0>,{handoff_data,<<141,146,205,110,211,64,20,133,237,4,211,132,2,170,80,69,37,150,22,203,186,216,249,105,210,172,42,149,95,137,162,2,5,177,129,232,120,102,156,153,137,61,78,237,113,72,10,172,186,101,195,51,176,224,1,120,12,158,130,55,97,198,173,68,83,177,192,35,223,197,55,231,156,185,158,235,27,155,36,87,115,86,148,208,34,87,227,146,145,130,233,242,206,173,46,153,204,59,60,18,125,61,91,208,123,223,188,51,190,70,157,86,49,206,99,201,136,206,28,199,249,167,209,110,172,122,83,67,92,222,164,78,187,24,27,135,102,74,243,54,117,174,81,65,52,60,108,152,213,194,17,66,190,33,175,60,220,189,204,108,78,195,150,117,123,198,205,139,168,64,47,103,12,26,12,11,83,31,96,134,20,128,128,170,245,91,86,186,254,46,120,37,48,13,222,30,99,130,1,158,152,213,67,132,199,168,240,26,7,72,12,123,134,23,198,25,154,247,33,30,225,16,18,39,56,56,63,210,173,139,205,241,132,162,108,33,175,226,205,139,248,231,40,117,112,152,83,145,8,70,121,51,54,134,15,177,211,252,252,59,118,218,223,127,94,114,93,183,174,53,194,81,148,76,227,13,142,77,43,1,134,82,90,254,227,147,111,238,212,31,69,219,126,44,168,63,242,211,124,206,210,101,86,149,130,116,250,251,147,12,34,221,33,121,230,111,251,101,189,207,243,100,63,143,89,161,4,83,59,148,25,30,151,6,79,39,162,43,62,46,79,213,105,181,103,181,150,173,140,197,64,208,58,33,234,134,123,195,97,212,11,13,210,70,23,117,7,189,78,103,216,31,12,118,67,211,6,169,69,187,211,98,113,50,226,18,75,213,77,184,255,229,252,115,120,195,246,220,58,186,251,244,236,101,182,117,159,55,224,42,207,193,215,247,191,110,203,191,67,118,255,127,200,114,229,122,169,227,145,148,65,153,32,93,84,76,74,243,19,85,102,8,137,80,140,254,1>>},60000]}}
> > > 
> > > so both shows a timeout. How could I takle this down?
> > > 
> > > - could this happen when many read repairs occur (through AAE)?
> > > 
> > > Also our "fsm PUT time is going higher but not really the GET time".. is 
> > > this the normal behavior in LOAD/read repair situations?
> > > 
> > > Also is this a bigger problem with eLevelDB or would it be the same case 
> > > for Bitcask?
> > > 
> > > Cheers
> > > Simon
> > > 
> > > 
> > > On Fri, 19 Jul 2013 10:25:05 +0200
> > > Simon Effenberg <seffenb...@team.mobile.de> wrote:
> > > 
> > > > once again with the list included... argh
> > > > 
> > > > Hey Christian,
> > > > 
> > > > so it could be also a erlang limit? I found out why my riak instances
> > > > are all having different processlimits. My mcollectived daemons have
> > > > the different limits and when I triggered a puppetrun through
> > > > mcollective they got this processlimit as well.
> > > > 
> > > > Also in the crash log I see:
> > > > 
> > > > exception exit: {{system_limit,[{erlang,spawn
> > > > 
> > > > for the too many processes. So it doesn't look like a Erlang limit, do
> > > > it? But I will keep this +P in my mind!! Thanks a lot.
> > > > 
> > > > The zdbbl is now at 100MB.
> > > > 
> > > > Cheers
> > > > Simon
> > > > 
> > > > On Fri, 19 Jul 2013 08:49:50 +0100
> > > > Christian Dahlqvist <christ...@basho.com> wrote:
> > > > 
> > > > > Hi Simon,
> > > > > 
> > > > > If you have objects that can be a s big as 15MB, it is probably wise 
> > > > > to increase the size of +zdbbl in order to avoid filling up buffers 
> > > > > when these large objects need to be transferred between nodes. What 
> > > > > an appropriate level is depends a lot on the size distribution of 
> > > > > your data and your access patterns, so I would recommend benchmarking 
> > > > > to find a suitable value.
> > > > > 
> > > > > Erlang also has a default process limit of 32768 (at least in 
> > > > > R15B01), which may be what you are hitting. You can override this to 
> > > > > 256k by adding the following line to the vm.args file:
> > > > > 
> > > > >     +P 262144
> > > > > 
> > > > > Best regards,
> > > > > 
> > > > > Christian
> > > > > 
> > > > > 
> > > > > 
> > > > > On 19 Jul 2013, at 08:24, Simon Effenberg <seffenb...@team.mobile.de> 
> > > > > wrote:
> > > > > 
> > > > > > The +zdbbl parameter helped a lot but the hinted handoffs didn't
> > > > > > disappear completely. I have no more busy dist port errors in the
> > > > > > _console.log_ (why aren't they in the error.log? it looks for me 
> > > > > > like a
> > > > > > serious problem you have.. at least our cluster was behaving not 
> > > > > > that
> > > > > > nice).
> > > > > > 
> > > > > > I'll try to increase the buffer size to a higher value because my
> > > > > > suggestion is that also the objects send from one to another are 
> > > > > > also
> > > > > > stored therein and we have sometimes objects which are up to 15MB.
> > > > > > 
> > > > > > But I saw now also some crashes in the last 6 hours on only two 
> > > > > > machines
> > > > > > complaining about too many processes
> > > > > > 
> > > > > > ----------------
> > > > > > console.log
> > > > > > 2013-07-19 02:04:21.962 UTC [error] <0.12813.29> CRASH REPORT 
> > > > > > Process <0.12813.29> with 15 neighbours exited with reason: 
> > > > > > {system_limit
> > > > > > 
> > > > > > crash.log
> > > > > > 2013-07-19 02:04:21 UTC =ERROR REPORT====
> > > > > > Too many processes
> > > > > > ----------------
> > > > > > 
> > > > > > the process has a process limit of 95142. So I will increase it now 
> > > > > > but I never saw any information about such problems on the linux 
> > > > > > tuning page. Am I missing something?
> > > > > > 
> > > > > > Cheers
> > > > > > Simon
> > > > > > 
> > > > > > 
> > > > > > On Thu, 18 Jul 2013 19:34:18 +0100
> > > > > > Guido Medina <guido.med...@temetra.com> wrote:
> > > > > > 
> > > > > >> If what you are describing is happening for 1.4, type riak-admin 
> > > > > >> diag 
> > > > > >> and see the new recommended kernel parameters, also, on vm.args 
> > > > > >> uncomment the +zdbbl 32768 parameter, since what you are 
> > > > > >> describing is 
> > > > > >> similar to what happened to us when we upgraded to 1.4.
> > > > > >> 
> > > > > >> HTH,
> > > > > >> 
> > > > > >> Guido.
> > > > > >> 
> > > > > >> On 18/07/13 19:21, Simon Effenberg wrote:
> > > > > >>> Hi @list,
> > > > > >>> 
> > > > > >>> I see sometimes logs talking about "hinted_handoff transfer of .. 
> > > > > >>> failed because of TCP recv timeout".
> > > > > >>> Also riak-admin transfers shows me many handoffs (is it possible 
> > > > > >>> to give some insights about "how many" handoffs happened through 
> > > > > >>> "riak-admin status"?).
> > > > > >>> 
> > > > > >>> - Is it a normal behavior to have up to 30 handoffs from/to 
> > > > > >>> different nodes?
> > > > > >>> - How can I get down to the problem with the TCP recv timeout? 
> > > > > >>> I'm not sure if this is a network problem or if the other node is 
> > > > > >>> too slow. The load is ok on the machines (some IOwait but not 
> > > > > >>> 100%). Maybe interfering with AAE?
> > > > > >>> 
> > > > > >>> Here the log information about the TCP recv timeout. But that is 
> > > > > >>> not that often but handoffs happens really often:
> > > > > >>> 
> > > > > >>> 2013-07-18 16:22:05.654 UTC [error] 
> > > > > >>> <0.28933.14>@riak_core_handoff_sender:start_fold:216 
> > > > > >>> hinted_handoff transfer of riak_kv_vnode from 
> > > > > >>> 'riak@10.46.109.207' 
> > > > > >>> 1118962191081472546749696200048404186924073353216 to 
> > > > > >>> 'riak@10.46.109.205' 
> > > > > >>> 1118962191081472546749696200048404186924073353216 failed because 
> > > > > >>> of TCP recv timeout
> > > > > >>> 2013-07-18 16:22:05.673 UTC [error] 
> > > > > >>> <0.202.0>@riak_core_handoff_manager:handle_info:282 An outbound 
> > > > > >>> handoff of partition riak_kv_vnode 
> > > > > >>> 1118962191081472546749696200048404186924073353216 was terminated 
> > > > > >>> for reason: {shutdown,timeout}
> > > > > >>> 
> > > > > >>> 
> > > > > >>> Thanks in advance
> > > > > >>> Simon
> > > > > >>> 
> > > > > >>> _______________________________________________
> > > > > >>> riak-users mailing list
> > > > > >>> riak-users@lists.basho.com
> > > > > >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > > >> 
> > > > > >> 
> > > > > >> _______________________________________________
> > > > > >> riak-users mailing list
> > > > > >> riak-users@lists.basho.com
> > > > > >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > > > > Fon:     + 49-(0)30-8109 - 7173
> > > > > > Fax:     + 49-(0)30-8109 - 7131
> > > > > > 
> > > > > > Mail:     seffenb...@team.mobile.de
> > > > > > Web:    www.mobile.de
> > > > > > 
> > > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > > > > 
> > > > > > 
> > > > > > Geschäftsführer: Malte Krüger
> > > > > > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > > > > Sitz der Gesellschaft: Kleinmachnow 
> > > > > > 
> > > > > > _______________________________________________
> > > > > > riak-users mailing list
> > > > > > riak-users@lists.basho.com
> > > > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > > 
> > > > 
> > > > 
> > > > -- 
> > > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > > Fon:     + 49-(0)30-8109 - 7173
> > > > Fax:     + 49-(0)30-8109 - 7131
> > > > 
> > > > Mail:     seffenb...@team.mobile.de
> > > > Web:    www.mobile.de
> > > > 
> > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > > 
> > > > 
> > > > Geschäftsführer: Malte Krüger
> > > > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > > Sitz der Gesellschaft: Kleinmachnow 
> > > > 
> > > > _______________________________________________
> > > > riak-users mailing list
> > > > riak-users@lists.basho.com
> > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > 
> > > 
> > > -- 
> > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > Fon:     + 49-(0)30-8109 - 7173
> > > Fax:     + 49-(0)30-8109 - 7131
> > > 
> > > Mail:     seffenb...@team.mobile.de
> > > Web:    www.mobile.de
> > > 
> > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > 
> > > 
> > > Geschäftsführer: Malte Krüger
> > > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > Sitz der Gesellschaft: Kleinmachnow 
> > > 
> > > _______________________________________________
> > > riak-users mailing list
> > > riak-users@lists.basho.com
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > 
> > 
> > -- 
> > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > Fon:     + 49-(0)30-8109 - 7173
> > Fax:     + 49-(0)30-8109 - 7131
> > 
> > Mail:     seffenb...@team.mobile.de
> > Web:    www.mobile.de
> > 
> > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > 
> > 
> > Geschäftsführer: Malte Krüger
> > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > Sitz der Gesellschaft: Kleinmachnow 
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users@lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> -- 
> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> Fon:     + 49-(0)30-8109 - 7173
> Fax:     + 49-(0)30-8109 - 7131
> 
> Mail:     seffenb...@team.mobile.de
> Web:    www.mobile.de
> 
> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> 
> 
> Geschäftsführer: Malte Krüger
> HRB Nr.: 18517 P, Amtsgericht Potsdam
> Sitz der Gesellschaft: Kleinmachnow 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


-- 
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon:     + 49-(0)30-8109 - 7173
Fax:     + 49-(0)30-8109 - 7131

Mail:     seffenb...@team.mobile.de
Web:    www.mobile.de

Marktplatz 1 | 14532 Europarc Dreilinden | Germany


Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow 

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: TCP recv timeout and handoffs almost all the time

Reply via email to