Hi, all. We are in the testing phase of an application subsystem that uses Riak 2.1.1 and a bucket type with consistency on and n=5 (per Basho docs).
Client side is .net using the Basho .net client. As the testing volume has risen, we are encountering get failures that seem to have a communication-related flavor. Things like: ³CommunicationError: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond² Or ³CommunicationError: Riak returned an error. Code Œ0¹. Message timeout² or even ³ClusterOffline: unable to access functioning Riak node². The cluster itself is 5 nodes, 2.1.1, centos 6, front ended with haproxy. The HA proxy stats with respect to retries, redispatch, errors, warning are completely clean. There are actually 2 haproxy nodes polling the riak nodes, set up with a keepalived (HSRP) VIP, but there have been no failovers. The only things in any of my ³crash² logs are: 2015-07-16 01:13:59 =CRASH REPORT==== crasher: initial call: mochiweb_acceptor:init/3 pid: <0.4628.6768> registered_name: [] exception error: {function_clause,[{webmachine_request,peer_from_peername,[{error,enotconn}, {webmachine_request,{wm_reqstate,#Port<0.12171987>,[],undefined,undefined,u ndefined,{wm_reqdata,'GET',http,{1,0},"defined_in_wm_req_srv_init","defined _in_wm_req_srv_init",defined_on_call,defined_in_load_dispatch_data,"/ping", "/ping",[],defined_in_load_dispatch_data,"defined_in_load_dispatch_data",50 0,1073741824,67108864,[],[],{0,nil},not_fetched_yet,false,{0,nil},<<>>,foll ow_request,undefined,undefined,[]},undefined,undefined,undefined}}],[{file, "src/webmachine_request.erl"},{line,150}]},{webmachine_request,get_peer,1,[ {file,"src/webmachine_request.erl"},{line,124}]},{webmachine,new_request,2, [{file,"src/webmachine.erl"},{line,69}]},{webmachine_mochiweb,loop,2,[{file ,"src/webmachine_mochiweb.erl"},{line,49}]},{mochiweb_http,headers,5,[{file ,"src/mochiweb_http.erl"},{line,96}]},{proc_lib,init_p_do_apply,3,[{file,"p roc_lib.erl"},{line,239}]}]} ancestors: ['http://0.0.0.0:8098_mochiweb',riak_api_sup,<0.368.0>] messages: [] links: [<0.374.0>,#Port<0.12171987>] dictionary: [] trap_exit: false status: running heap_size: 987 stack_size: 27 reductions: 1249 neighbours: And there are a number of these, over time, but all of the application interaction with the cluster is via PB, not rest. The client side is running inside an ASP.net application pool via IIS 7.5. All of the windows machines have had the TCP parameters around TimedWaitDelay and MaxUserPort changed (30 or 60 and 65534 respectively) to support the connection rate that the .net client generates. Prior to going to consistent buckets, we did not see these. We¹ve implemented Put retry logic in order to work around similar, transient failures on the write side, which are more understandable due to the way the consistency subsystem operates. We were not expecting to have to implement this type of retry on the get-side. Any recommendations from the group as to how to start unraveling this one? A number of the status-related riak-admin commands do not function once consistency has been used, which has limited my ability to peer too closely into the nodes themselves. The volume here is likely trivial compared to most of the use cases I¹m reading about, but the data model is relatively complex, and the traffic rate and object sizes quite variable. Thank you for any insight. Gregg Siegfried Language Logic, LLC _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com