Hi, all.

We are in the testing phase of an application subsystem that uses Riak
2.1.1 and a bucket type with consistency on and n=5 (per Basho docs).

Client side is .net using the Basho .net client.

As the testing volume has risen, we are encountering get failures that
seem to have a communication-related flavor.

Things like: ³CommunicationError: Unable to read data from the transport
connection: A connection attempt failed because the connected party did
not properly respond after a period of time, or established connection
failed because connected host has failed to respond²  Or
³CommunicationError: Riak returned an error. Code Œ0¹. Message timeout² or
even ³ClusterOffline: unable to access functioning Riak node².

The cluster itself is 5 nodes, 2.1.1, centos 6, front ended with haproxy.
The HA proxy stats with respect to retries, redispatch, errors, warning
are completely clean.
There are actually 2 haproxy nodes polling the riak nodes, set up with a
keepalived (HSRP) VIP, but there have been no failovers.

The only things in any of my ³crash² logs are:

2015-07-16 01:13:59 =CRASH REPORT====
  crasher:
    initial call: mochiweb_acceptor:init/3
    pid: <0.4628.6768>
    registered_name: []
    exception error:
{function_clause,[{webmachine_request,peer_from_peername,[{error,enotconn},
{webmachine_request,{wm_reqstate,#Port<0.12171987>,[],undefined,undefined,u
ndefined,{wm_reqdata,'GET',http,{1,0},"defined_in_wm_req_srv_init","defined
_in_wm_req_srv_init",defined_on_call,defined_in_load_dispatch_data,"/ping",
"/ping",[],defined_in_load_dispatch_data,"defined_in_load_dispatch_data",50
0,1073741824,67108864,[],[],{0,nil},not_fetched_yet,false,{0,nil},<<>>,foll
ow_request,undefined,undefined,[]},undefined,undefined,undefined}}],[{file,
"src/webmachine_request.erl"},{line,150}]},{webmachine_request,get_peer,1,[
{file,"src/webmachine_request.erl"},{line,124}]},{webmachine,new_request,2,
[{file,"src/webmachine.erl"},{line,69}]},{webmachine_mochiweb,loop,2,[{file
,"src/webmachine_mochiweb.erl"},{line,49}]},{mochiweb_http,headers,5,[{file
,"src/mochiweb_http.erl"},{line,96}]},{proc_lib,init_p_do_apply,3,[{file,"p
roc_lib.erl"},{line,239}]}]}
    ancestors: ['http://0.0.0.0:8098_mochiweb',riak_api_sup,<0.368.0>]
    messages: []
    links: [<0.374.0>,#Port<0.12171987>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 1249
  neighbours:


And there are a number of these, over time, but all of the application
interaction with the cluster is via PB, not rest.

The client side is running inside an ASP.net application pool via IIS 7.5.
 All of the windows machines have had the TCP parameters around
TimedWaitDelay and MaxUserPort changed (30 or 60 and 65534 respectively)
to support the connection rate that the .net client generates.

Prior to going to consistent buckets, we did not see these.

We¹ve implemented Put retry logic in order to work around similar,
transient failures on the write side, which are more understandable due to
the way the consistency subsystem operates.  We were not expecting to have
to implement this type of retry on the get-side.

Any recommendations from the group as to how to start unraveling this one?
 A number of the status-related riak-admin commands do not function once
consistency has been used, which has limited my ability to peer too
closely into the nodes themselves.

The volume here is likely trivial compared to most of the use cases I¹m
reading about, but the data model is relatively complex, and the traffic
rate and object sizes quite variable.

Thank you for any insight.
Gregg Siegfried
Language Logic, LLC


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to