Hi, There was a known eleveldb bug with handoff receiving that could cause a timeout. But it does not sound like bug fits your symptoms. However, I am willing to verify my diagnosis. I would need you to gather the LOG files from all vnodes on the RECEIVING side (or at least from the vnode that you are attempting and is failing).
I will check it for the symptoms of the known bug. Note: the LOG files reset on each restart of Riak. So you must gather the LOG files right after the failure without restarting Riak. Matthew > On Oct 29, 2015, at 11:11 AM, Vladyslav Zakhozhai > <v.zakhoz...@smartweb.com.ua> wrote: > > Hi, > > I want to make small update. Jon your hint about problems on sender side is > correct. As I've already told there problems with available resources on > sender nodes. There are no enough available RAM which is a cause of swapiness > and load on disks. Restarting of sender nodes helps me (at least temoprarily). > > > On Thu, Oct 29, 2015 at 12:19 PM Vladyslav Zakhozhai > <v.zakhoz...@smartweb.com.ua <mailto:v.zakhoz...@smartweb.com.ua>> wrote: > Hi, > > Average size of objects in Riak - 300 Kb. This objects are images. This data > updates very very rearly (there almost no updates). > > I have GC turned on and works: > root@python:~# riak-cs-gc status > There is no garbage collection in progress > The current garbage collection interval is: 900 > The current garbage collection leeway time is: 86400 > Last run started at: 20151029T100600Z > Next run scheduled for: 20151029T102100Z > > Network misconfigurations were not detected. The result of your script shows > correct info. > > But I see that almost all nodes with bitcask suffers from low free memory and > they swapped. I think that it can be an issue. But my question is, what > workaround is for this problem. > > I've wrote in my first post that I tuned handoff_timeout and > handoff_receive_timeout (now this vaules are 300000 and 600000). But > situation is the same. > > > On Tue, Oct 27, 2015 at 4:06 PM Jon Meredith <jmered...@basho.com > <mailto:jmered...@basho.com>> wrote: > Hi, > > Handoff problems without obvious disk issues can be due to the database > containing large objects. Do you frequently update objects in CS, and if so > have you had garbage collection running? > > The timeout is happening on the receiver side after not receiving any tcp > data for handoff_receive_timeout *milli*seconds. I know you said you > increased it, but not how high. I would bump that up to 300000 to give the > sender a chance to read larger objects off disk. > > To check if the sender is transmitting, on the source node you could run > redbug:start("riak_core_handoff_sender:visit_item", [{arity, > true},{print_file,"/tmp/visit_item.log"},{time, 3600000},{msgs, 1000000}]). > > That file should fill fairly fast with an entry for every object the sender > tries to transmit. > > There's a long shot it could be network misconfiguration. Run this from the > source node having problems > > rpc:multicall(erlang, apply, [fun() -> TargetNode = node(), [_Name,Host] = > string:tokens(atom_to_list(TargetNode), "@"), {ok, Port} = > riak_core_gen_server:call({riak_core_handoff_listener, TargetNode}, > handoff_port), HandoffIP = riak_core_handoff_listener:get_handoff_ip(), > TNHandoffIP = case HandoffIP of error -> Host; {ok, "0.0.0.0"} -> Host; {ok, > Other} -> Other end, {node(), HandoffIP, TNHandoffIP, > inet:gethostbyname(TNHandoffIP), Port} end, []]). > > and it will print out a a list of remote nodes and IP addresses (and > hopefully an empty list of failed nodes) > > {[{'dev1@127.0.0.1 <mailto:dev1@127.0.0.1>', <---- node name > {ok,"0.0.0.0"}, <---- handoff ip address configured in > app.config > "127.0.0.1", <---- hostname passed to socket open > {ok,{hostent,"127.0.0.1",[],inet,4,[{127,0,0,1}]}}, <--- DNS entry for > hostname > 10019}], <---- handoff port > []} <--- empty list of errors > > Good luck, Jon. > > On Tue, Oct 27, 2015 at 3:55 AM Vladyslav Zakhozhai > <v.zakhoz...@smartweb.com.ua <mailto:v.zakhoz...@smartweb.com.ua>> wrote: > Hi, > > Jon thank you for the answer. During approval of my mail to this list I've > troubleshoot my issue more deep. And yes, your are right. Neither {error, > enotconn} nor max_concurrency is my problem. > > I'm going to migrate my cluster entierly to eleveldb only, i.e. I need to > refuse using bitcask. I have a talk with basho support and they said that it > is tricky to tune bitcask on servers with 32 GB RAM (and I guess that it is > not tricky, but it is impossible, because bitcask loads all keys in memory > regardless of free available RAM). With LevelDB I have opportunity to tune > using RAM on servers. > > So I have 15 nodes with multibackend (bitcask for data and leveldb for > metadata). 2 additional servers are without multibackend - only with leveldb. > Now I'm not sure do I need still use mutibackend with levedb-only backend. > > And my problem is (as I mentioned earlier) the following. On leveldb-only > nodes I see handoffs timedout and no further progress. > > On multibackend hosts I have configuration: > > {riak_kv, [ > {add_paths, ["/usr/lib/riak-cs/lib/riak_cs-1.5.0/ebin"]}, > {storage_backend, riak_cs_kv_multi_backend}, > {multi_backend_prefix_list, [{<<"0b:">>, be_blocks}]}, > {multi_backend_default, be_default}, > {multi_backend, [ > {be_default, riak_kv_eleveldb_backend, [ > {max_open_files, 50}, > {data_root, "/var/lib/riak/leveldb"} > ]}, > {be_blocks, riak_kv_bitcask_backend, [ > {data_root, "/var/lib/riak/bitcask"} > ]} > ]}, > > And for hosts with leveldb-only backend: > > {riak_kv, [ > {storage_backend, riak_kv_eleveldb_backend}, > ... > {eleveldb, [ > {data_root, "/var/lib/riak/leveldb"} > (default values for leveldb) > > In leveldb logs I see nothing that could help me (no errors in logs). > > > On Mon, Oct 26, 2015 at 3:57 PM Jon Meredith <jmered...@basho.com > <mailto:jmered...@basho.com>> wrote: > Hi, > > I suspect your {error,enotconn} messages are unrelated - that's likely to be > caused by an HTTP client closing the connection while Riak looks up some > networking information about the requestor. > > The max_concurrency message you are seeing is related to the handoff transfer > limit - it should be labelled as informational. When a node has data to > handoff it starts the handoff sender process and if there are either too many > local handoff processes or too many on the remote side it exits with > max_concurrency. You could increase with riak-admin transfer-limit but that > probably won't help if you're timing out. > > As you're using the multi-backend you're transferring data from bitcask and > leveldb. The next place I would look is in the leveldb LOG files to see if > there are any leveldb vnodes that are having problems that's preventing > repair. > > Jon > > On Mon, Oct 26, 2015 at 7:15 AM Vladyslav Zakhozhai > <v.zakhoz...@smartweb.com.ua <mailto:v.zakhoz...@smartweb.com.ua>> wrote: > Hello, > > I have a problem with persistent timeouts during ownership handoffs. I've > tried to surf over Internet and current mail list but no success. > > I have Riak 1.4.12 cluster with 17 nodes. Almost all nodes use multibackend > with bitcask and eleveldb as storage backends (we need multiple backend for > Riak CS 1.5.0 integration). > > Now I'm working to migrate Riak cluster to eleveldb as primary and only > backend. For now I have 2 nodes with eleveldb backend in the same cluster. > > During ownership handoff process I permanently see errors of timed out > handoff receivers and sender. > > Here is partial output of riak-admin transfers: > ... > transfer type: ownership_transfer > vnode type: riak_kv_vnode > partition: 331121464707782692405522344912282871640797216768 > started: 2015-10-21 08:32:55 [46.66 min ago] > last update: no updates seen > total size: unknown > objects transferred: unknown > > unknown > riak@taipan.pleiad.uaprom =======> r...@eggeater.pleiad.uapr > om > | | 0% > unknown > > transfer type: ownership_transfer > vnode type: riak_kv_vnode > partition: 336830455478606531929755488790080852186328203264 > started: 2015-10-21 08:32:54 [46.68 min ago] > last update: no updates seen > total size: unknown > objects transferred: unknown > ... > > Some of partition handoffs state never updates, some of them terminates after > partial handoff objects and never starts again. > > I see nothing in logs but following: > > On receiver side: > > 2015-10-21 11:33:55.131 [error] > <0.25390.1266>@riak_core_handoff_receiver:handle_info:105 Handoff receiver > for partition 331121464707782692405522344912282871640797216768 timed out > after processing 0 objects. > > On sender side: > > 2015-10-21 11:01:58.879 [error] <0.13177.1401> CRASH REPORT Process > <0.13177.1401> with 0 neighbours crashed with reason: no function clause > matching webmachine_request:peer_from_peername({error,enotconn}, > {webmachine_request,{wm_reqstate,#Port<0.50978116>,[],undefined,undefined,undefined,{wm_reqdata,...},...}}) > line 150 > 2015-10-21 11:32:50.055 [error] <0.207.0> Supervisor > riak_core_handoff_sender_sup had child riak_core_handoff_sender started with > {riak_core_handoff_sender,start_link,undefined} at <0.22312.1090> exit with > reason max_concurrency in context child_terminated > > {error, enotconn} - seems to be network issue. But I have no any problems > with network. All hosts resolve their neighbors correctly and /etc/hosts on > each node are correct. > > I've tried to increase handoff_timeout and handoff_receive_timeout. But no > success. > > Forcing handoff helped me but for short period of time: > > rpc:multicall([node() | nodes()], riak_core_vnode_manager, force_handoffs, > []). > > I see progress of handoffs (riak-admin transfers) but then I see handoff > timed out again. > > A week ago I've joined 4 nodes with bitcask. And there was no such problems. > > I'm confused a little bit and need to understand my next steps in > troubleshooting this issue. > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com <mailto:riak-users@lists.basho.com> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com> > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com