Hi, Average size of objects in Riak - 300 Kb. This objects are images. This data updates very very rearly (there almost no updates).
I have GC turned on and works: root@python:~# riak-cs-gc status There is no garbage collection in progress The current garbage collection interval is: 900 The current garbage collection leeway time is: 86400 Last run started at: 20151029T100600Z Next run scheduled for: 20151029T102100Z Network misconfigurations were not detected. The result of your script shows correct info. But I see that almost all nodes with bitcask suffers from low free memory and they swapped. I think that it can be an issue. But my question is, what workaround is for this problem. I've wrote in my first post that I tuned handoff_timeout and handoff_receive_timeout (now this vaules are 300000 and 600000). But situation is the same. On Tue, Oct 27, 2015 at 4:06 PM Jon Meredith <jmered...@basho.com> wrote: > Hi, > > Handoff problems without obvious disk issues can be due to the database > containing large objects. Do you frequently update objects in CS, and if > so have you had garbage collection running? > > The timeout is happening on the receiver side after not receiving any tcp > data for handoff_receive_timeout *milli*seconds. I know you said you > increased it, but not how high. I would bump that up to 300000 to give the > sender a chance to read larger objects off disk. > > To check if the sender is transmitting, on the source node you could run > redbug:start("riak_core_handoff_sender:visit_item", [{arity, > true},{print_file,"/tmp/visit_item.log"},{time, 3600000},{msgs, 1000000}]). > > That file should fill fairly fast with an entry for every object the > sender tries to transmit. > > There's a long shot it could be network misconfiguration. Run this from > the source node having problems > > rpc:multicall(erlang, apply, [fun() -> TargetNode = node(), [_Name,Host] = > string:tokens(atom_to_list(TargetNode), "@"), {ok, Port} = > riak_core_gen_server:call({riak_core_handoff_listener, TargetNode}, > handoff_port), HandoffIP = riak_core_handoff_listener:get_handoff_ip(), > TNHandoffIP = case HandoffIP of error -> Host; {ok, "0.0.0.0"} -> Host; > {ok, Other} -> Other end, {node(), HandoffIP, TNHandoffIP, > inet:gethostbyname(TNHandoffIP), Port} end, []]). > > and it will print out a a list of remote nodes and IP addresses (and > hopefully an empty list of failed nodes) > > {[{'dev1@127.0.0.1', <---- node name > > {ok,"0.0.0.0"}, <---- handoff ip address configured in > app.config > > "127.0.0.1", <---- hostname passed to socket open > > {ok,{hostent,"127.0.0.1",[],inet,4,[{127,0,0,1}]}}, <--- DNS entry for > hostname > > 10019}], <---- handoff port > > []} <--- empty list of errors > > Good luck, Jon. > > On Tue, Oct 27, 2015 at 3:55 AM Vladyslav Zakhozhai < > v.zakhoz...@smartweb.com.ua> wrote: > >> Hi, >> >> Jon thank you for the answer. During approval of my mail to this list >> I've troubleshoot my issue more deep. And yes, your are right. Neither >> {error, enotconn} nor max_concurrency is my problem. >> >> I'm going to migrate my cluster entierly to eleveldb only, i.e. I need to >> refuse using bitcask. I have a talk with basho support and they said that >> it is tricky to tune bitcask on servers with 32 GB RAM (and I guess that it >> is not tricky, but it is impossible, because bitcask loads all keys in >> memory regardless of free available RAM). With LevelDB I have opportunity >> to tune using RAM on servers. >> >> So I have 15 nodes with multibackend (bitcask for data and leveldb for >> metadata). 2 additional servers are without multibackend - only with >> leveldb. Now I'm not sure do I need still use mutibackend with levedb-only >> backend. >> >> And my problem is (as I mentioned earlier) the following. On leveldb-only >> nodes I see handoffs timedout and no further progress. >> >> On multibackend hosts I have configuration: >> >> {riak_kv, [ >> {add_paths, ["/usr/lib/riak-cs/lib/riak_cs-1.5.0/ebin"]}, >> {storage_backend, riak_cs_kv_multi_backend}, >> {multi_backend_prefix_list, [{<<"0b:">>, be_blocks}]}, >> {multi_backend_default, be_default}, >> {multi_backend, [ >> {be_default, riak_kv_eleveldb_backend, [ >> {max_open_files, 50}, >> {data_root, "/var/lib/riak/leveldb"} >> ]}, >> {be_blocks, riak_kv_bitcask_backend, [ >> {data_root, "/var/lib/riak/bitcask"} >> ]} >> ]}, >> >> And for hosts with leveldb-only backend: >> >> {riak_kv, [ >> {storage_backend, riak_kv_eleveldb_backend}, >> ... >> {eleveldb, [ >> {data_root, "/var/lib/riak/leveldb"} >> (default values for leveldb) >> >> In leveldb logs I see nothing that could help me (no errors in logs). >> >> >> On Mon, Oct 26, 2015 at 3:57 PM Jon Meredith <jmered...@basho.com> wrote: >> >>> Hi, >>> >>> I suspect your {error,enotconn} messages are unrelated - that's likely >>> to be caused by an HTTP client closing the connection while Riak looks up >>> some networking information about the requestor. >>> >>> The max_concurrency message you are seeing is related to the handoff >>> transfer limit - it should be labelled as informational. When a node has >>> data to handoff it starts the handoff sender process and if there are >>> either too many local handoff processes or too many on the remote side it >>> exits with max_concurrency. You could increase with riak-admin >>> transfer-limit but that probably won't help if you're timing out. >>> >>> As you're using the multi-backend you're transferring data from bitcask >>> and leveldb. The next place I would look is in the leveldb LOG files to >>> see if there are any leveldb vnodes that are having problems that's >>> preventing repair. >>> >>> Jon >>> >>> On Mon, Oct 26, 2015 at 7:15 AM Vladyslav Zakhozhai < >>> v.zakhoz...@smartweb.com.ua> wrote: >>> >>>> Hello, >>>> >>>> I have a problem with persistent timeouts during ownership handoffs. >>>> I've tried to surf over Internet and current mail list but no success. >>>> >>>> I have Riak 1.4.12 cluster with 17 nodes. Almost all nodes use >>>> multibackend with bitcask and eleveldb as storage backends (we need >>>> multiple backend for Riak CS 1.5.0 integration). >>>> >>>> Now I'm working to migrate Riak cluster to eleveldb as primary and only >>>> backend. For now I have 2 nodes with eleveldb backend in the same cluster. >>>> >>>> During ownership handoff process I permanently see errors of timed out >>>> handoff receivers and sender. >>>> >>>> Here is partial output of riak-admin transfers: >>>> ... >>>> transfer type: ownership_transfer >>>> vnode type: riak_kv_vnode >>>> partition: 331121464707782692405522344912282871640797216768 >>>> started: 2015-10-21 08:32:55 [46.66 min ago] >>>> last update: no updates seen >>>> total size: unknown >>>> objects transferred: unknown >>>> >>>> unknown >>>> riak@taipan.pleiad.uaprom =======> r...@eggeater.pleiad.uapr >>>> om >>>> | | 0% >>>> unknown >>>> >>>> transfer type: ownership_transfer >>>> vnode type: riak_kv_vnode >>>> partition: 336830455478606531929755488790080852186328203264 >>>> started: 2015-10-21 08:32:54 [46.68 min ago] >>>> last update: no updates seen >>>> total size: unknown >>>> objects transferred: unknown >>>> ... >>>> >>>> Some of partition handoffs state never updates, some of them terminates >>>> after partial handoff objects and never starts again. >>>> >>>> I see nothing in logs but following: >>>> >>>> On receiver side: >>>> >>>> 2015-10-21 11:33:55.131 [error] >>>> <0.25390.1266>@riak_core_handoff_receiver:handle_info:105 Handoff receiver >>>> for partition 331121464707782692405522344912282871640797216768 timed out >>>> after processing 0 objects. >>>> >>>> On sender side: >>>> >>>> 2015-10-21 11:01:58.879 [error] <0.13177.1401> CRASH REPORT Process >>>> <0.13177.1401> with 0 neighbours crashed with reason: no function clause >>>> matching webmachine_request:peer_from_peername({error,enotconn}, >>>> {webmachine_request,{wm_reqstate,#Port<0.50978116>,[],undefined,undefined,undefined,{wm_reqdata,...},...}}) >>>> line 150 >>>> 2015-10-21 11:32:50.055 [error] <0.207.0> Supervisor >>>> riak_core_handoff_sender_sup had child riak_core_handoff_sender started >>>> with {riak_core_handoff_sender,start_link,undefined} at <0.22312.1090> exit >>>> with reason max_concurrency in context child_terminated >>>> >>>> {error, enotconn} - seems to be network issue. But I have no any >>>> problems with network. All hosts resolve their neighbors correctly and >>>> /etc/hosts on each node are correct. >>>> >>>> I've tried to increase handoff_timeout and handoff_receive_timeout. But >>>> no success. >>>> >>>> Forcing handoff helped me but for short period of time: >>>> >>>> rpc:multicall([node() | nodes()], riak_core_vnode_manager, force_handoffs, >>>> []). >>>> >>>> >>>> I see progress of handoffs (riak-admin transfers) but then I see handoff >>>> timed out again. >>>> >>>> >>>> A week ago I've joined 4 nodes with bitcask. And there was no such >>>> problems. >>>> >>>> >>>> I'm confused a little bit and need to understand my next steps in >>>> troubleshooting this issue. >>>> >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> riak-users@lists.basho.com >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>> >>>
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com