Sean,

This could be anything from hardware to a leveldb block size problem to a 
single bad .sst file causing an infinite loop.

Standard questions:

- would you send in a copy of the app.config file?
- would you describe the hardware characteristics of your node?
- would you describe roughly the size of your keys and the size of the 
data/values you write?
- would you email or post for download a copy of your leveldb LOG files  (tar 
-xzf LOGs.tgz /var/lib/riak/leveldb/*/LOG)
- would you run du -hs /var/lib/riak/leveldb/* and email the results.

At the moment, I am voting .sst problem.  The LOG files will prove / disprove 
that immediately.  You could run riak repair on each vnode as an attempted fix 
if you do not want to wait for my reply.

The block size problem only occurs once you get a large dataset … and I do not 
have a threshold of "large" to give you without seeing app.config and "du" 
results.  A discussion of this problem is here:

https://github.com/basho/leveldb/wiki/mv-dynamic-block-size

A hardware problem is unlikely, but the LOG files would carry clues.

Matthew




On Jan 9, 2014, at 9:33 PM, Sean McKibben <grap...@graphex.com> wrote:

> We have a 5 node cluster using elevelDB (1.4.2) and 2i, and this afternoon it 
> started responding extremely slowly. CPU on member 4 was extremely high and 
> we restarted that process, but it didn’t help. We temporarily shut down 
> member 4 and cluster speed returned to normal, but as soon as we boot member 
> 4 back up, the cluster performance goes to shit.
> 
> We’ve run in to this before but were able to just start with a fresh set of 
> data after wiping machines as it was before we migrated to this bare-metal 
> cluster. Now it is causing some pretty significant issues and we’re not sure 
> what we can do to get it back to normal, many of our queues are filling up 
> and we’ll probably have to take node 4 off again just so we can provide a 
> regular quality of service.
> 
> We’ve turned off AAE on node 4 but it hasn’t helped. We have some transfers 
> that need to happen but they are going very slowly.
> 
> 'riak-admin top’ on node 4 reports this:
> Load:  cpu       610               Memory:  total      503852    binary     
> 231544
>        procs     804                        processes  179850    code        
> 11588
>        runq      134                        atom          533    ets          
> 4581
> 
> Pid                 Name or Initial Func         Time       Reds     Memory   
>     MsgQ Current Function
> -------------------------------------------------------------------------------------------------------------------------------
> <6175.29048.3>      proc_lib:init_p/5             '-'     462231   51356760   
>        0 mochijson2:json_bin_is_safe/1
> <6175.12281.6>      proc_lib:init_p/5             '-'     307183   64195856   
>        1 gen_fsm:loop/7
> <6175.1581.5>       proc_lib:init_p/5             '-'     286143   41085600   
>        0 mochijson2:json_bin_is_safe/1
> <6175.6659.0>       proc_lib:init_p/5             '-'     281845      13752   
>        0 sext:decode_binary/3
> <6175.6666.0>       proc_lib:init_p/5             '-'     209113      21648   
>        0 sext:decode_binary/3
> <6175.12219.6>      proc_lib:init_p/5             '-'     168832   16829200   
>        0 riak_client:wait_for_query_results/4
> <6175.8403.0>       proc_lib:init_p/5             '-'     133333      13880   
>        1 eleveldb:iterator_move/2
> <6175.8813.0>       proc_lib:init_p/5             '-'     119548       9000   
>        1 eleveldb:iterator/3
> <6175.8411.0>       proc_lib:init_p/5             '-'     115759      34472   
>        0 riak_kv_vnode:'-result_fun_ack/2-fun-0-'
> <6175.5679.0>       proc_lib:init_p/5             '-'     109577       8952   
>        0 riak_kv_vnode:'-result_fun_ack/2-fun-0-'
> Output server crashed: connection_lost
> 
> Based on that, is there anything anyone can think to do to try to bring 
> performance back in to the land of usability? Does this thing appear to be 
> something that may have been resolved in 1.4.6 or 1.4.7?
> 
> Only thing we can think of at this point might be to remove or force remove 
> the member and join in a new freshly built one, but last time we attempted 
> that (on a different cluster) our secondary indexes got irreparably damaged 
> and only regained consistency when we copied every individual key to (this) 
> new cluster! Not a good experience :( but i’m hopeful that 1.4.6 may have 
> addressed some of our issues.
> 
> Any help is appreciated.
> 
> Thank you,
> Sean McKibben
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to