Re: erlang go boom

2013-08-06 Thread Paul Ingalls
I added logging into the resolvers to see how frequently I am received siblings, and how many I get when its called. Almost every call has only two siblings, and, although I am definitely creating them, about 10 or so per minute, it seems to be handling that ok. Its not a perfect test though

Re: erlang go boom

2013-08-06 Thread Kresten Krab Thorup
The ring state looks OK; the ring does not look polluted with random state, the strange thing is why the get_fsm process 0.83.0 has a +100M heap. Would be interesting to figure out what's on that heap; which you can learn from the crash dump. Perhaps you can load the crash dump into the

Re: erlang go boom

2013-08-05 Thread Kresten Krab Thorup
I'd think the large #buckets could be the issue; especially if there is any bucket properties being set, because that would cause the ring data structure to be enormous. Could you provide an ls -l output of the riak data/ring directory? Sent from my iPhone On 05/08/2013, at 21.52, Paul

Re: erlang go boom

2013-08-05 Thread Paul Ingalls
Hey Kresten, Thanks for the response! I learned my lesson on setting bucket properties. So all buckets currently use the defaults. here is the output from one of our nodes: total 40 drwxr-xr-x 2 root root 4096 Aug 5 21:10 ./ drwxr-xr-x 6 root root 4096 Aug 4 17:26 ../ -rw-r--r-- 1 root

Re: erlang go boom

2013-08-05 Thread Paul Ingalls
I watched top on all the instances when things started to fall apart. This is what I saw… Everything was jamming along just fine. CPU usage was about 25%, ram usage was about 25% (3 of the 7 were at about 15%). Suddenly, CPU usage spikes to over 50% and ram usage spikes to 80-90% (and I'm

Re: erlang go boom

2013-08-05 Thread Evan Vigil-McClanahan
Given your leveldb settings, I think that compaction is an unlikely culprit. But check this out: 2013-08-05 18:01:15.878 [info] 0.83.0@riak_core_sysmon_ handler:handle_event:92 monitor large_heap 0.14832.557

Re: erlang go boom

2013-08-05 Thread Paul Ingalls
Interesting. I have sibling resolution code on the client side. Would sibling explosion take out the entire cluster all at once? Within 5 minutes of my last email, the rest of the cluster died. Is there a way to quickly figure out whether the cluster is full of siblings? Paul Ingalls

Re: erlang go boom

2013-08-05 Thread Jeremy Ong
On the client you could extract the value_count of the objects you read and just log them. Feel free to post code too, in particular, how you are writing out updated values. On Mon, Aug 5, 2013 at 9:20 PM, Paul Ingalls p...@fanzo.me wrote: Interesting. I have sibling resolution code on the

Re: erlang go boom

2013-08-05 Thread Paul Ingalls
I'm currently using the java client and its ConflictResolver and Mutator interfaces. In some cases I am just doing a store, and letting the client do an implicit fetch and the mutator to make the actual change. In other cases I'm doing an explicit fetch, modify the result, and then a store