Re: erlang go boom
I added logging into the resolvers to see how frequently I am received siblings, and how many I get when its called. Almost every call has only two siblings, and, although I am definitely creating them, about 10 or so per minute, it seems to be handling that ok. Its not a perfect test though as my throughput is shot since I restarted the cluster after the last crash… Paul Paul Ingalls Founder CEO Fanzo p...@fanzo.me @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 9:50 PM, Paul Ingalls p...@fanzo.me wrote: I'm currently using the java client and its ConflictResolver and Mutator interfaces. In some cases I am just doing a store, and letting the client do an implicit fetch and the mutator to make the actual change. In other cases I'm doing an explicit fetch, modify the result, and then a store without fetch. For both, I am using a ConflictResolver and I have added a field for the vclock and used the annotation in my beans. I'll try pointing a local client at the cluster to see if I can see how many siblings there are. That is, if I can get the cluster running again...:) Paul Paul Ingalls Founder CEO Fanzo p...@fanzo.me @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 9:36 PM, Jeremy Ong jer...@quarkgames.com wrote: On the client you could extract the value_count of the objects you read and just log them. Feel free to post code too, in particular, how you are writing out updated values. On Mon, Aug 5, 2013 at 9:20 PM, Paul Ingalls p...@fanzo.me wrote: Interesting. I have sibling resolution code on the client side. Would sibling explosion take out the entire cluster all at once? Within 5 minutes of my last email, the rest of the cluster died. Is there a way to quickly figure out whether the cluster is full of siblings? Paul Ingalls Founder CEO Fanzo p...@fanzo.me @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 8:07 PM, Evan Vigil-McClanahan emcclana...@basho.com wrote: Given your leveldb settings, I think that compaction is an unlikely culprit. But check this out: 2013-08-05 18:01:15.878 [info] 0.83.0@riak_core_sysmon_ handler:handle_event:92 monitor large_heap 0.14832.557 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}] That's a 78MB heap in encode object... Unless your objects are big, I would suspect sibling explosion caused by rapid updates at w = 1. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: erlang go boom
The ring state looks OK; the ring does not look polluted with random state, the strange thing is why the get_fsm process 0.83.0 has a +100M heap. Would be interesting to figure out what's on that heap; which you can learn from the crash dump. Perhaps you can load the crash dump into the CrashDumpViewer [start it by typing crashdump_viewer:start() in a fresh Erlang VM](http://www.erlang.org/doc/apps/observer/crashdump_ug.html) and then go to the process named 0.83.0, and get a stack dump. That could maybe give a hint to what is using all that memory. Kresten Mobile: + 45 2343 4626 | Skype: krestenkrabthorup | Twitter: @drkrab Trifork A/S | Margrethepladsen 4 | DK- 8000 Aarhus C | Phone : +45 8732 8787 | www.trifork.comhttp://www.trifork.com [cid:A74426A3-6403-45D7-BF2C-B6FD4CA24662] On Aug 5, 2013, at 11:58 PM, Paul Ingalls p...@fanzo.memailto:p...@fanzo.me wrote: Hey Kresten, Thanks for the response! I learned my lesson on setting bucket properties. So all buckets currently use the defaults. here is the output from one of our nodes: total 40 drwxr-xr-x 2 root root 4096 Aug 5 21:10 ./ drwxr-xr-x 6 root root 4096 Aug 4 17:26 ../ -rw-r--r-- 1 root root 14187 Aug 4 17:37 riak_core_ring.default.20130804173753 -rw-r--r-- 1 root root 14586 Aug 5 21:10 riak_core_ring.default.20130805211043 another node: total 72 drwxr-xr-x 2 root root 4096 Aug 5 21:10 ./ drwxr-xr-x 6 root root 4096 Aug 4 17:26 ../ -rw-r--r-- 1 root root 14187 Aug 4 17:37 riak_core_ring.default.20130804173753 -rw-r--r-- 1 root root 14358 Aug 5 20:56 riak_core_ring.default.20130805205650 -rw-r--r-- 1 root root 14529 Aug 5 21:09 riak_core_ring.default.20130805210924 -rw-r--r-- 1 root root 14586 Aug 5 21:10 riak_core_ring.default.20130805211043 looks like the largest number of files in that directory on any node is 5 Paul Paul Ingalls Founder CEO Fanzo p...@fanzo.memailto:p...@fanzo.me @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 2:52 PM, Kresten Krab Thorup k...@trifork.com wrote: I'd think the large #buckets could be the issue; especially if there is any bucket properties being set, because that would cause the ring data structure to be enormous. Could you provide an ls -l output of the riak data/ring directory? Sent from my iPhone On 05/08/2013, at 21.52, Paul Ingalls p...@fanzo.memailto:p...@fanzo.me wrote: As promised in previous email, I hit a fairly big problem over the weekend and then reproduced it this morning and I was wondering if I could get some help. Basically, I was running my code against our risk cluster and everything was moving along just fine. However, at some point Riak just seems to hit a wall. I get to a certain scale of content, about 9-10 GB per node, and about half the cluster gives up the ghost. I've done this twice from scratch now, the first time I thought maybe I was trying to push too many transactions per second. But the second time I reduced the speed and changed some settings in the app.config that I thought would reduce memory usage. Ran it again, and 24 hrs later I still hit the wall (almost the same place). So, I figure I'm still doing something wrong, I'm just not sure what. I'm obviously hitting some kind of heap issue, but I'm not sure what else I can do about it. I'm hoping there is something obvious I'm missing in my ignorance, cuz at the moment I'm stuck... Some details: 7 node cluster running 1.4 each VM has 7GB RAM and 4 CPUs the data directory is on a RAID0 with 750GB of space 128 partitions, levelDB backend using links and secondary indexes lots of buckets (over 10 million), a couple buckets have lots of keys (one was around 6.6 million keys when it crashed, the other around 3.7 million). Values are pretty small, almost all are just a few bytes. There is one bucket, the largest (6.6 million keys), with value sizes between 1-2k. primary custom app config settings: {kernel, [ {inet_dist_listen_min, 6000}, {inet_dist_listen_max, 7999} ]}, {riak_core, [ {default_bucket_props, [ {allow_mult, true}, {r, 1}, {w, 1}, {dw, 1}, {dw, 1} ]}, {ring_creation_size, 128}, ]}, {riak_kv, [ {storage_backend, riak_kv_eleveldb_backend}, ]} {eleveldb, [ {max_open_files, 32} ]}, custom vm.args settings +A 16 Here is some of the error information I am seeing when it all goes boom: * * On one of the nodes that crashes: last lines of console.log --- 2013-08-05 17:59:08.071 [info] 0.29251.556@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during active anti-entropy exchange of {468137243207554840987117797979434404733540892672,3} between
Re: erlang go boom
I'd think the large #buckets could be the issue; especially if there is any bucket properties being set, because that would cause the ring data structure to be enormous. Could you provide an ls -l output of the riak data/ring directory? Sent from my iPhone On 05/08/2013, at 21.52, Paul Ingalls p...@fanzo.memailto:p...@fanzo.me wrote: As promised in previous email, I hit a fairly big problem over the weekend and then reproduced it this morning and I was wondering if I could get some help. Basically, I was running my code against our risk cluster and everything was moving along just fine. However, at some point Riak just seems to hit a wall. I get to a certain scale of content, about 9-10 GB per node, and about half the cluster gives up the ghost. I've done this twice from scratch now, the first time I thought maybe I was trying to push too many transactions per second. But the second time I reduced the speed and changed some settings in the app.config that I thought would reduce memory usage. Ran it again, and 24 hrs later I still hit the wall (almost the same place). So, I figure I'm still doing something wrong, I'm just not sure what. I'm obviously hitting some kind of heap issue, but I'm not sure what else I can do about it. I'm hoping there is something obvious I'm missing in my ignorance, cuz at the moment I'm stuck... Some details: 7 node cluster running 1.4 each VM has 7GB RAM and 4 CPUs the data directory is on a RAID0 with 750GB of space 128 partitions, levelDB backend using links and secondary indexes lots of buckets (over 10 million), a couple buckets have lots of keys (one was around 6.6 million keys when it crashed, the other around 3.7 million). Values are pretty small, almost all are just a few bytes. There is one bucket, the largest (6.6 million keys), with value sizes between 1-2k. primary custom app config settings: {kernel, [ {inet_dist_listen_min, 6000}, {inet_dist_listen_max, 7999} ]}, {riak_core, [ {default_bucket_props, [ {allow_mult, true}, {r, 1}, {w, 1}, {dw, 1}, {dw, 1} ]}, {ring_creation_size, 128}, ]}, {riak_kv, [ {storage_backend, riak_kv_eleveldb_backend}, ]} {eleveldb, [ {max_open_files, 32} ]}, custom vm.args settings +A 16 Here is some of the error information I am seeing when it all goes boom: * * On one of the nodes that crashes: last lines of console.log --- 2013-08-05 17:59:08.071 [info] 0.29251.556@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during active anti-entropy exchange of {468137243207554840987117797979434404733540892672,3} between {479555224749202520035584085735030365824602865664,riak@riak001} and {490973206290850199084050373490626326915664838656,riak@riak002} 2013-08-05 18:01:10.234 [info] 0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 0.14832.557 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] [{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,10},{old_heap_size,0},{heap_size,40978448}] 2013-08-05 18:01:10.672 [info] 0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 0.12133.557 [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_pb_kv_codec,'-encode_content_meta/3-lc$^0/1-1-',1}},{message_queue_len,0}] [{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,45},{old_heap_size,0},{heap_size,40978360}] 2013-08-05 18:01:12.993 [info] 0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 0.14832.557 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{dict,fold_bucket,3}},{message_queue_len,1}] [{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,50},{old_heap_size,0},{heap_size,40978626}] 2013-08-05 18:01:13.816 [info] 0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 0.14832.557 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{dict,fold_bucket,3}},{message_queue_len,1}] [{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,50},{old_heap_size,0},{heap_size,40978626}] 2013-08-05 18:01:13.819 [info] 0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 0.12133.557 [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_kv_pb,pack,5}},{message_queue_len,0}] [{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,200971},{old_heap_size,0},{heap_size,47627275}] 2013-08-05 18:01:14.594 [info] 0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 0.14832.557
Re: erlang go boom
Hey Kresten, Thanks for the response! I learned my lesson on setting bucket properties. So all buckets currently use the defaults. here is the output from one of our nodes: total 40 drwxr-xr-x 2 root root 4096 Aug 5 21:10 ./ drwxr-xr-x 6 root root 4096 Aug 4 17:26 ../ -rw-r--r-- 1 root root 14187 Aug 4 17:37 riak_core_ring.default.20130804173753 -rw-r--r-- 1 root root 14586 Aug 5 21:10 riak_core_ring.default.20130805211043 another node: total 72 drwxr-xr-x 2 root root 4096 Aug 5 21:10 ./ drwxr-xr-x 6 root root 4096 Aug 4 17:26 ../ -rw-r--r-- 1 root root 14187 Aug 4 17:37 riak_core_ring.default.20130804173753 -rw-r--r-- 1 root root 14358 Aug 5 20:56 riak_core_ring.default.20130805205650 -rw-r--r-- 1 root root 14529 Aug 5 21:09 riak_core_ring.default.20130805210924 -rw-r--r-- 1 root root 14586 Aug 5 21:10 riak_core_ring.default.20130805211043 looks like the largest number of files in that directory on any node is 5 Paul Paul Ingalls Founder CEO Fanzo p...@fanzo.me @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 2:52 PM, Kresten Krab Thorup k...@trifork.com wrote: I'd think the large #buckets could be the issue; especially if there is any bucket properties being set, because that would cause the ring data structure to be enormous. Could you provide an ls -l output of the riak data/ring directory? Sent from my iPhone On 05/08/2013, at 21.52, Paul Ingalls p...@fanzo.memailto:p...@fanzo.me wrote: As promised in previous email, I hit a fairly big problem over the weekend and then reproduced it this morning and I was wondering if I could get some help. Basically, I was running my code against our risk cluster and everything was moving along just fine. However, at some point Riak just seems to hit a wall. I get to a certain scale of content, about 9-10 GB per node, and about half the cluster gives up the ghost. I've done this twice from scratch now, the first time I thought maybe I was trying to push too many transactions per second. But the second time I reduced the speed and changed some settings in the app.config that I thought would reduce memory usage. Ran it again, and 24 hrs later I still hit the wall (almost the same place). So, I figure I'm still doing something wrong, I'm just not sure what. I'm obviously hitting some kind of heap issue, but I'm not sure what else I can do about it. I'm hoping there is something obvious I'm missing in my ignorance, cuz at the moment I'm stuck... Some details: 7 node cluster running 1.4 each VM has 7GB RAM and 4 CPUs the data directory is on a RAID0 with 750GB of space 128 partitions, levelDB backend using links and secondary indexes lots of buckets (over 10 million), a couple buckets have lots of keys (one was around 6.6 million keys when it crashed, the other around 3.7 million). Values are pretty small, almost all are just a few bytes. There is one bucket, the largest (6.6 million keys), with value sizes between 1-2k. primary custom app config settings: {kernel, [ {inet_dist_listen_min, 6000}, {inet_dist_listen_max, 7999} ]}, {riak_core, [ {default_bucket_props, [ {allow_mult, true}, {r, 1}, {w, 1}, {dw, 1}, {dw, 1} ]}, {ring_creation_size, 128}, ]}, {riak_kv, [ {storage_backend, riak_kv_eleveldb_backend}, ]} {eleveldb, [ {max_open_files, 32} ]}, custom vm.args settings +A 16 Here is some of the error information I am seeing when it all goes boom: * * On one of the nodes that crashes: last lines of console.log --- 2013-08-05 17:59:08.071 [info] 0.29251.556@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during active anti-entropy exchange of {468137243207554840987117797979434404733540892672,3} between {479555224749202520035584085735030365824602865664,riak@riak001} and {490973206290850199084050373490626326915664838656,riak@riak002} 2013-08-05 18:01:10.234 [info] 0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 0.14832.557 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] [{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,10},{old_heap_size,0},{heap_size,40978448}] 2013-08-05 18:01:10.672 [info] 0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 0.12133.557 [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_pb_kv_codec,'-encode_content_meta/3-lc$^0/1-1-',1}},{message_queue_len,0}]
Re: erlang go boom
I watched top on all the instances when things started to fall apart. This is what I saw… Everything was jamming along just fine. CPU usage was about 25%, ram usage was about 25% (3 of the 7 were at about 15%). Suddenly, CPU usage spikes to over 50% and ram usage spikes to 80-90% (and I'm guessing on those nodes that crash, it hit over 100%). This happens within seconds, its not a gradual growth in RAM or CPU but a spike. On those nodes that survived, they stayed at this higher water mark for RAM (and CPU keeps running at the incremental difference since I killed the client). Is there any easy way to figure out what the process is working on? Could compaction cause this? Basically I have hit a place where it needs to merge SSTables to a higher level, and there isn't enough RAM to pull the data set into memory? Paul Paul Ingalls Founder CEO Fanzo p...@fanzo.me @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 2:58 PM, Paul Ingalls p...@fanzo.me wrote: Hey Kresten, Thanks for the response! I learned my lesson on setting bucket properties. So all buckets currently use the defaults. here is the output from one of our nodes: total 40 drwxr-xr-x 2 root root 4096 Aug 5 21:10 ./ drwxr-xr-x 6 root root 4096 Aug 4 17:26 ../ -rw-r--r-- 1 root root 14187 Aug 4 17:37 riak_core_ring.default.20130804173753 -rw-r--r-- 1 root root 14586 Aug 5 21:10 riak_core_ring.default.20130805211043 another node: total 72 drwxr-xr-x 2 root root 4096 Aug 5 21:10 ./ drwxr-xr-x 6 root root 4096 Aug 4 17:26 ../ -rw-r--r-- 1 root root 14187 Aug 4 17:37 riak_core_ring.default.20130804173753 -rw-r--r-- 1 root root 14358 Aug 5 20:56 riak_core_ring.default.20130805205650 -rw-r--r-- 1 root root 14529 Aug 5 21:09 riak_core_ring.default.20130805210924 -rw-r--r-- 1 root root 14586 Aug 5 21:10 riak_core_ring.default.20130805211043 looks like the largest number of files in that directory on any node is 5 Paul Paul Ingalls Founder CEO Fanzo p...@fanzo.me @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 2:52 PM, Kresten Krab Thorup k...@trifork.com wrote: I'd think the large #buckets could be the issue; especially if there is any bucket properties being set, because that would cause the ring data structure to be enormous. Could you provide an ls -l output of the riak data/ring directory? Sent from my iPhone On 05/08/2013, at 21.52, Paul Ingalls p...@fanzo.memailto:p...@fanzo.me wrote: As promised in previous email, I hit a fairly big problem over the weekend and then reproduced it this morning and I was wondering if I could get some help. Basically, I was running my code against our risk cluster and everything was moving along just fine. However, at some point Riak just seems to hit a wall. I get to a certain scale of content, about 9-10 GB per node, and about half the cluster gives up the ghost. I've done this twice from scratch now, the first time I thought maybe I was trying to push too many transactions per second. But the second time I reduced the speed and changed some settings in the app.config that I thought would reduce memory usage. Ran it again, and 24 hrs later I still hit the wall (almost the same place). So, I figure I'm still doing something wrong, I'm just not sure what. I'm obviously hitting some kind of heap issue, but I'm not sure what else I can do about it. I'm hoping there is something obvious I'm missing in my ignorance, cuz at the moment I'm stuck... Some details: 7 node cluster running 1.4 each VM has 7GB RAM and 4 CPUs the data directory is on a RAID0 with 750GB of space 128 partitions, levelDB backend using links and secondary indexes lots of buckets (over 10 million), a couple buckets have lots of keys (one was around 6.6 million keys when it crashed, the other around 3.7 million). Values are pretty small, almost all are just a few bytes. There is one bucket, the largest (6.6 million keys), with value sizes between 1-2k. primary custom app config settings: {kernel, [ {inet_dist_listen_min, 6000}, {inet_dist_listen_max, 7999} ]}, {riak_core, [ {default_bucket_props, [ {allow_mult, true}, {r, 1}, {w, 1}, {dw, 1}, {dw, 1} ]}, {ring_creation_size, 128}, ]}, {riak_kv, [ {storage_backend, riak_kv_eleveldb_backend}, ]} {eleveldb, [ {max_open_files, 32} ]}, custom vm.args settings +A 16 Here is some of the error information I am seeing when it all goes boom: * * On one of the nodes that crashes: last lines of console.log --- 2013-08-05 17:59:08.071 [info]
Re: erlang go boom
Given your leveldb settings, I think that compaction is an unlikely culprit. But check this out: 2013-08-05 18:01:15.878 [info] 0.83.0@riak_core_sysmon_ handler:handle_event:92 monitor large_heap 0.14832.557 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}] That's a 78MB heap in encode object... Unless your objects are big, I would suspect sibling explosion caused by rapid updates at w = 1. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: erlang go boom
Interesting. I have sibling resolution code on the client side. Would sibling explosion take out the entire cluster all at once? Within 5 minutes of my last email, the rest of the cluster died. Is there a way to quickly figure out whether the cluster is full of siblings? Paul Ingalls Founder CEO Fanzo p...@fanzo.me @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 8:07 PM, Evan Vigil-McClanahan emcclana...@basho.com wrote: Given your leveldb settings, I think that compaction is an unlikely culprit. But check this out: 2013-08-05 18:01:15.878 [info] 0.83.0@riak_core_sysmon_ handler:handle_event:92 monitor large_heap 0.14832.557 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}] That's a 78MB heap in encode object... Unless your objects are big, I would suspect sibling explosion caused by rapid updates at w = 1. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: erlang go boom
On the client you could extract the value_count of the objects you read and just log them. Feel free to post code too, in particular, how you are writing out updated values. On Mon, Aug 5, 2013 at 9:20 PM, Paul Ingalls p...@fanzo.me wrote: Interesting. I have sibling resolution code on the client side. Would sibling explosion take out the entire cluster all at once? Within 5 minutes of my last email, the rest of the cluster died. Is there a way to quickly figure out whether the cluster is full of siblings? Paul Ingalls Founder CEO Fanzo p...@fanzo.me @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 8:07 PM, Evan Vigil-McClanahan emcclana...@basho.com wrote: Given your leveldb settings, I think that compaction is an unlikely culprit. But check this out: 2013-08-05 18:01:15.878 [info] 0.83.0@riak_core_sysmon_ handler:handle_event:92 monitor large_heap 0.14832.557 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}] That's a 78MB heap in encode object... Unless your objects are big, I would suspect sibling explosion caused by rapid updates at w = 1. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: erlang go boom
I'm currently using the java client and its ConflictResolver and Mutator interfaces. In some cases I am just doing a store, and letting the client do an implicit fetch and the mutator to make the actual change. In other cases I'm doing an explicit fetch, modify the result, and then a store without fetch. For both, I am using a ConflictResolver and I have added a field for the vclock and used the annotation in my beans. I'll try pointing a local client at the cluster to see if I can see how many siblings there are. That is, if I can get the cluster running again...:) Paul Paul Ingalls Founder CEO Fanzo p...@fanzo.me @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 9:36 PM, Jeremy Ong jer...@quarkgames.com wrote: On the client you could extract the value_count of the objects you read and just log them. Feel free to post code too, in particular, how you are writing out updated values. On Mon, Aug 5, 2013 at 9:20 PM, Paul Ingalls p...@fanzo.me wrote: Interesting. I have sibling resolution code on the client side. Would sibling explosion take out the entire cluster all at once? Within 5 minutes of my last email, the rest of the cluster died. Is there a way to quickly figure out whether the cluster is full of siblings? Paul Ingalls Founder CEO Fanzo p...@fanzo.me @paulingalls http://www.linkedin.com/in/paulingalls On Aug 5, 2013, at 8:07 PM, Evan Vigil-McClanahan emcclana...@basho.com wrote: Given your leveldb settings, I think that compaction is an unlikely culprit. But check this out: 2013-08-05 18:01:15.878 [info] 0.83.0@riak_core_sysmon_ handler:handle_event:92 monitor large_heap 0.14832.557 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}] [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}] That's a 78MB heap in encode object... Unless your objects are big, I would suspect sibling explosion caused by rapid updates at w = 1. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com