Re: erlang go boom

2013-08-06 Thread Paul Ingalls
I added logging into the resolvers to see how frequently I am received 
siblings, and how many I get when its called.

Almost every call has only two siblings, and, although I am definitely creating 
them, about 10 or so per minute, it seems to be handling that ok.  Its not a 
perfect test though as my throughput is shot since I restarted the cluster 
after the last crash…

Paul

Paul Ingalls
Founder  CEO Fanzo
p...@fanzo.me
@paulingalls
http://www.linkedin.com/in/paulingalls



On Aug 5, 2013, at 9:50 PM, Paul Ingalls p...@fanzo.me wrote:

 I'm currently using the java client and its ConflictResolver and Mutator 
 interfaces.  In some cases I am just doing a store, and letting the client do 
 an implicit fetch and the mutator to make the actual change.  In other cases 
 I'm doing an explicit fetch, modify the result, and then a store without 
 fetch.  For both, I am using a ConflictResolver and I have added a field for 
 the vclock and used the annotation in my beans.
 
 I'll try pointing a local client at the cluster to see if I can see how many 
 siblings there are.  That is, if I can get the cluster running again...:)
 
 Paul
 
 Paul Ingalls
 Founder  CEO Fanzo
 p...@fanzo.me
 @paulingalls
 http://www.linkedin.com/in/paulingalls
 
 On Aug 5, 2013, at 9:36 PM, Jeremy Ong jer...@quarkgames.com wrote:
 
 On the client you could extract the value_count of the objects you
 read and just log them. Feel free to post code too, in particular, how
 you are writing out updated values.
 
 On Mon, Aug 5, 2013 at 9:20 PM, Paul Ingalls p...@fanzo.me wrote:
 Interesting.  I have sibling resolution code on the client side.  Would
 sibling explosion take out the entire cluster all at once?  Within 5 minutes
 of my last email, the rest of the cluster died.
 
 Is there a way to quickly figure out whether the cluster is full of
 siblings?
 
 Paul Ingalls
 Founder  CEO Fanzo
 p...@fanzo.me
 @paulingalls
 http://www.linkedin.com/in/paulingalls
 
 On Aug 5, 2013, at 8:07 PM, Evan Vigil-McClanahan emcclana...@basho.com
 wrote:
 
 Given your leveldb settings, I think that compaction is an unlikely
 culprit.  But check this out:
 
 2013-08-05 18:01:15.878 [info] 0.83.0@riak_core_sysmon_
 handler:handle_event:92 monitor large_heap 0.14832.557
 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
 [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}]
 
 That's a 78MB heap in encode object... Unless your objects are big, I
 would suspect sibling explosion caused by rapid updates at w = 1.
 
 
 
 ___
 riak-users mailing list
 riak-users@lists.basho.com
 http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
 
 

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: erlang go boom

2013-08-06 Thread Kresten Krab Thorup
The ring state looks OK; the ring does not look polluted with random state, the 
strange thing is why the get_fsm process 0.83.0 has a +100M heap.  Would be 
interesting to figure out what's on that heap; which you can learn from the 
crash dump.

Perhaps you can load the crash dump into the CrashDumpViewer [start it by 
typing crashdump_viewer:start() in a fresh Erlang 
VM](http://www.erlang.org/doc/apps/observer/crashdump_ug.html) and then go to 
the process named 0.83.0, and get a stack dump.  That could maybe give a hint 
to what is using all that memory.

Kresten

Mobile: + 45 2343 4626 | Skype: krestenkrabthorup | Twitter: @drkrab
Trifork A/S  |  Margrethepladsen 4  | DK- 8000 Aarhus C |  Phone : +45 8732 
8787  |  www.trifork.comhttp://www.trifork.com
[cid:A74426A3-6403-45D7-BF2C-B6FD4CA24662]

On Aug 5, 2013, at 11:58 PM, Paul Ingalls p...@fanzo.memailto:p...@fanzo.me 
wrote:

Hey Kresten,

Thanks for the response!

I learned my lesson on setting bucket properties.  So all buckets currently use 
the defaults.

here is the output from one of our nodes:

total 40
drwxr-xr-x 2 root root  4096 Aug  5 21:10 ./
drwxr-xr-x 6 root root  4096 Aug  4 17:26 ../
-rw-r--r-- 1 root root 14187 Aug  4 17:37 riak_core_ring.default.20130804173753
-rw-r--r-- 1 root root 14586 Aug  5 21:10 riak_core_ring.default.20130805211043

another node:

total 72
drwxr-xr-x 2 root root  4096 Aug  5 21:10 ./
drwxr-xr-x 6 root root  4096 Aug  4 17:26 ../
-rw-r--r-- 1 root root 14187 Aug  4 17:37 riak_core_ring.default.20130804173753
-rw-r--r-- 1 root root 14358 Aug  5 20:56 riak_core_ring.default.20130805205650
-rw-r--r-- 1 root root 14529 Aug  5 21:09 riak_core_ring.default.20130805210924
-rw-r--r-- 1 root root 14586 Aug  5 21:10 riak_core_ring.default.20130805211043

looks like the largest number of files in that directory on any node is 5

Paul

Paul Ingalls
Founder  CEO Fanzo
p...@fanzo.memailto:p...@fanzo.me
@paulingalls
http://www.linkedin.com/in/paulingalls



On Aug 5, 2013, at 2:52 PM, Kresten Krab Thorup k...@trifork.com wrote:

I'd think the large #buckets could be the issue; especially if there is any 
bucket properties being set, because that would cause the ring data structure 
to be enormous.

Could you provide an ls -l output of the riak data/ring directory?

Sent from my iPhone

On 05/08/2013, at 21.52, Paul Ingalls p...@fanzo.memailto:p...@fanzo.me 
wrote:

As promised in previous email, I hit a fairly big problem over the weekend and 
then reproduced it this morning and I was wondering if I could get some help.

Basically, I was running my code against our risk cluster and everything was 
moving along just fine.  However, at some point Riak just seems to hit a wall.  
I get to a certain scale of content, about 9-10 GB per node, and about half the 
cluster gives up the ghost.

I've done this twice from scratch now, the first time I thought maybe I was 
trying to push too many transactions per second.  But the second time I reduced 
the speed and changed some settings in the app.config that I thought would 
reduce memory usage.  Ran it again, and 24 hrs later I still hit the wall 
(almost the same place).

So, I figure I'm still doing something wrong, I'm just not sure what.  I'm 
obviously hitting some kind of heap issue, but I'm not sure what else I can do 
about it.  I'm hoping there is something obvious I'm missing in my ignorance, 
cuz at the moment I'm stuck...

Some details:

7 node cluster running 1.4
each VM has 7GB RAM and 4 CPUs
the data directory is on a RAID0 with 750GB of space
128 partitions, levelDB backend
using links and secondary indexes
lots of buckets (over 10 million), a couple buckets have lots of keys (one was 
around 6.6 million keys when it crashed, the other around 3.7 million).
Values are pretty small, almost all are just a few bytes.  There is one bucket, 
the largest (6.6 million keys), with value sizes between 1-2k.

primary custom app config settings:
{kernel,
  [
  {inet_dist_listen_min, 6000},
  {inet_dist_listen_max, 7999}
  ]},

{riak_core, [
  {default_bucket_props, [
{allow_mult, true},
{r, 1},
{w, 1},
{dw, 1},
{dw, 1}
   ]},
 {ring_creation_size, 128},
]},

{riak_kv, [
   {storage_backend, riak_kv_eleveldb_backend},
]}

{eleveldb, [
{max_open_files, 32}
   ]},

custom vm.args settings
+A 16

Here is some of the error information I am seeing when it all goes boom:

*
* On one of the nodes that crashes:

last lines of console.log
---
2013-08-05 17:59:08.071 [info] 
0.29251.556@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during 
active anti-entropy exchange of 
{468137243207554840987117797979434404733540892672,3} between 

Re: erlang go boom

2013-08-05 Thread Kresten Krab Thorup
I'd think the large #buckets could be the issue; especially if there is any 
bucket properties being set, because that would cause the ring data structure 
to be enormous.

Could you provide an ls -l output of the riak data/ring directory?

Sent from my iPhone

On 05/08/2013, at 21.52, Paul Ingalls p...@fanzo.memailto:p...@fanzo.me 
wrote:

As promised in previous email, I hit a fairly big problem over the weekend and 
then reproduced it this morning and I was wondering if I could get some help.

Basically, I was running my code against our risk cluster and everything was 
moving along just fine.  However, at some point Riak just seems to hit a wall.  
I get to a certain scale of content, about 9-10 GB per node, and about half the 
cluster gives up the ghost.

I've done this twice from scratch now, the first time I thought maybe I was 
trying to push too many transactions per second.  But the second time I reduced 
the speed and changed some settings in the app.config that I thought would 
reduce memory usage.  Ran it again, and 24 hrs later I still hit the wall 
(almost the same place).

So, I figure I'm still doing something wrong, I'm just not sure what.  I'm 
obviously hitting some kind of heap issue, but I'm not sure what else I can do 
about it.  I'm hoping there is something obvious I'm missing in my ignorance, 
cuz at the moment I'm stuck...

Some details:

7 node cluster running 1.4
each VM has 7GB RAM and 4 CPUs
the data directory is on a RAID0 with 750GB of space
128 partitions, levelDB backend
using links and secondary indexes
lots of buckets (over 10 million), a couple buckets have lots of keys (one was 
around 6.6 million keys when it crashed, the other around 3.7 million).
Values are pretty small, almost all are just a few bytes.  There is one bucket, 
the largest (6.6 million keys), with value sizes between 1-2k.

primary custom app config settings:
{kernel,
   [
   {inet_dist_listen_min, 6000},
   {inet_dist_listen_max, 7999}
   ]},

{riak_core, [
   {default_bucket_props, [
 {allow_mult, true},
 {r, 1},
 {w, 1},
 {dw, 1},
 {dw, 1}
]},
  {ring_creation_size, 128},
]},

 {riak_kv, [
{storage_backend, riak_kv_eleveldb_backend},
]}

 {eleveldb, [
 {max_open_files, 32}
]},

custom vm.args settings
+A 16

Here is some of the error information I am seeing when it all goes boom:

*
* On one of the nodes that crashes:

last lines of console.log
---
2013-08-05 17:59:08.071 [info] 
0.29251.556@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during 
active anti-entropy exchange of 
{468137243207554840987117797979434404733540892672,3} between 
{479555224749202520035584085735030365824602865664,riak@riak001} and 
{490973206290850199084050373490626326915664838656,riak@riak002}
2013-08-05 18:01:10.234 [info] 
0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
0.14832.557 
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
 
[{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,10},{old_heap_size,0},{heap_size,40978448}]
2013-08-05 18:01:10.672 [info] 
0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
0.12133.557 
[{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_pb_kv_codec,'-encode_content_meta/3-lc$^0/1-1-',1}},{message_queue_len,0}]
 
[{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,45},{old_heap_size,0},{heap_size,40978360}]
2013-08-05 18:01:12.993 [info] 
0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
0.14832.557 
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{dict,fold_bucket,3}},{message_queue_len,1}]
 
[{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,50},{old_heap_size,0},{heap_size,40978626}]
2013-08-05 18:01:13.816 [info] 
0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
0.14832.557 
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{dict,fold_bucket,3}},{message_queue_len,1}]
 
[{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,50},{old_heap_size,0},{heap_size,40978626}]
2013-08-05 18:01:13.819 [info] 
0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
0.12133.557 
[{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_kv_pb,pack,5}},{message_queue_len,0}]
 
[{old_heap_block_size,0},{heap_block_size,59786060},{mbuf_size,0},{stack_size,200971},{old_heap_size,0},{heap_size,47627275}]
2013-08-05 18:01:14.594 [info] 
0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
0.14832.557 

Re: erlang go boom

2013-08-05 Thread Paul Ingalls
Hey Kresten,

Thanks for the response!

I learned my lesson on setting bucket properties.  So all buckets currently use 
the defaults.

here is the output from one of our nodes:

total 40
drwxr-xr-x 2 root root  4096 Aug  5 21:10 ./
drwxr-xr-x 6 root root  4096 Aug  4 17:26 ../
-rw-r--r-- 1 root root 14187 Aug  4 17:37 riak_core_ring.default.20130804173753
-rw-r--r-- 1 root root 14586 Aug  5 21:10 riak_core_ring.default.20130805211043

another node:

total 72
drwxr-xr-x 2 root root  4096 Aug  5 21:10 ./
drwxr-xr-x 6 root root  4096 Aug  4 17:26 ../
-rw-r--r-- 1 root root 14187 Aug  4 17:37 riak_core_ring.default.20130804173753
-rw-r--r-- 1 root root 14358 Aug  5 20:56 riak_core_ring.default.20130805205650
-rw-r--r-- 1 root root 14529 Aug  5 21:09 riak_core_ring.default.20130805210924
-rw-r--r-- 1 root root 14586 Aug  5 21:10 riak_core_ring.default.20130805211043

looks like the largest number of files in that directory on any node is 5

Paul

Paul Ingalls
Founder  CEO Fanzo
p...@fanzo.me
@paulingalls
http://www.linkedin.com/in/paulingalls



On Aug 5, 2013, at 2:52 PM, Kresten Krab Thorup k...@trifork.com wrote:

 I'd think the large #buckets could be the issue; especially if there is any 
 bucket properties being set, because that would cause the ring data structure 
 to be enormous.
 
 Could you provide an ls -l output of the riak data/ring directory?
 
 Sent from my iPhone
 
 On 05/08/2013, at 21.52, Paul Ingalls p...@fanzo.memailto:p...@fanzo.me 
 wrote:
 
 As promised in previous email, I hit a fairly big problem over the weekend 
 and then reproduced it this morning and I was wondering if I could get some 
 help.
 
 Basically, I was running my code against our risk cluster and everything was 
 moving along just fine.  However, at some point Riak just seems to hit a 
 wall.  I get to a certain scale of content, about 9-10 GB per node, and about 
 half the cluster gives up the ghost.
 
 I've done this twice from scratch now, the first time I thought maybe I was 
 trying to push too many transactions per second.  But the second time I 
 reduced the speed and changed some settings in the app.config that I thought 
 would reduce memory usage.  Ran it again, and 24 hrs later I still hit the 
 wall (almost the same place).
 
 So, I figure I'm still doing something wrong, I'm just not sure what.  I'm 
 obviously hitting some kind of heap issue, but I'm not sure what else I can 
 do about it.  I'm hoping there is something obvious I'm missing in my 
 ignorance, cuz at the moment I'm stuck...
 
 Some details:
 
 7 node cluster running 1.4
 each VM has 7GB RAM and 4 CPUs
 the data directory is on a RAID0 with 750GB of space
 128 partitions, levelDB backend
 using links and secondary indexes
 lots of buckets (over 10 million), a couple buckets have lots of keys (one 
 was around 6.6 million keys when it crashed, the other around 3.7 million).
 Values are pretty small, almost all are just a few bytes.  There is one 
 bucket, the largest (6.6 million keys), with value sizes between 1-2k.
 
 primary custom app config settings:
 {kernel,
   [
   {inet_dist_listen_min, 6000},
   {inet_dist_listen_max, 7999}
   ]},
 
 {riak_core, [
   {default_bucket_props, [
 {allow_mult, true},
 {r, 1},
 {w, 1},
 {dw, 1},
 {dw, 1}
]},
  {ring_creation_size, 128},
 ]},
 
 {riak_kv, [
{storage_backend, riak_kv_eleveldb_backend},
 ]}
 
 {eleveldb, [
 {max_open_files, 32}
]},
 
 custom vm.args settings
 +A 16
 
 Here is some of the error information I am seeing when it all goes boom:
 
 *
 * On one of the nodes that crashes:
 
 last lines of console.log
 ---
 2013-08-05 17:59:08.071 [info] 
 0.29251.556@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during 
 active anti-entropy exchange of 
 {468137243207554840987117797979434404733540892672,3} between 
 {479555224749202520035584085735030365824602865664,riak@riak001} and 
 {490973206290850199084050373490626326915664838656,riak@riak002}
 2013-08-05 18:01:10.234 [info] 
 0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
 0.14832.557 
 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
  
 [{old_heap_block_size,0},{heap_block_size,47828850},{mbuf_size,0},{stack_size,10},{old_heap_size,0},{heap_size,40978448}]
 2013-08-05 18:01:10.672 [info] 
 0.83.0@riak_core_sysmon_handler:handle_event:92 monitor large_heap 
 0.12133.557 
 [{initial_call,{riak_api_pb_server,init,1}},{almost_current_function,{riak_pb_kv_codec,'-encode_content_meta/3-lc$^0/1-1-',1}},{message_queue_len,0}]
  
 

Re: erlang go boom

2013-08-05 Thread Paul Ingalls
I watched top on all the instances when things started to fall apart.  This is 
what I saw…

Everything was jamming along just fine.  CPU usage was about 25%, ram usage was 
about 25% (3 of the 7 were at about 15%).

Suddenly, CPU usage spikes to over 50% and ram usage spikes to 80-90% (and I'm 
guessing on those nodes that crash, it hit over 100%).  This happens within 
seconds, its not a gradual growth in RAM or CPU but a spike.  On those nodes 
that survived, they stayed at this higher water mark for RAM (and CPU keeps 
running at the incremental difference since I killed the client).  Is there any 
easy way to figure out what the process is working on?

Could compaction cause this?  Basically I have hit a place where it needs to 
merge SSTables to a higher level, and there isn't enough RAM to pull the data 
set into memory?

Paul



Paul Ingalls
Founder  CEO Fanzo
p...@fanzo.me
@paulingalls
http://www.linkedin.com/in/paulingalls



On Aug 5, 2013, at 2:58 PM, Paul Ingalls p...@fanzo.me wrote:

 Hey Kresten,
 
 Thanks for the response!
 
 I learned my lesson on setting bucket properties.  So all buckets currently 
 use the defaults.
 
 here is the output from one of our nodes:
 
 total 40
 drwxr-xr-x 2 root root  4096 Aug  5 21:10 ./
 drwxr-xr-x 6 root root  4096 Aug  4 17:26 ../
 -rw-r--r-- 1 root root 14187 Aug  4 17:37 
 riak_core_ring.default.20130804173753
 -rw-r--r-- 1 root root 14586 Aug  5 21:10 
 riak_core_ring.default.20130805211043
 
 another node:
 
 total 72
 drwxr-xr-x 2 root root  4096 Aug  5 21:10 ./
 drwxr-xr-x 6 root root  4096 Aug  4 17:26 ../
 -rw-r--r-- 1 root root 14187 Aug  4 17:37 
 riak_core_ring.default.20130804173753
 -rw-r--r-- 1 root root 14358 Aug  5 20:56 
 riak_core_ring.default.20130805205650
 -rw-r--r-- 1 root root 14529 Aug  5 21:09 
 riak_core_ring.default.20130805210924
 -rw-r--r-- 1 root root 14586 Aug  5 21:10 
 riak_core_ring.default.20130805211043
 
 looks like the largest number of files in that directory on any node is 5
 
 Paul
 
 Paul Ingalls
 Founder  CEO Fanzo
 p...@fanzo.me
 @paulingalls
 http://www.linkedin.com/in/paulingalls
 
 
 
 On Aug 5, 2013, at 2:52 PM, Kresten Krab Thorup k...@trifork.com wrote:
 
 I'd think the large #buckets could be the issue; especially if there is any 
 bucket properties being set, because that would cause the ring data 
 structure to be enormous.
 
 Could you provide an ls -l output of the riak data/ring directory?
 
 Sent from my iPhone
 
 On 05/08/2013, at 21.52, Paul Ingalls 
 p...@fanzo.memailto:p...@fanzo.me wrote:
 
 As promised in previous email, I hit a fairly big problem over the weekend 
 and then reproduced it this morning and I was wondering if I could get some 
 help.
 
 Basically, I was running my code against our risk cluster and everything was 
 moving along just fine.  However, at some point Riak just seems to hit a 
 wall.  I get to a certain scale of content, about 9-10 GB per node, and 
 about half the cluster gives up the ghost.
 
 I've done this twice from scratch now, the first time I thought maybe I was 
 trying to push too many transactions per second.  But the second time I 
 reduced the speed and changed some settings in the app.config that I thought 
 would reduce memory usage.  Ran it again, and 24 hrs later I still hit the 
 wall (almost the same place).
 
 So, I figure I'm still doing something wrong, I'm just not sure what.  I'm 
 obviously hitting some kind of heap issue, but I'm not sure what else I can 
 do about it.  I'm hoping there is something obvious I'm missing in my 
 ignorance, cuz at the moment I'm stuck...
 
 Some details:
 
 7 node cluster running 1.4
 each VM has 7GB RAM and 4 CPUs
 the data directory is on a RAID0 with 750GB of space
 128 partitions, levelDB backend
 using links and secondary indexes
 lots of buckets (over 10 million), a couple buckets have lots of keys (one 
 was around 6.6 million keys when it crashed, the other around 3.7 million).
 Values are pretty small, almost all are just a few bytes.  There is one 
 bucket, the largest (6.6 million keys), with value sizes between 1-2k.
 
 primary custom app config settings:
 {kernel,
   [
   {inet_dist_listen_min, 6000},
   {inet_dist_listen_max, 7999}
   ]},
 
 {riak_core, [
   {default_bucket_props, [
 {allow_mult, true},
 {r, 1},
 {w, 1},
 {dw, 1},
 {dw, 1}
]},
  {ring_creation_size, 128},
 ]},
 
 {riak_kv, [
{storage_backend, riak_kv_eleveldb_backend},
 ]}
 
 {eleveldb, [
 {max_open_files, 32}
]},
 
 custom vm.args settings
 +A 16
 
 Here is some of the error information I am seeing when it all goes boom:
 
 *
 * On one of the nodes that crashes:
 
 last lines of console.log
 ---
 2013-08-05 17:59:08.071 [info] 

Re: erlang go boom

2013-08-05 Thread Evan Vigil-McClanahan
Given your leveldb settings, I think that compaction is an unlikely
culprit.  But check this out:

2013-08-05 18:01:15.878 [info] 0.83.0@riak_core_sysmon_
handler:handle_event:92 monitor large_heap 0.14832.557
[{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
[{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}]

That's a 78MB heap in encode object... Unless your objects are big, I
would suspect sibling explosion caused by rapid updates at w = 1.

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: erlang go boom

2013-08-05 Thread Paul Ingalls
Interesting.  I have sibling resolution code on the client side.  Would sibling 
explosion take out the entire cluster all at once?  Within 5 minutes of my last 
email, the rest of the cluster died.

Is there a way to quickly figure out whether the cluster is full of siblings?

Paul Ingalls
Founder  CEO Fanzo
p...@fanzo.me
@paulingalls
http://www.linkedin.com/in/paulingalls

On Aug 5, 2013, at 8:07 PM, Evan Vigil-McClanahan emcclana...@basho.com wrote:

 Given your leveldb settings, I think that compaction is an unlikely
 culprit.  But check this out:
 
 2013-08-05 18:01:15.878 [info] 0.83.0@riak_core_sysmon_
 handler:handle_event:92 monitor large_heap 0.14832.557
 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
 [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}]
 
 That's a 78MB heap in encode object... Unless your objects are big, I
 would suspect sibling explosion caused by rapid updates at w = 1.

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: erlang go boom

2013-08-05 Thread Jeremy Ong
On the client you could extract the value_count of the objects you
read and just log them. Feel free to post code too, in particular, how
you are writing out updated values.

On Mon, Aug 5, 2013 at 9:20 PM, Paul Ingalls p...@fanzo.me wrote:
 Interesting.  I have sibling resolution code on the client side.  Would
 sibling explosion take out the entire cluster all at once?  Within 5 minutes
 of my last email, the rest of the cluster died.

 Is there a way to quickly figure out whether the cluster is full of
 siblings?

 Paul Ingalls
 Founder  CEO Fanzo
 p...@fanzo.me
 @paulingalls
 http://www.linkedin.com/in/paulingalls

 On Aug 5, 2013, at 8:07 PM, Evan Vigil-McClanahan emcclana...@basho.com
 wrote:

 Given your leveldb settings, I think that compaction is an unlikely
 culprit.  But check this out:

 2013-08-05 18:01:15.878 [info] 0.83.0@riak_core_sysmon_
 handler:handle_event:92 monitor large_heap 0.14832.557
 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
 [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}]

 That's a 78MB heap in encode object... Unless your objects are big, I
 would suspect sibling explosion caused by rapid updates at w = 1.



 ___
 riak-users mailing list
 riak-users@lists.basho.com
 http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: erlang go boom

2013-08-05 Thread Paul Ingalls
I'm currently using the java client and its ConflictResolver and Mutator 
interfaces.  In some cases I am just doing a store, and letting the client do 
an implicit fetch and the mutator to make the actual change.  In other cases 
I'm doing an explicit fetch, modify the result, and then a store without fetch. 
 For both, I am using a ConflictResolver and I have added a field for the 
vclock and used the annotation in my beans.

I'll try pointing a local client at the cluster to see if I can see how many 
siblings there are.  That is, if I can get the cluster running again...:)

Paul

Paul Ingalls
Founder  CEO Fanzo
p...@fanzo.me
@paulingalls
http://www.linkedin.com/in/paulingalls

On Aug 5, 2013, at 9:36 PM, Jeremy Ong jer...@quarkgames.com wrote:

 On the client you could extract the value_count of the objects you
 read and just log them. Feel free to post code too, in particular, how
 you are writing out updated values.
 
 On Mon, Aug 5, 2013 at 9:20 PM, Paul Ingalls p...@fanzo.me wrote:
 Interesting.  I have sibling resolution code on the client side.  Would
 sibling explosion take out the entire cluster all at once?  Within 5 minutes
 of my last email, the rest of the cluster died.
 
 Is there a way to quickly figure out whether the cluster is full of
 siblings?
 
 Paul Ingalls
 Founder  CEO Fanzo
 p...@fanzo.me
 @paulingalls
 http://www.linkedin.com/in/paulingalls
 
 On Aug 5, 2013, at 8:07 PM, Evan Vigil-McClanahan emcclana...@basho.com
 wrote:
 
 Given your leveldb settings, I think that compaction is an unlikely
 culprit.  But check this out:
 
 2013-08-05 18:01:15.878 [info] 0.83.0@riak_core_sysmon_
 handler:handle_event:92 monitor large_heap 0.14832.557
 [{initial_call,{riak_kv_get_fsm,init,1}},{almost_current_function,{riak_object,encode_maybe_binary,1}},{message_queue_len,1}]
 [{old_heap_block_size,0},{heap_block_size,116769640},{mbuf_size,0},{stack_size,52},{old_heap_size,0},{heap_size,81956791}]
 
 That's a 78MB heap in encode object... Unless your objects are big, I
 would suspect sibling explosion caused by rapid updates at w = 1.
 
 
 
 ___
 riak-users mailing list
 riak-users@lists.basho.com
 http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
 

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com