Hi Johnny,

Since this seems to happen regularly on one node on your cluster (not
necessarily the same node), do you have a repetitive process that
performs a *lot* of updates or deletes on a single key that could be
correlated to these merges?
--
Luke Bakken
Engineer
lbak...@basho.com


On Wed, Jun 15, 2016 at 10:22 AM, Johnny Tan <johnnyd...@gmail.com> wrote:
> We're running riak-1.4.2
>
> Every few weeks, we have a riak node that starts to slowly fill up on disk
> space for several days, and then suddenly gain that space back again.
>
> In looking into this more today, I think I see what's going on.
>
> Per the console.log on a node that it's happening to right now, there are an
> unusually large amount of merges happening right now. There are 6 total
> nodes in our cluster, it's only happening to this node today. (In previous
> weeks, it's been other nodes, but it's always been one node at a time.)
>
> Normally, we get 50-70 merges per day per node (according to various nodes'
> console.log, including the node in question). Yesterday and today, the node
> in question has several hundred merges happening.
>
> When I look inside the bitcask directory, I see a lot of files with this set
> of permissions:
> -rwSrw-r--
>
> My understanding is that those are files marked for deletions after bitcask
> merging.
>
> The number of those files is currently growing, and from a spot-check, they
> indeed match up as the files that have been merged.
>
> So it seems the two are related: a lot of merges are happening, which then
> causes a large number of files to be marked for deletion, and those marked
> files are piling up and not getting deleted for some reason.
>
> If I don't do anything, those files eventually get deleted, and everything
> is good again for another couple weeks until it happens to another node. But
> the disk usage does get high enough to alert us, and obviously we don't want
> it to get anywhere near 100%.
>
>
> I'm trying to figure out why there are times when this happens. One thing I
> noticed is a difference in the merge log entries.
>
> Here's one from a "normal" day, nearly all the entries for that day are
> roughly this same length and same amount of time merging:
> 2016-06-10 05:27:39.426 UTC [info] <0.15230.160> Merged
> {["/var/lib/riak/bitcask/890602560248518965780370444936484965102833893376/84000.bitcask.data","/var/lib/riak/bitcask/890602560248518965780370444936484965102833893376/83999.bitcask.data"],[]}
> in 11.902028 seconds.
>
> But here's one from today on the problematic node:
> 2016-06-15 17:13:40.626 UTC [info] <0.17903.500> Merged
> {["/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83633.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83632.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83631.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83630.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83629.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83628.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83627.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83626.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83625.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83624.bitcask.data","/var/lib/riak/bitcask/12331420064979493372343590776043637978346
 
93083136/83623.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83622.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83621.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83620.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83619.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83618.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83617.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83616.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83615.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83614.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83613.bitcask.data","/var/lib/riak/bitcask/123314200649794
 93372343590776043637978346930...",...],...}
> in 220.186043 seconds.
>
> It's not just that it takes 20x longer to merge, but it seems to be doing a
> lot more files at once.
>
> What is going on?
>
> I'm not sure how much of the app.config is relevant, but I'll at least paste
> just the bitcask and merge sections for now:
>         {bitcask, [
>                 {data_root, "/var/lib/riak/bitcask"},
>                 {dead_bytes_merge_trigger, 268435456},
>                 {dead_bytes_threshold, 67108864},
>                 {frag_merge_trigger, 60},
>                 {frag_threshold, 40},
>                 {io_mode, erlang},
>                 {max_file_size, 1073741824},
>                 {small_file_threshold, 134217728}
>         ]},
>         {merge_index, [
>                 {buffer_rollover_size, 1048576},
>                 {data_root, "/var/lib/riak/merge_index"},
>                 {max_compact_segments, 20}
>         ]},
>
>
> Thanks for any insight,
> johnny
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to