Hi Johnny, Since this seems to happen regularly on one node on your cluster (not necessarily the same node), do you have a repetitive process that performs a *lot* of updates or deletes on a single key that could be correlated to these merges? -- Luke Bakken Engineer lbak...@basho.com
On Wed, Jun 15, 2016 at 10:22 AM, Johnny Tan <johnnyd...@gmail.com> wrote: > We're running riak-1.4.2 > > Every few weeks, we have a riak node that starts to slowly fill up on disk > space for several days, and then suddenly gain that space back again. > > In looking into this more today, I think I see what's going on. > > Per the console.log on a node that it's happening to right now, there are an > unusually large amount of merges happening right now. There are 6 total > nodes in our cluster, it's only happening to this node today. (In previous > weeks, it's been other nodes, but it's always been one node at a time.) > > Normally, we get 50-70 merges per day per node (according to various nodes' > console.log, including the node in question). Yesterday and today, the node > in question has several hundred merges happening. > > When I look inside the bitcask directory, I see a lot of files with this set > of permissions: > -rwSrw-r-- > > My understanding is that those are files marked for deletions after bitcask > merging. > > The number of those files is currently growing, and from a spot-check, they > indeed match up as the files that have been merged. > > So it seems the two are related: a lot of merges are happening, which then > causes a large number of files to be marked for deletion, and those marked > files are piling up and not getting deleted for some reason. > > If I don't do anything, those files eventually get deleted, and everything > is good again for another couple weeks until it happens to another node. But > the disk usage does get high enough to alert us, and obviously we don't want > it to get anywhere near 100%. > > > I'm trying to figure out why there are times when this happens. One thing I > noticed is a difference in the merge log entries. > > Here's one from a "normal" day, nearly all the entries for that day are > roughly this same length and same amount of time merging: > 2016-06-10 05:27:39.426 UTC [info] <0.15230.160> Merged > {["/var/lib/riak/bitcask/890602560248518965780370444936484965102833893376/84000.bitcask.data","/var/lib/riak/bitcask/890602560248518965780370444936484965102833893376/83999.bitcask.data"],[]} > in 11.902028 seconds. > > But here's one from today on the problematic node: > 2016-06-15 17:13:40.626 UTC [info] <0.17903.500> Merged > {["/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83633.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83632.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83631.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83630.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83629.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83628.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83627.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83626.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83625.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83624.bitcask.data","/var/lib/riak/bitcask/12331420064979493372343590776043637978346 93083136/83623.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83622.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83621.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83620.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83619.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83618.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83617.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83616.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83615.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83614.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83613.bitcask.data","/var/lib/riak/bitcask/123314200649794 93372343590776043637978346930...",...],...} > in 220.186043 seconds. > > It's not just that it takes 20x longer to merge, but it seems to be doing a > lot more files at once. > > What is going on? > > I'm not sure how much of the app.config is relevant, but I'll at least paste > just the bitcask and merge sections for now: > {bitcask, [ > {data_root, "/var/lib/riak/bitcask"}, > {dead_bytes_merge_trigger, 268435456}, > {dead_bytes_threshold, 67108864}, > {frag_merge_trigger, 60}, > {frag_threshold, 40}, > {io_mode, erlang}, > {max_file_size, 1073741824}, > {small_file_threshold, 134217728} > ]}, > {merge_index, [ > {buffer_rollover_size, 1048576}, > {data_root, "/var/lib/riak/merge_index"}, > {max_compact_segments, 20} > ]}, > > > Thanks for any insight, > johnny > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com