Hi Matthew,

On Sunday, 6 July 2014 at 3:04, Matthew Von-Maszewski wrote: 
> Tom,
> 
> Basho prides itself on quickly responding to all user queries. I have failed 
> that tradition in this case. Please accept my apologies.
No problem; I appreciate you taking the time to look into our LOG.
 
> 
> The LOG data suggests leveldb is not stalling, especially not for 4 hours. 
> Therefore the problem is related to disk utilization.

 That matches our experience - leveldb itself is working hard on disk 
operations whilst Riak fails to respond to... anything, causing an apparent 
'stall' from the client application's perspective.

> You appear to have large values. I see .sst files where the average value is 
> 100K to 1Mbyte in size. Is this intentional, or might you have a sibling 
> problem?
Yes, we have a split between very small (headers only, no body) items and 1MB 
binary chunks.  If we had our time again we'd probably use multi-backend to 
store these 1MB chunks in bitcask and keep leveldb for the small body-less 
items which require 2i.

> My assessment is that your lower levels are full and therefore cascading 
> regularly. "cascading" is like the typical champagne glass pyramid you see at 
> weddings. Once all the glasses are full, new champagne at the top causes each 
> subsequent layer to overflow into the one below that. You have the same 
> problem, but with data. 
> 
> Your large values have filled each of the lower levels and regularly cause 
> cascading data between multiple levels. The cascading is causing each 100K 
> value write to become the equivalent of a 300K or 500K value as levels 
> overflow. This cascading is chewing up your hard disk performance (by 
> reducing the amount of time the hard drive has available for read requests).
By increasing the size of the lower levels (as you show below), does this mean 
there's more capacity for writes to occur in those levels before compaction is 
triggered and hence compacting them less frequently?

I guess this turns your champagne fountain analogy into more of a 'tipping 
bucket' where the data is no longer 'flowing' through the levels but is instead 
building up in each level before tipping into the next when it's at capacity?  
(pictorial representation: 
http://4.bp.blogspot.com/_DUDhlpPD8X8/SIcN8D66j9I/AAAAAAAAASs/2Va3_n3vamk/s400/23157087_261a5da413.jpg)

> The leveldb code for Riak 2.0 has increased the size of all the levels. The 
> table of sizes is found at the top of leveldb's db/version_set.cc 
> (http://version_set.cc). You could patch your current code if desired with 
> this table from 2.0:
> 
> { 
> {10485760, 262144000, 57671680, 209715200, 0, 420000000, true}, 
> {10485760, 82914560, 57671680, 419430400, 0, 209715200, true}, 
> {10485760, 314572800, 57671680, 3082813440, 200000000, 314572800, false}, 
> {10485760, 419430400, 57671680, 6442450944ULL, 4294967296ULL, 419430400, 
> false}, 
> {10485760, 524288000, 57671680, 128849018880ULL, 85899345920ULL, 524288000, 
> false}, 
> {10485760, 629145600, 57671680, 2576980377600ULL, 1717986918400ULL, 
> 629145600, false}, 
> {10485760, 734003200, 57671680, 51539607552000ULL, 34359738368000ULL, 
> 734003200, false} 
> }; 
> 
> 
> You cannot take the entire 2.0 leveldb into your 1.4 code base due to various 
> option changes.
I assume leveldb will just 'handle' making the levels larger once nodes are 
restarted with this updated configuration?  I also assume that it would not be 
wise to then rollback the change to smaller levels after this has been done?
> Let me know if this helps. I have previously hypothesized that "grooming" 
> compactions should be limited to one thread total. However my test datasets 
> never demonstrated a benefit. Your dataset might be the case that proves the 
> benefit. I will go find the grooming patch to hot_threads for you if the 
> above table proves insufficient.

Do I understand correctly that this would mean compactions would continue, but 
limited to one thread, so that the rest of the application can still respond to 
client requests?  If so, that sounds like it may help a situation like ours - 
although I'd wonder whether the rate-limited compaction would ever "keep up" 
with the inflowing data.

Thanks,
Tom



_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to