I was in the Riak 1.2 webinar earlier today and asked a leveldb question about insertion order and durability vs. bitcask's WOL architecture. Joe was not able to get to my question then but took the time to write me a detailed answer. Great engineers at Basho taking time to answer questions is a great thing. Thanks Joe!
-Alexander Sicular @siculars Begin forwarded message: > From: Joseph Blomstedt <j...@basho.com> > Subject: LevelDB > Date: August 21, 2012 3:45:45 PM EDT > To: sicul...@gmail.com > > Alexander, > > I noticed your LevelDB question in the webinar as Reem was closing > things out, so I figured I'd follow up via email. > > As you know, Bitcask maintains a strict set of write-logs and an > in-memory hash table that maps keys to (file, offset). Pretty > straightforward. Compaction is a separate thing that happens based on > independent triggers. > > LevelDB is rather different. LevelDB does maintain a WAL, but it's > short-lived and only for crash recovery. LevelDB writes to the WAL, > but also keeps the object in an in-memory write buffer (configurable > size, increased in Riak 1.2 by 10x from Riak 1.1). After the buffer > becomes full, LevelDB writes the data to disk as a Level-0 SST (data > in sorted order + sorted index at the end of the file). > > There can be multiple Level-0 SSTs. To read a key, LevelDB looks at > the index in each SST starting from newest file to oldest. For > performance, there's an LRU cache of indexes so you're not always > hitting disk. LevelDB now also includes bloom filters (used in Riak > 1.2) to make it easier to skip non-interesting SSTs. > > To make things more efficient, LevelDB does compaction/merging in a > background thread. A set of Level-0 files will be selected and merged > together into a larger Level-1 file. The format is the same, but the > file is now larger and includes the data from multiple Level-0 files. > The original Level-0 files are then removed. Likewise, Level-1 files > are merged into Level-2 files, and Level-2 into Level-3, etc. Each > Level having larger files with a greater chunk of adjacent, sorted > data. > > To read, you check newest to oldest on Level 0, then Level 1, then Level 2, > etc. > > While compaction is a background thing, LevelDB limits the number of > Level-0 files you can have. If you hit the limit, LevelDB will block > writes until files have been merged into Level-1. With a single > compaction thread, it was easy to max out LevelDB in Riak 1.1, and > these stalls were fairly frequent and hurt 95% and up latencies, as > well as greatly hurt throughput. Our change to use multiple compaction > threads has greatly improved the how quickly compaction occurs, and > writes rarely (if ever) end up stalling. To further improve things, > there's the adaptive write throttling that I mentioned that will slow > down writes (increased latency) in order to ensure compaction isn't > heavily affected and remains ahead of write traffic -- thus, further > preventing stalls. Net effect is somewhat higher latency and lower > throughput that is more consistent (ie. 95%+ are tighter around > average latency). > > I hope this answers your question. > > -Joe
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com