Greetings,
I am currently responsible for tuning Google's leveldb implementation for Riak.
I have read through most of the thread and have a couple of information
requests. Then I will try to address various questions and comments from the
thread. In general, you are filling leveldb faster than its background
compaction (optimization) can keep up. I am willing to work with you to figure
out why and what can be done about it.
Questions / requests:
1. Execute the following on one of the servers:
sort /home/riak/leveldb/*/LOG* >log_jan.txt
Tar/gzip the log_jan.txt and email it back.
2. Execute the following on one of the servers:
grep -i flags /proc/cpuinfo
Include the output (actually just one line will do) in a reply.
3. On a running server that is processing data, execute:
grep -i swap /proc/meminfo
Include the full output (3 lines) in a reply.
4. Pick a server, then one directory in /home/riak/leveldb. Select 3 of the
largest *.sst files. Tar/gzip those and email back.
Notes about other messages on this thread:
a. the gdb stack traces are nice! They clearly indicate that the leveldb has
intentionally entered a "stall" state because compaction is not keeping up with
the input stream. Riak 1.2.1rc1 contains code that attempts to slow the write
rate to allow the background compactions to catch up. It is not working in
your case.
b. there is a performance bug in the cache code, not your main problem though.
this is why Evan asked you to reduce the cache size from 377,487,360. Yes, I
created the bug and will get it addressed soon.
c. the compaction process is disk and cpu intensive. The fact that your CPUs
are not heavily loaded, yet the client/request code is stalled waiting for
compaction to catch up, suggests the disk is thrashing / could use some help.
Again, this is why Evan had you work some configuration settings there.
d. you comment about using O_NOATIME is valid. The issue is that the flag is
relatively new. We are supporting some really old compilers and linux/solaris
versions. It is easier to ask everyone to work noatime at the mount level than
have conditional code for some and mount level tuning for others. But your
comment is still correct.
e. a non-zero sized lost/BLOCKS.bad means data corruption. It looks like you
already figured that out. Either the crc code or the decompression code found
an issue during compaction and moved the bad data to the side.
f. max_open_files in 1.1 was a hard limit on the number of open files per
vnode (per subdirectory in /home/riak/leveldb). 1.2 uses the number as more of
a memory consumption per file suggestion. A future release will drop the
option and substitute something like "file_cache_size". Memory is the critical
resource, not file handles (at least for Riak … I am told Google uses this code
in Android, so it might be critical there).
What issues did I miss?
Matthew
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com