Vladimir, I have created a branch off the 1.3.2 release tag: mv-error-logging-hack
This has two changes: - removes a late fix for database level locking that was added in 1.3.2 (to see if that was the problem source prior to its fix) - add test of all background file operations and log errors to syslog (since LOG handle not available) When I build new version of leveldb, I make sure eleveldb also rebuilds. I do this via "rm eleveldb/c_src/*.o" followed by "cd eleveldb/c_src/leveldb; make clean" There is a pull request from another community user that makes the entire process cleaner. I just have not had time to review and approve it. I typically "grep beam /var/log/syslog" on my Debian system. The exact system log may vary due to your Linux implementation. Let me know if this finds in any bugs. Matthew On Jul 25, 2013, at 8:12 PM, Vladimir Shabanov <vshaban...@gmail.com> wrote: > I prefer second option since it will show are the corrupted blocks related to > race condition. First option needs to be run for a long time to be completely > sure that it really fixes the issue. > > > 2013/7/26 Matthew Von-Maszewski <matth...@basho.com> > Vladimir, > > I apologize for not recognizing your name and previous contribution. I just > tend to think in terms of code and performance bottlenecks, not people. > > Your June contribution resulted in changes that were released in 1.4 and > 1.3.2. I and the team thank you. However, we have not isolated the source > of the corruption. We only know today that it does not happen very often. > We have a second, high transaction site, that has seen the same issue. > > I can offer you two non-release options: > > - I have a branch to 1.4.0 that fixes a potential, but unproven, race > condition. Details are here: > > https://github.com/basho/leveldb/wiki/mv-sst-fadvise > > You would have to build eleveldb locally and copy it into your executable > tree. The 1.4 leveldb and eleveldb work fine with Riak 1.3.x. should you > desire to limit changes to your production environment. > > > - I have code, soon to be a branch against 1.3.2, that only adds syslog error > messages to prove / disprove the race condition. You could take this code > and see if it reports problems. This route would help the community and > mostly me know the root cause is within the race condition addressed by the > mv-sst-fadvise branch. > > > The two options above are what I currently have to offer. I am actively > working to find the corruption source. The good news is that Riak will > naturally recover from a "bad CRC" when detected. The bad news is that the > Google defaults let some bad CRCs become good CRCs. Riak 1.4 and 1.3.2 > cannot identify those bad CRCs that became good CRCs. > > Matthew > > > > > On Jul 25, 2013, at 4:32 PM, Vladimir Shabanov <vshaban...@gmail.com> wrote: > >> Good. Will wait for doctor. >> >> A month ago I mailed about segmentation fault >> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-June/012245.html >> After looking at core dumps you have found this problem with CRC checks >> being skipped. I enabled paranoid_checks and got my node up an running. >> >> I've also found that lost/BLOCKS.bad sometimes appears in partitions and >> have sent you these blocks for further analysis. >> >> It's very interesting why corrupted data appears in the first place. Nodes >> didn't crashed, hardware didn't failed. As I mentioned previously all my >> machines are with ECC memory and Riak data is kept on ZFS filesystem (which >> also checks CRC for all the data and doesn't report any CRC errors). So it >> looks that data is somehow corrupted by Riak itself. >> >> lost/BLOCKS.bad are usually small 2-8kb and appears very infrequently (once >> a week, once a month or never for many partitions). I found these BLOCKS.bad >> in both data/leveldb and data/anti_entropy. So I have suspicion that there >> is a bug in LevelDB. >> >> Looking at LOGs they are created during compactions: >> "Moving corrupted block to lost/BLOCKS.bad (size 2393)" >> but there is no more information. What kind of block is it, where it was >> found. >> >> Is it possible to somehow find source of those BLOCKS.bad files? I'm >> building Riak from sources, maybe it's possible to enable some additional >> logging to find what these BLOCKS.bad are? >> >> >> 2013/7/25 Matthew Von-Maszewski <matth...@basho.com> >> Vladimir, >> >> I can explain what happened, but not how to correct the problem. The >> gentleman that can walk you through a repair is tied up on another project, >> but he intends to respond as soon as he is able. >> >> We recently discovered / realized that Google's leveldb code does not check >> the CRC of each block rewritten during a compaction. This means that blocks >> with bad CRCs get read without being flagged as bad, then rewritten to a new >> file with a new, valid CRC. The corruption is now hidden. >> >> A more thorough discussion of the problem is found here: >> >> https://github.com/basho/leveldb/wiki/mv-verify-compactions >> >> >> We added code to the 1.3.2 and 1.4 Riak releases to have the block CRC >> checked during both read (Get) requests and compaction rewrites. This >> prevents future corruption hiding. Unfortunately, it does NOTHING for >> blocks already corrupted and rewritten with valid CRCs. You are >> encountering this latter condition. We have a developer advocate / client >> services person that has walked others through a fix via the Riak data >> replicas … >> >> … please hold and the doctor will be with you shortly. >> >> Matthew >> >> >> On Jul 24, 2013, at 9:39 PM, Vladimir Shabanov <vshaban...@gmail.com> wrote: >> >>> Hello, >>> >>> Recently I've started expanding my Riak cluster and found that handoffs >>> were continuously retried for one partition. >>> >>> Here are logs from two nodes >>> https://gist.github.com/vshabanov/41282e622479fbe81974 >>> >>> The most interesting parts of logs are >>> "Handoff receiver for partition ... exited abnormally after processing >>> 2860338 objects: {{badarg,[{erlang,binary_to_term,..." >>> and >>> "bad argument in call to erlang:binary_to_term(<<131,104,...." >>> >>> Both nodes are running Riak 1.3.2 (old one was running 1.3.1 previously). >>> >>> >>> When I've printed corrupted binary string I found that it corresponds to >>> one value. >>> >>> When I've tried to "get" it, it was read OK but node with corrupted value >>> shown the same binary_to_term error. >>> >>> When I've tried to delete corrupted value I've got timeout. >>> >>> >>> I'm running machines with ECC memory and ZFS filesystem (which doesn't >>> report any checksum failures) so I doubt data was silently corrupted on >>> disk. >>> >>> LOG from corresponding LevelDB partition doesn't show any errors. But there >>> is a lost/BLOCKS.bad file in this partition (7kb, created more than a month >>> ago and looks like it doesn't contain corrupted value). >>> >>> At the moment I've stopped handoffs using "risk-admin transfer-limit 0". >>> >>> Why the value was corrupted? It there any way to remove it or fix it? >>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com