Re: Corrupted Erlang binary term inside LevelDB

Matthew Von-Maszewski Fri, 26 Jul 2013 08:11:04 -0700

Vladimir,

I have created a branch off the 1.3.2 release tag:  mv-error-logging-hack


This has two changes:

- removes a late fix for database level locking that was added in 1.3.2 (to see 
if that was the problem source prior to its fix)

- add test of all background file operations and log errors to syslog (since 
LOG handle not available)


When I build new version of leveldb, I make sure eleveldb also rebuilds.  I do 
this via "rm eleveldb/c_src/*.o" followed by "cd eleveldb/c_src/leveldb; make 
clean"  There is a pull request from another community user that makes the 
entire process cleaner.  I just have not had time to review and approve it.

I typically "grep beam /var/log/syslog" on my Debian system.  The exact system 
log may vary due to your Linux implementation.

Let me know if this finds in any bugs.  

Matthew


On Jul 25, 2013, at 8:12 PM, Vladimir Shabanov <vshaban...@gmail.com> wrote:

> I prefer second option since it will show are the corrupted blocks related to 
> race condition. First option needs to be run for a long time to be completely 
> sure that it really fixes the issue.
> 
> 
> 2013/7/26 Matthew Von-Maszewski <matth...@basho.com>
> Vladimir,
> 
> I apologize for not recognizing your name and previous contribution.  I just 
> tend to think in terms of code and performance bottlenecks, not people.
> 
> Your June contribution resulted in changes that were released in 1.4 and 
> 1.3.2.  I and the team thank you.  However, we have not isolated the source 
> of the corruption.  We only know today that it does not happen very often.  
> We have a second, high transaction site, that has seen the same issue.
> 
> I can offer you two non-release options:
> 
> - I have a branch to 1.4.0 that fixes a potential, but unproven, race 
> condition.  Details are here:
> 
> https://github.com/basho/leveldb/wiki/mv-sst-fadvise
> 
> You would have to build eleveldb locally and copy it into your executable 
> tree.  The 1.4 leveldb and eleveldb work fine with Riak 1.3.x. should you 
> desire to limit changes to your production environment.
> 
> 
> - I have code, soon to be a branch against 1.3.2, that only adds syslog error 
> messages to prove / disprove the race condition.  You could take this code 
> and see if it reports problems.  This route would help the community and 
> mostly me know the root cause is within the race condition addressed by the 
> mv-sst-fadvise branch.
> 
> 
> The two options above are what I currently have to offer.  I am actively 
> working to find the corruption source.  The good news is that Riak will 
> naturally recover from a "bad CRC" when detected.  The bad news is that the 
> Google defaults let some bad CRCs become good CRCs.  Riak 1.4 and 1.3.2 
> cannot identify those bad CRCs that became good CRCs.
> 
> Matthew
> 
> 
> 
> 
> On Jul 25, 2013, at 4:32 PM, Vladimir Shabanov <vshaban...@gmail.com> wrote:
> 
>> Good. Will wait for doctor.
>> 
>> A month ago I mailed about segmentation fault
>> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-June/012245.html
>> After looking at core dumps you have found this problem with CRC checks 
>> being skipped. I enabled paranoid_checks and got my node up an running.
>> 
>> I've also found that lost/BLOCKS.bad sometimes appears in partitions and 
>> have sent you these blocks for further analysis.
>> 
>> It's very interesting why corrupted data appears in the first place. Nodes 
>> didn't crashed, hardware didn't failed. As I mentioned previously all my 
>> machines are with ECC memory and Riak data is kept on ZFS filesystem (which 
>> also checks CRC for all the data and doesn't report any CRC errors). So it 
>> looks that data is somehow corrupted by Riak itself.
>> 
>> lost/BLOCKS.bad are usually small 2-8kb and appears very infrequently (once 
>> a week, once a month or never for many partitions). I found these BLOCKS.bad 
>> in both data/leveldb and data/anti_entropy. So I have suspicion that there 
>> is a bug in LevelDB.
>> 
>> Looking at LOGs they are created during compactions:
>> "Moving corrupted block to lost/BLOCKS.bad (size 2393)"
>> but there is no more information. What kind of block is it, where it was 
>> found.
>> 
>> Is it possible to somehow find source of those BLOCKS.bad files? I'm 
>> building Riak from sources, maybe it's possible to enable some additional 
>> logging to find what these BLOCKS.bad are?
>> 
>> 
>> 2013/7/25 Matthew Von-Maszewski <matth...@basho.com>
>> Vladimir,
>> 
>> I can explain what happened, but not how to correct the problem.  The 
>> gentleman that can walk you through a repair is tied up on another project, 
>> but he intends to respond as soon as he is able.
>> 
>> We recently discovered / realized that Google's leveldb code does not check 
>> the CRC of each block rewritten during a compaction.  This means that blocks 
>> with bad CRCs get read without being flagged as bad, then rewritten to a new 
>> file with a new, valid CRC.  The corruption is now hidden.
>> 
>> A more thorough discussion of the problem is found here:
>> 
>> https://github.com/basho/leveldb/wiki/mv-verify-compactions
>> 
>> 
>> We added code to the 1.3.2 and 1.4 Riak releases to have the block CRC 
>> checked during both read (Get) requests and compaction rewrites.  This 
>> prevents future corruption hiding.  Unfortunately, it does NOTHING for 
>> blocks already corrupted and rewritten with valid CRCs.  You are 
>> encountering this latter condition.  We have a developer advocate / client 
>> services person that has walked others through a fix via the Riak data 
>> replicas … 
>> 
>> … please hold and the doctor will be with you shortly.
>> 
>> Matthew
>> 
>> 
>> On Jul 24, 2013, at 9:39 PM, Vladimir Shabanov <vshaban...@gmail.com> wrote:
>> 
>>> Hello,
>>> 
>>> Recently I've started expanding my Riak cluster and found that handoffs 
>>> were continuously retried for one partition.
>>> 
>>> Here are logs from two nodes
>>> https://gist.github.com/vshabanov/41282e622479fbe81974
>>> 
>>> The most interesting parts of logs are
>>> "Handoff receiver for partition ... exited abnormally after processing 
>>> 2860338 objects: {{badarg,[{erlang,binary_to_term,..."
>>> and
>>> "bad argument in call to erlang:binary_to_term(<<131,104,...."
>>> 
>>> Both nodes are running Riak 1.3.2 (old one was running 1.3.1 previously).
>>> 
>>> 
>>> When I've printed corrupted binary string I found that it corresponds to 
>>> one value.
>>> 
>>> When I've tried to "get" it, it was read OK but node with corrupted value 
>>> shown the same binary_to_term error.
>>> 
>>> When I've tried to delete corrupted value I've got timeout.
>>> 
>>> 
>>> I'm running machines with ECC memory and ZFS filesystem (which doesn't 
>>> report any checksum failures) so I doubt data was silently corrupted on 
>>> disk.
>>> 
>>> LOG from corresponding LevelDB partition doesn't show any errors. But there 
>>> is a lost/BLOCKS.bad file in this partition (7kb, created more than a month 
>>> ago and looks like it doesn't contain corrupted value).
>>> 
>>> At the moment I've stopped handoffs using "risk-admin transfer-limit 0".
>>> 
>>> Why the value was corrupted? It there any way to remove it or fix it?
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users@lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
>> 
> 
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Corrupted Erlang binary term inside LevelDB

Reply via email to