Re: data recovery tool progress

Adam Kocoloski Tue, 10 Aug 2010 19:52:47 -0700

Excellent, thanks for testing.  I caught Jason Smith saying on IRC that he had 
packaged the whole thing up as an escript + some .beams.  If we can get it down 
to a single file a la rebar that would be a pretty sweet way to deliver the 
repair tool in my opinion.


Adam

On Aug 10, 2010, at 10:40 PM, Mikeal Rogers wrote:

> Ok, latest code has been tested against every db that I have and it works
> great.
> 
> What are our next steps here?
> 
> I'd like to get this out to all the people who didn't feel comfortable send
> me their db to test against before we release it more widely.
> 
> -Mikeal
> 
> On Tue, Aug 10, 2010 at 6:11 PM, Mikeal Rogers <mikeal.rog...@gmail.com>wrote:
> 
>> Found one issue, we weren't picking up design docs because it didn't have
>> admin privileges.
>> 
>> Adam fixed it and pushed and I've verified that it works now.
>> 
>> I wrote a little node script to show all recovered documents and expose any
>> documents that didn't make it in to lost+found.
>> 
>> http://github.com/mikeal/couchtest/blob/master/validate.js
>> 
>> Requires request, `npm install request`.
>> 
>> I'm now running recover on all the test db's I have and running the
>> validation script against them.
>> 
>> -Mikeal
>> 
>> 
>> On Tue, Aug 10, 2010 at 1:34 PM, Mikeal Rogers 
>> <mikeal.rog...@gmail.com>wrote:
>> 
>>> I have some timing number for the new code.
>>> 
>>> multi_conflict has 200 lost documents and 201 documents total after
>>> recovery.
>>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["multi_conflict"]).
>>> {25217069,ok}
>>> 25 seconds
>>> 
>>> Something funky is going on here. Investigating.
>>> 1> timer:tc(couch_db_repair, make_lost_and_found,
>>> ["multi_conflict_with_attach"]).
>>> {654782,ok}
>>> .6 seconds
>>> 
>>> This db has 124969 documents in it.
>>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["testwritesdb"]).
>>> {1381969304,ok}
>>> 23 minutes
>>> 
>>> This database is about 500megs and 46660 before recovery and 46801 after.
>>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["prod"]).
>>> {2329669113,ok}
>>> 38.8 minutes
>>> 
>>> -Mikeal
>>> 
>>> On Tue, Aug 10, 2010 at 12:06 PM, Adam Kocoloski <kocol...@apache.org>wrote:
>>> 
>>>> Good idea.  Now we've got
>>>> 
>>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576
>>>> bytes at 1380102
>>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576
>>>> bytes at 331526
>>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 331526
>>>> bytes at 0
>>>>> [info] [<0.33.0>] couch_db_repair writing 12 updates to
>>>> lost+found/testwritesdb
>>>>> [info] [<0.33.0>] couch_db_repair writing 9 updates to
>>>> lost+found/testwritesdb
>>>>> [info] [<0.33.0>] couch_db_repair writing 8 updates to
>>>> lost+found/testwritesdb
>>>> 
>>>> Adam
>>>> 
>>>> On Aug 10, 2010, at 2:29 PM, Robert Newson wrote:
>>>> 
>>>>> It took 20 minutes before the first 'update' line came out, but now
>>>>> seems to be recovering smoothly. machine load is back down to sane
>>>>> levels.
>>>>> 
>>>>> Suggest feedback during the hunting phase.
>>>>> 
>>>>> B.
>>>>> 
>>>>> On Tue, Aug 10, 2010 at 7:11 PM, Adam Kocoloski <kocol...@apache.org>
>>>> wrote:
>>>>>> Thanks for the crosscheck.  I'm not aware of anything in the node
>>>> finder that would cause it to struggle mightily with healthy DBs.  It 
>>>> pretty
>>>> much ignores the health of the DB, in fact.  Would be interested to hear
>>>> more.
>>>>>> 
>>>>>> On Aug 10, 2010, at 1:59 PM, Robert Newson wrote:
>>>>>> 
>>>>>>> I verified the new code's ability to repair the testwritesdb. system
>>>>>>> load was smooth from start to finish.
>>>>>>> 
>>>>>>> I started a further test on a different (healthy) database and system
>>>>>>> load was severe again, just collecting the roots (the lost+found db
>>>>>>> was not yet created when I aborted the attempt). I suspect the fact
>>>>>>> that it's healthy is the issue, so if I'm right, perhaps a warning is
>>>>>>> useful.
>>>>>>> 
>>>>>>> B.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Aug 10, 2010 at 6:53 PM, Adam Kocoloski <kocol...@apache.org>
>>>> wrote:
>>>>>>>> Another update.  This morning I took a different tack and, rather
>>>> than try to find root nodes, I just looked for all kv_nodes in the file and
>>>> treated each of those as a separate virtual DB to be replicated.  This
>>>> reduces the algorithmic complexity of the repair, and it looks like
>>>> testwritesdb repairs in ~30 minutes or so.  Also, this method results in 
>>>> the
>>>> lost+found DB containing every document, not just the missing ones.
>>>>>>>> 
>>>>>>>> My branch does not currently include Randall's parallelization of
>>>> the replications.  It's still CPU-limited, so that may be a worthwhile
>>>> optimization.  On the other hand, I think we may be reaching a stage at
>>>> which performance for this repair tool is 'good enough', and pmaps can make
>>>> error handling a bit dicey.
>>>>>>>> 
>>>>>>>> In short, I think this tool is now in good shape.
>>>>>>>> 
>>>>>>>> http://github.com/kocolosk/couchdb/tree/db_repair
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>

Re: data recovery tool progress

Reply via email to