Excellent, thanks for testing. I caught Jason Smith saying on IRC that he had packaged the whole thing up as an escript + some .beams. If we can get it down to a single file a la rebar that would be a pretty sweet way to deliver the repair tool in my opinion.
Adam On Aug 10, 2010, at 10:40 PM, Mikeal Rogers wrote: > Ok, latest code has been tested against every db that I have and it works > great. > > What are our next steps here? > > I'd like to get this out to all the people who didn't feel comfortable send > me their db to test against before we release it more widely. > > -Mikeal > > On Tue, Aug 10, 2010 at 6:11 PM, Mikeal Rogers <mikeal.rog...@gmail.com>wrote: > >> Found one issue, we weren't picking up design docs because it didn't have >> admin privileges. >> >> Adam fixed it and pushed and I've verified that it works now. >> >> I wrote a little node script to show all recovered documents and expose any >> documents that didn't make it in to lost+found. >> >> http://github.com/mikeal/couchtest/blob/master/validate.js >> >> Requires request, `npm install request`. >> >> I'm now running recover on all the test db's I have and running the >> validation script against them. >> >> -Mikeal >> >> >> On Tue, Aug 10, 2010 at 1:34 PM, Mikeal Rogers >> <mikeal.rog...@gmail.com>wrote: >> >>> I have some timing number for the new code. >>> >>> multi_conflict has 200 lost documents and 201 documents total after >>> recovery. >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["multi_conflict"]). >>> {25217069,ok} >>> 25 seconds >>> >>> Something funky is going on here. Investigating. >>> 1> timer:tc(couch_db_repair, make_lost_and_found, >>> ["multi_conflict_with_attach"]). >>> {654782,ok} >>> .6 seconds >>> >>> This db has 124969 documents in it. >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["testwritesdb"]). >>> {1381969304,ok} >>> 23 minutes >>> >>> This database is about 500megs and 46660 before recovery and 46801 after. >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["prod"]). >>> {2329669113,ok} >>> 38.8 minutes >>> >>> -Mikeal >>> >>> On Tue, Aug 10, 2010 at 12:06 PM, Adam Kocoloski <kocol...@apache.org>wrote: >>> >>>> Good idea. Now we've got >>>> >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576 >>>> bytes at 1380102 >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576 >>>> bytes at 331526 >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 331526 >>>> bytes at 0 >>>>> [info] [<0.33.0>] couch_db_repair writing 12 updates to >>>> lost+found/testwritesdb >>>>> [info] [<0.33.0>] couch_db_repair writing 9 updates to >>>> lost+found/testwritesdb >>>>> [info] [<0.33.0>] couch_db_repair writing 8 updates to >>>> lost+found/testwritesdb >>>> >>>> Adam >>>> >>>> On Aug 10, 2010, at 2:29 PM, Robert Newson wrote: >>>> >>>>> It took 20 minutes before the first 'update' line came out, but now >>>>> seems to be recovering smoothly. machine load is back down to sane >>>>> levels. >>>>> >>>>> Suggest feedback during the hunting phase. >>>>> >>>>> B. >>>>> >>>>> On Tue, Aug 10, 2010 at 7:11 PM, Adam Kocoloski <kocol...@apache.org> >>>> wrote: >>>>>> Thanks for the crosscheck. I'm not aware of anything in the node >>>> finder that would cause it to struggle mightily with healthy DBs. It >>>> pretty >>>> much ignores the health of the DB, in fact. Would be interested to hear >>>> more. >>>>>> >>>>>> On Aug 10, 2010, at 1:59 PM, Robert Newson wrote: >>>>>> >>>>>>> I verified the new code's ability to repair the testwritesdb. system >>>>>>> load was smooth from start to finish. >>>>>>> >>>>>>> I started a further test on a different (healthy) database and system >>>>>>> load was severe again, just collecting the roots (the lost+found db >>>>>>> was not yet created when I aborted the attempt). I suspect the fact >>>>>>> that it's healthy is the issue, so if I'm right, perhaps a warning is >>>>>>> useful. >>>>>>> >>>>>>> B. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 10, 2010 at 6:53 PM, Adam Kocoloski <kocol...@apache.org> >>>> wrote: >>>>>>>> Another update. This morning I took a different tack and, rather >>>> than try to find root nodes, I just looked for all kv_nodes in the file and >>>> treated each of those as a separate virtual DB to be replicated. This >>>> reduces the algorithmic complexity of the repair, and it looks like >>>> testwritesdb repairs in ~30 minutes or so. Also, this method results in >>>> the >>>> lost+found DB containing every document, not just the missing ones. >>>>>>>> >>>>>>>> My branch does not currently include Randall's parallelization of >>>> the replications. It's still CPU-limited, so that may be a worthwhile >>>> optimization. On the other hand, I think we may be reaching a stage at >>>> which performance for this repair tool is 'good enough', and pmaps can make >>>> error handling a bit dicey. >>>>>>>> >>>>>>>> In short, I think this tool is now in good shape. >>>>>>>> >>>>>>>> http://github.com/kocolosk/couchdb/tree/db_repair >>>>>>>> >>>>>> >>>>>> >>>> >>>> >>> >>