A user (#herman) on IRC today reported slow startups for couchdb. I speculated that he'd hit the data loss bug and that couchdb was scanning backwards for a header. This turned out to be the case. Interestingly this was verified with a strace call, watching the read calls use earlier and earlier offsets.
Should we consider a tweak to the tool, or couchdb itself, to report a warning if we have to seek back very far to find a header? Obviously it would a heuristic but there would be no real downside to the odd false positive since the recovery tool and subsequent replication will amount to a no-op. fyi: Reading the couchdb database (90G) with 'dd' took 22 minutes, but couchdb's backward scanning took 3 hours. B. On Fri, Aug 13, 2010 at 3:05 PM, J Chris Anderson <jch...@apache.org> wrote: > > On Aug 12, 2010, at 11:38 PM, Mikeal Rogers wrote: > >> I tested the latest code in recover-couchdb and it looks great. > > We need to package this so that it is useable by end-users, and put a link to > it on http://couchdb.apache.org/notice/1.0.1.html > > I'm the last guy who knows what that would mean... anyone? I think we should > do this today. > > Do we need to do anything formal and time consuming before linking to the > recovery tool / process from that page? > > Also, someone needs to write up the how-to instructions, along with a > description of what to expect. > > Chris > >> >> -Mikeal >> >> On Thu, Aug 12, 2010 at 2:33 PM, J Chris Anderson <jch...@apache.org> wrote: >> >>> >>> On Aug 12, 2010, at 2:15 PM, J Chris Anderson wrote: >>> >>>> >>>> On Aug 12, 2010, at 12:36 PM, Adam Kocoloski wrote: >>>> >>>>> Right, and jchris' db_repair branch includes my patches for DB reader >>> _admin access and a more useful progress report in the replication phase of >>> the repair. >>>>> >>>> >>>> I've updated the repair branch with everyone's code. I think it is >>> faster, due to Adam's idea that if we run the merges in reverse order, those >>> near the front of the file are more likely to be no-ops, so less work is >>> done over all. >>>> >>>> Mikeal will be testing for correctness. Could other's please use it and >>> test for usability as well. Latest code (with instructions) is here: >>>> >>>> http://github.com/jhs/recover-couchdb/ >>>> >>>> Which points at http://github.com/jchris/couchdb/tree/db_repair for the >>> repair code. >>>> >>>> One thing I am not clear about (need better docs) is, do we need to >>> replicate the original db to the lost+found db (or vice-versa), after >>> recovery is complete? >>>> >>> >>> Also, we should be clear about what the semantics for this are. It can >>> potentially introduce conflicts if some writes were repeated after restarts. >>> Should it always be a noop on dbs that are clean w/r/t the bug? >>> >>> Chris >>> >>>> Chris >>>> >>>>> Adam >>>>> >>>>> On Aug 12, 2010, at 3:14 PM, Jason Smith wrote: >>>>> >>>>>> The code is updated with the following changes: >>>>>> 1. Adhere to the lost+found/databasename custom... >>>>>> 2. ...except databases starting with _, which goes into >>>>>> _system/databasename >>>>>> 3. Sync up with jchris's db_repair branch >>>>>> >>>>>> (About #2, I started with _/database but I think it's too easy to miss >>> at >>>>>> the command line.) >>>>>> >>>>>> On Fri, Aug 13, 2010 at 00:52, J Chris Anderson <jch...@gmail.com> >>> wrote: >>>>>> >>>>>>> A few bug reports from my testing: >>>>>>> >>>>>>> I launched with this command, as specified in the README: >>>>>>> >>>>>>> find ~/code/couchdb/tmp/lib -type f -name '*.couch' -exec >>> ./recover_couchdb >>>>>>> {} \; >>>>>>> >>>>>>> >>>>>>> >>>>>>> First of all, it chokes on my _users and _replicator db: >>>>>>> >>>>>>> [info] [<0.2.0>] couch_db_repair for _users - scanning 335961 bytes at >>> 0 >>>>>>> [error] [<0.2.0>] couch_db_repair merge node at 332061 {case_clause, >>>>>>> {error,illegal_database_name}} >>>>>>> >>>>>>> That second [error] line is repeated many many times (once per merge I >>>>>>> think). I think the issue is that _users is hard-coded to be OK, but >>>>>>> _users_lost+found is not. So we should do something about that, maybe >>> if a >>>>>>> db-name starts with _ we should call the lost and found >>> a_users_lost+found >>>>>>> (_ sorts at the top, so "a" will be near it and legal). >>>>>>> >>>>>>> >>>>>>> >>>>>>> When a database has readers defined in the security object, the tool >>> is >>>>>>> unable to open them (the reading part of the repair tool needs to have >>> the >>>>>>> _admin userCtx, not just the writer). >>>>>>> >>>>>>> [debug] [<0.2.0>] Not a reader: UserCtx {user_ctx,null,[],undefined} >>> vs >>>>>>> Names [<<"joe">>] Roles [<<"_admin">>] >>>>>>> escript: exception throw: {unauthorized,<<"You are not authorized to >>> access >>>>>>> this db.">>} >>>>>>> in function couch_db:open/2 >>>>>>> in call from couch_db_repair:make_lost_and_found/3 >>>>>>> in call from recover_couchdb:main/1 >>>>>>> in call from escript:run/2 >>>>>>> in call from escript:start/1 >>>>>>> in call from init:start_it/1 >>>>>>> in call from init:start_em/1 >>>>>>> >>>>>>> >>>>>>> It would also be helpful if the status lines could say something more >>> than >>>>>>> >>>>>>> [info] [<0.2.0>] couch_db_repair writing 15 updates to >>> bench_lost+found >>>>>>> >>>>>>> Like maybe add a note like "about 23% complete" if at all possible. >>>>>>> >>>>>>> >>>>>>> I will patch the first few, I'd love help from someone on the last >>> one. >>>>>>> I'll be on IRC. >>>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> Chris >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Aug 12, 2010, at 10:18 AM, J Chris Anderson wrote: >>>>>>> >>>>>>>> >>>>>>>> On Aug 11, 2010, at 2:14 PM, Jason Smith wrote: >>>>>>>> >>>>>>>>> Hi, Jason. >>>>>>>>> >>>>>>>>> On Thu, Aug 12, 2010 at 04:14, Jason Smith <j...@couch.io> wrote: >>>>>>>>> >>>>>>>>>> On Wed, Aug 11, 2010 at 09:52, Adam Kocoloski <kocol...@apache.org >>>> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Excellent, thanks for testing. I caught Jason Smith saying on IRC >>>>>>> that he >>>>>>>>>>> had packaged the whole thing up as an escript + some .beams. If >>> we >>>>>>> can get >>>>>>>>>>> it down to a single file a la rebar that would be a pretty sweet >>> way >>>>>>> to >>>>>>>>>>> deliver the repair tool in my opinion. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Please check out http://github.com/jhs/repair-couchdb >>>>>>>>>> >>>>>>>>> >>>>>>>>> I think you mean http://github.com/jhs/recover-couchdb >>>>>>>>> >>>>>>>> >>>>>>>> I think it is important that we package and release this, if it is >>> ready. >>>>>>> We should link to it from the bug description page, the project home >>> page, >>>>>>> as well as blog about it, etc. What is the point of working feverishly >>> on a >>>>>>> recovery tool if we don't go the last mile? >>>>>>>> >>>>>>>> I am testing it now on my database directory to make sure it doesn't >>> harm >>>>>>> anything (I was never subject to the bug, which is probably where most >>>>>>> people are, but they might run it anyway.) >>>>>>>> >>>>>>>> As it stands the submodules thing can't be part of the release, we >>> need >>>>>>> to package it up as a single zip file or something. >>>>>>>> >>>>>>>> Is there anything else that needs to be done before we can release >>> this? >>>>>>>> >>>>>>>> Chris >>>>>>>> >>>>>>>>> -- >>>>>>>>> Jason Smith >>>>>>>>> Couchio Hosting >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jason Smith >>>>>> Couchio Hosting >>>>> >>>> >>> >>> > >