To revive an old thread: I needed a script to dump a large (>30G) couchdb database on a nightly basis for backup purposes, to be performed while couchdb is running, and noticed couchdb-python issue 58 <http://code.google.com/p/couchdb-python/issues/detail?id=58>: couchdb-dump fails if size of dumpfile > memory. I looked at the scripts attached in the comments and realized that I had similar but different needs: I just needed to stream the response to _all_docs_by_seq?include_docs=true directly to stdout, without doing any json encoding along the way. (The complementary load does have to do json encoding (for instance to check the "deleted":true flag), but this is fine.) I thought I would share the script with the community in case it's helpful to anyone else as well as to solicit feedback:
https://svn.openplans.org/melk/util/streamcouch.py Here is the usage: Usage: ./streamcouch.py [dump | load] DBURL "dump" requests "_all_docs_by_seq?include_docs=true" from DBURL and streams the response to stdout. "load" creates a database at DBURL with documents read from stdin in the format output by "_all_docs_by_seq?include_docs=true". Requires couchdb-python <http://code.google.com/p/couchdb-python/>. Ex. ./streamcouch.py dump http://localhost:5984/backedup > dump ./streamcouch.py load http://localhost:5984/restored < dump I have tested a dump / load roundtrip locally and it worked. I should note that none of our documents have inline attachments. There was talk earlier in this thread that documents with attachments cannot be posted with a _rev. It would be trivial to remove the _rev during the load, I just haven't needed to. I should also note that my next version of the script will allow you to perform incremental backups by passing a startkey to _all_docs_by_seq. The complementary load will be able accept multiple responses to _all_docs_by_seq on stdin. Looking forward to your responses. On Tue, Apr 7, 2009 at 8:49 AM, Matt Goodall <[email protected]> wrote: > > 2009/4/7 Jeff Hinrichs - DM&T <[email protected]>: > > > > > > On Tue, Apr 7, 2009 at 4:37 AM, Matt Goodall <[email protected]> > wrote: > >> > >> 2009/4/7 Jeff Hinrichs - DM&T <[email protected]>: > >> > > >> > > >> > On Mon, Apr 6, 2009 at 2:03 PM, Matt Goodall <[email protected]> > >> > wrote: > >> >> > >> >> 2009/4/6 Matt Goodall <[email protected]>: > >> >> > 2009/4/5 Jeff Hinrichs - DM&T <[email protected]>: > >> >> >>> So personally, for now, I would write the dump/load tools using > >> >> >>> plain > >> >> >>> old httplib from the standard library. It's more than capable. > The > >> >> >>> only bit that might involve decoding from JSON to Python is > >> >> >>> removing > >> >> >>> the _rev; everything else is a matter of streaming data from HTTP > >> >> >>> to > >> >> >>> disk and vice-versa. > >> >> >> > >> >> >> Although that is done currently, I don't believe that it is > strictly > >> >> >> required by couchdb. In fact I just tested and you can insert > >> >> >> (while > >> >> >> loading to an empty database) with the _rev property containing > >> >> >> information. The _rev is then updated and the resulting _rev > >> >> >> returned. > >> >> >> So > >> >> >> there is no need to remove the _rev when loading a dump file. > >> >> > > >> >> > Unfortunately, you do still have to remove the _rev for those > >> >> > documents with inline attachments, otherwise you get a conflict > error > >> >> > from couchdb. I don't know if it's a bug to create a document with > a > >> >> > _rev or if the _rev + inline attachments is just an inconsistency. > >> >> > I'll ask on the couchdb list later and post back here. > >> >> > >> >> Accepting a new document containing a _rev appears to be a bug in > >> >> CouchDB. So, it definitely needs removing which is arguably the > >> >> correct thing to do anyway. > >> > > >> > > >> >> > >> >> - Matt > >> > > >> > Matt, > >> > Saw your post and then reviewed my test scripts only to realize that > you > >> > are > >> > correct about couchdb freaking out with _rev+attachment. However, I > >> > don't > >> > agree with Damien about which is the bug. I see his point of view -- > he > >> > is > >> > seeing this as standard couchdb operation. > >> > > >> > Quite frankly, to be a proper dump/load mechanism -- you should be > able > >> > to > >> > dump dbA, then create a dbB, then load the dump from dbA into dbB and > >> > when > >> > you replicate from one to the other, they should appear to be already > >> > synchronized (no replication events occur) If a dump load cycle > causes > >> > dbA > >> > to transform into dbA' then it is not a dump/load -- it's a fetch and > >> > Insert. > >> > >> Ah, I think that's a different sort of dump/load than couchdb-python > >> provides. I see couchdb-python's dump/load is a sort of > >> snapshot/bootstrap tool. You're talking about a disconnected > >> replication process. Both probably have their uses. > > > > > > Now that you put it that way, yes. I was using replication to > demonstrate > > that a database reloaded from a dump should equal the originally dumped > > database. Replication is inherent w/ couchdb, so if a reloaded database > > will respond differently to replication than the original, something is > > lost. > > > > Couchdb essentially uses the _id + _rev as a unique index to the data. > The > > inability to recreate that unique index is the problem that, I think, > needs > > to be corrected. > > > >> > >> > > >> > couchdb proper really needs to correct this situation. Dump and load > >> > needs > >> > to put couchdb into a different mode so that this can be accomplished. > >> > i.e. > >> > /database/_dump -- which would dump json documents out and then a > >> > /database/_load where you would post the contents of the _dump to > reload > >> > a > >> > database. And I mean load -- not insert. > >> > >> I'm sure this can already be achieved using HTTP but, AFAICT, it's not > >> fully documented yet. > > > > There is a thread on couchdb-dev, > > > http://mail-archives.apache.org/mod_mbox/couchdb-dev/200904.mbox/%[email protected]%3e > > that appears to be talking about such a method. However, it would > require > > the _rev as a parameter instead of part of the json document. > > > >> > >> And yes, it might be nice for CouchDB to provide support for a > >> disconnected replication mode (i.e. replicate to disk) that assumes > >> the data will be loaded into an empty database (or at least a database > >> that has not seen those documents before). However, wouldn't that dump > >> basically be a compacted database? > > > > It would be, but currently, none of the scripts (haven't looked at your > > yet), request conflicted. I haven't written the test for this yet, but > my > > analysis is that only the current "winning" _rev of the data. A > dump/load > > cycle currently looses that information, if present. It has not bitten > me > > yet, but I can see it being a problem. > > My version is as lossy as the original couchdb-dump/load ... just a > couple of orders of magnitude more efficient ;-). > > > > >> > >> > > >> > There are times when you need to be able to dump/load a database. > Some > >> > times for error recovery, some times for debugging and some times for > >> > legal > >> > reasons but without a proper couchdb api for it, we are whistling in > the > >> > wind. > >> > >> That already works - just copy the .couch file for your database. > >> CouchDB's append-only model should mean you never get a copy of a > >> partly written database. > > > > You do have a point. However, that .couch file is dependent on the > running > > version of couchdb - so that would need to be backed up too. Maybe I am > > still need to shake off the dust of current RDBMS but all of them support > > the idea of dump/load. Something just feels very 80'ish about backing > up > > the actual data file. > > You're right, there should be a version-independent format and it > should be part of the CouchDB distribution. > > Every time a .couch file format-breaking change is made you have to > replicate from to a new CouchDB server. For the 0.9 release where the > .couch file format *and* the replication stream format changed not > even that was possible and someone (jchris, I think) wrote a script to > help. > > > > >> > >> Note: you can actually perform a disconnected replication using a > >> copy: copy a .couch file to a new CouchDB instance giving it a > >> temporary database name, replicate from the temporary database to the > >> real database, delete the temporary database. Not especially nice, but > >> not too bad either. All CouchDB needs to do to improve that is to > >> provide a way to replicate from a file on disk rather than a locally > >> installed database. > > > > Agreed, and then this whole thread becomes a bike shed. ;) > > Still some use, I think. I still think the ability to bulk load a > bunch of documents from a simple file is useful. For instance, when > setting up a new database it's not uncommon to need to bootstrap the > data in it before pointing an web server at it. > > > > > > > Regards, > > > > Jeff Hinrichs > >> > >> - Matt > >> > >> > > > > > > > > -- > > Jeff Hinrichs > > Dundee Media & Technology, Inc > > [email protected] > > 402.218.1473 > > web: www.dundeemt.com > > blog: inre.dundeemt.com > > > > > > > > > --~--~---------~--~----~------------~-------~--~----~ > You received this message because you are subscribed to the Google Groups > "CouchDB-Python" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected]<couchdb-python%[email protected]> > For more options, visit this group at > http://groups.google.com/group/couchdb-python?hl=en > -~----------~----~----~----~------~----~------~--~--- >

