[melkjug-dev] Re: Multipart MIME in dump tool

Joshua Bronson Wed, 17 Jun 2009 09:59:48 -0700

To revive an old thread:
I needed a script to dump a large (>30G) couchdb database on a nightly basis
for backup purposes, to be performed while couchdb is running, and
noticed couchdb-python
issue 58 <http://code.google.com/p/couchdb-python/issues/detail?id=58>:
couchdb-dump
fails if size of dumpfile > memory. I looked at the scripts attached in the
comments and realized that I had similar but different needs: I just needed
to stream the response to _all_docs_by_seq?include_docs=true directly to
stdout, without doing any json encoding along the way. (The complementary
load does have to do json encoding (for instance to check the "deleted":true
flag), but this is fine.) I thought I would share the script with the
community in case it's helpful to anyone else as well as to solicit
feedback:


https://svn.openplans.org/melk/util/streamcouch.py



Here is the usage:

Usage: ./streamcouch.py [dump | load] DBURL

"dump" requests "_all_docs_by_seq?include_docs=true" from DBURL
and streams the response to stdout.

"load" creates a database at DBURL with documents read from stdin
in the format output by "_all_docs_by_seq?include_docs=true".
Requires couchdb-python <http://code.google.com/p/couchdb-python/>.

Ex.
  ./streamcouch.py dump http://localhost:5984/backedup > dump
  ./streamcouch.py load http://localhost:5984/restored < dump



I have tested a dump / load roundtrip locally and it
worked. I should note that none of our documents have inline attachments.
There was talk earlier in this thread that documents with attachments cannot
be posted with a _rev. It would be trivial to remove the _rev during the
load, I just haven't needed to.

I should also note that my next version of the script will allow you to
perform incremental backups by passing a startkey to _all_docs_by_seq. The
complementary load will be able accept multiple responses to
_all_docs_by_seq on stdin.

Looking forward to your responses.


On Tue, Apr 7, 2009 at 8:49 AM, Matt Goodall <[email protected]> wrote:

>
> 2009/4/7 Jeff Hinrichs - DM&T <[email protected]>:
> >
> >
> > On Tue, Apr 7, 2009 at 4:37 AM, Matt Goodall <[email protected]>
> wrote:
> >>
> >> 2009/4/7 Jeff Hinrichs - DM&T <[email protected]>:
> >> >
> >> >
> >> > On Mon, Apr 6, 2009 at 2:03 PM, Matt Goodall <[email protected]>
> >> > wrote:
> >> >>
> >> >> 2009/4/6 Matt Goodall <[email protected]>:
> >> >> > 2009/4/5 Jeff Hinrichs - DM&T <[email protected]>:
> >> >> >>> So personally, for now, I would write the dump/load tools using
> >> >> >>> plain
> >> >> >>> old httplib from the standard library. It's more than capable.
> The
> >> >> >>> only bit that might involve decoding from JSON to Python is
> >> >> >>> removing
> >> >> >>> the _rev; everything else is a matter of streaming data from HTTP
> >> >> >>> to
> >> >> >>> disk and vice-versa.
> >> >> >>
> >> >> >> Although that is done currently, I don't believe that it is
> strictly
> >> >> >> required by couchdb.  In fact I just tested and you can insert
> >> >> >> (while
> >> >> >> loading to an empty database) with the _rev property containing
> >> >> >> information.  The _rev is then updated and the resulting _rev
> >> >> >> returned.
> >> >> >> So
> >> >> >> there is no need to remove the _rev when loading a dump file.
> >> >> >
> >> >> > Unfortunately, you do still have to remove the _rev for those
> >> >> > documents with inline attachments, otherwise you get a conflict
> error
> >> >> > from couchdb. I don't know if it's a bug to create a document with
> a
> >> >> > _rev or if the _rev + inline attachments is just an inconsistency.
> >> >> > I'll ask on the couchdb list later and post back here.
> >> >>
> >> >> Accepting a new document containing a _rev appears to be a bug in
> >> >> CouchDB. So, it definitely needs removing which is arguably the
> >> >> correct thing to do anyway.
> >> >
> >> >
> >> >>
> >> >> - Matt
> >> >
> >> > Matt,
> >> > Saw your post and then reviewed my test scripts only to realize that
> you
> >> > are
> >> > correct about couchdb freaking out with _rev+attachment.  However, I
> >> > don't
> >> > agree with Damien about which is the bug.  I see his point of view --
> he
> >> > is
> >> > seeing this as standard couchdb operation.
> >> >
> >> > Quite frankly, to be a proper dump/load mechanism -- you should be
> able
> >> > to
> >> > dump dbA, then create a dbB, then load the dump from dbA into dbB and
> >> > when
> >> > you replicate from one to the other, they should appear to be already
> >> > synchronized (no replication events occur)  If a dump load cycle
> causes
> >> > dbA
> >> > to transform into dbA' then it is not a dump/load -- it's a fetch and
> >> > Insert.
> >>
> >> Ah, I think that's a different sort of dump/load than couchdb-python
> >> provides. I see couchdb-python's dump/load is a sort of
> >> snapshot/bootstrap tool. You're talking about a disconnected
> >> replication process. Both probably have their uses.
> >
> >
> > Now that you put it that way, yes.  I was using replication to
> demonstrate
> > that a database reloaded from a dump should equal the originally dumped
> > database.  Replication is inherent w/ couchdb, so if a reloaded database
> > will respond differently to replication than the original, something is
> > lost.
> >
> > Couchdb essentially uses the _id + _rev as a unique index to the data.
> The
> > inability to recreate that unique index is the problem that, I think,
> needs
> > to be corrected.
> >
> >>
> >> >
> >> > couchdb proper really needs to correct this situation.  Dump and load
> >> > needs
> >> > to put couchdb into a different mode so that this can be accomplished.
> >> > i.e.
> >> > /database/_dump -- which would dump json documents out  and then a
> >> > /database/_load where you would post the contents of the _dump to
> reload
> >> > a
> >> > database.  And I mean load -- not insert.
> >>
> >> I'm sure this can already be achieved using HTTP but, AFAICT, it's not
> >> fully documented yet.
> >
> > There is a thread on couchdb-dev,
> >
> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200904.mbox/%[email protected]%3e
> > that appears to be talking about such a method.  However, it would
> require
> > the _rev as a parameter instead of part of the json document.
> >
> >>
> >> And yes, it might be nice for CouchDB to provide support for a
> >> disconnected replication mode (i.e. replicate to disk) that assumes
> >> the data will be loaded into an empty database (or at least a database
> >> that has not seen those documents before). However, wouldn't that dump
> >> basically be a compacted database?
> >
> > It would be, but currently, none of the scripts (haven't looked at your
> > yet), request conflicted.  I haven't written the test for this yet, but
> my
> > analysis is that only the current "winning" _rev of the data.  A
> dump/load
> > cycle currently looses that information, if present.  It has not bitten
> me
> > yet, but I can see it being a problem.
>
> My version is as lossy as the original couchdb-dump/load ... just a
> couple of orders of magnitude more efficient ;-).
>
> >
> >>
> >> >
> >> > There are times when you need to be able to dump/load a database.
> Some
> >> > times for error recovery, some times for debugging and some times for
> >> > legal
> >> > reasons but without a proper couchdb api for it, we are whistling in
> the
> >> > wind.
> >>
> >> That already works - just copy the .couch file for your database.
> >> CouchDB's append-only model should mean you never get a copy of a
> >> partly written database.
> >
> > You do have a point.  However, that .couch file is dependent on the
> running
> > version of couchdb - so that would need to be backed up too.  Maybe I am
> > still need to shake off the dust of current RDBMS but all of them support
> > the idea of  dump/load.  Something just feels very 80'ish about backing
> up
> > the actual data file.
>
> You're right, there should be a version-independent format and it
> should be part of the CouchDB distribution.
>
> Every time a .couch file format-breaking change is made you have to
> replicate from to a new CouchDB server. For the 0.9 release where the
> .couch file format *and* the replication stream format changed not
> even that was possible and someone (jchris, I think) wrote a script to
> help.
>
> >
> >>
> >> Note: you can actually perform a disconnected replication using a
> >> copy: copy a .couch file to a new CouchDB instance giving it a
> >> temporary database name, replicate from the temporary database to the
> >> real database, delete the temporary database. Not especially nice, but
> >> not too bad either. All CouchDB needs to do to improve that is to
> >> provide a way to replicate from a file on disk rather than a locally
> >> installed database.
> >
> > Agreed,  and then this whole thread becomes a bike shed. ;)
>
> Still some use, I think. I still think the ability to bulk load a
> bunch of documents from a simple file is useful. For instance, when
> setting up a new database it's not uncommon to need to bootstrap the
> data in it before pointing an web server at it.
>
> >
> >
> > Regards,
> >
> > Jeff Hinrichs
> >>
> >> - Matt
> >>
> >>
> >
> >
> >
> > --
> > Jeff Hinrichs
> > Dundee Media & Technology, Inc
> > [email protected]
> > 402.218.1473
> > web: www.dundeemt.com
> > blog: inre.dundeemt.com
> >
> > >
> >
>
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google Groups
> "CouchDB-Python" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]<couchdb-python%[email protected]>
> For more options, visit this group at
> http://groups.google.com/group/couchdb-python?hl=en
> -~----------~----~----~----~------~----~------~--~---
>

[melkjug-dev] Re: Multipart MIME in dump tool

Reply via email to