Heya Ralf,
On Apr 13, 2008, at 02:33, Ralf Nieuwenhuijsen wrote:

Because the storage system is pretty wasteful and you'd end up with
several Gigabytes of database files for just a few hundred Megabytes of actual data. So we do need compaction in one form or another. A compaction
that retains revisions is a lot harder to write.  Also, dealing with
revisions in a distributed setup is less than trivial and would complicate
the replication system quite a bit.


The gigabytes versus hundred megabytes seem acceptable to me. Esspecially when we can scale that easily. Also, it seems to depend on how often data
changes. A simple solution to compact revisions would be to store each
revision as a reverse-diff as well. The normal data can then be compacted, whereas the reverse-diff is just kept. From the most recent version the
older versions can be established.

Question 1: How would manual revisions be any more space efficient?

Manual revisions as top-level documents could be compacted to under 2N of the data size not 100N, where N is the size of your actual data. Of course, the amount of data you want to store won't magically decrease. It is rather that CouchDB's storage engine trades disk-space for speed and consistency at runtime with asynchronous compaction to regain wasted space. And, just to make sure, quoting myself: "The revisions are not, at least at this point, meant to implement revision control systems, they rather exists for the optimistic concurrency control that allows any number of parallel readers while serialised writes are happening and to power replication."


Compacting is a manual process at the moment. If we would introduce a
scheduling mechanism, it would certainly be more general purpose and you
could hook in al sorts of operations, including compaction.

Question 2: In which case 'compacting' (aka as destroying the revisions)
would still be optional; something we can turn off?

You need to run it explicitly at the moment. So by default, everything is kept. This might change in the future, but you will be able to disable it on a database-level.


Question 3: Can we use older revisions in views?

No.


See http://damienkatz.net/2008/02/incremental_map.html
and http://damienkatz.net/2008/02/incremental_map_1.html

Question 4: It appears from the comments this will behave much like a
combinator. So the algorithm complexity of adding one new document would be
O(1) ?

I think so, but I am not the definitive source here. Damien?


You don't merge, at least at the moment, but declare one revision to be the winner when resolving the conflict. Since this is a manual process, you can make sure you don't lose revision trees. Merge might be in at some point,
but no thoughts (at least public) went into that.

Question 5: Is manually implementing a conflict resolver possible at the moment (didn't find it on the wiki) and if so, why not let that function just return the winning _data_. That way we could easily implement a merger.
(which would be a much more sane approach for most documents)

It is not possible at the moment. You need to resolve a conflict from your application. There you can do all the merging you need :)


I don't understand what you mean here :) What is 'doc-is' in this context?

Oops, i meant 'doc-ID's' .. if i have several revisions of the same document as seperate documents, then the doc-id can no longer be some nice name.
Since doc-id's have to be unqiue.

Correct. Pretty names can get hairy in a distributed setup, so you might want to stick to UUIDs and provide your own "pretty-name". You wouldn't be able to use that, programatically, except in views, though.


The alternative of a cron-like system, could work much like the
view-documents. These documents could contain a source url (possibly
local),
a schedule-parameter and a function that maps a document to an array of documents that is treated as a batch-put. This way we could easily setup replication, but also all kinds of delayed and/or scheduled proccessing
of data.

Indeed. No planning went into such a thing at the moment. You might want to open a feature request at https://issues.apache.org/jira/browse/COUCHDB or
come up with a patch.

Perhaps i will look into it myself, if it turns out I need this desperately. I don't have any erlang experience, but i think my experience with haskell
will pull me through ;-)

Awesome! May recommend Joe Armstrong's 'Programming Erlang' book at http://www.pragprog.com/titles/jaerlang in case you are looking for literature. Thanks for considering helping out. Contributions are very important for the project.


Conflict resolution and merge functions do sound interesting, I don't
understand the "not guaranteeing scalability" remark though. In the current implementation, this feature actually makes CouchDB scalable by ensuring, that all node participating in a cluster eventually end up with the same data. If you really do need two-phase-commit (if I understand correctly, you want that), that would need to be part of your application or a intermediate
storage layer.


No, no need for two-phase-commits. Rather, i would suggest the complete other extreme. No failed inserts/updates ever, including batch puts. Just a
generic merging conflict solver.

JSON seems very merge friendly to me ;-) It would seem that 99% of all
documents and use cases could be treated with the same genericl merge
function.

I think this idea is worth pursuing.

Cheers
Jan
--

Reply via email to