> > Because the storage system is pretty wasteful and you'd end up with > several Gigabytes of database files for just a few hundred Megabytes of > actual data. So we do need compaction in one form or another. A compaction > that retains revisions is a lot harder to write. Also, dealing with > revisions in a distributed setup is less than trivial and would complicate > the replication system quite a bit.
The gigabytes versus hundred megabytes seem acceptable to me. Esspecially when we can scale that easily. Also, it seems to depend on how often data changes. A simple solution to compact revisions would be to store each revision as a reverse-diff as well. The normal data can then be compacted, whereas the reverse-diff is just kept. From the most recent version the older versions can be established. Question 1: How would manual revisions be any more space efficient? Compacting is a manual process at the moment. If we would introduce a > scheduling mechanism, it would certainly be more general purpose and you > could hook in al sorts of operations, including compaction. Question 2: In which case 'compacting' (aka as destroying the revisions) would still be optional; something we can turn off? Question 3: Can we use older revisions in views? See http://damienkatz.net/2008/02/incremental_map.html > and http://damienkatz.net/2008/02/incremental_map_1.html > Question 4: It appears from the comments this will behave much like a combinator. So the algorithm complexity of adding one new document would be O(1) ? You don't merge, at least at the moment, but declare one revision to be the > winner when resolving the conflict. Since this is a manual process, you can > make sure you don't lose revision trees. Merge might be in at some point, > but no thoughts (at least public) went into that. Question 5: Is manually implementing a conflict resolver possible at the moment (didn't find it on the wiki) and if so, why not let that function just return the winning _data_. That way we could easily implement a merger. (which would be a much more sane approach for most documents) I don't understand what you mean here :) What is 'doc-is' in this context? Oops, i meant 'doc-ID's' .. if i have several revisions of the same document as seperate documents, then the doc-id can no longer be some nice name. Since doc-id's have to be unqiue. The alternative of a cron-like system, could work much like the > > view-documents. These documents could contain a source url (possibly > > local), > > a schedule-parameter and a function that maps a document to an array of > > documents that is treated as a batch-put. This way we could easily setup > > replication, but also all kinds of delayed and/or scheduled proccessing > > of > > data. > > > > Indeed. No planning went into such a thing at the moment. You might want > to open a feature request at https://issues.apache.org/jira/browse/COUCHDB or > come up with a patch. Perhaps i will look into it myself, if it turns out I need this desperately. I don't have any erlang experience, but i think my experience with haskell will pull me through ;-) Conflict resolution and merge functions do sound interesting, I don't > understand the "not guaranteeing scalability" remark though. In the current > implementation, this feature actually makes CouchDB scalable by ensuring, > that all node participating in a cluster eventually end up with the same > data. If you really do need two-phase-commit (if I understand correctly, you > want that), that would need to be part of your application or a intermediate > storage layer. No, no need for two-phase-commits. Rather, i would suggest the complete other extreme. No failed inserts/updates ever, including batch puts. Just a generic merging conflict solver. JSON seems very merge friendly to me ;-) It would seem that 99% of all documents and use cases could be treated with the same genericl merge function. Greetings, Ralf
