Heya Ralf,
Thanks for your input and engaging in this discussion!
On Apr 12, 2008, at 04:36, Ralf Nieuwenhuijsen wrote:
Hi,
I've joined this mailing-list, because i wanted to reply to this
discussion
specifically.
I was hoping you could clear a number of things up for me.
1. Why make compacting the default? Isn't more likely that in this
day &
age, most will prefer revisions for all data?
Because the storage system is pretty wasteful and you'd end up with
several Gigabytes of database files for just a few hundred Megabytes
of actual data. So we do need compaction in one form or another. A
compaction that retains revisions is a lot harder to write. Also,
dealing with revisions in a distributed setup is less than trivial and
would complicate the replication system quite a bit.
2. Compacting seems like very specific behavior, wouldn't a built-in
cron-like system be much more generic? It could allow for all kinds of
background proccessing, like replication, fulltext-search using
javascript,
compacting, searching-for-dead-urls, etc.
Compacting is a manual process at the moment. If we would introduce a
scheduling mechanism, it would certainly be more general purpose and
you could hook in al sorts of operations, including compaction.
3. Is support for some sort of reduce behavior, as part of the views,
planned and ifso, what can we expect?
See http://damienkatz.net/2008/02/incremental_map.html
and http://damienkatz.net/2008/02/incremental_map_1.html
4. What is the default conflict behavor? Most recent version wins?
There's no 'recent' in a distributed system. At the moment, the
revision with the most changes wins, if I remember correctly.
5. Is it possible to merge on conflicts, or ifnot, how could
attachments
possible properly model revisions. Wouldn't we loose a whole
revision tree?
You don't merge, at least at the moment, but declare one revision to
be the winner when resolving the conflict. Since this is a manual
process, you can make sure you don't lose revision trees. Merge might
be in at some point, but no thoughts (at least public) went into that.
6. Without merging, we need to store revisions in seperate documents,
thereby prohibiting usefull doc-is for documents under revision.
I don't understand what you mean here :) What is 'doc-is' in this
context?
7. What added benefit do manual revisisons have when we can just
store extra
revision data to each document anyway?
I'm quite sure my understanding of CouchDB can be lacking. But to me
it
seems like garantueed revisisions are the killer feature.
The revisions are not, at least at this point, meant to implement
revision control systems, they rather exists for the optimistic
concurrency control that allows any number of parallel readers while
serialised writes are happening and to power replication.
The alternative of a cron-like system, could work much like the
view-documents. These documents could contain a source url (possibly
local),
a schedule-parameter and a function that maps a document to an array
of
documents that is treated as a batch-put. This way we could easily
setup
replication, but also all kinds of delayed and/or scheduled
proccessing of
data.
Indeed. No planning went into such a thing at the moment. You might
want to open a feature request at https://issues.apache.org/jira/browse/COUCHDB
or come up with a patch.
Likewise, being able to define a conflict function that could merge
data or
decide who wins, seems like a much better alternative to the 'atomic'
batch-put-operations, that break down when distributed. (thereby no
longer
garantueeing the scalability; another killer-feature).
Conflict resolution and merge functions do sound interesting, I don't
understand the "not guaranteeing scalability" remark though. In the
current implementation, this feature actually makes CouchDB scalable
by ensuring, that all node participating in a cluster eventually end
up with the same data. If you really do need two-phase-commit (if I
understand correctly, you want that), that would need to be part of
your application or a intermediate storage layer.
Cheers
Jan
--