Hello, In 0.9 CouchDB removed the transactional bulk docs feature in favour of simplifying sharding/replication.
The priorities behind this decision as I understood them are: 1. ensure that applications developed in a single server don't suffer from a degradation of guarantees if deployed using sharding 2. avoid the issues involving transactional I apologize if this proposal has already dismissed before. I did search the mailing list archives, but mostly found a discussion on why this stuff should not be done on IRC. I blame Jan for encouraging me to post ;-) So anyway, I think that we can have both features without needing to implement something like Paxos, and without silently breaking apps when they move to a sharding setup from a single machine setup. The basic idea is to keep the conflicts-are-data approach, keeping the current user visible replication and conflict resolution, but to allow the user to ask for stricter conflict checking. The api I propose is for the bulk docs operation to have an optional 'atomic' flag. When this flag is set CouchDB would atomically verify that all documents were committed without conflict (with respect to the supplied _rev), and if any one document conflicts, mark all of them as conflicting. Transaction recovery, conflict resolution etc is still the responsibility of the application, but provides an atomic guarantee that an inconsistent transaction will fail as a whole if it tries to write inconsistent data to the database, a guarantee that cannot be made using a client library (there are race conditions). Now the hard parts: 1. Replication The way I understand it replication currently works on a per document approach. If 'atomic' was specified in a bulk operation I propose that all the revisions created in that bulk operation were kept linked. If these linked revisions are being replicated, the same conflict resolution must be applied (the replication of the document itself is executed as bulk operation with aotmic=true, replicating all associated documents as well). The caveat is that even if you always use bulk docs with the atomic flag, if you a switch replica you could lose the D out of ACID: documents which are marked as non conflicting in your replica might be conflicting in the replica you switch to, in which case transactions that have already been committed appear to be rolled back from the application's point of view. This problem obviously already exists in the current implementation, but when 'atomic' is specified it could potentially happen a lot more often. 2. Data sharding This one is tricker. Two solutions both of which I think are acceptable, and either or both of which could be used: The easy way is to ignore this problem. Well not really: The client must ensure that all the documents affected by a single transaction are in the same shard, by using a partitioning scheme that allows this. If a bulk_docs operation with atomic set to true would affect multiple shards, that is an error (the data could still be written as a conflict, of course). If you want to enable the 'atomic' flag you'll need to be careful about how you use sharding. You can still use it for some of the transactions, but not all the time. I think this is a flexible and pragmatic solution. This means that if you choose to opt in to fully atomic bulk doc operations your app might not be deployable unmodified to a sharded setup, but it's never unsafe (no data inconsistencies). In my opinion this satisfies the requirement for no degredation of guarantees. It might not Just Work, but you can't have your cake and eat it too at the end of the day. The second way is harder but potentially still interesting. I've included it mostly for the sake of discussion. The core idea is to provide low level primitives on top of which a client or proxy can implement a multi phase commit protocol. The number of nodes involved is in the transaction depends on the data in the transaction (it doesn't need to coordinate all the nodes in the cluster). Basically this would breakdown bulk doc calls into several steps. First all the data is inserted to the backend, but it's set as conflicting so that it's not accidentally visible. This operation returns an identifier for the bulk doc operation (essentially a ticket for a prepared transaction). Once the data is available on all the shards it must be made live atomically. A two phase commit starts by acquiring locks on all the the transaction tickets and trying to apply them (the 'promise' phase), and then finalizing that application atomically (the 'accept' phase). To keep things simple the two phases should be scoped to a single keep alive connection. If the connection drops the locks should be released. Obviously Paxos ensues, but here's the catch: - The synchronization can be implemented first as a 3rd party component, it doesn't need to affect CouchDB's core - The atomic primitives are also useful for writing safe conflict resolution tools that handle conflicts that span multiple documents. So even if no one ends up implementing real Multi Paxos in the end CouchDB still benefits from having reusable synchronization primitives. (If this is interesting to anyone, see below [1]) I'd like to stress that this is all possible to do on top of the current 0.9 semantics. The default behavior in 0.9 does not change at all. You have to opt in to this more complex behavior. The problem with 0.9 is that there is no way to ensure atomicity and isolation from a client library, it must be done on the server, so by removing the ability to do this at all, couchdb is essentially no longer transaction. It's protected from internal data corruption, and it's protected from data loss (unlike say MyISAM which will happily overwrite your correct data), but it's still a potentially lossy model since conflicting revisions are not correlated. This means that you cannot have a transactional graph model, it's either or. Finally, you should know that I have no personal stake in this. I don't rely on CouchDB (yet), but I think it's somewhat relevant for a project I work on, and that the requirements for this project are not that far fetched. I'm the author of an object graph storage engine for Perl called KiokuDB. It serializes every object to a document unless told otherwise but the data is still a highly interconnected graph. As a user of this system I care a lot about transactions, but not at all about sharding (this might not hold for all the users of KiokuDB). Basically I already have what I need from KiokuDB; there are numerous backends for this system that satisfy me (BerkeleyDB, a transactional plain file backend, DBI (PostgreSQL, SQLite, MySQL)), and a number that don't fit my personal needs due to lack of transactions (SimpleDB, MongoDB, and CouchDB since 0.9). If things stays this way, then CouchDB is simply not intended for users like me (though I'll certainly still maintain KiokuDB::Backend::CouchDB). However, I do *want* to use CouchDB. I think that under many scenarios it has clear advantages compared to the other backends (mostly the fact that it's so easy, but also views support is nice). I think it'd be a shame if what was preventing me was a fix that ended up being a low hanging fruit to which no one objected. Regards, Yuval [1] in Paxos terms the CouchDB shards would do the Acceptor role and the client (be it the actual client or a sharding proxy, whatever delegates and consolidates the views) performs the the Learner role. Only the Learner is considered authoritative with respect to the final status of a transaction. Hard crashes of a shard during the 'accept' phase may produce inconsistent results if more than one Learner is used to proxy write operations. Focusing on this scenario is a *HUGE* pessimization. High availability of the Learner role can still be achieved using BerkeleyDB style master failover (voting). This transactional sharding proxy could of course also guarantee redundancy of shards. My point is that Paxos can be implemented as a 3rd party component if anyone actually wants/needs it, by providing comparatively simple primitives.