Great sleuthing Michael! In addition to the recommendation to upgrade to {minor_version: 1}, which could be a good first step, how about going the extra mile to make _rev generation easier across platforms? This would benefit PouchDB and others.
Best Jan -- > On 23 Mar 2016, at 01:30, Michael Fair <mich...@daclubhouse.net> wrote: > > Greetings CouchDBers! > > I've been modifying a BERT library to recreate the md5 calc of a RevisionID > in Java. > > I haven't tackled attachments yet, however with the awesome help of rnewson > on the IRC channel, I've succeeded in recreating the md5 for all the > documents I've tried so far which includes docs with values of strings, big > and small integers, lists of big integers, lists of small integers, true, > false, null, and objects; however the glaring exception is floats. > > The {minor_version, 0} format used for floats (A 31 byte string based > representation in %.20e format) is dependent on the host environment doing > the encoding and can't be reliably duplicated in other machines and > languages. > > For instance, here are examples of encoding 3.14159 as %.20e string on this > laptop: > erlang: 3.1415899999999999000e+00 (This is what term_to_binary is using) > python: 3.14158999999999988262e+00 > java: 3.14159000000000000000e+00 > > These minor numerical differences unfortunately make the md5 computation > untenable. And further, it seems that even different OTP versions and > different hardware will encode the {minor_version, 0} format slightly > differently on different Couch instances (A couple people on IRC shared > with me what their OTP produced). > > > To make a long story short and spare folks reading the mind-numbing > details, without changing something, replicating the md5 for the revision > id of documents with floats just can't be done sanely. > > As things are now, like I mentioned, even different installations of > CouchDB can disagree on the MD5 revision id for the document {"pi":3.14159}. > > > So where does this create an issue? > > It shows up by creating a conflict document during replication when the two > servers calculated different revision ids for the same document update > (which only happens if it was a multi-master update (an update where both > sides were updated before replicating -- like separate laptops on separate > planes each doing the same thing)). > > If only one side or the other was updated, it doesn't cause a problem. > > My goal is enabling people to upload documents from multiple server > applications using JSON and Couch to handle the replication bits. > > To give this heterogeneous environment the same multi-master intelligence > that Couch has, they need to be able to compute the same revision id that > Couch would compute; otherwise documents modified directly in couch could > create these kinds of multi-master type conflicts. > > > ---- > > What to do (aside from simply do nothing)? > > At the least I recommend changing the term_to_binary computation to use the > {minor_version, 1} option in the rev_id calculation. > > This changes how floats are encoded to the 64-bit IEEE format. It became > the standard way of encoding floats in OTP 17.0+ and is available as an > option all the way back to OTP 11. As long as it's explicitly provided as > a requested option in the term_to_binary call, all currently deployed OTP > installations for Couch can do it. > > Doing this normalizes the md5 calculation for floats regardless of the OTP > platform, and should make it feasible for third party applications to > replicate the encoding. > > > > I have some other ideas beyond that, but they would require changes to the > replication protocol to support. > > > ---- > > For anyone interested I'd be happy to share the code I have. It's still a > bit rough in the document construction part, but once constructed, getting > the binary encoding and revision id are each just a single call. > > > Thanks, > Mike -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/