Great sleuthing Michael!

In addition to the recommendation to upgrade to {minor_version: 1}, which could
be a good first step, how about going the extra mile to make _rev generation
easier across platforms? This would benefit PouchDB and others.

Best
Jan
-- 

> On 23 Mar 2016, at 01:30, Michael Fair <mich...@daclubhouse.net> wrote:
> 
> Greetings CouchDBers!
> 
> I've been modifying a BERT library to recreate the md5 calc of a RevisionID
> in Java.
> 
> I haven't tackled attachments yet, however with the awesome help of rnewson
> on the IRC channel, I've succeeded in recreating the md5 for all the
> documents I've tried so far which includes docs with values of strings, big
> and small integers, lists of big integers, lists of small integers, true,
> false, null, and objects; however the glaring exception is floats.
> 
> The {minor_version, 0} format used for floats (A 31 byte string based
> representation in %.20e format) is dependent on the host environment doing
> the encoding and can't be reliably duplicated in other machines and
> languages.
> 
> For instance, here are examples of encoding 3.14159 as %.20e string on this
> laptop:
> erlang: 3.1415899999999999000e+00  (This is what term_to_binary is using)
> python: 3.14158999999999988262e+00
> java:   3.14159000000000000000e+00
> 
> These minor numerical differences unfortunately make the md5 computation
> untenable.  And further, it seems that even different OTP versions and
> different hardware will encode the {minor_version, 0} format slightly
> differently on different Couch instances (A couple people on IRC shared
> with me what their OTP produced).
> 
> 
> To make a long story short and spare folks reading the mind-numbing
> details, without changing something, replicating the md5 for the revision
> id of documents with floats just can't be done sanely.
> 
> As things are now, like I mentioned, even different installations of
> CouchDB can disagree on the MD5 revision id for the document {"pi":3.14159}.
> 
> 
> So where does this create an issue?
> 
> It shows up by creating a conflict document during replication when the two
> servers calculated different revision ids for the same document update
> (which only happens if it was a multi-master update (an update where both
> sides were updated before replicating -- like separate laptops on separate
> planes each doing the same thing)).
> 
> If only one side or the other was updated, it doesn't cause a problem.
> 
> My goal is enabling people to upload documents from multiple server
> applications using JSON and Couch to handle the replication bits.
> 
> To give this heterogeneous environment the same multi-master intelligence
> that Couch has, they need to be able to compute the same revision id that
> Couch would compute; otherwise documents modified directly in couch could
> create these kinds of multi-master type conflicts.
> 
> 
> ----
> 
> What to do (aside from simply do nothing)?
> 
> At the least I recommend changing the term_to_binary computation to use the
> {minor_version, 1} option in the rev_id calculation.
> 
> This changes how floats are encoded to the 64-bit IEEE format.  It became
> the standard way of encoding floats in OTP 17.0+ and is available as an
> option all the way back to OTP 11.  As long as it's explicitly provided as
> a requested option in the term_to_binary call, all currently deployed OTP
> installations for Couch can do it.
> 
> Doing this normalizes the md5 calculation for floats regardless of the OTP
> platform, and should make it feasible for third party applications to
> replicate the encoding.
> 
> 
> 
> I have some other ideas beyond that, but they would require changes to the
> replication protocol to support.
> 
> 
> ----
> 
> For anyone interested I'd be happy to share the code I have.  It's still a
> bit rough in the document construction part, but once constructed, getting
> the binary encoding and revision id are each just a single call.
> 
> 
> Thanks,
> Mike

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Reply via email to