On Sun, Jun 21, 2009 at 4:40 PM, Antony Blakey<[email protected]> wrote: > > On 22/06/2009, at 7:26 AM, Paul Davis wrote: > >> Also +lots on deterministic revisions. As a side note, we've been >> worrying a bit about how to calculate the revision id's in the face of >> JSON differences in clients. I make a motion that we just stop caring >> and define how we calculate the signature. Ie, instead of calling it >> canonical JSON just call it, "The CouchDB deterministic revision >> algorithm" or some such. Then we can feel free to punt on any of the >> awesome issues we'd run into with a full canonical JSON standard. > > I haven't seen the recent discussions about canonicalisation, but IMO a > minimum requirement is that the infoset <-> serialisation mapping must be > 1:1, which requires completeness and prescription. Doing unicode > normalisation (NFC probably) is IMO also an absolute requirement - it's > virtually impossible to construct documents by hand with repeatable results > without it. >
My gut reaction is that normalizing strings using NFC [1] is not appropriate for a database. Here's why we should treat strings as binary and not worry about unicode normalization at all: First of all, I'm certain we can't require that all input already be NFC normalized. The real-life failure condition would be: "your language / operating system is not supported by CouchDB." A normal user is not going to understand the first bit of the fact that the underlying binary representation of their text could be subtly different in a way that would be invisible to them. And if they did understand that, they'd be hard pressed to change it. So rejecting non-normalized strings is unacceptable. Secondly, we're a database, so I find highly suspicious the notion that we should auto-normalize user input on-the-quiet. Maybe normalization is not lossy, but one particular use-case (however slim) that we can't support if we auto-normalize is a document which lists variations on the same string, to illustrate how non-normalized forms look the same but have different binary representations. A database which can't store that seems flawed, to me. So we can't require normalized input and we can't auto-normalize. Where does this leave us? Under the current (raw binary) string handling, two variations on a document which would, when NFC normalized, be binary identical, could have different deterministic revs. Since 99+% of content is already normalized, we're looking at a very small set of cases where we'd have distinct revs for documents that have similar (or identical, depending on your pov) content. The fact that there are rare pairs of documents out there which one could argue are the same, but which have different revs, strikes me as ever so slightly non-optimal, but not really a big deal. I think the potential optimization can be nicely accounted for by a simple recommendation: * If you are doing independent updates (from distinct client software) of strings in a document, and relying on deterministic revs to avoid conflict-on-replication, you should NFC normalize your content. The antecedents in that clause show how the case where normalization matters for deterministic revs is even more rare than the existence of non-normalized unicode. A secondary recommendation for people relying on deterministic revs to avoid conflicts on multi-node-updates, would be: "don't mutilate strings you didn't edit" so as long as client software doesn't go jiggling forms to other random look-alike codepoints without asking, any potential trouble is confined to fields actually effected by an update. The common use case for these revs is not lots of distinct client softwares all doing identical updates by hand and then pushing them to different eventually-replicating CouchDB cluster members. (Which is the use case where any of the above discussion is relevant.) The paradigm use case of deterministic revs is a single piece of software, running on a single box, creating a document and saving it to multiple cluster members using the same rev. Treating strings as binary completely and totally serves this use case. Chris [1] http://www.macchiato.com/unicode/nfc-faq > Unicode normalisation is an issue for clients because it requires they have > access to a Unicode NFC function. > > Antony Blakey > -------------------------- > CTO, Linkuistics Pty Ltd > Ph: 0438 840 787 > > It is as useless to argue with those who have renounced the use of reason as > to administer medication to the dead. > -- Thomas Jefferson > > > -- Chris Anderson http://jchrisa.net http://couch.io
