Re: Compatibility Casefold Equivalence

Carl via Unicode Mon, 26 Nov 2018 23:49:57 -0800

Thanks for the reply.    Responses inline:

> On November 24, 2018 at 5:33 PM Asmus Freytag via Unicode 
> <[email protected]> wrote: 
>  
> 
> On 11/22/2018 11:58 AM, Carl via Unicode wrote: 
> > (It looks like my HTML email got scrubbed, sorry for the double post)
> > 
> > Hi,
> > 
> > 
> > In Chapter 3 Section 13, the Unicode spec defines D146:
> > 
> > 
> > "A string X is a compatibility caseless match for a string Y if and only 
> > if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = 
> > NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))"
> > 
> > 
> > I am trying to understand the "if and only if" part of this.   
> > Specifically, why is the outermost NFKD necessary?  Could it also be a NFKC 
> > normalization?   Is wrapping the outer NFKD in a NFC or NFKC on both sides 
> > of the equation okay?
> > 
> > 
> > My use case is that I am trying to store user-provided tags in a database.  
> > I would like the tags to be deduplicated based on compatibility and 
> > caseless equivalence, which is how I ended up looking at D146.  However, 
> > because decomposition can result in much larger strings, I would prefer to 
> > keep  the stored version in NFC or NFKC (I *think* this doesn't matter 
> > after doing the casefolding as described above).
> 
> 
> Carl,
> 
> 
> you may find that some of the complications are limited to a small number of 
> code points. In particular, classical (polytonic) Greek has some gnarly 
> behavior wrt case; and some compatibility characters have odd edge cases.
> 
>


I suspected that the number of edge cases would be small, but I lack a way of 
enumerating them.  (i.e. I don't know what I don't know)

> I'm personally not a fan of allowing every single Unicode code point in 
> things like usernames (or other types of identifiers). Especially, if 
> including some code points makes the "general case" that much more complex, 
> my personal recommendation would be to simply disallow / reject a small set 
> of troublesome characters; especially if they aren't part of some widespread 
> modern orthography. 
> 
> 
> While Unicode is about being able to digitally represent all written text, 
> identifiers don't follow the same rules. The main reason why people often 
> allow "anything" is because it's easy in terms of specification. Sometimes, 
> you may not have control over what to accept; for example if tags are 
> generated from headers in a document, it would require some transform to 
> handle disallowed code points.
> 
> 

The identifiers doc was what I had originally planned on using, but some of the 
rules there are too much.  For example, IIUC variation selectors are not 
allowed (scrubbed?), which prevents use of some emoji sequences.  Also, the 
ID_Start and XID_Start properties are too strict (since I'm not using this in a 
programming language or otherwise secure environment), as they forbid leading 
numbers.  Hashtags are close to what I want, but again, they specify a leading 
"#".  

Really the problem for me is that I don't know what liberties I can take with 
restricting/allowing certain characters.  Being too restrictive might be 
culturally insensitive, but being too lax might open the system for abuse.   
Would it be overkill to render the tag text to a picture, hash the picture, and 
store that instead?  It seems like it would force visually identical strings to 
the same set of bytes.


> Case is also only one of the types of duplication you may encounter. In many 
> South and South East Asian scripts you may encounter cases where two 
> sequences of characters, while different, will normally render identical. 
> Arabic also has instances of that. Finally, you may ask yourself whether your 
> system should treat simplified and traditional Chinese ideographs as separate 
> or as a variant not unlike the way you treat case.
> 
> 

Ideally I would like the same kind of matching as my browser does when I press 
Ctrl-F.  If simplified and traditional Chinese match, that's probably good 
enough.  



> About storing your tag data: you can obviously store them as NFC, if you 
> like: in that case, you will have to run the operations both on the stored 
> and on the new tag.
> 
> 
> Finally, there are some cases where you can tell that two string are 
> identical without actually carrying out the full set of operations:
> 
> 
> Y = X
> 
> 
> NFC(Y) = NFC(X)
> 
> 
> and so on. (If these conditions are true, the full condition above must also 
> be true). For example, let's apply 
> 
> NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
> 
> 
> on both sides of
> 
> 
> NFC(Y) = NFC(X)
> 
> 
> First:
> 
> 
> NFD(NFC(Y)) = NFD(NFC(X))
> 
> 
> Because the two sides are equal, applying toCaseFold results in equal 
> strings, and so on all the way to the outer NFKD.

As a minor followup, TR 15 section 7 says:

"NFKC(NFKD(x)) == NFKC(x)"

which implies that the outer NFKD can be replaced:

NFKC(toCasefold(NFKD(toCasefold(NFD(X)))))


> 
> 
> In other words, you can stop the comparison at any point where the two sides 
> are equal. From that point on, the outer operations cannot add anything.


That's a good point.  In my case, since one side of the equation will be stored 
in a DB, I believe I need to do the full transform.  That said, It would be 
useful for in-memory comparisons. 

> 
> 
> A./

Re: Compatibility Casefold Equivalence

Reply via email to