On 15.08.2013 17:53, Olemis Lang wrote: > - An intermediate step of the NFC normalization process seems to be > the NFD form . Considering performance, isn't it better to use the > later (i.e. NFD) as a reference?
That is an implementation detail of the normalization library. Python for example (unicodedata.normalize) doesn't expose that. Although I have to say that, in my experience, the normal approach is indeed to convert to a stable decomposed form (NFD) first. In any case, performance should not be much of an issue here. As long as you decide on one normalization form for the database keys, the queries themselves will be relatively fast. Normalization becomes expensive once you start playing with collation algorithms in the databases, and you certainly don't want to do that. > - Is it more accurate to match «compatible» PKs (i.e. NFKC, NFKD) ? [...] That depends on the application. For text processing, the answer is a definite "no". For Bloodhound, it's not so clear that the compatibility form would be unacceptable. Although if I were a Russian or Greek speaker I'd take a dim view if Bloodhound converted my "А" (U+0410) or "Α" (U+0391) to "A" (U+0041). -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. [email protected]
