On 15.08.2013 17:53, Olemis Lang wrote:
> - An intermediate step of the NFC normalization process seems to be
> the NFD form . Considering performance, isn't it better to use the
> later (i.e. NFD) as a reference?

That is an implementation detail of the normalization library. Python
for example (unicodedata.normalize) doesn't expose that. Although I have
to say that, in my experience, the normal approach is indeed to convert
to a stable decomposed form (NFD) first.

In any case, performance should not be much of an issue here. As long as
you decide on one normalization form for the database keys, the queries
themselves will be relatively fast. Normalization becomes expensive once
you start playing with collation algorithms in the databases, and you
certainly don't want to do that.

> - Is it more accurate to match «compatible» PKs (i.e. NFKC, NFKD) ? [...]

That depends on the application. For text processing, the answer is a
definite "no". For Bloodhound, it's not so clear that the compatibility
form would be unacceptable. Although if I were a Russian or Greek
speaker I'd take a dim view if Bloodhound converted my "А" (U+0410) or
"Α" (U+0391) to "A" (U+0041).

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. [email protected]

Reply via email to