On 8/15/13, Branko Čibej <[email protected]> wrote:
> On 15.08.2013 17:53, Olemis Lang wrote:
>> - An intermediate step of the NFC normalization process seems to be
>> the NFD form . Considering performance, isn't it better to use the
>> later (i.e. NFD) as a reference?
>
> That is an implementation detail of the normalization library. Python
> for example (unicodedata.normalize) doesn't expose that. Although I have
> to say that, in my experience, the normal approach is indeed to convert
> to a stable decomposed form (NFD) first.
>
It seems so , that's why I asked because maybe if a call to
unicodedata.normalize under the hood performs the following
transformations (anything => NFD => NFC) as seems to be suggested by
the definition then relying upon NFD will not trigger the last
conversion step and maybe there will be a slight performance
improvement . Notice that one odf the consequences of adopting
bep:0003 is that product prefix (and therefore normalization) will be
used quite often almost everywhere for almost everything .
> In any case, performance should not be much of an issue here. As long as
> you decide on one normalization form for the database keys, the queries
> themselves will be relatively fast.
Even if I was thinking about conversions as mentioned above , it's
good to know this . I had not really thought about this, just assumed
that the impact at DB level would not be relevant.
> Normalization becomes expensive once
> you start playing with collation algorithms in the databases, and you
> certainly don't want to do that.
>
Test cases for this (if needed) will be written later.
>> - Is it more accurate to match «compatible» PKs (i.e. NFKC, NFKD) ? [...]
>
> That depends on the application. For text processing, the answer is a
> definite "no". For Bloodhound, it's not so clear that the compatibility
> form would be unacceptable. Although if I were a Russian or Greek
> speaker I'd take a dim view if Bloodhound converted my "А" (U+0410) or
> "Α" (U+0391) to "A" (U+0041).
>
For other applications ( e.g. free-text search ) I do see a reason for
considering NFK* forms . However for PKs I'm not really sure , mainly
because of this . Nevertheless
{{{#!py
>>> s = u'\u0410bc'
>>> print s
Аbc
>>> nfc = unicodedata.normalize('NFC', s)
>>> nfc
u'\u0410bc'
>>> nfd = unicodedata.normalize('NFD', s)
>>> nfd
u'\u0410bc'
>>> nfkd = unicodedata.normalize('NFKD', s)
>>> nfkd
u'\u0410bc'
>>> nfkc = unicodedata.normalize('NFKC', s)
>>> nfkc
u'\u0410bc'
>>> unicodedata.normalize('NFKC', u'Abc')
u'Abc'
>>> unicodedata.normalize('NFKC', u'Abc') == unicodedata.normalize('NFKC', s)
False
}}}
This makes me wonder whether they still belong in different
equivalence classes ...
--
Regards,
Olemis - @olemislc