multiproduct: api.py hooks.py web_ui.py

Olemis Lang Thu, 15 Aug 2013 10:22:14 -0700

On 8/15/13, Branko Čibej <[email protected]> wrote:
> On 15.08.2013 17:53, Olemis Lang wrote:
>> - An intermediate step of the NFC normalization process seems to be
>> the NFD form . Considering performance, isn't it better to use the
>> later (i.e. NFD) as a reference?
>
> That is an implementation detail of the normalization library. Python
> for example (unicodedata.normalize) doesn't expose that. Although I have
> to say that, in my experience, the normal approach is indeed to convert
> to a stable decomposed form (NFD) first.
>


It seems so , that's why I asked because maybe if a call to
unicodedata.normalize under the hood performs the following
transformations (anything => NFD => NFC) as seems to be suggested by
the definition then relying upon NFD will not trigger the last
conversion step and maybe there will be a slight performance
improvement . Notice that one odf the consequences of adopting
bep:0003 is that product prefix (and therefore normalization) will be
used quite often almost everywhere for almost everything .

> In any case, performance should not be much of an issue here. As long as
> you decide on one normalization form for the database keys, the queries
> themselves will be relatively fast.

Even if I was thinking about conversions as mentioned above , it's
good to know this . I had not really thought about this, just assumed
that the impact at DB level would not be relevant.

> Normalization becomes expensive once
> you start playing with collation algorithms in the databases, and you
> certainly don't want to do that.
>

Test cases for this (if needed) will be written later.

>> - Is it more accurate to match «compatible» PKs (i.e. NFKC, NFKD) ? [...]
>
> That depends on the application. For text processing, the answer is a
> definite "no". For Bloodhound, it's not so clear that the compatibility
> form would be unacceptable. Although if I were a Russian or Greek
> speaker I'd take a dim view if Bloodhound converted my "А" (U+0410) or
> "Α" (U+0391) to "A" (U+0041).
>

For other applications ( e.g. free-text search ) I do see a reason for
considering NFK* forms . However for PKs I'm not really sure , mainly
because of this . Nevertheless

{{{#!py

>>> s = u'\u0410bc'
>>> print s
Аbc
>>> nfc = unicodedata.normalize('NFC', s)
>>> nfc
u'\u0410bc'
>>> nfd = unicodedata.normalize('NFD', s)
>>> nfd
u'\u0410bc'
>>> nfkd = unicodedata.normalize('NFKD', s)
>>> nfkd
u'\u0410bc'
>>> nfkc = unicodedata.normalize('NFKC', s)
>>> nfkc
u'\u0410bc'
>>> unicodedata.normalize('NFKC', u'Abc')
u'Abc'
>>> unicodedata.normalize('NFKC', u'Abc') == unicodedata.normalize('NFKC', s)
False
}}}

This makes me wonder whether they still belong in different
equivalence classes ...

-- 
Regards,

Olemis - @olemislc

Re: Unicode normalization WAS: svn commit: r1455576 - in /incubator/bloodhound/branches/bep_0003_multiproduct/bloodhound_multiproduct/multiproduct: api.py hooks.py web_ui.py

Reply via email to