> > We have normalization similar to
> > the one you're talking about in our Internet Keywords
> system. It is built on
> > top of NFKC. It is good for users, but then it is also very
> specific.
>
> Details, details! (Or do you consider that stuff a proprietary
> advantage?)
I don't really. That would be too fragile of an advantage to build on. But
as my signature shows, I may be mistaken :)
For a year-old explanation of the use of Unicode in our system, from the
16th IUC, see http://www.internetkeywords.org/iuc/realnames-iuc16-paper.htm.
Basically, we have two normalization forms. The first one is only for
presentation, and that is a very lightweight cleanup (remove invisible
characters, compress whitespace runs, map half-width characters to
full-width ones...). The second one is used to define uniqueness and that is
more restrictive; it builds on the cleaned up form. We do the following:
- Put the string in NFKC.
- Put the string in lowercase of its uppercase.
- Map some characters to take into account alternate spelling (German, for
example; when there is a conflicting between languages, oops).
- Undo some ligatures that KC didn't undo (as in French "qui vole un oeuf
vole un boeuf").
- Map some characters that are visually very similar to their lowest common
denominator (ASCII) counterpart. For example, the prime and fancy
apostrophes (sorry, don't feel like fetching my Unicode book to get their
proper names) are considered the same as a vanilla apostrophe.
That's about it. We're considering doing new things regularly, and are/will
be also doing specific things to overcome limitations of our distributions
channels (for example, Kana mapping).
As I've said, it's specific to the user experience we want to present to
users of Keywords (fancy display, simpler input). There are obvious
limitations, and each time we start getting a fair number of names in a
given language, I look at these again, and try to do the "right thing"
(fortunately, this is a subjective and very adaptable notion ;-)). Any
pointers to problems that we may encounter, smart things to do, etc... are
of great interest to me, please send them!
YA
--
My opinions do not necessarily reflect my company's.
The opposite is also true.