Re: [CODE4LIB] find more like this one

Andrew Nagy Tue, 24 May 2005 14:43:19 -0700

Binkley, Peter wrote:

Bear in mind that even in UTF-8 there is more than one way to encode an
accented character. It can be precomposed (using a single character,
e.g. U0089 for lower-case e-acute: this is normalization form C) or
decomposed (using a base character and a non-spacing diacritic, e.g.
U0065 and U0301, lower-case e plus the acute accent: this is
normalization form D). If you're searching at the byte level, you have
to be sure that your index and your search term have been normalized the
same way or they won't match. I've found this FAQ useful for this stuff:
http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context,
we've used ICU4J (http://icu.sourceforge.net) to normalize stuff
(including stripping accents and normalizing case for different scripts)
for indexing and searching in UTF-8. There's also a C API, which could
presumably be incorporated into a Perl process, but no doubt there are
similar native Perl tools.


In general I think we've got to include i18n from the beginning: pay
attention to character sets of incoming data, normalize as early in the
process as possible (especially if ANSEL is involved!), use
UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser
(this site helps with the html:
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still
not as easy as it ought to be but at least there are good open-source
tools out there.

Wow, it looks like there are some unicode experts at our midst.  I am in
the middle of developing an international bibliographic database where
most of the titles are in languages other than EN-US.

Our database will store citations entered in via a web form since the
bibliography is in card format.  I am using MySQL 4 because of the
unicode support and collations.  I normally use postgres, but I figured
for a database that will mainly be used for searching only (very little
writes after the data has been populated) i'd give MySQL a try.

One feature we would like to offer is searching via the collations.  For
example, if I enter the phrase francais, i would hope that any items
with the term français would result.  Is it correct to use MySQL's
collations for this?  Does anyone have experience with this?

I am still learning the uses of UTF-8 characters, so I am glad there are
so many of you who know so much about this on this list!

Andrew

Re: [CODE4LIB] find more like this one

Reply via email to