subject:"\[CODE4LIB\] find more like this one"

Re: [CODE4LIB] find more like this one

2005-05-24 Thread Andrew Nagy


Binkley, Peter wrote:


Bear in mind that even in UTF-8 there is more than one way to encode an
accented character. It can be precomposed (using a single character,
e.g. U0089 for lower-case e-acute: this is normalization form C) or
decomposed (using a base character and a non-spacing diacritic, e.g.
U0065 and U0301, lower-case e plus the acute accent: this is
normalization form D). If you're searching at the byte level, you have
to be sure that your index and your search term have been normalized the
same way or they won't match. I've found this FAQ useful for this stuff:
http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context,
we've used ICU4J (http://icu.sourceforge.net) to normalize stuff
(including stripping accents and normalizing case for different scripts)
for indexing and searching in UTF-8. There's also a C API, which could
presumably be incorporated into a Perl process, but no doubt there are
similar native Perl tools.

In general I think we've got to include i18n from the beginning: pay
attention to character sets of incoming data, normalize as early in the
process as possible (especially if ANSEL is involved!), use
UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser
(this site helps with the html:
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still
not as easy as it ought to be but at least there are good open-source
tools out there.



Wow, it looks like there are some unicode experts at our midst.  I am in
the middle of developing an international bibliographic database where
most of the titles are in languages other than EN-US.

Our database will store citations entered in via a web form since the
bibliography is in card format.  I am using MySQL 4 because of the
unicode support and collations.  I normally use postgres, but I figured
for a database that will mainly be used for searching only (very little
writes after the data has been populated) i'd give MySQL a try.

One feature we would like to offer is searching via the collations.  For
example, if I enter the phrase francais, i would hope that any items
with the term français would result.  Is it correct to use MySQL's
collations for this?  Does anyone have experience with this?

I am still learning the uses of UTF-8 characters, so I am glad there are
so many of you who know so much about this on this list!

Andrew

Re: [CODE4LIB] find more like this one

2005-05-24 Thread Binkley, Peter

Eric and Mike wrote:

> > Maybe I should draw search results from MyLibrary and not
> swish-e to
> > display characters correctly? If I draw content from many global
> > sources, then how do I know what character set to use for display?
> >
>
> This is definitely the best thing to do.  Search the
> normallized data and display the original.  Also, if you
> store the documents UTF-8 encoded you won't need to worry
> about the character set, you just need to set the encoding
> for the page to UTF-8 and the browser will take care of the rest.
>

Bear in mind that even in UTF-8 there is more than one way to encode an
accented character. It can be precomposed (using a single character,
e.g. U0089 for lower-case e-acute: this is normalization form C) or
decomposed (using a base character and a non-spacing diacritic, e.g.
U0065 and U0301, lower-case e plus the acute accent: this is
normalization form D). If you're searching at the byte level, you have
to be sure that your index and your search term have been normalized the
same way or they won't match. I've found this FAQ useful for this stuff:
http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context,
we've used ICU4J (http://icu.sourceforge.net) to normalize stuff
(including stripping accents and normalizing case for different scripts)
for indexing and searching in UTF-8. There's also a C API, which could
presumably be incorporated into a Perl process, but no doubt there are
similar native Perl tools.

In general I think we've got to include i18n from the beginning: pay
attention to character sets of incoming data, normalize as early in the
process as possible (especially if ANSEL is involved!), use
UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser
(this site helps with the html:
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still
not as easy as it ought to be but at least there are good open-source
tools out there.

Peter

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [EMAIL PROTECTED]

Re: [CODE4LIB] find more like this one

2005-05-24 Thread Mike Rylander

On 5/24/05, Eric Lease Morgan <[EMAIL PROTECTED]> wrote:
> On May 23, 2005, at 6:27 PM, Steven C. Perkins wrote:
>
> > I did a search on indigenous.  The first item was a French article.
> > The display of diacritics was messed up.  I added French to the
> > languages in IE, but the display was still bad.  I don't know if this
> > is a WinXP problem or a problem with your page.  I did not see a
> > language encoding on your source.  Perhaps UTF-8 will fix this?  Or it
> > may be a problem from the document retrieved.
>
> Yes, I do not know how to handle the extended ASCII characters, and I
> hoping someone here can point me in the right direction.
>
> As I said earlier, I use Net::OAI::Harvester to... harvest the data. I
> use MyLibrary to save the data to a MySQL database. I then write
> reports against the database in the form of a simple XML stream and
> feed the stream to swish-e for indexing. I know swish-e is unable to
> index multi-byte characters, and search results come directly from
> swish-e, not MyLibrary.
>

Will swish-e index the actual bytes of non-diacritic multibyte
characters?  If so, you can do what we do with Open-ILS (we use
Postgres' tsearch2 fulltest indexing module).  When indexing data, we
strip it of diacritical combining characters using 's/\p{M}//go'.
When a search is submitted we do the same thing, because a linked
search may contain the diacritics, or the searching user may be typing
in a non-US locale.  This will search the simplified strings and "does
the right thing", at least with our data.  We display the original
document (or a portion thereof) so that multibyte characters are
displayed.

For scripts that are entirely outside ASCII (Arabic, Kanji, etc) we
just index and search using the original bytes because they are not
matched by /\p{M}/.  In our testing this seems to work fine (of
course, we'd appreciate any tips on making this smarter).

> Maybe I should draw search results from MyLibrary and not swish-e to
> display characters correctly? If I draw content from many global
> sources, then how do I know what character set to use for display?
>

This is definitely the best thing to do.  Search the normallized data
and display the original.  Also, if you store the documents UTF-8
encoded you won't need to worry about the character set, you just need
to set the encoding for the page to UTF-8 and the browser will take
care of the rest.

--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: [CODE4LIB] find more like this one

2005-05-24 Thread Eric Lease Morgan


On May 23, 2005, at 6:27 PM, Steven C. Perkins wrote:


I did a search on indigenous.  The first item was a French article.
The display of diacritics was messed up.  I added French to the
languages in IE, but the display was still bad.  I don't know if this
is a WinXP problem or a problem with your page.  I did not see a
language encoding on your source.  Perhaps UTF-8 will fix this?  Or it
may be a problem from the document retrieved.


Yes, I do not know how to handle the extended ASCII characters, and I
hoping someone here can point me in the right direction.

As I said earlier, I use Net::OAI::Harvester to... harvest the data. I
use MyLibrary to save the data to a MySQL database. I then write
reports against the database in the form of a simple XML stream and
feed the stream to swish-e for indexing. I know swish-e is unable to
index multi-byte characters, and search results come directly from
swish-e, not MyLibrary.

Maybe I should draw search results from MyLibrary and not swish-e to
display characters correctly? If I draw content from many global
sources, then how do I know what character set to use for display?

--
Eric "Really Feeling Like The 'Ugly American'" Morgan

[CODE4LIB] find more like this one

2005-05-23 Thread Eric Lease Morgan


As a part of Project Ockham, I have started integrating Find More Like
This One functionality into one of my bibliographic indexes. See:

  http://tinyurl.com/c5htp

The link above will search an index of about 250,000 OAI records for
the term "hemogloban". It should find zero hits because the word is
spelled incorrectly, but the interface will suggest a number of
alternative spellings. Once these spellings are used, the system will
respond with synonyms as well as a suggested abbreviation, Hb. The
system is not perfect; right now it only really works with singe-word
queries.

I am using using Net::OAI::Harvester to collect the OAI data. MyLibrary
is used to manage the harvested content and create reports. I use
swish-e to index/search the data. Spelling corrections employ Aspell,
and the synonyms are generated through the use of WordNet.

Fun!

--
Eric Lease Morgan
University Libraries of Notre Dame

Re: [CODE4LIB] find more like this one

Re: [CODE4LIB] find more like this one

Re: [CODE4LIB] find more like this one

Re: [CODE4LIB] find more like this one

[CODE4LIB] find more like this one

5 matches

Site Navigation

Mail list logo

Footer information