Re: [CODE4LIB] find more like this one

2005-05-24 Thread Eric Lease Morgan

On May 23, 2005, at 6:27 PM, Steven C. Perkins wrote:


I did a search on indigenous.  The first item was a French article.
The display of diacritics was messed up.  I added French to the
languages in IE, but the display was still bad.  I don't know if this
is a WinXP problem or a problem with your page.  I did not see a
language encoding on your source.  Perhaps UTF-8 will fix this?  Or it
may be a problem from the document retrieved.


Yes, I do not know how to handle the extended ASCII characters, and I
hoping someone here can point me in the right direction.

As I said earlier, I use Net::OAI::Harvester to... harvest the data. I
use MyLibrary to save the data to a MySQL database. I then write
reports against the database in the form of a simple XML stream and
feed the stream to swish-e for indexing. I know swish-e is unable to
index multi-byte characters, and search results come directly from
swish-e, not MyLibrary.

Maybe I should draw search results from MyLibrary and not swish-e to
display characters correctly? If I draw content from many global
sources, then how do I know what character set to use for display?

--
Eric "Really Feeling Like The 'Ugly American'" Morgan


Re: [CODE4LIB] find more like this one

2005-05-24 Thread Mike Rylander
On 5/24/05, Eric Lease Morgan <[EMAIL PROTECTED]> wrote:
> On May 23, 2005, at 6:27 PM, Steven C. Perkins wrote:
>
> > I did a search on indigenous.  The first item was a French article.
> > The display of diacritics was messed up.  I added French to the
> > languages in IE, but the display was still bad.  I don't know if this
> > is a WinXP problem or a problem with your page.  I did not see a
> > language encoding on your source.  Perhaps UTF-8 will fix this?  Or it
> > may be a problem from the document retrieved.
>
> Yes, I do not know how to handle the extended ASCII characters, and I
> hoping someone here can point me in the right direction.
>
> As I said earlier, I use Net::OAI::Harvester to... harvest the data. I
> use MyLibrary to save the data to a MySQL database. I then write
> reports against the database in the form of a simple XML stream and
> feed the stream to swish-e for indexing. I know swish-e is unable to
> index multi-byte characters, and search results come directly from
> swish-e, not MyLibrary.
>

Will swish-e index the actual bytes of non-diacritic multibyte
characters?  If so, you can do what we do with Open-ILS (we use
Postgres' tsearch2 fulltest indexing module).  When indexing data, we
strip it of diacritical combining characters using 's/\p{M}//go'.
When a search is submitted we do the same thing, because a linked
search may contain the diacritics, or the searching user may be typing
in a non-US locale.  This will search the simplified strings and "does
the right thing", at least with our data.  We display the original
document (or a portion thereof) so that multibyte characters are
displayed.

For scripts that are entirely outside ASCII (Arabic, Kanji, etc) we
just index and search using the original bytes because they are not
matched by /\p{M}/.  In our testing this seems to work fine (of
course, we'd appreciate any tips on making this smarter).

> Maybe I should draw search results from MyLibrary and not swish-e to
> display characters correctly? If I draw content from many global
> sources, then how do I know what character set to use for display?
>

This is definitely the best thing to do.  Search the normallized data
and display the original.  Also, if you store the documents UTF-8
encoded you won't need to worry about the character set, you just need
to set the encoding for the page to UTF-8 and the browser will take
care of the rest.

--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org


Re: [CODE4LIB] find more like this one

2005-05-24 Thread Binkley, Peter
Eric and Mike wrote:

> > Maybe I should draw search results from MyLibrary and not
> swish-e to
> > display characters correctly? If I draw content from many global
> > sources, then how do I know what character set to use for display?
> >
>
> This is definitely the best thing to do.  Search the
> normallized data and display the original.  Also, if you
> store the documents UTF-8 encoded you won't need to worry
> about the character set, you just need to set the encoding
> for the page to UTF-8 and the browser will take care of the rest.
>

Bear in mind that even in UTF-8 there is more than one way to encode an
accented character. It can be precomposed (using a single character,
e.g. U0089 for lower-case e-acute: this is normalization form C) or
decomposed (using a base character and a non-spacing diacritic, e.g.
U0065 and U0301, lower-case e plus the acute accent: this is
normalization form D). If you're searching at the byte level, you have
to be sure that your index and your search term have been normalized the
same way or they won't match. I've found this FAQ useful for this stuff:
http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context,
we've used ICU4J (http://icu.sourceforge.net) to normalize stuff
(including stripping accents and normalizing case for different scripts)
for indexing and searching in UTF-8. There's also a C API, which could
presumably be incorporated into a Perl process, but no doubt there are
similar native Perl tools.

In general I think we've got to include i18n from the beginning: pay
attention to character sets of incoming data, normalize as early in the
process as possible (especially if ANSEL is involved!), use
UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser
(this site helps with the html:
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still
not as easy as it ought to be but at least there are good open-source
tools out there.

Peter

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [EMAIL PROTECTED]


[CODE4LIB] browser toolbars

2005-05-24 Thread Eric Lease Morgan

How does one go about creating a browser toolbar? You know. Things like
the Google or Yahoo toolbars.

--
Eric Lease Morgan
(574) 631-8604


Re: [CODE4LIB] find more like this one

2005-05-24 Thread Andrew Nagy

Binkley, Peter wrote:


Bear in mind that even in UTF-8 there is more than one way to encode an
accented character. It can be precomposed (using a single character,
e.g. U0089 for lower-case e-acute: this is normalization form C) or
decomposed (using a base character and a non-spacing diacritic, e.g.
U0065 and U0301, lower-case e plus the acute accent: this is
normalization form D). If you're searching at the byte level, you have
to be sure that your index and your search term have been normalized the
same way or they won't match. I've found this FAQ useful for this stuff:
http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context,
we've used ICU4J (http://icu.sourceforge.net) to normalize stuff
(including stripping accents and normalizing case for different scripts)
for indexing and searching in UTF-8. There's also a C API, which could
presumably be incorporated into a Perl process, but no doubt there are
similar native Perl tools.

In general I think we've got to include i18n from the beginning: pay
attention to character sets of incoming data, normalize as early in the
process as possible (especially if ANSEL is involved!), use
UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser
(this site helps with the html:
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still
not as easy as it ought to be but at least there are good open-source
tools out there.



Wow, it looks like there are some unicode experts at our midst.  I am in
the middle of developing an international bibliographic database where
most of the titles are in languages other than EN-US.

Our database will store citations entered in via a web form since the
bibliography is in card format.  I am using MySQL 4 because of the
unicode support and collations.  I normally use postgres, but I figured
for a database that will mainly be used for searching only (very little
writes after the data has been populated) i'd give MySQL a try.

One feature we would like to offer is searching via the collations.  For
example, if I enter the phrase francais, i would hope that any items
with the term français would result.  Is it correct to use MySQL's
collations for this?  Does anyone have experience with this?

I am still learning the uses of UTF-8 characters, so I am glad there are
so many of you who know so much about this on this list!

Andrew


Re: [CODE4LIB] browser toolbars

2005-05-24 Thread Jeremy Dunck
On 5/24/05, Eric Lease Morgan <[EMAIL PROTECTED]> wrote:
> How does one go about creating a browser toolbar? You know. Things like
> the Google or Yahoo toolbars.

There's not much of a dev community around this but:

"Creating Custom Explorer Bars, Tool Bands, and Desk Bands"


"Adding Explorer Bars"


"Explorer Bar Style Guide"



Re: [CODE4LIB] browser toolbars

2005-05-24 Thread Vishwam Annam
I haven't played much with browser toolbars, but I wrote search plugins
for mozilla browsers. These are like Google, Amazon, Yahoo.. etc which
comes with firefox by default. Our catalog search plugins are at
http://www.libraries.wright.edu/download/plugins/

If you are looking some thing like this, I'd be happy to share my
experience.

Vishwam



- Original Message -
From: Eric Lease Morgan <[EMAIL PROTECTED]>
Date: Tuesday, May 24, 2005 5:07 pm
Subject: [CODE4LIB] browser toolbars

> How does one go about creating a browser toolbar? You know. Things
> likethe Google or Yahoo toolbars.
>
> --
> Eric Lease Morgan
> (574) 631-8604
>
begin:vcard
n:Annam;Vishwam
fn:Vishwam  Annam
tel;fax:937-775-2356
tel;home:937-431-5115
tel;work:937-775-3262
org:Wright State University;Library Computing Services
version:2.1
email;internet:[EMAIL PROTECTED]
title:Web Developer
end:vcard