Re: [CODE4LIB] find more like this one
Binkley, Peter wrote: Bear in mind that even in UTF-8 there is more than one way to encode an accented character. It can be precomposed (using a single character, e.g. U0089 for lower-case e-acute: this is normalization form C) or decomposed (using a base character and a non-spacing diacritic, e.g. U0065 and U0301, lower-case e plus the acute accent: this is normalization form D). If you're searching at the byte level, you have to be sure that your index and your search term have been normalized the same way or they won't match. I've found this FAQ useful for this stuff: http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context, we've used ICU4J (http://icu.sourceforge.net) to normalize stuff (including stripping accents and normalizing case for different scripts) for indexing and searching in UTF-8. There's also a C API, which could presumably be incorporated into a Perl process, but no doubt there are similar native Perl tools. In general I think we've got to include i18n from the beginning: pay attention to character sets of incoming data, normalize as early in the process as possible (especially if ANSEL is involved!), use UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser (this site helps with the html: http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still not as easy as it ought to be but at least there are good open-source tools out there. Wow, it looks like there are some unicode experts at our midst. I am in the middle of developing an international bibliographic database where most of the titles are in languages other than EN-US. Our database will store citations entered in via a web form since the bibliography is in card format. I am using MySQL 4 because of the unicode support and collations. I normally use postgres, but I figured for a database that will mainly be used for searching only (very little writes after the data has been populated) i'd give MySQL a try. One feature we would like to offer is searching via the collations. For example, if I enter the phrase francais, i would hope that any items with the term français would result. Is it correct to use MySQL's collations for this? Does anyone have experience with this? I am still learning the uses of UTF-8 characters, so I am glad there are so many of you who know so much about this on this list! Andrew
Re: [CODE4LIB] find more like this one
Eric and Mike wrote: > > Maybe I should draw search results from MyLibrary and not > swish-e to > > display characters correctly? If I draw content from many global > > sources, then how do I know what character set to use for display? > > > > This is definitely the best thing to do. Search the > normallized data and display the original. Also, if you > store the documents UTF-8 encoded you won't need to worry > about the character set, you just need to set the encoding > for the page to UTF-8 and the browser will take care of the rest. > Bear in mind that even in UTF-8 there is more than one way to encode an accented character. It can be precomposed (using a single character, e.g. U0089 for lower-case e-acute: this is normalization form C) or decomposed (using a base character and a non-spacing diacritic, e.g. U0065 and U0301, lower-case e plus the acute accent: this is normalization form D). If you're searching at the byte level, you have to be sure that your index and your search term have been normalized the same way or they won't match. I've found this FAQ useful for this stuff: http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context, we've used ICU4J (http://icu.sourceforge.net) to normalize stuff (including stripping accents and normalizing case for different scripts) for indexing and searching in UTF-8. There's also a C API, which could presumably be incorporated into a Perl process, but no doubt there are similar native Perl tools. In general I think we've got to include i18n from the beginning: pay attention to character sets of incoming data, normalize as early in the process as possible (especially if ANSEL is involved!), use UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser (this site helps with the html: http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still not as easy as it ought to be but at least there are good open-source tools out there. Peter Peter Binkley Digital Initiatives Technology Librarian Information Technology Services 4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta Canada T6G 2J8 Phone: (780) 492-3743 Fax: (780) 492-9243 e-mail: [EMAIL PROTECTED]
Re: [CODE4LIB] find more like this one
On 5/24/05, Eric Lease Morgan <[EMAIL PROTECTED]> wrote: > On May 23, 2005, at 6:27 PM, Steven C. Perkins wrote: > > > I did a search on indigenous. The first item was a French article. > > The display of diacritics was messed up. I added French to the > > languages in IE, but the display was still bad. I don't know if this > > is a WinXP problem or a problem with your page. I did not see a > > language encoding on your source. Perhaps UTF-8 will fix this? Or it > > may be a problem from the document retrieved. > > Yes, I do not know how to handle the extended ASCII characters, and I > hoping someone here can point me in the right direction. > > As I said earlier, I use Net::OAI::Harvester to... harvest the data. I > use MyLibrary to save the data to a MySQL database. I then write > reports against the database in the form of a simple XML stream and > feed the stream to swish-e for indexing. I know swish-e is unable to > index multi-byte characters, and search results come directly from > swish-e, not MyLibrary. > Will swish-e index the actual bytes of non-diacritic multibyte characters? If so, you can do what we do with Open-ILS (we use Postgres' tsearch2 fulltest indexing module). When indexing data, we strip it of diacritical combining characters using 's/\p{M}//go'. When a search is submitted we do the same thing, because a linked search may contain the diacritics, or the searching user may be typing in a non-US locale. This will search the simplified strings and "does the right thing", at least with our data. We display the original document (or a portion thereof) so that multibyte characters are displayed. For scripts that are entirely outside ASCII (Arabic, Kanji, etc) we just index and search using the original bytes because they are not matched by /\p{M}/. In our testing this seems to work fine (of course, we'd appreciate any tips on making this smarter). > Maybe I should draw search results from MyLibrary and not swish-e to > display characters correctly? If I draw content from many global > sources, then how do I know what character set to use for display? > This is definitely the best thing to do. Search the normallized data and display the original. Also, if you store the documents UTF-8 encoded you won't need to worry about the character set, you just need to set the encoding for the page to UTF-8 and the browser will take care of the rest. -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org
Re: [CODE4LIB] find more like this one
On May 23, 2005, at 6:27 PM, Steven C. Perkins wrote: I did a search on indigenous. The first item was a French article. The display of diacritics was messed up. I added French to the languages in IE, but the display was still bad. I don't know if this is a WinXP problem or a problem with your page. I did not see a language encoding on your source. Perhaps UTF-8 will fix this? Or it may be a problem from the document retrieved. Yes, I do not know how to handle the extended ASCII characters, and I hoping someone here can point me in the right direction. As I said earlier, I use Net::OAI::Harvester to... harvest the data. I use MyLibrary to save the data to a MySQL database. I then write reports against the database in the form of a simple XML stream and feed the stream to swish-e for indexing. I know swish-e is unable to index multi-byte characters, and search results come directly from swish-e, not MyLibrary. Maybe I should draw search results from MyLibrary and not swish-e to display characters correctly? If I draw content from many global sources, then how do I know what character set to use for display? -- Eric "Really Feeling Like The 'Ugly American'" Morgan
[CODE4LIB] find more like this one
As a part of Project Ockham, I have started integrating Find More Like This One functionality into one of my bibliographic indexes. See: http://tinyurl.com/c5htp The link above will search an index of about 250,000 OAI records for the term "hemogloban". It should find zero hits because the word is spelled incorrectly, but the interface will suggest a number of alternative spellings. Once these spellings are used, the system will respond with synonyms as well as a suggested abbreviation, Hb. The system is not perfect; right now it only really works with singe-word queries. I am using using Net::OAI::Harvester to collect the OAI data. MyLibrary is used to manage the harvested content and create reports. I use swish-e to index/search the data. Spelling corrections employ Aspell, and the synonyms are generated through the use of WordNet. Fun! -- Eric Lease Morgan University Libraries of Notre Dame