RE: Searching user-private annotations associated with indexed documents

2007-11-27 Thread Binkley, Peter
One approach would be to take advantage of Lucene's ability to handle different kinds of documents in a single index. You could put the annotations in the same index as the main articles, but with extra fields, like this: Article document: Id: article1 Type: article Text: blah blah blah Annotati

RE: Analysis/tokenization of compound words

2006-09-21 Thread Binkley, Peter
Aspell has some support for compound words that might be useful to look at: http://aspell.sourceforge.net/man-html/Compound-Words.html#Compound-Word s Peter Peter Binkley Digital Initiatives Technology Librarian Information Technology Services 4-30 Cameron Library University of Alberta Libraries

RE: UTF8 accents & umlauts filter?

2006-09-14 Thread Binkley, Peter
We use ICU4J to do the filtering based on Unicode blocks. See http://icu.sourceforge.net/userguide/Transform.html for a sense of what you can do. It's worth it for us because we need to normalize cyrillic as well as roman text; it might be overkill for other situations. But it does good work. The f