One approach would be to take advantage of Lucene's ability to handle
different kinds of documents in a single index. You could put the
annotations in the same index as the main articles, but with extra
fields, like this:
Article document:
Id: article1
Type: article
Text: blah blah blah
Annotati
Aspell has some support for compound words that might be useful to look
at:
http://aspell.sourceforge.net/man-html/Compound-Words.html#Compound-Word
s
Peter
Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
We use ICU4J to do the filtering based on Unicode blocks. See
http://icu.sourceforge.net/userguide/Transform.html for a sense of what
you can do. It's worth it for us because we need to normalize cyrillic
as well as roman text; it might be overkill for other situations. But it
does good work. The f