Hi Ryan,

Why not preprocessing your documents with tools like Apache UIMA, GATE or
OpenNLP before indexing them in Lucene? GATE for instance has FST-based
gazetteers which would be perfect for your place names, AFAIK there is also
a Dictionary component for UIMA which would be a good match.

Julien

On 3 January 2012 21:30, Ryan McKinley <ryan...@gmail.com> wrote:

> Happy new year!
>
> I'm working on a way to simple geocode documents as they are indexed.
> I'm hoping to use existing Lucene infrastructure to do this as much as
> possible.  My plan is to build an index of known place names then look
> for matches in incoming text.  When there is a match, some extra
> fields will get added to the index.
>
> The known place list will include things like:
>  * The People's Republic of China
>  * Rome
>  * New York
>
> I want to match documents where this phrase (normalized for
> capitalization/punctuation/etc) appears in the document.  It looks
> like MemoryIndex was made to do something like this: Create a
> MemoryIndex for each item you want to match, then run the document
> against each possible value and see if it matches.  Without testing
> this approach, it seems kinda crazy if we have ~100K+ placenames.  I
> am also concerned how this would work with long phrases and things
> that may match with "The Peoples Republic of *"
>
> Just brainstorming, it seems like an FST could be a good/efficient way
> to match documents.  My plan would be to:
>
> 1. Use an Analyzer to create a TokenStream for each place name.  From
> the TokenStream create an FST<docid> -- this would have to pick some
> impossible character for the token seperator.
> 2. While indexing, create a TokenStream from the input text.  For each
> token, try to follow the Arc to a match.  If there is a match, add it
> to the document.
>
> Does this approach seem reasonable?
> Is there some standard way to do this that I am missing?
>
> thanks for any pointers!
>
> ryan
>
>
> The two approaches I am considering:
>
> 1. MemoryIndex -- build a MemoryIndex for each place name.  Check every
> index
>
> 2. FST -- Use an Analyzer to get a TokenStream for each input name and
> build an FST<docid> based on the input.  Then analyze the text while
> indexing and use the TokenStream to
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to