Hi Ryan, Why not preprocessing your documents with tools like Apache UIMA, GATE or OpenNLP before indexing them in Lucene? GATE for instance has FST-based gazetteers which would be perfect for your place names, AFAIK there is also a Dictionary component for UIMA which would be a good match.
Julien On 3 January 2012 21:30, Ryan McKinley <ryan...@gmail.com> wrote: > Happy new year! > > I'm working on a way to simple geocode documents as they are indexed. > I'm hoping to use existing Lucene infrastructure to do this as much as > possible. My plan is to build an index of known place names then look > for matches in incoming text. When there is a match, some extra > fields will get added to the index. > > The known place list will include things like: > * The People's Republic of China > * Rome > * New York > > I want to match documents where this phrase (normalized for > capitalization/punctuation/etc) appears in the document. It looks > like MemoryIndex was made to do something like this: Create a > MemoryIndex for each item you want to match, then run the document > against each possible value and see if it matches. Without testing > this approach, it seems kinda crazy if we have ~100K+ placenames. I > am also concerned how this would work with long phrases and things > that may match with "The Peoples Republic of *" > > Just brainstorming, it seems like an FST could be a good/efficient way > to match documents. My plan would be to: > > 1. Use an Analyzer to create a TokenStream for each place name. From > the TokenStream create an FST<docid> -- this would have to pick some > impossible character for the token seperator. > 2. While indexing, create a TokenStream from the input text. For each > token, try to follow the Arc to a match. If there is a match, add it > to the document. > > Does this approach seem reasonable? > Is there some standard way to do this that I am missing? > > thanks for any pointers! > > ryan > > > The two approaches I am considering: > > 1. MemoryIndex -- build a MemoryIndex for each place name. Check every > index > > 2. FST -- Use an Analyzer to get a TokenStream for each input name and > build an FST<docid> based on the input. Then analyze the text while > indexing and use the TokenStream to > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com