Thanks for your feedback, Ian. I have written a first implementation of this service that works well. You mentioned something about technologies for speeding up lucene, something I am interested in knowing more about. Would you, or anyone, please mind elaborating a bit, or giving me some pointers?
For the record I am using the in memory RAMDirectory instead of file based index. I don't know if is relevant in terms of speeding things up, but thought I'd mention it just to be safe. Thank you, Geir 2010/7/23 Ian Lea <ian....@gmail.com> > So, if I've understood this correctly, you've got some text and wan't > to loop through a list of words and/or phrases, and see which of those > match the text. > > e.g. > > text "some random article about something or other of some random length" > > words > > some - matches > many - no match > article - matches > word - no match > > You can certainly do that with lucene. Load the text into a document > and loop round the words or phrases searching for each. You are > likely to need to look into analyzers depending on your requirements > around stop words, punctuation, case, etc. And phrase/span queries > for phrases. > There are also probably some lucene techniques for speeding this up, > but as ever, start simple - lucene is usually plenty fast enough. > > > -- > Ian. > > > On Thu, Jul 22, 2010 at 11:30 PM, Geir Gullestad Pettersen > <gei...@gmail.com> wrote: > > Hi, > > > > I'm about to write an application that does very simple text analysis, > > namely dictionary based entity entraction. The alternative is to do in > > memory matching with substring: > > > > String text; // could be any size, but normally "news paper length" > > List matches; > > for( String wordOrPhrase : dictionary) { > > if ( text.substring( wordOrPhrase ) >= 0 ) { > > matches.add( wordOrPhrase ); > > } > > } > > > > I am concerned the above code will be quite cpu intensitive, it will also > be > > case sensitive and lot leave any room for fuzzy matching. > > > > I thought this task could also be solved by indexing every bit of text > that > > is to be analyzed, and then executing a query per dicionary entry: > > > > (pseudo) > > > > lucene.index(text) > > List matches > > for( String wordOrPhrase : dictionary { > > if( lucene.search( wordOrPharse, text_id) gives hit ) { > > matches.add(wordOrPhrase) > > } > > } > > > > I have not used lucene very much, so I don't know if it is a good idea or > > not to use lucene for this task at all. Could anyone please share their > > thoughs on this? > > > > Thanks, > > Geir > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >