You could also look at MemoryIndex or InstantiatedIndex, both in lucene's contrib area. I think that I was also wondering if you might gain from using TermDocs or TermVectors or something directly.
-- Ian. On Tue, Jul 27, 2010 at 9:34 PM, Geir Gullestad Pettersen <gei...@gmail.com> wrote: > Thanks for your feedback, Ian. > > I have written a first implementation of this service that works well. You > mentioned something about technologies for speeding up lucene, something I > am interested in knowing more about. Would you, or anyone, please mind > elaborating a bit, or giving me some pointers? > > For the record I am using the in memory RAMDirectory instead of file based > index. I don't know if is relevant in terms of speeding things up, but > thought I'd mention it just to be safe. > > Thank you, > > Geir > > 2010/7/23 Ian Lea <ian....@gmail.com> > >> So, if I've understood this correctly, you've got some text and wan't >> to loop through a list of words and/or phrases, and see which of those >> match the text. >> >> e.g. >> >> text "some random article about something or other of some random length" >> >> words >> >> some - matches >> many - no match >> article - matches >> word - no match >> >> You can certainly do that with lucene. Load the text into a document >> and loop round the words or phrases searching for each. You are >> likely to need to look into analyzers depending on your requirements >> around stop words, punctuation, case, etc. And phrase/span queries >> for phrases. >> There are also probably some lucene techniques for speeding this up, >> but as ever, start simple - lucene is usually plenty fast enough. >> >> >> -- >> Ian. >> >> >> On Thu, Jul 22, 2010 at 11:30 PM, Geir Gullestad Pettersen >> <gei...@gmail.com> wrote: >> > Hi, >> > >> > I'm about to write an application that does very simple text analysis, >> > namely dictionary based entity entraction. The alternative is to do in >> > memory matching with substring: >> > >> > String text; // could be any size, but normally "news paper length" >> > List matches; >> > for( String wordOrPhrase : dictionary) { >> > if ( text.substring( wordOrPhrase ) >= 0 ) { >> > matches.add( wordOrPhrase ); >> > } >> > } >> > >> > I am concerned the above code will be quite cpu intensitive, it will also >> be >> > case sensitive and lot leave any room for fuzzy matching. >> > >> > I thought this task could also be solved by indexing every bit of text >> that >> > is to be analyzed, and then executing a query per dicionary entry: >> > >> > (pseudo) >> > >> > lucene.index(text) >> > List matches >> > for( String wordOrPhrase : dictionary { >> > if( lucene.search( wordOrPharse, text_id) gives hit ) { >> > matches.add(wordOrPhrase) >> > } >> > } >> > >> > I have not used lucene very much, so I don't know if it is a good idea or >> > not to use lucene for this task at all. Could anyone please share their >> > thoughs on this? >> > >> > Thanks, >> > Geir >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org