Hi, I'm about to write an application that does very simple text analysis, namely dictionary based entity entraction. The alternative is to do in memory matching with substring:
String text; // could be any size, but normally "news paper length" List matches; for( String wordOrPhrase : dictionary) { if ( text.substring( wordOrPhrase ) >= 0 ) { matches.add( wordOrPhrase ); } } I am concerned the above code will be quite cpu intensitive, it will also be case sensitive and lot leave any room for fuzzy matching. I thought this task could also be solved by indexing every bit of text that is to be analyzed, and then executing a query per dicionary entry: (pseudo) lucene.index(text) List matches for( String wordOrPhrase : dictionary { if( lucene.search( wordOrPharse, text_id) gives hit ) { matches.add(wordOrPhrase) } } I have not used lucene very much, so I don't know if it is a good idea or not to use lucene for this task at all. Could anyone please share their thoughs on this? Thanks, Geir