Hi,

I'm about to write an application that does very simple text analysis,
namely dictionary based entity entraction. The alternative is to do in
memory matching with substring:

String text; // could be any size, but normally "news paper length"
List matches;
for( String wordOrPhrase : dictionary) {
   if ( text.substring( wordOrPhrase ) >= 0 ) {
      matches.add( wordOrPhrase );
   }
}

I am concerned the above code will be quite cpu intensitive, it will also be
case sensitive and lot leave any room for fuzzy matching.

I thought this task could also be solved by indexing every bit of text that
is to be analyzed, and then executing a query per dicionary entry:

(pseudo)

lucene.index(text)
List matches
for( String wordOrPhrase : dictionary {
   if( lucene.search( wordOrPharse, text_id) gives hit ) {
      matches.add(wordOrPhrase)
   }
}

I have not used lucene very much, so I don't know if it is a good idea or
not to use lucene for this task at all. Could anyone please share their
thoughs on this?

Thanks,
Geir

Reply via email to