Using lucene for substring matching

Geir Gullestad Pettersen Thu, 22 Jul 2010 15:31:25 -0700

Hi,

I'm about to write an application that does very simple text analysis,
namely dictionary based entity entraction. The alternative is to do in
memory matching with substring:


String text; // could be any size, but normally "news paper length"
List matches;
for( String wordOrPhrase : dictionary) {
   if ( text.substring( wordOrPhrase ) >= 0 ) {
      matches.add( wordOrPhrase );
   }
}

I am concerned the above code will be quite cpu intensitive, it will also be
case sensitive and lot leave any room for fuzzy matching.

I thought this task could also be solved by indexing every bit of text that
is to be analyzed, and then executing a query per dicionary entry:

(pseudo)

lucene.index(text)
List matches
for( String wordOrPhrase : dictionary {
   if( lucene.search( wordOrPharse, text_id) gives hit ) {
      matches.add(wordOrPhrase)
   }
}

I have not used lucene very much, so I don't know if it is a good idea or
not to use lucene for this task at all. Could anyone please share their
thoughs on this?

Thanks,
Geir

Using lucene for substring matching

Reply via email to