Re: Using lucene for substring matching

Ian Lea Wed, 28 Jul 2010 01:54:58 -0700

You could also look at MemoryIndex or InstantiatedIndex, both in
lucene's contrib area.  I think that I was also wondering if you might
gain from using TermDocs or TermVectors or something directly.



--
Ian.



On Tue, Jul 27, 2010 at 9:34 PM, Geir Gullestad Pettersen
<gei...@gmail.com> wrote:
> Thanks for your feedback, Ian.
>
> I have written a first implementation of this service that works well. You
> mentioned something about technologies for speeding up lucene, something I
> am interested in knowing more about. Would you, or anyone, please mind
> elaborating a bit, or giving me some pointers?
>
> For the record I am using the in memory RAMDirectory instead of file based
> index. I don't know if is relevant in terms of speeding things up, but
> thought I'd mention it just to be safe.
>
> Thank you,
>
> Geir
>
> 2010/7/23 Ian Lea <ian....@gmail.com>
>
>> So, if I've understood this correctly, you've got some text and wan't
>> to loop through a list of words and/or phrases, and see which of those
>> match the text.
>>
>> e.g.
>>
>> text "some random article about something or other of some random length"
>>
>> words
>>
>> some - matches
>> many - no match
>> article - matches
>> word - no match
>>
>> You can certainly do that with lucene.  Load the text into a document
>> and loop round the words or phrases searching for each.  You are
>> likely to need to look into analyzers depending on your requirements
>> around stop words, punctuation, case, etc.  And phrase/span queries
>> for phrases.
>> There are also probably some lucene techniques for speeding this up,
>> but as ever, start simple - lucene is usually plenty fast enough.
>>
>>
>> --
>> Ian.
>>
>>
>> On Thu, Jul 22, 2010 at 11:30 PM, Geir Gullestad Pettersen
>> <gei...@gmail.com> wrote:
>> > Hi,
>> >
>> > I'm about to write an application that does very simple text analysis,
>> > namely dictionary based entity entraction. The alternative is to do in
>> > memory matching with substring:
>> >
>> > String text; // could be any size, but normally "news paper length"
>> > List matches;
>> > for( String wordOrPhrase : dictionary) {
>> >   if ( text.substring( wordOrPhrase ) >= 0 ) {
>> >      matches.add( wordOrPhrase );
>> >   }
>> > }
>> >
>> > I am concerned the above code will be quite cpu intensitive, it will also
>> be
>> > case sensitive and lot leave any room for fuzzy matching.
>> >
>> > I thought this task could also be solved by indexing every bit of text
>> that
>> > is to be analyzed, and then executing a query per dicionary entry:
>> >
>> > (pseudo)
>> >
>> > lucene.index(text)
>> > List matches
>> > for( String wordOrPhrase : dictionary {
>> >   if( lucene.search( wordOrPharse, text_id) gives hit ) {
>> >      matches.add(wordOrPhrase)
>> >   }
>> > }
>> >
>> > I have not used lucene very much, so I don't know if it is a good idea or
>> > not to use lucene for this task at all. Could anyone please share their
>> > thoughs on this?
>> >
>> > Thanks,
>> > Geir
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Using lucene for substring matching

Reply via email to