Re: Using lucene for substring matching

William Newport Tue, 27 Jul 2010 15:01:51 -0700

Ramdirectorys seem useful but as the index gets larger, java heap
sizes can become a problem in terms of garbage collection pauses. Some
customers are looking to use data grid products such as IBM websphere
extreme scale or oracle coherence to act as the directory for the
index. This stores the index in memory in a set of remote jvms. This
avoids expensive disk io and considerably reduces the memory needed
for each jvm running the lucene engine. It's not as fast as the ram
directory and in our tests is similar to a fast disk type setup.


Once the index is copied from disk to the grid directory when lucene
jvms can be cycled and reconnected to that remote grid thus avoiding
the need to copy to ram every jvm start.

I work for IBM and am the chief architect for ibms data grid product,
websphere extreme scale.

Sent from my iPad

On Jul 27, 2010, at 3:34 PM, Geir Gullestad Pettersen <gei...@gmail.com> wrote:

> Thanks for your feedback, Ian.
>
> I have written a first implementation of this service that works well. You
> mentioned something about technologies for speeding up lucene, something I
> am interested in knowing more about. Would you, or anyone, please mind
> elaborating a bit, or giving me some pointers?
>
> For the record I am using the in memory RAMDirectory instead of file based
> index. I don't know if is relevant in terms of speeding things up, but
> thought I'd mention it just to be safe.
>
> Thank you,
>
> Geir
>
> 2010/7/23 Ian Lea <ian....@gmail.com>
>
>> So, if I've understood this correctly, you've got some text and wan't
>> to loop through a list of words and/or phrases, and see which of those
>> match the text.
>>
>> e.g.
>>
>> text "some random article about something or other of some random length"
>>
>> words
>>
>> some - matches
>> many - no match
>> article - matches
>> word - no match
>>
>> You can certainly do that with lucene.  Load the text into a document
>> and loop round the words or phrases searching for each.  You are
>> likely to need to look into analyzers depending on your requirements
>> around stop words, punctuation, case, etc.  And phrase/span queries
>> for phrases.
>> There are also probably some lucene techniques for speeding this up,
>> but as ever, start simple - lucene is usually plenty fast enough.
>>
>>
>> --
>> Ian.
>>
>>
>> On Thu, Jul 22, 2010 at 11:30 PM, Geir Gullestad Pettersen
>> <gei...@gmail.com> wrote:
>>> Hi,
>>>
>>> I'm about to write an application that does very simple text analysis,
>>> namely dictionary based entity entraction. The alternative is to do in
>>> memory matching with substring:
>>>
>>> String text; // could be any size, but normally "news paper length"
>>> List matches;
>>> for( String wordOrPhrase : dictionary) {
>>>  if ( text.substring( wordOrPhrase ) >= 0 ) {
>>>     matches.add( wordOrPhrase );
>>>  }
>>> }
>>>
>>> I am concerned the above code will be quite cpu intensitive, it will also
>> be
>>> case sensitive and lot leave any room for fuzzy matching.
>>>
>>> I thought this task could also be solved by indexing every bit of text
>> that
>>> is to be analyzed, and then executing a query per dicionary entry:
>>>
>>> (pseudo)
>>>
>>> lucene.index(text)
>>> List matches
>>> for( String wordOrPhrase : dictionary {
>>>  if( lucene.search( wordOrPharse, text_id) gives hit ) {
>>>     matches.add(wordOrPhrase)
>>>  }
>>> }
>>>
>>> I have not used lucene very much, so I don't know if it is a good idea or
>>> not to use lucene for this task at all. Could anyone please share their
>>> thoughs on this?
>>>
>>> Thanks,
>>> Geir
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Using lucene for substring matching

Reply via email to