AW: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Uwe Goetzke Wed, 26 Mar 2008 03:10:27 -0700

Hi Jay,

Sorry for the confusion, I wrote NgramStemFilter in an early stage of the 
project which is essentially the same as NGramTokenFilter from Otis with the 
addition that I add begin and end token markers (e.g. word gets and _word_ and 
so  _w wo rd d_ ).


As I modified a lot of our lucene code which we developed since lucene version 
1.2 to move to a 2.x version, I did not notice of the existence of 
NGramTokenFilter.

Stemming is anyway not useful for our problem domain (product catalogs).
We chained WhiteSpace Tokenizer with a modified version of 
ISOLatin1AccentFilter to nomalize some character based language aspects (e.g. ß 
= ss, ö = oe), then make the token Lowercase before getting the bigrams. 
The advantage for us is anyway the TolerantPhraseQuery (see my other post " AW: 
Implement a relaxed PhraseQuery?") which gives us a first step for less 
language dependent searching.

Regards Uwe

-----Ursprüngliche Nachricht-----
Von: yu [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 26. März 2008 05:26
An: java-user@lucene.apache.org
Betreff: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Sorry for my ignorance, I am looking for

NgramStemFilter specifically.
Are you suggesting that it's the same as NGramTokenFilter? Does it have 
stemming in it?

Thanks again.

Jay


Otis Gospodnetic wrote:
> Sorry, I wrote this stuff, but forgot the naming.
> Look: 
> http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: yu <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Wednesday, March 26, 2008 12:04:33 AM
> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Otis,
> I checked that contrib before and could not find NgramStemFilter. Am I 
> missing other contrib?
> Thanks for the link!
>
> Jay
>
> Otis Gospodnetic wrote:
>   
>> Hi Jay,
>>
>> Sorry, lapsus calami, that would be Lucene *contrib*.
>> Have a look:
>> http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> ----- Original Message ----
>> From: Jay <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Tuesday, March 25, 2008 6:15:54 PM
>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Sorry, I could not find the filter in the 2.3 API class list (core + 
>> contrib + test). I am not ware of lucene config file either. Could you 
>> please tell me where it is in 2.3 release?
>>
>> Thanks!
>>
>> Jay
>>
>> Otis Gospodnetic wrote:
>>   
>>     
>>> Jay,
>>>
>>> Have a look at Lucene config, it's all there, including tests.  This filter 
>>> will take a token such as "foobar" and chop it up into n-grams (e.g. foobar 
>>> -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram 
>>> size and even min and max n-gram size.
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>> ----- Original Message ----
>>> From: Jay <[EMAIL PROTECTED]>
>>> To: java-user@lucene.apache.org
>>> Sent: Tuesday, March 25, 2008 1:32:24 PM
>>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>
>>> Hi Uwe,
>>>
>>> I am curious what NGramStemFilter is? Is it a combination of porter 
>>> stemming and word ngram identification?
>>>
>>> Thanks!
>>>
>>> Jay
>>>
>>> Uwe Goetzke wrote:
>>>     
>>>       
>>>> Hi Ivan,
>>>> No, we do not use StandardAnalyser or StandardTokenizer.
>>>>
>>>> Most data is processed by 
>>>>     fTextTokenStream = result = new 
>>>> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>>>>     result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  
>>>> modified that ö -> oe
>>>>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>>>>     result = new org.apache.lucene.analysis.NGramStemFilter(result,2); 
>>>> //just a bigram tokenizer
>>>>
>>>> We use our own queryparser. The bigramms are searched with a tolerant 
>>>> phrase query, scoring in a doc the greatest bigramms clusters covering the 
>>>> phrase token. 
>>>>
>>>> Best Regards
>>>>
>>>> Uwe
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] 
>>>> Gesendet: Freitag, 21. März 2008 16:25
>>>> An: java-user@lucene.apache.org
>>>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>>
>>>> Hi Uwe,
>>>>
>>>> Could you tell what Analyzer do you use when you marked so big indexing 
>>>> speedup?
>>>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the 
>>>> reason is in it. You can see the pre last report in the thread "Indexing 
>>>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake 
>>>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl 
>>>> that now is generated by JFlex instead of JavaCC.
>>>> I am asking because I noticed a great speedup in adding documents to 
>>>> index in our system. We have time control on this in the debug mode. NOW 
>>>> THEY ARE ADDED 5 TIMES FASTER!!!
>>>> But in the same time the total process of indexing in our case has 
>>>> improvement of about 8%. As our system is very big and complex I am 
>>>> wondering if really the whole process of indexing is reduces so 
>>>> remarkably and our system causes this slowdown or may be Lucene does 
>>>> some optimizations on the index, merges or something else and this is 
>>>> the reason the total process of indexing to be not so reasonably faster.
>>>>
>>>> Best Regards,
>>>> Ivan
>>>>
>>>>
>>>>
>>>> Uwe Goetzke wrote:
>>>>       
>>>>         
>>>>> This week I switched the lucene library version on one customer system.
>>>>> The indexing speed went down from 46m32s to 16m20s for the complete task
>>>>> including optimisation. Great Job!
>>>>> We index product catalogs from several suppliers, in this case around
>>>>> 56.000 product groups and 360.000 products including descriptions were
>>>>> indexed.
>>>>>
>>>>> Regards
>>>>>
>>>>> Uwe
>>>>>
>>>>>
>>>>>


-----------------------------------------------------------------------
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, 
dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recipient, you must not 
disclose or use this information contained in it. If you have received this 
email in error please tell us immediately by return email and delete the 
document.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

AW: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Reply via email to