Re: solr 3.5 and indexing performance

Jan Høydahl Tue, 13 Mar 2012 16:54:43 -0700

Hi,

Thanks a lot for your detailed problem description. It definitely is an error. 
Would you be so kind to register it as a bug ticket, including your 
descriptions from this email? 
http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8-bug_tracker.29.
 Also please attach to the issue your polish hunspell dictionaries. Then we'll 
try to reproduce the error.


I wonder if this performance decrease is also seen for English dictionaries?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 13. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:

> Hi,
> 
> I did some more tests for Hunspell in solr 3.4, 4.0:
> 
> Solr 3.4, full import 489017 documents:
> 
> StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec
> HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
> 
> Solr 4.0, full import 489017 documents:
> 
> StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec
> HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec
> 
> Server specification and Java settings are the same as before.
> 
> Cheers
> Agnieszka
> 
> 
>> -----Original Message-----
>> From: Agnieszka Kukałowicz [mailto:agnieszka.kukalow...@usable.pl]
>> Sent: Tuesday, March 13, 2012 10:39 AM
>> To: 'solr-user@lucene.apache.org'
>> Subject: RE: solr 3.5 and indexing performance
>> 
>> Hi,
>> 
>> Yes, I confirmed that without Hunspell indexing has normal speed.
>> I did tests in solr 4.0 with Hunspell and PolishStemmer.
>> With StempelPolishStemFilterFactory the speed is normal.
>> 
>> My schema is quit easy. For Hunspell I have one text field I copy 14
>> text fields to:
>> 
>> "<field name="text" type="text_pl_hunspell" indexed="true"
>> stored="false" multiValued="true"/>"
>> 
>> 
>> <copyField source="field1" dest="text"/>  <copyField source="field2"
>> dest="text"/>  <copyField source="field3" dest="text"/>  <copyField
>> source="field4" dest="text"/>  <copyField source="field5" dest="text"/>
>> <copyField source="field6" dest="text"/>  <copyField source="field7"
>> dest="text"/>  <copyField source="field8" dest="text"/>  <copyField
>> source="field9" dest="text"/>  <copyField source="field10" dest="text"/>
>> <copyField source="field11" dest="text"/>  <copyField source="field12"
>> dest="text"/>  <copyField source="field13" dest="text"/>  <copyField
>> source="field14" dest="text"/>
>> 
>> The "text_pl_hunspell" configuration:
>> 
>> <fieldType name="text_pl_hunspell" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="dict/stopwords_pl.txt"
>>                enablePositionIncrements="true"
>>                />
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.HunspellStemFilterFactory"
>> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>>        <!--filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords_pl.txt"/-->
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory"
>> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="dict/stopwords_pl.txt"
>>                enablePositionIncrements="true"
>>                />
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.HunspellStemFilterFactory"
>> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>>        <filter class="solr.KeywordMarkerFilterFactory"
>> protected="dict/protwords_pl.txt"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt,
>> synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same
>> files I used in 3.4 version.
>> 
>> For Polish Stemmer the diffrence is only in definion text field:
>> 
>> "<field name="text" type="text_pl" indexed="true" stored="false"
>> multiValued="true"/>"
>> 
>>    <fieldType name="text_pl" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="dict/stopwords_pl.txt"
>>                enablePositionIncrements="true"
>>                />
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.StempelPolishStemFilterFactory"/>
>>        <filter class="solr.KeywordMarkerFilterFactory"
>> protected="dict/protwords_pl.txt"/>
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory"
>> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="dict/stopwords_pl.txt"
>>                enablePositionIncrements="true"
>>                />
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.StempelPolishStemFilterFactory"/>
>>        <filter class="solr.KeywordMarkerFilterFactory"
>> protected="dict/protwords_pl.txt"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> One document has 23 fields:
>> - 14 text fields copy to one text field (above) that is only indexed
>> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The
>> size of one document is 3-4 kB.
>> So, I think this is not very complicated schema.
>> 
>> My environment is:
>> - Linux, RedHat 6.2, kernel 2.6.32
>> - 2 physical CPU Xeon 5606 (4 cores each)
>> - 32 GB RAM
>> - 2 SSD disks in RAID 0
>> - java version:
>> 
>> java -version
>> java version "1.6.0_26"
>> Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM)
>> 64-Bit Server VM (build 20.1-b02, mixed mode)
>> 
>> - java is running with -server -Xms4096M -Xmx4096M (I tried a lot of
>> other settings and always I have the same effect)
>> - solr has default configuration except Rambuffersize (128MB)
>> - solr 4.0 from nightly builds (2012-02-21 build).
>> 
>> If you need more information, please let me know.
>> I also will try to use profile to see what happens.
>> 
>> Agnieszka
>> 
>> 
>>> -----Original Message-----
>>> From: Jan Høydahl [mailto:jan....@cominvent.com]
>>> Sent: Tuesday, March 13, 2012 9:47 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: solr 3.5 and indexing performance
>>> 
>>> Hi,
>>> 
>>> Have you confirmed that disabling Hunspell in solrconfig gets you back
>>> to normal speed?
>>> What Hunspell configuration and dictionaries do you have?
>>> Can you share more about your environment and documents?
>>> Do you have a chance to run a profiler on your Solr instance? Try i.e.
>>> VisualVM and run the profiler to see what part of the code takes up
>>> the time
>>> http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.h
>>> t
>>> ml
>>> 
>>> --
>>> Jan Høydahl, search solution architect Cominvent AS -
>>> www.cominvent.com Solr Training - www.solrtraining.com
>>> 
>>> On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:
>>> 
>>>> Hi guys,
>>>> 
>>>> I have hit the same problem with Hunspell.
>>>> Doing a few tests for 500 000 documents, I've got:
>>>> 
>>>> Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4
>>>> version -
>>>> 125 documents per second
>>>> Build Hunspell from 4.0 trunk - 11 documents per second.
>>>> 
>>>> All the tests were made on 8 core CPU with 32 GB RAM and index on
>>>> SSD disks.
>>>> For Solr 3.5 I've tried to change JVM heap size, rambuffersize,
>>>> mergefactor but the speed of indexing was about 10 -20 documents per
>>>> second.
>>>> 
>>>> Is it possible that there is some performance bug with Solr 4.0?
>>>> According to previous post the problem exists in 3.5 version.
>>>> 
>>>> Best regards
>>>> Agnieszka Kukałowicz
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: mizayah [mailto:miza...@gmail.com]
>>>>> Sent: Thursday, February 23, 2012 10:19 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: solr 3.5 and indexing performance
>>>>> 
>>>>> Ok i found it.
>>>>> 
>>>>> Its becouse of Hunspell which now is in solr. Somehow when im using
>>>>> it by myself in 3.4 it is a lot of faster then one from 3.5.
>>>>> 
>>>>> Dont know about differences, but is there any way i use my old
>>> Google
>>>>> Hunspell jar?
>>>>> 
>>>>> --
>>>>> View this message in context:
>>>>> http://lucene.472066.n3.nabble.com/solr-
>>>>> 3-5-and-indexing-performance-tp3766653p3769139.html
>>>>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 3.5 and indexing performance

Reply via email to