RE: solr 3.5 and indexing performance

Agnieszka Kukałowicz Wed, 14 Mar 2012 06:37:05 -0700

Bug ticket created:
https://issues.apache.org/jira/browse/SOLR-3245


I also made test you ask with english dictionary.
The results are in the ticket.

Agnieszka

> -----Original Message-----
> From: Jan Høydahl [mailto:jan....@cominvent.com]
> Sent: Wednesday, March 14, 2012 12:54 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr 3.5 and indexing performance
>
> Hi,
>
> Thanks a lot for your detailed problem description. It definitely is an
> error. Would you be so kind to register it as a bug ticket, including
> your descriptions from this email?
> http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8
> -bug_tracker.29. Also please attach to the issue your polish hunspell
> dictionaries. Then we'll try to reproduce the error.
>
> I wonder if this performance decrease is also seen for English
> dictionaries?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 13. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:
>
> > Hi,
> >
> > I did some more tests for Hunspell in solr 3.4, 4.0:
> >
> > Solr 3.4, full import 489017 documents:
> >
> > StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec
> > HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
> >
> > Solr 4.0, full import 489017 documents:
> >
> > StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec
> > HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11
> docs/sec
> >
> > Server specification and Java settings are the same as before.
> >
> > Cheers
> > Agnieszka
> >
> >
> >> -----Original Message-----
> >> From: Agnieszka Kukałowicz [mailto:agnieszka.kukalow...@usable.pl]
> >> Sent: Tuesday, March 13, 2012 10:39 AM
> >> To: 'solr-user@lucene.apache.org'
> >> Subject: RE: solr 3.5 and indexing performance
> >>
> >> Hi,
> >>
> >> Yes, I confirmed that without Hunspell indexing has normal speed.
> >> I did tests in solr 4.0 with Hunspell and PolishStemmer.
> >> With StempelPolishStemFilterFactory the speed is normal.
> >>
> >> My schema is quit easy. For Hunspell I have one text field I copy 14
> >> text fields to:
> >>
> >> "<field name="text" type="text_pl_hunspell" indexed="true"
> >> stored="false" multiValued="true"/>"
> >>
> >>
> >> <copyField source="field1" dest="text"/>  <copyField source="field2"
> >> dest="text"/>  <copyField source="field3" dest="text"/>  <copyField
> >> source="field4" dest="text"/>  <copyField source="field5"
> dest="text"/>
> >> <copyField source="field6" dest="text"/>  <copyField source="field7"
> >> dest="text"/>  <copyField source="field8" dest="text"/>  <copyField
> >> source="field9" dest="text"/>  <copyField source="field10"
> dest="text"/>
> >> <copyField source="field11" dest="text"/>  <copyField
> source="field12"
> >> dest="text"/>  <copyField source="field13" dest="text"/>  <copyField
> >> source="field14" dest="text"/>
> >>
> >> The "text_pl_hunspell" configuration:
> >>
> >> <fieldType name="text_pl_hunspell" class="solr.TextField"
> >> positionIncrementGap="100">
> >>      <analyzer type="index">
> >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> >>        <filter class="solr.StopFilterFactory"
> >>                ignoreCase="true"
> >>                words="dict/stopwords_pl.txt"
> >>                enablePositionIncrements="true"
> >>                />
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.HunspellStemFilterFactory"
> >> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
> >>        <!--filter class="solr.KeywordMarkerFilterFactory"
> >> protected="protwords_pl.txt"/-->
> >>      </analyzer>
> >>      <analyzer type="query">
> >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> >>        <filter class="solr.SynonymFilterFactory"
> >> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
> >>        <filter class="solr.StopFilterFactory"
> >>                ignoreCase="true"
> >>                words="dict/stopwords_pl.txt"
> >>                enablePositionIncrements="true"
> >>                />
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.HunspellStemFilterFactory"
> >> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
> >>        <filter class="solr.KeywordMarkerFilterFactory"
> >> protected="dict/protwords_pl.txt"/>
> >>      </analyzer>
> >>    </fieldType>
> >>
> >> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt,
> >> synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same
> >> files I used in 3.4 version.
> >>
> >> For Polish Stemmer the diffrence is only in definion text field:
> >>
> >> "<field name="text" type="text_pl" indexed="true" stored="false"
> >> multiValued="true"/>"
> >>
> >>    <fieldType name="text_pl" class="solr.TextField"
> >> positionIncrementGap="100">
> >>      <analyzer type="index">
> >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> >>        <filter class="solr.StopFilterFactory"
> >>                ignoreCase="true"
> >>                words="dict/stopwords_pl.txt"
> >>                enablePositionIncrements="true"
> >>                />
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.StempelPolishStemFilterFactory"/>
> >>        <filter class="solr.KeywordMarkerFilterFactory"
> >> protected="dict/protwords_pl.txt"/>
> >>      </analyzer>
> >>      <analyzer type="query">
> >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> >>        <filter class="solr.SynonymFilterFactory"
> >> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
> >>        <filter class="solr.StopFilterFactory"
> >>                ignoreCase="true"
> >>                words="dict/stopwords_pl.txt"
> >>                enablePositionIncrements="true"
> >>                />
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.StempelPolishStemFilterFactory"/>
> >>        <filter class="solr.KeywordMarkerFilterFactory"
> >> protected="dict/protwords_pl.txt"/>
> >>      </analyzer>
> >>    </fieldType>
> >>
> >> One document has 23 fields:
> >> - 14 text fields copy to one text field (above) that is only indexed
> >> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The
> >> size of one document is 3-4 kB.
> >> So, I think this is not very complicated schema.
> >>
> >> My environment is:
> >> - Linux, RedHat 6.2, kernel 2.6.32
> >> - 2 physical CPU Xeon 5606 (4 cores each)
> >> - 32 GB RAM
> >> - 2 SSD disks in RAID 0
> >> - java version:
> >>
> >> java -version
> >> java version "1.6.0_26"
> >> Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM)
> >> 64-Bit Server VM (build 20.1-b02, mixed mode)
> >>
> >> - java is running with -server -Xms4096M -Xmx4096M (I tried a lot of
> >> other settings and always I have the same effect)
> >> - solr has default configuration except Rambuffersize (128MB)
> >> - solr 4.0 from nightly builds (2012-02-21 build).
> >>
> >> If you need more information, please let me know.
> >> I also will try to use profile to see what happens.
> >>
> >> Agnieszka
> >>
> >>
> >>> -----Original Message-----
> >>> From: Jan Høydahl [mailto:jan....@cominvent.com]
> >>> Sent: Tuesday, March 13, 2012 9:47 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: solr 3.5 and indexing performance
> >>>
> >>> Hi,
> >>>
> >>> Have you confirmed that disabling Hunspell in solrconfig gets you
> back
> >>> to normal speed?
> >>> What Hunspell configuration and dictionaries do you have?
> >>> Can you share more about your environment and documents?
> >>> Do you have a chance to run a profiler on your Solr instance? Try
> i.e.
> >>> VisualVM and run the profiler to see what part of the code takes up
> >>> the time
> >>>
> http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.h
> >>> t
> >>> ml
> >>>
> >>> --
> >>> Jan Høydahl, search solution architect Cominvent AS -
> >>> www.cominvent.com Solr Training - www.solrtraining.com
> >>>
> >>> On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:
> >>>
> >>>> Hi guys,
> >>>>
> >>>> I have hit the same problem with Hunspell.
> >>>> Doing a few tests for 500 000 documents, I've got:
> >>>>
> >>>> Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4
> >>>> version -
> >>>> 125 documents per second
> >>>> Build Hunspell from 4.0 trunk - 11 documents per second.
> >>>>
> >>>> All the tests were made on 8 core CPU with 32 GB RAM and index on
> >>>> SSD disks.
> >>>> For Solr 3.5 I've tried to change JVM heap size, rambuffersize,
> >>>> mergefactor but the speed of indexing was about 10 -20 documents
> per
> >>>> second.
> >>>>
> >>>> Is it possible that there is some performance bug with Solr 4.0?
> >>>> According to previous post the problem exists in 3.5 version.
> >>>>
> >>>> Best regards
> >>>> Agnieszka Kukałowicz
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: mizayah [mailto:miza...@gmail.com]
> >>>>> Sent: Thursday, February 23, 2012 10:19 AM
> >>>>> To: solr-user@lucene.apache.org
> >>>>> Subject: Re: solr 3.5 and indexing performance
> >>>>>
> >>>>> Ok i found it.
> >>>>>
> >>>>> Its becouse of Hunspell which now is in solr. Somehow when im
> using
> >>>>> it by myself in 3.4 it is a lot of faster then one from 3.5.
> >>>>>
> >>>>> Dont know about differences, but is there any way i use my old
> >>> Google
> >>>>> Hunspell jar?
> >>>>>
> >>>>> --
> >>>>> View this message in context:
> >>>>> http://lucene.472066.n3.nabble.com/solr-
> >>>>> 3-5-and-indexing-performance-tp3766653p3769139.html
> >>>>> Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr 3.5 and indexing performance

Reply via email to