Hi, Thanks a lot for your detailed problem description. It definitely is an error. Would you be so kind to register it as a bug ticket, including your descriptions from this email? http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8-bug_tracker.29. Also please attach to the issue your polish hunspell dictionaries. Then we'll try to reproduce the error.
I wonder if this performance decrease is also seen for English dictionaries? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 13. mars 2012, at 16:42, Agnieszka Kukałowicz wrote: > Hi, > > I did some more tests for Hunspell in solr 3.4, 4.0: > > Solr 3.4, full import 489017 documents: > > StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec > HunspellStemFilterFactory - 3922 seconds, 125 docs/sec > > Solr 4.0, full import 489017 documents: > > StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec > HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec > > Server specification and Java settings are the same as before. > > Cheers > Agnieszka > > >> -----Original Message----- >> From: Agnieszka Kukałowicz [mailto:agnieszka.kukalow...@usable.pl] >> Sent: Tuesday, March 13, 2012 10:39 AM >> To: 'solr-user@lucene.apache.org' >> Subject: RE: solr 3.5 and indexing performance >> >> Hi, >> >> Yes, I confirmed that without Hunspell indexing has normal speed. >> I did tests in solr 4.0 with Hunspell and PolishStemmer. >> With StempelPolishStemFilterFactory the speed is normal. >> >> My schema is quit easy. For Hunspell I have one text field I copy 14 >> text fields to: >> >> "<field name="text" type="text_pl_hunspell" indexed="true" >> stored="false" multiValued="true"/>" >> >> >> <copyField source="field1" dest="text"/> <copyField source="field2" >> dest="text"/> <copyField source="field3" dest="text"/> <copyField >> source="field4" dest="text"/> <copyField source="field5" dest="text"/> >> <copyField source="field6" dest="text"/> <copyField source="field7" >> dest="text"/> <copyField source="field8" dest="text"/> <copyField >> source="field9" dest="text"/> <copyField source="field10" dest="text"/> >> <copyField source="field11" dest="text"/> <copyField source="field12" >> dest="text"/> <copyField source="field13" dest="text"/> <copyField >> source="field14" dest="text"/> >> >> The "text_pl_hunspell" configuration: >> >> <fieldType name="text_pl_hunspell" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.StopFilterFactory" >> ignoreCase="true" >> words="dict/stopwords_pl.txt" >> enablePositionIncrements="true" >> /> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.HunspellStemFilterFactory" >> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true" >> <!--filter class="solr.KeywordMarkerFilterFactory" >> protected="protwords_pl.txt"/--> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.SynonymFilterFactory" >> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/> >> <filter class="solr.StopFilterFactory" >> ignoreCase="true" >> words="dict/stopwords_pl.txt" >> enablePositionIncrements="true" >> /> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.HunspellStemFilterFactory" >> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true" >> <filter class="solr.KeywordMarkerFilterFactory" >> protected="dict/protwords_pl.txt"/> >> </analyzer> >> </fieldType> >> >> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, >> synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same >> files I used in 3.4 version. >> >> For Polish Stemmer the diffrence is only in definion text field: >> >> "<field name="text" type="text_pl" indexed="true" stored="false" >> multiValued="true"/>" >> >> <fieldType name="text_pl" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.StopFilterFactory" >> ignoreCase="true" >> words="dict/stopwords_pl.txt" >> enablePositionIncrements="true" >> /> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.StempelPolishStemFilterFactory"/> >> <filter class="solr.KeywordMarkerFilterFactory" >> protected="dict/protwords_pl.txt"/> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.SynonymFilterFactory" >> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/> >> <filter class="solr.StopFilterFactory" >> ignoreCase="true" >> words="dict/stopwords_pl.txt" >> enablePositionIncrements="true" >> /> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.StempelPolishStemFilterFactory"/> >> <filter class="solr.KeywordMarkerFilterFactory" >> protected="dict/protwords_pl.txt"/> >> </analyzer> >> </fieldType> >> >> One document has 23 fields: >> - 14 text fields copy to one text field (above) that is only indexed >> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The >> size of one document is 3-4 kB. >> So, I think this is not very complicated schema. >> >> My environment is: >> - Linux, RedHat 6.2, kernel 2.6.32 >> - 2 physical CPU Xeon 5606 (4 cores each) >> - 32 GB RAM >> - 2 SSD disks in RAID 0 >> - java version: >> >> java -version >> java version "1.6.0_26" >> Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) >> 64-Bit Server VM (build 20.1-b02, mixed mode) >> >> - java is running with -server -Xms4096M -Xmx4096M (I tried a lot of >> other settings and always I have the same effect) >> - solr has default configuration except Rambuffersize (128MB) >> - solr 4.0 from nightly builds (2012-02-21 build). >> >> If you need more information, please let me know. >> I also will try to use profile to see what happens. >> >> Agnieszka >> >> >>> -----Original Message----- >>> From: Jan Høydahl [mailto:jan....@cominvent.com] >>> Sent: Tuesday, March 13, 2012 9:47 AM >>> To: solr-user@lucene.apache.org >>> Subject: Re: solr 3.5 and indexing performance >>> >>> Hi, >>> >>> Have you confirmed that disabling Hunspell in solrconfig gets you back >>> to normal speed? >>> What Hunspell configuration and dictionaries do you have? >>> Can you share more about your environment and documents? >>> Do you have a chance to run a profiler on your Solr instance? Try i.e. >>> VisualVM and run the profiler to see what part of the code takes up >>> the time >>> http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.h >>> t >>> ml >>> >>> -- >>> Jan Høydahl, search solution architect Cominvent AS - >>> www.cominvent.com Solr Training - www.solrtraining.com >>> >>> On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote: >>> >>>> Hi guys, >>>> >>>> I have hit the same problem with Hunspell. >>>> Doing a few tests for 500 000 documents, I've got: >>>> >>>> Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4 >>>> version - >>>> 125 documents per second >>>> Build Hunspell from 4.0 trunk - 11 documents per second. >>>> >>>> All the tests were made on 8 core CPU with 32 GB RAM and index on >>>> SSD disks. >>>> For Solr 3.5 I've tried to change JVM heap size, rambuffersize, >>>> mergefactor but the speed of indexing was about 10 -20 documents per >>>> second. >>>> >>>> Is it possible that there is some performance bug with Solr 4.0? >>>> According to previous post the problem exists in 3.5 version. >>>> >>>> Best regards >>>> Agnieszka Kukałowicz >>>> >>>> >>>>> -----Original Message----- >>>>> From: mizayah [mailto:miza...@gmail.com] >>>>> Sent: Thursday, February 23, 2012 10:19 AM >>>>> To: solr-user@lucene.apache.org >>>>> Subject: Re: solr 3.5 and indexing performance >>>>> >>>>> Ok i found it. >>>>> >>>>> Its becouse of Hunspell which now is in solr. Somehow when im using >>>>> it by myself in 3.4 it is a lot of faster then one from 3.5. >>>>> >>>>> Dont know about differences, but is there any way i use my old >>> Google >>>>> Hunspell jar? >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://lucene.472066.n3.nabble.com/solr- >>>>> 3-5-and-indexing-performance-tp3766653p3769139.html >>>>> Sent from the Solr - User mailing list archive at Nabble.com.