Hi, Yes, I confirmed that without Hunspell indexing has normal speed. I did tests in solr 4.0 with Hunspell and PolishStemmer. With StempelPolishStemFilterFactory the speed is normal.
My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to: "<field name="text" type="text_pl_hunspell" indexed="true" stored="false" multiValued="true"/>" <copyField source="field1" dest="text"/> <copyField source="field2" dest="text"/> <copyField source="field3" dest="text"/> <copyField source="field4" dest="text"/> <copyField source="field5" dest="text"/> <copyField source="field6" dest="text"/> <copyField source="field7" dest="text"/> <copyField source="field8" dest="text"/> <copyField source="field9" dest="text"/> <copyField source="field10" dest="text"/> <copyField source="field11" dest="text"/> <copyField source="field12" dest="text"/> <copyField source="field13" dest="text"/> <copyField source="field14" dest="text"/> The "text_pl_hunspell" configuration: <fieldType name="text_pl_hunspell" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="dict/stopwords_pl.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true" <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords_pl.txt"/--> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="dict/stopwords_pl.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true" <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/> </analyzer> </fieldType> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. For Polish Stemmer the diffrence is only in definion text field: "<field name="text" type="text_pl" indexed="true" stored="false" multiValued="true"/>" <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="dict/stopwords_pl.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StempelPolishStemFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="dict/stopwords_pl.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StempelPolishStemFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/> </analyzer> </fieldType> One document has 23 fields: - 14 text fields copy to one text field (above) that is only indexed - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB. So, I think this is not very complicated schema. My environment is: - Linux, RedHat 6.2, kernel 2.6.32 - 2 physical CPU Xeon 5606 (4 cores each) - 32 GB RAM - 2 SSD disks in RAID 0 - java version: java -version java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) - java is running with -server -Xms4096M -Xmx4096M (I tried a lot of other settings and always I have the same effect) - solr has default configuration except Rambuffersize (128MB) - solr 4.0 from nightly builds (2012-02-21 build). If you need more information, please let me know. I also will try to use profile to see what happens. Agnieszka > -----Original Message----- > From: Jan Høydahl [mailto:jan....@cominvent.com] > Sent: Tuesday, March 13, 2012 9:47 AM > To: solr-user@lucene.apache.org > Subject: Re: solr 3.5 and indexing performance > > Hi, > > Have you confirmed that disabling Hunspell in solrconfig gets you back > to normal speed? > What Hunspell configuration and dictionaries do you have? > Can you share more about your environment and documents? > Do you have a chance to run a profiler on your Solr instance? Try i.e. > VisualVM and run the profiler to see what part of the code takes up the > time > http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.ht > ml > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote: > > > Hi guys, > > > > I have hit the same problem with Hunspell. > > Doing a few tests for 500 000 documents, I've got: > > > > Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4 > > version - > > 125 documents per second > > Build Hunspell from 4.0 trunk - 11 documents per second. > > > > All the tests were made on 8 core CPU with 32 GB RAM and index on SSD > > disks. > > For Solr 3.5 I've tried to change JVM heap size, rambuffersize, > > mergefactor but the speed of indexing was about 10 -20 documents per > > second. > > > > Is it possible that there is some performance bug with Solr 4.0? > > According to previous post the problem exists in 3.5 version. > > > > Best regards > > Agnieszka Kukałowicz > > > > > >> -----Original Message----- > >> From: mizayah [mailto:miza...@gmail.com] > >> Sent: Thursday, February 23, 2012 10:19 AM > >> To: solr-user@lucene.apache.org > >> Subject: Re: solr 3.5 and indexing performance > >> > >> Ok i found it. > >> > >> Its becouse of Hunspell which now is in solr. Somehow when im using > >> it by myself in 3.4 it is a lot of faster then one from 3.5. > >> > >> Dont know about differences, but is there any way i use my old > Google > >> Hunspell jar? > >> > >> -- > >> View this message in context: > >> http://lucene.472066.n3.nabble.com/solr- > >> 3-5-and-indexing-performance-tp3766653p3769139.html > >> Sent from the Solr - User mailing list archive at Nabble.com.