RE: solr 3.5 and indexing performance
Bug ticket created: https://issues.apache.org/jira/browse/SOLR-3245 I also made test you ask with english dictionary. The results are in the ticket. Agnieszka -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: Wednesday, March 14, 2012 12:54 AM To: solr-user@lucene.apache.org Subject: Re: solr 3.5 and indexing performance Hi, Thanks a lot for your detailed problem description. It definitely is an error. Would you be so kind to register it as a bug ticket, including your descriptions from this email? http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8 -bug_tracker.29. Also please attach to the issue your polish hunspell dictionaries. Then we'll try to reproduce the error. I wonder if this performance decrease is also seen for English dictionaries? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 13. mars 2012, at 16:42, Agnieszka Kukałowicz wrote: Hi, I did some more tests for Hunspell in solr 3.4, 4.0: Solr 3.4, full import 489017 documents: StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec HunspellStemFilterFactory - 3922 seconds, 125 docs/sec Solr 4.0, full import 489017 documents: StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec Server specification and Java settings are the same as before. Cheers Agnieszka -Original Message- From: Agnieszka Kukałowicz [mailto:agnieszka.kukalow...@usable.pl] Sent: Tuesday, March 13, 2012 10:39 AM To: 'solr-user@lucene.apache.org' Subject: RE: solr 3.5 and indexing performance Hi, Yes, I confirmed that without Hunspell indexing has normal speed. I did tests in solr 4.0 with Hunspell and PolishStemmer. With StempelPolishStemFilterFactory the speed is normal. My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to: field name=text type=text_pl_hunspell indexed=true stored=false multiValued=true/ copyField source=field1 dest=text/ copyField source=field2 dest=text/ copyField source=field3 dest=text/ copyField source=field4 dest=text/ copyField source=field5 dest=text/ copyField source=field6 dest=text/ copyField source=field7 dest=text/ copyField source=field8 dest=text/ copyField source=field9 dest=text/ copyField source=field10 dest=text/ copyField source=field11 dest=text/ copyField source=field12 dest=text/ copyField source=field13 dest=text/ copyField source=field14 dest=text/ The text_pl_hunspell configuration: fieldType name=text_pl_hunspell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true !--filter class=solr.KeywordMarkerFilterFactory protected=protwords_pl.txt/-- /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer /fieldType I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. For Polish Stemmer the diffrence is only in definion text field: field name=text type=text_pl indexed=true stored=false multiValued=true/ fieldType name=text_pl class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StempelPolishStemFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class
Re: solr 3.5 and indexing performance
Hi, Have you confirmed that disabling Hunspell in solrconfig gets you back to normal speed? What Hunspell configuration and dictionaries do you have? Can you share more about your environment and documents? Do you have a chance to run a profiler on your Solr instance? Try i.e. VisualVM and run the profiler to see what part of the code takes up the time http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.html -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote: Hi guys, I have hit the same problem with Hunspell. Doing a few tests for 500 000 documents, I've got: Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4 version - 125 documents per second Build Hunspell from 4.0 trunk - 11 documents per second. All the tests were made on 8 core CPU with 32 GB RAM and index on SSD disks. For Solr 3.5 I've tried to change JVM heap size, rambuffersize, mergefactor but the speed of indexing was about 10 -20 documents per second. Is it possible that there is some performance bug with Solr 4.0? According to previous post the problem exists in 3.5 version. Best regards Agnieszka Kukałowicz -Original Message- From: mizayah [mailto:miza...@gmail.com] Sent: Thursday, February 23, 2012 10:19 AM To: solr-user@lucene.apache.org Subject: Re: solr 3.5 and indexing performance Ok i found it. Its becouse of Hunspell which now is in solr. Somehow when im using it by myself in 3.4 it is a lot of faster then one from 3.5. Dont know about differences, but is there any way i use my old Google Hunspell jar? -- View this message in context: http://lucene.472066.n3.nabble.com/solr- 3-5-and-indexing-performance-tp3766653p3769139.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: solr 3.5 and indexing performance
Hi, Yes, I confirmed that without Hunspell indexing has normal speed. I did tests in solr 4.0 with Hunspell and PolishStemmer. With StempelPolishStemFilterFactory the speed is normal. My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to: field name=text type=text_pl_hunspell indexed=true stored=false multiValued=true/ copyField source=field1 dest=text/ copyField source=field2 dest=text/ copyField source=field3 dest=text/ copyField source=field4 dest=text/ copyField source=field5 dest=text/ copyField source=field6 dest=text/ copyField source=field7 dest=text/ copyField source=field8 dest=text/ copyField source=field9 dest=text/ copyField source=field10 dest=text/ copyField source=field11 dest=text/ copyField source=field12 dest=text/ copyField source=field13 dest=text/ copyField source=field14 dest=text/ The text_pl_hunspell configuration: fieldType name=text_pl_hunspell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true !--filter class=solr.KeywordMarkerFilterFactory protected=protwords_pl.txt/-- /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer /fieldType I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. For Polish Stemmer the diffrence is only in definion text field: field name=text type=text_pl indexed=true stored=false multiValued=true/ fieldType name=text_pl class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StempelPolishStemFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StempelPolishStemFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer /fieldType One document has 23 fields: - 14 text fields copy to one text field (above) that is only indexed - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB. So, I think this is not very complicated schema. My environment is: - Linux, RedHat 6.2, kernel 2.6.32 - 2 physical CPU Xeon 5606 (4 cores each) - 32 GB RAM - 2 SSD disks in RAID 0 - java version: java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) - java is running with -server -Xms4096M -Xmx4096M (I tried a lot of other settings and always I have the same effect) - solr has default configuration except Rambuffersize (128MB) - solr 4.0 from nightly builds (2012-02-21 build). If you need more information, please let me know. I also will try to use profile to see what happens. Agnieszka -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: Tuesday, March 13, 2012 9:47 AM To: solr-user@lucene.apache.org Subject: Re: solr 3.5 and indexing performance Hi, Have you confirmed that disabling Hunspell in solrconfig gets you back to normal speed? What Hunspell configuration and dictionaries do you have? Can you share more about your environment and documents? Do you have a chance to run a profiler on your Solr instance? Try i.e
RE: solr 3.5 and indexing performance
Hi, I did some more tests for Hunspell in solr 3.4, 4.0: Solr 3.4, full import 489017 documents: StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec HunspellStemFilterFactory - 3922 seconds, 125 docs/sec Solr 4.0, full import 489017 documents: StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec Server specification and Java settings are the same as before. Cheers Agnieszka -Original Message- From: Agnieszka Kukałowicz [mailto:agnieszka.kukalow...@usable.pl] Sent: Tuesday, March 13, 2012 10:39 AM To: 'solr-user@lucene.apache.org' Subject: RE: solr 3.5 and indexing performance Hi, Yes, I confirmed that without Hunspell indexing has normal speed. I did tests in solr 4.0 with Hunspell and PolishStemmer. With StempelPolishStemFilterFactory the speed is normal. My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to: field name=text type=text_pl_hunspell indexed=true stored=false multiValued=true/ copyField source=field1 dest=text/ copyField source=field2 dest=text/ copyField source=field3 dest=text/ copyField source=field4 dest=text/ copyField source=field5 dest=text/ copyField source=field6 dest=text/ copyField source=field7 dest=text/ copyField source=field8 dest=text/ copyField source=field9 dest=text/ copyField source=field10 dest=text/ copyField source=field11 dest=text/ copyField source=field12 dest=text/ copyField source=field13 dest=text/ copyField source=field14 dest=text/ The text_pl_hunspell configuration: fieldType name=text_pl_hunspell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true !--filter class=solr.KeywordMarkerFilterFactory protected=protwords_pl.txt/-- /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer /fieldType I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. For Polish Stemmer the diffrence is only in definion text field: field name=text type=text_pl indexed=true stored=false multiValued=true/ fieldType name=text_pl class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StempelPolishStemFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StempelPolishStemFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer /fieldType One document has 23 fields: - 14 text fields copy to one text field (above) that is only indexed - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB. So, I think this is not very complicated schema. My environment is: - Linux, RedHat 6.2, kernel 2.6.32 - 2 physical CPU Xeon 5606 (4 cores each) - 32 GB RAM - 2 SSD disks in RAID 0 - java version: java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) - java is running with -server
Re: solr 3.5 and indexing performance
Hi, Thanks a lot for your detailed problem description. It definitely is an error. Would you be so kind to register it as a bug ticket, including your descriptions from this email? http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8-bug_tracker.29. Also please attach to the issue your polish hunspell dictionaries. Then we'll try to reproduce the error. I wonder if this performance decrease is also seen for English dictionaries? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 13. mars 2012, at 16:42, Agnieszka Kukałowicz wrote: Hi, I did some more tests for Hunspell in solr 3.4, 4.0: Solr 3.4, full import 489017 documents: StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec HunspellStemFilterFactory - 3922 seconds, 125 docs/sec Solr 4.0, full import 489017 documents: StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec Server specification and Java settings are the same as before. Cheers Agnieszka -Original Message- From: Agnieszka Kukałowicz [mailto:agnieszka.kukalow...@usable.pl] Sent: Tuesday, March 13, 2012 10:39 AM To: 'solr-user@lucene.apache.org' Subject: RE: solr 3.5 and indexing performance Hi, Yes, I confirmed that without Hunspell indexing has normal speed. I did tests in solr 4.0 with Hunspell and PolishStemmer. With StempelPolishStemFilterFactory the speed is normal. My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to: field name=text type=text_pl_hunspell indexed=true stored=false multiValued=true/ copyField source=field1 dest=text/ copyField source=field2 dest=text/ copyField source=field3 dest=text/ copyField source=field4 dest=text/ copyField source=field5 dest=text/ copyField source=field6 dest=text/ copyField source=field7 dest=text/ copyField source=field8 dest=text/ copyField source=field9 dest=text/ copyField source=field10 dest=text/ copyField source=field11 dest=text/ copyField source=field12 dest=text/ copyField source=field13 dest=text/ copyField source=field14 dest=text/ The text_pl_hunspell configuration: fieldType name=text_pl_hunspell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true !--filter class=solr.KeywordMarkerFilterFactory protected=protwords_pl.txt/-- /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer /fieldType I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. For Polish Stemmer the diffrence is only in definion text field: field name=text type=text_pl indexed=true stored=false multiValued=true/ fieldType name=text_pl class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StempelPolishStemFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StempelPolishStemFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer
RE: solr 3.5 and indexing performance
Hi guys, I have hit the same problem with Hunspell. Doing a few tests for 500 000 documents, I've got: Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4 version - 125 documents per second Build Hunspell from 4.0 trunk - 11 documents per second. All the tests were made on 8 core CPU with 32 GB RAM and index on SSD disks. For Solr 3.5 I've tried to change JVM heap size, rambuffersize, mergefactor but the speed of indexing was about 10 -20 documents per second. Is it possible that there is some performance bug with Solr 4.0? According to previous post the problem exists in 3.5 version. Best regards Agnieszka Kukałowicz -Original Message- From: mizayah [mailto:miza...@gmail.com] Sent: Thursday, February 23, 2012 10:19 AM To: solr-user@lucene.apache.org Subject: Re: solr 3.5 and indexing performance Ok i found it. Its becouse of Hunspell which now is in solr. Somehow when im using it by myself in 3.4 it is a lot of faster then one from 3.5. Dont know about differences, but is there any way i use my old Google Hunspell jar? -- View this message in context: http://lucene.472066.n3.nabble.com/solr- 3-5-and-indexing-performance-tp3766653p3769139.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 3.5 and indexing performance
Ok i found it. Its becouse of Hunspell which now is in solr. Somehow when im using it by myself in 3.4 it is a lot of faster then one from 3.5. Dont know about differences, but is there any way i use my old Google Hunspell jar? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-5-and-indexing-performance-tp3766653p3769139.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 3.5 and indexing performance
I wanted to switch to new version of solr, exactelly to 3.5 but im getting big drop of indexing speed. Could it be autoCommit configuration in solrconfig.xml?
Re: solr 3.5 and indexing performance
i got it all commnented in updateHandler, im prety sure there is no default autocommit updateHandler class=solr.DirectUpdateHandler2 iorixxx wrote I wanted to switch to new version of solr, exactelly to 3.5 but im getting big drop of indexing speed. Could it be autoCommit configuration in solrconfig.xml? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-5-and-indexing-performance-tp3766653p3767843.html Sent from the Solr - User mailing list archive at Nabble.com.