Re: Regarding improving performance of the solr
Hi I tried to reindex the solr. I get the regular expression problem. The steps I followed are I started the java -jar start.jar http://localhost:8983/solr/update?stream.body= deletequery*:*querydelete http://localhost:8983/solr/update?stream.body=commit/ I stopped the solr server I changed indexed and stored tags as false for some of the fields in schema.xml fields field name=idtype=string indexed=true stored=true required=true/ field name=title type=string indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=revision type=sintindexed=false stored=false/ field name=user type=string indexed=false stored=false/ field name=userIdtype=int indexed=false stored=false/ field name=text type=text_general indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=pagerank type=text_generalindexed=true stored=false/ field name=anchor_text type=text_general indexed=true stored=false multiValued=true compressed=true termVectors=true termPositions=true termOffsets=true/ field name=freebase type=text_general indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=timestamp type=dateindexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=titleText type=text_generalindexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=category type=string indexed=true stored=true/ /fields uniqueKeyid/uniqueKey copyField source=title dest=titleText/ My data-config.xml dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name=page processor=XPathEntityProcessor stream=true forEach=/mediawiki/page/ url=/home/prabu/wikipedia_full_indexed_dump.xml transformer=RegexTransformer,DateFormatTransformer,HTMLStripTransformer field column=idxpath=/mediawiki/page/id stripHTML=true/ field column=title xpath=/mediawiki/page/title stripHTML=true/ field column=category xpath=/mediawiki/page/category stripHTML=true/ field column=revision xpath=/mediawiki/page/revision/id stripHTML=true/ field column=user xpath=/mediawiki/page/revision/contributor/username stripHTML=true/ field column=userId xpath=/mediawiki/page/revision/contributor/id stripHTML=true/ field column=text xpath=/mediawiki/page/revision/text stripHTML=true/ field column=freebase xpath=/mediawiki/page/freebase stripHTML=true/ field column=pagerank xpath=/mediawiki/page/pagerank stripHTML=true/ field column=anchor_text xpath=/mediawiki/page/anchor_text/ stripHTML=true/ field column=timestamp xpath=/mediawiki/page/revision/timestamp dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=$skipDoc regex=^#REDIRECT .* replaceWith=true sourceColName=text/ field column=category regex=((\[\[.*Category:.*\]\]\W?)+) sourceColName=text stripHTML=true/ field column=$skipDoc regex=^Template:.* replaceWith=true sourceColName=title/ /entity /document /dataConfig I tried the http://localhost:8983/solr/dataimport?command=full-import. At 50,000 document, I get some error related to regular expression. at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) I do not how to proceed. Please help me out. Thanks and Regards Prabu On Wed, Sep 11, 2013 at 11:31 AM, Erick Erickson erickerick...@gmail.comwrote: Be a little careful when extrapolating from disk to memory. Any fields where you've set stored=true will put data in segment files with extensions .fdt and .fdx, see These are the compressed verbatim copy of the data for stored fields and have very little impact on memory required for searching. I've seen indexes where 75% of the data is stored and indexes where 5% of the data is stored. Summary of File Extensions here: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html Best, Erick On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy
Re: Regarding improving performance of the solr
Hi Prabu, It's difficult to tell what's going wrong without the full exception stack trace, including what the exception is. If you can provide the specific input that triggers the exception, that might also help. Steve On Sep 12, 2013, at 4:14 AM, prabu palanisamy pr...@serendio.com wrote: Hi I tried to reindex the solr. I get the regular expression problem. The steps I followed are I started the java -jar start.jar http://localhost:8983/solr/update?stream.body= deletequery*:*querydelete http://localhost:8983/solr/update?stream.body=commit/ I stopped the solr server I changed indexed and stored tags as false for some of the fields in schema.xml fields field name=idtype=string indexed=true stored=true required=true/ field name=title type=string indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=revision type=sintindexed=false stored=false/ field name=user type=string indexed=false stored=false/ field name=userIdtype=int indexed=false stored=false/ field name=text type=text_general indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=pagerank type=text_generalindexed=true stored=false/ field name=anchor_text type=text_general indexed=true stored=false multiValued=true compressed=true termVectors=true termPositions=true termOffsets=true/ field name=freebase type=text_general indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=timestamp type=dateindexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=titleText type=text_generalindexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=category type=string indexed=true stored=true/ /fields uniqueKeyid/uniqueKey copyField source=title dest=titleText/ My data-config.xml dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name=page processor=XPathEntityProcessor stream=true forEach=/mediawiki/page/ url=/home/prabu/wikipedia_full_indexed_dump.xml transformer=RegexTransformer,DateFormatTransformer,HTMLStripTransformer field column=idxpath=/mediawiki/page/id stripHTML=true/ field column=title xpath=/mediawiki/page/title stripHTML=true/ field column=category xpath=/mediawiki/page/category stripHTML=true/ field column=revision xpath=/mediawiki/page/revision/id stripHTML=true/ field column=user xpath=/mediawiki/page/revision/contributor/username stripHTML=true/ field column=userId xpath=/mediawiki/page/revision/contributor/id stripHTML=true/ field column=text xpath=/mediawiki/page/revision/text stripHTML=true/ field column=freebase xpath=/mediawiki/page/freebase stripHTML=true/ field column=pagerank xpath=/mediawiki/page/pagerank stripHTML=true/ field column=anchor_text xpath=/mediawiki/page/anchor_text/ stripHTML=true/ field column=timestamp xpath=/mediawiki/page/revision/timestamp dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=$skipDoc regex=^#REDIRECT .* replaceWith=true sourceColName=text/ field column=category regex=((\[\[.*Category:.*\]\]\W?)+) sourceColName=text stripHTML=true/ field column=$skipDoc regex=^Template:.* replaceWith=true sourceColName=title/ /entity /document /dataConfig I tried the http://localhost:8983/solr/dataimport?command=full-import. At 50,000 document, I get some error related to regular expression. at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) I do not how to proceed. Please help me out. Thanks and Regards Prabu On Wed, Sep 11, 2013 at 11:31 AM, Erick Erickson erickerick...@gmail.comwrote: Be a little careful when extrapolating from disk to memory. Any fields where you've set stored=true will put data in segment files with extensions .fdt and .fdx, see These are the compressed verbatim copy of the data for stored fields and have very little
Re: Regarding improving performance of the solr
@Shawn: Correctly I am trying to reduce the index size. I am working on reindex the solr with some of the features as indexed and not stored @Jean: I tried with different caches. It did not show much improvement. On Fri, Sep 6, 2013 at 3:17 PM, Shawn Heisey s...@elyograg.org wrote: On 9/6/2013 2:54 AM, prabu palanisamy wrote: I am currently using solr -3.5.0, indexed wikipedia dump (50 gb) with java 1.6. I am searching the solr with text (which is actually twitter tweets) . Currently it takes average time of 210 millisecond for each post, out of which 200 millisecond is consumed by solr server (QTime). I used the jconsole monitor tool. If the size of all your Solr indexes on disk is in the 50GB range of your wikipedia dump, then for ideal performance, you'll want to have 50GB of free memory so the OS can cache your index. You might be able to get by with 25-30GB of free memory, depending on your index composition. Note that this is memory over and above what you allocate to the Solr JVM, and memory used by other processes on the machine. If you do have other services on the same machine, note that those programs might ALSO require OS disk cache RAM. http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache Thanks, Shawn
Re: Regarding improving performance of the solr
Be a little careful when extrapolating from disk to memory. Any fields where you've set stored=true will put data in segment files with extensions .fdt and .fdx, see These are the compressed verbatim copy of the data for stored fields and have very little impact on memory required for searching. I've seen indexes where 75% of the data is stored and indexes where 5% of the data is stored. Summary of File Extensions here: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html Best, Erick On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy pr...@serendio.comwrote: @Shawn: Correctly I am trying to reduce the index size. I am working on reindex the solr with some of the features as indexed and not stored @Jean: I tried with different caches. It did not show much improvement. On Fri, Sep 6, 2013 at 3:17 PM, Shawn Heisey s...@elyograg.org wrote: On 9/6/2013 2:54 AM, prabu palanisamy wrote: I am currently using solr -3.5.0, indexed wikipedia dump (50 gb) with java 1.6. I am searching the solr with text (which is actually twitter tweets) . Currently it takes average time of 210 millisecond for each post, out of which 200 millisecond is consumed by solr server (QTime). I used the jconsole monitor tool. If the size of all your Solr indexes on disk is in the 50GB range of your wikipedia dump, then for ideal performance, you'll want to have 50GB of free memory so the OS can cache your index. You might be able to get by with 25-30GB of free memory, depending on your index composition. Note that this is memory over and above what you allocate to the Solr JVM, and memory used by other processes on the machine. If you do have other services on the same machine, note that those programs might ALSO require OS disk cache RAM. http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache Thanks, Shawn
RE: Regarding improving performance of the solr
Have you checked the hit ratio of the different caches? Try to tune them to get rid of all evictions if possible. Tuning the size of the caches and warming you searcher can give you a pretty good improvement. You might want to check your analysis chain as well to see if you`re not doing anything that is not necessary. -Original Message- From: prabu palanisamy [mailto:pr...@serendio.com] Sent: September-06-13 4:55 AM To: solr-user@lucene.apache.org Subject: Regarding improving performance of the solr Hi I am currently using solr -3.5.0, indexed wikipedia dump (50 gb) with java 1.6. I am searching the solr with text (which is actually twitter tweets) . Currently it takes average time of 210 millisecond for each post, out of which 200 millisecond is consumed by solr server (QTime). I used the jconsole monitor tool. The stats are Heap usage - 10-50Mb, No of threads - 10-20 No of class- 3800, Cpu usage - 10-15% Currently I am loading all the fields of the wikipedia. I only need the freebase category and wikipedia category. I want to know how to optimize the solr server to improve the performance. Could you please help me out in optimize the performance? Thanks and Regards Prabu - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2013.0.3392 / Base de données virale: 3222/6640 - Date: 05/09/2013
Re: Regarding improving performance of the solr
On 9/6/2013 2:54 AM, prabu palanisamy wrote: I am currently using solr -3.5.0, indexed wikipedia dump (50 gb) with java 1.6. I am searching the solr with text (which is actually twitter tweets) . Currently it takes average time of 210 millisecond for each post, out of which 200 millisecond is consumed by solr server (QTime). I used the jconsole monitor tool. If the size of all your Solr indexes on disk is in the 50GB range of your wikipedia dump, then for ideal performance, you'll want to have 50GB of free memory so the OS can cache your index. You might be able to get by with 25-30GB of free memory, depending on your index composition. Note that this is memory over and above what you allocate to the Solr JVM, and memory used by other processes on the machine. If you do have other services on the same machine, note that those programs might ALSO require OS disk cache RAM. http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache Thanks, Shawn