Re: Regarding improving performance of the solr

2013-09-12 Thread prabu palanisamy
Hi

I tried to reindex the solr. I get the regular expression problem. The
steps I followed are

I started the java -jar start.jar
http://localhost:8983/solr/update?stream.body=
deletequery*:*querydelete
http://localhost:8983/solr/update?stream.body=commit/
I stopped the solr server

I changed indexed and stored tags as false for some of the fields in
schema.xml
 fields
field name=idtype=string  indexed=true stored=true
required=true/
field name=title type=string  indexed=true stored=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/
field name=revision  type=sintindexed=false stored=false/
field name=user  type=string  indexed=false stored=false/
field name=userIdtype=int indexed=false stored=false/
field name=text  type=text_general indexed=true stored=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/
field name=pagerank  type=text_generalindexed=true
stored=false/
field name=anchor_text type=text_general indexed=true
stored=false  multiValued=true compressed=true termVectors=true
termPositions=true termOffsets=true/
field name=freebase type=text_general indexed=true stored=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/
field name=timestamp type=dateindexed=true stored=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/
field name=titleText type=text_generalindexed=true
stored=true  multiValued=true termVectors=true termPositions=true
termOffsets=true/
field name=category type=string indexed=true stored=true/
/fields
uniqueKeyid/uniqueKey
copyField source=title dest=titleText/

My data-config.xml
dataConfig
dataSource type=FileDataSource encoding=UTF-8 /
document
entity name=page
processor=XPathEntityProcessor
stream=true
forEach=/mediawiki/page/
url=/home/prabu/wikipedia_full_indexed_dump.xml

transformer=RegexTransformer,DateFormatTransformer,HTMLStripTransformer

field column=idxpath=/mediawiki/page/id
stripHTML=true/
field column=title xpath=/mediawiki/page/title
stripHTML=true/
field column=category  xpath=/mediawiki/page/category
stripHTML=true/
field column=revision  xpath=/mediawiki/page/revision/id
stripHTML=true/
field column=user
xpath=/mediawiki/page/revision/contributor/username stripHTML=true/
field column=userId
xpath=/mediawiki/page/revision/contributor/id stripHTML=true/
field column=text  xpath=/mediawiki/page/revision/text
stripHTML=true/
field column=freebase  xpath=/mediawiki/page/freebase
stripHTML=true/
field column=pagerank  xpath=/mediawiki/page/pagerank
stripHTML=true/
field column=anchor_text xpath=/mediawiki/page/anchor_text/
stripHTML=true/
field column=timestamp
xpath=/mediawiki/page/revision/timestamp
dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /
field column=$skipDoc  regex=^#REDIRECT .*
replaceWith=true sourceColName=text/
field column=category regex=((\[\[.*Category:.*\]\]\W?)+)
sourceColName=text stripHTML=true/
field column=$skipDoc regex=^Template:.* replaceWith=true
sourceColName=title/
   /entity
/document
/dataConfig

I tried the http://localhost:8983/solr/dataimport?command=full-import.  At
50,000 document, I get some error related to regular expression.

at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)

I do not how to proceed. Please help me out.

Thanks and Regards
Prabu


On Wed, Sep 11, 2013 at 11:31 AM, Erick Erickson erickerick...@gmail.comwrote:

 Be a little careful when extrapolating from disk to memory.
 Any fields where you've set stored=true will put data in
 segment files with extensions .fdt and .fdx, see
 These are the compressed verbatim copy of the data
 for stored fields and have very little impact on
 memory required for searching. I've seen indexes where
 75% of the data is stored and indexes where 5% of the
 data is stored.

 Summary of File Extensions here:

 http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html

 Best,
 Erick


 On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy 

Re: Regarding improving performance of the solr

2013-09-12 Thread Steve Rowe
Hi Prabu,

It's difficult to tell what's going wrong without the full exception stack 
trace, including what the exception is.

If you can provide the specific input that triggers the exception, that might 
also help.

Steve

On Sep 12, 2013, at 4:14 AM, prabu palanisamy pr...@serendio.com wrote:

 Hi
 
 I tried to reindex the solr. I get the regular expression problem. The
 steps I followed are
 
 I started the java -jar start.jar
 http://localhost:8983/solr/update?stream.body=
 deletequery*:*querydelete
 http://localhost:8983/solr/update?stream.body=commit/
 I stopped the solr server
 
 I changed indexed and stored tags as false for some of the fields in
 schema.xml
 fields
 field name=idtype=string  indexed=true stored=true
 required=true/
 field name=title type=string  indexed=true stored=true
 multiValued=true termVectors=true termPositions=true
 termOffsets=true/
 field name=revision  type=sintindexed=false stored=false/
 field name=user  type=string  indexed=false stored=false/
 field name=userIdtype=int indexed=false stored=false/
 field name=text  type=text_general indexed=true stored=true
 multiValued=true termVectors=true termPositions=true
 termOffsets=true/
 field name=pagerank  type=text_generalindexed=true
 stored=false/
 field name=anchor_text type=text_general indexed=true
 stored=false  multiValued=true compressed=true termVectors=true
 termPositions=true termOffsets=true/
 field name=freebase type=text_general indexed=true stored=true
 multiValued=true termVectors=true termPositions=true
 termOffsets=true/
 field name=timestamp type=dateindexed=true stored=true
 multiValued=true termVectors=true termPositions=true
 termOffsets=true/
 field name=titleText type=text_generalindexed=true
 stored=true  multiValued=true termVectors=true termPositions=true
 termOffsets=true/
 field name=category type=string indexed=true stored=true/
 /fields
 uniqueKeyid/uniqueKey
 copyField source=title dest=titleText/
 
 My data-config.xml
 dataConfig
dataSource type=FileDataSource encoding=UTF-8 /
document
entity name=page
processor=XPathEntityProcessor
stream=true
forEach=/mediawiki/page/
url=/home/prabu/wikipedia_full_indexed_dump.xml
 
 transformer=RegexTransformer,DateFormatTransformer,HTMLStripTransformer
 
field column=idxpath=/mediawiki/page/id
 stripHTML=true/
field column=title xpath=/mediawiki/page/title
 stripHTML=true/
field column=category  xpath=/mediawiki/page/category
 stripHTML=true/
field column=revision  xpath=/mediawiki/page/revision/id
 stripHTML=true/
field column=user
 xpath=/mediawiki/page/revision/contributor/username stripHTML=true/
field column=userId
 xpath=/mediawiki/page/revision/contributor/id stripHTML=true/
field column=text  xpath=/mediawiki/page/revision/text
 stripHTML=true/
field column=freebase  xpath=/mediawiki/page/freebase
 stripHTML=true/
field column=pagerank  xpath=/mediawiki/page/pagerank
 stripHTML=true/
field column=anchor_text xpath=/mediawiki/page/anchor_text/
 stripHTML=true/
field column=timestamp
 xpath=/mediawiki/page/revision/timestamp
 dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /
field column=$skipDoc  regex=^#REDIRECT .*
 replaceWith=true sourceColName=text/
field column=category regex=((\[\[.*Category:.*\]\]\W?)+)
 sourceColName=text stripHTML=true/
field column=$skipDoc regex=^Template:.* replaceWith=true
 sourceColName=title/
   /entity
/document
 /dataConfig
 
 I tried the http://localhost:8983/solr/dataimport?command=full-import.  At
 50,000 document, I get some error related to regular expression.
 
 at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
   at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
   at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
   at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
   at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
   at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
   at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
   at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
   at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
   at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
   at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
 
 I do not how to proceed. Please help me out.
 
 Thanks and Regards
 Prabu
 
 
 On Wed, Sep 11, 2013 at 11:31 AM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
 Be a little careful when extrapolating from disk to memory.
 Any fields where you've set stored=true will put data in
 segment files with extensions .fdt and .fdx, see
 These are the compressed verbatim copy of the data
 for stored fields and have very little 

Re: Regarding improving performance of the solr

2013-09-11 Thread prabu palanisamy
@Shawn: Correctly I am trying to reduce the index size. I am working on
reindex the solr with some of the features as indexed and not stored

@Jean: I tried with  different caches. It did not show much improvement.


On Fri, Sep 6, 2013 at 3:17 PM, Shawn Heisey s...@elyograg.org wrote:

 On 9/6/2013 2:54 AM, prabu palanisamy wrote:
  I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with
  java 1.6.
  I am searching the solr with text (which is actually twitter tweets) .
  Currently it takes average time of 210 millisecond for each post, out of
  which 200 millisecond is consumed by solr server (QTime).  I used the
  jconsole monitor tool.

 If the size of all your Solr indexes on disk is in the 50GB range of
 your wikipedia dump, then for ideal performance, you'll want to have
 50GB of free memory so the OS can cache your index.  You might be able
 to get by with 25-30GB of free memory, depending on your index composition.

 Note that this is memory over and above what you allocate to the Solr
 JVM, and memory used by other processes on the machine.  If you do have
 other services on the same machine, note that those programs might ALSO
 require OS disk cache RAM.

 http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache

 Thanks,
 Shawn




Re: Regarding improving performance of the solr

2013-09-11 Thread Erick Erickson
Be a little careful when extrapolating from disk to memory.
Any fields where you've set stored=true will put data in
segment files with extensions .fdt and .fdx, see
These are the compressed verbatim copy of the data
for stored fields and have very little impact on
memory required for searching. I've seen indexes where
75% of the data is stored and indexes where 5% of the
data is stored.

Summary of File Extensions here:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html

Best,
Erick


On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy pr...@serendio.comwrote:

 @Shawn: Correctly I am trying to reduce the index size. I am working on
 reindex the solr with some of the features as indexed and not stored

 @Jean: I tried with  different caches. It did not show much improvement.


 On Fri, Sep 6, 2013 at 3:17 PM, Shawn Heisey s...@elyograg.org wrote:

  On 9/6/2013 2:54 AM, prabu palanisamy wrote:
   I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with
   java 1.6.
   I am searching the solr with text (which is actually twitter tweets) .
   Currently it takes average time of 210 millisecond for each post, out
 of
   which 200 millisecond is consumed by solr server (QTime).  I used the
   jconsole monitor tool.
 
  If the size of all your Solr indexes on disk is in the 50GB range of
  your wikipedia dump, then for ideal performance, you'll want to have
  50GB of free memory so the OS can cache your index.  You might be able
  to get by with 25-30GB of free memory, depending on your index
 composition.
 
  Note that this is memory over and above what you allocate to the Solr
  JVM, and memory used by other processes on the machine.  If you do have
  other services on the same machine, note that those programs might ALSO
  require OS disk cache RAM.
 
  http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
 
  Thanks,
  Shawn
 
 



RE: Regarding improving performance of the solr

2013-09-06 Thread Jean-Sebastien Vachon
Have you checked the hit ratio of the different caches? Try to tune them to get 
rid of all evictions if possible.

Tuning the size of the caches and warming you searcher can give you a pretty 
good improvement. You might want to check your analysis chain as well to see if 
you`re not doing anything that is not necessary.



 -Original Message-
 From: prabu palanisamy [mailto:pr...@serendio.com]
 Sent: September-06-13 4:55 AM
 To: solr-user@lucene.apache.org
 Subject: Regarding improving performance of the solr
 
  Hi
 
 I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with java
 1.6.
 I am searching the solr with text (which is actually twitter tweets) .
 Currently it takes average time of 210 millisecond for each post, out of which
 200 millisecond is consumed by solr server (QTime).  I used the jconsole
 monitor tool.
 
 The stats are
Heap usage - 10-50Mb,
No of threads - 10-20
No of class- 3800,
Cpu usage - 10-15%
 
 Currently I am loading all the fields of the wikipedia.
 
 I only need the freebase category and wikipedia category. I want to know
 how to optimize the solr server to improve the performance.
 
 Could you please help me out in optimize the performance?
 
 Thanks and Regards
 Prabu
 
 -
 Aucun virus trouvé dans ce message.
 Analyse effectuée par AVG - www.avg.fr
 Version: 2013.0.3392 / Base de données virale: 3222/6640 - Date: 05/09/2013


Re: Regarding improving performance of the solr

2013-09-06 Thread Shawn Heisey
On 9/6/2013 2:54 AM, prabu palanisamy wrote:
 I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with
 java 1.6.
 I am searching the solr with text (which is actually twitter tweets) .
 Currently it takes average time of 210 millisecond for each post, out of
 which 200 millisecond is consumed by solr server (QTime).  I used the
 jconsole monitor tool.

If the size of all your Solr indexes on disk is in the 50GB range of
your wikipedia dump, then for ideal performance, you'll want to have
50GB of free memory so the OS can cache your index.  You might be able
to get by with 25-30GB of free memory, depending on your index composition.

Note that this is memory over and above what you allocate to the Solr
JVM, and memory used by other processes on the machine.  If you do have
other services on the same machine, note that those programs might ALSO
require OS disk cache RAM.

http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache

Thanks,
Shawn