Hi I tried to reindex the solr. I get the regular expression problem. The steps I followed are
I started the java -jar start.jar http://localhost:8983/solr/update?stream.body= <delete><query>*:*<query><delete> http://localhost:8983/solr/update?stream.body=<commit/> I stopped the solr server I changed indexed and stored tags as false for some of the fields in schema.xml <fields> <field name="id" type="string" indexed="true" stored="true" required="true"/> <field name="title" type="string" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/> <field name="revision" type="sint" indexed="false" stored="false"/> <field name="user" type="string" indexed="false" stored="false"/> <field name="userId" type="int" indexed="false" stored="false"/> <field name="text" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/> <field name="pagerank" type="text_general" indexed="true" stored="false"/> <field name="anchor_text" type="text_general" indexed="true" stored="false" multiValued="true" compressed="true" termVectors="true" termPositions="true" termOffsets="true"/> <field name="freebase" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/> <field name="timestamp" type="date" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/> <field name="titleText" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/> <field name="category" type="string" indexed="true" stored="true"/> </fields> <uniqueKey>id</uniqueKey> <copyField source="title" dest="titleText"/> My data-config.xml <dataConfig> <dataSource type="FileDataSource" encoding="UTF-8" /> <document> <entity name="page" processor="XPathEntityProcessor" stream="true" forEach="/mediawiki/page/" url="/home/prabu/wikipedia_full_indexed_dump.xml" transformer="RegexTransformer,DateFormatTransformer,HTMLStripTransformer" > <field column="id" xpath="/mediawiki/page/id" stripHTML="true"/> <field column="title" xpath="/mediawiki/page/title" stripHTML="true"/> <field column="category" xpath="/mediawiki/page/category" stripHTML="true"/> <field column="revision" xpath="/mediawiki/page/revision/id" stripHTML="true"/> <field column="user" xpath="/mediawiki/page/revision/contributor/username" stripHTML="true"/> <field column="userId" xpath="/mediawiki/page/revision/contributor/id" stripHTML="true"/> <field column="text" xpath="/mediawiki/page/revision/text" stripHTML="true"/> <field column="freebase" xpath="/mediawiki/page/freebase" stripHTML="true"/> <field column="pagerank" xpath="/mediawiki/page/pagerank" stripHTML="true"/> <field column="anchor_text" xpath="/mediawiki/page/anchor_text/" stripHTML="true"/> <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" /> <field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/> <field column="category" regex="((\[\[.*Category:.*\]\]\W?)+)" sourceColName="text" stripHTML="true"/> <field column="$skipDoc" regex="^Template:.*" replaceWith="true" sourceColName="title"/> </entity> </document> </dataConfig> I tried the http://localhost:8983/solr/dataimport?command=full-import. At 50,000 document, I get some error related to regular expression. at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) I do not how to proceed. Please help me out. Thanks and Regards Prabu On Wed, Sep 11, 2013 at 11:31 AM, Erick Erickson <erickerick...@gmail.com>wrote: > Be a little careful when extrapolating from disk to memory. > Any fields where you've set stored="true" will put data in > segment files with extensions .fdt and .fdx, see > These are the compressed verbatim copy of the data > for stored fields and have very little impact on > memory required for searching. I've seen indexes where > 75% of the data is stored and indexes where 5% of the > data is stored..... > > "Summary of File Extensions" here: > > http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html > > Best, > Erick > > > On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy <pr...@serendio.com > >wrote: > > > @Shawn: Correctly I am trying to reduce the index size. I am working on > > reindex the solr with some of the features as indexed and not stored > > > > @Jean: I tried with different caches. It did not show much improvement. > > > > > > On Fri, Sep 6, 2013 at 3:17 PM, Shawn Heisey <s...@elyograg.org> wrote: > > > > > On 9/6/2013 2:54 AM, prabu palanisamy wrote: > > > > I am currently using solr -3.5.0, indexed wikipedia dump (50 gb) > with > > > > java 1.6. > > > > I am searching the solr with text (which is actually twitter tweets) > . > > > > Currently it takes average time of 210 millisecond for each post, out > > of > > > > which 200 millisecond is consumed by solr server (QTime). I used the > > > > jconsole monitor tool. > > > > > > If the size of all your Solr indexes on disk is in the 50GB range of > > > your wikipedia dump, then for ideal performance, you'll want to have > > > 50GB of free memory so the OS can cache your index. You might be able > > > to get by with 25-30GB of free memory, depending on your index > > composition. > > > > > > Note that this is memory over and above what you allocate to the Solr > > > JVM, and memory used by other processes on the machine. If you do have > > > other services on the same machine, note that those programs might ALSO > > > require OS disk cache RAM. > > > > > > http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache > > > > > > Thanks, > > > Shawn > > > > > > > > >