Re: Regarding improving performance of the solr

Steve Rowe Thu, 12 Sep 2013 20:38:29 -0700

Hi Prabu,

It's difficult to tell what's going wrong without the full exception stack 
trace, including what the exception is.


If you can provide the specific input that triggers the exception, that might 
also help.

Steve

On Sep 12, 2013, at 4:14 AM, prabu palanisamy <pr...@serendio.com> wrote:

> Hi
> 
> I tried to reindex the solr. I get the regular expression problem. The
> steps I followed are
> 
> I started the java -jar start.jar
> http://localhost:8983/solr/update?stream.body=
> <delete><query>*:*<query><delete>
> http://localhost:8983/solr/update?stream.body=<commit/>
> I stopped the solr server
> 
> I changed indexed and stored tags as false for some of the fields in
> schema.xml
> <fields>
> <field name="id"        type="string"  indexed="true" stored="true"
> required="true"/>
> <field name="title"     type="string"  indexed="true" stored="true"
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
> <field name="revision"  type="sint"    indexed="false" stored="false"/>
> <field name="user"      type="string"  indexed="false" stored="false"/>
> <field name="userId"    type="int"     indexed="false" stored="false"/>
> <field name="text"      type="text_general" indexed="true" stored="true"
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
> <field name="pagerank"  type="text_general"    indexed="true"
> stored="false"/>
> <field name="anchor_text" type="text_general" indexed="true"
> stored="false"  multiValued="true" compressed="true" termVectors="true"
> termPositions="true" termOffsets="true"/>
> <field name="freebase" type="text_general" indexed="true" stored="true"
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
> <field name="timestamp" type="date"    indexed="true" stored="true"
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
> <field name="titleText" type="text_general"    indexed="true"
> stored="true"  multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
> <field name="category" type="string" indexed="true" stored="true"/>
> </fields>
> <uniqueKey>id</uniqueKey>
> <copyField source="title" dest="titleText"/>
> 
> My data-config.xml
> <dataConfig>
>        <dataSource type="FileDataSource" encoding="UTF-8" />
>        <document>
>        <entity name="page"
>                processor="XPathEntityProcessor"
>                stream="true"
>                forEach="/mediawiki/page/"
>                url="/home/prabu/wikipedia_full_indexed_dump.xml"
> 
> transformer="RegexTransformer,DateFormatTransformer,HTMLStripTransformer"
>> 
>            <field column="id"        xpath="/mediawiki/page/id"
> stripHTML="true"/>
>            <field column="title"     xpath="/mediawiki/page/title"
> stripHTML="true"/>
>        <field column="category"  xpath="/mediawiki/page/category"
> stripHTML="true"/>
>            <field column="revision"  xpath="/mediawiki/page/revision/id"
> stripHTML="true"/>
>            <field column="user"
> xpath="/mediawiki/page/revision/contributor/username" stripHTML="true"/>
>            <field column="userId"
> xpath="/mediawiki/page/revision/contributor/id" stripHTML="true"/>
>            <field column="text"      xpath="/mediawiki/page/revision/text"
> stripHTML="true"/>
>            <field column="freebase"  xpath="/mediawiki/page/freebase"
> stripHTML="true"/>
>        <field column="pagerank"  xpath="/mediawiki/page/pagerank"
> stripHTML="true"/>
>        <field column="anchor_text" xpath="/mediawiki/page/anchor_text/"
> stripHTML="true"/>
>            <field column="timestamp"
> xpath="/mediawiki/page/revision/timestamp"
> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
>            <field column="$skipDoc"  regex="^#REDIRECT .*"
> replaceWith="true" sourceColName="text"/>
>        <field column="category" regex="((\[\[.*Category:.*\]\]\W?)+)"
> sourceColName="text" stripHTML="true"/>
>        <field column="$skipDoc" regex="^Template:.*" replaceWith="true"
> sourceColName="title"/>
>       </entity>
>        </document>
> </dataConfig>
> 
> I tried the http://localhost:8983/solr/dataimport?command=full-import.  At
> 50,000 document, I get some error related to regular expression.
> 
> at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
>       at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
>       at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
>       at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
>       at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
>       at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
>       at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
>       at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
>       at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
>       at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
>       at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
> 
> I do not how to proceed. Please help me out.
> 
> Thanks and Regards
> Prabu
> 
> 
> On Wed, Sep 11, 2013 at 11:31 AM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
> 
>> Be a little careful when extrapolating from disk to memory.
>> Any fields where you've set stored="true" will put data in
>> segment files with extensions .fdt and .fdx, see
>> These are the compressed verbatim copy of the data
>> for stored fields and have very little impact on
>> memory required for searching. I've seen indexes where
>> 75% of the data is stored and indexes where 5% of the
>> data is stored.....
>> 
>> "Summary of File Extensions" here:
>> 
>> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html
>> 
>> Best,
>> Erick
>> 
>> 
>> On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy <pr...@serendio.com
>>> wrote:
>> 
>>> @Shawn: Correctly I am trying to reduce the index size. I am working on
>>> reindex the solr with some of the features as indexed and not stored
>>> 
>>> @Jean: I tried with  different caches. It did not show much improvement.
>>> 
>>> 
>>> On Fri, Sep 6, 2013 at 3:17 PM, Shawn Heisey <s...@elyograg.org> wrote:
>>> 
>>>> On 9/6/2013 2:54 AM, prabu palanisamy wrote:
>>>>> I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb)
>> with
>>>>> java 1.6.
>>>>> I am searching the solr with text (which is actually twitter tweets)
>> .
>>>>> Currently it takes average time of 210 millisecond for each post, out
>>> of
>>>>> which 200 millisecond is consumed by solr server (QTime).  I used the
>>>>> jconsole monitor tool.
>>>> 
>>>> If the size of all your Solr indexes on disk is in the 50GB range of
>>>> your wikipedia dump, then for ideal performance, you'll want to have
>>>> 50GB of free memory so the OS can cache your index.  You might be able
>>>> to get by with 25-30GB of free memory, depending on your index
>>> composition.
>>>> 
>>>> Note that this is memory over and above what you allocate to the Solr
>>>> JVM, and memory used by other processes on the machine.  If you do have
>>>> other services on the same machine, note that those programs might ALSO
>>>> require OS disk cache RAM.
>>>> 
>>>> http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>>> 
>>> 
>>

Re: Regarding improving performance of the solr

Reply via email to