Re: Regarding improving performance of the solr

prabu palanisamy Thu, 12 Sep 2013 01:16:02 -0700

Hi

I tried to reindex the solr. I get the regular expression problem. The
steps I followed are


I started the java -jar start.jar
http://localhost:8983/solr/update?stream.body=
<delete><query>*:*<query><delete>
http://localhost:8983/solr/update?stream.body=<commit/>
I stopped the solr server

I changed indexed and stored tags as false for some of the fields in
schema.xml
 <fields>
<field name="id"        type="string"  indexed="true" stored="true"
required="true"/>
<field name="title"     type="string"  indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true"/>
<field name="revision"  type="sint"    indexed="false" stored="false"/>
<field name="user"      type="string"  indexed="false" stored="false"/>
<field name="userId"    type="int"     indexed="false" stored="false"/>
<field name="text"      type="text_general" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true"/>
<field name="pagerank"  type="text_general"    indexed="true"
stored="false"/>
<field name="anchor_text" type="text_general" indexed="true"
stored="false"  multiValued="true" compressed="true" termVectors="true"
termPositions="true" termOffsets="true"/>
<field name="freebase" type="text_general" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true"/>
<field name="timestamp" type="date"    indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true"/>
<field name="titleText" type="text_general"    indexed="true"
stored="true"  multiValued="true" termVectors="true" termPositions="true"
termOffsets="true"/>
<field name="category" type="string" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<copyField source="title" dest="titleText"/>

My data-config.xml
<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/home/prabu/wikipedia_full_indexed_dump.xml"

transformer="RegexTransformer,DateFormatTransformer,HTMLStripTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id"
stripHTML="true"/>
            <field column="title"     xpath="/mediawiki/page/title"
stripHTML="true"/>
        <field column="category"  xpath="/mediawiki/page/category"
stripHTML="true"/>
            <field column="revision"  xpath="/mediawiki/page/revision/id"
stripHTML="true"/>
            <field column="user"
xpath="/mediawiki/page/revision/contributor/username" stripHTML="true"/>
            <field column="userId"
xpath="/mediawiki/page/revision/contributor/id" stripHTML="true"/>
            <field column="text"      xpath="/mediawiki/page/revision/text"
stripHTML="true"/>
            <field column="freebase"  xpath="/mediawiki/page/freebase"
stripHTML="true"/>
        <field column="pagerank"  xpath="/mediawiki/page/pagerank"
stripHTML="true"/>
        <field column="anchor_text" xpath="/mediawiki/page/anchor_text/"
stripHTML="true"/>
            <field column="timestamp"
xpath="/mediawiki/page/revision/timestamp"
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*"
replaceWith="true" sourceColName="text"/>
        <field column="category" regex="((\[\[.*Category:.*\]\]\W?)+)"
sourceColName="text" stripHTML="true"/>
        <field column="$skipDoc" regex="^Template:.*" replaceWith="true"
sourceColName="title"/>
       </entity>
        </document>
</dataConfig>

I tried the http://localhost:8983/solr/dataimport?command=full-import.  At
50,000 document, I get some error related to regular expression.

at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
        at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
        at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
        at java.util.regex.Pattern$Branch.match(Pattern.java:4114)

I do not how to proceed. Please help me out.

Thanks and Regards
Prabu


On Wed, Sep 11, 2013 at 11:31 AM, Erick Erickson <erickerick...@gmail.com>wrote:

> Be a little careful when extrapolating from disk to memory.
> Any fields where you've set stored="true" will put data in
> segment files with extensions .fdt and .fdx, see
> These are the compressed verbatim copy of the data
> for stored fields and have very little impact on
> memory required for searching. I've seen indexes where
> 75% of the data is stored and indexes where 5% of the
> data is stored.....
>
> "Summary of File Extensions" here:
>
> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html
>
> Best,
> Erick
>
>
> On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy <pr...@serendio.com
> >wrote:
>
> > @Shawn: Correctly I am trying to reduce the index size. I am working on
> > reindex the solr with some of the features as indexed and not stored
> >
> > @Jean: I tried with  different caches. It did not show much improvement.
> >
> >
> > On Fri, Sep 6, 2013 at 3:17 PM, Shawn Heisey <s...@elyograg.org> wrote:
> >
> > > On 9/6/2013 2:54 AM, prabu palanisamy wrote:
> > > > I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb)
> with
> > > > java 1.6.
> > > > I am searching the solr with text (which is actually twitter tweets)
> .
> > > > Currently it takes average time of 210 millisecond for each post, out
> > of
> > > > which 200 millisecond is consumed by solr server (QTime).  I used the
> > > > jconsole monitor tool.
> > >
> > > If the size of all your Solr indexes on disk is in the 50GB range of
> > > your wikipedia dump, then for ideal performance, you'll want to have
> > > 50GB of free memory so the OS can cache your index.  You might be able
> > > to get by with 25-30GB of free memory, depending on your index
> > composition.
> > >
> > > Note that this is memory over and above what you allocate to the Solr
> > > JVM, and memory used by other processes on the machine.  If you do have
> > > other services on the same machine, note that those programs might ALSO
> > > require OS disk cache RAM.
> > >
> > > http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>

Re: Regarding improving performance of the solr

Reply via email to