you could always try the fc facet method and maybe increase the filtercache
size

On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal <
pravin_agra...@persistent.co.in> wrote:

> Hi All,
>
> We are using solr 3.4 with following schema fields.
>
>
> <schema.xml>---------------------------------------------------------------------------------------
>
> <fieldType name="autosuggest_text" class="solr.TextField"
>             positionIncrementGap="100">
>             <analyzer type="index">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.ShingleFilterFactory"
> maxShingleSize="5" outputUnigrams="true"/>
>                 <filter class="solr.PatternReplaceFilterFactory"
> pattern="^([0-9. ])*$" replacement=""
>                     replace="all"/>
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>         </fieldType>
>
> <field name="id" type="string" stored="true" indexed="true"/>
> <field name="autoSuggestContent" type="autosuggest_text" stored="true"
> indexed="true" multiValued="true"/>
>         <copyField source="content" dest="autoSuggestContent"/>
>         <copyField source="original_title" dest="autoSuggestContent"/>
>
> <field name="content" type="text" stored="true" indexed="true"/>
> <field name="original_title" type="text" stored="true" indexed="true"/>
> <field name="site" type="site" stored="false" indexed="true"/>
>
>
> </schema.xml>---------------------------------------------------------------------------------------
>
> The index on above schema is distributed on two solr shards with each
> index size of about 1.2 million, and size on disk of about 195GB per shard.
>
> We want to retrieve (site, autoSuggestContent term, frequency of the term)
> information from our above main solr index. The site is a field in document
> and contains name of site to which that document belongs. The terms are
> retrieved from multivalued field autoSuggestContent which is created using
> shingles from content and title of the web page.
>
> As of now, we are using facet query to retrieve (term, frequency of term)
>  for each site. Below is a sample query (you may ignore initial part of
> query)
>
>
> http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index
>
> The problem is that with increase in index size, this method has started
> taking huge time. It used to take 7 minutes per site with index size of
> 0.4 million docs but takes around 60-90 minutes for index size of 2.5
> million(). With this speed, it will take around 5-6 days to index complete
> 1500 sites. Also we are expecting the index size to grow with more
> documents and more sites and as such time to get the above information will
> increase further.
>
> Please let us know if there is any better way to extract (site, term,
> frequency) information compare to current method.
>
> Thanks,
> Pravin Agrawal
>
>
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

Reply via email to