you could always try the fc facet method and maybe increase the filtercache size
On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal < pravin_agra...@persistent.co.in> wrote: > Hi All, > > We are using solr 3.4 with following schema fields. > > > <schema.xml>--------------------------------------------------------------------------------------- > > <fieldType name="autosuggest_text" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.ShingleFilterFactory" > maxShingleSize="5" outputUnigrams="true"/> > <filter class="solr.PatternReplaceFilterFactory" > pattern="^([0-9. ])*$" replacement="" > replace="all"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > <field name="id" type="string" stored="true" indexed="true"/> > <field name="autoSuggestContent" type="autosuggest_text" stored="true" > indexed="true" multiValued="true"/> > <copyField source="content" dest="autoSuggestContent"/> > <copyField source="original_title" dest="autoSuggestContent"/> > > <field name="content" type="text" stored="true" indexed="true"/> > <field name="original_title" type="text" stored="true" indexed="true"/> > <field name="site" type="site" stored="false" indexed="true"/> > > > </schema.xml>--------------------------------------------------------------------------------------- > > The index on above schema is distributed on two solr shards with each > index size of about 1.2 million, and size on disk of about 195GB per shard. > > We want to retrieve (site, autoSuggestContent term, frequency of the term) > information from our above main solr index. The site is a field in document > and contains name of site to which that document belongs. The terms are > retrieved from multivalued field autoSuggestContent which is created using > shingles from content and title of the web page. > > As of now, we are using facet query to retrieve (term, frequency of term) > for each site. Below is a sample query (you may ignore initial part of > query) > > > http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index > > The problem is that with increase in index size, this method has started > taking huge time. It used to take 7 minutes per site with index size of > 0.4 million docs but takes around 60-90 minutes for index size of 2.5 > million(). With this speed, it will take around 5-6 days to index complete > 1500 sites. Also we are expecting the index size to grow with more > documents and more sites and as such time to get the above information will > increase further. > > Please let us know if there is any better way to extract (site, term, > frequency) information compare to current method. > > Thanks, > Pravin Agrawal > > > > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. >