Hi,

 

We have a large, sharded SolrCloud index of >300 million documents which
we use to explore our web archives. We want to facet on fields that have
very large numbers of distinct values, e.g. host names and domain names
of pages and links. Thus, overall, we expect to have millions of
distinct terms for those fields. We also want to sort on other fields
(e.g. date of harvest).

 

We have experimented with various RAM and facet configurations, and are
currently finding facet.method=enum + minDf to be more stable than fc.
We currently have eight shards, and although the queries are slow, we
are finding individual shards to be fairly reliable with a few GB of RAM
(about 5GB per shard right now). This seems to be consistent with
guidelines for estimating RAM usage (e.g.
http://stackoverflow.com/questions/4499630/solr-faceted-navigation-on-la
rge-index).

 

However, the Solr instance we direct our  client query to is consuming
significantly more RAM (10GB) and is still failing after a few queries
when it runs out of heap space. This is presumably due to the role it
plays, aggregating the results from each shard. Is there any way we can
estimate the amount of RAM that server will need?

 

Alternatively, given our dataset, should be we pursuing a different
approach? Should we re-index with the facet partition size set to
something smaller (e.g. 10,000 rather than Integer.MAX_VALUE)? Should we
be using facet.method=fc and buying more RAM?

 

 

Best wishes,

Andy Jackson

 

--

Dr Andrew N Jackson

Web Archiving Technical Lead

The British Library

 

Tel: 01937 546602

Mobile: 07765 897948

Web: www.webarchive.org.uk <http://www.webarchive.org.uk/> 

Twitter: @UKWebArchive

 

Reply via email to