Hi,
We have a large, sharded SolrCloud index of >300 million documents which we use to explore our web archives. We want to facet on fields that have very large numbers of distinct values, e.g. host names and domain names of pages and links. Thus, overall, we expect to have millions of distinct terms for those fields. We also want to sort on other fields (e.g. date of harvest). We have experimented with various RAM and facet configurations, and are currently finding facet.method=enum + minDf to be more stable than fc. We currently have eight shards, and although the queries are slow, we are finding individual shards to be fairly reliable with a few GB of RAM (about 5GB per shard right now). This seems to be consistent with guidelines for estimating RAM usage (e.g. http://stackoverflow.com/questions/4499630/solr-faceted-navigation-on-la rge-index). However, the Solr instance we direct our client query to is consuming significantly more RAM (10GB) and is still failing after a few queries when it runs out of heap space. This is presumably due to the role it plays, aggregating the results from each shard. Is there any way we can estimate the amount of RAM that server will need? Alternatively, given our dataset, should be we pursuing a different approach? Should we re-index with the facet partition size set to something smaller (e.g. 10,000 rather than Integer.MAX_VALUE)? Should we be using facet.method=fc and buying more RAM? Best wishes, Andy Jackson -- Dr Andrew N Jackson Web Archiving Technical Lead The British Library Tel: 01937 546602 Mobile: 07765 897948 Web: www.webarchive.org.uk <http://www.webarchive.org.uk/> Twitter: @UKWebArchive