On Thu, 2013-12-12 at 02:46 +0100, Joel Bernstein wrote:
> Curious how many documents per shard you were planning?

350-500 million, optimized to a single segment as the data are not
changing.

> The number of documents per shard and field type will drive the amount
> of a RAM needed to sort and facet. 

Very true. It makes a lot of sense to separate RAM requirements for the
Lucene/Solr structures and OS-caching.

It seems that Gil is working on about the same project as we are, so I
will elaborate in this thread:

We would like to perform some sort of grouping on URL, so that the same
page harvested at different points in time, is only displayed once. This
is probably the heaviest functionality as the cardinality of the field
will be near the number of documents.

For plain(er) faceting, things like MIME-type, harvest date and site
seems relevant. Those field have lower cardinality and they are
single-valued so the memory requirements are something like 
#docs*log2(#unique_values) bits
With 500M documents and 1000 values, that is 600MB. With 20 shards, we
are looking at 12GB per simple facet field.

Regards,
Toke Eskildsen



Reply via email to