Just on our experiences, we have a large collection (350M documents, but
1.2Tb in size spread across 4 shards/machines and multiple replicas, we may
well need more) and the first thing we needed to do for size estimation was
to work out how big a set number of documents would be on disk.  So we did
a test collection, inserted 1000 documents and measured the size of the
collection. You've told us its 2 billion docs, but 2 billion of what, is it
lots of fields, lots of text, how many fields are stored, how many are
indexed, etc... As Shawn says, you need to do some empirical analysis
yourself on what your collection will look like.  There is a spreadsheet
(size-estimator-lucene-solr.xls) in the Solr source distribution, but that
will be a very rough rule of thumb, and  I don't know if it handles the new
compressed stored fields?  Only you can know how big your collection is,
there is no hard and fast rule.

As an aside, it seemed an odd suggestion for hardware, a beefy (and
expensive) Sun box, but with relatively low memory (I know PCs that have
16Gb, so 32Gb isn't much these days).  Again, I can only comment on our
experience, but we are going for the horizontal scaling approach which Solr
cloud is more suited too, so we have smaller Linux/Intel-based machines,
but with 256Gb of RAM (and we may well need 512Gb) to try and maximize the
use of the page cache.

Depending on your eventual collection size, and how frequently your updates
are coming in (you said 1M per day, but is that a bulk job or just
occurring adhoc), you may want to investigate SSD storage.  If you are
doing lots of segments merges because the collection is continually
changing, then disk IO could be an issue.  We went that route, since we
have a NRT setup and we want to try and keep the number of segments down so
that search times are better.

On 15 May 2013 07:38, Shawn Heisey <s...@elyograg.org> wrote:

> On 5/15/2013 12:31 AM, Shawn Heisey wrote:
> > If we assume that you've taken every possible step to reduce Solr's Java
> > heap requirements, you might be able to do a heap of 8 to 16GB per
> > server, but the actual heap requirement could be significantly higher.
> > Adding this up, you get a bare minimum memory requirement of 32GB for
> > each of those four servers.  Ideally, you'd need to have 48GB for each
> > of them.  If you plan to put it on two Solr servers instead of four,
> > double the per-server memory requirement.
> >
> > Remember that all the information in the previous paragraph assumes a
> > total index size of 100GB, and your index has the potential to be a lot
> > bigger than 100GB.  If you have a 300GB index size instead of 100GB,
> > triple those numbers.  Scale up similarly for larger sizes.
>
> I should have made something clear here: These are just possible
> estimates, not hard reliable numbers.  In particular, if you end up
> needing a larger per-server heap size than 16GB, then my estimates are
> wrong.
>
> Thanks,
> Shawn
>
>

Reply via email to