Just on our experiences, we have a large collection (350M documents, but 1.2Tb in size spread across 4 shards/machines and multiple replicas, we may well need more) and the first thing we needed to do for size estimation was to work out how big a set number of documents would be on disk. So we did a test collection, inserted 1000 documents and measured the size of the collection. You've told us its 2 billion docs, but 2 billion of what, is it lots of fields, lots of text, how many fields are stored, how many are indexed, etc... As Shawn says, you need to do some empirical analysis yourself on what your collection will look like. There is a spreadsheet (size-estimator-lucene-solr.xls) in the Solr source distribution, but that will be a very rough rule of thumb, and I don't know if it handles the new compressed stored fields? Only you can know how big your collection is, there is no hard and fast rule.
As an aside, it seemed an odd suggestion for hardware, a beefy (and expensive) Sun box, but with relatively low memory (I know PCs that have 16Gb, so 32Gb isn't much these days). Again, I can only comment on our experience, but we are going for the horizontal scaling approach which Solr cloud is more suited too, so we have smaller Linux/Intel-based machines, but with 256Gb of RAM (and we may well need 512Gb) to try and maximize the use of the page cache. Depending on your eventual collection size, and how frequently your updates are coming in (you said 1M per day, but is that a bulk job or just occurring adhoc), you may want to investigate SSD storage. If you are doing lots of segments merges because the collection is continually changing, then disk IO could be an issue. We went that route, since we have a NRT setup and we want to try and keep the number of segments down so that search times are better. On 15 May 2013 07:38, Shawn Heisey <s...@elyograg.org> wrote: > On 5/15/2013 12:31 AM, Shawn Heisey wrote: > > If we assume that you've taken every possible step to reduce Solr's Java > > heap requirements, you might be able to do a heap of 8 to 16GB per > > server, but the actual heap requirement could be significantly higher. > > Adding this up, you get a bare minimum memory requirement of 32GB for > > each of those four servers. Ideally, you'd need to have 48GB for each > > of them. If you plan to put it on two Solr servers instead of four, > > double the per-server memory requirement. > > > > Remember that all the information in the previous paragraph assumes a > > total index size of 100GB, and your index has the potential to be a lot > > bigger than 100GB. If you have a 300GB index size instead of 100GB, > > triple those numbers. Scale up similarly for larger sizes. > > I should have made something clear here: These are just possible > estimates, not hard reliable numbers. In particular, if you end up > needing a larger per-server heap size than 16GB, then my estimates are > wrong. > > Thanks, > Shawn > >