Shawn's right that if you're going to scale this big you'd be very well
served to spend time getting the index as small as possible.  In my
experience if your searches require real-time random access reads (that is,
the entire index needs to be fast), you don't want to wait for HDD disk
reads.

Getting everything in RAM is best but 6TB per replica (perhaps you'll want
more than 1 replica?) is a tall order.  SSDs are coming down in price.
 Flash memory tech is advancing quickly (Fusion-io and the like).

Sounds like an interesting use case!

Thanks, Ryan


On Tue, Dec 10, 2013 at 9:37 AM, Shawn Heisey <s...@elyograg.org> wrote:

> On 12/10/2013 9:51 AM, Hoggarth, Gil wrote:
> > We're probably going to be building a Solr service to handle a dataset
> > of ~60TB, which for our data and schema typically gives a Solr index
> > size of 1/10th - i.e., 6TB. Given there's a general rule about the
> > amount of hardware memory required should exceed the size of the Solr
> > index (exceed to also allow for the operating system etc.), how have
> > people handled this situation? Do I really need, for example, 12 servers
> > with 512GB RAM, or are there other techniques to handling this?
>
> That really depends on what kind of query volume you'll have and what
> kind of performance you want.  If your query volume is low and you can
> deal with slow individual queries, then you won't need that much memory.
>  If either of those requirements increases, you'd probably need more
> memory, up to the 6TB total -- or 12TB if you need to double the total
> index size for redundancy purposes.  If your index is constantly growing
> like most are, you need to plan for that too.
>
> Putting the entire index into RAM is required for *top* performance, but
> not for base functionality.  It might be possible to put only a fraction
> of your index into RAM.  Only testing can determine what you really need
> to obtain the performance you're after.
>
> Perhaps you've already done this, but you should try as much as possible
> to reduce your index size.  Store as few fields as possible, only just
> enough to build a search result list/grid and retrieve the full document
> from the canonical data store.  Save termvectors and docvalues on as few
> fields as possible.  If you can, reduce the number of terms produced by
> your analysis chains.
>
> Thanks,
> Shawn
>
>

Reply via email to