Hi Shawn,

Thanks for your response, wanted to clarify a few things.

*Does that mean for querying smoothly we need to have memory atleast equal
or greater to the size of index? As in my case the index size will be very
heavy(~2TB) and practically speaking that amount of memory is not possible.
Even If it goes to multiple shards, say around 10 Shards then also 200GB of
RAM will not be an feasible option.

*With CloudSolrServer can we specify which Shard the particular index
should go and reside, which I can do with EmbeddedSolrServer by indexing in
different directories and moving them to appropriate shard directories.

Thanks!



On Wed, Jun 4, 2014 at 12:43 PM, Shawn Heisey <s...@elyograg.org> wrote:

> On 6/4/2014 12:45 AM, Vineet Mishra wrote:
> > Thanks all for your response.
> > I presume this conversation concludes that indexing around 1Billion
> > documents per shard won't be a problem, as I have 10 Billion docs to
> index,
> > so approx 10 shards with 1 Billion each should be fine with it and how
> > about Memory, what size of RAM should be fine for this amount of data?
>
> Figure out the heap requirements of the operating system and every
> program on the machine (Solr especially).  Then you would add that
> number to the total size of the index data on the machine.  That is the
> ideal minimum RAM.
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Unfortunately, if you are dealing with a huge index with billions of
> documents, it is likely to be prohibitively expensive to buy that much
> RAM.  If you are running Solr on Amazon's cloud, the cost for that much
> RAM would be astronomical.
>
> Exactly how much RAM would actually be required is very difficult to
> predict.  If you had only 25% of the ideal, your index might have
> perfectly acceptable performance, or it might not.  It might do fine
> under a light query load, but if you increase to 50 queries per second,
> performance may drop significantly ... or it might be good.  It's
> generally not possible to know how your hardware will perform until you
> actually build and use your index.
>
>
> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> A general rule of thumb for RAM that I have found to be useful is that
> if you've got less than half of the ideal memory size, you might have
> performance problems.
>
> > Moreover what should be the indexing technique for this huge data set, as
> > currently I am indexing with EmbeddedSolrServer but its going
> pathetically
> > slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due
> > to network delays and response but after this long running the indexing
> > with EmbeddedSolrServer I am getting a different notion.
> > Any good indexing technique for this huge dataset would be highly
> > appreciated.
>
> EmbeddedSolrServer is not recommended.  Run Solr in the traditional way
> with HTTP connectivity.  HTTP overhead on a LAN is usually quite small.
>  Solr is fully thread-safe, so you can have several indexing threads all
> going at the same time.
>
> Indexes at this scale should normally be built with SolrCloud, with
> enough servers so that each machine is only handling one shard replica.
>  The ideal indexing program would be written in Java, using
> CloudSolrServer.
>
> Thanks,
> Shawn
>
>

Reply via email to