Thanks for your detail answer @Shawn

Yes I run the query in SolrCloud mode, and my collection has 20 shards,
each shard size is 30~50GB。
4 solr server, each solr JVM  use 6GB, HDFS datanode are 4 too, each
datanode JVM use 2.5GB。
Linux server host are 4 node too,each node is 16 core/32GB RAM/1600GB SSD 。

So, in order to  search 2 billion docs fast( HDFS shows 787GB ),I should
turn on autowarm,and   How
much  solr RAM/how many solr node  it should be?
Is there a roughly  formula to budget ?

Thanks again ~
TinsWzy



Shawn Heisey <apa...@elyograg.org> 于2018年8月23日周四 下午6:19写道:

> On 8/23/2018 4:03 AM, Shawn Heisey wrote:
> > Configuring caches cannot speed up the first time a query runs.  That
> > speeds up later runs.  To speed up the first time will require two
> > things:
> >
> > 1) Ensuring that there is enough memory in the system for the
> > operating system to effectively cache the index.  This is memory
> > *beyond* the java heap that is not allocated to any program.
>
> Followup, after fully digesting the latest reply:
>
> HDFS changes things a little bit.  You would need to talk to somebody
> about caching HDFS data effectively.  I think that in that case, you
> *do* need to use the heap to create a large HDFS client cache, but I
> have no personal experience with HDFS, so I do not know for sure.  Note
> that having a very large heap can make garbage collection pauses become
> extreme.
>
> With 2 billion docs, I'm assuming that you're running SolrCloud and that
> the index is sharded.  SolrCloud gives you query load balancing for
> free.  But I think you're probably going to need a lot more than 4
> servers, and each server is probably going to need a lot of memory.  You
> haven't indicated how many shards or replicas are involved here.  For
> optimal performance, every shard needs to be on a separate server.
>
> Searching 2 billion docs, especially with wildcards, may not be possible
> to get working REALLY fast.  Without a LOT of hardware, particularly
> memory, it can be completely impractical to cache that much data.
> Terabytes of memory is *very* expensive, especially if it's scattered
> across many servers.
>
> Thanks,
> Shawn
>
>

Reply via email to