Thanks for your detail answer @Shawn Yes I run the query in SolrCloud mode, and my collection has 20 shards, each shard size is 30~50GB。 4 solr server, each solr JVM use 6GB, HDFS datanode are 4 too, each datanode JVM use 2.5GB。 Linux server host are 4 node too,each node is 16 core/32GB RAM/1600GB SSD 。
So, in order to search 2 billion docs fast( HDFS shows 787GB ),I should turn on autowarm,and How much solr RAM/how many solr node it should be? Is there a roughly formula to budget ? Thanks again ~ TinsWzy Shawn Heisey <apa...@elyograg.org> 于2018年8月23日周四 下午6:19写道: > On 8/23/2018 4:03 AM, Shawn Heisey wrote: > > Configuring caches cannot speed up the first time a query runs. That > > speeds up later runs. To speed up the first time will require two > > things: > > > > 1) Ensuring that there is enough memory in the system for the > > operating system to effectively cache the index. This is memory > > *beyond* the java heap that is not allocated to any program. > > Followup, after fully digesting the latest reply: > > HDFS changes things a little bit. You would need to talk to somebody > about caching HDFS data effectively. I think that in that case, you > *do* need to use the heap to create a large HDFS client cache, but I > have no personal experience with HDFS, so I do not know for sure. Note > that having a very large heap can make garbage collection pauses become > extreme. > > With 2 billion docs, I'm assuming that you're running SolrCloud and that > the index is sharded. SolrCloud gives you query load balancing for > free. But I think you're probably going to need a lot more than 4 > servers, and each server is probably going to need a lot of memory. You > haven't indicated how many shards or replicas are involved here. For > optimal performance, every shard needs to be on a separate server. > > Searching 2 billion docs, especially with wildcards, may not be possible > to get working REALLY fast. Without a LOT of hardware, particularly > memory, it can be completely impractical to cache that much data. > Terabytes of memory is *very* expensive, especially if it's scattered > across many servers. > > Thanks, > Shawn > >