You're missing the point of my comment. Since they already are docValues, you can use the /export functionality to get the results back as a _stream_ and avoid all of the overhead of the aggregator node doing a merge sort and all of that.
You'll have to do this from SolrJ, but see CloudSolrStream. You can see examples of its usage in StreamingTest.java. this should 1> complete much, much faster. The design goal is 400K rows/second but YMMV 2> use vastly less memory on your Solr instances. 3> only require _one_ query Best, Erick On Wed, Apr 12, 2017 at 7:36 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 4/12/2017 5:19 PM, Chetas Joshi wrote: >> I am getting back 100K results per page. >> The fields have docValues enabled and I am getting sorted results based on >> "id" and 2 more fields (String: 32 Bytes and Long: 8 Bytes). >> >> I have a solr Cloud of 80 nodes. There will be one shard that will get top >> 100K docs from each shard and apply merge sort. So, the max memory usage of >> any shard could be 40 bytes * 100K * 80 = 320 MB. Why would heap memory >> usage shoot up from 8 GB to 17 GB? > > From what I understand, Java overhead for a String object is 56 bytes > above the actual byte size of the string itself. And each character in > the string will be two bytes -- Java uses UTF-16 for character > representation internally. If I'm right about these numbers, it means > that each of those id values will take 120 bytes -- and that doesn't > include the size the actual response (xml, json, etc). > > I don't know what the overhead for a long is, but you can be sure that > it's going to take more than eight bytes total memory usage for each one. > > Then there is overhead for all the Lucene memory structures required to > execute the query and gather results, plus Solr memory structures to > keep track of everything. I have absolutely no idea how much memory > Lucene and Solr use to accomplish a query, but it's not going to be > small when you have 200 million documents per shard. > > Speaking of Solr memory requirements, under normal query circumstances > the aggregating node is going to receive at least 100K results from > *every* shard in the collection, which it will condense down to the > final result with 100K entries. The behavior during a cursor-based > request may be more memory-efficient than what I have described, but I > am unsure whether that is the case. > > If the cursor behavior is not more efficient, then each entry in those > results will contain the uniqueKey value and the score. That's going to > be many megabytes for every shard. If there are 80 shards, it would > probably be over a gigabyte for one request. > > Thanks, > Shawn >