On 4/12/2017 5:19 PM, Chetas Joshi wrote: > I am getting back 100K results per page. > The fields have docValues enabled and I am getting sorted results based on > "id" and 2 more fields (String: 32 Bytes and Long: 8 Bytes). > > I have a solr Cloud of 80 nodes. There will be one shard that will get top > 100K docs from each shard and apply merge sort. So, the max memory usage of > any shard could be 40 bytes * 100K * 80 = 320 MB. Why would heap memory usage > shoot up from 8 GB to 17 GB?
>From what I understand, Java overhead for a String object is 56 bytes above the actual byte size of the string itself. And each character in the string will be two bytes -- Java uses UTF-16 for character representation internally. If I'm right about these numbers, it means that each of those id values will take 120 bytes -- and that doesn't include the size the actual response (xml, json, etc). I don't know what the overhead for a long is, but you can be sure that it's going to take more than eight bytes total memory usage for each one. Then there is overhead for all the Lucene memory structures required to execute the query and gather results, plus Solr memory structures to keep track of everything. I have absolutely no idea how much memory Lucene and Solr use to accomplish a query, but it's not going to be small when you have 200 million documents per shard. Speaking of Solr memory requirements, under normal query circumstances the aggregating node is going to receive at least 100K results from *every* shard in the collection, which it will condense down to the final result with 100K entries. The behavior during a cursor-based request may be more memory-efficient than what I have described, but I am unsure whether that is the case. If the cursor behavior is not more efficient, then each entry in those results will contain the uniqueKey value and the score. That's going to be many megabytes for every shard. If there are 80 shards, it would probably be over a gigabyte for one request. Thanks, Shawn