You're missing the point of my comment. Since they already are
docValues, you can use the /export functionality to get the results
back as a _stream_ and avoid all of the overhead of the aggregator
node doing a merge sort and all of that.

You'll have to do this from SolrJ, but see CloudSolrStream. You can
see examples of its usage in StreamingTest.java.

this should
1> complete much, much faster. The design goal is 400K rows/second but YMMV
2> use vastly less memory on your Solr instances.
3> only require _one_ query

Best,
Erick

On Wed, Apr 12, 2017 at 7:36 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 4/12/2017 5:19 PM, Chetas Joshi wrote:
>> I am getting back 100K results per page.
>> The fields have docValues enabled and I am getting sorted results based on 
>> "id" and 2 more fields (String: 32 Bytes and Long: 8 Bytes).
>>
>> I have a solr Cloud of 80 nodes. There will be one shard that will get top 
>> 100K docs from each shard and apply merge sort. So, the max memory usage of 
>> any shard could be 40 bytes * 100K * 80 = 320 MB. Why would heap memory 
>> usage shoot up from 8 GB to 17 GB?
>
> From what I understand, Java overhead for a String object is 56 bytes
> above the actual byte size of the string itself.  And each character in
> the string will be two bytes -- Java uses UTF-16 for character
> representation internally.  If I'm right about these numbers, it means
> that each of those id values will take 120 bytes -- and that doesn't
> include the size the actual response (xml, json, etc).
>
> I don't know what the overhead for a long is, but you can be sure that
> it's going to take more than eight bytes total memory usage for each one.
>
> Then there is overhead for all the Lucene memory structures required to
> execute the query and gather results, plus Solr memory structures to
> keep track of everything.  I have absolutely no idea how much memory
> Lucene and Solr use to accomplish a query, but it's not going to be
> small when you have 200 million documents per shard.
>
> Speaking of Solr memory requirements, under normal query circumstances
> the aggregating node is going to receive at least 100K results from
> *every* shard in the collection, which it will condense down to the
> final result with 100K entries.  The behavior during a cursor-based
> request may be more memory-efficient than what I have described, but I
> am unsure whether that is the case.
>
> If the cursor behavior is not more efficient, then each entry in those
> results will contain the uniqueKey value and the score.  That's going to
> be many megabytes for every shard.  If there are 80 shards, it would
> probably be over a gigabyte for one request.
>
> Thanks,
> Shawn
>

Reply via email to