On 18/11/14 14:24, Ferenczi, Jim | EURHQ wrote:
> Your third factoid: A high number of hits/shard, suggests that there is a
possibility of all the final top-1000 hits to originate from a single shard.
In fact if you ask for 1000 hits in a distributed SolrCloud, each shard has to
retrieve 1000 hits to get the unique key of each match and send it back to the
shard responsible for the merge.
Yes, at least if each shard has 1000 hits. It is when each shard has a
lot of actual hits this "issue" becomes a problem
This means that even if your data is fairly distributed among the 1000 shards,
they still have to decompress 1000 documents during the first phase of the
search.
Eactly!
There are ways to avoid this, for instance you can check this JIRA where the
idea is discussed:
https://issues.apache.org/jira/browse/SOLR-5478
Guess our solution (described in my previous mail) is kinda an
alternative solution to SOLR-5478
Bottom line is that if you have 1000 shards the GET_FIELDS stage should be fast
(if your data is fairly distributed) but the GET_TOP_IDS is not.
Exactly!
You could avoid a lot of decompression/reads by using the field cache to
retrieve the unique key in the first stage.
We have so much data and relatively little RAM, so we cannot use
field-cache, because it requires an amount of memory linearly dependent
on the number of docs in store. We can never fulfill this requirement.
Doc-values is a valid approach for us, but currently our id-field is
unfortunately not doc-value - at it is not easy for us to just re-index
all documents with id as doc-value. Besides that, our solution is
diagonal on a field-cache/doc-values solution in the way that one does
not prevent the other, and if you do one of them you will still be able
to benefit from doing the other one.
Cheers,
Jim
Thanks, Jim