Limiting my answer to the #shards*rows issue, you can have a look at
https://issues.apache.org/jira/browse/SOLR-5611.

The odds all top docs are in the same shard in a uniformly distributed
index are negligible, so you can use it to request much fewer docs per
shard.
There's a nice discrete equation that gives you the shards.rows you should
request, depending on the #shards, rows, and confidence level that all the
top-rows are returned (confidence=0.95 would mean 95% of the responses will
contain the exact top rows as if rows and shards.rows were equal).

Our use case: 36 shards, 2000 rows, conf=99% --> shards.rows=49 which gave
a good performance boost.

On Tue, Nov 18, 2014 at 3:24 PM, Ferenczi, Jim | EURHQ <
jim.feren...@mail.rakuten.com> wrote:

>  > Your third factoid: A high number of hits/shard, suggests that there is
> a possibility of all the final top-1000 hits to originate from a single
> shard.
> In fact if you ask for 1000 hits in a distributed SolrCloud, each shard
> has to retrieve 1000 hits to get the unique key of each match and send it
> back to the shard responsible for the merge. This means that even if your
> data is fairly distributed among the 1000 shards, they still have to
> decompress 1000 documents during the first phase of the search. There are
> ways to avoid this, for instance you can check this JIRA where the idea is
> discussed:
> https://issues.apache.org/jira/browse/SOLR-5478
> Bottom line is that if you have 1000 shards the GET_FIELDS stage should be
> fast (if your data is fairly distributed) but the GET_TOP_IDS is not. You
> could avoid a lot of decompression/reads by using the field cache to
> retrieve the unique key in the first stage.
>
> Cheers,
> Jim
>
>
>
> ________________________________________
> From: Per Steffensen <st...@liace.dk>
> Sent: Tuesday, November 18, 2014 19:26
> To: dev@lucene.apache.org
> Subject: Re: Slow searching limited but high rows across many shards all
> with high hits
>
> On 17/11/14 18:47, Toke Eskildsen wrote:
> > Per Steffensen [st...@liace.dk] wrote:
> > I understand that the request is for rows * #shards IDs+score in
> > total, but if you have presented your alternative, I have failed to
> > see that.
> I deliberately did not present the "solution" we did, in order for you
> guys not to focus on whether or not this particular solution to the
> problem already has been implemented after 4.4.0 (the version of Apache
> Solr we currently base our version of Solr on). Guess the "problem" can
> be solved in numerous ways, so just wanted you to focus on whether or
> not it has been solved in some way (do not care which way)
> >   Your third factoid: A high number of hits/shard, suggests that there
> is a possibility of all the final top-1000 hits to originate from a single
> shard.
> Im not sure what you are aiming at with this comment. But I can say that
> it is very very unlikely that the overall-top-1000 all originate from a
> single shard. It is likely (since we are not routing on anything that
> has to do with the "content" text-field) that the overall-top-1000 i
> fairly evenly distributed among the 1000 shards
> > I was about to suggest collapsing to 2 or 3 months/shard, but that would
> be ruining a logistically nice setup.
> Yes, we are also considering options in that area, but we really would
> like not to have to go this way
>
> There are many additional reasons (besides the ones I mentioned in my
> previous mail). E.g. we are (maybe) about to introduce a bloom-filter on
> shard-level, which will help us reduce performance on indexing
> significantly. Bloom-filter will help quickly say "document with this
> particular id does definitely not exist" when doing optimistic locking
> (including version-lookup). First-iteration tests has shown that it can
> reduce the resources/time spent on indexing by up to 80%. Bloom-filter
> data does not merge very well.
> > 5-50 billion records/server? That seems very high, but after hearing
> > about many different Solr setups at Lucene/Solr Revolution, I try to
> > adopt a "sounds insane, but it's probably correct"-mindset.
> We are not in the business of ms-response-times of thousands of searches
> per sec/min. We can accept response-times measured in secs, and there
> not performed thousands of searches per minute. We are in the business
> of being able to index enormous amounts of data per second though. But
> this issue is about searches - we really do not like 10-30-60 min
> response-times on searches that ought to run much faster.
> > Anyway, setup accepted, problem acknowledged, your possibly re-usable
> > solution not understood.
> What we did in our solution is the following
>
> Introduced the concept of "distributed query algorithm" controlled by
> request-param "dqa". We are naming the existing (default)
> query-algorithm (not knowing about SOLR-5768)
> "find-id-relevance_fetch-by-ids" (short-alias "firfbi") and we introduce
> an new alternative "distributed query algorithm" called
> "find-relevance_find-ids-limited-rows_fetch-by-ids" (short-alias
> "frfilrfbi" :-) )
> * find-id-relevance_fetch-by-ids does as always
> ** Find (by query) id and score (score is the measurement for relevance)
> for the top-X (1000 in my example) documents on each shard
> ** Sort out the ids of the overall-top-X and group them by shard. ids(S)
> is the set of ids among the overall-top-X that live on shard S
> ** For each shard S fetch by ids in ids(S) the full documents (or
> whatever is pointed out by fl-parameter)
> * find-relevance_find-ids-limited-rows_fetch-by-ids does it in a
> different way
> ** Find (by query) score (score is the measurement for relevance) for
> the top-X (1000 in my example) documents on each shard
> ** Sort out how many documents count(S) of the overall-top-X documents
> that live on each individual shard S
> ** For each shard S fetch (by query) the ids (ids(S)) for the count(S)
> most relevant documents
> ** For each shard S fetch by ids in ids(S) the full documents (or
> whatever is pointed out by fl-parameter)
> Since "find score only" (step 1 of
> find-relevance_find-ids-limited-rows_fetch-by-ids) actually does not
> have to go into the store to fetch anything (id not needed), it can be
> optimized to perform much much better than step 1 in
> find-id-relevance_fetch-by-ids (id needed). I step 3 of
> find-relevance_find-ids-limited-rows_fetch-by-ids, when you have to go
> to store, we are not asking for 1000 docs per shard, but only the number
> of documents among the overall-top-1000 documents that live on this
> particular shard. This way we go from potentially visiting the store for
> 1 mio docs across the cluster, to never visiting the store for more than
> 1000 docs across the cluster. In our particular test-setup (which
> simulates our production environment pretty well) it has given us an
> total response-time reduction of a factor 60
>
> I believe SOLR-5768 (without having looked at it yet) has made the
> existing distributed query algorithm (what we call
> find-id-relevance_fetch-by-ids) do the following when sending
> distrib.singlePass paramter
> * Find (by query) the full documents (or whatever is pointed out by
> fl-parameter) for the top-X (1000 in my example) documents on each shard
> The same way we plan to see if we can use the SOLR-5768 solution, and
> make find-relevance_find-ids-limited-rows_fetch-by-ids do the following
> when sending distrib.singlePass parameter
> ** Find (by query) score (score is the measurement for relevance) for
> the top-X (1000 in my example) documents on each shard
> ** Sort out how many documents count(S) of the overall-top-X documents
> that live on each individual shard S
> ** For each shard S fetch (by query) the full documents (or whatever is
> pointed out by fl-parameter) for the count(S) most relevant documents
> In this case distrib.singlePass will be a bad naming, because it makes
> find-relevance_find-ids-limited-rows_fetch-by-ids go from 3 to 2 passes.
> So be might want to rename it to distrib.skipIdFetch or something
>
> Hope you get the idea, and why it makes us perform much much better?!
>
> Regards, Per Steffensen
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to