Limiting my answer to the #shards*rows issue, you can have a look at https://issues.apache.org/jira/browse/SOLR-5611.
The odds all top docs are in the same shard in a uniformly distributed index are negligible, so you can use it to request much fewer docs per shard. There's a nice discrete equation that gives you the shards.rows you should request, depending on the #shards, rows, and confidence level that all the top-rows are returned (confidence=0.95 would mean 95% of the responses will contain the exact top rows as if rows and shards.rows were equal). Our use case: 36 shards, 2000 rows, conf=99% --> shards.rows=49 which gave a good performance boost. On Tue, Nov 18, 2014 at 3:24 PM, Ferenczi, Jim | EURHQ < jim.feren...@mail.rakuten.com> wrote: > > Your third factoid: A high number of hits/shard, suggests that there is > a possibility of all the final top-1000 hits to originate from a single > shard. > In fact if you ask for 1000 hits in a distributed SolrCloud, each shard > has to retrieve 1000 hits to get the unique key of each match and send it > back to the shard responsible for the merge. This means that even if your > data is fairly distributed among the 1000 shards, they still have to > decompress 1000 documents during the first phase of the search. There are > ways to avoid this, for instance you can check this JIRA where the idea is > discussed: > https://issues.apache.org/jira/browse/SOLR-5478 > Bottom line is that if you have 1000 shards the GET_FIELDS stage should be > fast (if your data is fairly distributed) but the GET_TOP_IDS is not. You > could avoid a lot of decompression/reads by using the field cache to > retrieve the unique key in the first stage. > > Cheers, > Jim > > > > ________________________________________ > From: Per Steffensen <st...@liace.dk> > Sent: Tuesday, November 18, 2014 19:26 > To: dev@lucene.apache.org > Subject: Re: Slow searching limited but high rows across many shards all > with high hits > > On 17/11/14 18:47, Toke Eskildsen wrote: > > Per Steffensen [st...@liace.dk] wrote: > > I understand that the request is for rows * #shards IDs+score in > > total, but if you have presented your alternative, I have failed to > > see that. > I deliberately did not present the "solution" we did, in order for you > guys not to focus on whether or not this particular solution to the > problem already has been implemented after 4.4.0 (the version of Apache > Solr we currently base our version of Solr on). Guess the "problem" can > be solved in numerous ways, so just wanted you to focus on whether or > not it has been solved in some way (do not care which way) > > Your third factoid: A high number of hits/shard, suggests that there > is a possibility of all the final top-1000 hits to originate from a single > shard. > Im not sure what you are aiming at with this comment. But I can say that > it is very very unlikely that the overall-top-1000 all originate from a > single shard. It is likely (since we are not routing on anything that > has to do with the "content" text-field) that the overall-top-1000 i > fairly evenly distributed among the 1000 shards > > I was about to suggest collapsing to 2 or 3 months/shard, but that would > be ruining a logistically nice setup. > Yes, we are also considering options in that area, but we really would > like not to have to go this way > > There are many additional reasons (besides the ones I mentioned in my > previous mail). E.g. we are (maybe) about to introduce a bloom-filter on > shard-level, which will help us reduce performance on indexing > significantly. Bloom-filter will help quickly say "document with this > particular id does definitely not exist" when doing optimistic locking > (including version-lookup). First-iteration tests has shown that it can > reduce the resources/time spent on indexing by up to 80%. Bloom-filter > data does not merge very well. > > 5-50 billion records/server? That seems very high, but after hearing > > about many different Solr setups at Lucene/Solr Revolution, I try to > > adopt a "sounds insane, but it's probably correct"-mindset. > We are not in the business of ms-response-times of thousands of searches > per sec/min. We can accept response-times measured in secs, and there > not performed thousands of searches per minute. We are in the business > of being able to index enormous amounts of data per second though. But > this issue is about searches - we really do not like 10-30-60 min > response-times on searches that ought to run much faster. > > Anyway, setup accepted, problem acknowledged, your possibly re-usable > > solution not understood. > What we did in our solution is the following > > Introduced the concept of "distributed query algorithm" controlled by > request-param "dqa". We are naming the existing (default) > query-algorithm (not knowing about SOLR-5768) > "find-id-relevance_fetch-by-ids" (short-alias "firfbi") and we introduce > an new alternative "distributed query algorithm" called > "find-relevance_find-ids-limited-rows_fetch-by-ids" (short-alias > "frfilrfbi" :-) ) > * find-id-relevance_fetch-by-ids does as always > ** Find (by query) id and score (score is the measurement for relevance) > for the top-X (1000 in my example) documents on each shard > ** Sort out the ids of the overall-top-X and group them by shard. ids(S) > is the set of ids among the overall-top-X that live on shard S > ** For each shard S fetch by ids in ids(S) the full documents (or > whatever is pointed out by fl-parameter) > * find-relevance_find-ids-limited-rows_fetch-by-ids does it in a > different way > ** Find (by query) score (score is the measurement for relevance) for > the top-X (1000 in my example) documents on each shard > ** Sort out how many documents count(S) of the overall-top-X documents > that live on each individual shard S > ** For each shard S fetch (by query) the ids (ids(S)) for the count(S) > most relevant documents > ** For each shard S fetch by ids in ids(S) the full documents (or > whatever is pointed out by fl-parameter) > Since "find score only" (step 1 of > find-relevance_find-ids-limited-rows_fetch-by-ids) actually does not > have to go into the store to fetch anything (id not needed), it can be > optimized to perform much much better than step 1 in > find-id-relevance_fetch-by-ids (id needed). I step 3 of > find-relevance_find-ids-limited-rows_fetch-by-ids, when you have to go > to store, we are not asking for 1000 docs per shard, but only the number > of documents among the overall-top-1000 documents that live on this > particular shard. This way we go from potentially visiting the store for > 1 mio docs across the cluster, to never visiting the store for more than > 1000 docs across the cluster. In our particular test-setup (which > simulates our production environment pretty well) it has given us an > total response-time reduction of a factor 60 > > I believe SOLR-5768 (without having looked at it yet) has made the > existing distributed query algorithm (what we call > find-id-relevance_fetch-by-ids) do the following when sending > distrib.singlePass paramter > * Find (by query) the full documents (or whatever is pointed out by > fl-parameter) for the top-X (1000 in my example) documents on each shard > The same way we plan to see if we can use the SOLR-5768 solution, and > make find-relevance_find-ids-limited-rows_fetch-by-ids do the following > when sending distrib.singlePass parameter > ** Find (by query) score (score is the measurement for relevance) for > the top-X (1000 in my example) documents on each shard > ** Sort out how many documents count(S) of the overall-top-X documents > that live on each individual shard S > ** For each shard S fetch (by query) the full documents (or whatever is > pointed out by fl-parameter) for the count(S) most relevant documents > In this case distrib.singlePass will be a bad naming, because it makes > find-relevance_find-ids-limited-rows_fetch-by-ids go from 3 to 2 passes. > So be might want to rename it to distrib.skipIdFetch or something > > Hope you get the idea, and why it makes us perform much much better?! > > Regards, Per Steffensen > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >