> Your third factoid: A high number of hits/shard, suggests that there is a > possibility of all the final top-1000 hits to originate from a single shard. In fact if you ask for 1000 hits in a distributed SolrCloud, each shard has to retrieve 1000 hits to get the unique key of each match and send it back to the shard responsible for the merge. This means that even if your data is fairly distributed among the 1000 shards, they still have to decompress 1000 documents during the first phase of the search. There are ways to avoid this, for instance you can check this JIRA where the idea is discussed: https://issues.apache.org/jira/browse/SOLR-5478 Bottom line is that if you have 1000 shards the GET_FIELDS stage should be fast (if your data is fairly distributed) but the GET_TOP_IDS is not. You could avoid a lot of decompression/reads by using the field cache to retrieve the unique key in the first stage.
Cheers, Jim ________________________________________ From: Per Steffensen <st...@liace.dk> Sent: Tuesday, November 18, 2014 19:26 To: dev@lucene.apache.org Subject: Re: Slow searching limited but high rows across many shards all with high hits On 17/11/14 18:47, Toke Eskildsen wrote: > Per Steffensen [st...@liace.dk] wrote: > I understand that the request is for rows * #shards IDs+score in > total, but if you have presented your alternative, I have failed to > see that. I deliberately did not present the "solution" we did, in order for you guys not to focus on whether or not this particular solution to the problem already has been implemented after 4.4.0 (the version of Apache Solr we currently base our version of Solr on). Guess the "problem" can be solved in numerous ways, so just wanted you to focus on whether or not it has been solved in some way (do not care which way) > Your third factoid: A high number of hits/shard, suggests that there is a > possibility of all the final top-1000 hits to originate from a single shard. Im not sure what you are aiming at with this comment. But I can say that it is very very unlikely that the overall-top-1000 all originate from a single shard. It is likely (since we are not routing on anything that has to do with the "content" text-field) that the overall-top-1000 i fairly evenly distributed among the 1000 shards > I was about to suggest collapsing to 2 or 3 months/shard, but that would be > ruining a logistically nice setup. Yes, we are also considering options in that area, but we really would like not to have to go this way There are many additional reasons (besides the ones I mentioned in my previous mail). E.g. we are (maybe) about to introduce a bloom-filter on shard-level, which will help us reduce performance on indexing significantly. Bloom-filter will help quickly say "document with this particular id does definitely not exist" when doing optimistic locking (including version-lookup). First-iteration tests has shown that it can reduce the resources/time spent on indexing by up to 80%. Bloom-filter data does not merge very well. > 5-50 billion records/server? That seems very high, but after hearing > about many different Solr setups at Lucene/Solr Revolution, I try to > adopt a "sounds insane, but it's probably correct"-mindset. We are not in the business of ms-response-times of thousands of searches per sec/min. We can accept response-times measured in secs, and there not performed thousands of searches per minute. We are in the business of being able to index enormous amounts of data per second though. But this issue is about searches - we really do not like 10-30-60 min response-times on searches that ought to run much faster. > Anyway, setup accepted, problem acknowledged, your possibly re-usable > solution not understood. What we did in our solution is the following Introduced the concept of "distributed query algorithm" controlled by request-param "dqa". We are naming the existing (default) query-algorithm (not knowing about SOLR-5768) "find-id-relevance_fetch-by-ids" (short-alias "firfbi") and we introduce an new alternative "distributed query algorithm" called "find-relevance_find-ids-limited-rows_fetch-by-ids" (short-alias "frfilrfbi" :-) ) * find-id-relevance_fetch-by-ids does as always ** Find (by query) id and score (score is the measurement for relevance) for the top-X (1000 in my example) documents on each shard ** Sort out the ids of the overall-top-X and group them by shard. ids(S) is the set of ids among the overall-top-X that live on shard S ** For each shard S fetch by ids in ids(S) the full documents (or whatever is pointed out by fl-parameter) * find-relevance_find-ids-limited-rows_fetch-by-ids does it in a different way ** Find (by query) score (score is the measurement for relevance) for the top-X (1000 in my example) documents on each shard ** Sort out how many documents count(S) of the overall-top-X documents that live on each individual shard S ** For each shard S fetch (by query) the ids (ids(S)) for the count(S) most relevant documents ** For each shard S fetch by ids in ids(S) the full documents (or whatever is pointed out by fl-parameter) Since "find score only" (step 1 of find-relevance_find-ids-limited-rows_fetch-by-ids) actually does not have to go into the store to fetch anything (id not needed), it can be optimized to perform much much better than step 1 in find-id-relevance_fetch-by-ids (id needed). I step 3 of find-relevance_find-ids-limited-rows_fetch-by-ids, when you have to go to store, we are not asking for 1000 docs per shard, but only the number of documents among the overall-top-1000 documents that live on this particular shard. This way we go from potentially visiting the store for 1 mio docs across the cluster, to never visiting the store for more than 1000 docs across the cluster. In our particular test-setup (which simulates our production environment pretty well) it has given us an total response-time reduction of a factor 60 I believe SOLR-5768 (without having looked at it yet) has made the existing distributed query algorithm (what we call find-id-relevance_fetch-by-ids) do the following when sending distrib.singlePass paramter * Find (by query) the full documents (or whatever is pointed out by fl-parameter) for the top-X (1000 in my example) documents on each shard The same way we plan to see if we can use the SOLR-5768 solution, and make find-relevance_find-ids-limited-rows_fetch-by-ids do the following when sending distrib.singlePass parameter ** Find (by query) score (score is the measurement for relevance) for the top-X (1000 in my example) documents on each shard ** Sort out how many documents count(S) of the overall-top-X documents that live on each individual shard S ** For each shard S fetch (by query) the full documents (or whatever is pointed out by fl-parameter) for the count(S) most relevant documents In this case distrib.singlePass will be a bad naming, because it makes find-relevance_find-ids-limited-rows_fetch-by-ids go from 3 to 2 passes. So be might want to rename it to distrib.skipIdFetch or something Hope you get the idea, and why it makes us perform much much better?! Regards, Per Steffensen --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org