On 17/11/14 18:47, Toke Eskildsen wrote:
Per Steffensen [st...@liace.dk] wrote:
I understand that the request is for rows * #shards IDs+score in total, but if you have presented your alternative, I have failed to see that.
I deliberately did not present the "solution" we did, in order for you guys not to focus on whether or not this particular solution to the problem already has been implemented after 4.4.0 (the version of Apache Solr we currently base our version of Solr on). Guess the "problem" can be solved in numerous ways, so just wanted you to focus on whether or not it has been solved in some way (do not care which way)
  Your third factoid: A high number of hits/shard, suggests that there is a 
possibility of all the final top-1000 hits to originate from a single shard.
Im not sure what you are aiming at with this comment. But I can say that it is very very unlikely that the overall-top-1000 all originate from a single shard. It is likely (since we are not routing on anything that has to do with the "content" text-field) that the overall-top-1000 i fairly evenly distributed among the 1000 shards
I was about to suggest collapsing to 2 or 3 months/shard, but that would be 
ruining a logistically nice setup.
Yes, we are also considering options in that area, but we really would like not to have to go this way

There are many additional reasons (besides the ones I mentioned in my previous mail). E.g. we are (maybe) about to introduce a bloom-filter on shard-level, which will help us reduce performance on indexing significantly. Bloom-filter will help quickly say "document with this particular id does definitely not exist" when doing optimistic locking (including version-lookup). First-iteration tests has shown that it can reduce the resources/time spent on indexing by up to 80%. Bloom-filter data does not merge very well.
5-50 billion records/server? That seems very high, but after hearing about many different Solr setups at Lucene/Solr Revolution, I try to adopt a "sounds insane, but it's probably correct"-mindset.
We are not in the business of ms-response-times of thousands of searches per sec/min. We can accept response-times measured in secs, and there not performed thousands of searches per minute. We are in the business of being able to index enormous amounts of data per second though. But this issue is about searches - we really do not like 10-30-60 min response-times on searches that ought to run much faster.
Anyway, setup accepted, problem acknowledged, your possibly re-usable solution not understood.
What we did in our solution is the following

Introduced the concept of "distributed query algorithm" controlled by request-param "dqa". We are naming the existing (default) query-algorithm (not knowing about SOLR-5768) "find-id-relevance_fetch-by-ids" (short-alias "firfbi") and we introduce an new alternative "distributed query algorithm" called "find-relevance_find-ids-limited-rows_fetch-by-ids" (short-alias "frfilrfbi" :-) )
* find-id-relevance_fetch-by-ids does as always
** Find (by query) id and score (score is the measurement for relevance) for the top-X (1000 in my example) documents on each shard ** Sort out the ids of the overall-top-X and group them by shard. ids(S) is the set of ids among the overall-top-X that live on shard S ** For each shard S fetch by ids in ids(S) the full documents (or whatever is pointed out by fl-parameter) * find-relevance_find-ids-limited-rows_fetch-by-ids does it in a different way ** Find (by query) score (score is the measurement for relevance) for the top-X (1000 in my example) documents on each shard ** Sort out how many documents count(S) of the overall-top-X documents that live on each individual shard S ** For each shard S fetch (by query) the ids (ids(S)) for the count(S) most relevant documents ** For each shard S fetch by ids in ids(S) the full documents (or whatever is pointed out by fl-parameter) Since "find score only" (step 1 of find-relevance_find-ids-limited-rows_fetch-by-ids) actually does not have to go into the store to fetch anything (id not needed), it can be optimized to perform much much better than step 1 in find-id-relevance_fetch-by-ids (id needed). I step 3 of find-relevance_find-ids-limited-rows_fetch-by-ids, when you have to go to store, we are not asking for 1000 docs per shard, but only the number of documents among the overall-top-1000 documents that live on this particular shard. This way we go from potentially visiting the store for 1 mio docs across the cluster, to never visiting the store for more than 1000 docs across the cluster. In our particular test-setup (which simulates our production environment pretty well) it has given us an total response-time reduction of a factor 60

I believe SOLR-5768 (without having looked at it yet) has made the existing distributed query algorithm (what we call find-id-relevance_fetch-by-ids) do the following when sending distrib.singlePass paramter * Find (by query) the full documents (or whatever is pointed out by fl-parameter) for the top-X (1000 in my example) documents on each shard The same way we plan to see if we can use the SOLR-5768 solution, and make find-relevance_find-ids-limited-rows_fetch-by-ids do the following when sending distrib.singlePass parameter ** Find (by query) score (score is the measurement for relevance) for the top-X (1000 in my example) documents on each shard ** Sort out how many documents count(S) of the overall-top-X documents that live on each individual shard S ** For each shard S fetch (by query) the full documents (or whatever is pointed out by fl-parameter) for the count(S) most relevant documents In this case distrib.singlePass will be a bad naming, because it makes find-relevance_find-ids-limited-rows_fetch-by-ids go from 3 to 2 passes. So be might want to rename it to distrib.skipIdFetch or something

Hope you get the idea, and why it makes us perform much much better?!

Regards, Per Steffensen

<<attachment: steff.vcf>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to