Re: Slow searching limited but high rows across many shards all with high hits

Per Steffensen Tue, 18 Nov 2014 03:44:54 -0800

On 17/11/14 18:47, Toke Eskildsen wrote:

Per Steffensen [st...@liace.dk] wrote:
I understand that the request is for rows * #shards IDs+score intotal, but if you have presented your alternative, I have failed tosee that.

I deliberately did not present the "solution" we did, in order for youguys not to focus on whether or not this particular solution to theproblem already has been implemented after 4.4.0 (the version of ApacheSolr we currently base our version of Solr on). Guess the "problem" canbe solved in numerous ways, so just wanted you to focus on whether ornot it has been solved in some way (do not care which way)

  Your third factoid: A high number of hits/shard, suggests that there is a 
possibility of all the final top-1000 hits to originate from a single shard.

Im not sure what you are aiming at with this comment. But I can say thatit is very very unlikely that the overall-top-1000 all originate from asingle shard. It is likely (since we are not routing on anything thathas to do with the "content" text-field) that the overall-top-1000 ifairly evenly distributed among the 1000 shards

I was about to suggest collapsing to 2 or 3 months/shard, but that would be 
ruining a logistically nice setup.

Yes, we are also considering options in that area, but we really wouldlike not to have to go this way

There are many additional reasons (besides the ones I mentioned in myprevious mail). E.g. we are (maybe) about to introduce a bloom-filter onshard-level, which will help us reduce performance on indexingsignificantly. Bloom-filter will help quickly say "document with thisparticular id does definitely not exist" when doing optimistic locking(including version-lookup). First-iteration tests has shown that it canreduce the resources/time spent on indexing by up to 80%. Bloom-filterdata does not merge very well.

5-50 billion records/server? That seems very high, but after hearingabout many different Solr setups at Lucene/Solr Revolution, I try toadopt a "sounds insane, but it's probably correct"-mindset.

We are not in the business of ms-response-times of thousands of searchesper sec/min. We can accept response-times measured in secs, and therenot performed thousands of searches per minute. We are in the businessof being able to index enormous amounts of data per second though. Butthis issue is about searches - we really do not like 10-30-60 minresponse-times on searches that ought to run much faster.

Anyway, setup accepted, problem acknowledged, your possibly re-usablesolution not understood.

What we did in our solution is the following

Introduced the concept of "distributed query algorithm" controlled byrequest-param "dqa". We are naming the existing (default)query-algorithm (not knowing about SOLR-5768)"find-id-relevance_fetch-by-ids" (short-alias "firfbi") and we introducean new alternative "distributed query algorithm" called"find-relevance_find-ids-limited-rows_fetch-by-ids" (short-alias"frfilrfbi" :-) )

* find-id-relevance_fetch-by-ids does as always

** Find (by query) id and score (score is the measurement for relevance)for the top-X (1000 in my example) documents on each shard** Sort out the ids of the overall-top-X and group them by shard. ids(S)is the set of ids among the overall-top-X that live on shard S** For each shard S fetch by ids in ids(S) the full documents (orwhatever is pointed out by fl-parameter)* find-relevance_find-ids-limited-rows_fetch-by-ids does it in adifferent way** Find (by query) score (score is the measurement for relevance) forthe top-X (1000 in my example) documents on each shard** Sort out how many documents count(S) of the overall-top-X documentsthat live on each individual shard S** For each shard S fetch (by query) the ids (ids(S)) for the count(S)most relevant documents** For each shard S fetch by ids in ids(S) the full documents (orwhatever is pointed out by fl-parameter)Since "find score only" (step 1 offind-relevance_find-ids-limited-rows_fetch-by-ids) actually does nothave to go into the store to fetch anything (id not needed), it can beoptimized to perform much much better than step 1 infind-id-relevance_fetch-by-ids (id needed). I step 3 offind-relevance_find-ids-limited-rows_fetch-by-ids, when you have to goto store, we are not asking for 1000 docs per shard, but only the numberof documents among the overall-top-1000 documents that live on thisparticular shard. This way we go from potentially visiting the store for1 mio docs across the cluster, to never visiting the store for more than1000 docs across the cluster. In our particular test-setup (whichsimulates our production environment pretty well) it has given us antotal response-time reduction of a factor 60

I believe SOLR-5768 (without having looked at it yet) has made theexisting distributed query algorithm (what we callfind-id-relevance_fetch-by-ids) do the following when sendingdistrib.singlePass paramter* Find (by query) the full documents (or whatever is pointed out byfl-parameter) for the top-X (1000 in my example) documents on each shardThe same way we plan to see if we can use the SOLR-5768 solution, andmake find-relevance_find-ids-limited-rows_fetch-by-ids do the followingwhen sending distrib.singlePass parameter** Find (by query) score (score is the measurement for relevance) forthe top-X (1000 in my example) documents on each shard** Sort out how many documents count(S) of the overall-top-X documentsthat live on each individual shard S** For each shard S fetch (by query) the full documents (or whatever ispointed out by fl-parameter) for the count(S) most relevant documentsIn this case distrib.singlePass will be a bad naming, because it makesfind-relevance_find-ids-limited-rows_fetch-by-ids go from 3 to 2 passes.So be might want to rename it to distrib.skipIdFetch or something


Hope you get the idea, and why it makes us perform much much better?!

Regards, Per Steffensen

<<attachment: steff.vcf>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Slow searching limited but high rows across many shards all with high hits

Reply via email to