Re: Slow searching limited but high rows across many shards all with high hits

Ferenczi, Jim | EURHQ Tue, 18 Nov 2014 05:26:08 -0800

 > Your third factoid: A high number of hits/shard, suggests that there is a 
 > possibility of all the final top-1000 hits to originate from a single shard.
In fact if you ask for 1000 hits in a distributed SolrCloud, each shard has to 
retrieve 1000 hits to get the unique key of each match and send it back to the 
shard responsible for the merge. This means that even if your data is fairly 
distributed among the 1000 shards, they still have to decompress 1000 documents 
during the first phase of the search. There are ways to avoid this, for 
instance you can check this JIRA where the idea is discussed:
https://issues.apache.org/jira/browse/SOLR-5478
Bottom line is that if you have 1000 shards the GET_FIELDS stage should be fast 
(if your data is fairly distributed) but the GET_TOP_IDS is not. You could 
avoid a lot of decompression/reads by using the field cache to retrieve the 
unique key in the first stage.

Cheers,
Jim

________________________________________
From: Per Steffensen <st...@liace.dk>
Sent: Tuesday, November 18, 2014 19:26
To: dev@lucene.apache.org
Subject: Re: Slow searching limited but high rows across many shards all with 
high hits

On 17/11/14 18:47, Toke Eskildsen wrote:
> Per Steffensen [st...@liace.dk] wrote:
> I understand that the request is for rows * #shards IDs+score in
> total, but if you have presented your alternative, I have failed to
> see that.
I deliberately did not present the "solution" we did, in order for you
guys not to focus on whether or not this particular solution to the
problem already has been implemented after 4.4.0 (the version of Apache
Solr we currently base our version of Solr on). Guess the "problem" can
be solved in numerous ways, so just wanted you to focus on whether or
not it has been solved in some way (do not care which way)
>   Your third factoid: A high number of hits/shard, suggests that there is a 
> possibility of all the final top-1000 hits to originate from a single shard.
Im not sure what you are aiming at with this comment. But I can say that
it is very very unlikely that the overall-top-1000 all originate from a
single shard. It is likely (since we are not routing on anything that
has to do with the "content" text-field) that the overall-top-1000 i
fairly evenly distributed among the 1000 shards
> I was about to suggest collapsing to 2 or 3 months/shard, but that would be 
> ruining a logistically nice setup.
Yes, we are also considering options in that area, but we really would
like not to have to go this way

There are many additional reasons (besides the ones I mentioned in my
previous mail). E.g. we are (maybe) about to introduce a bloom-filter on
shard-level, which will help us reduce performance on indexing
significantly. Bloom-filter will help quickly say "document with this
particular id does definitely not exist" when doing optimistic locking
(including version-lookup). First-iteration tests has shown that it can
reduce the resources/time spent on indexing by up to 80%. Bloom-filter
data does not merge very well.
> 5-50 billion records/server? That seems very high, but after hearing
> about many different Solr setups at Lucene/Solr Revolution, I try to
> adopt a "sounds insane, but it's probably correct"-mindset.
We are not in the business of ms-response-times of thousands of searches
per sec/min. We can accept response-times measured in secs, and there
not performed thousands of searches per minute. We are in the business
of being able to index enormous amounts of data per second though. But
this issue is about searches - we really do not like 10-30-60 min
response-times on searches that ought to run much faster.
> Anyway, setup accepted, problem acknowledged, your possibly re-usable
> solution not understood.
What we did in our solution is the following

Introduced the concept of "distributed query algorithm" controlled by
request-param "dqa". We are naming the existing (default)
query-algorithm (not knowing about SOLR-5768)
"find-id-relevance_fetch-by-ids" (short-alias "firfbi") and we introduce
an new alternative "distributed query algorithm" called
"find-relevance_find-ids-limited-rows_fetch-by-ids" (short-alias
"frfilrfbi" :-) )
* find-id-relevance_fetch-by-ids does as always
** Find (by query) id and score (score is the measurement for relevance)
for the top-X (1000 in my example) documents on each shard
** Sort out the ids of the overall-top-X and group them by shard. ids(S)
is the set of ids among the overall-top-X that live on shard S
** For each shard S fetch by ids in ids(S) the full documents (or
whatever is pointed out by fl-parameter)
* find-relevance_find-ids-limited-rows_fetch-by-ids does it in a
different way
** Find (by query) score (score is the measurement for relevance) for
the top-X (1000 in my example) documents on each shard
** Sort out how many documents count(S) of the overall-top-X documents
that live on each individual shard S
** For each shard S fetch (by query) the ids (ids(S)) for the count(S)
most relevant documents
** For each shard S fetch by ids in ids(S) the full documents (or
whatever is pointed out by fl-parameter)
Since "find score only" (step 1 of
find-relevance_find-ids-limited-rows_fetch-by-ids) actually does not
have to go into the store to fetch anything (id not needed), it can be
optimized to perform much much better than step 1 in
find-id-relevance_fetch-by-ids (id needed). I step 3 of
find-relevance_find-ids-limited-rows_fetch-by-ids, when you have to go
to store, we are not asking for 1000 docs per shard, but only the number
of documents among the overall-top-1000 documents that live on this
particular shard. This way we go from potentially visiting the store for
1 mio docs across the cluster, to never visiting the store for more than
1000 docs across the cluster. In our particular test-setup (which
simulates our production environment pretty well) it has given us an
total response-time reduction of a factor 60

I believe SOLR-5768 (without having looked at it yet) has made the
existing distributed query algorithm (what we call
find-id-relevance_fetch-by-ids) do the following when sending
distrib.singlePass paramter
* Find (by query) the full documents (or whatever is pointed out by
fl-parameter) for the top-X (1000 in my example) documents on each shard
The same way we plan to see if we can use the SOLR-5768 solution, and
make find-relevance_find-ids-limited-rows_fetch-by-ids do the following
when sending distrib.singlePass parameter
** Find (by query) score (score is the measurement for relevance) for
the top-X (1000 in my example) documents on each shard
** Sort out how many documents count(S) of the overall-top-X documents
that live on each individual shard S
** For each shard S fetch (by query) the full documents (or whatever is
pointed out by fl-parameter) for the count(S) most relevant documents
In this case distrib.singlePass will be a bad naming, because it makes
find-relevance_find-ids-limited-rows_fetch-by-ids go from 3 to 2 passes.
So be might want to rename it to distrib.skipIdFetch or something

Hope you get the idea, and why it makes us perform much much better?!

Regards, Per Steffensen
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Slow searching limited but high rows across many shards all with high hits

Reply via email to