Re: Slow searching limited but high rows across many shards all with high hits

Per Steffensen Mon, 17 Nov 2014 08:42:32 -0800

Sorry I cannot send mails from my normal e-mail address at the moment.Using another one


On 17/11/14 12:39, Toke Eskildsen wrote:

On Thu, 2014-11-13 at 11:58 +0100, Per Steffensen wrote:
What is the core problem? The two-phase request system(https://issues.apache.org/jira/browse/SOLR-5768 seems to solve that)?

No thats not it. We have solved that as well in our PoC solution (didntknow about SOLR-5758) but that is just a minor thing

That the IDs in the second phase are sent to all shards? Something third?

Yes, something third. The problem is that the "main query"(STAGE_EXECUTE_QUERY) asks each shard for id and score of the top 1000documents. When you have 1000 shards that is asking for a total of 1 mioid/score's across 20 machines - 50.000 for each machine. The problem isthat fetching 50.000 id's from store is slow.


This problem grows when the following factors grows

* the (limited) number of documents you ask for (e.g. 1000) in yourquery - hence the "limited but high rows" in subject* the number of shards searched - hence the "across many shards" in thesubject* the number of actual hits (e.g. 1000+) on each individual shard -hence the "all with high hits" in the subject

Related to that, why are you running 50+ shards on each machine, whenyou're doing search across all shards?

We are not always searching across all shards - see below

Why not fewer shards/machine and less distribution overhead?

We have a system receiving data all the time, and we are indexing thedata as it arrives at our system. All data entities/rows/documents havea timestamp. We almost only get recent data - data with timestampswithin the last few hours from "now", but sometime we get older data -up to a month old

We have to retain two years of data - so e.g. today 17/11-2014 we areallowed/supposed to throw data from 16/11-2012 (and older) away

We are using one collection for each month - collection "coll_2014_11"contains all data with timestamps from November 2014, collection"coll_2014_10" contains all data with timestamps from October 2014 ...collection "coll_2012_11" contains all data with timestamps fromNovember 2012. This approach with a collection for each month hasnumerous advantages and it has saved our a.... lots of times* We can change indexing strategy (e.g. schema, config etc) every month,without having to re-index all historical data (that is not feasible,because we have so huge amounts of data)* We can easily delete data when they are more than 2 year old - justdelete an entire collection instead of delete-by-query or something (weare allowed to only delete data every month) - basically just deletingfolders on the file-system* When doing searches including timestamp:[xxx to yyy], we internally inSolr calculate the set of collections that can possible contain matchingdocuments, and "redirect" the search to only those collections. Allowsnot to search across all data for every search

* Many other reasons

We want all machines involved in all collections - or at least in thecollection for the current month (because that is where all the heavyindexing is going on). We have between 20 and 40 machines depending onthe prod environment

We want at least 2 shards on each machine for each collection, so thatwe can easily double the number of Solr servers, and have them fullyinvolved in doing "the work", by moving one shard (from each collection)onto one of the new servers. It is very easy and quick, because you justhave to move the shards data-folders, and adjust core.properties andclusterstate.json a little. I am aware of the "split shard" feature, butwe are currently not using it because it is not as quick and flawless.Besides that, this thing about multiple shards per Solr-server (so thatwe can easily scale horizontally) was invented and established beforethe "split shard" feature existed. We havnt prioritized looking atchanging it

So we have 24 collection, with 2 shards on each of 20+ Solr-servers =960+ shards total. Carrying somewhere between 100 and 1000 billiondocuments total.


Regards, Steff

<<attachment: steff.vcf>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Slow searching limited but high rows across many shards all with high hits

Reply via email to