Indeed the distribution across shards should be transparent. In fact, as a
client I should not need to know anything about any shard. But as the
current state of Solr (1.4) dictate an interface where you - as a client -
must provide a list of shards, then the responsibility has been shiftet over
to the client.

Since we get so much data that we must add a new shard per month, we have to
be shard-aware on the client side. My understanding of Solr is that the
final reponse of a query is only finished when every shard in the querys
shard list has been consulted. This mean that the slowest ship defines the
speed, so to speak. Or worse - if any shard in the list fails, then the
response fails!

What I hope to achieve is a way of cutting shards off the list for a query.
If I more or less know how many hits a given query have in a shard, then I
could control paging myself, and only include shards I know will have the
documents in the shardlist for the query. Otherwise I'm afraid of
performance when we get to have dusins of shards.

So to summerise: We are developing a system where a given search will be
performed again and again over time on an ever-increasing document base. The
first time a search is done, it will be distributed across every shard in
order to get a total from beginning of time till the current timestamp of
the querys debute. This total is cached and hereafter maintained by querying
the most recent shards from the last date until now.
Mostly the documents come in a chronological order, but occasionally they
arrive out of order. The shards are organised by date intervals, and this
mean that every shard from time to time will be the target of more
documents. This will induce a slight discrepency between the cached total
and the actual total. But this is a discrepency that we can live with.
But I would also like to know how many hits there are in each individual
shard. If I know this, then I can tailormake a precise shardlist for the
query: Because I know the offset and pagesize of the query, and I know how
many documents are in each shard, then I can calculate which shards to
include. This is a lot of client side administration - I know, but I quess -
I hope - it will performe quite well...

Is this idea crazy or what?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/match-count-per-shard-and-across-shards-tp2369627p2382411.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to