Yeah, Erick confused me a bit too, but I think what he's talking about takes for granted that you'd have your various indexes directly set up as individual collections.

If instead you're considering one big collection, or a few collections based on aggregations of your individual indexes, having big, multisharded collections using compositeId should work, unless there's a use case we're not discussing.

Michael

On 11/11/14 10:27, Michal Krajňanský wrote:
Hi Eric, Michael,

thank you both for your comments.

2014-11-11 5:05 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:

bq: - the documents are organized in "shards" according to date (integer)
and
language (a possibly extensible discrete set)

bq: - the indexes are disjunct

OK, I'm having a hard time getting my head around these two statements.

If the indexes are disjunct in the sense that you only search one at a
time,
then they are different "collections" in SolrCloud jargon.


I just meant that every document is contained in a single one of the
indexes. I have a lot of Lucene indexes for various [language X timespan],
but logically we are speaking about a single huge index. That is why I
thought it would be natural to represent is as a single SolrCloud
collection.

If, on the other hand, these are a big collection and you want to search
them all with a single query, I suggest that in SolrCloud land you don't
want them to be discrete shards. My reasoning here is that let's say you
have a bunch of documents for October, 2014 in Spanish. By putting these
all on a single shard, your queries all have to be serviced by that one
shard. You don't get any parallelism.


That is right. Actually the parallelization is not the main issue right
now. The queries are very sparse, currently our system does not support
load balancing at all. I imagined that in the future it could be achievable
via SolrCloud replication.

The main consideration is to be able to plug the indexes in and out on
demand. The total size of the data is in terabytes. We usually want to
search only the latest indexes but occassionally it is needed to plug in
one of the older ones.

Maybe (probably) I still have some misconceptions about the uses of
SolrCloud...

If it really does make sense in your case to route all the doc to a
single shard,
then Michael's comment is spot-on use compositeId router.


You confuse me here. I was not thinking about a single shard, on the
contrary, any [language X timespan] index would be itself a shard. I agree
that compositeId router seems to be natural for what I need. I am currently
searching for the way to convert my indexes in such way that my document
ID's have the composite format. Currently these are just unique integers,
so I would like to prefix all the document ID's of an index with it's
language and timespan. I do not know how, but I believe this should be
possible, as it is a constant operation that would not change the structure
of the index.

Best,

Michal



Best,
Erick

On Mon, Nov 10, 2014 at 11:50 AM, Michael Della Bitta
<michael.della.bi...@appinions.com> wrote:
Hi Michal,

Is there a particular reason to shard your collections like that? If it
was
mainly for ease of operations, I'd consider just using CompositeId to
prevent specific types of queries hotspotting particular nodes.

If your ingest rate is fast, you might also consider making each
"collection" an alias that points to many actual collections, and
periodically closing off a collection and starting a new one. This
prevents
cache churn and the impact of large merges.

Michael



On 11/10/14 08:03, Michal Krajňanský wrote:
Hi All,

I have been working on a project that has long employed Lucene indexer.

Currently, the system implements a proprietary document routing and
index
plugging/unplugging on top of the Lucene and of course contains a great
body of indexes. Recently an idea came up to migrate from Lucene to
Solrcloud, which appears to be more powerfull that our proprietary
system.
Could you suggest the best way to seamlessly migrate the system to use
Solrcloud, when the reindexing is not an option?

- all the existing indexes represent a single collection in terms of
Solrcloud
- the documents are organized in "shards" according to date (integer)
and
language (a possibly extensible discrete set)
- the indexes are disjunct

I have been able to convert the existing indexes to the newest Lucene
version and plug them individually into the Solrcloud. However, there is
the question of routing, sharding etc.

Any insight appreciated.

Best,


Michal Krajnansky


Reply via email to