Re: Sol rCloud collection design considerations / best practice
"The main motivation is to support a geo-specific relevancy model which can easily be customized without stepping into each other" Is your relevancy tuning massively index time based ? i.e. will create massively different index content based on the geo location ? If it is just query time based or lightly index based ( few fields of difference across region), you don't need different collections at all to have a customized relevancy model per use case. In Solr you can define different request handlers with different query parsers and search components specifications. If you go in deep with relevancy tuning and for example you experiment Learning To Rank, it supports passing the model name at query time, which means you can use a different relevancy mode just passing it as a request parameter. Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Sol rCloud collection design considerations / best practice
On 11/13/2017 12:33 PM, Shamik Bandopadhyay wrote: > I'm looking for some input on design considerations for defining > collections in a SolrCloud cluster. Right now, our cluster consists of two > collections in a 2 shard / 2 replica mode. Each collection has a dedicated > set of source and don't overlap, which made it an easy decision. > Recently, we've a requirement to index a bunch of new sources that are > region based. The search result corresponding to those region needs to come > from their specific source as well sources from one of our existing > collection. Here's an example of our existing collection and their > corresponding source(s). You haven't defined in *ANY* way exactly what a "source" is or how that data actually gets into Solr. Without that information, it'll be difficult to even understand your requirements. If I make one assumption that for all of the data sources, the config and schema are going to be identical, then I can give you this information: If you set up each source as a collection in your SolrCloud, you can create collection aliases that let you query multiple collections with one query. Whether or not this will work correctly will depend on a few factors, but most of all whether or not all the data is using the same (or extremely similar) Solr config/schema. > The other consideration is the hardware design. Right now, both shards and > their replicas run on their dedicated instance. With two collections, we > sometimes run into OOM scenarios, so I'm a little bit worried about adding > more collections. Does the best practice (I know it's subjective) in > scenarios like this call for a dedicated Solr cluster per collection? From > index size perspective, Source_C,Source_D and Source_E combines close to10 > million documents with 60gb volume size. Each geo based source is small, > won't exceed more than 500k documents. 10 million documents producing 60GB of index data means that the documents are relatively large, but aren't super huge -- or that the data in them is duplicated several times. For contrast, I have an index where each shard has about 30 million docs, and each of those shards is 36GB in size. The entire index has six of these large shards and one tiny hot shard. I always get a little anxious when somebody wants best practice information about Solr configurations and hardware. Any recommendation that we make will be COMPLETELY wrong for some use cases, indexes, and/or query patterns. Solr configurations and hardware must be tailored specifically for the use case, index data, and query patterns that actually exist. Typically, this means that you have to actually set up a full system and try it to make any determinations about how much hardware you need. https://lucidworks.com/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Regarding your hardware sizing, the only general advice I can give you is this: Good performance usually ends up requiring significantly more RAM than users plan on. Thanks, Shawn
Re: Sol rCloud collection design considerations / best practice
Have you considered collection aliasing? You can create an alias that points to multiple collections. So you could keep specific collections and have aliases that encompass your regions The one caveat here is that sorting the final result set by score will require that the collections be roughly similar in terms of TF/IDF. Best, Erick On Mon, Nov 13, 2017 at 11:33 AM, Shamik Bandopadhyay wrote: > Hi, > > I'm looking for some input on design considerations for defining > collections in a SolrCloud cluster. Right now, our cluster consists of two > collections in a 2 shard / 2 replica mode. Each collection has a dedicated > set of source and don't overlap, which made it an easy decision. > Recently, we've a requirement to index a bunch of new sources that are > region based. The search result corresponding to those region needs to come > from their specific source as well sources from one of our existing > collection. Here's an example of our existing collection and their > corresponding source(s). > > Existing Collection: > -- > Collection A --> Source_A, Source_B > Collection B --> Source_C, Source_D, Source_E > > Proposed Collection: > > Collection_Asia --> Source_Asia, Source_C, Source_D, Source_E > Collection_Europe --> Source_Europe, Source_C, Source_D, Source_E > Collection_Australia --> Source_Asutralia, Source_C, Source_D, Source_E > > The proposed collection part shows that each geo has its dedicated source > as well as source(s) from existing collection B. > > Just wondering if creating a dedicated collection for each geo is the right > approach here. The main motivation is to support a geo-specific relevancy > model which can easily be customized without stepping into each other. On > the downside, I'm not sure if it's a good idea to replicate data from the > same source across various collections. Moreover, the data within the > source are not relational, so joining across collection might not be > an easy proposition. > The other consideration is the hardware design. Right now, both shards and > their replicas run on their dedicated instance. With two collections, we > sometimes run into OOM scenarios, so I'm a little bit worried about adding > more collections. Does the best practice (I know it's subjective) in > scenarios like this call for a dedicated Solr cluster per collection? From > index size perspective, Source_C,Source_D and Source_E combines close to10 > million documents with 60gb volume size. Each geo based source is small, > won't exceed more than 500k documents. > > Any pointers will be appreciated. > > Thanks, > Shamik
Sol rCloud collection design considerations / best practice
Hi, I'm looking for some input on design considerations for defining collections in a SolrCloud cluster. Right now, our cluster consists of two collections in a 2 shard / 2 replica mode. Each collection has a dedicated set of source and don't overlap, which made it an easy decision. Recently, we've a requirement to index a bunch of new sources that are region based. The search result corresponding to those region needs to come from their specific source as well sources from one of our existing collection. Here's an example of our existing collection and their corresponding source(s). Existing Collection: -- Collection A --> Source_A, Source_B Collection B --> Source_C, Source_D, Source_E Proposed Collection: Collection_Asia --> Source_Asia, Source_C, Source_D, Source_E Collection_Europe --> Source_Europe, Source_C, Source_D, Source_E Collection_Australia --> Source_Asutralia, Source_C, Source_D, Source_E The proposed collection part shows that each geo has its dedicated source as well as source(s) from existing collection B. Just wondering if creating a dedicated collection for each geo is the right approach here. The main motivation is to support a geo-specific relevancy model which can easily be customized without stepping into each other. On the downside, I'm not sure if it's a good idea to replicate data from the same source across various collections. Moreover, the data within the source are not relational, so joining across collection might not be an easy proposition. The other consideration is the hardware design. Right now, both shards and their replicas run on their dedicated instance. With two collections, we sometimes run into OOM scenarios, so I'm a little bit worried about adding more collections. Does the best practice (I know it's subjective) in scenarios like this call for a dedicated Solr cluster per collection? From index size perspective, Source_C,Source_D and Source_E combines close to10 million documents with 60gb volume size. Each geo based source is small, won't exceed more than 500k documents. Any pointers will be appreciated. Thanks, Shamik