RE: Best practice to support multi-tenant with Solr

Petersen, Robert Sat, 15 Mar 2014 15:33:25 -0700

Hi 

Overall I think you are mixing up your terminology.  What used to be called a 
'core' is now called a 'collection' in solr cloud.  In the old master slave 
setup, you made separate cores and replicated them to all slaves.  Now they 
want you to think of them as collections and let the cloud manage the 
distribution over the physical machines and their cores.  
https://wiki.apache.org/solr/SolrTerminology

On the multi-tenancy front, I have one core/collection with thousands of 
tenants.  I manage the separation of concerns with dynamic fields using the 
tenant ids as prefixes.  Thus I can have one schema allowing searches across 
all tenants or restricted to one tenants data.  This is secure because I use a 
wrapper web service to present a simpler API to the web clients and the wrapper 
constructs the actual queries to solr behind the curtains, thus nobody can make 
any malicious queries.  Secure done.

On the performance front, one big index for all the tenants works fine.  It's 
probably just as good as having thousands of collections and much simpler to 
maintain.

Hope that helps a bit,
Robi

-----Original Message-----
From: shushuai zhu [mailto:ss...@yahoo.com] 
Sent: Saturday, March 15, 2014 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Best practice to support multi-tenant with Solr

Hi Lajos, thanks again. 

Your suggestion is to support multi-tenant via collection in a Solr Cloud: 
putting small tenants in one collection and big tenants in their own 
collections. 

My original question was to find out which approach is better: supporting 
multi-tenant at collection level or core level. Based on the links below and a 
few comments there, it seems people more prefer at core level. Collection is 
logical and core is physical. I am trying to figure out the trade-offs between 
the approaches regarding to scalability, security, performance, and 
flexibility. My understanding might be wrong, the belows are some rough 
comparison:

1) Scalability
Core is more scalable than collection by number: we can have much more cores 
than collections in one Solr Cloud? Or collection is more scalable than core by 
size: a collection could be much bigger than a core? Not sure which one is 
better: having ~1000 cores or ~1000 collections in a Solr Cloud.

2) Security
Core is more isolated than collection: core is physical and has its own index, 
but collection is logical so multiple collections may contain the same cores?

3) Performance
Core has better performance control since it has its own index? Collection 
index is bigger so performance is not as good as smaller core index?

4) Flexibilty
Core is more flexible since it has its own schema/config, but one collection 
may have multiple cores hence multiple schemas/configs? Or it does not matter 
since we can set same schema/config for the whole collection?

Basically, I just want to get opinions about which approach might be better for 
the given use case.

Regards.

Shushuai

________________________________
From: Lajos <la...@protulae.com>
To: solr-user@lucene.apache.org 
Sent: Saturday, March 15, 2014 1:19 PM
Subject: Re: Best practice to support multi-tenant with Solr

Hi Shushuai,

> ---------------------------
> Finally, I would (in general) argue for cloud-based implementations to give 
> you data redundancy ...
> ---------------------------
> Do you mean using multi-sharding to have multiple replicas of cores 
> (corresponding to tenants) across nodes?
>
> Shushuai
>
>

What I means first and foremost is that using SolrCloud with replication 
ensures that your data isn't lost if you lose a note. So in a hosted 
solution, that's a good thing.

If you are using SolrCloud, then its up to you to choose whether to have 
one collection per tenant, or one collection that supports multiple 
tenants via document routing.

Obviously the former has implications on the number of shards you'll 
have. For example, if you have a 3-node cluster with replication factor 
of 2, that's 6 shards per collection. If you have 1,000 tenant 
collections, that's 6,000 shards. Hence my argument for multiple low-end 
tenants per collection, and then only give your higher-end tenants their 
own collections. Just to make things simpler for you ;)

Regards, 

Lajos

>
> ________________________________
> From: Lajos <la...@protulae.com>
> To: solr-user@lucene.apache.org
> Sent: Saturday, March 15, 2014 5:37 AM
> Subject: Re: Best practice to support multi-tenant with Solr
>
>
> Hi Shushuai,
>
> Just a few thoughts.
>
> I would guess that most people would argue for implementing
> multi-tenancy within your core (via some unique filter ID) or collection
> (via document routing) because of the headache of managing individual
> cores at the scale you are talking about.
>
> There are disadvantages the other way too: having a core/collection
> support multiple tenants does affect scoring, since TF-IDF is calculated
> across the index, and can open up security implications that you have to
> address (i.e. making sure a malicious query cannot get another tenants
> documents).
>
> The most important thing you have to lock down is whether there is a
> need to customize the schema/solrconfig for each tenant. If there is,
> then having individual cores per tenant is going to be a stronger
> argument. If I was to guess, and based on my own multi-tenant
> experience, you'll have some high-end tenants who need their own
> cores/collections, and a larger number that can all share a
> configuration. Its like any kind of hosted solution: the cheapest
> version is one-size-fits-all and involves the minimum of management
> overhead, while the higher end are more expensive and require more
> management.
>
> My own preference is for a blended environment. While the management of
> individual cores/collections is not to be taken lightly, I've done it in
> a variety of hosting situations and it all comes down to smart
> management and the intelligent use of administrative scripts. I've
> developed my own set of tools over the years and they work quite well.
>
> Finally, I would (in general) argue for cloud-based implementations to
> give you data redundancy, but that decision would require more information.
>
> HTH,
>
> Lajos Moczar
>
>
> theconsultantcto.com
> Enterprise Lucene/Solr
>
>
>
>
> On 14/03/2014 23:10, shushuai zhu wrote:
>> Hi,
>>
>> I am looking into Solr 4.7 for best practice of multi-tenancy support. Our 
>> use cases require support of thousands of tenants (say 10,000) and the 
>> incoming data rate could be more than 10k documents per second. I did some 
>> research and found people talked about scaling tenants at all four levels:
>>
>> Solr Cloud
>> Collection
>> Shard
>> Core
>>
>> I am listing them plus some quoted comments from the links.
>>
>> 1) Solr Cloud and Collection
>>
>> http://find.searchhub.org/document/c7caa34d807a8a1b#c7caa34d807a8a1b
>>
>> -----------
>> Are you trying to do "multi-tenant"? If so, you should be talking
>>        "multi-cluster" where you externally manage your "tenants",
>>        assigning them to clusters, but keeping tenants per cluster down in
>>        the dozens/hundreds, and "archiving" inactive tenants and spinning
>>        up (and down) clusters as inactive tenants become active or fall
>>        into inactivity. But keeping 1,000 or more tenants active in a
>>        single cluster as separate collections is... a no-go.
>> -----------
>>
>> 2) Shard
>>
>> http://searchhub.org/2013/06/13/solr-cloud-document-routing/
>>
>> -----------
>> Document routing can be used to achieve a more efficient
>>        multi-tenant environment. This can be done by making the tenant id
>>        the shard key, which would group all documents from the same tenant
>>        on the same shard.
>> -----------
>>
>> 3) Core
>>
>> http://find.searchhub.org/document/4312991db2dd90e9#4312991db2dd90e9
>>
>> -----------
>> Every multitenant situation is going to be different, but at the
>>        extreme a single core per tenant is the cleanest and provides the
>>        best separation, optimal performance, and supports full tf-idf
>>        relevancy of document fields for each tenant.
>> -----------
>>
>> http://find.searchhub.org/document/fc5b734fba135e83#fc5b734fba135e83
>>
>> -----------
>> Well, we try to use Solr to run a multi-tenant index/search
>>        service.  We assigns each client a different core with their own
>>        config and schema. It would be good for us if we can just let the
>>        customer to be able to create cores with their own schema and
>>        config.
>> -----------
>>
>> I also saw slides talking about scaling time along Collection: timed
>>        collections (slides 50 ~ 58)
>>
>> http://www.slideshare.net/sematext/solr-for-indexing-and-searching-logs
>>
>> According to these, I am thinking about the following approach:
>>
>> In a single Solr Cloud, the multi-tenant support is at Core level
>>        (one or more cores per tenant), and for better performance, will
>>        create a collection every day. When a tenant grows too big, will
>>        migrate it from this Solr cloud to a new Solr Cloud.
>>
>> Any potential issue with this approach? Is there better approach
>>        based on your experience?
>>
>> A few questions related to proposed approach:
>>
>> 1) When a core is replicated to multiple nodes via multiple shards,
>>        the query submitted against a particular core (tenant) should be
>>        executed distributed, right?
>> 2) What is the best way to move a core from one Solr Cloud to
>>        another?
>> 3) If we create one collection per day and want to keep data for
>>        three years for example, is it OK to have so many collections? If
>>        yes, is it cheap to maintain the collection alias for easy querying?
>>
>> Thanks.
>>
>> Shushuai
>>

RE: Best practice to support multi-tenant with Solr

Reply via email to