Re: Best practice to support multi-tenant with Solr

Lajos Sat, 15 Mar 2014 02:38:29 -0700

Hi Shushuai,

Just a few thoughts.

I would guess that most people would argue for implementingmulti-tenancy within your core (via some unique filter ID) or collection(via document routing) because of the headache of managing individualcores at the scale you are talking about.

There are disadvantages the other way too: having a core/collectionsupport multiple tenants does affect scoring, since TF-IDF is calculatedacross the index, and can open up security implications that you have toaddress (i.e. making sure a malicious query cannot get another tenantsdocuments).

The most important thing you have to lock down is whether there is aneed to customize the schema/solrconfig for each tenant. If there is,then having individual cores per tenant is going to be a strongerargument. If I was to guess, and based on my own multi-tenantexperience, you'll have some high-end tenants who need their owncores/collections, and a larger number that can all share aconfiguration. Its like any kind of hosted solution: the cheapestversion is one-size-fits-all and involves the minimum of managementoverhead, while the higher end are more expensive and require moremanagement.

My own preference is for a blended environment. While the management ofindividual cores/collections is not to be taken lightly, I've done it ina variety of hosting situations and it all comes down to smartmanagement and the intelligent use of administrative scripts. I'vedeveloped my own set of tools over the years and they work quite well.

Finally, I would (in general) argue for cloud-based implementations togive you data redundancy, but that decision would require more information.


HTH,

Lajos Moczar


theconsultantcto.com
Enterprise Lucene/Solr



On 14/03/2014 23:10, shushuai zhu wrote:

Hi,

I am looking into Solr 4.7 for best practice of multi-tenancy support. Our use 
cases require support of thousands of tenants (say 10,000) and the incoming 
data rate could be more than 10k documents per second. I did some research and 
found people talked about scaling tenants at all four levels:

Solr Cloud
Collection
Shard
Core

I am listing them plus some quoted comments from the links.

1) Solr Cloud and Collection

http://find.searchhub.org/document/c7caa34d807a8a1b#c7caa34d807a8a1b

-----------
Are you trying to do "multi-tenant"? If so, you should be talking
     "multi-cluster" where you externally manage your "tenants",
     assigning them to clusters, but keeping tenants per cluster down in
     the dozens/hundreds, and "archiving" inactive tenants and spinning
     up (and down) clusters as inactive tenants become active or fall
     into inactivity. But keeping 1,000 or more tenants active in a
     single cluster as separate collections is... a no-go.
-----------

2) Shard

http://searchhub.org/2013/06/13/solr-cloud-document-routing/

-----------
Document routing can be used to achieve a more efficient
     multi-tenant environment. This can be done by making the tenant id
     the shard key, which would group all documents from the same tenant
     on the same shard.
-----------

3) Core

http://find.searchhub.org/document/4312991db2dd90e9#4312991db2dd90e9

-----------
Every multitenant situation is going to be different, but at the
     extreme a single core per tenant is the cleanest and provides the
     best separation, optimal performance, and supports full tf-idf
     relevancy of document fields for each tenant.
-----------

http://find.searchhub.org/document/fc5b734fba135e83#fc5b734fba135e83

-----------
Well, we try to use Solr to run a multi-tenant index/search
     service.  We assigns each client a different core with their own
     config and schema. It would be good for us if we can just let the
     customer to be able to create cores with their own schema and
     config.
-----------

I also saw slides talking about scaling time along Collection: timed
     collections (slides 50 ~ 58)

http://www.slideshare.net/sematext/solr-for-indexing-and-searching-logs

According to these, I am thinking about the following approach:

In a single Solr Cloud, the multi-tenant support is at Core level
     (one or more cores per tenant), and for better performance, will
     create a collection every day. When a tenant grows too big, will
     migrate it from this Solr cloud to a new Solr Cloud.

Any potential issue with this approach? Is there better approach
     based on your experience?

A few questions related to proposed approach:

1) When a core is replicated to multiple nodes via multiple shards,
     the query submitted against a particular core (tenant) should be
     executed distributed, right?
2) What is the best way to move a core from one Solr Cloud to
     another?
3) If we create one collection per day and want to keep data for
     three years for example, is it OK to have so many collections? If
     yes, is it cheap to maintain the collection alias for easy querying?

Thanks.

Shushuai

Re: Best practice to support multi-tenant with Solr

Reply via email to