Re: Best practice to support multi-tenant with Solr

shushuai zhu Sat, 15 Mar 2014 21:18:07 -0700

Lajos/Robi, thanks for the answers.

For others' convenience, I copied Robi's reply below so this thread contains 
all discussions.


Based on Lajos' detailed comments, there seems no single answer to this 
question. There are trade-offs between collection level and core level, and one 
needs to decide by considering multiple factors and various situations. 
Basically, some POC is needed for intended use cases, which was also mentioned 
in a few previous discussions.

Robi prefers to have one single collection to handle many tenants as described 
in the quoted mail. As Lajos said, this may be good for certain use cases.

Just one thing Robi mentioned I feel a little confusing: "what used to be 
called a 'core' is now called a 'collection' in solr cloud". I think it might 
be better to say "what used to be done via a 'core' is now better to be done 
via a 'collection' in solr cloud" to take advantage of the distributed 
execution in cloud. In the referred wiki, the Core concept is still there for 
Solr Cloud, and it is quite different from Collection: physical vs. logical.

Regards.

Shushuai 


-----------------------------------
Hi

Overall I think you are mixing up your terminology.  What used to be called a 
'core' is now called a 'collection' in solr cloud.  In the old master slave 
setup, you made separate cores and replicated them to all slaves.  Now they 
want you to think of them as collections and let the cloud manage the 
distribution over the physical machines and their cores.  
https://wiki.apache.org/solr/SolrTerminology

On the multi-tenancy front, I have one core/collection with thousands of 
tenants.  I manage the separation of concerns with dynamic fields using the 
tenant ids as prefixes.  Thus I can have one schema allowing searches across 
all tenants or restricted to one tenants data.  This is secure because I use a 
wrapper web service to present a simpler API to the web clients and the wrapper 
constructs the actual queries to solr behind the curtains, thus nobody can make 
any malicious queries.  Secure done.

On the performance front, one big index for all the tenants works fine.  It's 
probably just as good as having thousands of collections and much simpler to 
maintain.

Hope that helps a bit,
Robi
------------------------------------

--------------------------------------------
On Sat, 3/15/14, Lajos <la...@protulae.com> wrote:

 Subject: Re: Best practice to support multi-tenant with Solr
 To: solr-user@lucene.apache.org
 Date: Saturday, March 15, 2014, 5:59 PM
 
 Hi Shushuai,
 
 Yes, as Robi noted, you have to be careful with terminology:
 core 
 generally refers to the traditional Solr configuration of a
 single index 
 + configuration on a single node (optionally replicated to
 others). A 
 collection is a distributed index that is associated with a
 
 configuration (but multiple collections can be associated
 with the same 
 configuration).
 
 A collection is still a single index, however, just like a
 core - its 
 just spread out across however many nodes you have and
 replicated 
 according to your chosen replication factor. You can do
 multi-tenancy 
 with cores and collections, but via different strategies.
 
 More inline ...
 
 
 On 15/03/2014 19:17, shushuai zhu wrote:
 > Hi Lajos, thanks again.
 >
 > Your suggestion is to support multi-tenant via
 collection in a Solr Cloud: putting small tenants in one
 collection and big tenants in their own collections.
 >
 > My original question was to find out which approach is
 better: supporting multi-tenant at collection level or core
 level. Based on the links below and a few comments there, it
 seems people more prefer at core level. Collection is
 logical and core is physical. I am trying to figure out the
 trade-offs between the approaches regarding to scalability,
 security, performance, and flexibility. My understanding
 might be wrong, the belows are some rough comparison:
 >
 > 1) Scalability
 > Core is more scalable than collection by number: we can
 have much more cores than collections in one Solr Cloud? Or
 collection is more scalable than core by size: a collection
 could be much bigger than a core? Not sure which one is
 better: having ~1000 cores or ~1000 collections in a Solr
 Cloud.
 >
 
 SolrCloud is more scalable in terms of index size. Plus you
 get 
 redundancy which can't be underestimated in a hosted
 solution.
 
 
 > 2) Security
 > Core is more isolated than collection: core is physical
 and has its own index, but collection is logical so multiple
 collections may contain the same cores?
 >
 
 No: cores are not less or more isolated than collections.
 Both support 
 multi-tenancy, albeit in different ways. If you do it in a
 core with 
 some prefix or special field, you just have to be aware of
 security 
 implications. As Robi said is easily enforced by the middle
 tier; I use 
 Spring for this, in my case.
 
 > 3) Performance
 > Core has better performance control since it has its
 own index? Collection index is bigger so performance is not
 as good as smaller core index?
 >
 
 Not really. You might want to test this, however, to verify
 with your 
 specific hardware configuration.
 
 > 4) Flexibilty
 > Core is more flexible since it has its own
 schema/config, but one collection may have multiple cores
 hence multiple schemas/configs? Or it does not matter since
 we can set same schema/config for the whole collection?
 >
 
 One could argue that the easiest configuration will be one
 big 
 collection (or maybe divided up intelligently amongst
 several big 
 collections). More complex is 1000s of cores or
 collections.
 
 The issue is management. 1000s of cores/collections require
 a level of 
 automation. On the other hand, having a single
 core/collection means if 
 you make one change to the schema or solrconfig, it affects
 everyone. 
 That might not work if you have frequent changes or
 differing tenant needs.
 
 This is a decision you'll have to make yourself, based on
 your client 
 needs, change management, index sizes, management system,
 etc, etc.
 
 
 Regards,
 
 Lajos
 
 
 > Basically, I just want to get opinions about which
 approach might be better for the given use case.
 >
 > Regards.
 >
 > Shushuai
 >
 >
 > ________________________________
 > From: Lajos <la...@protulae.com>
 > To: solr-user@lucene.apache.org
 > Sent: Saturday, March 15, 2014 1:19 PM
 > Subject: Re: Best practice to support multi-tenant with
 Solr
 >
 >
 > Hi Shushuai,
 >
 >
 >> ---------------------------
 >> Finally, I would (in general) argue for cloud-based
 implementations to give you data redundancy ...
 >> ---------------------------
 >> Do you mean using multi-sharding to have multiple
 replicas of cores (corresponding to tenants) across nodes?
 >>
 >> Shushuai
 >>
 >>
 >
 >
 > What I means first and foremost is that using SolrCloud
 with replication
 > ensures that your data isn't lost if you lose a note.
 So in a hosted
 > solution, that's a good thing.
 >
 > If you are using SolrCloud, then its up to you to
 choose whether to have
 > one collection per tenant, or one collection that
 supports multiple
 > tenants via document routing.
 >
 > Obviously the former has implications on the number of
 shards you'll
 > have. For example, if you have a 3-node cluster with
 replication factor
 > of 2, that's 6 shards per collection. If you have 1,000
 tenant
 > collections, that's 6,000 shards. Hence my argument for
 multiple low-end
 > tenants per collection, and then only give your
 higher-end tenants their
 > own collections. Just to make things simpler for you
 ;)
 >
 > Regards,
 >
 >
 > Lajos
 >
 >
 >>
 >> ________________________________
 >> From: Lajos <la...@protulae.com>
 >> To: solr-user@lucene.apache.org
 >> Sent: Saturday, March 15, 2014 5:37 AM
 >> Subject: Re: Best practice to support multi-tenant
 with Solr
 >>
 >>
 >> Hi Shushuai,
 >>
 >> Just a few thoughts.
 >>
 >> I would guess that most people would argue for
 implementing
 >> multi-tenancy within your core (via some unique
 filter ID) or collection
 >> (via document routing) because of the headache of
 managing individual
 >> cores at the scale you are talking about.
 >>
 >> There are disadvantages the other way too: having a
 core/collection
 >> support multiple tenants does affect scoring, since
 TF-IDF is calculated
 >> across the index, and can open up security
 implications that you have to
 >> address (i.e. making sure a malicious query cannot
 get another tenants
 >> documents).
 >>
 >> The most important thing you have to lock down is
 whether there is a
 >> need to customize the schema/solrconfig for each
 tenant. If there is,
 >> then having individual cores per tenant is going to
 be a stronger
 >> argument. If I was to guess, and based on my own
 multi-tenant
 >> experience, you'll have some high-end tenants who
 need their own
 >> cores/collections, and a larger number that can all
 share a
 >> configuration. Its like any kind of hosted
 solution: the cheapest
 >> version is one-size-fits-all and involves the
 minimum of management
 >> overhead, while the higher end are more expensive
 and require more
 >> management.
 >>
 >> My own preference is for a blended environment.
 While the management of
 >> individual cores/collections is not to be taken
 lightly, I've done it in
 >> a variety of hosting situations and it all comes
 down to smart
 >> management and the intelligent use of
 administrative scripts. I've
 >> developed my own set of tools over the years and
 they work quite well.
 >>
 >> Finally, I would (in general) argue for cloud-based
 implementations to
 >> give you data redundancy, but that decision would
 require more information.
 >>
 >> HTH,
 >>
 >> Lajos Moczar
 >>
 >>
 >> theconsultantcto.com
 >> Enterprise Lucene/Solr
 >>
 >>
 >>
 >>
 >> On 14/03/2014 23:10, shushuai zhu wrote:
 >>> Hi,
 >>>
 >>> I am looking into Solr 4.7 for best practice of
 multi-tenancy support. Our use cases require support of
 thousands of tenants (say 10,000) and the incoming data rate
 could be more than 10k documents per second. I did some
 research and found people talked about scaling tenants at
 all four levels:
 >>>
 >>> Solr Cloud
 >>> Collection
 >>> Shard
 >>> Core
 >>>
 >>> I am listing them plus some quoted comments
 from the links.
 >>>
 >>> 1) Solr Cloud and Collection
 >>>
 >>> http://find.searchhub.org/document/c7caa34d807a8a1b#c7caa34d807a8a1b
 >>>
 >>> -----------
 >>> Are you trying to do "multi-tenant"? If so, you
 should be talking
 >>>         
 "multi-cluster" where you externally manage your "tenants",
 >>>          assigning
 them to clusters, but keeping tenants per cluster down in
 >>>          the
 dozens/hundreds, and "archiving" inactive tenants and
 spinning
 >>>          up (and down)
 clusters as inactive tenants become active or fall
 >>>          into
 inactivity. But keeping 1,000 or more tenants active in a
 >>>          single
 cluster as separate collections is... a no-go.
 >>> -----------
 >>>
 >>> 2) Shard
 >>>
 >>> http://searchhub.org/2013/06/13/solr-cloud-document-routing/
 >>>
 >>> -----------
 >>> Document routing can be used to achieve a more
 efficient
 >>>          multi-tenant
 environment. This can be done by making the tenant id
 >>>          the shard
 key, which would group all documents from the same tenant
 >>>          on the same
 shard.
 >>> -----------
 >>>
 >>> 3) Core
 >>>
 >>> http://find.searchhub.org/document/4312991db2dd90e9#4312991db2dd90e9
 >>>
 >>> -----------
 >>> Every multitenant situation is going to be
 different, but at the
 >>>          extreme a
 single core per tenant is the cleanest and provides the
 >>>          best
 separation, optimal performance, and supports full tf-idf
 >>>          relevancy of
 document fields for each tenant.
 >>> -----------
 >>>
 >>> http://find.searchhub.org/document/fc5b734fba135e83#fc5b734fba135e83
 >>>
 >>> -----------
 >>> Well, we try to use Solr to run a multi-tenant
 index/search
 >>>         
 service.  We assigns each client a different core with
 their own
 >>>          config and
 schema. It would be good for us if we can just let the
 >>>          customer to
 be able to create cores with their own schema and
 >>>          config.
 >>> -----------
 >>>
 >>> I also saw slides talking about scaling time
 along Collection: timed
 >>>          collections
 (slides 50 ~ 58)
 >>>
 >>> http://www.slideshare.net/sematext/solr-for-indexing-and-searching-logs
 >>>
 >>> According to these, I am thinking about the
 following approach:
 >>>
 >>> In a single Solr Cloud, the multi-tenant
 support is at Core level
 >>>          (one or more
 cores per tenant), and for better performance, will
 >>>          create a
 collection every day. When a tenant grows too big, will
 >>>          migrate it
 from this Solr cloud to a new Solr Cloud.
 >>>
 >>> Any potential issue with this approach? Is
 there better approach
 >>>          based on your
 experience?
 >>>
 >>> A few questions related to proposed approach:
 >>>
 >>> 1) When a core is replicated to multiple nodes
 via multiple shards,
 >>>          the query
 submitted against a particular core (tenant) should be
 >>>          executed
 distributed, right?
 >>> 2) What is the best way to move a core from one
 Solr Cloud to
 >>>          another?
 >>> 3) If we create one collection per day and want
 to keep data for
 >>>          three years
 for example, is it OK to have so many collections? If
 >>>          yes, is it
 cheap to maintain the collection alias for easy querying?
 >>>
 >>> Thanks.
 >>>
 >>> Shushuai
 >>>

Re: Best practice to support multi-tenant with Solr

Reply via email to