Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Jack Krupansky
Just to give a specific answer to the original question, I would say that
dozens of cores (collections) is certainly fine (assuming the total data
load and query rate is reasonable), maybe 50 or even 100. Low hundreds of
cores/collections MAY work, but isn't advisable. Thousands, if it works at
all, is probably just asking for trouble and likely to be far more hassle
than it could possible be worth.

Whether the number for you ends up being 37, 50, 75, 100, 237, or 1273, you
will have to do a proof of concept implementation to validate it.

I'm not sure where we are at these days for lazy-loading of cores. That may
work for you with hundreds (thousands?!) of cores/collections for tenants
who are mostly idle or dormant, but if the server is running long enough,
it may build up a lot of memory usage for collections that were active but
have gone idle after days or weeks.


-- Jack Krupansky

On Wed, Mar 25, 2015 at 2:49 AM, Shai Erera  wrote:

> While it's hard to answer this question because as others have said, "it
> depends", I think it will be good of we can quantify or assess the cost of
> running a SolrCore.
>
> For instance, let's say that a server can handle a load of 10M indexed
> documents (I omit search load on purpose for now) in a single SolrCore.
> Would the same server be able to handle the same number of documents, If we
> indexed 1000 docs per SolrCore, in total of 10,000 SorClores? If the answer
> is no, then it means there is some cost that comes w/ each SolrCore, and we
> may at least be able to give an upper bound --- on a server with X amount
> of storage, Y GB RAM and Z cores you can run up to maxSolrCores(X, Y, Z).
>
> Another way to look at it, if I were to create empty SolrCores, would I be
> able to create an infinite number of cores if storage was infinite? Or even
> empty cores have their toll on CPU and RAM?
>
> I know from the Lucene side of things that each SolrCore (carries a Lucene
> index) there is a toll to an index -- the lexicon, IW's RAM buffer, Codecs
> that store things in memory etc. For instance, one downside of splitting a
> 10M core into 10,000 cores is that the cost of the holding the total
> lexicon (dictionary of indexed words) goes up drastically, since now every
> word (just the byte[] of the word) is potentially represented in memory
> 10,000 times.
>
> What other RAM/CPU/Storage costs does a SolrCore carry with it? There are
> the caches of course, which really depend on how many documents are
> indexed. Any other non-trivial or constant cost?
>
> So yes, there isn't a single answer to this question. It's just like
> someone would ask how many documents can a single Lucene index handle
> efficiently. But if we can come up with basic numbers as I outlined above,
> it might help people doing rough estimates. That doesn't mean people
> shouldn't benchmark, as that upper bound may be wy too high for their
> data set, query workload and search needs.
>
> Shai
>
> On Wed, Mar 25, 2015 at 5:25 AM, Damien Kamerman 
> wrote:
>
> > From my experience on a high-end sever (256GB memory, 40 core CPU)
> testing
> > collection numbers with one shard and two replicas, the maximum that
> would
> > work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps
> > half of that), depending on your startup-time requirements. (Though I
> have
> > settled on 6,000 collection maximum with some patching. See SOLR-7191).
> You
> > could create multiple clouds after that, and choose the cloud least used
> to
> > create your collection.
> >
> > Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap
> per
> > collection.
> >
> > On 25 March 2015 at 13:46, Ian Rose  wrote:
> >
> > > First off thanks everyone for the very useful replies thus far.
> > >
> > > Shawn - thanks for the list of items to check.  #1 and #2 should be
> fine
> > > for us and I'll check our ulimit for #3.
> > >
> > > To add a bit of clarification, we are indeed using SolrCloud.  Our
> > current
> > > setup is to create a new collection for each customer.  For now we
> allow
> > > SolrCloud to decide for itself where to locate the initial shard(s) but
> > in
> > > time we expect to refine this such that our system will automatically
> > > choose the least loaded nodes according to some metric(s).
> > >
> > > Having more than one business entity controlling the configuration of a
> > > > single (Solr) server is a recipe for disaster. Solr works well if
> there
> > > is
> > > > an architect for the system.
> > >
> > >
> > > Jack, can you explain a bit what you mean here?  It looks like Toke
> > caught
> > > your meaning but I'm afraid it missed me.  What do you mean by
> "business
> > > entity"?  Is your concern that with automatic creation of collections
> > they
> > > will be distributed willy-nilly across the cluster, leading to uneven
> > load
> > > across nodes?  If it is relevant, the schema and solrconfig are
> > controlled
> > > entirely by me and is the same for all collections

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Per Steffensen

On 25/03/15 15:03, Ian Rose wrote:

Per - Wow, 1 trillion documents stored is pretty impressive.  One
clarification: when you say that you have 2 replica per collection on each
machine, what exactly does that mean?  Do you mean that each collection is
sharded into 50 shards, divided evenly over all 25 machines (thus 2 shards
per machine)?

Yes

   Or are some of these slave replicas (e.g. 25x sharding with
1 replica per shard)?
No replication. It does not work very well, at least in 4.4.0. Besides 
that I am not a big fan of two (or more) machines having to do all the 
indexing work and making sure to keep synchronized. Use a distributed 
file-system supporting multiple copies of every piece of data (like 
HDFS) for HA on data-level. Have only one Solr-node handle the indexing 
into a particular shard - if this Solr-node breaks down let another 
Solr-node take over the indexing "leadership" on this shard. Besides the 
indexing Solr-node several other Solr-nodes can serve data from this 
shard - just watching the data-folder (can commits) done by the 
indexing-leader of this particular shard - will give you HA on 
service-level. That is probably how we are going to do HA - pretty soon. 
But that is another story


Thanks!

No problem



Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Ian Rose
Per - Wow, 1 trillion documents stored is pretty impressive.  One
clarification: when you say that you have 2 replica per collection on each
machine, what exactly does that mean?  Do you mean that each collection is
sharded into 50 shards, divided evenly over all 25 machines (thus 2 shards
per machine)?  Or are some of these slave replicas (e.g. 25x sharding with
1 replica per shard)?

Thanks!

On Wed, Mar 25, 2015 at 5:13 AM, Per Steffensen  wrote:

> In one of our production environments we use 32GB, 4-core, 3T RAID0
> spinning disk Dell servers (do not remember the exact model). We have about
> 25 collections with 2 replica (shard-instances) per collection on each
> machine - 25 machines. Total of 25 coll * 2 replica/coll/machine * 25
> machines = 1250 replica. Each replica contains about 800 million pretty
> small documents - thats about 1000 billion (do not know the english word
> for it) documents all in all. We index about 1.5 billion new documents
> every day (mainly into one of the collections = 50 replica across 25
> machine) and keep a history of 2 years on the data. Shifting the "index
> into" collection every month. We can fairly easy keep up with the indexing
> load. We have almost non of the data on the heap, but of course a small
> fraction of the data in the files will at any time be in OS file-cache.
> Compared to our indexing frequency we do not do a lot of searches. We have
> about 10 users searching the system from time to time - anything from major
> extracts to small quick searches. Depending on the nature of the search we
> have response-times between 1 sec and 5 min. But of course that is very
> dependent on "clever" choice on each field wrt index, store, doc-value etc.
> BUT we are not using out-of-box Apache Solr. We have made quit a lot of
> performance tweaks ourselves.
> Please note that, even though you disable all Solr caches, each replica
> will use heap-memory linearly dependent on the number of documents (and
> their size) in that replica. But not much, so you can get pretty far with
> relatively little RAM.
> Our version of Solr is based on Apache Solr 4.4.0, but I expect/hope it
> did not get worse in newer releases.
>
> Just to give you some idea of what can at least be achieved - in the
> high-end of #replica and #docs, I guess
>
> Regards, Per Steffensen
>
>
> On 24/03/15 14:02, Ian Rose wrote:
>
>> Hi all -
>>
>> I'm sure this topic has been covered before but I was unable to find any
>> clear references online or in the mailing list.
>>
>> Are there any rules of thumb for how many cores (aka shards, since I am
>> using SolrCloud) is "too many" for one machine?  I realize there is no one
>> answer (depends on size of the machine, etc.) so I'm just looking for a
>> rough idea.  Something like the following would be very useful:
>>
>> * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
>> server without any problems.
>> * I have never heard of anyone successfully running X cores/shards on a
>> single machine, even if you throw a lot of hardware at it.
>>
>> Thanks!
>> - Ian
>>
>>
>


Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Per Steffensen
In one of our production environments we use 32GB, 4-core, 3T RAID0 
spinning disk Dell servers (do not remember the exact model). We have 
about 25 collections with 2 replica (shard-instances) per collection on 
each machine - 25 machines. Total of 25 coll * 2 replica/coll/machine * 
25 machines = 1250 replica. Each replica contains about 800 million 
pretty small documents - thats about 1000 billion (do not know the 
english word for it) documents all in all. We index about 1.5 billion 
new documents every day (mainly into one of the collections = 50 replica 
across 25 machine) and keep a history of 2 years on the data. Shifting 
the "index into" collection every month. We can fairly easy keep up with 
the indexing load. We have almost non of the data on the heap, but of 
course a small fraction of the data in the files will at any time be in 
OS file-cache.
Compared to our indexing frequency we do not do a lot of searches. We 
have about 10 users searching the system from time to time - anything 
from major extracts to small quick searches. Depending on the nature of 
the search we have response-times between 1 sec and 5 min. But of course 
that is very dependent on "clever" choice on each field wrt index, 
store, doc-value etc.
BUT we are not using out-of-box Apache Solr. We have made quit a lot of 
performance tweaks ourselves.
Please note that, even though you disable all Solr caches, each replica 
will use heap-memory linearly dependent on the number of documents (and 
their size) in that replica. But not much, so you can get pretty far 
with relatively little RAM.
Our version of Solr is based on Apache Solr 4.4.0, but I expect/hope it 
did not get worse in newer releases.


Just to give you some idea of what can at least be achieved - in the 
high-end of #replica and #docs, I guess


Regards, Per Steffensen

On 24/03/15 14:02, Ian Rose wrote:

Hi all -

I'm sure this topic has been covered before but I was unable to find any
clear references online or in the mailing list.

Are there any rules of thumb for how many cores (aka shards, since I am
using SolrCloud) is "too many" for one machine?  I realize there is no one
answer (depends on size of the machine, etc.) so I'm just looking for a
rough idea.  Something like the following would be very useful:

* People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
server without any problems.
* I have never heard of anyone successfully running X cores/shards on a
single machine, even if you throw a lot of hardware at it.

Thanks!
- Ian





Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Toke Eskildsen
On Wed, 2015-03-25 at 03:46 +0100, Ian Rose wrote:
> Thus theoretically we could actually just use one single collection for
>all of our customers (adding a 'customer:' type fq to all
> queries) but since we never need to query across customers it seemed
> more performant (as well as safer - less chance of accidentally
> leaking data across customers) to use separate collections.

If only a few customers are active at a given time, it is more
performant to use a collestion/customer. If many of them are active, the
more performant option is to lump them together and filter on a field,
due to the redundancy-reduction of larger indexes.

The 1 collection/customer solution has another edge as ranking will be
calculated based on the corpus of the customer and not based on all
customers. If the number of customers is low enough to get the
individual collections solution to work, that would be the preferable
solution.

- Toke Eskildsen, State and University Library, Denmark




Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Damien Kamerman
I've tried (very simplistically) hitting a collection with a good variety
of searches and looking at the collection's heap memory and working out the
bytes / doc. I've seen results around 100 bytes / doc, and as low as 3
bytes / doc for collections with small docs. It's still a work-in-progress
- not sure if it will scale with docs - or is too simplistic.

On 25 March 2015 at 17:49, Shai Erera  wrote:

> While it's hard to answer this question because as others have said, "it
> depends", I think it will be good of we can quantify or assess the cost of
> running a SolrCore.
>
> For instance, let's say that a server can handle a load of 10M indexed
> documents (I omit search load on purpose for now) in a single SolrCore.
> Would the same server be able to handle the same number of documents, If we
> indexed 1000 docs per SolrCore, in total of 10,000 SorClores? If the answer
> is no, then it means there is some cost that comes w/ each SolrCore, and we
> may at least be able to give an upper bound --- on a server with X amount
> of storage, Y GB RAM and Z cores you can run up to maxSolrCores(X, Y, Z).
>
> Another way to look at it, if I were to create empty SolrCores, would I be
> able to create an infinite number of cores if storage was infinite? Or even
> empty cores have their toll on CPU and RAM?
>
> I know from the Lucene side of things that each SolrCore (carries a Lucene
> index) there is a toll to an index -- the lexicon, IW's RAM buffer, Codecs
> that store things in memory etc. For instance, one downside of splitting a
> 10M core into 10,000 cores is that the cost of the holding the total
> lexicon (dictionary of indexed words) goes up drastically, since now every
> word (just the byte[] of the word) is potentially represented in memory
> 10,000 times.
>
> What other RAM/CPU/Storage costs does a SolrCore carry with it? There are
> the caches of course, which really depend on how many documents are
> indexed. Any other non-trivial or constant cost?
>
> So yes, there isn't a single answer to this question. It's just like
> someone would ask how many documents can a single Lucene index handle
> efficiently. But if we can come up with basic numbers as I outlined above,
> it might help people doing rough estimates. That doesn't mean people
> shouldn't benchmark, as that upper bound may be wy too high for their
> data set, query workload and search needs.
>
> Shai
>
> On Wed, Mar 25, 2015 at 5:25 AM, Damien Kamerman 
> wrote:
>
> > From my experience on a high-end sever (256GB memory, 40 core CPU)
> testing
> > collection numbers with one shard and two replicas, the maximum that
> would
> > work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps
> > half of that), depending on your startup-time requirements. (Though I
> have
> > settled on 6,000 collection maximum with some patching. See SOLR-7191).
> You
> > could create multiple clouds after that, and choose the cloud least used
> to
> > create your collection.
> >
> > Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap
> per
> > collection.
> >
> > On 25 March 2015 at 13:46, Ian Rose  wrote:
> >
> > > First off thanks everyone for the very useful replies thus far.
> > >
> > > Shawn - thanks for the list of items to check.  #1 and #2 should be
> fine
> > > for us and I'll check our ulimit for #3.
> > >
> > > To add a bit of clarification, we are indeed using SolrCloud.  Our
> > current
> > > setup is to create a new collection for each customer.  For now we
> allow
> > > SolrCloud to decide for itself where to locate the initial shard(s) but
> > in
> > > time we expect to refine this such that our system will automatically
> > > choose the least loaded nodes according to some metric(s).
> > >
> > > Having more than one business entity controlling the configuration of a
> > > > single (Solr) server is a recipe for disaster. Solr works well if
> there
> > > is
> > > > an architect for the system.
> > >
> > >
> > > Jack, can you explain a bit what you mean here?  It looks like Toke
> > caught
> > > your meaning but I'm afraid it missed me.  What do you mean by
> "business
> > > entity"?  Is your concern that with automatic creation of collections
> > they
> > > will be distributed willy-nilly across the cluster, leading to uneven
> > load
> > > across nodes?  If it is relevant, the schema and solrconfig are
> > controlled
> > > entirely by me and is the same for all collections.  Thus theoretically
> > we
> > > could actually just use one single collection for all of our customers
> > > (adding a 'customer:' type fq to all queries) but since we
> > never
> > > need to query across customers it seemed more performant (as well as
> > safer
> > > - less chance of accidentally leaking data across customers) to use
> > > separate collections.
> > >
> > > Better to give each tenant a separate Solr instance that you spin up
> and
> > > > spin down based on demand.
> > >
> > >
> > > Regarding this, if by tenant you mean "cus

Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Shai Erera
While it's hard to answer this question because as others have said, "it
depends", I think it will be good of we can quantify or assess the cost of
running a SolrCore.

For instance, let's say that a server can handle a load of 10M indexed
documents (I omit search load on purpose for now) in a single SolrCore.
Would the same server be able to handle the same number of documents, If we
indexed 1000 docs per SolrCore, in total of 10,000 SorClores? If the answer
is no, then it means there is some cost that comes w/ each SolrCore, and we
may at least be able to give an upper bound --- on a server with X amount
of storage, Y GB RAM and Z cores you can run up to maxSolrCores(X, Y, Z).

Another way to look at it, if I were to create empty SolrCores, would I be
able to create an infinite number of cores if storage was infinite? Or even
empty cores have their toll on CPU and RAM?

I know from the Lucene side of things that each SolrCore (carries a Lucene
index) there is a toll to an index -- the lexicon, IW's RAM buffer, Codecs
that store things in memory etc. For instance, one downside of splitting a
10M core into 10,000 cores is that the cost of the holding the total
lexicon (dictionary of indexed words) goes up drastically, since now every
word (just the byte[] of the word) is potentially represented in memory
10,000 times.

What other RAM/CPU/Storage costs does a SolrCore carry with it? There are
the caches of course, which really depend on how many documents are
indexed. Any other non-trivial or constant cost?

So yes, there isn't a single answer to this question. It's just like
someone would ask how many documents can a single Lucene index handle
efficiently. But if we can come up with basic numbers as I outlined above,
it might help people doing rough estimates. That doesn't mean people
shouldn't benchmark, as that upper bound may be wy too high for their
data set, query workload and search needs.

Shai

On Wed, Mar 25, 2015 at 5:25 AM, Damien Kamerman  wrote:

> From my experience on a high-end sever (256GB memory, 40 core CPU) testing
> collection numbers with one shard and two replicas, the maximum that would
> work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps
> half of that), depending on your startup-time requirements. (Though I have
> settled on 6,000 collection maximum with some patching. See SOLR-7191). You
> could create multiple clouds after that, and choose the cloud least used to
> create your collection.
>
> Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap per
> collection.
>
> On 25 March 2015 at 13:46, Ian Rose  wrote:
>
> > First off thanks everyone for the very useful replies thus far.
> >
> > Shawn - thanks for the list of items to check.  #1 and #2 should be fine
> > for us and I'll check our ulimit for #3.
> >
> > To add a bit of clarification, we are indeed using SolrCloud.  Our
> current
> > setup is to create a new collection for each customer.  For now we allow
> > SolrCloud to decide for itself where to locate the initial shard(s) but
> in
> > time we expect to refine this such that our system will automatically
> > choose the least loaded nodes according to some metric(s).
> >
> > Having more than one business entity controlling the configuration of a
> > > single (Solr) server is a recipe for disaster. Solr works well if there
> > is
> > > an architect for the system.
> >
> >
> > Jack, can you explain a bit what you mean here?  It looks like Toke
> caught
> > your meaning but I'm afraid it missed me.  What do you mean by "business
> > entity"?  Is your concern that with automatic creation of collections
> they
> > will be distributed willy-nilly across the cluster, leading to uneven
> load
> > across nodes?  If it is relevant, the schema and solrconfig are
> controlled
> > entirely by me and is the same for all collections.  Thus theoretically
> we
> > could actually just use one single collection for all of our customers
> > (adding a 'customer:' type fq to all queries) but since we
> never
> > need to query across customers it seemed more performant (as well as
> safer
> > - less chance of accidentally leaking data across customers) to use
> > separate collections.
> >
> > Better to give each tenant a separate Solr instance that you spin up and
> > > spin down based on demand.
> >
> >
> > Regarding this, if by tenant you mean "customer", this is not viable for
> us
> > from a cost perspective.  As I mentioned initially, many of our customers
> > are very small so dedicating an entire machine to each of them would not
> be
> > economical (or efficient).  Or perhaps I am not understanding what your
> > definition of "tenant" is?
> >
> > Cheers,
> > Ian
> >
> >
> >
> > On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen 
> > wrote:
> >
> > > Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > > > I'm sure that I am quite unqualified to describe his hypothetical
> > setup.
> > > I
> > > > mean, he's the one using the term multi-tenancy, 

Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Damien Kamerman
>From my experience on a high-end sever (256GB memory, 40 core CPU) testing
collection numbers with one shard and two replicas, the maximum that would
work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps
half of that), depending on your startup-time requirements. (Though I have
settled on 6,000 collection maximum with some patching. See SOLR-7191). You
could create multiple clouds after that, and choose the cloud least used to
create your collection.

Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap per
collection.

On 25 March 2015 at 13:46, Ian Rose  wrote:

> First off thanks everyone for the very useful replies thus far.
>
> Shawn - thanks for the list of items to check.  #1 and #2 should be fine
> for us and I'll check our ulimit for #3.
>
> To add a bit of clarification, we are indeed using SolrCloud.  Our current
> setup is to create a new collection for each customer.  For now we allow
> SolrCloud to decide for itself where to locate the initial shard(s) but in
> time we expect to refine this such that our system will automatically
> choose the least loaded nodes according to some metric(s).
>
> Having more than one business entity controlling the configuration of a
> > single (Solr) server is a recipe for disaster. Solr works well if there
> is
> > an architect for the system.
>
>
> Jack, can you explain a bit what you mean here?  It looks like Toke caught
> your meaning but I'm afraid it missed me.  What do you mean by "business
> entity"?  Is your concern that with automatic creation of collections they
> will be distributed willy-nilly across the cluster, leading to uneven load
> across nodes?  If it is relevant, the schema and solrconfig are controlled
> entirely by me and is the same for all collections.  Thus theoretically we
> could actually just use one single collection for all of our customers
> (adding a 'customer:' type fq to all queries) but since we never
> need to query across customers it seemed more performant (as well as safer
> - less chance of accidentally leaking data across customers) to use
> separate collections.
>
> Better to give each tenant a separate Solr instance that you spin up and
> > spin down based on demand.
>
>
> Regarding this, if by tenant you mean "customer", this is not viable for us
> from a cost perspective.  As I mentioned initially, many of our customers
> are very small so dedicating an entire machine to each of them would not be
> economical (or efficient).  Or perhaps I am not understanding what your
> definition of "tenant" is?
>
> Cheers,
> Ian
>
>
>
> On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen 
> wrote:
>
> > Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > > I'm sure that I am quite unqualified to describe his hypothetical
> setup.
> > I
> > > mean, he's the one using the term multi-tenancy, so it's for him to be
> > > clear.
> >
> > It was my understanding that Ian used them interchangeably, but of course
> > Ian it the only one that knows.
> >
> > > For me, it's a question of who has control over the config and schema
> and
> > > collection creation. Having more than one business entity controlling
> the
> > > configuration of a single (Solr) server is a recipe for disaster.
> >
> > Thank you. Now your post makes a lot more sense. I will not argue against
> > that.
> >
> > - Toke Eskildsen
> >
>



-- 
Damien Kamerman


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
First off thanks everyone for the very useful replies thus far.

Shawn - thanks for the list of items to check.  #1 and #2 should be fine
for us and I'll check our ulimit for #3.

To add a bit of clarification, we are indeed using SolrCloud.  Our current
setup is to create a new collection for each customer.  For now we allow
SolrCloud to decide for itself where to locate the initial shard(s) but in
time we expect to refine this such that our system will automatically
choose the least loaded nodes according to some metric(s).

Having more than one business entity controlling the configuration of a
> single (Solr) server is a recipe for disaster. Solr works well if there is
> an architect for the system.


Jack, can you explain a bit what you mean here?  It looks like Toke caught
your meaning but I'm afraid it missed me.  What do you mean by "business
entity"?  Is your concern that with automatic creation of collections they
will be distributed willy-nilly across the cluster, leading to uneven load
across nodes?  If it is relevant, the schema and solrconfig are controlled
entirely by me and is the same for all collections.  Thus theoretically we
could actually just use one single collection for all of our customers
(adding a 'customer:' type fq to all queries) but since we never
need to query across customers it seemed more performant (as well as safer
- less chance of accidentally leaking data across customers) to use
separate collections.

Better to give each tenant a separate Solr instance that you spin up and
> spin down based on demand.


Regarding this, if by tenant you mean "customer", this is not viable for us
from a cost perspective.  As I mentioned initially, many of our customers
are very small so dedicating an entire machine to each of them would not be
economical (or efficient).  Or perhaps I am not understanding what your
definition of "tenant" is?

Cheers,
Ian



On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen 
wrote:

> Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > I'm sure that I am quite unqualified to describe his hypothetical setup.
> I
> > mean, he's the one using the term multi-tenancy, so it's for him to be
> > clear.
>
> It was my understanding that Ian used them interchangeably, but of course
> Ian it the only one that knows.
>
> > For me, it's a question of who has control over the config and schema and
> > collection creation. Having more than one business entity controlling the
> > configuration of a single (Solr) server is a recipe for disaster.
>
> Thank you. Now your post makes a lot more sense. I will not argue against
> that.
>
> - Toke Eskildsen
>


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Erick Erickson
Test Test:

>From Hossman's apache page:

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.

Also, please format your stack trace for readability. On a quick
glance, you probably
have mis-matched jars in your classpath.

On Tue, Mar 24, 2015 at 1:35 PM, Test Test  wrote:
> Hi there,
> I'm trying to create my own TokenizerFactory (from tamingtext's book).After 
> setting schema.xml and have adding path in solrconfig.xml, i start solr.I 
> have this error message : Caused by: org.apache.solr.common.SolrException: 
> Plugin init failure for [schema.xml] fieldType "text": Plugin init failure 
> for [schema.xml] analyzer/tokenizer: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is 
> .../conf/schema.xml at 
> org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595) at 
> org.apache.solr.schema.IndexSchema.(IndexSchema.java:166) at 
> org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) 
> at 
> org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
>  at 
> org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)
>  at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62) 
> ... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init 
> failure for [schema.xml] fieldType "text": Plugin init failure for 
> [schema.xml] analyzer/tokenizer: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
>  at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486) ... 
> 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure 
> for [schema.xml] analyzer/tokenizer: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
>  at 
> org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)
>  at 
> org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
>  at 
> org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
>  at 
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
>  ... 13 moreCaused by: java.lang.ClassCastException: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
> java.lang.Class.asSubclass(Class.java:3208) at 
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)
>  at 
> org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)
>  at 
> org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)
>  at 
> org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)
>  at 
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
> Someone can help?
> Thanks.Regards.
>
>
>  Le Mardi 24 mars 2015 21h24, Jack Krupansky  a 
> écrit :
>
>
>  I'm sure that I am quite unqualified to describe his hypothetical setup. I
> mean, he's the one using the term multi-tenancy, so it's for him to be
> clear.
>
> For me, it's a question of who has control over the config and schema and
> collection creation. Having more than one business entity controlling the
> configuration of a single (Solr) server is a recipe for disaster. Solr
> works well if there is an architect for the system. Ever hear the old
> saying "Too many cooks spoil the stew"?
>
> -- Jack Krupansky
>
> On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen 
> wrote:
>
>> Jack Krupansky [jack.krupan...@gmail.com] wrote:
>> > Don't confuse customers and tenants.
>>
>> Perhaps you could explain what you mean by multi-tenant in the context of
>> Ian's setup? It is not clear to me what the distinction is in this case.
>>
>> - Toke Eskildsen
>>
>
>
>


RE: rough maximum cores (shards) per machine?

2015-03-24 Thread Toke Eskildsen
Jack Krupansky [jack.krupan...@gmail.com] wrote:
> I'm sure that I am quite unqualified to describe his hypothetical setup. I
> mean, he's the one using the term multi-tenancy, so it's for him to be
> clear.

It was my understanding that Ian used them interchangeably, but of course Ian 
it the only one that knows.

> For me, it's a question of who has control over the config and schema and
> collection creation. Having more than one business entity controlling the
> configuration of a single (Solr) server is a recipe for disaster.

Thank you. Now your post makes a lot more sense. I will not argue against that.

- Toke Eskildsen


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Test Test
Hi there, 
I'm trying to create my own TokenizerFactory (from tamingtext's book).After 
setting schema.xml and have adding path in solrconfig.xml, i start solr.I have 
this error message : Caused by: org.apache.solr.common.SolrException: Plugin 
init failure for [schema.xml] fieldType "text": Plugin init failure for 
[schema.xml] analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is 
.../conf/schema.xml at 
org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595) at 
org.apache.solr.schema.IndexSchema.(IndexSchema.java:166) at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
 at 
org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)
 at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62) 
... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure 
for [schema.xml] fieldType "text": Plugin init failure for [schema.xml] 
analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
 at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486) ... 12 
moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for 
[schema.xml] analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
 at 
org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)
 at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
 at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
 at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 ... 13 moreCaused by: java.lang.ClassCastException: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
java.lang.Class.asSubclass(Class.java:3208) at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474) 
at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)
 at 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)
 at 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)
 at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 
Someone can help?
Thanks.Regards.


 Le Mardi 24 mars 2015 21h24, Jack Krupansky  a 
écrit :
   

 I'm sure that I am quite unqualified to describe his hypothetical setup. I
mean, he's the one using the term multi-tenancy, so it's for him to be
clear.

For me, it's a question of who has control over the config and schema and
collection creation. Having more than one business entity controlling the
configuration of a single (Solr) server is a recipe for disaster. Solr
works well if there is an architect for the system. Ever hear the old
saying "Too many cooks spoil the stew"?

-- Jack Krupansky

On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen 
wrote:

> Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > Don't confuse customers and tenants.
>
> Perhaps you could explain what you mean by multi-tenant in the context of
> Ian's setup? It is not clear to me what the distinction is in this case.
>
> - Toke Eskildsen
>


  

Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
I'm sure that I am quite unqualified to describe his hypothetical setup. I
mean, he's the one using the term multi-tenancy, so it's for him to be
clear.

For me, it's a question of who has control over the config and schema and
collection creation. Having more than one business entity controlling the
configuration of a single (Solr) server is a recipe for disaster. Solr
works well if there is an architect for the system. Ever hear the old
saying "Too many cooks spoil the stew"?

-- Jack Krupansky

On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen 
wrote:

> Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > Don't confuse customers and tenants.
>
> Perhaps you could explain what you mean by multi-tenant in the context of
> Ian's setup? It is not clear to me what the distinction is in this case.
>
> - Toke Eskildsen
>


RE: rough maximum cores (shards) per machine?

2015-03-24 Thread Toke Eskildsen
Jack Krupansky [jack.krupan...@gmail.com] wrote:
> Don't confuse customers and tenants.

Perhaps you could explain what you mean by multi-tenant in the context of Ian's 
setup? It is not clear to me what the distinction is in this case.

- Toke Eskildsen


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Shawn Heisey
On 3/24/2015 11:22 AM, Ian Rose wrote:
> Let me give a bit of background.  Our Solr cluster is multi-tenant, where
> we use one collection for each of our customers.  In many cases, these
> customers are very tiny, so their collection consists of just a single
> shard on a single Solr node.  In fact, a non-trivial number of them are
> totally empty (e.g. trial customers that never did anything with their
> trial account).  However there are also some customers that are larger,
> requiring their collection to be sharded.  Our strategy is to try to keep
> the total documents in any one shard under 20 million (honestly not sure
> where my coworker got that number from - I am open to alternatives but I
> realize this is heavily app-specific).
>
> So my original question is not related to indexing or query traffic, but
> just the sheer number of cores.  For example, if I have 10 active cores on
> a machine and everything is working fine, should I expect that everything
> will still work fine if I add 10 nearly-idle cores to that machine?  What
> about 100?  1000?  I figure the overhead of each core is probably fairly
> low but at some point starts to matter.

One resource that may be exhausted faster than any other when you have a
lot of cores on a solr instance (especially when they are not idle) is
Java heap memory, so you might need to increase the java heap.  Memory
in the server is one of the most important resources you have for Solr
performance, and here I am talking about memory that is *not* used in
the Java heap (or any other program) -- the OS must be able to
effectively cache your index data or Solr performance will be terrible.

You have said "Solr cluster" and "collection" ... so that makes me think
you're running SolrCloud.  In cloud mode, you can't really use the
LotsOfCores functionality, where you mark cores transient and tell Solr
how many cores you'd like to have resident at the same time.  If you are
NOT in cloud mode, then you can use this feature:

http://wiki.apache.org/solr/LotsOfCores

In general, there are three resources other than memory which might
become exhausted with a large number of cores:

One resource is the "maximum open files" limit in the operating system,
which typically defaults to 1024.  Each core will typically have several
dozen files in its index, so it's very easy to reach 1024 open files.

The second resource is the maximum allowed threads in your servlet
container config -- each core you add requires more threads.  The
default maxThreads value in most containers is 200.  The Jetty container
included in the Solr download is preconfigured with a maxThreads value
of 1, effectively removing the limit for most setups.

The third resource is related to the second -- some operating systems
implement threads as hidden processes, and many operating systems will
limit the number of processes that a user may start.  On Linux, this
limit is typically 1024, and may need to be increased.

I really need to add this kind of info to the wiki.

Thanks,
Shawn



Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
Don't confuse customers and tenants.

-- Jack Krupansky

On Tue, Mar 24, 2015 at 2:24 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Sorry Jack. That doesn't scale when you have millions of customers. And
> these are good problems to have!
>
> On Tue, Mar 24, 2015 at 10:47 AM, Jack Krupansky  >
> wrote:
>
> > Multi-tenancy is a bad idea for a single solr Cluster. Better to give
> each
> > tenant a separate Solr instance that you spin up and spin down based on
> > demand.
> >
> > Think about it: If there are a small number of tenants, just giving each
> > their own machine will be cheaper than the effort spent managing a
> > multi-tenant cluster, and if there are a large number of tenants of even
> a
> > moderate number of large tenants, you can't expect them to all run
> > reasonably on a relatively small cluster. Think about scalability.
> >
> >
> > -- Jack Krupansky
> >
> > On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose  wrote:
> >
> > > Let me give a bit of background.  Our Solr cluster is multi-tenant,
> where
> > > we use one collection for each of our customers.  In many cases, these
> > > customers are very tiny, so their collection consists of just a single
> > > shard on a single Solr node.  In fact, a non-trivial number of them are
> > > totally empty (e.g. trial customers that never did anything with their
> > > trial account).  However there are also some customers that are larger,
> > > requiring their collection to be sharded.  Our strategy is to try to
> keep
> > > the total documents in any one shard under 20 million (honestly not
> sure
> > > where my coworker got that number from - I am open to alternatives but
> I
> > > realize this is heavily app-specific).
> > >
> > > So my original question is not related to indexing or query traffic,
> but
> > > just the sheer number of cores.  For example, if I have 10 active cores
> > on
> > > a machine and everything is working fine, should I expect that
> everything
> > > will still work fine if I add 10 nearly-idle cores to that machine?
> What
> > > about 100?  1000?  I figure the overhead of each core is probably
> fairly
> > > low but at some point starts to matter.
> > >
> > > Does that make sense?
> > > - Ian
> > >
> > >
> > > On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky <
> > jack.krupan...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Shards per collection, or across all collections on the node?
> > > >
> > > > It will all depend on:
> > > >
> > > > 1. Your ingestion/indexing rate. High, medium or low?
> > > > 2. Your query access pattern. Note that a typical query fans out to
> all
> > > > shards, so having more shards than CPU cores means less parallelism.
> > > > 3. How many collections you will have per node.
> > > >
> > > > In short, it depends on what you want to achieve, not some limit of
> > Solr
> > > > per se.
> > > >
> > > > Why are you even sharding the node anyway? Why not just run with a
> > single
> > > > shard per node, and do sharding by having separate nodes, to maximize
> > > > parallel processing and availability?
> > > >
> > > > Also be careful to be clear about using the Solr term "shard" (a
> slice,
> > > > across all replica nodes) as distinct from the Elasticsearch term
> > "shard"
> > > > (a single slice of an index for a single replica, analogous to a Solr
> > > > "core".)
> > > >
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose 
> > wrote:
> > > >
> > > > > Hi all -
> > > > >
> > > > > I'm sure this topic has been covered before but I was unable to
> find
> > > any
> > > > > clear references online or in the mailing list.
> > > > >
> > > > > Are there any rules of thumb for how many cores (aka shards, since
> I
> > am
> > > > > using SolrCloud) is "too many" for one machine?  I realize there is
> > no
> > > > one
> > > > > answer (depends on size of the machine, etc.) so I'm just looking
> > for a
> > > > > rough idea.  Something like the following would be very useful:
> > > > >
> > > > > * People commonly run up to X cores/shards on a mid-sized (4 or 8
> > core)
> > > > > server without any problems.
> > > > > * I have never heard of anyone successfully running X cores/shards
> > on a
> > > > > single machine, even if you throw a lot of hardware at it.
> > > > >
> > > > > Thanks!
> > > > > - Ian
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Shalin Shekhar Mangar
Sorry Jack. That doesn't scale when you have millions of customers. And
these are good problems to have!

On Tue, Mar 24, 2015 at 10:47 AM, Jack Krupansky 
wrote:

> Multi-tenancy is a bad idea for a single solr Cluster. Better to give each
> tenant a separate Solr instance that you spin up and spin down based on
> demand.
>
> Think about it: If there are a small number of tenants, just giving each
> their own machine will be cheaper than the effort spent managing a
> multi-tenant cluster, and if there are a large number of tenants of even a
> moderate number of large tenants, you can't expect them to all run
> reasonably on a relatively small cluster. Think about scalability.
>
>
> -- Jack Krupansky
>
> On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose  wrote:
>
> > Let me give a bit of background.  Our Solr cluster is multi-tenant, where
> > we use one collection for each of our customers.  In many cases, these
> > customers are very tiny, so their collection consists of just a single
> > shard on a single Solr node.  In fact, a non-trivial number of them are
> > totally empty (e.g. trial customers that never did anything with their
> > trial account).  However there are also some customers that are larger,
> > requiring their collection to be sharded.  Our strategy is to try to keep
> > the total documents in any one shard under 20 million (honestly not sure
> > where my coworker got that number from - I am open to alternatives but I
> > realize this is heavily app-specific).
> >
> > So my original question is not related to indexing or query traffic, but
> > just the sheer number of cores.  For example, if I have 10 active cores
> on
> > a machine and everything is working fine, should I expect that everything
> > will still work fine if I add 10 nearly-idle cores to that machine?  What
> > about 100?  1000?  I figure the overhead of each core is probably fairly
> > low but at some point starts to matter.
> >
> > Does that make sense?
> > - Ian
> >
> >
> > On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky <
> jack.krupan...@gmail.com
> > >
> > wrote:
> >
> > > Shards per collection, or across all collections on the node?
> > >
> > > It will all depend on:
> > >
> > > 1. Your ingestion/indexing rate. High, medium or low?
> > > 2. Your query access pattern. Note that a typical query fans out to all
> > > shards, so having more shards than CPU cores means less parallelism.
> > > 3. How many collections you will have per node.
> > >
> > > In short, it depends on what you want to achieve, not some limit of
> Solr
> > > per se.
> > >
> > > Why are you even sharding the node anyway? Why not just run with a
> single
> > > shard per node, and do sharding by having separate nodes, to maximize
> > > parallel processing and availability?
> > >
> > > Also be careful to be clear about using the Solr term "shard" (a slice,
> > > across all replica nodes) as distinct from the Elasticsearch term
> "shard"
> > > (a single slice of an index for a single replica, analogous to a Solr
> > > "core".)
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose 
> wrote:
> > >
> > > > Hi all -
> > > >
> > > > I'm sure this topic has been covered before but I was unable to find
> > any
> > > > clear references online or in the mailing list.
> > > >
> > > > Are there any rules of thumb for how many cores (aka shards, since I
> am
> > > > using SolrCloud) is "too many" for one machine?  I realize there is
> no
> > > one
> > > > answer (depends on size of the machine, etc.) so I'm just looking
> for a
> > > > rough idea.  Something like the following would be very useful:
> > > >
> > > > * People commonly run up to X cores/shards on a mid-sized (4 or 8
> core)
> > > > server without any problems.
> > > > * I have never heard of anyone successfully running X cores/shards
> on a
> > > > single machine, even if you throw a lot of hardware at it.
> > > >
> > > > Thanks!
> > > > - Ian
> > > >
> > >
> >
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
Multi-tenancy is a bad idea for a single solr Cluster. Better to give each
tenant a separate Solr instance that you spin up and spin down based on
demand.

Think about it: If there are a small number of tenants, just giving each
their own machine will be cheaper than the effort spent managing a
multi-tenant cluster, and if there are a large number of tenants of even a
moderate number of large tenants, you can't expect them to all run
reasonably on a relatively small cluster. Think about scalability.


-- Jack Krupansky

On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose  wrote:

> Let me give a bit of background.  Our Solr cluster is multi-tenant, where
> we use one collection for each of our customers.  In many cases, these
> customers are very tiny, so their collection consists of just a single
> shard on a single Solr node.  In fact, a non-trivial number of them are
> totally empty (e.g. trial customers that never did anything with their
> trial account).  However there are also some customers that are larger,
> requiring their collection to be sharded.  Our strategy is to try to keep
> the total documents in any one shard under 20 million (honestly not sure
> where my coworker got that number from - I am open to alternatives but I
> realize this is heavily app-specific).
>
> So my original question is not related to indexing or query traffic, but
> just the sheer number of cores.  For example, if I have 10 active cores on
> a machine and everything is working fine, should I expect that everything
> will still work fine if I add 10 nearly-idle cores to that machine?  What
> about 100?  1000?  I figure the overhead of each core is probably fairly
> low but at some point starts to matter.
>
> Does that make sense?
> - Ian
>
>
> On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky  >
> wrote:
>
> > Shards per collection, or across all collections on the node?
> >
> > It will all depend on:
> >
> > 1. Your ingestion/indexing rate. High, medium or low?
> > 2. Your query access pattern. Note that a typical query fans out to all
> > shards, so having more shards than CPU cores means less parallelism.
> > 3. How many collections you will have per node.
> >
> > In short, it depends on what you want to achieve, not some limit of Solr
> > per se.
> >
> > Why are you even sharding the node anyway? Why not just run with a single
> > shard per node, and do sharding by having separate nodes, to maximize
> > parallel processing and availability?
> >
> > Also be careful to be clear about using the Solr term "shard" (a slice,
> > across all replica nodes) as distinct from the Elasticsearch term "shard"
> > (a single slice of an index for a single replica, analogous to a Solr
> > "core".)
> >
> >
> > -- Jack Krupansky
> >
> > On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose  wrote:
> >
> > > Hi all -
> > >
> > > I'm sure this topic has been covered before but I was unable to find
> any
> > > clear references online or in the mailing list.
> > >
> > > Are there any rules of thumb for how many cores (aka shards, since I am
> > > using SolrCloud) is "too many" for one machine?  I realize there is no
> > one
> > > answer (depends on size of the machine, etc.) so I'm just looking for a
> > > rough idea.  Something like the following would be very useful:
> > >
> > > * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
> > > server without any problems.
> > > * I have never heard of anyone successfully running X cores/shards on a
> > > single machine, even if you throw a lot of hardware at it.
> > >
> > > Thanks!
> > > - Ian
> > >
> >
>


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
Let me give a bit of background.  Our Solr cluster is multi-tenant, where
we use one collection for each of our customers.  In many cases, these
customers are very tiny, so their collection consists of just a single
shard on a single Solr node.  In fact, a non-trivial number of them are
totally empty (e.g. trial customers that never did anything with their
trial account).  However there are also some customers that are larger,
requiring their collection to be sharded.  Our strategy is to try to keep
the total documents in any one shard under 20 million (honestly not sure
where my coworker got that number from - I am open to alternatives but I
realize this is heavily app-specific).

So my original question is not related to indexing or query traffic, but
just the sheer number of cores.  For example, if I have 10 active cores on
a machine and everything is working fine, should I expect that everything
will still work fine if I add 10 nearly-idle cores to that machine?  What
about 100?  1000?  I figure the overhead of each core is probably fairly
low but at some point starts to matter.

Does that make sense?
- Ian


On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky 
wrote:

> Shards per collection, or across all collections on the node?
>
> It will all depend on:
>
> 1. Your ingestion/indexing rate. High, medium or low?
> 2. Your query access pattern. Note that a typical query fans out to all
> shards, so having more shards than CPU cores means less parallelism.
> 3. How many collections you will have per node.
>
> In short, it depends on what you want to achieve, not some limit of Solr
> per se.
>
> Why are you even sharding the node anyway? Why not just run with a single
> shard per node, and do sharding by having separate nodes, to maximize
> parallel processing and availability?
>
> Also be careful to be clear about using the Solr term "shard" (a slice,
> across all replica nodes) as distinct from the Elasticsearch term "shard"
> (a single slice of an index for a single replica, analogous to a Solr
> "core".)
>
>
> -- Jack Krupansky
>
> On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose  wrote:
>
> > Hi all -
> >
> > I'm sure this topic has been covered before but I was unable to find any
> > clear references online or in the mailing list.
> >
> > Are there any rules of thumb for how many cores (aka shards, since I am
> > using SolrCloud) is "too many" for one machine?  I realize there is no
> one
> > answer (depends on size of the machine, etc.) so I'm just looking for a
> > rough idea.  Something like the following would be very useful:
> >
> > * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
> > server without any problems.
> > * I have never heard of anyone successfully running X cores/shards on a
> > single machine, even if you throw a lot of hardware at it.
> >
> > Thanks!
> > - Ian
> >
>


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
Shards per collection, or across all collections on the node?

It will all depend on:

1. Your ingestion/indexing rate. High, medium or low?
2. Your query access pattern. Note that a typical query fans out to all
shards, so having more shards than CPU cores means less parallelism.
3. How many collections you will have per node.

In short, it depends on what you want to achieve, not some limit of Solr
per se.

Why are you even sharding the node anyway? Why not just run with a single
shard per node, and do sharding by having separate nodes, to maximize
parallel processing and availability?

Also be careful to be clear about using the Solr term "shard" (a slice,
across all replica nodes) as distinct from the Elasticsearch term "shard"
(a single slice of an index for a single replica, analogous to a Solr
"core".)


-- Jack Krupansky

On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose  wrote:

> Hi all -
>
> I'm sure this topic has been covered before but I was unable to find any
> clear references online or in the mailing list.
>
> Are there any rules of thumb for how many cores (aka shards, since I am
> using SolrCloud) is "too many" for one machine?  I realize there is no one
> answer (depends on size of the machine, etc.) so I'm just looking for a
> rough idea.  Something like the following would be very useful:
>
> * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
> server without any problems.
> * I have never heard of anyone successfully running X cores/shards on a
> single machine, even if you throw a lot of hardware at it.
>
> Thanks!
> - Ian
>


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Erick Erickson
Well, there's a ticket out there for "thousands of collections on a
single machine",
although this is wy out there. I often see 10-20 small cores on a
4-8 core machine
if they're reasonably small (a few million docs). I see a single
replica strain a 128G 16
core machine if it has 300M docs

Which is a way of saying "ya gotta test with your data/query mix".

Wish there was a better answer.
Erick

On Tue, Mar 24, 2015 at 6:02 AM, Ian Rose  wrote:
> Hi all -
>
> I'm sure this topic has been covered before but I was unable to find any
> clear references online or in the mailing list.
>
> Are there any rules of thumb for how many cores (aka shards, since I am
> using SolrCloud) is "too many" for one machine?  I realize there is no one
> answer (depends on size of the machine, etc.) so I'm just looking for a
> rough idea.  Something like the following would be very useful:
>
> * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
> server without any problems.
> * I have never heard of anyone successfully running X cores/shards on a
> single machine, even if you throw a lot of hardware at it.
>
> Thanks!
> - Ian