Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
This is a good explanation and makes sense. The one inconsistency is referring 
to a replica of a shard that has no replication. But its not that big of a 
problem. If you wove the term 'core' into your writeup below it would be 
complete and should be posted on the wiki.



Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Jack Krupansky j...@basetechnology.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Replication makes perfect sense even if our explanations so far do not.

A shard is an abstraction of a subset of the data for a collection.

A replica is an instance of the data of the shard and instances of Solr 
servers that have indicated a readiness to service queries and updates for 
the data. Alternatively, a replica is a node which has indicated a readiness 
to receive and serve the data of a shard, but may not have any data at the 
moment.

Lets describe it operationally for SolrCloud: If data comes in to any 
replica of a shard it will automatically and quickly be replicated to all 
other replicas of the shard. If a new replica of a shard comes up it will be 
streamed all of the data from the another replica of the shard. If an 
existing replica of a shard restarts or reconnects to the cluster, it will 
be streamed updates of any new data since it was last updated from another 
replica of the shard.

Replication is simply the process of assuring that all replicas are kept up 
to date. That's the same abstract meaning as for Master/Slave even though 
the operational details are somewhat different. The goal remains the same.

Replication factor is the number of instances of the data of the shard and 
instances of Solr servers that can service queries and updates for the data. 
Alternatively, the replication factor is the number of nodes of the 
SolrCloud cluster  which have indicated a readiness to receive and serve the 
data of a shard, but may not have any data at the moment.

A node is an instance of Solr running in a Java JVM that has indicated to 
the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a 
replica for a shard of a collection. [The latter part of that is a bit too 
fuzzy - I'm not sure what the node tells Zookeeper and who does shard 
assignment. I mean, does a node explicitly say what shard it wants to be, or 
is that assigned by Zookeeper, or is that a node's choice/option? But none 
of that changes the fact that a node registers with Zookeeper and then 
somehow becomes a replica for a shard.]

A node (instance of a Solr server) can be a replica of shards from multiple 
collections (potentially multiple shards per collection). A node is not a 
replica per se, but a container that can serve multiple collections. A node 
can serve as multiple replicas, each of a different collection.

My only interest here on this user list is to understand and explain the 
terms we have today and that SEEM to be working for the most part, even 
though we may not have defined them carefully enough and used them 
consistently enough.

If somebody want to propose an alternative terminology - fine, discuss that 
on the dev list and/or file a Jira.

I won't claim that my definitions are perfect (yet), but perfecting the 
definitions (for users) should be separated from changing the terms 
themselves.

-- Jack Krupansky

-Original Message- 
From: Per Steffensen
Sent: Friday, January 04, 2013 2:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

On 1/3/13 5:58 PM, Walter Underwood wrote:
 A factor is multiplied, so multiplying the leader by a replicationFactor 
 of 1 means you have exactly one copy of that shard.

 I think that recycling the term replication within Solr was confusing, 
 but it is a bit late to change that.

 wunder
Yes, the term factor is not misleading, but the term replication is.
If we keep calling shard-instances for Replica I guess replicaFactor
will be ok - at least much better than replicationFactor. But it would
still be better with e.g. ShardInstance and InstancesPerShard 



Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Jack Krupansky
I thought about adding Solr core, but it only muddies the water. Yes, it 
needs to be added, but carefully.


In the context of SolrCloud, a Solr core is the underlying representation of 
a replica. Alternatively, a replica of a shard of a collection is 
implemented as a Solr core. [Need to factor in the potential for multiple 
shards on a single node.] Or, a Solr core is capable of serving as a replica 
of a shard. A Solr core has a collection name but can exist without being 
registered with Zookeeper, so it may not be a replica of a 
zookeeper-registered collection.


Something like that. Not quite there yet.

The main point, I think, is that when we talk about SolrCloud or a Solr 
cluster it would be better for people to speak of replicas and shards and 
collections than cores since core is the implementation rather than the 
abstraction. I mean, at the level of cores, they know of only documents and 
fields, not shards, replicas, and the overall structure of collections and 
the cluster. Sure, the core has the name of the collection, but cores on 
other nodes can use that same name.


-- Jack Krupansky

-Original Message- 
From: darren

Sent: Friday, January 04, 2013 9:00 AM
To: j...@basetechnology.com ; solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

This is a good explanation and makes sense. The one inconsistency is 
referring to a replica of a shard that has no replication. But its not that 
big of a problem. If you wove the term 'core' into your writeup below it 
would be complete and should be posted on the wiki.




Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Jack Krupansky j...@basetechnology.com
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Replication makes perfect sense even if our explanations so far do not.

A shard is an abstraction of a subset of the data for a collection.

A replica is an instance of the data of the shard and instances of Solr
servers that have indicated a readiness to service queries and updates for
the data. Alternatively, a replica is a node which has indicated a readiness
to receive and serve the data of a shard, but may not have any data at the
moment.

Lets describe it operationally for SolrCloud: If data comes in to any
replica of a shard it will automatically and quickly be replicated to all
other replicas of the shard. If a new replica of a shard comes up it will be
streamed all of the data from the another replica of the shard. If an
existing replica of a shard restarts or reconnects to the cluster, it will
be streamed updates of any new data since it was last updated from another
replica of the shard.

Replication is simply the process of assuring that all replicas are kept up
to date. That's the same abstract meaning as for Master/Slave even though
the operational details are somewhat different. The goal remains the same.

Replication factor is the number of instances of the data of the shard and
instances of Solr servers that can service queries and updates for the data.
Alternatively, the replication factor is the number of nodes of the
SolrCloud cluster  which have indicated a readiness to receive and serve the
data of a shard, but may not have any data at the moment.

A node is an instance of Solr running in a Java JVM that has indicated to
the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a
replica for a shard of a collection. [The latter part of that is a bit too
fuzzy - I'm not sure what the node tells Zookeeper and who does shard
assignment. I mean, does a node explicitly say what shard it wants to be, or
is that assigned by Zookeeper, or is that a node's choice/option? But none
of that changes the fact that a node registers with Zookeeper and then
somehow becomes a replica for a shard.]

A node (instance of a Solr server) can be a replica of shards from multiple
collections (potentially multiple shards per collection). A node is not a
replica per se, but a container that can serve multiple collections. A node
can serve as multiple replicas, each of a different collection.

My only interest here on this user list is to understand and explain the
terms we have today and that SEEM to be working for the most part, even
though we may not have defined them carefully enough and used them
consistently enough.

If somebody want to propose an alternative terminology - fine, discuss that
on the dev list and/or file a Jira.

I won't claim that my definitions are perfect (yet), but perfecting the
definitions (for users) should be separated from changing the terms
themselves.

-- Jack Krupansky

-Original Message- 
From: Per Steffensen

Sent: Friday, January 04, 2013 2:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

On 1/3/13 5:58 PM, Walter Underwood wrote:

A factor is multiplied, so multiplying the leader by a replicationFactor
of 1

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
Yes. Thats it. Its clear if we separate logical terms from physical terms. A 
simple cake diagram on the wiki along with perhaps a uml will solidify these 
concepts.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Jack Krupansky j...@basetechnology.com 
Date:  
To: solr-user@lucene.apache.org,darren dar...@ontrenet.com 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
I thought about adding Solr core, but it only muddies the water. Yes, it 
needs to be added, but carefully.

In the context of SolrCloud, a Solr core is the underlying representation of 
a replica. Alternatively, a replica of a shard of a collection is 
implemented as a Solr core. [Need to factor in the potential for multiple 
shards on a single node.] Or, a Solr core is capable of serving as a replica 
of a shard. A Solr core has a collection name but can exist without being 
registered with Zookeeper, so it may not be a replica of a 
zookeeper-registered collection.

Something like that. Not quite there yet.

The main point, I think, is that when we talk about SolrCloud or a Solr 
cluster it would be better for people to speak of replicas and shards and 
collections than cores since core is the implementation rather than the 
abstraction. I mean, at the level of cores, they know of only documents and 
fields, not shards, replicas, and the overall structure of collections and 
the cluster. Sure, the core has the name of the collection, but cores on 
other nodes can use that same name.

-- Jack Krupansky

-Original Message- 
From: darren
Sent: Friday, January 04, 2013 9:00 AM
To: j...@basetechnology.com ; solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

This is a good explanation and makes sense. The one inconsistency is 
referring to a replica of a shard that has no replication. But its not that 
big of a problem. If you wove the term 'core' into your writeup below it 
would be complete and should be posted on the wiki.



Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Jack Krupansky j...@basetechnology.com
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Replication makes perfect sense even if our explanations so far do not.

A shard is an abstraction of a subset of the data for a collection.

A replica is an instance of the data of the shard and instances of Solr
servers that have indicated a readiness to service queries and updates for
the data. Alternatively, a replica is a node which has indicated a readiness
to receive and serve the data of a shard, but may not have any data at the
moment.

Lets describe it operationally for SolrCloud: If data comes in to any
replica of a shard it will automatically and quickly be replicated to all
other replicas of the shard. If a new replica of a shard comes up it will be
streamed all of the data from the another replica of the shard. If an
existing replica of a shard restarts or reconnects to the cluster, it will
be streamed updates of any new data since it was last updated from another
replica of the shard.

Replication is simply the process of assuring that all replicas are kept up
to date. That's the same abstract meaning as for Master/Slave even though
the operational details are somewhat different. The goal remains the same.

Replication factor is the number of instances of the data of the shard and
instances of Solr servers that can service queries and updates for the data.
Alternatively, the replication factor is the number of nodes of the
SolrCloud cluster  which have indicated a readiness to receive and serve the
data of a shard, but may not have any data at the moment.

A node is an instance of Solr running in a Java JVM that has indicated to
the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a
replica for a shard of a collection. [The latter part of that is a bit too
fuzzy - I'm not sure what the node tells Zookeeper and who does shard
assignment. I mean, does a node explicitly say what shard it wants to be, or
is that assigned by Zookeeper, or is that a node's choice/option? But none
of that changes the fact that a node registers with Zookeeper and then
somehow becomes a replica for a shard.]

A node (instance of a Solr server) can be a replica of shards from multiple
collections (potentially multiple shards per collection). A node is not a
replica per se, but a container that can serve multiple collections. A node
can serve as multiple replicas, each of a different collection.

My only interest here on this user list is to understand and explain the
terms we have today and that SEEM to be working for the most part, even
though we may not have defined them carefully enough and used them
consistently enough.

If somebody want to propose an alternative terminology - fine, discuss that
on the dev list and/or file a Jira.

I won't claim that my definitions are perfect (yet

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Yonik Seeley
On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote:
 Our biggest problem is that we really havent decided once and for all and
 made sure to reflect the decision consistently across code and
 documentation. As long as we havnt I believe it is still ok to change our
 minds.

IMO, I *think* it's settled: It's collection consists of 1 or more
shards, which each consist of one or more replicas.

A *long* time ago (3 years actually), I tried to get slice used in
place of shard just because shard was already used ambiguously by
people for both physical and logical shards, but it never caught on,
and as I recall no one could really agree on a set of terms that
satisfied everyone.  Attempting to replace Replica with something
like Shard Instance could actually end up being worse since it's a
mouthful and people would tend to shorten it to shard when talking
about it.

From a practical standpoint, I don't think people will be confused by
the current terminology once we document it well (we should probably
start with collection/shard/replica).  It's mostly an issue of when
one goes looking for inconsistencies or things that might not make
sense.  And as has been pointed out, others use the exact same
terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication

In the *code* I have been migrating away from shard as the physical
kind.  I've also used slice as a synonym for logical shard in the
code because of this mixed history of shard and since removing all
remnants of the use of shard as physical all at once would be
impractical.  Anyone who works on the code should not be bothered by
an extra synonym, and things will continue to be cleaned up over time.

-Yonik
http://lucidworks.com


Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
Agreed. But for completeness can it be node/collection/shard/replica/core?


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Yonik Seeley yo...@lucidworks.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote:
 Our biggest problem is that we really havent decided once and for all and
 made sure to reflect the decision consistently across code and
 documentation. As long as we havnt I believe it is still ok to change our
 minds.

IMO, I *think* it's settled: It's collection consists of 1 or more
shards, which each consist of one or more replicas.

A *long* time ago (3 years actually), I tried to get slice used in
place of shard just because shard was already used ambiguously by
people for both physical and logical shards, but it never caught on,
and as I recall no one could really agree on a set of terms that
satisfied everyone.  Attempting to replace Replica with something
like Shard Instance could actually end up being worse since it's a
mouthful and people would tend to shorten it to shard when talking
about it.

From a practical standpoint, I don't think people will be confused by
the current terminology once we document it well (we should probably
start with collection/shard/replica).  It's mostly an issue of when
one goes looking for inconsistencies or things that might not make
sense.  And as has been pointed out, others use the exact same
terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication

In the *code* I have been migrating away from shard as the physical
kind.  I've also used slice as a synonym for logical shard in the
code because of this mixed history of shard and since removing all
remnants of the use of shard as physical all at once would be
impractical.  Anyone who works on the code should not be bothered by
an extra synonym, and things will continue to be cleaned up over time.

-Yonik
http://lucidworks.com


Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
Actually. Node/collection/shard/replica/core/index


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: darren dar...@ontrenet.com 
Date:  
To: yo...@lucidworks.com,solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Agreed. But for completeness can it be node/collection/shard/replica/core?


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Yonik Seeley yo...@lucidworks.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 

On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote:
 Our biggest problem is that we really havent decided once and for all and
 made sure to reflect the decision consistently across code and
 documentation. As long as we havnt I believe it is still ok to change our
 minds.

IMO, I *think* it's settled: It's collection consists of 1 or more
shards, which each consist of one or more replicas.

A *long* time ago (3 years actually), I tried to get slice used in
place of shard just because shard was already used ambiguously by
people for both physical and logical shards, but it never caught on,
and as I recall no one could really agree on a set of terms that
satisfied everyone.  Attempting to replace Replica with something
like Shard Instance could actually end up being worse since it's a
mouthful and people would tend to shorten it to shard when talking
about it.

From a practical standpoint, I don't think people will be confused by
the current terminology once we document it well (we should probably
start with collection/shard/replica).  It's mostly an issue of when
one goes looking for inconsistencies or things that might not make
sense.  And as has been pointed out, others use the exact same
terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication

In the *code* I have been migrating away from shard as the physical
kind.  I've also used slice as a synonym for logical shard in the
code because of this mixed history of shard and since removing all
remnants of the use of shard as physical all at once would be
impractical.  Anyone who works on the code should not be bothered by
an extra synonym, and things will continue to be cleaned up over time.

-Yonik
http://lucidworks.com


Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Alexandre Rafalovitch
Can I just start by saying that this was AMAZING. :-) When I asked the
question, I certainly did not expect this level of details.

And I vote on the cake diagram for WIKI as well. Perhaps, two with the
first one showing the trivial collapsed state of single
collection/shard/replica/core. The trivial one will also help to explain
why the example is now called 'collection1'.

I think I followed everything, except for just added term of 'index'. Isn't
that the same as 'core'? Or can we have several indexes in one core?

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:

 This is the containment hierarchy i understand but includes both physical
 and logical.

  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...

 Actually. Node/collection/shard/replica/core/index



  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: yo...@lucidworks.com,solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...


 Agreed. But for completeness can it be node/collection/shard/replica/core?




Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Jack Krupansky
The entire collection does have an index - a distributed index - which 
consists of a Lucene index on each core/replica for the subset of the data 
in that shard.


-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Friday, January 04, 2013 1:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Can I just start by saying that this was AMAZING. :-) When I asked the
question, I certainly did not expect this level of details.

And I vote on the cake diagram for WIKI as well. Perhaps, two with the
first one showing the trivial collapsed state of single
collection/shard/replica/core. The trivial one will also help to explain
why the example is now called 'collection1'.

I think I followed everything, except for just added term of 'index'. Isn't
that the same as 'core'? Or can we have several indexes in one core?

Regards,
  Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:


This is the containment hierarchy i understand but includes both physical
and logical.

 Original message 
From: darren dar...@ontrenet.com
Date:
To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Actually. Node/collection/shard/replica/core/index



 Original message 
From: darren dar...@ontrenet.com
Date:
To: yo...@lucidworks.com,solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...


Agreed. But for completeness can it be node/collection/shard/replica/core?






Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
My understanding is core is a logical solr term. Index is a physical lucene 
term. A solr core is backed by a physical lucene index. One index per core. 
Solr team can correct me if its not accurate. :)


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Alexandre Rafalovitch arafa...@gmail.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Can I just start by saying that this was AMAZING. :-) When I asked the
question, I certainly did not expect this level of details.

And I vote on the cake diagram for WIKI as well. Perhaps, two with the
first one showing the trivial collapsed state of single
collection/shard/replica/core. The trivial one will also help to explain
why the example is now called 'collection1'.

I think I followed everything, except for just added term of 'index'. Isn't
that the same as 'core'? Or can we have several indexes in one core?

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:

 This is the containment hierarchy i understand but includes both physical
 and logical.

  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...

 Actually. Node/collection/shard/replica/core/index



  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: yo...@lucidworks.com,solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...


 Agreed. But for completeness can it be node/collection/shard/replica/core?




Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Alexandre Rafalovitch
Hmm. Doesn't that make (logical) index=collection? And (physical)
index=core? Which creates duplication of terminology and at the same time
can cause confusion between highest logical and lowest physical level.

Regards,
   Alex.
P.s. Hoping not to start a new terminology war.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 1:21 PM, Jack Krupansky j...@basetechnology.comwrote:

 The entire collection does have an index - a distributed index - which
 consists of a Lucene index on each core/replica for the subset of the data
 in that shard.

 -- Jack Krupansky

 -Original Message- From: Alexandre Rafalovitch
 Sent: Friday, January 04, 2013 1:12 PM
 To: solr-user@lucene.apache.org

 Subject: Re: Terminology question: Core vs. Collection vs...

 Can I just start by saying that this was AMAZING. :-) When I asked the
 question, I certainly did not expect this level of details.

 And I vote on the cake diagram for WIKI as well. Perhaps, two with the
 first one showing the trivial collapsed state of single
 collection/shard/replica/core. The trivial one will also help to explain
 why the example is now called 'collection1'.

 I think I followed everything, except for just added term of 'index'. Isn't
 that the same as 'core'? Or can we have several indexes in one core?

 Regards,
   Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: 
 http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:

  This is the containment hierarchy i understand but includes both physical
 and logical.

  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: dar...@ontrenet.com,yonik@**lucidworks.com yo...@lucidworks.com,
 solr-user@**lucene.apache.org solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...

 Actually. Node/collection/shard/replica/**core/index



  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: 
 yo...@lucidworks.com,solr-**u...@lucene.apache.orgsolr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...


 Agreed. But for completeness can it be node/collection/shard/replica/**
 core?






Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Yonik Seeley
On Fri, Jan 4, 2013 at 1:35 PM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
 Hmm. Doesn't that make (logical) index=collection? And (physical)
 index=core? Which creates duplication of terminology and at the same time
 can cause confusion between highest logical and lowest physical level.

That's why I've avoided index to mean anything other than the lowest
level physical lucene index, and used collection for the logical
meaning instead.

A solr core is essentially a replica (currently... core is more of
an implementation thing), and it has a lucene index.

-Yonik
http://lucidworks.com


Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
I agree. In my opinion index is a low level lucene thing. I never say a 
collection has an index directly. That confuses levels and creates confusion. 
To me at least. I think the terminology discussed is good. Just some lingering 
usage inconsistencies.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Alexandre Rafalovitch arafa...@gmail.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Hmm. Doesn't that make (logical) index=collection? And (physical)
index=core? Which creates duplication of terminology and at the same time
can cause confusion between highest logical and lowest physical level.

Regards,
   Alex.
P.s. Hoping not to start a new terminology war.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 1:21 PM, Jack Krupansky j...@basetechnology.comwrote:

 The entire collection does have an index - a distributed index - which
 consists of a Lucene index on each core/replica for the subset of the data
 in that shard.

 -- Jack Krupansky

 -Original Message- From: Alexandre Rafalovitch
 Sent: Friday, January 04, 2013 1:12 PM
 To: solr-user@lucene.apache.org

 Subject: Re: Terminology question: Core vs. Collection vs...

 Can I just start by saying that this was AMAZING. :-) When I asked the
 question, I certainly did not expect this level of details.

 And I vote on the cake diagram for WIKI as well. Perhaps, two with the
 first one showing the trivial collapsed state of single
 collection/shard/replica/core. The trivial one will also help to explain
 why the example is now called 'collection1'.

 I think I followed everything, except for just added term of 'index'. Isn't
 that the same as 'core'? Or can we have several indexes in one core?

 Regards,
   Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: 
 http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:

  This is the containment hierarchy i understand but includes both physical
 and logical.

  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: dar...@ontrenet.com,yonik@**lucidworks.com yo...@lucidworks.com,
 solr-user@**lucene.apache.org solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...

 Actually. Node/collection/shard/replica/**core/index



  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: 
 yo...@lucidworks.com,solr-**u...@lucene.apache.orgsolr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...


 Agreed. But for completeness can it be node/collection/shard/replica/**
 core?






Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Upayavira
Using your terminology, I'd say core is a physical solr term, and index
is a pysical lucene term. A collection or a shard is a logical solr
term.

Upayavira

On Fri, Jan 4, 2013, at 06:28 PM, darren wrote:
 My understanding is core is a logical solr term. Index is a physical
 lucene term. A solr core is backed by a physical lucene index. One index
 per core. Solr team can correct me if its not accurate. :)
 
 
 Sent from my Verizon Wireless 4G LTE Smartphone
 
  Original message 
 From: Alexandre Rafalovitch arafa...@gmail.com 
 Date:  
 To: solr-user@lucene.apache.org 
 Subject: Re: Terminology question: Core vs. Collection vs... 
  
 Can I just start by saying that this was AMAZING. :-) When I asked the
 question, I certainly did not expect this level of details.
 
 And I vote on the cake diagram for WIKI as well. Perhaps, two with the
 first one showing the trivial collapsed state of single
 collection/shard/replica/core. The trivial one will also help to explain
 why the example is now called 'collection1'.
 
 I think I followed everything, except for just added term of 'index'.
 Isn't
 that the same as 'core'? Or can we have several indexes in one core?
 
 Regards,
    Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:
 
  This is the containment hierarchy i understand but includes both physical
  and logical.
 
   Original message 
  From: darren dar...@ontrenet.com
  Date:
  To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
  Subject: Re: Terminology question: Core vs. Collection vs...
 
  Actually. Node/collection/shard/replica/core/index
 
 
 
   Original message 
  From: darren dar...@ontrenet.com
  Date:
  To: yo...@lucidworks.com,solr-user@lucene.apache.org
  Subject: Re: Terminology question: Core vs. Collection vs...
 
 
  Agreed. But for completeness can it be node/collection/shard/replica/core?
 
 


Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
Good point. Agree.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Upayavira u...@odoko.co.uk 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Using your terminology, I'd say core is a physical solr term, and index
is a pysical lucene term. A collection or a shard is a logical solr
term.

Upayavira

On Fri, Jan 4, 2013, at 06:28 PM, darren wrote:
 My understanding is core is a logical solr term. Index is a physical
 lucene term. A solr core is backed by a physical lucene index. One index
 per core. Solr team can correct me if its not accurate. :)
 
 
 Sent from my Verizon Wireless 4G LTE Smartphone
 
  Original message 
 From: Alexandre Rafalovitch arafa...@gmail.com 
 Date:  
 To: solr-user@lucene.apache.org 
 Subject: Re: Terminology question: Core vs. Collection vs... 
  
 Can I just start by saying that this was AMAZING. :-) When I asked the
 question, I certainly did not expect this level of details.
 
 And I vote on the cake diagram for WIKI as well. Perhaps, two with the
 first one showing the trivial collapsed state of single
 collection/shard/replica/core. The trivial one will also help to explain
 why the example is now called 'collection1'.
 
 I think I followed everything, except for just added term of 'index'.
 Isn't
 that the same as 'core'? Or can we have several indexes in one core?
 
 Regards,
    Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:
 
  This is the containment hierarchy i understand but includes both physical
  and logical.
 
   Original message 
  From: darren dar...@ontrenet.com
  Date:
  To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
  Subject: Re: Terminology question: Core vs. Collection vs...
 
  Actually. Node/collection/shard/replica/core/index
 
 
 
   Original message 
  From: darren dar...@ontrenet.com
  Date:
  To: yo...@lucidworks.com,solr-user@lucene.apache.org
  Subject: Re: Terminology question: Core vs. Collection vs...
 
 
  Agreed. But for completeness can it be node/collection/shard/replica/core?
 
 


Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Mark Miller
Currently a SolrCore is 1:1 with a low level Lucene index. There is no reason 
that needs to alway be that way. It's possible that we may at some point add 
built in micro sharding support that means a SolrCore could have multiple 
underlying Lucene indexes. Or we may not.

- Mark


On Jan 4, 2013, at 1:49 PM, darren dar...@ontrenet.com wrote:

 Good point. Agree.
 
 
 Sent from my Verizon Wireless 4G LTE Smartphone
 
  Original message 
 From: Upayavira u...@odoko.co.uk 
 Date:  
 To: solr-user@lucene.apache.org 
 Subject: Re: Terminology question: Core vs. Collection vs... 
 
 Using your terminology, I'd say core is a physical solr term, and index
 is a pysical lucene term. A collection or a shard is a logical solr
 term.
 
 Upayavira
 
 On Fri, Jan 4, 2013, at 06:28 PM, darren wrote:
 My understanding is core is a logical solr term. Index is a physical
 lucene term. A solr core is backed by a physical lucene index. One index
 per core. Solr team can correct me if its not accurate. :)
 
 
 Sent from my Verizon Wireless 4G LTE Smartphone
 
  Original message 
 From: Alexandre Rafalovitch arafa...@gmail.com 
 Date:  
 To: solr-user@lucene.apache.org 
 Subject: Re: Terminology question: Core vs. Collection vs... 
   
 Can I just start by saying that this was AMAZING. :-) When I asked the
 question, I certainly did not expect this level of details.
 
 And I vote on the cake diagram for WIKI as well. Perhaps, two with the
 first one showing the trivial collapsed state of single
 collection/shard/replica/core. The trivial one will also help to explain
 why the example is now called 'collection1'.
 
 I think I followed everything, except for just added term of 'index'.
 Isn't
 that the same as 'core'? Or can we have several indexes in one core?
 
 Regards,
Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:
 
 This is the containment hierarchy i understand but includes both physical
 and logical.
 
  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...
 
 Actually. Node/collection/shard/replica/core/index
 
 
 
  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: yo...@lucidworks.com,solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...
 
 
 Agreed. But for completeness can it be node/collection/shard/replica/core?
 
 



Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Darren Govoni
Yes. In that case, core should best be described as a logical solr 
entity with various managed attributes
and qualities above the physical layer (sorry, not trying to perpetuate 
this thread so much).


On 01/04/2013 01:55 PM, Mark Miller wrote:

Currently a SolrCore is 1:1 with a low level Lucene index. There is no reason 
that needs to alway be that way. It's possible that we may at some point add 
built in micro sharding support that means a SolrCore could have multiple 
underlying Lucene indexes. Or we may not.

- Mark


On Jan 4, 2013, at 1:49 PM, darren dar...@ontrenet.com wrote:


Good point. Agree.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Upayavira u...@odoko.co.uk
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Using your terminology, I'd say core is a physical solr term, and index
is a pysical lucene term. A collection or a shard is a logical solr
term.

Upayavira

On Fri, Jan 4, 2013, at 06:28 PM, darren wrote:

My understanding is core is a logical solr term. Index is a physical
lucene term. A solr core is backed by a physical lucene index. One index
per core. Solr team can correct me if its not accurate. :)


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Alexandre Rafalovitch arafa...@gmail.com
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...
   
Can I just start by saying that this was AMAZING. :-) When I asked the

question, I certainly did not expect this level of details.

And I vote on the cake diagram for WIKI as well. Perhaps, two with the
first one showing the trivial collapsed state of single
collection/shard/replica/core. The trivial one will also help to explain
why the example is now called 'collection1'.

I think I followed everything, except for just added term of 'index'.
Isn't
that the same as 'core'? Or can we have several indexes in one core?

Regards,
Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:


This is the containment hierarchy i understand but includes both physical
and logical.

 Original message 
From: darren dar...@ontrenet.com
Date:
To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Actually. Node/collection/shard/replica/core/index



 Original message 
From: darren dar...@ontrenet.com
Date:
To: yo...@lucidworks.com,solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...


Agreed. But for completeness can it be node/collection/shard/replica/core?






Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Per Steffensen
 to be working for the most part, 
even though we may not have defined them carefully enough and used 
them consistently enough.


If somebody want to propose an alternative terminology - fine, discuss 
that on the dev list and/or file a Jira.


I won't claim that my definitions are perfect (yet), but perfecting 
the definitions (for users) should be separated from changing the 
terms themselves.


-- Jack Krupansky

-Original Message- From: Per Steffensen
Sent: Friday, January 04, 2013 2:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

On 1/3/13 5:58 PM, Walter Underwood wrote:
A factor is multiplied, so multiplying the leader by a 
replicationFactor of 1 means you have exactly one copy of that shard.


I think that recycling the term replication within Solr was 
confusing, but it is a bit late to change that.


wunder

Yes, the term factor is not misleading, but the term replication is.
If we keep calling shard-instances for Replica I guess replicaFactor
will be ok - at least much better than replicationFactor. But it would
still be better with e.g. ShardInstance and InstancesPerShard





Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Alexandre Rafalovitch
 it wants to be,
 or is that assigned by Zookeeper, or is that a node's choice/option? But
 none of that changes the fact that a node registers with Zookeeper and
 then somehow becomes a replica for a shard.]

 A node (instance of a Solr server) can be a replica of shards from
 multiple collections (potentially multiple shards per collection). A node
 is not a replica per se, but a container that can serve multiple
 collections. A node can serve as multiple replicas, each of a different
 collection.

 My only interest here on this user list is to understand and explain the
 terms we have today and that SEEM to be working for the most part, even
 though we may not have defined them carefully enough and used them
 consistently enough.

 If somebody want to propose an alternative terminology - fine, discuss
 that on the dev list and/or file a Jira.

 I won't claim that my definitions are perfect (yet), but perfecting the
 definitions (for users) should be separated from changing the terms
 themselves.

 -- Jack Krupansky

 -Original Message- From: Per Steffensen
 Sent: Friday, January 04, 2013 2:49 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...

 On 1/3/13 5:58 PM, Walter Underwood wrote:

 A factor is multiplied, so multiplying the leader by a
 replicationFactor of 1 means you have exactly one copy of that shard.

 I think that recycling the term replication within Solr was confusing,
 but it is a bit late to change that.

 wunder

 Yes, the term factor is not misleading, but the term replication is.
 If we keep calling shard-instances for Replica I guess replicaFactor
 will be ok - at least much better than replicationFactor. But it would
 still be better with e.g. ShardInstance and InstancesPerShard





Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Mark Miller

On Jan 4, 2013, at 2:14 PM, Per Steffensen st...@designware.dk wrote:

 I'm not sure what the node tells Zookeeper and who does shard assignment. I 
 mean, does a node explicitly say what shard it wants to be, or is that 
 assigned by Zookeeper, or is that a node's choice/option? 

It's basically both. If you don't explicitly specify a shard assignment on 
SolrCore creation, the Oveerseer will use ZooKeeper to assign a shard for you. 
It's up to the user which road to take.

- mark

Terminology question: Core vs. Collection vs...

2013-01-03 Thread Alexandre Rafalovitch
Hello,

I am trying to understand the core Solr terminology. I am looking for
correct rather than loose meaning as I am trying to teach an example that
starts from easy scenario and may scale to multi-core, multi-machine
situation.

Here are the terms that seem to be all overlapping and/or crossing over in
my mind a the moment.

1) Index
2) Core
3) Collection
4) Instance
5) Replica (Replica of _what_?)
6) Others?

I tried looking through documentation, but either there is a terminology
drift or I am having trouble understanding the distinctions.

If anybody has a clear picture in their mind, I would appreciate a
clarification.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Aloke Ghoshal
Hi,

If you haven't already, please refer to:

http://www.ngdata.com/site/blog/57-ng.html
http://lucene.472066.n3.nabble.com/solr-cloud-concepts-td3726292.html
http://wiki.apache.org/solr/SolrCloud#FAQ

Regards,
Aloke

On Thu, Jan 3, 2013 at 3:12 PM, Alexandre Rafalovitch arafa...@gmail.comwrote:

 Hello,

 I am trying to understand the core Solr terminology. I am looking for
 correct rather than loose meaning as I am trying to teach an example that
 starts from easy scenario and may scale to multi-core, multi-machine
 situation.

 Here are the terms that seem to be all overlapping and/or crossing over in
 my mind a the moment.

 1) Index
 2) Core
 3) Collection
 4) Instance
 5) Replica (Replica of _what_?)
 6) Others?

 I tried looking through documentation, but either there is a terminology
 drift or I am having trouble understanding the distinctions.

 If anybody has a clear picture in their mind, I would appreciate a
 clarification.

 Regards,
Alex.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)



Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Alexandre Rafalovitch
Haven't seen these yet. These look like a great start, though now I see
even more terms to figure out.

Thank you,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Jan 3, 2013 at 5:34 AM, Aloke Ghoshal alghos...@gmail.com wrote:

 Hi,

 If you haven't already, please refer to:

 http://www.ngdata.com/site/blog/57-ng.html
 http://lucene.472066.n3.nabble.com/solr-cloud-concepts-td3726292.html
 http://wiki.apache.org/solr/SolrCloud#FAQ

 Regards,
 Aloke

 On Thu, Jan 3, 2013 at 3:12 PM, Alexandre Rafalovitch arafa...@gmail.com
 wrote:

  Hello,
 
  I am trying to understand the core Solr terminology. I am looking for
  correct rather than loose meaning as I am trying to teach an example that
  starts from easy scenario and may scale to multi-core, multi-machine
  situation.
 
  Here are the terms that seem to be all overlapping and/or crossing over
 in
  my mind a the moment.
 
  1) Index
  2) Core
  3) Collection
  4) Instance
  5) Replica (Replica of _what_?)
  6) Others?
 
  I tried looking through documentation, but either there is a terminology
  drift or I am having trouble understanding the distinctions.
 
  If anybody has a clear picture in their mind, I would appreciate a
  clarification.
 
  Regards,
 Alex.
 
  Personal blog: http://blog.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality of nature that keeps events from happening all at
  once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 



Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Jack Krupansky
Collection is the more modern term and incorporates the fact that the 
collection may be sharded, with each shard on one or more cores, with each 
core being a replica of the other cores within that shard of that 
collection.


Instance is a general term, but is commonly used to refer to a running Solr 
server, each of which can service any number of cores. A sharded collection 
would typically require multiple instances of Solr, each with a shard of the 
collection.


Multiple collections can be supported on a single instance of Solr. They 
don't have to be sharded or replicated. But if they are, each Solr instance 
will have a copy or replica of the data (index) of one shard of each sharded 
collection - to the degree that each collection needs that many shards.


At the API level, you talk to a Solr instance, using a host and port, and 
giving the collection name. Some operations will refer only to the portion 
of a multi-shard collection on that Solr instance, but typically Solr will 
distribute the operation, whether it be an update or a query, to all of 
the shards of the named collection. In the case of update, the update will 
be distributed to all replicas as well, but in the case of query only one 
replica of each shard of the collection is needed.


Before SolrCloud we Solr had master and slave and the slaves were replicas 
of the master, but with SolrCloud there is no master and all the replicas of 
the shard are peers, although at any moment of time one of them will be 
considered the leader for coordination purposes, but not in the sense that 
it is a master of the other replicas in that shard. A SolrCloud replica is a 
replica of the data, in an abstract sense, for a single shard of a 
collection. A SolrCloud replica is more of an instance of the data/index.


An index exists at two levels: the portion of a collection on a single Solr 
core will have a Lucene index, but collectively the Lucene indexes for the 
shards of a collection can be referred to the index of the collection. Each 
replica is a copy or instance of a portion of the collection's index.


The term slice is sometimes used to refer collectively to all of the 
cores/replicas of a single shard, or sometimes to a single replica as it 
contains only a slice of the full collection data.


-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Thursday, January 03, 2013 4:42 AM
To: solr-user@lucene.apache.org
Subject: Terminology question: Core vs. Collection vs...

Hello,

I am trying to understand the core Solr terminology. I am looking for
correct rather than loose meaning as I am trying to teach an example that
starts from easy scenario and may scale to multi-core, multi-machine
situation.

Here are the terms that seem to be all overlapping and/or crossing over in
my mind a the moment.

1) Index
2) Core
3) Collection
4) Instance
5) Replica (Replica of _what_?)
6) Others?

I tried looking through documentation, but either there is a terminology
drift or I am having trouble understanding the distinctions.

If anybody has a clear picture in their mind, I would appreciate a
clarification.

Regards,
  Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book) 



RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni
Good write up. 


And what about node?

I think there needs to be an official glossary of terms that is sanctioned by the solr 
team and some terms still ni use may need to be labeled deprecated. After so 
many years, its still confusing.

brbrbr--- Original Message ---
On 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more modern term and incorporates the fact that the 
brcollection may be sharded, with each shard on one or more cores, with each 
brcore being a replica of the other cores within that shard of that 
brcollection.

br
brInstance is a general term, but is commonly used to refer to a running Solr 
brserver, each of which can service any number of cores. A sharded collection 
brwould typically require multiple instances of Solr, each with a shard of the 
brcollection.

br
brMultiple collections can be supported on a single instance of Solr. They 
brdon't have to be sharded or replicated. But if they are, each Solr instance 
brwill have a copy or replica of the data (index) of one shard of each sharded 
brcollection - to the degree that each collection needs that many shards.

br
brAt the API level, you talk to a Solr instance, using a host and port, and 
brgiving the collection name. Some operations will refer only to the portion 
brof a multi-shard collection on that Solr instance, but typically Solr will 
brdistribute the operation, whether it be an update or a query, to all of 
brthe shards of the named collection. In the case of update, the update will 
brbe distributed to all replicas as well, but in the case of query only one 
brreplica of each shard of the collection is needed.

br
brBefore SolrCloud we Solr had master and slave and the slaves were replicas 
brof the master, but with SolrCloud there is no master and all the replicas of 
brthe shard are peers, although at any moment of time one of them will be 
brconsidered the leader for coordination purposes, but not in the sense that 
brit is a master of the other replicas in that shard. A SolrCloud replica is a 
brreplica of the data, in an abstract sense, for a single shard of a 
brcollection. A SolrCloud replica is more of an instance of the data/index.

br
brAn index exists at two levels: the portion of a collection on a single Solr 
brcore will have a Lucene index, but collectively the Lucene indexes for the 
brshards of a collection can be referred to the index of the collection. Each 
brreplica is a copy or instance of a portion of the collection's index.

br
brThe term slice is sometimes used to refer collectively to all of the 
brcores/replicas of a single shard, or sometimes to a single replica as it 
brcontains only a slice of the full collection data.

br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Alexandre Rafalovitch

brSent: Thursday, January 03, 2013 4:42 AM
brTo: solr-user@lucene.apache.org
brSubject: Terminology question: Core vs. Collection vs...
br
brHello,
br
brI am trying to understand the core Solr terminology. I am looking for
brcorrect rather than loose meaning as I am trying to teach an example that
brstarts from easy scenario and may scale to multi-core, multi-machine
brsituation.
br
brHere are the terms that seem to be all overlapping and/or crossing over in
brmy mind a the moment.
br
br1) Index
br2) Core
br3) Collection
br4) Instance
br5) Replica (Replica of _what_?)
br6) Others?
br
brI tried looking through documentation, but either there is a terminology
brdrift or I am having trouble understanding the distinctions.
br
brIf anybody has a clear picture in their mind, I would appreciate a
brclarification.
br
brRegards,
br   Alex.
br
brPersonal blog: http://blog.outerthoughts.com/
brLinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
br- Time is the quality of nature that keeps events from happening all at
bronce. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book) 
br

br


Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Jack Krupansky
A node is a machine in a cluster or cloud (graph). It could be a real 
machine or a virtualized machine. Technically, you could have multiple 
virtual nodes on the same physical box. Each Solr replica would be on a 
different node.


Technically, you could have multiple Solr instances running on a single 
hardware node, each with a different port. They are simply instances of 
Solr, although you could consider each Solr instance a node in a Solr cloud 
as well, a virtual node. So, technically, you could have multiple replicas 
on the same node, but that sort of defeats most of the purpose of having 
replicas in the first place - to distribute the data for performance and 
fault tolerance. But, you could have replicas of different shards on the 
same node/box for a partial improvement of performance and fault tolerance.


A Solr cloud' is really a cluster.

-- Jack Krupansky

-Original Message- 
From: Darren Govoni

Sent: Thursday, January 03, 2013 8:16 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Good write up.

And what about node?

I think there needs to be an official glossary of terms that is sanctioned 
by the solr team and some terms still ni use may need to be labeled 
deprecated. After so many years, its still confusing.


brbrbr--- Original Message ---
On 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more modern 
term and incorporates the fact that the
brcollection may be sharded, with each shard on one or more cores, with 
each

brcore being a replica of the other cores within that shard of that
brcollection.
br
brInstance is a general term, but is commonly used to refer to a running 
Solr
brserver, each of which can service any number of cores. A sharded 
collection
brwould typically require multiple instances of Solr, each with a shard of 
the

brcollection.
br
brMultiple collections can be supported on a single instance of Solr. They
brdon't have to be sharded or replicated. But if they are, each Solr 
instance
brwill have a copy or replica of the data (index) of one shard of each 
sharded

brcollection - to the degree that each collection needs that many shards.
br
brAt the API level, you talk to a Solr instance, using a host and port, 
and
brgiving the collection name. Some operations will refer only to the 
portion
brof a multi-shard collection on that Solr instance, but typically Solr 
will
brdistribute the operation, whether it be an update or a query, to all 
of
brthe shards of the named collection. In the case of update, the update 
will
brbe distributed to all replicas as well, but in the case of query only 
one

brreplica of each shard of the collection is needed.
br
brBefore SolrCloud we Solr had master and slave and the slaves were 
replicas
brof the master, but with SolrCloud there is no master and all the 
replicas of

brthe shard are peers, although at any moment of time one of them will be
brconsidered the leader for coordination purposes, but not in the sense 
that
brit is a master of the other replicas in that shard. A SolrCloud replica 
is a

brreplica of the data, in an abstract sense, for a single shard of a
brcollection. A SolrCloud replica is more of an instance of the 
data/index.

br
brAn index exists at two levels: the portion of a collection on a single 
Solr
brcore will have a Lucene index, but collectively the Lucene indexes for 
the
brshards of a collection can be referred to the index of the collection. 
Each

brreplica is a copy or instance of a portion of the collection's index.
br
brThe term slice is sometimes used to refer collectively to all of the
brcores/replicas of a single shard, or sometimes to a single replica as it
brcontains only a slice of the full collection data.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Alexandre Rafalovitch

brSent: Thursday, January 03, 2013 4:42 AM
brTo: solr-user@lucene.apache.org
brSubject: Terminology question: Core vs. Collection vs...
br
brHello,
br
brI am trying to understand the core Solr terminology. I am looking for
brcorrect rather than loose meaning as I am trying to teach an example 
that

brstarts from easy scenario and may scale to multi-core, multi-machine
brsituation.
br
brHere are the terms that seem to be all overlapping and/or crossing over 
in

brmy mind a the moment.
br
br1) Index
br2) Core
br3) Collection
br4) Instance
br5) Replica (Replica of _what_?)
br6) Others?
br
brI tried looking through documentation, but either there is a terminology
brdrift or I am having trouble understanding the distinctions.
br
brIf anybody has a clear picture in their mind, I would appreciate a
brclarification.
br
brRegards,
br   Alex.
br
brPersonal blog: http://blog.outerthoughts.com/
brLinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
br- Time is the quality of nature that keeps events from happening all at
bronce. Lately, it doesn't seem to be working.  (Anonymous  - via GTD 
book)

br
br 



RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Thanks again. (And sorry to jump into this convo)

But I had a question on your statement:

On 1/3/2013 08:07 AM Jack Krupansky wrote:
  brCollection is the more modern term and incorporates the fact that the 
brcollection may be sharded, with each shard on one or more cores, with each 
brcore being a replica of the other cores within that shard of that
brcollection. 


A collection is sharded, meaning it is distributed across cores. A shard itself 
is not distributed across cores in the same since. Rather a shard exist on a 
single core and is replicated on other cores. Is that right? The way its worded 
above, it sounds like a shard can also be sharded...


brbrbr--- Original Message ---
On 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a cluster or cloud (graph). It could be a real 
brmachine or a virtualized machine. Technically, you could have multiple 
brvirtual nodes on the same physical box. Each Solr replica would be on a 
brdifferent node.

br
brTechnically, you could have multiple Solr instances running on a single 
brhardware node, each with a different port. They are simply instances of 
brSolr, although you could consider each Solr instance a node in a Solr cloud 
bras well, a virtual node. So, technically, you could have multiple replicas 
bron the same node, but that sort of defeats most of the purpose of having 
brreplicas in the first place - to distribute the data for performance and 
brfault tolerance. But, you could have replicas of different shards on the 
brsame node/box for a partial improvement of performance and fault tolerance.

br
brA Solr cloud' is really a cluster.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:16 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brGood write up.
br
brAnd what about node?
br
brI think there needs to be an official glossary of terms that is sanctioned 
brby the solr team and some terms still ni use may need to be labeled 
brdeprecated. After so many years, its still confusing.

br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more modern 
brterm and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with 
breach

brbrcore being a replica of the other cores within that shard of that
brbrcollection.
brbr
brbrInstance is a general term, but is commonly used to refer to a running 
brSolr
brbrserver, each of which can service any number of cores. A sharded 
brcollection
brbrwould typically require multiple instances of Solr, each with a shard of 
brthe

brbrcollection.
brbr
brbrMultiple collections can be supported on a single instance of Solr. They
brbrdon't have to be sharded or replicated. But if they are, each Solr 
brinstance
brbrwill have a copy or replica of the data (index) of one shard of each 
brsharded

brbrcollection - to the degree that each collection needs that many shards.
brbr
brbrAt the API level, you talk to a Solr instance, using a host and port, 
brand
brbrgiving the collection name. Some operations will refer only to the 
brportion
brbrof a multi-shard collection on that Solr instance, but typically Solr 
brwill
brbrdistribute the operation, whether it be an update or a query, to all 
brof
brbrthe shards of the named collection. In the case of update, the update 
brwill
brbrbe distributed to all replicas as well, but in the case of query only 
brone

brbrreplica of each shard of the collection is needed.
brbr
brbrBefore SolrCloud we Solr had master and slave and the slaves were 
brreplicas
brbrof the master, but with SolrCloud there is no master and all the 
brreplicas of

brbrthe shard are peers, although at any moment of time one of them will be
brbrconsidered the leader for coordination purposes, but not in the sense 
brthat
brbrit is a master of the other replicas in that shard. A SolrCloud replica 
bris a

brbrreplica of the data, in an abstract sense, for a single shard of a
brbrcollection. A SolrCloud replica is more of an instance of the 
brdata/index.

brbr
brbrAn index exists at two levels: the portion of a collection on a single 
brSolr
brbrcore will have a Lucene index, but collectively the Lucene indexes for 
brthe
brbrshards of a collection can be referred to the index of the collection. 
brEach

brbrreplica is a copy or instance of a portion of the collection's index.
brbr
brbrThe term slice is sometimes used to refer collectively to all of the
brbrcores/replicas of a single shard, or sometimes to a single replica as it
brbrcontains only a slice of the full collection data.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Alexandre Rafalovitch

brbrSent: Thursday, January 03, 2013 4:42 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: Terminology question: Core vs. Collection vs...
brbr
brbrHello,
brbr
brbrI am trying

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Jack Krupansky
No, a shard is a subset (or slice) of the collection. Sharding is a way of 
slicing the original data, before we talk about how the shards get stored 
and replicated on actual Solr cores. Replicas are instances of the data for 
a shard.


Sometimes people may loosely speak of a replica as being a shard, but 
that's just loose use of the terminology.


So, we're not sharding shards, but we are replicating shards.

-- Jack Krupansky

-Original Message- 
From: Darren Govoni

Sent: Thursday, January 03, 2013 8:51 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Thanks again. (And sorry to jump into this convo)

But I had a question on your statement:

On 1/3/2013 08:07 AM Jack Krupansky wrote:
  brCollection is the more modern term and incorporates the fact that the 
brcollection may be sharded, with each shard on one or more cores, with 
each brcore being a replica of the other cores within that shard of that

brcollection.

A collection is sharded, meaning it is distributed across cores. A shard 
itself is not distributed across cores in the same since. Rather a shard 
exist on a single core and is replicated on other cores. Is that right? The 
way its worded above, it sounds like a shard can also be sharded...



brbrbr--- Original Message ---
On 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a 
cluster or cloud (graph). It could be a real

brmachine or a virtualized machine. Technically, you could have multiple
brvirtual nodes on the same physical box. Each Solr replica would be on 
a

brdifferent node.
br
brTechnically, you could have multiple Solr instances running on a single
brhardware node, each with a different port. They are simply instances of
brSolr, although you could consider each Solr instance a node in a Solr 
cloud
bras well, a virtual node. So, technically, you could have multiple 
replicas

bron the same node, but that sort of defeats most of the purpose of having
brreplicas in the first place - to distribute the data for performance and
brfault tolerance. But, you could have replicas of different shards on the
brsame node/box for a partial improvement of performance and fault 
tolerance.

br
brA Solr cloud' is really a cluster.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:16 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brGood write up.
br
brAnd what about node?
br
brI think there needs to be an official glossary of terms that is 
sanctioned

brby the solr team and some terms still ni use may need to be labeled
brdeprecated. After so many years, its still confusing.
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more 
modern

brterm and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, 
with

breach
brbrcore being a replica of the other cores within that shard of that
brbrcollection.
brbr
brbrInstance is a general term, but is commonly used to refer to a 
running

brSolr
brbrserver, each of which can service any number of cores. A sharded
brcollection
brbrwould typically require multiple instances of Solr, each with a 
shard of

brthe
brbrcollection.
brbr
brbrMultiple collections can be supported on a single instance of Solr. 
They

brbrdon't have to be sharded or replicated. But if they are, each Solr
brinstance
brbrwill have a copy or replica of the data (index) of one shard of each
brsharded
brbrcollection - to the degree that each collection needs that many 
shards.

brbr
brbrAt the API level, you talk to a Solr instance, using a host and 
port,

brand
brbrgiving the collection name. Some operations will refer only to the
brportion
brbrof a multi-shard collection on that Solr instance, but typically 
Solr

brwill
brbrdistribute the operation, whether it be an update or a query, to 
all

brof
brbrthe shards of the named collection. In the case of update, the 
update

brwill
brbrbe distributed to all replicas as well, but in the case of query 
only

brone
brbrreplica of each shard of the collection is needed.
brbr
brbrBefore SolrCloud we Solr had master and slave and the slaves were
brreplicas
brbrof the master, but with SolrCloud there is no master and all the
brreplicas of
brbrthe shard are peers, although at any moment of time one of them will 
be
brbrconsidered the leader for coordination purposes, but not in the 
sense

brthat
brbrit is a master of the other replicas in that shard. A SolrCloud 
replica

bris a
brbrreplica of the data, in an abstract sense, for a single shard of a
brbrcollection. A SolrCloud replica is more of an instance of the
brdata/index.
brbr
brbrAn index exists at two levels: the portion of a collection on a 
single

brSolr
brbrcore will have a Lucene index, but collectively the Lucene indexes 
for

brthe
brbrshards of a collection can

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Jack Krupansky

Oops... let me word that a little more carefully:

...we are replicating the data of each shard.





-- Jack Krupansky
-Original Message- 
From: Jack Krupansky

Sent: Thursday, January 03, 2013 9:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

No, a shard is a subset (or slice) of the collection. Sharding is a way of
slicing the original data, before we talk about how the shards get stored
and replicated on actual Solr cores. Replicas are instances of the data for
a shard.

Sometimes people may loosely speak of a replica as being a shard, but
that's just loose use of the terminology.

So, we're not sharding shards, but we are replicating shards.

-- Jack Krupansky

-Original Message- 
From: Darren Govoni

Sent: Thursday, January 03, 2013 8:51 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Thanks again. (And sorry to jump into this convo)

But I had a question on your statement:

On 1/3/2013 08:07 AM Jack Krupansky wrote:
  brCollection is the more modern term and incorporates the fact that the
brcollection may be sharded, with each shard on one or more cores, with
each brcore being a replica of the other cores within that shard of that
brcollection.

A collection is sharded, meaning it is distributed across cores. A shard
itself is not distributed across cores in the same since. Rather a shard
exist on a single core and is replicated on other cores. Is that right? The
way its worded above, it sounds like a shard can also be sharded...


brbrbr--- Original Message ---
On 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
cluster or cloud (graph). It could be a real
brmachine or a virtualized machine. Technically, you could have multiple
brvirtual nodes on the same physical box. Each Solr replica would be on
a
brdifferent node.
br
brTechnically, you could have multiple Solr instances running on a single
brhardware node, each with a different port. They are simply instances of
brSolr, although you could consider each Solr instance a node in a Solr
cloud
bras well, a virtual node. So, technically, you could have multiple
replicas
bron the same node, but that sort of defeats most of the purpose of having
brreplicas in the first place - to distribute the data for performance and
brfault tolerance. But, you could have replicas of different shards on the
brsame node/box for a partial improvement of performance and fault
tolerance.
br
brA Solr cloud' is really a cluster.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:16 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brGood write up.
br
brAnd what about node?
br
brI think there needs to be an official glossary of terms that is
sanctioned
brby the solr team and some terms still ni use may need to be labeled
brdeprecated. After so many years, its still confusing.
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more
modern
brterm and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores,
with
breach
brbrcore being a replica of the other cores within that shard of that
brbrcollection.
brbr
brbrInstance is a general term, but is commonly used to refer to a
running
brSolr
brbrserver, each of which can service any number of cores. A sharded
brcollection
brbrwould typically require multiple instances of Solr, each with a
shard of
brthe
brbrcollection.
brbr
brbrMultiple collections can be supported on a single instance of Solr.
They
brbrdon't have to be sharded or replicated. But if they are, each Solr
brinstance
brbrwill have a copy or replica of the data (index) of one shard of each
brsharded
brbrcollection - to the degree that each collection needs that many
shards.
brbr
brbrAt the API level, you talk to a Solr instance, using a host and
port,
brand
brbrgiving the collection name. Some operations will refer only to the
brportion
brbrof a multi-shard collection on that Solr instance, but typically
Solr
brwill
brbrdistribute the operation, whether it be an update or a query, to
all
brof
brbrthe shards of the named collection. In the case of update, the
update
brwill
brbrbe distributed to all replicas as well, but in the case of query
only
brone
brbrreplica of each shard of the collection is needed.
brbr
brbrBefore SolrCloud we Solr had master and slave and the slaves were
brreplicas
brbrof the master, but with SolrCloud there is no master and all the
brreplicas of
brbrthe shard are peers, although at any moment of time one of them will
be
brbrconsidered the leader for coordination purposes, but not in the
sense
brthat
brbrit is a master of the other replicas in that shard. A SolrCloud
replica
bris a
brbrreplica of the data, in an abstract sense, for a single shard of a
brbrcollection. A SolrCloud

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Thanks. I got that part.

A group of shards (and therefore cores) represent a collection, yes. But a single shard exist only on a single core? 


brbrbr--- Original Message ---
On 1/3/2013  09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or slice) of the collection. Sharding is a way of 
brslicing the original data, before we talk about how the shards get stored 
brand replicated on actual Solr cores. Replicas are instances of the data for 
bra shard.

br
brSometimes people may loosely speak of a replica as being a shard, but 
brthat's just loose use of the terminology.

br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that the 
brbrcollection may be sharded, with each shard on one or more cores, with 
breach brcore being a replica of the other cores within that shard of that

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard 
britself is not distributed across cores in the same since. Rather a shard 
brexist on a single core and is replicated on other cores. Is that right? The 
brway its worded above, it sounds like a shard can also be sharded...

br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a 
brcluster or cloud (graph). It could be a real

brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on 
bra

brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr 
brcloud
brbras well, a virtual node. So, technically, you could have multiple 
brreplicas

brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault 
brtolerance.

brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is 
brsanctioned

brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more 
brmodern

brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores, 
brwith

brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a 
brrunning

brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a 
brshard of

brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr. 
brThey

brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many 
brshards.

brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and 
brport,

brbrand
brbrbrgiving the collection name. Some operations will refer only to the
brbrportion
brbrbrof a multi-shard collection on that Solr instance, but typically 
brSolr

brbrwill
brbrbrdistribute the operation, whether it be an update or a query, to 
brall

brbrof
brbrbrthe shards of the named collection. In the case of update, the 
brupdate

brbrwill
brbrbrbe distributed to all replicas as well, but in the case of query 
bronly

brbrone
brbrbrreplica of each shard of the collection is needed.
brbrbr
brbrbrBefore SolrCloud we Solr had master and slave and the slaves were
brbrreplicas
brbrbrof the master, but with SolrCloud there is no master and all the
brbrreplicas of
brbrbrthe shard are peers, although at any moment of time one of them will 
brbe
brbrbrconsidered the leader

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Jack Krupansky
And I would revise node to note that in SolrCloud a node is simply an 
instance of a Solr server.


And, technically, you can have multiple shards in a single instance of Solr, 
separating the logical sharding of keys from the distribution of the data.


-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Thursday, January 03, 2013 9:08 AM
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Oops... let me word that a little more carefully:

...we are replicating the data of each shard.





-- Jack Krupansky
-Original Message- 
From: Jack Krupansky

Sent: Thursday, January 03, 2013 9:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

No, a shard is a subset (or slice) of the collection. Sharding is a way of
slicing the original data, before we talk about how the shards get stored
and replicated on actual Solr cores. Replicas are instances of the data for
a shard.

Sometimes people may loosely speak of a replica as being a shard, but
that's just loose use of the terminology.

So, we're not sharding shards, but we are replicating shards.

-- Jack Krupansky

-Original Message- 
From: Darren Govoni

Sent: Thursday, January 03, 2013 8:51 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Thanks again. (And sorry to jump into this convo)

But I had a question on your statement:

On 1/3/2013 08:07 AM Jack Krupansky wrote:
  brCollection is the more modern term and incorporates the fact that the
brcollection may be sharded, with each shard on one or more cores, with
each brcore being a replica of the other cores within that shard of that
brcollection.

A collection is sharded, meaning it is distributed across cores. A shard
itself is not distributed across cores in the same since. Rather a shard
exist on a single core and is replicated on other cores. Is that right? The
way its worded above, it sounds like a shard can also be sharded...


brbrbr--- Original Message ---
On 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
cluster or cloud (graph). It could be a real
brmachine or a virtualized machine. Technically, you could have multiple
brvirtual nodes on the same physical box. Each Solr replica would be on
a
brdifferent node.
br
brTechnically, you could have multiple Solr instances running on a single
brhardware node, each with a different port. They are simply instances of
brSolr, although you could consider each Solr instance a node in a Solr
cloud
bras well, a virtual node. So, technically, you could have multiple
replicas
bron the same node, but that sort of defeats most of the purpose of having
brreplicas in the first place - to distribute the data for performance and
brfault tolerance. But, you could have replicas of different shards on the
brsame node/box for a partial improvement of performance and fault
tolerance.
br
brA Solr cloud' is really a cluster.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:16 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brGood write up.
br
brAnd what about node?
br
brI think there needs to be an official glossary of terms that is
sanctioned
brby the solr team and some terms still ni use may need to be labeled
brdeprecated. After so many years, its still confusing.
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more
modern
brterm and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores,
with
breach
brbrcore being a replica of the other cores within that shard of that
brbrcollection.
brbr
brbrInstance is a general term, but is commonly used to refer to a
running
brSolr
brbrserver, each of which can service any number of cores. A sharded
brcollection
brbrwould typically require multiple instances of Solr, each with a
shard of
brthe
brbrcollection.
brbr
brbrMultiple collections can be supported on a single instance of Solr.
They
brbrdon't have to be sharded or replicated. But if they are, each Solr
brinstance
brbrwill have a copy or replica of the data (index) of one shard of each
brsharded
brbrcollection - to the degree that each collection needs that many
shards.
brbr
brbrAt the API level, you talk to a Solr instance, using a host and
port,
brand
brbrgiving the collection name. Some operations will refer only to the
brportion
brbrof a multi-shard collection on that Solr instance, but typically
Solr
brwill
brbrdistribute the operation, whether it be an update or a query, to
all
brof
brbrthe shards of the named collection. In the case of update, the
update
brwill
brbrbe distributed to all replicas as well, but in the case of query
only
brone
brbrreplica of each shard of the collection is needed.
brbr
brbrBefore SolrCloud we Solr had master

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

I think what's confusing about your explanation below is when you have a 
situation where there is no replication factor. That's possible too, yes?

So in that case, is each core of a shard of a collection, still referred to as a replica? 


To me a replica is a duplicate/backup of a shard's core. Not the sharded core 
itself. Or is there just no difference. Even a non-replicated core is called a 
replica?


brbrbr--- Original Message ---
On 1/3/2013  09:08 AM Jack Krupansky wrote:brOops... let me word that a 
little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach brcore being a replica of the other cores within that shard of that
brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of
brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey
brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.
brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,
brbrand
brbrbrgiving the collection name. Some operations will refer only to the
brbrportion
brbrbrof a multi-shard collection on that Solr instance, but typically
brSolr
brbrwill
brbrbrdistribute the operation, whether it be an update

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Yes. And its worth to note that when having multiple shards in a single 
node(@deprecated) that they are shards of different collections...

brbrbr--- Original Message ---
On 1/3/2013  09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an 
brinstance of a Solr server.

br
brAnd, technically, you can have multiple shards in a single instance of Solr, 
brseparating the logical sharding of keys from the distribution of the data.

br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach brcore being a replica of the other cores within that shard of that
brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of
brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey
brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.
brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,
brbrand
brbrbrgiving

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Jack Krupansky
A single shard MAY exist on a single core, but only if it is not replicated. 
Generally, a single shard will exist on multiple cores, each a replica of 
the source data as it comes into the update handler.


-- Jack Krupansky

-Original Message- 
From: Darren Govoni

Sent: Thursday, January 03, 2013 9:10 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Thanks. I got that part.

A group of shards (and therefore cores) represent a collection, yes. But a 
single shard exist only on a single core?


brbrbr--- Original Message ---
On 1/3/2013  09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or 
slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get 
stored
brand replicated on actual Solr cores. Replicas are instances of the data 
for

bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that 
the
brbrcollection may be sharded, with each shard on one or more cores, 
with
breach brcore being a replica of the other cores within that shard of 
that

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? 
The

brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have 
multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be 
on

bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a 
single
brbrhardware node, each with a different port. They are simply instances 
of
brbrSolr, although you could consider each Solr instance a node in a 
Solr

brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of 
having
brbrreplicas in the first place - to distribute the data for performance 
and
brbrfault tolerance. But, you could have replicas of different shards on 
the

brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the 
more

brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of 
that

brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of
brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of 
Solr.

brThey
brbrbrdon't have to be sharded or replicated. But if they are, each 
Solr

brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of 
each

brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.
brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,
brbrand
brbrbrgiving the collection name. Some operations will refer only to 
the

brbrportion
brbrbrof a multi-shard collection on that Solr instance, but typically
brSolr
brbrwill
brbrbrdistribute the operation, whether it be an update or a query, 
to

brall
brbrof
brbrbrthe shards of the named collection. In the case of update, the
brupdate
brbrwill
brbrbrbe distributed to all replicas

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Ah, ok. Good. Makes sense.

I think I will draw all this up in a UML that includes the distinction between the 
logical terms and the physical terms (and their mapping) as they do get 
intertwined. I'll post it here when I'm done.

brbrbr--- Original Message ---
On 1/3/2013  09:19 AM Jack Krupansky wrote:brA single shard MAY exist on a single core, but only if it is not replicated. 
brGenerally, a single shard will exist on multiple cores, each a replica of 
brthe source data as it comes into the update handler.

br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 9:10 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks. I got that part.
br
brA group of shards (and therefore cores) represent a collection, yes. But a 
brsingle shard exist only on a single core?

br
brbrbrbr--- Original Message ---
brOn 1/3/2013  09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or 
brslice) of the collection. Sharding is a way of
brbrslicing the original data, before we talk about how the shards get 
brstored
brbrand replicated on actual Solr cores. Replicas are instances of the data 
brfor

brbra shard.
brbr
brbrSometimes people may loosely speak of a replica as being a shard, but
brbrthat's just loose use of the terminology.
brbr
brbrSo, we're not sharding shards, but we are replicating shards.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:51 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrThanks again. (And sorry to jump into this convo)
brbr
brbrBut I had a question on your statement:
brbr
brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:
brbr   brCollection is the more modern term and incorporates the fact that 
brthe
brbrbrcollection may be sharded, with each shard on one or more cores, 
brwith
brbreach brcore being a replica of the other cores within that shard of 
brthat

brbrbrcollection.
brbr
brbrA collection is sharded, meaning it is distributed across cores. A shard
brbritself is not distributed across cores in the same since. Rather a shard
brbrexist on a single core and is replicated on other cores. Is that right? 
brThe

brbrway its worded above, it sounds like a shard can also be sharded...
brbr
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brbrcluster or cloud (graph). It could be a real
brbrbrmachine or a virtualized machine. Technically, you could have 
brmultiple
brbrbrvirtual nodes on the same physical box. Each Solr replica would be 
bron

brbra
brbrbrdifferent node.
brbrbr
brbrbrTechnically, you could have multiple Solr instances running on a 
brsingle
brbrbrhardware node, each with a different port. They are simply instances 
brof
brbrbrSolr, although you could consider each Solr instance a node in a 
brSolr

brbrcloud
brbrbras well, a virtual node. So, technically, you could have multiple
brbrreplicas
brbrbron the same node, but that sort of defeats most of the purpose of 
brhaving
brbrbrreplicas in the first place - to distribute the data for performance 
brand
brbrbrfault tolerance. But, you could have replicas of different shards on 
brthe

brbrbrsame node/box for a partial improvement of performance and fault
brbrtolerance.
brbrbr
brbrbrA Solr cloud' is really a cluster.
brbrbr
brbrbr-- Jack Krupansky
brbrbr
brbrbr-Original Message- 
brbrbrFrom: Darren Govoni

brbrbrSent: Thursday, January 03, 2013 8:16 AM
brbrbrTo: solr-user@lucene.apache.org
brbrbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbrbr
brbrbrGood write up.
brbrbr
brbrbrAnd what about node?
brbrbr
brbrbrI think there needs to be an official glossary of terms that is
brbrsanctioned
brbrbrby the solr team and some terms still ni use may need to be labeled
brbrbrdeprecated. After so many years, its still confusing.
brbrbr
brbrbrbrbrbr--- Original Message ---
brbrbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the 
brmore

brbrmodern
brbrbrterm and incorporates the fact that the
brbrbrbrcollection may be sharded, with each shard on one or more cores,
brbrwith
brbrbreach
brbrbrbrcore being a replica of the other cores within that shard of 
brthat

brbrbrbrcollection.
brbrbrbr
brbrbrbrInstance is a general term, but is commonly used to refer to a
brbrrunning
brbrbrSolr
brbrbrbrserver, each of which can service any number of cores. A sharded
brbrbrcollection
brbrbrbrwould typically require multiple instances of Solr, each with a
brbrshard of
brbrbrthe
brbrbrbrcollection.
brbrbrbr
brbrbrbrMultiple collections can be supported on a single instance of 
brSolr.

brbrThey
brbrbrbrdon't have to be sharded or replicated. But if they are, each 
brSolr

brbrbrinstance
brbrbrbrwill have a copy or replica of the data (index) of one

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Mark Miller

On Jan 3, 2013, at 9:17 AM, Darren Govoni dar...@ontrenet.com wrote:

  Even a non-replicated core is called a replica?

To some :) Forcing agreement on terminology has been … challenging…

And even if there is some agreement, new people come, old people that were not 
around for the agreement come back, etc.

Usually you have to figure it out by context.

I started trying to put a stake in the ground on the wiki - but it's still 
solidifying and does not include everything yet - eg I don't think it makes a 
call about replica being just copies or also the leader. There was some 
discussion about this very thing recently.

Because all cores in a slice / logical shard are pretty much equal (anyone can 
become a leader), it doesn't seem crazy to consider them all replicas. If a 
leader goes down briefly and comes back - perhaps it just lost its connection 
for a moment - it will come back and no longer be a leader. Did it change from 
a non replica to a replica then? Gosh I don't know. Stick a fork in my eye :)

- Mark

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Jack Krupansky
Ah... the multiple shards (of the same collection) in a single node is about 
planning for future expansion of your cluster - create more shards than you 
need today, put more of them on a single node and then migrate them to their 
own nodes as the data outgrows the smaller number of nodes. In other words, 
add nodes incrementally without having to reindex all the data.


-- Jack Krupansky

-Original Message- 
From: Darren Govoni

Sent: Thursday, January 03, 2013 9:18 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Yes. And its worth to note that when having multiple shards in a single 
node(@deprecated) that they are shards of different collections...


brbrbr--- Original Message ---
On 1/3/2013  09:16 AM Jack Krupansky wrote:brAnd I would revise node to 
note that in SolrCloud a node is simply an

brinstance of a Solr server.
br
brAnd, technically, you can have multiple shards in a single instance of 
Solr,
brseparating the logical sharding of keys from the distribution of the 
data.

br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding is a 
way of
brslicing the original data, before we talk about how the shards get 
stored
brand replicated on actual Solr cores. Replicas are instances of the data 
for

bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that 
the
brbrcollection may be sharded, with each shard on one or more cores, 
with
breach brcore being a replica of the other cores within that shard of 
that

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? 
The

brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have 
multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be 
on

bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a 
single
brbrhardware node, each with a different port. They are simply instances 
of
brbrSolr, although you could consider each Solr instance a node in a 
Solr

brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of 
having
brbrreplicas in the first place - to distribute the data for performance 
and
brbrfault tolerance. But, you could have replicas of different shards on 
the

brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the 
more

brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of 
that

brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Per Steffensen

Hi

Here is my version - do not believe the explanations have been very clear

We have the following concepts (here I will try to explain what each the 
concept cover without naming it - its hard)
1) Machines (virtual or physical) running Solr server JVMs (one machine 
can run several Solr server JVMs if you like)

2) Solr server JVMs
3) Logical stores where you can add/update/delete data-instances 
(closest to logical tables in RDBMS)
4) Logical slices of a store (closest to non-overlapping logical 
sets of rows for the logical table in a RDBMS)
5) Physical instances of slices (a physical (disk/memory) instance of 
the a logical slice). This is where data actually goes on disk - the 
logical stores and slices above are just non-physical concepts


Terminology
1) Believe we have no name for this (except of course machine :-) ), 
even though Jack claims that this is called a node. Maybe sometimes it 
is called a node, but I believe node is more often used to refer to 
a Solr server JVM.

2) Node
3) Collection
4) Shard. Used to be called Slice but I believe now it is officially 
called Shard. I agree with that change, because I believe most of the 
industry also uses the term Shard for this logical/non-physical 
concept  - just needs to be reflected it across documentation and code
5) Replica. Used to be called Shard but I believe now it is 
officially called Replica. I certainly do not agree with the name 
Replica, because it suggests that it is a copy of an original, but 
it isnt. I would prefer Shard-instance here, to avoid the confusion. I 
understand that you can argue (if you argue long enough) that Replica 
is a fine name, but you really need the explanation to understand why 
Replica can be defended as the name for this. Is is not immediately 
obvious what this is as long as it is called Replica. A Replica is 
basically a Solr Cloud managed Core and behind every Replica/Core lives 
a physical Lucene index. So Replica=Core) contains/maintains Lucene 
index behind the scenes. The term Replica also needs to be reflected 
across documentation and code.


Regards, Per Steffensen

On 1/3/13 10:42 AM, Alexandre Rafalovitch wrote:

Hello,

I am trying to understand the core Solr terminology. I am looking for
correct rather than loose meaning as I am trying to teach an example that
starts from easy scenario and may scale to multi-core, multi-machine
situation.

Here are the terms that seem to be all overlapping and/or crossing over in
my mind a the moment.

1) Index
2) Core
3) Collection
4) Instance
5) Replica (Replica of _what_?)
6) Others?

I tried looking through documentation, but either there is a terminology
drift or I am having trouble understanding the distinctions.

If anybody has a clear picture in their mind, I would appreciate a
clarification.

Regards,
Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)





Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Per Steffensen
For the same reasons that Replica shouldnt be called Replica (it 
requires to long an explanation to agree that it is an ok name), 
replicationFactor shouldnt be called replicationFactor and long as 
it referes to the TOTAL number of cores you get for your Shard. 
replicationFactor would be an ok name if replicationFactor=0 meant one 
core, replicationFactor=1 meant two cores etc., but as long as 
replicationFactor=1 means one core, replicationFactor=2 means two cores, 
it is bad naming (you will not get any replication with 
replicationFactor=1 - WTF!?!?). If we want to insist that you specify 
the total number of cores at least use replicaPerShard instead of 
replicationFactor, or even better rename Replica to Shard-instance 
and use instancesPerShard instead of replicationFactor.


Regards, Per Steffensen

On 1/3/13 3:52 PM, Per Steffensen wrote:

Hi

Here is my version - do not believe the explanations have been very clear

We have the following concepts (here I will try to explain what each 
the concept cover without naming it - its hard)
1) Machines (virtual or physical) running Solr server JVMs (one 
machine can run several Solr server JVMs if you like)

2) Solr server JVMs
3) Logical stores where you can add/update/delete data-instances 
(closest to logical tables in RDBMS)
4) Logical slices of a store (closest to non-overlapping logical 
sets of rows for the logical table in a RDBMS)
5) Physical instances of slices (a physical (disk/memory) instance 
of the a logical slice). This is where data actually goes on disk - 
the logical stores and slices above are just non-physical concepts


Terminology
1) Believe we have no name for this (except of course machine :-) ), 
even though Jack claims that this is called a node. Maybe sometimes 
it is called a node, but I believe node is more often used to 
refer to a Solr server JVM.

2) Node
3) Collection
4) Shard. Used to be called Slice but I believe now it is 
officially called Shard. I agree with that change, because I believe 
most of the industry also uses the term Shard for this 
logical/non-physical concept  - just needs to be reflected it across 
documentation and code
5) Replica. Used to be called Shard but I believe now it is 
officially called Replica. I certainly do not agree with the name 
Replica, because it suggests that it is a copy of an original, but 
it isnt. I would prefer Shard-instance here, to avoid the confusion. 
I understand that you can argue (if you argue long enough) that 
Replica is a fine name, but you really need the explanation to 
understand why Replica can be defended as the name for this. Is is 
not immediately obvious what this is as long as it is called 
Replica. A Replica is basically a Solr Cloud managed Core and 
behind every Replica/Core lives a physical Lucene index. So 
Replica=Core) contains/maintains Lucene index behind the scenes. The 
term Replica also needs to be reflected across documentation and code.


Regards, Per Steffensen




Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Jack Krupansky

Yes, in the context of SolrCloud, Node = Solr server JVM.

So, node is an instance of Solr, which can support multiple cores and 
multiple collections - or at least shards of multiple collections.


-- Jack Krupansky

-Original Message- 
From: Per Steffensen

Sent: Thursday, January 03, 2013 9:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Hi

Here is my version - do not believe the explanations have been very clear

We have the following concepts (here I will try to explain what each the
concept cover without naming it - its hard)
1) Machines (virtual or physical) running Solr server JVMs (one machine
can run several Solr server JVMs if you like)
2) Solr server JVMs
3) Logical stores where you can add/update/delete data-instances
(closest to logical tables in RDBMS)
4) Logical slices of a store (closest to non-overlapping logical
sets of rows for the logical table in a RDBMS)
5) Physical instances of slices (a physical (disk/memory) instance of
the a logical slice). This is where data actually goes on disk - the
logical stores and slices above are just non-physical concepts

Terminology
1) Believe we have no name for this (except of course machine :-) ),
even though Jack claims that this is called a node. Maybe sometimes it
is called a node, but I believe node is more often used to refer to
a Solr server JVM.
2) Node
3) Collection
4) Shard. Used to be called Slice but I believe now it is officially
called Shard. I agree with that change, because I believe most of the
industry also uses the term Shard for this logical/non-physical
concept  - just needs to be reflected it across documentation and code
5) Replica. Used to be called Shard but I believe now it is
officially called Replica. I certainly do not agree with the name
Replica, because it suggests that it is a copy of an original, but
it isnt. I would prefer Shard-instance here, to avoid the confusion. I
understand that you can argue (if you argue long enough) that Replica
is a fine name, but you really need the explanation to understand why
Replica can be defended as the name for this. Is is not immediately
obvious what this is as long as it is called Replica. A Replica is
basically a Solr Cloud managed Core and behind every Replica/Core lives
a physical Lucene index. So Replica=Core) contains/maintains Lucene
index behind the scenes. The term Replica also needs to be reflected
across documentation and code.

Regards, Per Steffensen

On 1/3/13 10:42 AM, Alexandre Rafalovitch wrote:

Hello,

I am trying to understand the core Solr terminology. I am looking for
correct rather than loose meaning as I am trying to teach an example that
starts from easy scenario and may scale to multi-core, multi-machine
situation.

Here are the terms that seem to be all overlapping and/or crossing over in
my mind a the moment.

1) Index
2) Core
3) Collection
4) Instance
5) Replica (Replica of _what_?)
6) Others?

I tried looking through documentation, but either there is a terminology
drift or I am having trouble understanding the distinctions.

If anybody has a clear picture in their mind, I would appreciate a
clarification.

Regards,
Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)





Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Mark Miller
This has pretty much become the standard across other distributed systems and 
in the literat…err…books.

I first implemented it as you mention you'd like, but Yonik correctly pointed 
out that we were going against the grain.

- Mark

On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote:

 For the same reasons that Replica shouldnt be called Replica (it requires 
 to long an explanation to agree that it is an ok name), replicationFactor 
 shouldnt be called replicationFactor and long as it referes to the TOTAL 
 number of cores you get for your Shard. replicationFactor would be an ok 
 name if replicationFactor=0 meant one core, replicationFactor=1 meant two 
 cores etc., but as long as replicationFactor=1 means one core, 
 replicationFactor=2 means two cores, it is bad naming (you will not get any 
 replication with replicationFactor=1 - WTF!?!?). If we want to insist that 
 you specify the total number of cores at least use replicaPerShard instead 
 of replicationFactor, or even better rename Replica to Shard-instance 
 and use instancesPerShard instead of replicationFactor.
 
 Regards, Per Steffensen
 
 On 1/3/13 3:52 PM, Per Steffensen wrote:
 Hi
 
 Here is my version - do not believe the explanations have been very clear
 
 We have the following concepts (here I will try to explain what each the 
 concept cover without naming it - its hard)
 1) Machines (virtual or physical) running Solr server JVMs (one machine can 
 run several Solr server JVMs if you like)
 2) Solr server JVMs
 3) Logical stores where you can add/update/delete data-instances (closest 
 to logical tables in RDBMS)
 4) Logical slices of a store (closest to non-overlapping logical sets of 
 rows for the logical table in a RDBMS)
 5) Physical instances of slices (a physical (disk/memory) instance of the 
 a logical slice). This is where data actually goes on disk - the logical 
 stores and slices above are just non-physical concepts
 
 Terminology
 1) Believe we have no name for this (except of course machine :-) ), even 
 though Jack claims that this is called a node. Maybe sometimes it is 
 called a node, but I believe node is more often used to refer to a Solr 
 server JVM.
 2) Node
 3) Collection
 4) Shard. Used to be called Slice but I believe now it is officially 
 called Shard. I agree with that change, because I believe most of the 
 industry also uses the term Shard for this logical/non-physical concept  - 
 just needs to be reflected it across documentation and code
 5) Replica. Used to be called Shard but I believe now it is officially 
 called Replica. I certainly do not agree with the name Replica, because 
 it suggests that it is a copy of an original, but it isnt. I would prefer 
 Shard-instance here, to avoid the confusion. I understand that you can 
 argue (if you argue long enough) that Replica is a fine name, but you 
 really need the explanation to understand why Replica can be defended as 
 the name for this. Is is not immediately obvious what this is as long as it 
 is called Replica. A Replica is basically a Solr Cloud managed Core and 
 behind every Replica/Core lives a physical Lucene index. So Replica=Core) 
 contains/maintains Lucene index behind the scenes. The term Replica also 
 needs to be reflected across documentation and code.
 
 Regards, Per Steffensen
 



Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Per Steffensen

On 1/3/13 4:33 PM, Mark Miller wrote:

This has pretty much become the standard across other distributed systems and 
in the literat…err…books.
Hmmm Im not sure you are right about that. Maybe more than one 
distributed system calls them Replica, but there is also a lot that 
doesnt. But if you are right, thats at least a good valid argument to do 
it this way, even though I generally prefer good logical naming over 
following bad naming from the industry :-) Just because there is a lot 
of crap out there, doesnt mean that we also want to make crap. Maybe 
good logical naming could even be a small entry in the Why Solr is 
better than its competitors list :-)


RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Great point.

brbrbr--- Original Message ---
On 1/3/2013  10:42 AM Per Steffensen wrote:brOn 1/3/13 4:33 PM, Mark Miller 
wrote:
br This has pretty much become the standard across other distributed systems 
and in the literat…err…books.
brHmmm Im not sure you are right about that. Maybe more than one 
brdistributed system calls them Replica, but there is also a lot that 
brdoesnt. But if you are right, thats at least a good valid argument to do 
brit this way, even though I generally prefer good logical naming over 
brfollowing bad naming from the industry :-) Just because there is a lot 
brof crap out there, doesnt mean that we also want to make crap. Maybe 
brgood logical naming could even be a small entry in the Why Solr is 
brbetter than its competitors list :-)

br


Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Mark Miller

On Jan 3, 2013, at 10:42 AM, Per Steffensen st...@designware.dk wrote:

 Why Solr is better than its competitors list :-)

The problem is that it's not just Solr competitors. It seems to be pretty much 
everyone. If you can provide counter examples, I'd be interested to see them, 
but I've found confirmation examples in projects and books left and right.

Trying to forge our own path here seems more confusing than helpful IMO. We 
have enough issues with terminology right now - where we can go with the 
industry standard, I think we should.

- Mark

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Mark Miller

On Jan 3, 2013, at 10:55 AM, Mark Miller markrmil...@gmail.com wrote:

 
 On Jan 3, 2013, at 10:42 AM, Per Steffensen st...@designware.dk wrote:
 
 Why Solr is better than its competitors list :-)
 
 The problem is that it's not just Solr competitors. It seems to be pretty 
 much everyone. If you can provide counter examples, I'd be interested to see 
 them, but I've found confirmation examples in projects and books left and 
 right.
 
 Trying to forge our own path here seems more confusing than helpful IMO. We 
 have enough issues with terminology right now - where we can go with the 
 industry standard, I think we should.
 
 - Mark


P.S. I'm referring specifically to replication factor and not replica. While 
I think it's probably a similar deal, I've only researched replication factor 
specifically.

- Mark

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Walter Underwood
A factor is multiplied, so multiplying the leader by a replicationFactor of 1 
means you have exactly one copy of that shard.

I think that recycling the term replication within Solr was confusing, but it 
is a bit late to change that. 

wunder

On Jan 3, 2013, at 7:33 AM, Mark Miller wrote:

 This has pretty much become the standard across other distributed systems and 
 in the literat…err…books.
 
 I first implemented it as you mention you'd like, but Yonik correctly pointed 
 out that we were going against the grain.
 
 - Mark
 
 On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote:
 
 For the same reasons that Replica shouldnt be called Replica (it 
 requires to long an explanation to agree that it is an ok name), 
 replicationFactor shouldnt be called replicationFactor and long as it 
 referes to the TOTAL number of cores you get for your Shard. 
 replicationFactor would be an ok name if replicationFactor=0 meant one 
 core, replicationFactor=1 meant two cores etc., but as long as 
 replicationFactor=1 means one core, replicationFactor=2 means two cores, it 
 is bad naming (you will not get any replication with replicationFactor=1 - 
 WTF!?!?). If we want to insist that you specify the total number of cores at 
 least use replicaPerShard instead of replicationFactor, or even better 
 rename Replica to Shard-instance and use instancesPerShard instead of 
 replicationFactor.
 
 Regards, Per Steffensen
 
 On 1/3/13 3:52 PM, Per Steffensen wrote:
 Hi
 
 Here is my version - do not believe the explanations have been very clear
 
 We have the following concepts (here I will try to explain what each the 
 concept cover without naming it - its hard)
 1) Machines (virtual or physical) running Solr server JVMs (one machine can 
 run several Solr server JVMs if you like)
 2) Solr server JVMs
 3) Logical stores where you can add/update/delete data-instances (closest 
 to logical tables in RDBMS)
 4) Logical slices of a store (closest to non-overlapping logical sets 
 of rows for the logical table in a RDBMS)
 5) Physical instances of slices (a physical (disk/memory) instance of the 
 a logical slice). This is where data actually goes on disk - the logical 
 stores and slices above are just non-physical concepts
 
 Terminology
 1) Believe we have no name for this (except of course machine :-) ), even 
 though Jack claims that this is called a node. Maybe sometimes it is 
 called a node, but I believe node is more often used to refer to a 
 Solr server JVM.
 2) Node
 3) Collection
 4) Shard. Used to be called Slice but I believe now it is officially 
 called Shard. I agree with that change, because I believe most of the 
 industry also uses the term Shard for this logical/non-physical concept  
 - just needs to be reflected it across documentation and code
 5) Replica. Used to be called Shard but I believe now it is officially 
 called Replica. I certainly do not agree with the name Replica, because 
 it suggests that it is a copy of an original, but it isnt. I would prefer 
 Shard-instance here, to avoid the confusion. I understand that you can 
 argue (if you argue long enough) that Replica is a fine name, but you 
 really need the explanation to understand why Replica can be defended as 
 the name for this. Is is not immediately obvious what this is as long as it 
 is called Replica. A Replica is basically a Solr Cloud managed Core and 
 behind every Replica/Core lives a physical Lucene index. So Replica=Core) 
 contains/maintains Lucene index behind the scenes. The term Replica also 
 needs to be reflected across documentation and code.
 
 Regards, Per Steffensen
 
 

--
Walter Underwood
wun...@wunderwood.org





RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

And based on the previous explanation there is never a copy of a shard. A 
shard represents and contains only replicas for itself, replicas being copies of cores 
within the shard.

brbrbr--- Original Message ---
On 1/3/2013  11:58 AM Walter Underwood wrote:brA factor is multiplied, so 
multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that 
shard.
br
brI think that recycling the term replication within Solr was confusing, but it is a bit late to change that. 
br

brwunder
br
brOn Jan 3, 2013, at 7:33 AM, Mark Miller wrote:
br
br This has pretty much become the standard across other distributed systems 
and in the literat…err…books.
br 
br I first implemented it as you mention you'd like, but Yonik correctly pointed out that we were going against the grain.
br 
br - Mark
br 
br On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote:
br 
br For the same reasons that Replica shouldnt be called Replica (it requires to long an explanation to agree that it is an ok name), replicationFactor shouldnt be called replicationFactor and long as it referes to the TOTAL number of cores you get for your Shard. replicationFactor would be an ok name if replicationFactor=0 meant one core, replicationFactor=1 meant two cores etc., but as long as replicationFactor=1 means one core, replicationFactor=2 means two cores, it is bad naming (you will not get any replication with replicationFactor=1 - WTF!?!?). If we want to insist that you specify the total number of cores at least use replicaPerShard instead of replicationFactor, or even better rename Replica to Shard-instance and use instancesPerShard instead of replicationFactor.
br 
br Regards, Per Steffensen
br 
br On 1/3/13 3:52 PM, Per Steffensen wrote:

br Hi
br 
br Here is my version - do not believe the explanations have been very clear
br 
br We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard)

br 1) Machines (virtual or physical) running Solr server JVMs (one machine 
can run several Solr server JVMs if you like)
br 2) Solr server JVMs
br 3) Logical stores where you can add/update/delete data-instances (closest to 
logical tables in RDBMS)
br 4) Logical slices of a store (closest to non-overlapping logical sets of rows 
for the logical table in a RDBMS)
br 5) Physical instances of slices (a physical (disk/memory) instance of the a logical 
slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical 
concepts
br 
br Terminology

br 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is 
called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer 
to a Solr server JVM.
br 2) Node
br 3) Collection
br 4) Shard. Used to be called Slice but I believe now it is officially called 
Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this 
logical/non-physical concept  - just needs to be reflected it across documentation and code
br 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name 
Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue 
(if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately 
obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core) 
contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code.
br 
br Regards, Per Steffensen
br 
br 
br

br--
brWalter Underwood
brwun...@wunderwood.org
br
br
br
br


Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Lance Norskog
Also, searching can be much faster if you put all of the shards on one 
machine, and the search distributor. That way, you search with multiple 
simultaneous threads inside one machine. I've seen this make searches 
several times faster.


On 01/03/2013 06:36 AM, Jack Krupansky wrote:
Ah... the multiple shards (of the same collection) in a single node is 
about planning for future expansion of your cluster - create more 
shards than you need today, put more of them on a single node and then 
migrate them to their own nodes as the data outgrows the smaller 
number of nodes. In other words, add nodes incrementally without 
having to reindex all the data.


-- Jack Krupansky

-Original Message- From: Darren Govoni
Sent: Thursday, January 03, 2013 9:18 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Yes. And its worth to note that when having multiple shards in a 
single node(@deprecated) that they are shards of different collections...


brbrbr--- Original Message ---
On 1/3/2013  09:16 AM Jack Krupansky wrote:brAnd I would revise 
node to note that in SolrCloud a node is simply an

brinstance of a Solr server.
br
brAnd, technically, you can have multiple shards in a single 
instance of Solr,
brseparating the logical sharding of keys from the distribution of 
the data.

br
br-- Jack Krupansky
br
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding 
is a way of
brslicing the original data, before we talk about how the shards 
get stored
brand replicated on actual Solr cores. Replicas are instances of the 
data for

bra shard.
br
brSometimes people may loosely speak of a replica as being a 
shard, but

brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- brFrom: Darren Govoni
brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the 
fact that the
brbrcollection may be sharded, with each shard on one or more 
cores, with
breach brcore being a replica of the other cores within that shard 
of that

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A 
shard
britself is not distributed across cores in the same since. Rather a 
shard
brexist on a single core and is replicated on other cores. Is that 
right? The

brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine 
in a

brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have 
multiple
brbrvirtual nodes on the same physical box. Each Solr replica 
would be on

bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on 
a single
brbrhardware node, each with a different port. They are simply 
instances of
brbrSolr, although you could consider each Solr instance a node in 
a Solr

brcloud
brbras well, a virtual node. So, technically, you could have 
multiple

brreplicas
brbron the same node, but that sort of defeats most of the purpose 
of having
brbrreplicas in the first place - to distribute the data for 
performance and
brbrfault tolerance. But, you could have replicas of different 
shards on the

brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- brbrFrom: Darren Govoni
brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be 
labeled

brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is 
the more

brmodern
brbrterm and incorporates the fact that the
brbrbrcollection

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni
I see. So sharding and distributing/replicating can have separate and 
different advantages.


On 01/03/2013 01:06 PM, Lance Norskog wrote:
Also, searching can be much faster if you put all of the shards on one 
machine, and the search distributor. That way, you search with 
multiple simultaneous threads inside one machine. I've seen this make 
searches several times faster.


On 01/03/2013 06:36 AM, Jack Krupansky wrote:
Ah... the multiple shards (of the same collection) in a single node 
is about planning for future expansion of your cluster - create more 
shards than you need today, put more of them on a single node and 
then migrate them to their own nodes as the data outgrows the smaller 
number of nodes. In other words, add nodes incrementally without 
having to reindex all the data.


-- Jack Krupansky

-Original Message- From: Darren Govoni
Sent: Thursday, January 03, 2013 9:18 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Yes. And its worth to note that when having multiple shards in a 
single node(@deprecated) that they are shards of different 
collections...


brbrbr--- Original Message ---
On 1/3/2013  09:16 AM Jack Krupansky wrote:brAnd I would revise 
node to note that in SolrCloud a node is simply an

brinstance of a Solr server.
br
brAnd, technically, you can have multiple shards in a single 
instance of Solr,
brseparating the logical sharding of keys from the distribution of 
the data.

br
br-- Jack Krupansky
br
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding 
is a way of
brslicing the original data, before we talk about how the shards 
get stored
brand replicated on actual Solr cores. Replicas are instances of 
the data for

bra shard.
br
brSometimes people may loosely speak of a replica as being a 
shard, but

brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- brFrom: Darren Govoni
brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the 
fact that the
brbrcollection may be sharded, with each shard on one or more 
cores, with
breach brcore being a replica of the other cores within that 
shard of that

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. 
A shard
britself is not distributed across cores in the same since. Rather 
a shard
brexist on a single core and is replicated on other cores. Is that 
right? The

brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a 
machine in a

brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have 
multiple
brbrvirtual nodes on the same physical box. Each Solr replica 
would be on

bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running 
on a single
brbrhardware node, each with a different port. They are simply 
instances of
brbrSolr, although you could consider each Solr instance a node 
in a Solr

brcloud
brbras well, a virtual node. So, technically, you could have 
multiple

brreplicas
brbron the same node, but that sort of defeats most of the 
purpose of having
brbrreplicas in the first place - to distribute the data for 
performance and
brbrfault tolerance. But, you could have replicas of different 
shards on the

brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- brbrFrom: Darren Govoni
brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be 
labeled

brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Per Steffensen

On 1/3/13 4:55 PM, Mark Miller wrote:
Trying to forge our own path here seems more confusing than helpful 
IMO. We have enough issues with terminology right now - where we can 
go with the industry standard, I think we should. - Mark 

Fair enough.

I dont think our biggest problem is whether we decide to call it 
Replica/replicationFactor or ShardInstance/InstancesPerShard. Our 
biggest problem is that we really havent decided once and for all and 
made sure to reflect the decision consistently across code and 
documentation. As long as we havnt I believe it is still ok to change 
our minds.




Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Per Steffensen

On 1/3/13 5:58 PM, Walter Underwood wrote:

A factor is multiplied, so multiplying the leader by a replicationFactor of 1 
means you have exactly one copy of that shard.

I think that recycling the term replication within Solr was confusing, but it 
is a bit late to change that.

wunder
Yes, the term factor is not misleading, but the term replication is. 
If we keep calling shard-instances for Replica I guess replicaFactor 
will be ok - at least much better than replicationFactor. But it would 
still be better with e.g. ShardInstance and InstancesPerShard