[jira] [Commented] (SOLR-14927) Remove Overseer

2021-01-23 Thread Ilan Ginzburg (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17270728#comment-17270728
 ] 

Ilan Ginzburg commented on SOLR-14927:
--

As I'm working on the child work item for distributing the cluster state 
updates, I realize that some changes to the Collection API might be required 
earlier than I hoped.
See [comment on 
SOLR-14928|https://issues.apache.org/jira/browse/SOLR-14928?focusedCommentId=17270726=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17270726].

> Remove Overseer
> ---
>
> Key: SOLR-14927
> URL: https://issues.apache.org/jira/browse/SOLR-14927
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Ilan Ginzburg
>Assignee: Ilan Ginzburg
>Priority: Major
>  Labels: cluster, collection-api, overseer, solrcloud, zookeeper
>
> This Jira is intended to capture sub jiras on the path to remove the Overseer 
> component from SolrCloud and move to all nodes being able to do the work 
> currently done by Overseer.
> See detailed description in [this 
> doc|https://docs.google.com/document/d/1u4QHsIHuIxlglIW6hekYlXGNOP0HjLGVX5N6inkj6Ok/].
> Copying (edited) from the above doc:
> The motivation for removing Overseer include:
>  * Mono threaded state change is slow and doesn’t scale,
>  * Communication between cluster nodes and the Overseer use Zookeeper as a 
> queueing mechanism, this is not a good idea,
>  * Nodes talking to Overseer (then Overseer talking to itself) via Zookeeper 
> is inefficient and adds latency,
>  * Collection API scalability is poor, because not only a single node 
> processes commands for all Collections, but it also depends on the mono 
> threaded state change queue consumption,
>  * The code supporting Overseer in SolrCloud is complex (election, queue 
> management, recovery etc).
> The general idea is that there’s already a central point in the SolrCloud 
> cluster and it’s Zookeeper. It might not be necessary to have a second 
> central point (Overseer) because nodes can interact directly with Zookeeper 
> and synchronize more efficiently by optimistic locking using “conditional 
> updates” (a.k.a compare and swap or CAS).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14927) Remove Overseer

2020-11-10 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229452#comment-17229452
 ] 

David Smiley commented on SOLR-14927:
-

+1 
I like this plan most of all because SolrCloud becomes simpler -- ZK + Overseer 
is fundamentally more complex than ZK alone.  This is your point at the end of 
the issue description; I just wanted to amplify it.  Perhaps it's more scalable 
as well, but that's secondary to me.

> Remove Overseer
> ---
>
> Key: SOLR-14927
> URL: https://issues.apache.org/jira/browse/SOLR-14927
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Ilan Ginzburg
>Assignee: Ilan Ginzburg
>Priority: Major
>  Labels: cluster, collection-api, overseer, solrcloud, zookeeper
>
> This Jira is intended to capture sub jiras on the path to remove the Overseer 
> component from SolrCloud and move to all nodes being able to do the work 
> currently done by Overseer.
> See detailed description in [this 
> doc|https://docs.google.com/document/d/1u4QHsIHuIxlglIW6hekYlXGNOP0HjLGVX5N6inkj6Ok/].
> Copying (edited) from the above doc:
> The motivation for removing Overseer include:
>  * Mono threaded state change is slow and doesn’t scale,
>  * Communication between cluster nodes and the Overseer use Zookeeper as a 
> queueing mechanism, this is not a good idea,
>  * Nodes talking to Overseer (then Overseer talking to itself) via Zookeeper 
> is inefficient and adds latency,
>  * Collection API scalability is poor, because not only a single node 
> processes commands for all Collections, but it also depends on the mono 
> threaded state change queue consumption,
>  * The code supporting Overseer in SolrCloud is complex (election, queue 
> management, recovery etc).
> The general idea is that there’s already a central point in the SolrCloud 
> cluster and it’s Zookeeper. It might not be necessary to have a second 
> central point (Overseer) because nodes can interact directly with Zookeeper 
> and synchronize more efficiently by optimistic locking using “conditional 
> updates” (a.k.a compare and swap or CAS).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14927) Remove Overseer

2020-11-10 Thread Ilan Ginzburg (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229336#comment-17229336
 ] 

Ilan Ginzburg commented on SOLR-14927:
--

I see three types of (cluster state) updates done by Overseer and a different 
behavior for each by moving to CAS (Compare and Swap):
 # State change for a single collection done directly by the Collection API 
command execution
 # Replicas advertising themselves as having been created by nodes
 # Replica up/down updates for multiple collection when nodes go up/down.

I believe CAS should be a relatively easy win for case 1. There is likely low 
contention on updates to the corresponding {{state.json}}.

Case 2 (as [~thelabdude] 
[commented)|https://docs.google.com/document/d/1u4QHsIHuIxlglIW6hekYlXGNOP0HjLGVX5N6inkj6Ok/edit?disco=HGxplOw]
 might be a tricky one. There will no longer be batching of the updates as 
Overseer currently does. Each replica will try to do the update to ZK directly. 
We could then introduce batching by SolrCloud node for large collections. We 
might also split the collection Zookeeper state into a collection level state 
({{state.json}}) and per shard sub states ({{state-__.json}} for 
example) so that replicas of different shards do not compete (nor watch...) 
state changes related to other shards.

Case 3 has to be addressed in a slightly different way: make replica state (in 
its collection's {{state.json}} or its shard state if one eventually exists) 
not depend on (i.e. change with) its node state, but rather check node state 
and replica state to determine the "real" replica state: replica considered up 
if replica advertised as up and node up as well.

> Remove Overseer
> ---
>
> Key: SOLR-14927
> URL: https://issues.apache.org/jira/browse/SOLR-14927
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Ilan Ginzburg
>Assignee: Ilan Ginzburg
>Priority: Major
>  Labels: cluster, collection-api, overseer, solrcloud, zookeeper
>
> This Jira is intended to capture sub jiras on the path to remove the Overseer 
> component from SolrCloud and move to all nodes being able to do the work 
> currently done by Overseer.
> See detailed description in [this 
> doc|https://docs.google.com/document/d/1u4QHsIHuIxlglIW6hekYlXGNOP0HjLGVX5N6inkj6Ok/].
> Copying (edited) from the above doc:
> The motivation for removing Overseer include:
>  * Mono threaded state change is slow and doesn’t scale,
>  * Communication between cluster nodes and the Overseer use Zookeeper as a 
> queueing mechanism, this is not a good idea,
>  * Nodes talking to Overseer (then Overseer talking to itself) via Zookeeper 
> is inefficient and adds latency,
>  * Collection API scalability is poor, because not only a single node 
> processes commands for all Collections, but it also depends on the mono 
> threaded state change queue consumption,
>  * The code supporting Overseer in SolrCloud is complex (election, queue 
> management, recovery etc).
> The general idea is that there’s already a central point in the SolrCloud 
> cluster and it’s Zookeeper. It might not be necessary to have a second 
> central point (Overseer) because nodes can interact directly with Zookeeper 
> and synchronize more efficiently by optimistic locking using “conditional 
> updates” (a.k.a compare and swap or CAS).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14927) Remove Overseer

2020-11-10 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229252#comment-17229252
 ] 

Mark Robert Miller commented on SOLR-14927:
---

It’s the bad impl (due to tech debt and variety of reasons), not the design. 
You are right that the zookeeper already owns the state, and that is why our 
overseer is so silly. The solution is not to embrace zk more, that’s actually 
the non scalable solution. The Overseer actually has the advantage for state 
updates, the cas approach with zk as the state owner is actually the non 
scalable approach. The disadvantage is actually what’s stated as the advantage. 
Zookeeper owning the state, state updates are not scalable.

The approach will not compete well with ab overseer approach in multiple areas, 
including cluster scalability. 

> Remove Overseer
> ---
>
> Key: SOLR-14927
> URL: https://issues.apache.org/jira/browse/SOLR-14927
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Ilan Ginzburg
>Assignee: Ilan Ginzburg
>Priority: Major
>  Labels: cluster, collection-api, overseer, solrcloud, zookeeper
>
> This Jira is intended to capture sub jiras on the path to remove the Overseer 
> component from SolrCloud and move to all nodes being able to do the work 
> currently done by Overseer.
> See detailed description in [this 
> doc|https://docs.google.com/document/d/1u4QHsIHuIxlglIW6hekYlXGNOP0HjLGVX5N6inkj6Ok/].
> Copying (edited) from the above doc:
> The motivation for removing Overseer include:
>  * Mono threaded state change is slow and doesn’t scale,
>  * Communication between cluster nodes and the Overseer use Zookeeper as a 
> queueing mechanism, this is not a good idea,
>  * Nodes talking to Overseer (then Overseer talking to itself) via Zookeeper 
> is inefficient and adds latency,
>  * Collection API scalability is poor, because not only a single node 
> processes commands for all Collections, but it also depends on the mono 
> threaded state change queue consumption,
>  * The code supporting Overseer in SolrCloud is complex (election, queue 
> management, recovery etc).
> The general idea is that there’s already a central point in the SolrCloud 
> cluster and it’s Zookeeper. It might not be necessary to have a second 
> central point (Overseer) because nodes can interact directly with Zookeeper 
> and synchronize more efficiently by optimistic locking using “conditional 
> updates” (a.k.a compare and swap or CAS).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14927) Remove Overseer

2020-10-27 Thread Ilan Ginzburg (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221554#comment-17221554
 ] 

Ilan Ginzburg commented on SOLR-14927:
--

Thanks [~gezapeti]. I use the doc linked in the description to describe what 
I'm currently doing (last section) and questions I have (that I ask myself... 
or others). Would love to collaborate on this!

> Remove Overseer
> ---
>
> Key: SOLR-14927
> URL: https://issues.apache.org/jira/browse/SOLR-14927
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Ilan Ginzburg
>Assignee: Ilan Ginzburg
>Priority: Major
>  Labels: cluster, collection-api, overseer, solrcloud, zookeeper
>
> This Jira is intended to capture sub jiras on the path to remove the Overseer 
> component from SolrCloud and move to all nodes being able to do the work 
> currently done by Overseer.
> See detailed description in [this 
> doc|https://docs.google.com/document/d/1u4QHsIHuIxlglIW6hekYlXGNOP0HjLGVX5N6inkj6Ok/].
> Copying (edited) from the above doc:
> The motivation for removing Overseer include:
>  * Mono threaded state change is slow and doesn’t scale,
>  * Communication between cluster nodes and the Overseer use Zookeeper as a 
> queueing mechanism, this is not a good idea,
>  * Nodes talking to Overseer (then Overseer talking to itself) via Zookeeper 
> is inefficient and adds latency,
>  * Collection API scalability is poor, because not only a single node 
> processes commands for all Collections, but it also depends on the mono 
> threaded state change queue consumption,
>  * The code supporting Overseer in SolrCloud is complex (election, queue 
> management, recovery etc).
> The general idea is that there’s already a central point in the SolrCloud 
> cluster and it’s Zookeeper. It might not be necessary to have a second 
> central point (Overseer) because nodes can interact directly with Zookeeper 
> and synchronize more efficiently by optimistic locking using “conditional 
> updates” (a.k.a compare and swap or CAS).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14927) Remove Overseer

2020-10-27 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221480#comment-17221480
 ] 

Gézapeti commented on SOLR-14927:
-

I think this is a good direction even if the complete removal of the Overseer 
is far or even unreachable.
I'd be happy to grab pieces of this work and help with it.

> Remove Overseer
> ---
>
> Key: SOLR-14927
> URL: https://issues.apache.org/jira/browse/SOLR-14927
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Ilan Ginzburg
>Assignee: Ilan Ginzburg
>Priority: Major
>  Labels: cluster, collection-api, overseer, solrcloud, zookeeper
>
> This Jira is intended to capture sub jiras on the path to remove the Overseer 
> component from SolrCloud and move to all nodes being able to do the work 
> currently done by Overseer.
> See detailed description in [this 
> doc|https://docs.google.com/document/d/1u4QHsIHuIxlglIW6hekYlXGNOP0HjLGVX5N6inkj6Ok/].
> Copying (edited) from the above doc:
> The motivation for removing Overseer include:
>  * Mono threaded state change is slow and doesn’t scale,
>  * Communication between cluster nodes and the Overseer use Zookeeper as a 
> queueing mechanism, this is not a good idea,
>  * Nodes talking to Overseer (then Overseer talking to itself) via Zookeeper 
> is inefficient and adds latency,
>  * Collection API scalability is poor, because not only a single node 
> processes commands for all Collections, but it also depends on the mono 
> threaded state change queue consumption,
>  * The code supporting Overseer in SolrCloud is complex (election, queue 
> management, recovery etc).
> The general idea is that there’s already a central point in the SolrCloud 
> cluster and it’s Zookeeper. It might not be necessary to have a second 
> central point (Overseer) because nodes can interact directly with Zookeeper 
> and synchronize more efficiently by optimistic locking using “conditional 
> updates” (a.k.a compare and swap or CAS).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org