[jira] [Comment Edited] (SOLR-5872) Eliminate overseer queue

2017-03-09 Thread albert vico oton (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890021#comment-15890021
 ] 

albert vico oton edited comment on SOLR-5872 at 3/9/17 11:51 AM:
-

Hello, we are currently trying to do a deploy of around 200 collections and 
solrcloud can't handle it, it just  dies due update_status messages propagation 
everytime we try to add a new collection, each collection has 3 replicas, and 
sizes are not very large. Also, I do not see why collection A should be aware 
of collection B state.  

But moving to the topic, overseer node dies since he can not handle all the 
stress due the flooding of messages. IMHO we have here a single point of 
failure in a distributed system, which is not very recommended. 

since it would be useful for big fat shards, my suggestion would be to make 
this optional behavior, so people like us, who need to have a more distributed 
approach, can make use of solrcloud. Since right now it is impossible to. and 
I'm not talking about "thousands" of collections actually with as few as 100 we 
are seeing very bad performance.




was (Author: alvico):
Hello, we are currently trying to do a deploy of around 200 collections and 
solrcloud can't handle it, it just  dies due update_status messages propagation 
everytime we try to add a new collection, each collection has 3 replicas, and 
sizes are not very large. Also, I do not see why collection A should be aware 
of collection B state.  

But moving to the topic, overseer node dies since he can not handle all the 
stress due the flooding of messages. IMHO we have here a single point of 
failure in a distributed system, which is not very recommended. 

since it would be useful for big fat shards, my suggestion would be to make 
this optional behavior, so people like use who need to have a more distributed 
approach can make use of solrcloud. Since right now it is impossible to. and 
I'm not talking about "thousands" of collections actually with as few as 100 we 
are seeing very bad performance.



> Eliminate overseer queue 
> -
>
> Key: SOLR-5872
> URL: https://issues.apache.org/jira/browse/SOLR-5872
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud
>Reporter: Noble Paul
>Assignee: Noble Paul
>
> The overseer queue is one of the busiest points in the entire system. The 
> raison d'être of the queue is
>  * Provide batching of operations for the main clusterstate,json so that 
> state updates are minimized 
> * Avoid race conditions and ensure order
> Now , as we move the individual collection states out of the main 
> clusterstate.json, the batching is not useful anymore.
> Race conditions can easily be solved by using a compare and set in Zookeeper. 
> The proposed solution  is , whenever an operation is required to be performed 
> on the clusterstate, the same thread (and of course the same JVM)
>  # read the fresh state and version of zk node  
>  # construct the new state 
>  # perform a compare and set
>  # if compare and set fails go to step 1
> This should be limited to all operations performed on external collections 
> because batching would be required for others 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5872) Eliminate overseer queue

2015-08-15 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698435#comment-14698435
 ] 

Ramkumar Aiyengar edited comment on SOLR-5872 at 8/15/15 8:29 PM:
--

Though I haven't done serious experiments on this as yet, I see the lack of 
batching in stateFormat=2 is a potential blocker to it's adoption. We need some 
benchmarks on a single collection with lots of cores (at least 1000), and see 
how it works with stateFormat=1, stateFormat=2, and this new approach. Keep in 
mind that hundreds of cores might change state at the same time, that's the 
real benefit to batching. I fear that without a batching approach, the system 
might choke due to the contention at that point.

My point here being that stateFormat=2 not doing batching isn't a convincing 
enough argument to eliminate overseer queue, may be the effort should be 
directed more towards getting batching for stateFormat=2 if that's more useful.


was (Author: andyetitmoves):
Though I haven't done serious experiments on this as yet, I see the lack of 
batching in stateFormat=2 is a potential blocker to it's adoption. We need some 
benchmarks on a single collection with lots of cores (at least 1000), and see 
how it works with stateFormat=1, stateFormat=2, and this new approach. Keep in 
mind that hundreds of cores might change state at the same time, that's the 
real benefit to batching. I fear that without a batching approach, the system 
might choke due to the contention at that point.

 Eliminate overseer queue 
 -

 Key: SOLR-5872
 URL: https://issues.apache.org/jira/browse/SOLR-5872
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 The overseer queue is one of the busiest points in the entire system. The 
 raison d'être of the queue is
  * Provide batching of operations for the main clusterstate,json so that 
 state updates are minimized 
 * Avoid race conditions and ensure order
 Now , as we move the individual collection states out of the main 
 clusterstate.json, the batching is not useful anymore.
 Race conditions can easily be solved by using a compare and set in Zookeeper. 
 The proposed solution  is , whenever an operation is required to be performed 
 on the clusterstate, the same thread (and of course the same JVM)
  # read the fresh state and version of zk node  
  # construct the new state 
  # perform a compare and set
  # if compare and set fails go to step 1
 This should be limited to all operations performed on external collections 
 because batching would be required for others 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5872) Eliminate overseer queue

2014-03-18 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939422#comment-13939422
 ] 

Noble Paul edited comment on SOLR-5872 at 3/18/14 4:04 PM:
---

bq.That is also how I first implemented the clusterstate

Can you throw some light on how was the ZK schema for your initial impl? If all 
nodes of a given slice is under one zk directory , one watch on the parent 
should be fine, right?


was (Author: noble.paul):
bq.That is also how I first implemented the clusterstate

Can you throw some light on how was the ZK schema for your initial impl? If all 
nodes of a given slice is in one watch on the parent should be fine, right?

 Eliminate overseer queue 
 -

 Key: SOLR-5872
 URL: https://issues.apache.org/jira/browse/SOLR-5872
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 The overseer queue is one of the busiest points in the entire system. The 
 raison d'être of the queue is
  * Provide batching of operations for the main clusterstate,json so that 
 state updates are minimized 
 * Avoid race conditions and ensure order
 Now , as we move the individual collection states out of the main 
 clusterstate.json, the batching is not useful anymore.
 Race conditions can easily be solved by using a compare and set in Zookeeper. 
 The proposed solution  is , whenever an operation is required to be performed 
 on the clusterstate, the same thread (and of course the same JVM)
  # read the fresh state and version of zk node  
  # construct the new state 
  # perform a compare and set
  # if compare and set fails go to step 1
 This should be limited to all operations performed on external collections 
 because batching would be required for others 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5872) Eliminate overseer queue

2014-03-17 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938670#comment-13938670
 ] 

Jessica Cheng edited comment on SOLR-5872 at 3/18/14 1:44 AM:
--

{quote}For further discussion around the change, there should be background if 
you search the archives.{quote}
If you wouldn't mind terribly, will you please paste the link of a few relevant 
threads in the archive? (Sorry, I'm not familiar with all the keywords and 
archives, etc., yet.)

{quote}There is a strong argument to be made that we should first investigate 
the performance issues with the current strategy. ZooKeeper is pretty fast - 
these state updates are tiny and batched. It seems like we should be able to do 
a lot better without throwing out code that has been getting hardened for a 
long time now.{quote}
I see where your hesitation is now, and I can definitely agree. Sounds like 
there are a few points to be investigated for the current system before we 
attempt to change anything:

- Why is the Overseer's so slow at updating cluster state/ What's causing the 
build-up of queue messages during a restart?
- What can we do to generally solve the problem of the Overseer being killed on 
every instance restart in a rolling bounce?
- How much is actually batched?

My gut is that for external collections, batching won't be of that much benefit 
(except for that super-large collection case that Yoink mentioned), but I agree 
that if the current system can be hardened to work even for those, then the 
simplicity of one code path should be preferred over ultra-optimizing for a 
non-issue (assuming the first two points above can be fixed).


was (Author: mewmewball):
quoteFor further discussion around the change, there should be background if 
you search the archives./quote
If you wouldn't mind terribly, will you please paste the link of a few relevant 
threads in the archive? (Sorry, I'm not familiar with all the keywords and 
archives, etc., yet.)

quoteThere is a strong argument to be made that we should first investigate 
the performance issues with the current strategy. ZooKeeper is pretty fast - 
these state updates are tiny and batched. It seems like we should be able to do 
a lot better without throwing out code that has been getting hardened for a 
long time now./quote
I see where your hesitation is now, and I can definitely agree. Sounds like 
there are a few points to be investigated for the current system before we 
attempt to change anything:

- Why is the Overseer's so slow at updating cluster state/ What's causing the 
build-up of queue messages during a restart?
- What can we do to generally solve the problem of the Overseer being killed on 
every instance restart in a rolling bounce?
- How much is actually batched?

My gut is that for external collections, batching won't be of that much benefit 
(except for that super-large collection case that Yoink mentioned), but I agree 
that if the current system can be hardened to work even for those, then the 
simplicity of one code path should be preferred over ultra-optimizing for a 
non-issue (assuming the first two points above can be fixed).

 Eliminate overseer queue 
 -

 Key: SOLR-5872
 URL: https://issues.apache.org/jira/browse/SOLR-5872
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 The overseer queue is one of the busiest points in the entire system. The 
 raison d'être of the queue is
  * Provide batching of operations for the main clusterstate,json so that 
 state updates are minimized 
 * Avoid race conditions and ensure order
 Now , as we move the individual collection states out of the main 
 clusterstate.json, the batching is not useful anymore.
 Race conditions can easily be solved by using a compare and set in Zookeeper. 
 The proposed solution  is , whenever an operation is required to be performed 
 on the clusterstate, the same thread (and of course the same JVM)
  # read the fresh state and version of zk node  
  # construct the new state 
  # perform a compare and set
  # if compare and set fails go to step 1
 This should be limited to all operations performed on external collections 
 because batching would be required for others 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5872) Eliminate overseer queue

2014-03-17 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938796#comment-13938796
 ] 

Noble Paul edited comment on SOLR-5872 at 3/18/14 4:28 AM:
---

bq. I think if we decide to split out the clusterstate.json per collection, 
that is the direction we should take

Yes, that is the plan

we would probably switch to that from 5.0 or something. But the challenge is to 
offer a smother migration path. Till then we need a name to differentiate both 
modes
 * initially , users would be able to switch to that mode when creating a 
collection (an opt In) SOLR-5473 does that
*  offer an API to migrate to the new format  SOLR-5756
*  Make it the default format (from say 5.0)
*  deprecate the old format




was (Author: noble.paul):
bq. I think if we decide to split out the clusterstate.json per collection, 
that is the direction we should take

Yes, that is the plan

we would probably switch to that from 5.0 or something. But the challenge is to 
offer a smother migration path. 
 * initially , users would be able to switch to that mode when creating a 
collection (an opt In) SOLR-5473 does that
*  offer an API to migrate to the new format  SOLR-5756
*  Make it the default format (from say 5.0)
*  deprecate the old format



 Eliminate overseer queue 
 -

 Key: SOLR-5872
 URL: https://issues.apache.org/jira/browse/SOLR-5872
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 The overseer queue is one of the busiest points in the entire system. The 
 raison d'être of the queue is
  * Provide batching of operations for the main clusterstate,json so that 
 state updates are minimized 
 * Avoid race conditions and ensure order
 Now , as we move the individual collection states out of the main 
 clusterstate.json, the batching is not useful anymore.
 Race conditions can easily be solved by using a compare and set in Zookeeper. 
 The proposed solution  is , whenever an operation is required to be performed 
 on the clusterstate, the same thread (and of course the same JVM)
  # read the fresh state and version of zk node  
  # construct the new state 
  # perform a compare and set
  # if compare and set fails go to step 1
 This should be limited to all operations performed on external collections 
 because batching would be required for others 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org