[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-10-10 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199766#comment-16199766
 ] 

Cao Manh Dat commented on SOLR-10265:
-

I linked two child issues for this issue which is SOLR-11443 and SOLR-11465. 
Estimated that if we finish both of them, Overseer can process messages about 
15x to 20x times faster. So with 1.6M messages, we can process all of them in 
100 minutes ( or less ). Which I think is enough for this ticket.

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Cao Manh Dat
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-10-10 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199169#comment-16199169
 ] 

Shawn Heisey commented on SOLR-10265:
-

No idea how this got assigned to me.  I probably was trying to type in another 
window while Jira had focus.

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Shawn Heisey
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-10-06 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194970#comment-16194970
 ] 

Shawn Heisey commented on SOLR-10265:
-

Guava has something really cool that I think we could use for a multi-threaded 
overseer, and it is available in the 14.0.1 version that Solr currently 
includes:

https://google.github.io/guava/releases/14.0/api/docs/com/google/common/util/concurrent/Striped.html

If we use the name of the collection as the stripe input, then we will have 
individual locks for each collection, so we can guarantee order of operations.  
If there are any Overseer operations that are cluster-wide and not applicable 
to one collection, we probably need to come up with a special name for those.

NB: It's possible that different collection names might hash to the same stripe 
and therefore block each other, but that would hopefully be a rare occurrence.

Probably what should happen is that each thread would grab an operation from 
the queue, and attempt to acquire the lock for the collection mentioned in that 
operation.  If a ton of operations happen that all apply to one collection, 
then it would still act like the current single-thread implementation.  One 
thing that I do not know is whether the Java "Lock" implementation (which is 
what is obtained from the Striped class) guarantees what order the threads will 
be granted the lock.  The hope is of course that locks will be granted in the 
order they are requested when several threads are all attempting to use the 
same lock.

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-10-06 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194319#comment-16194319
 ] 

Cao Manh Dat commented on SOLR-10265:
-

Maybe the problem here is Overseer is processing all messages in a single 
thread ( with a lot of IO blocking for every time we peek, poll messages and 
write new clusterstate )? So it will be wasted in case of using a powerful 
machine just for Overseer.
The idea here is each collection has its own {{states.json}}, so messages for 
different collections can be processed and updated in parallel. It can be 
tricky to implement but If we want a cluster with 400k cores, we can not just 
use a single thread to process 1m6 messages.

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-07-30 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106729#comment-16106729
 ] 

Noble Paul commented on SOLR-10265:
---

bq.Would a watcher have to be set for every shard of every collection hosted on 
a particular Solr if we split it up more granularly?

yes. I don't think it is a huge problem

Let's do a back-of-the-envelope calculation

let's split our 100B doc index to 1000 shards (100m docs/shard) and 5 replicas 
. that is just 5000 cores. This leads to having 5000 watchers if no node has 
multiple replicas of same shard. IIRC zookeeper really has no issues with 5000 
watchers

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-07-29 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106167#comment-16106167
 ] 

Erick Erickson commented on SOLR-10265:
---

Hmm, it seems to me that another problem with splitting any finer than 
collection is watchers.  Currently IIUC a watcher is set on the collection's 
state.json for every collection that is hosted on a particular Solr instance. 
Would a watcher have to be set for every shard of every collection hosted on a 
particular Solr if we split it up more granularly?

Or has this all changed and I'm behind the times?

Or, more radically, is ZooKeeper the right place to keep this info? Not that I 
have any idea what to do to replace it, just throwin' it out there. When 
SolrCloud started we were talking about making 5 billion doc collections 
manageable. The current model has gotten us this far which is great, I mean 
we're talking 100B document (and larger) collections currently being supported 
out there. Are there other models that would better suit the next 100x 
expansion?

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-07-29 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106038#comment-16106038
 ] 

Noble Paul commented on SOLR-10265:
---

bq. If for some reason a per-replica znode is unworkable, would a per-shard 
znode possibly work?

per replica znode is useless. per-shard is the finest we can get.

However, the problem is not how big the state.json is. The real problem is the 
frequency of updates because of replica state changes. That is why I suggested 
that we should split out the state from the rest of the data (which rarely 
changes). 

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-07-28 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106015#comment-16106015
 ] 

David Smiley commented on SOLR-10265:
-

I haven't participated in these overseer related conversations yet so pardon my 
display of ignorance.  What if the collection's state.json were further split 
such that each replica had it's own znode file.  Then, when a replica comes up 
and down, it could simply write directly to ZK; no overseer needed.  

If for some reason a per-replica znode is unworkable, would a per-shard znode 
possibly work?  If so, then the shard leader could take care to manage the 
state of it's replicas.  Then replicas would contact their shard leader (not 
the Overseer) to notify a change in state.  And the leader would do state 
changes directly to ZooKeeper.

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-04-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963527#comment-15963527
 ] 

Mark Miller commented on SOLR-10265:


There are perhaps other wins we could get in the short term as well. For 
example, if we could tell a cluster stop and start from a simple Overseer 
failover, we could avoid the common waste where we fire tons of state updates 
due to shutdown or some other reason and then when we start the cluster backup, 
we first have to run through all those useless state updates in the queue. 

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-19 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15932032#comment-15932032
 ] 

Noble Paul commented on SOLR-10265:
---

Yeah, the overseer messages and collection action messages are already handled 
separately in two different queues. If we can  manage the  overseer messages 
without sending to zk it's a huuuge win because that's the bulk of the 
operations 

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-19 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931754#comment-15931754
 ] 

Erick Erickson commented on SOLR-10265:
---

Hmmm, could we take a step in that direction now? Off the top of my head on 
Sunday morning before coffee has really kicked in, so maybe nonsense.

Anyway, it seems like we have two sets of operations here, what about splitting 
them out? The state change messages go into memory and the Collections-level 
calls like "create collection", "addreplica" and the like continue to use the 
Overseer queue. Seems like that would allow us to then build robustness into 
the collection-level ops "sometime" and have them nicely isolated.

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931560#comment-15931560
 ] 

Mark Miller commented on SOLR-10265:


The main reason a zk queue was used instead of in memory was that there was a 
grand idea that eventually you could create the collection by simply putting 
the command on the queue and call it successful, even if things crashed. The 
Overseer would have the create captured and work towards its parameters over 
time. 

That is still ideal vs the current system. It's never been done though, and 
could be accomplished without the zk queue largely.

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-18 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931544#comment-15931544
 ] 

Erick Erickson commented on SOLR-10265:
---

bq: My suspicion is that it's problematic because of the number of messages in 
a typical queue

Well, that's certainly true, in this case the overseer work queue for at least 
two reasons: 1> the current method of dealing with the queue doesn't batch and 
2> the state changes get written to the logs which get _huge_. Part of the bits 
we're exploring are being smarter about that too.

bq: I can envision types that indicate "node X just came up" and "node X just 
went down"

"Just went down" is being considered, but "just came up" is a different 
problem. Consider a single JVM hosting 400 replicas from 100 different 
collections each of which has 4 shards. "Node just came up" could (I think) 
move them all from "down" to "recovering" in one go, but all the leader 
election stuff still needs to be hashed out. Also consider if a leader and 
follower are on the same node. How do they coordinate the leader coming up and 
informing the follower of that fact without sending individual messages?

bq: When a collection is created, is there REALLY a need to send state changes 
for EVERY collection in the cloud? 

IIUC this just shouldn't be happening. If it is we need to fix it. I've 
certainly seen some allegations that this is so.

bq: If ZK isn't used for the overseer queue, 

Well, that's part of the proposal. The in-memory Overseer queue is OK 
(according to Noble's latest comment) if each client retries submitting the 
pending operations that aren't ack'd. So no alternate queueing mechanism is 
necessary.

What's really happening here IMO is the evolution of SorlCloud. So far we've 
gotten away with inefficiencies because we haven't really had huge clusters to 
deal with, now we're getting them. So we need to squeeze what efficiencies we 
can out of the current architecture while considering if we really need ZK as 
the intermediary. Of course cutting down the number of messages would help, the 
fear is that there's been a great deal of work put into hardening all the wonky 
conditions so I do worry a bit about wholesale restructuring Ya gotta bite 
the bullet sometime though.


> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-18 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931532#comment-15931532
 ] 

Shawn Heisey commented on SOLR-10265:
-

bq. We should seriously reconsider the idea of using ZK queue for the overseer 
messages. 

Serious question, not being confrontational:  What are the characteristics of 
zookeeper which make it bad choice for the overseer queue?  My suspicion is 
that it's problematic because of the number of messages in a typical queue, not 
because of technology issues.  If a way can be found to accomplish the overseer 
job with a very small number of messages instead of a very large number, would 
ZK be suitable?

Can't we create more queue entry types that can accomplish with a single entry 
what currently can take hundreds or thousands of entries (depending on the 
number of collections)?  I can envision types that indicate "node X just came 
up" and "node X just went down" ... the overseer should be able to use the 
cluster information in ZK to decide what state changes are required, rather 
than processing an individual message for each state change.  Other types that 
I haven't thought of would probably be needed.  Another type might specify 
state changes that affect multiple collections or shard replicas with a single 
entry.

When a collection is created, is there REALLY a need to send state changes for 
EVERY collection in the cloud?  That seems completely unnecessary to me.  I 
have not attempted other actions like creating a new replica, but I would not 
be surprised to learn that they also send many unnecessary state changes.

If ZK isn't used for the overseer queue, then some other method must be found 
to make sure the queue is not lost when nodes go down.  An in-memory queue 
would need to be somehow replicated to at least one other node in the cluster.  
Doing that well probably means finding another dependency where somebody else 
has worked out the bugs.


> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-18 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931501#comment-15931501
 ] 

Noble Paul commented on SOLR-10265:
---

Order of events between two clients don't matter

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-18 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931500#comment-15931500
 ] 

Erick Erickson commented on SOLR-10265:
---

bq: The client should take care of the ordering. 

The case I was wondering about was two different clients. Or doesn't it matter 
as long as each client preserves the ordering of all requests that originate on 
that client only?



> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-18 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931437#comment-15931437
 ] 

Noble Paul commented on SOLR-10265:
---

bq.Depending on the timing of the retry logic, the clients could well submit 
them in reverse order.

The client should take care of the ordering. if message with id {{n}} is to be 
retried, then all messages with {{id >n}} must be retried too

bq.Another thought I had was how often we actually need to write the state out. 
Imagine 100 updates for the same state.json file. Is there any good reason 
those updates couldn't all be written at once? 

Yeah, that is an easy optimization. I was working on a patch for the same



> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-18 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931336#comment-15931336
 ] 

Erick Erickson commented on SOLR-10265:
---

[~noble.paul] Interesting. You're essentially replacing the overseer zknodes 
with an in-memory queue, right?

I do wonder about sequencing. The order of operations could get switched, 
although I'm not sure it matters if there's some sort of optimistic concurrency 
provisions (or maybe even if not). Imagine request1 and request2 in that order, 
the overseer dies. Depending on the timing of the retry logic, the clients 
could well submit them in reverse order

Another thought I had was how often we actually need to write the state out. 
Imagine 100 updates for the same state.json file. Is there any good reason 
those updates couldn't all be written at once? The algorithm would look 
something like

get N messages from the queue
Apply all the changes
write out all affected nodes to ZK.

Hmmm, in that case, though, any watchers expecting an inermediate transition 
state wouldn't see it. Imagine node goes straight from "down" to "active". That 
is all the intermediate state changes are collapsed into a single write. Would 
that matter? In the situation we're talking about here this would certainly cut 
down a lot of chatter.

Musings on Saturday morning.



> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-17 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931070#comment-15931070
 ] 

Noble Paul commented on SOLR-10265:
---

We should seriously reconsider the idea of using ZK queue for the overseer 
messages. 

There is no harm in maintaining an in memory queue at the overseer and posting 
messages directly to overseer. There is a risk of Overseer dying, or getting 
moved in between. This can be alleviated by making the client smarter. 

# client posts a message to Overseer
# overseer puts the message into an internal queue returns an id for the message
# Overseer processes the message, one after other
# client polls overseer to check the status . if successfully processed, return 
the call
# if the overseer fails in between. got to step #1

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-11 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906381#comment-15906381
 ] 

Noble Paul commented on SOLR-10265:
---

bq.I haven't debugged why the transaction logs ran into terabytes without 
taking into snapshots but this was my assumption based on the other problems we 
observed

The current design is suboptimal. Just to change one small bit of information, 
we write ~300bytes per replica. for 600 replicas we write ~180kb of data (add 
the fact that 600 nodes have to read that data and parse for every state change 
operation). Every write is appended to the ZK transaction log. We must split 
the state.json to scale any further. 
If we write out the state separately to a format as follows
{code}
{
"replica1": 1
"replica2":0
}
{code}

we can bring down the data size to around 10 bytes per replica. This means 600, 
replicas will have only 6K data per update. 

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-11 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906245#comment-15906245
 ] 

Shawn Heisey commented on SOLR-10265:
-

Side comment, need its own issue if we decide to pursue.

Only about 6 entries can fit in the overseer queue before retrieving the 
queue exceeds the default jute.maxbuffer and ZOOKEEPER-1162 becomes a problem.

Some calculation revealed that if we use a nested node structure where we 
create a "directory" node in the overseer queue and then put events inside 
those directories, limiting the number of directories to 32768 and the number 
of entries in each directory to 32768, we can handle a billion entries without 
exceeding the 1MB maxbuffer.  Figuring out how to handle rollover in the 
directory names would be necessary.

Of course it would likely take weeks or months to actually process a billion 
entries, but it does need to be able to handle more than 6 without breaking 
down.


> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

2017-03-11 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906241#comment-15906241
 ] 

Erick Erickson commented on SOLR-10265:
---

Trying to think a little OOB here:

> Is there any way to split one or more collections across multiple ZK 
> ensembles? My instant response is yuck, that'd be major surgery and perhaps 
> practically impossible but I thought I'd toss it out there.

> Can we reduce the number of messages the Overseer has to process, especially 
> when nodes are coming up?

> Does all that stuff have to be written to the tlog so often (I'd guess yes, 
> just askin').

> Does all this stuff actually have to flow through ZK?

> Could we "somehow" roll up/batch changes? I.e. go through the queue, assemble 
> all of the changes that affect a particular znode and write _one_ state 
> change (or do we already)? 

And I don't know that code at all so much of this may be nonsense or already 
done. Think of it as brainstorming

> Overseer can become the bottleneck in very large clusters
> -
>
> Key: SOLR-10265
> URL: https://issues.apache.org/jira/browse/SOLR-10265
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org