[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2015-02-16 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323231#comment-14323231
 ] 

Josh Elser commented on ACCUMULO-1454:
--

Yes, you are correct. Thanks for commenting. We'll need to do some 
brainstorming as to what we'll need to add to the RPC api to support it -- the 
above concern will likely be pulled there. I just wanted to make sure I write 
it down so it doesn't get lost.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Sub-task
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
> Attachments: ACCUMULO-1454-proposal-01.adoc, 
> ACCUMULO-1454-proposal-01.html
>
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2015-02-14 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321688#comment-14321688
 ] 

Sean Busbey commented on ACCUMULO-1454:
---

That sounds like a problem for the parent issue, but not for this one. If we're 
doing a rolling restart generally (that is not during an upgrade) then we 
needn't worry about version conflicts.

Maybe worth another child on the parent issue for an RPC to master about 
maximum running version negotiation?

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Sub-task
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
> Attachments: ACCUMULO-1454-proposal-01.adoc, 
> ACCUMULO-1454-proposal-01.html
>
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2015-02-13 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320543#comment-14320543
 ] 

Josh Elser commented on ACCUMULO-1454:
--

I think we chatted about this recently: there's an issue of handling newer 
versions of RFile and WALs in the middle of a rolling restart.

1. Server1 is restarted as the new version
2. Server1 writes some new data
3. Server1 dies
4. Server2 (still old version) gets the tablets from Server1

We need to ensure that there is control to limit the new software from writing 
out new versions of persistent files while there are still old versions of the 
software participating in the instance. It's similar to finalizing an upgrade: 
after we're sure that all of the servers have been upgraded and are functioning 
well, we can flip them over to using new messages/serialization that the old 
versions aren't aware of.

This problem gets much easier after we get to using Thrift/PB for serializing 
things because both of those can naturally read newer versions of messages they 
know about, ignoring the new fields.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Sub-task
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
> Attachments: ACCUMULO-1454-proposal-01.adoc, 
> ACCUMULO-1454-proposal-01.html
>
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-19 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102506#comment-14102506
 ] 

Josh Elser commented on ACCUMULO-1454:
--

bq.  I can post design doc on RB if anyone has feedback.

That'd be best, IMO.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
> Attachments: ACCUMULO-1454-proposal-01.adoc, 
> ACCUMULO-1454-proposal-01.html
>
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-19 Thread Mike Drob (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102474#comment-14102474
 ] 

Mike Drob commented on ACCUMULO-1454:
-

Is the upgrade case the same as the 'change configuration' use case?

Why load/unload instead of move?

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
> Attachments: ACCUMULO-1454-proposal-01.adoc, 
> ACCUMULO-1454-proposal-01.html
>
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-12 Thread Eric Newton (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094524#comment-14094524
 ] 

Eric Newton commented on ACCUMULO-1454:
---

bq. Would that just bash ZooKeeper trying to migrate tablets off of ~200 
tservers? (e.g. 200 tservers * 200 tablets)?

No more so than start-up.

bq. Would you consider this a FATE op

Yes: it needs to be part of the metadata processing state machine.  Migrations 
are not presently a FATE op.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-12 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094482#comment-14094482
 ] 

Josh Elser commented on ACCUMULO-1454:
--

bq. start a lot of new tserver instances

This might be more difficult than it sounds. {{start-server.sh}} is fairly easy 
to manage, but {{stop-server.sh}} is pretty aggressive about just {{kill}}'ing 
the process on a host. I think we might have to expand on the shell scripts to 
really give an admin what they want.

bq. migrate tablets  

Would you consider this a FATE op that the master coordinates, or would you 
just add something directly to the TServer and let the client handle the 
coordination? Would that just bash ZooKeeper trying to migrate tablets off of 
~200 tservers? (e.g. 200 tservers * 200 tablets)?

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-12 Thread Keith Turner (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094450#comment-14094450
 ] 

Keith Turner commented on ACCUMULO-1454:


This discussion makes me think we could offer the following primitives.   

 * disable balancing
 * migrate tablets  
 * enable balancing

This would allow and admin to do the following.

 # disable balancing
 # start a lot of new tserver instances
 # call migrate tablets in a loop
 # kill old tservers
 # enable balancing

We could offer a script to assist with this.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-12 Thread Mike Drob (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094234#comment-14094234
 ] 

Mike Drob commented on ACCUMULO-1454:
-

Disabling the balancer while doing this (or replacing it with a balancer 
specifically for rolling upgrades) is a Good Idea.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-12 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094210#comment-14094210
 ] 

Josh Elser commented on ACCUMULO-1454:
--

bq. If we expose a user level command to request that a tablet be moved to a 
given destination

Do we necessarily care where they move? I don't have a good feel for the 
difference in cost between moving to a "sibling" tserver on the same node as 
opposed to a tserver on a completely different node. We also might just be 
fighting the balancer if we make it seem like the user has control over where a 
tablet is hosted. HBase has such a tool, no? Can we glean anything from their 
rolling upgrade support (good and bad)?

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-12 Thread Mike Drob (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094145#comment-14094145
 ] 

Mike Drob commented on ACCUMULO-1454:
-

If we expose a user level command to request that a tablet be moved to a given 
destination, then external tools could implement their own rolling restarts. If 
any of those turn out to be really good in the general case (or even in the 
upgrade case), then we can always backport them.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-12 Thread Keith Turner (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094134#comment-14094134
 ] 

Keith Turner commented on ACCUMULO-1454:


I was thinking about use cases.

 * Administrator has 20 of 100 nodes with screwy java memory config, wants to 
fix those nodes w/ minimal impact.
 * Administrator has 100 of 100 nodes with screwy java memory config, wants to 
fix those nodes w/ minimal impact.
 * Administrator wants to upgrade cluster from 1.7.0 to 1.7.1 w/ minimal impact

Are there any other important use cases?  The first two are covered in the 
ticket description, I split them into two because one is a subset of tservers.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-11 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093674#comment-14093674
 ] 

Sean Busbey commented on ACCUMULO-1454:
---

[~mdrob], presumably the sidecar could be started with its own 
ACCUMULO_CONF_DIR with the port set to a different value. 

Relying exclusively on dynamic ports would preclude rolling restarts for anyone 
running in an environment that requires whitelisted network comms (e.g. those 
running under DISA STIGs). But we could just call out needing to pick a port in 
whatever docs are describing how to set up a custom conf dir and make '0' the 
common case.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-11 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093085#comment-14093085
 ] 

Josh Elser commented on ACCUMULO-1454:
--

bq. How do you plan to deal with port conflicts?

For any current version, the config could be modified to start up the "sidecar" 
processes with a port value of '0'. I've done pretty extensive testing in 
making sure we can bring up a cluster using completely dynamic ports.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-11 Thread Mike Drob (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093021#comment-14093021
 ] 

Mike Drob commented on ACCUMULO-1454:
-

How do you plan to deal with port conflicts?

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2014-08-11 Thread Keith Turner (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092944#comment-14092944
 ] 

Keith Turner commented on ACCUMULO-1454:


I thought of another possible solution.  Instead of killing the tserver process 
and restarting, start another tserver instance on the same node  while the old 
tserver instance is still running.  Then migrate tablets between the old and 
new tserver instance on the same node.   After everything is migrated, kill the 
old tserver instance on the node.

I really like this approach, but it has one small problem.  There is a 
potential for memory exhaustion (from using 2x memory for buffering of read and 
write data).  To circumvent this, could possibly make decommissioned tserver 
flush its read cache, flush recently written memory, and hold new writes.  This 
approach my delay writes a bit, but seems like it would be good for reads.


> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2013-06-03 Thread Mike Drob (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673749#comment-13673749
 ] 

Mike Drob commented on ACCUMULO-1454:
-

bq. Maybe the 'restarting' status has a lease? Server must become responsive 
within a configurable time period.

+1

bq. Currently when a tablet is closed, it interrupts running scans. Waiting for 
scans to finish is tricky, because it seems like you would not want to allow 
new scans to start. So while the scans running before close would see no delay, 
scans started after close will still see a delay.

Would it make sense to "double host" tablets? Let existing scans finish on a 
tserver that is about to go down, meanwhile, load those same tablets elsewhere 
and point new scans to the new locations. At the end of the whole process, run 
one final balance to shake things out.



> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.4.3, 1.5.0
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2013-05-23 Thread Keith Turner (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665541#comment-13665541
 ] 

Keith Turner commented on ACCUMULO-1454:


Seems like you would want an exception for metadata tablets?  Reassign those 
tablets immediately.

bq.  For most cases, we could even do neat stuff like wait for all scans to 
cease for a tablet before we migrate it away.

Currently when a tablet is closed, it interrupts running scans.   Waiting for 
scans to finish is tricky, because it seems like you would not want to allow 
new scans to start.  So while the scans running before close would see no 
delay, scans started after close will still see a delay.

bq. I think that's the key to an elegant solution here: ensure a delay long 
enough for the tserver to come back and continue serving the tablets it had been

Could possibly record this tablet state in the metadata table as opposed to 
keeping it in the master memory.   So put something in the metadata table for a 
tablet indicates the master should delay assigning a tablet until a tablet 
server becomes active.  If the master does not see a tablet server for a period 
of time, it could ignore those entries in the metadata table and assign.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.5.0, 1.4.3
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2013-05-23 Thread David Medinets (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665079#comment-13665079
 ] 

David Medinets commented on ACCUMULO-1454:
--

Maybe the 'restarting' status has a lease? Server must become responsive
within a configurable time period.


On Wed, May 22, 2013 at 10:31 PM, Christopher Tubbs (JIRA)



> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.5.0, 1.4.3
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2013-05-22 Thread Christopher Tubbs (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664802#comment-13664802
 ] 

Christopher Tubbs commented on ACCUMULO-1454:
-

[~medined] wrote:
{quote}Is there some time delay before tablets are reassigned? Can the tserver 
restart within that window of time?{quote}

I think that's the key to an elegant solution here: ensure a delay long enough 
for the tserver to come back and continue serving the tablets it had been, to 
avoid rebalancing the whole cluster, but not so long that a failure to come 
back would prevent re-assignment entirely.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.5.0, 1.4.3
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2013-05-22 Thread David Medinets (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664737#comment-13664737
 ] 

David Medinets commented on ACCUMULO-1454:
--

This ticket make me laugh because my team at Toyrus.com developed rolling 
restart functionality back in 1999. We tracked the number of sessions assigned 
to each server using a load balancer. When a given server needed to be 
restarted, the load balancer changed the server's state to something like 
'restarting' and did not send any sessions to it. The server would periodically 
check it's own status. If it saw 'rebooting' and it had zero sessions then it 
restarted itself.

Is there some time delay before tablets are reassigned? Can the tserver restart 
within that window of time?

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.5.0, 1.4.3
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2013-05-22 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664635#comment-13664635
 ] 

Josh Elser commented on ACCUMULO-1454:
--

I think there's more to it than just pausing the balancer.

When a tserver dies, you get a massive swath of tablets that need to be 
reassigned. While this typically isn't an interruption of service due to the 
resiliency of the client, it will still affect query response time.

What if there were a way to "decomission" a tserver in which we more gracefully 
migrate tablets off of that tserver to others. This makes things more difficult 
on our end; however, it should result in better QOS for clients. For most 
cases, we could even do neat stuff like wait for all scans to cease for a 
tablet before we migrate it away.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.5.0, 1.4.3
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers

2013-05-22 Thread Christopher Tubbs (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664592#comment-13664592
 ] 

Christopher Tubbs commented on ACCUMULO-1454:
-

This shouldn't be too difficult. Essentially, one needs to pause load balancing 
before each server goes down, and resume when the server comes back.

This could be done with a load balancer that detects that one is in a 
"rolling-upgrade" state, and is less aggressive about tablet assignment... 
maybe through a simple timeout delay before assignment.

> Need good way to perform a rolling restart of all tablet servers
> 
>
> Key: ACCUMULO-1454
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1454
> Project: Accumulo
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.5.0, 1.4.3
>Reporter: Mike Drob
>
> When needing to change a tserver parameter (e.g. java heap space) across the 
> entire cluster, there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of 
> churn on the cluster as tablets move around and zookeeper tries to maintain 
> current state.
> Potential solutions might be via a fancy fate operation, with coordination by 
> the master. Ideally, the master would know which servers are 'safe' to 
> restart and could minimize overall impact during the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira