[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323231#comment-14323231 ] Josh Elser commented on ACCUMULO-1454: -- Yes, you are correct. Thanks for commenting. We'll need to do some brainstorming as to what we'll need to add to the RPC api to support it -- the above concern will likely be pulled there. I just wanted to make sure I write it down so it doesn't get lost. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Sub-task > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > Attachments: ACCUMULO-1454-proposal-01.adoc, > ACCUMULO-1454-proposal-01.html > > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321688#comment-14321688 ] Sean Busbey commented on ACCUMULO-1454: --- That sounds like a problem for the parent issue, but not for this one. If we're doing a rolling restart generally (that is not during an upgrade) then we needn't worry about version conflicts. Maybe worth another child on the parent issue for an RPC to master about maximum running version negotiation? > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Sub-task > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > Attachments: ACCUMULO-1454-proposal-01.adoc, > ACCUMULO-1454-proposal-01.html > > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320543#comment-14320543 ] Josh Elser commented on ACCUMULO-1454: -- I think we chatted about this recently: there's an issue of handling newer versions of RFile and WALs in the middle of a rolling restart. 1. Server1 is restarted as the new version 2. Server1 writes some new data 3. Server1 dies 4. Server2 (still old version) gets the tablets from Server1 We need to ensure that there is control to limit the new software from writing out new versions of persistent files while there are still old versions of the software participating in the instance. It's similar to finalizing an upgrade: after we're sure that all of the servers have been upgraded and are functioning well, we can flip them over to using new messages/serialization that the old versions aren't aware of. This problem gets much easier after we get to using Thrift/PB for serializing things because both of those can naturally read newer versions of messages they know about, ignoring the new fields. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Sub-task > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > Attachments: ACCUMULO-1454-proposal-01.adoc, > ACCUMULO-1454-proposal-01.html > > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102506#comment-14102506 ] Josh Elser commented on ACCUMULO-1454: -- bq. I can post design doc on RB if anyone has feedback. That'd be best, IMO. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > Attachments: ACCUMULO-1454-proposal-01.adoc, > ACCUMULO-1454-proposal-01.html > > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102474#comment-14102474 ] Mike Drob commented on ACCUMULO-1454: - Is the upgrade case the same as the 'change configuration' use case? Why load/unload instead of move? > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > Attachments: ACCUMULO-1454-proposal-01.adoc, > ACCUMULO-1454-proposal-01.html > > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094524#comment-14094524 ] Eric Newton commented on ACCUMULO-1454: --- bq. Would that just bash ZooKeeper trying to migrate tablets off of ~200 tservers? (e.g. 200 tservers * 200 tablets)? No more so than start-up. bq. Would you consider this a FATE op Yes: it needs to be part of the metadata processing state machine. Migrations are not presently a FATE op. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094482#comment-14094482 ] Josh Elser commented on ACCUMULO-1454: -- bq. start a lot of new tserver instances This might be more difficult than it sounds. {{start-server.sh}} is fairly easy to manage, but {{stop-server.sh}} is pretty aggressive about just {{kill}}'ing the process on a host. I think we might have to expand on the shell scripts to really give an admin what they want. bq. migrate tablets Would you consider this a FATE op that the master coordinates, or would you just add something directly to the TServer and let the client handle the coordination? Would that just bash ZooKeeper trying to migrate tablets off of ~200 tservers? (e.g. 200 tservers * 200 tablets)? > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094450#comment-14094450 ] Keith Turner commented on ACCUMULO-1454: This discussion makes me think we could offer the following primitives. * disable balancing * migrate tablets * enable balancing This would allow and admin to do the following. # disable balancing # start a lot of new tserver instances # call migrate tablets in a loop # kill old tservers # enable balancing We could offer a script to assist with this. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094234#comment-14094234 ] Mike Drob commented on ACCUMULO-1454: - Disabling the balancer while doing this (or replacing it with a balancer specifically for rolling upgrades) is a Good Idea. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094210#comment-14094210 ] Josh Elser commented on ACCUMULO-1454: -- bq. If we expose a user level command to request that a tablet be moved to a given destination Do we necessarily care where they move? I don't have a good feel for the difference in cost between moving to a "sibling" tserver on the same node as opposed to a tserver on a completely different node. We also might just be fighting the balancer if we make it seem like the user has control over where a tablet is hosted. HBase has such a tool, no? Can we glean anything from their rolling upgrade support (good and bad)? > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094145#comment-14094145 ] Mike Drob commented on ACCUMULO-1454: - If we expose a user level command to request that a tablet be moved to a given destination, then external tools could implement their own rolling restarts. If any of those turn out to be really good in the general case (or even in the upgrade case), then we can always backport them. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094134#comment-14094134 ] Keith Turner commented on ACCUMULO-1454: I was thinking about use cases. * Administrator has 20 of 100 nodes with screwy java memory config, wants to fix those nodes w/ minimal impact. * Administrator has 100 of 100 nodes with screwy java memory config, wants to fix those nodes w/ minimal impact. * Administrator wants to upgrade cluster from 1.7.0 to 1.7.1 w/ minimal impact Are there any other important use cases? The first two are covered in the ticket description, I split them into two because one is a subset of tservers. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093674#comment-14093674 ] Sean Busbey commented on ACCUMULO-1454: --- [~mdrob], presumably the sidecar could be started with its own ACCUMULO_CONF_DIR with the port set to a different value. Relying exclusively on dynamic ports would preclude rolling restarts for anyone running in an environment that requires whitelisted network comms (e.g. those running under DISA STIGs). But we could just call out needing to pick a port in whatever docs are describing how to set up a custom conf dir and make '0' the common case. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093085#comment-14093085 ] Josh Elser commented on ACCUMULO-1454: -- bq. How do you plan to deal with port conflicts? For any current version, the config could be modified to start up the "sidecar" processes with a port value of '0'. I've done pretty extensive testing in making sure we can bring up a cluster using completely dynamic ports. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093021#comment-14093021 ] Mike Drob commented on ACCUMULO-1454: - How do you plan to deal with port conflicts? > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092944#comment-14092944 ] Keith Turner commented on ACCUMULO-1454: I thought of another possible solution. Instead of killing the tserver process and restarting, start another tserver instance on the same node while the old tserver instance is still running. Then migrate tablets between the old and new tserver instance on the same node. After everything is migrated, kill the old tserver instance on the node. I really like this approach, but it has one small problem. There is a potential for memory exhaustion (from using 2x memory for buffering of read and write data). To circumvent this, could possibly make decommissioned tserver flush its read cache, flush recently written memory, and hold new writes. This approach my delay writes a bit, but seems like it would be good for reads. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673749#comment-13673749 ] Mike Drob commented on ACCUMULO-1454: - bq. Maybe the 'restarting' status has a lease? Server must become responsive within a configurable time period. +1 bq. Currently when a tablet is closed, it interrupts running scans. Waiting for scans to finish is tricky, because it seems like you would not want to allow new scans to start. So while the scans running before close would see no delay, scans started after close will still see a delay. Would it make sense to "double host" tablets? Let existing scans finish on a tserver that is about to go down, meanwhile, load those same tablets elsewhere and point new scans to the new locations. At the end of the whole process, run one final balance to shake things out. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.4.3, 1.5.0 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665541#comment-13665541 ] Keith Turner commented on ACCUMULO-1454: Seems like you would want an exception for metadata tablets? Reassign those tablets immediately. bq. For most cases, we could even do neat stuff like wait for all scans to cease for a tablet before we migrate it away. Currently when a tablet is closed, it interrupts running scans. Waiting for scans to finish is tricky, because it seems like you would not want to allow new scans to start. So while the scans running before close would see no delay, scans started after close will still see a delay. bq. I think that's the key to an elegant solution here: ensure a delay long enough for the tserver to come back and continue serving the tablets it had been Could possibly record this tablet state in the metadata table as opposed to keeping it in the master memory. So put something in the metadata table for a tablet indicates the master should delay assigning a tablet until a tablet server becomes active. If the master does not see a tablet server for a period of time, it could ignore those entries in the metadata table and assign. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.5.0, 1.4.3 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665079#comment-13665079 ] David Medinets commented on ACCUMULO-1454: -- Maybe the 'restarting' status has a lease? Server must become responsive within a configurable time period. On Wed, May 22, 2013 at 10:31 PM, Christopher Tubbs (JIRA) > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.5.0, 1.4.3 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664802#comment-13664802 ] Christopher Tubbs commented on ACCUMULO-1454: - [~medined] wrote: {quote}Is there some time delay before tablets are reassigned? Can the tserver restart within that window of time?{quote} I think that's the key to an elegant solution here: ensure a delay long enough for the tserver to come back and continue serving the tablets it had been, to avoid rebalancing the whole cluster, but not so long that a failure to come back would prevent re-assignment entirely. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.5.0, 1.4.3 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664737#comment-13664737 ] David Medinets commented on ACCUMULO-1454: -- This ticket make me laugh because my team at Toyrus.com developed rolling restart functionality back in 1999. We tracked the number of sessions assigned to each server using a load balancer. When a given server needed to be restarted, the load balancer changed the server's state to something like 'restarting' and did not send any sessions to it. The server would periodically check it's own status. If it saw 'rebooting' and it had zero sessions then it restarted itself. Is there some time delay before tablets are reassigned? Can the tserver restart within that window of time? > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.5.0, 1.4.3 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664635#comment-13664635 ] Josh Elser commented on ACCUMULO-1454: -- I think there's more to it than just pausing the balancer. When a tserver dies, you get a massive swath of tablets that need to be reassigned. While this typically isn't an interruption of service due to the resiliency of the client, it will still affect query response time. What if there were a way to "decomission" a tserver in which we more gracefully migrate tablets off of that tserver to others. This makes things more difficult on our end; however, it should result in better QOS for clients. For most cases, we could even do neat stuff like wait for all scans to cease for a tablet before we migrate it away. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.5.0, 1.4.3 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
[ https://issues.apache.org/jira/browse/ACCUMULO-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664592#comment-13664592 ] Christopher Tubbs commented on ACCUMULO-1454: - This shouldn't be too difficult. Essentially, one needs to pause load balancing before each server goes down, and resume when the server comes back. This could be done with a load balancer that detects that one is in a "rolling-upgrade" state, and is less aggressive about tablet assignment... maybe through a simple timeout delay before assignment. > Need good way to perform a rolling restart of all tablet servers > > > Key: ACCUMULO-1454 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1454 > Project: Accumulo > Issue Type: Improvement > Components: tserver >Affects Versions: 1.5.0, 1.4.3 >Reporter: Mike Drob > > When needing to change a tserver parameter (e.g. java heap space) across the > entire cluster, there is not a graceful way to perform a rolling restart. > The naive approach of just killing tservers one at a time causes a lot of > churn on the cluster as tablets move around and zookeeper tries to maintain > current state. > Potential solutions might be via a fancy fate operation, with coordination by > the master. Ideally, the master would know which servers are 'safe' to > restart and could minimize overall impact during the operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira