[ https://issues.apache.org/jira/browse/GEODE-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gideon updated GEODE-4250: -------------------------- Description: Command would only succeed when the system is fully redundant. Re-establishing Redundancy after the loss of a peer node is typically far more urgent and important than achieving better balance. The operational impact of rebalancing is also much higher, forcing impacted buckets' updates to be distributed to _redundancy-copies + 1_ peer processes and potentially spiking p2p connections/threads (and thus load) far beyond normal operations. If the system is already close to exhausting available capacity for some hardware component, this can be enough to push it over-the-edge (and may force the original fault to recur). This problem is exacerbated when the cluster's overall capacity has been reduced due to the loss of a physical server. Without the ability to separate the operational tasks of re-establishing full data redundancy and rebalancing bucket partitions (that are already safely redundant), system administrators may be forced to provision replacement capacity _before_ they can restore full service, thus increasing downtime unnecessarily. For these reasons, we must add the option to execute these operational tasks separately. It still makes sense for _rebalancing_ ops to first re-establish redundancy, so we can keep the existing GFSH command/behavior (it would still be useful to clearly log completion of one step before the next one begins). We need a new GFSH command/ResourceManager API to execute re-establishment of redundancy _without_ rebalancing. was: Command would only succeed when the system is fully redundant. Re-establishing Redundancy after the loss of a peer node is typically far more urgent and important than achieving better balance. The operational impact of rebalancing is also much higher, forcing impacted buckets' updates to be distributed to _redundancy-copies + 1_ peer processes and potentially spiking p2p connections/threads (and thus load) far beyond normal operations. If the system is already close to exhausting available capacity for some hardware component, this can be enough to push it over-the-edge (and may force the original fault to recur). This problem is exacerbated when the cluster's overall capacity has been reduced due to the loss of a physical server. Without the ability to separate the operational tasks of re-establishing full data redundancy and rebalancing bucket partitions (that are already safely redundant), system administrators may be forced to provision replacement capacity _before_ they can restore full service, thus increasing downtime unnecessarily. For these reasons, we must add the option to execute these operational separately. It still makes sense for _rebalancing_ ops to first re-establish redundancy, so we can keep the existing GFSH command/behavior (it would still be useful to clearly log completion of one step before the next one begins). We need a new GFSH command/ResourceManager API to execute re-establishment of redundancy _without_ rebalancing. > Users would like a command to re-establish redundancy without rebalancing > ------------------------------------------------------------------------- > > Key: GEODE-4250 > URL: https://issues.apache.org/jira/browse/GEODE-4250 > Project: Geode > Issue Type: Improvement > Components: regions > Reporter: Fred Krone > Priority: Major > > Command would only succeed when the system is fully redundant. > Re-establishing Redundancy after the loss of a peer node is typically far > more urgent and important than achieving better balance. The operational > impact of rebalancing is also much higher, forcing impacted buckets' updates > to be distributed to _redundancy-copies + 1_ peer processes and potentially > spiking p2p connections/threads (and thus load) far beyond normal operations. > If the system is already close to exhausting available capacity for some > hardware component, this can be enough to push it over-the-edge (and may > force the original fault to recur). This problem is exacerbated when the > cluster's overall capacity has been reduced due to the loss of a physical > server. Without the ability to separate the operational tasks of > re-establishing full data redundancy and rebalancing bucket partitions (that > are already safely redundant), system administrators may be forced to > provision replacement capacity _before_ they can restore full service, thus > increasing downtime unnecessarily. > For these reasons, we must add the option to execute these operational tasks > separately. > It still makes sense for _rebalancing_ ops to first re-establish redundancy, > so we can keep the existing GFSH command/behavior (it would still be useful > to clearly log completion of one step before the next one begins). We need a > new GFSH command/ResourceManager API to execute re-establishment of > redundancy _without_ rebalancing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)