[ https://issues.apache.org/jira/browse/HADOOP-18396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577353#comment-17577353 ]
Steve Loughran commented on HADOOP-18396: ----------------------------------------- funny https://www.slideshare.net/steve_l/farming-hadoop-inthecloud > Issues running in dynamic / managed environments > ------------------------------------------------ > > Key: HADOOP-18396 > URL: https://issues.apache.org/jira/browse/HADOOP-18396 > Project: Hadoop Common > Issue Type: Improvement > Affects Versions: 3.4.0, 3.3.9, 3.3.4 > Environment: Running an HA configuration in Kubernetes, using Java 11. > Reporter: Steve Vaughan > Assignee: Steve Vaughan > Priority: Major > > Running in dynamic or managed environments is a challenge because we can't > assume that all services will have DNS entries, will be started in a specific > order, will maintain constant IP addresses, etc. I'm using the following > assumptions to guide the changes necessary to operate in this kind of > environment: > # The configuration files are an expression of desired state > # If a referenced service instance is not resolvable or reachable at a > moment in time, it will be eventually and should be able to participate in > the future, as if it had been there originally, without requiring manual > intervention > # IP address changes should be handled in a way that no only allows > distributed calls to continue to function, but avoids having to re-resolve > the address over and over > # Code that requires resolved names (Kerberos and DataNode registration) > should fall back to DNS reverse lookups to work around temporary issues > caused by caching. Example: The DataNode registration is only performed at > startup, and yet the extra check that allows it to succeed in registering > with the NameNode isn’t performed > # If an HA system is supposed to only require a quorum, then we shouldn’t > require the full set, allowing the called service to bring the remaining > instances into compliance > # Managing a service should be independent of other services. Example: You > should be able to perform a rolling restart of JournalNodes without worrying > about causing an issue with NameNodes as long as a quorum is present. > A proof of these concepts would be the ability to: > * Start with less that the full replica count of a service, while still > providing the required quorum or minimal count, should still allow a cluster > to start and function. Example: 2 out of 3 configured JournalNodes should > still allow the NameNode to format, function, rollover to the standby, etc. > * Introduce missing instances should join the existing cluster without > manual intervention. Example: Starting the 3rd JournalNode should > automatically be formatted and brought up to date > * Perform rolling restarts of individual services without negatively > impacting other services (causing failures, restarts, etc.). Example: > Rolling restarts of JournalNodes shouldn't cause problems in NameNodes; > Rolling restarts of NameNodes shouldn't cause problems with DataNodes > * Logs should only report updated IP addresses once (per dependent), > avoiding costly re-resolution -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org