http://git-wip-us.apache.org/repos/asf/hadoop/blob/a45ef2b6/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm b/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm deleted file mode 100644 index a3a9f12..0000000 --- a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm +++ /dev/null @@ -1,816 +0,0 @@ -~~ Licensed under the Apache License, Version 2.0 (the "License"); -~~ you may not use this file except in compliance with the License. -~~ You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. See accompanying LICENSE file. - - --- - Hadoop Distributed File System-${project.version} - High Availability - --- - --- - ${maven.build.timestamp} - -HDFS High Availability Using the Quorum Journal Manager - -%{toc|section=1|fromDepth=0} - -* {Purpose} - - This guide provides an overview of the HDFS High Availability (HA) feature - and how to configure and manage an HA HDFS cluster, using the Quorum Journal - Manager (QJM) feature. - - This document assumes that the reader has a general understanding of - general components and node types in an HDFS cluster. Please refer to the - HDFS Architecture guide for details. - -* {Note: Using the Quorum Journal Manager or Conventional Shared Storage} - - This guide discusses how to configure and use HDFS HA using the Quorum - Journal Manager (QJM) to share edit logs between the Active and Standby - NameNodes. For information on how to configure HDFS HA using NFS for shared - storage instead of the QJM, please see - {{{./HDFSHighAvailabilityWithNFS.html}this alternative guide.}} - -* {Background} - - Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in - an HDFS cluster. Each cluster had a single NameNode, and if that machine or - process became unavailable, the cluster as a whole would be unavailable - until the NameNode was either restarted or brought up on a separate machine. - - This impacted the total availability of the HDFS cluster in two major ways: - - * In the case of an unplanned event such as a machine crash, the cluster would - be unavailable until an operator restarted the NameNode. - - * Planned maintenance events such as software or hardware upgrades on the - NameNode machine would result in windows of cluster downtime. - - The HDFS High Availability feature addresses the above problems by providing - the option of running two redundant NameNodes in the same cluster in an - Active/Passive configuration with a hot standby. This allows a fast failover to - a new NameNode in the case that a machine crashes, or a graceful - administrator-initiated failover for the purpose of planned maintenance. - -* {Architecture} - - In a typical HA cluster, two separate machines are configured as NameNodes. - At any point in time, exactly one of the NameNodes is in an <Active> state, - and the other is in a <Standby> state. The Active NameNode is responsible - for all client operations in the cluster, while the Standby is simply acting - as a slave, maintaining enough state to provide a fast failover if - necessary. - - In order for the Standby node to keep its state synchronized with the Active - node, both nodes communicate with a group of separate daemons called - "JournalNodes" (JNs). When any namespace modification is performed by the - Active node, it durably logs a record of the modification to a majority of - these JNs. The Standby node is capable of reading the edits from the JNs, and - is constantly watching them for changes to the edit log. As the Standby Node - sees the edits, it applies them to its own namespace. In the event of a - failover, the Standby will ensure that it has read all of the edits from the - JounalNodes before promoting itself to the Active state. This ensures that the - namespace state is fully synchronized before a failover occurs. - - In order to provide a fast failover, it is also necessary that the Standby node - have up-to-date information regarding the location of blocks in the cluster. - In order to achieve this, the DataNodes are configured with the location of - both NameNodes, and send block location information and heartbeats to both. - - It is vital for the correct operation of an HA cluster that only one of the - NameNodes be Active at a time. Otherwise, the namespace state would quickly - diverge between the two, risking data loss or other incorrect results. In - order to ensure this property and prevent the so-called "split-brain scenario," - the JournalNodes will only ever allow a single NameNode to be a writer at a - time. During a failover, the NameNode which is to become active will simply - take over the role of writing to the JournalNodes, which will effectively - prevent the other NameNode from continuing in the Active state, allowing the - new Active to safely proceed with failover. - -* {Hardware resources} - - In order to deploy an HA cluster, you should prepare the following: - - * <<NameNode machines>> - the machines on which you run the Active and - Standby NameNodes should have equivalent hardware to each other, and - equivalent hardware to what would be used in a non-HA cluster. - - * <<JournalNode machines>> - the machines on which you run the JournalNodes. - The JournalNode daemon is relatively lightweight, so these daemons may - reasonably be collocated on machines with other Hadoop daemons, for example - NameNodes, the JobTracker, or the YARN ResourceManager. <<Note:>> There - must be at least 3 JournalNode daemons, since edit log modifications must be - written to a majority of JNs. This will allow the system to tolerate the - failure of a single machine. You may also run more than 3 JournalNodes, but - in order to actually increase the number of failures the system can tolerate, - you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when - running with N JournalNodes, the system can tolerate at most (N - 1) / 2 - failures and continue to function normally. - - Note that, in an HA cluster, the Standby NameNode also performs checkpoints of - the namespace state, and thus it is not necessary to run a Secondary NameNode, - CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an - error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster - to be HA-enabled to reuse the hardware which they had previously dedicated to - the Secondary NameNode. - -* {Deployment} - -** Configuration overview - - Similar to Federation configuration, HA configuration is backward compatible - and allows existing single NameNode configurations to work without change. - The new configuration is designed such that all the nodes in the cluster may - have the same configuration without the need for deploying different - configuration files to different machines based on the type of the node. - - Like HDFS Federation, HA clusters reuse the <<<nameservice ID>>> to identify a - single HDFS instance that may in fact consist of multiple HA NameNodes. In - addition, a new abstraction called <<<NameNode ID>>> is added with HA. Each - distinct NameNode in the cluster has a different NameNode ID to distinguish it. - To support a single configuration file for all of the NameNodes, the relevant - configuration parameters are suffixed with the <<nameservice ID>> as well as - the <<NameNode ID>>. - -** Configuration details - - To configure HA NameNodes, you must add several configuration options to your - <<hdfs-site.xml>> configuration file. - - The order in which you set these configurations is unimportant, but the values - you choose for <<dfs.nameservices>> and - <<dfs.ha.namenodes.[nameservice ID]>> will determine the keys of those that - follow. Thus, you should decide on these values before setting the rest of the - configuration options. - - * <<dfs.nameservices>> - the logical name for this new nameservice - - Choose a logical name for this nameservice, for example "mycluster", and use - this logical name for the value of this config option. The name you choose is - arbitrary. It will be used both for configuration and as the authority - component of absolute HDFS paths in the cluster. - - <<Note:>> If you are also using HDFS Federation, this configuration setting - should also include the list of other nameservices, HA or otherwise, as a - comma-separated list. - ----- -<property> - <name>dfs.nameservices</name> - <value>mycluster</value> -</property> ----- - - * <<dfs.ha.namenodes.[nameservice ID]>> - unique identifiers for each NameNode in the nameservice - - Configure with a list of comma-separated NameNode IDs. This will be used by - DataNodes to determine all the NameNodes in the cluster. For example, if you - used "mycluster" as the nameservice ID previously, and you wanted to use "nn1" - and "nn2" as the individual IDs of the NameNodes, you would configure this as - such: - ----- -<property> - <name>dfs.ha.namenodes.mycluster</name> - <value>nn1,nn2</value> -</property> ----- - - <<Note:>> Currently, only a maximum of two NameNodes may be configured per - nameservice. - - * <<dfs.namenode.rpc-address.[nameservice ID].[name node ID]>> - the fully-qualified RPC address for each NameNode to listen on - - For both of the previously-configured NameNode IDs, set the full address and - IPC port of the NameNode processs. Note that this results in two separate - configuration options. For example: - ----- -<property> - <name>dfs.namenode.rpc-address.mycluster.nn1</name> - <value>machine1.example.com:8020</value> -</property> -<property> - <name>dfs.namenode.rpc-address.mycluster.nn2</name> - <value>machine2.example.com:8020</value> -</property> ----- - - <<Note:>> You may similarly configure the "<<servicerpc-address>>" setting if - you so desire. - - * <<dfs.namenode.http-address.[nameservice ID].[name node ID]>> - the fully-qualified HTTP address for each NameNode to listen on - - Similarly to <rpc-address> above, set the addresses for both NameNodes' HTTP - servers to listen on. For example: - ----- -<property> - <name>dfs.namenode.http-address.mycluster.nn1</name> - <value>machine1.example.com:50070</value> -</property> -<property> - <name>dfs.namenode.http-address.mycluster.nn2</name> - <value>machine2.example.com:50070</value> -</property> ----- - - <<Note:>> If you have Hadoop's security features enabled, you should also set - the <https-address> similarly for each NameNode. - - * <<dfs.namenode.shared.edits.dir>> - the URI which identifies the group of JNs where the NameNodes will write/read edits - - This is where one configures the addresses of the JournalNodes which provide - the shared edits storage, written to by the Active nameNode and read by the - Standby NameNode to stay up-to-date with all the file system changes the Active - NameNode makes. Though you must specify several JournalNode addresses, - <<you should only configure one of these URIs.>> The URI should be of the form: - "qjournal://<host1:port1>;<host2:port2>;<host3:port3>/<journalId>". The Journal - ID is a unique identifier for this nameservice, which allows a single set of - JournalNodes to provide storage for multiple federated namesystems. Though not - a requirement, it's a good idea to reuse the nameservice ID for the journal - identifier. - - For example, if the JournalNodes for this cluster were running on the - machines "node1.example.com", "node2.example.com", and "node3.example.com" and - the nameservice ID were "mycluster", you would use the following as the value - for this setting (the default port for the JournalNode is 8485): - ----- -<property> - <name>dfs.namenode.shared.edits.dir</name> - <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value> -</property> ----- - - * <<dfs.client.failover.proxy.provider.[nameservice ID]>> - the Java class that HDFS clients use to contact the Active NameNode - - Configure the name of the Java class which will be used by the DFS Client to - determine which NameNode is the current Active, and therefore which NameNode is - currently serving client requests. The only implementation which currently - ships with Hadoop is the <<ConfiguredFailoverProxyProvider>>, so use this - unless you are using a custom one. For example: - ----- -<property> - <name>dfs.client.failover.proxy.provider.mycluster</name> - <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> -</property> ----- - - * <<dfs.ha.fencing.methods>> - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover - - It is desirable for correctness of the system that only one NameNode be in - the Active state at any given time. <<Importantly, when using the Quorum - Journal Manager, only one NameNode will ever be allowed to write to the - JournalNodes, so there is no potential for corrupting the file system metadata - from a split-brain scenario.>> However, when a failover occurs, it is still - possible that the previous Active NameNode could serve read requests to - clients, which may be out of date until that NameNode shuts down when trying to - write to the JournalNodes. For this reason, it is still desirable to configure - some fencing methods even when using the Quorum Journal Manager. However, to - improve the availability of the system in the event the fencing mechanisms - fail, it is advisable to configure a fencing method which is guaranteed to - return success as the last fencing method in the list. Note that if you choose - to use no actual fencing methods, you still must configure something for this - setting, for example "<<<shell(/bin/true)>>>". - - The fencing methods used during a failover are configured as a - carriage-return-separated list, which will be attempted in order until one - indicates that fencing has succeeded. There are two methods which ship with - Hadoop: <shell> and <sshfence>. For information on implementing your own custom - fencing method, see the <org.apache.hadoop.ha.NodeFencer> class. - - * <<sshfence>> - SSH to the Active NameNode and kill the process - - The <sshfence> option SSHes to the target node and uses <fuser> to kill the - process listening on the service's TCP port. In order for this fencing option - to work, it must be able to SSH to the target node without providing a - passphrase. Thus, one must also configure the - <<dfs.ha.fencing.ssh.private-key-files>> option, which is a - comma-separated list of SSH private key files. For example: - ---- -<property> - <name>dfs.ha.fencing.methods</name> - <value>sshfence</value> -</property> - -<property> - <name>dfs.ha.fencing.ssh.private-key-files</name> - <value>/home/exampleuser/.ssh/id_rsa</value> -</property> ---- - - Optionally, one may configure a non-standard username or port to perform the - SSH. One may also configure a timeout, in milliseconds, for the SSH, after - which this fencing method will be considered to have failed. It may be - configured like so: - ---- -<property> - <name>dfs.ha.fencing.methods</name> - <value>sshfence([[username][:port]])</value> -</property> -<property> - <name>dfs.ha.fencing.ssh.connect-timeout</name> - <value>30000</value> -</property> ---- - - * <<shell>> - run an arbitrary shell command to fence the Active NameNode - - The <shell> fencing method runs an arbitrary shell command. It may be - configured like so: - ---- -<property> - <name>dfs.ha.fencing.methods</name> - <value>shell(/path/to/my/script.sh arg1 arg2 ...)</value> -</property> ---- - - The string between '(' and ')' is passed directly to a bash shell and may not - include any closing parentheses. - - The shell command will be run with an environment set up to contain all of the - current Hadoop configuration variables, with the '_' character replacing any - '.' characters in the configuration keys. The configuration used has already had - any namenode-specific configurations promoted to their generic forms -- for example - <<dfs_namenode_rpc-address>> will contain the RPC address of the target node, even - though the configuration may specify that variable as - <<dfs.namenode.rpc-address.ns1.nn1>>. - - Additionally, the following variables referring to the target node to be fenced - are also available: - -*-----------------------:-----------------------------------+ -| $target_host | hostname of the node to be fenced | -*-----------------------:-----------------------------------+ -| $target_port | IPC port of the node to be fenced | -*-----------------------:-----------------------------------+ -| $target_address | the above two, combined as host:port | -*-----------------------:-----------------------------------+ -| $target_nameserviceid | the nameservice ID of the NN to be fenced | -*-----------------------:-----------------------------------+ -| $target_namenodeid | the namenode ID of the NN to be fenced | -*-----------------------:-----------------------------------+ - - These environment variables may also be used as substitutions in the shell - command itself. For example: - ---- -<property> - <name>dfs.ha.fencing.methods</name> - <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value> -</property> ---- - - If the shell command returns an exit - code of 0, the fencing is determined to be successful. If it returns any other - exit code, the fencing was not successful and the next fencing method in the - list will be attempted. - - <<Note:>> This fencing method does not implement any timeout. If timeouts are - necessary, they should be implemented in the shell script itself (eg by forking - a subshell to kill its parent in some number of seconds). - - * <<fs.defaultFS>> - the default path prefix used by the Hadoop FS client when none is given - - Optionally, you may now configure the default path for Hadoop clients to use - the new HA-enabled logical URI. If you used "mycluster" as the nameservice ID - earlier, this will be the value of the authority portion of all of your HDFS - paths. This may be configured like so, in your <<core-site.xml>> file: - ---- -<property> - <name>fs.defaultFS</name> - <value>hdfs://mycluster</value> -</property> ---- - - - * <<dfs.journalnode.edits.dir>> - the path where the JournalNode daemon will store its local state - - This is the absolute path on the JournalNode machines where the edits and - other local state used by the JNs will be stored. You may only use a single - path for this configuration. Redundancy for this data is provided by running - multiple separate JournalNodes, or by configuring this directory on a - locally-attached RAID array. For example: - ---- -<property> - <name>dfs.journalnode.edits.dir</name> - <value>/path/to/journal/node/local/data</value> -</property> ---- - -** Deployment details - - After all of the necessary configuration options have been set, you must - start the JournalNode daemons on the set of machines where they will run. This - can be done by running the command "<hadoop-daemon.sh start journalnode>" and - waiting for the daemon to start on each of the relevant machines. - - Once the JournalNodes have been started, one must initially synchronize the - two HA NameNodes' on-disk metadata. - - * If you are setting up a fresh HDFS cluster, you should first run the format - command (<hdfs namenode -format>) on one of NameNodes. - - * If you have already formatted the NameNode, or are converting a - non-HA-enabled cluster to be HA-enabled, you should now copy over the - contents of your NameNode metadata directories to the other, unformatted - NameNode by running the command "<hdfs namenode -bootstrapStandby>" on the - unformatted NameNode. Running this command will also ensure that the - JournalNodes (as configured by <<dfs.namenode.shared.edits.dir>>) contain - sufficient edits transactions to be able to start both NameNodes. - - * If you are converting a non-HA NameNode to be HA, you should run the - command "<hdfs -initializeSharedEdits>", which will initialize the - JournalNodes with the edits data from the local NameNode edits directories. - - At this point you may start both of your HA NameNodes as you normally would - start a NameNode. - - You can visit each of the NameNodes' web pages separately by browsing to their - configured HTTP addresses. You should notice that next to the configured - address will be the HA state of the NameNode (either "standby" or "active".) - Whenever an HA NameNode starts, it is initially in the Standby state. - -** Administrative commands - - Now that your HA NameNodes are configured and started, you will have access - to some additional commands to administer your HA HDFS cluster. Specifically, - you should familiarize yourself with all of the subcommands of the "<hdfs - haadmin>" command. Running this command without any additional arguments will - display the following usage information: - ---- -Usage: DFSHAAdmin [-ns <nameserviceId>] - [-transitionToActive <serviceId>] - [-transitionToStandby <serviceId>] - [-failover [--forcefence] [--forceactive] <serviceId> <serviceId>] - [-getServiceState <serviceId>] - [-checkHealth <serviceId>] - [-help <command>] ---- - - This guide describes high-level uses of each of these subcommands. For - specific usage information of each subcommand, you should run "<hdfs haadmin - -help <command>>". - - * <<transitionToActive>> and <<transitionToStandby>> - transition the state of the given NameNode to Active or Standby - - These subcommands cause a given NameNode to transition to the Active or Standby - state, respectively. <<These commands do not attempt to perform any fencing, - and thus should rarely be used.>> Instead, one should almost always prefer to - use the "<hdfs haadmin -failover>" subcommand. - - * <<failover>> - initiate a failover between two NameNodes - - This subcommand causes a failover from the first provided NameNode to the - second. If the first NameNode is in the Standby state, this command simply - transitions the second to the Active state without error. If the first NameNode - is in the Active state, an attempt will be made to gracefully transition it to - the Standby state. If this fails, the fencing methods (as configured by - <<dfs.ha.fencing.methods>>) will be attempted in order until one - succeeds. Only after this process will the second NameNode be transitioned to - the Active state. If no fencing method succeeds, the second NameNode will not - be transitioned to the Active state, and an error will be returned. - - * <<getServiceState>> - determine whether the given NameNode is Active or Standby - - Connect to the provided NameNode to determine its current state, printing - either "standby" or "active" to STDOUT appropriately. This subcommand might be - used by cron jobs or monitoring scripts which need to behave differently based - on whether the NameNode is currently Active or Standby. - - * <<checkHealth>> - check the health of the given NameNode - - Connect to the provided NameNode to check its health. The NameNode is capable - of performing some diagnostics on itself, including checking if internal - services are running as expected. This command will return 0 if the NameNode is - healthy, non-zero otherwise. One might use this command for monitoring - purposes. - - <<Note:>> This is not yet implemented, and at present will always return - success, unless the given NameNode is completely down. - -* {Automatic Failover} - -** Introduction - - The above sections describe how to configure manual failover. In that mode, - the system will not automatically trigger a failover from the active to the - standby NameNode, even if the active node has failed. This section describes - how to configure and deploy automatic failover. - -** Components - - Automatic failover adds two new components to an HDFS deployment: a ZooKeeper - quorum, and the ZKFailoverController process (abbreviated as ZKFC). - - Apache ZooKeeper is a highly available service for maintaining small amounts - of coordination data, notifying clients of changes in that data, and - monitoring clients for failures. The implementation of automatic HDFS failover - relies on ZooKeeper for the following things: - - * <<Failure detection>> - each of the NameNode machines in the cluster - maintains a persistent session in ZooKeeper. If the machine crashes, the - ZooKeeper session will expire, notifying the other NameNode that a failover - should be triggered. - - * <<Active NameNode election>> - ZooKeeper provides a simple mechanism to - exclusively elect a node as active. If the current active NameNode crashes, - another node may take a special exclusive lock in ZooKeeper indicating that - it should become the next active. - - The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client - which also monitors and manages the state of the NameNode. Each of the - machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible - for: - - * <<Health monitoring>> - the ZKFC pings its local NameNode on a periodic - basis with a health-check command. So long as the NameNode responds in a - timely fashion with a healthy status, the ZKFC considers the node - healthy. If the node has crashed, frozen, or otherwise entered an unhealthy - state, the health monitor will mark it as unhealthy. - - * <<ZooKeeper session management>> - when the local NameNode is healthy, the - ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it - also holds a special "lock" znode. This lock uses ZooKeeper's support for - "ephemeral" nodes; if the session expires, the lock node will be - automatically deleted. - - * <<ZooKeeper-based election>> - if the local NameNode is healthy, and the - ZKFC sees that no other node currently holds the lock znode, it will itself - try to acquire the lock. If it succeeds, then it has "won the election", and - is responsible for running a failover to make its local NameNode active. The - failover process is similar to the manual failover described above: first, - the previous active is fenced if necessary, and then the local NameNode - transitions to active state. - - For more details on the design of automatic failover, refer to the design - document attached to HDFS-2185 on the Apache HDFS JIRA. - -** Deploying ZooKeeper - - In a typical deployment, ZooKeeper daemons are configured to run on three or - five nodes. Since ZooKeeper itself has light resource requirements, it is - acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS - NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper - process on the same node as the YARN ResourceManager. It is advisable to - configure the ZooKeeper nodes to store their data on separate disk drives from - the HDFS metadata for best performance and isolation. - - The setup of ZooKeeper is out of scope for this document. We will assume that - you have set up a ZooKeeper cluster running on three or more nodes, and have - verified its correct operation by connecting using the ZK CLI. - -** Before you begin - - Before you begin configuring automatic failover, you should shut down your - cluster. It is not currently possible to transition from a manual failover - setup to an automatic failover setup while the cluster is running. - -** Configuring automatic failover - - The configuration of automatic failover requires the addition of two new - parameters to your configuration. In your <<<hdfs-site.xml>>> file, add: - ----- - <property> - <name>dfs.ha.automatic-failover.enabled</name> - <value>true</value> - </property> ----- - - This specifies that the cluster should be set up for automatic failover. - In your <<<core-site.xml>>> file, add: - ----- - <property> - <name>ha.zookeeper.quorum</name> - <value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value> - </property> ----- - - This lists the host-port pairs running the ZooKeeper service. - - As with the parameters described earlier in the document, these settings may - be configured on a per-nameservice basis by suffixing the configuration key - with the nameservice ID. For example, in a cluster with federation enabled, - you can explicitly enable automatic failover for only one of the nameservices - by setting <<<dfs.ha.automatic-failover.enabled.my-nameservice-id>>>. - - There are also several other configuration parameters which may be set to - control the behavior of automatic failover; however, they are not necessary - for most installations. Please refer to the configuration key specific - documentation for details. - -** Initializing HA state in ZooKeeper - - After the configuration keys have been added, the next step is to initialize - required state in ZooKeeper. You can do so by running the following command - from one of the NameNode hosts. - ----- -[hdfs]$ $HADOOP_PREFIX/bin/hdfs zkfc -formatZK ----- - - This will create a znode in ZooKeeper inside of which the automatic failover - system stores its data. - -** Starting the cluster with <<<start-dfs.sh>>> - - Since automatic failover has been enabled in the configuration, the - <<<start-dfs.sh>>> script will now automatically start a ZKFC daemon on any - machine that runs a NameNode. When the ZKFCs start, they will automatically - select one of the NameNodes to become active. - -** Starting the cluster manually - - If you manually manage the services on your cluster, you will need to manually - start the <<<zkfc>>> daemon on each of the machines that runs a NameNode. You - can start the daemon by running: - ----- -[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start zkfc ----- - -** Securing access to ZooKeeper - - If you are running a secure cluster, you will likely want to ensure that the - information stored in ZooKeeper is also secured. This prevents malicious - clients from modifying the metadata in ZooKeeper or potentially triggering a - false failover. - - In order to secure the information in ZooKeeper, first add the following to - your <<<core-site.xml>>> file: - ----- - <property> - <name>ha.zookeeper.auth</name> - <value>@/path/to/zk-auth.txt</value> - </property> - <property> - <name>ha.zookeeper.acl</name> - <value>@/path/to/zk-acl.txt</value> - </property> ----- - - Please note the '@' character in these values -- this specifies that the - configurations are not inline, but rather point to a file on disk. - - The first configured file specifies a list of ZooKeeper authentications, in - the same format as used by the ZK CLI. For example, you may specify something - like: - ----- -digest:hdfs-zkfcs:mypassword ----- - ...where <<<hdfs-zkfcs>>> is a unique username for ZooKeeper, and - <<<mypassword>>> is some unique string used as a password. - - Next, generate a ZooKeeper ACL that corresponds to this authentication, using - a command like the following: - ----- -[hdfs]$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword -output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs= ----- - - Copy and paste the section of this output after the '->' string into the file - <<<zk-acls.txt>>>, prefixed by the string "<<<digest:>>>". For example: - ----- -digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda ----- - - In order for these ACLs to take effect, you should then rerun the - <<<zkfc -formatZK>>> command as described above. - - After doing so, you may verify the ACLs from the ZK CLI as follows: - ----- -[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha -'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM= -: cdrwa ----- - -** Verifying automatic failover - - Once automatic failover has been set up, you should test its operation. To do - so, first locate the active NameNode. You can tell which node is active by - visiting the NameNode web interfaces -- each node reports its HA state at the - top of the page. - - Once you have located your active NameNode, you may cause a failure on that - node. For example, you can use <<<kill -9 <pid of NN>>>> to simulate a JVM - crash. Or, you could power cycle the machine or unplug its network interface - to simulate a different kind of outage. After triggering the outage you wish - to test, the other NameNode should automatically become active within several - seconds. The amount of time required to detect a failure and trigger a - fail-over depends on the configuration of - <<<ha.zookeeper.session-timeout.ms>>>, but defaults to 5 seconds. - - If the test does not succeed, you may have a misconfiguration. Check the logs - for the <<<zkfc>>> daemons as well as the NameNode daemons in order to further - diagnose the issue. - - -* Automatic Failover FAQ - - * <<Is it important that I start the ZKFC and NameNode daemons in any - particular order?>> - - No. On any given node you may start the ZKFC before or after its corresponding - NameNode. - - * <<What additional monitoring should I put in place?>> - - You should add monitoring on each host that runs a NameNode to ensure that the - ZKFC remains running. In some types of ZooKeeper failures, for example, the - ZKFC may unexpectedly exit, and should be restarted to ensure that the system - is ready for automatic failover. - - Additionally, you should monitor each of the servers in the ZooKeeper - quorum. If ZooKeeper crashes, then automatic failover will not function. - - * <<What happens if ZooKeeper goes down?>> - - If the ZooKeeper cluster crashes, no automatic failovers will be triggered. - However, HDFS will continue to run without any impact. When ZooKeeper is - restarted, HDFS will reconnect with no issues. - - * <<Can I designate one of my NameNodes as primary/preferred?>> - - No. Currently, this is not supported. Whichever NameNode is started first will - become active. You may choose to start the cluster in a specific order such - that your preferred node starts first. - - * <<How can I initiate a manual failover when automatic failover is - configured?>> - - Even if automatic failover is configured, you may initiate a manual failover - using the same <<<hdfs haadmin>>> command. It will perform a coordinated - failover. - -* HDFS Upgrade/Finalization/Rollback with HA Enabled - - When moving between versions of HDFS, sometimes the newer software can simply - be installed and the cluster restarted. Sometimes, however, upgrading the - version of HDFS you're running may require changing on-disk data. In this case, - one must use the HDFS Upgrade/Finalize/Rollback facility after installing the - new software. This process is made more complex in an HA environment, since the - on-disk metadata that the NN relies upon is by definition distributed, both on - the two HA NNs in the pair, and on the JournalNodes in the case that QJM is - being used for the shared edits storage. This documentation section describes - the procedure to use the HDFS Upgrade/Finalize/Rollback facility in an HA setup. - - <<To perform an HA upgrade>>, the operator must do the following: - - [[1]] Shut down all of the NNs as normal, and install the newer software. - - [[2]] Start up all of the JNs. Note that it is <<critical>> that all the - JNs be running when performing the upgrade, rollback, or finalization - operations. If any of the JNs are down at the time of running any of these - operations, the operation will fail. - - [[3]] Start one of the NNs with the <<<'-upgrade'>>> flag. - - [[4]] On start, this NN will not enter the standby state as usual in an HA - setup. Rather, this NN will immediately enter the active state, perform an - upgrade of its local storage dirs, and also perform an upgrade of the shared - edit log. - - [[5]] At this point the other NN in the HA pair will be out of sync with - the upgraded NN. In order to bring it back in sync and once again have a highly - available setup, you should re-bootstrap this NameNode by running the NN with - the <<<'-bootstrapStandby'>>> flag. It is an error to start this second NN with - the <<<'-upgrade'>>> flag. - - Note that if at any time you want to restart the NameNodes before finalizing - or rolling back the upgrade, you should start the NNs as normal, i.e. without - any special startup flag. - - <<To finalize an HA upgrade>>, the operator will use the <<<`hdfs - dfsadmin -finalizeUpgrade'>>> command while the NNs are running and one of them - is active. The active NN at the time this happens will perform the finalization - of the shared log, and the NN whose local storage directories contain the - previous FS state will delete its local state. - - <<To perform a rollback>> of an upgrade, both NNs should first be shut down. - The operator should run the roll back command on the NN where they initiated - the upgrade procedure, which will perform the rollback on the local dirs there, - as well as on the shared log, either NFS or on the JNs. Afterward, this NN - should be started and the operator should run <<<`-bootstrapStandby'>>> on the - other NN to bring the two NNs in sync with this rolled-back file system state.
http://git-wip-us.apache.org/repos/asf/hadoop/blob/a45ef2b6/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsDesign.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsDesign.apt.vm b/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsDesign.apt.vm deleted file mode 100644 index 00628b1..0000000 --- a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsDesign.apt.vm +++ /dev/null @@ -1,510 +0,0 @@ -~~ Licensed under the Apache License, Version 2.0 (the "License"); -~~ you may not use this file except in compliance with the License. -~~ You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. See accompanying LICENSE file. - - --- - HDFS Architecture - --- - ${maven.build.timestamp} - -HDFS Architecture - -%{toc|section=1|fromDepth=0} - -* Introduction - - The Hadoop Distributed File System (HDFS) is a distributed file system - designed to run on commodity hardware. It has many similarities with - existing distributed file systems. However, the differences from other - distributed file systems are significant. HDFS is highly fault-tolerant - and is designed to be deployed on low-cost hardware. HDFS provides high - throughput access to application data and is suitable for applications - that have large data sets. HDFS relaxes a few POSIX requirements to - enable streaming access to file system data. HDFS was originally built - as infrastructure for the Apache Nutch web search engine project. HDFS - is part of the Apache Hadoop Core project. The project URL is - {{http://hadoop.apache.org/}}. - -* Assumptions and Goals - -** Hardware Failure - - Hardware failure is the norm rather than the exception. An HDFS - instance may consist of hundreds or thousands of server machines, each - storing part of the file systemâs data. The fact that there are a huge - number of components and that each component has a non-trivial - probability of failure means that some component of HDFS is always - non-functional. Therefore, detection of faults and quick, automatic - recovery from them is a core architectural goal of HDFS. - -** Streaming Data Access - - Applications that run on HDFS need streaming access to their data sets. - They are not general purpose applications that typically run on general - purpose file systems. HDFS is designed more for batch processing rather - than interactive use by users. The emphasis is on high throughput of - data access rather than low latency of data access. POSIX imposes many - hard requirements that are not needed for applications that are - targeted for HDFS. POSIX semantics in a few key areas has been traded - to increase data throughput rates. - -** Large Data Sets - - Applications that run on HDFS have large data sets. A typical file in - HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support - large files. It should provide high aggregate data bandwidth and scale - to hundreds of nodes in a single cluster. It should support tens of - millions of files in a single instance. - -** Simple Coherency Model - - HDFS applications need a write-once-read-many access model for files. A - file once created, written, and closed need not be changed. This - assumption simplifies data coherency issues and enables high throughput - data access. A Map/Reduce application or a web crawler application fits - perfectly with this model. There is a plan to support appending-writes - to files in the future. - -** âMoving Computation is Cheaper than Moving Dataâ - - A computation requested by an application is much more efficient if it - is executed near the data it operates on. This is especially true when - the size of the data set is huge. This minimizes network congestion and - increases the overall throughput of the system. The assumption is that - it is often better to migrate the computation closer to where the data - is located rather than moving the data to where the application is - running. HDFS provides interfaces for applications to move themselves - closer to where the data is located. - -** Portability Across Heterogeneous Hardware and Software Platforms - - HDFS has been designed to be easily portable from one platform to - another. This facilitates widespread adoption of HDFS as a platform of - choice for a large set of applications. - -* NameNode and DataNodes - - HDFS has a master/slave architecture. An HDFS cluster consists of a - single NameNode, a master server that manages the file system namespace - and regulates access to files by clients. In addition, there are a - number of DataNodes, usually one per node in the cluster, which manage - storage attached to the nodes that they run on. HDFS exposes a file - system namespace and allows user data to be stored in files. - Internally, a file is split into one or more blocks and these blocks - are stored in a set of DataNodes. The NameNode executes file system - namespace operations like opening, closing, and renaming files and - directories. It also determines the mapping of blocks to DataNodes. The - DataNodes are responsible for serving read and write requests from the - file systemâs clients. The DataNodes also perform block creation, - deletion, and replication upon instruction from the NameNode. - - -[images/hdfsarchitecture.png] HDFS Architecture - - The NameNode and DataNode are pieces of software designed to run on - commodity machines. These machines typically run a GNU/Linux operating - system (OS). HDFS is built using the Java language; any machine that - supports Java can run the NameNode or the DataNode software. Usage of - the highly portable Java language means that HDFS can be deployed on a - wide range of machines. A typical deployment has a dedicated machine - that runs only the NameNode software. Each of the other machines in the - cluster runs one instance of the DataNode software. The architecture - does not preclude running multiple DataNodes on the same machine but in - a real deployment that is rarely the case. - - The existence of a single NameNode in a cluster greatly simplifies the - architecture of the system. The NameNode is the arbitrator and - repository for all HDFS metadata. The system is designed in such a way - that user data never flows through the NameNode. - -* The File System Namespace - - HDFS supports a traditional hierarchical file organization. A user or - an application can create directories and store files inside these - directories. The file system namespace hierarchy is similar to most - other existing file systems; one can create and remove files, move a - file from one directory to another, or rename a file. HDFS does not yet - implement user quotas or access permissions. HDFS does not support hard - links or soft links. However, the HDFS architecture does not preclude - implementing these features. - - The NameNode maintains the file system namespace. Any change to the - file system namespace or its properties is recorded by the NameNode. An - application can specify the number of replicas of a file that should be - maintained by HDFS. The number of copies of a file is called the - replication factor of that file. This information is stored by the - NameNode. - -* Data Replication - - HDFS is designed to reliably store very large files across machines in - a large cluster. It stores each file as a sequence of blocks; all - blocks in a file except the last block are the same size. The blocks of - a file are replicated for fault tolerance. The block size and - replication factor are configurable per file. An application can - specify the number of replicas of a file. The replication factor can be - specified at file creation time and can be changed later. Files in HDFS - are write-once and have strictly one writer at any time. - - The NameNode makes all decisions regarding replication of blocks. It - periodically receives a Heartbeat and a Blockreport from each of the - DataNodes in the cluster. Receipt of a Heartbeat implies that the - DataNode is functioning properly. A Blockreport contains a list of all - blocks on a DataNode. - -[images/hdfsdatanodes.png] HDFS DataNodes - -** Replica Placement: The First Baby Steps - - The placement of replicas is critical to HDFS reliability and - performance. Optimizing replica placement distinguishes HDFS from most - other distributed file systems. This is a feature that needs lots of - tuning and experience. The purpose of a rack-aware replica placement - policy is to improve data reliability, availability, and network - bandwidth utilization. The current implementation for the replica - placement policy is a first effort in this direction. The short-term - goals of implementing this policy are to validate it on production - systems, learn more about its behavior, and build a foundation to test - and research more sophisticated policies. - - Large HDFS instances run on a cluster of computers that commonly spread - across many racks. Communication between two nodes in different racks - has to go through switches. In most cases, network bandwidth between - machines in the same rack is greater than network bandwidth between - machines in different racks. - - The NameNode determines the rack id each DataNode belongs to via the - process outlined in {{{../hadoop-common/ClusterSetup.html#Hadoop+Rack+Awareness}Hadoop Rack Awareness}}. A simple but non-optimal policy - is to place replicas on unique racks. This prevents losing data when an - entire rack fails and allows use of bandwidth from multiple racks when - reading data. This policy evenly distributes replicas in the cluster - which makes it easy to balance load on component failure. However, this - policy increases the cost of writes because a write needs to transfer - blocks to multiple racks. - - For the common case, when the replication factor is three, HDFSâs - placement policy is to put one replica on one node in the local rack, - another on a different node in the local rack, and the last on a - different node in a different rack. This policy cuts the inter-rack - write traffic which generally improves write performance. The chance of - rack failure is far less than that of node failure; this policy does - not impact data reliability and availability guarantees. However, it - does reduce the aggregate network bandwidth used when reading data - since a block is placed in only two unique racks rather than three. - With this policy, the replicas of a file do not evenly distribute - across the racks. One third of replicas are on one node, two thirds of - replicas are on one rack, and the other third are evenly distributed - across the remaining racks. This policy improves write performance - without compromising data reliability or read performance. - - The current, default replica placement policy described here is a work - in progress. - -** Replica Selection - - To minimize global bandwidth consumption and read latency, HDFS tries - to satisfy a read request from a replica that is closest to the reader. - If there exists a replica on the same rack as the reader node, then - that replica is preferred to satisfy the read request. If angg/ HDFS - cluster spans multiple data centers, then a replica that is resident in - the local data center is preferred over any remote replica. - -** Safemode - - On startup, the NameNode enters a special state called Safemode. - Replication of data blocks does not occur when the NameNode is in the - Safemode state. The NameNode receives Heartbeat and Blockreport - messages from the DataNodes. A Blockreport contains the list of data - blocks that a DataNode is hosting. Each block has a specified minimum - number of replicas. A block is considered safely replicated when the - minimum number of replicas of that data block has checked in with the - NameNode. After a configurable percentage of safely replicated data - blocks checks in with the NameNode (plus an additional 30 seconds), the - NameNode exits the Safemode state. It then determines the list of data - blocks (if any) that still have fewer than the specified number of - replicas. The NameNode then replicates these blocks to other DataNodes. - -* The Persistence of File System Metadata - - The HDFS namespace is stored by the NameNode. The NameNode uses a - transaction log called the EditLog to persistently record every change - that occurs to file system metadata. For example, creating a new file - in HDFS causes the NameNode to insert a record into the EditLog - indicating this. Similarly, changing the replication factor of a file - causes a new record to be inserted into the EditLog. The NameNode uses - a file in its local host OS file system to store the EditLog. The - entire file system namespace, including the mapping of blocks to files - and file system properties, is stored in a file called the FsImage. The - FsImage is stored as a file in the NameNodeâs local file system too. - - The NameNode keeps an image of the entire file system namespace and - file Blockmap in memory. This key metadata item is designed to be - compact, such that a NameNode with 4 GB of RAM is plenty to support a - huge number of files and directories. When the NameNode starts up, it - reads the FsImage and EditLog from disk, applies all the transactions - from the EditLog to the in-memory representation of the FsImage, and - flushes out this new version into a new FsImage on disk. It can then - truncate the old EditLog because its transactions have been applied to - the persistent FsImage. This process is called a checkpoint. In the - current implementation, a checkpoint only occurs when the NameNode - starts up. Work is in progress to support periodic checkpointing in the - near future. - - The DataNode stores HDFS data in files in its local file system. The - DataNode has no knowledge about HDFS files. It stores each block of - HDFS data in a separate file in its local file system. The DataNode - does not create all files in the same directory. Instead, it uses a - heuristic to determine the optimal number of files per directory and - creates subdirectories appropriately. It is not optimal to create all - local files in the same directory because the local file system might - not be able to efficiently support a huge number of files in a single - directory. When a DataNode starts up, it scans through its local file - system, generates a list of all HDFS data blocks that correspond to - each of these local files and sends this report to the NameNode: this - is the Blockreport. - -* The Communication Protocols - - All HDFS communication protocols are layered on top of the TCP/IP - protocol. A client establishes a connection to a configurable TCP port - on the NameNode machine. It talks the ClientProtocol with the NameNode. - The DataNodes talk to the NameNode using the DataNode Protocol. A - Remote Procedure Call (RPC) abstraction wraps both the Client Protocol - and the DataNode Protocol. By design, the NameNode never initiates any - RPCs. Instead, it only responds to RPC requests issued by DataNodes or - clients. - -* Robustness - - The primary objective of HDFS is to store data reliably even in the - presence of failures. The three common types of failures are NameNode - failures, DataNode failures and network partitions. - -** Data Disk Failure, Heartbeats and Re-Replication - - Each DataNode sends a Heartbeat message to the NameNode periodically. A - network partition can cause a subset of DataNodes to lose connectivity - with the NameNode. The NameNode detects this condition by the absence - of a Heartbeat message. The NameNode marks DataNodes without recent - Heartbeats as dead and does not forward any new IO requests to them. - Any data that was registered to a dead DataNode is not available to - HDFS any more. DataNode death may cause the replication factor of some - blocks to fall below their specified value. The NameNode constantly - tracks which blocks need to be replicated and initiates replication - whenever necessary. The necessity for re-replication may arise due to - many reasons: a DataNode may become unavailable, a replica may become - corrupted, a hard disk on a DataNode may fail, or the replication - factor of a file may be increased. - -** Cluster Rebalancing - - The HDFS architecture is compatible with data rebalancing schemes. A - scheme might automatically move data from one DataNode to another if - the free space on a DataNode falls below a certain threshold. In the - event of a sudden high demand for a particular file, a scheme might - dynamically create additional replicas and rebalance other data in the - cluster. These types of data rebalancing schemes are not yet - implemented. - -** Data Integrity - - It is possible that a block of data fetched from a DataNode arrives - corrupted. This corruption can occur because of faults in a storage - device, network faults, or buggy software. The HDFS client software - implements checksum checking on the contents of HDFS files. When a - client creates an HDFS file, it computes a checksum of each block of - the file and stores these checksums in a separate hidden file in the - same HDFS namespace. When a client retrieves file contents it verifies - that the data it received from each DataNode matches the checksum - stored in the associated checksum file. If not, then the client can opt - to retrieve that block from another DataNode that has a replica of that - block. - -** Metadata Disk Failure - - The FsImage and the EditLog are central data structures of HDFS. A - corruption of these files can cause the HDFS instance to be - non-functional. For this reason, the NameNode can be configured to - support maintaining multiple copies of the FsImage and EditLog. Any - update to either the FsImage or EditLog causes each of the FsImages and - EditLogs to get updated synchronously. This synchronous updating of - multiple copies of the FsImage and EditLog may degrade the rate of - namespace transactions per second that a NameNode can support. However, - this degradation is acceptable because even though HDFS applications - are very data intensive in nature, they are not metadata intensive. - When a NameNode restarts, it selects the latest consistent FsImage and - EditLog to use. - - The NameNode machine is a single point of failure for an HDFS cluster. - If the NameNode machine fails, manual intervention is necessary. - Currently, automatic restart and failover of the NameNode software to - another machine is not supported. - -** Snapshots - - Snapshots support storing a copy of data at a particular instant of - time. One usage of the snapshot feature may be to roll back a corrupted - HDFS instance to a previously known good point in time. HDFS does not - currently support snapshots but will in a future release. - -* Data Organization - -** Data Blocks - - HDFS is designed to support very large files. Applications that are - compatible with HDFS are those that deal with large data sets. These - applications write their data only once but they read it one or more - times and require these reads to be satisfied at streaming speeds. HDFS - supports write-once-read-many semantics on files. A typical block size - used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB - chunks, and if possible, each chunk will reside on a different - DataNode. - -** Staging - - A client request to create a file does not reach the NameNode - immediately. In fact, initially the HDFS client caches the file data - into a temporary local file. Application writes are transparently - redirected to this temporary local file. When the local file - accumulates data worth over one HDFS block size, the client contacts - the NameNode. The NameNode inserts the file name into the file system - hierarchy and allocates a data block for it. The NameNode responds to - the client request with the identity of the DataNode and the - destination data block. Then the client flushes the block of data from - the local temporary file to the specified DataNode. When a file is - closed, the remaining un-flushed data in the temporary local file is - transferred to the DataNode. The client then tells the NameNode that - the file is closed. At this point, the NameNode commits the file - creation operation into a persistent store. If the NameNode dies before - the file is closed, the file is lost. - - The above approach has been adopted after careful consideration of - target applications that run on HDFS. These applications need streaming - writes to files. If a client writes to a remote file directly without - any client side buffering, the network speed and the congestion in the - network impacts throughput considerably. This approach is not without - precedent. Earlier distributed file systems, e.g. AFS, have used client - side caching to improve performance. A POSIX requirement has been - relaxed to achieve higher performance of data uploads. - -** Replication Pipelining - - When a client is writing data to an HDFS file, its data is first - written to a local file as explained in the previous section. Suppose - the HDFS file has a replication factor of three. When the local file - accumulates a full block of user data, the client retrieves a list of - DataNodes from the NameNode. This list contains the DataNodes that will - host a replica of that block. The client then flushes the data block to - the first DataNode. The first DataNode starts receiving the data in - small portions, writes each portion to its local repository and - transfers that portion to the second DataNode in the list. The second - DataNode, in turn starts receiving each portion of the data block, - writes that portion to its repository and then flushes that portion to - the third DataNode. Finally, the third DataNode writes the data to its - local repository. Thus, a DataNode can be receiving data from the - previous one in the pipeline and at the same time forwarding data to - the next one in the pipeline. Thus, the data is pipelined from one - DataNode to the next. - -* Accessibility - - HDFS can be accessed from applications in many different ways. - Natively, HDFS provides a - {{{http://hadoop.apache.org/docs/current/api/}FileSystem Java API}} - for applications to use. A C language wrapper for this Java API is also - available. In addition, an HTTP browser can also be used to browse the files - of an HDFS instance. Work is in progress to expose HDFS through the WebDAV - protocol. - -** FS Shell - - HDFS allows user data to be organized in the form of files and - directories. It provides a commandline interface called FS shell that - lets a user interact with the data in HDFS. The syntax of this command - set is similar to other shells (e.g. bash, csh) that users are already - familiar with. Here are some sample action/command pairs: - -*---------+---------+ -|| Action | Command -*---------+---------+ -| Create a directory named <<</foodir>>> | <<<bin/hadoop dfs -mkdir /foodir>>> -*---------+---------+ -| Remove a directory named <<</foodir>>> | <<<bin/hadoop fs -rm -R /foodir>>> -*---------+---------+ -| View the contents of a file named <<</foodir/myfile.txt>>> | <<<bin/hadoop dfs -cat /foodir/myfile.txt>>> -*---------+---------+ - - FS shell is targeted for applications that need a scripting language to - interact with the stored data. - -** DFSAdmin - - The DFSAdmin command set is used for administering an HDFS cluster. - These are commands that are used only by an HDFS administrator. Here - are some sample action/command pairs: - -*---------+---------+ -|| Action | Command -*---------+---------+ -|Put the cluster in Safemode | <<<bin/hdfs dfsadmin -safemode enter>>> -*---------+---------+ -|Generate a list of DataNodes | <<<bin/hdfs dfsadmin -report>>> -*---------+---------+ -|Recommission or decommission DataNode(s) | <<<bin/hdfs dfsadmin -refreshNodes>>> -*---------+---------+ - -** Browser Interface - - A typical HDFS install configures a web server to expose the HDFS - namespace through a configurable TCP port. This allows a user to - navigate the HDFS namespace and view the contents of its files using a - web browser. - -* Space Reclamation - -** File Deletes and Undeletes - - When a file is deleted by a user or an application, it is not - immediately removed from HDFS. Instead, HDFS first renames it to a file - in the <<</trash>>> directory. The file can be restored quickly as long as it - remains in <<</trash>>>. A file remains in <<</trash>>> for a configurable amount - of time. After the expiry of its life in <<</trash>>>, the NameNode deletes - the file from the HDFS namespace. The deletion of a file causes the - blocks associated with the file to be freed. Note that there could be - an appreciable time delay between the time a file is deleted by a user - and the time of the corresponding increase in free space in HDFS. - - A user can Undelete a file after deleting it as long as it remains in - the <<</trash>>> directory. If a user wants to undelete a file that he/she - has deleted, he/she can navigate the <<</trash>>> directory and retrieve the - file. The <<</trash>>> directory contains only the latest copy of the file - that was deleted. The <<</trash>>> directory is just like any other directory - with one special feature: HDFS applies specified policies to - automatically delete files from this directory. Current default trash - interval is set to 0 (Deletes file without storing in trash). This value is - configurable parameter stored as <<<fs.trash.interval>>> stored in - core-site.xml. - -** Decrease Replication Factor - - When the replication factor of a file is reduced, the NameNode selects - excess replicas that can be deleted. The next Heartbeat transfers this - information to the DataNode. The DataNode then removes the - corresponding blocks and the corresponding free space appears in the - cluster. Once again, there might be a time delay between the completion - of the setReplication API call and the appearance of free space in the - cluster. - -* References - - Hadoop {{{http://hadoop.apache.org/docs/current/api/}JavaDoc API}}. - - HDFS source code: {{http://hadoop.apache.org/version_control.html}} http://git-wip-us.apache.org/repos/asf/hadoop/blob/a45ef2b6/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm b/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm deleted file mode 100644 index 8c2db1b..0000000 --- a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm +++ /dev/null @@ -1,104 +0,0 @@ -~~ Licensed under the Apache License, Version 2.0 (the "License"); -~~ you may not use this file except in compliance with the License. -~~ You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. See accompanying LICENSE file. - - - --- - Offline Edits Viewer Guide - --- - Erik Steffl - --- - ${maven.build.timestamp} - -Offline Edits Viewer Guide - -%{toc|section=1|fromDepth=0} - -* Overview - - Offline Edits Viewer is a tool to parse the Edits log file. The current - processors are mostly useful for conversion between different formats, - including XML which is human readable and easier to edit than native - binary format. - - The tool can parse the edits formats -18 (roughly Hadoop 0.19) and - later. The tool operates on files only, it does not need Hadoop cluster - to be running. - - Input formats supported: - - [[1]] <<binary>>: native binary format that Hadoop uses internally - - [[2]] <<xml>>: XML format, as produced by xml processor, used if filename - has <<<.xml>>> (case insensitive) extension - - The Offline Edits Viewer provides several output processors (unless - stated otherwise the output of the processor can be converted back to - original edits file): - - [[1]] <<binary>>: native binary format that Hadoop uses internally - - [[2]] <<xml>>: XML format - - [[3]] <<stats>>: prints out statistics, this cannot be converted back to - Edits file - -* Usage - ----- - bash$ bin/hdfs oev -i edits -o edits.xml ----- - -*-----------------------:-----------------------------------+ -| Flag | Description | -*-----------------------:-----------------------------------+ -|[<<<-i>>> ; <<<--inputFile>>>] <input file> | Specify the input edits log file to -| | process. Xml (case insensitive) extension means XML format otherwise -| | binary format is assumed. Required. -*-----------------------:-----------------------------------+ -|[<<-o>> ; <<--outputFile>>] <output file> | Specify the output filename, if the -| | specified output processor generates one. If the specified file already -| | exists, it is silently overwritten. Required. -*-----------------------:-----------------------------------+ -|[<<-p>> ; <<--processor>>] <processor> | Specify the image processor to apply -| | against the image file. Currently valid options are -| | <<<binary>>>, <<<xml>>> (default) and <<<stats>>>. -*-----------------------:-----------------------------------+ -|<<[-v ; --verbose] >> | Print the input and output filenames and pipe output of -| | processor to console as well as specified file. On extremely large -| | files, this may increase processing time by an order of magnitude. -*-----------------------:-----------------------------------+ -|<<[-h ; --help] >> | Display the tool usage and help information and exit. -*-----------------------:-----------------------------------+ - -* Case study: Hadoop cluster recovery - - In case there is some problem with hadoop cluster and the edits file is - corrupted it is possible to save at least part of the edits file that - is correct. This can be done by converting the binary edits to XML, - edit it manually and then convert it back to binary. The most common - problem is that the edits file is missing the closing record (record - that has opCode -1). This should be recognized by the tool and the XML - format should be properly closed. - - If there is no closing record in the XML file you can add one after - last correct record. Anything after the record with opCode -1 is - ignored. - - Example of a closing record (with opCode -1): - -+---- - <RECORD> - <OPCODE>-1</OPCODE> - <DATA> - </DATA> - </RECORD> -+---- http://git-wip-us.apache.org/repos/asf/hadoop/blob/a45ef2b6/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm b/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm deleted file mode 100644 index 3b84226..0000000 --- a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm +++ /dev/null @@ -1,247 +0,0 @@ -~~ Licensed under the Apache License, Version 2.0 (the "License"); -~~ you may not use this file except in compliance with the License. -~~ You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. See accompanying LICENSE file. - - --- - Offline Image Viewer Guide - --- - --- - ${maven.build.timestamp} - -Offline Image Viewer Guide - -%{toc|section=1|fromDepth=0} - -* Overview - - The Offline Image Viewer is a tool to dump the contents of hdfs fsimage - files to a human-readable format and provide read-only WebHDFS API - in order to allow offline analysis and examination of an Hadoop cluster's - namespace. The tool is able to process very large image files relatively - quickly. The tool handles the layout formats that were included with Hadoop - versions 2.4 and up. If you want to handle older layout formats, you can - use the Offline Image Viewer of Hadoop 2.3 or {{oiv_legacy Command}}. - If the tool is not able to process an image file, it will exit cleanly. - The Offline Image Viewer does not require a Hadoop cluster to be running; - it is entirely offline in its operation. - - The Offline Image Viewer provides several output processors: - - [[1]] Web is the default output processor. It launches a HTTP server - that exposes read-only WebHDFS API. Users can investigate the namespace - interactively by using HTTP REST API. - - [[2]] XML creates an XML document of the fsimage and includes all of the - information within the fsimage, similar to the lsr processor. The - output of this processor is amenable to automated processing and - analysis with XML tools. Due to the verbosity of the XML syntax, - this processor will also generate the largest amount of output. - - [[3]] FileDistribution is the tool for analyzing file sizes in the - namespace image. In order to run the tool one should define a range - of integers [0, maxSize] by specifying maxSize and a step. The - range of integers is divided into segments of size step: [0, s[1], - ..., s[n-1], maxSize], and the processor calculates how many files - in the system fall into each segment [s[i-1], s[i]). Note that - files larger than maxSize always fall into the very last segment. - The output file is formatted as a tab separated two column table: - Size and NumFiles. Where Size represents the start of the segment, - and numFiles is the number of files form the image which size falls - in this segment. - -* Usage - -** Web Processor - - Web processor launches a HTTP server which exposes read-only WebHDFS API. - Users can specify the address to listen by -addr option (default by - localhost:5978). - ----- - bash$ bin/hdfs oiv -i fsimage - 14/04/07 13:25:14 INFO offlineImageViewer.WebImageViewer: WebImageViewer - started. Listening on /127.0.0.1:5978. Press Ctrl+C to stop the viewer. ----- - - Users can access the viewer and get the information of the fsimage by - the following shell command: - ----- - bash$ bin/hdfs dfs -ls webhdfs://127.0.0.1:5978/ - Found 2 items - drwxrwx--- - root supergroup 0 2014-03-26 20:16 webhdfs://127.0.0.1:5978/tmp - drwxr-xr-x - root supergroup 0 2014-03-31 14:08 webhdfs://127.0.0.1:5978/user ----- - - To get the information of all the files and directories, you can simply use - the following command: - ----- - bash$ bin/hdfs dfs -ls -R webhdfs://127.0.0.1:5978/ ----- - - Users can also get JSON formatted FileStatuses via HTTP REST API. - ----- - bash$ curl -i http://127.0.0.1:5978/webhdfs/v1/?op=liststatus - HTTP/1.1 200 OK - Content-Type: application/json - Content-Length: 252 - - {"FileStatuses":{"FileStatus":[ - {"fileId":16386,"accessTime":0,"replication":0,"owner":"theuser","length":0,"permission":"755","blockSize":0,"modificationTime":1392772497282,"type":"DIRECTORY","group":"supergroup","childrenNum":1,"pathSuffix":"user"} - ]}} ----- - - The Web processor now supports the following operations: - - * {{{./WebHDFS.html#List_a_Directory}LISTSTATUS}} - - * {{{./WebHDFS.html#Status_of_a_FileDirectory}GETFILESTATUS}} - - * {{{./WebHDFS.html#Get_ACL_Status}GETACLSTATUS}} - -** XML Processor - - XML Processor is used to dump all the contents in the fsimage. Users can - specify input and output file via -i and -o command-line. - ----- - bash$ bin/hdfs oiv -p XML -i fsimage -o fsimage.xml ----- - - This will create a file named fsimage.xml contains all the information in - the fsimage. For very large image files, this process may take several - minutes. - - Applying the Offline Image Viewer with XML processor would result in the - following output: - ----- - <?xml version="1.0"?> - <fsimage> - <NameSection> - <genstampV1>1000</genstampV1> - <genstampV2>1002</genstampV2> - <genstampV1Limit>0</genstampV1Limit> - <lastAllocatedBlockId>1073741826</lastAllocatedBlockId> - <txid>37</txid> - </NameSection> - <INodeSection> - <lastInodeId>16400</lastInodeId> - <inode> - <id>16385</id> - <type>DIRECTORY</type> - <name></name> - <mtime>1392772497282</mtime> - <permission>theuser:supergroup:rwxr-xr-x</permission> - <nsquota>9223372036854775807</nsquota> - <dsquota>-1</dsquota> - </inode> - ...remaining output omitted... ----- - -* Options - -*-----------------------:-----------------------------------+ -| <<Flag>> | <<Description>> | -*-----------------------:-----------------------------------+ -| <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file -| | to process. Required. -*-----------------------:-----------------------------------+ -| <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename, -| | if the specified output processor generates one. If -| | the specified file already exists, it is silently -| | overwritten. (output to stdout by default) -*-----------------------:-----------------------------------+ -| <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to -| | apply against the image file. Currently valid options -| | are Web (default), XML and FileDistribution. -*-----------------------:-----------------------------------+ -| <<<-addr>>> <address> | Specify the address(host:port) to listen. -| | (localhost:5978 by default). This option is used with -| | Web processor. -*-----------------------:-----------------------------------+ -| <<<-maxSize>>> <size> | Specify the range [0, maxSize] of file sizes to be -| | analyzed in bytes (128GB by default). This option is -| | used with FileDistribution processor. -*-----------------------:-----------------------------------+ -| <<<-step>>> <size> | Specify the granularity of the distribution in bytes -| | (2MB by default). This option is used with -| | FileDistribution processor. -*-----------------------:-----------------------------------+ -| <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and -| | exit. -*-----------------------:-----------------------------------+ - -* Analyzing Results - - The Offline Image Viewer makes it easy to gather large amounts of data - about the hdfs namespace. This information can then be used to explore - file system usage patterns or find specific files that match arbitrary - criteria, along with other types of namespace analysis. - -* oiv_legacy Command - - Due to the internal layout changes introduced by the ProtocolBuffer-based - fsimage ({{{https://issues.apache.org/jira/browse/HDFS-5698}HDFS-5698}}), - OfflineImageViewer consumes excessive amount of memory and loses some - functions such as Indented and Delimited processor. If you want to process - without large amount of memory or use these processors, you can use - <<<oiv_legacy>>> command (same as <<<oiv>>> in Hadoop 2.3). - -** Usage - - 1. Set <<<dfs.namenode.legacy-oiv-image.dir>>> to an appropriate directory - to make standby NameNode or SecondaryNameNode save its namespace in the - old fsimage format during checkpointing. - - 2. Use <<<oiv_legacy>>> command to the old format fsimage. - ----- - bash$ bin/hdfs oiv_legacy -i fsimage_old -o output ----- - -** Options - -*-----------------------:-----------------------------------+ -| <<Flag>> | <<Description>> | -*-----------------------:-----------------------------------+ -| <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file to -| | process. Required. -*-----------------------:-----------------------------------+ -| <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename, if -| | the specified output processor generates one. If the -| | specified file already exists, it is silently -| | overwritten. Required. -*-----------------------:-----------------------------------+ -| <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to -| | apply against the image file. Valid options are -| | Ls (default), XML, Delimited, Indented, and -| | FileDistribution. -*-----------------------:-----------------------------------+ -| <<<-skipBlocks>>> | Do not enumerate individual blocks within files. This -| | may save processing time and outfile file space on -| | namespaces with very large files. The Ls processor -| | reads the blocks to correctly determine file sizes -| | and ignores this option. -*-----------------------:-----------------------------------+ -| <<<-printToScreen>>> | Pipe output of processor to console as well as -| | specified file. On extremely large namespaces, this -| | may increase processing time by an order of -| | magnitude. -*-----------------------:-----------------------------------+ -| <<<-delimiter>>> <arg>| When used in conjunction with the Delimited -| | processor, replaces the default tab delimiter with -| | the string specified by <arg>. -*-----------------------:-----------------------------------+ -| <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and exit. -*-----------------------:-----------------------------------+ http://git-wip-us.apache.org/repos/asf/hadoop/blob/a45ef2b6/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsMultihoming.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsMultihoming.apt.vm b/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsMultihoming.apt.vm deleted file mode 100644 index 2be4567..0000000 --- a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsMultihoming.apt.vm +++ /dev/null @@ -1,145 +0,0 @@ -~~ Licensed under the Apache License, Version 2.0 (the "License"); -~~ you may not use this file except in compliance with the License. -~~ You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. See accompanying LICENSE file. - - --- - Hadoop Distributed File System-${project.version} - Support for Multi-Homed Networks - --- - --- - ${maven.build.timestamp} - -HDFS Support for Multihomed Networks - - This document is targetted to cluster administrators deploying <<<HDFS>>> in - multihomed networks. Similar support for <<<YARN>>>/<<<MapReduce>>> is - work in progress and will be documented when available. - -%{toc|section=1|fromDepth=0} - -* Multihoming Background - - In multihomed networks the cluster nodes are connected to more than one - network interface. There could be multiple reasons for doing so. - - [[1]] <<Security>>: Security requirements may dictate that intra-cluster - traffic be confined to a different network than the network used to - transfer data in and out of the cluster. - - [[2]] <<Performance>>: Intra-cluster traffic may use one or more high bandwidth - interconnects like Fiber Channel, Infiniband or 10GbE. - - [[3]] <<Failover/Redundancy>>: The nodes may have multiple network adapters - connected to a single network to handle network adapter failure. - - - Note that NIC Bonding (also known as NIC Teaming or Link - Aggregation) is a related but separate topic. The following settings - are usually not applicable to a NIC bonding configuration which handles - multiplexing and failover transparently while presenting a single 'logical - network' to applications. - -* Fixing Hadoop Issues In Multihomed Environments - -** Ensuring HDFS Daemons Bind All Interfaces - - By default <<<HDFS>>> endpoints are specified as either hostnames or IP addresses. - In either case <<<HDFS>>> daemons will bind to a single IP address making - the daemons unreachable from other networks. - - The solution is to have separate setting for server endpoints to force binding - the wildcard IP address <<<INADDR_ANY>>> i.e. <<<0.0.0.0>>>. Do NOT supply a port - number with any of these settings. - ----- -<property> - <name>dfs.namenode.rpc-bind-host</name> - <value>0.0.0.0</value> - <description> - The actual address the RPC server will bind to. If this optional address is - set, it overrides only the hostname portion of dfs.namenode.rpc-address. - It can also be specified per name node or name service for HA/Federation. - This is useful for making the name node listen on all interfaces by - setting it to 0.0.0.0. - </description> -</property> - -<property> - <name>dfs.namenode.servicerpc-bind-host</name> - <value>0.0.0.0</value> - <description> - The actual address the service RPC server will bind to. If this optional address is - set, it overrides only the hostname portion of dfs.namenode.servicerpc-address. - It can also be specified per name node or name service for HA/Federation. - This is useful for making the name node listen on all interfaces by - setting it to 0.0.0.0. - </description> -</property> - -<property> - <name>dfs.namenode.http-bind-host</name> - <value>0.0.0.0</value> - <description> - The actual adress the HTTP server will bind to. If this optional address - is set, it overrides only the hostname portion of dfs.namenode.http-address. - It can also be specified per name node or name service for HA/Federation. - This is useful for making the name node HTTP server listen on all - interfaces by setting it to 0.0.0.0. - </description> -</property> - -<property> - <name>dfs.namenode.https-bind-host</name> - <value>0.0.0.0</value> - <description> - The actual adress the HTTPS server will bind to. If this optional address - is set, it overrides only the hostname portion of dfs.namenode.https-address. - It can also be specified per name node or name service for HA/Federation. - This is useful for making the name node HTTPS server listen on all - interfaces by setting it to 0.0.0.0. - </description> -</property> ----- - -** Clients use Hostnames when connecting to DataNodes - - By default <<<HDFS>>> clients connect to DataNodes using the IP address - provided by the NameNode. Depending on the network configuration this - IP address may be unreachable by the clients. The fix is letting clients perform - their own DNS resolution of the DataNode hostname. The following setting - enables this behavior. - ----- -<property> - <name>dfs.client.use.datanode.hostname</name> - <value>true</value> - <description>Whether clients should use datanode hostnames when - connecting to datanodes. - </description> -</property> ----- - -** DataNodes use HostNames when connecting to other DataNodes - - Rarely, the NameNode-resolved IP address for a DataNode may be unreachable - from other DataNodes. The fix is to force DataNodes to perform their own - DNS resolution for inter-DataNode connections. The following setting enables - this behavior. - ----- -<property> - <name>dfs.datanode.use.datanode.hostname</name> - <value>true</value> - <description>Whether datanodes should use datanode hostnames when - connecting to other datanodes for data transfer. - </description> -</property> ----- -