[Hadoop Wiki] Update of "QuickStart" by GlenMazza

Apache Wiki Wed, 28 Nov 2012 17:50:31 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "QuickStart" page has been changed by GlenMazza:
http://wiki.apache.org/hadoop/QuickStart?action=diff&rev1=34&rev2=35

Comment:
Removed duplicate information already available on the Hadoop Site, providing 
links instead to that information (remaining Apache information should 
eventually be incorporated into the website.)

   * [[http://www.cloudera.com/hadoop-deb|Debian Packages for Debian based 
systems]] (Debian, Ubuntu, etc)
   * [[http://www.cloudera.com/hadoop-ec2|AMI for Amazon EC2]]
  
- If you want to work exclusively with Hadoop code directly from Apache, the 
rest of this document can help you get started quickly from there.
+ If you want to work exclusively with Hadoop code directly from Apache, the 
following articles from the website will be most useful:
+  * [[http://hadoop.apache.org/docs/stable/single_node_setup.html|Single-Node 
Setup]]
+  * [[http://hadoop.apache.org/docs/stable/cluster_setup.html|Cluster Setup]]
  
+ Note for the above Apache links, if you're having trouble getting "ssh 
localhost" to work on the following OS's:
- The instructions below are
- based on the docs found at the 
[[http://hadoop.apache.org/common/docs/current/cluster_setup.html#Configurationml
 | Hadoop Cluster Setup/Configuration]].
- 
- Please note the instructions were last updated to match Release 0.21.0. 
Things may have changed since then. If they have, please update this page.
- 
- == Requirements ==
-  * Java 1.6+ (see HadoopJavaVersions for 1.6.X version details)
-  * ssh and sshd
-  * rsync
- 
- == Preparatory Steps ==
- Download
- 
- '''Release Versions:'''
- can be found here http://hadoop.apache.org/core/releases.html
- 
- '''Subversion:'''
- First check that the currently build isn't borked
- http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/
- 
- Then grab the latest with subversion 
- {{{svn co http://svn.apache.org/repos/asf/hadoop/core/trunk hadoop}}}
- 
- 
- run the following commands:
- {{{
- cd hadoop
- ant 
- ant examples
- bin/hadoop
- }}}
- `bin/hadoop` should display the basic command line help docs and let you know 
it's at least basically working. If any of the above steps failed use 
subversion to roll back to an earlier days revision.
- 
- == Stage 1: Standalone Operation ==
- By default, Hadoop is configured to run things in a non-distributed mode, as 
a single Java process. This is useful for debugging, and can be demonstrated as 
follows:
- {{{
- mkdir input
- cp conf/*.xml input
- bin/hadoop jar hadoop-mapred-examples-0.21.0.jar grep input output 
'security[a-z.]+'
- cat output/*
- }}}
- 
- Obviously the version number on the jar may have changed by the time you read 
this. You should see a lot of INFO level logging commands go by when you run it 
and cat output/* should give you something that looks like this:
- 
- {{{
- cat output/*
- 1     security.task.umbilical.protocol.acl
- 1     security.refresh.policy.protocol.acl
- 1     security.namenode.protocol.acl
- 1     security.job.submission.protocol.acl
- 1     security.inter.tracker.protocol.acl
- 1     security.inter.datanode.protocol.acl
- 1     security.datanode.protocol.acl
- ...(and so on)
- }}}
- 
- If you saw the error `Exception in thread "main" 
java.lang.NoClassDefFoundError: hadoop-mapred-examples-0/21/0/jar` it means you 
forgot to type `jar` after `bin/hadoop` If you were unable to run this example, 
roll back to a previous night's version. If it seemed to run fine but cat 
didn't spit anything out you probably mistyped something. Try copying the 
command directly from the wiki to avoid typos. You'll need to wipe out the 
output directory between each run.
- 
- Congratulations you have just successfully run your first MapReduce with 
Hadoop.
- 
- == Stage 2: Pseudo-distributed Configuration ==
- You can in fact run everything on a single host. To run things this way, put 
the following in `conf/hdfs-site.xml` (`conf/hadoop-site.xml` in versions < 
0.20)
- {{{
- <configuration>
- 
-   <property>
-     <name>fs.default.name</name>
-     <value>localhost:9000</value>
-   </property>
- 
-   <property>
-     <name>mapred.job.tracker</name>
-     <value>localhost:9001</value>
-   </property>
- 
-   <property>
-     <name>dfs.replication</name>
-     <value>1</value>
-       <!-- set to 1 to reduce warnings when 
-       running on a single node -->
-   </property>
- 
- </configuration>
- }}}
- 
- Now check that the command 
- `ssh localhost`
- does not require a password. If it does, set up passwordless ssh. For 
example, you can execute the following commands:
- {{{
- ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
- cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
- }}}
- 
- Now, try `ssh localhost` again. If this doesn't work you're doing to have to 
figure out what's going on with your `ssh-agent` on your own.
  
  '''Window Users''' To start ssh server, you need run "ssh-host-config -y" in 
cygwin environment. If he ask for CYGWIN environment value, set it to "ntsec 
tty". After you can run server from cygwin "cygrunsrv --start sshd" or from 
Windows command line "net start sshd".
  
  '''Mac Users''' In recent versions of OSX, ssh-agent is already set up with 
launchd and keychain. This can be verified by executing "echo $SSH_AUTH_SOCK" 
in your favorite shell. You can use ssh-add -k and -K to add your keys and 
passphrases to your keychain.
  
+ Multi-node cluster setup is largely similar to single-node 
(pseudo-distributed) setup, except for the following:
- === Bootstrapping ===
- A new distributed filesystem must be formatted with the following command, 
run on the master node:
- 
- {{{bin/hadoop namenode -format}}}
- 
- If asked to [re]format, you must reply Y (not just y) if you want to 
reformat, else Hadoop will abort the format.
- 
- You should see a quick series of `STARTUP_MSG`s and a `SHUTDOWN_MSG`
- 
- Open the {{{conf/hadoop-env.sh}}} file and define {{{JAVA_HOME}}} in it.
- Then start up the Hadoop daemon with 
- 
- {{{bin/start-all.sh}}}
- 
- It should notify you that it's starting the `namenode`, `datanode`, 
`secondarynamenode`, and `jobtracker`. 
- 
- Input files are copied into the distributed filesystem as follows: 
- {{{bin/hadoop dfs -put <localsrc> <dst>}}}
- For more details just type `bin/hadoop dfs` with no options.
- 
- To shutdown:
- 
- {{{bin/stop-all.sh}}}
- 
- === Browsing to the Services ===
- 
- Once the Pseudo Distributed cluster is live, you can point your web browser 
at it, by connecting to localhost at the chosen ports. 
- If you have left the values at their defaults, the page 
PseudoDistributedHadoop provides short cuts to these pages. 
- 
- == Stage 3: Fully-distributed operation ==
- 
- Fully distributed operation is just like the pseudo-distributed operation 
described above, except, specify:
  
   1. The hostname or IP address of your master server in the value for 
fs.default.name, as hdfs://master.example.com/ in conf/core-site.xml.
   1. The host and port of the your master server in the value of 
mapred.job.tracker as master.example.com:port in conf/mapred-site.xml.
@@ -153, +31 @@

   1. mapred.map.tasks and mapred.reduce.tasks in conf/mapred-site.xml. As a 
rule of thumb, use 10x the number of slave processors for mapred.map.tasks, and 
2x the number of slave processors for mapred.reduce.tasks.
   1. Finally, list all slave hostnames or IP addresses in your conf/slaves 
file, one per line. Then format your filesystem and start your cluster on your 
master node, as above.
  
- See 
[[http://hadoop.apache.org/common/docs/current/cluster_setup.html#Configurationml
 | Hadoop Cluster Setup/Configuration]] for details.
+ See 
[[http://hadoop.apache.org/common/docs/stable/cluster_setup.html#Configurationml
 | Hadoop Cluster Setup/Configuration]] for details.

[Hadoop Wiki] Update of "QuickStart" by GlenMazza

Reply via email to