Hey everyone. I just started some weeks ago to learn about Hadoop. I got the 
task to understand the Hadoop Ecosystem, and be able to answer some questions. 
First of all I started reading a book "OReilly - Hadoop The Definitive Guide". 
After reading the book I had a first idea of how components work together, but 
for me the book didn't helped me to understand what's going on. In my opinion 
the book described pretty much general in depth details about various 
components. This didn't helped me to understand the Hadoop Ecosystem.

I started to work with it. I installed a VM (SUSE Leap 42.1) and followed the 
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
 Guide.
After doing this I started to work with files on it. I wrote my first simple 
mapper and reducer, and I analyzed my apache log for some testing. This worked 
good so far.

But let's face my problems:
1) All my knowledge about the Installing of Hadoop right now is: Unpacking a 
.tar.gz. I ran some shell-scripts and everything was running fine. Well, I have 
no clue at all, which components are now installed on the VM and where are they 
located and installed?

2) Furthermore, I'm missing all kinds of information about setting those up. 
The apache guide on some point says "Now check that you can ssh to the 
localhost without a passphrase" "If you cannot ssh to localhost without a 
passphrase, execute the following commands:". Well, I'd like to know what am I 
doing here ?! I mean WHY do I need ssh running on localhost, and WHY do this 
have to be without a passphrase. Which other ways of configuring this do exists?

3) Same on the next point: "The following instructions are to run a MapReduce 
job locally. If you want to execute a job on YARN, see YARN on Single Node." 
"Format the filesystem: $ bin/hdfs namenode -format". I have no clue how HDFS 
internally work. For me a Filesystem is where I can setup partitions hooked on 
folders. So how am I supposed to explain hdfs to someone else?
I understood the storing of data, splitting files in blocks, spread files 
around the cluster, store metadata, but if someone asks me: "How can this be 
called filesystem if you install it by unpacking a .tar.gz?" I simply can't 
answer this question in any way.

So I'm now looking for a documentation/guide for:
- Which requirements do I have?
-- Does I have to use a specific Filesystem? If yes/no, why or what would you 
recommend?
-- How should I partition my VM?
-- On which partition should I install which components?
- Setting up a VM with Hadoop
- Configure Hadoop step by step
- Setup all kinds of deamons/nodes manually and explain where are they located 
(how they work) and how they should be configured

I'm right now reading: 
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html
 but after some first readings this Guide will tell you what to write in which 
configuration-file, but now why you should do this or not. I'm feeling like 
"leaved alone in the darkness" after getting an idea of what Hadoop is. I hope 
some of you can show me some ways to get back om the road.
For me it's very important not just to write some configuration somewhere. I 
need to understand what's going on because if I got a running cluster and 
things, I need to be sure to handle all this stuff before going in productive 
use with it.

Best Regards
Mike

Reply via email to