Hey everyone. I just started some weeks ago to learn about Hadoop. I got the task to understand the Hadoop Ecosystem, and be able to answer some questions. First of all I started reading a book "OReilly - Hadoop The Definitive Guide". After reading the book I had a first idea of how components work together, but for me the book didn't helped me to understand what's going on. In my opinion the book described pretty much general in depth details about various components. This didn't helped me to understand the Hadoop Ecosystem.
I started to work with it. I installed a VM (SUSE Leap 42.1) and followed the https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html Guide. After doing this I started to work with files on it. I wrote my first simple mapper and reducer, and I analyzed my apache log for some testing. This worked good so far. But let's face my problems: 1) All my knowledge about the Installing of Hadoop right now is: Unpacking a .tar.gz. I ran some shell-scripts and everything was running fine. Well, I have no clue at all, which components are now installed on the VM and where are they located and installed? 2) Furthermore, I'm missing all kinds of information about setting those up. The apache guide on some point says "Now check that you can ssh to the localhost without a passphrase" "If you cannot ssh to localhost without a passphrase, execute the following commands:". Well, I'd like to know what am I doing here ?! I mean WHY do I need ssh running on localhost, and WHY do this have to be without a passphrase. Which other ways of configuring this do exists? 3) Same on the next point: "The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node." "Format the filesystem: $ bin/hdfs namenode -format". I have no clue how HDFS internally work. For me a Filesystem is where I can setup partitions hooked on folders. So how am I supposed to explain hdfs to someone else? I understood the storing of data, splitting files in blocks, spread files around the cluster, store metadata, but if someone asks me: "How can this be called filesystem if you install it by unpacking a .tar.gz?" I simply can't answer this question in any way. So I'm now looking for a documentation/guide for: - Which requirements do I have? -- Does I have to use a specific Filesystem? If yes/no, why or what would you recommend? -- How should I partition my VM? -- On which partition should I install which components? - Setting up a VM with Hadoop - Configure Hadoop step by step - Setup all kinds of deamons/nodes manually and explain where are they located (how they work) and how they should be configured I'm right now reading: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html but after some first readings this Guide will tell you what to write in which configuration-file, but now why you should do this or not. I'm feeling like "leaved alone in the darkness" after getting an idea of what Hadoop is. I hope some of you can show me some ways to get back om the road. For me it's very important not just to write some configuration somewhere. I need to understand what's going on because if I got a running cluster and things, I need to be sure to handle all this stuff before going in productive use with it. Best Regards Mike