Hey, that book doesnt include material about Hadoop MR2. Would be worth looking into some Arun C murthy sirs presentations.
On Tue, Nov 22, 2011 at 6:53 AM, hari708 <hari...@gmail.com> wrote: > > hello, > Please help me on this. > Hi, > I have a big file consisting of XML data.the XML is not represented as a > single line in the file. if we stream this file using ./hadoop dfs -put > command to a hadoop directory .How the distribution happens.? > Basically in My mapreduce program i am expecting a complete XML as my > input.i have a CustomReader(for XML) in my mapreduce job configuration.My > main confusion is if namenode distribute data to DataNodes ,there is a > chance that a part of xml can go to one data node and other half can go in > another datanode.If that is the case will my custom XMLReader in the > mapreduce be able to combine it(as mapreduce reads data locally only). > Please help me on this? > > oleksiy wrote: > > > > Hello, > > > > Sorry for the late answer (didn't have time). > > So the first what I would like to clarify is what you mean by > > "unstructured data"? Could you give me your example of this data. You > > should keep in mind, that hadoop effective only for processing particular > > types of tasks. In other words how you compute median using hadoop Map > > Reduce? This kind of situation is not for hadoop. > > > > So, let me give you small description of how hadoop works regarding what > > you wrote. Let's look at a sample (the simple map reduce word count app > > from hadoop site): > > We have 1 GB unstructured text file (tel it be some book). We are saving > > this book to the HDFS which is by default will divide this data by blocks > > of 64MB and put them to 3 different nodes. So, right now we have 1Gb file > > splited by blocks and spread arose HDFS cluster. > > > > When we run Map Reduce job. Hadoop automatically compute how much tasks > he > > needs to process this data. Let hadoop created 10 tsaks. And in this > > situation 1 task will process the first 64MB which located on the node 1 > > (for instance) the second process second 64MB which located on the > machine > > 2 and so on. > > > > In this situation each map process their own peace of data (in our case > > this is 64MB). > > > > Also one note regarding metadata. Only NameNode contains metadata info. > > So, in our example NameNode knows that we have 1GB file split by 64 MB, > > and we have 16 pieces which is spread arose the cluster. By knowing this > > hadoop mustn't know the real data structure. For example in our example > we > > have simple book and by default hadoop use "TextInputFormat" for > > processing simple text files. And in this case when hadoop reads data he > > will take key like number of the string in the file, and value will be > the > > string. And he mustn't know the format in this case. > > > > that's it :) > > > > > > > > > > panamamike wrote: > >> > >> > >> > >> oleksiy wrote: > >>> > >>> Hello, > >>> > >>> I would suggest you to read at least this piece of info: > >>> > http://hadoop.apache.org/common/docs/r0.20.204.0/hdfs_design.html#NameNode+and+DataNodes > >>> HDFS Architecture > >>> > >>> > >>> This is the main part of HDFS architecture. There you can find some > >>> info of how client read data from different nodes. > >>> Also I would suggest good book " > >>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732 > >>> Tom White - Hadoop. The Definitive Guide - 2010, 2nd Edition " > >>> There you definitely find all answers on your questions. > >>> > >>> Regards, > >>> Oleksiy > >>> > >>> > >>> panamamike wrote: > >>>> > >>>> I'm new to Hadoop. I've read a few articles and presentations which > >>>> are directed at explaining what Hadoop is, and how it works. > Currently > >>>> my understanding is Hadoop is an MPP system which leverages the use of > >>>> large block size to quickly find data. In theory, I understand how a > >>>> large block size along with an MPP architecture as well as using what > >>>> I'm understanding to be a massive index scheme via mapreduce can be > >>>> used to find data. > >>>> > >>>> What I don't understand is how ,after you identify the appropriate > 64MB > >>>> blocksize, do you find the data you're specifically after? Does this > >>>> mean the CPU has to search the entire 64MB block for the data of > >>>> interest? If so, how does Hadoop know what data from that block to > >>>> retrieve? > >>>> > >>>> I'm assuming the block is probably composed of one or more files. If > >>>> not, I'm assuming the user isn't look for the entire 64MB block rather > >>>> a portion of it. > >>>> > >>>> Any help indicating documentation, books, articles on the subject > would > >>>> be much appreciated. > >>>> > >>>> Regards, > >>>> > >>>> Mike > >>>> > >>> > >>> > >> > >> Oleksiy, > >> > >> Thank you for your input, I've actually ready that section of the Hadoop > >> documentation. I think it does a good job of describing the general > >> architecture of how Hadoop works. The description reminds me of how the > >> Teradata MPP architecture. The thing that I'm missing, is how does > >> Hadoop look find things? > >> > >> I see how Hadoop can potentially narrow searches down by leveraging the > >> concept of using metadata indexes to find the large 64MB blocks, I'm > >> calling these large since typical blocks are measured in bytes, however, > >> when it does find this block, how does it search within the block? Does > >> it then get down to a brute force type search of the 64MB, and because > >> systems are just fast enough these days that search isn't a big deal? > >> > >> Going back to my comparison to Teradata, teradata had a weakness in that > >> the speed of the MPP architecture was dependant on the quality of the > >> data distribution index. Meaning, there had to be a way for the system > >> to determine how to store data across the commodity hardware in order to > >> have a even distribution. If the distribution isn't even, meaning based > >> on the index defined most data goes to one node in the system, you get > >> something call hot amping where the MPP advantage is lost because the > >> majority of the work is being directed to the one node. > >> > >> How does Hadoop tacking this particular issue? Really, when it comes > >> down to it, how does hadoop distribute the data, balance load data as > >> well as keep up the parallel performance? This gets back to my question > >> of how does Hadoop find things quickly? I know in Teradata, it's based > >> on the design of the main index. My assumption is that Hadoop does > >> something similar with the metadata, but then that means unstructured > >> data would have to be associated to some sort of metadata tags. > >> > >> Futhermore, that unstructured data could only be found if the correct > >> metadata keys values are searched. Is this the way it works? > >> > >> Mike > >> > > > > > > -- > View this message in context: > http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32871905.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- Regards, R.V.