Hey, that book doesnt include material about Hadoop MR2. Would be worth
looking into some Arun C murthy sirs
presentations.

On Tue, Nov 22, 2011 at 6:53 AM, hari708 <hari...@gmail.com> wrote:

>
> hello,
> Please help me on this.
> Hi,
> I have a big file consisting of XML data.the XML is not represented as a
> single line in the file. if we stream this file using ./hadoop dfs -put
> command to a hadoop directory .How the distribution happens.?
> Basically in My mapreduce program i am expecting a complete XML as my
> input.i have a CustomReader(for XML) in my mapreduce job configuration.My
> main confusion is if namenode distribute data to DataNodes ,there is a
> chance that a part of xml can go to one data node and other half can go in
> another datanode.If that is the case will my custom XMLReader in the
> mapreduce be able to combine it(as mapreduce reads data locally only).
> Please help me on this?
>
> oleksiy wrote:
> >
> > Hello,
> >
> > Sorry for the late answer (didn't have time).
> > So the first what I would like to clarify is what you mean by
> > "unstructured data"? Could you give me your example of this data. You
> > should keep in mind, that hadoop effective only for processing particular
> > types of tasks. In other words how you compute median using hadoop Map
> > Reduce? This kind of situation is not for hadoop.
> >
> > So, let me give you small description of how hadoop works regarding what
> > you wrote. Let's look at a sample (the simple map reduce word count app
> > from hadoop site):
> > We have 1 GB unstructured text file (tel it be some book). We are saving
> > this book to the HDFS which is by default will divide this data by blocks
> > of 64MB and put them to 3 different nodes. So, right now we have 1Gb file
> > splited by blocks and spread arose HDFS cluster.
> >
> > When we run Map Reduce job. Hadoop automatically compute how much tasks
> he
> > needs to process this data. Let hadoop created 10 tsaks. And in this
> > situation 1 task will process the first 64MB which located on the node 1
> > (for instance) the second process second 64MB which located on the
> machine
> > 2 and so on.
> >
> > In this situation each map process their own peace of data (in our case
> > this is 64MB).
> >
> > Also one note regarding metadata. Only NameNode contains metadata info.
> > So, in our example NameNode knows that we have 1GB file split by 64 MB,
> > and we have 16 pieces which is spread arose the cluster. By knowing this
> > hadoop mustn't know the real data structure. For example in our example
> we
> > have simple book and by default hadoop use "TextInputFormat" for
> > processing simple text files. And in this case when hadoop reads data he
> > will take key like number of the string in the file, and value will be
> the
> > string. And he mustn't know the format in this case.
> >
> > that's it :)
> >
> >
> >
> >
> > panamamike wrote:
> >>
> >>
> >>
> >> oleksiy wrote:
> >>>
> >>> Hello,
> >>>
> >>> I would suggest you to read at least this piece of info:
> >>>
> http://hadoop.apache.org/common/docs/r0.20.204.0/hdfs_design.html#NameNode+and+DataNodes
> >>> HDFS Architecture
> >>>
> >>>
> >>> This is the main part of HDFS  architecture. There you can find some
> >>> info of how client read data from different nodes.
> >>> Also I would suggest good book "
> >>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732
> >>> Tom White - Hadoop. The Definitive Guide - 2010, 2nd Edition "
> >>> There you definitely find all answers on your questions.
> >>>
> >>> Regards,
> >>> Oleksiy
> >>>
> >>>
> >>> panamamike wrote:
> >>>>
> >>>> I'm new to Hadoop.  I've read a few articles and presentations which
> >>>> are directed at explaining what Hadoop is, and how it works.
>  Currently
> >>>> my understanding is Hadoop is an MPP system which leverages the use of
> >>>> large block size to quickly find data.  In theory, I understand how a
> >>>> large block size along with an MPP architecture as well as using what
> >>>> I'm understanding to be a massive index scheme via mapreduce can be
> >>>> used to find data.
> >>>>
> >>>> What I don't understand is how ,after you identify the appropriate
> 64MB
> >>>> blocksize, do you find the data you're specifically after?  Does this
> >>>> mean the CPU has to search the entire 64MB block for the data of
> >>>> interest?  If so, how does Hadoop know what data from that block to
> >>>> retrieve?
> >>>>
> >>>> I'm assuming the block is probably composed of one or more files.  If
> >>>> not, I'm assuming the user isn't look for the entire 64MB block rather
> >>>> a portion of it.
> >>>>
> >>>> Any help indicating documentation, books, articles on the subject
> would
> >>>> be much appreciated.
> >>>>
> >>>> Regards,
> >>>>
> >>>> Mike
> >>>>
> >>>
> >>>
> >>
> >> Oleksiy,
> >>
> >> Thank you for your input, I've actually ready that section of the Hadoop
> >> documentation.  I think it does a good job of describing the general
> >> architecture of how Hadoop works.  The description reminds me of how the
> >> Teradata MPP architecture.  The thing that I'm missing, is how does
> >> Hadoop look find things?
> >>
> >> I see how Hadoop can potentially narrow searches down by leveraging the
> >> concept of using metadata indexes to find the large 64MB blocks, I'm
> >> calling these large since typical blocks are measured in bytes, however,
> >> when it does find this block, how does it search within the block?  Does
> >> it then get down to a brute force type search of the 64MB, and because
> >> systems are just fast enough these days that search isn't a big deal?
> >>
> >> Going back to my comparison to Teradata, teradata had a weakness in that
> >> the speed of the MPP architecture was dependant on the quality of the
> >> data distribution index.  Meaning, there had to be a way for the system
> >> to determine how to store data across the commodity hardware in order to
> >> have a even distribution.  If the distribution isn't even, meaning based
> >> on the index defined most data goes to one node in the system, you get
> >> something call hot amping where the MPP advantage is lost because the
> >> majority of the work is being directed to the one node.
> >>
> >> How does Hadoop tacking this particular issue?  Really, when it comes
> >> down to it, how does hadoop distribute the data, balance load data as
> >> well as keep up the parallel performance?  This gets back to my question
> >> of how does Hadoop find things quickly?  I know in Teradata, it's based
> >> on the design of the main index.  My assumption is that Hadoop does
> >> something similar with the metadata, but then that means unstructured
> >> data would have to be associated to some sort of metadata tags.
> >>
> >> Futhermore, that unstructured data could only be found if the correct
> >> metadata keys values are searched.  Is this the way it works?
> >>
> >> Mike
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32871905.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Regards,
R.V.

Reply via email to