Hello, Sorry for the late answer (didn't have time). So the first what I would like to clarify is what you mean by "unstructured data"? Could you give me your example of this data. You should keep in mind, that hadoop effective only for processing particular types of tasks. In other words how you compute median using hadoop Map Reduce? This kind of situation is not for hadoop.
So, let me give you small description of how hadoop works regarding what you wrote. Let's look at a sample (the simple map reduce word count app from hadoop site): We have 1 GB unstructured text file (tel it be some book). We are saving this book to the HDFS which is by default will divide this data by blocks of 64MB and put them to 3 different nodes. So, right now we have 1Gb file splited by blocks and spread arose HDFS cluster. When we run Map Reduce job. Hadoop automatically compute how much tasks he needs to process this data. Let hadoop created 10 tsaks. And in this situation 1 task will process the first 64MB which located on the node 1 (for instance) the second process second 64MB which located on the machine 2 and so on. In this situation each map process their own peace of data (in our case this is 64MB). Also one note regarding metadata. Only NameNode contains metadata info. So, in our example NameNode knows that we have 1GB file split by 64 MB, and we have 16 pieces which is spread arose the cluster. By knowing this hadoop mustn't know the real data structure. For example in our example we have simple book and by default hadoop use "TextInputFormat" for processing simple text files. And in this case when hadoop reads data he will take key like number of the string in the file, and value will be the string. And he mustn't know the format in this case. that's it :) panamamike wrote: > > > > oleksiy wrote: >> >> Hello, >> >> I would suggest you to read at least this piece of info: >> http://hadoop.apache.org/common/docs/r0.20.204.0/hdfs_design.html#NameNode+and+DataNodes >> HDFS Architecture >> >> >> This is the main part of HDFS architecture. There you can find some info >> of how client read data from different nodes. >> Also I would suggest good book " >> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732 Tom >> White - Hadoop. The Definitive Guide - 2010, 2nd Edition " >> There you definitely find all answers on your questions. >> >> Regards, >> Oleksiy >> >> >> panamamike wrote: >>> >>> I'm new to Hadoop. I've read a few articles and presentations which are >>> directed at explaining what Hadoop is, and how it works. Currently my >>> understanding is Hadoop is an MPP system which leverages the use of >>> large block size to quickly find data. In theory, I understand how a >>> large block size along with an MPP architecture as well as using what >>> I'm understanding to be a massive index scheme via mapreduce can be used >>> to find data. >>> >>> What I don't understand is how ,after you identify the appropriate 64MB >>> blocksize, do you find the data you're specifically after? Does this >>> mean the CPU has to search the entire 64MB block for the data of >>> interest? If so, how does Hadoop know what data from that block to >>> retrieve? >>> >>> I'm assuming the block is probably composed of one or more files. If >>> not, I'm assuming the user isn't look for the entire 64MB block rather a >>> portion of it. >>> >>> Any help indicating documentation, books, articles on the subject would >>> be much appreciated. >>> >>> Regards, >>> >>> Mike >>> >> >> > > Oleksiy, > > Thank you for your input, I've actually ready that section of the Hadoop > documentation. I think it does a good job of describing the general > architecture of how Hadoop works. The description reminds me of how the > Teradata MPP architecture. The thing that I'm missing, is how does Hadoop > look find things? > > I see how Hadoop can potentially narrow searches down by leveraging the > concept of using metadata indexes to find the large 64MB blocks, I'm > calling these large since typical blocks are measured in bytes, however, > when it does find this block, how does it search within the block? Does > it then get down to a brute force type search of the 64MB, and because > systems are just fast enough these days that search isn't a big deal? > > Going back to my comparison to Teradata, teradata had a weakness in that > the speed of the MPP architecture was dependant on the quality of the data > distribution index. Meaning, there had to be a way for the system to > determine how to store data across the commodity hardware in order to have > a even distribution. If the distribution isn't even, meaning based on the > index defined most data goes to one node in the system, you get something > call hot amping where the MPP advantage is lost because the majority of > the work is being directed to the one node. > > How does Hadoop tacking this particular issue? Really, when it comes down > to it, how does hadoop distribute the data, balance load data as well as > keep up the parallel performance? This gets back to my question of how > does Hadoop find things quickly? I know in Teradata, it's based on the > design of the main index. My assumption is that Hadoop does something > similar with the metadata, but then that means unstructured data would > have to be associated to some sort of metadata tags. > > Futhermore, that unstructured data could only be found if the correct > metadata keys values are searched. Is this the way it works? > > Mike > -- View this message in context: http://old.nabble.com/Need-help-understanding-Hadoop-Architecture-tp32705405p32752983.html Sent from the Hadoop core-user mailing list archive at Nabble.com.