Glad we could help, Terrence. The second pivot might be tricky; you may have to run a second iteration. I haven't thought the problem all the way through, though.
Good luck. Alex On Wed, Oct 8, 2008 at 1:02 PM, Terrence A. Pietrondi <[EMAIL PROTECTED] > wrote: > I think I can figure this out now and get it to work. I will check back in > if I get it. All that is missing at the moment is in my pivot back mapping > step. Thanks for the help. > > Terrence A. Pietrondi > > > --- On Tue, 10/7/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > > > From: Alex Loddengaard <[EMAIL PROTECTED]> > > Subject: Re: architecture diagram > > To: core-user@hadoop.apache.org > > Date: Tuesday, October 7, 2008, 1:55 PM > > Thanks for the clarification, Samuel. I wasn't aware > > that parts of a line > > might be emitted depending on the split, while using > > TextInputFormat. > > Terrence, this means that you'll have to take the > > approach of collecting key > > => column_count, value => column_contents in your map > > step. > > > > Alex > > > > On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo > > <[EMAIL PROTECTED]> wrote: > > > > > I think what Alex talked about 'split' is the > > mapreduce system's action. > > > What you said about 'split' is your > > mapper's action. > > > > > > I guess that your map/reduce application uses > > *TextInputFormat* to treat > > > your input file. > > > > > > your input file will first be splitted into a few > > splits. these splits may > > > be like <filename, offset, length>. What Alex > > said about 'The location of > > > these splits is semi-arbitrary' means that the > > file split's offset in your > > > input file is semi-arbitrary. Am I right, Alex? > > > Then *TextInputFormat* will translate these file > > splits into a sequence of > > > lines, where offset is treated as key and line is > > treated as value. > > > > > > As these file splits are splitted by offset. Some > > lines in your file may be > > > splitted into different file splits. A > > *LineRecordReader* used by > > > *TextInputFormat* will remove the half-baked line in > > these file splits to > > > make sure that every mapper will get integrated lines > > one by one. > > > > > > For examples: > > > > > > a file as below: > > > .... > > > AAA BBB CCC DDD > > > EEE FFF GGG HHH > > > AAA BBB CCC DDD > > > .... > > > > > > it may be splitted into two file splits(we assume that > > there are two > > > mappers.). > > > split one: > > > .... > > > AAA BBB CCC > > > > > > split two: > > > DDD > > > EEE FFF GGG HHH > > > AAA BBB CCC DDD > > > .... > > > > > > take split two as example: > > > TextInputFormat will use LineRecordReader to translate > > split two into a > > > sequence of <offset, line> pairs, and it will > > skip the first half-baked > > > line > > > "DDD". so the sequence will be: > > > <offset1, "EEE FFF GGG HHH"> > > > <offset2, "AAA BBB CCC DDD"> > > > .... > > > > > > Then what to do with the lines depends on your job. > > > > > > > > > On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi > > < > > > [EMAIL PROTECTED] > > > > wrote: > > > > > > > So looking at the following mapper... > > > > > > > > > > > > > > > > > > http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup > > > > > > > > On line 32, you can see the row split via a > > delimiter. On line 43, you > > > can > > > > see that the field index (the column index) is > > the map key, and the map > > > > value is the field contents. How is this > > incorrect? I think this follows > > > > your earlier suggestion of: > > > > > > > > "You may want to play with the following > > idea: collect key => > > > column_number > > > > and value => column_contents in your map > > step." > > > > > > > > Terrence A. Pietrondi > > > > > > > > > > > > --- On Mon, 10/6/08, Alex Loddengaard > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > From: Alex Loddengaard > > <[EMAIL PROTECTED]> > > > > > Subject: Re: architecture diagram > > > > > To: core-user@hadoop.apache.org > > > > > Date: Monday, October 6, 2008, 12:55 PM > > > > > As far as I know, splits will never be made > > within a line, > > > > > only between > > > > > rows. To answer your question about ways to > > control the > > > > > splits, see below: > > > > > > > > > > > > <http://wiki.apache.org/hadoop/HowManyMapsAndReduces> > > > > > < > > > > > > > > > > > > > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html > > > > > > > > > > > > > > > > Alex > > > > > > > > > > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. > > Pietrondi > > > > > <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > > > Can you explain "The location of > > these splits is > > > > > semi-arbitrary"? What if > > > > > > the example was... > > > > > > > > > > > > AAA|BBB|CCC|DDD > > > > > > EEE|FFF|GGG|HHH > > > > > > > > > > > > > > > > > > Does this mean the split might be > > between CCC such > > > > > that it results in > > > > > > AAA|BBB|C and C|DDD for the first line? > > Is there a way > > > > > to control this > > > > > > behavior to split on my delimiter? > > > > > > > > > > > > > > > > > > Terrence A. Pietrondi > > > > > > > > > > > > > > > > > > --- On Sun, 10/5/08, Alex Loddengaard > > > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > From: Alex Loddengaard > > > > > <[EMAIL PROTECTED]> > > > > > > > Subject: Re: architecture diagram > > > > > > > To: core-user@hadoop.apache.org > > > > > > > Date: Sunday, October 5, 2008, > > 9:26 PM > > > > > > > Let's say you have one very > > large input file > > > > > of the > > > > > > > form: > > > > > > > > > > > > > > A|B|C|D > > > > > > > E|F|G|H > > > > > > > ... > > > > > > > |1|2|3|4 > > > > > > > > > > > > > > This input file will be broken up > > into N pieces, > > > > > where N is > > > > > > > the number of > > > > > > > mappers that run. The location of > > these splits > > > > > is > > > > > > > semi-arbitrary. This > > > > > > > means that unless you have one > > mapper, you > > > > > won't be > > > > > > > able to see the entire > > > > > > > contents of a column in your > > mapper. Given that > > > > > you would > > > > > > > need one mapper > > > > > > > to be able to see the entirety of > > a column, > > > > > you've now > > > > > > > essentially reduced > > > > > > > your problem to a single machine. > > > > > > > > > > > > > > You may want to play with the > > following idea: > > > > > collect key > > > > > > > => column_number > > > > > > > and value => column_contents in > > your map step. > > > > > This > > > > > > > means that you would be > > > > > > > able to see the entirety of a > > column in your > > > > > reduce step, > > > > > > > though you're > > > > > > > still faced with the tasks of > > shuffling and > > > > > re-pivoting. > > > > > > > > > > > > > > Does this clear up your confusion? > > Let me know > > > > > if > > > > > > > you'd like me to clarify > > > > > > > more. > > > > > > > > > > > > > > Alex > > > > > > > > > > > > > > On Sun, Oct 5, 2008 at 3:54 PM, > > Terrence A. > > > > > Pietrondi > > > > > > > <[EMAIL PROTECTED] > > > > > > > > wrote: > > > > > > > > > > > > > > > I am not sure why this > > doesn't fit, > > > > > maybe you can > > > > > > > help me understand. Your > > > > > > > > previous comment was... > > > > > > > > > > > > > > > > "The reason I'm > > making this claim > > > > > is because > > > > > > > in order to do the pivot > > > > > > > > operation you must know about > > every row. > > > > > Your input > > > > > > > files will be split at > > > > > > > > semi-arbitrary places, > > essentially making it > > > > > > > impossible for each mapper to > > > > > > > > know every single row." > > > > > > > > > > > > > > > > Are you saying that my row > > segments might > > > > > not actually > > > > > > > be the entire row so > > > > > > > > I will get a bad key index? > > If so, would the > > > > > row > > > > > > > segments be determined? I > > > > > > > > based my initial work off of > > the word count > > > > > example, > > > > > > > where the lines are > > > > > > > > tokenized. Does this mean in > > this example > > > > > the row > > > > > > > tokens may not be the > > > > > > > > complete row? > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > Terrence A. Pietrondi > > > > > > > > > > > > > > > > > > > > > > > > --- On Fri, 10/3/08, Alex > > Loddengaard > > > > > > > <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > > > > > > > From: Alex Loddengaard > > > > > > > <[EMAIL PROTECTED]> > > > > > > > > > Subject: Re: > > architecture diagram > > > > > > > > > To: > > core-user@hadoop.apache.org > > > > > > > > > Date: Friday, October 3, > > 2008, 7:14 PM > > > > > > > > > The approach that > > you've described > > > > > does not > > > > > > > fit well in > > > > > > > > > to the MapReduce > > > > > > > > > paradigm. You may want > > to consider > > > > > randomizing > > > > > > > your data > > > > > > > > > in a different > > > > > > > > > way. > > > > > > > > > > > > > > > > > > Unfortunately some > > things can't be > > > > > solved > > > > > > > well with > > > > > > > > > MapReduce, and I think > > > > > > > > > this is one of them. > > > > > > > > > > > > > > > > > > Can someone else say > > more? > > > > > > > > > > > > > > > > > > Alex > > > > > > > > > > > > > > > > > > On Fri, Oct 3, 2008 at > > 8:15 AM, > > > > > Terrence A. > > > > > > > Pietrondi > > > > > > > > > > > <[EMAIL PROTECTED] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Sorry for the > > confusion, I did > > > > > make some > > > > > > > typos. My > > > > > > > > > example should have > > > > > > > > > > looked like... > > > > > > > > > > > > > > > > > > > > > A|B|C > > > > > > > > > > > D|E|G > > > > > > > > > > > > > > > > > > > > > > pivots too... > > > > > > > > > > > > > > > > > > > > > > D|A > > > > > > > > > > > E|B > > > > > > > > > > > G|C > > > > > > > > > > > > > > > > > > > > > > Then for each > > row, shuffle > > > > > the contents > > > > > > > around > > > > > > > > > randomly... > > > > > > > > > > > > > > > > > > > > > > D|A > > > > > > > > > > > B|E > > > > > > > > > > > C|G > > > > > > > > > > > > > > > > > > > > > > Then pivot the > > data back... > > > > > > > > > > > > > > > > > > > > > > A|E|G > > > > > > > > > > > D|B|C > > > > > > > > > > > > > > > > > > > > The general goal is > > to shuffle the > > > > > elements > > > > > > > in each > > > > > > > > > column in the input > > > > > > > > > > data. Meaning, the > > ordering of the > > > > > elements > > > > > > > in each > > > > > > > > > column will not be the > > > > > > > > > > same as in input. > > > > > > > > > > > > > > > > > > > > If you look at the > > initial input > > > > > and compare > > > > > > > to the > > > > > > > > > final output, you'll > > > > > > > > > > see that during the > > shuffling, B > > > > > and E are > > > > > > > swapped, > > > > > > > > > and G and C are swapped, > > > > > > > > > > while A and D were > > shuffled back > > > > > into their > > > > > > > > > originating positions in > > the > > > > > > > > > > column. > > > > > > > > > > > > > > > > > > > > Once again, sorry > > for the typos > > > > > and > > > > > > > confusion. > > > > > > > > > > > > > > > > > > > > Terrence A. > > Pietrondi > > > > > > > > > > > > > > > > > > > > --- On Fri, > > 10/3/08, Alex > > > > > Loddengaard > > > > > > > > > > > <[EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > From: Alex > > Loddengaard > > > > > > > > > > > <[EMAIL PROTECTED]> > > > > > > > > > > > Subject: Re: > > architecture > > > > > diagram > > > > > > > > > > > To: > > > > > core-user@hadoop.apache.org > > > > > > > > > > > Date: Friday, > > October 3, > > > > > 2008, 11:01 AM > > > > > > > > > > > Can you > > confirm that the > > > > > example > > > > > > > you've > > > > > > > > > presented is > > > > > > > > > > > accurate? I > > think you > > > > > > > > > > > may have made > > some typos, > > > > > because the > > > > > > > letter > > > > > > > > > "G" > > > > > > > > > > > isn't in > > the final > > > > > result; > > > > > > > > > > > I also think > > your first pivot > > > > > > > accidentally > > > > > > > > > swapped C and G. > > > > > > > > > > > I'm > > having a > > > > > > > > > > > hard time > > understanding what > > > > > you want > > > > > > > to do, > > > > > > > > > because it > > > > > > > > > > > seems like > > your > > > > > > > > > > > operations > > differ from your > > > > > example. > > > > > > > > > > > > > > > > > > > > > > With that > > said, at first > > > > > glance, this > > > > > > > problem may > > > > > > > > > not fit > > > > > > > > > > > well in to the > > > > > > > > > > > MapReduce > > paradigm. The > > > > > reason I'm > > > > > > > making > > > > > > > > > this claim > > > > > > > > > > > is because in > > order to > > > > > > > > > > > do the pivot > > operation you > > > > > must know > > > > > > > about every > > > > > > > > > row. Your > > > > > > > > > > > input files > > will > > > > > > > > > > > be split at > > semi-arbitrary > > > > > places, > > > > > > > essentially > > > > > > > > > making it > > > > > > > > > > > impossible for > > each > > > > > > > > > > > mapper to know > > every single > > > > > row. There > > > > > > > may be a > > > > > > > > > way to do > > > > > > > > > > > this by > > > > > > > > > > > collecting, in > > your map step, > > > > > key => > > > > > > > column > > > > > > > > > number (0, > > > > > > > > > > > 1, 2, etc) and > > value > > > > > > > > > > > => (A, B, > > C, etc), though > > > > > you may > > > > > > > run in to > > > > > > > > > problems > > > > > > > > > > > when you try > > to pivot > > > > > > > > > > > back. I say > > this because > > > > > when you > > > > > > > pivot back, > > > > > > > > > you need to > > > > > > > > > > > have each > > column, > > > > > > > > > > > which means > > you'll need > > > > > one reduce > > > > > > > step. > > > > > > > > > There may be > > > > > > > > > > > a way to put > > the > > > > > > > > > > > pivot-back > > operation in a > > > > > second > > > > > > > iteration, > > > > > > > > > though I > > > > > > > > > > > don't > > think that would > > > > > > > > > > > help you. > > > > > > > > > > > > > > > > > > > > > > Terrence, > > please confirm that > > > > > > > you've defined > > > > > > > > > your > > > > > > > > > > > example > > correctly. In the > > > > > > > > > > > meantime, can > > someone else > > > > > confirm that > > > > > > > this > > > > > > > > > problem does > > > > > > > > > > > not fit will > > in to > > > > > > > > > > > the MapReduce > > paradigm? > > > > > > > > > > > > > > > > > > > > > > Alex > > > > > > > > > > > > > > > > > > > > > > On Thu, Oct 2, > > 2008 at 10:48 > > > > > AM, > > > > > > > Terrence A. > > > > > > > > > Pietrondi < > > > > > > > > > > > > > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > I am > > trying to write a > > > > > map reduce > > > > > > > > > implementation to do > > > > > > > > > > > the following: > > > > > > > > > > > > > > > > > > > > > > > > 1) read > > tabular data > > > > > delimited in > > > > > > > some > > > > > > > > > fashion > > > > > > > > > > > > 2) pivot > > that data, so > > > > > the rows > > > > > > > are columns > > > > > > > > > and the > > > > > > > > > > > columns are > > rows > > > > > > > > > > > > 3) > > shuffle the rows > > > > > (that were the > > > > > > > columns) > > > > > > > > > to > > > > > > > > > > > randomize the > > data > > > > > > > > > > > > 4) pivot > > the data back > > > > > > > > > > > > > > > > > > > > > > > > For > > example..... > > > > > > > > > > > > > > > > > > > > > > > > A|B|C > > > > > > > > > > > > D|E|G > > > > > > > > > > > > > > > > > > > > > > > > pivots > > too... > > > > > > > > > > > > > > > > > > > > > > > > D|A > > > > > > > > > > > > E|B > > > > > > > > > > > > C|G > > > > > > > > > > > > > > > > > > > > > > > > Then for > > each row, > > > > > shuffle the > > > > > > > contents > > > > > > > > > around > > > > > > > > > > > randomly... > > > > > > > > > > > > > > > > > > > > > > > > D|A > > > > > > > > > > > > B|E > > > > > > > > > > > > G|C > > > > > > > > > > > > > > > > > > > > > > > > Then > > pivot the data > > > > > back... > > > > > > > > > > > > > > > > > > > > > > > > A|E|C > > > > > > > > > > > > D|B|C > > > > > > > > > > > > > > > > > > > > > > > > You can > > reference my > > > > > progress so > > > > > > > far... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ > > > > > > > > > > > > > > > > > > > > > > > > Terrence > > A. Pietrondi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- On > > Thu, 10/2/08, > > > > > Alex > > > > > > > Loddengaard > > > > > > > > > > > > > > > > <[EMAIL PROTECTED]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > From: Alex > > > > > Loddengaard > > > > > > > > > > > > > > > > <[EMAIL PROTECTED]> > > > > > > > > > > > > > > > Subject: Re: > > > > > architecture > > > > > > > diagram > > > > > > > > > > > > > To: > > > > > > > core-user@hadoop.apache.org > > > > > > > > > > > > > > > Date: Thursday, > > > > > October 2, > > > > > > > 2008, 1:36 > > > > > > > > > PM > > > > > > > > > > > > > I > > think it really > > > > > depends on > > > > > > > the job as > > > > > > > > > to where > > > > > > > > > > > logic goes. > > > > > > > > > > > > > > > Sometimes your > > > > > > > > > > > > > > > reduce step is as > > > > > simple as > > > > > > > an identify > > > > > > > > > function, > > > > > > > > > > > and > > > > > > > > > > > > > > > sometimes it can be > > > > > > > > > > > > > more > > complex than > > > > > your map > > > > > > > step. It > > > > > > > > > all depends > > > > > > > > > > > on your > > > > > > > > > > > > > data > > and the > > > > > > > > > > > > > > > operation(s) > > > > > you're > > > > > > > trying to > > > > > > > > > perform. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Perhaps we should > > > > > step out of > > > > > > > the > > > > > > > > > abstract. Do > > > > > > > > > > > you have a > > > > > > > > > > > > > > > specific problem > > > > > > > > > > > > > > > you're trying > > > > > to solve? > > > > > > > Can you > > > > > > > > > describe it? > > > > > > > > > > > > > > > > > > > > > > > > > > Alex > > > > > > > > > > > > > > > > > > > > > > > > > > On > > Thu, Oct 2, 2008 > > > > > at 4:55 > > > > > > > AM, > > > > > > > > > Terrence A. > > > > > > > > > > > Pietrondi > > > > > > > > > > > > > > > > > > <[EMAIL PROTECTED] > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am sorry for > > > > > the > > > > > > > confusion. I > > > > > > > > > meant > > > > > > > > > > > distributed > > > > > > > > > > > > > > > data. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So help me out > > > > > here. For > > > > > > > example, > > > > > > > > > if I am > > > > > > > > > > > reducing to > > > > > > > > > > > > > a > > single file, then > > > > > > > > > > > > > > > > my main > > > > > transformation > > > > > > > logic would > > > > > > > > > be in my > > > > > > > > > > > mapping > > > > > > > > > > > > > step > > since I am > > > > > reducing > > > > > > > > > > > > > > > > away from the > > > > > data? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Terrence A. > > > > > Pietrondi > > > > > > > > > > > > > > > > > > > > > http://del.icio.us/tepietrondi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- On Wed, > > > > > 10/1/08, > > > > > > > Alex > > > > > > > > > Loddengaard > > > > > > > > > > > > > > > > > > > > <[EMAIL PROTECTED]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: > > > > > Alex > > > > > > > Loddengaard > > > > > > > > > > > > > > > > > > > > <[EMAIL PROTECTED]> > > > > > > > > > > > > > > > > > Subject: > > > > > Re: > > > > > > > architecture > > > > > > > > > diagram > > > > > > > > > > > > > > > > > To: > > > > > > > > > > > core-user@hadoop.apache.org > > > > > > > > > > > > > > > > > Date: > > > > > Wednesday, > > > > > > > October 1, > > > > > > > > > 2008, 7:44 > > > > > > > > > > > PM > > > > > > > > > > > > > > > > > I'm > > > > > not sure > > > > > > > what you > > > > > > > > > mean by > > > > > > > > > > > > > > > "disconnected > > > > > parts > > > > > > > > > > > > > > > > > of > > > > > data," but > > > > > > > Hadoop is > > > > > > > > > > > > > > > > > > > > > > implemented to try > > > > > > > and > > > > > > > > > perform map > > > > > > > > > > > tasks on > > > > > > > > > > > > > > > machines that > > > > > > > > > > > > > > > > > have > > > > > input data. > > > > > > > > > > > > > > > > > This is > > > > > to lower > > > > > > > the amount > > > > > > > > > of network > > > > > > > > > > > traffic, > > > > > > > > > > > > > > > hence > > > > > > > > > > > > > > > > > making > > > > > the entire > > > > > > > job > > > > > > > > > > > > > > > > > run > > > > > faster. Hadoop > > > > > > > does all > > > > > > > > > this for > > > > > > > > > > > you under > > > > > > > > > > > > > the > > hood. > > > > > > > > > > > > > > > > > From a > > > > > user's > > > > > > > > > > > > > > > > > point of > > > > > view, all > > > > > > > you need > > > > > > > > > to do is > > > > > > > > > > > store data > > > > > > > > > > > > > in > > HDFS > > > > > > > > > > > > > > > > > (the > > > > > distributed > > > > > > > > > > > > > > > > > > > > > > filesystem), and > > > > > > > run > > > > > > > > > MapReduce jobs on > > > > > > > > > > > that data. > > > > > > > > > > > > > > > Take a > > > > > > > > > > > > > > > > > look > > > > > here: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <http://wiki.apache.org/hadoop/WordCount> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Alex > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, > > > > > Oct 1, 2008 > > > > > > > at 1:11 > > > > > > > > > PM, > > > > > > > > > > > Terrence A. > > > > > > > > > > > > > > > Pietrondi > > > > > > > > > > > > > > > > > > > > > > > > <[EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So > > > > > to be > > > > > > > > > "distributed" > > > > > > > > > > > in a sense, > > > > > > > > > > > > > you > > would > > > > > > > > > > > > > > > > > want to > > > > > do your > > > > > > > computation > > > > > > > > > on > > > > > > > > > > > > > > > > > > the > > > > > > > disconnected parts > > > > > > > > > of data in > > > > > > > > > > > the map > > > > > > > > > > > > > > > phase I > > > > > > > > > > > > > > > > > would > > > > > guess? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Terrence A. > > > > > > > Pietrondi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://del.icio.us/tepietrondi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > On Wed, > > > > > > > 10/1/08, > > > > > > > > > Arun C Murthy > > > > > > > > > > > > > > > > > > > > > > > > <[EMAIL PROTECTED]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: > > > > > > > Arun C Murthy > > > > > > > > > > > > > > > > > > <[EMAIL PROTECTED]> > > > > > > > > > > > > > > > > > > > > > > > > Subject: > > > > > > > Re: > > > > > > > > > architecture > > > > > > > > > > > diagram > > > > > > > > > > > > > > > > > > > > > > > > To: > > > > > > > > > > > > > core-user@hadoop.apache.org > > > > > > > > > > > > > > > > > > > > > > > > Date: > > > > > > > Wednesday, > > > > > > > > > October 1, > > > > > > > > > > > 2008, 2:16 > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > > > > > > > > > > On Oct 1, > > > > > > > 2008, at > > > > > > > > > 10:17 AM, > > > > > > > > > > > Terrence > > > > > > > > > > > > > A. > > > > > > > > > > > > > > > > > Pietrondi > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am > > > > > > > trying to > > > > > > > > > plan out > > > > > > > > > > > my > > > > > > > > > > > > > > > map-reduce > > > > > > > > > > > > > > > > > > > > > > implementation > > > > > > > > > > > > > > > > > > > > > > > > and I > > > > > > > have some > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > questions of > > > > > > > > > where > > > > > > > > > > > computation > > > > > > > > > > > > > > > should be > > > > > > > > > > > > > > > > > split in > > > > > > > > > > > > > > > > > > > > > > > > order to > > > > > > > take > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > advantage of > > > > > > > > > the > > > > > > > > > > > distributed > > > > > > > > > > > > > > > nodes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looking at the > > > > > > > > > > > architecture > > > > > > > > > > > > > > > diagram > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (http://hadoop.apache.org/core/images/architecture.gif > > > > > > > > > > > > > > > > > > > > > > > > > ), > > > > > > > are the map > > > > > > > > > boxes the > > > > > > > > > > > major > > > > > > > > > > > > > > > computation > > > > > > > > > > > > > > > > > areas or > > > > > is > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > reduce > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > major > > > > > > > > > computation > > > > > > > > > > > area? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Usually > > > > > > > the maps > > > > > > > > > perform the > > > > > > > > > > > > > > > 'embarrassingly > > > > > > > > > > > > > > > > > > > > > > > > > > parallel' > > > > > > > > > computational > > > > > > > > > > > > > > > > > > > > > > > > steps > > > > > > > where-in each > > > > > > > > > map works > > > > > > > > > > > > > > > independently on a > > > > > > > > > > > > > > > > > > > > > > > > > > 'split' on > > > > > > > > > your input > > > > > > > > > > > > > > > > > > > > > > > > and the > > > > > > > reduces > > > > > > > > > perform the > > > > > > > > > > > > > > > 'aggregate' > > > > > > > > > > > > > > > > > > > > > > > > > > computations. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From > > > > > > > > > > > > > > > > http://hadoop.apache.org/core/ : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hadoop > > > > > > > implements > > > > > > > > > MapReduce, > > > > > > > > > > > using the > > > > > > > > > > > > > > > Hadoop > > > > > > > > > > > > > > > > > > > > > > Distributed > > > > > > > > > > > > > > > > > > > > > > > > File > > > > > > > System > > > > > > > > > > > > > > > > > > > > > > > > (HDFS). > > > > > > > MapReduce > > > > > > > > > divides > > > > > > > > > > > applications > > > > > > > > > > > > > into > > many > > > > > > > > > > > > > > > > > small > > > > > > > > > > > > > > > > > > > > > > > > blocks of > > > > > > > work. > > > > > > > > > > > > > > > > > > > > > > > > HDFS > > > > > > > creates > > > > > > > > > multiple > > > > > > > > > > > replicas of > > data > > > > > > > > > > > > > > > blocks for > > > > > > > > > > > > > > > > > > > > > > > > > > reliability, > > > > > > > > > placing > > > > > > > > > > > > > > > > > > > > > > > > them on > > > > > > > compute > > > > > > > > > nodes around > > > > > > > > > > > the > > > > > > > > > > > > > > > cluster. > > > > > > > > > > > > > > > > > MapReduce > > > > > can > > > > > > > > > > > > > > > > > > > > > > > > then > > > > > > > process > > > > > > > > > > > > > > > > > > > > > > > > the data > > > > > > > where it > > > > > > > > > is located. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The > > > > > > > Hadoop > > > > > > > > > Map-Reduce > > > > > > > > > > > framework is > > > > > > > > > > > > > > > quite good at > > > > > > > > > > > > > > > > > > > > > > scheduling > > > > > > > > > > > > > > > > > > > > > > > > your > > > > > > > > > > > > > > > > > > > > > > > > > > 'maps' on > > > > > > > > > the actual > > > > > > > > > > > data-nodes > > > > > > > > > > > > > > > where the > > > > > > > > > > > > > > > > > > > > > > > > > > input-blocks are > > > > > > > > > present, > > > > > > > > > > > > > > > > > > > > > > > > leading > > > > > > > to i/o > > > > > > > > > > > > > efficiencies... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Terrence A. > > > > > > > > > Pietrondi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >