Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4
This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key => column_number and value => column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi <[EMAIL PROTECTED] > wrote: > I am not sure why this doesn't fit, maybe you can help me understand. Your > previous comment was... > > "The reason I'm making this claim is because in order to do the pivot > operation you must know about every row. Your input files will be split at > semi-arbitrary places, essentially making it impossible for each mapper to > know every single row." > > Are you saying that my row segments might not actually be the entire row so > I will get a bad key index? If so, would the row segments be determined? I > based my initial work off of the word count example, where the lines are > tokenized. Does this mean in this example the row tokens may not be the > complete row? > > Thanks. > > Terrence A. Pietrondi > > > --- On Fri, 10/3/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > > > From: Alex Loddengaard <[EMAIL PROTECTED]> > > Subject: Re: architecture diagram > > To: core-user@hadoop.apache.org > > Date: Friday, October 3, 2008, 7:14 PM > > The approach that you've described does not fit well in > > to the MapReduce > > paradigm. You may want to consider randomizing your data > > in a different > > way. > > > > Unfortunately some things can't be solved well with > > MapReduce, and I think > > this is one of them. > > > > Can someone else say more? > > > > Alex > > > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi > > <[EMAIL PROTECTED] > > > wrote: > > > > > Sorry for the confusion, I did make some typos. My > > example should have > > > looked like... > > > > > > > A|B|C > > > > D|E|G > > > > > > > > pivots too... > > > > > > > > D|A > > > > E|B > > > > G|C > > > > > > > > Then for each row, shuffle the contents around > > randomly... > > > > > > > > D|A > > > > B|E > > > > C|G > > > > > > > > Then pivot the data back... > > > > > > > > A|E|G > > > > D|B|C > > > > > > The general goal is to shuffle the elements in each > > column in the input > > > data. Meaning, the ordering of the elements in each > > column will not be the > > > same as in input. > > > > > > If you look at the initial input and compare to the > > final output, you'll > > > see that during the shuffling, B and E are swapped, > > and G and C are swapped, > > > while A and D were shuffled back into their > > originating positions in the > > > column. > > > > > > Once again, sorry for the typos and confusion. > > > > > > Terrence A. Pietrondi > > > > > > --- On Fri, 10/3/08, Alex Loddengaard > > <[EMAIL PROTECTED]> wrote: > > > > > > > From: Alex Loddengaard > > <[EMAIL PROTECTED]> > > > > Subject: Re: architecture diagram > > > > To: core-user@hadoop.apache.org > > > > Date: Friday, October 3, 2008, 11:01 AM > > > > Can you confirm that the example you've > > presented is > > > > accurate? I think you > > > > may have made some typos, because the letter > > "G" > > > > isn't in the final result; > > > > I also think your first pivot accidentally > > swapped C and G. > > > > I'm having a > > > > hard time understanding what you want to do, > > because it > > > > seems like your > > > > operations differ from your example. > > > > > > > > With that said, at first glance, this problem may > > not fit > > > > well in to the > > > > MapReduce paradigm. The reason I'm making > > this claim > > > > is because in order to > > > > do the pivot operation you must know about every > > row. Your > > > > input files will > > > > be split at semi-arbitrary places, essentially > > making it > > > > impossible for each > > > > mapper to know every single row. There may be a > > way to do > > > > this by > > > > collecting, in your map step, key => column > > number (0, > > > > 1, 2, etc) and value > > > > => (A, B, C, etc), though you may run in to > > problems > > > > when you try to pivot > > > > back. I say this because when you pivot back, > > you need to > > > > have each column, > > > > which means you'll need one reduce step. > > There may be > > > > a way to put the > > > > pivot-back operation in a second iteration, > > though I > > > > don't think that would > > > > help you. > > > > > > > > Terrence, please confirm that you've defined > > your > > > > example correctly. In the > > > > meantime, can someone else confirm that this > > problem does > > > > not fit will in to > > > > the MapReduce paradigm? > > > > > > > > Alex > > > > > > > > On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. > > Pietrondi < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > I am trying to write a map reduce > > implementation to do > > > > the following: > > > > > > > > > > 1) read tabular data delimited in some > > fashion > > > > > 2) pivot that data, so the rows are columns > > and the > > > > columns are rows > > > > > 3) shuffle the rows (that were the columns) > > to > > > > randomize the data > > > > > 4) pivot the data back > > > > > > > > > > For example..... > > > > > > > > > > A|B|C > > > > > D|E|G > > > > > > > > > > pivots too... > > > > > > > > > > D|A > > > > > E|B > > > > > C|G > > > > > > > > > > Then for each row, shuffle the contents > > around > > > > randomly... > > > > > > > > > > D|A > > > > > B|E > > > > > G|C > > > > > > > > > > Then pivot the data back... > > > > > > > > > > A|E|C > > > > > D|B|C > > > > > > > > > > You can reference my progress so far... > > > > > > > > > > > > > > > > http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ > > > > > > > > > > Terrence A. Pietrondi > > > > > > > > > > > > > > > --- On Thu, 10/2/08, Alex Loddengaard > > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > From: Alex Loddengaard > > > > <[EMAIL PROTECTED]> > > > > > > Subject: Re: architecture diagram > > > > > > To: core-user@hadoop.apache.org > > > > > > Date: Thursday, October 2, 2008, 1:36 > > PM > > > > > > I think it really depends on the job as > > to where > > > > logic goes. > > > > > > Sometimes your > > > > > > reduce step is as simple as an identify > > function, > > > > and > > > > > > sometimes it can be > > > > > > more complex than your map step. It > > all depends > > > > on your > > > > > > data and the > > > > > > operation(s) you're trying to > > perform. > > > > > > > > > > > > Perhaps we should step out of the > > abstract. Do > > > > you have a > > > > > > specific problem > > > > > > you're trying to solve? Can you > > describe it? > > > > > > > > > > > > Alex > > > > > > > > > > > > On Thu, Oct 2, 2008 at 4:55 AM, > > Terrence A. > > > > Pietrondi > > > > > > <[EMAIL PROTECTED] > > > > > > > wrote: > > > > > > > > > > > > > I am sorry for the confusion. I > > meant > > > > distributed > > > > > > data. > > > > > > > > > > > > > > So help me out here. For example, > > if I am > > > > reducing to > > > > > > a single file, then > > > > > > > my main transformation logic would > > be in my > > > > mapping > > > > > > step since I am reducing > > > > > > > away from the data? > > > > > > > > > > > > > > Terrence A. Pietrondi > > > > > > > http://del.icio.us/tepietrondi > > > > > > > > > > > > > > > > > > > > > --- On Wed, 10/1/08, Alex > > Loddengaard > > > > > > <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > > > > > From: Alex Loddengaard > > > > > > <[EMAIL PROTECTED]> > > > > > > > > Subject: Re: architecture > > diagram > > > > > > > > To: > > core-user@hadoop.apache.org > > > > > > > > Date: Wednesday, October 1, > > 2008, 7:44 > > > > PM > > > > > > > > I'm not sure what you > > mean by > > > > > > "disconnected parts > > > > > > > > of data," but Hadoop is > > > > > > > > implemented to try and > > perform map > > > > tasks on > > > > > > machines that > > > > > > > > have input data. > > > > > > > > This is to lower the amount > > of network > > > > traffic, > > > > > > hence > > > > > > > > making the entire job > > > > > > > > run faster. Hadoop does all > > this for > > > > you under > > > > > > the hood. > > > > > > > > From a user's > > > > > > > > point of view, all you need > > to do is > > > > store data > > > > > > in HDFS > > > > > > > > (the distributed > > > > > > > > filesystem), and run > > MapReduce jobs on > > > > that data. > > > > > > Take a > > > > > > > > look here: > > > > > > > > > > > > > > > > > > > > <http://wiki.apache.org/hadoop/WordCount> > > > > > > > > > > > > > > > > Alex > > > > > > > > > > > > > > > > On Wed, Oct 1, 2008 at 1:11 > > PM, > > > > Terrence A. > > > > > > Pietrondi > > > > > > > > <[EMAIL PROTECTED] > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > So to be > > "distributed" > > > > in a sense, > > > > > > you would > > > > > > > > want to do your computation > > on > > > > > > > > > the disconnected parts > > of data in > > > > the map > > > > > > phase I > > > > > > > > would guess? > > > > > > > > > > > > > > > > > > Terrence A. Pietrondi > > > > > > > > > > > http://del.icio.us/tepietrondi > > > > > > > > > > > > > > > > > > > > > > > > > > > --- On Wed, 10/1/08, > > Arun C Murthy > > > > > > > > <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > > > > > > > > > From: Arun C Murthy > > > > > > <[EMAIL PROTECTED]> > > > > > > > > > > Subject: Re: > > architecture > > > > diagram > > > > > > > > > > To: > > > > core-user@hadoop.apache.org > > > > > > > > > > Date: Wednesday, > > October 1, > > > > 2008, 2:16 > > > > > > PM > > > > > > > > > > On Oct 1, 2008, at > > 10:17 AM, > > > > Terrence > > > > > > A. > > > > > > > > Pietrondi wrote: > > > > > > > > > > > > > > > > > > > > > I am trying to > > plan out > > > > my > > > > > > map-reduce > > > > > > > > implementation > > > > > > > > > > and I have some > > > > > > > > > > > questions of > > where > > > > computation > > > > > > should be > > > > > > > > split in > > > > > > > > > > order to take > > > > > > > > > > > advantage of > > the > > > > distributed > > > > > > nodes. > > > > > > > > > > > > > > > > > > > > > > Looking at the > > > > architecture > > > > > > diagram > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (http://hadoop.apache.org/core/images/architecture.gif > > > > > > > > > > > ), are the map > > boxes the > > > > major > > > > > > computation > > > > > > > > areas or is > > > > > > > > > > the reduce > > > > > > > > > > > the major > > computation > > > > area? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Usually the maps > > perform the > > > > > > 'embarrassingly > > > > > > > > > > parallel' > > computational > > > > > > > > > > steps where-in each > > map works > > > > > > independently on a > > > > > > > > > > 'split' on > > your input > > > > > > > > > > and the reduces > > perform the > > > > > > 'aggregate' > > > > > > > > > > computations. > > > > > > > > > > > > > > > > > > > > From > > > > http://hadoop.apache.org/core/ : > > > > > > > > > > > > > > > > > > > > Hadoop implements > > MapReduce, > > > > using the > > > > > > Hadoop > > > > > > > > Distributed > > > > > > > > > > File System > > > > > > > > > > (HDFS). MapReduce > > divides > > > > applications > > > > > > into many > > > > > > > > small > > > > > > > > > > blocks of work. > > > > > > > > > > HDFS creates > > multiple > > > > replicas of data > > > > > > blocks for > > > > > > > > > > reliability, > > placing > > > > > > > > > > them on compute > > nodes around > > > > the > > > > > > cluster. > > > > > > > > MapReduce can > > > > > > > > > > then process > > > > > > > > > > the data where it > > is located. > > > > > > > > > > > > > > > > > > > > The Hadoop > > Map-Reduce > > > > framework is > > > > > > quite good at > > > > > > > > scheduling > > > > > > > > > > your > > > > > > > > > > 'maps' on > > the actual > > > > data-nodes > > > > > > where the > > > > > > > > > > input-blocks are > > present, > > > > > > > > > > leading to i/o > > > > efficiencies... > > > > > > > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > Terrence A. > > Pietrondi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >