Re: architecture diagram
Glad we could help, Terrence. The second pivot might be tricky; you may have to run a second iteration. I haven't thought the problem all the way through, though. Good luck. Alex On Wed, Oct 8, 2008 at 1:02 PM, Terrence A. Pietrondi <[EMAIL PROTECTED] > wrote: > I think I can figure this out now and get it to work. I will check back in > if I get it. All that is missing at the moment is in my pivot back mapping > step. Thanks for the help. > > Terrence A. Pietrondi > > > --- On Tue, 10/7/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > > > From: Alex Loddengaard <[EMAIL PROTECTED]> > > Subject: Re: architecture diagram > > To: core-user@hadoop.apache.org > > Date: Tuesday, October 7, 2008, 1:55 PM > > Thanks for the clarification, Samuel. I wasn't aware > > that parts of a line > > might be emitted depending on the split, while using > > TextInputFormat. > > Terrence, this means that you'll have to take the > > approach of collecting key > > => column_count, value => column_contents in your map > > step. > > > > Alex > > > > On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo > > <[EMAIL PROTECTED]> wrote: > > > > > I think what Alex talked about 'split' is the > > mapreduce system's action. > > > What you said about 'split' is your > > mapper's action. > > > > > > I guess that your map/reduce application uses > > *TextInputFormat* to treat > > > your input file. > > > > > > your input file will first be splitted into a few > > splits. these splits may > > > be like . What Alex > > said about 'The location of > > > these splits is semi-arbitrary' means that the > > file split's offset in your > > > input file is semi-arbitrary. Am I right, Alex? > > > Then *TextInputFormat* will translate these file > > splits into a sequence of > > > lines, where offset is treated as key and line is > > treated as value. > > > > > > As these file splits are splitted by offset. Some > > lines in your file may be > > > splitted into different file splits. A > > *LineRecordReader* used by > > > *TextInputFormat* will remove the half-baked line in > > these file splits to > > > make sure that every mapper will get integrated lines > > one by one. > > > > > > For examples: > > > > > > a file as below: > > > > > > AAA BBB CCC DDD > > > EEE FFF GGG HHH > > > AAA BBB CCC DDD > > > > > > > > > it may be splitted into two file splits(we assume that > > there are two > > > mappers.). > > > split one: > > > > > > AAA BBB CCC > > > > > > split two: > > > DDD > > > EEE FFF GGG HHH > > > AAA BBB CCC DDD > > > > > > > > > take split two as example: > > > TextInputFormat will use LineRecordReader to translate > > split two into a > > > sequence of pairs, and it will > > skip the first half-baked > > > line > > > "DDD". so the sequence will be: > > > > > > > > > > > > > > > Then what to do with the lines depends on your job. > > > > > > > > > On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi > > < > > > [EMAIL PROTECTED] > > > > wrote: > > > > > > > So looking at the following mapper... > > > > > > > > > > > > > > > > > > http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup > > > > > > > > On line 32, you can see the row split via a > > delimiter. On line 43, you > > > can > > > > see that the field index (the column index) is > > the map key, and the map > > > > value is the field contents. How is this > > incorrect? I think this follows > > > > your earlier suggestion of: > > > > > > > > "You may want to play with the following > > idea: collect key => > > > column_number > > > > and value => column_contents in your map > > step." > > > > > > > > Terrence A. Pietrondi > > > > > > > > > > > > --- On Mon, 10/6/08, Alex Loddengaard > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > From: Alex Lod
Re: architecture diagram
I think I can figure this out now and get it to work. I will check back in if I get it. All that is missing at the moment is in my pivot back mapping step. Thanks for the help. Terrence A. Pietrondi --- On Tue, 10/7/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > From: Alex Loddengaard <[EMAIL PROTECTED]> > Subject: Re: architecture diagram > To: core-user@hadoop.apache.org > Date: Tuesday, October 7, 2008, 1:55 PM > Thanks for the clarification, Samuel. I wasn't aware > that parts of a line > might be emitted depending on the split, while using > TextInputFormat. > Terrence, this means that you'll have to take the > approach of collecting key > => column_count, value => column_contents in your map > step. > > Alex > > On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo > <[EMAIL PROTECTED]> wrote: > > > I think what Alex talked about 'split' is the > mapreduce system's action. > > What you said about 'split' is your > mapper's action. > > > > I guess that your map/reduce application uses > *TextInputFormat* to treat > > your input file. > > > > your input file will first be splitted into a few > splits. these splits may > > be like . What Alex > said about 'The location of > > these splits is semi-arbitrary' means that the > file split's offset in your > > input file is semi-arbitrary. Am I right, Alex? > > Then *TextInputFormat* will translate these file > splits into a sequence of > > lines, where offset is treated as key and line is > treated as value. > > > > As these file splits are splitted by offset. Some > lines in your file may be > > splitted into different file splits. A > *LineRecordReader* used by > > *TextInputFormat* will remove the half-baked line in > these file splits to > > make sure that every mapper will get integrated lines > one by one. > > > > For examples: > > > > a file as below: > > > > AAA BBB CCC DDD > > EEE FFF GGG HHH > > AAA BBB CCC DDD > > > > > > it may be splitted into two file splits(we assume that > there are two > > mappers.). > > split one: > > > > AAA BBB CCC > > > > split two: > > DDD > > EEE FFF GGG HHH > > AAA BBB CCC DDD > > > > > > take split two as example: > > TextInputFormat will use LineRecordReader to translate > split two into a > > sequence of pairs, and it will > skip the first half-baked > > line > > "DDD". so the sequence will be: > > > > > > > > > > Then what to do with the lines depends on your job. > > > > > > On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi > < > > [EMAIL PROTECTED] > > > wrote: > > > > > So looking at the following mapper... > > > > > > > > > > > > http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup > > > > > > On line 32, you can see the row split via a > delimiter. On line 43, you > > can > > > see that the field index (the column index) is > the map key, and the map > > > value is the field contents. How is this > incorrect? I think this follows > > > your earlier suggestion of: > > > > > > "You may want to play with the following > idea: collect key => > > column_number > > > and value => column_contents in your map > step." > > > > > > Terrence A. Pietrondi > > > > > > > > > --- On Mon, 10/6/08, Alex Loddengaard > <[EMAIL PROTECTED]> wrote: > > > > > > > From: Alex Loddengaard > <[EMAIL PROTECTED]> > > > > Subject: Re: architecture diagram > > > > To: core-user@hadoop.apache.org > > > > Date: Monday, October 6, 2008, 12:55 PM > > > > As far as I know, splits will never be made > within a line, > > > > only between > > > > rows. To answer your question about ways to > control the > > > > splits, see below: > > > > > > > > > <http://wiki.apache.org/hadoop/HowManyMapsAndReduces> > > > > < > > > > > > > > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html > > > > > > > > > > > > > Alex > > > > > > > > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. > Pietrondi > > > > <[EM
Re: architecture diagram
Thanks for the clarification, Samuel. I wasn't aware that parts of a line might be emitted depending on the split, while using TextInputFormat. Terrence, this means that you'll have to take the approach of collecting key => column_count, value => column_contents in your map step. Alex On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo <[EMAIL PROTECTED]> wrote: > I think what Alex talked about 'split' is the mapreduce system's action. > What you said about 'split' is your mapper's action. > > I guess that your map/reduce application uses *TextInputFormat* to treat > your input file. > > your input file will first be splitted into a few splits. these splits may > be like . What Alex said about 'The location of > these splits is semi-arbitrary' means that the file split's offset in your > input file is semi-arbitrary. Am I right, Alex? > Then *TextInputFormat* will translate these file splits into a sequence of > lines, where offset is treated as key and line is treated as value. > > As these file splits are splitted by offset. Some lines in your file may be > splitted into different file splits. A *LineRecordReader* used by > *TextInputFormat* will remove the half-baked line in these file splits to > make sure that every mapper will get integrated lines one by one. > > For examples: > > a file as below: > > AAA BBB CCC DDD > EEE FFF GGG HHH > AAA BBB CCC DDD > > > it may be splitted into two file splits(we assume that there are two > mappers.). > split one: > > AAA BBB CCC > > split two: > DDD > EEE FFF GGG HHH > AAA BBB CCC DDD > > > take split two as example: > TextInputFormat will use LineRecordReader to translate split two into a > sequence of pairs, and it will skip the first half-baked > line > "DDD". so the sequence will be: > > > > > Then what to do with the lines depends on your job. > > > On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi < > [EMAIL PROTECTED] > > wrote: > > > So looking at the following mapper... > > > > > > > http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup > > > > On line 32, you can see the row split via a delimiter. On line 43, you > can > > see that the field index (the column index) is the map key, and the map > > value is the field contents. How is this incorrect? I think this follows > > your earlier suggestion of: > > > > "You may want to play with the following idea: collect key => > column_number > > and value => column_contents in your map step." > > > > Terrence A. Pietrondi > > > > > > --- On Mon, 10/6/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > > > > > From: Alex Loddengaard <[EMAIL PROTECTED]> > > > Subject: Re: architecture diagram > > > To: core-user@hadoop.apache.org > > > Date: Monday, October 6, 2008, 12:55 PM > > > As far as I know, splits will never be made within a line, > > > only between > > > rows. To answer your question about ways to control the > > > splits, see below: > > > > > > <http://wiki.apache.org/hadoop/HowManyMapsAndReduces> > > > < > > > > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html > > > > > > > > > > Alex > > > > > > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi > > > <[EMAIL PROTECTED] > > > > wrote: > > > > > > > Can you explain "The location of these splits is > > > semi-arbitrary"? What if > > > > the example was... > > > > > > > > AAA|BBB|CCC|DDD > > > > EEE|FFF|GGG|HHH > > > > > > > > > > > > Does this mean the split might be between CCC such > > > that it results in > > > > AAA|BBB|C and C|DDD for the first line? Is there a way > > > to control this > > > > behavior to split on my delimiter? > > > > > > > > > > > > Terrence A. Pietrondi > > > > > > > > > > > > --- On Sun, 10/5/08, Alex Loddengaard > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > From: Alex Loddengaard > > > <[EMAIL PROTECTED]> > > > > > Subject: Re: architecture diagram > > > > > To: core-user@hadoop.apache.org > > > > > Date: Sunday, October 5, 2008, 9:26 PM > > > &
Re: architecture diagram
I think what Alex talked about 'split' is the mapreduce system's action. What you said about 'split' is your mapper's action. I guess that your map/reduce application uses *TextInputFormat* to treat your input file. your input file will first be splitted into a few splits. these splits may be like . What Alex said about 'The location of these splits is semi-arbitrary' means that the file split's offset in your input file is semi-arbitrary. Am I right, Alex? Then *TextInputFormat* will translate these file splits into a sequence of lines, where offset is treated as key and line is treated as value. As these file splits are splitted by offset. Some lines in your file may be splitted into different file splits. A *LineRecordReader* used by *TextInputFormat* will remove the half-baked line in these file splits to make sure that every mapper will get integrated lines one by one. For examples: a file as below: AAA BBB CCC DDD EEE FFF GGG HHH AAA BBB CCC DDD it may be splitted into two file splits(we assume that there are two mappers.). split one: AAA BBB CCC split two: DDD EEE FFF GGG HHH AAA BBB CCC DDD take split two as example: TextInputFormat will use LineRecordReader to translate split two into a sequence of pairs, and it will skip the first half-baked line "DDD". so the sequence will be: Then what to do with the lines depends on your job. On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi <[EMAIL PROTECTED] > wrote: > So looking at the following mapper... > > > http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup > > On line 32, you can see the row split via a delimiter. On line 43, you can > see that the field index (the column index) is the map key, and the map > value is the field contents. How is this incorrect? I think this follows > your earlier suggestion of: > > "You may want to play with the following idea: collect key => column_number > and value => column_contents in your map step." > > Terrence A. Pietrondi > > > --- On Mon, 10/6/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > > > From: Alex Loddengaard <[EMAIL PROTECTED]> > > Subject: Re: architecture diagram > > To: core-user@hadoop.apache.org > > Date: Monday, October 6, 2008, 12:55 PM > > As far as I know, splits will never be made within a line, > > only between > > rows. To answer your question about ways to control the > > splits, see below: > > > > <http://wiki.apache.org/hadoop/HowManyMapsAndReduces> > > < > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html > > > > > > > Alex > > > > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi > > <[EMAIL PROTECTED] > > > wrote: > > > > > Can you explain "The location of these splits is > > semi-arbitrary"? What if > > > the example was... > > > > > > AAA|BBB|CCC|DDD > > > EEE|FFF|GGG|HHH > > > > > > > > > Does this mean the split might be between CCC such > > that it results in > > > AAA|BBB|C and C|DDD for the first line? Is there a way > > to control this > > > behavior to split on my delimiter? > > > > > > > > > Terrence A. Pietrondi > > > > > > > > > --- On Sun, 10/5/08, Alex Loddengaard > > <[EMAIL PROTECTED]> wrote: > > > > > > > From: Alex Loddengaard > > <[EMAIL PROTECTED]> > > > > Subject: Re: architecture diagram > > > > To: core-user@hadoop.apache.org > > > > Date: Sunday, October 5, 2008, 9:26 PM > > > > Let's say you have one very large input file > > of the > > > > form: > > > > > > > > A|B|C|D > > > > E|F|G|H > > > > ... > > > > |1|2|3|4 > > > > > > > > This input file will be broken up into N pieces, > > where N is > > > > the number of > > > > mappers that run. The location of these splits > > is > > > > semi-arbitrary. This > > > > means that unless you have one mapper, you > > won't be > > > > able to see the entire > > > > contents of a column in your mapper. Given that > > you would > > > > need one mapper > > > > to be able to see the entirety of a column, > > you've now > > > > essentially reduced > > > > your problem to a single machine. > > > > > > > > Yo
Re: architecture diagram
This mapper does follow my original suggestion, though I'm not familiar with how the delimiter works in this example. Anyone else? Alex On Mon, Oct 6, 2008 at 2:55 PM, Terrence A. Pietrondi <[EMAIL PROTECTED] > wrote: > So looking at the following mapper... > > > http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup > > On line 32, you can see the row split via a delimiter. On line 43, you can > see that the field index (the column index) is the map key, and the map > value is the field contents. How is this incorrect? I think this follows > your earlier suggestion of: > > "You may want to play with the following idea: collect key => column_number > and value => column_contents in your map step." > > Terrence A. Pietrondi > > > --- On Mon, 10/6/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > > > From: Alex Loddengaard <[EMAIL PROTECTED]> > > Subject: Re: architecture diagram > > To: core-user@hadoop.apache.org > > Date: Monday, October 6, 2008, 12:55 PM > > As far as I know, splits will never be made within a line, > > only between > > rows. To answer your question about ways to control the > > splits, see below: > > > > <http://wiki.apache.org/hadoop/HowManyMapsAndReduces> > > < > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html > > > > > > > Alex > > > > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi > > <[EMAIL PROTECTED] > > > wrote: > > > > > Can you explain "The location of these splits is > > semi-arbitrary"? What if > > > the example was... > > > > > > AAA|BBB|CCC|DDD > > > EEE|FFF|GGG|HHH > > > > > > > > > Does this mean the split might be between CCC such > > that it results in > > > AAA|BBB|C and C|DDD for the first line? Is there a way > > to control this > > > behavior to split on my delimiter? > > > > > > > > > Terrence A. Pietrondi > > > > > > > > > --- On Sun, 10/5/08, Alex Loddengaard > > <[EMAIL PROTECTED]> wrote: > > > > > > > From: Alex Loddengaard > > <[EMAIL PROTECTED]> > > > > Subject: Re: architecture diagram > > > > To: core-user@hadoop.apache.org > > > > Date: Sunday, October 5, 2008, 9:26 PM > > > > Let's say you have one very large input file > > of the > > > > form: > > > > > > > > A|B|C|D > > > > E|F|G|H > > > > ... > > > > |1|2|3|4 > > > > > > > > This input file will be broken up into N pieces, > > where N is > > > > the number of > > > > mappers that run. The location of these splits > > is > > > > semi-arbitrary. This > > > > means that unless you have one mapper, you > > won't be > > > > able to see the entire > > > > contents of a column in your mapper. Given that > > you would > > > > need one mapper > > > > to be able to see the entirety of a column, > > you've now > > > > essentially reduced > > > > your problem to a single machine. > > > > > > > > You may want to play with the following idea: > > collect key > > > > => column_number > > > > and value => column_contents in your map step. > > This > > > > means that you would be > > > > able to see the entirety of a column in your > > reduce step, > > > > though you're > > > > still faced with the tasks of shuffling and > > re-pivoting. > > > > > > > > Does this clear up your confusion? Let me know > > if > > > > you'd like me to clarify > > > > more. > > > > > > > > Alex > > > > > > > > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. > > Pietrondi > > > > <[EMAIL PROTECTED] > > > > > wrote: > > > > > > > > > I am not sure why this doesn't fit, > > maybe you can > > > > help me understand. Your > > > > > previous comment was... > > > > > > > > > > "The reason I'm making this claim > > is because > > > > in order to do the pivot > > > > > operation you must know about every row. > > Your input > > &g
Re: architecture diagram
So looking at the following mapper... http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup On line 32, you can see the row split via a delimiter. On line 43, you can see that the field index (the column index) is the map key, and the map value is the field contents. How is this incorrect? I think this follows your earlier suggestion of: "You may want to play with the following idea: collect key => column_number and value => column_contents in your map step." Terrence A. Pietrondi --- On Mon, 10/6/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > From: Alex Loddengaard <[EMAIL PROTECTED]> > Subject: Re: architecture diagram > To: core-user@hadoop.apache.org > Date: Monday, October 6, 2008, 12:55 PM > As far as I know, splits will never be made within a line, > only between > rows. To answer your question about ways to control the > splits, see below: > > <http://wiki.apache.org/hadoop/HowManyMapsAndReduces> > < > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html > > > > Alex > > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi > <[EMAIL PROTECTED] > > wrote: > > > Can you explain "The location of these splits is > semi-arbitrary"? What if > > the example was... > > > > AAA|BBB|CCC|DDD > > EEE|FFF|GGG|HHH > > > > > > Does this mean the split might be between CCC such > that it results in > > AAA|BBB|C and C|DDD for the first line? Is there a way > to control this > > behavior to split on my delimiter? > > > > > > Terrence A. Pietrondi > > > > > > --- On Sun, 10/5/08, Alex Loddengaard > <[EMAIL PROTECTED]> wrote: > > > > > From: Alex Loddengaard > <[EMAIL PROTECTED]> > > > Subject: Re: architecture diagram > > > To: core-user@hadoop.apache.org > > > Date: Sunday, October 5, 2008, 9:26 PM > > > Let's say you have one very large input file > of the > > > form: > > > > > > A|B|C|D > > > E|F|G|H > > > ... > > > |1|2|3|4 > > > > > > This input file will be broken up into N pieces, > where N is > > > the number of > > > mappers that run. The location of these splits > is > > > semi-arbitrary. This > > > means that unless you have one mapper, you > won't be > > > able to see the entire > > > contents of a column in your mapper. Given that > you would > > > need one mapper > > > to be able to see the entirety of a column, > you've now > > > essentially reduced > > > your problem to a single machine. > > > > > > You may want to play with the following idea: > collect key > > > => column_number > > > and value => column_contents in your map step. > This > > > means that you would be > > > able to see the entirety of a column in your > reduce step, > > > though you're > > > still faced with the tasks of shuffling and > re-pivoting. > > > > > > Does this clear up your confusion? Let me know > if > > > you'd like me to clarify > > > more. > > > > > > Alex > > > > > > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. > Pietrondi > > > <[EMAIL PROTECTED] > > > > wrote: > > > > > > > I am not sure why this doesn't fit, > maybe you can > > > help me understand. Your > > > > previous comment was... > > > > > > > > "The reason I'm making this claim > is because > > > in order to do the pivot > > > > operation you must know about every row. > Your input > > > files will be split at > > > > semi-arbitrary places, essentially making it > > > impossible for each mapper to > > > > know every single row." > > > > > > > > Are you saying that my row segments might > not actually > > > be the entire row so > > > > I will get a bad key index? If so, would the > row > > > segments be determined? I > > > > based my initial work off of the word count > example, > > > where the lines are > > > > tokenized. Does this mean in this example > the row > > > tokens may not be the > > > > complete row? > > > > > > > > Thanks. > > > > > > > > Terrence A. Pietrondi > > > > > > > &
Re: architecture diagram
As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: <http://wiki.apache.org/hadoop/HowManyMapsAndReduces> < http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html > Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi <[EMAIL PROTECTED] > wrote: > Can you explain "The location of these splits is semi-arbitrary"? What if > the example was... > > AAA|BBB|CCC|DDD > EEE|FFF|GGG|HHH > > > Does this mean the split might be between CCC such that it results in > AAA|BBB|C and C|DDD for the first line? Is there a way to control this > behavior to split on my delimiter? > > > Terrence A. Pietrondi > > > --- On Sun, 10/5/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > > > From: Alex Loddengaard <[EMAIL PROTECTED]> > > Subject: Re: architecture diagram > > To: core-user@hadoop.apache.org > > Date: Sunday, October 5, 2008, 9:26 PM > > Let's say you have one very large input file of the > > form: > > > > A|B|C|D > > E|F|G|H > > ... > > |1|2|3|4 > > > > This input file will be broken up into N pieces, where N is > > the number of > > mappers that run. The location of these splits is > > semi-arbitrary. This > > means that unless you have one mapper, you won't be > > able to see the entire > > contents of a column in your mapper. Given that you would > > need one mapper > > to be able to see the entirety of a column, you've now > > essentially reduced > > your problem to a single machine. > > > > You may want to play with the following idea: collect key > > => column_number > > and value => column_contents in your map step. This > > means that you would be > > able to see the entirety of a column in your reduce step, > > though you're > > still faced with the tasks of shuffling and re-pivoting. > > > > Does this clear up your confusion? Let me know if > > you'd like me to clarify > > more. > > > > Alex > > > > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi > > <[EMAIL PROTECTED] > > > wrote: > > > > > I am not sure why this doesn't fit, maybe you can > > help me understand. Your > > > previous comment was... > > > > > > "The reason I'm making this claim is because > > in order to do the pivot > > > operation you must know about every row. Your input > > files will be split at > > > semi-arbitrary places, essentially making it > > impossible for each mapper to > > > know every single row." > > > > > > Are you saying that my row segments might not actually > > be the entire row so > > > I will get a bad key index? If so, would the row > > segments be determined? I > > > based my initial work off of the word count example, > > where the lines are > > > tokenized. Does this mean in this example the row > > tokens may not be the > > > complete row? > > > > > > Thanks. > > > > > > Terrence A. Pietrondi > > > > > > > > > --- On Fri, 10/3/08, Alex Loddengaard > > <[EMAIL PROTECTED]> wrote: > > > > > > > From: Alex Loddengaard > > <[EMAIL PROTECTED]> > > > > Subject: Re: architecture diagram > > > > To: core-user@hadoop.apache.org > > > > Date: Friday, October 3, 2008, 7:14 PM > > > > The approach that you've described does not > > fit well in > > > > to the MapReduce > > > > paradigm. You may want to consider randomizing > > your data > > > > in a different > > > > way. > > > > > > > > Unfortunately some things can't be solved > > well with > > > > MapReduce, and I think > > > > this is one of them. > > > > > > > > Can someone else say more? > > > > > > > > Alex > > > > > > > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. > > Pietrondi > > > > <[EMAIL PROTECTED] > > > > > wrote: > > > > > > > > > Sorry for the confusion, I did make some > > typos. My > > > > example should have > > > > > looked like... > > > > > > > > > > > A|B|C > > > > > > D|E|G > > > > > > > > > > > > pivots too... > &g
Re: architecture diagram
Can you explain "The location of these splits is semi-arbitrary"? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > From: Alex Loddengaard <[EMAIL PROTECTED]> > Subject: Re: architecture diagram > To: core-user@hadoop.apache.org > Date: Sunday, October 5, 2008, 9:26 PM > Let's say you have one very large input file of the > form: > > A|B|C|D > E|F|G|H > ... > |1|2|3|4 > > This input file will be broken up into N pieces, where N is > the number of > mappers that run. The location of these splits is > semi-arbitrary. This > means that unless you have one mapper, you won't be > able to see the entire > contents of a column in your mapper. Given that you would > need one mapper > to be able to see the entirety of a column, you've now > essentially reduced > your problem to a single machine. > > You may want to play with the following idea: collect key > => column_number > and value => column_contents in your map step. This > means that you would be > able to see the entirety of a column in your reduce step, > though you're > still faced with the tasks of shuffling and re-pivoting. > > Does this clear up your confusion? Let me know if > you'd like me to clarify > more. > > Alex > > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi > <[EMAIL PROTECTED] > > wrote: > > > I am not sure why this doesn't fit, maybe you can > help me understand. Your > > previous comment was... > > > > "The reason I'm making this claim is because > in order to do the pivot > > operation you must know about every row. Your input > files will be split at > > semi-arbitrary places, essentially making it > impossible for each mapper to > > know every single row." > > > > Are you saying that my row segments might not actually > be the entire row so > > I will get a bad key index? If so, would the row > segments be determined? I > > based my initial work off of the word count example, > where the lines are > > tokenized. Does this mean in this example the row > tokens may not be the > > complete row? > > > > Thanks. > > > > Terrence A. Pietrondi > > > > > > --- On Fri, 10/3/08, Alex Loddengaard > <[EMAIL PROTECTED]> wrote: > > > > > From: Alex Loddengaard > <[EMAIL PROTECTED]> > > > Subject: Re: architecture diagram > > > To: core-user@hadoop.apache.org > > > Date: Friday, October 3, 2008, 7:14 PM > > > The approach that you've described does not > fit well in > > > to the MapReduce > > > paradigm. You may want to consider randomizing > your data > > > in a different > > > way. > > > > > > Unfortunately some things can't be solved > well with > > > MapReduce, and I think > > > this is one of them. > > > > > > Can someone else say more? > > > > > > Alex > > > > > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. > Pietrondi > > > <[EMAIL PROTECTED] > > > > wrote: > > > > > > > Sorry for the confusion, I did make some > typos. My > > > example should have > > > > looked like... > > > > > > > > > A|B|C > > > > > D|E|G > > > > > > > > > > pivots too... > > > > > > > > > > D|A > > > > > E|B > > > > > G|C > > > > > > > > > > Then for each row, shuffle the contents > around > > > randomly... > > > > > > > > > > D|A > > > > > B|E > > > > > C|G > > > > > > > > > > Then pivot the data back... > > > > > > > > > > A|E|G > > > > > D|B|C > > > > > > > > The general goal is to shuffle the elements > in each > > > column in the input > > > > data. Meaning, the ordering of the elements > in each > > > column will not be the > > > > same as in input. > > > > > > > > If you look at the initial input and compare > to the > > > final output, you'll > > > > see that during the shuffling, B and E are > swappe
Re: architecture diagram
Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key => column_number and value => column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi <[EMAIL PROTECTED] > wrote: > I am not sure why this doesn't fit, maybe you can help me understand. Your > previous comment was... > > "The reason I'm making this claim is because in order to do the pivot > operation you must know about every row. Your input files will be split at > semi-arbitrary places, essentially making it impossible for each mapper to > know every single row." > > Are you saying that my row segments might not actually be the entire row so > I will get a bad key index? If so, would the row segments be determined? I > based my initial work off of the word count example, where the lines are > tokenized. Does this mean in this example the row tokens may not be the > complete row? > > Thanks. > > Terrence A. Pietrondi > > > --- On Fri, 10/3/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > > > From: Alex Loddengaard <[EMAIL PROTECTED]> > > Subject: Re: architecture diagram > > To: core-user@hadoop.apache.org > > Date: Friday, October 3, 2008, 7:14 PM > > The approach that you've described does not fit well in > > to the MapReduce > > paradigm. You may want to consider randomizing your data > > in a different > > way. > > > > Unfortunately some things can't be solved well with > > MapReduce, and I think > > this is one of them. > > > > Can someone else say more? > > > > Alex > > > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi > > <[EMAIL PROTECTED] > > > wrote: > > > > > Sorry for the confusion, I did make some typos. My > > example should have > > > looked like... > > > > > > > A|B|C > > > > D|E|G > > > > > > > > pivots too... > > > > > > > > D|A > > > > E|B > > > > G|C > > > > > > > > Then for each row, shuffle the contents around > > randomly... > > > > > > > > D|A > > > > B|E > > > > C|G > > > > > > > > Then pivot the data back... > > > > > > > > A|E|G > > > > D|B|C > > > > > > The general goal is to shuffle the elements in each > > column in the input > > > data. Meaning, the ordering of the elements in each > > column will not be the > > > same as in input. > > > > > > If you look at the initial input and compare to the > > final output, you'll > > > see that during the shuffling, B and E are swapped, > > and G and C are swapped, > > > while A and D were shuffled back into their > > originating positions in the > > > column. > > > > > > Once again, sorry for the typos and confusion. > > > > > > Terrence A. Pietrondi > > > > > > --- On Fri, 10/3/08, Alex Loddengaard > > <[EMAIL PROTECTED]> wrote: > > > > > > > From: Alex Loddengaard > > <[EMAIL PROTECTED]> > > > > Subject: Re: architecture diagram > > > > To: core-user@hadoop.apache.org > > > > Date: Friday, October 3, 2008, 11:01 AM > > > > Can you confirm that the example you've > > presented is > > > > accurate? I think you > > > > may have made some typos, because the letter > > "G" > > > > isn't in the final result; > > > > I also think your first pivot accidentally > > swapped C and G. > > > > I'm having a > > > > hard time understanding what you want to do, > > because it > > > > seems like your > > > > operations differ from your example. > > >
Re: architecture diagram
I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... "The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row." Are you saying that my row segments might not actually be the entire row so I will get a bad key index? If so, would the row segments be determined? I based my initial work off of the word count example, where the lines are tokenized. Does this mean in this example the row tokens may not be the complete row? Thanks. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > From: Alex Loddengaard <[EMAIL PROTECTED]> > Subject: Re: architecture diagram > To: core-user@hadoop.apache.org > Date: Friday, October 3, 2008, 7:14 PM > The approach that you've described does not fit well in > to the MapReduce > paradigm. You may want to consider randomizing your data > in a different > way. > > Unfortunately some things can't be solved well with > MapReduce, and I think > this is one of them. > > Can someone else say more? > > Alex > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi > <[EMAIL PROTECTED] > > wrote: > > > Sorry for the confusion, I did make some typos. My > example should have > > looked like... > > > > > A|B|C > > > D|E|G > > > > > > pivots too... > > > > > > D|A > > > E|B > > > G|C > > > > > > Then for each row, shuffle the contents around > randomly... > > > > > > D|A > > > B|E > > > C|G > > > > > > Then pivot the data back... > > > > > > A|E|G > > > D|B|C > > > > The general goal is to shuffle the elements in each > column in the input > > data. Meaning, the ordering of the elements in each > column will not be the > > same as in input. > > > > If you look at the initial input and compare to the > final output, you'll > > see that during the shuffling, B and E are swapped, > and G and C are swapped, > > while A and D were shuffled back into their > originating positions in the > > column. > > > > Once again, sorry for the typos and confusion. > > > > Terrence A. Pietrondi > > > > --- On Fri, 10/3/08, Alex Loddengaard > <[EMAIL PROTECTED]> wrote: > > > > > From: Alex Loddengaard > <[EMAIL PROTECTED]> > > > Subject: Re: architecture diagram > > > To: core-user@hadoop.apache.org > > > Date: Friday, October 3, 2008, 11:01 AM > > > Can you confirm that the example you've > presented is > > > accurate? I think you > > > may have made some typos, because the letter > "G" > > > isn't in the final result; > > > I also think your first pivot accidentally > swapped C and G. > > > I'm having a > > > hard time understanding what you want to do, > because it > > > seems like your > > > operations differ from your example. > > > > > > With that said, at first glance, this problem may > not fit > > > well in to the > > > MapReduce paradigm. The reason I'm making > this claim > > > is because in order to > > > do the pivot operation you must know about every > row. Your > > > input files will > > > be split at semi-arbitrary places, essentially > making it > > > impossible for each > > > mapper to know every single row. There may be a > way to do > > > this by > > > collecting, in your map step, key => column > number (0, > > > 1, 2, etc) and value > > > => (A, B, C, etc), though you may run in to > problems > > > when you try to pivot > > > back. I say this because when you pivot back, > you need to > > > have each column, > > > which means you'll need one reduce step. > There may be > > > a way to put the > > > pivot-back operation in a second iteration, > though I > > > don't think that would > > > help you. > > > > > > Terrence, please confirm that you've defined > your > > > example correctly. In the > > > meantime, can someone else confirm that this > problem does > > > not fit will in to > > > the MapReduce paradigm? > > > > > > Alex > > > > > > On Thu, Oct 2, 2008 at 10:48
Re: architecture diagram
The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi <[EMAIL PROTECTED] > wrote: > Sorry for the confusion, I did make some typos. My example should have > looked like... > > > A|B|C > > D|E|G > > > > pivots too... > > > > D|A > > E|B > > G|C > > > > Then for each row, shuffle the contents around randomly... > > > > D|A > > B|E > > C|G > > > > Then pivot the data back... > > > > A|E|G > > D|B|C > > The general goal is to shuffle the elements in each column in the input > data. Meaning, the ordering of the elements in each column will not be the > same as in input. > > If you look at the initial input and compare to the final output, you'll > see that during the shuffling, B and E are swapped, and G and C are swapped, > while A and D were shuffled back into their originating positions in the > column. > > Once again, sorry for the typos and confusion. > > Terrence A. Pietrondi > > --- On Fri, 10/3/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > > > From: Alex Loddengaard <[EMAIL PROTECTED]> > > Subject: Re: architecture diagram > > To: core-user@hadoop.apache.org > > Date: Friday, October 3, 2008, 11:01 AM > > Can you confirm that the example you've presented is > > accurate? I think you > > may have made some typos, because the letter "G" > > isn't in the final result; > > I also think your first pivot accidentally swapped C and G. > > I'm having a > > hard time understanding what you want to do, because it > > seems like your > > operations differ from your example. > > > > With that said, at first glance, this problem may not fit > > well in to the > > MapReduce paradigm. The reason I'm making this claim > > is because in order to > > do the pivot operation you must know about every row. Your > > input files will > > be split at semi-arbitrary places, essentially making it > > impossible for each > > mapper to know every single row. There may be a way to do > > this by > > collecting, in your map step, key => column number (0, > > 1, 2, etc) and value > > => (A, B, C, etc), though you may run in to problems > > when you try to pivot > > back. I say this because when you pivot back, you need to > > have each column, > > which means you'll need one reduce step. There may be > > a way to put the > > pivot-back operation in a second iteration, though I > > don't think that would > > help you. > > > > Terrence, please confirm that you've defined your > > example correctly. In the > > meantime, can someone else confirm that this problem does > > not fit will in to > > the MapReduce paradigm? > > > > Alex > > > > On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi < > > [EMAIL PROTECTED]> wrote: > > > > > I am trying to write a map reduce implementation to do > > the following: > > > > > > 1) read tabular data delimited in some fashion > > > 2) pivot that data, so the rows are columns and the > > columns are rows > > > 3) shuffle the rows (that were the columns) to > > randomize the data > > > 4) pivot the data back > > > > > > For example..... > > > > > > A|B|C > > > D|E|G > > > > > > pivots too... > > > > > > D|A > > > E|B > > > C|G > > > > > > Then for each row, shuffle the contents around > > randomly... > > > > > > D|A > > > B|E > > > G|C > > > > > > Then pivot the data back... > > > > > > A|E|C > > > D|B|C > > > > > > You can reference my progress so far... > > > > > > > > http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ > > > > > > Terrence A. Pietrondi > > > > > > > > > --- On Thu, 10/2/08, Alex Loddengaard > > <[EMAIL PROTECTED]> wrote: > > > > > > > From: Alex Loddengaard > > <[EMAIL PROTECTED]> > > > > Subject: Re: architecture diagram > > > > To: core-user@hadoop.apache.org > > > > Date: Thursday, October 2, 200
Re: architecture diagram
Sorry for the confusion, I did make some typos. My example should have looked like... > A|B|C > D|E|G > > pivots too... > > D|A > E|B > G|C > > Then for each row, shuffle the contents around randomly... > > D|A > B|E > C|G > > Then pivot the data back... > > A|E|G > D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare to the final output, you'll see that during the shuffling, B and E are swapped, and G and C are swapped, while A and D were shuffled back into their originating positions in the column. Once again, sorry for the typos and confusion. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > From: Alex Loddengaard <[EMAIL PROTECTED]> > Subject: Re: architecture diagram > To: core-user@hadoop.apache.org > Date: Friday, October 3, 2008, 11:01 AM > Can you confirm that the example you've presented is > accurate? I think you > may have made some typos, because the letter "G" > isn't in the final result; > I also think your first pivot accidentally swapped C and G. > I'm having a > hard time understanding what you want to do, because it > seems like your > operations differ from your example. > > With that said, at first glance, this problem may not fit > well in to the > MapReduce paradigm. The reason I'm making this claim > is because in order to > do the pivot operation you must know about every row. Your > input files will > be split at semi-arbitrary places, essentially making it > impossible for each > mapper to know every single row. There may be a way to do > this by > collecting, in your map step, key => column number (0, > 1, 2, etc) and value > => (A, B, C, etc), though you may run in to problems > when you try to pivot > back. I say this because when you pivot back, you need to > have each column, > which means you'll need one reduce step. There may be > a way to put the > pivot-back operation in a second iteration, though I > don't think that would > help you. > > Terrence, please confirm that you've defined your > example correctly. In the > meantime, can someone else confirm that this problem does > not fit will in to > the MapReduce paradigm? > > Alex > > On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi < > [EMAIL PROTECTED]> wrote: > > > I am trying to write a map reduce implementation to do > the following: > > > > 1) read tabular data delimited in some fashion > > 2) pivot that data, so the rows are columns and the > columns are rows > > 3) shuffle the rows (that were the columns) to > randomize the data > > 4) pivot the data back > > > > For example. > > > > A|B|C > > D|E|G > > > > pivots too... > > > > D|A > > E|B > > C|G > > > > Then for each row, shuffle the contents around > randomly... > > > > D|A > > B|E > > G|C > > > > Then pivot the data back... > > > > A|E|C > > D|B|C > > > > You can reference my progress so far... > > > > > http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ > > > > Terrence A. Pietrondi > > > > > > --- On Thu, 10/2/08, Alex Loddengaard > <[EMAIL PROTECTED]> wrote: > > > > > From: Alex Loddengaard > <[EMAIL PROTECTED]> > > > Subject: Re: architecture diagram > > > To: core-user@hadoop.apache.org > > > Date: Thursday, October 2, 2008, 1:36 PM > > > I think it really depends on the job as to where > logic goes. > > > Sometimes your > > > reduce step is as simple as an identify function, > and > > > sometimes it can be > > > more complex than your map step. It all depends > on your > > > data and the > > > operation(s) you're trying to perform. > > > > > > Perhaps we should step out of the abstract. Do > you have a > > > specific problem > > > you're trying to solve? Can you describe it? > > > > > > Alex > > > > > > On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. > Pietrondi > > > <[EMAIL PROTECTED] > > > > wrote: > > > > > > > I am sorry for the confusion. I meant > distributed > > > data. > > > > > > > > So help me out here. For example, if I am > reducing to > > > a single file, then > > >
Re: architecture diagram
Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter "G" isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example. With that said, at first glance, this problem may not fit well in to the MapReduce paradigm. The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. There may be a way to do this by collecting, in your map step, key => column number (0, 1, 2, etc) and value => (A, B, C, etc), though you may run in to problems when you try to pivot back. I say this because when you pivot back, you need to have each column, which means you'll need one reduce step. There may be a way to put the pivot-back operation in a second iteration, though I don't think that would help you. Terrence, please confirm that you've defined your example correctly. In the meantime, can someone else confirm that this problem does not fit will in to the MapReduce paradigm? Alex On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi < [EMAIL PROTECTED]> wrote: > I am trying to write a map reduce implementation to do the following: > > 1) read tabular data delimited in some fashion > 2) pivot that data, so the rows are columns and the columns are rows > 3) shuffle the rows (that were the columns) to randomize the data > 4) pivot the data back > > For example. > > A|B|C > D|E|G > > pivots too... > > D|A > E|B > C|G > > Then for each row, shuffle the contents around randomly... > > D|A > B|E > G|C > > Then pivot the data back... > > A|E|C > D|B|C > > You can reference my progress so far... > > http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ > > Terrence A. Pietrondi > > > --- On Thu, 10/2/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > > > From: Alex Loddengaard <[EMAIL PROTECTED]> > > Subject: Re: architecture diagram > > To: core-user@hadoop.apache.org > > Date: Thursday, October 2, 2008, 1:36 PM > > I think it really depends on the job as to where logic goes. > > Sometimes your > > reduce step is as simple as an identify function, and > > sometimes it can be > > more complex than your map step. It all depends on your > > data and the > > operation(s) you're trying to perform. > > > > Perhaps we should step out of the abstract. Do you have a > > specific problem > > you're trying to solve? Can you describe it? > > > > Alex > > > > On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi > > <[EMAIL PROTECTED] > > > wrote: > > > > > I am sorry for the confusion. I meant distributed > > data. > > > > > > So help me out here. For example, if I am reducing to > > a single file, then > > > my main transformation logic would be in my mapping > > step since I am reducing > > > away from the data? > > > > > > Terrence A. Pietrondi > > > http://del.icio.us/tepietrondi > > > > > > > > > --- On Wed, 10/1/08, Alex Loddengaard > > <[EMAIL PROTECTED]> wrote: > > > > > > > From: Alex Loddengaard > > <[EMAIL PROTECTED]> > > > > Subject: Re: architecture diagram > > > > To: core-user@hadoop.apache.org > > > > Date: Wednesday, October 1, 2008, 7:44 PM > > > > I'm not sure what you mean by > > "disconnected parts > > > > of data," but Hadoop is > > > > implemented to try and perform map tasks on > > machines that > > > > have input data. > > > > This is to lower the amount of network traffic, > > hence > > > > making the entire job > > > > run faster. Hadoop does all this for you under > > the hood. > > > > From a user's > > > > point of view, all you need to do is store data > > in HDFS > > > > (the distributed > > > > filesystem), and run MapReduce jobs on that data. > > Take a > > > > look here: > > > > > > > > <http://wiki.apache.org/hadoop/WordCount> > > > > > > > > Alex > > > > > > > > On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. > > Pietrondi > > > > <[EMAIL PROTECTED] > > > > &g
Re: architecture diagram
I am trying to write a map reduce implementation to do the following: 1) read tabular data delimited in some fashion 2) pivot that data, so the rows are columns and the columns are rows 3) shuffle the rows (that were the columns) to randomize the data 4) pivot the data back For example. A|B|C D|E|G pivots too... D|A E|B C|G Then for each row, shuffle the contents around randomly... D|A B|E G|C Then pivot the data back... A|E|C D|B|C You can reference my progress so far... http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ Terrence A. Pietrondi --- On Thu, 10/2/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > From: Alex Loddengaard <[EMAIL PROTECTED]> > Subject: Re: architecture diagram > To: core-user@hadoop.apache.org > Date: Thursday, October 2, 2008, 1:36 PM > I think it really depends on the job as to where logic goes. > Sometimes your > reduce step is as simple as an identify function, and > sometimes it can be > more complex than your map step. It all depends on your > data and the > operation(s) you're trying to perform. > > Perhaps we should step out of the abstract. Do you have a > specific problem > you're trying to solve? Can you describe it? > > Alex > > On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi > <[EMAIL PROTECTED] > > wrote: > > > I am sorry for the confusion. I meant distributed > data. > > > > So help me out here. For example, if I am reducing to > a single file, then > > my main transformation logic would be in my mapping > step since I am reducing > > away from the data? > > > > Terrence A. Pietrondi > > http://del.icio.us/tepietrondi > > > > > > --- On Wed, 10/1/08, Alex Loddengaard > <[EMAIL PROTECTED]> wrote: > > > > > From: Alex Loddengaard > <[EMAIL PROTECTED]> > > > Subject: Re: architecture diagram > > > To: core-user@hadoop.apache.org > > > Date: Wednesday, October 1, 2008, 7:44 PM > > > I'm not sure what you mean by > "disconnected parts > > > of data," but Hadoop is > > > implemented to try and perform map tasks on > machines that > > > have input data. > > > This is to lower the amount of network traffic, > hence > > > making the entire job > > > run faster. Hadoop does all this for you under > the hood. > > > From a user's > > > point of view, all you need to do is store data > in HDFS > > > (the distributed > > > filesystem), and run MapReduce jobs on that data. > Take a > > > look here: > > > > > > <http://wiki.apache.org/hadoop/WordCount> > > > > > > Alex > > > > > > On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. > Pietrondi > > > <[EMAIL PROTECTED] > > > > wrote: > > > > > > > So to be "distributed" in a sense, > you would > > > want to do your computation on > > > > the disconnected parts of data in the map > phase I > > > would guess? > > > > > > > > Terrence A. Pietrondi > > > > http://del.icio.us/tepietrondi > > > > > > > > > > > > --- On Wed, 10/1/08, Arun C Murthy > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > From: Arun C Murthy > <[EMAIL PROTECTED]> > > > > > Subject: Re: architecture diagram > > > > > To: core-user@hadoop.apache.org > > > > > Date: Wednesday, October 1, 2008, 2:16 > PM > > > > > On Oct 1, 2008, at 10:17 AM, Terrence > A. > > > Pietrondi wrote: > > > > > > > > > > > I am trying to plan out my > map-reduce > > > implementation > > > > > and I have some > > > > > > questions of where computation > should be > > > split in > > > > > order to take > > > > > > advantage of the distributed > nodes. > > > > > > > > > > > > Looking at the architecture > diagram > > > > > > > > > (http://hadoop.apache.org/core/images/architecture.gif > > > > > > ), are the map boxes the major > computation > > > areas or is > > > > > the reduce > > > > > > the major computation area? > > > > > > > > > > > > > > > > Usually the maps perform the > 'embarrassingly > > > > > parallel' computational > > > > > steps where-in each map works > independently on a > > > > > 'split' on your input > > > > > and the reduces perform the > 'aggregate' > > > > > computations. > > > > > > > > > > From http://hadoop.apache.org/core/ : > > > > > > > > > > Hadoop implements MapReduce, using the > Hadoop > > > Distributed > > > > > File System > > > > > (HDFS). MapReduce divides applications > into many > > > small > > > > > blocks of work. > > > > > HDFS creates multiple replicas of data > blocks for > > > > > reliability, placing > > > > > them on compute nodes around the > cluster. > > > MapReduce can > > > > > then process > > > > > the data where it is located. > > > > > > > > > > The Hadoop Map-Reduce framework is > quite good at > > > scheduling > > > > > your > > > > > 'maps' on the actual data-nodes > where the > > > > > input-blocks are present, > > > > > leading to i/o efficiencies... > > > > > > > > > > Arun > > > > > > > > > > > Thanks. > > > > > > > > > > > > Terrence A. Pietrondi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: architecture diagram
I think it really depends on the job as to where logic goes. Sometimes your reduce step is as simple as an identify function, and sometimes it can be more complex than your map step. It all depends on your data and the operation(s) you're trying to perform. Perhaps we should step out of the abstract. Do you have a specific problem you're trying to solve? Can you describe it? Alex On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi <[EMAIL PROTECTED] > wrote: > I am sorry for the confusion. I meant distributed data. > > So help me out here. For example, if I am reducing to a single file, then > my main transformation logic would be in my mapping step since I am reducing > away from the data? > > Terrence A. Pietrondi > http://del.icio.us/tepietrondi > > > --- On Wed, 10/1/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > > > From: Alex Loddengaard <[EMAIL PROTECTED]> > > Subject: Re: architecture diagram > > To: core-user@hadoop.apache.org > > Date: Wednesday, October 1, 2008, 7:44 PM > > I'm not sure what you mean by "disconnected parts > > of data," but Hadoop is > > implemented to try and perform map tasks on machines that > > have input data. > > This is to lower the amount of network traffic, hence > > making the entire job > > run faster. Hadoop does all this for you under the hood. > > From a user's > > point of view, all you need to do is store data in HDFS > > (the distributed > > filesystem), and run MapReduce jobs on that data. Take a > > look here: > > > > <http://wiki.apache.org/hadoop/WordCount> > > > > Alex > > > > On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi > > <[EMAIL PROTECTED] > > > wrote: > > > > > So to be "distributed" in a sense, you would > > want to do your computation on > > > the disconnected parts of data in the map phase I > > would guess? > > > > > > Terrence A. Pietrondi > > > http://del.icio.us/tepietrondi > > > > > > > > > --- On Wed, 10/1/08, Arun C Murthy > > <[EMAIL PROTECTED]> wrote: > > > > > > > From: Arun C Murthy <[EMAIL PROTECTED]> > > > > Subject: Re: architecture diagram > > > > To: core-user@hadoop.apache.org > > > > Date: Wednesday, October 1, 2008, 2:16 PM > > > > On Oct 1, 2008, at 10:17 AM, Terrence A. > > Pietrondi wrote: > > > > > > > > > I am trying to plan out my map-reduce > > implementation > > > > and I have some > > > > > questions of where computation should be > > split in > > > > order to take > > > > > advantage of the distributed nodes. > > > > > > > > > > Looking at the architecture diagram > > > > > > (http://hadoop.apache.org/core/images/architecture.gif > > > > > ), are the map boxes the major computation > > areas or is > > > > the reduce > > > > > the major computation area? > > > > > > > > > > > > > Usually the maps perform the 'embarrassingly > > > > parallel' computational > > > > steps where-in each map works independently on a > > > > 'split' on your input > > > > and the reduces perform the 'aggregate' > > > > computations. > > > > > > > > From http://hadoop.apache.org/core/ : > > > > > > > > Hadoop implements MapReduce, using the Hadoop > > Distributed > > > > File System > > > > (HDFS). MapReduce divides applications into many > > small > > > > blocks of work. > > > > HDFS creates multiple replicas of data blocks for > > > > reliability, placing > > > > them on compute nodes around the cluster. > > MapReduce can > > > > then process > > > > the data where it is located. > > > > > > > > The Hadoop Map-Reduce framework is quite good at > > scheduling > > > > your > > > > 'maps' on the actual data-nodes where the > > > > input-blocks are present, > > > > leading to i/o efficiencies... > > > > > > > > Arun > > > > > > > > > Thanks. > > > > > > > > > > Terrence A. Pietrondi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: architecture diagram
I am sorry for the confusion. I meant distributed data. So help me out here. For example, if I am reducing to a single file, then my main transformation logic would be in my mapping step since I am reducing away from the data? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: > From: Alex Loddengaard <[EMAIL PROTECTED]> > Subject: Re: architecture diagram > To: core-user@hadoop.apache.org > Date: Wednesday, October 1, 2008, 7:44 PM > I'm not sure what you mean by "disconnected parts > of data," but Hadoop is > implemented to try and perform map tasks on machines that > have input data. > This is to lower the amount of network traffic, hence > making the entire job > run faster. Hadoop does all this for you under the hood. > From a user's > point of view, all you need to do is store data in HDFS > (the distributed > filesystem), and run MapReduce jobs on that data. Take a > look here: > > <http://wiki.apache.org/hadoop/WordCount> > > Alex > > On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi > <[EMAIL PROTECTED] > > wrote: > > > So to be "distributed" in a sense, you would > want to do your computation on > > the disconnected parts of data in the map phase I > would guess? > > > > Terrence A. Pietrondi > > http://del.icio.us/tepietrondi > > > > > > --- On Wed, 10/1/08, Arun C Murthy > <[EMAIL PROTECTED]> wrote: > > > > > From: Arun C Murthy <[EMAIL PROTECTED]> > > > Subject: Re: architecture diagram > > > To: core-user@hadoop.apache.org > > > Date: Wednesday, October 1, 2008, 2:16 PM > > > On Oct 1, 2008, at 10:17 AM, Terrence A. > Pietrondi wrote: > > > > > > > I am trying to plan out my map-reduce > implementation > > > and I have some > > > > questions of where computation should be > split in > > > order to take > > > > advantage of the distributed nodes. > > > > > > > > Looking at the architecture diagram > > > > (http://hadoop.apache.org/core/images/architecture.gif > > > > ), are the map boxes the major computation > areas or is > > > the reduce > > > > the major computation area? > > > > > > > > > > Usually the maps perform the 'embarrassingly > > > parallel' computational > > > steps where-in each map works independently on a > > > 'split' on your input > > > and the reduces perform the 'aggregate' > > > computations. > > > > > > From http://hadoop.apache.org/core/ : > > > > > > Hadoop implements MapReduce, using the Hadoop > Distributed > > > File System > > > (HDFS). MapReduce divides applications into many > small > > > blocks of work. > > > HDFS creates multiple replicas of data blocks for > > > reliability, placing > > > them on compute nodes around the cluster. > MapReduce can > > > then process > > > the data where it is located. > > > > > > The Hadoop Map-Reduce framework is quite good at > scheduling > > > your > > > 'maps' on the actual data-nodes where the > > > input-blocks are present, > > > leading to i/o efficiencies... > > > > > > Arun > > > > > > > Thanks. > > > > > > > > Terrence A. Pietrondi > > > > > > > > > > > > > > > > > > > >
Re: architecture diagram
I'm not sure what you mean by "disconnected parts of data," but Hadoop is implemented to try and perform map tasks on machines that have input data. This is to lower the amount of network traffic, hence making the entire job run faster. Hadoop does all this for you under the hood. From a user's point of view, all you need to do is store data in HDFS (the distributed filesystem), and run MapReduce jobs on that data. Take a look here: <http://wiki.apache.org/hadoop/WordCount> Alex On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi <[EMAIL PROTECTED] > wrote: > So to be "distributed" in a sense, you would want to do your computation on > the disconnected parts of data in the map phase I would guess? > > Terrence A. Pietrondi > http://del.icio.us/tepietrondi > > > --- On Wed, 10/1/08, Arun C Murthy <[EMAIL PROTECTED]> wrote: > > > From: Arun C Murthy <[EMAIL PROTECTED]> > > Subject: Re: architecture diagram > > To: core-user@hadoop.apache.org > > Date: Wednesday, October 1, 2008, 2:16 PM > > On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote: > > > > > I am trying to plan out my map-reduce implementation > > and I have some > > > questions of where computation should be split in > > order to take > > > advantage of the distributed nodes. > > > > > > Looking at the architecture diagram > > (http://hadoop.apache.org/core/images/architecture.gif > > > ), are the map boxes the major computation areas or is > > the reduce > > > the major computation area? > > > > > > > Usually the maps perform the 'embarrassingly > > parallel' computational > > steps where-in each map works independently on a > > 'split' on your input > > and the reduces perform the 'aggregate' > > computations. > > > > From http://hadoop.apache.org/core/ : > > > > Hadoop implements MapReduce, using the Hadoop Distributed > > File System > > (HDFS). MapReduce divides applications into many small > > blocks of work. > > HDFS creates multiple replicas of data blocks for > > reliability, placing > > them on compute nodes around the cluster. MapReduce can > > then process > > the data where it is located. > > > > The Hadoop Map-Reduce framework is quite good at scheduling > > your > > 'maps' on the actual data-nodes where the > > input-blocks are present, > > leading to i/o efficiencies... > > > > Arun > > > > > Thanks. > > > > > > Terrence A. Pietrondi > > > > > > > > > > > > >
Re: architecture diagram
So to be "distributed" in a sense, you would want to do your computation on the disconnected parts of data in the map phase I would guess? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Arun C Murthy <[EMAIL PROTECTED]> wrote: > From: Arun C Murthy <[EMAIL PROTECTED]> > Subject: Re: architecture diagram > To: core-user@hadoop.apache.org > Date: Wednesday, October 1, 2008, 2:16 PM > On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote: > > > I am trying to plan out my map-reduce implementation > and I have some > > questions of where computation should be split in > order to take > > advantage of the distributed nodes. > > > > Looking at the architecture diagram > (http://hadoop.apache.org/core/images/architecture.gif > > ), are the map boxes the major computation areas or is > the reduce > > the major computation area? > > > > Usually the maps perform the 'embarrassingly > parallel' computational > steps where-in each map works independently on a > 'split' on your input > and the reduces perform the 'aggregate' > computations. > > From http://hadoop.apache.org/core/ : > > Hadoop implements MapReduce, using the Hadoop Distributed > File System > (HDFS). MapReduce divides applications into many small > blocks of work. > HDFS creates multiple replicas of data blocks for > reliability, placing > them on compute nodes around the cluster. MapReduce can > then process > the data where it is located. > > The Hadoop Map-Reduce framework is quite good at scheduling > your > 'maps' on the actual data-nodes where the > input-blocks are present, > leading to i/o efficiencies... > > Arun > > > Thanks. > > > > Terrence A. Pietrondi > > > > > >
Re: architecture diagram
On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote: I am trying to plan out my map-reduce implementation and I have some questions of where computation should be split in order to take advantage of the distributed nodes. Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif ), are the map boxes the major computation areas or is the reduce the major computation area? Usually the maps perform the 'embarrassingly parallel' computational steps where-in each map works independently on a 'split' on your input and the reduces perform the 'aggregate' computations. From http://hadoop.apache.org/core/ : Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. The Hadoop Map-Reduce framework is quite good at scheduling your 'maps' on the actual data-nodes where the input-blocks are present, leading to i/o efficiencies... Arun Thanks. Terrence A. Pietrondi
Re: architecture diagram
I normally find the intermediate stage of copying data to the reducers from the mappers to be a significant step - but that's not over the best quality switches... The mappers and reducers work on the same boxes, close to the data. On Wed, 2008-10-01 at 10:59 -0700, Alex Loddengaard wrote: > > It really depends on your job I think. Often reduce steps can be the > bottleneck if you want a single output file (one reducer). > > Hope this helps. > > Alex
Re: architecture diagram
Hi Terrence, It really depends on your job I think. Often reduce steps can be the bottleneck if you want a single output file (one reducer). Hope this helps. Alex On Wed, Oct 1, 2008 at 10:17 AM, Terrence A. Pietrondi < [EMAIL PROTECTED]> wrote: > I am trying to plan out my map-reduce implementation and I have some > questions of where computation should be split in order to take advantage of > the distributed nodes. > > Looking at the architecture diagram ( > http://hadoop.apache.org/core/images/architecture.gif), are the map boxes > the major computation areas or is the reduce the major computation area? > > Thanks. > > Terrence A. Pietrondi > > > >
architecture diagram
I am trying to plan out my map-reduce implementation and I have some questions of where computation should be split in order to take advantage of the distributed nodes. Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif), are the map boxes the major computation areas or is the reduce the major computation area? Thanks. Terrence A. Pietrondi