Re: architecture diagram
Glad we could help, Terrence. The second pivot might be tricky; you may have to run a second iteration. I haven't thought the problem all the way through, though. Good luck. Alex On Wed, Oct 8, 2008 at 1:02 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I think I can figure this out now and get it to work. I will check back in if I get it. All that is missing at the moment is in my pivot back mapping step. Thanks for the help. Terrence A. Pietrondi --- On Tue, 10/7/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Tuesday, October 7, 2008, 1:55 PM Thanks for the clarification, Samuel. I wasn't aware that parts of a line might be emitted depending on the split, while using TextInputFormat. Terrence, this means that you'll have to take the approach of collecting key = column_count, value = column_contents in your map step. Alex On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo [EMAIL PROTECTED] wrote: I think what Alex talked about 'split' is the mapreduce system's action. What you said about 'split' is your mapper's action. I guess that your map/reduce application uses *TextInputFormat* to treat your input file. your input file will first be splitted into a few splits. these splits may be like filename, offset, length. What Alex said about 'The location of these splits is semi-arbitrary' means that the file split's offset in your input file is semi-arbitrary. Am I right, Alex? Then *TextInputFormat* will translate these file splits into a sequence of lines, where offset is treated as key and line is treated as value. As these file splits are splitted by offset. Some lines in your file may be splitted into different file splits. A *LineRecordReader* used by *TextInputFormat* will remove the half-baked line in these file splits to make sure that every mapper will get integrated lines one by one. For examples: a file as below: AAA BBB CCC DDD EEE FFF GGG HHH AAA BBB CCC DDD it may be splitted into two file splits(we assume that there are two mappers.). split one: AAA BBB CCC split two: DDD EEE FFF GGG HHH AAA BBB CCC DDD take split two as example: TextInputFormat will use LineRecordReader to translate split two into a sequence of offset, line pairs, and it will skip the first half-baked line DDD. so the sequence will be: offset1, EEE FFF GGG HHH offset2, AAA BBB CCC DDD Then what to do with the lines depends on your job. On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So looking at the following mapper... http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup On line 32, you can see the row split via a delimiter. On line 43, you can see that the field index (the column index) is the map key, and the map value is the field contents. How is this incorrect? I think this follows your earlier suggestion of: You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. Terrence A. Pietrondi --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Monday, October 6, 2008, 12:55 PM As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: http://wiki.apache.org/hadoop/HowManyMapsAndReduces http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H
Re: architecture diagram
Thanks for the clarification, Samuel. I wasn't aware that parts of a line might be emitted depending on the split, while using TextInputFormat. Terrence, this means that you'll have to take the approach of collecting key = column_count, value = column_contents in your map step. Alex On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo [EMAIL PROTECTED] wrote: I think what Alex talked about 'split' is the mapreduce system's action. What you said about 'split' is your mapper's action. I guess that your map/reduce application uses *TextInputFormat* to treat your input file. your input file will first be splitted into a few splits. these splits may be like filename, offset, length. What Alex said about 'The location of these splits is semi-arbitrary' means that the file split's offset in your input file is semi-arbitrary. Am I right, Alex? Then *TextInputFormat* will translate these file splits into a sequence of lines, where offset is treated as key and line is treated as value. As these file splits are splitted by offset. Some lines in your file may be splitted into different file splits. A *LineRecordReader* used by *TextInputFormat* will remove the half-baked line in these file splits to make sure that every mapper will get integrated lines one by one. For examples: a file as below: AAA BBB CCC DDD EEE FFF GGG HHH AAA BBB CCC DDD it may be splitted into two file splits(we assume that there are two mappers.). split one: AAA BBB CCC split two: DDD EEE FFF GGG HHH AAA BBB CCC DDD take split two as example: TextInputFormat will use LineRecordReader to translate split two into a sequence of offset, line pairs, and it will skip the first half-baked line DDD. so the sequence will be: offset1, EEE FFF GGG HHH offset2, AAA BBB CCC DDD Then what to do with the lines depends on your job. On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So looking at the following mapper... http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup On line 32, you can see the row split via a delimiter. On line 43, you can see that the field index (the column index) is the map key, and the map value is the field contents. How is this incorrect? I think this follows your earlier suggestion of: You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. Terrence A. Pietrondi --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Monday, October 6, 2008, 12:55 PM As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: http://wiki.apache.org/hadoop/HowManyMapsAndReduces http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED
Re: architecture diagram
Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. Are you saying that my row segments might not actually be the entire row so I will get a bad key index? If so, would the row segments be determined? I based my initial work off of the word count example, where the lines are tokenized. Does this mean in this example the row tokens may not be the complete row? Thanks. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 7:14 PM The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents around randomly... D|A B|E C|G Then pivot the data back... A|E|G D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare to the final output, you'll see that during the shuffling, B and E are swapped, and G and C are swapped, while A and D were shuffled back into their originating positions in the column. Once again, sorry for the typos and confusion. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 11:01 AM Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter G isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example. With that said, at first glance, this problem may not fit well in to the MapReduce paradigm. The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. There may be a way to do this by collecting, in your map step, key = column number (0, 1, 2, etc) and value = (A, B, C, etc), though you
Re: architecture diagram
As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: http://wiki.apache.org/hadoop/HowManyMapsAndReduces http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. Are you saying that my row segments might not actually be the entire row so I will get a bad key index? If so, would the row segments be determined? I based my initial work off of the word count example, where the lines are tokenized. Does this mean in this example the row tokens may not be the complete row? Thanks. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 7:14 PM The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents around randomly... D|A B|E C|G Then pivot the data back... A|E|G D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare to the final output, you'll see that during the shuffling, B and E are swapped, and G and C are swapped, while A and D were shuffled back into their originating positions in the column. Once again, sorry for the typos and confusion. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 11:01 AM Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter G isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example
Re: architecture diagram
So looking at the following mapper... http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup On line 32, you can see the row split via a delimiter. On line 43, you can see that the field index (the column index) is the map key, and the map value is the field contents. How is this incorrect? I think this follows your earlier suggestion of: You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. Terrence A. Pietrondi --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Monday, October 6, 2008, 12:55 PM As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: http://wiki.apache.org/hadoop/HowManyMapsAndReduces http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. Are you saying that my row segments might not actually be the entire row so I will get a bad key index? If so, would the row segments be determined? I based my initial work off of the word count example, where the lines are tokenized. Does this mean in this example the row tokens may not be the complete row? Thanks. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 7:14 PM The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents around randomly... D|A B|E C|G Then pivot the data back... A|E|G D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare
Re: architecture diagram
This mapper does follow my original suggestion, though I'm not familiar with how the delimiter works in this example. Anyone else? Alex On Mon, Oct 6, 2008 at 2:55 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So looking at the following mapper... http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup On line 32, you can see the row split via a delimiter. On line 43, you can see that the field index (the column index) is the map key, and the map value is the field contents. How is this incorrect? I think this follows your earlier suggestion of: You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. Terrence A. Pietrondi --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Monday, October 6, 2008, 12:55 PM As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: http://wiki.apache.org/hadoop/HowManyMapsAndReduces http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. Are you saying that my row segments might not actually be the entire row so I will get a bad key index? If so, would the row segments be determined? I based my initial work off of the word count example, where the lines are tokenized. Does this mean in this example the row tokens may not be the complete row? Thanks. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 7:14 PM The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents
Re: architecture diagram
I think what Alex talked about 'split' is the mapreduce system's action. What you said about 'split' is your mapper's action. I guess that your map/reduce application uses *TextInputFormat* to treat your input file. your input file will first be splitted into a few splits. these splits may be like filename, offset, length. What Alex said about 'The location of these splits is semi-arbitrary' means that the file split's offset in your input file is semi-arbitrary. Am I right, Alex? Then *TextInputFormat* will translate these file splits into a sequence of lines, where offset is treated as key and line is treated as value. As these file splits are splitted by offset. Some lines in your file may be splitted into different file splits. A *LineRecordReader* used by *TextInputFormat* will remove the half-baked line in these file splits to make sure that every mapper will get integrated lines one by one. For examples: a file as below: AAA BBB CCC DDD EEE FFF GGG HHH AAA BBB CCC DDD it may be splitted into two file splits(we assume that there are two mappers.). split one: AAA BBB CCC split two: DDD EEE FFF GGG HHH AAA BBB CCC DDD take split two as example: TextInputFormat will use LineRecordReader to translate split two into a sequence of offset, line pairs, and it will skip the first half-baked line DDD. so the sequence will be: offset1, EEE FFF GGG HHH offset2, AAA BBB CCC DDD Then what to do with the lines depends on your job. On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So looking at the following mapper... http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup On line 32, you can see the row split via a delimiter. On line 43, you can see that the field index (the column index) is the map key, and the map value is the field contents. How is this incorrect? I think this follows your earlier suggestion of: You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. Terrence A. Pietrondi --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Monday, October 6, 2008, 12:55 PM As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: http://wiki.apache.org/hadoop/HowManyMapsAndReduces http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. Are you saying that my row segments might not actually be the entire row so I will get
Re: architecture diagram
Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. Are you saying that my row segments might not actually be the entire row so I will get a bad key index? If so, would the row segments be determined? I based my initial work off of the word count example, where the lines are tokenized. Does this mean in this example the row tokens may not be the complete row? Thanks. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 7:14 PM The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents around randomly... D|A B|E C|G Then pivot the data back... A|E|G D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare to the final output, you'll see that during the shuffling, B and E are swapped, and G and C are swapped, while A and D were shuffled back into their originating positions in the column. Once again, sorry for the typos and confusion. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 11:01 AM Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter G isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example. With that said, at first glance, this problem may not fit well in to the MapReduce paradigm. The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. There may be a way to do this by collecting, in your map step, key = column number (0, 1, 2, etc) and value = (A, B, C, etc), though you may run in to problems when you try to pivot back. I say this because when you pivot back, you need to have each column, which means you'll need one reduce step. There may be a way to put the pivot-back operation in a second iteration, though I don't think that would help you. Terrence, please confirm that you've defined your example correctly. In the meantime, can someone else confirm that this problem does not fit will in to the MapReduce paradigm? Alex On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am trying to write a map reduce implementation to do the following: 1) read
Re: architecture diagram
Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter G isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example. With that said, at first glance, this problem may not fit well in to the MapReduce paradigm. The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. There may be a way to do this by collecting, in your map step, key = column number (0, 1, 2, etc) and value = (A, B, C, etc), though you may run in to problems when you try to pivot back. I say this because when you pivot back, you need to have each column, which means you'll need one reduce step. There may be a way to put the pivot-back operation in a second iteration, though I don't think that would help you. Terrence, please confirm that you've defined your example correctly. In the meantime, can someone else confirm that this problem does not fit will in to the MapReduce paradigm? Alex On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am trying to write a map reduce implementation to do the following: 1) read tabular data delimited in some fashion 2) pivot that data, so the rows are columns and the columns are rows 3) shuffle the rows (that were the columns) to randomize the data 4) pivot the data back For example. A|B|C D|E|G pivots too... D|A E|B C|G Then for each row, shuffle the contents around randomly... D|A B|E G|C Then pivot the data back... A|E|C D|B|C You can reference my progress so far... http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ Terrence A. Pietrondi --- On Thu, 10/2/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Thursday, October 2, 2008, 1:36 PM I think it really depends on the job as to where logic goes. Sometimes your reduce step is as simple as an identify function, and sometimes it can be more complex than your map step. It all depends on your data and the operation(s) you're trying to perform. Perhaps we should step out of the abstract. Do you have a specific problem you're trying to solve? Can you describe it? Alex On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am sorry for the confusion. I meant distributed data. So help me out here. For example, if I am reducing to a single file, then my main transformation logic would be in my mapping step since I am reducing away from the data? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 7:44 PM I'm not sure what you mean by disconnected parts of data, but Hadoop is implemented to try and perform map tasks on machines that have input data. This is to lower the amount of network traffic, hence making the entire job run faster. Hadoop does all this for you under the hood. From a user's point of view, all you need to do is store data in HDFS (the distributed filesystem), and run MapReduce jobs on that data. Take a look here: http://wiki.apache.org/hadoop/WordCount Alex On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So to be distributed in a sense, you would want to do your computation on the disconnected parts of data in the map phase I would guess? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Arun C Murthy [EMAIL PROTECTED] wrote: From: Arun C Murthy [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 2:16 PM On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote: I am trying to plan out my map-reduce implementation and I have some questions of where computation should be split in order to take advantage of the distributed nodes. Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif ), are the map boxes the major computation areas or is the reduce the major computation area? Usually the maps perform the 'embarrassingly
Re: architecture diagram
Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents around randomly... D|A B|E C|G Then pivot the data back... A|E|G D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare to the final output, you'll see that during the shuffling, B and E are swapped, and G and C are swapped, while A and D were shuffled back into their originating positions in the column. Once again, sorry for the typos and confusion. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 11:01 AM Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter G isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example. With that said, at first glance, this problem may not fit well in to the MapReduce paradigm. The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. There may be a way to do this by collecting, in your map step, key = column number (0, 1, 2, etc) and value = (A, B, C, etc), though you may run in to problems when you try to pivot back. I say this because when you pivot back, you need to have each column, which means you'll need one reduce step. There may be a way to put the pivot-back operation in a second iteration, though I don't think that would help you. Terrence, please confirm that you've defined your example correctly. In the meantime, can someone else confirm that this problem does not fit will in to the MapReduce paradigm? Alex On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am trying to write a map reduce implementation to do the following: 1) read tabular data delimited in some fashion 2) pivot that data, so the rows are columns and the columns are rows 3) shuffle the rows (that were the columns) to randomize the data 4) pivot the data back For example. A|B|C D|E|G pivots too... D|A E|B C|G Then for each row, shuffle the contents around randomly... D|A B|E G|C Then pivot the data back... A|E|C D|B|C You can reference my progress so far... http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ Terrence A. Pietrondi --- On Thu, 10/2/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Thursday, October 2, 2008, 1:36 PM I think it really depends on the job as to where logic goes. Sometimes your reduce step is as simple as an identify function, and sometimes it can be more complex than your map step. It all depends on your data and the operation(s) you're trying to perform. Perhaps we should step out of the abstract. Do you have a specific problem you're trying to solve? Can you describe it? Alex On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am sorry for the confusion. I meant distributed data. So help me out here. For example, if I am reducing to a single file, then my main transformation logic would be in my mapping step since I am reducing away from the data? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 7:44 PM I'm not sure what you mean by disconnected parts of data, but Hadoop is implemented to try and perform map tasks on machines that have input data. This is to lower the amount of network traffic, hence making the entire job run faster. Hadoop does all this for you under the hood. From a user's point of view, all you need to do is store data in HDFS (the distributed filesystem), and run MapReduce jobs on that data. Take a look here: http://wiki.apache.org/hadoop/WordCount Alex On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi
Re: architecture diagram
The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents around randomly... D|A B|E C|G Then pivot the data back... A|E|G D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare to the final output, you'll see that during the shuffling, B and E are swapped, and G and C are swapped, while A and D were shuffled back into their originating positions in the column. Once again, sorry for the typos and confusion. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 11:01 AM Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter G isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example. With that said, at first glance, this problem may not fit well in to the MapReduce paradigm. The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. There may be a way to do this by collecting, in your map step, key = column number (0, 1, 2, etc) and value = (A, B, C, etc), though you may run in to problems when you try to pivot back. I say this because when you pivot back, you need to have each column, which means you'll need one reduce step. There may be a way to put the pivot-back operation in a second iteration, though I don't think that would help you. Terrence, please confirm that you've defined your example correctly. In the meantime, can someone else confirm that this problem does not fit will in to the MapReduce paradigm? Alex On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am trying to write a map reduce implementation to do the following: 1) read tabular data delimited in some fashion 2) pivot that data, so the rows are columns and the columns are rows 3) shuffle the rows (that were the columns) to randomize the data 4) pivot the data back For example. A|B|C D|E|G pivots too... D|A E|B C|G Then for each row, shuffle the contents around randomly... D|A B|E G|C Then pivot the data back... A|E|C D|B|C You can reference my progress so far... http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ Terrence A. Pietrondi --- On Thu, 10/2/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Thursday, October 2, 2008, 1:36 PM I think it really depends on the job as to where logic goes. Sometimes your reduce step is as simple as an identify function, and sometimes it can be more complex than your map step. It all depends on your data and the operation(s) you're trying to perform. Perhaps we should step out of the abstract. Do you have a specific problem you're trying to solve? Can you describe it? Alex On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am sorry for the confusion. I meant distributed data. So help me out here. For example, if I am reducing to a single file, then my main transformation logic would be in my mapping step since I am reducing away from the data? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 7:44 PM I'm not sure what you mean by disconnected parts of data, but Hadoop is implemented to try and perform map tasks
Re: architecture diagram
I am sorry for the confusion. I meant distributed data. So help me out here. For example, if I am reducing to a single file, then my main transformation logic would be in my mapping step since I am reducing away from the data? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 7:44 PM I'm not sure what you mean by disconnected parts of data, but Hadoop is implemented to try and perform map tasks on machines that have input data. This is to lower the amount of network traffic, hence making the entire job run faster. Hadoop does all this for you under the hood. From a user's point of view, all you need to do is store data in HDFS (the distributed filesystem), and run MapReduce jobs on that data. Take a look here: http://wiki.apache.org/hadoop/WordCount Alex On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So to be distributed in a sense, you would want to do your computation on the disconnected parts of data in the map phase I would guess? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Arun C Murthy [EMAIL PROTECTED] wrote: From: Arun C Murthy [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 2:16 PM On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote: I am trying to plan out my map-reduce implementation and I have some questions of where computation should be split in order to take advantage of the distributed nodes. Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif ), are the map boxes the major computation areas or is the reduce the major computation area? Usually the maps perform the 'embarrassingly parallel' computational steps where-in each map works independently on a 'split' on your input and the reduces perform the 'aggregate' computations. From http://hadoop.apache.org/core/ : Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. The Hadoop Map-Reduce framework is quite good at scheduling your 'maps' on the actual data-nodes where the input-blocks are present, leading to i/o efficiencies... Arun Thanks. Terrence A. Pietrondi
Re: architecture diagram
I think it really depends on the job as to where logic goes. Sometimes your reduce step is as simple as an identify function, and sometimes it can be more complex than your map step. It all depends on your data and the operation(s) you're trying to perform. Perhaps we should step out of the abstract. Do you have a specific problem you're trying to solve? Can you describe it? Alex On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am sorry for the confusion. I meant distributed data. So help me out here. For example, if I am reducing to a single file, then my main transformation logic would be in my mapping step since I am reducing away from the data? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 7:44 PM I'm not sure what you mean by disconnected parts of data, but Hadoop is implemented to try and perform map tasks on machines that have input data. This is to lower the amount of network traffic, hence making the entire job run faster. Hadoop does all this for you under the hood. From a user's point of view, all you need to do is store data in HDFS (the distributed filesystem), and run MapReduce jobs on that data. Take a look here: http://wiki.apache.org/hadoop/WordCount Alex On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So to be distributed in a sense, you would want to do your computation on the disconnected parts of data in the map phase I would guess? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Arun C Murthy [EMAIL PROTECTED] wrote: From: Arun C Murthy [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 2:16 PM On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote: I am trying to plan out my map-reduce implementation and I have some questions of where computation should be split in order to take advantage of the distributed nodes. Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif ), are the map boxes the major computation areas or is the reduce the major computation area? Usually the maps perform the 'embarrassingly parallel' computational steps where-in each map works independently on a 'split' on your input and the reduces perform the 'aggregate' computations. From http://hadoop.apache.org/core/ : Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. The Hadoop Map-Reduce framework is quite good at scheduling your 'maps' on the actual data-nodes where the input-blocks are present, leading to i/o efficiencies... Arun Thanks. Terrence A. Pietrondi
architecture diagram
I am trying to plan out my map-reduce implementation and I have some questions of where computation should be split in order to take advantage of the distributed nodes. Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif), are the map boxes the major computation areas or is the reduce the major computation area? Thanks. Terrence A. Pietrondi
Re: architecture diagram
Hi Terrence, It really depends on your job I think. Often reduce steps can be the bottleneck if you want a single output file (one reducer). Hope this helps. Alex On Wed, Oct 1, 2008 at 10:17 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am trying to plan out my map-reduce implementation and I have some questions of where computation should be split in order to take advantage of the distributed nodes. Looking at the architecture diagram ( http://hadoop.apache.org/core/images/architecture.gif), are the map boxes the major computation areas or is the reduce the major computation area? Thanks. Terrence A. Pietrondi
Re: architecture diagram
I normally find the intermediate stage of copying data to the reducers from the mappers to be a significant step - but that's not over the best quality switches... The mappers and reducers work on the same boxes, close to the data. On Wed, 2008-10-01 at 10:59 -0700, Alex Loddengaard wrote: It really depends on your job I think. Often reduce steps can be the bottleneck if you want a single output file (one reducer). Hope this helps. Alex
Re: architecture diagram
On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote: I am trying to plan out my map-reduce implementation and I have some questions of where computation should be split in order to take advantage of the distributed nodes. Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif ), are the map boxes the major computation areas or is the reduce the major computation area? Usually the maps perform the 'embarrassingly parallel' computational steps where-in each map works independently on a 'split' on your input and the reduces perform the 'aggregate' computations. From http://hadoop.apache.org/core/ : Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. The Hadoop Map-Reduce framework is quite good at scheduling your 'maps' on the actual data-nodes where the input-blocks are present, leading to i/o efficiencies... Arun Thanks. Terrence A. Pietrondi
Re: architecture diagram
So to be distributed in a sense, you would want to do your computation on the disconnected parts of data in the map phase I would guess? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Arun C Murthy [EMAIL PROTECTED] wrote: From: Arun C Murthy [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 2:16 PM On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote: I am trying to plan out my map-reduce implementation and I have some questions of where computation should be split in order to take advantage of the distributed nodes. Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif ), are the map boxes the major computation areas or is the reduce the major computation area? Usually the maps perform the 'embarrassingly parallel' computational steps where-in each map works independently on a 'split' on your input and the reduces perform the 'aggregate' computations. From http://hadoop.apache.org/core/ : Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. The Hadoop Map-Reduce framework is quite good at scheduling your 'maps' on the actual data-nodes where the input-blocks are present, leading to i/o efficiencies... Arun Thanks. Terrence A. Pietrondi
Re: architecture diagram
I'm not sure what you mean by disconnected parts of data, but Hadoop is implemented to try and perform map tasks on machines that have input data. This is to lower the amount of network traffic, hence making the entire job run faster. Hadoop does all this for you under the hood. From a user's point of view, all you need to do is store data in HDFS (the distributed filesystem), and run MapReduce jobs on that data. Take a look here: http://wiki.apache.org/hadoop/WordCount Alex On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So to be distributed in a sense, you would want to do your computation on the disconnected parts of data in the map phase I would guess? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Arun C Murthy [EMAIL PROTECTED] wrote: From: Arun C Murthy [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 2:16 PM On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote: I am trying to plan out my map-reduce implementation and I have some questions of where computation should be split in order to take advantage of the distributed nodes. Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif ), are the map boxes the major computation areas or is the reduce the major computation area? Usually the maps perform the 'embarrassingly parallel' computational steps where-in each map works independently on a 'split' on your input and the reduces perform the 'aggregate' computations. From http://hadoop.apache.org/core/ : Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. The Hadoop Map-Reduce framework is quite good at scheduling your 'maps' on the actual data-nodes where the input-blocks are present, leading to i/o efficiencies... Arun Thanks. Terrence A. Pietrondi