Re: architecture diagram

2008-10-08 Thread Alex Loddengaard
Glad we could help, Terrence.  The second pivot might be tricky; you may
have to run a second iteration.  I haven't thought the problem all the way
through, though.

Good luck.

Alex

On Wed, Oct 8, 2008 at 1:02 PM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 I think I can figure this out now and get it to work. I will check back in
 if I get it. All that is missing at the moment is in my pivot back mapping
 step. Thanks for the help.

 Terrence A. Pietrondi


 --- On Tue, 10/7/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Tuesday, October 7, 2008, 1:55 PM
  Thanks for the clarification, Samuel.  I wasn't aware
  that parts of a line
  might be emitted depending on the split, while using
  TextInputFormat.
  Terrence, this means that you'll have to take the
  approach of collecting key
  = column_count, value = column_contents in your map
  step.
 
  Alex
 
  On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo
  [EMAIL PROTECTED] wrote:
 
   I think what Alex talked about 'split' is the
  mapreduce system's action.
   What you said about 'split' is your
  mapper's action.
  
   I guess that your map/reduce application uses
  *TextInputFormat* to treat
   your input file.
  
   your input file will first be splitted into a few
  splits. these splits may
   be like filename, offset, length. What Alex
  said about 'The location of
   these splits is semi-arbitrary' means that the
  file split's offset in your
   input file is semi-arbitrary. Am I right, Alex?
   Then *TextInputFormat* will translate these file
  splits into a sequence of
   lines, where offset is treated as key and line is
  treated as value.
  
   As these file splits are splitted by offset. Some
  lines in your file may be
   splitted into different file splits. A
  *LineRecordReader* used by
   *TextInputFormat* will remove the half-baked line in
  these file splits to
   make sure that every mapper will get integrated lines
  one by one.
  
   For examples:
  
   a file as below:
   
   AAA BBB CCC DDD
   EEE FFF GGG HHH
   AAA BBB CCC DDD
   
  
   it may be splitted into two file splits(we assume that
  there are two
   mappers.).
   split one:
   
   AAA BBB CCC
  
   split two:
   DDD
   EEE FFF GGG HHH
   AAA BBB CCC DDD
   
  
   take split two as example:
   TextInputFormat will use LineRecordReader to translate
  split two into a
   sequence of offset, line pairs, and it will
  skip the first half-baked
   line
   DDD. so the sequence will be:
   offset1, EEE FFF GGG HHH
   offset2, AAA BBB CCC DDD
   
  
   Then what to do with the lines depends on your job.
  
  
   On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi
  
   [EMAIL PROTECTED]
wrote:
  
So looking at the following mapper...
   
   
   
  
 
 http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
   
On line 32, you can see the row split via a
  delimiter. On line 43, you
   can
see that the field index (the column index) is
  the map key, and the map
value is the field contents. How is this
  incorrect? I think this follows
your earlier suggestion of:
   
You may want to play with the following
  idea: collect key =
   column_number
and value = column_contents in your map
  step.
   
Terrence A. Pietrondi
   
   
--- On Mon, 10/6/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
   
 From: Alex Loddengaard
  [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Monday, October 6, 2008, 12:55 PM
 As far as I know, splits will never be made
  within a line,
 only between
 rows.  To answer your question about ways to
  control the
 splits, see below:


  http://wiki.apache.org/hadoop/HowManyMapsAndReduces
 

   
  
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
 

 Alex

 On Mon, Oct 6, 2008 at 6:38 AM, Terrence A.
  Pietrondi
 [EMAIL PROTECTED]
  wrote:

  Can you explain The location of
  these splits is
 semi-arbitrary? What if
  the example was...
 
  AAA|BBB|CCC|DDD
  EEE|FFF|GGG|HHH
 
 
  Does this mean the split might be
  between CCC such
 that it results in
  AAA|BBB|C and C|DDD for the first line?
  Is there a way
 to control this
  behavior to split on my delimiter?
 
 
  Terrence A. Pietrondi
 
 
  --- On Sun, 10/5/08, Alex Loddengaard
 [EMAIL PROTECTED] wrote:
 
   From: Alex Loddengaard
 [EMAIL PROTECTED]
   Subject: Re: architecture diagram
   To: core-user@hadoop.apache.org
   Date: Sunday, October 5, 2008,
  9:26 PM
   Let's say you have one very
  large input file
 of the
   form:
  
   A|B|C|D
   E|F|G|H

Re: architecture diagram

2008-10-07 Thread Alex Loddengaard
Thanks for the clarification, Samuel.  I wasn't aware that parts of a line
might be emitted depending on the split, while using TextInputFormat.
Terrence, this means that you'll have to take the approach of collecting key
= column_count, value = column_contents in your map step.

Alex

On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo [EMAIL PROTECTED] wrote:

 I think what Alex talked about 'split' is the mapreduce system's action.
 What you said about 'split' is your mapper's action.

 I guess that your map/reduce application uses *TextInputFormat* to treat
 your input file.

 your input file will first be splitted into a few splits. these splits may
 be like filename, offset, length. What Alex said about 'The location of
 these splits is semi-arbitrary' means that the file split's offset in your
 input file is semi-arbitrary. Am I right, Alex?
 Then *TextInputFormat* will translate these file splits into a sequence of
 lines, where offset is treated as key and line is treated as value.

 As these file splits are splitted by offset. Some lines in your file may be
 splitted into different file splits. A *LineRecordReader* used by
 *TextInputFormat* will remove the half-baked line in these file splits to
 make sure that every mapper will get integrated lines one by one.

 For examples:

 a file as below:
 
 AAA BBB CCC DDD
 EEE FFF GGG HHH
 AAA BBB CCC DDD
 

 it may be splitted into two file splits(we assume that there are two
 mappers.).
 split one:
 
 AAA BBB CCC

 split two:
 DDD
 EEE FFF GGG HHH
 AAA BBB CCC DDD
 

 take split two as example:
 TextInputFormat will use LineRecordReader to translate split two into a
 sequence of offset, line pairs, and it will skip the first half-baked
 line
 DDD. so the sequence will be:
 offset1, EEE FFF GGG HHH
 offset2, AAA BBB CCC DDD
 

 Then what to do with the lines depends on your job.


 On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi 
 [EMAIL PROTECTED]
  wrote:

  So looking at the following mapper...
 
 
 
 http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
 
  On line 32, you can see the row split via a delimiter. On line 43, you
 can
  see that the field index (the column index) is the map key, and the map
  value is the field contents. How is this incorrect? I think this follows
  your earlier suggestion of:
 
  You may want to play with the following idea: collect key =
 column_number
  and value = column_contents in your map step.
 
  Terrence A. Pietrondi
 
 
  --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote:
 
   From: Alex Loddengaard [EMAIL PROTECTED]
   Subject: Re: architecture diagram
   To: core-user@hadoop.apache.org
   Date: Monday, October 6, 2008, 12:55 PM
   As far as I know, splits will never be made within a line,
   only between
   rows.  To answer your question about ways to control the
   splits, see below:
  
   http://wiki.apache.org/hadoop/HowManyMapsAndReduces
   
  
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
   
  
   Alex
  
   On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
   [EMAIL PROTECTED]
wrote:
  
Can you explain The location of these splits is
   semi-arbitrary? What if
the example was...
   
AAA|BBB|CCC|DDD
EEE|FFF|GGG|HHH
   
   
Does this mean the split might be between CCC such
   that it results in
AAA|BBB|C and C|DDD for the first line? Is there a way
   to control this
behavior to split on my delimiter?
   
   
Terrence A. Pietrondi
   
   
--- On Sun, 10/5/08, Alex Loddengaard
   [EMAIL PROTECTED] wrote:
   
 From: Alex Loddengaard
   [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Sunday, October 5, 2008, 9:26 PM
 Let's say you have one very large input file
   of the
 form:

 A|B|C|D
 E|F|G|H
 ...
 |1|2|3|4

 This input file will be broken up into N pieces,
   where N is
 the number of
 mappers that run.  The location of these splits
   is
 semi-arbitrary.  This
 means that unless you have one mapper, you
   won't be
 able to see the entire
 contents of a column in your mapper.  Given that
   you would
 need one mapper
 to be able to see the entirety of a column,
   you've now
 essentially reduced
 your problem to a single machine.

 You may want to play with the following idea:
   collect key
 = column_number
 and value = column_contents in your map step.
This
 means that you would be
 able to see the entirety of a column in your
   reduce step,
 though you're
 still faced with the tasks of shuffling and
   re-pivoting.

 Does this clear up your confusion?  Let me know
   if
 you'd like me to clarify
 more.

 Alex

 On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
   Pietrondi
 [EMAIL PROTECTED

Re: architecture diagram

2008-10-06 Thread Terrence A. Pietrondi
Can you explain The location of these splits is semi-arbitrary? What if the 
example was...

AAA|BBB|CCC|DDD
EEE|FFF|GGG|HHH


Does this mean the split might be between CCC such that it results in AAA|BBB|C 
and C|DDD for the first line? Is there a way to control this behavior to split 
on my delimiter?


Terrence A. Pietrondi


--- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

 From: Alex Loddengaard [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Sunday, October 5, 2008, 9:26 PM
 Let's say you have one very large input file of the
 form:
 
 A|B|C|D
 E|F|G|H
 ...
 |1|2|3|4
 
 This input file will be broken up into N pieces, where N is
 the number of
 mappers that run.  The location of these splits is
 semi-arbitrary.  This
 means that unless you have one mapper, you won't be
 able to see the entire
 contents of a column in your mapper.  Given that you would
 need one mapper
 to be able to see the entirety of a column, you've now
 essentially reduced
 your problem to a single machine.
 
 You may want to play with the following idea: collect key
 = column_number
 and value = column_contents in your map step.  This
 means that you would be
 able to see the entirety of a column in your reduce step,
 though you're
 still faced with the tasks of shuffling and re-pivoting.
 
 Does this clear up your confusion?  Let me know if
 you'd like me to clarify
 more.
 
 Alex
 
 On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi
 [EMAIL PROTECTED]
  wrote:
 
  I am not sure why this doesn't fit, maybe you can
 help me understand. Your
  previous comment was...
 
  The reason I'm making this claim is because
 in order to do the pivot
  operation you must know about every row. Your input
 files will be split at
  semi-arbitrary places, essentially making it
 impossible for each mapper to
  know every single row.
 
  Are you saying that my row segments might not actually
 be the entire row so
  I will get a bad key index? If so, would the row
 segments be determined? I
  based my initial work off of the word count example,
 where the lines are
  tokenized. Does this mean in this example the row
 tokens may not be the
  complete row?
 
  Thanks.
 
  Terrence A. Pietrondi
 
 
  --- On Fri, 10/3/08, Alex Loddengaard
 [EMAIL PROTECTED] wrote:
 
   From: Alex Loddengaard
 [EMAIL PROTECTED]
   Subject: Re: architecture diagram
   To: core-user@hadoop.apache.org
   Date: Friday, October 3, 2008, 7:14 PM
   The approach that you've described does not
 fit well in
   to the MapReduce
   paradigm.  You may want to consider randomizing
 your data
   in a different
   way.
  
   Unfortunately some things can't be solved
 well with
   MapReduce, and I think
   this is one of them.
  
   Can someone else say more?
  
   Alex
  
   On Fri, Oct 3, 2008 at 8:15 AM, Terrence A.
 Pietrondi
   [EMAIL PROTECTED]
wrote:
  
Sorry for the confusion, I did make some
 typos. My
   example should have
looked like...
   
 A|B|C
 D|E|G

 pivots too...

 D|A
 E|B
 G|C

 Then for each row, shuffle the contents
 around
   randomly...

 D|A
 B|E
 C|G

 Then pivot the data back...

 A|E|G
 D|B|C
   
The general goal is to shuffle the elements
 in each
   column in the input
data. Meaning, the ordering of the elements
 in each
   column will not be the
same as in input.
   
If you look at the initial input and compare
 to the
   final output, you'll
see that during the shuffling, B and E are
 swapped,
   and G and C are swapped,
while A and D were shuffled back into their
   originating positions in the
column.
   
Once again, sorry for the typos and
 confusion.
   
Terrence A. Pietrondi
   
--- On Fri, 10/3/08, Alex Loddengaard
   [EMAIL PROTECTED] wrote:
   
 From: Alex Loddengaard
   [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Friday, October 3, 2008, 11:01 AM
 Can you confirm that the example
 you've
   presented is
 accurate?  I think you
 may have made some typos, because the
 letter
   G
 isn't in the final result;
 I also think your first pivot
 accidentally
   swapped C and G.
  I'm having a
 hard time understanding what you want
 to do,
   because it
 seems like your
 operations differ from your example.

 With that said, at first glance, this
 problem may
   not fit
 well in to the
 MapReduce paradigm.  The reason I'm
 making
   this claim
 is because in order to
 do the pivot operation you must know
 about every
   row.  Your
 input files will
 be split at semi-arbitrary places,
 essentially
   making it
 impossible for each
 mapper to know every single row.  There
 may be a
   way to do
 this by
 collecting, in your map step, key =
 column
   number (0,
 1, 2, etc) and value
 = (A, B, C, etc), though you

Re: architecture diagram

2008-10-06 Thread Alex Loddengaard
As far as I know, splits will never be made within a line, only between
rows.  To answer your question about ways to control the splits, see below:

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html


Alex

On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 Can you explain The location of these splits is semi-arbitrary? What if
 the example was...

 AAA|BBB|CCC|DDD
 EEE|FFF|GGG|HHH


 Does this mean the split might be between CCC such that it results in
 AAA|BBB|C and C|DDD for the first line? Is there a way to control this
 behavior to split on my delimiter?


 Terrence A. Pietrondi


 --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Sunday, October 5, 2008, 9:26 PM
  Let's say you have one very large input file of the
  form:
 
  A|B|C|D
  E|F|G|H
  ...
  |1|2|3|4
 
  This input file will be broken up into N pieces, where N is
  the number of
  mappers that run.  The location of these splits is
  semi-arbitrary.  This
  means that unless you have one mapper, you won't be
  able to see the entire
  contents of a column in your mapper.  Given that you would
  need one mapper
  to be able to see the entirety of a column, you've now
  essentially reduced
  your problem to a single machine.
 
  You may want to play with the following idea: collect key
  = column_number
  and value = column_contents in your map step.  This
  means that you would be
  able to see the entirety of a column in your reduce step,
  though you're
  still faced with the tasks of shuffling and re-pivoting.
 
  Does this clear up your confusion?  Let me know if
  you'd like me to clarify
  more.
 
  Alex
 
  On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   I am not sure why this doesn't fit, maybe you can
  help me understand. Your
   previous comment was...
  
   The reason I'm making this claim is because
  in order to do the pivot
   operation you must know about every row. Your input
  files will be split at
   semi-arbitrary places, essentially making it
  impossible for each mapper to
   know every single row.
  
   Are you saying that my row segments might not actually
  be the entire row so
   I will get a bad key index? If so, would the row
  segments be determined? I
   based my initial work off of the word count example,
  where the lines are
   tokenized. Does this mean in this example the row
  tokens may not be the
   complete row?
  
   Thanks.
  
   Terrence A. Pietrondi
  
  
   --- On Fri, 10/3/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
  [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Friday, October 3, 2008, 7:14 PM
The approach that you've described does not
  fit well in
to the MapReduce
paradigm.  You may want to consider randomizing
  your data
in a different
way.
   
Unfortunately some things can't be solved
  well with
MapReduce, and I think
this is one of them.
   
Can someone else say more?
   
Alex
   
On Fri, Oct 3, 2008 at 8:15 AM, Terrence A.
  Pietrondi
[EMAIL PROTECTED]
 wrote:
   
 Sorry for the confusion, I did make some
  typos. My
example should have
 looked like...

  A|B|C
  D|E|G
 
  pivots too...
 
  D|A
  E|B
  G|C
 
  Then for each row, shuffle the contents
  around
randomly...
 
  D|A
  B|E
  C|G
 
  Then pivot the data back...
 
  A|E|G
  D|B|C

 The general goal is to shuffle the elements
  in each
column in the input
 data. Meaning, the ordering of the elements
  in each
column will not be the
 same as in input.

 If you look at the initial input and compare
  to the
final output, you'll
 see that during the shuffling, B and E are
  swapped,
and G and C are swapped,
 while A and D were shuffled back into their
originating positions in the
 column.

 Once again, sorry for the typos and
  confusion.

 Terrence A. Pietrondi

 --- On Fri, 10/3/08, Alex Loddengaard
[EMAIL PROTECTED] wrote:

  From: Alex Loddengaard
[EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Friday, October 3, 2008, 11:01 AM
  Can you confirm that the example
  you've
presented is
  accurate?  I think you
  may have made some typos, because the
  letter
G
  isn't in the final result;
  I also think your first pivot
  accidentally
swapped C and G.
   I'm having a
  hard time understanding what you want
  to do,
because it
  seems like your
  operations differ from your example

Re: architecture diagram

2008-10-06 Thread Terrence A. Pietrondi
So looking at the following mapper...

http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup

On line 32, you can see the row split via a delimiter. On line 43, you can see 
that the field index (the column index) is the map key, and the map value is 
the field contents. How is this incorrect? I think this follows your earlier 
suggestion of:

You may want to play with the following idea: collect key = column_number and 
value = column_contents in your map step.

Terrence A. Pietrondi


--- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

 From: Alex Loddengaard [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Monday, October 6, 2008, 12:55 PM
 As far as I know, splits will never be made within a line,
 only between
 rows.  To answer your question about ways to control the
 splits, see below:
 
 http://wiki.apache.org/hadoop/HowManyMapsAndReduces
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
 
 
 Alex
 
 On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
 [EMAIL PROTECTED]
  wrote:
 
  Can you explain The location of these splits is
 semi-arbitrary? What if
  the example was...
 
  AAA|BBB|CCC|DDD
  EEE|FFF|GGG|HHH
 
 
  Does this mean the split might be between CCC such
 that it results in
  AAA|BBB|C and C|DDD for the first line? Is there a way
 to control this
  behavior to split on my delimiter?
 
 
  Terrence A. Pietrondi
 
 
  --- On Sun, 10/5/08, Alex Loddengaard
 [EMAIL PROTECTED] wrote:
 
   From: Alex Loddengaard
 [EMAIL PROTECTED]
   Subject: Re: architecture diagram
   To: core-user@hadoop.apache.org
   Date: Sunday, October 5, 2008, 9:26 PM
   Let's say you have one very large input file
 of the
   form:
  
   A|B|C|D
   E|F|G|H
   ...
   |1|2|3|4
  
   This input file will be broken up into N pieces,
 where N is
   the number of
   mappers that run.  The location of these splits
 is
   semi-arbitrary.  This
   means that unless you have one mapper, you
 won't be
   able to see the entire
   contents of a column in your mapper.  Given that
 you would
   need one mapper
   to be able to see the entirety of a column,
 you've now
   essentially reduced
   your problem to a single machine.
  
   You may want to play with the following idea:
 collect key
   = column_number
   and value = column_contents in your map step.
  This
   means that you would be
   able to see the entirety of a column in your
 reduce step,
   though you're
   still faced with the tasks of shuffling and
 re-pivoting.
  
   Does this clear up your confusion?  Let me know
 if
   you'd like me to clarify
   more.
  
   Alex
  
   On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
 Pietrondi
   [EMAIL PROTECTED]
wrote:
  
I am not sure why this doesn't fit,
 maybe you can
   help me understand. Your
previous comment was...
   
The reason I'm making this claim
 is because
   in order to do the pivot
operation you must know about every row.
 Your input
   files will be split at
semi-arbitrary places, essentially making it
   impossible for each mapper to
know every single row.
   
Are you saying that my row segments might
 not actually
   be the entire row so
I will get a bad key index? If so, would the
 row
   segments be determined? I
based my initial work off of the word count
 example,
   where the lines are
tokenized. Does this mean in this example
 the row
   tokens may not be the
complete row?
   
Thanks.
   
Terrence A. Pietrondi
   
   
--- On Fri, 10/3/08, Alex Loddengaard
   [EMAIL PROTECTED] wrote:
   
 From: Alex Loddengaard
   [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Friday, October 3, 2008, 7:14 PM
 The approach that you've described
 does not
   fit well in
 to the MapReduce
 paradigm.  You may want to consider
 randomizing
   your data
 in a different
 way.

 Unfortunately some things can't be
 solved
   well with
 MapReduce, and I think
 this is one of them.

 Can someone else say more?

 Alex

 On Fri, Oct 3, 2008 at 8:15 AM,
 Terrence A.
   Pietrondi
 [EMAIL PROTECTED]
  wrote:

  Sorry for the confusion, I did
 make some
   typos. My
 example should have
  looked like...
 
   A|B|C
   D|E|G
  
   pivots too...
  
   D|A
   E|B
   G|C
  
   Then for each row, shuffle
 the contents
   around
 randomly...
  
   D|A
   B|E
   C|G
  
   Then pivot the data back...
  
   A|E|G
   D|B|C
 
  The general goal is to shuffle the
 elements
   in each
 column in the input
  data. Meaning, the ordering of the
 elements
   in each
 column will not be the
  same as in input.
 
  If you look at the initial input
 and compare

Re: architecture diagram

2008-10-06 Thread Alex Loddengaard
This mapper does follow my original suggestion, though I'm not familiar with
how the delimiter works in this example.  Anyone else?

Alex

On Mon, Oct 6, 2008 at 2:55 PM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 So looking at the following mapper...


 http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup

 On line 32, you can see the row split via a delimiter. On line 43, you can
 see that the field index (the column index) is the map key, and the map
 value is the field contents. How is this incorrect? I think this follows
 your earlier suggestion of:

 You may want to play with the following idea: collect key = column_number
 and value = column_contents in your map step.

 Terrence A. Pietrondi


 --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Monday, October 6, 2008, 12:55 PM
  As far as I know, splits will never be made within a line,
  only between
  rows.  To answer your question about ways to control the
  splits, see below:
 
  http://wiki.apache.org/hadoop/HowManyMapsAndReduces
  
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
  
 
  Alex
 
  On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   Can you explain The location of these splits is
  semi-arbitrary? What if
   the example was...
  
   AAA|BBB|CCC|DDD
   EEE|FFF|GGG|HHH
  
  
   Does this mean the split might be between CCC such
  that it results in
   AAA|BBB|C and C|DDD for the first line? Is there a way
  to control this
   behavior to split on my delimiter?
  
  
   Terrence A. Pietrondi
  
  
   --- On Sun, 10/5/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
  [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Sunday, October 5, 2008, 9:26 PM
Let's say you have one very large input file
  of the
form:
   
A|B|C|D
E|F|G|H
...
|1|2|3|4
   
This input file will be broken up into N pieces,
  where N is
the number of
mappers that run.  The location of these splits
  is
semi-arbitrary.  This
means that unless you have one mapper, you
  won't be
able to see the entire
contents of a column in your mapper.  Given that
  you would
need one mapper
to be able to see the entirety of a column,
  you've now
essentially reduced
your problem to a single machine.
   
You may want to play with the following idea:
  collect key
= column_number
and value = column_contents in your map step.
   This
means that you would be
able to see the entirety of a column in your
  reduce step,
though you're
still faced with the tasks of shuffling and
  re-pivoting.
   
Does this clear up your confusion?  Let me know
  if
you'd like me to clarify
more.
   
Alex
   
On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
  Pietrondi
[EMAIL PROTECTED]
 wrote:
   
 I am not sure why this doesn't fit,
  maybe you can
help me understand. Your
 previous comment was...

 The reason I'm making this claim
  is because
in order to do the pivot
 operation you must know about every row.
  Your input
files will be split at
 semi-arbitrary places, essentially making it
impossible for each mapper to
 know every single row.

 Are you saying that my row segments might
  not actually
be the entire row so
 I will get a bad key index? If so, would the
  row
segments be determined? I
 based my initial work off of the word count
  example,
where the lines are
 tokenized. Does this mean in this example
  the row
tokens may not be the
 complete row?

 Thanks.

 Terrence A. Pietrondi


 --- On Fri, 10/3/08, Alex Loddengaard
[EMAIL PROTECTED] wrote:

  From: Alex Loddengaard
[EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Friday, October 3, 2008, 7:14 PM
  The approach that you've described
  does not
fit well in
  to the MapReduce
  paradigm.  You may want to consider
  randomizing
your data
  in a different
  way.
 
  Unfortunately some things can't be
  solved
well with
  MapReduce, and I think
  this is one of them.
 
  Can someone else say more?
 
  Alex
 
  On Fri, Oct 3, 2008 at 8:15 AM,
  Terrence A.
Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   Sorry for the confusion, I did
  make some
typos. My
  example should have
   looked like...
  
A|B|C
D|E|G
   
pivots too...
   
D|A
E|B
G|C
   
Then for each row, shuffle
  the contents

Re: architecture diagram

2008-10-06 Thread Samuel Guo
I think what Alex talked about 'split' is the mapreduce system's action.
What you said about 'split' is your mapper's action.

I guess that your map/reduce application uses *TextInputFormat* to treat
your input file.

your input file will first be splitted into a few splits. these splits may
be like filename, offset, length. What Alex said about 'The location of
these splits is semi-arbitrary' means that the file split's offset in your
input file is semi-arbitrary. Am I right, Alex?
Then *TextInputFormat* will translate these file splits into a sequence of
lines, where offset is treated as key and line is treated as value.

As these file splits are splitted by offset. Some lines in your file may be
splitted into different file splits. A *LineRecordReader* used by
*TextInputFormat* will remove the half-baked line in these file splits to
make sure that every mapper will get integrated lines one by one.

For examples:

a file as below:

AAA BBB CCC DDD
EEE FFF GGG HHH
AAA BBB CCC DDD


it may be splitted into two file splits(we assume that there are two
mappers.).
split one:

AAA BBB CCC

split two:
DDD
EEE FFF GGG HHH
AAA BBB CCC DDD


take split two as example:
TextInputFormat will use LineRecordReader to translate split two into a
sequence of offset, line pairs, and it will skip the first half-baked line
DDD. so the sequence will be:
offset1, EEE FFF GGG HHH
offset2, AAA BBB CCC DDD


Then what to do with the lines depends on your job.


On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 So looking at the following mapper...


 http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup

 On line 32, you can see the row split via a delimiter. On line 43, you can
 see that the field index (the column index) is the map key, and the map
 value is the field contents. How is this incorrect? I think this follows
 your earlier suggestion of:

 You may want to play with the following idea: collect key = column_number
 and value = column_contents in your map step.

 Terrence A. Pietrondi


 --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Monday, October 6, 2008, 12:55 PM
  As far as I know, splits will never be made within a line,
  only between
  rows.  To answer your question about ways to control the
  splits, see below:
 
  http://wiki.apache.org/hadoop/HowManyMapsAndReduces
  
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
  
 
  Alex
 
  On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   Can you explain The location of these splits is
  semi-arbitrary? What if
   the example was...
  
   AAA|BBB|CCC|DDD
   EEE|FFF|GGG|HHH
  
  
   Does this mean the split might be between CCC such
  that it results in
   AAA|BBB|C and C|DDD for the first line? Is there a way
  to control this
   behavior to split on my delimiter?
  
  
   Terrence A. Pietrondi
  
  
   --- On Sun, 10/5/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
  [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Sunday, October 5, 2008, 9:26 PM
Let's say you have one very large input file
  of the
form:
   
A|B|C|D
E|F|G|H
...
|1|2|3|4
   
This input file will be broken up into N pieces,
  where N is
the number of
mappers that run.  The location of these splits
  is
semi-arbitrary.  This
means that unless you have one mapper, you
  won't be
able to see the entire
contents of a column in your mapper.  Given that
  you would
need one mapper
to be able to see the entirety of a column,
  you've now
essentially reduced
your problem to a single machine.
   
You may want to play with the following idea:
  collect key
= column_number
and value = column_contents in your map step.
   This
means that you would be
able to see the entirety of a column in your
  reduce step,
though you're
still faced with the tasks of shuffling and
  re-pivoting.
   
Does this clear up your confusion?  Let me know
  if
you'd like me to clarify
more.
   
Alex
   
On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
  Pietrondi
[EMAIL PROTECTED]
 wrote:
   
 I am not sure why this doesn't fit,
  maybe you can
help me understand. Your
 previous comment was...

 The reason I'm making this claim
  is because
in order to do the pivot
 operation you must know about every row.
  Your input
files will be split at
 semi-arbitrary places, essentially making it
impossible for each mapper to
 know every single row.

 Are you saying that my row segments might
  not actually
be the entire row so
 I will get

Re: architecture diagram

2008-10-05 Thread Alex Loddengaard
Let's say you have one very large input file of the form:

A|B|C|D
E|F|G|H
...
|1|2|3|4

This input file will be broken up into N pieces, where N is the number of
mappers that run.  The location of these splits is semi-arbitrary.  This
means that unless you have one mapper, you won't be able to see the entire
contents of a column in your mapper.  Given that you would need one mapper
to be able to see the entirety of a column, you've now essentially reduced
your problem to a single machine.

You may want to play with the following idea: collect key = column_number
and value = column_contents in your map step.  This means that you would be
able to see the entirety of a column in your reduce step, though you're
still faced with the tasks of shuffling and re-pivoting.

Does this clear up your confusion?  Let me know if you'd like me to clarify
more.

Alex

On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 I am not sure why this doesn't fit, maybe you can help me understand. Your
 previous comment was...

 The reason I'm making this claim is because in order to do the pivot
 operation you must know about every row. Your input files will be split at
 semi-arbitrary places, essentially making it impossible for each mapper to
 know every single row.

 Are you saying that my row segments might not actually be the entire row so
 I will get a bad key index? If so, would the row segments be determined? I
 based my initial work off of the word count example, where the lines are
 tokenized. Does this mean in this example the row tokens may not be the
 complete row?

 Thanks.

 Terrence A. Pietrondi


 --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Friday, October 3, 2008, 7:14 PM
  The approach that you've described does not fit well in
  to the MapReduce
  paradigm.  You may want to consider randomizing your data
  in a different
  way.
 
  Unfortunately some things can't be solved well with
  MapReduce, and I think
  this is one of them.
 
  Can someone else say more?
 
  Alex
 
  On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   Sorry for the confusion, I did make some typos. My
  example should have
   looked like...
  
A|B|C
D|E|G
   
pivots too...
   
D|A
E|B
G|C
   
Then for each row, shuffle the contents around
  randomly...
   
D|A
B|E
C|G
   
Then pivot the data back...
   
A|E|G
D|B|C
  
   The general goal is to shuffle the elements in each
  column in the input
   data. Meaning, the ordering of the elements in each
  column will not be the
   same as in input.
  
   If you look at the initial input and compare to the
  final output, you'll
   see that during the shuffling, B and E are swapped,
  and G and C are swapped,
   while A and D were shuffled back into their
  originating positions in the
   column.
  
   Once again, sorry for the typos and confusion.
  
   Terrence A. Pietrondi
  
   --- On Fri, 10/3/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
  [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Friday, October 3, 2008, 11:01 AM
Can you confirm that the example you've
  presented is
accurate?  I think you
may have made some typos, because the letter
  G
isn't in the final result;
I also think your first pivot accidentally
  swapped C and G.
 I'm having a
hard time understanding what you want to do,
  because it
seems like your
operations differ from your example.
   
With that said, at first glance, this problem may
  not fit
well in to the
MapReduce paradigm.  The reason I'm making
  this claim
is because in order to
do the pivot operation you must know about every
  row.  Your
input files will
be split at semi-arbitrary places, essentially
  making it
impossible for each
mapper to know every single row.  There may be a
  way to do
this by
collecting, in your map step, key = column
  number (0,
1, 2, etc) and value
= (A, B, C, etc), though you may run in to
  problems
when you try to pivot
back.  I say this because when you pivot back,
  you need to
have each column,
which means you'll need one reduce step.
  There may be
a way to put the
pivot-back operation in a second iteration,
  though I
don't think that would
help you.
   
Terrence, please confirm that you've defined
  your
example correctly.  In the
meantime, can someone else confirm that this
  problem does
not fit will in to
the MapReduce paradigm?
   
Alex
   
On Thu, Oct 2, 2008 at 10:48 AM, Terrence A.
  Pietrondi 
[EMAIL PROTECTED] wrote:
   
 I am trying to write a map reduce
  implementation to do
the following:

 1) read

Re: architecture diagram

2008-10-03 Thread Alex Loddengaard
Can you confirm that the example you've presented is accurate?  I think you
may have made some typos, because the letter G isn't in the final result;
I also think your first pivot accidentally swapped C and G.  I'm having a
hard time understanding what you want to do, because it seems like your
operations differ from your example.

With that said, at first glance, this problem may not fit well in to the
MapReduce paradigm.  The reason I'm making this claim is because in order to
do the pivot operation you must know about every row.  Your input files will
be split at semi-arbitrary places, essentially making it impossible for each
mapper to know every single row.  There may be a way to do this by
collecting, in your map step, key = column number (0, 1, 2, etc) and value
= (A, B, C, etc), though you may run in to problems when you try to pivot
back.  I say this because when you pivot back, you need to have each column,
which means you'll need one reduce step.  There may be a way to put the
pivot-back operation in a second iteration, though I don't think that would
help you.

Terrence, please confirm that you've defined your example correctly.  In the
meantime, can someone else confirm that this problem does not fit will in to
the MapReduce paradigm?

Alex

On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi 
[EMAIL PROTECTED] wrote:

 I am trying to write a map reduce implementation to do the following:

 1) read tabular data delimited in some fashion
 2) pivot that data, so the rows are columns and the columns are rows
 3) shuffle the rows (that were the columns) to randomize the data
 4) pivot the data back

 For example.

 A|B|C
 D|E|G

 pivots too...

 D|A
 E|B
 C|G

 Then for each row, shuffle the contents around randomly...

 D|A
 B|E
 G|C

 Then pivot the data back...

 A|E|C
 D|B|C

 You can reference my progress so far...

 http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/

 Terrence A. Pietrondi


 --- On Thu, 10/2/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Thursday, October 2, 2008, 1:36 PM
  I think it really depends on the job as to where logic goes.
   Sometimes your
  reduce step is as simple as an identify function, and
  sometimes it can be
  more complex than your map step.  It all depends on your
  data and the
  operation(s) you're trying to perform.
 
  Perhaps we should step out of the abstract.  Do you have a
  specific problem
  you're trying to solve?  Can you describe it?
 
  Alex
 
  On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   I am sorry for the confusion. I meant distributed
  data.
  
   So help me out here. For example, if I am reducing to
  a single file, then
   my main transformation logic would be in my mapping
  step since I am reducing
   away from the data?
  
   Terrence A. Pietrondi
   http://del.icio.us/tepietrondi
  
  
   --- On Wed, 10/1/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
  [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Wednesday, October 1, 2008, 7:44 PM
I'm not sure what you mean by
  disconnected parts
of data, but Hadoop is
implemented to try and perform map tasks on
  machines that
have input data.
This is to lower the amount of network traffic,
  hence
making the entire job
run faster.  Hadoop does all this for you under
  the hood.
From a user's
point of view, all you need to do is store data
  in HDFS
(the distributed
filesystem), and run MapReduce jobs on that data.
   Take a
look here:
   
http://wiki.apache.org/hadoop/WordCount
   
Alex
   
On Wed, Oct 1, 2008 at 1:11 PM, Terrence A.
  Pietrondi
[EMAIL PROTECTED]
 wrote:
   
 So to be distributed in a sense,
  you would
want to do your computation on
 the disconnected parts of data in the map
  phase I
would guess?

 Terrence A. Pietrondi
 http://del.icio.us/tepietrondi


 --- On Wed, 10/1/08, Arun C Murthy
[EMAIL PROTECTED] wrote:

  From: Arun C Murthy
  [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Wednesday, October 1, 2008, 2:16
  PM
  On Oct 1, 2008, at 10:17 AM, Terrence
  A.
Pietrondi wrote:
 
   I am trying to plan out my
  map-reduce
implementation
  and I have some
   questions of where computation
  should be
split in
  order to take
   advantage of the distributed
  nodes.
  
   Looking at the architecture
  diagram
 
   
  (http://hadoop.apache.org/core/images/architecture.gif
   ), are the map boxes the major
  computation
areas or is
  the reduce
   the major computation area?
  
 
  Usually the maps perform the
  'embarrassingly

Re: architecture diagram

2008-10-03 Thread Terrence A. Pietrondi
Sorry for the confusion, I did make some typos. My example should have looked 
like... 

 A|B|C
 D|E|G

 pivots too...

 D|A
 E|B
 G|C

 Then for each row, shuffle the contents around randomly...

 D|A
 B|E
 C|G

 Then pivot the data back...

 A|E|G
 D|B|C

The general goal is to shuffle the elements in each column in the input data. 
Meaning, the ordering of the elements in each column will not be the same as in 
input.

If you look at the initial input and compare to the final output, you'll see 
that during the shuffling, B and E are swapped, and G and C are swapped, while 
A and D were shuffled back into their originating positions in the column. 

Once again, sorry for the typos and confusion.

Terrence A. Pietrondi

--- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

 From: Alex Loddengaard [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Friday, October 3, 2008, 11:01 AM
 Can you confirm that the example you've presented is
 accurate?  I think you
 may have made some typos, because the letter G
 isn't in the final result;
 I also think your first pivot accidentally swapped C and G.
  I'm having a
 hard time understanding what you want to do, because it
 seems like your
 operations differ from your example.
 
 With that said, at first glance, this problem may not fit
 well in to the
 MapReduce paradigm.  The reason I'm making this claim
 is because in order to
 do the pivot operation you must know about every row.  Your
 input files will
 be split at semi-arbitrary places, essentially making it
 impossible for each
 mapper to know every single row.  There may be a way to do
 this by
 collecting, in your map step, key = column number (0,
 1, 2, etc) and value
 = (A, B, C, etc), though you may run in to problems
 when you try to pivot
 back.  I say this because when you pivot back, you need to
 have each column,
 which means you'll need one reduce step.  There may be
 a way to put the
 pivot-back operation in a second iteration, though I
 don't think that would
 help you.
 
 Terrence, please confirm that you've defined your
 example correctly.  In the
 meantime, can someone else confirm that this problem does
 not fit will in to
 the MapReduce paradigm?
 
 Alex
 
 On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi 
 [EMAIL PROTECTED] wrote:
 
  I am trying to write a map reduce implementation to do
 the following:
 
  1) read tabular data delimited in some fashion
  2) pivot that data, so the rows are columns and the
 columns are rows
  3) shuffle the rows (that were the columns) to
 randomize the data
  4) pivot the data back
 
  For example.
 
  A|B|C
  D|E|G
 
  pivots too...
 
  D|A
  E|B
  C|G
 
  Then for each row, shuffle the contents around
 randomly...
 
  D|A
  B|E
  G|C
 
  Then pivot the data back...
 
  A|E|C
  D|B|C
 
  You can reference my progress so far...
 
 
 http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
 
  Terrence A. Pietrondi
 
 
  --- On Thu, 10/2/08, Alex Loddengaard
 [EMAIL PROTECTED] wrote:
 
   From: Alex Loddengaard
 [EMAIL PROTECTED]
   Subject: Re: architecture diagram
   To: core-user@hadoop.apache.org
   Date: Thursday, October 2, 2008, 1:36 PM
   I think it really depends on the job as to where
 logic goes.
Sometimes your
   reduce step is as simple as an identify function,
 and
   sometimes it can be
   more complex than your map step.  It all depends
 on your
   data and the
   operation(s) you're trying to perform.
  
   Perhaps we should step out of the abstract.  Do
 you have a
   specific problem
   you're trying to solve?  Can you describe it?
  
   Alex
  
   On Thu, Oct 2, 2008 at 4:55 AM, Terrence A.
 Pietrondi
   [EMAIL PROTECTED]
wrote:
  
I am sorry for the confusion. I meant
 distributed
   data.
   
So help me out here. For example, if I am
 reducing to
   a single file, then
my main transformation logic would be in my
 mapping
   step since I am reducing
away from the data?
   
Terrence A. Pietrondi
http://del.icio.us/tepietrondi
   
   
--- On Wed, 10/1/08, Alex Loddengaard
   [EMAIL PROTECTED] wrote:
   
 From: Alex Loddengaard
   [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Wednesday, October 1, 2008, 7:44
 PM
 I'm not sure what you mean by
   disconnected parts
 of data, but Hadoop is
 implemented to try and perform map
 tasks on
   machines that
 have input data.
 This is to lower the amount of network
 traffic,
   hence
 making the entire job
 run faster.  Hadoop does all this for
 you under
   the hood.
 From a user's
 point of view, all you need to do is
 store data
   in HDFS
 (the distributed
 filesystem), and run MapReduce jobs on
 that data.
Take a
 look here:


 http://wiki.apache.org/hadoop/WordCount

 Alex

 On Wed, Oct 1, 2008 at 1:11 PM,
 Terrence A.
   Pietrondi

Re: architecture diagram

2008-10-03 Thread Alex Loddengaard
The approach that you've described does not fit well in to the MapReduce
paradigm.  You may want to consider randomizing your data in a different
way.

Unfortunately some things can't be solved well with MapReduce, and I think
this is one of them.

Can someone else say more?

Alex

On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 Sorry for the confusion, I did make some typos. My example should have
 looked like...

  A|B|C
  D|E|G
 
  pivots too...
 
  D|A
  E|B
  G|C
 
  Then for each row, shuffle the contents around randomly...
 
  D|A
  B|E
  C|G
 
  Then pivot the data back...
 
  A|E|G
  D|B|C

 The general goal is to shuffle the elements in each column in the input
 data. Meaning, the ordering of the elements in each column will not be the
 same as in input.

 If you look at the initial input and compare to the final output, you'll
 see that during the shuffling, B and E are swapped, and G and C are swapped,
 while A and D were shuffled back into their originating positions in the
 column.

 Once again, sorry for the typos and confusion.

 Terrence A. Pietrondi

 --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Friday, October 3, 2008, 11:01 AM
  Can you confirm that the example you've presented is
  accurate?  I think you
  may have made some typos, because the letter G
  isn't in the final result;
  I also think your first pivot accidentally swapped C and G.
   I'm having a
  hard time understanding what you want to do, because it
  seems like your
  operations differ from your example.
 
  With that said, at first glance, this problem may not fit
  well in to the
  MapReduce paradigm.  The reason I'm making this claim
  is because in order to
  do the pivot operation you must know about every row.  Your
  input files will
  be split at semi-arbitrary places, essentially making it
  impossible for each
  mapper to know every single row.  There may be a way to do
  this by
  collecting, in your map step, key = column number (0,
  1, 2, etc) and value
  = (A, B, C, etc), though you may run in to problems
  when you try to pivot
  back.  I say this because when you pivot back, you need to
  have each column,
  which means you'll need one reduce step.  There may be
  a way to put the
  pivot-back operation in a second iteration, though I
  don't think that would
  help you.
 
  Terrence, please confirm that you've defined your
  example correctly.  In the
  meantime, can someone else confirm that this problem does
  not fit will in to
  the MapReduce paradigm?
 
  Alex
 
  On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi 
  [EMAIL PROTECTED] wrote:
 
   I am trying to write a map reduce implementation to do
  the following:
  
   1) read tabular data delimited in some fashion
   2) pivot that data, so the rows are columns and the
  columns are rows
   3) shuffle the rows (that were the columns) to
  randomize the data
   4) pivot the data back
  
   For example.
  
   A|B|C
   D|E|G
  
   pivots too...
  
   D|A
   E|B
   C|G
  
   Then for each row, shuffle the contents around
  randomly...
  
   D|A
   B|E
   G|C
  
   Then pivot the data back...
  
   A|E|C
   D|B|C
  
   You can reference my progress so far...
  
  
  http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
  
   Terrence A. Pietrondi
  
  
   --- On Thu, 10/2/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
  [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Thursday, October 2, 2008, 1:36 PM
I think it really depends on the job as to where
  logic goes.
 Sometimes your
reduce step is as simple as an identify function,
  and
sometimes it can be
more complex than your map step.  It all depends
  on your
data and the
operation(s) you're trying to perform.
   
Perhaps we should step out of the abstract.  Do
  you have a
specific problem
you're trying to solve?  Can you describe it?
   
Alex
   
On Thu, Oct 2, 2008 at 4:55 AM, Terrence A.
  Pietrondi
[EMAIL PROTECTED]
 wrote:
   
 I am sorry for the confusion. I meant
  distributed
data.

 So help me out here. For example, if I am
  reducing to
a single file, then
 my main transformation logic would be in my
  mapping
step since I am reducing
 away from the data?

 Terrence A. Pietrondi
 http://del.icio.us/tepietrondi


 --- On Wed, 10/1/08, Alex Loddengaard
[EMAIL PROTECTED] wrote:

  From: Alex Loddengaard
[EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Wednesday, October 1, 2008, 7:44
  PM
  I'm not sure what you mean by
disconnected parts
  of data, but Hadoop is
  implemented to try and perform map
  tasks

Re: architecture diagram

2008-10-02 Thread Terrence A. Pietrondi
I am sorry for the confusion. I meant distributed data. 

So help me out here. For example, if I am reducing to a single file, then my 
main transformation logic would be in my mapping step since I am reducing away 
from the data?

Terrence A. Pietrondi
http://del.icio.us/tepietrondi


--- On Wed, 10/1/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

 From: Alex Loddengaard [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Wednesday, October 1, 2008, 7:44 PM
 I'm not sure what you mean by disconnected parts
 of data, but Hadoop is
 implemented to try and perform map tasks on machines that
 have input data.
 This is to lower the amount of network traffic, hence
 making the entire job
 run faster.  Hadoop does all this for you under the hood. 
 From a user's
 point of view, all you need to do is store data in HDFS
 (the distributed
 filesystem), and run MapReduce jobs on that data.  Take a
 look here:
 
 http://wiki.apache.org/hadoop/WordCount
 
 Alex
 
 On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi
 [EMAIL PROTECTED]
  wrote:
 
  So to be distributed in a sense, you would
 want to do your computation on
  the disconnected parts of data in the map phase I
 would guess?
 
  Terrence A. Pietrondi
  http://del.icio.us/tepietrondi
 
 
  --- On Wed, 10/1/08, Arun C Murthy
 [EMAIL PROTECTED] wrote:
 
   From: Arun C Murthy [EMAIL PROTECTED]
   Subject: Re: architecture diagram
   To: core-user@hadoop.apache.org
   Date: Wednesday, October 1, 2008, 2:16 PM
   On Oct 1, 2008, at 10:17 AM, Terrence A.
 Pietrondi wrote:
  
I am trying to plan out my map-reduce
 implementation
   and I have some
questions of where computation should be
 split in
   order to take
advantage of the distributed nodes.
   
Looking at the architecture diagram
  
 (http://hadoop.apache.org/core/images/architecture.gif
), are the map boxes the major computation
 areas or is
   the reduce
the major computation area?
   
  
   Usually the maps perform the 'embarrassingly
   parallel' computational
   steps where-in each map works independently on a
   'split' on your input
   and the reduces perform the 'aggregate'
   computations.
  
From http://hadoop.apache.org/core/ :
  
   Hadoop implements MapReduce, using the Hadoop
 Distributed
   File System
   (HDFS). MapReduce divides applications into many
 small
   blocks of work.
   HDFS creates multiple replicas of data blocks for
   reliability, placing
   them on compute nodes around the cluster.
 MapReduce can
   then process
   the data where it is located.
  
   The Hadoop Map-Reduce framework is quite good at
 scheduling
   your
   'maps' on the actual data-nodes where the
   input-blocks are present,
   leading to i/o efficiencies...
  
   Arun
  
Thanks.
   
Terrence A. Pietrondi
   
   
   
 
 
 
 


  


Re: architecture diagram

2008-10-02 Thread Alex Loddengaard
I think it really depends on the job as to where logic goes.  Sometimes your
reduce step is as simple as an identify function, and sometimes it can be
more complex than your map step.  It all depends on your data and the
operation(s) you're trying to perform.

Perhaps we should step out of the abstract.  Do you have a specific problem
you're trying to solve?  Can you describe it?

Alex

On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 I am sorry for the confusion. I meant distributed data.

 So help me out here. For example, if I am reducing to a single file, then
 my main transformation logic would be in my mapping step since I am reducing
 away from the data?

 Terrence A. Pietrondi
 http://del.icio.us/tepietrondi


 --- On Wed, 10/1/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Wednesday, October 1, 2008, 7:44 PM
  I'm not sure what you mean by disconnected parts
  of data, but Hadoop is
  implemented to try and perform map tasks on machines that
  have input data.
  This is to lower the amount of network traffic, hence
  making the entire job
  run faster.  Hadoop does all this for you under the hood.
  From a user's
  point of view, all you need to do is store data in HDFS
  (the distributed
  filesystem), and run MapReduce jobs on that data.  Take a
  look here:
 
  http://wiki.apache.org/hadoop/WordCount
 
  Alex
 
  On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   So to be distributed in a sense, you would
  want to do your computation on
   the disconnected parts of data in the map phase I
  would guess?
  
   Terrence A. Pietrondi
   http://del.icio.us/tepietrondi
  
  
   --- On Wed, 10/1/08, Arun C Murthy
  [EMAIL PROTECTED] wrote:
  
From: Arun C Murthy [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Wednesday, October 1, 2008, 2:16 PM
On Oct 1, 2008, at 10:17 AM, Terrence A.
  Pietrondi wrote:
   
 I am trying to plan out my map-reduce
  implementation
and I have some
 questions of where computation should be
  split in
order to take
 advantage of the distributed nodes.

 Looking at the architecture diagram
   
  (http://hadoop.apache.org/core/images/architecture.gif
 ), are the map boxes the major computation
  areas or is
the reduce
 the major computation area?

   
Usually the maps perform the 'embarrassingly
parallel' computational
steps where-in each map works independently on a
'split' on your input
and the reduces perform the 'aggregate'
computations.
   
 From http://hadoop.apache.org/core/ :
   
Hadoop implements MapReduce, using the Hadoop
  Distributed
File System
(HDFS). MapReduce divides applications into many
  small
blocks of work.
HDFS creates multiple replicas of data blocks for
reliability, placing
them on compute nodes around the cluster.
  MapReduce can
then process
the data where it is located.
   
The Hadoop Map-Reduce framework is quite good at
  scheduling
your
'maps' on the actual data-nodes where the
input-blocks are present,
leading to i/o efficiencies...
   
Arun
   
 Thanks.

 Terrence A. Pietrondi



  
  
  
  






architecture diagram

2008-10-01 Thread Terrence A. Pietrondi
I am trying to plan out my map-reduce implementation and I have some questions 
of where computation should be split in order to take advantage of the 
distributed nodes. 

Looking at the architecture diagram 
(http://hadoop.apache.org/core/images/architecture.gif), are the map boxes the 
major computation areas or is the reduce the major computation area?

Thanks.

Terrence A. Pietrondi


  


Re: architecture diagram

2008-10-01 Thread Alex Loddengaard
Hi Terrence,

It really depends on your job I think.  Often reduce steps can be the
bottleneck if you want a single output file (one reducer).

Hope this helps.

Alex

On Wed, Oct 1, 2008 at 10:17 AM, Terrence A. Pietrondi 
[EMAIL PROTECTED] wrote:

 I am trying to plan out my map-reduce implementation and I have some
 questions of where computation should be split in order to take advantage of
 the distributed nodes.

 Looking at the architecture diagram (
 http://hadoop.apache.org/core/images/architecture.gif), are the map boxes
 the major computation areas or is the reduce the major computation area?

 Thanks.

 Terrence A. Pietrondi






Re: architecture diagram

2008-10-01 Thread Tim Wintle
I normally find the intermediate stage of copying data to the reducers
from the mappers to be a significant step - but that's not over the best
quality switches...

The mappers and reducers work on the same boxes, close to the data.  


On Wed, 2008-10-01 at 10:59 -0700, Alex Loddengaard wrote:
 
 It really depends on your job I think.  Often reduce steps can be the
 bottleneck if you want a single output file (one reducer).
 
 Hope this helps.
 
 Alex



Re: architecture diagram

2008-10-01 Thread Arun C Murthy


On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote:

I am trying to plan out my map-reduce implementation and I have some  
questions of where computation should be split in order to take  
advantage of the distributed nodes.


Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif 
), are the map boxes the major computation areas or is the reduce  
the major computation area?




Usually the maps perform the 'embarrassingly parallel' computational  
steps where-in each map works independently on a 'split' on your input  
and the reduces perform the 'aggregate' computations.


From http://hadoop.apache.org/core/ :

Hadoop implements MapReduce, using the Hadoop Distributed File System  
(HDFS). MapReduce divides applications into many small blocks of work.  
HDFS creates multiple replicas of data blocks for reliability, placing  
them on compute nodes around the cluster. MapReduce can then process  
the data where it is located.


The Hadoop Map-Reduce framework is quite good at scheduling your  
'maps' on the actual data-nodes where the input-blocks are present,  
leading to i/o efficiencies...


Arun


Thanks.

Terrence A. Pietrondi







Re: architecture diagram

2008-10-01 Thread Terrence A. Pietrondi
So to be distributed in a sense, you would want to do your computation on the 
disconnected parts of data in the map phase I would guess?

Terrence A. Pietrondi
http://del.icio.us/tepietrondi


--- On Wed, 10/1/08, Arun C Murthy [EMAIL PROTECTED] wrote:

 From: Arun C Murthy [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Wednesday, October 1, 2008, 2:16 PM
 On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote:
 
  I am trying to plan out my map-reduce implementation
 and I have some  
  questions of where computation should be split in
 order to take  
  advantage of the distributed nodes.
 
  Looking at the architecture diagram
 (http://hadoop.apache.org/core/images/architecture.gif 
  ), are the map boxes the major computation areas or is
 the reduce  
  the major computation area?
 
 
 Usually the maps perform the 'embarrassingly
 parallel' computational  
 steps where-in each map works independently on a
 'split' on your input  
 and the reduces perform the 'aggregate'
 computations.
 
  From http://hadoop.apache.org/core/ :
 
 Hadoop implements MapReduce, using the Hadoop Distributed
 File System  
 (HDFS). MapReduce divides applications into many small
 blocks of work.  
 HDFS creates multiple replicas of data blocks for
 reliability, placing  
 them on compute nodes around the cluster. MapReduce can
 then process  
 the data where it is located.
 
 The Hadoop Map-Reduce framework is quite good at scheduling
 your  
 'maps' on the actual data-nodes where the
 input-blocks are present,  
 leading to i/o efficiencies...
 
 Arun
 
  Thanks.
 
  Terrence A. Pietrondi
 
 
 


  


Re: architecture diagram

2008-10-01 Thread Alex Loddengaard
I'm not sure what you mean by disconnected parts of data, but Hadoop is
implemented to try and perform map tasks on machines that have input data.
This is to lower the amount of network traffic, hence making the entire job
run faster.  Hadoop does all this for you under the hood.  From a user's
point of view, all you need to do is store data in HDFS (the distributed
filesystem), and run MapReduce jobs on that data.  Take a look here:

http://wiki.apache.org/hadoop/WordCount

Alex

On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 So to be distributed in a sense, you would want to do your computation on
 the disconnected parts of data in the map phase I would guess?

 Terrence A. Pietrondi
 http://del.icio.us/tepietrondi


 --- On Wed, 10/1/08, Arun C Murthy [EMAIL PROTECTED] wrote:

  From: Arun C Murthy [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Wednesday, October 1, 2008, 2:16 PM
  On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote:
 
   I am trying to plan out my map-reduce implementation
  and I have some
   questions of where computation should be split in
  order to take
   advantage of the distributed nodes.
  
   Looking at the architecture diagram
  (http://hadoop.apache.org/core/images/architecture.gif
   ), are the map boxes the major computation areas or is
  the reduce
   the major computation area?
  
 
  Usually the maps perform the 'embarrassingly
  parallel' computational
  steps where-in each map works independently on a
  'split' on your input
  and the reduces perform the 'aggregate'
  computations.
 
   From http://hadoop.apache.org/core/ :
 
  Hadoop implements MapReduce, using the Hadoop Distributed
  File System
  (HDFS). MapReduce divides applications into many small
  blocks of work.
  HDFS creates multiple replicas of data blocks for
  reliability, placing
  them on compute nodes around the cluster. MapReduce can
  then process
  the data where it is located.
 
  The Hadoop Map-Reduce framework is quite good at scheduling
  your
  'maps' on the actual data-nodes where the
  input-blocks are present,
  leading to i/o efficiencies...
 
  Arun
 
   Thanks.
  
   Terrence A. Pietrondi