Re: Is it possible to implement transpose with PigLatin/any other MR language?

Robert Evans Fri, 22 Jun 2012 07:48:29 -0700

@Subit

You can do it.  Here is some pseudo code for it in map/reduce.  It abuses 
Map/Reduce a little to be more performent.  But it is definitely doable.  At 
the end you will get a file for each reducer you have configured.  If you want 
a single file you can concatenate all of the files together ordered by the name 
of the file.  You should be able to do it in pig too, but you will need an 
input format that will give you the offset, and you will need to possibly have 
the reducer sort by the offset internally in the bag it is handed.  This may 
cause pig to have performance issues if it cannot keep the entire bag in memory 
to sort, which is why I did it in MR instead.


//Assuming Text input format where the key is the offset into the original 
input file, and there is only one input file. If there is more then one input 
file you need a way to include the ordering of the input files in the offset.
Map (LongWriteable offset, Text line)
    String[] parts = line.toString().split(',');
    for(long I = 0 ; I < parts.length; i++) {
        collect(new ColumnOffsetKey(offset, I), new Text(parts[I]));
    }
 }

//We need to know the max Columns ahead of time to get total order partitioning 
to work
Int partition(ColumnOffsetKey key) {
  return (int)(((double)key.column/MaxColumns)*numPartitions);
}

//You probably want to put in a binary comparator for performance reasons
int compare(ColumnOffsetKey key1, ColumnOffsetKey key2) {
    //First sort by column(which will become the new row) next sort by offset 
which will tell us the new column ordering
    if(key1.column > key2.column) {
      return 1;
    } else if(key1.column < key2.column) {
      return -1;
    } else if(key1.offset > key2.offset) {
      return 1;
    } else if (key1.offset < key2.offset) {
      return -1;
    }
    return 0;
}

StringBuffer currentRow = null;
Long currentRowNum = -1;

Reduce(RowOffsetKey key, Iterable<Text> part) {
//This is a bit ugly because we did need to detect changes to the row, there is 
probably a cleaner way to do this
    if(currentRowNum != key.column) {
        //Output the currentRow if needed
        if(currentRow != null) {
            collect(null, currentRow);
        }
        currentRow = new StringBuffer();
        currentRow.append(part);
        currentRowNum = key.column;
    } else {
        currentRow.append(',');
        currentRow.append(part);
    }
}

//This is called at the end of the reducer in the new API, ro something like it 
I don't remember the method name off the top of my head
cleanup() {
    if(currentRow != null)
      collect(null, currentRow);
}

On 6/22/12 5:35 AM, "Subir S" <subir.sasiku...@gmail.com> wrote:

Thank you for the inputs!

@Norbert,
 But a Group By column number clause also does not guarantee the order of
columns to be preserved. Like even the row number should be known so that
may be in the end we can sort each row based on the row number using a
nested FOREACH. But after that  FOREACH since sorting is not preserved, for
other operations again data may be in wrong order in the row.

To me it seems like it is not possible to do this in MR.


On Fri, Jun 22, 2012 at 12:56 AM, Robert Evans <ev...@yahoo-inc.com> wrote:

> That may be true, I have not read through the code very closely, if you
> have multiple reduces,  so you can run it with a single reduce or you can
> write a custom partitioner to do it.  You only need to know the length of
> the column, and then you can divide them up appropriately, kind of like how
> the total order partitioner does it.
>
> --Bobby Evans
>
> On 6/21/12 1:15 PM, "Norbert Burger" <norbert.bur...@gmail.com> wrote:
>
> While it may be fine for many cases, If I'm reading the Nectar code
> correctly, that transpose doesn't guarantee anything about the order of
> rows within each column.  In other words, transposing:
>
> a - b -c
> d - e - f
> g - h - i
>
> may give you different permutations of "a - d - g" as the first row,
> depending on shuffle order.  You can trivially avoid this with one
> mapper/reducer, but then you're not exploiting the framework.  Note that
> you can accomplish same with a higher-level language like PIg by using a
> UDF like LinkedIn's Enumerate [1] to tag each column, and then simply
> GROUPing BY column number.
>
> [1]
>
> https://raw.github.com/linkedin/datafu/master/src/java/datafu/pig/bags/Enumerate.java
>
> Norbert
>
> On Thu, Jun 21, 2012 at 5:00 AM, madhu phatak <phatak....@gmail.com>
> wrote:
>
> > Hi,
> >  Its possible in Map/Reduce. Look into the code here
> >
> >
> https://github.com/zinnia-phatak-dev/Nectar/tree/master/Nectar-regression/src/main/java/com/zinnia/nectar/regression/hadoop/primitive/mapreduce
> >
> >
> >
> > 2012/6/21 Subir S <subir.sasiku...@gmail.com>
> >
> > > Hi,
> > >
> > > Is it possible to implement transpose operation of rows into columns
> and
> > > vice versa...
> > >
> > >
> > > i.e.
> > >
> > > col1 col2 col3
> > > col4 col5 col6
> > > col7 col8 col9
> > > col10 col11 col12
> > >
> > > can this be converted to
> > >
> > > col1 col4 col7 col10
> > > col2 col5 col8 col11
> > > col3 col6 col9 col12
> > >
> > > Is this even possible with map reduce? If yes, which language helps to
> > > achieve this faster?
> > >
> > > Thanks
> > >
> >
> >
> >
> > --
> > https://github.com/zinnia-phatak-dev/Nectar
> >
>
>

Re: Is it possible to implement transpose with PigLatin/any other MR language?

Reply via email to