Re: Generating multiple tuples from single tuple

Jonathan Coveney Mon, 02 Jul 2012 13:20:02 -0700

IMHO, if you want this to be more generic, I would have it just take the
full line, and then parse it out. Why? Because what happens when you have
an indeterminate number of columns? That's my own pesonal opinion though.
As far as implementation, I would return a DataBag (because what you want
are many rows, and Bags = rows).


you want these two things to make the Tuples and output bag:

private static final TupleFactory mTupleFactory =
TupleFactory.getInstance();
private static final BagFactory mBagFactory = BagFactory.getInstance();

Their use is described in the Pig api, but essentially, you'll have
something like this (this is off the cuff and needs some love, but is the
general idea)...

DataBag output = mBagFactory.newDefaultBag();
String[] vals = ((String)input.get(0)).split("|");
List<Object> protoTuple = new ArrayList<Object>(3);
protoTuple.add(vals[0]); //the first will be the ID
protoTuple.add(null);
protoTuple.add(null);
for (int i = 1; i < vals.length; i++) {
    String[] colAndValue = vals[i].split(":");
    protoTuple.set(1, colAndValue[0]); //the column name
    protoTuple.set(2, colAndValue[0]); //the value
    output.add(mTupleFactory.newTuple(protoTuple)); //the default of
newTuple(List) is to copy the List over, which is what we want
}
return output;

the output will always have ID, then col and val. You want to flatten the
output of this UDF.

2012/7/2 naresh <meumanar...@gmail.com>

> Thanks for the suggestions.
>
> @Jonathan Coveney:
>
> input tuple :  (id1,column1,column2)
> output : two tuples (id1,column1)  and (id2,column2) so it is List<Tuple>
> or should I return a Bag?
>
> public class SPLITTUPPLE extends EvalFunc <List<Tuple>>
> {
>     public List<Tuple> exec(Tuple input) throws IOException {
>         if (input == null || input.size() == 0)
>             return null;
>         try{
>             // not sure how whether I can create tuples on my own. Looks
> like I should use TupleFactory.
>             // return list of tuples.
>         }catch(Exception e){
>             throw WrappedIOException.wrap("Caught exception processing
> input row ", e);
>         }
>     }
> }
>
> Can you point me to some example?
>
> Thanks for your time,
> Naresh.
>
> On Mon, Jul 2, 2012 at 9:34 AM, Jonathan Coveney <jcove...@gmail.com>
> wrote:
>
> > You can probably hack together something that will do exactly this
> without
> > writing a UDF, but I think a UDF will be most useful here...especially if
> > you want to add more columns, etc etc.
> >
> > 2012/7/1 Subir S <subir.sasiku...@gmail.com>
> >
> > > Would FLATTEN help?
> > >
> > > B = GROUP A by ID;
> > >
> > > C = FOREACH B GENERATE group, FLATTEN ($1);
> > >
> > > Might work i guess. Not tested.
> > >
> > > On Mon, Jul 2, 2012 at 8:04 AM, naresh <meumanar...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > >         I am new to pig scripting. I like to generate multiple tuples
> > > from
> > > > a single tuple. What I mean is:
> > > >
> > > > I have file with following data in it.
> > > >
> > > > >> cat data
> > > >
> > > > ID | ColumnName1:Value1 | ColumnName2:Value2
> > > >
> > > > so I load it by the following command
> > > >
> > > > grunt >> A = load '$data' using PigStorage('|');
> > > >
> > > > grunt >> dump A;
> > > >
> > > > (ID,ColumnName1:Value1,ColumnName2:Value2)
> > > >
> > > > Now I want to split this tuple into two tuples.
> > > >
> > > > (ID, ColumnName1, Value1)
> > > > (ID, ColumnName2, Value2)
> > > >
> > > > Can I use UDF along with foreach and generate. Some thing like the
> > > > following?
> > > >
> > > > grunt >> foreach A generate SOMEUDF(A)
> > > >
> > > > Thanks for your time,
> > > > Naresh.
> > > >
> > >
> >
>

Re: Generating multiple tuples from single tuple

Reply via email to