Hi,
If you flatten a tuple/bag, Pig will prefix the field with a
disambiguation string ([prefix]::). (See:
http://pig.apache.org/docs/r0.12.0/basic.html#disambiguate).
In your example getSchemaName() returns a generated unique name built
from the classname + first input schema field + a unique id. If you want
to get rid of the disambiguation string, you need to explicitly define
the schema when flattening:
Example:
A = load 'data.txt' using PigStorage() as (c:chararray);
B = foreach A generate TOBAG(TOTUPLE($0, 1)) as ({(field1:chararray,
field2:int)});
describe B;
B: {bag_0: {(field1: chararray,field2: int)}}
Define schema for flatten:
C = foreach B generate flatten($0) as (field1:chararray, field2:int);
describe C;
C: {field1: chararray,field2: int}
D = foreach C generate field1;
...
However, if the original column name (field1) is unique within the
schema, you can refer to it by this name, rather than using the
disambiguated form (bag_0::field1), so you don't need to explicitly set
the schema:
C = foreach B generate flatten($0);
describe C;
C: {bag_0::field1: chararray,bag_0::field2: int}
D = foreach C generate field1; --refers to bag_0::field1
...
Hope this helps!
--Lorand
On 06/14/2014 11:43 PM, Narayanan K wrote:
Hi
I am writing a Pig UDF that returns a Tuple as per
http://wiki.apache.org/pig/UDFManual . I want the output tuple to have
a particular schema, Say {name:chararray, age:int} after I FLATTEN it
out after using the UDF.
As per the UDFManual, the method below
public Schema outputSchema(Schema input) {
try{
Schema tupleSchema = new Schema();
tupleSchema.add(input.getField(1));
tupleSchema.add(input.getField(0));
return new Schema(new
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
input),
tupleSchema, DataType.TUPLE));
}catch (Exception e){
return null;
}
}
}
gives this.getClass().getName().toLowerCase()::name and
this.getClass().getName().toLowerCase()::age as the fields after I
flatten.
My actual usecase has a Tuple that has a schema with 100 columns with
nested bags etc..
Is there some way I can get rid of the prefix of each of the fields ?
I just need schema of the Tuple as
{ field_name1: datatype1, field_name2:datatype 2, .... field_name100:
datatype 100 }
Thanks
Narayanan