Hi,

If you flatten a tuple/bag, Pig will prefix the field with a disambiguation string ([prefix]::). (See: http://pig.apache.org/docs/r0.12.0/basic.html#disambiguate). In your example getSchemaName() returns a generated unique name built from the classname + first input schema field + a unique id. If you want to get rid of the disambiguation string, you need to explicitly define the schema when flattening:

Example:

A = load 'data.txt' using PigStorage() as (c:chararray);
B = foreach A generate TOBAG(TOTUPLE($0, 1)) as ({(field1:chararray, field2:int)});
describe B;
B: {bag_0: {(field1: chararray,field2: int)}}

Define schema for flatten:

C = foreach B generate flatten($0) as (field1:chararray, field2:int);
describe C;
C: {field1: chararray,field2: int}
D = foreach C generate field1;
...

However, if the original column name (field1) is unique within the schema, you can refer to it by this name, rather than using the disambiguated form (bag_0::field1), so you don't need to explicitly set the schema:

C = foreach B generate flatten($0);
describe C;
C: {bag_0::field1: chararray,bag_0::field2: int}
D = foreach C generate field1;  --refers to bag_0::field1
...

Hope this helps!
--Lorand


On 06/14/2014 11:43 PM, Narayanan K wrote:
Hi

I am writing a Pig UDF that returns a Tuple as per
http://wiki.apache.org/pig/UDFManual . I want the output tuple to have
a particular schema, Say {name:chararray, age:int} after I FLATTEN it
out after using the UDF.

As per the UDFManual, the method below

public Schema outputSchema(Schema input) {
            try{
                Schema tupleSchema = new Schema();
                tupleSchema.add(input.getField(1));
                tupleSchema.add(input.getField(0));
                return new Schema(new
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
input),

           tupleSchema, DataType.TUPLE));
            }catch (Exception e){
                    return null;
            }
        }
    }

gives this.getClass().getName().toLowerCase()::name and
this.getClass().getName().toLowerCase()::age as the fields after I
flatten.

My actual usecase has a Tuple that has a schema with 100 columns with
nested bags etc..

Is there some way I can get rid of the prefix of each of the fields ?

I just need schema of the Tuple as

  { field_name1: datatype1, field_name2:datatype 2, .... field_name100:
datatype 100 }


Thanks
Narayanan


Reply via email to