Thanks, Daniel. My UDF with schema (which I suspect is culprit) is below. I've tried excluding the "outputSchema()" method entirely and a several variations:
(Full source here: http://pastie.org/1362084) public class NormalizeListUDF extends EvalFunc<DataBag> { public DataBag exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try { DataBag output = DefaultBagFactory.getInstance().newDefaultBag(); List<Object> tuples = input.getAll(); String line = (String) tuples.remove(0); line = line.trim(); String[] items = line.split(","); for (int i = 1; i < items.length - 1; i++) { for (int j = i + 1; j < items.length; j++) { int num1 = Integer.parseInt(items[i]); int num2 = Integer.parseInt(items[j]); Tuple t = TupleFactory.getInstance().newTuple(1); if (num1 < num2) { t.set(0, num1 + "," + num2); } else if (num2 < num1) { t.set(0, num2 + "," + num1); } output.add(t); } } return output; } catch (Exception e) { throw WrappedIOException.wrap("Caught exception processing input row ", e); } } public Schema outputSchema(Schema input) { try { List<Schema.FieldSchema> fields = new ArrayList<Schema.FieldSchema>(); Schema.FieldSchema f1 = new Schema.FieldSchema("f1", DataType.INTEGER); Schema.FieldSchema f2 = new Schema.FieldSchema("f2", DataType.INTEGER); fields.add(f1); fields.add(f2); Schema tupleInner = new Schema(fields); Schema.FieldSchema tupleSchema = new Schema.FieldSchema("t1", tupleInner, DataType.TUPLE); Schema bagInner = new Schema(tupleSchema); Schema.FieldSchema bagSchema = new Schema.FieldSchema("bag", bagInner, DataType.BAG); return new Schema(bagSchema); } catch (Exception e) { return null; } } } On Wed, Dec 8, 2010 at 7:04 PM, Daniel Dai <jiany...@yahoo-inc.com> wrote: > It is not expected. I would think something wrong inside NormalizeListUDF. > Make sure you feed bag of tuples which has the schema (int, int) inside your > UDF. If you can post your UDF, I can know better. > > Daniel > > > Michael Moss wrote: > >> Hello, >> >> I'm having an issue with a script that uses an EvalFunc I wrote. The issue >> is the final output contains characters that I am not expecting (commas - >> followed by what I'm guessing are null fields which I do not see). >> >> Snippet: >> C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int); >> grunt> DUMP C; >> (2,3) >> (2,4) >> (2,5) >> (3,4) >> (3,5) >> (4,5) >> (2,3) >> (2,4) >> (2,5) >> (3,4) >> (3,5) >> (4,5) >> >> D = GROUP C by (f1,f2); >> grunt> describe D; >> D: {group: (f1: int,f2: int),C: {f1: int,f2: int}} >> >> grunt> DUMP D; >> ((2,3,),{(2,3,),(2,3,)}) >> ((2,4,),{(2,4,),(2,4,)}) >> ((2,5,),{(2,5,),(2,5,)}) >> ((3,4,),{(3,4,),(3,4,)}) >> ((3,5,),{(3,5,),(3,5,)}) >> ((4,5,),{(4,5,),(4,5,)}) >> >> My question is, what are these extra comma/null fiends in each tuple? I >> expected the first row to read as: >> ((2,3),{(2,3),(2,3)}) >> >> It seems related, but when I run 'ILLUSTRATE C', I get an exeption: >> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 >> at java.util.ArrayList.RangeCheck(ArrayList.java:547) >> at java.util.ArrayList.get(ArrayList.java:322) >> at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) >> at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80) >> at >> >> org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190) >> at >> >> org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:86) >> at >> >> org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:69) >> at >> org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:143) >> at org.apache.pig.PigServer.getExamples(PigServer.java:785) >> at >> >> org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:555) >> at >> >> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246) >> at >> >> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) >> at >> >> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) >> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) >> at org.apache.pig.Main.main(Main.java:357) >> >> Excruciating detail below: >> >> My script: >> REGISTER udf.jar >> A = LOAD '/pig_input/co.txt' as (line:chararray); >> B = FOREACH A GENERATE com.thumbplay.pig.NormalizeListUDF(line) as B; >> C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int); >> D = GROUP C by (f1,f2); >> E = FOREACH D GENERATE group, COUNT(C); >> STORE E INTO 'output' USING PigStorage(','); >> >> Here's what I'm trying to do: >> For input: >> A,1,2,3 >> B,1,2,3 >> >> Produce combinations for each row (My UDF does this): >> (1,2),(1,3),(2,3) >> (1,2),(1,3),(2,3) >> >> Flatten them: >> (1,2), >> (1,3), >> (2,3), >> (1,2), >> (1,3), >> (2,3) >> >> Group and count them: >> (1,2),2 >> (1,3),2 >> (2,3),2 >> >> > >