[ https://issues.apache.org/jira/browse/PIG-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037767#comment-14037767 ]
Mona Chitnis commented on PIG-3938: ----------------------------------- * While loading data as kv:[ ], if the data it in the form of multi-map i.e. multiple rows with same keys but different value, they will get loaded into unique key map, latter values overriding the former. So if you have a use-case to retain all values pertaining to one key, use an alternate way of loading the data than using kv:[ ] directly. * UDF like MAPTOBAG should also override getOutputSchema() similar to how it is implemented in TOBAG. Else the schema from the resulting bag is unknown/null and leads to data handling issues later in the pipeline. * Sample pig script for POC using finite known number of key-value pairs. this produces the expected results computing MIN/MAX among the different values corresponding to one key {code} A = load 'input_exploded.txt' using PigStorage(',') as (id:chararray, k1:chararray, v1:chararray, k2:chararray, v2:chararray, k3:chararray, v3:chararray); B = foreach A generate id, TOBAG(TOTUPLE(k1,v1),TOTUPLE(k2,v2),TOTUPLE(k3,v3)) as to_bag; C = foreach B generate id, flatten(to_bag) as (key:chararray, value:chararray); D = group C by (id,key); E = foreach D generate group,MIN(C.value); {code} > Type cast doesn't work after flatten result of UDF > -------------------------------------------------- > > Key: PIG-3938 > URL: https://issues.apache.org/jira/browse/PIG-3938 > Project: Pig > Issue Type: Bug > Components: internal-udfs > Affects Versions: 0.12.0, 0.11.1 > Reporter: Hongchang Li > > this ticket was very close to > http://stackoverflow.com/questions/8828839/how-can-correct-data-types-on-apache-pig-be-enforced. > To reproduce the issue, first, we have an UDF to cast map to bag, code almost > like(http://stackoverflow.com/questions/12476929/group-key-value-of-map-in-pig?answertab=votes#tab-top) > {code:title=test.pig} > $ cat test.pig > register polisan/maptobag.jar; > define MAPTOBAG maptobag.MAPTOBAG(); > A = load 'polisan/input1.txt' using PigStorage(' ') as (id:chararray, kv:[]); > B = foreach A generate id, MAPTOBAG(kv) as to_bag; > C = foreach B generate id, flatten(to_bag) as (key:chararray, > value:chararray); > D = group C by (id, key); > E = foreach D generate group, MIN(C.value); > dump E; > {code} > {code:title=polisan/input1.pig} > 1 [x#1,y#ab] > 1 [x#2,y#cd] > {code} > then run the pig, I got exception as following: > {noformat} > 2014-05-15 19:44:52,944 [Thread-2] WARN > org.apache.hadoop.mapred.LocalJobRunner - job_local_0001 > org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception > while executing (Name: D: Local Rearrange[tuple]{tuple}(false) - scope-42 > Operator Key: scope-42): > org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while > computing min in Initial > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:263) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:282) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:277) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:1) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: > Error while computing min in Initial > at org.apache.pig.builtin.StringMin$Initial.exec(StringMin.java:81) > at org.apache.pig.builtin.StringMin$Initial.exec(StringMin.java:1) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:352) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:391) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:281) > ... 8 more > Caused by: java.lang.ClassCastException: org.apache.pig.data.DataByteArray > cannot be cast to java.lang.String > at org.apache.pig.builtin.StringMin$Initial.exec(StringMin.java:73) > ... 15 more > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)