[
https://issues.apache.org/jira/browse/PIG-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985265#action_12985265
]
Daniel Dai commented on PIG-1813:
---------------------------------
Script to reproduce the issue:
{code}
public static class BagGenerateNoSchema extends EvalFunc<DataBag> {
@Override
public DataBag exec(Tuple input) throws IOException {
DataBag bg = DefaultBagFactory.getInstance().newDefaultBag();
bg.add(input);
return bg;
}
}
{code}
{code}
a = load '1.txt' as (a0:map[]);
b = foreach a generate BagGenerateNoSchema(*) as b0;
c = foreach b generate flatten(IdentityColumn(b0));
d = foreach c generate $0#'key';
dump d;
{code}
Analysis:
1. BagGenerateNoSchema does not define outputSchema, the output schema should
be bag{} (since BagGenerateNoSchema extends EvalFunc<DataBag>
2. LOGenerate erroneously generate null as schema for BagGenerateNoSchema.
However, when translate LOUserFunc into physical operator, we only use the
LOUserFunc's output schema bag{}, which is correct. So BagGenerateNoSchema
alone will not cause the problem
3. IdentityColumn defines outputSchema:
public Schema outputSchema(Schema input) {
return input;
}
It will use the input schema passed by logical plan, which is null
4. When we translate IdentityColumn into physical operator, we get the wrong
schema "null" as the expression's schema. And null will be translate into
bytearray eventually
5. In POUserFunc, we will convert everything into bytearray, if the declared
type for POUserFunc is bytearray
6. In statement d, we try to convert bytearray back into map; However, the
lineage for $0 trace to UDF IdentityBag, rather than a loader, so Pig complain
it cannot convert bytearray to map; In realty, $0 is a map, such a conversion
should not happen
There are two problems:
1. LOGenerate generate wrong schema in the place of BagGenerateNoSchema
2. #1 will fix the above script, however, it still fail the following script:
{code}
a = load '1.txt' as (a0:map[]);
b = foreach a generate BagGenerateNoSchema(*) as b0;
c = foreach b generate flatten(IdentityColumn(b0));
d = foreach c generate $0#'key';
{code}
The reason is the schema for BagGenerateNoSchema is an empty bag, after flatten
(in c), the schema is null. So POUserFunc for IdentityColumn will carry type
bytearray anyway. Pig will convert data into bytearray in POUserFunc, then try
to conver back to map in statement d, which will cause lineage error again.
The script works in 0.7. The reason is in 0.7, we have the code in POUserFunc:
{code}
if(resultType == DataType.BYTEARRAY) {
if(res.result != null && DataType.findType(result.result) !=
DataType.BYTEARRAY) {
result.result = new DataByteArray(result.result.toString().getBytes());
}
}
{code}
Which is apparently wrong since res is always empty. In 0.8, we change this
code into:
{code}
if(resultType == DataType.BYTEARRAY) {
if(result.result != null && DataType.findType(result.result) !=
DataType.BYTEARRAY) {
result.result = new DataByteArray(result.result.toString().getBytes());
}
}
{code}
It checks if the result type for UDF is bytearray, if it is, convert data to
bytearray. However, this conversion should not happen. All udf with output
schema bytearray should be treated as unknown and leave the data as is.
> Pig 0.8 throws ERROR 1075 while trying to refer a map in the result of eval
> udf.Works with 0.7
> -----------------------------------------------------------------------------------------------
>
> Key: PIG-1813
> URL: https://issues.apache.org/jira/browse/PIG-1813
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0
> Reporter: Vivek Padmanabhan
> Attachments: PIG-1813-0.patch
>
>
> register myudf.jar;
> A = load 'input' MyZippedStorage('\u0001') as ($inputSchema);
> B = foreach A generate id , value ;
> C = foreach B generate id , org.myudf.ExplodeHashList( (chararray)value,
> '\u0002', '\u0004', '\u0003') as value;
> D = FILTER C by value is not null;
> E = foreach D generate id , flatten(org.myudf.GETFIRST(value)) as hop;
> F = foreach E generate id , hop#'rmli' as rmli:bytearray ;
> store F into 'output.bz2' using PigStorage();
> The above script fails when run with Pig 0.8 but runs fine with Pig 0.7 or if
> pig.usenewlogicalplan=false.
> The below is the exception thrown in 0.8 :
> org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a
> bytearray from the UDF. Cannot determine how to convert the bytearray to map.
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:952)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.processInput(POMapLookUp.java:87)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:117)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.mapred.Child.main(Child.java:211)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.