[
https://issues.apache.org/jira/browse/PIG-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated PIG-5474:
------------------------------------
Attachment: PIG-5474-2.patch
> Casting error or empty output when as clause is used on a bag with schema not
> defined
> -------------------------------------------------------------------------------------
>
> Key: PIG-5474
> URL: https://issues.apache.org/jira/browse/PIG-5474
> Project: Pig
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5474-1.patch, PIG-5474-2.patch
>
>
> Ran into an issue with where script that worked with older version of Pig
> failed on Pig 0.17. It was a regression caused by PIG-2315 adding additional
> POCast operators when there is an AS clause.
> A script with below lines
> {code}
> G = FOREACH F GENERATE a0,
> org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as
> bag2;
> H = FOREACH G GENERATE a0, FLATTEN(bag2) as (x1:chararray, x2:double,
> x3:chararray, x4:long);
> {code}
> ran into this error
> {code}
> ERROR 1075: Received a bytearray from the UDF or Union from two different
> Loaders. Cannot determine how to convert the bytearray to string for
> [x1[-1,-1]]
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNextString(POCast.java:1125)
> {code}
> It was not easily reproducible with a simple script and required a sequence
> of steps for the CastLineageSetter to not be able to set the LoadFunc that
> will provide the caster on POCast -
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/newplan/logical/visitor/CastLineageSetter.java#L108-L112
> causing the error in
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1123-L1125.
>
> User trying to rewrite the script by moving the as clause to the UDF
> statement instead of after FLATTEN, made the script pass. But all the bags
> produced were empty because casting of the bag (
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1824-L1828
> ) swallowed the underlying exception and return null unlike the primitive
> fields which throw error.
> {code}
> G = FOREACH F GENERATE a0,
> org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as
> bag2:{t:(a1:chararray, a2:double, a3:chararray, a4:long)};
> H = FOREACH G GENERATE a0, FLATTEN(bag2);
> {code}
> Also realized that this additional POCast has made the processing inefficient
> in general as it tries to cast everything from bytearray to the type
> specified in the as clause. If the UDF returned the correct type, lets say
> Integer the code will still try to typecast to DataByteArray, hit a
> ClassCastException and then will cast based on the realType
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L503.
> This is going to add a lot of overhead to processing when there are millions
> of rows.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)