[ 
https://issues.apache.org/jira/browse/PIG-5194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15938532#comment-15938532
 ] 

Adam Szita commented on PIG-5194:
---------------------------------

There was a lot going wrong with the e2e test suite for HiveUDF. Please find my 
fixes in [^PIG-5194.0.patch].
h6. 1. ClassNotFoundExceptions for Hive classes
Although jars like {{hive-serde, hive-exec, etc..}} were in the ship list, they 
were not added as JARs, but rather as FILEs at job creation in SparkLauncher. 
This is a problem, because Spark will only add an element to the container's 
classpath if {{sc.addJar()}} was called, not {{sc.addFile}}. I fixed this so 
that shipped files ending with {{.jar}} are added as JARs, others as simple 
FILEs.
h6. 2. ClassCastException for HiveUDF 5 and 6.
The trick here is that these test cases have a {{group by and foreach}} in 
their scripts, hence {{ReduceByConverter#MergeValuesFunction}} gets called from 
Spark in the combiner phase. This method is responsible to produce a partial 
result of two input partial results. In this case, this method calls the 
Intermediate implementation of the HiveUDF.
It seems like that {{HiveUDAF}} is not implemented correctly, because it 
doesn't call the {{merge}} method of the actual HiveUDAF, but rather the 
{{iterate()}}. This is bad, because {{iterate()}} is not ready to accept 
partial results and should only be used for new input parts.
Old mapping of methods:
* Initial > just returns Tuples as is
* Intermediate > calls {{iterate()}} then finally {{terminatePartial()}}
* Final > calls {{merge()}} then finally {{terminate()}}

Following the information 
[here|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFEvaluator.java#L70]
 I changed this to be:
* Initial > *(PARTIAL1)* calls {{iterate()}} then finally {{terminatePartial()}}
* Intermediate > *(PARTIAL2)* calls {{merge()}} then finally 
{{terminatePartial()}}
* Final > *(FINAL)* calls {{merge()}} then finally {{terminate()}}

It looks like in MR/Tez modes all the partial results (not just two) for a 
given key were added to {{HiveUDAF#Intermediate#exec}} and this issue doesn't 
show. (At least on this test case's inputs..)
[~rohini], [~daijy] I'm going to need your help to review this part.
h6. 3. ClassNotFoundException of DummyContextUDF in HiveUDF_7
SparkLauncher is missing to add {{pigContext.extraJars}}. I'm fixing this.

[~kellyzly] can you please review the {{SparkLauncher}} modifications?

> HiveUDF fails with Spark exec type
> ----------------------------------
>
>                 Key: PIG-5194
>                 URL: https://issues.apache.org/jira/browse/PIG-5194
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>             Fix For: spark-branch
>
>         Attachments: PIG-5194.0.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to