[
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964045#comment-15964045
]
Adam Szita commented on PIG-5135:
---------------------------------
The reason these tests failed is that they depend on *hdfsBytesRead stat which
is always 0* when using Spark as execution engine. (TestOrcStoragePushdown
compares bytes read with and without the optimization and expect a certain
difference..)
Spark only counts the bytes read if the type of the split given is of
FileSplit.
[Here|https://github.com/apache/spark/blob/v1.6.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L142]
you can see that otherwise the {{bytesReadCallback}} is None, and because of
that the counter is never incremented. This happens in our case because
PigSplit is not a FileSplit.
In my patch [^PIG-5135.0.patch] I have created a wrapper {{SparkPigSplit}} that
wraps a PigSplit instance and delegates every method to that. If the original
PigSplit contained FileSplits then I create FileSparkPigSplit, otherwise
GenericSparkPigSplit. (The former extends FileSplit so Spark will be able to
count the bytes being read.)
[~kellyzly] please take a look.
> Fix TestOrcStoragePushdown unit test in Spark mode
> --------------------------------------------------
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
> Issue Type: Bug
> Components: spark
> Reporter: liyunzhang_intel
> Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)