[
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967670#comment-15967670
]
Adam Szita commented on PIG-5135:
---------------------------------
[~kellyzly] I've checked this, it seems that {{assertEquals(30,
inputStats.get(0).getBytes());}} is fine, but {{assertEquals(18,
inputStats.get(1).getBytes());}} is not true, Spark returns -1 here. The plan
generated for spark consists of 4 jobs, last one being the responsible for
replicated join. This latter does 3 loads, and thus SparkPigStats handle this
as -1. (Even after adding together all the bytes from all load ops in this job
I got different result than 18.) I guess compression is also at work here on
the tmp file part generation that further alters the number of bytes being read.
I would say we should leave the exclusion for Spark as is, but update the
comment section since we don't get the expected numbers for a different reason.
What do you think?
> HDFS bytes read stats are always 0 in Spark mode
> ------------------------------------------------
>
> Key: PIG-5135
> URL: https://issues.apache.org/jira/browse/PIG-5135
> Project: Pig
> Issue Type: Bug
> Components: spark
> Reporter: liyunzhang_intel
> Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5135.0.patch, PIG-5135.1.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark
> mode where the test depends on the value of this stat.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)