[ 
https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4788:
----------------------------------
    Attachment: PIG-4776.patch

*why taskMetrics.inputMetrics().get().bytesRead() alway returns 0?*
spark only calculates the statistics like "bytes_read" of a *file* input.
In [NewHadoopRDD.scala#compute| 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L137],
 [NewHadoopRDD.scala#close| 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L212],
 it only calculates the bytes_read of File Input(like FileSplit and 
CombineFileSplit). If the PigSplit extends InputSplit,
inputMetrics will not set the value of byte_read.

PigSplit wrapped an array of  
inputSplit(https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigSplit.java#L76).
 The format of  input in pig is file. so we can change "PigSplit extends 
InputSplit" to "PigSplit extends FileSplit". 

In PIG-4776.patch:
changes are
1. PigSplit extends FileSplit not InputSplit
2. add try catch to PigSplit#getLocations(), PigSplit#getLength()
3.add PigSplit#getPath().  PigSplit#getPath() will be called in 
[NewTrackingRecordReader|https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java#L507].

After test, no new unit tests are imported and unit test failures about 
TestOrcStoragePushdown will be fixed.



> the value BytesRead metric info always returns 0 even the length of input 
> file is not 0 in spark engine
> -------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-4788
>                 URL: https://issues.apache.org/jira/browse/PIG-4788
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-4776.patch
>
>
> In 
> [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140],
>  taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the 
> length of input file is not zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to