[ https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liyunzhang_intel updated PIG-4788: ---------------------------------- Attachment: PIG-4776.patch *why taskMetrics.inputMetrics().get().bytesRead() alway returns 0?* spark only calculates the statistics like "bytes_read" of a *file* input. In [NewHadoopRDD.scala#compute| https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L137], [NewHadoopRDD.scala#close| https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L212], it only calculates the bytes_read of File Input(like FileSplit and CombineFileSplit). If the PigSplit extends InputSplit, inputMetrics will not set the value of byte_read. PigSplit wrapped an array of inputSplit(https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigSplit.java#L76). The format of input in pig is file. so we can change "PigSplit extends InputSplit" to "PigSplit extends FileSplit". In PIG-4776.patch: changes are 1. PigSplit extends FileSplit not InputSplit 2. add try catch to PigSplit#getLocations(), PigSplit#getLength() 3.add PigSplit#getPath(). PigSplit#getPath() will be called in [NewTrackingRecordReader|https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java#L507]. After test, no new unit tests are imported and unit test failures about TestOrcStoragePushdown will be fixed. > the value BytesRead metric info always returns 0 even the length of input > file is not 0 in spark engine > ------------------------------------------------------------------------------------------------------- > > Key: PIG-4788 > URL: https://issues.apache.org/jira/browse/PIG-4788 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: liyunzhang_intel > Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4776.patch > > > In > [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140], > taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the > length of input file is not zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)