[
https://issues.apache.org/jira/browse/TEZ-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336003#comment-14336003
]
Siddharth Seth commented on TEZ-2121:
-------------------------------------
[~rgrandl], an INPUT_BYTES counter used to exist earlier - and was based off of
statistics maintained by FileSystem. It was removed since looking up statistics
in the tight loop (before and after each readKeyValue operation) was
considered expensive.
Looking at the patch, this counter would be incremented by the size of the
splits - and only for TezGroupedSplitsInputFormat. When using any columnar
format - the size of the split is not a representative size of the actual data
being read. There's also the aspect of compression. (Output Counters, i
believe, differentiate between output bytes compressed vs what's written to
disk).
Rather than naming this BYTES_READ - something like SPLIT_SIZE would be more
appropriate - but should work for Inputs other than GroupedInputs as well.
Event better would be to setup a counter within your InputFormat itself.
If your InputFormat is based on FileInputFormat - a FileInputFormat.BYTES_READ
counter should already be present.
> INPUT_BYTES counter to describe amount of data read by a task
> --------------------------------------------------------------
>
> Key: TEZ-2121
> URL: https://issues.apache.org/jira/browse/TEZ-2121
> Project: Apache Tez
> Issue Type: New Feature
> Affects Versions: 0.7.0
> Reporter: Robert Grandl
> Priority: Minor
> Fix For: 0.7.0
>
> Attachments: TEZ-2121.patch
>
>
> We propose to add an INPUT_BYTES counter to the set of TaskCounters in
> TaskCounter.java to record the amount of data being read by a task in a
> vertex. We observe that a similar counter exists for data being
> written(OUTPUT_BYTES).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)