Re: setting inputMetrics in HadoopRDD#compute()

Reynold Xin Sat, 26 Jul 2014 10:48:29 -0700

There is one piece of information that'd be useful to know, which is the
source of the input. Even in the presence of an IOException, the input
metrics still specifies the task is reading from Hadoop.

However, I'm slightly confused by this -- I think usually we'd want to
report the number of bytes read, rather than the total input size. For
example, if there is a limit (only read the first 5 records), the actual
number of bytes read is much smaller than the total split size.

Kay, am I mis-interpreting this?

On Sat, Jul 26, 2014 at 7:42 AM, Ted Yu <[email protected]> wrote:

> Hi,
> Starting at line 203:
>       try {
>         /* bytesRead may not exactly equal the bytes read by a task: split
> boundaries aren't
>          * always at record boundaries, so tasks may need to read into
> other splits to complete
>          * a record. */
>         inputMetrics.bytesRead = split.inputSplit.value.getLength()
>       } catch {
>         case e: java.io.IOException =>
>           logWarning("Unable to get input size to set InputMetrics for
> task", e)
>       }
>       context.taskMetrics.inputMetrics = Some(inputMetrics)
>
> If there is IOException, context.taskMetrics.inputMetrics is set by
> wrapping inputMetrics - as if there wasn't any error.
>
> I wonder if the above code should distinguish the error condition.
>
> Cheers
>

Re: setting inputMetrics in HadoopRDD#compute()

Reply via email to