[GitHub] spark pull request: [SPARK-1683] Track task read metrics.

kayousterhout Wed, 25 Jun 2014 14:07:42 -0700

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/962#discussion_r14212360
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
    @@ -67,6 +67,12 @@ class TaskMetrics extends Serializable {
       var diskBytesSpilled: Long = _
     
       /**
    +   * If this task reads from a HadoopRDD, from cached data, or from a 
parallelized collection,
    --- End diff --
    
    How can a talk end up reading data from both HDFS and the cache? I didn't
    realize that was possible.
    
    
    On Wed, Jun 25, 2014 at 2:05 PM, andrewor14 <notificati...@github.com>
    wrote:
    
    > In core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:
    >
    > > @@ -67,6 +67,12 @@ class TaskMetrics extends Serializable {
    > >    var diskBytesSpilled: Long = _
    > >
    > >    /**
    > > +   * If this task reads from a HadoopRDD, from cached data, or from a 
parallelized collection,
    >
    > I see. What if the same task reads the data from both HDFS and the cache?
    > From the code it seems that we keep the input bytes from the cache and
    > overwrite the ones for HDFS. Maybe I'm misunderstanding but I don't see an
    > easy way to tell whether these bytes are for an external source or from 
the
    > cache.
    >
    > â
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/962/files#r14212270>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1683] Track task read metrics.

Reply via email to