[GitHub] spark pull request #11105: [SPARK-12469][CORE] Data Property accumulators fo...

squito Thu, 03 Nov 2016 09:33:05 -0700

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11105#discussion_r86383782
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/ShuffledRDD.scala ---
    @@ -104,10 +105,26 @@ class ShuffledRDD[K: ClassTag, V: ClassTag, C: 
ClassTag](
       }
     
       override def compute(split: Partition, context: TaskContext): 
Iterator[(K, C)] = {
    +    // Use -1 for our Shuffle ID since we are on the read side of the 
shuffle.
    +    val shuffleWriteId = -1
    +    // If our task has data property accumulators we need to keep track of 
which partitions
    +    // we are processing.
    +    if (context.taskMetrics.hasDataPropertyAccumulators()) {
    +      context.setRDDPartitionInfo(id, shuffleWriteId, split.index)
    +    }
         val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
    -    SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, 
split.index + 1, context)
    +    val itr = SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, 
split.index, split.index + 1,
    +      context)
    --- End diff --
    
    I am looking closely at the combiner code to try to confirm this.  I think 
I believe it, I don't think its *guaranteed* to be true in the future.  Eg., 
right now the combiners do an `insertAll` into the `ExternalAppendOnlyMap` 
before reading from it.  But there is no reason spark couldn't change so that 
what it actually does is just insert the *next* key from all incoming streams 
into the `ExternalAppendOnlyMap`, and then feed that one key to the downstream 
iterators.  
    
    At the very least, we need a test to ensure this doesn't break if that 
internal implementation were to change.  (Does a test like that already exist?)
    
    Again, I'm still mulling over whether there is even a good use to bother 
supporting this at all ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #11105: [SPARK-12469][CORE] Data Property accumulators fo...

Reply via email to