[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

nkronenfeld Sun, 19 Apr 2015 18:29:04 -0700

Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94332668
  
    Ok, I think there are only 3 real API changes, though one of them appears 
in multiple places.
    
    * `RDD`
      * `flatMap[U](f: T => TraversableOnce[U])` changed to `flatMap[U](f: T => 
Traversable[U])`.  My thinking at the time was, `DStream` used a function to 
`Traversable`, `RDD` to `TraversableOnce`, and that the `Traversable` part in 
`DStream` was necessary there.  Thinking about it again, I'm not completely 
sure of that - I think it might be just as valid to change the `DStream` 
version to `TraversableOnce`, which would probably be more general, and require 
fewer people to change code
    * `PairRDDFunctions`
      * Several of the functions now require a `ClassTag`, as the `DStream` 
version already did.  I tried to take the requirement out of the `DStream` 
versions instead, since that would be more inclusive, but ran into some 
compiler problems, and this was easier in the short run.  If people think it's 
doable to change in that direction, I can try.
        * `combineByKey`
        * `mapValues`
        * `flatMapValues`
      * `reduceByKey`
        * changed the (partitioner, function) version to (function, 
partitioner) (see below).  I wanted to leave the old version in, deprecated, 
but for the same reason as the next point, couldn't.
        * eliminated the (function, int) version.  Unfortunately, the compiler 
seemed to want to pick the type of function generification before it picked 
which version of the function to use, and because of this, couldn't handle the 
type inference.  I'm not sure someone with more scala foo couldn't solve this.  
I'll copy the error in here when I can get hold of it again.
    
    However, regarding the last two, there is a general pattern in a lot of the 
spark functions, that they all have three versions:
    
    * `foo(....)` - an N-argument version, that uses the default partitioner
    * `foo(..., Int)` - an N+1-argument version, that uses a HashPartitioner, 
and
    * `foo(..., Partitioner)` - an N+1 argument version, that uses the 
passed-in partitioner.
    
    `PairRDDFunctions.reduceByKey` broke this pattern, but 
`PairDStreamFunctions.reduceByKey` did not - so, with differing method 
signatures, one had to change.  Since the RDD one was out of sync with pretty 
much every similar function in the class, it seemed natural to change it - and, 
I would argue, only makes sense that way.
    
    As my initial PR notes mention, I think the ideal thing would be to reduce 
all these function triplets to a single version, with auto-conversion from 
`Int` to `HashPartitioner`, which would leave all associated classes cleaner, 
smaller, and less confusing.
    
    Besides these three changes, I don't see any other API changes, but I'll go 
through it again later to be sure.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Reply via email to