[ 
https://issues.apache.org/jira/browse/SPARK-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209361#comment-14209361
 ] 

Patrick Wendell commented on SPARK-1296:
----------------------------------------

Because it's not possible to do without breaking compatibility.

> Make RDDs Covariant
> -------------------
>
>                 Key: SPARK-1296
>                 URL: https://issues.apache.org/jira/browse/SPARK-1296
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Michael Armbrust
>            Assignee: Michael Armbrust
>
> First, what is the problem with RDDs not being covariant
> {code}
> // Consider a function that takes a Seq of some trait.
> scala> trait A { val a = 1 }
> scala> def f(as: Seq[A]) = as.map(_.a)
> // A list of a concrete version of that trait can be used in this function.
> scala> class B extends A
> scala> f(new B :: Nil)
> res0: Seq[Int] = List(1)
> // Now lets try the same thing with RDDs
> scala> def f(as: org.apache.spark.rdd.RDD[A]) = as.map(_.a)
> scala> val rdd = sc.parallelize(new B :: Nil)
> rdd: org.apache.spark.rdd.RDD[B] = ParallelCollectionRDD[2] at parallelize at 
> <console>:42
> // :(
> scala> f(rdd)
> <console>:45: error: type mismatch;
>  found   : org.apache.spark.rdd.RDD[B]
>  required: org.apache.spark.rdd.RDD[A]
> Note: B <: A, but class RDD is invariant in type T.
> You may wish to define T as +T instead. (SLS 4.5)
>               f(rdd)
> {code}
> h2. Is it possible to make RDDs covariant?
> Probably?  In terms of the public user interface, they are *mostly* 
> covariant. (Internally we use the type parameter T in a lot of mutable state 
> that breaks the covariance contract, but I think with casting we can 
> 'promise' the compiler that we are behaving).  There are also a lot of 
> complications with other types that we return which are invariant.
> h2. What will it take to make RDDs covariant?
> As I mention above, all of our mutable internal state is going to require 
> casting to avoid using T.  This seems to be okay, it makes our life only 
> slightly harder. This extra work required because we are basically promising 
> the compiler that even if an RDD is implicitly upcast, internally we are 
> keeping all the checkpointed data of the correct type. Since an RDD is 
> immutable, we are okay!
> We also need to modify all the places where we use T in function parameters.  
> So for example:
> {code}
> def ++[U >: T : ClassTag](other: RDD[U]): RDD[U] = 
> this.union(other).asInstanceOf[RDD[U]]
> {code}
> We are now allowing you to append an RDD of a less specific type, and then 
> returning a less specific new RDD.  This I would argue is a good change. We 
> are strictly improving the power of the RDD interface, while maintaining 
> reasonable type semantics.
> h2. So, why wouldn't we do it?
> There are a lot of places where we interact with invariant types.  We return 
> both Maps and Arrays from a lot of public functions.  Arrays are invariant 
> (but if we returned immutable sequences instead.... we would be good), and 
> Maps are invariant in the Key (once again, immutable sequences of tuples 
> would be great here).
> I don't think this is a deal breaker, and we may even be able to get away 
> with it, without changing the returns types of these functions.  For example, 
> I think that this should work, though once again requires make promises to 
> the compiler:
> {code}
>   /**
>    * Return an array that contains all of the elements in this RDD.
>    */
>   def collect[U >: T](): Array[U] = {
>     val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
>     Array.concat(results: _*).asInstanceOf[Array[U]]
>   }
> {code}
> I started working on this 
> [here|https://github.com/marmbrus/spark/tree/coveriantRDD].  Thoughts / 
> suggestions are welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to