Thanks everyone for the comments! I waited for more replies to come before I responded as I was interested in the community's opinion.
The thread I'm noticing in this thread (pun intended) is that most responses focus on the nested RDD issue. I think we all agree that it is problematic for many reasons, including not just implementation complexity but also user-facing complexity (programming model, processing patterns, debugging, etc.). What about a much simpler approach? Rather than producing RDDs with Iterable[T], why not produce RDDs with SparkIterable[T]? Then, let's look at the *RDD APIs and decide which methods would be useful to have there. The simplest rule I can think of is anything that does not involve context, job or partitioning in any form, therefore implicitly protecting the RDD/partitioning abstractions underneath. Instead of returning RDDs these functions in SparkIterable will produce SparkIterables. The benefits are significant: API consistency, programming model simplicity/consistency, greater leverage of non-trivial community code such as sampling, approximate counting, reuse of user code for reducers, etc. The cost is only the implementation effort of the new methods. No change to the process model. No nested RDDs. Here is a quick list of the methods we can do this for. Not all need to be available at once: this is directional. This list is alphabetical, /not/ in priority order of value. *From RDD* aggregate count countApprox countApproxDistinct countByValue countByValueApprox pipe randomSplit sample sortBy takeOrdered takeSample treeAggregate treeReduce union zipWithUniqueId aggregateByKey *From PairRDDFunctions* combineByKey countApproxDistinctByKey countByKey flatMapValues foldByKey groupByKey keys lookup mapValues reduceByKey sampleByKey sampleByKeyExact values Here is another way to look at this: I am not sure why these methods, whose signatures have nothing to do with partitions or partitioning, were defined directly on RDDs as opposed to in some abstract trait. How a method is implemented is a separate concern from how APIs should be designed. Had these methods been put in a trait early on in the life of Spark, it would have been natural to expose them to Spark-backed Iterables. Since this was not done, we look at them and tell ourselves "we can't do this because we can't have RDD nesting" which is not the real issue as the implementations of these methods in a SparkIterable don't need access to any RDD APIs. In fact, many implementations would be one-liners using the underlying Scala Iterable API: // Purely for illustration implicit class PairSparkIterableFunctions[K, V](self: SparkIterable[(K, V)]) extends SparkIterable[(K, V)] { def groupByKey() = groupBy(_._1).map { case (k, valuePairs) => (k, valuePairs.map(_._2)) } } The Spark community has decided that the RDD methods are important for large scale data processing so let's make them available to all data Spark touches while avoiding the nested RDD mess. What do you think about this approach? P.S. In a previous life I built developer tools, APIs and standards used by over a million enterprise developers. One of the lessons I learned was that simple, high-level APIs based on consistent patterns substantially accelerate the growth of communities. Conversely, lack of either high-level abstractions or consistency introduces friction. Because of the iterative nature of development, even small amounts of friction meaningfully slow down adoption. Further, simplicity of high-level APIs and consistency always beat capability & performance in terms of how the mass of developers make technology choices. I have found no exceptions to this, which is why I wanted to bring the issue with the RDD API up here. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14191.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org