Thanks everyone for the comments! I waited for more replies to come before I
responded as I was interested in the community's opinion. 

The thread I'm noticing in this thread (pun intended) is that most responses
focus on the nested RDD issue. I think we all agree that it is problematic
for many reasons, including not just implementation complexity but also
user-facing complexity (programming model, processing patterns, debugging,
etc.).

What about a much simpler approach? Rather than producing RDDs with
Iterable[T], why not produce RDDs with SparkIterable[T]? Then, let's look at
the *RDD APIs and decide which methods would be useful to have there. The
simplest rule I can think of is anything that does not involve context, job
or partitioning in any form, therefore implicitly protecting the
RDD/partitioning abstractions underneath. Instead of returning RDDs these
functions in SparkIterable will produce SparkIterables. 

The benefits are significant: API consistency, programming model
simplicity/consistency, greater leverage of non-trivial community code such
as sampling, approximate counting, reuse of user code for reducers, etc. The
cost is only the implementation effort of the new methods. No change to the
process model. No nested RDDs.

Here is a quick list of the methods we can do this for. Not all need to be
available at once: this is directional. This list is alphabetical, /not/ in
priority order of value.

*From RDD*
aggregate
count
countApprox
countApproxDistinct
countByValue
countByValueApprox
pipe
randomSplit
sample
sortBy
takeOrdered
takeSample
treeAggregate
treeReduce
union
zipWithUniqueId
aggregateByKey

*From PairRDDFunctions*
combineByKey
countApproxDistinctByKey
countByKey
flatMapValues
foldByKey
groupByKey
keys
lookup
mapValues
reduceByKey
sampleByKey
sampleByKeyExact
values

Here is another way to look at this: I am not sure why these methods, whose
signatures have nothing to do with partitions or partitioning, were defined
directly on RDDs as opposed to in some abstract trait. How a method is
implemented is a separate concern from how APIs should be designed. Had
these methods been put in a trait early on in the life of Spark, it would
have been natural to expose them to Spark-backed Iterables. Since this was
not done, we look at them and tell ourselves "we can't do this because we
can't have RDD nesting" which is not the real issue as the implementations
of these methods in a SparkIterable don't need access to any RDD APIs. In
fact, many implementations would be one-liners using the underlying Scala
Iterable API:

// Purely for illustration
implicit class PairSparkIterableFunctions[K, V](self: SparkIterable[(K, V)])
extends SparkIterable[(K, V)] {
  def groupByKey() = groupBy(_._1).map { case (k, valuePairs) => (k,
valuePairs.map(_._2)) }
}

The Spark community has decided that the RDD methods are important for large
scale data processing so let's make them available to all data Spark touches
while avoiding the nested RDD mess. 

What do you think about this approach?

P.S. In a previous life I built developer tools, APIs and standards used by
over a million enterprise developers. One of the lessons I learned was that
simple, high-level APIs based on consistent patterns substantially
accelerate the growth of communities. Conversely, lack of either high-level
abstractions or consistency introduces friction. Because of the iterative
nature of development, even small amounts of friction meaningfully slow down
adoption. Further, simplicity of high-level APIs and consistency always beat
capability & performance in terms of how the mass of developers make
technology choices. I have found no exceptions to this, which is why I
wanted to bring the issue with the RDD API up here.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14191.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to