----- Original Message ----- > Sure, drop() would be useful, but breaking the "transformations are lazy; > only actions launch jobs" model is abhorrent -- which is not to say that we > haven't already broken that model for useful operations (cf. > RangePartitioner, which is used for sorted RDDs), but rather that each such > exception to the model is a significant source of pain that can be hard to > work with or work around. > > I really wouldn't like to see another such model-breaking transformation > added to the API. On the other hand, being able to write transformations > with dependencies on these kind of "internal" jobs is sometimes very > useful, so a significant reworking of Spark's Dependency model that would > allow for lazily running such internal jobs and making the results > available to subsequent stages may be something worth pursuing.
It turns out that drop can be implemented as a proper lazy transform. I discuss how that works here: http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/ I updated the PR with this lazy implementation. > > > On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash <and...@andrewash.com> wrote: > > > Personally I'd find the method useful -- I've often had a .csv file with a > > header row that I want to drop so filter it out, which touches all > > partitions anyway. I don't have any comments on the implementation quite > > yet though. > > > > > > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <e...@redhat.com> wrote: > > > > > A few weeks ago I submitted a PR for supporting rdd.drop(n), under > > > SPARK-2315: > > > https://issues.apache.org/jira/browse/SPARK-2315 > > > > > > Supporting the drop method would make some operations convenient, however > > > it forces computation of >= 1 partition of the parent RDD, and so it > > would > > > behave like a "partial action" that returns an RDD as the result. > > > > > > I wrote up a discussion of these trade-offs here: > > > > > > > > http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/ > > > > > >