Sure, drop() would be useful, but breaking the "transformations are lazy;
only actions launch jobs" model is abhorrent -- which is not to say that we
haven't already broken that model for useful operations (cf.
RangePartitioner, which is used for sorted RDDs), but rather that each such
exception to the model is a significant source of pain that can be hard to
work with or work around.

I really wouldn't like to see another such model-breaking transformation
added to the API.  On the other hand, being able to write transformations
with dependencies on these kind of "internal" jobs is sometimes very
useful, so a significant reworking of Spark's Dependency model that would
allow for lazily running such internal jobs and making the results
available to subsequent stages may be something worth pursuing.


On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash <and...@andrewash.com> wrote:

> Personally I'd find the method useful -- I've often had a .csv file with a
> header row that I want to drop so filter it out, which touches all
> partitions anyway.  I don't have any comments on the implementation quite
> yet though.
>
>
> On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <e...@redhat.com> wrote:
>
> > A few weeks ago I submitted a PR for supporting rdd.drop(n), under
> > SPARK-2315:
> > https://issues.apache.org/jira/browse/SPARK-2315
> >
> > Supporting the drop method would make some operations convenient, however
> > it forces computation of >= 1 partition of the parent RDD, and so it
> would
> > behave like a "partial action" that returns an RDD as the result.
> >
> > I wrote up a discussion of these trade-offs here:
> >
> >
> http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> >
>

Reply via email to