----- Original Message -----
> Sure, drop() would be useful, but breaking the "transformations are lazy;
> only actions launch jobs" model is abhorrent -- which is not to say that we
> haven't already broken that model for useful operations (cf.
> RangePartitioner, which is used for sorted RDDs), but rather that each such
> exception to the model is a significant source of pain that can be hard to
> work with or work around.

A thought that comes to my mind here is that there are in fact already two 
categories of transform: ones that are truly lazy, and ones that are not.  A 
possible option is to embrace that, and commit to documenting the two 
categories as such, with an obvious bias towards favoring lazy transforms (to 
paraphrase Churchill, we're down to haggling over the price).
 

> 
> I really wouldn't like to see another such model-breaking transformation
> added to the API.  On the other hand, being able to write transformations
> with dependencies on these kind of "internal" jobs is sometimes very
> useful, so a significant reworking of Spark's Dependency model that would
> allow for lazily running such internal jobs and making the results
> available to subsequent stages may be something worth pursuing.


This seems like a very interesting angle.   I don't have much feel for what a 
solution would look like, but it sounds as if it would involve caching all 
operations embodied by RDD transform method code for provisional execution.  I 
believe that these levels of invocation are currently executed in the master, 
not executor nodes.


> 
> 
> On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash <and...@andrewash.com> wrote:
> 
> > Personally I'd find the method useful -- I've often had a .csv file with a
> > header row that I want to drop so filter it out, which touches all
> > partitions anyway.  I don't have any comments on the implementation quite
> > yet though.
> >
> >
> > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <e...@redhat.com> wrote:
> >
> > > A few weeks ago I submitted a PR for supporting rdd.drop(n), under
> > > SPARK-2315:
> > > https://issues.apache.org/jira/browse/SPARK-2315
> > >
> > > Supporting the drop method would make some operations convenient, however
> > > it forces computation of >= 1 partition of the parent RDD, and so it
> > would
> > > behave like a "partial action" that returns an RDD as the result.
> > >
> > > I wrote up a discussion of these trade-offs here:
> > >
> > >
> > http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> > >
> >
> 

Reply via email to