Re: RFC: Supporting the Scala drop Method for Spark RDDs

Erik Erlandson Tue, 22 Jul 2014 06:55:27 -0700


----- Original Message -----
> It could make sense to add a skipHeader argument to SparkContext.textFile?


I also looked into this.   I don't think it's feasible given the limits of the 
InputFormat and RecordReader interfaces.  RecordReader can't (I think) *ever* 
know which split it's attached to, and the getSplits() method has no concept of 
RecordReader, so it can't know how many records reside in its splits.   At 
least in RDD it's possible to do, if not attractive.



> 
> 
> On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin <[email protected]> wrote:
> 
> > If the purpose is for dropping csv headers, perhaps we don't really need a
> > common drop and only one that drops the first line in a file? I'd really
> > try hard to avoid a common drop/dropWhile because they can be expensive to
> > do.
> >
> > Note that I think we will be adding this functionality (ignoring headers)
> > to the CsvRDD functionality in Spark SQL.
> >  https://github.com/apache/spark/pull/1351
> >
> >
> > On Mon, Jul 21, 2014 at 1:45 PM, Mark Hamstra <[email protected]>
> > wrote:
> >
> > > You can find some of the prior, related discussion here:
> > > https://issues.apache.org/jira/browse/SPARK-1021
> > >
> > >
> > > On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson <[email protected]> wrote:
> > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > Rather than embrace non-lazy transformations and add more of them,
> > I'd
> > > > > rather we 1) try to fully characterize the needs that are driving
> > their
> > > > > creation/usage; and 2) design and implement new Spark abstractions
> > that
> > > > > will allow us to meet those needs and eliminate existing non-lazy
> > > > > transformation.
> > > >
> > > >
> > > > In the case of drop, obtaining the index of the boundary partition can
> > be
> > > > viewed as the action forcing compute -- one that happens to be invoked
> > > > inside of a transform.  The concept of a "lazy action", that is only
> > > > triggered if the result rdd has compute invoked on it, might be
> > > sufficient
> > > > to restore laziness to the drop transform.   For that matter, I might
> > > find
> > > > some way to make use of Scala lazy values directly and achieve the same
> > > > goal for drop.
> > > >
> > > >
> > > >
> > > > > They really mess up things like creation of asynchronous
> > > > > FutureActions, job cancellation and accounting of job resource usage,
> > > > etc.,
> > > > > so I'd rather we seek a way out of the existing hole rather than make
> > > it
> > > > > deeper.
> > > > >
> > > > >
> > > > > On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson <[email protected]>
> > > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > Sure, drop() would be useful, but breaking the "transformations
> > are
> > > > lazy;
> > > > > > > only actions launch jobs" model is abhorrent -- which is not to
> > say
> > > > that
> > > > > > we
> > > > > > > haven't already broken that model for useful operations (cf.
> > > > > > > RangePartitioner, which is used for sorted RDDs), but rather that
> > > > each
> > > > > > such
> > > > > > > exception to the model is a significant source of pain that can
> > be
> > > > hard
> > > > > > to
> > > > > > > work with or work around.
> > > > > >
> > > > > > A thought that comes to my mind here is that there are in fact
> > > already
> > > > two
> > > > > > categories of transform: ones that are truly lazy, and ones that
> > are
> > > > not.
> > > > > >  A possible option is to embrace that, and commit to documenting
> > the
> > > > two
> > > > > > categories as such, with an obvious bias towards favoring lazy
> > > > transforms
> > > > > > (to paraphrase Churchill, we're down to haggling over the price).
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > I really wouldn't like to see another such model-breaking
> > > > transformation
> > > > > > > added to the API.  On the other hand, being able to write
> > > > transformations
> > > > > > > with dependencies on these kind of "internal" jobs is sometimes
> > > very
> > > > > > > useful, so a significant reworking of Spark's Dependency model
> > that
> > > > would
> > > > > > > allow for lazily running such internal jobs and making the
> > results
> > > > > > > available to subsequent stages may be something worth pursuing.
> > > > > >
> > > > > >
> > > > > > This seems like a very interesting angle.   I don't have much feel
> > > for
> > > > > > what a solution would look like, but it sounds as if it would
> > involve
> > > > > > caching all operations embodied by RDD transform method code for
> > > > > > provisional execution.  I believe that these levels of invocation
> > are
> > > > > > currently executed in the master, not executor nodes.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Personally I'd find the method useful -- I've often had a .csv
> > > file
> > > > > > with a
> > > > > > > > header row that I want to drop so filter it out, which touches
> > > all
> > > > > > > > partitions anyway.  I don't have any comments on the
> > > implementation
> > > > > > quite
> > > > > > > > yet though.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <
> > [email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > A few weeks ago I submitted a PR for supporting rdd.drop(n),
> > > > under
> > > > > > > > > SPARK-2315:
> > > > > > > > > https://issues.apache.org/jira/browse/SPARK-2315
> > > > > > > > >
> > > > > > > > > Supporting the drop method would make some operations
> > > convenient,
> > > > > > however
> > > > > > > > > it forces computation of >= 1 partition of the parent RDD,
> > and
> > > > so it
> > > > > > > > would
> > > > > > > > > behave like a "partial action" that returns an RDD as the
> > > result.
> > > > > > > > >
> > > > > > > > > I wrote up a discussion of these trade-offs here:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > >
> > http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: RFC: Supporting the Scala drop Method for Spark RDDs

Reply via email to