Re: RFC: Supporting the Scala drop Method for Spark RDDs

Sandy Ryza Mon, 21 Jul 2014 23:01:21 -0700

Yeah, the input format doesn't support this behavior.  But it does tell you
the byte position of each record in the file.



On Mon, Jul 21, 2014 at 10:55 PM, Reynold Xin <r...@databricks.com> wrote:

> Yes, that could work. But it is not as simple as just a binary flag.
>
> We might want to skip the first row for every file, or the header only for
> the first file. The former is not really supported out of the box by the
> input format I think?
>
>
> On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
>
> > It could make sense to add a skipHeader argument to
> SparkContext.textFile?
> >
> >
> > On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >
> > > If the purpose is for dropping csv headers, perhaps we don't really
> need
> > a
> > > common drop and only one that drops the first line in a file? I'd
> really
> > > try hard to avoid a common drop/dropWhile because they can be expensive
> > to
> > > do.
> > >
> > > Note that I think we will be adding this functionality (ignoring
> headers)
> > > to the CsvRDD functionality in Spark SQL.
> > >  https://github.com/apache/spark/pull/1351
> > >
> > >
> > > On Mon, Jul 21, 2014 at 1:45 PM, Mark Hamstra <m...@clearstorydata.com
> >
> > > wrote:
> > >
> > > > You can find some of the prior, related discussion here:
> > > > https://issues.apache.org/jira/browse/SPARK-1021
> > > >
> > > >
> > > > On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson <e...@redhat.com>
> > wrote:
> > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > > Rather than embrace non-lazy transformations and add more of
> them,
> > > I'd
> > > > > > rather we 1) try to fully characterize the needs that are driving
> > > their
> > > > > > creation/usage; and 2) design and implement new Spark
> abstractions
> > > that
> > > > > > will allow us to meet those needs and eliminate existing non-lazy
> > > > > > transformation.
> > > > >
> > > > >
> > > > > In the case of drop, obtaining the index of the boundary partition
> > can
> > > be
> > > > > viewed as the action forcing compute -- one that happens to be
> > invoked
> > > > > inside of a transform.  The concept of a "lazy action", that is
> only
> > > > > triggered if the result rdd has compute invoked on it, might be
> > > > sufficient
> > > > > to restore laziness to the drop transform.   For that matter, I
> might
> > > > find
> > > > > some way to make use of Scala lazy values directly and achieve the
> > same
> > > > > goal for drop.
> > > > >
> > > > >
> > > > >
> > > > > > They really mess up things like creation of asynchronous
> > > > > > FutureActions, job cancellation and accounting of job resource
> > usage,
> > > > > etc.,
> > > > > > so I'd rather we seek a way out of the existing hole rather than
> > make
> > > > it
> > > > > > deeper.
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson <e...@redhat.com
> >
> > > > wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > > Sure, drop() would be useful, but breaking the
> "transformations
> > > are
> > > > > lazy;
> > > > > > > > only actions launch jobs" model is abhorrent -- which is not
> to
> > > say
> > > > > that
> > > > > > > we
> > > > > > > > haven't already broken that model for useful operations (cf.
> > > > > > > > RangePartitioner, which is used for sorted RDDs), but rather
> > that
> > > > > each
> > > > > > > such
> > > > > > > > exception to the model is a significant source of pain that
> can
> > > be
> > > > > hard
> > > > > > > to
> > > > > > > > work with or work around.
> > > > > > >
> > > > > > > A thought that comes to my mind here is that there are in fact
> > > > already
> > > > > two
> > > > > > > categories of transform: ones that are truly lazy, and ones
> that
> > > are
> > > > > not.
> > > > > > >  A possible option is to embrace that, and commit to
> documenting
> > > the
> > > > > two
> > > > > > > categories as such, with an obvious bias towards favoring lazy
> > > > > transforms
> > > > > > > (to paraphrase Churchill, we're down to haggling over the
> price).
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > I really wouldn't like to see another such model-breaking
> > > > > transformation
> > > > > > > > added to the API.  On the other hand, being able to write
> > > > > transformations
> > > > > > > > with dependencies on these kind of "internal" jobs is
> sometimes
> > > > very
> > > > > > > > useful, so a significant reworking of Spark's Dependency
> model
> > > that
> > > > > would
> > > > > > > > allow for lazily running such internal jobs and making the
> > > results
> > > > > > > > available to subsequent stages may be something worth
> pursuing.
> > > > > > >
> > > > > > >
> > > > > > > This seems like a very interesting angle.   I don't have much
> > feel
> > > > for
> > > > > > > what a solution would look like, but it sounds as if it would
> > > involve
> > > > > > > caching all operations embodied by RDD transform method code
> for
> > > > > > > provisional execution.  I believe that these levels of
> invocation
> > > are
> > > > > > > currently executed in the master, not executor nodes.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash <
> > > and...@andrewash.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Personally I'd find the method useful -- I've often had a
> > .csv
> > > > file
> > > > > > > with a
> > > > > > > > > header row that I want to drop so filter it out, which
> > touches
> > > > all
> > > > > > > > > partitions anyway.  I don't have any comments on the
> > > > implementation
> > > > > > > quite
> > > > > > > > > yet though.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <
> > > e...@redhat.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > A few weeks ago I submitted a PR for supporting
> > rdd.drop(n),
> > > > > under
> > > > > > > > > > SPARK-2315:
> > > > > > > > > > https://issues.apache.org/jira/browse/SPARK-2315
> > > > > > > > > >
> > > > > > > > > > Supporting the drop method would make some operations
> > > > convenient,
> > > > > > > however
> > > > > > > > > > it forces computation of >= 1 partition of the parent
> RDD,
> > > and
> > > > > so it
> > > > > > > > > would
> > > > > > > > > > behave like a "partial action" that returns an RDD as the
> > > > result.
> > > > > > > > > >
> > > > > > > > > > I wrote up a discussion of these trade-offs here:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: RFC: Supporting the Scala drop Method for Spark RDDs

Reply via email to