- Original Message -
Sure, drop() would be useful, but breaking the transformations are lazy;
only actions launch jobs model is abhorrent -- which is not to say that we
haven't already broken that model for useful operations (cf.
RangePartitioner, which is used for sorted RDDs), but
Yeah, the input format doesn't support this behavior. But it does tell you
the byte position of each record in the file.
On Mon, Jul 21, 2014 at 10:55 PM, Reynold Xin r...@databricks.com wrote:
Yes, that could work. But it is not as simple as just a binary flag.
We might want to skip the
- Original Message -
It could make sense to add a skipHeader argument to SparkContext.textFile?
I also looked into this. I don't think it's feasible given the limits of the
InputFormat and RecordReader interfaces. RecordReader can't (I think) *ever*
know which split it's attached
A few weeks ago I submitted a PR for supporting rdd.drop(n), under SPARK-2315:
https://issues.apache.org/jira/browse/SPARK-2315
Supporting the drop method would make some operations convenient, however it
forces computation of = 1 partition of the parent RDD, and so it would behave
like a
Personally I'd find the method useful -- I've often had a .csv file with a
header row that I want to drop so filter it out, which touches all
partitions anyway. I don't have any comments on the implementation quite
yet though.
On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson e...@redhat.com
Sure, drop() would be useful, but breaking the transformations are lazy;
only actions launch jobs model is abhorrent -- which is not to say that we
haven't already broken that model for useful operations (cf.
RangePartitioner, which is used for sorted RDDs), but rather that each such
exception to
://apache-spark-developers-list.1001551.n3.nabble.com/RFC-Supporting-the-Scala-drop-Method-for-Spark-RDDs-tp7433p7436.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.
- Original Message -
Sure, drop() would be useful, but breaking the transformations are lazy;
only actions launch jobs model is abhorrent -- which is not to say that we
haven't already broken that model for useful operations (cf.
RangePartitioner, which is used for sorted RDDs), but
Rather than embrace non-lazy transformations and add more of them, I'd
rather we 1) try to fully characterize the needs that are driving their
creation/usage; and 2) design and implement new Spark abstractions that
will allow us to meet those needs and eliminate existing non-lazy
transformation.
- Original Message -
Rather than embrace non-lazy transformations and add more of them, I'd
rather we 1) try to fully characterize the needs that are driving their
creation/usage; and 2) design and implement new Spark abstractions that
will allow us to meet those needs and eliminate
If the purpose is for dropping csv headers, perhaps we don't really need a
common drop and only one that drops the first line in a file? I'd really
try hard to avoid a common drop/dropWhile because they can be expensive to
do.
Note that I think we will be adding this functionality (ignoring
It could make sense to add a skipHeader argument to SparkContext.textFile?
On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin r...@databricks.com wrote:
If the purpose is for dropping csv headers, perhaps we don't really need a
common drop and only one that drops the first line in a file? I'd
Yes, that could work. But it is not as simple as just a binary flag.
We might want to skip the first row for every file, or the header only for
the first file. The former is not really supported out of the box by the
input format I think?
On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza
13 matches
Mail list logo