Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-29 Thread Erik Erlandson
- Original Message - Sure, drop() would be useful, but breaking the transformations are lazy; only actions launch jobs model is abhorrent -- which is not to say that we haven't already broken that model for useful operations (cf. RangePartitioner, which is used for sorted RDDs), but

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-22 Thread Sandy Ryza
Yeah, the input format doesn't support this behavior. But it does tell you the byte position of each record in the file. On Mon, Jul 21, 2014 at 10:55 PM, Reynold Xin r...@databricks.com wrote: Yes, that could work. But it is not as simple as just a binary flag. We might want to skip the

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-22 Thread Erik Erlandson
- Original Message - It could make sense to add a skipHeader argument to SparkContext.textFile? I also looked into this. I don't think it's feasible given the limits of the InputFormat and RecordReader interfaces. RecordReader can't (I think) *ever* know which split it's attached

RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
A few weeks ago I submitted a PR for supporting rdd.drop(n), under SPARK-2315: https://issues.apache.org/jira/browse/SPARK-2315 Supporting the drop method would make some operations convenient, however it forces computation of = 1 partition of the parent RDD, and so it would behave like a

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Andrew Ash
Personally I'd find the method useful -- I've often had a .csv file with a header row that I want to drop so filter it out, which touches all partitions anyway. I don't have any comments on the implementation quite yet though. On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson e...@redhat.com

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
Sure, drop() would be useful, but breaking the transformations are lazy; only actions launch jobs model is abhorrent -- which is not to say that we haven't already broken that model for useful operations (cf. RangePartitioner, which is used for sorted RDDs), but rather that each such exception to

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
://apache-spark-developers-list.1001551.n3.nabble.com/RFC-Supporting-the-Scala-drop-Method-for-Spark-RDDs-tp7433p7436.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
- Original Message - Sure, drop() would be useful, but breaking the transformations are lazy; only actions launch jobs model is abhorrent -- which is not to say that we haven't already broken that model for useful operations (cf. RangePartitioner, which is used for sorted RDDs), but

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
Rather than embrace non-lazy transformations and add more of them, I'd rather we 1) try to fully characterize the needs that are driving their creation/usage; and 2) design and implement new Spark abstractions that will allow us to meet those needs and eliminate existing non-lazy transformation.

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
- Original Message - Rather than embrace non-lazy transformations and add more of them, I'd rather we 1) try to fully characterize the needs that are driving their creation/usage; and 2) design and implement new Spark abstractions that will allow us to meet those needs and eliminate

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Reynold Xin
If the purpose is for dropping csv headers, perhaps we don't really need a common drop and only one that drops the first line in a file? I'd really try hard to avoid a common drop/dropWhile because they can be expensive to do. Note that I think we will be adding this functionality (ignoring

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Sandy Ryza
It could make sense to add a skipHeader argument to SparkContext.textFile? On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin r...@databricks.com wrote: If the purpose is for dropping csv headers, perhaps we don't really need a common drop and only one that drops the first line in a file? I'd

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Reynold Xin
Yes, that could work. But it is not as simple as just a binary flag. We might want to skip the first row for every file, or the header only for the first file. The former is not really supported out of the box by the input format I think? On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza