Re: Apache Spark Docker image repository

2020-02-11 Thread Erik Erlandson
My takeaway from the last time we discussed this was: 1) To be ASF compliant, we needed to only publish images at official releases 2) There was some ambiguity about whether or not a container image that included GPL'ed packages (spark images do) might trip over the GPL "viral propagation" due to

Re: Initial Decom PR for Spark 3?

2020-02-08 Thread Erik Erlandson
I'd be willing to pull this in, unless others have concerns post branch-cut. On Tue, Feb 4, 2020 at 2:51 PM Holden Karau wrote: > Hi Y’all, > > I’ve got a K8s graceful decom PR ( > https://github.com/apache/spark/pull/26440 > ) I’d love to try and get in for Spark 3, but I don’t want to push

Re: [DISCUSS][SPARK-30275] Discussion about whether to add a gitlab-ci.yml file

2020-01-26 Thread Erik Erlandson
Can a '.gitlab-ci.yml' be considered code, in the same way that the k8s related dockerfiles are code? In other words, something like: "here is a piece of code you might choose to use for building your own binaries, that is not specifically endorsed by Apache Spark"? So it would not be involved in

[DISCUSS] commit to ExpressionEncoder?

2019-10-22 Thread Erik Erlandson
Currently the design of Encoder implies the possibility that encoders might be customized, or at least that there are other internal alternatives to ExpressionEncoder. However there are both implicit and explicit restrictions in spark-sql, such that ExpressionEncoder is the only functional option,

Re: [k8s] Spark operator (the Java one)

2019-10-19 Thread Erik Erlandson
> It's applicable regardless of if the operators are maintained as part of > Spark core or not, with the maturity of Kubernetes features around CRD > support and webhooks. The GCP Spark operator supports a lot of additional > pod/container configs using a webhook, and this approach seems pretty >

Re: Spark 3.0 preview release feature list and major changes

2019-10-19 Thread Erik Erlandson
I'd like to get SPARK-27296 onto 3.0: SPARK-27296 Efficient User Defined Aggregators On Mon, Oct 7, 2019 at 3:03 PM Xingbo Jiang wrote: > Hi all, > > I went over all the finished JIRA tickets targeted to Spark 3.0.0, here > I'm listing all

[DISCUSS] remove 'private[spark]' scoping from UserDefinedType

2019-10-19 Thread Erik Erlandson
The 3.0 release is as good an opportunity as any to make UserDefinedType public again. What does the community think? Cheers, Erik

Re: [k8s] Spark operator (the Java one)

2019-10-16 Thread Erik Erlandson
Folks have (correctly) pointed out that an operator does not need to be coupled to the Apache Spark project. However, I believe there are some strategic community benefits to supporting a Spark operator that should be weighed against the costs of maintaining one. *) The Kubernetes ecosystem is

Re: UDAFs have an inefficiency problem

2019-09-30 Thread Erik Erlandson
, depending on what option is selected. I'm pushing this forward now with the goal of getting a solution into the upcoming 3.0 branch cut On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson wrote: > I describe some of the details here: > https://issues.apache.org/jira/browse/SPARK-27296 > > The s

Re: Thoughts on Spark 3 release, or a preview release

2019-09-16 Thread Erik Erlandson
I'm in favor of adding SPARK-25299 - Use remote storage for persisting shuffle data https://issues.apache.org/jira/browse/SPARK-25299 If that is far enough along to get onto the roadmap. On Wed, Sep 11, 2019 at 11:37 AM Sean Owen wrote: >

Re: Data Property Accumulators

2019-08-21 Thread Erik Erlandson
I'm wondering whether keeping track of accumulation in "consistent mode" is like a case for mapping straight to the Try value, so parsedData has type RDD[Try[...]], and counting failures is parsedData.filter(_.isFailure).count, etc Put another way: Consistent mode accumulation seems (to me) like

Re: UDAFs have an inefficiency problem

2019-07-05 Thread Erik Erlandson
I submitted a PR for this: https://github.com/apache/spark/pull/25024 On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson wrote: > I describe some of the details here: > https://issues.apache.org/jira/browse/SPARK-27296 > > The short version of the story is that aggregating data stru

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
BTW, if this is known, is there an existing JIRA I should link to? On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson wrote: > > At a high level, some candidate strategies are: > 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF > trait itself) so that

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
t; on how to fix this? > > On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson > wrote: > >> I describe some of the details here: >> https://issues.apache.org/jira/browse/SPARK-27296 >> >> The short version of the story is that aggregating data structures (UDTs) >

UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
I describe some of the details here: https://issues.apache.org/jira/browse/SPARK-27296 The short version of the story is that aggregating data structures (UDTs) used by UDAFs are serialized to a Row object, and de-serialized, for every row in a data frame. Cheers, Erik

Re: SPARk-25299: Updates As Of December 19, 2018

2019-01-09 Thread Erik Erlandson
Curious how SPARK-25299 (where file tracking is pushed to spark drivers, at least in option-5) interacts with Splash. The shuffle data location in SPARK-25299 would now have additional "fallback" logic for recovering from executor loss. On Thu, Jan 3, 2019 at 6:24 AM Peter Rudenko wrote: > Hi

Re: Remove non-Tungsten mode in Spark 3?

2019-01-09 Thread Erik Erlandson
Removing the user facing config seems like a good idea from the standpoint of reducing cognitive load, and documentation On Fri, Jan 4, 2019 at 7:03 AM Sean Owen wrote: > OK, maybe leave in tungsten for 3.0. > I did a quick check, and removing StaticMemoryManager saves a few hundred > lines.

Re: What's a blocker?

2018-10-25 Thread Erik Erlandson
I'd like to expand a bit on the phrase "opportunity cost" to try and make it more concrete: delaying a release means that the community is *not* receiving various bug fixes (and features). Just as a particular example, the wait for 2.3.2 delayed a fix for the Py3.7 iterator breaking change that

Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Erik Erlandson
I would be comfortable making the integration testing manual for now. A JIRA for ironing out how to make it reliable for automatic as a goal for 3.0 seems like a good idea. On Thu, Oct 25, 2018 at 8:11 AM Sean Owen wrote: > Forking this thread. > > Because we'll have another RC, we could

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Erik Erlandson
Cheers, Erik On Fri, Oct 19, 2018 at 9:40 AM Stephen Boesch wrote: > Erik - is there a current locale for approved/recommended third party > additions? The spark-packages has been stale for years it seems. > > Am Fr., 19. Okt. 2018 um 07:06 Uhr schrieb Erik Erlandson < > e

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Erik Erlandson
Hi Matt! There are a couple ways to do this. If you want to submit it for inclusion in Spark, you should start by filing a JIRA for it, and then a pull request. Another possibility is to publish it as your own 3rd party library, which I have done for aggregators before. On Wed, Oct 17, 2018

Re: Starting to make changes for Spark 3 -- what can we delete?

2018-10-17 Thread Erik Erlandson
My understanding was that the legacy mllib api was frozen, with all new dev going to ML, but it was not going to be removed. Although removing it would get rid of a lot of `OldXxx` shims. On Wed, Oct 17, 2018 at 12:55 AM Marco Gaido wrote: > Hi all, > > I think a very big topic on this would

Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Erik Erlandson
unless the plan is to also merge the Kerberos support to branch-2.4 >> >> >> >> Rob >> >> >> >> *From: *Erik Erlandson >> *Date: *Tuesday, 16 October 2018 at 16:47 >> *To: *dev >> *Subject: *[DISCUSS][K8S][TESTS] Include Kerberos integ

[DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Erik Erlandson
I'd like to propose including integration testing for Kerberos on the Spark 2.4 release: https://github.com/apache/spark/pull/22608 Arguments in favor: 1) it improves testing coverage on a feature important for integrating with HDFS deployments 2) its intersection with existing code is small - it

Kubernetes Big-Data-SIG notes, September 19

2018-09-19 Thread Erik Erlandson
Meta Following this week's regular meeting we will be meeting bi weekly. The next meeting will be October 3. I will be in London for Spark Summit and so Yinan Li will chair that meeting. Spark K8s backend development for 2.4 is complete. There is some renewed discussion about how much

Re: [DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Erik Erlandson
I can speak somewhat to the current design. Two of the goals for the design of this feature are that (1) its behavior is easy to reason about (2) its implementation in the back-end is light weight Option 1 was chosen partly because it's behavior is relatively simple to describe to a user: "Your

Re: Python friendly API for Spark 3.0

2018-09-18 Thread Erik Erlandson
implementing and maintaining) and friendlier (for end users) > are worth doing, and maybe some of these "friendlier" APIs can be done > outside of Spark itself (imo, Frameless is doing a very nice job for the > parts of Spark that it is currently covering -- https://github.c

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
cating it in 2.4. I’d also consider looking >> at what other data science tools are doing before fully removing it: for >> example, if Pandas and TensorFlow no longer support Python 2 past some >> point, that might be a good point to remove it. >> > >> > Matei &

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
FWIW, Pandas is dropping Py2 support at the end of this year. Tensorflow is less clear. They only support py3 on windows, but there is no reference to any policy about py2 on their roadmap or the

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Erik Erlandson
I have no binding vote but I second Stavros’ recommendation for spark-23200 Per parallel threads on Py2 support I would also like to propose deprecating Py2 starting with this 2.4 release On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin wrote: > You can log in to https://repository.apache.org

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
ate Py2 already in the 2.4.0 release. > > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson > wrote: > >> In case this didn't make it onto this thread: >> >> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and >> remove it entirely on a later 3.x re

Re: Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Erik Erlandson
In case this didn't make it onto this thread: There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release. On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson wrote: > On a separate dev@spark thread, I raised a question of whet

Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Erik Erlandson
On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0. Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking

Re: Python friendly API for Spark 3.0

2018-09-15 Thread Erik Erlandson
bly-awkward position of supporting python 2 for some time after it goes EOL. Under the current release cadence, spark 3.0 will land some time in early 2019, which at that point will be mere months until EOL for py2. On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau wrote: > > > On Fr

Re: Python friendly API for Spark 3.0

2018-09-14 Thread Erik Erlandson
To be clear, is this about "python-friendly API" or "friendly python API" ? On the python side, it might be nice to take advantage of static typing. Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a good opportunity to jump the python-3-only train. On Fri, Sep 14, 2018 at

Kubernetes Big-Data-SIG notes, September 12

2018-09-12 Thread Erik Erlandson
Spark Pod template parameters is mostly done. The main remaining design discussion is around how (or whether) to specify which container on the pod template is the driver pod. HDFS We had some discussions about the possibility of adding an HDFS

Kubernetes Big-Data-SIG notes, September 5

2018-09-05 Thread Erik Erlandson
Meta At the weekly K8s Big Data SIG meeting today, we agreed to experiment with publishing a brief summary of noteworthy Spark-related topics from the weekly meeting to dev@spark, as a reference for interested members of the Apache Spark community. The format is a brief summary, including a link

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-23 Thread Erik Erlandson
Behaviors at this level of detail, across different ML implementations, are highly unlikely to ever align exactly. Statistically small changes in logic, such as "<" versus "<=", or differences in random number generators, etc, (to say nothing of different implementation languages) will accumulate

Re: [DISCUSS] SparkR support on k8s back-end for Spark 2.4

2018-08-16 Thread Erik Erlandson
IMO sparkR support makes sense to merge for 2.4, as long as the release wranglers agree that local integration testing is sufficiently convincing. Part of the intent here is to allow this to happen without Shane having to reorganize his complex upgrade schedule and make it even more complicated.

[DISCUSS] SparkR support on k8s back-end for Spark 2.4

2018-08-15 Thread Erik Erlandson
The SparkR support PR is finished, along with integration testing, however Shane has requested that the integration testing not be enabled until after the 2.4 release because it requires the OS updates he wants to test *after* the release. The integration testing can be run locally, and so the

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Erik Erlandson
The PR for SparkR support on the kube back-end is completed, but waiting for Shane to make some tweaks to the CI machinery for full testing support. If the code freeze is being delayed, this PR could be merged as well. On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin wrote: > FYI 6 mo is coming up

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Erik Erlandson
t;> >> I really appreciate any notice on hydrogen PRs and welcome comments to >> help improve the feature, thanks! >> >> 2018-08-01 4:19 GMT+08:00 Reynold Xin : >> >>> I actually totally agree that we should make sure it should have no >>> impact o

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
they go in with proper stability annotations > and are understood not to be cast-in-stone final implementations, but > rather as a way to get people using them and generating the feedback that > is necessary to get us to something more like a final design and > implementation. > > On Tue, Jul 31,

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
Barrier mode seems like a high impact feature on Spark's core code: is one additional week enough time to properly vet this feature? On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres wrote: > Full continuous processing aggregation support ran into unanticipated > scalability and scheduling

Toward an "API" for spark images used by the Kubernetes back-end

2018-03-21 Thread Erik Erlandson
During the review of the recent PR to remove use of the init_container from kube pods as created by the Kubernetes back-end, the topic of documenting the "API" for these container images also came up. What information does the back-end provide to these containers? In what form? What assumptions

Publishing container images for Apache Spark

2018-01-11 Thread Erik Erlandson
Dear ASF Legal Affairs Committee, The Apache Spark development community has begun some discussions about publishing container images for Spark as part of its

Fwd: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Erik Erlandson
, questions and proposed actions to Apache counsel > for advice or guidance. > > On Tue, Dec 19, 2017 at 10:34 AM, Erik Erlandson <eerla...@redhat.com> > wrote: > >> I've been looking a bit more into ASF legal posture on licensing and >> container images. What I hav

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Erik Erlandson
Any actual bits distributed by the PMC would have > to follow all the license rules. > > On Tue, Dec 19, 2017 at 12:34 PM Erik Erlandson <eerla...@redhat.com> > wrote: > >> I've been looking a bit more into ASF legal posture on licensing and >> container images. What

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Erik Erlandson
ik On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson <eerla...@redhat.com> wrote: > Currently the containers are based off alpine, which pulls in BSD2 and MIT > licensing: > https://github.com/apache/spark/pull/19717#discussion_r154502824 > > to the best of my understanding, n

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-14 Thread Erik Erlandson
deps to be compatible. On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > What licensing issues come into play? > > On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson <eerla...@redhat.com> > wrote: > >> We've been discussing the top

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-14 Thread Erik Erlandson
We've been discussing the topic of container images a bit more. The kubernetes back-end operates by executing some specific CMD and ENTRYPOINT logic, which is different than mesos, and which is probably not practical to unify at this level. However: These CMD and ENTRYPOINT configurations are

Re: Timeline for Spark 2.3

2017-12-14 Thread Erik Erlandson
I wanted to check in on the state of the 2.3 freeze schedule. Original proposal was "late Dec", which is a bit open to interpretation. We are working to get some refactoring done on the integration testing for the Kubernetes back-end in preparation for testing upcoming release candidates,

Re: Timeline for Spark 2.3

2017-11-09 Thread Erik Erlandson
+1 on extending the deadline. It will significantly improve the logistics for upstreaming the Kubernetes back-end. Also agreed, on the general realities of reduced bandwidth over the Nov-Dec holiday season. Erik On Thu, Nov 9, 2017 at 6:03 PM, Matei Zaharia wrote: >

Announcing Spark on Kubernetes release 0.4.0

2017-09-25 Thread Erik Erlandson
The Spark on Kubernetes development community is pleased to announce release 0.4.0 of Apache Spark with native Kubernetes scheduler back-end! The dev community is planning to use this release as the reference for upstreaming native kubernetes capability over the Spark 2.3 release cycle. This

Re: SPIP: Spark on Kubernetes

2017-09-02 Thread Erik Erlandson
vote has passed. >> So far, there have been 4 binding +1 votes, ~25 non-binding votes, and no >> -1 votes. >> >> Thanks all! >> >> +1 votes (binding): >> Reynold Xin >> Matei Zahari >> Marcelo Vanzin >> Mark Hamstra >> >> +1

Re: SPIP: Spark on Kubernetes

2017-08-28 Thread Erik Erlandson
spawn off such as: >>>>- HDFS on Kubernetes >>>> <https://github.com/apache-spark-on-k8s/kubernetes-HDFS> with >>>> Locality and Performance Experiments >>>> - Kerberized access >>>> >>>> <https:/

Re: SPIP: Spark on Kubernetes

2017-08-21 Thread Erik Erlandson
ly care. >>- -1: I don't think this is a good idea because of the following >>technical reasons. >> >> If there is any further clarification desired, on the design or the >> implementation, please feel free to ask questions or provide feedback. >> >> >

Re: SPIP: Spark on Kubernetes

2017-08-18 Thread Erik Erlandson
echnical > > reasons. > > > > If there is any further clarification desired, on the design or the > > implementation, please feel free to ask questions or provide feedback. > > > > > > SPIP: Kubernetes as A Native Cluster Manager > > > > > >

Re: Questions about the future of UDTs and Encoders

2017-08-16 Thread Erik Erlandson
I've been working on packaging some UDTs as well. I have them working in scala and pyspark, although I haven't been able to get them to serialize to parquet, which puzzles me. Although it works, I have to define UDTs under the org.apache.spark scope due to the privatization, which is a bit

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Erik Erlandson
represents a way to optimize uptake and retention of Apache Spark among the members of the expanding Kubernetes community. On Tue, Aug 15, 2017 at 8:43 AM, Erik Erlandson <eerla...@redhat.com> wrote: > +1 (non-binding) > > > On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan <fox..

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Erik Erlandson
ons or provide feedback. > > > SPIP: Kubernetes as A Native Cluster Manager > > Full Design Doc: link > <https://issues.apache.org/jira/secure/attachment/12881586/SPARK-18278%20Spark%20on%20Kubernetes%20Design%20Proposal%20Revision%202%20%281%29.pdf> > > JIRA: https://iss

Apache Spark on Kubernetes: New Release for Spark 2.2

2017-08-14 Thread Erik Erlandson
The Apache Spark on Kubernetes Community Development Project is pleased to announce the latest release of Apache Spark with native Scheduler Backend for Kubernetes! Features provided in this release include: - Cluster-mode submission of Spark jobs to a Kubernetes cluster - Support

Failing to write a data-frame containing a UDT to parquet format

2017-07-30 Thread Erik Erlandson
I'm trying to support parquet i/o for data-frames that contain a UDT (for t-digests). The UDT is defined here: https://github.com/erikerlandson/isarn-sketches-spark/blob/feature/pyspark/src/main/scala/org/apache/spark/isarnproject/sketches/udt/TDigestUDT.scala#L37 I can read and write using

Spark on Kubernetes: Birds-of-a-Feather Session 12:50pm 6/6

2017-06-05 Thread Erik Erlandson
Come learn about the community development project to add a native Kubernetes scheduling back-end to Apache Spark! Meet contributors and network with community members interested in running Spark on Kubernetes. Learn how to run Spark jobs on your Kubernetes cluster; find out how to contribute to

Re: [ml] Why all model classes are final?

2015-06-11 Thread Erik Erlandson
I was able to work around this problem in several cases using the class 'enhancement' or 'extension' pattern to add some functionality to the decision tree model data structures. - Original Message - Hi, previously all the models in ml package were private to package, so if i need to

Re: Discussion | SparkContext 's setJobGroup and clearJobGroup should return a new instance of SparkContext

2015-01-12 Thread Erik Erlandson
setJobGroup needs fixing: https://issues.apache.org/jira/browse/SPARK-4514 I'm interested in any community input on what the semantics or design ought to be changed to. - Original Message - Hi spark committers I would like to discuss the possibility of changing the signature of

Re: Scalastyle improvements / large code reformatting

2014-10-13 Thread Erik Erlandson
- Original Message - I'm also against these huge reformattings. They slow down development and backporting for trivial reasons. Let's not do that at this point, the style of the current code is quite consistent and we have plenty of other things to worry about. Instead, what you can

Re: Adding abstraction in MLlib

2014-09-12 Thread Erik Erlandson
Are interface designs being captured anywhere as documents that the community can follow along with as the proposals evolve? I've worked on other open source projects where design docs were published as living documents (e.g. on google docs, or etherpad, but the particular mechanism isn't

PSA: SI-8835 (Iterator 'drop' method has a complexity bug causing quadratic behavior)

2014-09-06 Thread Erik Erlandson
I tripped over this recently while preparing a solution for SPARK-3250 (efficient sampling): Iterator 'drop' method has a complexity bug causing quadratic behavior https://issues.scala-lang.org/browse/SI-8835 It's something of a corner case, as the impact is serious only if one is repeatedly

Re: Handling stale PRs

2014-08-26 Thread Erik Erlandson
- Original Message - Another thing is that we should, IMO, err on the side of explicitly saying no or not yet to patches, rather than letting them linger without attention. We do get patches where the user is well intentioned, but it is Completely agree. The solution is partly

any interest in something like rdd.parent[T](n) (equivalent to firstParent[T] for n==0) ?

2014-08-05 Thread Erik Erlandson
Not that rdd.dependencies(n).rdd.asInstanceOf[RDD[T]] is terrible, but rdd.parent[T](n) better captures the intent. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-29 Thread Erik Erlandson
: Personally I'd find the method useful -- I've often had a .csv file with a header row that I want to drop so filter it out, which touches all partitions anyway. I don't have any comments on the implementation quite yet though. On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson e

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-22 Thread Erik Erlandson
PM, Erik Erlandson e...@redhat.com wrote: - Original Message - Rather than embrace non-lazy transformations and add more of them, I'd rather we 1) try to fully characterize the needs that are driving their creation/usage; and 2) design and implement new

RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
A few weeks ago I submitted a PR for supporting rdd.drop(n), under SPARK-2315: https://issues.apache.org/jira/browse/SPARK-2315 Supporting the drop method would make some operations convenient, however it forces computation of = 1 partition of the parent RDD, and so it would behave like a

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
. On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson [hidden email] http://user/SendEmail.jtp?type=nodenode=7434i=0 wrote: A few weeks ago I submitted a PR for supporting rdd.drop(n), under SPARK-2315: https://issues.apache.org/jira/browse/SPARK-2315 Supporting the drop method would

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
anyway. I don't have any comments on the implementation quite yet though. On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson e...@redhat.com wrote: A few weeks ago I submitted a PR for supporting rdd.drop(n), under SPARK-2315: https://issues.apache.org/jira/browse/SPARK-2315

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
and accounting of job resource usage, etc., so I'd rather we seek a way out of the existing hole rather than make it deeper. On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson e...@redhat.com wrote: - Original Message - Sure, drop() would be useful, but breaking the transformations