Re: [VOTE] SPIP: Monthly preview release

2025-07-03 Thread Mark Hamstra
+0 I'm not sure that it is a good idea, but I'm also not certain that it isn't an idea worth trying. On Wed, Jul 2, 2025 at 9:37 PM Hyukjin Kwon wrote: > > Hi all, > > I would like to start a vote on the monthly preview releases. > > Discussion thread: > https://lists.apache.org/thread/1hmsb3g7

Re: [DISCUSS] SPIP: Monthly preview release

2025-07-02 Thread Mark Hamstra
On Tue, Jul 1, 2025 at 9:42 PM Hyukjin Kwon wrote: > > > 1. Since we "release" the preview, we will go through the VOTE process. > > What is the expected overhead on doing this monthly? (Maybe this would be > > coupled with the questions below.) > > We have to vote for every preview every month

Re: [VOTE] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-06-02 Thread Mark Hamstra
+0 I'm never going to like calling it real-time when it is not, but that's not enough to vote against the SPIP. On Mon, Jun 2, 2025 at 12:57 AM Liu Cao wrote: > > +1 (non-binding) > > > On Sun, Jun 1, 2025 at 11:42 PM Anish Shrigondekar > wrote: >> >> +1 (non-binding) - this will be really use

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Mark Hamstra
ms and pacemakers are both examples where timeliness is not > only essential but the lack of it can result in a life-or-death situation." > > I don't think it is inaccurate or misleading to call this mode real-time. > It is soft real-time. > > On Thu, May 29, 2025 at 1

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Mark Hamstra
er fields, I can > clarify what we mean by "real-time" explicitly in the SPIP document and any > future documentation. That is not a problem and thank you for your feedback. > > On Thu, May 29, 2025 at 10:37 PM Mark Hamstra > wrote: > >> Referencing other misu

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Mark Hamstra
Referencing other misuse of "real-time" is not persuasive. A SPIP is an engineering document, not a marketing document. Technical clarity and accuracy should be non-negotiable. On Thu, May 29, 2025 at 10:27 PM Jerry Peng wrote: > Mark, > > As an example of my point if you go the the Apache Stor

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Mark Hamstra
de is not "low > latency mode" but I honestly don't like naming. This naming implies the > existing execution modes are not low latency which is not true. What > defines "low" in low latency? It is kind of relative. That is why the name > Real-time mode is se

Re: [VOTE] SPIP: Declarative Pipelines

2025-04-09 Thread Mark Hamstra
+1 On Wed, Apr 9, 2025 at 7:22 AM Sandy Ryza wrote: > > We started to get some votes on the discussion thread, so I'd like to move to > a formal vote on adding support for declarative pipelines. > > *Discussion thread: * > https://lists.apache.org/thread/lsv8f829ps0bog41fjoqc45xk7m574ly > *SPIP

Re: Revert of [SPARK-51229][BUILD][CONNECT] Fix dependency:analyze goal on connect common

2025-03-27 Thread Mark Hamstra
Back in the very early days of Spark (before it was even an Apache Incubator project), Maven was clearly a more mature, capable and stable tool suite for building, testing and publishing JVM code, even Scala code, so some of the earliest commercial adopters of Spark relied upon Maven. It made sense

Re: [RESULT][VOTE] Technical Justification for the veto of the "Retain migration logic..." code change proposal is not valid

2025-03-17 Thread Mark Hamstra
sound alternative is proposed. On Mon, Mar 17, 2025 at 3:37 PM Hyukjin Kwon wrote: > The vote passes with 5 +1s (4 binding +1s) and 3 -1s (3 binding -1s). > > (* = binding) > +1: > - Mark Hamstra * > - Jungtaek Lim > - Wenchen Fan * > - Reynold Xin * > - Yuanjian Li

Re: [DISCUSS] Involve any hack / workaround to not include vendor name in migration logic

2025-03-16 Thread Mark Hamstra
Doing something like pattern matching on u0064\u0061\u0074\u0061\u0062\u0072\u0069\u0063\u006b\u0073 instead of “databricks” might also be an option if including “databricks” in the code is believed to be so offensive. On Sun, Mar 16, 2025 at 12:52 AM Jungtaek Lim wrote: > Hi dev, > > I'm reall

Re: [VOTE] Technical Justification for the veto of the "Retain migration logic..." code change proposal is not valid

2025-03-15 Thread Mark Hamstra
Quick administrative note: I don't see any reason why this vote should take a long time, so I expect to close the process and tally the votes in not much more than 48 hours. On Sat, Mar 15, 2025 at 4:35 PM Mark Hamstra wrote: > > There has been enough discussion on this topic already,

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-15 Thread Mark Hamstra
; >> For Retaining Migration Logic (+1): Jungtaek Lim, Sean Owen, Yang Jie >> (initially -1, then +1), Adam Binford (non-binding), Russell Jurney >> (non-binding), Mridul Muralidharan >> Against Retaining Migration Logic (-1): Dongjoon Hyun >> >> Maybe we should pu

[VOTE] Technical Justification for the veto of the "Retain migration logic..." code change proposal is not valid

2025-03-15 Thread Mark Hamstra
There has been enough discussion on this topic already, so I think that an immediate vote on the validity of Dongjoon's technical justification for his veto of the "Retain migration logic ... in Spark 4.0.x" proposal is in order. That technical justification has been called into question, and the g

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-15 Thread Mark Hamstra
lt stands. > Anyone who is really upset about it, please escalate to the board or > something, but, this thread and decision point has now concluded. > > > On Sat, Mar 15, 2025 at 1:16 PM Mark Hamstra wrote: >> >> You do not have the authority to declare Dongjoon's te

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-15 Thread Mark Hamstra
judged his -1 as veto for their reasoning of how >> this could be “technical” objection and I don’t think I heard anything. >> >> I can be corrected if you can point out what is the “technical” objection. >> If you or Dongjoon do not provide this to the end of the week, I

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-15 Thread Mark Hamstra
of the week, I have to > consider I haven’t heard about that and the veto (although Dongjoon stated it > is not a veto) will be ignored. > > 2025년 3월 15일 (토) 오후 8:19, Mark Hamstra 님이 작성: >> >> Once again, I have to object. Dongjoon said that the vote is a time >> li

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-15 Thread Mark Hamstra
Once again, I have to object. Dongjoon said that the vote is a time limited procedure, not that the vote itself is a procedural vote as distinct from a code change vote or a package release vote. Frankly, this feels like you are trying to manipulate the vote procedure by misrepresenting Dongjoon,

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-13 Thread Mark Hamstra
ther approach? Is it just that the current code is >> automatically achieving your goal? I believe this makes no sense. > > > I believe the last one is the most important one to hear, but I argue we > should say we don't hear about the justification if he doesn't answe

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-13 Thread Mark Hamstra
t want to let this be dragged. > > On Fri, Mar 14, 2025 at 1:37 PM Mark Hamstra wrote: >> >> The relevant time window is since Dongjoon's veto was challenged, not >> any other that you choose to assert. It has been less than a day since >> that challenge. >>

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-13 Thread Mark Hamstra
n PMC members need > to respect users. That's what the project is for. Likewise veto, PMC members > can't override it. > > > On Fri, Mar 14, 2025 at 12:26 PM Mark Hamstra wrote: >> >> Characterizing Dongjoon's position as just "agree to di

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-13 Thread Mark Hamstra
ersation if there is no movement on the substance of the discussion. > There is just clear support for the position in this vote. > > > > On Thu, Mar 13, 2025 at 9:42 PM Mark Hamstra wrote: >> >> This vote has not passed. >> >> The proposed code change has been

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-13 Thread Mark Hamstra
e concern or >> angle comes out, but, I'd say it's better not to keep entertaining this >> conversation if there is no movement on the substance of the discussion. >> There is just clear support for the position in this vote. >> >> >> >> On Thu,

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-13 Thread Mark Hamstra
Absolutely not! This is clearly a vote on a code change, not on a procedural issue or a package release. The code change has been vetoed by a -1 vote by a qualified voter. On Thu, Mar 13, 2025 at 6:58 PM Jungtaek Lim wrote: > > Likewise I said, I'm concluding the VOTE since we ensure the criteri

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-13 Thread Mark Hamstra
This vote has not passed. The proposed code change has been vetoed by a qualified voter. The validity of that veto has been called into question since "the voter must provide with the veto a technical justification showing why the change is bad (opens a security exposure, negatively affects perfor

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-13 Thread Mark Hamstra
Valid -1 votes are not restricted to technical objections. On Thu, Mar 13, 2025 at 7:28 AM Sean Owen wrote: > > I'm not sure if a VOTE is appropriate here, but I also do not see any valid > technical objection here. I don't think this can be considered a valid 'veto' > even if we were thinking

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Mark Hamstra
This doesn't really have anything to do with a broader approach to breaking changes. Removing the mistake in 4.0.0 does not change our striving to avoid breaking APIs or silently changing behavior -- striving is not a guarantee. And the addition of check-in tooling should prevent the issue from rec

Re: [VOTE] Release Apache Spark 3.5.5 deprecating `spark.databricks.*` configuration

2025-02-18 Thread Mark Hamstra
+1 On Tue, Feb 18, 2025 at 9:46 PM dongjoon.hyun wrote: > > Please vote to deprecate `spark.databricks.*` configuration at Apache Spark > 3.5.5. > This is a part of the following on-going discussion. > > - DISCUSSION: https://lists.apache.org/thread/qwxb21g5xjl7xfp4rozqmg1g0ndfw2jd > (Deprecat

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

2025-02-18 Thread Mark Hamstra
The issue is not how many lines of code it is, but rather how serious of an issue it is to have the databricks namespace in Apache code. It's not a large functional issue, but that doesn't mean that it is only a minor issue, nor do I think that I would characterize the removal of this clear error a

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Mark Hamstra
s > still no particular answer to why that would be different, as I take it you > are not volunteering to work on it and don't have a use case here either. > > For those reasons, I don't believe this is motivated enough to sustain a veto. > > On Fri, Oct 4, 2024 at 7:16 PM M

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Mark Hamstra
who might do some work and hasn't to date for some > reason. Add to that the fact that the 'pro' arguments all seem to be > arguments for working on GraphFrames, and I find this somewhat drastic. > > On Fri, Oct 4, 2024 at 5:23 PM Mark Hamstra wrote: >> >> &q

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Mark Hamstra
GraphFrames not be the logical home of this going forward > anyway? which I think is the subtext. > > On Fri, Oct 4, 2024 at 4:56 PM Mark Hamstra wrote: >> >> I'm -1(*) because, while it technically means "might be removed in the >> future", I think develop

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Mark Hamstra
t; deployment today? > > > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Prono

Re: [VOTE] Officialy Deprecate GraphX in Spark 4

2024-10-04 Thread Mark Hamstra
-1(*) reasoning posted in the DISCUSS thread On Mon, Sep 30, 2024 at 12:40 PM Holden Karau wrote: > > I think it has been de-facto deprecated, we haven’t updated it meaningfully > in several years. I think removing the API would be excessive but deprecating > it would give us the flexibility to

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Mark Hamstra
I'm -1(*) because, while it technically means "might be removed in the future", I think developers and users are more prone to interpret something being marked as deprecated as "very likely will be removed in the future, so don't depend on this or waste your time contributing to its further develop

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-26 Thread Mark Hamstra
fault of anyone currently contributing. I've wandered out of the context of this SPIP, I know. I'll at least +0 this SPIP, but I also couldn't let my concerns go unvoiced. On Mon, Mar 25, 2019 at 8:32 PM Xiangrui Meng wrote: > > > On Mon, Mar 25, 2019 at 8:07 PM Mark Hamst

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
uot;spark.task.cpus" is the > answer here. The point I want to make is that "spark.task.cpus", though > less ideal, is still needed when we have task-level requests for CPUs. > > On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra > wrote: > >> I remain unconvinced

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
IP. It fairly > separated necessary GPU support from risky scheduler changes. > > On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra > wrote: > >> Of course there is an issue of the perfect becoming the enemy of the >> good, so I can understand the impulse to get something done

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
any of the conventions used now to scheduler gpus can easily be broken by > one bad user. I think from the user point of view this gives many users > an improvement and we can extend it later to cover more use cases. > > Tom > On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Ham

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-21 Thread Mark Hamstra
I understand the application-level, static, global nature of spark.task.accelerator.gpu.count and its similarity to the existing spark.task.cpus, but to me this feels like extending a weakness of Spark's scheduler, not building on its strengths. That is because I consider binding the number of core

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Mark Hamstra
Avro than > Spark uses, which triggers it > - it doesn't work in 2.4.0 > > It's not a regression from 2.4.0, which is the immediate question. > There isn't even a Parquet fix available. > But I'm not even seeing why this is excuse-making? > > On Sun, Mar

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Mark Hamstra
Now wait... we created a regression in 2.4.0. Arguably, we should have blocked that release until we had a fix; but the issue came up late in the release process and it looks to me like there wasn't an adequate fix immediately available, so we did something bad and released 2.4.0 with a known regre

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra
I'll try to find some time, but it's really at a premium right now. On Mon, Mar 4, 2019 at 3:17 PM Xiangrui Meng wrote: > > > On Mon, Mar 4, 2019 at 3:10 PM Mark Hamstra > wrote: > >> :) Sorry, that was ambiguous. I was seconding Imran's comment. >&g

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra
:) Sorry, that was ambiguous. I was seconding Imran's comment. On Mon, Mar 4, 2019 at 3:09 PM Xiangrui Meng wrote: > > > On Mon, Mar 4, 2019 at 1:56 PM Mark Hamstra > wrote: > >> +1 >> > > Mark, just to be clear, are you +1 on the SPIP or Imran's po

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra
+1 On Mon, Mar 4, 2019 at 12:52 PM Imran Rashid wrote: > On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng wrote: > >> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung >> wrote: >> >>> IMO upfront allocation is less useful. Specifically too expensive for >>> large jobs. >>> >> >> This is also an API/de

Re: [RESULT] [VOTE] Functional DataSourceV2 in Spark 3.0

2019-03-03 Thread Mark Hamstra
an Blue wrote: > > This vote fails with the following counts: > > 3 +1 votes: > >- Matt Cheah >- Ryan Blue >- Sean Owen (binding) > > 1 -0 vote: > >- Jose Torres > > 2 -1 votes: > >- Mark Hamstra (binding) >- Midrul Muralidh

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
ike your objection is to this commitment for 3.0, but remember > that 3.0 is the next release so that we can remove deprecated APIs. It does > not mean that we aren't adding new features in that release and aren't > considering other goals. > > On Thu, Feb 28, 2019 at 10:12 AM

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
Then I'm -1. Setting new features as blockers of major releases is not proper project management, IMO. On Thu, Feb 28, 2019 at 10:06 AM Ryan Blue wrote: > Mark, if this goal is adopted, "we" is the Apache Spark community. > > On Thu, Feb 28, 2019 at 9:52 AM Mark Hamst

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
Who is "we" in these statements, such as "we should consider a functional DSv2 implementation a blocker for Spark 3.0"? If it means those contributing to the DSv2 effort want to set their own goals, milestones, etc., then that is fine with me. If you mean that the Apache Spark project should offici

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Mark Hamstra
munity, to > suggest a direction for the community to take, and I fully accept that the > decision is up to the community. I think it is reasonable to candidly state > how this matters; that context informs the discussion. > > On Fri, Feb 22, 2019 at 1:55 PM Mark Hamstra > wrote: >

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-22 Thread Mark Hamstra
g to get. The > perfect is the enemy of the good. > > Aside from throwing out a date, I probably just restated what everyone > said. But I was 'summoned' :) > > On Fri, Feb 22, 2019 at 12:40 PM Mark Hamstra > wrote: > >> However, as other people mentioned, Sp

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-08 Thread Mark Hamstra
There are 2. C'mon Marcelo, you can make it 3! On Fri, Feb 8, 2019 at 5:03 PM Marcelo Vanzin wrote: > Hi Takeshi, > > Since we only really have one +1 binding vote, do you want to extend > this vote a bit? > > I've been stuck on a few things but plan to test this (setting things > up now), but i

Re: Trigger full GC during executor idle time?

2019-01-02 Thread Mark Hamstra
Without addressing whether the change is beneficial or not, I will note that the logic in the paper and the PR's description is incorrect: "During execution, some executor nodes finish the tasks assigned to them early and wait for the entire stage to complete before more tasks are assigned to them,

Re: A survey about IP clearance of Spark in UC Berkeley for donating to Apache

2018-11-28 Thread Mark Hamstra
Your history isn't really accurate. Years before Spark became an Apache project, the AMPlab and UC Berkeley placed the Spark code under a 3-clause BSD License and made the code publicly available. Later, a group of developers and Spark users from both inside and outside Berkeley brought Spark and t

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Mark Hamstra
- Deprecate 2.11 right now via announcement and/or Spark 2.4.1 soon. > Drop 2.11 support in Spark 3.0, and support only 2.12. > - (same as above, but add Spark 2.13 support if possible for Spark 3.0) > > > On Wed, Nov 7, 2018 at 12:32 PM Mark Hamstra > wrote: > > > > I'

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Mark Hamstra
I'm not following "exclude Scala 2.13". Is there something inherent in making 2.12 the default Scala version in Spark 3.0 that would prevent us from supporting the option of building with 2.13? On Tue, Nov 6, 2018 at 5:48 PM Sean Owen wrote: > That's possible here, sure. The issue is: would you

Re: What's a blocker?

2018-10-24 Thread Mark Hamstra
Yeah, I can pretty much agree with that. Before we get into release candidates, it's not as big a deal if something gets labeled as a blocker. Once we are into an RC, I'd like to see any discussions as to whether something is or isn't a blocker at least cross-referenced in the RC VOTE thread so tha

Re: About introduce function sum0 to Spark

2018-10-23 Thread Mark Hamstra
2:23 AM Wenchen Fan wrote: > This is logically `sum( if(isnull(col), 0, col) )` right? > > On Tue, Oct 23, 2018 at 2:58 PM 陶 加涛 wrote: > >> The name is from Apache Calcite, And it doesn’t matter, we can introduce >> our own. >> >> >> >> >> &

Re: About introduce function sum0 to Spark

2018-10-22 Thread Mark Hamstra
That's a horrible name. This is just a fold. On Mon, Oct 22, 2018 at 7:39 PM 陶 加涛 wrote: > Hi, in calcite, has the concept of sum0, here I quote the definition of > sum0: > > > > Sum0 is an aggregator which returns the sum of the values which > > go into it like Sum. It differs in that when no n

Re: Adding Extension to Load Custom functions into Thriftserver/SqlShell

2018-09-27 Thread Mark Hamstra
'll > probably have to do some forks (at least for the CliDriver), the > thriftserver has a bunch of code which doesn't run under "startWithContext" > so we may have an issue there as well. > > > On Wed, Sep 26, 2018, 6:21 PM Mark Hamstra > wrote: > >

Re: Adding Extension to Load Custom functions into Thriftserver/SqlShell

2018-09-26 Thread Mark Hamstra
You're talking about users starting Thriftserver or SqlShell from the command line, right? It's much easier if you are starting a Thriftserver programmatically so that you can register functions when initializing a SparkContext and then HiveThriftServer2.startWithContext using that context. On We

Re: ***UNCHECKED*** Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Mark Hamstra
That's overstated. We will also block for a data correctness issue -- and that is, arguably, what this is. On Wed, Sep 19, 2018 at 12:21 AM Reynold Xin wrote: > We also only block if it is a new regression. > > On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao > wrote: > >> Hi Marco, >> >> From my u

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
ng before fully removing it: for > example, if Pandas and TensorFlow no longer support Python 2 past some > point, that might be a good point to remove it. > > Matei > > > On Sep 17, 2018, at 11:01 AM, Mark Hamstra > wrote: > > > > If we're going to do tha

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
some spark versions supporting > Py2 past the point where Py2 is no longer receiving security patches > > > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra > wrote: > >> We could also deprecate Py2 already in the 2.4.0 release. >> >> On Sat, Sep 15, 2018 at 11:46 A

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra
make >> it more obvious to Pandas users, that will help the most. The other issue >> though is that a bunch of Pandas functions are just missing in Spark — it >> would be awesome to set up an umbrella JIRA to just track those and let >> people fill them in. >> >> M

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra
It's not splitting hairs, Erik. It's actually very close to something that I think deserves some discussion (perhaps on a separate thread.) What I've been thinking about also concerns API "friendliness" or style. The original RDD API was very intentionally modeled on the Scala parallel collections

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Mark Hamstra
We could also deprecate Py2 already in the 2.4.0 release. On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson wrote: > In case this didn't make it onto this thread: > > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove > it entirely on a later 3.x release. > > On Sat, Sep 15

Re: time for Apache Spark 3.0?

2018-09-06 Thread Mark Hamstra
Yes, that is why we have these annotations in the code and the corresponding labels appearing in the API documentation: https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java As long as it is properly annotated, we can change or ev

Re: Naming policy for packages

2018-08-15 Thread Mark Hamstra
While it is permissible to have a maven identify like "spark-foo" from "org.bar", I'll agree with Sean that avoiding that kind of name is often wiser. It is just too easy to slip into prohibited usage if the most popular, de facto identification turns out to become "spark-foo" instead of something

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-08 Thread Mark Hamstra
I'm inclined to agree. Just saying that it is not a regression doesn't really cut it when it is a now known data correctness issue. We need something a lot more than nothing before releasing 2.4.0. At a barest minimum, that has to be much more complete and publicly highlighted documentation of the

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Mark Hamstra
No reasonable amount of time is likely going to be sufficient to fully vet the code as a PR. I'm not entirely happy with the design and code as they currently are (and I'm still trying to find the time to more publicly express my thoughts and concerns), but I'm fine with them going into 2.4 much as

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Mark Hamstra
See some of the related discussion under https://github.com/apache/spark/pull/21589 If feels to me like we need some kind of user code mechanism to signal policy preferences to Spark. This could also include ways to signal scheduling policy, which could include things like scheduling pool and/or b

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Mark Hamstra
cked mirrors then we might have bigger problems, but > there the issue is verifying the download sigs in the first place. Those > would have to come from archive.apache.org. > > If you're up for it, yes that could be a fine security precaution. > > On Thu, Jul 19, 2018, 2:11 PM

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Mark Hamstra
Is there or should there be some checking of digests just to make sure that we are really testing against the same thing in /tmp/test-spark that we are distributing from the archive? On Thu, Jul 19, 2018 at 11:15 AM Sean Owen wrote: > Ideally, that list is updated with each release, yes. Non-cur

Re: time for Apache Spark 3.0?

2018-06-15 Thread Mark Hamstra
Changing major version numbers is not about new features or a vague notion that it is time to do something that will be seen to be a significant release. It is about breaking stable public APIs. I still remain unconvinced that the next version can't be 2.4.0. On Fri, Jun 15, 2018 at 1:34 AM Andy

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Mark Hamstra
+1 On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.1. > > Given that I expect at least a few people to be busy with Spark Summit next > week, I'm taking the liberty of setting an extended voting period. The vot

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Mark Hamstra
There is no hadoop-2.8 profile. Use hadoop-2.7, which is effectively hadoop-2.7+ On Fri, Jun 1, 2018 at 4:01 PM Nicholas Chammas wrote: > I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4 > using Flintrock . However, trying > to load the

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Mark Hamstra
If I am understanding you correctly, you're just saying that the problem is that you know what you want to keep, not what you want to throw away, and that there is no unpersist DataFrames call based on that what-to-keep information. On Tue, May 8, 2018 at 6:00 AM, Nicholas Chammas wrote: > I cer

Re: Fair scheduler pool leak

2018-04-07 Thread Mark Hamstra
ode, which is equivalent to FIFO. Providing a way to set the mode of > the default scheduler would be awesome. > > Regarding why fair scheduling showed generally better performance for > out-of-core datasets, I don't have a good answer. My guess was > isolated job scheduling and b

Re: Fair scheduler pool leak

2018-04-07 Thread Mark Hamstra
Sorry, but I'm still not understanding this use case. Are you somehow creating additional scheduling pools dynamically as Jobs execute? If so, that is a very unusual thing to do. Scheduling pools are intended to be statically configured -- initialized, living and dying with the Application. On Sat

Re: time for Apache Spark 3.0?

2018-04-05 Thread Mark Hamstra
As with Sean, I'm not sure that this will require a new major version, but we should also be looking at Java 9 & 10 support -- particularly with regard to their better functionality in a containerized environment (memory limits from cgroups, not sysconf; support for cpusets). In that regard, we sho

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Mark Hamstra
issue that the work done on the fork was > isolated from the dev mailing list. Moving forward as we push our work into > mainline Spark, we aim to be transparent with the Spark community via the > Spark mailing list and Spark JIRA tickets. We’re specifically aiming to > deprecate the f

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Mark Hamstra
That's good, but you should probably stop and consider whether the discussions that led up to this document's creation could have taken place on this dev list -- because if they could have, then they probably should have as part of the whole spark-on-k8s project becoming part of mainline spark deve

Re: Union in Spark context

2018-02-05 Thread Mark Hamstra
First, the public API cannot be changed except when there is a major version change, and there is no way that we are going to do Spark 3.0.0 just for this change. Second, the change would be a mistake since the two different union methods are quite different. The method in RDD only ever works on t

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Mark Hamstra
sion_r154502824 >> >> to the best of my understanding, neither of those poses a problem. If we >> based the image off of centos I'd also expect the licensing of any image >> deps to be compatible. >> >> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra >>

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-14 Thread Mark Hamstra
to always be built that way. >> The >> >> driver and executor images, there may be cases where people want to >> >> customize it - (like putting all dependencies into it for example). >> >> In those cases, as long as our images are bare bones, t

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-11-29 Thread Mark Hamstra
It's probably also worth considering whether there is only one, well-defined, correct way to create such an image or whether this is a reasonable avenue for customization. Part of why we don't do something like maintain and publish canonical Debian packages for Spark is because different organizati

Re: Object in compiler mirror not found - maven build

2017-11-26 Thread Mark Hamstra
Or you just have zinc running but in a bad state. `zinc -shutdown` should kill it off and let you try again. On Sun, Nov 26, 2017 at 2:12 PM, Sean Owen wrote: > I'm not seeing that on OS X or Linux. It sounds a bit like you have an old > version of zinc or scala or something installed. > > On Su

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-14 Thread Mark Hamstra
; question (e.g. Why are we downloading Spark in a test case ?). >> >> Thanks >> Shivaram >> >> On Wed, Sep 13, 2017 at 11:50 AM, Mark Hamstra >> wrote: >> > Yeah, but that discussion and use case is a bit different -- providing a >> > different

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-13 Thread Mark Hamstra
Yeah, but that discussion and use case is a bit different -- providing a different route to download the final released and approved artifacts that were built using only acceptable artifacts and sources vs. building and checking prior to release using something that is not from an Apache mirror. Th

Re: Supporting Apache Aurora as a cluster manager

2017-09-10 Thread Mark Hamstra
While it may be worth creating the design doc and JIRA ticket so that we at least have a better idea and a record of what you are talking about, I kind of doubt that we are going to want to merge this into the Spark codebase. That's not because of anything specific to this Aurora effort, but rather

Re: SPIP: Spark on Kubernetes

2017-08-28 Thread Mark Hamstra
> > In my opinion, the fact that there are nearly no changes to spark-core, > and most of our changes are additive should go to prove that this adds > little complexity to the workflow of the committers. Actually (and somewhat perversely), the otherwise praiseworthy isolation of the Kubernetes co

Re: Increase Timeout or optimize Spark UT?

2017-08-22 Thread Mark Hamstra
This is another argument for getting the code to the point where this can default to "true": SQLConf.scala: val ADAPTIVE_EXECUTION_ENABLED = buildConf(" *spark.sql.adaptive.enabled*") On Tue, Aug 22, 2017 at 12:27 PM, Reynold Xin wrote: > +1 > > > On Tue, Aug 22, 2017 at 12:25 PM, Maciej Szymk

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Mark Hamstra
Points 2, 3 and 4 of the Project Plan in that document (i.e. "port existing data sources using internal APIs to use the proposed public Data Source V2 API") have my full support. Really, I'd like to see that dog-fooding effort completed and lesson learned from it fully digested before we remove any

Re: a stage can belong to more than one job please?

2017-06-06 Thread Mark Hamstra
Yes, a Stage can be part of more than one Job. The jobIds field of Stage is used repeatedly in the DAGScheduler. On Tue, Jun 6, 2017 at 5:04 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > I read same code of spark about stage. > > The constructor of stage keep the first job ID the stage was

Re: Why did spark switch from AKKA to net / ...

2017-05-07 Thread Mark Hamstra
The point is that Spark's prior usage of Akka was limited enough that it could fairly easily be removed entirely instead of forcing particular architectural decisions on Spark's users. On Sun, May 7, 2017 at 1:14 PM, geoHeil wrote: > Thank you! > In the issue they outline that hard wired depende

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-01 Thread Mark Hamstra
LocalityPlacementStrategySuite hangs -- definitely been seeing that one for quite awhile, not just with 2.1.1-rc, also with Ubuntu 16.10, and not with macOS Sierra. On Sat, Apr 1, 2017 at 12:34 PM, Sean Owen wrote: > (Tiny nits: first line says '2.1.0', just a note for next copy/paste of > the e

Re: Should we consider a Spark 2.1.1 release?

2017-03-19 Thread Mark Hamstra
That doesn't necessarily follow, Jacek. There is a point where too frequent releases decrease quality. That is because releases don't come for free -- each one demands a considerable amount of time from release managers, testers, etc. -- time that would otherwise typically be devoted to improving (

Re: Spark Improvement Proposals

2017-03-09 Thread Mark Hamstra
-0 on voting on whether we need a vote. On Thu, Mar 9, 2017 at 9:00 AM, Reynold Xin wrote: > I'm fine without a vote. (are we voting on wether we need a vote?) > > > On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen wrote: > >> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt. >>

Re: Sharing data in columnar storage between two applications

2016-12-26 Thread Mark Hamstra
16, at 5:24 PM, Mark Hamstra wrote: > > NOt so much about between applications, rather multiple frameworks within > an application, but still related: https://cs.stanford. > edu/~matei/papers/2017/cidr_weld.pdf > > On Sun, Dec 25, 2016 at 8:12 PM, Kazuaki Ishizaki > wrote: > &

  1   2   3   >