from:"Mark Hamstra"

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-26 Thread Mark Hamstra

Yes, I do expect that the application-level approach outlined in this SPIP
will be sufficiently useful to be worth doing despite any concerns about it
not being ideal. My concern is not just about this design, however. It
feels to me like we are running into limitations of the current Spark
scheduler and that what is really needed is a deeper redesign in order to
be able to cleanly handle new or anticipated requirements like barrier mode
scheduling, GPUs, FPGAs, other domain specific resources, FaaS/serverless,
etc. Instead, what we are getting is layers of clever hacks to sort of make
the current scheduler do new things. The current scheduler was already too
complicated and murky for our own good, and these new grafts tend to make
that worse.

Unfortunately, I can't currently commit to trying to drive such a New
Scheduler effort, and I don't know anyone who can. We also can't
conceivably do something along these lines in Spark 3.0.0 -- there's just
not enough time even if other resources were available; so I don't have a
clear idea about the way forward. I am concerned, though, that scheduler
development isn't currently in very good shape and doesn't have a
better-looking future.  That is not at all intended as a slight on those
who are making contributions now after most of us who used to be more
active haven't been able to continue to be: current contributions are much
appreciated, they're just not enough -- which is not the fault of anyone
currently contributing. I've wandered out of the context of this SPIP, I
know. I'll at least +0 this SPIP, but I also couldn't let my concerns go
unvoiced.

On Mon, Mar 25, 2019 at 8:32 PM Xiangrui Meng  wrote:

>
>
> On Mon, Mar 25, 2019 at 8:07 PM Mark Hamstra 
> wrote:
>
>> Maybe.
>>
>> And I expect that we will end up doing something based on spark.task.cpus
>> in the short term. I'd just rather that this SPIP not make it look like
>> this is the way things should ideally be done. I'd prefer that we be quite
>> explicit in recognizing that this approach is a significant compromise, and
>> I'd like to see at least some references to the beginning of serious
>> longer-term efforts to do something better in a deeper re-design of
>> resource scheduling.
>>
>
> It is also a feature I desire as a user. How about suggesting it as a
> future work in the SPIP? It certainly requires someone who fully
> understands Spark scheduler to drive. Shall we start with a Spark JIRA? I
> don't know much about scheduler like you do, but I can speak for DL use
> cases. Maybe we just view it from different angles. To you
> application-level request is a significant compromise. To me it provides a
> major milestone that brings GPU to Spark workload. I know many users who
> tried to do DL on Spark ended up doing hacks here and there, huge pain. The
> scope covered by the current SPIP makes those users much happier. Tom and
> Andy from NVIDIA are certainly more calibrated on the usefulness of the
> current proposal.
>
>
>>
>> On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng 
>> wrote:
>>
>>> There are certainly use cases where different stages require different
>>> number of CPUs or GPUs under an optimal setting. I don't think anyone
>>> disagrees that ideally users should be able to do it. We are just dealing
>>> with typical engineering trade-offs and see how we break it down into
>>> smaller ones. I think it is fair to treat the task-level resource request
>>> as a separate feature here because it also applies to CPUs alone without
>>> GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
>>> years Spark is still able to cover many many use cases. Otherwise we
>>> shouldn't see many Spark users around now. Here we just apply similar
>>> arguments to GPUs.
>>>
>>> Initially, I was the person who really wanted task-level requests
>>> because it is ideal. In an offline discussion, Andy Feng pointed out an
>>> application-level setting should fit common deep learning training and
>>> inference cases and it greatly simplifies necessary changes required to
>>> Spark job scheduler. With Imran's feedback to the initial design sketch,
>>> the application-level approach became my first choice because it is still
>>> very valuable but much less risky. If a feature brings great value to
>>> users, we should add it even it is not ideal.
>>>
>>> Back to the default value discussion, let's forget GPUs and only
>>> consider CPUs. Would an application-level default number of CPU cores
>>> disappear if we added task-level requests? If yes, does it mean that users
>>> have to explicitly state the resource requirements for eve

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra

Maybe.

And I expect that we will end up doing something based on spark.task.cpus
in the short term. I'd just rather that this SPIP not make it look like
this is the way things should ideally be done. I'd prefer that we be quite
explicit in recognizing that this approach is a significant compromise, and
I'd like to see at least some references to the beginning of serious
longer-term efforts to do something better in a deeper re-design of
resource scheduling.

On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng  wrote:

> There are certainly use cases where different stages require different
> number of CPUs or GPUs under an optimal setting. I don't think anyone
> disagrees that ideally users should be able to do it. We are just dealing
> with typical engineering trade-offs and see how we break it down into
> smaller ones. I think it is fair to treat the task-level resource request
> as a separate feature here because it also applies to CPUs alone without
> GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
> years Spark is still able to cover many many use cases. Otherwise we
> shouldn't see many Spark users around now. Here we just apply similar
> arguments to GPUs.
>
> Initially, I was the person who really wanted task-level requests because
> it is ideal. In an offline discussion, Andy Feng pointed out an
> application-level setting should fit common deep learning training and
> inference cases and it greatly simplifies necessary changes required to
> Spark job scheduler. With Imran's feedback to the initial design sketch,
> the application-level approach became my first choice because it is still
> very valuable but much less risky. If a feature brings great value to
> users, we should add it even it is not ideal.
>
> Back to the default value discussion, let's forget GPUs and only consider
> CPUs. Would an application-level default number of CPU cores disappear if
> we added task-level requests? If yes, does it mean that users have to
> explicitly state the resource requirements for every single stage? It is
> tedious to do and who do not fully understand the impact would probably do
> it wrong and waste even more resources. Then how many cores each task
> should use if user didn't specify it? I do see "spark.task.cpus" is the
> answer here. The point I want to make is that "spark.task.cpus", though
> less ideal, is still needed when we have task-level requests for CPUs.
>
> On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra 
> wrote:
>
>> I remain unconvinced that a default configuration at the application
>> level makes sense even in that case. There may be some applications where
>> you know a priori that almost all the tasks for all the stages for all the
>> jobs will need some fixed number of gpus; but I think the more common cases
>> will be dynamic configuration at the job or stage level. Stage level could
>> have a lot of overlap with barrier mode scheduling -- barrier mode stages
>> having a need for an inter-task channel resource, gpu-ified stages needing
>> gpu resources, etc. Have I mentioned that I'm not a fan of the current
>> barrier mode API, Xiangrui? :) Yes, I know: "Show me something better."
>>
>> On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng  wrote:
>>
>>> Say if we support per-task resource requests in the future, it would be
>>> still inconvenient for users to declare the resource requirements for every
>>> single task/stage. So there must be some default values defined somewhere
>>> for task resource requirements. "spark.task.cpus" and
>>> "spark.task.accelerator.gpu.count" could serve for this purpose without
>>> introducing breaking changes. So I'm +1 on the updated SPIP. It fairly
>>> separated necessary GPU support from risky scheduler changes.
>>>
>>> On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra 
>>> wrote:
>>>
>>>> Of course there is an issue of the perfect becoming the enemy of the
>>>> good, so I can understand the impulse to get something done. I am left
>>>> wanting, however, at least something more of a roadmap to a task-level
>>>> future than just a vague "we may choose to do something more in the
>>>> future." At the risk of repeating myself, I don't think the
>>>> existing spark.task.cpus is very good, and I think that building more on
>>>> that weak foundation without a more clear path or stated intention to move
>>>> to something better runs the risk of leaving Spark stuck in a bad
>>>> neighborhood.
>>>>
>>>> On Thu, Mar 21, 2019 at 10:10 AM Tom Graves 
>>>> wrote:
>>&

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra

I remain unconvinced that a default configuration at the application level
makes sense even in that case. There may be some applications where you
know a priori that almost all the tasks for all the stages for all the jobs
will need some fixed number of gpus; but I think the more common cases will
be dynamic configuration at the job or stage level. Stage level could have
a lot of overlap with barrier mode scheduling -- barrier mode stages having
a need for an inter-task channel resource, gpu-ified stages needing gpu
resources, etc. Have I mentioned that I'm not a fan of the current barrier
mode API, Xiangrui? :) Yes, I know: "Show me something better."

On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng  wrote:

> Say if we support per-task resource requests in the future, it would be
> still inconvenient for users to declare the resource requirements for every
> single task/stage. So there must be some default values defined somewhere
> for task resource requirements. "spark.task.cpus" and
> "spark.task.accelerator.gpu.count" could serve for this purpose without
> introducing breaking changes. So I'm +1 on the updated SPIP. It fairly
> separated necessary GPU support from risky scheduler changes.
>
> On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra 
> wrote:
>
>> Of course there is an issue of the perfect becoming the enemy of the
>> good, so I can understand the impulse to get something done. I am left
>> wanting, however, at least something more of a roadmap to a task-level
>> future than just a vague "we may choose to do something more in the
>> future." At the risk of repeating myself, I don't think the
>> existing spark.task.cpus is very good, and I think that building more on
>> that weak foundation without a more clear path or stated intention to move
>> to something better runs the risk of leaving Spark stuck in a bad
>> neighborhood.
>>
>> On Thu, Mar 21, 2019 at 10:10 AM Tom Graves  wrote:
>>
>>> While I agree with you that it would be ideal to have the task level
>>> resources and do a deeper redesign for the scheduler, I think that can be a
>>> separate enhancement like was discussed earlier in the thread. That feature
>>> is useful without GPU's.  I do realize that they overlap some but I think
>>> the changes for this will be minimal to the scheduler, follow existing
>>> conventions, and it is an improvement over what we have now. I know many
>>> users will be happy to have this even without the task level scheduling as
>>> many of the conventions used now to scheduler gpus can easily be broken by
>>> one bad user. I think from the user point of view this gives many users
>>> an improvement and we can extend it later to cover more use cases.
>>>
>>> Tom
>>> On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra <
>>> m...@clearstorydata.com> wrote:
>>>
>>>
>>> I understand the application-level, static, global nature
>>> of spark.task.accelerator.gpu.count and its similarity to the
>>> existing spark.task.cpus, but to me this feels like extending a weakness of
>>> Spark's scheduler, not building on its strengths. That is because I
>>> consider binding the number of cores for each task to an application
>>> configuration to be far from optimal. This is already far from the desired
>>> behavior when an application is running a wide range of jobs (as in a
>>> generic job-runner style of Spark application), some of which require or
>>> can benefit from multi-core tasks, others of which will just waste the
>>> extra cores allocated to their tasks. Ideally, the number of cores
>>> allocated to tasks would get pushed to an even finer granularity that jobs,
>>> and instead being a per-stage property.
>>>
>>> Now, of course, making allocation of general-purpose cores and
>>> domain-specific resources work in this finer-grained fashion is a lot more
>>> work than just trying to extend the existing resource allocation mechanisms
>>> to handle domain-specific resources, but it does feel to me like we should
>>> at least be considering doing that deeper redesign.
>>>
>>> On Thu, Mar 21, 2019 at 7:33 AM Tom Graves 
>>> wrote:
>>>
>>> Tthe proposal here is that all your resources are static and the gpu per
>>> task config is global per application, meaning you ask for a certain amount
>>> memory, cpu, GPUs for every executor up front just like you do today and
>>> every executor you get is that size.  This means that both static or
>>> dynamic allocation still work with

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra

Of course there is an issue of the perfect becoming the enemy of the good,
so I can understand the impulse to get something done. I am left wanting,
however, at least something more of a roadmap to a task-level future than
just a vague "we may choose to do something more in the future." At the
risk of repeating myself, I don't think the existing spark.task.cpus is
very good, and I think that building more on that weak foundation without a
more clear path or stated intention to move to something better runs the
risk of leaving Spark stuck in a bad neighborhood.

On Thu, Mar 21, 2019 at 10:10 AM Tom Graves  wrote:

> While I agree with you that it would be ideal to have the task level
> resources and do a deeper redesign for the scheduler, I think that can be a
> separate enhancement like was discussed earlier in the thread. That feature
> is useful without GPU's.  I do realize that they overlap some but I think
> the changes for this will be minimal to the scheduler, follow existing
> conventions, and it is an improvement over what we have now. I know many
> users will be happy to have this even without the task level scheduling as
> many of the conventions used now to scheduler gpus can easily be broken by
> one bad user. I think from the user point of view this gives many users
> an improvement and we can extend it later to cover more use cases.
>
> Tom
> On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra <
> m...@clearstorydata.com> wrote:
>
>
> I understand the application-level, static, global nature
> of spark.task.accelerator.gpu.count and its similarity to the
> existing spark.task.cpus, but to me this feels like extending a weakness of
> Spark's scheduler, not building on its strengths. That is because I
> consider binding the number of cores for each task to an application
> configuration to be far from optimal. This is already far from the desired
> behavior when an application is running a wide range of jobs (as in a
> generic job-runner style of Spark application), some of which require or
> can benefit from multi-core tasks, others of which will just waste the
> extra cores allocated to their tasks. Ideally, the number of cores
> allocated to tasks would get pushed to an even finer granularity that jobs,
> and instead being a per-stage property.
>
> Now, of course, making allocation of general-purpose cores and
> domain-specific resources work in this finer-grained fashion is a lot more
> work than just trying to extend the existing resource allocation mechanisms
> to handle domain-specific resources, but it does feel to me like we should
> at least be considering doing that deeper redesign.
>
> On Thu, Mar 21, 2019 at 7:33 AM Tom Graves 
> wrote:
>
> Tthe proposal here is that all your resources are static and the gpu per
> task config is global per application, meaning you ask for a certain amount
> memory, cpu, GPUs for every executor up front just like you do today and
> every executor you get is that size.  This means that both static or
> dynamic allocation still work without explicitly adding more logic at this
> point. Since the config for gpu per task is global it means every task you
> want will need a certain ratio of cpu to gpu.  Since that is a global you
> can't really have the scenario you mentioned, all tasks are assuming to
> need GPU.  For instance. I request 5 cores, 2 GPUs, set 1 gpu per task for
> each executor.  That means that I could only run 2 tasks and 3 cores would
> be wasted.  The stage/task level configuration of resources was removed and
> is something we can do in a separate SPIP.
> We thought erroring would make it more obvious to the user.  We could
> change this to a warning if everyone thinks that is better but I personally
> like the error until we can implement the per lower level per stage
> configuration.
>
> Tom
>
> On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido <
> marcogaid...@gmail.com> wrote:
>
>
> Thanks for this SPIP.
> I cannot comment on the docs, but just wanted to highlight one thing. In
> page 5 of the SPIP, when we talk about DRA, I see:
>
> "For instance, if each executor consists 4 CPUs and 2 GPUs, and each task
> requires 1 CPU and 1GPU, then we shall throw an error on application start
> because we shall always have at least 2 idle CPUs per executor"
>
> I am not sure this is a correct behavior. We might have tasks requiring
> only CPU running in parallel as well, hence that may make sense. I'd rather
> emit a WARN or something similar. Anyway we just said we will keep GPU
> scheduling on task level out of scope for the moment, right?
>
> Thanks,
> Marco
>
> Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng <
> m...@databricks.com> ha

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-21 Thread Mark Hamstra

I understand the application-level, static, global nature
of spark.task.accelerator.gpu.count and its similarity to the
existing spark.task.cpus, but to me this feels like extending a weakness of
Spark's scheduler, not building on its strengths. That is because I
consider binding the number of cores for each task to an application
configuration to be far from optimal. This is already far from the desired
behavior when an application is running a wide range of jobs (as in a
generic job-runner style of Spark application), some of which require or
can benefit from multi-core tasks, others of which will just waste the
extra cores allocated to their tasks. Ideally, the number of cores
allocated to tasks would get pushed to an even finer granularity that jobs,
and instead being a per-stage property.

Now, of course, making allocation of general-purpose cores and
domain-specific resources work in this finer-grained fashion is a lot more
work than just trying to extend the existing resource allocation mechanisms
to handle domain-specific resources, but it does feel to me like we should
at least be considering doing that deeper redesign.

On Thu, Mar 21, 2019 at 7:33 AM Tom Graves 
wrote:

> Tthe proposal here is that all your resources are static and the gpu per
> task config is global per application, meaning you ask for a certain amount
> memory, cpu, GPUs for every executor up front just like you do today and
> every executor you get is that size.  This means that both static or
> dynamic allocation still work without explicitly adding more logic at this
> point. Since the config for gpu per task is global it means every task you
> want will need a certain ratio of cpu to gpu.  Since that is a global you
> can't really have the scenario you mentioned, all tasks are assuming to
> need GPU.  For instance. I request 5 cores, 2 GPUs, set 1 gpu per task for
> each executor.  That means that I could only run 2 tasks and 3 cores would
> be wasted.  The stage/task level configuration of resources was removed and
> is something we can do in a separate SPIP.
> We thought erroring would make it more obvious to the user.  We could
> change this to a warning if everyone thinks that is better but I personally
> like the error until we can implement the per lower level per stage
> configuration.
>
> Tom
>
> On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido <
> marcogaid...@gmail.com> wrote:
>
>
> Thanks for this SPIP.
> I cannot comment on the docs, but just wanted to highlight one thing. In
> page 5 of the SPIP, when we talk about DRA, I see:
>
> "For instance, if each executor consists 4 CPUs and 2 GPUs, and each task
> requires 1 CPU and 1GPU, then we shall throw an error on application start
> because we shall always have at least 2 idle CPUs per executor"
>
> I am not sure this is a correct behavior. We might have tasks requiring
> only CPU running in parallel as well, hence that may make sense. I'd rather
> emit a WARN or something similar. Anyway we just said we will keep GPU
> scheduling on task level out of scope for the moment, right?
>
> Thanks,
> Marco
>
> Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng <
> m...@databricks.com> ha scritto:
>
> Steve, the initial work would focus on GPUs, but we will keep the
> interfaces general to support other accelerators in the future. This was
> mentioned in the SPIP and draft design.
>
> Imran, you should have comment permission now. Thanks for making a pass! I
> don't think the proposed 3.0 features should block Spark 3.0 release
> either. It is just an estimate of what we could deliver. I will update the
> doc to make it clear.
>
> Felix, it would be great if you can review the updated docs and let us
> know your feedback.
>
> ** How about setting a tentative vote closing time to next Tue (Mar 26)?
>
> On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid 
> wrote:
>
> Thanks for sending the updated docs.  Can you please give everyone the
> ability to comment?  I have some comments, but overall I think this is a
> good proposal and addresses my prior concerns.
>
> My only real concern is that I notice some mention of "must dos" for spark
> 3.0.  I don't want to make any commitment to holding spark 3.0 for parts of
> this, I think that is an entirely separate decision.  However I'm guessing
> this is just a minor wording issue, and you really mean that's a minimal
> set of features you are aiming for, which is reasonable.
>
> On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang 
> wrote:
>
> Hi all,
>
> I updated the SPIP doc
> 
> and stories
> ,
> I hope it now contains clear scope of the changes and enough details for
> SPIP vote.
> Please review the updated docs, thanks!
>
> Xiangrui Meng  于2019年3月6日周三 上午8:35写道：
>
> How about letting Xingbo make a major revision to the SPIP doc

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Mark Hamstra

It worked in 2.3. We broke it with 2.4.0 and were informed of that
regression late in the 2.4.0 release process. Since we didn't fix it before
the 2.4.0 release, it should have been noted as a known issue. To now claim
that there is no regression from 2.4.0 is a circular argument denying the
existence of a known regression from 2.3.

On Sun, Mar 10, 2019 at 6:53 PM Sean Owen  wrote:

> From https://issues.apache.org/jira/browse/SPARK-25588, I'm reading that:
>
> - this is a Parquet-Avro version conflict thing
> - a downstream app wants different versions of Parquet and Avro than
> Spark uses, which triggers it
> - it doesn't work in 2.4.0
>
> It's not a regression from 2.4.0, which is the immediate question.
> There isn't even a Parquet fix available.
> But I'm not even seeing why this is excuse-making?
>
> On Sun, Mar 10, 2019 at 8:44 PM Mark Hamstra 
> wrote:
> >
> > Now wait... we created a regression in 2.4.0. Arguably, we should have
> blocked that release until we had a fix; but the issue came up late in the
> release process and it looks to me like there wasn't an adequate fix
> immediately available, so we did something bad and released 2.4.0 with a
> known regression. Saying that there is now no regression from 2.4 is
> tautological and no excuse for not taking in a fix -- and it looks like
> that fix has been waiting for months.
>

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Mark Hamstra

Now wait... we created a regression in 2.4.0. Arguably, we should have
blocked that release until we had a fix; but the issue came up late in the
release process and it looks to me like there wasn't an adequate fix
immediately available, so we did something bad and released 2.4.0 with a
known regression. Saying that there is now no regression from 2.4 is
tautological and no excuse for not taking in a fix -- and it looks like
that fix has been waiting for months.

On Sun, Mar 10, 2019 at 3:42 PM DB Tsai  wrote:

> As we have many important fixes in 2.4 branch which we want to release
> asap, and this is is not a regression from Spark 2.4; as a result, 2.4.1
> will be not blocked by this.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0359BC9965359766
>
>
> On Sun, Mar 10, 2019 at 3:08 PM Michael Heuer  wrote:
>
>> Any chance we could get some movement on this for 2.4.1?
>>
>> https://issues.apache.org/jira/browse/SPARK-25588
>> https://github.com/apache/parquet-mr/pull/560
>>
>> It would require a new Parquet release, which would then need to be
>> picked up by Spark.  We're dead in the water on 2.4.0 without a large
>> refactoring (remove all the RDD code paths for reading Avro stored in
>> Parquet).
>>
>>michael
>>
>>
>> On Mar 8, 2019, at 6:22 PM, Sean Owen  wrote:
>>
>> FWIW RC6 looked fine to me. Passed all tests, etc.
>>
>> On Fri, Mar 8, 2019 at 6:09 PM DB Tsai  wrote:
>>
>>> Sounds fair to me. I'll cut another rc7 when the PR is merged.
>>> Hopefully, this is the final rc. Thanks.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 42E5B25A8F7A82C1
>>>
>>>
>>> On Fri, Mar 8, 2019 at 3:23 PM Xiao Li  wrote:
>>>
 It is common to hit this issue when driver and executors are different
 object layout, but Spark might not return a wrong answer. It is very hard
 to find out the root cause. Thus, I would suggest to include it in Spark
 2.4.1.

 On Fri, Mar 8, 2019 at 3:13 PM DB Tsai  wrote:

> BTW, practically, is it common for users running into this bug when
> the driver and executors have different object layout?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
>
> On Fri, Mar 8, 2019 at 3:00 PM DB Tsai  wrote:
>
>> Hi Xiao,
>>
>> I already cut rc7 and start the build process. If we definitely need
>> this fix, I can cut rc8. Let me know what you think.
>>
>> Thanks,
>>
>> On Fri, Mar 8, 2019 at 1:46 PM Xiao Li  wrote:
>>
>>> Hi, DB,
>>>
>>> Since this RC will fail, could you hold it until we fix
>>> https://issues.apache.org/jira/browse/SPARK-27097? Either Kris or I
>>> will submit a PR today. The PR is small and the risk is low. This is a
>>> correctness bug. It would be good to have it.
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>>
>>>
>>>
>>> On Fri, Mar 8, 2019 at 12:17 PM DB Tsai 
>>> wrote:
>>>
 Since I can not find the commit of `Preparing development version
 2.4.2-SNAPSHOT` after rc6 cut, it's very risky to fix the branch and 
 do a
 force-push. I'll follow Marcelo's suggestion to have another rc7 cut. 
 Thus,
 this vote fails.

 DB Tsai  |  Siri Open Source Technologies [not a contribution]  |
  Apple, Inc

 > On Mar 8, 2019, at 11:45 AM, DB Tsai 
 wrote:
 >
 > Okay, I see the problem. rc6 tag is not in the 2.4 branch. It's
 very weird. It must be overwritten by a force push.
 >
 > DB Tsai  |  Siri Open Source Technologies [not a contribution]
 |   Apple, Inc
 >
 >> On Mar 8, 2019, at 11:39 AM, DB Tsai 
 wrote:
 >>
 >> I was using `./do-release-docker.sh` to create a release. But
 since the gpg validation fails couple times when the script tried to
 publish the jars into Nexus, I re-ran the scripts multiple times 
 without
 creating a new rc. I was wondering if the script will overwrite the
 v.2.4.1-rc6 tag instead of using the same commit causing this issue.
 >>
 >> Should we create a new rc7?
 >>
 >> DB Tsai  |  Siri Open Source Technologies [not a contribution]
 |   Apple, Inc
 >>
 >>> On Mar 8, 2019, at 10:54 AM, Marcelo Vanzin <
 van...@cloudera.com.INVALID> wrote:
 >>>
 >>> I personally find it a little weird to not have the commit in
 branch-2.4.
 >>>
 >>> Not that this would happen, but if the v2.4.1-rc6 tag is
 overwritten
 >>> (e.g. accidentally) then you lose the reference to that commit,

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra

I'll try to find some time, but it's really at a premium right now.

On Mon, Mar 4, 2019 at 3:17 PM Xiangrui Meng  wrote:

>
>
> On Mon, Mar 4, 2019 at 3:10 PM Mark Hamstra 
> wrote:
>
>> :) Sorry, that was ambiguous. I was seconding Imran's comment.
>>
>
> Could you also help review Xingbo's design sketch and help evaluate the
> cost?
>
>
>>
>> On Mon, Mar 4, 2019 at 3:09 PM Xiangrui Meng  wrote:
>>
>>>
>>>
>>> On Mon, Mar 4, 2019 at 1:56 PM Mark Hamstra 
>>> wrote:
>>>
>>>> +1
>>>>
>>>
>>> Mark, just to be clear, are you +1 on the SPIP or Imran's point?
>>>
>>>
>>>>
>>>> On Mon, Mar 4, 2019 at 12:52 PM Imran Rashid 
>>>> wrote:
>>>>
>>>>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng  wrote:
>>>>>
>>>>>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung <
>>>>>> felixcheun...@hotmail.com> wrote:
>>>>>>
>>>>>>> IMO upfront allocation is less useful. Specifically too expensive
>>>>>>> for large jobs.
>>>>>>>
>>>>>>
>>>>>> This is also an API/design discussion.
>>>>>>
>>>>>
>>>>> I agree with Felix -- this is more than just an API question.  It has
>>>>> a huge impact on the complexity of what you're proposing.  You might be
>>>>> proposing big changes to a core and brittle part of spark, which is 
>>>>> already
>>>>> short of experts.
>>>>>
>>>>
>>> To my understanding, Felix's comment is mostly on the user interfaces,
>>> stating upfront allocation is less useful, specially for large jobs. I
>>> agree that for large jobs we better have dynamic allocation, which was
>>> mentioned in the YARN support section in the companion scoping doc. We
>>> restrict the new container type to initially requested to keep things
>>> simple. However upfront allocation already meets the requirements of basic
>>> workflows like data + DL training/inference + data. Saying "it is less
>>> useful specifically for large jobs" kinda missed the fact that "it is super
>>> useful for basic use cases".
>>>
>>> Your comment is mostly on the implementation side, which IMHO it is the
>>> KEY question to conclude this vote: does the design sketch sufficiently
>>> demonstrate that the internal changes to Spark scheduler is manageable? I
>>> read Xingbo's design sketch and I think it is doable, which led to my +1.
>>> But I'm not an expert on the scheduler. So I would feel more confident if
>>> the design was reviewed by some scheduler experts. I also read the design
>>> sketch to support different cluster managers, which I think is less
>>> critical than the internal scheduler changes.
>>>
>>>
>>>>
>>>>> I don't see any value in having a vote on "does feature X sound cool?"
>>>>>
>>>>
>>> I believe no one would disagree. To prepare the companion doc, we went
>>> through several rounds of discussions to provide concrete stories such that
>>> the proposal is not just "cool".
>>>
>>>
>>>>
>>>>>
>>>> We have to evaluate the potential benefit against the risks the feature
>>>>> brings and the continued maintenance cost.  We don't need super low-level
>>>>> details, but we have to a sketch of the design to be able to make that
>>>>> tradeoff.
>>>>>
>>>>
>>> Could you review the design sketch from Xingbo, help evaluate the cost,
>>> and provide feedback?
>>>
>>>
>>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra

:) Sorry, that was ambiguous. I was seconding Imran's comment.

On Mon, Mar 4, 2019 at 3:09 PM Xiangrui Meng  wrote:

>
>
> On Mon, Mar 4, 2019 at 1:56 PM Mark Hamstra 
> wrote:
>
>> +1
>>
>
> Mark, just to be clear, are you +1 on the SPIP or Imran's point?
>
>
>>
>> On Mon, Mar 4, 2019 at 12:52 PM Imran Rashid 
>> wrote:
>>
>>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng  wrote:
>>>
>>>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung 
>>>> wrote:
>>>>
>>>>> IMO upfront allocation is less useful. Specifically too expensive for
>>>>> large jobs.
>>>>>
>>>>
>>>> This is also an API/design discussion.
>>>>
>>>
>>> I agree with Felix -- this is more than just an API question.  It has a
>>> huge impact on the complexity of what you're proposing.  You might be
>>> proposing big changes to a core and brittle part of spark, which is already
>>> short of experts.
>>>
>>
> To my understanding, Felix's comment is mostly on the user interfaces,
> stating upfront allocation is less useful, specially for large jobs. I
> agree that for large jobs we better have dynamic allocation, which was
> mentioned in the YARN support section in the companion scoping doc. We
> restrict the new container type to initially requested to keep things
> simple. However upfront allocation already meets the requirements of basic
> workflows like data + DL training/inference + data. Saying "it is less
> useful specifically for large jobs" kinda missed the fact that "it is super
> useful for basic use cases".
>
> Your comment is mostly on the implementation side, which IMHO it is the
> KEY question to conclude this vote: does the design sketch sufficiently
> demonstrate that the internal changes to Spark scheduler is manageable? I
> read Xingbo's design sketch and I think it is doable, which led to my +1.
> But I'm not an expert on the scheduler. So I would feel more confident if
> the design was reviewed by some scheduler experts. I also read the design
> sketch to support different cluster managers, which I think is less
> critical than the internal scheduler changes.
>
>
>>
>>> I don't see any value in having a vote on "does feature X sound cool?"
>>>
>>
> I believe no one would disagree. To prepare the companion doc, we went
> through several rounds of discussions to provide concrete stories such that
> the proposal is not just "cool".
>
>
>>
>>>
>> We have to evaluate the potential benefit against the risks the feature
>>> brings and the continued maintenance cost.  We don't need super low-level
>>> details, but we have to a sketch of the design to be able to make that
>>> tradeoff.
>>>
>>
> Could you review the design sketch from Xingbo, help evaluate the cost,
> and provide feedback?
>
>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra

+1

On Mon, Mar 4, 2019 at 12:52 PM Imran Rashid  wrote:

> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng  wrote:
>
>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung 
>> wrote:
>>
>>> IMO upfront allocation is less useful. Specifically too expensive for
>>> large jobs.
>>>
>>
>> This is also an API/design discussion.
>>
>
> I agree with Felix -- this is more than just an API question.  It has a
> huge impact on the complexity of what you're proposing.  You might be
> proposing big changes to a core and brittle part of spark, which is already
> short of experts.
>
> I don't see any value in having a vote on "does feature X sound cool?"  We
> have to evaluate the potential benefit against the risks the feature brings
> and the continued maintenance cost.  We don't need super low-level details,
> but we have to a sketch of the design to be able to make that tradeoff.
>

Re: [RESULT] [VOTE] Functional DataSourceV2 in Spark 3.0

2019-03-03 Thread Mark Hamstra

No, it is not at all dead! There just isn't any kind of expectation or
commitment that the 3.0.0 release will be held up in any way if DSv2 is not
ready to go when the rest of 3.0.0 is. There is nothing new preventing
continued work on DSv2 or its eventual inclusion in a release.

On Sun, Mar 3, 2019 at 1:36 PM Jean Georges Perrin  wrote:

> Hi, I am kind of new at the whole Apache process (not specifically Spark).
> Does that means that the DataSourceV2 is dead or stays experimental? Thanks
> for clarifying for a newbie.
>
> jg
>
>
> On Mar 3, 2019, at 11:21, Ryan Blue  wrote:
>
> This vote fails with the following counts:
>
> 3 +1 votes:
>
>- Matt Cheah
>- Ryan Blue
>- Sean Owen (binding)
>
> 1 -0 vote:
>
>- Jose Torres
>
> 2 -1 votes:
>
>- Mark Hamstra (binding)
>- Midrul Muralidharan (binding)
>
> Thanks for the discussion, everyone, It sounds to me that the main
> objection is simply that we’ve already committed to a release that removes
> deprecated APIs and we don’t want to commit to features at the same time.
> While I’m a bit disappointed, I think that’s a reasonable position for the
> community to take and at least is a clear result.
>
> rb
>
> On Thu, Feb 28, 2019 at 8:38 AM Ryan Blue rb...@netflix.com
> <http://mailto:rb...@netflix.com> wrote:
>
> I’d like to call a vote for committing to getting DataSourceV2 in a
>> functional state for Spark 3.0.
>>
>> For more context, please see the discussion thread, but here is a quick
>> summary about what this commitment means:
>>
>>- We think that a “functional DSv2” is an achievable goal for the
>>Spark 3.0 release
>>- We will consider this a blocker for Spark 3.0, and take reasonable
>>steps to make it happen
>>- We will *not* delay the release without a community discussion
>>
>> Here’s what we’ve defined as a functional DSv2:
>>
>>- Add a plugin system for catalogs
>>- Add an interface for table catalogs (see the ongoing SPIP vote)
>>- Add an implementation of the new interface that calls
>>SessionCatalog to load v2 tables
>>- Add a resolution rule to load v2 tables from the v2 catalog
>>- Add CTAS logical and physical plan nodes
>>- Add conversions from SQL parsed plans to v2 logical plans (e.g.,
>>INSERT INTO support)
>>
>> Please vote in the next 3 days on whether you agree with committing to
>> this goal.
>>
>> [ ] +1: Agree that we should consider a functional DSv2 implementation a
>> blocker for Spark 3.0
>> [ ] +0: . . .
>> [ ] -1: I disagree with this goal because . . .
>>
>> Thank you!
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra

I agree that adding new features in a major release is not forbidden, but
that is just not the primary goal of a major release. If we reach the point
where we are happy with the new public API before some new features are in
a satisfactory state to be merged, then I don't want there to be a prior
presumption that we cannot complete the primary goal of the major release.
If at that point you want to argue that it is worth waiting for some new
feature, then that would be fine and may have sufficient merits to warrant
some delay.

Regardless of whether significant new public API comes into a major release
or a feature release, it should come in with an experimental annotation so
that we can make changes without requiring a new major release.

If you want to argue that some new features that are currently targeting
3.0.0 are significant enough that one or more of them should justify an
accelerated 3.1.0 release schedule if it is not ready in time for the 3.0.0
release, then I can much more easily get behind that kind of commitment;
but I remain opposed to the notion of promoting any new features to the
status of blockers of 3.0.0 at this time.

On Thu, Feb 28, 2019 at 10:23 AM Ryan Blue  wrote:

> Mark, I disagree. Setting common goals is a critical part of getting
> things done.
>
> This doesn't commit the community to push out the release if the goals
> aren't met, but does mean that we will, as a community, seriously consider
> it. This is also an acknowledgement that this is the most important feature
> in the next release (whether major or minor) for many of us. This has been
> in limbo for a very long time, so I think it is important for the community
> to commit to getting it to a functional state.
>
> It sounds like your objection is to this commitment for 3.0, but remember
> that 3.0 is the next release so that we can remove deprecated APIs. It does
> not mean that we aren't adding new features in that release and aren't
> considering other goals.
>
> On Thu, Feb 28, 2019 at 10:12 AM Mark Hamstra 
> wrote:
>
>> Then I'm -1. Setting new features as blockers of major releases is not
>> proper project management, IMO.
>>
>> On Thu, Feb 28, 2019 at 10:06 AM Ryan Blue  wrote:
>>
>>> Mark, if this goal is adopted, "we" is the Apache Spark community.
>>>
>>> On Thu, Feb 28, 2019 at 9:52 AM Mark Hamstra 
>>> wrote:
>>>
>>>> Who is "we" in these statements, such as "we should consider a
>>>> functional DSv2 implementation a blocker for Spark 3.0"? If it means those
>>>> contributing to the DSv2 effort want to set their own goals, milestones,
>>>> etc., then that is fine with me. If you mean that the Apache Spark project
>>>> should officially commit to the lack of a functional DSv2 implementation
>>>> being a blocker for the release of Spark 3.0, then I'm -1. A major release
>>>> is just not about adding new features. Rather, it is about making changes
>>>> to the existing public API. As such, I'm opposed to any new feature or any
>>>> API addition being considered a blocker of the 3.0.0 release.
>>>>
>>>>
>>>> On Thu, Feb 28, 2019 at 9:09 AM Matt Cheah  wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>>
>>>>>
>>>>> Are identifiers and namespaces going to be rolled under one of those
>>>>> six points?
>>>>>
>>>>>
>>>>>
>>>>> *From: *Ryan Blue 
>>>>> *Reply-To: *"rb...@netflix.com" 
>>>>> *Date: *Thursday, February 28, 2019 at 8:39 AM
>>>>> *To: *Spark Dev List 
>>>>> *Subject: *[VOTE] Functional DataSourceV2 in Spark 3.0
>>>>>
>>>>>
>>>>>
>>>>> I’d like to call a vote for committing to getting DataSourceV2 in a
>>>>> functional state for Spark 3.0.
>>>>>
>>>>> For more context, please see the discussion thread, but here is a
>>>>> quick summary about what this commitment means:
>>>>>
>>>>> · We think that a “functional DSv2” is an achievable goal for
>>>>> the Spark 3.0 release
>>>>>
>>>>> · We will consider this a blocker for Spark 3.0, and take
>>>>> reasonable steps to make it happen
>>>>>
>>>>> · We will *not* delay the release without a community
>>>>> discussion
>>>>>
>>>>> Here’s what we’ve defined as a functional DSv2:
>>>>>
>>>>> ·

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra

Then I'm -1. Setting new features as blockers of major releases is not
proper project management, IMO.

On Thu, Feb 28, 2019 at 10:06 AM Ryan Blue  wrote:

> Mark, if this goal is adopted, "we" is the Apache Spark community.
>
> On Thu, Feb 28, 2019 at 9:52 AM Mark Hamstra 
> wrote:
>
>> Who is "we" in these statements, such as "we should consider a functional
>> DSv2 implementation a blocker for Spark 3.0"? If it means those
>> contributing to the DSv2 effort want to set their own goals, milestones,
>> etc., then that is fine with me. If you mean that the Apache Spark project
>> should officially commit to the lack of a functional DSv2 implementation
>> being a blocker for the release of Spark 3.0, then I'm -1. A major release
>> is just not about adding new features. Rather, it is about making changes
>> to the existing public API. As such, I'm opposed to any new feature or any
>> API addition being considered a blocker of the 3.0.0 release.
>>
>>
>> On Thu, Feb 28, 2019 at 9:09 AM Matt Cheah  wrote:
>>
>>> +1 (non-binding)
>>>
>>>
>>>
>>> Are identifiers and namespaces going to be rolled under one of those six
>>> points?
>>>
>>>
>>>
>>> *From: *Ryan Blue 
>>> *Reply-To: *"rb...@netflix.com" 
>>> *Date: *Thursday, February 28, 2019 at 8:39 AM
>>> *To: *Spark Dev List 
>>> *Subject: *[VOTE] Functional DataSourceV2 in Spark 3.0
>>>
>>>
>>>
>>> I’d like to call a vote for committing to getting DataSourceV2 in a
>>> functional state for Spark 3.0.
>>>
>>> For more context, please see the discussion thread, but here is a quick
>>> summary about what this commitment means:
>>>
>>> · We think that a “functional DSv2” is an achievable goal for
>>> the Spark 3.0 release
>>>
>>> · We will consider this a blocker for Spark 3.0, and take
>>> reasonable steps to make it happen
>>>
>>> · We will *not* delay the release without a community discussion
>>>
>>> Here’s what we’ve defined as a functional DSv2:
>>>
>>> · Add a plugin system for catalogs
>>>
>>> · Add an interface for table catalogs (see the ongoing SPIP
>>> vote)
>>>
>>> · Add an implementation of the new interface that calls
>>> SessionCatalog to load v2 tables
>>>
>>> · Add a resolution rule to load v2 tables from the v2 catalog
>>>
>>> · Add CTAS logical and physical plan nodes
>>>
>>> · Add conversions from SQL parsed plans to v2 logical plans
>>> (e.g., INSERT INTO support)
>>>
>>> Please vote in the next 3 days on whether you agree with committing to
>>> this goal.
>>>
>>> [ ] +1: Agree that we should consider a functional DSv2 implementation a
>>> blocker for Spark 3.0
>>> [ ] +0: . . .
>>> [ ] -1: I disagree with this goal because . . .
>>>
>>> Thank you!
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>> Software Engineer
>>>
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra

Who is "we" in these statements, such as "we should consider a functional
DSv2 implementation a blocker for Spark 3.0"? If it means those
contributing to the DSv2 effort want to set their own goals, milestones,
etc., then that is fine with me. If you mean that the Apache Spark project
should officially commit to the lack of a functional DSv2 implementation
being a blocker for the release of Spark 3.0, then I'm -1. A major release
is just not about adding new features. Rather, it is about making changes
to the existing public API. As such, I'm opposed to any new feature or any
API addition being considered a blocker of the 3.0.0 release.

On Thu, Feb 28, 2019 at 9:09 AM Matt Cheah  wrote:

> +1 (non-binding)
>
>
>
> Are identifiers and namespaces going to be rolled under one of those six
> points?
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Thursday, February 28, 2019 at 8:39 AM
> *To: *Spark Dev List 
> *Subject: *[VOTE] Functional DataSourceV2 in Spark 3.0
>
>
>
> I’d like to call a vote for committing to getting DataSourceV2 in a
> functional state for Spark 3.0.
>
> For more context, please see the discussion thread, but here is a quick
> summary about what this commitment means:
>
> · We think that a “functional DSv2” is an achievable goal for the
> Spark 3.0 release
>
> · We will consider this a blocker for Spark 3.0, and take
> reasonable steps to make it happen
>
> · We will *not* delay the release without a community discussion
>
> Here’s what we’ve defined as a functional DSv2:
>
> · Add a plugin system for catalogs
>
> · Add an interface for table catalogs (see the ongoing SPIP vote)
>
> · Add an implementation of the new interface that calls
> SessionCatalog to load v2 tables
>
> · Add a resolution rule to load v2 tables from the v2 catalog
>
> · Add CTAS logical and physical plan nodes
>
> · Add conversions from SQL parsed plans to v2 logical plans
> (e.g., INSERT INTO support)
>
> Please vote in the next 3 days on whether you agree with committing to
> this goal.
>
> [ ] +1: Agree that we should consider a functional DSv2 implementation a
> blocker for Spark 3.0
> [ ] +0: . . .
> [ ] -1: I disagree with this goal because . . .
>
> Thank you!
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Mark Hamstra

>
> I’m not quite sure what you mean here.
>

I'll try to explain once more, then I'll drop it since continuing the rest
of the discussion in this thread is more important than getting
side-tracked.

There is nothing wrong with individuals advocating for what they think
should or should not be in Spark 3.0, nor should anyone shy away from
explaining why they think delaying the release for some reason is or isn't
a good idea. What is a problem, or is at least something that I have a
problem with, are declarative, pseudo-authoritative statements that 3.0 (or
some other release) will or won't contain some feature, API, etc. or that
some issue is or is not blocker or worth delaying for. When the PMC has not
voted on such issues, I'm often left thinking, "Wait... what? Who decided
that, or where did that decision come from?"

On Sun, Feb 24, 2019 at 1:27 PM Ryan Blue  wrote:

> Thanks to Matt for his philosophical take. I agree.
>
> The intent is to set a common goal, so that we work toward getting v2 in a
> usable state as a community. Part of that is making choices to get it done
> on time, which we have already seen on this thread: setting out more
> clearly what we mean by “DSv2” and what we think we can get done on time.
>
> I don’t mean to say that we should commit to a plan that *requires* a
> delay to the next release (which describes the goal better than 3.0 does).
> But we should commit to making sure the goal is met, acknowledging that
> this is one of the most important efforts for many people that work in this
> community.
>
> I think it would help to clarify what this commitment means, at least to
> me:
>
>1. What it means: the community will seriously consider delaying the
>next release if this isn’t done by our initial deadline.
>2. What it does not mean: delaying the release no matter what happens.
>
> In that event that this feature isn’t done on time, it would be up to the
> community to decide what to do. But in the mean time, I think it is healthy
> to set a goal and work toward it. (I am not making a distinction between
> PMC and community here.)
>
> I think this commitment is a good idea for the same reason why we set
> other goals: to hold ourselves accountable. When one sets a New Years
> resolution to drop 10 pounds, it isn’t that the hope or intent wasn’t there
> before. It is about having a (self-imposed) constraint that helps you make
> hard choices: cake now or meet my goal?
>
> Spark 3.0 has many other major features as well, delaying the release has
> significant cost and we should try our best to not let it happen.”
>
> I agree with Wenchen here. No one wants to actually delay the release. We
> just want to push ourselves to make some tough decisions, using that delay
> as a motivating factor.
>
> The fact that some entity other than the PMC thinks that Spark 3.0 should
> contain certain new features or that it will be costly to them if 3.0 does
> not contain those features is not dispositive.
>
> I’m not quite sure what you mean here. While I am representing my
> employer, I am bringing up this topic as a member of the community, to
> suggest a direction for the community to take, and I fully accept that the
> decision is up to the community. I think it is reasonable to candidly state
> how this matters; that context informs the discussion.
>
> On Fri, Feb 22, 2019 at 1:55 PM Mark Hamstra 
> wrote:
>
>> To your other message: I already see a number of PMC members here. Who's
>>> the other entity?
>>>
>>
>> I'll answer indirectly since pointing fingers isn't really my intent. In
>> the absence of a PMC vote, I react negatively to individuals making new
>> declarative policy statements or statements to the effect that Spark
>> 3.0 will (or will not) include these features..., or that it will be too
>> costly to do something. Maybe these are innocent shorthand that leave off a
>> clarifying "in my opinion" or "according to the current state of JIRA" or
>> some such.
>>
>> My points are simply that nobody other than the PMC has an authoritative
>> say on such matters, and if we are at a point where the community needs
>> some definitive guidance, then we need PMC involvement and a vote. That's
>> not intended to preclude or terminate community discussion, because that
>> is, indeed, lovely to see.
>>
>> On Fri, Feb 22, 2019 at 12:04 PM Sean Owen  wrote:
>>
>>> To your other message: I already see a number of PMC members here. Who's
>>> the other entity? The PMC is the thing that says a thing is a release,
>>> sure, but this discussion is properly a community one. And here we are,
>>> this is lovely to see.
>>>
&g

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-22 Thread Mark Hamstra

>
> To your other message: I already see a number of PMC members here. Who's
> the other entity?
>

I'll answer indirectly since pointing fingers isn't really my intent. In
the absence of a PMC vote, I react negatively to individuals making new
declarative policy statements or statements to the effect that Spark
3.0 will (or will not) include these features..., or that it will be too
costly to do something. Maybe these are innocent shorthand that leave off a
clarifying "in my opinion" or "according to the current state of JIRA" or
some such.

My points are simply that nobody other than the PMC has an authoritative
say on such matters, and if we are at a point where the community needs
some definitive guidance, then we need PMC involvement and a vote. That's
not intended to preclude or terminate community discussion, because that
is, indeed, lovely to see.

On Fri, Feb 22, 2019 at 12:04 PM Sean Owen  wrote:

> To your other message: I already see a number of PMC members here. Who's
> the other entity? The PMC is the thing that says a thing is a release,
> sure, but this discussion is properly a community one. And here we are,
> this is lovely to see.
>
> (May I remind everyone to casually, sometime, browse the large list of
> other JIRAs targeted for Spark 3? it's much more than DSv2!)
>
> I can't speak to specific decisions here, but, I see:
>
> Spark 3 doesn't have a release date. Notionally it's 6 months after Spark
> 2.4 (Nov 2018). It'd be reasonable to plan for a little more time. Can we
> throw out... June 2019, and I update the website? It can slip but that
> gives a concrete timeframe around which to plan. What can comfortably get
> in by June 2019?
>
> Agreement that "DSv2" is going into Spark 3, for some definition of DSv2
> that's probably roughly Matt's list.
>
> Changes that can't go into a minor release (API changes, etc) must by
> definition go into Spark 3.0. Agree those first and do those now. Delay
> Spark 3 until they're done and prioritize accordingly.
> Changes that can go into a minor release can go into 3.1, if needed.
> This has been in discussion long enough that I think whatever design(s)
> are on the table for DSv2 now are as close as one is going to get. The
> perfect is the enemy of the good.
>
> Aside from throwing out a date, I probably just restated what everyone
> said. But I was 'summoned' :)
>
> On Fri, Feb 22, 2019 at 12:40 PM Mark Hamstra 
> wrote:
>
>> However, as other people mentioned, Spark 3.0 has many other major
>>> features as well
>>>
>>
>> I fundamentally disagree. First, Spark 3.0 has nothing until the PMC says
>> it has something, and we have made no commitment along the lines that
>> "Spark 3.0.0 will not be released unless it contains new features x, y and
>> z." Second, major-version releases are not about adding new features.
>> Major-version releases are about making changes to the public API that we
>> cannot make in feature or bug-fix releases. If that is all that is
>> accomplished in a particular major release, that's fine -- in fact, we
>> quite intentionally did not target new features in the Spark 2.0.0 release.
>> The fact that some entity other than the PMC thinks that Spark 3.0 should
>> contain certain new features or that it will be costly to them if 3.0 does
>> not contain those features is not dispositive. If there are public API
>> changes that should occur in a timely fashion and there is also a list of
>> new features that some users or contributors want to see in 3.0 but that
>> look likely to not be ready in a timely fashion, then the PMC should fully
>> consider releasing 3.0 without all those new features. There is no reason
>> that they can't come in with 3.1.0.
>>
>

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-08 Thread Mark Hamstra

There are 2. C'mon Marcelo, you can make it 3!

On Fri, Feb 8, 2019 at 5:03 PM Marcelo Vanzin 
wrote:

> Hi Takeshi,
>
> Since we only really have one +1 binding vote, do you want to extend
> this vote a bit?
>
> I've been stuck on a few things but plan to test this (setting things
> up now), but it probably won't happen before the deadline.
>
> On Tue, Feb 5, 2019 at 5:07 PM Takeshi Yamamuro 
> wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.3.3.
> >
> > The vote is open until February 8 6:00PM (PST) and passes if a majority
> +1 PMC votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.3.3
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.3.3-rc2 (commit
> 66fd9c34bf406a4b5f86605d06c9607752bd637a):
> > https://github.com/apache/spark/tree/v2.3.3-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1298/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-docs/
> >
> > The list of bug fixes going into 2.3.3 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12343759
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.3.3?
> > ===
> >
> > The current list of open tickets targeted at 2.3.3 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.3.3
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
> > P.S.
> > I checked all the tests passed in the Amazon Linux 2 AMI;
> > $ java -version
> > openjdk version "1.8.0_191"
> > OpenJDK Runtime Environment (build 1.8.0_191-b12)
> > OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
> > $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos
> -Psparkr test
> >
> > --
> > ---
> > Takeshi Yamamuro
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Trigger full GC during executor idle time?

2019-01-02 Thread Mark Hamstra

Without addressing whether the change is beneficial or not, I will note
that the logic in the paper and the PR's description is incorrect: "During
execution, some executor nodes finish the tasks assigned to them early and
wait for the entire stage to complete before more tasks are assigned to
them, while other executor nodes take longer to finish." That is simply not
true -- or more generously, is only sort of true in some circumstances
where only a single Job is executing on the cluster. Less generously, there
is no coordination between Executors. They simply receive Tasks from the
DAGScheduler. When an Executor has idle resources, it informs the
DAGScheduler, and it is the DAGScheduler that knows whether there is more
work ready for the Executor. Perhaps the DAGScheduler should be sending a
message to the Executor is it knows that there isn't more work for the
Executor to do, but I am really dubious about Executors on their own
deciding with their limited knowledge that they are going to take a GC
break unless they really need to.

On Mon, Dec 31, 2018 at 4:13 PM Holden Karau  wrote:

> Maybe it would make sense to loop in the paper authors? I imagine they
> might have more information than ended up in the paper.
>
> On Mon, Dec 31, 2018 at 2:10 PM Ryan Blue 
> wrote:
>
>> After a quick look, I don't think that the paper's
>> 
>> evaluation is very thorough. I don't see where it discusses what the
>> PageRank implementation is doing in terms of object allocation or whether
>> data is cached between iterations (looks like it probably isn't, based on
>> Table III). It also doesn't address how this would interact with
>> spark.memory.fraction. I think it would be a problem to set this threshold
>> lower than spark.memory.fraction. And it doesn't say whether this is static
>> or dynamic allocation.
>>
>> My impression is that this is obviously a good idea for some
>> allocation-heavy iterative workloads, but it is unclear whether it would
>> help generally:
>>
>> * An empty executor may delay starting tasks because of the optimistic GC
>> * Full GC instead of incremental may not be needed and could increase
>> starting delay
>> * 1-core executors will always GC between tasks
>> * Spark-managed memory may cause long GC pauses that don't recover much
>> space
>> * Dynamic allocation probably eliminates most of the benefit because of
>> executor turn-over
>>
>> rb
>>
>> On Mon, Dec 31, 2018 at 11:01 AM Reynold Xin  wrote:
>>
>>> Not sure how reputable or representative that paper is...
>>>
>>> On Mon, Dec 31, 2018 at 10:57 AM Sean Owen  wrote:
>>>
 https://github.com/apache/spark/pull/23401

 Interesting PR; I thought it was not worthwhile until I saw a paper
 claiming this can speed things up to the tune of 2-6%. Has anyone
 considered this before?

 Sean

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: A survey about IP clearance of Spark in UC Berkeley for donating to Apache

2018-11-28 Thread Mark Hamstra

Your history isn't really accurate. Years before Spark became an Apache
project, the AMPlab and UC Berkeley placed the Spark code under a 3-clause
BSD License and made the code publicly available. Later, a group of
developers and Spark users from both inside and outside Berkeley brought
Spark and that repository of code through the Apache incubation process to
become a full Apache project. So, it is not really accurate to say that UC
Berkeley donated Spark to the ASF.

On Tue, Nov 27, 2018 at 9:21 PM hxd  wrote:

> Hi,
>
> As we know, Spark is one of the most famous projects for distributed
> computing. It is donated by UC Berkeley to ASF initially, and currently a
> lot of developers in the world are making contribution to the project.
>
> Because Apache 2.0 License requires licensing related patents to ASF if
> needed, I want to make a survey about “how universities deal with the IP
> clearance when donating to Apache”. We believe that it is helpful to let
> more universities understand the process, and join in Apache more  smoothly
> in the future.
>
> Therefore,  I want to know does UC Berkeley have related patents before
> the university contributed source codes to Apache?  If there is, then how
> the university dealt with them? And what documents the university provided
> to Apache, just SGA?
>
> Thanks very much!
>
> Best,
> Xiangdong
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Mark Hamstra

Ok, got it -- it's really just an argument for not all of 2.11, 2.12 and
2.13 at the same time; always 2.12; now figure out when we stop 2.11
support and start 2.13 support.

On Wed, Nov 7, 2018 at 11:10 AM Sean Owen  wrote:

> It's not making 2.12 the default, but not dropping 2.11. Supporting
> 2.13 could mean supporting 3 Scala versions at once, which I claim is
> just too much. I think the options are likely:
>
> - Support 2.11, 2.12 in Spark 3.0. Deprecate 2.11 and make 2.12 the
> default. Add 2.13 support in 3.x and drop 2.11 in the same release
> - Deprecate 2.11 right now via announcement and/or Spark 2.4.1 soon.
> Drop 2.11 support in Spark 3.0, and support only 2.12.
> - (same as above, but add Spark 2.13 support if possible for Spark 3.0)
>
>
> On Wed, Nov 7, 2018 at 12:32 PM Mark Hamstra 
> wrote:
> >
> > I'm not following "exclude Scala 2.13". Is there something inherent in
> making 2.12 the default Scala version in Spark 3.0 that would prevent us
> from supporting the option of building with 2.13?
> >
> > On Tue, Nov 6, 2018 at 5:48 PM Sean Owen  wrote:
> >>
> >> That's possible here, sure. The issue is: would you exclude Scala 2.13
> >> support in 3.0 for this, if it were otherwise ready to go?
> >> I think it's not a hard rule that something has to be deprecated
> >> previously to be removed in a major release. The notice is helpful,
> >> sure, but there are lots of ways to provide that notice to end users.
> >> Lots of things are breaking changes in a major release. Or: deprecate
> >> in Spark 2.4.1, if desired?
> >>
> >> On Tue, Nov 6, 2018 at 7:36 PM Wenchen Fan  wrote:
> >> >
> >> > We make Scala 2.11 the default one in Spark 2.0, then drop Scala 2.10
> in Spark 2.3. Shall we follow it and drop Scala 2.11 at some point of Spark
> 3.x?
> >> >
> >> > On Wed, Nov 7, 2018 at 8:55 AM Reynold Xin 
> wrote:
> >> >>
> >> >> Have we deprecated Scala 2.11 already in an existing release?
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Mark Hamstra

I'm not following "exclude Scala 2.13". Is there something inherent in
making 2.12 the default Scala version in Spark 3.0 that would prevent us
from supporting the option of building with 2.13?

On Tue, Nov 6, 2018 at 5:48 PM Sean Owen  wrote:

> That's possible here, sure. The issue is: would you exclude Scala 2.13
> support in 3.0 for this, if it were otherwise ready to go?
> I think it's not a hard rule that something has to be deprecated
> previously to be removed in a major release. The notice is helpful,
> sure, but there are lots of ways to provide that notice to end users.
> Lots of things are breaking changes in a major release. Or: deprecate
> in Spark 2.4.1, if desired?
>
> On Tue, Nov 6, 2018 at 7:36 PM Wenchen Fan  wrote:
> >
> > We make Scala 2.11 the default one in Spark 2.0, then drop Scala 2.10 in
> Spark 2.3. Shall we follow it and drop Scala 2.11 at some point of Spark
> 3.x?
> >
> > On Wed, Nov 7, 2018 at 8:55 AM Reynold Xin  wrote:
> >>
> >> Have we deprecated Scala 2.11 already in an existing release?
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: What's a blocker?

2018-10-24 Thread Mark Hamstra

Yeah, I can pretty much agree with that. Before we get into release
candidates, it's not as big a deal if something gets labeled as a blocker.
Once we are into an RC, I'd like to see any discussions as to whether
something is or isn't a blocker at least cross-referenced in the RC VOTE
thread so that PMC members can more easily be aware of the discussion and
potentially weigh in.

On Wed, Oct 24, 2018 at 7:12 PM Saisai Shao  wrote:

> Just my two cents of the past experience. As a release manager of Spark
> 2.3.2, I felt significantly delay during the release by block issues. Vote
> was failed several times by one or two "block issue". I think during the RC
> time, each "block issue" should be carefully evaluated by the related PMCs
> and release manager. Some issues which are not so critical or only matters
> to one or two firms should be carefully marked as blocker, to avoid the
> delay of the release.
>
> Thanks
> Saisai
>

Re: About introduce function sum0 to Spark

2018-10-23 Thread Mark Hamstra

Yes, as long as you are only talking about summing numeric values. Part of
my point, though, is that this is just a special case of folding or
aggregating with an initial or 'zero' value. It doesn't need to be limited
to just numeric sums with zero = 0.

On Tue, Oct 23, 2018 at 12:23 AM Wenchen Fan  wrote:

> This is logically `sum( if(isnull(col), 0, col) )` right?
>
> On Tue, Oct 23, 2018 at 2:58 PM 陶 加涛  wrote:
>
>> The name is from Apache Calcite, And it doesn’t matter, we can introduce
>> our own.
>>
>>
>>
>>
>>
>> ---
>>
>> Regards!
>>
>> Aron Tao
>>
>>
>>
>> *发件人**: *Mark Hamstra 
>> *日期**: *2018年10月23日 星期二 12:28
>> *收件人**: *"taojia...@gmail.com" 
>> *抄送**: *dev 
>> *主题**: *Re: About introduce function sum0 to Spark
>>
>>
>>
>> That's a horrible name. This is just a fold.
>>
>>
>>
>> On Mon, Oct 22, 2018 at 7:39 PM 陶 加涛  wrote:
>>
>> Hi, in calcite, has the concept of sum0, here I quote the definition of
>> sum0:
>>
>>
>>
>> Sum0 is an aggregator which returns the sum of the values which
>>
>> go into it like Sum. It differs in that when no non null values
>>
>> are applied zero is returned instead of null..
>>
>>
>>
>> One scenario is that we can use sum0 to implement pre-calculation
>> count(pre-calculation system like Apache Kylin).
>>
>>
>>
>> It is very easy in Spark to implement sum0, if community consider this is
>> necessary, I would like to open a JIRA and implement this.
>>
>>
>>
>> ---
>>
>> Regards!
>>
>> Aron Tao
>>
>>
>>
>>

Re: About introduce function sum0 to Spark

2018-10-22 Thread Mark Hamstra

That's a horrible name. This is just a fold.

On Mon, Oct 22, 2018 at 7:39 PM 陶 加涛  wrote:

> Hi, in calcite, has the concept of sum0, here I quote the definition of
> sum0:
>
>
>
> Sum0 is an aggregator which returns the sum of the values which
>
> go into it like Sum. It differs in that when no non null values
>
> are applied zero is returned instead of null..
>
>
>
> One scenario is that we can use sum0 to implement pre-calculation
> count(pre-calculation system like Apache Kylin).
>
>
>
> It is very easy in Spark to implement sum0, if community consider this is
> necessary, I would like to open a JIRA and implement this.
>
>
>
> ---
>
> Regards!
>
> Aron Tao
>
>
>

Re: Adding Extension to Load Custom functions into Thriftserver/SqlShell

2018-09-27 Thread Mark Hamstra

Yes, the "startWithContext" code predates SparkSessions in Thriftserver, so
it doesn't really work the way you want it to with Session initiation.

On Thu, Sep 27, 2018 at 11:13 AM Russell Spitzer 
wrote:

> While that's easy for some users, we basically want to load up some
> functions by default into all session catalogues regardless of who made
> them. We do this with certain rules and strategies using the
> SparkExtensions, so all apps that run through our submit scripts get a
> config parameter added and it's transparent to the user. I think we'll
> probably have to do some forks (at least for the CliDriver), the
> thriftserver has a bunch of code which doesn't run under "startWithContext"
> so we may have an issue there as well.
>
>
> On Wed, Sep 26, 2018, 6:21 PM Mark Hamstra 
> wrote:
>
>> You're talking about users starting Thriftserver or SqlShell from the
>> command line, right? It's much easier if you are starting a Thriftserver
>> programmatically so that you can register functions when initializing a
>> SparkContext and then  HiveThriftServer2.startWithContext using that
>> context.
>>
>> On Wed, Sep 26, 2018 at 3:30 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I've been looking recently on possible avenues to load new functions
>>> into the Thriftserver and SqlShell at launch time. I basically want to
>>> preload a set of functions in addition to those already present in the
>>> Spark Code. I'm not sure there is at present a way to do this and I was
>>> wondering if anyone had any ideas.
>>>
>>> I would basically want to make it so that any user launching either of
>>> these tools would automatically have access to some custom functions. In
>>> the SparkShell I can do this by adding additional lines to the init section
>>> but I think It would be nice if we could pass in a parameter which would
>>> point to a class with a list of additional functions to add to all new
>>> session states.
>>>
>>> An interface like Spark Sessions Extensions but instead of running
>>> during Session Init, it would run after session init has completed.
>>>
>>> Thanks for your time and I would be glad to hear any opinions or ideas
>>> on this,
>>>
>>

Re: Adding Extension to Load Custom functions into Thriftserver/SqlShell

2018-09-26 Thread Mark Hamstra

You're talking about users starting Thriftserver or SqlShell from the
command line, right? It's much easier if you are starting a Thriftserver
programmatically so that you can register functions when initializing a
SparkContext and then  HiveThriftServer2.startWithContext using that
context.

On Wed, Sep 26, 2018 at 3:30 PM Russell Spitzer 
wrote:

> I've been looking recently on possible avenues to load new functions into
> the Thriftserver and SqlShell at launch time. I basically want to preload a
> set of functions in addition to those already present in the Spark Code.
> I'm not sure there is at present a way to do this and I was wondering if
> anyone had any ideas.
>
> I would basically want to make it so that any user launching either of
> these tools would automatically have access to some custom functions. In
> the SparkShell I can do this by adding additional lines to the init section
> but I think It would be nice if we could pass in a parameter which would
> point to a class with a list of additional functions to add to all new
> session states.
>
> An interface like Spark Sessions Extensions but instead of running during
> Session Init, it would run after session init has completed.
>
> Thanks for your time and I would be glad to hear any opinions or ideas on
> this,
>

Re: UNCHECKED Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Mark Hamstra

That's overstated. We will also block for a data correctness issue -- and
that is, arguably, what this is.

On Wed, Sep 19, 2018 at 12:21 AM Reynold Xin  wrote:

> We also only block if it is a new regression.
>
> On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao 
> wrote:
>
>> Hi Marco,
>>
>> From my understanding of SPARK-25454, I don't think it is a block issue,
>> it might be an corner case, so personally I don't want to block the release
>> of 2.3.2 because of this issue. The release has been delayed for a long
>> time.
>>
>> Marco Gaido  于2018年9月19日周三 下午2:58写道：
>>
>>> Sorry, I am -1 because of SPARK-25454 which is a regression from 2.2.
>>>
>>> Il giorno mer 19 set 2018 alle ore 03:45 Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> ha scritto:
>>>
 +1.

 I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
 -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.

 I hit the following test case failure once during testing, but it's not
 persistent.

 KafkaContinuousSourceSuite
 ...
 subscribing topic by name from earliest offsets (failOnDataLoss:
 false) *** FAILED ***

 Thank you, Saisai.

 Bests,
 Dongjoon.

 On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
 wrote:

> +1 from my own side.
>
> Thanks
> Saisai
>
> Wenchen Fan  于2018年9月18日周二 上午9:34写道：
>
>> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>>
>> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:
>>
>>> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
>>> build from source with most profiles passed for me.
>>> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.3.2.
>>> >
>>> > The vote is open until September 21 PST and passes if a majority
>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 2.3.2
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v2.3.2-rc6 (commit
>>> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
>>> > https://github.com/apache/spark/tree/v2.3.2-rc6
>>> >
>>> > The release files, including signatures, digests, etc. can be
>>> found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1286/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
>>> >
>>> > The list of bug fixes going into 2.3.2 can be found at the
>>> following URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>> >
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by
>>> taking
>>> > an existing Spark workload and running on this release candidate,
>>> then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and
>>> install
>>> > the current RC and see if anything important breaks, in the
>>> Java/Scala
>>> > you can add the staging repository to your projects resolvers and
>>> test
>>> > with the RC (make sure to clean up the artifact cache before/after
>>> so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 2.3.2?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 2.3.2 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for
>>> "Target Version/s" = 2.3.2
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility
>>> should
>>> > be worked on immediately. Everything else please retarget to an
>>> > appropriate release.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release unless the bug in question is a regression from the
>>> previous
>>> > release. That being said, if there is

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra

What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't
change the code at all; it's just a notification that we will eventually
cease supporting Py2. Wouldn't users prefer to get that notification sooner
rather than later?

On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia 
wrote:

> I’d like to understand the maintenance burden of Python 2 before
> deprecating it. Since it is not EOL yet, it might make sense to only
> deprecate it once it’s EOL (which is still over a year from now).
> Supporting Python 2+3 seems less burdensome than supporting, say, multiple
> Scala versions in the same codebase, so what are we losing out?
>
> The other thing is that even though Python core devs might not support 2.x
> later, it’s quite possible that various Linux distros will if moving from 2
> to 3 remains painful. In that case, we may want Apache Spark to continue
> releasing for it despite the Python core devs not supporting it.
>
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it
> later in 3.x instead of deprecating it in 2.4. I’d also consider looking at
> what other data science tools are doing before fully removing it: for
> example, if Pandas and TensorFlow no longer support Python 2 past some
> point, that might be a good point to remove it.
>
> Matei
>
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra 
> wrote:
> >
> > If we're going to do that, then we need to do it right now, since 2.4.0
> is already in release candidates.
> >
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson 
> wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem
> like a ways off but even now there may be some spark versions supporting
> Py2 past the point where Py2 is no longer receiving security patches
> >
> >
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
> wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> >
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
> wrote:
> > In case this didn't make it onto this thread:
> >
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
> remove it entirely on a later 3.x release.
> >
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
> wrote:
> > On a separate dev@spark thread, I raised a question of whether or not
> to support python 2 in Apache Spark, going forward into Spark 3.0.
> >
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark
> 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it
> is a good time to consider support for Python-2 on PySpark.
> >
> > Key advantages to dropping Python 2 are:
> >   • Support for PySpark becomes significantly easier.
> >   • Avoid having to support Python 2 until Spark 4.0, which is
> likely to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that
> PySpark would be supporting a version of python that was no longer
> receiving security patches)
> >
> > The main disadvantage is that PySpark users who have legacy python-2
> code would have to migrate their code to python 3 to take advantage of
> Spark 3.0
> >
> > This decision obviously has large implications for the Apache Spark
> community and we want to solicit community feedback.
> >
> >
>
>

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra

If we're going to do that, then we need to do it right now, since 2.4.0 is
already in release candidates.

On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:

> I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem
> like a ways off but even now there may be some spark versions supporting
> Py2 past the point where Py2 is no longer receiving security patches
>
>
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
> wrote:
>
>> We could also deprecate Py2 already in the 2.4.0 release.
>>
>> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
>> wrote:
>>
>>> In case this didn't make it onto this thread:
>>>
>>> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>>> remove it entirely on a later 3.x release.
>>>
>>> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>>> wrote:
>>>
>>>> On a separate dev@spark thread, I raised a question of whether or not
>>>> to support python 2 in Apache Spark, going forward into Spark 3.0.
>>>>
>>>> Python-2 is going EOL <https://github.com/python/devguide/pull/344> at
>>>> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
>>>> make breaking changes to Spark's APIs, and so it is a good time to consider
>>>> support for Python-2 on PySpark.
>>>>
>>>> Key advantages to dropping Python 2 are:
>>>>
>>>>- Support for PySpark becomes significantly easier.
>>>>- Avoid having to support Python 2 until Spark 4.0, which is likely
>>>>to imply supporting Python 2 for some time after it goes EOL.
>>>>
>>>> (Note that supporting python 2 after EOL means, among other things,
>>>> that PySpark would be supporting a version of python that was no longer
>>>> receiving security patches)
>>>>
>>>> The main disadvantage is that PySpark users who have legacy python-2
>>>> code would have to migrate their code to python 3 to take advantage of
>>>> Spark 3.0
>>>>
>>>> This decision obviously has large implications for the Apache Spark
>>>> community and we want to solicit community feedback.
>>>>
>>>>
>>>

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra

>
> difficult to reconcile
>

That's a big chunk of what I'm getting at: How much is it even possible to
do this kind of reconciliation from the underlying implementation to a more
normal/expected/friendly API for a given programming environment? How much
more work is it for us to maintain multiple such reconciliations, one for
each environment? Do we even need to do it at all, or can we push off such
higher-level reconciliations to 3rd-party efforts like Frameless?


On Sun, Sep 16, 2018 at 2:12 PM Reynold Xin  wrote:

> Most of those are pretty difficult to add though, because they are
> fundamentally difficult to do in a distributed setting and with lazy
> execution.
>
> We should add some but at some point there are fundamental differences
> between the underlying execution engine that are pretty difficult to
> reconcile.
>
> On Sun, Sep 16, 2018 at 2:09 PM Matei Zaharia 
> wrote:
>
>> My 2 cents on this is that the biggest room for improvement in Python is
>> similarity to Pandas. We already made the Python DataFrame API different
>> from Scala/Java in some respects, but if there’s anything we can do to make
>> it more obvious to Pandas users, that will help the most. The other issue
>> though is that a bunch of Pandas functions are just missing in Spark — it
>> would be awesome to set up an umbrella JIRA to just track those and let
>> people fill them in.
>>
>> Matei
>>
>> > On Sep 16, 2018, at 1:02 PM, Mark Hamstra 
>> wrote:
>> >
>> > It's not splitting hairs, Erik. It's actually very close to something
>> that I think deserves some discussion (perhaps on a separate thread.) What
>> I've been thinking about also concerns API "friendliness" or style. The
>> original RDD API was very intentionally modeled on the Scala parallel
>> collections API. That made it quite friendly for some Scala programmers,
>> but not as much so for users of the other language APIs when they
>> eventually came about. Similarly, the Dataframe API drew a lot from pandas
>> and R, so it is relatively friendly for those used to those abstractions.
>> Of course, the Spark SQL API is modeled closely on HiveQL and standard SQL.
>> The new barrier scheduling draws inspiration from MPI. With all of these
>> models and sources of inspiration, as well as multiple language targets,
>> there isn't really a strong sense of coherence across Spark -- I mean, even
>> though one of the key advantages of Spark is the ability to do within a
>> single framework things that would otherwise require multiple frameworks,
>> actually doing that is requiring more than one programming style or
>> multiple design abstractions more than what is strictly necessary even when
>> writing Spark code in just a single language.
>> >
>> > For me, that raises questions over whether we want to start designing,
>> implementing and supporting APIs that are designed to be more consistent,
>> friendly and idiomatic to particular languages and abstractions -- e.g. an
>> API covering all of Spark that is designed to look and feel as much like
>> "normal" code for a Python programmer, another that looks and feels more
>> like "normal" Java code, another for Scala, etc. That's a lot more work and
>> support burden than the current approach where sometimes it feels like you
>> are writing "normal" code for your prefered programming environment, and
>> sometimes it feels like you are trying to interface with something foreign,
>> but underneath it hopefully isn't too hard for those writing the
>> implementation code below the APIs, and it is not too hard to maintain
>> multiple language bindings that are each fairly lightweight.
>> >
>> > It's a cost-benefit judgement, of course, whether APIs that are heavier
>> (in terms of implementing and maintaining) and friendlier (for end users)
>> are worth doing, and maybe some of these "friendlier" APIs can be done
>> outside of Spark itself (imo, Frameless is doing a very nice job for the
>> parts of Spark that it is currently covering --
>> https://github.com/typelevel/frameless); but what we have currently is a
>> bit too ad hoc and fragmentary for my taste.
>> >
>> > On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson 
>> wrote:
>> > I am probably splitting hairs to finely, but I was considering the
>> difference between improvements to the jvm-side (py4j and the scala/java
>> code) that would make it easier to write the python layer ("python-friendly
>> api"), and actual improvements to the python layers ("friendly python api").
>> >
>

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra

It's not splitting hairs, Erik. It's actually very close to something that
I think deserves some discussion (perhaps on a separate thread.) What I've
been thinking about also concerns API "friendliness" or style. The original
RDD API was very intentionally modeled on the Scala parallel collections
API. That made it quite friendly for some Scala programmers, but not as
much so for users of the other language APIs when they eventually came
about. Similarly, the Dataframe API drew a lot from pandas and R, so it is
relatively friendly for those used to those abstractions. Of course, the
Spark SQL API is modeled closely on HiveQL and standard SQL. The new
barrier scheduling draws inspiration from MPI. With all of these models and
sources of inspiration, as well as multiple language targets, there isn't
really a strong sense of coherence across Spark -- I mean, even though one
of the key advantages of Spark is the ability to do within a single
framework things that would otherwise require multiple frameworks, actually
doing that is requiring more than one programming style or multiple design
abstractions more than what is strictly necessary even when writing Spark
code in just a single language.

For me, that raises questions over whether we want to start designing,
implementing and supporting APIs that are designed to be more consistent,
friendly and idiomatic to particular languages and abstractions -- e.g. an
API covering all of Spark that is designed to look and feel as much like
"normal" code for a Python programmer, another that looks and feels more
like "normal" Java code, another for Scala, etc. That's a lot more work and
support burden than the current approach where sometimes it feels like you
are writing "normal" code for your prefered programming environment, and
sometimes it feels like you are trying to interface with something foreign,
but underneath it hopefully isn't too hard for those writing the
implementation code below the APIs, and it is not too hard to maintain
multiple language bindings that are each fairly lightweight.

It's a cost-benefit judgement, of course, whether APIs that are heavier (in
terms of implementing and maintaining) and friendlier (for end users) are
worth doing, and maybe some of these "friendlier" APIs can be done outside
of Spark itself (imo, Frameless is doing a very nice job for the parts of
Spark that it is currently covering --
https://github.com/typelevel/frameless); but what we have currently is a
bit too ad hoc and fragmentary for my taste.

On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson  wrote:

> I am probably splitting hairs to finely, but I was considering the
> difference between improvements to the jvm-side (py4j and the scala/java
> code) that would make it easier to write the python layer ("python-friendly
> api"), and actual improvements to the python layers ("friendly python api").
>
> They're not mutually exclusive of course, and both worth working on. But
> it's *possible* to improve either without the other.
>
> Stub files look like a great solution for type annotations, maybe even if
> only python 3 is supported.
>
> I definitely agree that any decision to drop python 2 should not be taken
> lightly. Anecdotally, I'm seeing an increase in python developers
> announcing that they are dropping support for python 2 (and loving it). As
> people have already pointed out, if we don't drop python 2 for spark 3.0,
> we're stuck with it until 4.0, which would place spark in a
> possibly-awkward position of supporting python 2 for some time after it
> goes EOL.
>
> Under the current release cadence, spark 3.0 will land some time in early
> 2019, which at that point will be mere months until EOL for py2.
>
> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau 
> wrote:
>
>>
>>
>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>>
>>> To be clear, is this about "python-friendly API" or "friendly python
>>> API" ?
>>>
>> Well what would you consider to be different between those two
>> statements? I think it would be good to be a bit more explicit, but I don't
>> think we should necessarily limit ourselves.
>>
>>>
>>> On the python side, it might be nice to take advantage of static typing.
>>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>>> good opportunity to jump the python-3-only train.
>>>
>> I think we can make types sort of work without ditching 2 (the types only
>> would work in 3 but it would still function in 2). Ditching 2 entirely
>> would be a big thing to consider, I honestly hadn't been considering that
>> but it could be from just spending so much time maintaining a 2/3 code
>> base. I'd suggest reaching out to to user@ before making that kind of
>> change.
>>
>>>
>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>>> wrote:
>>>
 Since we're talking about Spark 3.0 in the near future (and since some
 recent conversation on a proposed change reminded me) I wanted to open up
 the floor and see if folks have

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Mark Hamstra

We could also deprecate Py2 already in the 2.4.0 release.

On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:

> In case this didn't make it onto this thread:
>
> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove
> it entirely on a later 3.x release.
>
> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
> wrote:
>
>> On a separate dev@spark thread, I raised a question of whether or not to
>> support python 2 in Apache Spark, going forward into Spark 3.0.
>>
>> Python-2 is going EOL  at
>> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
>> make breaking changes to Spark's APIs, and so it is a good time to consider
>> support for Python-2 on PySpark.
>>
>> Key advantages to dropping Python 2 are:
>>
>>- Support for PySpark becomes significantly easier.
>>- Avoid having to support Python 2 until Spark 4.0, which is likely
>>to imply supporting Python 2 for some time after it goes EOL.
>>
>> (Note that supporting python 2 after EOL means, among other things, that
>> PySpark would be supporting a version of python that was no longer
>> receiving security patches)
>>
>> The main disadvantage is that PySpark users who have legacy python-2 code
>> would have to migrate their code to python 3 to take advantage of Spark 3.0
>>
>> This decision obviously has large implications for the Apache Spark
>> community and we want to solicit community feedback.
>>
>>
>

Re: time for Apache Spark 3.0?

2018-09-06 Thread Mark Hamstra

Yes, that is why we have these annotations in the code and the
corresponding labels appearing in the API documentation:
https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java

As long as it is properly annotated, we can change or even eliminate an API
method before the next major release. And frankly, we shouldn't be
contemplating bringing in the DS v2 API (and, I'd argue, *any* new API)
without such an annotation. There is just too much risk of not getting
everything right before we see the results of the new API being more widely
used, and too much cost in maintaining until the next major release
something that we come to regret for us to create new API in a fully frozen
state.


On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:

> It would be great to get more features out incrementally. For experimental
> features, do we have more relaxed constraints?
>
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
>
>> +1 on 3.0
>>
>> Dsv2 stable can still evolve in across major releases. DataFrame,
>> Dataset, dsv1 and a lot of other major features all were developed
>> throughout the 1.x and 2.x lines.
>>
>> I do want to explore ways for us to get dsv2 incremental changes out
>> there more frequently, to get feedback. Maybe that means we apply additive
>> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
>> will start a separate thread about it.
>>
>>
>>
>> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
>>
>>> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>>> happen before 3.x? if it's a significant change, seems reasonable for a
>>> major version bump rather than minor. Is the concern that tying it to 3.0
>>> means you have to take a major version update to get it?
>>>
>>> I generally support moving on to 3.x so we can also jettison a lot of
>>> older dependencies, code, fix some long standing issues, etc.
>>>
>>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>>
>>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
>>> wrote:
>>>
 My concern is that the v2 data source API is still evolving and not
 very close to stable. I had hoped to have stabilized the API and behaviors
 for a 3.0 release. But we could also wait on that for a 4.0 release,
 depending on when we think that will be.

 Unless there is a pressing need to move to 3.0 for some other area, I
 think it would be better for the v2 sources to have a 2.5 release.

 On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:

> Yesterday, the 2.4 branch was created. Based on the above discussion,
> I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Naming policy for packages

2018-08-15 Thread Mark Hamstra

While it is permissible to have a maven identify like "spark-foo" from
"org.bar", I'll agree with Sean that avoiding that kind of name is often
wiser. It is just too easy to slip into prohibited usage if the most
popular, de facto identification turns out to become "spark-foo" instead of
something like "Foo for Apache Spark".

On Wed, Aug 15, 2018 at 11:47 AM Koert Kuipers  wrote:

> ok it doesnt sound so bad if the maven identifier can have spark it in. no
> big deal!
>
> otherwise i was going to suggest "kraps". like kraps-xml
>
> scala> "spark".reverse
> res0: String = kraps
>
>
> On Wed, Aug 15, 2018 at 2:43 PM, Sean Owen  wrote:
>
>> I'd refer you again to the trademark policy. In the first link I see
>> projects whose software ID is like "spark-foo" but title/subtitle is like
>> "Foo for Apache Spark". This is OK. 'sparklyr' is in a gray area we've
>> talked about before; see https://www.apache.org/foundation/marks/ as
>> well. I think it's in a gray area, myself.
>>
>> My best advice to anyone is to avoid this entirely by just not naming
>> your project anything like 'spark'.
>>
>> On Wed, Aug 15, 2018 at 10:39 AM <0xf0f...@protonmail.com> wrote:
>>
>>> Does it mean that majority of Spark related projects, including top
>>> Datatbricks (
>>> https://github.com/databricks?utf8=%E2%9C%93=spark==)
>>> or RStudio (sparklyr) contributions, violate the trademark?
>>>
>>>
>>> Sent with ProtonMail  Secure Email.
>>>
>>> ‐‐‐ Original Message ‐‐‐
>>> On August 15, 2018 5:51 PM, Sean Owen  wrote:
>>>
>>> You might be interested in the full policy:
>>> https://spark.apache.org/trademarks.html
>>>
>>> What it is trying to prevent is confusion. Is spark-xml from the Spark
>>> project? Sounds like it but who knows ? What is a vendor releases ASFSpark
>>> 3.0? Are people going to think this is an official real project release?
>>>
>>> You can release 'Foo for Apache Spark'. You can use shorthand like
>>> foo-spark in software identifiers like Maven coordinates.
>>>
>>> Keeping trademark rights is essential in OSS and part of it is making an
>>> effort to assert that right.
>>>
>>>
>>>
>

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-08 Thread Mark Hamstra

I'm inclined to agree. Just saying that it is not a regression doesn't
really cut it when it is a now known data correctness issue. We need
something a lot more than nothing before releasing 2.4.0. At a barest
minimum, that has to be much more complete and publicly highlighted
documentation of the issue so that users are less likely to stumble into
this unaware; but really we need to fix at least the most common cases of
this bug. Backports to maintenance branches are also probably in order.

On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid 
wrote:

> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>>
>> SPARK-23243 : 
>> Shuffle+Repartition
>> on an RDD could lead to incorrect answers
>> It turns out to be a very complicated issue, there is no consensus about
>> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
>> long-standing issue, not a regression.
>>
>
> This is a really serious data loss bug.  Yes its very complex, but we
> absolutely have to fix this, I really think it should be in 2.4.
> Has worked on it stopped?
>

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Mark Hamstra

No reasonable amount of time is likely going to be sufficient to fully vet
the code as a PR. I'm not entirely happy with the design and code as they
currently are (and I'm still trying to find the time to more publicly
express my thoughts and concerns), but I'm fine with them going into 2.4
much as they are as long as they go in with proper stability annotations
and are understood not to be cast-in-stone final implementations, but
rather as a way to get people using them and generating the feedback that
is necessary to get us to something more like a final design and
implementation.

On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson  wrote:

>
> Barrier mode seems like a high impact feature on Spark's core code: is one
> additional week enough time to properly vet this feature?
>
> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
> joseph.tor...@databricks.com> wrote:
>
>> Full continuous processing aggregation support ran into unanticipated
>> scalability and scheduling problems. We’re planning to overcome those by
>> using some of the barrier execution machinery, but since barrier execution
>> itself is still in progress the full support isn’t going to make it into
>> 2.4.
>>
>> Jose
>>
>> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda 
>> wrote:
>>
>>> Hi,
>>>
>>> what is the status of Continuous Processing + Aggregations? As far as I
>>> remember, Jose Torres said it should  be easy to perform aggregations if
>>> coalesce(1) work. IIRC it's already merged to master.
>>>
>>> Is this work in progress? If yes, it would be great to have full
>>> aggregation/join support in Spark 2.4 in CP.
>>>
>>> Pozdrawiam / Best regards,
>>>
>>> Tomek
>>>
>>>
>>> On 2018-07-31 10:43, Petar Zečević wrote:
>>> > This one is important to us:
>>> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join
>>> inner range optimization) but I think it could be useful to others too.
>>> >
>>> > It is finished and is ready to be merged (was ready a month ago at
>>> least).
>>> >
>>> > Do you think you could consider including it in 2.4?
>>> >
>>> > Petar
>>> >
>>> >
>>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>>> >
>>> >> I went through the open JIRA tickets and here is a list that we
>>> should consider for Spark 2.4:
>>> >>
>>> >> High Priority:
>>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>>> >> This one is critical to the Spark ecosystem for deep learning. It
>>> only has a few remaining works and I think we should have it in Spark 2.4.
>>> >>
>>> >> Middle Priority:
>>> >> SPARK-23899: Built-in SQL Function Improvement
>>> >> We've already added a lot of built-in functions in this release, but
>>> there are a few useful higher-order functions in progress, like
>>> `array_except`, `transform`, etc. It would be great if we can get them in
>>> Spark 2.4.
>>> >>
>>> >> SPARK-14220: Build and test Spark against Scala 2.12
>>> >> Very close to finishing, great to have it in Spark 2.4.
>>> >>
>>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>>> >> This one is there for years (thanks for your patience Michael!), and
>>> is also close to finishing. Great to have it in 2.4.
>>> >>
>>> >> SPARK-24882: data source v2 API improvement
>>> >> This is to improve the data source v2 API based on what we learned
>>> during this release. From the migration of existing sources and design of
>>> new features, we found some problems in the API and want to address them. I
>>> believe this should be
>>> >> the last significant API change to data source v2, so great to have
>>> in Spark 2.4. I'll send a discuss email about it later.
>>> >>
>>> >> SPARK-24252: Add catalog support in Data Source V2
>>> >> This is a very important feature for data source v2, and is currently
>>> being discussed in the dev list.
>>> >>
>>> >> SPARK-24768: Have a built-in AVRO data source implementation
>>> >> Most of it is done, but date/timestamp support is still missing.
>>> Great to have in 2.4.
>>> >>
>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>>> answers
>>> >> This is a long-standing correctness bug, great to have in 2.4.
>>> >>
>>> >> There are some other important features like the adaptive execution,
>>> streaming SQL, etc., not in the list, since I think we are not able to
>>> finish them before 2.4.
>>> >>
>>> >> Feel free to add more things if you think they are important to Spark
>>> 2.4 by replying to this email.
>>> >>
>>> >> Thanks,
>>> >> Wenchen
>>> >>
>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>>> >>
>>> >>   In theory releases happen on a time-based cadence, so it's pretty
>>> much wrap up what's ready by the code freeze and ship it. In practice, the
>>> cadence slips frequently, and it's very much a negotiation about what
>>> features should push the
>>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>>> approach here that works OK.
>>> >>
>>> >>   Certainly speak up if you think there's something that really needs
>>> to get into 2.4.

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Mark Hamstra

See some of the related discussion under
https://github.com/apache/spark/pull/21589

If feels to me like we need some kind of user code mechanism to signal
policy preferences to Spark. This could also include ways to signal
scheduling policy, which could include things like scheduling pool and/or
barrier scheduling. Some of those scheduling policies operate at inherently
different levels currently -- e.g. scheduling pools at the Job level
(really, the thread local level in the current implementation) and barrier
scheduling at the Stage level -- so it is not completely obvious how to
unify all of these policy options/preferences/mechanism, or whether it is
possible, but I think it is worth considering such things at a fairly high
level of abstraction and try to unify and simplify before making things
more complex with multiple policy mechanisms.

On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin  wrote:

> Seems like a good idea in general. Do other systems have similar concepts?
> In general it'd be easier if we can follow existing convention if there is
> any.
>
>
> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge  wrote:
>
>> Hi all,
>>
>> Many Spark users in my company are asking for a way to control the number
>> of output files in Spark SQL. There are use cases to either reduce or
>> increase the number. The users prefer not to use function *repartition*(n)
>> or *coalesce*(n, shuffle) that require them to write and deploy
>> Scala/Java/Python code.
>>
>> Could we introduce a query hint for this purpose (similar to Broadcast
>> Join Hints)?
>>
>> /*+ *COALESCE*(n, shuffle) */
>>
>> In general, is query hint is the best way to bring DF functionality to
>> SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>
>> This requirement is not the same as SPARK-6221 that asked for
>> auto-merging output files.
>>
>> Thanks,
>> John Zhuge
>>
>

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Mark Hamstra

Yeah, I was mostly thinking that, if the normal Spark PR tests were setup
to check the sigs (every time? some of the time?), then this could serve as
an automatic check that nothing funny has been done to the archives. There
shouldn't be any difference between the cache and the archive; but if there
ever is, then we may well have a serious security problem.

On Thu, Jul 19, 2018 at 12:41 PM Sean Owen  wrote:

> Yeah if the test code keeps around the archive and/or digest of what it
> unpacked. A release should never be modified though, so highly rare.
>
> If the worry is hacked mirrors then we might have bigger problems, but
> there the issue is verifying the download sigs in the first place. Those
> would have to come from archive.apache.org.
>
> If you're up for it, yes that could be a fine security precaution.
>
> On Thu, Jul 19, 2018, 2:11 PM Mark Hamstra 
> wrote:
>
>> Is there or should there be some checking of digests just to make sure
>> that we are really testing against the same thing in /tmp/test-spark that
>> we are distributing from the archive?
>>
>> On Thu, Jul 19, 2018 at 11:15 AM Sean Owen  wrote:
>>
>>> Ideally, that list is updated with each release, yes. Non-current
>>> releases will now always download from archive.apache.org though. But
>>> we run into rate-limiting problems if that gets pinged too much. So yes
>>> good to keep the list only to current branches.
>>>
>>> It looks like the download is cached in /tmp/test-spark, for what it's
>>> worth.
>>>
>>> On Thu, Jul 19, 2018 at 11:06 AM Felix Cheung 
>>> wrote:
>>>
>>>> +1 this has been problematic.
>>>>
>>>> Also, this list needs to be updated every time we make a new release?
>>>>
>>>> Plus can we cache them on Jenkins, maybe we can avoid downloading the
>>>> same thing from Apache archive every test run.
>>>>
>>>>
>>>> --
>>>> *From:* Marco Gaido 
>>>> *Sent:* Monday, July 16, 2018 11:12 PM
>>>> *To:* Hyukjin Kwon
>>>> *Cc:* Sean Owen; dev
>>>> *Subject:* Re: Cleaning Spark releases from mirrors, and the flakiness
>>>> of HiveExternalCatalogVersionsSuite
>>>>
>>>> +1 too
>>>>
>>>> On Tue, 17 Jul 2018, 05:38 Hyukjin Kwon,  wrote:
>>>>
>>>>> +1
>>>>>
>>>>> 2018년 7월 17일 (화) 오전 7:34, Sean Owen 님이 작성:
>>>>>
>>>>>> Fix is committed to branches back through 2.2.x, where this test was
>>>>>> added.
>>>>>>
>>>>>> There is still some issue; I'm seeing that archive.apache.org is
>>>>>> rate-limiting downloads and frequently returning 503 errors.
>>>>>>
>>>>>> We can help, I guess, by avoiding testing against non-current
>>>>>> releases. Right now we should be testing against 2.3.1, 2.2.2, 2.1.3,
>>>>>> right? 2.0.x is now effectively EOL right?
>>>>>>
>>>>>> I can make that quick change too if everyone's amenable, in order to
>>>>>> prevent more failures in this test from master.
>>>>>>
>>>>>> On Sun, Jul 15, 2018 at 3:51 PM Sean Owen  wrote:
>>>>>>
>>>>>>> Yesterday I cleaned out old Spark releases from the mirror system --
>>>>>>> we're supposed to only keep the latest release from active branches out 
>>>>>>> on
>>>>>>> mirrors. (All releases are available from the Apache archive site.)
>>>>>>>
>>>>>>> Having done so I realized quickly that the
>>>>>>> HiveExternalCatalogVersionsSuite relies on the versions it downloads 
>>>>>>> being
>>>>>>> available from mirrors. It has been flaky, as sometimes mirrors are
>>>>>>> unreliable. I think now it will not work for any versions except 2.3.1,
>>>>>>> 2.2.2, 2.1.3.
>>>>>>>
>>>>>>> Because we do need to clean those releases out of the mirrors soon
>>>>>>> anyway, and because they're flaky sometimes, I propose adding logic to 
>>>>>>> the
>>>>>>> test to fall back on downloading from the Apache archive site.
>>>>>>>
>>>>>>> ... and I'll do that right away to unblock
>>>>>>> HiveExternalCatalogVersionsSuite runs. I think it needs to be 
>>>>>>> backported to
>>>>>>> other branches as they will still be testing against potentially
>>>>>>> non-current Spark releases.
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Mark Hamstra

Is there or should there be some checking of digests just to make sure that
we are really testing against the same thing in /tmp/test-spark that we are
distributing from the archive?

On Thu, Jul 19, 2018 at 11:15 AM Sean Owen  wrote:

> Ideally, that list is updated with each release, yes. Non-current releases
> will now always download from archive.apache.org though. But we run into
> rate-limiting problems if that gets pinged too much. So yes good to keep
> the list only to current branches.
>
> It looks like the download is cached in /tmp/test-spark, for what it's
> worth.
>
> On Thu, Jul 19, 2018 at 11:06 AM Felix Cheung 
> wrote:
>
>> +1 this has been problematic.
>>
>> Also, this list needs to be updated every time we make a new release?
>>
>> Plus can we cache them on Jenkins, maybe we can avoid downloading the
>> same thing from Apache archive every test run.
>>
>>
>> --
>> *From:* Marco Gaido 
>> *Sent:* Monday, July 16, 2018 11:12 PM
>> *To:* Hyukjin Kwon
>> *Cc:* Sean Owen; dev
>> *Subject:* Re: Cleaning Spark releases from mirrors, and the flakiness
>> of HiveExternalCatalogVersionsSuite
>>
>> +1 too
>>
>> On Tue, 17 Jul 2018, 05:38 Hyukjin Kwon,  wrote:
>>
>>> +1
>>>
>>> 2018년 7월 17일 (화) 오전 7:34, Sean Owen 님이 작성:
>>>
 Fix is committed to branches back through 2.2.x, where this test was
 added.

 There is still some issue; I'm seeing that archive.apache.org is
 rate-limiting downloads and frequently returning 503 errors.

 We can help, I guess, by avoiding testing against non-current releases.
 Right now we should be testing against 2.3.1, 2.2.2, 2.1.3, right? 2.0.x is
 now effectively EOL right?

 I can make that quick change too if everyone's amenable, in order to
 prevent more failures in this test from master.

 On Sun, Jul 15, 2018 at 3:51 PM Sean Owen  wrote:

> Yesterday I cleaned out old Spark releases from the mirror system --
> we're supposed to only keep the latest release from active branches out on
> mirrors. (All releases are available from the Apache archive site.)
>
> Having done so I realized quickly that the
> HiveExternalCatalogVersionsSuite relies on the versions it downloads being
> available from mirrors. It has been flaky, as sometimes mirrors are
> unreliable. I think now it will not work for any versions except 2.3.1,
> 2.2.2, 2.1.3.
>
> Because we do need to clean those releases out of the mirrors soon
> anyway, and because they're flaky sometimes, I propose adding logic to the
> test to fall back on downloading from the Apache archive site.
>
> ... and I'll do that right away to unblock
> HiveExternalCatalogVersionsSuite runs. I think it needs to be backported 
> to
> other branches as they will still be testing against potentially
> non-current Spark releases.
>
> Sean
>

Re: time for Apache Spark 3.0?

2018-06-15 Thread Mark Hamstra

Changing major version numbers is not about new features or a vague notion
that it is time to do something that will be seen to be a significant
release. It is about breaking stable public APIs.

I still remain unconvinced that the next version can't be 2.4.0.

On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:

> *Dear all:*
>
> It have been 2 months since this topic being proposed. Any progress now?
> 2018 has been passed about 1/2.
>
> I agree with that the new version should be some exciting new feature. How
> about this one:
>
> *6. ML/DL framework to be integrated as core component and feature. (Such
> as Angel / BigDL / ……)*
>
> 3.0 is a very important version for an good open source project. It should
> be better to drift away the historical burden and *focus in new area*.
> Spark has been widely used all over the world as a successful big data
> framework. And it can be better than that.
>
>
> *Andy*
>
>
> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:
>
>> There was a discussion thread on scala-contributors
>> 
>> about Apache Spark not yet supporting Scala 2.12, and that got me to think
>> perhaps it is about time for Spark to work towards the 3.0 release. By the
>> time it comes out, it will be more than 2 years since Spark 2.0.
>>
>> For contributors less familiar with Spark’s history, I want to give more
>> context on Spark releases:
>>
>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
>> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
>> in 2018.
>>
>> 2. Spark’s versioning policy promises that Spark does not break stable
>> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>> 2.0, 2.x to 3.0).
>>
>> 3. That said, a major version isn’t necessarily the playground for
>> disruptive API changes to make it painful for users to update. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs.
>>
>> 4. Spark as a project has a culture of evolving architecture and
>> developing major new features incrementally, so major releases are not the
>> only time for exciting new features. For example, the bulk of the work in
>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>> Processing was introduced in Spark 2.3. Both were feature releases rather
>> than major releases.
>>
>>
>> You can find more background in the thread discussing Spark 2.0:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>
>>
>> The primary motivating factor IMO for a major version bump is to support
>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>> Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>>
>> 1. Support Scala 2.12.
>>
>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 2.x.
>>
>> 3. Shade all dependencies.
>>
>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>> compliant, to prevent users from shooting themselves in the foot, e.g.
>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>> less painful for users to upgrade here, I’d suggest creating a flag for
>> backward compatibility mode.
>>
>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>> standard compliant, and have a flag for backward compatibility.
>>
>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
>> “JavaPairRDD flatMapValues requires function returning Iterable, not
>> Iterator”, “Prevent column name duplication in temporary view”).
>>
>>
>> Now the reality of a major version bump is that the world often thinks in
>> terms of what exciting features are coming. I do think there are a number
>> of major changes happening already that can be part of the 3.0 release, if
>> they make it in:
>>
>> 1. Scala 2.12 support (listing it twice)
>> 2. Continuous Processing non-experimental
>> 3. Kubernetes support non-experimental
>> 4. A more flushed out version of data source API v2 (I don’t think it is
>> realistic to stabilize that in one release)
>> 5. Hadoop 3.0 support
>> 6. ...
>>
>>
>>
>> Similar to the 2.0 discussion, this thread should focus on the framework
>> and whether it’d make sense to create Spark 3.0 as the next release, rather
>> than the individual feature requests. Those are important but are best done
>> in their own separate threads.
>>
>>
>>
>>
>>

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Mark Hamstra

+1

On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.1.
>
> Given that I expect at least a few people to be busy with Spark Summit next
> week, I'm taking the liberty of setting an extended voting period. The vote
> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>
> It passes with a majority of +1 votes, which must include at least 3 +1
> votes
> from the PMC.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
> https://github.com/apache/spark/tree/v2.3.1-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1272/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Mark Hamstra

There is no hadoop-2.8 profile. Use hadoop-2.7, which is effectively
hadoop-2.7+

On Fri, Jun 1, 2018 at 4:01 PM Nicholas Chammas 
wrote:

> I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4
> using Flintrock . However, trying
> to load the hadoop-aws package gave me some errors.
>
> $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
>
> 
>
> :: problems summary ::
>  WARNINGS
> [NOT FOUND  ] 
> com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
>  local-m2-cache: tried
>   
> file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar
> [NOT FOUND  ] 
> com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms)
>  local-m2-cache: tried
>   
> file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar
> [NOT FOUND  ] 
> org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)
>  local-m2-cache: tried
>   
> file:/home/ec2-user/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar
> [NOT FOUND  ] 
> com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar (0ms)
>  local-m2-cache: tried
>   
> file:/home/ec2-user/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar
>
> I’d guess I’m probably using the wrong version of hadoop-aws, but I
> called make-distribution.sh with -Phadoop-2.8 so I’m not sure what else
> to try.
>
> Any quick pointers?
>
> Nick
> 
>
> On Fri, Jun 1, 2018 at 6:29 PM Marcelo Vanzin  wrote:
>
>> Starting with my own +1 (binding).
>>
>> On Fri, Jun 1, 2018 at 3:28 PM, Marcelo Vanzin 
>> wrote:
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.3.1.
>> >
>> > Given that I expect at least a few people to be busy with Spark Summit
>> next
>> > week, I'm taking the liberty of setting an extended voting period. The
>> vote
>> > will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>> >
>> > It passes with a majority of +1 votes, which must include at least 3 +1
>> votes
>> > from the PMC.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.3.1
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
>> > https://github.com/apache/spark/tree/v2.3.1-rc4
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1272/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>> >
>> > The list of bug fixes going into 2.3.1 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12342432
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.3.1?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.3.1 can be found at:
>> > https://s.apache.org/Q3Uo
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> >
>> > --
>> > Marcelo
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail:

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Mark Hamstra

If I am understanding you correctly, you're just saying that the problem is
that you know what you want to keep, not what you want to throw away, and
that there is no unpersist DataFrames call based on that what-to-keep
information.

On Tue, May 8, 2018 at 6:00 AM, Nicholas Chammas  wrote:

> I certainly can, but the problem I’m facing is that of how best to track
> all the DataFrames I no longer want to persist.
>
> I create and persist various DataFrames throughout my pipeline. Spark is
> already tracking all this for me, and exposing some of that tracking
> information via getPersistentRDDs(). So when I arrive at a point in my
> program where I know, “I only need this DataFrame going forward”, I want to
> be able to tell Spark “Please unpersist everything except this one
> DataFrame”. If I cannot leverage the information about persisted DataFrames
> that Spark is already tracking, then the alternative is for me to carefully
> track and unpersist DataFrames when I no longer need them.
>
> I suppose the problem is similar at a high level to garbage collection.
> Tracking and freeing DataFrames manually is analogous to malloc and free;
> and full automation would be Spark automatically unpersisting DataFrames
> when they were no longer referenced or needed. I’m looking for an
> in-between solution that lets me leverage some of the persistence tracking
> in Spark so I don’t have to do it all myself.
>
> Does this make more sense now, from a use case perspective as well as from
> a desired API perspective?
> 
>
> On Thu, May 3, 2018 at 10:26 PM Reynold Xin  wrote:
>
>> Why do you need the underlying RDDs? Can't you just unpersist the
>> dataframes that you don't need?
>>
>>
>> On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> This seems to be an underexposed part of the API. My use case is this: I
>>> want to unpersist all DataFrames except a specific few. I want to do this
>>> because I know at a specific point in my pipeline that I have a handful of
>>> DataFrames that I need, and everything else is no longer needed.
>>>
>>> The problem is that there doesn’t appear to be a way to identify
>>> specific DataFrames (or rather, their underlying RDDs) via
>>> getPersistentRDDs(), which is the only way I’m aware of to ask Spark
>>> for all currently persisted RDDs:
>>>
>>> >>> a = spark.range(10).persist()>>> a.rdd.id()8>>> 
>>> >>> list(spark.sparkContext._jsc.getPersistentRDDs().items())
>>> [(3, JavaObject id=o36)]
>>>
>>> As you can see, the id of the persisted RDD, 8, doesn’t match the id
>>> returned by getPersistentRDDs(), 3. So I can’t go through the RDDs
>>> returned by getPersistentRDDs() and know which ones I want to keep.
>>>
>>> id() itself appears to be an undocumented method of the RDD API, and in
>>> PySpark getPersistentRDDs() is buried behind the Java sub-objects
>>> , so I know I’m
>>> reaching here. But is there a way to do what I want in PySpark without
>>> manually tracking everything I’ve persisted myself?
>>>
>>> And more broadly speaking, do we want to add additional APIs, or
>>> formalize currently undocumented APIs like id(), to make this use case
>>> possible?
>>>
>>> Nick
>>> 
>>>
>>

Re: Fair scheduler pool leak

2018-04-07 Thread Mark Hamstra

>
> Providing a way to set the mode of the default scheduler would be awesome.


That's trivial: Just use the pool configuration XML file and define a pool
named "default" with the characteristics that you want (including
schedulingMode FAIR).

You only get the default construction of the pool named "default" is you
don't define your own "default".

On Sat, Apr 7, 2018 at 2:32 PM, Matthias Boehm <mboe...@gmail.com> wrote:

> No, these pools are not created per job but per parfor worker and
> thus, used to execute many jobs. For all scripts with a single
> top-level parfor this is equivalent to static initialization. However,
> yes we create these pools dynamically on demand to avoid unnecessary
> initialization and handle scenarios of nested parfor.
>
> At the end of the day, we just want to configure fair scheduling in a
> programmatic way without the need for additional configuration files
> which is a hassle for a library that is meant to work out-of-the-box.
> Simply setting 'spark.scheduler.mode' to FAIR does not do the trick
> because we end up with a single default fair scheduler pool in FIFO
> mode, which is equivalent to FIFO. Providing a way to set the mode of
> the default scheduler would be awesome.
>
> Regarding why fair scheduling showed generally better performance for
> out-of-core datasets, I don't have a good answer. My guess was
> isolated job scheduling and better locality of in-memory partitions.
>
> Regards,
> Matthias
>
> On Sat, Apr 7, 2018 at 8:50 AM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
> > Sorry, but I'm still not understanding this use case. Are you somehow
> > creating additional scheduling pools dynamically as Jobs execute? If so,
> > that is a very unusual thing to do. Scheduling pools are intended to be
> > statically configured -- initialized, living and dying with the
> Application.
> >
> > On Sat, Apr 7, 2018 at 12:33 AM, Matthias Boehm <mboe...@gmail.com>
> wrote:
> >>
> >> Thanks for the clarification Imran - that helped. I was mistakenly
> >> assuming that these pools are removed via weak references, as the
> >> ContextCleaner does for RDDs, broadcasts, and accumulators, etc. For
> >> the time being, we'll just work around it, but I'll file a
> >> nice-to-have improvement JIRA. Also, you're right, we see indeed these
> >> warnings but they're usually hidden when running with ERROR or INFO
> >> (due to overwhelming output) log levels.
> >>
> >> Just to give the context: We use these scheduler pools in SystemML's
> >> parallel for loop construct (parfor), which allows combining data- and
> >> task-parallel computation. If the data fits into the remote memory
> >> budget, the optimizer may decide to execute the entire loop as a
> >> single spark job (with groups of iterations mapped to spark tasks). If
> >> the data is too large and non-partitionable, the parfor loop is
> >> executed as a multi-threaded operator in the driver and each worker
> >> might spawn several data-parallel spark jobs in the context of the
> >> worker's scheduler pool, for operations that don't fit into the
> >> driver.
> >>
> >> We decided to use these fair scheduler pools (w/ fair scheduling
> >> across pools, FIFO per pool) instead of the default FIFO scheduler
> >> because it gave us better and more robust performance back in the
> >> Spark 1.x line. This was especially true for concurrent jobs over
> >> shared input data (e.g., for hyper parameter tuning) and when the data
> >> size exceeded aggregate memory. The only downside was that we had to
> >> guard against scenarios where concurrently jobs would lazily pull a
> >> shared RDD into cache because that lead to thread contention on the
> >> executors' block managers and spurious replicated in-memory
> >> partitions.
> >>
> >> Regards,
> >> Matthias
> >>
> >> On Fri, Apr 6, 2018 at 8:08 AM, Imran Rashid <iras...@cloudera.com>
> wrote:
> >> > Hi Matthias,
> >> >
> >> > This doeesn't look possible now.  It may be worth filing an
> improvement
> >> > jira
> >> > for.
> >> >
> >> > But I'm trying to understand what you're trying to do a little better.
> >> > So
> >> > you intentionally have each thread create a new unique pool when its
> >> > submits
> >> > a job?  So that pool will just get the default pool configuration, and
> >> > you
> >> > will see lots of these messages in your logs?
&

Re: Fair scheduler pool leak

2018-04-07 Thread Mark Hamstra

Sorry, but I'm still not understanding this use case. Are you somehow
creating additional scheduling pools dynamically as Jobs execute? If so,
that is a very unusual thing to do. Scheduling pools are intended to be
statically configured -- initialized, living and dying with the
Application.

On Sat, Apr 7, 2018 at 12:33 AM, Matthias Boehm  wrote:

> Thanks for the clarification Imran - that helped. I was mistakenly
> assuming that these pools are removed via weak references, as the
> ContextCleaner does for RDDs, broadcasts, and accumulators, etc. For
> the time being, we'll just work around it, but I'll file a
> nice-to-have improvement JIRA. Also, you're right, we see indeed these
> warnings but they're usually hidden when running with ERROR or INFO
> (due to overwhelming output) log levels.
>
> Just to give the context: We use these scheduler pools in SystemML's
> parallel for loop construct (parfor), which allows combining data- and
> task-parallel computation. If the data fits into the remote memory
> budget, the optimizer may decide to execute the entire loop as a
> single spark job (with groups of iterations mapped to spark tasks). If
> the data is too large and non-partitionable, the parfor loop is
> executed as a multi-threaded operator in the driver and each worker
> might spawn several data-parallel spark jobs in the context of the
> worker's scheduler pool, for operations that don't fit into the
> driver.
>
> We decided to use these fair scheduler pools (w/ fair scheduling
> across pools, FIFO per pool) instead of the default FIFO scheduler
> because it gave us better and more robust performance back in the
> Spark 1.x line. This was especially true for concurrent jobs over
> shared input data (e.g., for hyper parameter tuning) and when the data
> size exceeded aggregate memory. The only downside was that we had to
> guard against scenarios where concurrently jobs would lazily pull a
> shared RDD into cache because that lead to thread contention on the
> executors' block managers and spurious replicated in-memory
> partitions.
>
> Regards,
> Matthias
>
> On Fri, Apr 6, 2018 at 8:08 AM, Imran Rashid  wrote:
> > Hi Matthias,
> >
> > This doeesn't look possible now.  It may be worth filing an improvement
> jira
> > for.
> >
> > But I'm trying to understand what you're trying to do a little better.
> So
> > you intentionally have each thread create a new unique pool when its
> submits
> > a job?  So that pool will just get the default pool configuration, and
> you
> > will see lots of these messages in your logs?
> >
> > https://github.com/apache/spark/blob/6ade5cbb498f6c6ea38779b97f2325
> d5cf5013f2/core/src/main/scala/org/apache/spark/
> scheduler/SchedulableBuilder.scala#L196-L200
> >
> > What is the use case for creating pools this way?
> >
> > Also if I understand correctly, it doesn't even matter if the thread
> dies --
> > that pool will still stay around, as the rootPool will retain a
> reference to
> > its (the pools aren't really actually tied to specific threads).
> >
> > Imran
> >
> > On Thu, Apr 5, 2018 at 9:46 PM, Matthias Boehm 
> wrote:
> >>
> >> Hi all,
> >>
> >> for concurrent Spark jobs spawned from the driver, we use Spark's fair
> >> scheduler pools, which are set and unset in a thread-local manner by
> >> each worker thread. Typically (for rather long jobs), this works very
> >> well. Unfortunately, in an application with lots of very short
> >> parallel sections, we see 1000s of these pools remaining in the Spark
> >> UI, which indicates some kind of leak. Each worker cleans up its local
> >> property by setting it to null, but not all pools are properly
> >> removed. I've checked and reproduced this behavior with Spark 2.1-2.3.
> >>
> >> Now my question: Is there a way to explicitly remove these pools,
> >> either globally, or locally while the thread is still alive?
> >>
> >> Regards,
> >> Matthias
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: time for Apache Spark 3.0?

2018-04-05 Thread Mark Hamstra

As with Sean, I'm not sure that this will require a new major version, but
we should also be looking at Java 9 & 10 support -- particularly with
regard to their better functionality in a containerized environment (memory
limits from cgroups, not sysconf; support for cpusets). In that regard, we
should also be looking at using the latest Scala 2.11.x maintenance release
in current Spark branches.

On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:

> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
>
>> The primary motivating factor IMO for a major version bump is to support
>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>> Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>>
>
> IIRC from looking at this, it is possible to support 2.11 and 2.12
> simultaneously. The cross-build already works now in 2.3.0. Barring some
> big change needed to get 2.12 fully working -- and that may be the case --
> it nearly works that way now.
>
> Compiling vs 2.11 and 2.12 does however result in some APIs that differ in
> byte code. However Scala itself isn't mutually compatible between 2.11 and
> 2.12 anyway; that's never been promised as compatible.
>
> (Interesting question about what *Java* users should expect; they would
> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>
> I don't disagree with shooting for Spark 3.0, just saying I don't know if
> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
> 2.11 support if needed to make supporting 2.12 less painful.
>

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Mark Hamstra

Sure. Obviously, there is going to be some overlap as the project
transitions to being part of mainline Spark development. As long as you are
consciously working toward moving discussions into this dev list, then all
is good.

On Mon, Feb 5, 2018 at 1:56 PM, Matt Cheah <mch...@palantir.com> wrote:

> I think in this case, the original design that was proposed before the
> document was implemented on the Spark on K8s fork, that we took some time
> to build separately before proposing that the fork be merged into the main
> line.
>
>
>
> Specifically, the timeline of events was:
>
>
>
>1. We started building Spark on Kubernetes on a fork and was prepared
>to merge our work directly into master,
>2. Discussion on https://issues.apache.org/jira/browse/SPARK-18278 led
>us to move down the path of working on a fork first. We would harden the
>fork, have the fork become used more widely to prove its value and
>robustness in practice. See https://github.com/apache-
>spark-on-k8s/spark
>3. On said fork, we made the original design decisions to use a
>step-based builder pattern for the driver but not the same design for the
>executors. This original discussion was made among the collaborators of the
>fork, as much of the work on the fork in general was not done on the
>mailing list.
>4. We eventually decided to merge the fork into the main line, and got
>the feedback in the corresponding PRs.
>
>
>
> Therefore the question may less so be with this specific design, but
> whether or not the overarching approach we took - building Spark on K8s on
> a fork first before merging into mainline – was the correct one in the
> first place. There’s also the issue that the work done on the fork was
> isolated from the dev mailing list. Moving forward as we push our work into
> mainline Spark, we aim to be transparent with the Spark community via the
> Spark mailing list and Spark JIRA tickets. We’re specifically aiming to
> deprecate the fork and migrate all the work done on the fork into the main
> line.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Mark Hamstra <m...@clearstorydata.com>
> *Date: *Monday, February 5, 2018 at 1:44 PM
> *To: *Matt Cheah <mch...@palantir.com>
> *Cc: *"dev@spark.apache.org" <dev@spark.apache.org>, "
> ramanath...@google.com" <ramanath...@google.com>, Ilan Filonenko <
> i...@cornell.edu>, Erik <e...@redhat.com>, Marcelo Vanzin <
> van...@cloudera.com>
> *Subject: *Re: Spark on Kubernetes Builder Pattern Design Document
>
>
>
> That's good, but you should probably stop and consider whether the
> discussions that led up to this document's creation could have taken place
> on this dev list -- because if they could have, then they probably should
> have as part of the whole spark-on-k8s project becoming part of mainline
> spark development, not a separate fork.
>
>
>
> On Mon, Feb 5, 2018 at 1:17 PM, Matt Cheah <mch...@palantir.com> wrote:
>
> Hi everyone,
>
>
>
> While we were building the Spark on Kubernetes integration, we realized
> that some of the abstractions we introduced for building the driver
> application in spark-submit, and building executor pods in the scheduler
> backend, could be improved for better readability and clarity. We received
> feedback in this pull request[github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_19954=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=erOFmxRHVvo6PvT99RtjVlMv__RdcgyOiWW5leJYHqw=FhN8lkONIMpEX-CfF7YaC91JJWA695X8DNbM3p9bB3c=>
> in particular. In response to this feedback, we’ve put together a design
> document that proposes a possible refactor to address the given feedback.
>
>
>
> You may comment on the proposed design at this link:
> https://docs.google.com/document/d/1XPLh3E2JJ7yeJSDLZWXh_
> lUcjZ1P0dy9QeUEyxIlfak/edit#[docs.google.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1XPLh3E2JJ7yeJSDLZWXh-5FlUcjZ1P0dy9QeUEyxIlfak_edit-23=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=erOFmxRHVvo6PvT99RtjVlMv__RdcgyOiWW5leJYHqw=tiPOvGDtyhow_VDk3X4hjCs7l3fVeCyRlQDgXLzhD_Q=>
>
>
>
> I hope that we can have a productive discussion and continue improving the
> Kubernetes integration further.
>
>
>
> Thanks,
>
>
>
> -Matt Cheah
>
>
>

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Mark Hamstra

That's good, but you should probably stop and consider whether the
discussions that led up to this document's creation could have taken place
on this dev list -- because if they could have, then they probably should
have as part of the whole spark-on-k8s project becoming part of mainline
spark development, not a separate fork.

On Mon, Feb 5, 2018 at 1:17 PM, Matt Cheah  wrote:

> Hi everyone,
>
>
>
> While we were building the Spark on Kubernetes integration, we realized
> that some of the abstractions we introduced for building the driver
> application in spark-submit, and building executor pods in the scheduler
> backend, could be improved for better readability and clarity. We received
> feedback in this pull request 
> in particular. In response to this feedback, we’ve put together a design
> document that proposes a possible refactor to address the given feedback.
>
>
>
> You may comment on the proposed design at this link:
> https://docs.google.com/document/d/1XPLh3E2JJ7yeJSDLZWXh_
> lUcjZ1P0dy9QeUEyxIlfak/edit#
>
>
>
> I hope that we can have a productive discussion and continue improving the
> Kubernetes integration further.
>
>
>
> Thanks,
>
>
>
> -Matt Cheah
>

Re: Union in Spark context

2018-02-05 Thread Mark Hamstra

First, the public API cannot be changed except when there is a major
version change, and there is no way that we are going to do Spark 3.0.0
just for this change.

Second, the change would be a mistake since the two different union methods
are quite different. The method in RDD only ever works on two RDDs at a
time, whereas the method in SparkContext can work on many RDDs in a single
call. That means that the method in SparkContext is much preferred when
unioning many RDDs to prevent a lengthy lineage chain.

On Mon, Feb 5, 2018 at 8:04 AM, Suchith J N  wrote:

> Hi,
>
> Seems like simple clean up - Why do we have union() on RDDs in
> SparkContext? Shouldn't it reside in RDD? There is one in RDD, but it seems
> like a wrapper around this.
>
> Regards,
> Suchith
>

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Mark Hamstra

Reasoning by analogy to other Apache projects is generally not sufficient
when it come to securing legally permissible form or behavior -- that
another project is doing something is not a guarantee that they are doing
it right. If we have issues or legal questions, we need to formulate them
and our proposed actions as clearly and concretely as possible so that the
PMC can take those issues, questions and proposed actions to Apache counsel
for advice or guidance.

On Tue, Dec 19, 2017 at 10:34 AM, Erik Erlandson <eerla...@redhat.com>
wrote:

> I've been looking a bit more into ASF legal posture on licensing and
> container images. What I have found indicates that ASF considers container
> images to be just another variety of distribution channel.  As such, it is
> acceptable to publish official releases; for example an image such as
> spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
> do something like regularly publish spark:latest built from the head of
> master.
>
> More detail here:
> https://issues.apache.org/jira/browse/LEGAL-270
>
> So as I understand it, making a release-tagged public image as part of
> each official release does not pose any problems.
>
> With respect to considering the licenses of other ancillary dependencies
> that are also installed on such container images, I noticed this clause in
> the legal boilerplate for the Flink images
> <https://hub.docker.com/r/library/flink/>:
>
> As with all Docker images, these likely also contain other software which
>> may be under other licenses (such as Bash, etc from the base distribution,
>> along with any direct or indirect dependencies of the primary software
>> being contained).
>>
>
> So it may be sufficient to resolve this via disclaimer.
>
> -Erik
>
> On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson <eerla...@redhat.com>
> wrote:
>
>> Currently the containers are based off alpine, which pulls in BSD2 and
>> MIT licensing:
>> https://github.com/apache/spark/pull/19717#discussion_r154502824
>>
>> to the best of my understanding, neither of those poses a problem.  If we
>> based the image off of centos I'd also expect the licensing of any image
>> deps to be compatible.
>>
>> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>
>>> What licensing issues come into play?
>>>
>>> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson <eerla...@redhat.com>
>>> wrote:
>>>
>>>> We've been discussing the topic of container images a bit more.  The
>>>> kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
>>>> logic, which is different than mesos, and which is probably not practical
>>>> to unify at this level.
>>>>
>>>> However: These CMD and ENTRYPOINT configurations are essentially just a
>>>> thin skin on top of an image which is just an install of a spark distro.
>>>> We feel that a single "spark-base" image should be publishable, that is
>>>> consumable by kube-spark images, and mesos-spark images, and likely any
>>>> other community image whose primary purpose is running spark components.
>>>> The kube-specific dockerfiles would be written "FROM spark-base" and just
>>>> add the small command and entrypoint layers.  Likewise, the mesos images
>>>> could add any specialization layers that are necessary on top of the
>>>> "spark-base" image.
>>>>
>>>> Does this factorization sound reasonable to others?
>>>> Cheers,
>>>> Erik
>>>>
>>>>
>>>> On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan <mri...@gmail.com
>>>> > wrote:
>>>>
>>>>> We do support running on Apache Mesos via docker images - so this
>>>>> would not be restricted to k8s.
>>>>> But unlike mesos support, which has other modes of running, I believe
>>>>> k8s support more heavily depends on availability of docker images.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>>
>>>>> On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>> > Would it be logical to provide Docker-based distributions of other
>>>>> pieces of
>>>>> > Spark? or is this specific to K8S?
>>>>> > The problem is we wouldn't generally also provide a distribution of
>>>>> Spark
>>>>

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-14 Thread Mark Hamstra

What licensing issues come into play?

On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson <eerla...@redhat.com> wrote:

> We've been discussing the topic of container images a bit more.  The
> kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
> logic, which is different than mesos, and which is probably not practical
> to unify at this level.
>
> However: These CMD and ENTRYPOINT configurations are essentially just a
> thin skin on top of an image which is just an install of a spark distro.
> We feel that a single "spark-base" image should be publishable, that is
> consumable by kube-spark images, and mesos-spark images, and likely any
> other community image whose primary purpose is running spark components.
> The kube-specific dockerfiles would be written "FROM spark-base" and just
> add the small command and entrypoint layers.  Likewise, the mesos images
> could add any specialization layers that are necessary on top of the
> "spark-base" image.
>
> Does this factorization sound reasonable to others?
> Cheers,
> Erik
>
>
> On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan <mri...@gmail.com>
> wrote:
>
>> We do support running on Apache Mesos via docker images - so this
>> would not be restricted to k8s.
>> But unlike mesos support, which has other modes of running, I believe
>> k8s support more heavily depends on availability of docker images.
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen <so...@cloudera.com> wrote:
>> > Would it be logical to provide Docker-based distributions of other
>> pieces of
>> > Spark? or is this specific to K8S?
>> > The problem is we wouldn't generally also provide a distribution of
>> Spark
>> > for the reasons you give, because if that, then why not RPMs and so on.
>> >
>> > On Wed, Nov 29, 2017 at 10:41 AM Anirudh Ramanathan <
>> ramanath...@google.com>
>> > wrote:
>> >>
>> >> In this context, I think the docker images are similar to the binaries
>> >> rather than an extension.
>> >> It's packaging the compiled distribution to save people the effort of
>> >> building one themselves, akin to binaries or the python package.
>> >>
>> >> For reference, this is the base dockerfile for the main image that we
>> >> intend to publish. It's not particularly complicated.
>> >> The driver and executor images are based on said base image and only
>> >> customize the CMD (any file/directory inclusions are extraneous and
>> will be
>> >> removed).
>> >>
>> >> Is there only one way to build it? That's a bit harder to reason about.
>> >> The base image I'd argue is likely going to always be built that way.
>> The
>> >> driver and executor images, there may be cases where people want to
>> >> customize it - (like putting all dependencies into it for example).
>> >> In those cases, as long as our images are bare bones, they can use the
>> >> spark-driver/spark-executor images we publish as the base, and build
>> their
>> >> customization as a layer on top of it.
>> >>
>> >> I think the composability of docker images, makes this a bit different
>> >> from say - debian packages.
>> >> We can publish canonical images that serve as both - a complete image
>> for
>> >> most Spark applications, as well as a stable substrate to build
>> >> customization upon.
>> >>
>> >> On Wed, Nov 29, 2017 at 7:38 AM, Mark Hamstra <m...@clearstorydata.com
>> >
>> >> wrote:
>> >>>
>> >>> It's probably also worth considering whether there is only one,
>> >>> well-defined, correct way to create such an image or whether this is a
>> >>> reasonable avenue for customization. Part of why we don't do
>> something like
>> >>> maintain and publish canonical Debian packages for Spark is because
>> >>> different organizations doing packaging and distribution of
>> infrastructures
>> >>> or operating systems can reasonably want to do this in a custom (or
>> >>> non-customary) way. If there is really only one reasonable way to do a
>> >>> docker image, then my bias starts to tend more toward the Spark PMC
>> taking
>> >>> on the responsibility to maintain and publish that image. If there is
>> more
>> >>> than one way to do it and publishing

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-11-29 Thread Mark Hamstra

It's probably also worth considering whether there is only one,
well-defined, correct way to create such an image or whether this is a
reasonable avenue for customization. Part of why we don't do something like
maintain and publish canonical Debian packages for Spark is because
different organizations doing packaging and distribution of infrastructures
or operating systems can reasonably want to do this in a custom (or
non-customary) way. If there is really only one reasonable way to do a
docker image, then my bias starts to tend more toward the Spark PMC taking
on the responsibility to maintain and publish that image. If there is more
than one way to do it and publishing a particular image is more just a
convenience, then my bias tends more away from maintaining and publish it.

On Wed, Nov 29, 2017 at 5:14 AM, Sean Owen  wrote:

> Source code is the primary release; compiled binary releases are
> conveniences that are also released. A docker image sounds fairly different
> though. To the extent it's the standard delivery mechanism for some
> artifact (think: pyspark on PyPI as well) that makes sense, but is that the
> situation? if it's more of an extension or alternate presentation of Spark
> components, that typically wouldn't be part of a Spark release. The ones
> the PMC takes responsibility for maintaining ought to be the core, critical
> means of distribution alone.
>
> On Wed, Nov 29, 2017 at 2:52 AM Anirudh Ramanathan  .invalid> wrote:
>
>> Hi all,
>>
>> We're all working towards the Kubernetes scheduler backend (full steam
>> ahead!) that's targeted towards Spark 2.3. One of the questions that comes
>> up often is docker images.
>>
>> While we're making available dockerfiles to allow people to create their
>> own docker images from source, ideally, we'd want to publish official
>> docker images as part of the release process.
>>
>> I understand that the ASF has procedure around this, and we would want to
>> get that started to help us get these artifacts published by 2.3. I'd love
>> to get a discussion around this started, and the thoughts of the community
>> regarding this.
>>
>> --
>> Thanks,
>> Anirudh Ramanathan
>>
>

Re: Object in compiler mirror not found - maven build

2017-11-26 Thread Mark Hamstra

Or you just have zinc running but in a bad state. `zinc -shutdown` should
kill it off and let you try again.

On Sun, Nov 26, 2017 at 2:12 PM, Sean Owen  wrote:

> I'm not seeing that on OS X or Linux. It sounds a bit like you have an old
> version of zinc or scala or something installed.
>
> On Sun, Nov 26, 2017 at 3:55 PM Tomasz Dudek <
> megatrontomaszdu...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> I would love to help develop Apache Spark. I have run into a (very
>> basic?) issue which holds me in that mission.
>>
>> I followed the `how to contribute` guide, however running ./build/mvn
>> -DskipTests clean package fails with:
>>
>> [INFO] Using zinc server for incremental compilation
>> [info] 'compiler-interface' not yet compiled for Scala 2.11.8.
>> Compiling...
>> error: scala.reflect.internal.MissingRequirementError: object
>> java.lang.Object in compiler mirror not found.
>> at scala.reflect.internal.MissingRequirementError$.signal(
>> MissingRequirementError.scala:17)
>> at scala.reflect.internal.MissingRequirementError$.notFound(
>> MissingRequirementError.scala:18)
>> at scala.reflect.internal.Mirrors$RootsBase.
>> getModuleOrClass(Mirrors.scala:53)
>>
>> is it perhaps compability issue? Versions I use are as follows:
>>
>> ➜  spark git:(master) ✗ ./build/mvn --version
>> Using `mvn` from path: /Users/tdudek/Programming/
>> spark/build/apache-maven-3.3.9/bin/mvn
>> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5;
>> 2015-11-10T17:41:47+01:00)
>> Maven home: /Users/tdudek/Programming/spark/build/apache-maven-3.3.9
>> Java version: 1.8.0_152, vendor: Oracle Corporation
>> Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_
>> 152.jdk/Contents/Home/jre
>> Default locale: en_PL, platform encoding: US-ASCII
>> OS name: "mac os x", version: "10.13.1", arch: "x86_64", family: "mac"
>>
>> I just lost few hours mindlessly trying to make it work. I hate to waste
>> other peoples' time and I'm REALLY ashamed at my question, but I think I am
>> missing something fundamental.
>>
>> Cheers,
>> Tomasz
>>
>

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-14 Thread Mark Hamstra

The problem is that it's not really an "official" download link, but rather
just a supplemental convenience. While that may be ok when distributing
artifacts, it's more of a problem when actually building and testing
artifacts. In the latter case, the download should really only be from an
Apache mirror.

On Thu, Sep 14, 2017 at 1:20 AM, Wenchen Fan <cloud0...@gmail.com> wrote:

> That test case is trying to test the backward compatibility of
> `HiveExternalCatalog`. It downloads official Spark releases and creates
> tables with them, and then read these tables via the current Spark.
>
> About the download link, I just picked it from the Spark website, and this
> link is the default one when you choose "direct download". Do we have a
> better choice?
>
> On Thu, Sep 14, 2017 at 3:05 AM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Mark, I agree with your point on the risks of using Cloudfront while
>> building Spark. I was only trying to provide background on when we
>> started using Cloudfront.
>>
>> Personally, I don't have enough about context about the test case in
>> question (e.g. Why are we downloading Spark in a test case ?).
>>
>> Thanks
>> Shivaram
>>
>> On Wed, Sep 13, 2017 at 11:50 AM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>> > Yeah, but that discussion and use case is a bit different -- providing a
>> > different route to download the final released and approved artifacts
>> that
>> > were built using only acceptable artifacts and sources vs. building and
>> > checking prior to release using something that is not from an Apache
>> mirror.
>> > This new use case puts us in the position of approving spark artifacts
>> that
>> > weren't built entirely from canonical resources located in presumably
>> secure
>> > and monitored repositories. Incorporating something that is not
>> completely
>> > trusted or approved into the process of building something that we are
>> then
>> > going to approve as trusted is different from the prior use of
>> cloudfront.
>> >
>> > On Wed, Sep 13, 2017 at 10:26 AM, Shivaram Venkataraman
>> > <shiva...@eecs.berkeley.edu> wrote:
>> >>
>> >> The bucket comes from Cloudfront, a CDN thats part of AWS. There was a
>> >> bunch of discussion about this back in 2013
>> >>
>> >> https://lists.apache.org/thread.html/9a72ff7ce913dd85a6b112b
>> 1b2de536dcda74b28b050f70646aba0ac@1380147885@%3Cdev.spark.apache.org%3E
>> >>
>> >> Shivaram
>> >>
>> >> On Wed, Sep 13, 2017 at 9:30 AM, Sean Owen <so...@cloudera.com> wrote:
>> >> > Not a big deal, but Mark noticed that this test now downloads Spark
>> >> > artifacts from the same 'direct download' link available on the
>> >> > downloads
>> >> > page:
>> >> >
>> >> >
>> >> > https://github.com/apache/spark/blob/master/sql/hive/src/
>> test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVe
>> rsionsSuite.scala#L53
>> >> >
>> >> > https://d3kbcqa49mib13.cloudfront.net/spark-$version-bin-
>> hadoop2.7.tgz
>> >> >
>> >> > I don't know of any particular problem with this, which is a parallel
>> >> > download option in addition to the Apache mirrors. It's also the
>> >> > default.
>> >> >
>> >> > Does anyone know what this bucket is and if there's a strong reason
>> we
>> >> > can't
>> >> > just use mirrors?
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-13 Thread Mark Hamstra

Yeah, but that discussion and use case is a bit different -- providing a
different route to download the final released and approved artifacts that
were built using only acceptable artifacts and sources vs. building and
checking prior to release using something that is not from an Apache
mirror. This new use case puts us in the position of approving spark
artifacts that weren't built entirely from canonical resources located in
presumably secure and monitored repositories. Incorporating something that
is not completely trusted or approved into the process of building
something that we are then going to approve as trusted is different from
the prior use of cloudfront.

On Wed, Sep 13, 2017 at 10:26 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> The bucket comes from Cloudfront, a CDN thats part of AWS. There was a
> bunch of discussion about this back in 2013
> https://lists.apache.org/thread.html/9a72ff7ce913dd85a6b112b1b2de53
> 6dcda74b28b050f70646aba0ac@1380147885@%3Cdev.spark.apache.org%3E
>
> Shivaram
>
> On Wed, Sep 13, 2017 at 9:30 AM, Sean Owen  wrote:
> > Not a big deal, but Mark noticed that this test now downloads Spark
> > artifacts from the same 'direct download' link available on the downloads
> > page:
> >
> > https://github.com/apache/spark/blob/master/sql/hive/
> src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSui
> te.scala#L53
> >
> > https://d3kbcqa49mib13.cloudfront.net/spark-$version-bin-hadoop2.7.tgz
> >
> > I don't know of any particular problem with this, which is a parallel
> > download option in addition to the Apache mirrors. It's also the default.
> >
> > Does anyone know what this bucket is and if there's a strong reason we
> can't
> > just use mirrors?
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Supporting Apache Aurora as a cluster manager

2017-09-11 Thread Mark Hamstra

While it may be worth creating the design doc and JIRA ticket so that we at
least have a better idea and a record of what you are talking about, I kind
of doubt that we are going to want to merge this into the Spark codebase.
That's not because of anything specific to this Aurora effort, but rather
because scheduler implementations in general are not going in the preferred
direction. There is already some regret that the YARN scheduler wasn't
implemented by means of a scheduler plug-in API, and there is likely to be
more regret if we continue to go forward with the spark-on-kubernetes SPIP
in its present form. I'd guess that we are likely to merge code associated
with that SPIP just because Kubernetes has become such an important
resource scheduler, but such a merge wouldn't be without some misgivings.
That is because we just can't get into the position of having more and more
scheduler implementations in the Spark code, and more and more maintenance
overhead to keep up with the idiosyncrasies of all the scheduler
implementations. We've really got to get to the kind of plug-in
architecture discussed in SPARK-19700 so that scheduler implementations can
be done outside of the Spark codebase, release schedule, etc.

My opinion on the subject isn't dispositive on its own, of course, but that
is how I'm seeing things right now.

On Sun, Sep 10, 2017 at 8:27 PM, karthik padmanabhan  wrote:

> Hi Spark Devs,
>
> We are using Aurora (http://aurora.apache.org/) as our mesos framework
> for running stateless services. We would like to use Aurora to deploy big
> data and batch workloads as well. And for this we have forked Spark and
> implement the ExternalClusterManager trait.
>
> The reason for doing this and not running Spark on Mesos is to leverage
> the existing roles and quotas provided by Aurora for admission control and
> also leverage Aurora features such as priority and preemption. Additionally
> we would like Aurora to be only deploy/orchestration system that our users
> should interact with.
>
> We have a working POC where Spark is launching jobs through as the
> ClusterManager. Is this something that can be merged upstream ? If so I can
> create a design document and create an associated jira ticket.
>
> Thanks
> Karthik
>

Re: SPIP: Spark on Kubernetes

2017-08-28 Thread Mark Hamstra

>
> In my opinion, the fact that there are nearly no changes to spark-core,
> and most of our changes are additive should go to prove that this adds
> little complexity to the workflow of the committers.

Actually (and somewhat perversely), the otherwise praiseworthy isolation of
the Kubernetes code does mean that it adds complexity to the workflow of
the existing Spark committers. I'll reiterate Imran's concerns: The
existing Spark committers familiar with Spark's scheduler code have
adequate knowledge of the Standalone and Yarn implementations, and still
not sufficient coverage of Mesos. Adding k8s code to Spark would mean that
the progression of that code would start seeing the issues that the Mesos
code in Spark currently sees: Reviews and commits tend to languish because
we don't have currently active committers with sufficient knowledge and
cycles to deal with the Mesos PRs. Some of this is because the PMC needs to
get back to addressing the issue of adding new Spark committers who do have
the needed Mesos skills, but that isn't as simple as we'd like because
ideally a Spark committer has demonstrated skills across a significant
portion of the Spark code, not just tightly focused on one area (such as
Mesos or k8s integration.) In short, adding Kubernetes support directly
into Spark isn't likely (at least in the short-term) to be entirely
positive for the spark-on-k8s project, since merging of PRs to the
spark-on-k8s is very likely to be quite slow at least until such time as we
have k8s-focused Spark committers. If this project does end up getting
pulled into the Spark codebase, then the PMC will need to start looking at
bringing in one or more new committers who meet our requirements for such a
role and responsibility, and who also have k8s skills. The success and pace
of development of the spark-on-k8s will depend in large measure on the
PMC's ability to find such new committers.

All that said, I'm +1 if the those currently responsible for the
spark-on-k8s project still want to bring the code into Spark.

On Mon, Aug 21, 2017 at 11:48 AM, Anirudh Ramanathan <
ramanath...@google.com.invalid> wrote:

> Thank you for your comments Imran.
>
> Regarding integration tests,
>
> What you inferred from the documentation is correct -
> Integration tests do not require any prior setup or a Kubernetes cluster
> to run. Minikube is a single binary that brings up a one-node cluster and
> exposes the full Kubernetes API. It is actively maintained and kept up to
> date with the rest of the project. These local integration tests on Jenkins
> (like the ones with spark-on-yarn), should allow for the committers to
> merge changes with a high degree of confidence.
> I will update the proposal to include more information about the extent
> and kinds of testing we do.
>
> As for (b), people on this thread and the set of contributors on our fork
> are a fairly wide community of contributors and committers who would be
> involved in the maintenance long-term. It was one of the reasons behind
> developing separately as a fork. In my opinion, the fact that there are
> nearly no changes to spark-core, and most of our changes are additive
> should go to prove that this adds little complexity to the workflow of the
> committers.
>
> Separating out the cluster managers (into an as yet undecided new home)
> appears far more disruptive and a high risk change for the short term.
> However, when there is enough community support behind that effort, tracked
> in 19700 ; and if that
> is realized in the future, it wouldn't be difficult to switch over
> Kubernetes, YARN and Mesos to using the pluggable API. Currently, in my
> opinion, with the integration tests, active users, and a community of
> maintainers, Spark-on-Kubernetes would add minimal overhead and benefit a
> large (and growing) class of users.
>
> Lastly, the RSS is indeed separate and a value-add that we would love to
> share with other cluster managers as well.
>
> On Mon, Aug 21, 2017 at 10:17 AM, Imran Rashid 
> wrote:
>
>> Overall this looks like a good proposal.  I do have some concerns which
>> I'd like to discuss -- please understand I'm taking a "devil's advocate"
>> stance here for discussion, not that I'm giving a -1.
>>
>> My primary concern is about testing and maintenance.  My concerns might
>> be addressed if the doc included a section on testing that might just be
>> this: https://github.com/apache-spark-on-k8s/spark/blob/branch-2.
>> 2-kubernetes/resource-managers/kubernetes/README.md#
>> running-the-kubernetes-integration-tests
>>
>> but without the concerning warning "Note that the integration test
>> framework is currently being heavily revised and is subject to change".
>> I'd like the proposal to clearly indicate that some baseline testing can be
>> done by devs and in spark's regular jenkins builds without special access
>> to kubernetes clusters.
>>
>> Its worth noting that

Re: Increase Timeout or optimize Spark UT?

2017-08-22 Thread Mark Hamstra

This is another argument for getting the code to the point where this can
default to "true":

SQLConf.scala:  val ADAPTIVE_EXECUTION_ENABLED = buildConf("
*spark.sql.adaptive.enabled*")

On Tue, Aug 22, 2017 at 12:27 PM, Reynold Xin  wrote:

> +1
>
>
> On Tue, Aug 22, 2017 at 12:25 PM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> Hi,
>>
>> From my experience it is possible to cut quite a lot by reducing
>> spark.sql.shuffle.partitions to some reasonable value (let's say
>> comparable to the number of cores). 200 is a serious overkill for most of
>> the test cases anyway.
>>
>>
>> Best,
>> Maciej
>>
>>
>>
>> On 21 August 2017 at 03:00, Dong Joon Hyun  wrote:
>>
>>> +1 for any efforts to recover Jenkins!
>>>
>>>
>>>
>>> Thank you for the direction.
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>> *From: *Reynold Xin 
>>> *Date: *Sunday, August 20, 2017 at 5:53 PM
>>> *To: *Dong Joon Hyun 
>>> *Cc: *"dev@spark.apache.org" 
>>> *Subject: *Re: Increase Timeout or optimize Spark UT?
>>>
>>>
>>>
>>> It seems like it's time to look into how to cut down some of the test
>>> runtimes. Test runtimes will slowly go up given the way development
>>> happens. 3 hr is already a very long time for tests to run.
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Aug 20, 2017 at 5:45 PM, Dong Joon Hyun 
>>> wrote:
>>>
>>> Hi, All.
>>>
>>>
>>>
>>> Recently, Apache Spark master branch test (SBT with hadoop-2.7 / 2.6)
>>> has been hitting the build timeout.
>>>
>>>
>>>
>>> Please see the build time trend.
>>>
>>>
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>>> t%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/buildTimeTrend
>>>
>>>
>>>
>>> All recent 22 builds fail due to timeout directly/indirectly. The last
>>> success (SBT with Hadoop-2.7) is 15th August.
>>>
>>>
>>>
>>> We may do the followings.
>>>
>>>
>>>
>>>1. Increase Build Timeout (3 hr 30 min)
>>>2. Optimize UTs (Scala/Java/Python/UT)
>>>
>>>
>>>
>>> But, Option 1 will be the immediate solution for now . Could you update
>>> the Jenkins setup?
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>
>>
>

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Mark Hamstra

Points 2, 3 and 4 of the Project Plan in that document (i.e. "port existing
data sources using internal APIs to use the proposed public Data Source V2
API") have my full support. Really, I'd like to see that dog-fooding effort
completed and lesson learned from it fully digested before we remove any
unstable annotations from the new API. It's okay to get a proposal out
there so that we can talk about it and start implementing and using it
internally, followed by external use under the unstable annotations, but I
don't want to see a premature vote on a final form of a new public API.

On Thu, Aug 17, 2017 at 8:55 AM, Reynold Xin  wrote:

> Yea I don't think it's a good idea to upload a doc and then call for a
> vote immediately. People need time to digest ...
>
>
> On Thu, Aug 17, 2017 at 6:22 AM, Wenchen Fan  wrote:
>
>> Sorry let's remove the VOTE tag as I just wanna bring this up for
>> discussion.
>>
>> I'll restart the voting process after we have enough discussion on the
>> JIRA ticket or here in this email thread.
>>
>> On Thu, Aug 17, 2017 at 9:12 PM, Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> -1, I don't think there has really been any discussion of this api
>>> change yet or at least it hasn't occurred on the jira ticket
>>>
>>> On Thu, Aug 17, 2017 at 8:05 AM Wenchen Fan  wrote:
>>>
 adding my own +1 (binding)

 On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan 
 wrote:

> Hi all,
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> The current data source API doesn't work well because of some
> limitations like: no partitioning/bucketing support, no columnar read, 
> hard
> to support more operator push down, etc.
>
> I'm proposing a Data Source API V2 to address these problems, please
> read the full document at https://issues.apache.org/jira
> /secure/attachment/12882332/SPIP%20Data%20Source%20API%20V2.pdf
>
> Since this SPIP is mostly about APIs, I also created a prototype and
> put java docs on these interfaces, so that it's easier to review these
> interfaces and discuss: https://github.com/cl
> oud-fan/spark/pull/10/files
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following
> technical reasons.
>
> Thanks!
>


>>
>

Re: a stage can belong to more than one job please?

2017-06-06 Thread Mark Hamstra

Yes, a Stage can be part of more than one Job. The jobIds field of Stage is
used repeatedly in the DAGScheduler.

On Tue, Jun 6, 2017 at 5:04 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> I read same code of spark about stage.
>
> The constructor of stage keep the first job  ID the stage was part of.
> does that means a stage can belong to more than one job
> please? And I find the member jobIds is never used. It looks strange.
>
>
> thanks adv
>

Re: Why did spark switch from AKKA to net / ...

2017-05-07 Thread Mark Hamstra

The point is that Spark's prior usage of Akka was limited enough that it
could fairly easily be removed entirely instead of forcing particular
architectural decisions on Spark's users.

On Sun, May 7, 2017 at 1:14 PM, geoHeil  wrote:

> Thank you!
> In the issue they outline that hard wired dependencies were the problem.
> But wouldn't one want to not directly accept the messages from an actor
> but have Kafka as an failsafe intermediary?
>
> zero323 [via Apache Spark Developers List] <[hidden email]
> > schrieb am So., 7.
> Mai 2017 um 21:17 Uhr:
>
>> https://issues.apache.org/jira/browse/SPARK-5293
>>
>>
>> On 05/07/2017 08:59 PM, geoHeil wrote:
>>
>> > Hi,
>> >
>> > I am curious why spark (with 2.0 completely) removed any akka
>> dependencies
>> > for RPC and switched entirely to (as far as I know natty)
>> >
>> > regards,
>> > Georg
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Why-did-spark-
>> switch-from-AKKA-to-net-tp21522.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: [hidden email]
>> 
>> >
>>
>>
>> -
>> To unsubscribe e-mail: [hidden email]
>> 
>>
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-
>> switch-from-AKKA-to-net-tp21522p21523.html
>> To unsubscribe from Why did spark switch from AKKA to net / ..., click
>> here.
>> NAML
>> 
>>
>
> --
> View this message in context: Re: Why did spark switch from AKKA to net /
> ...
> 
>
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>

Re: Should we consider a Spark 2.1.1 release?

2017-03-19 Thread Mark Hamstra

That doesn't necessarily follow, Jacek. There is a point where too frequent
releases decrease quality. That is because releases don't come for free --
each one demands a considerable amount of time from release managers,
testers, etc. -- time that would otherwise typically be devoted to
improving (or at least adding to) the code. And that doesn't even begin to
consider the time that needs to be spent putting a new version into a
larger software distribution or that users need to put in to deploy and use
a new version. If you have an extremely lightweight deployment cycle, then
small, quick releases can make sense; but "lightweight" doesn't really
describe a Spark release. The concern for excessive overhead is a large
part of the thinking behind why we stretched out the roadmap to allow
longer intervals between scheduled releases. A similar concern does come
into play for unscheduled maintenance releases -- but I don't think that
that is the forcing function at this point: A 2.1.1 release is a good idea.

On Sun, Mar 19, 2017 at 6:24 AM, Jacek Laskowski  wrote:

> +1
>
> More smaller and more frequent releases (so major releases get even more
> quality).
>
> Jacek
>
> On 13 Mar 2017 8:07 p.m., "Holden Karau"  wrote:
>
>> Hi Spark Devs,
>>
>> Spark 2.1 has been out since end of December
>> 
>> and we've got quite a few fixes merged for 2.1.1
>> 
>> .
>>
>> On the Python side one of the things I'd like to see us get out into a
>> patch release is a packaging fix (now merged) before we upload to PyPI &
>> Conda, and we also have the normal batch of fixes like toLocalIterator for
>> large DataFrames in PySpark.
>>
>> I've chatted with Felix & Shivaram who seem to think the R side is
>> looking close to in good shape for a 2.1.1 release to submit to CRAN (if
>> I've miss-spoken my apologies). The two outstanding issues that are being
>> tracked for R are SPARK-18817, SPARK-19237.
>>
>> Looking at the other components quickly it seems like structured
>> streaming could also benefit from a patch release.
>>
>> What do others think - are there any issues people are actively targeting
>> for 2.1.1? Is this too early to be considering a patch release?
>>
>> Cheers,
>>
>> Holden
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>

Re: Spark Improvement Proposals

2017-03-09 Thread Mark Hamstra

-0 on voting on whether we need a vote.

On Thu, Mar 9, 2017 at 9:00 AM, Reynold Xin  wrote:

> I'm fine without a vote. (are we voting on wether we need a vote?)
>
>
> On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen  wrote:
>
>> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
>> Nah, anyone can call a vote. This really isn't that formal. We just want to
>> declare and document consensus.
>>
>> I think SPIP is just a remix of existing process anyway, and don't think
>> it will actually do much anyway, which is why I am sanguine about the whole
>> thing.
>>
>> To bring this to a conclusion, I will just put the contents of the doc in
>> an email tomorrow for a VOTE. Raise any objections now.
>>
>> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:
>>
>>> I started this idea as a fork with a merge-able change to docs.
>>> Reynold moved it to his google doc, and has suggested during this
>>> email thread that a vote should occur.
>>> If a vote needs to occur, I can't see anything on
>>> http://apache.org/foundation/voting.html suggesting that I can call
>>> for a vote, which is why I'm asking PMC members to do it since they're
>>> the ones who would vote anyway.
>>> Now Sean is saying this is a code/doc change that can just be reviewed
>>> and merged as usual...which is what I tried to do to begin with.
>>>
>>> The fact that you haven't agreed on a process to agree on your process
>>> is, I think, an indication that the process really does need
>>> improvement ;)
>>>
>>>
>

Re: Sharing data in columnar storage between two applications

2016-12-26 Thread Mark Hamstra

Yes, this is part of Matei's current research, for which code is not yet
publicly available at all, much less in a form suitable for production use.

On Mon, Dec 26, 2016 at 2:29 AM, Evan Chan <vel...@gmail.com> wrote:

> Looks pretty interesting, but might take a while honestly.
>
> On Dec 25, 2016, at 5:24 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
>
> NOt so much about between applications, rather multiple frameworks within
> an application, but still related: https://cs.stanford.
> edu/~matei/papers/2017/cidr_weld.pdf
>
> On Sun, Dec 25, 2016 at 8:12 PM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
> wrote:
>
>> Here is an interesting discussion to share data in columnar storage
>> between two applications.
>> https://github.com/apache/spark/pull/15219#issuecomment-265835049
>>
>> One of the ideas is to prepare interfaces (or trait) only for read or
>> write. Each application can implement only one class to want to do (e.g.
>> read or write). For example, FiloDB wants to provide a columnar storage
>> that can be read from Spark. In that case, it is easy to implement only
>> read APIs for Spark. These two classes can be prepared.
>> However, it may lead to incompatibility in ColumnarBatch. ColumnarBatch
>> keeps a set of ColumnVector that can be read or written. The ColumnVector
>> class should have read and write APIs. How can we put the new ColumnVector
>> with only read APIs?  Here is an example to case incompatibility at
>> https://gist.github.com/kiszk/00ab7d0c69f0e598e383cdc8e72bcc4d
>>
>> Another possible idea is that both applications supports Apache Arrow
>> APIs.
>> Other approaches could be.
>>
>> What approach would be good for all of applications?
>>
>> Regards,
>> Kazuaki Ishizaki
>>
>
>
>

Re: Shuffle intermidiate results not being cached

2016-12-26 Thread Mark Hamstra

Shuffle results are only reused if you are reusing the exact same RDD.  If
you are working with Dataframes that you have not explicitly cached, then
they are going to be producing new RDDs within their physical plan creation
and evaluation, so you won't get implicit shuffle reuse.  This is what
https://issues.apache.org/jira/browse/SPARK-11838 is about.

On Mon, Dec 26, 2016 at 5:56 AM, assaf.mendelson 
wrote:

> Hi,
>
>
>
> Sorry to be bothering everyone on the holidays but I have found what may
> be a bug.
>
>
>
> I am doing a “manual” streaming (see http://stackoverflow.com/
> questions/41266956/apache-spark-streaming-performance for the specific
> code) where I essentially read an additional dataframe each time from file,
> union it with previous dataframes to create a “window” and then do double
> aggregation on the result.
>
> Having looked at the documentation (https://spark.apache.org/
> docs/latest/programming-guide.html#which-storage-level-to-choose right
> above the headline) I expected spark to automatically cache the partial
> aggregation for each dataframe read and then continue with the aggregations
> from there. Instead it seems it reads each dataframe from file all over
> again.
>
> Is this a bug? Am I doing something wrong?
>
>
>
> Thanks.
>
> Assaf.
>
> --
> View this message in context: Shuffle intermidiate results not being
> cached
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>

Re: Sharing data in columnar storage between two applications

2016-12-25 Thread Mark Hamstra

NOt so much about between applications, rather multiple frameworks within
an application, but still related:
https://cs.stanford.edu/~matei/papers/2017/cidr_weld.pdf

On Sun, Dec 25, 2016 at 8:12 PM, Kazuaki Ishizaki 
wrote:

> Here is an interesting discussion to share data in columnar storage
> between two applications.
> https://github.com/apache/spark/pull/15219#issuecomment-265835049
>
> One of the ideas is to prepare interfaces (or trait) only for read or
> write. Each application can implement only one class to want to do (e.g.
> read or write). For example, FiloDB wants to provide a columnar storage
> that can be read from Spark. In that case, it is easy to implement only
> read APIs for Spark. These two classes can be prepared.
> However, it may lead to incompatibility in ColumnarBatch. ColumnarBatch
> keeps a set of ColumnVector that can be read or written. The ColumnVector
> class should have read and write APIs. How can we put the new ColumnVector
> with only read APIs?  Here is an example to case incompatibility at
> https://gist.github.com/kiszk/00ab7d0c69f0e598e383cdc8e72bcc4d
>
> Another possible idea is that both applications supports Apache Arrow APIs.
> Other approaches could be.
>
> What approach would be good for all of applications?
>
> Regards,
> Kazuaki Ishizaki
>

Re: Can I add a new method to RDD class?

2016-12-07 Thread Mark Hamstra

The easiest way is probably with:

mvn versions:set -DnewVersion=your_new_version

On Wed, Dec 7, 2016 at 11:31 AM, Teng Long  wrote:

> Hi Holden,
>
> Can you please tell me how to edit version numbers efficiently? the
> correct way? I'm really struggling with this and don't know where to look.
>
> Thanks,
> Teng
>
>
> On Dec 6, 2016, at 4:02 PM, Teng Long  wrote:
>
> Hi Jakob,
>
> It seems like I’ll have to either replace the version with my custom
> version in all the pom.xml files in every subdirectory that has one and
> publish locally, or keep the version (i.e. 2.0.2) and manually remove the
> spark repository cache in ~/.ivy2 and ~/.m2 and publish spark locally, then
> compile my application with the correct version respectively to make it
> work. I think there has to be an elegant way to do this.
>
> On Dec 6, 2016, at 1:07 PM, Jakob Odersky-2 [via Apache Spark Developers
> List] <[hidden email]
> > wrote:
>
> Yes, I think changing the  property (line 29) in spark's root
> pom.xml should be sufficient. However, keep in mind that you'll also
> need to publish spark locally before you can access it in your test
> application.
>
> On Tue, Dec 6, 2016 at 2:50 AM, Teng Long < rel="nofollow" link="external" class="">[hidden email]> wrote:
>
> > Thank you Jokob for clearing things up for me.
> >
> > Before, I thought my application was compiled against my local build
> since I
> > can get all the logs I just added in spark-core. But it was all along
> using
> > spark downloaded from remote maven repository, and that’s why I “cannot"
> add
> > new RDD methods in.
> >
> > How can I specify a custom version? modify version numbers in all the
> > pom.xml file?
> >
> >
> >
> > On Dec 5, 2016, at 9:12 PM, Jakob Odersky < rel="nofollow" link="external" class="">[hidden email]> wrote:
> >
> > m rdds in an "org.apache.spark" package as well
> >
> >
> -
> To unsubscribe e-mail:  rel="nofollow" link="external" class="">[hidden email]
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-
> method-to-RDD-class-tp20100p20151.html
> To unsubscribe from Can I add a new method to RDD class?, click here.
> NAML
> 
>
>
>
> --
> View this message in context: Re: Can I add a new method to RDD class?
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at Nabble.com
> .
>
>

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra

You still have the problem that even within a single Job it is often the
case that not every Exchange really wants to use the same number of shuffle
partitions.

On Tue, Nov 15, 2016 at 2:46 AM, Sean Owen  wrote:

> Once you get to needing this level of fine-grained control, should you not
> consider using the programmatic API in part, to let you control individual
> jobs?
>
>
> On Tue, Nov 15, 2016 at 1:19 AM leo9r  wrote:
>
>> Hi Daniel,
>>
>> I completely agree with your request. As the amount of data being
>> processed
>> with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need
>> to prevent OOM and performance degradation. The fact that
>> sql.shuffle.partitions cannot be set several times in the same job/action,
>> because of the reason you explain, is a big inconvenient for the
>> development
>> of ETL pipelines.
>>
>> Have you got any answer or feedback in this regard?
>>
>> Thanks,
>> Leo Lezcano
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-
>> partitions-should-be-stored-in-the-lineage-tp13240p19867.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra

AFAIK, the adaptive shuffle partitioning still isn't completely ready to be
made the default, and there are some corner issues that need to be
addressed before this functionality is declared finished and ready.  E.g.,
the current logic can make data skew problems worse by turning One Big
Partition into an even larger partition before the ExchangeCoordinator
decides to create a new one.  That can be worked around by changing the
logic to "If including the nextShuffleInputSize would exceed the target
partition size, then start a new partition":
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ExchangeCoordinator.scala#L173

If you're willing to work around those kinds of issues to fit your use
case, then I do know that the adaptive shuffle partitioning can be made to
work well even if it is not perfect.  It would be nice, though, to see
adaptive partitioning be finished and hardened to the point where it
becomes the default, because a fixed number of shuffle partitions has some
significant limitations and problems.

On Tue, Nov 15, 2016 at 12:50 AM, leo9r  wrote:

> That's great insight Mark, I'm looking forward to give it a try!!
>
> According to jira's  Adaptive execution in Spark
>   , it seems that some
> functionality was added in Spark 1.6.0 and the rest is still in progress.
> Are there any improvements to the SparkSQL adaptive behavior in Spark 2.0+
> that you know?
>
> Thanks and best regards,
> Leo
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-
> partitions-should-be-stored-in-the-lineage-tp13240p19885.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-14 Thread Mark Hamstra

Take a look at spark.sql.adaptive.enabled and the ExchangeCoordinator.  A
single, fixed-sized sql.shuffle.partitions is not the only way to control
the number of partitions in an Exchange -- if you are willing to deal with
code that is still off by by default.

On Mon, Nov 14, 2016 at 4:19 PM, leo9r  wrote:

> Hi Daniel,
>
> I completely agree with your request. As the amount of data being processed
> with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need
> to prevent OOM and performance degradation. The fact that
> sql.shuffle.partitions cannot be set several times in the same job/action,
> because of the reason you explain, is a big inconvenient for the
> development
> of ETL pipelines.
>
> Have you got any answer or feedback in this regard?
>
> Thanks,
> Leo Lezcano
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-
> partitions-should-be-stored-in-the-lineage-tp13240p19867.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Mark Hamstra

You're right; so we could remove Java 7 support in 2.1.0.

Both Holden and I not having the facts immediately to mind does suggest,
however, that we should be doing a better job of making sure that
information about deprecated language versions is inescapably public.
That's harder to do with a language version deprecation since using such a
version doesn't really give you the same kind of repeated warnings that
using a deprecated API does.

On Tue, Oct 25, 2016 at 12:59 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> No, I think our intent is that using a deprecated language version can
> generate warnings, but that it should still work; whereas once we remove
> support for a language version, then it really is ok for Spark developers
> to do things not compatible with that version and for users attempting to
> use that version to encounter errors.
>
> OK, understood.
>
> With that understanding, the first steps toward removing support for Scala
> 2.10 and/or Java 7 would be to deprecate them in 2.1.0. Actual removal of
> support could then occur at the earliest in 2.2.0.
>
> Java 7 is already deprecated per the 2.0 release notes which I linked to. Here
> they are
> <http://spark.apache.org/releases/spark-release-2-0-0.html#deprecations>
> again.
> 
>
> On Tue, Oct 25, 2016 at 3:19 PM Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> No, I think our intent is that using a deprecated language version can
>> generate warnings, but that it should still work; whereas once we remove
>> support for a language version, then it really is ok for Spark developers to
>> do things not compatible with that version and for users attempting to use
>> that version to encounter errors.
>>
>> With that understanding, the first steps toward removing support for
>> Scala 2.10 and/or Java 7 would be to deprecate them in 2.1.0.  Actual
>> removal of support could then occur at the earliest in 2.2.0.
>>
>> On Tue, Oct 25, 2016 at 12:13 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> FYI: Support for both Python 2.6 and Java 7 was deprecated in 2.0 (see 
>> release
>> notes <http://spark.apache.org/releases/spark-release-2-0-0.html> under
>> Deprecations). The deprecation notice didn't offer a specific timeline for
>> completely dropping support other than to say they "might be removed in
>> future versions of Spark 2.x".
>>
>> Not sure what the distinction between deprecating and dropping support is
>> for language versions, since in both cases it seems like it's OK to do
>> things not compatible with the deprecated versions.
>>
>> Nick
>>
>>
>> On Tue, Oct 25, 2016 at 11:50 AM Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>> I'd also like to add Python 2.6 to the list of things. We've considered
>> dropping it before but never followed through to the best of my knowledge
>> (although on mobile right now so can't double check).
>>
>> On Tuesday, October 25, 2016, Sean Owen <so...@cloudera.com> wrote:
>>
>> I'd like to gauge where people stand on the issue of dropping support for
>> a few things that were considered for 2.0.
>>
>> First: Scala 2.10. We've seen a number of build breakages this week
>> because the PR builder only tests 2.11. No big deal at this stage, but, it
>> did cause me to wonder whether it's time to plan to drop 2.10 support,
>> especially with 2.12 coming soon.
>>
>> Next, Java 7. It's reasonably old and out of public updates at this
>> stage. It's not that painful to keep supporting, to be honest. It would
>> simplify some bits of code, some scripts, some testing.
>>
>> Hadoop versions: I think the the general argument is that most anyone
>> would be using, at the least, 2.6, and it would simplify some code that has
>> to reflect to use not-even-that-new APIs. It would remove some moderate
>> complexity in the build.
>>
>>
>> "When" is a tricky question. Although it's a little aggressive for minor
>> releases, I think these will all happen before 3.x regardless. 2.1.0 is not
>> out of the question, though coming soon. What about ... 2.2.0?
>>
>>
>> Although I tend to favor dropping support, I'm mostly asking for current
>> opinions.
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>>

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Mark Hamstra

What's changed since the last time we discussed these issues, about 7
months ago?  Or, another way to formulate the question: What are the
threshold criteria that we should use to decide when to end Scala 2.10
and/or Java 7 support?

On Tue, Oct 25, 2016 at 8:36 AM, Sean Owen  wrote:

> I'd like to gauge where people stand on the issue of dropping support for
> a few things that were considered for 2.0.
>
> First: Scala 2.10. We've seen a number of build breakages this week
> because the PR builder only tests 2.11. No big deal at this stage, but, it
> did cause me to wonder whether it's time to plan to drop 2.10 support,
> especially with 2.12 coming soon.
>
> Next, Java 7. It's reasonably old and out of public updates at this stage.
> It's not that painful to keep supporting, to be honest. It would simplify
> some bits of code, some scripts, some testing.
>
> Hadoop versions: I think the the general argument is that most anyone
> would be using, at the least, 2.6, and it would simplify some code that has
> to reflect to use not-even-that-new APIs. It would remove some moderate
> complexity in the build.
>
>
> "When" is a tricky question. Although it's a little aggressive for minor
> releases, I think these will all happen before 3.x regardless. 2.1.0 is not
> out of the question, though coming soon. What about ... 2.2.0?
>
>
> Although I tend to favor dropping support, I'm mostly asking for current
> opinions.
>

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-24 Thread Mark Hamstra

The advice to avoid idioms that may not be universally understood is good.
My further issue with the misuse of "straw-man" (which really is not, or
should not be, separable from "straw-man argument") is that a "straw-man"
in the established usage is something that is always intended to be a
failure or designed to be obviously and fatally flawed.  That's what makes
it fundamentally different from a trial balloon or a first crack at
something or a prototype or an initial design proposal -- these are all
intended, despite any remaining flaws, to have merits that are likely worth
pursuing further, whereas a straw-man is only intended to be knocked apart
as a way to preclude and put an end to further consideration of something.


On Mon, Oct 24, 2016 at 10:38 AM, Sean Owen <so...@cloudera.com> wrote:

> Well, it's more of a reference to the fallacy than anything. Writing down
> a proposed action implicitly claims it's what others are arguing for. It's
> self-deprecating to call it a "straw man", suggesting that it may not at
> all be what others are arguing for, and is done to openly invite criticism
> and feedback. The logical fallacy is "attacking a straw man", and that's
> not what was written here.
>
> Really, the important thing is that we understand each other, and I'm
> guessing you did. Although I think the usage here is fine, casually,
> avoiding idioms is best, where plain language suffices, especially given we
> have people from lots of language backgrounds here.
>
>
> On Mon, Oct 24, 2016 at 6:11 PM Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> Alright, that does it!  Who is responsible for this "straw-man"
>> abuse that is becoming too commonplace in the Spark community?  "Straw-man"
>> does not mean something like "trial balloon" or "run it up the flagpole and
>> see if anyone salutes", and I would really appreciate it if Spark
>> developers would stop using "straw-man" to mean anything other than its
>> established meaning: The logical fallacy of declaring victory by knocking
>> down an easily defeated argument or position that the opposition has never
>> actually made.
>>
>> On Mon, Oct 24, 2016 at 5:51 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> BTW I wrote up a straw-man proposal for migrating the wiki content:
>>
>> https://issues.apache.org/jira/browse/SPARK-18073
>>
>> On Tue, Oct 18, 2016 at 12:25 PM Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>> Right now the wiki isn't particularly accessible to updates by external
>> contributors. We've already got a contributing to spark page which just
>> links to the wiki - how about if we just move the wiki contents over? This
>> way contributors can contribute to our documentation about how to
>> contribute probably helping clear up points of confusion for new
>> contributors which the rest of us may be blind to.
>>
>> If we do this we would probably want to update the wiki page to point to
>> the documentation generated from markdown. It would also mean that the
>> results of any update to the contributing guide take a full release cycle
>> to be visible. Another alternative would be opening up the wiki to a
>> broader set of people.
>>
>> I know a lot of people are probably getting ready for Spark Summit EU
>> (and I hope to catch up with some of y'all there) but I figured this a
>> relatively minor proposal.
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>>

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-24 Thread Mark Hamstra

Alright, that does it!  Who is responsible for this "straw-man" abuse
that is becoming too commonplace in the Spark community?  "Straw-man" does
not mean something like "trial balloon" or "run it up the flagpole and see
if anyone salutes", and I would really appreciate it if Spark developers
would stop using "straw-man" to mean anything other than its established
meaning: The logical fallacy of declaring victory by knocking down an
easily defeated argument or position that the opposition has never actually
made.

On Mon, Oct 24, 2016 at 5:51 AM, Sean Owen  wrote:

> BTW I wrote up a straw-man proposal for migrating the wiki content:
>
> https://issues.apache.org/jira/browse/SPARK-18073
>
> On Tue, Oct 18, 2016 at 12:25 PM Holden Karau 
> wrote:
>
>> Right now the wiki isn't particularly accessible to updates by external
>> contributors. We've already got a contributing to spark page which just
>> links to the wiki - how about if we just move the wiki contents over? This
>> way contributors can contribute to our documentation about how to
>> contribute probably helping clear up points of confusion for new
>> contributors which the rest of us may be blind to.
>>
>> If we do this we would probably want to update the wiki page to point to
>> the documentation generated from markdown. It would also mean that the
>> results of any update to the contributing guide take a full release cycle
>> to be visible. Another alternative would be opening up the wiki to a
>> broader set of people.
>>
>> I know a lot of people are probably getting ready for Spark Summit EU
>> (and I hope to catch up with some of y'all there) but I figured this a
>> relatively minor proposal.
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>

Re: DAGScheduler.handleJobCancellation uses jobIdToStageIds for verification while jobIdToActiveJob for lookup?

2016-10-13 Thread Mark Hamstra

There were at least a couple of ideas behind the original thinking on using
both of those Maps: 1) We needed the ability to efficiently get from jobId
to both ActiveJob objects and to sets of associated Stages, and using both
Maps here was an opportunity to do a little sanity checking to make sure
that the Maps looked at least minimally consistent for the Job at issue; 2)
Similarly, it could serve as a kind of hierarchical check -- first, for the
Job which we are being asked to cancel, that we ever knew enough to even
register its existence; second, that for a JobId that passes the first
test, that we still have an ActiveJob that can be canceled.

Now, without doing a bunch of digging into the code archives, I can't tell
you for sure whether those ideas were ever implemented completely correctly
or whether they still make valid sense in the current code, but from
looking at the lines that you highlighted, I can tell you that even if the
ideas still make sense and are worth carrying forward, the current code
doesn't implement them correctly.  In particular, if it is possible for the
`jobId` to not be in `jobIdToActiveJob`, we're going to produce a
`NoSuchElementException` for the missing key instead of handling it or even
reporting it in a more useful way.

On Thu, Oct 13, 2016 at 8:11 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Thanks Imran! Not only did the response come so promptly, but also
> it's something I could work on (and have another Spark contributor
> badge unlocked)! Thanks.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Oct 13, 2016 at 5:02 PM, Imran Rashid <iras...@cloudera.com>
> wrote:
> > Hi Jacek,
> >
> > doesn't look like there is any good reason -- Mark Hamstra might know
> this
> > best.  Feel free to open a jira & pr for it, you can ping Mark, Kay
> > Ousterhout, and me (@squito) for review.
> >
> > Imran
> >
> > On Thu, Oct 13, 2016 at 7:56 AM, Jacek Laskowski <ja...@japila.pl>
> wrote:
> >>
> >> Hi,
> >>
> >> Is there a reason why DAGScheduler.handleJobCancellation checks the
> >> active job id in jobIdToStageIds [1] while looking the job up in
> >> jobIdToActiveJob [2]? Perhaps synchronized earlier yet still
> >> inconsistent.
> >>
> >> [1]
> >> https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1372
> >> [2]
> >> https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1376
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
>

Re: Spark Improvement Proposals

2016-10-10 Thread Mark Hamstra

There is a larger issue to keep in mind, and that is that what you are
proposing is a procedure that, as far as I am aware, hasn't previously been
adopted in an Apache project, and thus is not an easy or exact fit with
established practices that have been blessed as "The Apache Way".  As such,
we need to be careful, because we have run into some trouble in the past
with some inside the ASF but essentially outside the Spark community who
didn't like the way we were doing things.

On Mon, Oct 10, 2016 at 3:53 PM, Cody Koeninger <c...@koeninger.org> wrote:

> Apache documents say lots of confusing stuff, including that commiters are
> in practice given a vote.
>
> https://www.apache.org/foundation/voting.html
>
> I don't care either way, if someone wants me to sub commiter for PMC in
> the voting section, fine, we just need a clear outcome.
>
> On Oct 10, 2016 17:36, "Mark Hamstra" <m...@clearstorydata.com> wrote:
>
>> If I'm correctly understanding the kind of voting that you are talking
>> about, then to be accurate, it is only the PMC members that have a vote,
>> not all committers: https://www.apache.org/foundation/how-it-works.
>> html#pmc-members
>>
>> On Mon, Oct 10, 2016 at 12:02 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> I think the main value is in being honest about what's going on.  No
>>> one other than committers can cast a meaningful vote, that's the
>>> reality.  Beyond that, if people think it's more open to allow formal
>>> proposals from anyone, I'm not necessarily against it, but my main
>>> question would be this:
>>>
>>> If anyone can submit a proposal, are committers actually going to
>>> clearly reject and close proposals that don't meet the requirements?
>>>
>>> Right now we have a serious problem with lack of clarity regarding
>>> contributions, and that cannot spill over into goal-setting.
>>>
>>> On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue <rb...@netflix.com> wrote:
>>> > +1 to votes to approve proposals. I agree that proposals should have an
>>> > official mechanism to be accepted, and a vote is an established means
>>> of
>>> > doing that well. I like that it includes a period to review the
>>> proposal and
>>> > I think proposals should have been discussed enough ahead of a vote to
>>> > survive the possibility of a veto.
>>> >
>>> > I also like the names that are short and (mostly) unique, like SEP.
>>> >
>>> > Where I disagree is with the requirement that a committer must formally
>>> > propose an enhancement. I don't see the value of restricting this: if
>>> > someone has the will to write up a proposal then they should be
>>> encouraged
>>> > to do so and start a discussion about it. Even if there is a political
>>> > reality as Cody says, what is the value of codifying that in our
>>> process? I
>>> > think restricting who can submit proposals would only undermine them by
>>> > pushing contributors out. Maybe I'm missing something here?
>>> >
>>> > rb
>>> >
>>> >
>>> >
>>> > On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>> >>
>>> >> Yes, users suggesting SIPs is a good thing and is explicitly called
>>> >> out in the linked document under the Who? section.  Formally proposing
>>> >> them, not so much, because of the political realities.
>>> >>
>>> >> Yes, implementation strategy definitely affects goals.  There are all
>>> >> kinds of examples of this, I'll pick one that's my fault so as to
>>> >> avoid sounding like I'm blaming:
>>> >>
>>> >> When I implemented the Kafka DStream, one of my (not explicitly agreed
>>> >> upon by the community) goals was to make sure people could use the
>>> >> Dstream with however they were already using Kafka at work.  The lack
>>> >> of explicit agreement on that goal led to all kinds of fighting with
>>> >> committers, that could have been avoided.  The lack of explicit
>>> >> up-front strategy discussion led to the DStream not really working
>>> >> with compacted topics.  I knew about compacted topics, but don't have
>>> >> a use for them, so had a blind spot there.  If there was explicit
>>> >> up-front discussion that my strategy was "assume that batches can be
>>> >> de

Re: Spark Improvement Proposals

2016-10-10 Thread Mark Hamstra

I'm not a fan of the SEP acronym.  Besides it prior established meaning of
"Somebody else's problem", the are other inappropriate or offensive
connotations such as this Australian slang that often gets shortened to
just "sep":  http://www.urbandictionary.com/define.php?term=Seppo

On Sun, Oct 9, 2016 at 4:00 PM, Nicholas Chammas  wrote:

> On Sun, Oct 9, 2016 at 5:19 PM Cody Koeninger  wrote:
>
>> Regarding name, if the SIP overlap is a concern, we can pick a different
>> name.
>>
>> My tongue in cheek suggestion would be
>>
>> Spark Lightweight Improvement process (SPARKLI)
>>
>
> If others share my minor concern about the SIP name, I propose Spark
> Enhancement Proposal (SEP), taking inspiration from the Python Enhancement
> Proposal name.
>
> So if we're going to number proposals like other projects do, they'd be
> numbered SEP-1, SEP-2, etc. This avoids the naming conflict with Scala SIPs.
>
> Another way to avoid a conflict is to stick with "Spark Improvement
> Proposal" but use SPIP as the acronym. So SPIP-1, SPIP-2, etc.
>
> Anyway, it's not a big deal. I just wanted to raise this point.
>
> Nick
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-10-01 Thread Mark Hamstra

Thanks for doing the investigation.  What I found out yesterday is that my
other macOs 10.12 machine ran into the same issue, while various Linux
machines did not, so there may well be an OS-specific component to this
particular OOM-in-tests problem.  Unfortunately, increasing the heap as you
suggest doesn't resolve the issue for me -- even if I increase it all the
way to 6g.  This does appear to be environment-specific (and not an
environment that I would expect to see in Spark deployments), so I agree
that this is not a blocker.

I looked a bit into the other annoying issue that I've been seeing for
awhile now with the shell terminating when YarnClusterSuite is run on an
Ubuntu 16.0.4 box.  Both Sean Owen and I have run into this problem when
running the tests over an ssh connection, and we each assumed that it was
an ssh-specific problem.  Yesterday, though, I spent some time logged
directly into both a normal graphical sessions and console sessions, and I
am seeing similar problems there. Running the tests from the graphical
session actually ends up failing and kicking me all the way out to the
login screen when YarnClusterSuite is run, while doing the same from the
console ends up terminating the shell.  All very strange, and I don't have
much of a clue what is going on yet, but it also seems to quite specific to
this environment, so I wouldn't consider this issue to be a blocker, either

On Fri, Sep 30, 2016 at 8:47 PM, Shixiong(Ryan) Zhu <shixi...@databricks.com
> wrote:

> Hey Mark,
>
> I can reproduce the failure locally using your command. There were a lot
> of OutOfMemoryError in the unit test log. I increased the heap size from 3g
> to 4g at https://github.com/apache/spark/blob/v2.0.1-rc4/pom.xml#L2029
> and it passed tests. I think the patch you mentioned increased the memory
> usage of BlockManagerSuite and made the tests easy to OOM. It can be
> fixed by mocking SparkContext (or may be not necessary since Jenkins's
> maven and sbt builds are green now).
>
> However, since this is only a test issue, it should not be a blocker.
>
>
> On Fri, Sep 30, 2016 at 8:34 AM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> 0
>>
>> RC4 is causing a build regression for me on at least one of my machines.
>> RC3 built and ran tests successfully, but the tests consistently fail with
>> RC4 unless I revert 9e91a1009e6f916245b4d4018de1664ea3decfe7,
>> "[SPARK-15703][SCHEDULER][CORE][WEBUI] Make ListenerBus event queue size
>> configurable (branch 2.0)".  This is using build/mvn -U -Pyarn -Phadoop-2.7
>> -Pkinesis-asl -Phive -Phive-thriftserver -Dpyspark -Dsparkr -DskipTests
>> clean package; build/mvn -U -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>> -Phive-thriftserver -Dpyspark -Dsparkr test.  Environment is macOS 10.12,
>> Java 1.8.0_102.
>>
>> There are no tests that go red.  Rather, the core tests just end after...
>>
>> ...
>> BlockManagerSuite:
>> ...
>> - overly large block
>> - block compression
>> - block store put failure
>>
>> ...with only the generic "[ERROR] Failed to execute goal
>> org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
>> spark-core_2.11: There are test failures".
>>
>> I'll try some other environments today to see whether I can turn this 0
>> into either a -1 or +1, but right now that commit is looking deeply
>> suspicious to me.
>>
>> On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a
>>> majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.1-rc4 (933d2c1ea4e5f5c4ec8d375b5ccaa
>>> 4577ba4be38)
>>>
>>> This release candidate resolves 301 issues:
>>> https://s.apache.org/spark-2.0.1-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1203/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-docs/
>>>
&g

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-30 Thread Mark Hamstra

0

RC4 is causing a build regression for me on at least one of my machines.
RC3 built and ran tests successfully, but the tests consistently fail with
RC4 unless I revert 9e91a1009e6f916245b4d4018de1664ea3decfe7,
"[SPARK-15703][SCHEDULER][CORE][WEBUI] Make ListenerBus event queue size
configurable (branch 2.0)".  This is using build/mvn -U -Pyarn -Phadoop-2.7
-Pkinesis-asl -Phive -Phive-thriftserver -Dpyspark -Dsparkr -DskipTests
clean package; build/mvn -U -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
-Phive-thriftserver -Dpyspark -Dsparkr test.  Environment is macOS 10.12,
Java 1.8.0_102.

There are no tests that go red.  Rather, the core tests just end after...

...
BlockManagerSuite:
...
- overly large block
- block compression
- block store put failure

...with only the generic "[ERROR] Failed to execute goal
org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
spark-core_2.11: There are test failures".

I'll try some other environments today to see whether I can turn this 0
into either a -1 or +1, but right now that commit is looking deeply
suspicious to me.

On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.1
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.1-rc4 (933d2c1ea4e5f5c4ec8d375b5ccaa4
> 577ba4be38)
>
> This release candidate resolves 301 issues: https://s.apache.org/spark-2.
> 0.1-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1203/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.0.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series.  Bugs already
> present in 2.0.0, missing features, or bugs related to new features will
> not necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
> (i.e. RC5) is cut, I will change the fix version of those patches to 2.0.1.
>
>
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Mark Hamstra

If we're going to cut another RC, then it would be good to get this in as
well (assuming that it is merged shortly):
https://github.com/apache/spark/pull/15213

It's not a regression, and it shouldn't happen too often, but when failed
stages don't get resubmitted it is a fairly significant issue.

On Tue, Sep 27, 2016 at 1:31 PM, Reynold Xin <r...@databricks.com> wrote:

> Actually I'm going to have to -1 the release myself. Sorry for crashing
> the party, but I saw two super critical issues discovered in the last 2
> days:
>
> https://issues.apache.org/jira/browse/SPARK-17666  -- this would
> eventually hang Spark when running against S3 (and many other storage
> systems)
>
> https://issues.apache.org/jira/browse/SPARK-17673  -- this is a
> correctness issue across all non-file data sources.
>
> If we go ahead and release 2.0.1 based on this RC, we would need to cut
> 2.0.2 immediately.
>
>
>
>
>
> On Tue, Sep 27, 2016 at 10:18 AM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> I've got a couple of build niggles that should really be investigated at
>> some point (what look to be OOM issues in spark-repl when building and
>> testing with mvn in a single pass instead of in two passes with -DskipTests
>> first; the killing of ssh sessions by YarnClusterSuite), but these
>> aren't anything that should hold up the release.
>>
>> +1
>>
>> On Sat, Sep 24, 2016 at 3:08 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and passes if
>>> a majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5
>>> c2cc2730d17)
>>>
>>> This release candidate resolves 290 issues:
>>> https://s.apache.org/spark-2.0.1-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1201/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-docs/
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 2.0.0.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series.  Bugs already
>>> present in 2.0.0, missing features, or bugs related to new features will
>>> not necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into branch-2.0
>>> from now on?
>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.1.
>>>
>>>
>>>
>>
>

Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Mark Hamstra

+1

And I'll dare say that for those with Spark in production, what is more
important is that maintenance releases come out in a timely fashion than
that new features are released one month sooner or later.

On Tue, Sep 27, 2016 at 12:06 PM, Reynold Xin  wrote:

> We are 2 months past releasing Spark 2.0.0, an important milestone for the
> project. Spark 2.0.0 deviated (took 6 month from the regular release
> cadence we had for the 1.x line, and we never explicitly discussed what the
> release cadence should look like for 2.x. Thus this email.
>
> During Spark 1.x, roughly every three months we make a new 1.x feature
> release (e.g. 1.5.0 comes out three months after 1.4.0). Development
> happened primarily in the first two months, and then a release branch was
> cut at the end of month 2, and the last month was reserved for QA and
> release preparation.
>
> During 2.0.0 development, I really enjoyed the longer release cycle
> because there was a lot of major changes happening and the longer time was
> critical for thinking through architectural changes as well as API design.
> While I don't expect the same degree of drastic changes in a 2.x feature
> release, I do think it'd make sense to increase the length of release cycle
> so we can make better designs.
>
> My strawman proposal is to maintain a regular release cadence, as we did
> in Spark 1.x, and increase the cycle from 3 months to 4 months. This
> effectively gives us ~50% more time to develop (in reality it'd be slightly
> less than 50% since longer dev time also means longer QA time). As for
> maintenance releases, I think those should still be cut on-demand, similar
> to Spark 1.x, but more aggressively.
>
> To put this into perspective, 4-month cycle means we will release Spark
> 2.1.0 at the end of Nov or early Dec (and branch cut / code freeze at the
> end of Oct).
>
> I am curious what others think.
>
>
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-25 Thread Mark Hamstra

Spark's branch-2.0 is a maintenance branch, effectively meaning that only
bug-fixes will be added to it.  There are other maintenance branches (such
as branch-1.6) that are also receiving bug-fixes in theory, but not so much
in fact as maintenance branches get older.  The major and minor version
numbers of maintenance branches stay fixed, with only the patch-level
version number changing as new releases are made from a maintenance
branch.  Thus, the next release from branch-2.0 will be 2.0.1, the set of
bug-fixes contributing to the next branch-2.0 release will result in 2.0.2,
etc.

New work, both bug-fixes and non-bug-fixes, is contributed to the master
branch.  New releases from the master branch increment the minor version
number (unless they include API-breaking changes, in which case the major
version number changes -- e.g. Spark 1.x.y to Spark 2.0.0).  Thus the first
release from the current master branch will be 2.1.0, the next will be
2.2.0, etc.

There should be active "next JIRA numbers" for whatever will be the next
release from the master as well as each of the maintenance branches.

This is all just basic SemVer (http://semver.org/), so it surprises me some
that you are finding the concepts to be new, difficult or frustrating.

On Sun, Sep 25, 2016 at 8:31 AM, Jacek Laskowski  wrote:

> Hi Sean,
>
> I remember a similar discussion about the releases in Spark and I must
> admit it again -- I simply don't get it. I seem to not have paid
> enough attention to details to appreciate it. I apologize for asking
> the very same questions again and again. Sorry.
>
> Re the next release, I was referring to JIRA where 2.0.2 came up quite
> recently for issues not included in 2.0.1. This disjoint between
> releases and JIRA versions causes even more frustration whenever I'm
> asked what and when the next release is going to be. It's not as
> simple as I think it should be (for me).
>
> (I really hope it's only me with this mental issue)
>
> Unless I'm mistaken, -Pmesos won't get included in 2.0.x releases
> unless someone adds it to branch-2.0. Correct?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Sep 25, 2016 at 1:35 PM, Sean Owen  wrote:
> > Master is implicitly 2.1.x right now. When branch-2.1 is cut, master
> > becomes the de facto 2.2.x branch. It's not true that the next release
> > is 2.0.2. You can see the master version:
> > https://github.com/apache/spark/blob/master/pom.xml#L29
> >
> > On Sun, Sep 25, 2016 at 12:30 PM, Jacek Laskowski 
> wrote:
> >> Hi Sean,
> >>
> >> So, another question would be when is the change going to be released
> >> then? What's the version for the master? The next release's 2.0.2 so
> >> it's not for mesos profile either :(
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >>
> >> On Sun, Sep 25, 2016 at 1:27 PM, Sean Owen  wrote:
> >>> It's a change to the structure of the project, and probably not
> >>> appropriate for a maintenance release. 2.0.1 core would then no longer
> >>> contain Mesos code while 2.0.0 did.
> >>>
> >>> On Sun, Sep 25, 2016 at 12:26 PM, Jacek Laskowski 
> wrote:
>  Hi Sean,
> 
>  Sure, but then the question is why it's not a part of 2.0.1? I thought
>  it was considered ready for prime time and so should be shipped in
>  2.0.1.
> 
>  Pozdrawiam,
>  Jacek Laskowski
>  
>  https://medium.com/@jaceklaskowski/
>  Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>  Follow me at https://twitter.com/jaceklaskowski
> 
> 
>  On Sun, Sep 25, 2016 at 1:21 PM, Sean Owen 
> wrote:
> > It was added to the master branch, and this is a release from the
> 2.0.x branch.
> >
> > On Sun, Sep 25, 2016 at 12:12 PM, Jacek Laskowski 
> wrote:
> >> Hi,
> >>
> >> That's even more interesting. How's so since the profile got added a
> >> week ago or later and RC2 was cut two/three days ago? Anyone know?
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-23 Thread Mark Hamstra

Similar but not identical configuration (Java 8/macOs 10.12 with build/mvn
-Phive -Phive-thriftserver -Phadoop-2.7 -Pyarn clean install);
Similar but not identical failure:

...

- line wrapper only initialized once when used as encoder outer scope

Spark context available as 'sc' (master = local-cluster[1,1,1024], app id =
app-20160923150640-).

Spark session available as 'spark'.

Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError:
GC overhead limit exceeded

Exception in thread "dispatcher-event-loop-7" java.lang.OutOfMemoryError:
GC overhead limit exceeded

- define case class and create Dataset together with paste mode

java.lang.OutOfMemoryError: GC overhead limit exceeded

- should clone and clean line object in ClosureCleaner *** FAILED ***

  java.util.concurrent.TimeoutException: Futures timed out after [10
minutes]

...


On Fri, Sep 23, 2016 at 3:08 PM, Sean Owen  wrote:

> +1 Signatures and hashes check out. I checked that the Kinesis
> assembly artifacts are not present.
>
> I compiled and tested on Java 8 / Ubuntu 16 with -Pyarn -Phive
> -Phive-thriftserver -Phadoop-2.7 -Psparkr and only saw one test
> problem. This test never completed. If nobody else sees it, +1,
> assuming it's a bad test or env issue.
>
> - should clone and clean line object in ClosureCleaner *** FAILED ***
>   isContain was true Interpreter output contained 'Exception':
>   Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
> /_/
>
>   Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
>   Type in expressions to have them evaluated.
>   Type :help for more information.
>
>   scala> // Entering paste mode (ctrl-D to finish)
>
>
>   // Exiting paste mode, now interpreting.
>
>   org.apache.spark.SparkException: Job 0 cancelled because
> SparkContext was shut down
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:818)
> ...
>
>
> On Fri, Sep 23, 2016 at 7:01 AM, Reynold Xin  wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.0.1. The vote is open until Sunday, Sep 25, 2016 at 23:59 PDT and
> passes
> > if a majority of at least 3+1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.0.1
> > [ ] -1 Do not release this package because ...
> >
> >
> > The tag to be voted on is v2.0.1-rc2
> > (04141ad49806a48afccc236b699827997142bd57)
> >
> > This release candidate resolves 284 issues:
> > https://s.apache.org/spark-2.0.1-jira
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc2-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1199
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc2-docs/
> >
> >
> > Q: How can I help test this release?
> > A: If you are a Spark user, you can help us test this release by taking
> an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions from 2.0.0.
> >
> > Q: What justifies a -1 vote for this release?
> > A: This is a maintenance release in the 2.0.x series.  Bugs already
> present
> > in 2.0.0, missing features, or bugs related to new features will not
> > necessarily block this release.
> >
> > Q: What happened to 2.0.1 RC1?
> > A: There was an issue with RC1 R documentation during release candidate
> > preparation. As a result, rc1 was canceled before a vote was called.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: spark roadmap

2016-08-29 Thread Mark Hamstra

At this point, there is no target date set for 2.1.  That's something that
we should do fairly soon, but right now there is at least a little room for
discussion as to whether we want to continue with the same pace of releases
that we targeted throughout the 1.x development cycles, or whether
lengthening the release cycles by a month or two might be better (mainly by
reducing the overhead fraction that comes from the constant-size
engineering mechanics of coordinating and making a release.)

On Mon, Aug 29, 2016 at 1:23 AM, Denis Bolshakov 
wrote:

> Hello spark devs,
>
> Does any one can provide a roadmap for the nearest two months?
> Or at least say when we can expect 2.1 release and which features will be
> added?
>
>
> --
> //with Best Regards
> --Denis Bolshakov
> e-mail: bolshakov.de...@gmail.com
>

Re: renaming "minor release" to "feature release"

2016-07-29 Thread Mark Hamstra

One issue worth at least considering is that our minor releases usually do
not include only new features, but also many bug-fixes -- at least some of
which often do not get backported into the next patch-level release.
 "Feature release" does not convey that information.

On Thu, Jul 28, 2016 at 8:20 PM, vaquar khan  wrote:

> +1
> Though following is commonly use standard for release(http://semver.org/) 
> ,feature
> also looks good as Minor release indicate significant features have been
> added
>
>1. MAJOR version when you make incompatible API changes,
>2. MINOR version when you add functionality in a backwards-compatible
>manner, and
>3. PATCH version when you make backwards-compatible bug fixes.
>
>
> Apart from verbiage "Minor" with "feature"  no other changes in
>  versioning policy.
>
> regards,
> Vaquar khan
>
> On Thu, Jul 28, 2016 at 6:20 PM, Matei Zaharia 
> wrote:
>
>> I also agree with this given the way we develop stuff. We don't really
>> want to move to possibly-API-breaking major releases super often, but we do
>> have lots of large features that come out all the time, and our current
>> name doesn't convey that.
>>
>> Matei
>>
>> On Jul 28, 2016, at 4:15 PM, Reynold Xin  wrote:
>>
>> Yea definitely. Those are consistent with what is defined here:
>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Versioning+Policy
>>
>> The only change I'm proposing is replacing "minor" with "feature".
>>
>>
>> On Thu, Jul 28, 2016 at 4:10 PM, Sean Owen  wrote:
>>
>>> Although 'minor' is the standard term, the important thing is making
>>> the nature of the release understood. 'feature release' seems OK to me
>>> as an additional description.
>>>
>>> Is it worth agreeing on or stating a little more about the theory?
>>>
>>> patch release: backwards/forwards compatible within a minor release,
>>> generally fixes only
>>> minor/feature release: backwards compatible within a major release,
>>> not forward; generally also includes new features
>>> major release: not backwards compatible and may remove or change
>>> existing features
>>>
>>> On Thu, Jul 28, 2016 at 3:46 PM, Reynold Xin 
>>> wrote:
>>> > tl;dr
>>> >
>>> > I would like to propose renaming “minor release” to “feature release”
>>> in
>>> > Apache Spark.
>>> >
>>> >
>>> > details
>>> >
>>> > Apache Spark’s official versioning policy follows roughly semantic
>>> > versioning. Each Spark release is versioned as
>>> > [major].[minor].[maintenance]. That is to say, 1.0.0 and 2.0.0 are both
>>> > “major releases”, whereas “1.1.0” and “1.3.0” would be minor releases.
>>> >
>>> > I have gotten a lot of feedback from users that the word “minor” is
>>> > confusing and does not accurately describes those releases. When users
>>> hear
>>> > the word “minor”, they think it is a small update that introduces
>>> couple
>>> > minor features and some bug fixes. But if you look at the history of
>>> Spark
>>> > 1.x, here are just a subset of large features added:
>>> >
>>> > Spark 1.1: sort-based shuffle, JDBC/ODBC server, new stats library,
>>> 2-5X
>>> > perf improvement for machine learning.
>>> >
>>> > Spark 1.2: HA for streaming, new network module, Python API for
>>> streaming,
>>> > ML pipelines, data source API.
>>> >
>>> > Spark 1.3: DataFrame API, Spark SQL graduate out of alpha, tons of new
>>> > algorithms in machine learning.
>>> >
>>> > Spark 1.4: SparkR, Python 3 support, DAG viz, robust joins in SQL, math
>>> > functions, window functions, SQL analytic functions, Python API for
>>> > pipelines.
>>> >
>>> > Spark 1.5: code generation, Project Tungsten
>>> >
>>> > Spark 1.6: automatic memory management, Dataset API, ML pipeline
>>> persistence
>>> >
>>> >
>>> > So while “minor” is an accurate depiction of the releases from an API
>>> > compatibiility point of view, we are miscommunicating and doing Spark a
>>> > disservice by calling these releases “minor”. I would actually call
>>> these
>>> > releases “major”, but then it would be a larger deviation from semantic
>>> > versioning. I think calling these “feature releases” would be a smaller
>>> > change and a more accurate depiction of what they are.
>>> >
>>> > That said, I’m not attached to the name “feature” and am open to
>>> > suggestions, as long as they don’t convey the notion of “minor”.
>>> >
>>> >
>>>
>>
>>
>>
>
>
> --
> Regards,
> Vaquar Khan
> +91 830-851-1500
>
>

Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2016-07-23 Thread Mark Hamstra

Sure, signalling well ahead of time is good, as is getting better
performance from Java 8; but do either of those interests really require
dropping Java 7 support sooner rather than later?

Now, to retroactively copy edit myself, when I previously wrote "after all
or nearly all relevant clusters are actually no longer running on Java 6",
I meant "...no longer running on Java 7".  We should be at a point now
where there aren't many Java 6 clusters left, but my sense is that there
are still quite a number of Java 7 clusters around, and that there will be
for a good while still.

On Sat, Jul 23, 2016 at 3:50 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i care about signalling it in advance mostly. and given the performance
> differences we do have some interest in pushing towards java 8
>
> On Jul 23, 2016 6:10 PM, "Mark Hamstra" <m...@clearstorydata.com> wrote:
>
> Why the push to remove Java 7 support as soon as possible (which is how I
> read your "cluster admins plan to migrate by date X, so Spark should end
> Java 7 support then, too")?  First, I don't think we should be removing
> Java 7 support until some time after all or nearly all relevant clusters
> are actually no longer running on Java 6, and that targeting removal of
> support at our best guess about when admins are just *planning* to migrate
> isn't a very good idea.  Second, I don't see the significant difficulty or
> harm in continuing to support Java 7 for a while longer.
>
> On Sat, Jul 23, 2016 at 2:54 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> dropping java 7 support was considered for spark 2.0.x but we decided
>> against it.
>>
>> ideally dropping support for a java version should be communicated far in
>> advance to facilitate the transition.
>>
>> is this the right time to make that decision and start communicating it
>> (mailing list, jira, etc.)? perhaps for spark 2.1.x or spark 2.2.x?
>>
>> my general sense is that most cluster admins have plans to migrate to
>> java 8 before end of year. so that could line up nicely with spark 2.2
>>
>>
>
>

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-15 Thread Mark Hamstra

Yes.  https://github.com/apache/spark/pull/11796

On Fri, Jul 15, 2016 at 2:50 PM, Krishna Sankar  wrote:

> Can't find the "spark-assembly-2.0.0-hadoop2.7.0.jar" after compilation.
> Usually it is in the assembly/target/scala-2.11
> Has the packaging changed for 2.0.0 ?
> Cheers
> 
>
> On Thu, Jul 14, 2016 at 11:59 AM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.0. The vote is open until Sunday, July 17, 2016 at 12:00 PDT and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.0-rc4
>> (e5f8c1117e0c48499f54d62b556bc693435afae0).
>>
>> This release candidate resolves ~2500 issues:
>> https://s.apache.org/spark-2.0.0-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> *https://repository.apache.org/content/repositories/orgapachespark-1192/
>> *
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/
>>
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions from 1.x.
>>
>> ==
>> What justifies a -1 vote for this release?
>> ==
>> Critical bugs impacting major functionalities.
>>
>> Bugs already present in 1.x, missing features, or bugs related to new
>> features will not necessarily block this release. Note that historically
>> Spark documentation has been published on the website separately from the
>> main release so we do not need to block the release due to documentation
>> errors either.
>>
>>
>> Note: There was a mistake made during "rc3" preparation, and as a result
>> there is no "rc3", but only "rc4".
>>
>>
>

Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Mark Hamstra

You've got to satisfy my curiosity, though.  Why would you want to run such
a badly out-of-date version in production?  I mean, 2.0.0 is just about
ready for release, and lagging three full releases behind, with one of them
being a major version release, is a long way from where Spark is now.

On Wed, Jul 6, 2016 at 11:12 PM, Niranda Perera 
wrote:

> Thanks Reynold
>
> On Thu, Jul 7, 2016 at 11:40 AM, Reynold Xin  wrote:
>
>> Yes definitely.
>>
>>
>> On Wed, Jul 6, 2016 at 11:08 PM, Niranda Perera > > wrote:
>>
>>> Thanks Reynold for the prompt response. Do you think we could use a
>>> 1.4-branch latest build in a production environment?
>>>
>>>
>>>
>>> On Thu, Jul 7, 2016 at 11:33 AM, Reynold Xin 
>>> wrote:
>>>
 I think last time I tried I had some trouble releasing it because the
 release scripts no longer work with branch-1.4. You can build from the
 branch yourself, but it might be better to upgrade to the later versions.

 On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera <
 niranda.per...@gmail.com> wrote:

> Hi guys,
>
> May I know if you have halted development in the Spark 1.4 branch? I
> see that there is a release tag for 1.4.2 but it was never released.
>
> Can we expect a 1.4.x bug fixing release anytime soon?
>
> Best
> --
> Niranda
> @n1r44 
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>


>>>
>>>
>>> --
>>> Niranda
>>> @n1r44 
>>> +94-71-554-8430
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>
>
> --
> Niranda
> @n1r44 
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Mark Hamstra

No, that isn't necessarily enough to be considered a blocker.  A blocker
would be something that would have large negative effects on a significant
number of people trying to run Spark.  Arguably, something that prevents a
minority of Spark developers from running unit tests on one OS does not
qualify.  That's not to say that we shouldn't fix this, but only that it
needn't block a 2.0.0 release.

On Wed, Jun 22, 2016 at 5:56 PM, Ulanov, Alexander <alexander.ula...@hpe.com
> wrote:

> Spark Unit tests fail on Windows in Spark 2.0. It can be considered as
> blocker since there are people that develop for Spark on Windows. The
> referenced issue is indeed Minor and has nothing to do with unit tests.
>
>
>
> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
> *Sent:* Wednesday, June 22, 2016 4:09 PM
> *To:* Marcelo Vanzin <van...@cloudera.com>
> *Cc:* Ulanov, Alexander <alexander.ula...@hpe.com>; Reynold Xin <
> r...@databricks.com>; dev@spark.apache.org
> *Subject:* Re: [VOTE] Release Apache Spark 2.0.0 (RC1)
>
>
>
> It's also marked as Minor, not Blocker.
>
>
>
> On Wed, Jun 22, 2016 at 4:07 PM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
>
> On Wed, Jun 22, 2016 at 4:04 PM, Ulanov, Alexander
> <alexander.ula...@hpe.com> wrote:
> > -1
> >
> > Spark Unit tests fail on Windows. Still not resolved, though marked as
> > resolved.
>
> To be pedantic, it's marked as a duplicate
> (https://issues.apache.org/jira/browse/SPARK-15899), which doesn't
> mean necessarily that it's fixed.
>
>
>
>
> > https://issues.apache.org/jira/browse/SPARK-15893
> >
> > From: Reynold Xin [mailto:r...@databricks.com]
> > Sent: Tuesday, June 21, 2016 6:27 PM
> > To: dev@spark.apache.org
> > Subject: [VOTE] Release Apache Spark 2.0.0 (RC1)
> >
> >
> >
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.0.0. The vote is open until Friday, June 24, 2016 at 19:00 PDT and
> passes
> > if a majority of at least 3+1 PMC votes are cast.
> >
> >
> >
> > [ ] +1 Release this package as Apache Spark 2.0.0
> >
> > [ ] -1 Do not release this package because ...
> >
> >
> >
> >
> >
> > The tag to be voted on is v2.0.0-rc1
> > (0c66ca41afade6db73c9aeddd5aed6e5dcea90df).
> >
> >
> >
> > This release candidate resolves ~2400 issues:
> > https://s.apache.org/spark-2.0.0-rc1-jira
> >
> >
> >
> > The release files, including signatures, digests, etc. can be found at:
> >
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-bin/
> >
> >
> >
> > Release artifacts are signed with the following key:
> >
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> >
> >
> > The staging repository for this release can be found at:
> >
> > https://repository.apache.org/content/repositories/orgapachespark-1187/
> >
> >
> >
> > The documentation corresponding to this release can be found at:
> >
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/
> >
> >
> >
> >
> >
> > ===
> >
> > == How can I help test this release? ==
> >
> > ===
> >
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions from 1.x.
> >
> >
> >
> > 
> >
> > == What justifies a -1 vote for this release? ==
> >
> > 
> >
> > Critical bugs impacting major functionalities.
> >
> >
> >
> > Bugs already present in 1.x, missing features, or bugs related to new
> > features will not necessarily block this release. Note that historically
> > Spark documentation has been published on the website separately from the
> > main release so we do not need to block the release due to documentation
> > errors either.
> >
> >
> >
> >
>
>
> --
> Marcelo
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Mark Hamstra

It's also marked as Minor, not Blocker.

On Wed, Jun 22, 2016 at 4:07 PM, Marcelo Vanzin  wrote:

> On Wed, Jun 22, 2016 at 4:04 PM, Ulanov, Alexander
>  wrote:
> > -1
> >
> > Spark Unit tests fail on Windows. Still not resolved, though marked as
> > resolved.
>
> To be pedantic, it's marked as a duplicate
> (https://issues.apache.org/jira/browse/SPARK-15899), which doesn't
> mean necessarily that it's fixed.
>
>
>
> > https://issues.apache.org/jira/browse/SPARK-15893
> >
> > From: Reynold Xin [mailto:r...@databricks.com]
> > Sent: Tuesday, June 21, 2016 6:27 PM
> > To: dev@spark.apache.org
> > Subject: [VOTE] Release Apache Spark 2.0.0 (RC1)
> >
> >
> >
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.0.0. The vote is open until Friday, June 24, 2016 at 19:00 PDT and
> passes
> > if a majority of at least 3+1 PMC votes are cast.
> >
> >
> >
> > [ ] +1 Release this package as Apache Spark 2.0.0
> >
> > [ ] -1 Do not release this package because ...
> >
> >
> >
> >
> >
> > The tag to be voted on is v2.0.0-rc1
> > (0c66ca41afade6db73c9aeddd5aed6e5dcea90df).
> >
> >
> >
> > This release candidate resolves ~2400 issues:
> > https://s.apache.org/spark-2.0.0-rc1-jira
> >
> >
> >
> > The release files, including signatures, digests, etc. can be found at:
> >
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-bin/
> >
> >
> >
> > Release artifacts are signed with the following key:
> >
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> >
> >
> > The staging repository for this release can be found at:
> >
> > https://repository.apache.org/content/repositories/orgapachespark-1187/
> >
> >
> >
> > The documentation corresponding to this release can be found at:
> >
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/
> >
> >
> >
> >
> >
> > ===
> >
> > == How can I help test this release? ==
> >
> > ===
> >
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions from 1.x.
> >
> >
> >
> > 
> >
> > == What justifies a -1 vote for this release? ==
> >
> > 
> >
> > Critical bugs impacting major functionalities.
> >
> >
> >
> > Bugs already present in 1.x, missing features, or bugs related to new
> > features will not necessarily block this release. Note that historically
> > Spark documentation has been published on the website separately from the
> > main release so we do not need to block the release due to documentation
> > errors either.
> >
> >
> >
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Mark Hamstra

SPARK-15893 is resolved as a duplicate of SPARK-15899.  SPARK-15899 is
Unresolved.

On Wed, Jun 22, 2016 at 4:04 PM, Ulanov, Alexander  wrote:

> -1
>
> Spark Unit tests fail on Windows. Still not resolved, though marked as
> resolved.
>
> https://issues.apache.org/jira/browse/SPARK-15893
>
> *From:* Reynold Xin [mailto:r...@databricks.com]
> *Sent:* Tuesday, June 21, 2016 6:27 PM
> *To:* dev@spark.apache.org
> *Subject:* [VOTE] Release Apache Spark 2.0.0 (RC1)
>
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Friday, June 24, 2016 at 19:00 PDT and passes
> if a majority of at least 3+1 PMC votes are cast.
>
>
>
> [ ] +1 Release this package as Apache Spark 2.0.0
>
> [ ] -1 Do not release this package because ...
>
>
>
>
>
> The tag to be voted on is v2.0.0-rc1
> (0c66ca41afade6db73c9aeddd5aed6e5dcea90df).
>
>
>
> This release candidate resolves ~2400 issues:
> https://s.apache.org/spark-2.0.0-rc1-jira
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-bin/
>
>
>
> Release artifacts are signed with the following key:
>
> https://people.apache.org/keys/committer/pwendell.asc
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1187/
>
>
>
> The documentation corresponding to this release can be found at:
>
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/
>
>
>
>
>
> ===
>
> == How can I help test this release? ==
>
> ===
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
>
>
> 
>
> == What justifies a -1 vote for this release? ==
>
> 
>
> Critical bugs impacting major functionalities.
>
>
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>
>
>
>
>

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Mark Hamstra

>
> I still don't know where this "severely compromised builds of limited
> usefulness" thing comes from? what's so bad? You didn't veto its
> release, after all.


I simply mean that it was released with the knowledge that there are still
significant bugs in the preview that definitely would warrant a veto if
this were intended to be on a par with other releases.  There have been
repeated announcements to that effect, but developers finding the preview
artifacts on Maven Central months from now may well not also see those
announcements and related discussion.  The artifacts will be very stale and
no longer useful for their limited testing purpose, but will persist in the
repository.

On Mon, Jun 6, 2016 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote:

> I still don't know where this "severely compromised builds of limited
> usefulness" thing comes from? what's so bad? You didn't veto its
> release, after all. And rightly so: a release doesn't mean "definitely
> works"; it means it was created the right way. It's OK to say it's
> buggy alpha software; this isn't an argument to not really release it.
>
> But aside from that: if it should be used by someone, then who did you
> have in mind?
>
> It would be coherent at least to decide not to make alpha-like
> release, but, we agreed to, which is why this argument sort of
> surprises me.
>
> I share some concerns about piling on Databricks. Nothing here is by
> nature about an organization. However, this release really began in
> response to a thread (which not everyone here can see) about
> Databricks releasing a "2.0.0 preview" option in their product before
> it existed. I presume employees of that company sort of endorse this,
> which has put this same release into the hands of not just developers
> or admins but end users -- even with caveats and warnings.
>
> (And I think that's right!)
>
> While I'd like to see your reasons before I'd agree with you Mark,
> yours is a feasible position; I'm not as sure how people who work for
> Databricks can argue at the same time however that this should be
> carefully guarded as an ASF release -- even with caveats and warnings.
>
> We don't need to assume bad faith -- I don't. The appearance alone is
> enough to act to make this consistent.
>
> But, I think the resolution is simple: it's not 'dangerous' to release
> this and I don't think people who say they think this really do. So
> just finish this release normally, and we're done. Even if you think
> there's an argument against it, weigh vs the problems above.
>
>
> On Mon, Jun 6, 2016 at 4:00 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
> > This is not a Databricks vs. The World situation, and the fact that some
> > persist in forcing every issue into that frame is getting annoying.
> There
> > are good engineering and project-management reasons not to populate the
> > long-term, canonical repository of Maven artifacts with what are known
> to be
> > severely compromised builds of limited usefulness, particularly over
> time.
> > It is a legitimate dispute over whether these preview artifacts should be
> > deployed to Maven Central, not one that must be seen as Databricks
> seeking
> > improper advantage.
> >
>

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Mark Hamstra

This is not a Databricks vs. The World situation, and the fact that some
persist in forcing every issue into that frame is getting annoying.  There
are good engineering and project-management reasons not to populate the
long-term, canonical repository of Maven artifacts with what are known to
be severely compromised builds of limited usefulness, particularly over
time.  It is a legitimate dispute over whether these preview artifacts
should be deployed to Maven Central, not one that must be seen as
Databricks seeking improper advantage.

On Mon, Jun 6, 2016 at 5:34 AM, Shane Curcuru  wrote:

>
>
> On 2016-06-04 18:42 (-0400), Sean Owen  wrote:
> ...
> > The question is, can you just not fully release it? I don't think so,
> > even as a matter of process, and don't see a good reason not to.
> >
> > To Reynold's quote, I think that's suggesting that not all projects
> > will release to a repo at all (e.g. OpenOffice?). I don't think it
> > means you're free to not release some things to Maven, if that's
> > appropriate and common for the type of project.
> >
> > Regarding risk, remember that the audience for Maven artifacts are
> > developers, not admins or end users. I understand that developers can
> > temporarily change their build to use a different resolver if they
> > care, but, why? (and, where would someone figure this out?)
> >
> > Regardless: the 2.0.0-preview docs aren't published to go along with
> > the source/binary releases. Those need be released to the project
> > site, though probably under a different /preview/ path or something.
> > If they are, is it weird that someone wouldn't find the release in the
> > usual place in Maven then?
> >
> > Given that the driver of this was concern over wide access to
> > 2.0.0-preview, I think it's best to err on the side openness vs some
> > theoretical problem.
>
> The mere fact that there continues to be repeated pushback from PMC
> members employed by DataBricks to such a reasonable and easy question to
> answer and take action on for the benefit of all the project's users
> raises red flags for me.
>
> Immaterial of the actual motivations of individual PMC members, this
> still gives the *appearance* that DataBricks as an organization
> effectively exercises a more than healthy amount of control over how the
> project operates in simple, day-to-day manners.
>
> I strongly urge everyone participating in Apache Spark development to
> read and take to heart this required policy for Apache projects:
>
>   http://community.apache.org/projectIndependence
>
> - Shane, speaking as an individual
>
> (If I were speaking in other roles I hold, I wouldn't be as polite)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-03 Thread Mark Hamstra

It's not a question of whether the preview artifacts can be made available
on Maven central, but rather whether they must be or should be.  I've got
no problems leaving these unstable, transitory artifacts out of the more
permanent, canonical repository.

On Fri, Jun 3, 2016 at 1:53 AM, Steve Loughran 
wrote:

>
> It's been voted on by the project, so can go up on central
>
> There's already some JIRAs being filed against it, this is a metric of
> success as pre-beta of the artifacts.
>
> The risk of exercising the m2 central option is that people may get
> expectations that they can point their code at the 2.0.0-preview and then,
> when a release comes out, simply
> update their dependency; this may/may not be the case. But is it harmful
> if people do start building and testing against the preview? If it finds
> problems early, it can only be a good thing
>
>
> > On 1 Jun 2016, at 23:10, Sean Owen  wrote:
> >
> > I'll be more specific about the issue that I think trumps all this,
> > which I realize maybe not everyone was aware of.
> >
> > There was a long and contentious discussion on the PMC about, among
> > other things, advertising a "Spark 2.0 preview" from Databricks, such
> > as at
> https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html
> >
> > That post has already been updated/fixed from an earlier version, but
> > part of the resolution was to make a full "2.0.0 preview" release in
> > order to continue to be able to advertise it as such. Without it, I
> > believe the PMC's conclusion remains that this blog post / product
> > announcement is not allowed by ASF policy. Hence, either the product
> > announcements need to be taken down and a bunch of wording changed in
> > the Databricks product, or, this needs to be a normal release.
> >
> > Obviously, it seems far easier to just finish the release per usual. I
> > actually didn't realize this had not been offered for download at
> > http://spark.apache.org/downloads.html either. It needs to be
> > accessible there too.
> >
> >
> > We can get back in the weeds about what a "preview" release means,
> > but, normal voted releases can and even should be alpha/beta
> > (http://www.apache.org/dev/release.html) The culture is, in theory, to
> > release early and often. I don't buy an argument that it's too old, at
> > 2 weeks, when the alternative is having nothing at all to test
> > against.
> >
> > On Wed, Jun 1, 2016 at 5:02 PM, Michael Armbrust 
> wrote:
> >>> I'd think we want less effort, not more, to let people test it? for
> >>> example, right now I can't easily try my product build against
> >>> 2.0.0-preview.
> >>
> >>
> >> I don't feel super strongly one way or the other, so if we need to
> publish
> >> it permanently we can.
> >>
> >> However, either way you can still test against this release.  You just
> need
> >> to add a resolver as well (which is how I have always tested packages
> >> against RCs).  One concern with making it permeant is this preview
> release
> >> is already fairly far behind branch-2.0, so many of the issues that
> people
> >> might report have already been fixed and that might continue even after
> the
> >> release is made.  I'd rather be able to force upgrades eventually when
> we
> >> vote on the final 2.0 release.
> >>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-20 Thread Mark Hamstra

This is isn't yet a release candidate since, as Reynold mention in his
opening post, preview releases are "not meant to be functional, i.e. they
can and highly likely will contain critical bugs or documentation errors."
 Once we're at the point where we expect there not to be such bugs and
errors, then the release candidates will start.

On Fri, May 20, 2016 at 4:40 AM, Ross Lawley  wrote:

> +1 Having an rc1 would help me get stable feedback on using my library
> with Spark, compared to relying on 2.0.0-SNAPSHOT.
>
>
> On Fri, 20 May 2016 at 05:57 Xiao Li  wrote:
>
>> Changed my vote to +1. Thanks!
>>
>> 2016-05-19 13:28 GMT-07:00 Xiao Li :
>>
>>> Will do. Thanks!
>>>
>>> 2016-05-19 13:26 GMT-07:00 Reynold Xin :
>>>
 Xiao thanks for posting. Please file a bug in JIRA. Again as I said in
 the email this is not meant to be a functional release and will contain
 bugs.

 On Thu, May 19, 2016 at 1:20 PM, Xiao Li  wrote:

> -1
>
> Unable to use Hive meta-store in pyspark shell. Tried both HiveContext
> and SparkSession. Both failed. It always uses in-memory catalog. Anybody
> else hit the same issue?
>
>
> Method 1: SparkSession
>
> >>> from pyspark.sql import SparkSession
>
> >>> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
>
> >>>
>
> >>> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
>
> DataFrame[]
>
> >>> spark.sql("LOAD DATA LOCAL INPATH
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
> line 494, in sql
>
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
> line 933, in __call__
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py",
> line 57, in deco
>
> return f(*a, **kw)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
> line 312, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
>
> : java.lang.UnsupportedOperationException: loadTable is not implemented
>
> at
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)
>
> at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)
>
> at
> org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>
> at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>
> at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>
> at org.apache.spark.sql.Dataset.(Dataset.scala:168)
>
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>
> at py4j.Gateway.invoke(Gateway.java:280)
>
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra

Ah, got it.  While that would be useful, it doesn't address the more
general (and potentially even more beneficial) case where the total number
of worker nodes is fully elastic.  That already starts to push you into the
direction of spitting Spark worker and HDFS data nodes into disjoint sets,
and to compensate for the loss of data locality you start wishing for some
kind of hierarchical storage where at least your hot data can be present on
the Spark workers.  Even without an elastic number of HDFS nodes, you might
well get into a similar kind of desire for hierarchical storage another
layer providing faster access to the shuffle files than is possible using
HDFS -- because I share Reynold's scepticism that HDFS by itself will be up
to demands of handling the shuffle files.  With such a hierarchical split
or Spark-node-local caching layer, considering the more general split
between data and fully elastic worker nodes becomes much more tractable.

On Thu, Apr 28, 2016 at 11:23 AM, Michael Gummelt <mgumm...@mesosphere.io>
wrote:

> Not disjoint.  Colocated.  By "shrinking", I don't mean any nodes are
> going away.  I mean executors are decreasing in number, which is the case
> with dynamic allocation.  HDFS nodes aren't decreasing in number though,
> and we can still colocate on those nodes, as always.
>
> On Thu, Apr 28, 2016 at 11:19 AM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> So you are only considering the case where your set of HDFS nodes is
>> disjoint from your dynamic set of Spark Worker nodes?  That would seem to
>> be a pretty significant sacrifice of data locality.
>>
>> On Thu, Apr 28, 2016 at 11:15 AM, Michael Gummelt <mgumm...@mesosphere.io
>> > wrote:
>>
>>> > if after a work-load burst your cluster dynamically changes from 1
>>> workers to 1000, will the typical HDFS replication factor be sufficient to
>>> retain access to the shuffle files in HDFS
>>>
>>> HDFS isn't resizing.  Spark is.  HDFS files should be HA and durable.
>>>
>>> On Thu, Apr 28, 2016 at 11:08 AM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> Yes, replicated and distributed shuffle materializations are key
>>>> requirement to maintain performance in a fully elastic cluster where
>>>> Executors aren't just reallocated across an essentially fixed number of
>>>> Worker nodes, but rather the number of Workers itself is dynamic.
>>>> Retaining the file interface to those shuffle materializations while also
>>>> using HDFS for the spark.local.dirs has a certain amount of attraction, but
>>>> I also wonder whether a typical HDFS deployment is really sufficient to
>>>> handle this kind of elastic cluster scaling.  For instance and assuming
>>>> HDFS co-located on worker nodes, if after a work-load burst your cluster
>>>> dynamically changes from 1 workers to 1000, will the typical HDFS
>>>> replication factor be sufficient to retain access to the shuffle files in
>>>> HDFS, or will we instead be seeing numerous FetchFailure exceptions, Tasks
>>>> recomputed or Stages aborted, etc. so that the net effect is not all that
>>>> much different than if the shuffle files had not been relocated to HDFS and
>>>> the Executors or ShuffleService instances had just disappeared along with
>>>> the worker nodes?
>>>>
>>>> On Thu, Apr 28, 2016 at 10:46 AM, Michael Gummelt <
>>>> mgumm...@mesosphere.io> wrote:
>>>>
>>>>> > Why would you run the shuffle service on 10K nodes but Spark
>>>>> executors
>>>>> on just 100 nodes? wouldn't you also run that service just on the 100
>>>>> nodes?
>>>>>
>>>>> We have to start the service beforehand, out of band, and we don't
>>>>> know a priori where the Spark executors will land.  Those 100 executors
>>>>> could land on any of the 10K nodes.
>>>>>
>>>>> > What does plumbing it through HDFS buy you in comparison?
>>>>>
>>>>> It drops the shuffle service requirement, which is HUGE.  It means
>>>>> Spark can completely vacate the machine when it's not in use, which is
>>>>> crucial for a large, multi-tenant cluster.  ShuffledRDDs can now read the
>>>>> map files from HDFS, rather than the ancestor executors, which means we 
>>>>> can
>>>>> shut executors down immediately after the shuffle files are written.
>>>>>
>>>>> > There's some additional overhead and if

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra

So you are only considering the case where your set of HDFS nodes is
disjoint from your dynamic set of Spark Worker nodes?  That would seem to
be a pretty significant sacrifice of data locality.

On Thu, Apr 28, 2016 at 11:15 AM, Michael Gummelt <mgumm...@mesosphere.io>
wrote:

> > if after a work-load burst your cluster dynamically changes from 1
> workers to 1000, will the typical HDFS replication factor be sufficient to
> retain access to the shuffle files in HDFS
>
> HDFS isn't resizing.  Spark is.  HDFS files should be HA and durable.
>
> On Thu, Apr 28, 2016 at 11:08 AM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> Yes, replicated and distributed shuffle materializations are key
>> requirement to maintain performance in a fully elastic cluster where
>> Executors aren't just reallocated across an essentially fixed number of
>> Worker nodes, but rather the number of Workers itself is dynamic.
>> Retaining the file interface to those shuffle materializations while also
>> using HDFS for the spark.local.dirs has a certain amount of attraction, but
>> I also wonder whether a typical HDFS deployment is really sufficient to
>> handle this kind of elastic cluster scaling.  For instance and assuming
>> HDFS co-located on worker nodes, if after a work-load burst your cluster
>> dynamically changes from 1 workers to 1000, will the typical HDFS
>> replication factor be sufficient to retain access to the shuffle files in
>> HDFS, or will we instead be seeing numerous FetchFailure exceptions, Tasks
>> recomputed or Stages aborted, etc. so that the net effect is not all that
>> much different than if the shuffle files had not been relocated to HDFS and
>> the Executors or ShuffleService instances had just disappeared along with
>> the worker nodes?
>>
>> On Thu, Apr 28, 2016 at 10:46 AM, Michael Gummelt <mgumm...@mesosphere.io
>> > wrote:
>>
>>> > Why would you run the shuffle service on 10K nodes but Spark executors
>>> on just 100 nodes? wouldn't you also run that service just on the 100
>>> nodes?
>>>
>>> We have to start the service beforehand, out of band, and we don't know
>>> a priori where the Spark executors will land.  Those 100 executors could
>>> land on any of the 10K nodes.
>>>
>>> > What does plumbing it through HDFS buy you in comparison?
>>>
>>> It drops the shuffle service requirement, which is HUGE.  It means Spark
>>> can completely vacate the machine when it's not in use, which is crucial
>>> for a large, multi-tenant cluster.  ShuffledRDDs can now read the map files
>>> from HDFS, rather than the ancestor executors, which means we can shut
>>> executors down immediately after the shuffle files are written.
>>>
>>> > There's some additional overhead and if anything you lose some control
>>> over locality, in a context where I presume HDFS itself is storing data on
>>> much more than the 100 Spark nodes.
>>>
>>> Write locality would be sacrificed, but the descendent executors were
>>> already doing a remote read (they have to read from multiple ancestor
>>> executors), so there's no additional cost in read locality.  In fact, if we
>>> take advantage of HDFS's favored node feature, we could make it likely that
>>> all map files for a given partition land on the same node, so the
>>> descendent executor would never have to do a remote read!  We'd effectively
>>> shift the remote IO from read side to write side, for theoretically no
>>> change in performance.
>>>
>>> In summary:
>>>
>>> Advantages:
>>> - No shuffle service dependency (increased utilization, decreased
>>> management cost)
>>> - Shut executors down immediately after shuffle files are written,
>>> rather than waiting for a timeout (increased utilization)
>>> - HDFS is HA, so shuffle files survive a node failure, which isn't true
>>> for the shuffle service (decreased latency during failures)
>>> - Potential ability to parallelize shuffle file reads if we write a new
>>> shuffle iterator (decreased latency)
>>>
>>> Disadvantages
>>> - Increased write latency (but potentially not if we implement it
>>> efficiently.  See above).
>>> - Would need some sort of GC on HDFS shuffle files
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Apr 28, 2016 at 1:36 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Why would you run the shuffle service on 10K nodes but Spark executors
>

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra

Yes, replicated and distributed shuffle materializations are key
requirement to maintain performance in a fully elastic cluster where
Executors aren't just reallocated across an essentially fixed number of
Worker nodes, but rather the number of Workers itself is dynamic.
Retaining the file interface to those shuffle materializations while also
using HDFS for the spark.local.dirs has a certain amount of attraction, but
I also wonder whether a typical HDFS deployment is really sufficient to
handle this kind of elastic cluster scaling.  For instance and assuming
HDFS co-located on worker nodes, if after a work-load burst your cluster
dynamically changes from 1 workers to 1000, will the typical HDFS
replication factor be sufficient to retain access to the shuffle files in
HDFS, or will we instead be seeing numerous FetchFailure exceptions, Tasks
recomputed or Stages aborted, etc. so that the net effect is not all that
much different than if the shuffle files had not been relocated to HDFS and
the Executors or ShuffleService instances had just disappeared along with
the worker nodes?

On Thu, Apr 28, 2016 at 10:46 AM, Michael Gummelt 
wrote:

> > Why would you run the shuffle service on 10K nodes but Spark executors
> on just 100 nodes? wouldn't you also run that service just on the 100
> nodes?
>
> We have to start the service beforehand, out of band, and we don't know a
> priori where the Spark executors will land.  Those 100 executors could land
> on any of the 10K nodes.
>
> > What does plumbing it through HDFS buy you in comparison?
>
> It drops the shuffle service requirement, which is HUGE.  It means Spark
> can completely vacate the machine when it's not in use, which is crucial
> for a large, multi-tenant cluster.  ShuffledRDDs can now read the map files
> from HDFS, rather than the ancestor executors, which means we can shut
> executors down immediately after the shuffle files are written.
>
> > There's some additional overhead and if anything you lose some control
> over locality, in a context where I presume HDFS itself is storing data on
> much more than the 100 Spark nodes.
>
> Write locality would be sacrificed, but the descendent executors were
> already doing a remote read (they have to read from multiple ancestor
> executors), so there's no additional cost in read locality.  In fact, if we
> take advantage of HDFS's favored node feature, we could make it likely that
> all map files for a given partition land on the same node, so the
> descendent executor would never have to do a remote read!  We'd effectively
> shift the remote IO from read side to write side, for theoretically no
> change in performance.
>
> In summary:
>
> Advantages:
> - No shuffle service dependency (increased utilization, decreased
> management cost)
> - Shut executors down immediately after shuffle files are written, rather
> than waiting for a timeout (increased utilization)
> - HDFS is HA, so shuffle files survive a node failure, which isn't true
> for the shuffle service (decreased latency during failures)
> - Potential ability to parallelize shuffle file reads if we write a new
> shuffle iterator (decreased latency)
>
> Disadvantages
> - Increased write latency (but potentially not if we implement it
> efficiently.  See above).
> - Would need some sort of GC on HDFS shuffle files
>
>
>
>
>
> On Thu, Apr 28, 2016 at 1:36 AM, Sean Owen  wrote:
>
>> Why would you run the shuffle service on 10K nodes but Spark executors
>> on just 100 nodes? wouldn't you also run that service just on the 100
>> nodes?
>>
>> What does plumbing it through HDFS buy you in comparison? There's some
>> additional overhead and if anything you lose some control over
>> locality, in a context where I presume HDFS itself is storing data on
>> much more than the 100 Spark nodes.
>>
>> On Thu, Apr 28, 2016 at 1:34 AM, Michael Gummelt 
>> wrote:
>> >> Are you suggesting to have shuffle service persist and fetch data with
>> >> hdfs, or skip shuffle service altogether and just write to hdfs?
>> >
>> > Skip shuffle service altogether.  Write to HDFS.
>> >
>> > Mesos environments tend to be multi-tenant, and running the shuffle
>> service
>> > on all nodes could be extremely wasteful.  If you're running a 10K node
>> > cluster, and you'd like to run a Spark job that consumes 100 nodes, you
>> > would have to run the shuffle service on all 10K nodes out of band of
>> Spark
>> > (e.g. marathon).  I'd like a solution for dynamic allocation that
>> doesn't
>> > require this overhead.
>> >
>> > I'll look at SPARK-1529.
>> >
>> > On Wed, Apr 27, 2016 at 10:24 AM, Steve Loughran <
>> ste...@hortonworks.com>
>> > wrote:
>> >>
>> >>
>> >> > On 27 Apr 2016, at 04:59, Takeshi Yamamuro 
>> >> > wrote:
>> >> >
>> >> > Hi, all
>> >> >
>> >> > See SPARK-1529 for related discussion.
>> >> >
>> >> > // maropu
>> >>
>> >>
>> >> I'd not seen that discussion.
>> >>
>> >> I'm

Re: Question about Scala style, explicit typing within transformation functions and anonymous val.

2016-04-17 Thread Mark Hamstra

I actually find my version of 3 more readable than the one with the `_`,
which looks too much like a partially applied function.  It's a minor
issue, though.

On Sat, Apr 16, 2016 at 11:56 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Hi Mark,
>
> I know but that could harm readability. AFAIK, for this reason, that is
> not (or rarely) used in Spark.
>
> 2016-04-17 15:54 GMT+09:00 Mark Hamstra <m...@clearstorydata.com>:
>
>> FWIW, 3 should work as just `.map(function)`.
>>
>> On Sat, Apr 16, 2016 at 11:48 PM, Reynold Xin <r...@databricks.com>
>> wrote:
>>
>>> Hi Hyukjin,
>>>
>>> Thanks for asking.
>>>
>>> For 1 the change is almost always better.
>>>
>>> For 2 it depends on the context. In general if the type is not obvious,
>>> it helps readability to explicitly declare them.
>>>
>>> For 3 again it depends on context.
>>>
>>>
>>> So while it is a good idea to change 1 to reflect a more consistent code
>>> base (and maybe we should codify it), it is almost always a bad idea to
>>> change 2 and 3 just for the sake of changing them.
>>>
>>>
>>>
>>> On Sat, Apr 16, 2016 at 11:06 PM, Hyukjin Kwon <gurwls...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> First of all, I am sorry that this is relatively trivial and too minor
>>>> but I just want to be clear on this and careful for the more PRs in the
>>>> future.
>>>>
>>>> Recently, I have submitted a PR (
>>>> https://github.com/apache/spark/pull/12413) about Scala style and this
>>>> was merged. In this PR, I changed
>>>>
>>>> 1.
>>>>
>>>> from
>>>>
>>>> .map(item => {
>>>>   ...
>>>> })
>>>>
>>>> to
>>>>
>>>> .map { item =>
>>>>   ...
>>>> }
>>>>
>>>>
>>>>
>>>> 2.
>>>> from
>>>>
>>>> words.foreachRDD { (rdd: RDD[String], time: Time) => ...
>>>>
>>>> to
>>>>
>>>> words.foreachRDD { (rdd, time) => ...
>>>>
>>>>
>>>>
>>>> 3.
>>>>
>>>> from
>>>>
>>>> .map { x =>
>>>>   function(x)
>>>> }
>>>>
>>>> to
>>>>
>>>> .map(function(_))
>>>>
>>>>
>>>> My question is, I think it looks 2. and 3. are arguable (please see the
>>>> discussion in the PR).
>>>> I agree that I might not have to change those in the future but I just
>>>> wonder if I should revert 2. and 3..
>>>>
>>>> FYI,
>>>> - The usage of 2. is pretty rare.
>>>> - 3. is pretty a lot. but the PR corrects ones like above only when the
>>>> val within closure looks obviously meaningless (such as x or a) and with
>>>> only single line.
>>>>
>>>> I would appreciate that if you add some comments and opinions on this.
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>
>

Re: Question about Scala style, explicit typing within transformation functions and anonymous val.

2016-04-17 Thread Mark Hamstra

FWIW, 3 should work as just `.map(function)`.

On Sat, Apr 16, 2016 at 11:48 PM, Reynold Xin  wrote:

> Hi Hyukjin,
>
> Thanks for asking.
>
> For 1 the change is almost always better.
>
> For 2 it depends on the context. In general if the type is not obvious, it
> helps readability to explicitly declare them.
>
> For 3 again it depends on context.
>
>
> So while it is a good idea to change 1 to reflect a more consistent code
> base (and maybe we should codify it), it is almost always a bad idea to
> change 2 and 3 just for the sake of changing them.
>
>
>
> On Sat, Apr 16, 2016 at 11:06 PM, Hyukjin Kwon 
> wrote:
>
>> Hi all,
>>
>> First of all, I am sorry that this is relatively trivial and too minor
>> but I just want to be clear on this and careful for the more PRs in the
>> future.
>>
>> Recently, I have submitted a PR (
>> https://github.com/apache/spark/pull/12413) about Scala style and this
>> was merged. In this PR, I changed
>>
>> 1.
>>
>> from
>>
>> .map(item => {
>>   ...
>> })
>>
>> to
>>
>> .map { item =>
>>   ...
>> }
>>
>>
>>
>> 2.
>> from
>>
>> words.foreachRDD { (rdd: RDD[String], time: Time) => ...
>>
>> to
>>
>> words.foreachRDD { (rdd, time) => ...
>>
>>
>>
>> 3.
>>
>> from
>>
>> .map { x =>
>>   function(x)
>> }
>>
>> to
>>
>> .map(function(_))
>>
>>
>> My question is, I think it looks 2. and 3. are arguable (please see the
>> discussion in the PR).
>> I agree that I might not have to change those in the future but I just
>> wonder if I should revert 2. and 3..
>>
>> FYI,
>> - The usage of 2. is pretty rare.
>> - 3. is pretty a lot. but the PR corrects ones like above only when the
>> val within closure looks obviously meaningless (such as x or a) and with
>> only single line.
>>
>> I would appreciate that if you add some comments and opinions on this.
>>
>> Thanks!
>>
>
>

1 2 >

1 - 100 of 192 matches

Mail list logo