Re: Integrating ML/DL frameworks with Spark

2018-05-16 Thread Daniel Galvez
Hi all,

Paul Ogilvie pointed this thread out to me; we overlapped a little at LinkedIn. 
It’s good to see that this kind of discussion is going on!

I have some thoughts regarding the discussion going on:

- Practically speaking, one of the lowest hanging fruit is the ability for 
Spark to request GPUs (and in general, devices). I would be happy to implement 
this myself, if I were given the go-ahead. I’m familiar with only YARN, not the 
Mesos or Kubernetes resource schedulers, though. It would be best to be 
forward-looking and think about how to request arbitrary linux devices rather 
than just GPUs.

- The discussion here regarding ML/DL seems to focus on DL in particular, and 
the DL discussion seems to focus vaguely on data-parallel deep learning 
training.This is probably a fine starting point.

- It is generally challenging to utilize a GPU fully in each kernel call, but 
there are solutions like CUDA MPS to virtualize a physical GPU as many smaller 
GPUs. However, each physical GPU is still represented as a single character 
device, e.g., /dev/nvidia0. This does not mesh well with YARN’s GPU isolation 
by putting each executor in its own cgroup, with only specific *physical* 
character devices whitelisted. Alas. Supporting CUDA MPS would be good to keep 
in mind for inference workloads. I could elaborate if desired.

- For things like all-reduce to work well, you need to keep in mind your I/O 
bandwidth. This means that you need to keep in mind your “topology” of your 
compute devices (be they CPU, GPUs, FPGAs, IPUs, or whatever). I’m not sure if 
Spark is already aware of this at the ethernet level, forgive me. But I am 
certain that it is not aware of this at the PCIe level. Ring all-reduce does 
this automatically for you in some sense when it creates its “ring", but only 
if you give it control of your full topology, which is the traditional MPI 
style (i.e., you’re normally not sharing a node with other jobs with MPI). 
Secondly, Infiniband connections exist for GPUs to talk directly to one another 
via what is called “GPUDirect", effectively bypassing the CPU and running at 
the highest bandwidth possible today. This is a very popular approach, and not 
something that Spark would seemingly be able to touch. So I question Spark’s 
ability to have a hand in large-scale distributed training of deep learning 
models.

- I would want to know more about claims of UDFs being slow. For perspective, 
PCI express Gen 3 (Gen 4 is not out yet…) has 12 GB/s bandwidth effectively. 
Split among 4 GPUs, you have 3 GB/s. In high performance computing, this is 
always considered the bottleneck.

Anyway, this is something I’m particularly interested in. Feel free to poke me 
if you want me to answer a specific question.

Sincerely,
Daniel

On 2018/05/09 23:31:10, Xiangrui Meng  wrote: 
> Shivaram: Yes, we can call it "gang scheduling" or "barrier
> synchronization". Spark doesn't support it now. The proposal is to have a
> proper support in Spark's job scheduler, so we can integrate well with
> MPI-like frameworks.
> 
> On Tue, May 8, 2018 at 11:17 AM Nan Zhu  wrote:
> 
> > .how I skipped the last part
> >
> > On Tue, May 8, 2018 at 11:16 AM, Reynold Xin  wrote:
> >
> >> Yes, Nan, totally agree. To be on the same page, that's exactly what I
> >> wrote wasn't it?
> >>
> >> On Tue, May 8, 2018 at 11:14 AM Nan Zhu  wrote:
> >>
> >>> besides that, one of the things which is needed by multiple frameworks
> >>> is to schedule tasks in a single wave
> >>>
> >>> i.e.
> >>>
> >>> if some frameworks like xgboost/mxnet requires 50 parallel workers,
> >>> Spark is desired to provide a capability to ensure that either we run 50
> >>> tasks at once, or we should quit the complete application/job after some
> >>> timeout period
> >>>
> >>> Best,
> >>>
> >>> Nan
> >>>
> >>> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin 
> >>> wrote:
> >>>
>  I think that's what Xiangrui was referring to. Instead of retrying a
>  single task, retry the entire stage, and the entire stage of tasks need 
>  to
>  be scheduled all at once.
> 
> 
>  On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
>  shiva...@eecs.berkeley.edu> wrote:
> 
> >
> >>
> >>>- Fault tolerance and execution model: Spark assumes
> >>>fine-grained task recovery, i.e. if something fails, only that 
> >>> task is
> >>>rerun. This doesn’t match the execution model of distributed ML/DL
> >>>frameworks that are typically MPI-based, and rerunning a single 
> >>> task would
> >>>lead to the entire system hanging. A whole stage needs to be 
> >>> re-run.
> >>>
> >>> This is not only useful for integrating with 3rd-party frameworks,
> >> but also useful for scaling MLlib algorithms. One of my earliest 
> >> attempts
> >> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485
> >> ). But we

Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-16 Thread Saisai Shao
+1, checked new py4j related changes.

Marcelo Vanzin  于2018年5月17日周四 上午5:41写道:

> This is actually in 2.3, jira is just missing the version.
>
> https://github.com/apache/spark/pull/20765
>
> On Wed, May 16, 2018 at 2:14 PM, kant kodali  wrote:
> > I am not sure how SPARK-23406 is a new feature. since streaming joins are
> > already part of SPARK 2.3.0. The self joins didn't work because of a bug
> and
> > it is fixed but I can understand if it touches some other code paths.
> >
> > On Wed, May 16, 2018 at 3:22 AM, Marco Gaido 
> wrote:
> >>
> >> I'd be against having a new feature in a minor maintenance release. I
> >> think such a release should contain only bugfixes.
> >>
> >> 2018-05-16 12:11 GMT+02:00 kant kodali :
> >>>
> >>> Can this https://issues.apache.org/jira/browse/SPARK-23406 be part of
> >>> 2.3.1?
> >>>
> >>> On Tue, May 15, 2018 at 2:07 PM, Marcelo Vanzin 
> >>> wrote:
> 
>  Bummer. People should still feel welcome to test the existing RC so we
>  can rule out other issues.
> 
>  On Tue, May 15, 2018 at 2:04 PM, Xiao Li 
> wrote:
>  > -1
>  >
>  > We have a correctness bug fix that was merged after 2.3 RC1. It
> would
>  > be
>  > nice to have that in Spark 2.3.1 release.
>  >
>  > https://issues.apache.org/jira/browse/SPARK-24259
>  >
>  > Xiao
>  >
>  >
>  > 2018-05-15 14:00 GMT-07:00 Marcelo Vanzin :
>  >>
>  >> Please vote on releasing the following candidate as Apache Spark
>  >> version
>  >> 2.3.1.
>  >>
>  >> The vote is open until Friday, May 18, at 21:00 UTC and passes if
>  >> a majority of at least 3 +1 PMC votes are cast.
>  >>
>  >> [ ] +1 Release this package as Apache Spark 2.3.1
>  >> [ ] -1 Do not release this package because ...
>  >>
>  >> To learn more about Apache Spark, please see
> http://spark.apache.org/
>  >>
>  >> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
>  >> https://github.com/apache/spark/tree/v2.3.0-rc1
>  >>
>  >> The release files, including signatures, digests, etc. can be found
>  >> at:
>  >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>  >>
>  >> Signatures used for Spark RCs can be found in this file:
>  >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>  >>
>  >> The staging repository for this release can be found at:
>  >>
>  >>
> https://repository.apache.org/content/repositories/orgapachespark-1269/
>  >>
>  >> The documentation corresponding to this release can be found at:
>  >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>  >>
>  >> The list of bug fixes going into 2.3.1 can be found at the
> following
>  >> URL:
>  >> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>  >>
>  >> FAQ
>  >>
>  >> =
>  >> How can I help test this release?
>  >> =
>  >>
>  >> If you are a Spark user, you can help us test this release by
> taking
>  >> an existing Spark workload and running on this release candidate,
>  >> then
>  >> reporting any regressions.
>  >>
>  >> If you're working in PySpark you can set up a virtual env and
> install
>  >> the current RC and see if anything important breaks, in the
>  >> Java/Scala
>  >> you can add the staging repository to your projects resolvers and
>  >> test
>  >> with the RC (make sure to clean up the artifact cache before/after
> so
>  >> you don't end up building with a out of date RC going forward).
>  >>
>  >> ===
>  >> What should happen to JIRA tickets still targeting 2.3.1?
>  >> ===
>  >>
>  >> The current list of open tickets targeted at 2.3.1 can be found at:
>  >> https://s.apache.org/Q3Uo
>  >>
>  >> Committers should look at those and triage. Extremely important bug
>  >> fixes, documentation, and API tweaks that impact compatibility
> should
>  >> be worked on immediately. Everything else please retarget to an
>  >> appropriate release.
>  >>
>  >> ==
>  >> But my bug isn't fixed?
>  >> ==
>  >>
>  >> In order to make timely releases, we will typically not hold the
>  >> release unless the bug in question is a regression from the
> previous
>  >> release. That being said, if there is something which is a
> regression
>  >> that has not been correctly targeted please ping me or a committer
> to
>  >> help target the issue.
>  >>
>  >>
>  >> --
>  >> Marcelo
>  >>
>  >>
> -
>  >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>  >>
>  >
> 
> 
> 
>  --
>  Marcelo
> 
>  -

Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-16 Thread Marcelo Vanzin
This is actually in 2.3, jira is just missing the version.

https://github.com/apache/spark/pull/20765

On Wed, May 16, 2018 at 2:14 PM, kant kodali  wrote:
> I am not sure how SPARK-23406 is a new feature. since streaming joins are
> already part of SPARK 2.3.0. The self joins didn't work because of a bug and
> it is fixed but I can understand if it touches some other code paths.
>
> On Wed, May 16, 2018 at 3:22 AM, Marco Gaido  wrote:
>>
>> I'd be against having a new feature in a minor maintenance release. I
>> think such a release should contain only bugfixes.
>>
>> 2018-05-16 12:11 GMT+02:00 kant kodali :
>>>
>>> Can this https://issues.apache.org/jira/browse/SPARK-23406 be part of
>>> 2.3.1?
>>>
>>> On Tue, May 15, 2018 at 2:07 PM, Marcelo Vanzin 
>>> wrote:

 Bummer. People should still feel welcome to test the existing RC so we
 can rule out other issues.

 On Tue, May 15, 2018 at 2:04 PM, Xiao Li  wrote:
 > -1
 >
 > We have a correctness bug fix that was merged after 2.3 RC1. It would
 > be
 > nice to have that in Spark 2.3.1 release.
 >
 > https://issues.apache.org/jira/browse/SPARK-24259
 >
 > Xiao
 >
 >
 > 2018-05-15 14:00 GMT-07:00 Marcelo Vanzin :
 >>
 >> Please vote on releasing the following candidate as Apache Spark
 >> version
 >> 2.3.1.
 >>
 >> The vote is open until Friday, May 18, at 21:00 UTC and passes if
 >> a majority of at least 3 +1 PMC votes are cast.
 >>
 >> [ ] +1 Release this package as Apache Spark 2.3.1
 >> [ ] -1 Do not release this package because ...
 >>
 >> To learn more about Apache Spark, please see http://spark.apache.org/
 >>
 >> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
 >> https://github.com/apache/spark/tree/v2.3.0-rc1
 >>
 >> The release files, including signatures, digests, etc. can be found
 >> at:
 >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
 >>
 >> Signatures used for Spark RCs can be found in this file:
 >> https://dist.apache.org/repos/dist/dev/spark/KEYS
 >>
 >> The staging repository for this release can be found at:
 >>
 >> https://repository.apache.org/content/repositories/orgapachespark-1269/
 >>
 >> The documentation corresponding to this release can be found at:
 >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
 >>
 >> The list of bug fixes going into 2.3.1 can be found at the following
 >> URL:
 >> https://issues.apache.org/jira/projects/SPARK/versions/12342432
 >>
 >> FAQ
 >>
 >> =
 >> How can I help test this release?
 >> =
 >>
 >> If you are a Spark user, you can help us test this release by taking
 >> an existing Spark workload and running on this release candidate,
 >> then
 >> reporting any regressions.
 >>
 >> If you're working in PySpark you can set up a virtual env and install
 >> the current RC and see if anything important breaks, in the
 >> Java/Scala
 >> you can add the staging repository to your projects resolvers and
 >> test
 >> with the RC (make sure to clean up the artifact cache before/after so
 >> you don't end up building with a out of date RC going forward).
 >>
 >> ===
 >> What should happen to JIRA tickets still targeting 2.3.1?
 >> ===
 >>
 >> The current list of open tickets targeted at 2.3.1 can be found at:
 >> https://s.apache.org/Q3Uo
 >>
 >> Committers should look at those and triage. Extremely important bug
 >> fixes, documentation, and API tweaks that impact compatibility should
 >> be worked on immediately. Everything else please retarget to an
 >> appropriate release.
 >>
 >> ==
 >> But my bug isn't fixed?
 >> ==
 >>
 >> In order to make timely releases, we will typically not hold the
 >> release unless the bug in question is a regression from the previous
 >> release. That being said, if there is something which is a regression
 >> that has not been correctly targeted please ping me or a committer to
 >> help target the issue.
 >>
 >>
 >> --
 >> Marcelo
 >>
 >> -
 >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >>
 >



 --
 Marcelo

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

>>>
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-16 Thread kant kodali
I am not sure how SPARK-23406
 is a new feature. since
streaming joins are already part of SPARK 2.3.0. The self joins didn't work
because of a bug and it is fixed but I can understand if it touches some
other code paths.

On Wed, May 16, 2018 at 3:22 AM, Marco Gaido  wrote:

> I'd be against having a new feature in a minor maintenance release. I
> think such a release should contain only bugfixes.
>
> 2018-05-16 12:11 GMT+02:00 kant kodali :
>
>> Can this https://issues.apache.org/jira/browse/SPARK-23406 be part of
>> 2.3.1?
>>
>> On Tue, May 15, 2018 at 2:07 PM, Marcelo Vanzin 
>> wrote:
>>
>>> Bummer. People should still feel welcome to test the existing RC so we
>>> can rule out other issues.
>>>
>>> On Tue, May 15, 2018 at 2:04 PM, Xiao Li  wrote:
>>> > -1
>>> >
>>> > We have a correctness bug fix that was merged after 2.3 RC1. It would
>>> be
>>> > nice to have that in Spark 2.3.1 release.
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-24259
>>> >
>>> > Xiao
>>> >
>>> >
>>> > 2018-05-15 14:00 GMT-07:00 Marcelo Vanzin :
>>> >>
>>> >> Please vote on releasing the following candidate as Apache Spark
>>> version
>>> >> 2.3.1.
>>> >>
>>> >> The vote is open until Friday, May 18, at 21:00 UTC and passes if
>>> >> a majority of at least 3 +1 PMC votes are cast.
>>> >>
>>> >> [ ] +1 Release this package as Apache Spark 2.3.1
>>> >> [ ] -1 Do not release this package because ...
>>> >>
>>> >> To learn more about Apache Spark, please see http://spark.apache.org/
>>> >>
>>> >> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
>>> >> https://github.com/apache/spark/tree/v2.3.0-rc1
>>> >>
>>> >> The release files, including signatures, digests, etc. can be found
>>> at:
>>> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>>> >>
>>> >> Signatures used for Spark RCs can be found in this file:
>>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >>
>>> >> The staging repository for this release can be found at:
>>> >> https://repository.apache.org/content/repositories/orgapache
>>> spark-1269/
>>> >>
>>> >> The documentation corresponding to this release can be found at:
>>> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>>> >>
>>> >> The list of bug fixes going into 2.3.1 can be found at the following
>>> URL:
>>> >> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>>> >>
>>> >> FAQ
>>> >>
>>> >> =
>>> >> How can I help test this release?
>>> >> =
>>> >>
>>> >> If you are a Spark user, you can help us test this release by taking
>>> >> an existing Spark workload and running on this release candidate, then
>>> >> reporting any regressions.
>>> >>
>>> >> If you're working in PySpark you can set up a virtual env and install
>>> >> the current RC and see if anything important breaks, in the Java/Scala
>>> >> you can add the staging repository to your projects resolvers and test
>>> >> with the RC (make sure to clean up the artifact cache before/after so
>>> >> you don't end up building with a out of date RC going forward).
>>> >>
>>> >> ===
>>> >> What should happen to JIRA tickets still targeting 2.3.1?
>>> >> ===
>>> >>
>>> >> The current list of open tickets targeted at 2.3.1 can be found at:
>>> >> https://s.apache.org/Q3Uo
>>> >>
>>> >> Committers should look at those and triage. Extremely important bug
>>> >> fixes, documentation, and API tweaks that impact compatibility should
>>> >> be worked on immediately. Everything else please retarget to an
>>> >> appropriate release.
>>> >>
>>> >> ==
>>> >> But my bug isn't fixed?
>>> >> ==
>>> >>
>>> >> In order to make timely releases, we will typically not hold the
>>> >> release unless the bug in question is a regression from the previous
>>> >> release. That being said, if there is something which is a regression
>>> >> that has not been correctly targeted please ping me or a committer to
>>> >> help target the issue.
>>> >>
>>> >>
>>> >> --
>>> >> Marcelo
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>


[DISCUSS] PySpark Window UDF

2018-05-16 Thread Li Jin
Hi All,

I have been looking into leverage the Arrow and Pandas UDF work we have
done so far for Window UDF in PySpark. I have done some investigation and
believe there is a way to do PySpark window UDF efficiently.

The basic idea is instead of passing each window to Python separately, we
can pass a "batch of windows" as an Arrow Batch of rows + begin/end indices
for each window (indices are computed on the Java side), and then rolling
over the begin/end indices in Python and applies the UDF.

I have written my investigation in more details here:
https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit#

I think this is a pretty promising and hope to get some feedback from the
community about this approach. Let's discuss! :)

Li


Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-16 Thread Holden Karau
RC1 pyspark install works into a venv for what its worth.

On Wed, May 16, 2018 at 5:26 AM, Sean Owen  wrote:

> +1 the release (otherwise) looks fine to me. Sigs and licenses are OK.
> Builds and passes tests on Debian with -Pyarn -Phadoop-2.7 -Phive
> -Phive-thriftserver -Pkubernetes
>
>
> On Tue, May 15, 2018 at 4:00 PM Marcelo Vanzin 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.3.1.
>>
>> The vote is open until Friday, May 18, at 21:00 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.3.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
>> https://github.com/apache/spark/tree/v2.3.0-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1269/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>>
>> The list of bug fixes going into 2.3.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.1?
>> ===
>>
>> The current list of open tickets targeted at 2.3.1 can be found at:
>> https://s.apache.org/Q3Uo
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-16 Thread Sean Owen
+1 the release (otherwise) looks fine to me. Sigs and licenses are OK.
Builds and passes tests on Debian with -Pyarn -Phadoop-2.7 -Phive
-Phive-thriftserver -Pkubernetes

On Tue, May 15, 2018 at 4:00 PM Marcelo Vanzin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.1.
>
> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> https://github.com/apache/spark/tree/v2.3.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1269/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-16 Thread Marco Gaido
I'd be against having a new feature in a minor maintenance release. I think
such a release should contain only bugfixes.

2018-05-16 12:11 GMT+02:00 kant kodali :

> Can this https://issues.apache.org/jira/browse/SPARK-23406 be part of
> 2.3.1?
>
> On Tue, May 15, 2018 at 2:07 PM, Marcelo Vanzin 
> wrote:
>
>> Bummer. People should still feel welcome to test the existing RC so we
>> can rule out other issues.
>>
>> On Tue, May 15, 2018 at 2:04 PM, Xiao Li  wrote:
>> > -1
>> >
>> > We have a correctness bug fix that was merged after 2.3 RC1. It would be
>> > nice to have that in Spark 2.3.1 release.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-24259
>> >
>> > Xiao
>> >
>> >
>> > 2018-05-15 14:00 GMT-07:00 Marcelo Vanzin :
>> >>
>> >> Please vote on releasing the following candidate as Apache Spark
>> version
>> >> 2.3.1.
>> >>
>> >> The vote is open until Friday, May 18, at 21:00 UTC and passes if
>> >> a majority of at least 3 +1 PMC votes are cast.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 2.3.1
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see http://spark.apache.org/
>> >>
>> >> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
>> >> https://github.com/apache/spark/tree/v2.3.0-rc1
>> >>
>> >> The release files, including signatures, digests, etc. can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>> >>
>> >> Signatures used for Spark RCs can be found in this file:
>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >>
>> >> The staging repository for this release can be found at:
>> >> https://repository.apache.org/content/repositories/orgapache
>> spark-1269/
>> >>
>> >> The documentation corresponding to this release can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>> >>
>> >> The list of bug fixes going into 2.3.1 can be found at the following
>> URL:
>> >> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>> >>
>> >> FAQ
>> >>
>> >> =
>> >> How can I help test this release?
>> >> =
>> >>
>> >> If you are a Spark user, you can help us test this release by taking
>> >> an existing Spark workload and running on this release candidate, then
>> >> reporting any regressions.
>> >>
>> >> If you're working in PySpark you can set up a virtual env and install
>> >> the current RC and see if anything important breaks, in the Java/Scala
>> >> you can add the staging repository to your projects resolvers and test
>> >> with the RC (make sure to clean up the artifact cache before/after so
>> >> you don't end up building with a out of date RC going forward).
>> >>
>> >> ===
>> >> What should happen to JIRA tickets still targeting 2.3.1?
>> >> ===
>> >>
>> >> The current list of open tickets targeted at 2.3.1 can be found at:
>> >> https://s.apache.org/Q3Uo
>> >>
>> >> Committers should look at those and triage. Extremely important bug
>> >> fixes, documentation, and API tweaks that impact compatibility should
>> >> be worked on immediately. Everything else please retarget to an
>> >> appropriate release.
>> >>
>> >> ==
>> >> But my bug isn't fixed?
>> >> ==
>> >>
>> >> In order to make timely releases, we will typically not hold the
>> >> release unless the bug in question is a regression from the previous
>> >> release. That being said, if there is something which is a regression
>> >> that has not been correctly targeted please ping me or a committer to
>> >> help target the issue.
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-16 Thread kant kodali
Can this https://issues.apache.org/jira/browse/SPARK-23406 be part of 2.3.1?

On Tue, May 15, 2018 at 2:07 PM, Marcelo Vanzin  wrote:

> Bummer. People should still feel welcome to test the existing RC so we
> can rule out other issues.
>
> On Tue, May 15, 2018 at 2:04 PM, Xiao Li  wrote:
> > -1
> >
> > We have a correctness bug fix that was merged after 2.3 RC1. It would be
> > nice to have that in Spark 2.3.1 release.
> >
> > https://issues.apache.org/jira/browse/SPARK-24259
> >
> > Xiao
> >
> >
> > 2018-05-15 14:00 GMT-07:00 Marcelo Vanzin :
> >>
> >> Please vote on releasing the following candidate as Apache Spark version
> >> 2.3.1.
> >>
> >> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> >> a majority of at least 3 +1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 2.3.1
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> >> https://github.com/apache/spark/tree/v2.3.0-rc1
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1269/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
> >>
> >> The list of bug fixes going into 2.3.1 can be found at the following
> URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12342432
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> >> the current RC and see if anything important breaks, in the Java/Scala
> >> you can add the staging repository to your projects resolvers and test
> >> with the RC (make sure to clean up the artifact cache before/after so
> >> you don't end up building with a out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 2.3.1?
> >> ===
> >>
> >> The current list of open tickets targeted at 2.3.1 can be found at:
> >> https://s.apache.org/Q3Uo
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >>
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something which is a regression
> >> that has not been correctly targeted please ping me or a committer to
> >> help target the issue.
> >>
> >>
> >> --
> >> Marcelo
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>