Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Hyukjin Kwon
The reason is that it is not 100% clear if the root cause in the Sphinx bug
is Python 2 and if the workaround is to use Python 3. Xiangrui opened a bug
against Sphinx https://github.com/sphinx-doc/sphinx/issues/5142

Here is my observation:
- Sphinx seems having a bug that it does not respect
'autodoc_docstring_signature' feature (which allows to override the
signature in the documentation manually)
  does not work in few cases such as __init__, and seems failing to
override its signature.
- In case of Python 2, functools's wraps does not copy its signature. So,
looks __init__ wrapped by a wrapper (for example, 'keyword_only') sets the
wrapper's signature (*args, **kwargs).
- In case of Python 3, functools.s wraps copies its signature. So, looks
the documentation is fine even if, apparently, autodoc did not work.

To cut it short, I am waiting for some responses at
https://github.com/sphinx-doc/sphinx/issues/5142 to check and confirm that
this is an issue at Sphinx and the workaround is to use Python 3.
Given my observation, the workaround is to use Python 3. So, if the
response is pending at Sphinx, we could probably just merge it for now.
Even if the bug is fixed in Sphinx, I think we will live with this bug for
long time anyway.



2018년 7월 9일 (월) 오전 9:28, Saisai Shao 님이 작성:

> Thanks @Hyukjin Kwon  . Yes I'm using python2 to
> build docs, looks like Python2 with Sphinx has issues.
>
> What is the pending thing for this PR (
> https://github.com/apache/spark/pull/21659)? I'm planning to cut RC2 once
> this is merged, do you an ETA for this PR?
>
> Hyukjin Kwon  于2018年7月9日周一 上午9:06写道:
>
>> Seems Python 2's Sphinx was used -
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>> and SPARK-24530 issue exists in the RC. it's kind of tricky to manually
>> verify if Python 3 is used given my few tries in my local.
>>
>> I think the fix against SPARK-24530 is technically not merged yet;
>> however, I don't think this blocks the release like the previous release. I
>> think we could proceed in parallel.
>> Will probably make a progress on
>> https://github.com/apache/spark/pull/21659, and fix the release doc too.
>>
>>
>> 2018년 7월 9일 (월) 오전 8:25, Saisai Shao 님이 작성:
>>
>>> Hi Sean,
>>>
>>> SPARK-24530 is not included in this RC1 release. Actually I'm so
>>> familiar with this issue so still using python2 to generate docs.
>>>
>>> In the JIRA it mentioned that python3 with sphinx could workaround this
>>> issue. @Hyukjin Kwon  would you please help to
>>> clarify?
>>>
>>> Thanks
>>> Saisai
>>>
>>>
>>> Xiao Li  于2018年7月9日周一 上午1:59写道:
>>>
 Three business days might be too short. Let us open the vote until the
 end of this Friday (July 13th)?

 Cheers,

 Xiao

 2018-07-08 10:15 GMT-07:00 Sean Owen :

> Just checking that the doc issue in
> https://issues.apache.org/jira/browse/SPARK-24530 is worked around in
> this release?
>
> This was pointed out as an example of a broken doc:
>
> https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>
> Here it is in 2.3.2 RC1:
>
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>
> It wasn't immediately obvious to me whether this addressed the issue
> that was identified or not.
>
>
> Otherwise nothing is open for 2.3.2, sigs and license look good, tests
> pass as last time, etc.
>
> +1
>
> On Sun, Jul 8, 2018 at 3:30 AM Saisai Shao 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 2.3.2.
>>
>> The vote is open until July 11th PST and passes if a majority +1 PMC
>> votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.3.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.3.2-rc1
>> (commit 4df06b45160241dbb331153efbb25703f913c192):
>> https://github.com/apache/spark/tree/v2.3.2-rc1
>>
>> The release files, including signatures, digests, etc. can be found
>> at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1277/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/
>>
>> The list of bug fixes going into 2.3.2 can be found at the following
>> URL:
>> 

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Saisai Shao
Thanks @Hyukjin Kwon  . Yes I'm using python2 to build
docs, looks like Python2 with Sphinx has issues.

What is the pending thing for this PR (
https://github.com/apache/spark/pull/21659)? I'm planning to cut RC2 once
this is merged, do you an ETA for this PR?

Hyukjin Kwon  于2018年7月9日周一 上午9:06写道:

> Seems Python 2's Sphinx was used -
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
> and SPARK-24530 issue exists in the RC. it's kind of tricky to manually
> verify if Python 3 is used given my few tries in my local.
>
> I think the fix against SPARK-24530 is technically not merged yet;
> however, I don't think this blocks the release like the previous release. I
> think we could proceed in parallel.
> Will probably make a progress on
> https://github.com/apache/spark/pull/21659, and fix the release doc too.
>
>
> 2018년 7월 9일 (월) 오전 8:25, Saisai Shao 님이 작성:
>
>> Hi Sean,
>>
>> SPARK-24530 is not included in this RC1 release. Actually I'm so familiar
>> with this issue so still using python2 to generate docs.
>>
>> In the JIRA it mentioned that python3 with sphinx could workaround this
>> issue. @Hyukjin Kwon  would you please help to
>> clarify?
>>
>> Thanks
>> Saisai
>>
>>
>> Xiao Li  于2018年7月9日周一 上午1:59写道:
>>
>>> Three business days might be too short. Let us open the vote until the
>>> end of this Friday (July 13th)?
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> 2018-07-08 10:15 GMT-07:00 Sean Owen :
>>>
 Just checking that the doc issue in
 https://issues.apache.org/jira/browse/SPARK-24530 is worked around in
 this release?

 This was pointed out as an example of a broken doc:

 https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

 Here it is in 2.3.2 RC1:

 https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

 It wasn't immediately obvious to me whether this addressed the issue
 that was identified or not.


 Otherwise nothing is open for 2.3.2, sigs and license look good, tests
 pass as last time, etc.

 +1

 On Sun, Jul 8, 2018 at 3:30 AM Saisai Shao 
 wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.3.2.
>
> The vote is open until July 11th PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.2-rc1
> (commit 4df06b45160241dbb331153efbb25703f913c192):
> https://github.com/apache/spark/tree/v2.3.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1277/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/
>
> The list of bug fixes going into 2.3.2 can be found at the following
> URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>
> PS. This is my first time to do release, please help to check if
> everything is landing correctly. Thanks ^-^
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.2?
> ===
>
> The current list of open tickets targeted at 2.3.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.3.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. 

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Saisai Shao
Hi Sean,

SPARK-24530 is not included in this RC1 release. Actually I'm so familiar
with this issue so still using python2 to generate docs.

In the JIRA it mentioned that python3 with sphinx could workaround this
issue. @Hyukjin Kwon  would you please help to clarify?

Thanks
Saisai


Xiao Li  于2018年7月9日周一 上午1:59写道:

> Three business days might be too short. Let us open the vote until the end
> of this Friday (July 13th)?
>
> Cheers,
>
> Xiao
>
> 2018-07-08 10:15 GMT-07:00 Sean Owen :
>
>> Just checking that the doc issue in
>> https://issues.apache.org/jira/browse/SPARK-24530 is worked around in
>> this release?
>>
>> This was pointed out as an example of a broken doc:
>>
>> https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>>
>> Here it is in 2.3.2 RC1:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>>
>> It wasn't immediately obvious to me whether this addressed the issue that
>> was identified or not.
>>
>>
>> Otherwise nothing is open for 2.3.2, sigs and license look good, tests
>> pass as last time, etc.
>>
>> +1
>>
>> On Sun, Jul 8, 2018 at 3:30 AM Saisai Shao 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.3.2.
>>>
>>> The vote is open until July 11th PST and passes if a majority +1 PMC
>>> votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.3.2
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.3.2-rc1
>>> (commit 4df06b45160241dbb331153efbb25703f913c192):
>>> https://github.com/apache/spark/tree/v2.3.2-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1277/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/
>>>
>>> The list of bug fixes going into 2.3.2 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>>
>>> PS. This is my first time to do release, please help to check if
>>> everything is landing correctly. Thanks ^-^
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.3.2?
>>> ===
>>>
>>> The current list of open tickets targeted at 2.3.2 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.3.2
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>
>


Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Xiao Li
Three business days might be too short. Let us open the vote until the end
of this Friday (July 13th)?

Cheers,

Xiao

2018-07-08 10:15 GMT-07:00 Sean Owen :

> Just checking that the doc issue in https://issues.apache.org/
> jira/browse/SPARK-24530 is worked around in this release?
>
> This was pointed out as an example of a broken doc:
> https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.
> classification.LogisticRegression
>
> Here it is in 2.3.2 RC1:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-
> docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.
> LogisticRegression
>
> It wasn't immediately obvious to me whether this addressed the issue that
> was identified or not.
>
>
> Otherwise nothing is open for 2.3.2, sigs and license look good, tests
> pass as last time, etc.
>
> +1
>
> On Sun, Jul 8, 2018 at 3:30 AM Saisai Shao  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.3.2.
>>
>> The vote is open until July 11th PST and passes if a majority +1 PMC
>> votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.3.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.3.2-rc1 (commit
>> 4df06b45160241dbb331153efbb25703f913c192):
>> https://github.com/apache/spark/tree/v2.3.2-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1277/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/
>>
>> The list of bug fixes going into 2.3.2 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>
>> PS. This is my first time to do release, please help to check if
>> everything is landing correctly. Thanks ^-^
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.2?
>> ===
>>
>> The current list of open tickets targeted at 2.3.2 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.3.2
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>


Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Reynold Xin
Yes I would just reuse the same function.

On Sun, Jul 8, 2018 at 5:01 AM Li Jin  wrote:

> Hi Linar,
>
> This seems useful. But perhaps reusing the same function name is better?
>
>
> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame
>
> Currently createDataFrame takes an RDD of any kind of SQL data
> representation(e.g. row, tuple, int, boolean, etc.), or list, or
> pandas.DataFrame.
>
> Perhaps we can support taking an RDD of *pandas.DataFrame *as the "data"
> args too?
>
> What do other people think.
>
> Li
>
> On Sun, Jul 8, 2018 at 1:13 PM, Linar Savion 
> wrote:
>
>> We've created a snippet that creates a Spark DF from a RDD of many pandas
>> DFs in a distributed manner that does not require the driver to collect the
>> entire dataset.
>>
>> Early tests show a performance improvement of x6-x10 over using
>> pandasDF->Rows>sparkDF.
>>
>> I've seen that there are some open pull requests that change the way
>> arrow serialization work, Should I open a pull request to add this
>> functionality to SparkSession? (`createFromPandasDataframesRDD`)
>>
>> https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
>>
>> Thanks,
>> Linar
>>
>
>


Re: [DESIGN] Barrier Execution Mode

2018-07-08 Thread Reynold Xin
Xingbo,

Please reference the spip and jira ticket next time:  [SPARK-24374] SPIP:
Support Barrier Scheduling in Apache Spark

On Sun, Jul 8, 2018 at 9:45 AM Xingbo Jiang  wrote:

> Hi All,
>
> I would like to invite you to review the design document for Barrier
> Execution Mode:
>
> https://docs.google.com/document/d/1GvcYR6ZFto3dOnjfLjZMtTezX0W5VYN9w1l4-tQXaZk/edit#
>
> TL;DR: We announced the project Hydrogen on recent Spark+AI Summit, a
> major part of the project involves significant changes to execution mode of
> Spark. This design doc proposes new APIs as well as new execution mode
> (known as barrier execution mode) to provide high-performance support for
> DL workloads.
>
> Major changes include:
>
>- Add RDDBarrier to support gang scheduling.
>- Add BarrierTaskContext to support global sync of all tasks in a
>stage;
>- Better fault tolerance approach for barrier stage, that in case some
>tasks fail in the middle, retry all tasks in the same stage.
>- Integrate barrier execution mode with Standalone cluster manager.
>
> Please feel free to review and discuss on the design proposal.
>
> Thanks,
> Xingbo
>
>


Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Sean Owen
Just checking that the doc issue in
https://issues.apache.org/jira/browse/SPARK-24530 is worked around in this
release?

This was pointed out as an example of a broken doc:
https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

Here it is in 2.3.2 RC1:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

It wasn't immediately obvious to me whether this addressed the issue that
was identified or not.


Otherwise nothing is open for 2.3.2, sigs and license look good, tests pass
as last time, etc.

+1

On Sun, Jul 8, 2018 at 3:30 AM Saisai Shao  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.2.
>
> The vote is open until July 11th PST and passes if a majority +1 PMC votes
> are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.2-rc1
> (commit 4df06b45160241dbb331153efbb25703f913c192):
> https://github.com/apache/spark/tree/v2.3.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1277/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/
>
> The list of bug fixes going into 2.3.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>
> PS. This is my first time to do release, please help to check if
> everything is landing correctly. Thanks ^-^
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.2?
> ===
>
> The current list of open tickets targeted at 2.3.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.3.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


[DESIGN] Barrier Execution Mode

2018-07-08 Thread Xingbo Jiang
Hi All,

I would like to invite you to review the design document for Barrier
Execution Mode:
https://docs.google.com/document/d/1GvcYR6ZFto3dOnjfLjZMtTezX0W5VYN9w1l4-tQXaZk/edit#

TL;DR: We announced the project Hydrogen on recent Spark+AI Summit, a major
part of the project involves significant changes to execution mode of
Spark. This design doc proposes new APIs as well as new execution mode
(known as barrier execution mode) to provide high-performance support for
DL workloads.

Major changes include:

   - Add RDDBarrier to support gang scheduling.
   - Add BarrierTaskContext to support global sync of all tasks in a stage;
   - Better fault tolerance approach for barrier stage, that in case some
   tasks fail in the middle, retry all tasks in the same stage.
   - Integrate barrier execution mode with Standalone cluster manager.

Please feel free to review and discuss on the design proposal.

Thanks,
Xingbo


Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Li Jin
Hi Linar,

This seems useful. But perhaps reusing the same function name is better?

http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame

Currently createDataFrame takes an RDD of any kind of SQL data
representation(e.g. row, tuple, int, boolean, etc.), or list, or
pandas.DataFrame.

Perhaps we can support taking an RDD of *pandas.DataFrame *as the "data"
args too?

What do other people think.

Li

On Sun, Jul 8, 2018 at 1:13 PM, Linar Savion 
wrote:

> We've created a snippet that creates a Spark DF from a RDD of many pandas
> DFs in a distributed manner that does not require the driver to collect the
> entire dataset.
>
> Early tests show a performance improvement of x6-x10 over using
> pandasDF->Rows>sparkDF.
>
> I've seen that there are some open pull requests that change the way arrow
> serialization work, Should I open a pull request to add this functionality
> to SparkSession? (`createFromPandasDataframesRDD`)
>
> https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
>
> Thanks,
> Linar
>


[SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Linar Savion
We've created a snippet that creates a Spark DF from a RDD of many pandas
DFs in a distributed manner that does not require the driver to collect the
entire dataset.

Early tests show a performance improvement of x6-x10 over using
pandasDF->Rows>sparkDF.

I've seen that there are some open pull requests that change the way arrow
serialization work, Should I open a pull request to add this functionality
to SparkSession? (`createFromPandasDataframesRDD`)

https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5

Thanks,
Linar


[VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Saisai Shao
Please vote on releasing the following candidate as Apache Spark version
2.3.2.

The vote is open until July 11th PST and passes if a majority +1 PMC votes
are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.3.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.3.2-rc1
(commit 4df06b45160241dbb331153efbb25703f913c192):
https://github.com/apache/spark/tree/v2.3.2-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1277/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/

The list of bug fixes going into 2.3.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12343289

PS. This is my first time to do release, please help to check if everything
is landing correctly. Thanks ^-^

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.2?
===

The current list of open tickets targeted at 2.3.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.3.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.