Re: ASF board report draft for May

2024-05-06 Thread Matei Zaharia
>> 
>>>>> To track and comply with the new ASF Infra Policy as much as possible, we 
>>>>> opened a blocker-level JIRA issue and have been working on it.
>>>>> - https://infra.apache.org/github-actions-policy.html
>>>>> 
>>>>> Please include a sentence that Apache Spark PMC is working on under the 
>>>>> following umbrella JIRA issue.
>>>>> 
>>>>> https://issues.apache.org/jira/browse/SPARK-48094
>>>>> > Reduce GitHub Action usage according to ASF project allowance
>>>>> 
>>>>> Thanks,
>>>>> Dongjoon.
>>>>> 
>>>>> 
>>>>> On Sun, May 5, 2024 at 3:45 PM Holden Karau >>>> <mailto:holden.ka...@gmail.com>> wrote:
>>>>>> Do we want to include that we’re planning on having a preview release of 
>>>>>> Spark 4 so folks can see the APIs “soon”?
>>>>>> 
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.): 
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>> 
>>>>>> 
>>>>>> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia >>>>> <mailto:matei.zaha...@gmail.com>> wrote:
>>>>>>> It’s time for our quarterly ASF board report on Apache Spark this 
>>>>>>> Wednesday. Here’s a draft, feel free to suggest changes.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Description:
>>>>>>> 
>>>>>>> Apache Spark is a fast and general purpose engine for large-scale data 
>>>>>>> processing. It offers high-level APIs in Java, Scala, Python, R and SQL 
>>>>>>> as well as a rich set of libraries including stream processing, machine 
>>>>>>> learning, and graph analytics.
>>>>>>> 
>>>>>>> Issues for the board:
>>>>>>> 
>>>>>>> - None
>>>>>>> 
>>>>>>> Project status:
>>>>>>> 
>>>>>>> - We made two patch releases: Spark 3.5.1 on February 28, 2024, and 
>>>>>>> Spark 3.4.2 on April 18, 2024.
>>>>>>> - The votes on "SPIP: Structured Logging Framework for Apache Spark" 
>>>>>>> and "Pure Python Package in PyPI (Spark Connect)" have passed.
>>>>>>> - The votes for two behavior changes have passed: "SPARK-4: Use 
>>>>>>> ANSI SQL mode by default" and "SPARK-46122: Set 
>>>>>>> spark.sql.legacy.createHiveTableByDefault to false".
>>>>>>> - The community decided that upcoming Spark 4.0 release will drop 
>>>>>>> support for Python 3.8.
>>>>>>> - We started a discussion about the definition of behavior changes that 
>>>>>>> is critical for version upgrades and user experience.
>>>>>>> - We've opened a dedicated repository for the Spark Kubernetes Operator 
>>>>>>> at https://github.com/apache/spark-kubernetes-operator. We added a new 
>>>>>>> version in Apache Spark JIRA for versioning of the Spark operator based 
>>>>>>> on a vote result.
>>>>>>> 
>>>>>>> Trademarks:
>>>>>>> 
>>>>>>> - No changes since the last report.
>>>>>>> 
>>>>>>> Latest releases:
>>>>>>> - Spark 3.4.3 was released on April 18, 2024
>>>>>>> - Spark 3.5.1 was released on February 28, 2024
>>>>>>> - Spark 3.3.4 was released on December 16, 2023
>>>>>>> 
>>>>>>> Committers and PMC:
>>>>>>> 
>>>>>>> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
>>>>>>> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and 
>>>>>>> Yikun Jiang).
>>>>>>> 
>>>>>>> 
>>>>>>> -
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>>>>>> <mailto:dev-unsubscr...@spark.apache.org>
>>>>>>> 
> 
> 
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau



ASF board report draft for May

2024-05-05 Thread Matei Zaharia
It’s time for our quarterly ASF board report on Apache Spark this Wednesday. 
Here’s a draft, feel free to suggest changes.



Description:

Apache Spark is a fast and general purpose engine for large-scale data 
processing. It offers high-level APIs in Java, Scala, Python, R and SQL as well 
as a rich set of libraries including stream processing, machine learning, and 
graph analytics.

Issues for the board:

- None

Project status:

- We made two patch releases: Spark 3.5.1 on February 28, 2024, and Spark 3.4.2 
on April 18, 2024.
- The votes on "SPIP: Structured Logging Framework for Apache Spark" and "Pure 
Python Package in PyPI (Spark Connect)" have passed.
- The votes for two behavior changes have passed: "SPARK-4: Use ANSI SQL 
mode by default" and "SPARK-46122: Set 
spark.sql.legacy.createHiveTableByDefault to false".
- The community decided that upcoming Spark 4.0 release will drop support for 
Python 3.8.
- We started a discussion about the definition of behavior changes that is 
critical for version upgrades and user experience.
- We've opened a dedicated repository for the Spark Kubernetes Operator at 
https://github.com/apache/spark-kubernetes-operator. We added a new version in 
Apache Spark JIRA for versioning of the Spark operator based on a vote result.

Trademarks:

- No changes since the last report.

Latest releases:
- Spark 3.4.3 was released on April 18, 2024
- Spark 3.5.1 was released on February 28, 2024
- Spark 3.3.4 was released on December 16, 2023

Committers and PMC:

- The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
- The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and Yikun 
Jiang).


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: ASF board report draft for February

2024-02-18 Thread Matei Zaharia
Thanks for the clarification. I updated it to say Comet is in the process of 
being open sourced.

> On Feb 18, 2024, at 1:55 AM, Mich Talebzadeh  
> wrote:
> 
> Hi Matei,
> 
> With regard to your last point
> 
> "- Project Comet, a plugin designed to accelerate Spark query execution by 
> leveraging DataFusion and Arrow, has been open-sourced under the Apache Arrow 
> project. For more information, visit 
> https://github.com/apache/arrow-datafusion-comet.;
> 
> If my understanding is correct (as of  15th February), I don't think the full 
> project is open sourced yet and I quote a response from the thead owner Chao 
> Sun
> 
> "Note that we haven't open sourced several features yet including shuffle 
> support, which the aggregate operation depends on. Please stay tuned!" 
> 
> I would be inclined to leave that line out for now. The rest is fine.
> 
> HTH
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: The information provided is correct to the best of my knowledge 
> but of course cannot be guaranteed . It is essential to note that, as with 
> any advice, one verified and tested result holds more weight than a thousand 
> expert opinions.
> 
> 
> On Sat, 17 Feb 2024 at 19:23, Matei Zaharia  <mailto:matei.zaha...@gmail.com>> wrote:
>> Hi all,
>> 
>> I missed some reminder emails about our board report this month, but here is 
>> my draft. I’ll submit it tomorrow if that’s ok. 
>> 
>> ==
>> 
>> Issues for the board:
>> 
>> - None
>> 
>> Project status:
>> 
>> - We made two patch releases: Spark 3.3.4 (EOL release) on December 16, 
>> 2023, and Spark 3.4.2 on November 30, 2023.
>> - We have begun voting for a Spark 3.5.1 maintenance release.
>> - The vote on "SPIP: Structured Streaming - Arbitrary State API v2" has 
>> passed.
>> - We transitioned to an ASF-hosted analytics service, Matomo. For details, 
>> visit 
>> https://analytics.apache.org/index.php?module=CoreHome=index=yesterday=day=40.
>> - Project Comet, a plugin designed to accelerate Spark query execution by 
>> leveraging DataFusion and Arrow, has been open-sourced under the Apache 
>> Arrow project. For more information, visit 
>> https://github.com/apache/arrow-datafusion-comet.
>> 
>> Trademarks:
>> 
>> - No changes since the last report.
>> 
>> Latest releases:
>> 
>> - Spark 3.3.4 was released on December 16, 2023
>> - Spark 3.4.2 was released on November 30, 2023
>> - Spark 3.5.0 was released on September 13, 2023
>> 
>> Committers and PMC:
>> 
>> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
>> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and Yikun 
>> Jiang).
>> 
>> ==
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> <mailto:dev-unsubscr...@spark.apache.org>
>> 



ASF board report draft for February

2024-02-17 Thread Matei Zaharia
Hi all,

I missed some reminder emails about our board report this month, but here is my 
draft. I’ll submit it tomorrow if that’s ok. 

==

Issues for the board:

- None

Project status:

- We made two patch releases: Spark 3.3.4 (EOL release) on December 16, 2023, 
and Spark 3.4.2 on November 30, 2023.
- We have begun voting for a Spark 3.5.1 maintenance release.
- The vote on "SPIP: Structured Streaming - Arbitrary State API v2" has passed.
- We transitioned to an ASF-hosted analytics service, Matomo. For details, 
visit 
https://analytics.apache.org/index.php?module=CoreHome=index=yesterday=day=40.
- Project Comet, a plugin designed to accelerate Spark query execution by 
leveraging DataFusion and Arrow, has been open-sourced under the Apache Arrow 
project. For more information, visit 
https://github.com/apache/arrow-datafusion-comet.

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.3.4 was released on December 16, 2023
- Spark 3.4.2 was released on November 30, 2023
- Spark 3.5.0 was released on September 13, 2023

Committers and PMC:

- The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
- The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and Yikun 
Jiang).

==
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: ASF board report draft for Nov 2023

2023-11-09 Thread Matei Zaharia
Alright, done and posted.

> On Nov 6, 2023, at 10:55 AM, Dongjoon Hyun  wrote:
> 
> Thank you, Matei.
> 
> It would be great if we can include upcoming plans briefly.
> 
> - Apache Spark 3.4.2 
> (https://lists.apache.org/thread/35o2169l5r05k2mknqjy9mztq3ty1btr)
> - Apache Spark 3.3.4 EOL (December 16th)
> 
> Dongjoon.
> 
> On 2023/11/06 05:32:11 Matei Zaharia wrote:
>> It’s time to send our project’s quarterly report to the ASF board on 
>> Wednesday November 8th. Here’s what I wrote as a draft; let me know any 
>> suggested changes.
>> 
>> =
>> 
>> Issues for the board:
>> 
>> - None
>> 
>> Project status:
>> 
>> - We released Apache Spark 3.5 on September 15, a feature release with over 
>> 1300 patches. This release introduced more scenarios with general 
>> availability for Spark Connect, like Scala and Go client, distributed 
>> training and inference support, and enhancement of compatibility for 
>> Structured streaming. It also introduced new PySpark and SQL functionality, 
>> including the SQL IDENTIFIER clause, named argument support for SQL function 
>> calls, SQL function support for HyperLogLog approximate aggregations, and 
>> Python user-defined table functions; simplified distributed training with 
>> DeepSpeed; introduced watermark propagation among operators; and added the 
>> dropDuplicatesWithinWatermark operation in Structured Streaming.
>> - We made a patch release, Spark 3.3.3, on August 21, 2023.
>> - Apache Spark 4.0.0-SNAPSHOT is now ready for Java 21. [SPARK-43831]
>> - The vote on "Updating documentation hosted for EOL and maintenance 
>> releases" has passed.
>> - The vote on the Spark Project Improvement Proposals (SPIPs) for "State 
>> Data Source - Reader" has passed.
>> - The PMC has voted to add two new PMC members, Yuanjian Li and Yikun Jiang, 
>> and one new committer, Jiaan Geng, to the project.
>> 
>> Trademarks:
>> 
>> - No changes since the last report.
>> 
>> Latest releases:
>> 
>> - Spark 3.5.0 was released on September 13, 2023
>> - Spark 3.3.3 was released on August 21, 2023
>> - Spark 3.4.1 was released on June 23, 2023
>> 
>> Committers and PMC:
>> 
>> - The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
>> - The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and Yikun 
>> Jiang).
>> 
>> =
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
>> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



ASF board report draft for Nov 2023

2023-11-05 Thread Matei Zaharia
It’s time to send our project’s quarterly report to the ASF board on Wednesday 
November 8th. Here’s what I wrote as a draft; let me know any suggested changes.

=

Issues for the board:

- None

Project status:

- We released Apache Spark 3.5 on September 15, a feature release with over 
1300 patches. This release introduced more scenarios with general availability 
for Spark Connect, like Scala and Go client, distributed training and inference 
support, and enhancement of compatibility for Structured streaming. It also 
introduced new PySpark and SQL functionality, including the SQL IDENTIFIER 
clause, named argument support for SQL function calls, SQL function support for 
HyperLogLog approximate aggregations, and Python user-defined table functions; 
simplified distributed training with DeepSpeed; introduced watermark 
propagation among operators; and added the dropDuplicatesWithinWatermark 
operation in Structured Streaming.
- We made a patch release, Spark 3.3.3, on August 21, 2023.
- Apache Spark 4.0.0-SNAPSHOT is now ready for Java 21. [SPARK-43831]
- The vote on "Updating documentation hosted for EOL and maintenance releases" 
has passed.
- The vote on the Spark Project Improvement Proposals (SPIPs) for "State Data 
Source - Reader" has passed.
- The PMC has voted to add two new PMC members, Yuanjian Li and Yikun Jiang, 
and one new committer, Jiaan Geng, to the project.

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.5.0 was released on September 13, 2023
- Spark 3.3.3 was released on August 21, 2023
- Spark 3.4.1 was released on June 23, 2023

Committers and PMC:

- The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
- The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and Yikun 
Jiang).

=

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Updating documentation hosted for EOL and maintenance releases

2023-08-31 Thread Matei Zaharia
It would be great to do this IMO, because there are often usability and 
formatting fixes needed to docs over time, and people naturally search for docs 
from their *deployed* version of the project — not the latest version, hoping 
that it also applies to their release.

For example, right now there’s a bug where the language switcher for code 
examples on many versions of our docs does not toggle all the code examples on 
the page. If you switch one example from Scala to Python, it used to switch all 
the other ones too, but it stopped doing that. It would be great to fix this. 
There may be smaller issues too.

> On Aug 30, 2023, at 10:27 PM, Hyukjin Kwon  wrote:
> 
> Hi all,
> 
> I would like to raise a discussion about updating documentation hosted for 
> EOL and maintenance
> versions.
> 
> To provide some context, we currently host the documentation for EOL versions 
> of Apache Spark,
> which can be found at links like 
> https://spark.apache.org/docs/2.3.1/api/python/index.html. Some
> of their documentation appear in search results on the top if you google. The 
> same applies to
> maintenance releases. Once technical mistakes in the documentation, incorrect 
> information,
> etc. are landed mistakenly, they become permanent and/or cannot easily be 
> fixed, e.g., until
> the next maintenance release.
> 
> In practice, we’ve already taken steps to update and fix the documentation 
> for these EOL and
> maintenance releases, including:
> 
> Algolia and Docsearch in which we require to make some changes after 
> individual release
> for allowing search results in Apache Spark website and documentation
> Regenerating the documentation that was incorrectly generated.
> Fixing the malformed download page
> …
> I would like to take a step further, and want for the doc changes of 
> improvement and better examples,
> in maintenance branches, to be landed to the hosted documentation for better 
> usability.
> The changes landed into EOL or maintenance branches, according to SemVer, are 
> usually only bug
> fixes, so the documentation changes such as fixing examples would not 
> introduce any surprises.
> 
> Those documentation are critical to the end users, and this is the very one I 
> heard most often
> where we should improve, and I eagerly would like to improve the usability 
> here.
> 
> TL;DR, what I would like to propose is to improve our current practice of 
> landing updates in the
> documentation hosted for EOL and maintenance versions so that we can show a 
> better search
> result for Spark documentation, end users can read the correct information in 
> the versions they use,
> and follow the better examples provided in Spark documentation.
> 
> 
> 



Re: ASF board report draft for August 2023

2023-08-09 Thread Matei Zaharia
Sounds good, I’ll add that.

> On Aug 8, 2023, at 9:34 AM, Holden Karau  wrote:
> 
> Maybe add a link to the 4.0 JIRA where we are tracking the current plans for 
> 4.0?
> 
> On Tue, Aug 8, 2023 at 9:33 AM Dongjoon Hyun  <mailto:dongjoon.h...@gmail.com>> wrote:
>> Thank you, Matei.
>> 
>> It looks good to me.
>> 
>> Dongjoon
>> 
>> On Mon, Aug 7, 2023 at 22:54 Matei Zaharia > <mailto:matei.zaha...@gmail.com>> wrote:
>>> It’s time to send our quarterly report to the ASF board on August 9th. 
>>> Here’s what I wrote as a draft — feel free to suggest changes.
>>> 
>>> =
>>> 
>>> Issues for the board:
>>> 
>>> - None
>>> 
>>> Project status:
>>> 
>>> - We cut the branch Spark 3.5.0 on July 17th 2023. The community is working 
>>> on bug fixes, tests, stability and documentation.
>>> - We made a patch release, Spark 3.4.1, on June 23, 2023.
>>> - We are preparing a Spark 3.3.3 release for later this month 
>>> (https://lists.apache.org/thread/0kgnw8njjnfgc5nghx60mn7oojvrqwj7).
>>> - Votes on three Spark Project Improvement Proposals (SPIP) passed: "XML 
>>> data source support", "Python Data Source API", and "PySpark Test 
>>> Framework".
>>> - A vote for "Apache Spark PMC asks Databricks to differentiate its Spark 
>>> version string" did not pass. This was asking a company to change the 
>>> string returned by Spark APIs in a product that packages a modified version 
>>> of Apache Spark.
>>> - The community decided to release Apache Spark 4.0.0 after the 3.5.0 
>>> version.
>>> - An official Apache Spark Docker image is now available at 
>>> https://hub.docker.com/_/spark
>>> - A new repository, https://github.com/apache/spark-connect-go, was created 
>>> for the Go client of Spark Connect.
>>> - The PMC voted to add two new committers to the project, XiDuo You and 
>>> Peter Toth
>>> 
>>> Trademarks:
>>> 
>>> - No changes since the last report.
>>> 
>>> Latest releases:
>>> 
>>> - We released Apache Spark 3.4.1 on June 23, 2023
>>> - We released Apache Spark 3.2.4 on April 13, 2023
>>> - We released Spark 3.3.2 on February 17, 2023
>>> 
>>> Committers and PMC:
>>> 
>>> - The latest committers were added on July 11th, 2023 (XiDuo You and Peter 
>>> Toth).
>>> - The latest PMC members were added on May 10th, 2023 (Chao Sun, Xinrong 
>>> Meng and Ruifeng Zheng).
>>> 
>>> =
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau



ASF board report draft for August 2023

2023-08-07 Thread Matei Zaharia
It’s time to send our quarterly report to the ASF board on August 9th. Here’s 
what I wrote as a draft — feel free to suggest changes.

=

Issues for the board:

- None

Project status:

- We cut the branch Spark 3.5.0 on July 17th 2023. The community is working on 
bug fixes, tests, stability and documentation.
- We made a patch release, Spark 3.4.1, on June 23, 2023.
- We are preparing a Spark 3.3.3 release for later this month 
(https://lists.apache.org/thread/0kgnw8njjnfgc5nghx60mn7oojvrqwj7).
- Votes on three Spark Project Improvement Proposals (SPIP) passed: "XML data 
source support", "Python Data Source API", and "PySpark Test Framework".
- A vote for "Apache Spark PMC asks Databricks to differentiate its Spark 
version string" did not pass. This was asking a company to change the string 
returned by Spark APIs in a product that packages a modified version of Apache 
Spark.
- The community decided to release Apache Spark 4.0.0 after the 3.5.0 version.
- An official Apache Spark Docker image is now available at 
https://hub.docker.com/_/spark
- A new repository, https://github.com/apache/spark-connect-go, was created for 
the Go client of Spark Connect.
- The PMC voted to add two new committers to the project, XiDuo You and Peter 
Toth

Trademarks:

- No changes since the last report.

Latest releases:

- We released Apache Spark 3.4.1 on June 23, 2023
- We released Apache Spark 3.2.4 on April 13, 2023
- We released Spark 3.3.2 on February 17, 2023

Committers and PMC:

- The latest committers were added on July 11th, 2023 (XiDuo You and Peter 
Toth).
- The latest PMC members were added on May 10th, 2023 (Chao Sun, Xinrong Meng 
and Ruifeng Zheng).

=
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPIP] Python Data Source API

2023-07-10 Thread Matei Zaharia
+1

> On Jul 10, 2023, at 10:19 AM, Takuya UESHIN  wrote:
> 
> +1
> 
> On Sun, Jul 9, 2023 at 10:05 PM Ruifeng Zheng  > wrote:
>> +1
>> 
>> On Mon, Jul 10, 2023 at 8:20 AM Jungtaek Lim > > wrote:
>>> +1
>>> 
>>> On Sat, Jul 8, 2023 at 4:13 AM Reynold Xin  
>>> wrote:
 +1!
 
 
 On Fri, Jul 7 2023 at 11:58 AM, Holden Karau >>> > wrote: 
> +1
> 
> On Fri, Jul 7, 2023 at 9:55 AM huaxin gao  > wrote:
>> +1
>> 
>> On Fri, Jul 7, 2023 at 8:59 AM Mich Talebzadeh 
>> mailto:mich.talebza...@gmail.com>> wrote:
>>> +1 for me 
>>> 
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>> 
>>>view my Linkedin profile 
>>> 
>>> 
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>>  
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may 
>>> arise from relying on this email's technical content is explicitly 
>>> disclaimed. The author will in no case be liable for any monetary 
>>> damages arising from such loss, damage or destruction.
>>>  
>>> 
>>> 
>>> On Fri, 7 Jul 2023 at 11:05, Martin Grund 
>>>  wrote:
 +1 (non-binding)
 
 On Fri, Jul 7, 2023 at 12:05 AM Denny Lee >>> > wrote:
> +1 (non-binding) 
> 
> On Fri, Jul 7, 2023 at 00:50 Maciej  > wrote:
>> +0
>> 
>> Best regards,
>> Maciej Szymkiewicz
>> 
>> Web: https://zero323.net 
>> PGP: A30CEF0C31A501EC
>> On 7/6/23 17:41, Xiao Li wrote:
>>> +1
>>> 
>>> Xiao
>>> 
>>> Hyukjin Kwon mailto:gurwls...@apache.org>> 
>>> 于2023年7月5日周三 17:28写道:
 +1.
 
 See https://youtu.be/yj7XlTB1Jvc?t=604 :-).
 
 On Thu, 6 Jul 2023 at 09:15, Allison Wang 
  
  wrote:
> Hi all,
> 
> I'd like to start the vote for SPIP: Python Data Source API.
> 
> The high-level summary for the SPIP is that it aims to introduce 
> a simple API in Python for Data Sources. The idea is to enable 
> Python developers to create data sources without learning Scala 
> or dealing with the complexities of the current data source APIs. 
> This would make Spark more accessible to the wider Python 
> developer community. 
> 
> References:
> SPIP doc 
> 
> JIRA ticket 
> Discussion thread 
> 
> 
> Please vote on the SPIP for the next 72 hours:
> 
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because __.
> 
> Thanks,
> Allison
> 
> 
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): 
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> 
> 
> -- 
> Takuya UESHIN
> 



ASF board report draft for May 2023

2023-05-09 Thread Matei Zaharia
It’s time to send our ASF board report again on May 10th. I’ve put together 
this draft — let me know whether to add anything else.



Issues for the board:

- None

Project status:

- We released Apache Spark 3.4 on April 13th, a feature release with over 2600 
patches. This release introduces Python client for Spark Connect, augments 
Structured Streaming with async progress tracking and Python arbitrary stateful 
processing, increases Pandas API coverage and provides NumPy input support, 
simplifies the migration from traditional data warehouses to Apache Spark by 
improving ANSI compliance and implementing dozens of new built-in functions, 
and boosts development productivity and debuggability with memory profiling.

- We made two patch releases: Spark 3.2.4 on April 13th and Spark 3.3.2 on 
February 17th. These have bug fixes to the corresponding branches of the 
project.

- The PMC voted to add three new PMC members to the project (to be announced 
soon once they accept).

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.4.0 was released on April 13, 2023
- Spark 3.2.4 on April 13, 2023
- Spark 3.3.2 on February 17, 2023

Committers and PMC:

- The latest committer was added on Oct 2nd, 2022 (Yikun Jiang).
- The latest PMC member was added on June 28th, 2022 (Huaxin Gao).


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Slack for Spark Community: Merging various threads

2023-04-06 Thread Matei Zaharia
To me, the most important opportunity here is to create a better support 
environment for users, and I think it’s super important to allow users to join 
immediately on their own if we want this to succeed. A lot of users these days 
do prefer to join a live chat interface to get support for an open source 
project than to email a list. Just look at how many people joined the Slack 
workspaces for related open source projects — there are 8000 users in the Delta 
Lake one, which covers only a subset of the Spark user community, so this is 
already more than the user@ mailing list. So I do think we should have an 
official project Slack with an easy invitation process.

*Developer* discussions should still happen on email, JIRA and GitHub and be 
async-friendly (72-hour rule) to fit the ASF’s development model.

Given the size of the overall Spark user base, I would also lean towards a 
standalone workspace, where we have more control over organizing the channels, 
have #general and #random channels that are only about Apache Spark, etc. I’m 
not sure that joining a Slack workspace that supports all ASF projects would 
work well for users seeking help, unless it’s being organized for broad user 
support that way. In practice, when you use the Slack UI, you can easily switch 
between different Slack workspaces and invite people from them into another 
workspace, so users should not have trouble participating in multiple Slack 
workspaces to cover different projects.

Just my 2 cents, maybe there is another approach that would work with the 
ASF-wide workspace, but I do think that Slack is meant to be used as a 
multi-workspace thing.

> On Apr 6, 2023, at 3:17 PM, Maciej  wrote:
> 
> Additionally. there is no indication that the-asf.slack.com is intended for 
> general support. In particular it states the following
> 
> > The Apache Software Foundation has a workspace on Slack 
> >  to provide channels on which people working on 
> > the same ASF project, or in the same area of the Foundation, can discuss 
> > issues, solve problems, and build community in real-time.
> 
> and then
> 
> > Other contributors and interested parties (observers, former members, 
> > software evaluators, members of the media, those without an @apache.org 
> > address) who want to participate in channels in the ASF workspace can use a 
> > guest account.
> 
> Extending this to inviting everyone on @user (over >4k  subscribers according 
> to the previous thread) might be a stretch, especially without knowing the 
> details of the agreement between the ASF and the Slack Technologies.
> 
> -- 
> Best regards,
> Maciej Szymkiewicz
> 
> Web: https://zero323.net 
> PGP: A30CEF0C31A501EC
> 
> On 4/6/23 17:13, Denny Lee wrote:
>> Thanks Dongjoon, but I don't think this is misleading insofar that this is 
>> not a self-service process but an invite process which admittedly I did not 
>> state explicitly in my previous thread.  And thanks for the invite to 
>> the-ASF Slack - I just joined :) 
>> 
>> Saying this, I do completely agree with your two assertions:
>> Shall we narrow-down our focus on comparing the ASF Slack vs another 
>> 3rd-party Slack because all of us agree that this is important?  
>> Yes, I do agree that is an important aspect, all else being equal.
>> I'm wondering what ASF misses here if Apache Spark PMC invites all remaining 
>> subscribers of `user@spark` and `dev@spark` mailing lists.
>> The key question here is that do PMC members have the bandwidth of inviting 
>> everyone in user@ and dev@?   There is a lot of overhead of maintaining this 
>> so that's my key concern is if we have the number of volunteers to manage 
>> this.  Note, I'm willing to help with this process as well it was just more 
>> of a matter that there are a lot of folks to approve  
>> A reason why we may want to consider Spark's own Slack is because we can 
>> potentially create different channels within Slack to more easily group 
>> messages (e.g. different threads for troubleshooting, RDDs, streaming, 
>> etc.).  Again, we'd need someone to manage this so that way we don't have an 
>> out of control number of channels.
>> WDYT?
>> 
>> 
>> 
>> On Wed, Apr 5, 2023 at 10:50 PM Dongjoon Hyun > > wrote:
>>> Thank you so much, Denny.
>>> Yes, let me comment on a few things.
>>> 
>>> >  - While there is an ASF Slack , it
>>> >requires an @apache.org  email address
>>> 
>>> 1. This sounds a little misleading because we can see `guest` accounts in 
>>> the same link. People can be invited by "Invite people to ASF" link. I 
>>> invited you, Denny, and attached the screenshots.
>>> 
>>> >   using linen.dev  as its Slack archive (so we can 
>>> > surpass the 90 days limit)
>>> 
>>> 2. The official Foundation-supported Slack workspace preserves all messages.
>>> (the-asf.slack.com 

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-25 Thread Matei Zaharia
I’m +1 on switching to Python by default given what I see at the majority of 
users. I like the idea of investigating a way to save the language choice in a 
cookie and to switch all code examples on the page to a new language when you 
click one of the tabs. We used to have the switching behavior at least (e.g. 
see this archived page from 2016 
https://web.archive.org/web/20160308055505/https://spark.apache.org/docs/latest/quick-start.html
 
),
 so I’m not sure what happened to that. We might never have had the cookie, but 
that is worth investigating.

Matei

> On Feb 23, 2023, at 11:31 PM, Santosh Pingale 
>  wrote:
> 
> Yes, I definitely agree and +1 to the proposal (FWIW). 
> 
> I was looking at Dongjoon's comments which made a lot of sense to me and 
> trying to come up with an approach that provides smooth segway to python as 
> first tab later on. But this is mostly guess work as I do not personally know 
> the actual user behaviour on docs site.
> 
> On Fri, Feb 24, 2023, 8:01 AM Hyukjin Kwon  > wrote:
> That sounds good to have that especially given that it will allow more 
> flexibility to the users.
> But I think that's slightly orthogonal to this proposal since this proposal 
> is more about the default (before users take an action).
> 
> 
> On Fri, 24 Feb 2023 at 15:35, Santosh Pingale  > wrote:
> Very interesting and user focused discussion, thanks for the proposal.
> 
> Would it be better if we rather let users set the preference about the 
> language they want to see first in the code examples? This preference can be 
> easily stored on the browser side and used to decide ordering. This is inline 
> with freedom users have with spark today.
> 
> 
> On Fri, Feb 24, 2023, 4:46 AM Allan Folting  > wrote:
> I think this needs to be consistently done on all relevant pages and my 
> intent is to do that work in time for when it is first released.
> I started with the "Spark SQL, DataFrames and Datasets Guide" page to break 
> it up into multiple, scoped PRs.
> I should have made that clear before.
> 
> I think it's a great idea to have an umbrella JIRA for this to outline the 
> full scope and track overall progress and I'm happy to create it.
> 
> I can't speak on behalf of all Scala users of course, but I don't think this 
> change makes Scala appear as a 2nd class citizen, like I don't think of 
> Python as a 2nd class citizen because it is not first currently, but it does 
> recognize that Python is more broadly popular today.
> 
> Thanks,
> Allan
> 
> On Thu, Feb 23, 2023 at 6:55 PM Dongjoon Hyun  > wrote:
> Thank you all.
> 
> Yes, attracting more Python users and being more Python user-friendly is 
> always good.
> 
> Basically, SPARK-42493 is proposing to introduce intentional inconsistency to 
> Apache Spark documentation.
> 
> The inconsistency from SPARK-42493 might give Python users the following 
> questions first.
> 
> - Why not RDD pages which are the heart of Apache Spark? Is Python not good 
> in RDD?
> - Why not ML and Structured Streaming pages when DATA+AI Summit focuses on ML 
> heavily?
> 
> Also, more questions to the Scala users.
> - Is Scala language stepping down to the 2nd citizen language?
> - What about Scala 3?
> 
> Of course, I understand SPARK-42493 has specific scopes 
> (SQL/Dataset/Dataframe) and didn't mean anything like the above at all.
> However, if SPARK-42493 is emphasized as "the first step" to introduce that 
> inconsistency, I'm wondering 
> - What direction we are heading?
> - What is the next target scope?
> - When it will be achieved (or completed)?
> - Or, is the goal to be permanently inconsistent in terms of the 
> documentation?
> 
> It's unclear even in the documentation-only scope. If we are expecting more 
> and more subtasks during Apache Spark 3.5 timeframe, shall we have an 
> umbrella JIRA?
> 
> Bests,
> Dongjoon.
> 
> 
> On Thu, Feb 23, 2023 at 6:15 PM Allan Folting  > wrote:
> Thanks a lot for the questions and comments/feedback!
> 
> To address your questions Dongjoon, I do not intend for these updates to the 
> documentation to be tied to the potential changes/suggestions you ask about.
> 
> In other words, this proposal is only about adjusting the documentation to 
> target the majority of people reading it - namely the large and growing 
> number of Python users - and new users in particular as they are often 
> already familiar with and have a preference for Python when evaluating or 
> starting to use Spark.
> 
> While we may want to strengthen support for Python in other ways, I think 
> such efforts should be tracked separately from this.
> 
> Allan
> 
> On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh  > wrote:
> If this is not just flip flopping 

Re: ASF board report draft for Feb 2023

2023-02-08 Thread Matei Zaharia
Sounds good!On Feb 6, 2023, at 8:16 PM, Dongjoon Hyun  wrote:Thank you, Matei. Could you include the following addtionally?1. Liang-Chi is preparing v3.3.2 (This month).    https://lists.apache.org/thread/nwzr3o2cxyyf6sbb37b8yylgcvmbtp162. Since Spark 3.4.0, we attached SBOM to Apache Spark Maven artifacts [SPARK-41893] in line with other ASF projects.    https://cwiki.apache.org/confluence/display/COMDEV/SBOMThanks,DongjoonOn Mon, Feb 6, 2023 at 6:13 PM Matei Zaharia <matei.zaha...@gmail.com> wrote:Hi all,

It’s time to send our quarterly report to the ASF board this Wednesday (Feb 8th). Here is a draft; let me know if you have suggestions:

===

Issues for the board:

- None

Project status:

- We cut the branch Spark 3.4.0 on Jan 24th 2023. The community is working on bug fixes, tests, stability and docs.
- We released Apache Spark 3.2.3, a bug fix release for the 3.2 line, on Nov 28th 2022.
- Votes on the Spark Project Improvement Proposals (SPIPs) for "Asynchronous Offset Management in Structured Streaming" and "Better Spark UI scalability and Driver stability for large applications" passed.
- The DStream API will be deprecated in the upcoming Apache Spark 3.4 release to focus work on the Structured Streaming APIs. [SPARK-42075]

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.2.3 was released on Nov 28, 2022.
- Spark 3.3.1 was released on Oct 25, 2022.
- Spark 3.3.0 was released on June 16, 2022.

Committers and PMC:

- The latest committer was added on Oct 2nd, 2022 (Yikun Jiang).
- The latest PMC member was added on June 28th, 2022 (Huaxin Gao).

===


ASF board report draft for Feb 2023

2023-02-06 Thread Matei Zaharia
Hi all,

It’s time to send our quarterly report to the ASF board this Wednesday (Feb 
8th). Here is a draft; let me know if you have suggestions:

===

Issues for the board:

- None

Project status:

- We cut the branch Spark 3.4.0 on Jan 24th 2023. The community is working on 
bug fixes, tests, stability and docs.
- We released Apache Spark 3.2.3, a bug fix release for the 3.2 line, on Nov 
28th 2022.
- Votes on the Spark Project Improvement Proposals (SPIPs) for "Asynchronous 
Offset Management in Structured Streaming" and "Better Spark UI scalability and 
Driver stability for large applications" passed.
- The DStream API will be deprecated in the upcoming Apache Spark 3.4 release 
to focus work on the Structured Streaming APIs. [SPARK-42075]

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.2.3 was released on Nov 28, 2022.
- Spark 3.3.1 was released on Oct 25, 2022.
- Spark 3.3.0 was released on June 16, 2022.

Committers and PMC:

- The latest committer was added on Oct 2nd, 2022 (Yikun Jiang).
- The latest PMC member was added on June 28th, 2022 (Huaxin Gao).

===
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: ASF board report draft for November

2022-11-10 Thread Matei Zaharia
Sounds good.

> On Nov 7, 2022, at 12:02 PM, Dongjoon Hyun  wrote:
> 
> Shall we mention Spark 3.2.3 release preparation since Chao is currently 
> actively working on it?
> 
> Dongjoon.
> 
> On Mon, Nov 7, 2022 at 11:53 AM Matei Zaharia  <mailto:matei.zaha...@gmail.com>> wrote:
> It’s time to send our quarterly report to the ASF board on Wednesday. Here is 
> a draft, let me know if you have suggestions:
> 
> ===
> 
> Description:
> 
> Apache Spark is a fast and general purpose engine for large-scale data
> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
> well as a rich set of libraries including stream processing, machine learning,
> and graph analytics.
> 
> Issues for the board:
> 
> - None
> 
> Project status:
> 
> - We released Apache Spark 3.3.1, a bug fix release for the 3.3 line, on 
> October 25th.
> - The vote on the Spark Project Improvement Proposal (SPIP) for "Support 
> Docker Official Image for Spark" passed. We created a new Github repository 
> https://github.com/apache/spark-docker 
> <https://github.com/apache/spark-docker> for building the official Docker 
> image.
> - We decided to drop the Apache Spark Hadoop 2 binary distribution in future 
> releases.
> - We added a new committer, Yikun Jiang, in October 2022.
> 
> Trademarks:
> 
> - No changes since the last report.
> 
> Latest releases:
> 
> - Spark 3.3.1 was released on Oct 25, 2022.
> - Spark 3.3.0 was released on June 16, 2022.
> - Spark 3.2.2 was released on July 17, 2022.
> 
> Committers and PMC:
> 
> - The latest committer was added on Oct 2nd, 2022 (Yikun Jiang).
> - The latest PMC member was added on June 28th, 2022 (Huaxin Gao).
> 
> ===
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> 



ASF board report draft for November

2022-11-07 Thread Matei Zaharia
It’s time to send our quarterly report to the ASF board on Wednesday. Here is a 
draft, let me know if you have suggestions:

===

Description:

Apache Spark is a fast and general purpose engine for large-scale data
processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
well as a rich set of libraries including stream processing, machine learning,
and graph analytics.

Issues for the board:

- None

Project status:

- We released Apache Spark 3.3.1, a bug fix release for the 3.3 line, on 
October 25th.
- The vote on the Spark Project Improvement Proposal (SPIP) for "Support Docker 
Official Image for Spark" passed. We created a new Github repository 
https://github.com/apache/spark-docker for building the official Docker image.
- We decided to drop the Apache Spark Hadoop 2 binary distribution in future 
releases.
- We added a new committer, Yikun Jiang, in October 2022.

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.3.1 was released on Oct 25, 2022.
- Spark 3.3.0 was released on June 16, 2022.
- Spark 3.2.2 was released on July 17, 2022.

Committers and PMC:

- The latest committer was added on Oct 2nd, 2022 (Yikun Jiang).
- The latest PMC member was added on June 28th, 2022 (Huaxin Gao).

===
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: ASF board report draft for August

2022-08-10 Thread Matei Zaharia
Actually I forgot to add one more item. I want to mention that the community 
started a large effort to improve Structured Streaming performance, usability, 
APIs, and connectors (https://issues.apache.org/jira/browse/SPARK-40025 
<https://issues.apache.org/jira/browse/SPARK-40025>), and we’d love to get 
feedback and contributions on that.

> On Aug 10, 2022, at 11:16 AM, Matei Zaharia  wrote:
> 
> It’s time to submit our quarterly report to the ASF board on Friday. Here is 
> a draft, lmk if you have suggestions:
> 
> ===
> 
> Description:
> 
> Apache Spark is a fast and general purpose engine for large-scale data
> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
> well as a rich set of libraries including stream processing, machine learning,
> and graph analytics.
> 
> Issues for the board:
> 
> - None
> 
> Project status:
> 
> - Apache Spark was honored to receive the SIGMOD System Award this year, 
> given by SIGMOD (the ACM’s data management research organization) to 
> impactful real-world and research systems.
> 
> - We recently released Apache Spark 3.3.0, a feature release that improves 
> join query performance via Bloom filters, increases the Pandas API coverage 
> with the support of popular Pandas features such as datetime.timedelta and 
> merge_asof, simplifies the migration from traditional data warehouses by 
> improving ANSI SQL compliance and supporting dozens of new built-in 
> functions, boosts development productivity with better error handling, 
> autocompletion, performance, and profiling.
> 
> - We released Apache Spark 3.2.2, a bug fix release for the 3.2 line, on July 
> 17th.
> 
> - A Spark Project Improvement Proposal (SPIP) for Spark Connect was voted on 
> and accepted. Spark Connect introduces a lightweight client/server API for 
> Spark (https://issues.apache.org/jira/browse/SPARK-39375) that will allow 
> applications to submit work to a remote Spark cluster without running the 
> heavyweight query planner in the client, and will also decouple the client 
> version from the server version, making it possible to update Spark without 
> updating all the applications.
> 
> - We added three new PMC members, Huaxin Gao, Gengliang Wang and Maxim Gekk, 
> in June 2022.
> 
> - We added a new committer, Xinrong Meng, in July 2022.
> 
> Trademarks:
> 
> - No changes since the last report.
> 
> Latest releases:
> 
> - Spark 3.3.0 was released on June 16, 2022.
> - Spark 3.2.2 was released on July 17, 2022.
> - Spark 3.1.3 was released on February 18, 2022.
> 
> Committers and PMC:
> 
> - The latest committer was added on July 13rd, 2022 (Xinrong Meng).
> - The latest PMC member was added on June 28th, 2022 (Huaxin Gao).
> 
> ===



ASF board report draft for August

2022-08-10 Thread Matei Zaharia
It’s time to submit our quarterly report to the ASF board on Friday. Here is a 
draft, lmk if you have suggestions:

===

Description:

Apache Spark is a fast and general purpose engine for large-scale data
processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
well as a rich set of libraries including stream processing, machine learning,
and graph analytics.

Issues for the board:

- None

Project status:

- Apache Spark was honored to receive the SIGMOD System Award this year, given 
by SIGMOD (the ACM’s data management research organization) to impactful 
real-world and research systems.

- We recently released Apache Spark 3.3.0, a feature release that improves join 
query performance via Bloom filters, increases the Pandas API coverage with the 
support of popular Pandas features such as datetime.timedelta and merge_asof, 
simplifies the migration from traditional data warehouses by improving ANSI SQL 
compliance and supporting dozens of new built-in functions, boosts development 
productivity with better error handling, autocompletion, performance, and 
profiling.

- We released Apache Spark 3.2.2, a bug fix release for the 3.2 line, on July 
17th.

- A Spark Project Improvement Proposal (SPIP) for Spark Connect was voted on 
and accepted. Spark Connect introduces a lightweight client/server API for 
Spark (https://issues.apache.org/jira/browse/SPARK-39375) that will allow 
applications to submit work to a remote Spark cluster without running the 
heavyweight query planner in the client, and will also decouple the client 
version from the server version, making it possible to update Spark without 
updating all the applications.

- We added three new PMC members, Huaxin Gao, Gengliang Wang and Maxim Gekk, in 
June 2022.

- We added a new committer, Xinrong Meng, in July 2022.

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.3.0 was released on June 16, 2022.
- Spark 3.2.2 was released on July 17, 2022.
- Spark 3.1.3 was released on February 18, 2022.

Committers and PMC:

- The latest committer was added on July 13rd, 2022 (Xinrong Meng).
- The latest PMC member was added on June 28th, 2022 (Huaxin Gao).

===

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPIP] Spark Connect

2022-06-13 Thread Matei Zaharia
+1, very excited about this direction.

Matei

> On Jun 13, 2022, at 11:07 AM, Herman van Hovell 
>  wrote:
> 
> Let me kick off the voting...
> 
> +1
> 
> On Mon, Jun 13, 2022 at 2:02 PM Herman van Hovell  > wrote:
> Hi all,
> 
> I’d like to start a vote for SPIP: "Spark Connect"
> 
> The goal of the SPIP is to introduce a Dataframe based client/server API for 
> Spark
> 
> Please also refer to:
> 
> - Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark Connect - A 
> client and server interface for Apache Spark. 
> 
> - Design doc: Spark Connect - A client and server interface for Apache Spark. 
> 
> - JIRA: SPARK-39375 
> 
> Please vote on the SPIP for the next 72 hours:
> 
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
> 
> Kind Regards,
> Herman



SIGMOD System Award for Apache Spark

2022-05-12 Thread Matei Zaharia
Hi all,

We recently found out that Apache Spark received 
 the SIGMOD System Award this year, 
given by SIGMOD (the ACM’s data management research organization) to impactful 
real-world and research systems. This puts Spark in good company with some very 
impressive previous recipients 
. This award is really 
an achievement by the whole community, so I wanted to say congrats to everyone 
who contributes to Spark, whether through code, issue reports, docs, or other 
means.

Matei

ASF board report draft for May 2022

2022-05-10 Thread Matei Zaharia
Hi all,

It’s time to submit our quarterly ASF board report again this Wednesday. I’ve 
put together the draft below. Let me know if you have any suggestions:

===

Description:

Apache Spark is a fast and general purpose engine for large-scale data
processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
well as a rich set of libraries including stream processing, machine learning,
and graph analytics.

Issues for the board:

- None

Project status:

- We are working on the release of Spark 3.3.0, with Release Candidate 1 
currently being tested and voted on.

- We released Apache Spark 3.1.3, a bug fix release for the 3.1 line, on 
February 18th.

- We started publishing official Docker images of Apache Spark in Docker Hub, 
at https://hub.docker.com/r/apache/spark/tags
 
- A new Spark Project Improvement Proposal (SPIP) is being discussed by the 
community to offer a simplified API for deep learning inference, including 
built-in integration with popular libraries such as Tensorflow, PyTorch and 
HuggingFace (https://issues.apache.org/jira/browse/SPARK-38648).

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.1.3 was released on February 18, 2022.
- Spark 3.2.1 was released on January 26, 2022.
- Spark 3.2.0 was released on October 13, 2021.

Committers and PMC:
- The latest committer was added on Dec 20th, 2021 (Yuanjian Li).
- The latest PMC member was added on Jan 19th, 2022 (Maciej Szymkiewicz).

===



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: ASF board report draft for February 2022

2022-02-09 Thread Matei Zaharia
Thanks, good idea.

> On Feb 8, 2022, at 12:25 PM, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> I believe it would be beneficial to provide the links to SPIPs mentioned in 
> the report
> 
> - Two Spark Project Improvement Proposals (SPIPs) were recently accepted by 
> the community: namely; 1)  Support for Customized Kubernetes Schedulers 
> <https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg>
>  and 2) Storage Partitioned Join for Data Source V2 
> <https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit#heading=h.82w8qxfl2uwl>
> 
> HTH
> 
>view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Tue, 8 Feb 2022 at 09:06, Matei Zaharia  <mailto:matei.zaha...@gmail.com>> wrote:
> It’s time to send our quarterly report to the ASF board again this Wednesday. 
> I’ve written the following draft for it — let me know if you want to add or 
> change anything.
> 
> ==
> 
> Description:
> 
> Apache Spark is a fast and general purpose engine for large-scale data
> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
> well as a rich set of libraries including stream processing, machine learning,
> and graph analytics.
> 
> Issues for the board:
> 
> - None
> 
> Project status:
> 
> - We released Apache Spark 3.2.1, a bug fix release for the 3.2 line, in 
> January.
> 
> - Two Spark Project Improvement Proposals (SPIPs) were recently accepted by 
> the community: Support for Customized Kubernetes Schedulers and Storage 
> Partitioned Join for Data Source V2.
> 
> - We’ve migrated away from Spark’s original Jenkins CI/CD infrastructure, 
> which was graciously hosted by UC Berkeley on their clusters since 2013, to 
> GitHub Actions. Thanks to the Berkeley CS department for hosting this for so 
> long!
> 
> - We added a new committer, Yuanjian Li, in December 2021.
> 
> - We added a new PMC member, Maciej Szymkiewicz, in January 2022.
> 
> Trademarks:
> 
> - No changes since the last report.
> 
> Latest releases:
> 
> - Spark 3.2.1 was released on January 26, 2022.
> - Spark 3.2.0 was released on October 13, 2021.
> - Spark 3.1.2 was released on June 23rd, 2021.
> 
> Committers and PMC:
> - The latest committer was added on Dec 20th, 2021 (Yuanjian Li).
> - The latest PMC member was added on Jan 19th, 2022 (Maciej Szymkiewicz).
> 
> ==
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> 



ASF board report draft for February 2022

2022-02-08 Thread Matei Zaharia
It’s time to send our quarterly report to the ASF board again this Wednesday. 
I’ve written the following draft for it — let me know if you want to add or 
change anything.

==

Description:

Apache Spark is a fast and general purpose engine for large-scale data
processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
well as a rich set of libraries including stream processing, machine learning,
and graph analytics.

Issues for the board:

- None

Project status:

- We released Apache Spark 3.2.1, a bug fix release for the 3.2 line, in 
January.

- Two Spark Project Improvement Proposals (SPIPs) were recently accepted by the 
community: Support for Customized Kubernetes Schedulers and Storage Partitioned 
Join for Data Source V2.

- We’ve migrated away from Spark’s original Jenkins CI/CD infrastructure, which 
was graciously hosted by UC Berkeley on their clusters since 2013, to GitHub 
Actions. Thanks to the Berkeley CS department for hosting this for so long!

- We added a new committer, Yuanjian Li, in December 2021.

- We added a new PMC member, Maciej Szymkiewicz, in January 2022.

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.2.1 was released on January 26, 2022.
- Spark 3.2.0 was released on October 13, 2021.
- Spark 3.1.2 was released on June 23rd, 2021.

Committers and PMC:
- The latest committer was added on Dec 20th, 2021 (Yuanjian Li).
- The latest PMC member was added on Jan 19th, 2022 (Maciej Szymkiewicz).

==
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: ASF board report draft for November

2021-11-10 Thread Matei Zaharia
Sounds good, I’ll fix that.

Matei

> On Nov 9, 2021, at 12:39 AM, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> Just a minor modification
> 
> Under Description:
> 
> Apache Spark is a fast and general engine for large-scale data processing.
> 
> It should read
> 
> Apache Spark is a fast and general purpose engine for large-scale data 
> processing. 
> 
> HTH
> 
>view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Tue, 9 Nov 2021 at 08:06, Matei Zaharia  <mailto:matei.zaha...@gmail.com>> wrote:
> Hi all,
> 
> Our ASF board report needs to be submitted again this Wednesday (November 
> 10). I wrote a draft with the major things that happened in the past three 
> months — let me know if I missed something.
> 
> ===
> 
> Description:
> 
> Apache Spark is a fast and general engine for large-scale data processing. It
> offers high-level APIs in Java, Scala, Python, R and SQL as well as a rich set
> of libraries including stream processing, machine learning, and graph
> analytics.
> 
> Issues for the board:
> 
> - None
> 
> Project status:
> 
> - We recently released Apache Spark 3.2, a feature release that adds several 
> large
>   pieces of functionality. Spark 3.2 includes a new Pandas API for Apache 
> Spark
>   based on the Koalas project, a new push-based shuffle implementation, a more
>   efficient RocksDB state store for Structured Streaming, native support for
>   session windows, error message standardization, and significant improvements
>   to Spark SQL, such as the use of adaptive query execution by default and GA
>   status for the ANSI SQL language mode.
> 
> - We updated the Apache Spark homepage with a new design and more examples.
> 
> - We added one new committer, Chao Sun, in October.
> 
> Trademarks:
> 
> - No changes since the last report.
> 
> Latest releases:
> 
> - Spark 3.2.0 was released on October 13, 2021.
> - Spark 3.1.2 was released on June 23rd, 2021.
> - Spark 3.0.3 was released on June 1st, 2021.
> 
> Committers and PMC:
> 
> - The latest committers was added on November 5th, 2021 (Chao Sun).
> - The latest PMC member was added on June 20th, 2021 (Kousuke Saruta).
> 
> ===
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> 



ASF board report draft for November

2021-11-09 Thread Matei Zaharia
Hi all,

Our ASF board report needs to be submitted again this Wednesday (November 10). 
I wrote a draft with the major things that happened in the past three months — 
let me know if I missed something.

===

Description:

Apache Spark is a fast and general engine for large-scale data processing. It
offers high-level APIs in Java, Scala, Python, R and SQL as well as a rich set
of libraries including stream processing, machine learning, and graph
analytics.

Issues for the board:

- None

Project status:

- We recently released Apache Spark 3.2, a feature release that adds several 
large
  pieces of functionality. Spark 3.2 includes a new Pandas API for Apache Spark
  based on the Koalas project, a new push-based shuffle implementation, a more
  efficient RocksDB state store for Structured Streaming, native support for
  session windows, error message standardization, and significant improvements
  to Spark SQL, such as the use of adaptive query execution by default and GA
  status for the ANSI SQL language mode.

- We updated the Apache Spark homepage with a new design and more examples.

- We added one new committer, Chao Sun, in October.

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.2.0 was released on October 13, 2021.
- Spark 3.1.2 was released on June 23rd, 2021.
- Spark 3.0.3 was released on June 1st, 2021.

Committers and PMC:

- The latest committers was added on November 5th, 2021 (Chao Sun).
- The latest PMC member was added on June 20th, 2021 (Kousuke Saruta).

===
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: ASF board report draft for August

2021-08-10 Thread Matei Zaharia
Good point, I’ll make sure to include that.

> On Aug 9, 2021, at 9:20 PM, Mridul Muralidharan  wrote:
> 
> Hi Matei,
> 
>   3.2 will also include support for pushed based shuffle (spip SPARK-30602).
> 
> Regards,
> Mridul
> 
> On Mon, Aug 9, 2021 at 9:26 PM Hyukjin Kwon  <mailto:gurwls...@gmail.com>> wrote:
> > Are you referring to what version of Koala project? 1.8.1?
> 
> Yes, the latest version 1.8.1.
> 
> 2021년 8월 10일 (화) 오전 11:07, Igor Costa  <mailto:igorco...@gmail.com>>님이 작성:
> Hi Matei, nice update
> 
> 
> Just one question, when you mention “ We are working on Spark 3.2.0 as our 
> next release, with a release candidate likely to come soon. Spark 3.2 
> includes a new Pandas API for Apache Spark based on the Koalas project”
> 
> 
> Are you referring to what version of Koala project? 1.8.1?
> 
> 
> 
> Cheers
> Igor 
> 
> On Tue, 10 Aug 2021 at 13:31, Matei Zaharia  <mailto:matei.zaha...@gmail.com>> wrote:
> It’s time for our quarterly report to the ASF board, which we need to send 
> out this Wednesday. I wrote the draft below based on community activity — let 
> me know if you’d like to add or change anything:
> 
> ==
> 
> Description:
> 
> Apache Spark is a fast and general engine for large-scale data processing. It 
> offers high-level APIs in Java, Scala, Python, R and SQL as well as a rich 
> set of libraries including stream processing, machine learning, and graph 
> analytics.
> 
> Issues for the board:
> 
> - None
> 
> Project status:
> 
> - We made a number of maintenance releases in the past three months. We 
> released Apache Spark 3.1.2 and 3.0.3 in June as maintenance releases for the 
> 3.x branches. We also released Apache Spark 2.4.8 on May 17 as a bug fix 
> release for the Spark 2.x line. This may be the last release on 2.x unless 
> major new bugs are found.
> 
> - We added three PMC members: Liang-Chi Hsieh, Kousuke Saruta and Takeshi 
> Yamamuro.
> 
> - We are working on Spark 3.2.0 as our next release, with a release candidate 
> likely to come soon. Spark 3.2 includes a new Pandas API for Apache Spark 
> based on the Koalas project, a RocksDB state store for Structured Streaming, 
> native support for session windows, error message standardization, and 
> significant improvements to Spark SQL, such as the use of adaptive query 
> execution by default.
> 
> Trademarks:
> 
> - No changes since the last report.
> 
> Latest releases:
> 
> - Spark 3.1.2 was released on June 23rd, 2021.
> - Spark 3.0.3 was released on June 1st, 2021.
> - Spark 2.4.8 was released on May 17th, 2021.
> 
> Committers and PMC:
> 
> - The latest committers were added on March 11th, 2021 (Atilla Zsolt Piros, 
> Gabor Somogyi, Kent Yao, Maciej Szymkiewicz, Max Gekk, and Yi Wu).
> - The latest PMC member was added on June 20th, 2021 (Kousuke Saruta).
> 
> 
> 
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> 
> -- 
> Sent from Gmail Mobile



ASF board report draft for August

2021-08-09 Thread Matei Zaharia
It’s time for our quarterly report to the ASF board, which we need to send out 
this Wednesday. I wrote the draft below based on community activity — let me 
know if you’d like to add or change anything:

==

Description:

Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python, R and SQL as well as a rich set 
of libraries including stream processing, machine learning, and graph analytics.

Issues for the board:

- None

Project status:

- We made a number of maintenance releases in the past three months. We 
released Apache Spark 3.1.2 and 3.0.3 in June as maintenance releases for the 
3.x branches. We also released Apache Spark 2.4.8 on May 17 as a bug fix 
release for the Spark 2.x line. This may be the last release on 2.x unless 
major new bugs are found.

- We added three PMC members: Liang-Chi Hsieh, Kousuke Saruta and Takeshi 
Yamamuro.

- We are working on Spark 3.2.0 as our next release, with a release candidate 
likely to come soon. Spark 3.2 includes a new Pandas API for Apache Spark based 
on the Koalas project, a RocksDB state store for Structured Streaming, native 
support for session windows, error message standardization, and significant 
improvements to Spark SQL, such as the use of adaptive query execution by 
default.

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.1.2 was released on June 23rd, 2021.
- Spark 3.0.3 was released on June 1st, 2021.
- Spark 2.4.8 was released on May 17th, 2021.

Committers and PMC:

- The latest committers were added on March 11th, 2021 (Atilla Zsolt Piros, 
Gabor Somogyi, Kent Yao, Maciej Szymkiewicz, Max Gekk, and Yi Wu).
- The latest PMC member was added on June 20th, 2021 (Kousuke Saruta).





-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



ASF board report draft for May

2021-05-10 Thread Matei Zaharia
It’s time for our quarterly report to the ASF board, which we need to submit on 
Wednesday. I’ve put together the following draft based on activity in the 
community — let me know if you’d like to add or change anything:

==

Description:

Apache Spark is a fast and general engine for large-scale data processing. It
offers high-level APIs in Java, Scala, Python, R and SQL as well as a rich set
of libraries including stream processing, machine learning, and graph
analytics.

Issues for the board:

- None

Project status:

- We released Apache Spark 3.1.1, a major update release for the 3.x branch, on 
March 2nd. This release includes updates to improve Python usability and error 
messages, ANSI SQL support, the streaming UI, and support for running Apache 
Spark on Kubernetes, which is now marked GA. Overall, the release includes 
about 1500 patches.

- We are voting on an Apache Spark 2.4.8 bug fix release with for the Spark 2.x 
line. This may be the last release on 2.x.

- We added six new committers to the project: Atilla Zsolt Piros, Gabor 
Somogyi, Kent Yao, Maciej Szymkiewicz, Max Gekk, and Yi Wu.

- Several SPIPs (major project improvement proposals) were voted on and 
accepted, including adding a Function Catalog in Spark SQL and adding a Pandas 
API layer for PySpark based on the Koalas project. We’ve also started an effort 
to standardize error message reporting in Apache Spark 
(https://spark.apache.org/error-message-guidelines.html) so that messages are 
easier to understand and users can quickly figure out how to fix them.

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.1.1 was released on March 2nd, 2021.
- Spark 3.0.2 was released on February 19th, 2021.
- Spark 2.4.7 was released on September 12th, 2020.

Committers and PMC:

- The latest committers were added on March 11th, 2021 (Atilla Zsolt Piros, 
Gabor Somogyi, Kent Yao, Maciej Szymkiewicz, Max Gekk, and Yi Wu).
- The latest PMC member was added on Sept 4th, 2019 (Dongjoon Hyun). The PMC 
has been discussing some new PMC candidates.
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-28 Thread Matei Zaharia
+1

Matei

> On Mar 28, 2021, at 1:45 AM, Gengliang Wang  wrote:
> 
> +1 (non-binding)
> 
> On Sun, Mar 28, 2021 at 11:12 AM Mridul Muralidharan  > wrote:
> +1
> 
> Regards,
> Mridul 
> 
> On Sat, Mar 27, 2021 at 6:09 PM Xiao Li  > wrote:
> +1 
> 
> Xiao
> 
> Takeshi Yamamuro mailto:linguin@gmail.com>> 
> 于2021年3月26日周五 下午4:14写道:
> +1 (non-binding)
> 
> On Sat, Mar 27, 2021 at 4:53 AM Liang-Chi Hsieh  > wrote:
> +1 (non-binding)
> 
> 
> rxin wrote
> > +1. Would open up a huge persona for Spark.
> > 
> > On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler < 
> 
> > cutlerb@
> 
> >  > wrote:
> > 
> >> 
> >> +1 (non-binding)
> >> 
> >> 
> >> On Fri, Mar 26, 2021 at 9:49 AM Maciej < 
> 
> > mszymkiewicz@
> 
> >  > wrote:
> >> 
> >> 
> >>> +1 (nonbinding)
> 
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ 
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro



Welcoming six new Apache Spark committers

2021-03-26 Thread Matei Zaharia
Hi all,

The Spark PMC recently voted to add several new committers. Please join me in 
welcoming them to their new role! Our new committers are:

- Maciej Szymkiewicz (contributor to PySpark)
- Max Gekk (contributor to Spark SQL)
- Kent Yao (contributor to Spark SQL)
- Attila Zsolt Piros (contributor to decommissioning and Spark on Kubernetes)
- Yi Wu (contributor to Spark Core and SQL)
- Gabor Somogyi (contributor to Streaming and security)

All six of them contributed to Spark 3.1 and we’re very excited to have them 
join as committers.

Matei and the Spark PMC
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



ASF board report for February 2021

2021-02-08 Thread Matei Zaharia
It’s time to prepare our quarterly ASF board report, which we need to submit on 
Feb 10th. The last one was in November. I’ve written a draft here, but let me 
know if you want to add any more content that I’ve missed.

==

Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python, R and SQL as well as a rich set 
of libraries including stream processing, machine learning, and graph analytics.

Project status:

- The community is close to finalizing the first Spark 3.1.x release, which 
will be Spark 3.1.1. There was a problem with our release candidate packaging 
scripts that caused us to accidentally publish a 3.1.0 version to Maven Central 
before it was ready, so we’ve deleted that and will not use that version 
number. Several release candidates for 3.1.1 have gone out to the dev mailing 
list and we’re tracking the last remaining issues.

- Several proposals for significant new features are being discussed on the dev 
mailing list, including a function catalog for Spark SQL, a RocksDB based state 
store for streaming applications, and public APIs for creating user-defined 
types (UDTs) in Spark SQL. We would welcome feedback on these from interested 
community members.

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 2.4.7 was released on September 12th, 2020.
- Spark 3.0.1 was released on September 8th, 2020.
- Spark 3.0.0 was released on June 18th, 2020.

Committers and PMC:

- The latest committers were added on July 14th, 2020 (Huaxin Gao, Jungtaek
 Lim and Dilip Biswal).
- The latest PMC member was added on Sept 4th, 2019 (Dongjoon Hyun). The PMC
 has been discussing some new PMC candidates.

==
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Draft ASF board report for November

2020-11-10 Thread Matei Zaharia
Hi all,

It’s time to send in our quarterly ASF board report on Nov 11, so I wanted to 
include anything notable going on that we want to appear in the board archive. 
Here is my draft; let me know if you have suggested changes.

===

Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python, R and SQL as well as a rich set 
of libraries including stream processing, machine learning, and graph analytics.

Project status:

- We releases Apache Spark 3.0.1 on September 8th and Spark 2.4.7 on September 
12th as maintenance releases with bug fixes to these two branches.

- The community is working on a number of new features in the Spark 3.x branch, 
including improved data catalog APIs, a push-based shuffle implementation, and 
better error messages to make Spark applications easier to debug. The largest 
changes have are being discussed as SPIPs on our mailing list.

Trademarks:

- One of the two software projects we reached out to July to change its name 
due to a trademark issue has changed it. We are still waiting for a reply from 
the other one, but it may be that development there has stopped.

Latest releases:

- Spark 2.4.7 was released on September 12th, 2020.
- Spark 3.0.1 was released on September 8th, 2020.
- Spark 3.0.0 was released on June 18th, 2020.

Committers and PMC:

- The latest committers were added on July 14th, 2020 (Huaxin Gao, Jungtaek Lim 
and Dilip Biswal).
- The latest PMC member was added on Sept 4th, 2019 (Dongjoon Hyun). The PMC 
has been discussing some new candidates to add as PMC members.
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-05 Thread Matei Zaharia
+1

Matei

> On Nov 5, 2020, at 10:25 AM, EveLiao  wrote:
> 
> +1 
> Thanks!
> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



ASF board report draft for August

2020-08-10 Thread Matei Zaharia
Hi all,

Our quarterly project board report needs to be submitted on August 12th, and I 
wanted to include anything notable going on that we want to appear in the board 
archive. Here is my draft below; let me know if you have suggested changes.

===

Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python, R and SQL as well as a rich set 
of libraries including stream processing, machine learning, and graph analytics.

Project status:

- We released Apache Spark 3.0.0 on June 18th, 2020. This was our largest 
release yet, containing over 3400 patches from the community, including 
significant improvements to SQL performance, ANSI SQL compatibility, Python 
APIs, SparkR performance, error reporting and monitoring tools. This release 
also enhances Spark’s job scheduler to support adaptive execution (changing 
query plans at runtime to reduce the need for configuration) and workloads that 
need hardware accelerators.

- We released Apache Spark 2.4.6 on June 5th, 2020 with bug fixes to the 2.4 
line.

- The community is working on 3.0.1 and 2.4.7 releases with bug fixes to these 
two branches.

- We had a discussion on the dev list about clarifying our process for handling 
-1 votes on patches, which will go into updated guidelines on our website.

- We added three committers to the project since the last report: Huaxin Gao, 
Jungtaek Lim and Dilip Biswal.

Trademarks:

- We engaged with two companies that had created products with “Spark” in the 
name to ask them to follow our trademark guidelines.

Latest releases:

- Spark 3.0.0 was released on June 18th, 2020.
- Spark 2.4.6 was released on June 5th, 2020.
- Spark 2.4.5 was released on Feb 8th, 2020.

Committers and PMC:

- The latest PMC member was added on Sept 4th, 2019 (Dongjoon Hyun).
- The latest committers were added on July 14th, 2020 (Huaxin Gao, Jungtaek Lim 
and Dilip Biswal).
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Welcoming some new Apache Spark committers

2020-07-14 Thread Matei Zaharia
Hi all,

The Spark PMC recently voted to add several new committers. Please join me in 
welcoming them to their new roles! The new committers are:

- Huaxin Gao
- Jungtaek Lim
- Dilip Biswal

All three of them contributed to Spark 3.0 and we’re excited to have them join 
the project.

Matei and the Spark PMC
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Removing references to slave (and maybe in the future master)

2020-06-18 Thread Matei Zaharia
Yup, it would be great to do this. FWIW, I would propose using “worker” 
everywhere instead unless it already means something in that context, just to 
have a single word for this (instead of multiple words such as agent, replica, 
etc), but I haven’t looked into whether that would make anything confusing.

> On Jun 18, 2020, at 1:14 PM, Holden Karau  wrote:
> 
> Thank you. I agree being careful with API comparability is important. I think 
> in situations where the terms are exposed in our API we can introduce 
> alternatives and deprecate the old ones to allow for a smooth migration.
> 
> On Thu, Jun 18, 2020 at 12:28 PM Reynold Xin  > wrote:
> Thanks for doing this. I think this is a great thing to do.
> 
> But we gotta be careful with API compatibility.
> 
> 
> On Thu, Jun 18, 2020 at 11:32 AM, Holden Karau  > wrote:
> Hi Folks,
> 
> I've started working on cleaning up the Spark code to remove references to 
> slave since the word has a lot of negative connotations and we can generally 
> replace it with more accurate/descriptive words in our code base. The PR is 
> at https://github.com/apache/spark/pull/28864 
>  (I'm a little uncertain on the 
> place of where I chose the name "AgentLost" as the replacement, suggestions 
> welcome).
> 
> At some point I think we should explore deprecating master as well, but that 
> is used very broadley inside of our code and in our APIs, so while it is 
> visible to more people changing it would be more work. I think having 
> consensus around removing slave though is a good first step.
> 
> Cheers,
> 
> Holden
> 
> -- 
> Twitter: https://twitter.com/holdenkarau 
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau 
> 
> -- 
> Twitter: https://twitter.com/holdenkarau 
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau 
> 


Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Matei Zaharia
Congrats! Excited to see the release posted soon.

> On Jun 9, 2020, at 6:39 PM, Reynold Xin  wrote:
> 
> 
> I waited another day to account for the weekend. This vote passes with the 
> following +1 votes and no -1 votes!
> 
> I'll start the release prep later this week.
> 
> +1:
> Reynold Xin (binding)
> Prashant Sharma (binding)
> Gengliang Wang
> Sean Owen (binding)
> Mridul Muralidharan (binding)
> Takeshi Yamamuro
> Maxim Gekk
> Matei Zaharia (binding)
> Jungtaek Lim
> Denny Lee
> Russell Spitzer
> Dongjoon Hyun (binding)
> DB Tsai (binding)
> Michael Armbrust (binding)
> Tom Graves (binding)
> Bryan Cutler
> Huaxin Gao
> Jiaxin Shan
> Xingbo Jiang
> Xiao Li (binding)
> Hyukjin Kwon (binding)
> Kent Yao
> Wenchen Fan (binding)
> Shixiong Zhu (binding)
> Burak Yavuz
> Tathagata Das (binding)
> Ryan Blue
> 
> -1: None
> 
> 
> 
>> On Sat, Jun 06, 2020 at 1:08 PM, Reynold Xin  wrote:
>> Please vote on releasing the following candidate as Apache Spark version 
>> 3.0.0.
>> 
>> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are 
>> cast, with a minimum of 3 +1 votes.
>> 
>> [ ] +1 Release this package as Apache Spark 3.0.0
>> [ ] -1 Do not release this package because ...
>> 
>> To learn more about Apache Spark, please see http://spark.apache.org/
>> 
>> The tag to be voted on is v3.0.0-rc3 (commit 
>> 3fdfce3120f307147244e5eaf46d61419a723d50):
>> https://github.com/apache/spark/tree/v3.0.0-rc3
>> 
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
>> 
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> 
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1350/
>> 
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/
>> 
>> The list of bug fixes going into 3.0.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>> 
>> This release is using the release script of the tag v3.0.0-rc3.
>> 
>> FAQ
>> 
>> =
>> How can I help test this release?
>> =
>> 
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>> 
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>> 
>> ===
>> What should happen to JIRA tickets still targeting 3.0.0?
>> ===
>> 
>> The current list of open tickets targeted at 3.0.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 3.0.0
>> 
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>> 
>> ==
>> But my bug isn't fixed?
>> ==
>> 
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
> 


Re: [vote] Apache Spark 3.0 RC3

2020-06-07 Thread Matei Zaharia
+1

Matei

> On Jun 7, 2020, at 6:53 AM, Maxim Gekk  wrote:
> 
> +1 (non-binding)
> 
> On Sun, Jun 7, 2020 at 2:34 PM Takeshi Yamamuro  > wrote:
> +1 (non-binding)
> 
> I don't see any ongoing PR to fix critical bugs in my area.
> Bests,
> Takeshi
> 
> On Sun, Jun 7, 2020 at 7:24 PM Mridul Muralidharan  > wrote:
> +1
> 
> Regards,
> Mridul
> 
> On Sat, Jun 6, 2020 at 1:20 PM Reynold Xin  > wrote:
> Apologies for the mistake. The vote is open till 11:59pm Pacific time on Mon 
> June 9th. 
> 
> On Sat, Jun 6, 2020 at 1:08 PM Reynold Xin  > wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 3.0.0.
> 
> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are 
> cast, with a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 3.0.0
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see http://spark.apache.org/ 
> 
> 
> The tag to be voted on is v3.0.0-rc3 (commit 
> 3fdfce3120f307147244e5eaf46d61419a723d50):
> https://github.com/apache/spark/tree/v3.0.0-rc3 
> 
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/ 
> 
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS 
> 
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1350/ 
> 
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/ 
> 
> 
> The list of bug fixes going into 3.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12339177 
> 
> 
> This release is using the release script of the tag v3.0.0-rc3.
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
> 
> ===
> What should happen to JIRA tickets still targeting 3.0.0?
> ===
> 
> The current list of open tickets targeted at 3.0.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK 
>  and search for "Target 
> Version/s" = 3.0.0
> 
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
> 
> ==
> But my bug isn't fixed?
> ==
> 
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro



ASF board report draft for May

2020-05-11 Thread Matei Zaharia
Hi all,

Our quarterly project board report needs to be submitted on May 13th, and I 
wanted to include anything notable going on that we want to appear in the board 
archive. Here is my draft below — let me know if you have suggested changes.

===

Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python and R as well as a rich set of 
libraries including stream processing, machine learning, and graph analytics.

Project status:

- Progress is continuing on the upcoming Apache Spark 3.0 release, with the 
first votes on release candidates. This will be a major release with various 
API and SQL language updates, so we’ve tried to solicit broad input on it 
through two preview releases and a lot of JIRA and mailing list discussion.

- The community is also voting on a release candidate for Apache Spark 2.4.6, 
bringing bug fixes to the 2.4 branch.

Trademarks:

- Nothing new to report in the past 3 months.

Latest releases:

- Spark 2.4.5 was released on Feb 8th, 2020.
- Spark 3.0.0-preview2 was released on Dec 23rd, 2019.
- Spark 3.0.0-preview was released on Nov 6th, 2019.
- Spark 2.3.4 was released on Sept 9th, 2019.

Committers and PMC:

- The latest PMC member was added on Sept 4th, 2019 (Dongjoon Hyun).
- The latest committer was added on Sept 9th, 2019 (Weichen Xu). We also added
Ryan Blue, L.C. Hsieh, Gengliang Wang, Yuming Wang and Ruifeng Zheng as
committers in the past three months.
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Matei Zaharia
+1 as well.

Matei

> On Mar 9, 2020, at 12:05 AM, Wenchen Fan  wrote:
> 
> +1 (binding), assuming that this is for public stable APIs, not APIs that are 
> marked as unstable, evolving, etc.
> 
> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía  > wrote:
> +1 (non-binding)
> 
> Michael's section on the trade-offs of maintaining / removing an API are one 
> of
> the best reads I have seeing in this mailing list. Enthusiast +1
> 
> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun  > wrote:
> >
> > This new policy has a good indention, but can we narrow down on the 
> > migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
> >
> > I saw that there already exists a reverting PR to bring back Spark 1.4 and 
> > 1.5 APIs based on this AS-IS suggestion.
> >
> > The AS-IS policy is clearly mentioning that JVM/Scala-level difficulty, and 
> > it's nice.
> >
> > However, for the other cases, it sounds like `recommending older APIs as 
> > much as possible` due to the following.
> >
> >  > How long has the API been in Spark?
> >
> > We had better be more careful when we add a new policy and should aim not 
> > to mislead the users and 3rd party library developers to say "older is 
> > better".
> >
> > Technically, I'm wondering who will use new APIs in their examples (of 
> > books and StackOverflow) if they need to write an additional warning like 
> > `this only works at 2.4.0+` always .
> >
> > Bests,
> > Dongjoon.
> >
> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan  > > wrote:
> >>
> >> I am in broad agreement with the prposal, as any developer, I prefer
> >> stable well designed API's :-)
> >>
> >> Can we tie the proposal to stability guarantees given by spark and
> >> reasonable expectation from users ?
> >> In my opinion, an unstable or evolving could change - while an
> >> experimental api which has been around for ages should be more
> >> conservatively handled.
> >> Which brings in question what are the stability guarantees as
> >> specified by annotations interacting with the proposal.
> >>
> >> Also, can we expand on 'when' an API change can occur ?  Since we are
> >> proposing to diverge from semver.
> >> Patch release ? Minor release ? Only major release ? Based on 'impact'
> >> of API ? Stability guarantees ?
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >>
> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust  >> > wrote:
> >> >
> >> > I'll start off the vote with a strong +1 (binding).
> >> >
> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust  >> > > wrote:
> >> >>
> >> >> I propose to add the following text to Spark's Semantic Versioning 
> >> >> policy and adopt it as the rubric that should be used when deciding to 
> >> >> break APIs (even at major versions such as 3.0).
> >> >>
> >> >>
> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a 
> >> >> procedural vote, the measure will pass if there are more favourable 
> >> >> votes than unfavourable ones. PMC votes are binding, but the community 
> >> >> is encouraged to add their voice to the discussion.
> >> >>
> >> >>
> >> >> [ ] +1 - Spark should adopt this policy.
> >> >>
> >> >> [ ] -1  - Spark should not adopt this policy.
> >> >>
> >> >>
> >> >> 
> >> >>
> >> >>
> >> >> Considerations When Breaking APIs
> >> >>
> >> >> The Spark project strives to avoid breaking APIs or silently changing 
> >> >> behavior, even at major versions. While this is not always possible, 
> >> >> the balance of the following factors should be considered before 
> >> >> choosing to break an API.
> >> >>
> >> >>
> >> >> Cost of Breaking an API
> >> >>
> >> >> Breaking an API almost always has a non-trivial cost to the users of 
> >> >> Spark. A broken API means that Spark programs need to be rewritten 
> >> >> before they can be upgraded. However, there are a few considerations 
> >> >> when thinking about what the cost will be:
> >> >>
> >> >> Usage - an API that is actively used in many different places, is 
> >> >> always very costly to break. While it is hard to know usage for sure, 
> >> >> there are a bunch of ways that we can estimate:
> >> >>
> >> >> How long has the API been in Spark?
> >> >>
> >> >> Is the API common even for basic programs?
> >> >>
> >> >> How often do we see recent questions in JIRA or mailing lists?
> >> >>
> >> >> How often does it appear in StackOverflow or blogs?
> >> >>
> >> >> Behavior after the break - How will a program that works today, work 
> >> >> after the break? The following are listed roughly in order of 
> >> >> increasing severity:
> >> >>
> >> >> Will there be a compiler or linker error?
> >> >>
> >> >> Will there be a runtime exception?
> >> >>
> >> >> Will that exception happen after significant processing has been done?
> >> >>
> >> >> Will we silently return different answers? (very hard to debug, might 
> >> >> not even notice!)
> >> >>
> >> 

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-27 Thread Matei Zaharia
+1 on this new rubric. It definitely captures the issues I’ve seen in Spark and 
in other projects. If we write down this rubric (or something like it), it will 
also be easier to refer to it during code reviews or in proposals of new APIs 
(we could ask “do you expect to have to change this API in the future, and if 
so, how”).

Matei

> On Feb 24, 2020, at 3:02 PM, Michael Armbrust  wrote:
> 
> Hello Everyone,
> 
> As more users have started upgrading to Spark 3.0 preview (including myself), 
> there have been many discussions around APIs that have been broken compared 
> with Spark 2.x. In many of these discussions, one of the rationales for 
> breaking an API seems to be "Spark follows semantic versioning 
> , so this major release is 
> our chance to get it right [by breaking APIs]". Similarly, in many cases the 
> response to questions about why an API was completely removed has been, "this 
> API has been deprecated since x.x, so we have to remove it".
> 
> As a long time contributor to and user of Spark this interpretation of the 
> policy is concerning to me. This reasoning misses the intention of the 
> original policy, and I am worried that it will hurt the long-term success of 
> the project.
> 
> I definitely understand that these are hard decisions, and I'm not proposing 
> that we never remove anything from Spark. However, I would like to give some 
> additional context and also propose a different rubric for thinking about API 
> breakage moving forward.
> 
> Spark adopted semantic versioning back in 2014 during the preparations for 
> the 1.0 release. As this was the first major release -- and as, up until 
> fairly recently, Spark had only been an academic project -- no real promises 
> had been made about API stability ever.
> 
> During the discussion, some committers suggested that this was an opportunity 
> to clean up cruft and give the Spark APIs a once-over, making cosmetic 
> changes to improve consistency. However, in the end, it was decided that in 
> many cases it was not in the best interests of the Spark community to break 
> things just because we could. Matei actually said it pretty forcefully 
> :
> 
> I know that some names are suboptimal, but I absolutely detest breaking APIs, 
> config names, etc. I’ve seen it happen way too often in other projects (even 
> things we depend on that are officially post-1.0, like Akka or Protobuf or 
> Hadoop), and it’s very painful. I think that we as fairly cutting-edge users 
> are okay with libraries occasionally changing, but many others will consider 
> it a show-stopper. Given this, I think that any cosmetic change now, even 
> though it might improve clarity slightly, is not worth the tradeoff in terms 
> of creating an update barrier for existing users.
> 
> In the end, while some changes were made, most APIs remained the same and 
> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this 
> served the project very well, as compatibility means users are able to 
> upgrade and we keep as many people on the latest versions of Spark (though 
> maybe not the latest APIs of Spark) as possible.
> 
> As Spark grows, I think compatibility actually becomes more important and we 
> should be more conservative rather than less. Today, there are very likely 
> more Spark programs running than there were at any other time in the past. 
> Spark is no longer a tool only used by advanced hackers, it is now also 
> running "traditional enterprise workloads.'' In many cases these jobs are 
> powering important processes long after the original author leaves.
> 
> Broken APIs can also affect libraries that extend Spark. This dependency can 
> be even harder for users, as if the library has not been upgraded to use new 
> APIs and they need that library, they are stuck.
> 
> Given all of this, I'd like to propose the following rubric as an addition to 
> our semantic versioning policy. After discussion and if people agree this is 
> a good idea, I'll call a vote of the PMC to ratify its inclusion in the 
> official policy.
> 
> Considerations When Breaking APIs
> The Spark project strives to avoid breaking APIs or silently changing 
> behavior, even at major versions. While this is not always possible, the 
> balance of the following factors should be considered before choosing to 
> break an API.
> 
> Cost of Breaking an API
> Breaking an API almost always has a non-trivial cost to the users of Spark. A 
> broken API means that Spark programs need to be rewritten before they can be 
> upgraded. However, there are a few considerations when thinking about what 
> the cost will be:
> Usage - an API that is actively used in many different places, is always very 
> costly to break. While it is hard to know usage for sure, there are a bunch 
> of ways that we can estimate: 

ASF board report draft for February

2020-02-09 Thread Matei Zaharia
Hi all,

Our project board report needs to be submitted on Feb 12th, and I wanted to 
include anything notable going on that we want to appear in the board archive. 
Here is my draft below — let me know if you have suggestions to add or change 
things.

===

Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python and R as well as a rich set of 
libraries including stream processing, machine learning, and graph analytics.

Project status:

- We have cut a release branch for Apache Spark 3.0, which is now undergoing 
testing and bug fixes before the final release. In December, we also published 
a new preview release for the 3.0 branch that the community can use to test and 
give feedback: https://spark.apache.org/news/spark-3.0.0-preview2.html. Spark 
3.0 includes a range of new features and dependency upgrades (e.g. Java 11) but 
remains largely compatible with Spark’s current API.

- We published Apache Spark 2.4.5 on Feb 8th with bug fixes for the 2.4 branch 
of Spark.

Trademarks:

- Nothing new to report in the past 3 months.

Latest releases:

- Spark 2.4.5 was released on Feb 8th, 2020.
- Spark 3.0.0-preview2 was released on Dec 23rd, 2019.
- Spark 3.0.0-preview was released on Nov 6th, 2019.
- Spark 2.3.4 was released on Sept 9th, 2019.

Committers and PMC:

- The latest PMC member was added on Sept 4th, 2019 (Dongjoon Hyun).
- The latest committer was added on Sept 9th, 2019 (Weichen Xu). We also added
 Ryan Blue, L.C. Hsieh, Gengliang Wang, Yuming Wang and Ruifeng Zheng as
 committers in the past three months.
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why Apache Spark doesn't use Calcite?

2020-01-13 Thread Matei Zaharia
I’m pretty sure that Catalyst was built before Calcite, or at least in 
parallel. Calcite 1.0 was only released in 2015. From a technical standpoint, 
building Catalyst in Scala also made it more concise and easier to extend than 
an optimizer written in Java (you can find various presentations about how 
Catalyst works).

Matei

> On Jan 13, 2020, at 8:41 AM, Michael Mior  wrote:
> 
> It's fairly common for adapters (Calcite's abstraction of a data
> source) to push down predicates. However, the API certainly looks a
> lot different than Catalyst's.
> --
> Michael Mior
> mm...@apache.org
> 
> Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
>  a écrit :
>> 
>> The implementation they chose supports push down predicates, Datasets and 
>> other features that are not available in Calcite:
>> 
>> https://databricks.com/glossary/catalyst-optimizer
>> 
>> On Mon, Jan 13, 2020 at 8:24 AM newroyker  wrote:
>>> 
>>> Was there a qualitative or quantitative benchmark done before a design
>>> decision was made not to use Calcite?
>>> 
>>> Are there limitations (for heuristic based, cost based, * aware optimizer)
>>> in Calcite, and frameworks built on top of Calcite? In the context of big
>>> data / TCPH benchmarks.
>>> 
>>> I was unable to dig up anything concrete from user group / Jira. Appreciate
>>> if any Catalyst veteran here can give me pointers. Trying to defend
>>> Spark/Catalyst.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 
>> 
>> 
>> --
>> Thanks,
>> Jason
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 3.0 preview release 2?

2019-12-09 Thread Matei Zaharia
Yup, it would be great to release these more often.

> On Dec 9, 2019, at 4:25 PM, Takeshi Yamamuro  wrote:
> 
> +1; Looks great if we can in terms of user's feedbacks.
> 
> Bests,
> Takeshi
> 
> On Tue, Dec 10, 2019 at 3:14 AM Dongjoon Hyun  > wrote:
> Thank you, All.
> 
> +1 for another `3.0-preview`.
> 
> Also, thank you Yuming for volunteering for that!
> 
> Bests,
> Dongjoon.
> 
> 
> On Mon, Dec 9, 2019 at 9:39 AM Xiao Li  > wrote:
> When entering the official release candidates, the new features have to be 
> disabled or even reverted [if the conf is not available] if the fixes are not 
> trivial; otherwise, we might need 10+ RCs to make the final release. The new 
> features should not block the release based on the previous discussions. 
> 
> I agree we should have code freeze at the beginning of 2020. The preview 
> releases should not block the official releases. The preview is just to 
> collect more feedback about these new features or behavior changes.
> 
> Also, for the release of Spark 3.0, we still need the Hive community to do us 
> a favor to release 2.3.7 for having HIVE-22190 
> . Before asking Hive 
> community to do 2.3.7 release, if possible, we want our Spark community to 
> have more tries, especially the support of JDK 11 on Hadoop 2.7 and 3.2, 
> which is based on Hive 2.3 execution JAR. During the preview stage, we might 
> find more issues that are not covered by our test cases.
> 
>  
> 
> On Mon, Dec 9, 2019 at 4:55 AM Sean Owen  > wrote:
> Seems fine to me of course. Honestly that wouldn't be a bad result for
> a release candidate, though we would probably roll another one now.
> How about simply moving to a release candidate? If not now then at
> least move to code freeze from the start of 2020. There is also some
> downside in pushing out the 3.0 release further with previews.
> 
> On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  > wrote:
> >
> > I got many great feedbacks from the community about the recent 3.0 preview 
> > release. Since the last 3.0 preview release, we already have 353 commits 
> > [https://github.com/apache/spark/compare/v3.0.0-preview...master 
> > ]. There 
> > are various important features and behavior changes we want the community 
> > to try before entering the official release candidates of Spark 3.0.
> >
> >
> > Below is my selected items that are not part of the last 3.0 preview but 
> > already available in the upstream master branch:
> >
> > Support JDK 11 with Hadoop 2.7
> > Spark SQL will respect its own default format (i.e., parquet) when users do 
> > CREATE TABLE without USING or STORED AS clauses
> > Enable Parquet nested schema pruning and nested pruning on expressions by 
> > default
> > Add observable Metrics for Streaming queries
> > Column pruning through nondeterministic expressions
> > RecordBinaryComparator should check endianness when compared by long
> > Improve parallelism for local shuffle reader in adaptive query execution
> > Upgrade Apache Arrow to version 0.15.1
> > Various interval-related SQL support
> > Add a mode to pin Python thread into JVM's
> > Provide option to clean up completed files in streaming query
> >
> > I am wondering if we can have another preview release for Spark 3.0? This 
> > can help us find the design/API defects as early as possible and avoid the 
> > significant delay of the upcoming Spark 3.0 release
> >
> >
> > Also, any committer is willing to volunteer as the release manager of the 
> > next preview release of Spark 3.0, if we have such a release?
> >
> >
> > Cheers,
> >
> >
> > Xiao
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> 
> 
> -- 
>   
> 
> 
> -- 
> ---
> Takeshi Yamamuro



Re: ASF board report for November 2019

2019-11-12 Thread Matei Zaharia
Oops, sorry about the typo there; I’ll correct that.

> On Nov 12, 2019, at 12:43 AM, ruifengz  wrote:
> 
> nit: Ruifeng Zhang as committers in the past three months. <- Ruifeng Zheng
> 
> ☺Thanks
> 
> On 11/12/19 3:54 PM, Matei Zaharia wrote:
>> Good catch, thanks.
>> 
>>> On Nov 11, 2019, at 6:46 PM, Jungtaek Lim >> <mailto:kabhwan.opensou...@gmail.com>> wrote:
>>> 
>>> nit: - The latest committer was added on Sept 4th, 2019 (Dongjoon Hyun). <= 
>>> s/committer/PMC member
>>> 
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>> 
>>> On Tue, Nov 12, 2019 at 11:38 AM Matei Zaharia >> <mailto:matei.zaha...@gmail.com>> wrote:
>>> Hi all,
>>> 
>>> It’s time to send our quarterly report to the ASF board. Here is my draft — 
>>> please feel free to suggest any changes.
>>> 
>>> 
>>> 
>>> Apache Spark is a fast and general engine for large-scale data processing. 
>>> It
>>> offers high-level APIs in Java, Scala, Python and R as well as a rich set of
>>> libraries including stream processing, machine learning, and graph 
>>> analytics.
>>> 
>>> Project status:
>>> 
>>> - We made the first preview release for Spark 3.0 on November 6th. This
>>>   release aims to get early feedback on the new APIs and functionality
>>>   targeting Spark 3.0 but does not provide API or stability guarantees. We
>>>   encourage community members to try this release and leave feedback on
>>>   JIRA. More info about what’s new and how to report feedback is found at
>>>   https://spark.apache.org/news/spark-3.0.0-preview.html 
>>> <https://spark.apache.org/news/spark-3.0.0-preview.html>.
>>> 
>>> - We published Spark 2.4.4. and 2.3.4 as maintenance releases to fix bugs
>>>   in the 2.4 and 2.3 branches.
>>> 
>>> - We added one new PMC members and six committers to the project
>>>   in August and September, covering data sources, streaming, SQL, ML
>>>   and other components of the project.
>>> 
>>> Trademarks:
>>> 
>>> - Nothing new to report since August.
>>> 
>>> Latest releases:
>>> 
>>> - Spark 3.0.0-preview was released on Nov 6th, 2019.
>>> - Spark 2.3.4 was released on Sept 9th, 2019.
>>> - Spark 2.4.4 was released on Sept 1st, 2019.
>>> 
>>> Committers and PMC:
>>> 
>>> - The latest committer was added on Sept 4th, 2019 (Dongjoon Hyun).
>>> - The latest committer was added on Sept 9th, 2019 (Weichen Xu). We
>>>   also added Ryan Blue, L.C. Hsieh, Gengliang Wang, Yuming Wang and
>>>   Ruifeng Zhang as committers in the past three months.
>>> 
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>> <mailto:dev-unsubscr...@spark.apache.org>
>>> 
>> 



Re: ASF board report for November 2019

2019-11-11 Thread Matei Zaharia
Good catch, thanks.

> On Nov 11, 2019, at 6:46 PM, Jungtaek Lim  
> wrote:
> 
> nit: - The latest committer was added on Sept 4th, 2019 (Dongjoon Hyun). <= 
> s/committer/PMC member
> 
> Thanks,
> Jungtaek Lim (HeartSaVioR)
> 
> On Tue, Nov 12, 2019 at 11:38 AM Matei Zaharia  <mailto:matei.zaha...@gmail.com>> wrote:
> Hi all,
> 
> It’s time to send our quarterly report to the ASF board. Here is my draft — 
> please feel free to suggest any changes.
> 
> 
> 
> Apache Spark is a fast and general engine for large-scale data processing. It
> offers high-level APIs in Java, Scala, Python and R as well as a rich set of
> libraries including stream processing, machine learning, and graph analytics.
> 
> Project status:
> 
> - We made the first preview release for Spark 3.0 on November 6th. This
>   release aims to get early feedback on the new APIs and functionality
>   targeting Spark 3.0 but does not provide API or stability guarantees. We
>   encourage community members to try this release and leave feedback on
>   JIRA. More info about what’s new and how to report feedback is found at
>   https://spark.apache.org/news/spark-3.0.0-preview.html 
> <https://spark.apache.org/news/spark-3.0.0-preview.html>.
> 
> - We published Spark 2.4.4. and 2.3.4 as maintenance releases to fix bugs
>   in the 2.4 and 2.3 branches.
> 
> - We added one new PMC members and six committers to the project
>   in August and September, covering data sources, streaming, SQL, ML
>   and other components of the project.
> 
> Trademarks:
> 
> - Nothing new to report since August.
> 
> Latest releases:
> 
> - Spark 3.0.0-preview was released on Nov 6th, 2019.
> - Spark 2.3.4 was released on Sept 9th, 2019.
> - Spark 2.4.4 was released on Sept 1st, 2019.
> 
> Committers and PMC:
> 
> - The latest committer was added on Sept 4th, 2019 (Dongjoon Hyun).
> - The latest committer was added on Sept 9th, 2019 (Weichen Xu). We
>   also added Ryan Blue, L.C. Hsieh, Gengliang Wang, Yuming Wang and
>   Ruifeng Zhang as committers in the past three months.
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> 



ASF board report for November 2019

2019-11-11 Thread Matei Zaharia
Hi all,

It’s time to send our quarterly report to the ASF board. Here is my draft — 
please feel free to suggest any changes.



Apache Spark is a fast and general engine for large-scale data processing. It
offers high-level APIs in Java, Scala, Python and R as well as a rich set of
libraries including stream processing, machine learning, and graph analytics.

Project status:

- We made the first preview release for Spark 3.0 on November 6th. This
  release aims to get early feedback on the new APIs and functionality
  targeting Spark 3.0 but does not provide API or stability guarantees. We
  encourage community members to try this release and leave feedback on
  JIRA. More info about what’s new and how to report feedback is found at
  https://spark.apache.org/news/spark-3.0.0-preview.html.

- We published Spark 2.4.4. and 2.3.4 as maintenance releases to fix bugs
  in the 2.4 and 2.3 branches.

- We added one new PMC members and six committers to the project
  in August and September, covering data sources, streaming, SQL, ML
  and other components of the project.

Trademarks:

- Nothing new to report since August.

Latest releases:

- Spark 3.0.0-preview was released on Nov 6th, 2019.
- Spark 2.3.4 was released on Sept 9th, 2019.
- Spark 2.4.4 was released on Sept 1st, 2019.

Committers and PMC:

- The latest committer was added on Sept 4th, 2019 (Dongjoon Hyun).
- The latest committer was added on Sept 9th, 2019 (Weichen Xu). We
  also added Ryan Blue, L.C. Hsieh, Gengliang Wang, Yuming Wang and
  Ruifeng Zhang as committers in the past three months.


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Matei Zaharia
If the goal is to get people to try the DSv2 API and build DSv2 data sources, 
can we recommend the 3.0-preview release for this? That would get people 
shifting to 3.0 faster, which is probably better overall compared to 
maintaining two major versions. There’s not that much else changing in 3.0 if 
you already want to update your Java version.

> On Sep 21, 2019, at 2:45 PM, Ryan Blue  wrote:
> 
> > If you insist we shouldn't change the unstable temporary API in 3.x . . .
> 
> Not what I'm saying at all. I said we should carefully consider whether a 
> breaking change is the right decision in the 3.x line.
> 
> All I'm suggesting is that we can make a 2.5 release with the feature and an 
> API that is the same as the one in 3.0.
> 
> > I also don't get this backporting a giant feature to 2.x line
> 
> I am planning to do this so we can use DSv2 before 3.0 is released. Then we 
> can have a source implementation that works in both 2.x and 3.0 to make the 
> transition easier. Since I'm already doing the work, I'm offering to share it 
> with the community.
> 
> 
> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin  > wrote:
> Because for example we'd need to move the location of InternalRow, breaking 
> the package name. If you insist we shouldn't change the unstable temporary 
> API in 3.x to maintain compatibility with 3.0, which is totally different 
> from my understanding of the situation when you exposed it, then I'd say we 
> should gate 3.0 on having a stable row interface.
> 
> I also don't get this backporting a giant feature to 2.x line ... as 
> suggested by others in the thread, DSv2 would be one of the main reasons 
> people upgrade to 3.0. What's so special about DSv2 that we are doing this? 
> Why not abandoning 3.0 entirely and backport all the features to 2.x?
> 
> 
> 
> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue  > wrote:
> Why would that require an incompatible change?
> 
> We *could* make an incompatible change and remove support for InternalRow, 
> but I think we would want to carefully consider whether that is the right 
> decision. And in any case, we would be able to keep 2.5 and 3.0 compatible, 
> which is the main goal.
> 
> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin  > wrote:
> How would you not make incompatible changes in 3.x? As discussed the 
> InternalRow API is not stable and needs to change. 
> 
> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  > wrote:
> > Making downstream to diverge their implementation heavily between minor 
> > versions (say, 2.4 vs 2.5) wouldn't be a good experience
> 
> You're right that the API has been evolving in the 2.x line. But, it is now 
> reasonably stable with respect to the current feature set and we should not 
> need to break compatibility in the 3.x line. Because we have reached our 
> goals for the 3.0 release, we can backport at least those features to 2.x and 
> confidently have an API that works in both a 2.x release and is compatible 
> with 3.0, if not 3.1 and later releases as well.
> 
> > I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 
> > is officially released
> 
> The reason I'm suggesting this is that I'm already going to do the work to 
> backport the 3.0 release features to 2.4. I've been asked by several people 
> when DSv2 will be released, so I know there is a lot of interest in making 
> this available sooner than 3.0. If I'm already doing the work, then I'd be 
> happy to share that with the community.
> 
> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5 while 
> preparing the 3.0 preview and fixing bugs. For DSv2, the work is about 
> complete so we can easily release the same set of features and API in 2.5 and 
> 3.0.
> 
> If we decide for some reason to wait until after 3.0 is released, I don't 
> know that there is much value in a 2.5. The purpose is to be a step toward 
> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also 
> wouldn't get these features out any sooner than 3.0, as a 2.5 release 
> probably would, given the work needed to validate the incompatible changes in 
> 3.0.
> 
> > DSv2 change would be the major backward incompatibility which Spark 2.x 
> > users may hesitate to upgrade
> 
> As I pointed out, DSv2 has been changing in the 2.x line, so this is 
> expected. I don't think it will need incompatible changes in the 3.x line.
> 
> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  > wrote:
> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal 
> with this as the change made confusion on my PRs...), but my bet is that DSv2 
> would be already changed in incompatible way, at least who works for custom 
> DataSource. Making downstream to diverge their implementation heavily between 
> minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially 
> we are not completely 

Welcoming some new committers and PMC members

2019-09-09 Thread Matei Zaharia
Hi all,

The Spark PMC recently voted to add several new committers and one PMC member. 
Join me in welcoming them to their new roles!

New PMC member: Dongjoon Hyun

New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang, 
Weichen Xu, Ruifeng Zheng

The new committers cover lots of important areas including ML, SQL, and data 
sources, so it’s great to have them here. All the best,

Matei and the Spark PMC


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: JDK11 Support in Apache Spark

2019-08-26 Thread Matei Zaharia
+1, it’s super messy without that. But great to see this running!

> On Aug 26, 2019, at 10:53 AM, Reynold Xin  wrote:
> 
> Exactly - I think it's important to be able to create a single binary build. 
> Otherwise downstream users (the 99.99% won't be building their own Spark but 
> just pull it from Maven) will have to deal with the mess, and it's even worse 
> for libraries.
> 
> 
> On Mon, Aug 26, 2019 at 10:51 AM, Dongjoon Hyun  > wrote:
> Oh, right. If you want to publish something to Maven, it will inherit the 
> situation.
> Thank you for feedback. :)
> 
> On Mon, Aug 26, 2019 at 10:37 AM Michael Heuer  > wrote:
> That is not true for any downstream users who also provide a library.  
> Whatever build mess you create in Apache Spark, we'll have to inherit it.  ;)
> 
>michael
> 
> 
>> On Aug 26, 2019, at 12:32 PM, Dongjoon Hyun > > wrote:
>> 
>> As Shane wrote, not yet.
>> 
>> `one build for works for both` is our aspiration and the next step mentioned 
>> in the first email.
>> 
>> > The next step is `how to support JDK8/JDK11 together in a single artifact`.
>> 
>> For the downstream users who build from the Apache Spark source, that will 
>> not be a blocker because they will prefer a single JDK.
>> 
>> Bests,
>> Dongjoon.
>> 
>> On Mon, Aug 26, 2019 at 10:28 AM Shane Knapp > > wrote:
>> maybe in the future, but not right now as the hadoop 2.7 build is broken.
>> 
>> also, i busted dev/run-tests.py  in my changes to 
>> support java11 in PRBs:
>> https://github.com/apache/spark/pull/25585 
>> 
>> 
>> quick fix, testing now.
>> 
>> On Mon, Aug 26, 2019 at 10:23 AM Reynold Xin > > wrote:
>> Would it be possible to have one build that works for both?
> 



ASF board report draft for August

2019-08-12 Thread Matei Zaharia
Hi all,

It’s time to submit our quarterly report to the ASF board again this Wednesday. 
Here is my draft about what’s new — feel free to suggest changes.



Apache Spark is a fast and general engine for large-scale data processing. It
offers high-level APIs in Java, Scala, Python and R as well as a rich set of
libraries including stream processing, machine learning, and graph analytics.

Project status:

- Discussions are continuing about our next feature release, which will likely
  be Spark 3.0, on the dev and user mailing lists. Some key questions include
  whether to remove various deprecated APIs, and which minimum versions of
  Java, Python, Scala, etc to support. There are also a number of new features
  targeting this release. We encourage everyone in the community to give
  feedback on these discussions through our mailing lists or issue tracker.

- We announced a plan to stop supporting Python 2 in our next major release,
  as many other projects in the Python ecosystem are now dropping support
  (https://spark.apache.org/news/plan-for-dropping-python-2-support.html 
).

- We added three new PMC members to the project in May: Takuya Ueshin,
  Jerry Shao and Hyukjin Kwon.

- There is an ongoing discussion on our dev list about whether to consider
  adding project committers who do not contribute to the code or docs in the
  project, and what the criteria might be for those. (Note that the project does
  solicit committers who only work on docs, and has also added committers
  who work on other tasks, like maintaining our build infrastructure).

Trademarks:

- We are continuing engagement with various organizations.

Latest releases:

- May 8th, 2018: Spark 2.4.3
- April 23rd, 2019: Spark 2.4.2
- March 31st, 2019: Spark 2.4.1
- Feb 15th, 2019: Spark 2.3.3

Committers and PMC:

- The latest committer was added on Jan 29th, 2019 (Jose Torres).
- The latest PMC members were added on May 21st, 2019 (Jerry Shao,
  Takuya Ueshin and Hyukjin Kwon).



ASF board report for May

2019-05-06 Thread Matei Zaharia
It’s time to submit Spark's quarterly ASF board report on May 15th, so I wanted 
to run the report by everyone to make sure we’re not missing something. Let me 
know whether I missed anything:



Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python and R as well as a rich set of 
libraries including stream processing, machine learning, and graph analytics. 

Project status:

- We released Apache Spark 2.4.1, 2.4.2 and 2.3.3 in the past three months to 
fix issues in the 2.3 and 2.4 branches.

- Discussions are under way about the next feature release, which will likely 
be Spark 3.0, on our dev and user mailing lists. Some key questions include 
whether to remove various deprecated APIs, and which minimum versions of Java, 
Python, Scala, etc to support. There are also a number of new features 
targeting this release. We encourage everyone in the community to give feedback 
on these discussions through our mailing lists or issue tracker.

- Several Spark Project Improvement Proposals (SPIPs) for major additions to 
Spark were discussed on the dev list in the past three months. These include 
support for passing columnar data efficiently into external engines (e.g. GPU 
based libraries), accelerator-aware scheduling, new data source APIs, and .NET 
support. Some of these have been accepted (e.g. table metadata and accelerator 
aware scheduling proposals) while others are still being discussed.

Trademarks:

- We are continuing engagement with various organizations.

Latest releases:

- April 23rd, 2019: Spark 2.4.2
- March 31st, 2019: Spark 2.4.1
- Feb 15th, 2019: Spark 2.3.3

Committers and PMC:

- The latest committer was added on Jan 29th, 2019 (Jose Torres).
- The latest PMC member was added on Jan 12th, 2018 (Xiao Li).


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-23 Thread Matei Zaharia
Just as a note here, if the goal is the format not change, why not make that 
explicit in a versioning policy? You can always include a format version number 
and say that future versions may increment the number, but this specific 
version will always be readable in some specific way. You could also put a 
timeline on how long old version numbers will be recognized in the official 
libraries (e.g. 3 years).

Matei

> On Apr 22, 2019, at 6:36 AM, Bobby Evans  wrote:
> 
> Yes, it is technically possible for the layout to change.  No, it is not 
> going to happen.  It is already baked into several different official 
> libraries which are widely used, not just for holding and processing the 
> data, but also for transfer of the data between the various implementations.  
> There would have to be a really serious reason to force an incompatible 
> change at this point.  So in the worst case, we can version the layout and 
> bake that into the API that exposes the internal layout of the data.  That 
> way code that wants to program against a JAVA API can do so using the API 
> that Spark provides, those who want to interface with something that expects 
> the data in arrow format will already have to know what version of the format 
> it was programmed against and in the worst case if the layout does change we 
> can support the new layout if needed.
> 
> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler  wrote:
> The Arrow data format is not yet stable, meaning there are no guarantees on 
> backwards/forwards compatibility. Once version 1.0 is released, it will have 
> those guarantees but it's hard to say when that will be. The remaining work 
> to get there can be seen at 
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
>  So yes, it is a risk that exposing Spark data as Arrow could cause an issue 
> if handled by a different version that is not compatible. That being said, 
> changes to format are not taken lightly and are backwards compatible when 
> possible. I think it would be fair to mark the APIs exposing Arrow data as 
> experimental for the time being, and clearly state the version that must be 
> used to be compatible in the docs. Also, adding features like this and 
> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0 
> release. Adding the Arrow dev list to CC.
> 
> Bryan
> 
> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia  wrote:
> Okay, that makes sense, but is the Arrow data format stable? If not, we risk 
> breakage when Arrow changes in the future and some libraries using this 
> feature are begin to use the new Arrow code.
> 
> Matei
> 
> > On Apr 20, 2019, at 1:39 PM, Bobby Evans  wrote:
> > 
> > I want to be clear that this SPIP is not proposing exposing Arrow 
> > APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and 
> > because of the overlap between the two SPIPs I scaled this one back to 
> > concentrate just on the columnar processing aspects. Sorry for the 
> > confusion as I didn't update the JIRA description clearly enough when we 
> > adjusted it during the discussion on the JIRA.  As part of the columnar 
> > processing, we plan on providing arrow formatted data, but that will be 
> > exposed through a Spark owned API.
> > 
> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia  
> > wrote:
> > FYI, I’d also be concerned about exposing the Arrow API or format as a 
> > public API if it’s not yet stable. Is stabilization of the API and format 
> > coming soon on the roadmap there? Maybe someone can work with the Arrow 
> > community to make that happen.
> > 
> > We’ve been bitten lots of times by API changes forced by external libraries 
> > even when those were widely popular. For example, we used Guava’s Optional 
> > for a while, which changed at some point, and we also had issues with 
> > Protobuf and Scala itself (especially how Scala’s APIs appear in Java). API 
> > breakage might not be as serious in dynamic languages like Python, where 
> > you can often keep compatibility with old behaviors, but it really hurts in 
> > Java and Scala.
> > 
> > The problem is especially bad for us because of two aspects of how Spark is 
> > used:
> > 
> > 1) Spark is used for production data transformation jobs that people need 
> > to keep running for a long time. Nobody wants to make changes to a job 
> > that’s been working fine and computing something correctly for years just 
> > to get a bug fix from the latest Spark release or whatever. It’s much 
> > better if they can upgrade Spark without editing every job.
> > 
> > 2) Spark is often used as “glue” to combine data processing code in other

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-23 Thread Matei Zaharia
Just as a note here, if the goal is the format not change, why not make that 
explicit in a versioning policy? You can always include a format version number 
and say that future versions may increment the number, but this specific 
version will always be readable in some specific way. You could also put a 
timeline on how long old version numbers will be recognized in the official 
libraries (e.g. 3 years).

Matei

> On Apr 22, 2019, at 6:36 AM, Bobby Evans  wrote:
> 
> Yes, it is technically possible for the layout to change.  No, it is not 
> going to happen.  It is already baked into several different official 
> libraries which are widely used, not just for holding and processing the 
> data, but also for transfer of the data between the various implementations.  
> There would have to be a really serious reason to force an incompatible 
> change at this point.  So in the worst case, we can version the layout and 
> bake that into the API that exposes the internal layout of the data.  That 
> way code that wants to program against a JAVA API can do so using the API 
> that Spark provides, those who want to interface with something that expects 
> the data in arrow format will already have to know what version of the format 
> it was programmed against and in the worst case if the layout does change we 
> can support the new layout if needed.
> 
> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler  wrote:
> The Arrow data format is not yet stable, meaning there are no guarantees on 
> backwards/forwards compatibility. Once version 1.0 is released, it will have 
> those guarantees but it's hard to say when that will be. The remaining work 
> to get there can be seen at 
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
>  So yes, it is a risk that exposing Spark data as Arrow could cause an issue 
> if handled by a different version that is not compatible. That being said, 
> changes to format are not taken lightly and are backwards compatible when 
> possible. I think it would be fair to mark the APIs exposing Arrow data as 
> experimental for the time being, and clearly state the version that must be 
> used to be compatible in the docs. Also, adding features like this and 
> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0 
> release. Adding the Arrow dev list to CC.
> 
> Bryan
> 
> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia  wrote:
> Okay, that makes sense, but is the Arrow data format stable? If not, we risk 
> breakage when Arrow changes in the future and some libraries using this 
> feature are begin to use the new Arrow code.
> 
> Matei
> 
> > On Apr 20, 2019, at 1:39 PM, Bobby Evans  wrote:
> > 
> > I want to be clear that this SPIP is not proposing exposing Arrow 
> > APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and 
> > because of the overlap between the two SPIPs I scaled this one back to 
> > concentrate just on the columnar processing aspects. Sorry for the 
> > confusion as I didn't update the JIRA description clearly enough when we 
> > adjusted it during the discussion on the JIRA.  As part of the columnar 
> > processing, we plan on providing arrow formatted data, but that will be 
> > exposed through a Spark owned API.
> > 
> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia  
> > wrote:
> > FYI, I’d also be concerned about exposing the Arrow API or format as a 
> > public API if it’s not yet stable. Is stabilization of the API and format 
> > coming soon on the roadmap there? Maybe someone can work with the Arrow 
> > community to make that happen.
> > 
> > We’ve been bitten lots of times by API changes forced by external libraries 
> > even when those were widely popular. For example, we used Guava’s Optional 
> > for a while, which changed at some point, and we also had issues with 
> > Protobuf and Scala itself (especially how Scala’s APIs appear in Java). API 
> > breakage might not be as serious in dynamic languages like Python, where 
> > you can often keep compatibility with old behaviors, but it really hurts in 
> > Java and Scala.
> > 
> > The problem is especially bad for us because of two aspects of how Spark is 
> > used:
> > 
> > 1) Spark is used for production data transformation jobs that people need 
> > to keep running for a long time. Nobody wants to make changes to a job 
> > that’s been working fine and computing something correctly for years just 
> > to get a bug fix from the latest Spark release or whatever. It’s much 
> > better if they can upgrade Spark without editing every job.
> > 
> > 2) Spark is often used as “glue” to combine data processing code in other

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Matei Zaharia
Okay, that makes sense, but is the Arrow data format stable? If not, we risk 
breakage when Arrow changes in the future and some libraries using this feature 
are begin to use the new Arrow code.

Matei

> On Apr 20, 2019, at 1:39 PM, Bobby Evans  wrote:
> 
> I want to be clear that this SPIP is not proposing exposing Arrow 
> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and because 
> of the overlap between the two SPIPs I scaled this one back to concentrate 
> just on the columnar processing aspects. Sorry for the confusion as I didn't 
> update the JIRA description clearly enough when we adjusted it during the 
> discussion on the JIRA.  As part of the columnar processing, we plan on 
> providing arrow formatted data, but that will be exposed through a Spark 
> owned API.
> 
> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia  wrote:
> FYI, I’d also be concerned about exposing the Arrow API or format as a public 
> API if it’s not yet stable. Is stabilization of the API and format coming 
> soon on the roadmap there? Maybe someone can work with the Arrow community to 
> make that happen.
> 
> We’ve been bitten lots of times by API changes forced by external libraries 
> even when those were widely popular. For example, we used Guava’s Optional 
> for a while, which changed at some point, and we also had issues with 
> Protobuf and Scala itself (especially how Scala’s APIs appear in Java). API 
> breakage might not be as serious in dynamic languages like Python, where you 
> can often keep compatibility with old behaviors, but it really hurts in Java 
> and Scala.
> 
> The problem is especially bad for us because of two aspects of how Spark is 
> used:
> 
> 1) Spark is used for production data transformation jobs that people need to 
> keep running for a long time. Nobody wants to make changes to a job that’s 
> been working fine and computing something correctly for years just to get a 
> bug fix from the latest Spark release or whatever. It’s much better if they 
> can upgrade Spark without editing every job.
> 
> 2) Spark is often used as “glue” to combine data processing code in other 
> libraries, and these might start to require different versions of our 
> dependencies. For example, the Guava class exposed in Spark became a problem 
> when third-party libraries started requiring a new version of Guava: those 
> new libraries just couldn’t work with Spark. Protobuf was especially bad 
> because some users wanted to read data stored as Protobufs (or in a format 
> that uses Protobuf inside), so they needed a different version of the library 
> in their main data processing code.
> 
> If there was some guarantee that this stuff would remain backward-compatible, 
> we’d be in a much better stuff. It’s not that hard to keep a storage format 
> backward-compatible: just document the format and extend it only in ways that 
> don’t break the meaning of old data (for example, add new version numbers or 
> field types that are read in a different way). It’s a bit harder for a Java 
> API, but maybe Spark could just expose byte arrays directly and work on those 
> if the API is not guaranteed to stay stable (that is, we’d still use our own 
> classes to manipulate the data internally, and end users could use the Arrow 
> library if they want it).
> 
> Matei
> 
> > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > 
> > I think you misunderstood the point of this SPIP. I responded to your 
> > comments in the SPIP JIRA.
> > 
> > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng  wrote:
> > I posted my comment in the JIRA. Main concerns here:
> > 
> > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 
> > release someday.
> > 2. ML/DL systems that can benefits from columnar format are mostly in 
> > Python.
> > 3. Simple operations, though benefits vectorization, might not be worth the 
> > data exchange overhead.
> > 
> > So would an improved Pandas UDF API would be good enough? For example, 
> > SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > 
> > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > 
> > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > +1 (non-binding) for better columnar data processing support.
> > 
> >  
> > 
> > From: Jules Damji  
> > Sent: Friday, April 19, 2019 12:21 PM
> > To: Bryan Cutler 
> > Cc: Dev 
> > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> > Processing Support
> > 
> >  
> > 
> > + (non-binding)
> > 
> > Sent from my iPhone
> > 
> > Pardon the dumb thumb 

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Matei Zaharia
FYI, I’d also be concerned about exposing the Arrow API or format as a public 
API if it’s not yet stable. Is stabilization of the API and format coming soon 
on the roadmap there? Maybe someone can work with the Arrow community to make 
that happen.

We’ve been bitten lots of times by API changes forced by external libraries 
even when those were widely popular. For example, we used Guava’s Optional for 
a while, which changed at some point, and we also had issues with Protobuf and 
Scala itself (especially how Scala’s APIs appear in Java). API breakage might 
not be as serious in dynamic languages like Python, where you can often keep 
compatibility with old behaviors, but it really hurts in Java and Scala.

The problem is especially bad for us because of two aspects of how Spark is 
used:

1) Spark is used for production data transformation jobs that people need to 
keep running for a long time. Nobody wants to make changes to a job that’s been 
working fine and computing something correctly for years just to get a bug fix 
from the latest Spark release or whatever. It’s much better if they can upgrade 
Spark without editing every job.

2) Spark is often used as “glue” to combine data processing code in other 
libraries, and these might start to require different versions of our 
dependencies. For example, the Guava class exposed in Spark became a problem 
when third-party libraries started requiring a new version of Guava: those new 
libraries just couldn’t work with Spark. Protobuf was especially bad because 
some users wanted to read data stored as Protobufs (or in a format that uses 
Protobuf inside), so they needed a different version of the library in their 
main data processing code.

If there was some guarantee that this stuff would remain backward-compatible, 
we’d be in a much better stuff. It’s not that hard to keep a storage format 
backward-compatible: just document the format and extend it only in ways that 
don’t break the meaning of old data (for example, add new version numbers or 
field types that are read in a different way). It’s a bit harder for a Java 
API, but maybe Spark could just expose byte arrays directly and work on those 
if the API is not guaranteed to stay stable (that is, we’d still use our own 
classes to manipulate the data internally, and end users could use the Arrow 
library if they want it).

Matei

> On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> 
> I think you misunderstood the point of this SPIP. I responded to your 
> comments in the SPIP JIRA.
> 
> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng  wrote:
> I posted my comment in the JIRA. Main concerns here:
> 
> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 
> release someday.
> 2. ML/DL systems that can benefits from columnar format are mostly in Python.
> 3. Simple operations, though benefits vectorization, might not be worth the 
> data exchange overhead.
> 
> So would an improved Pandas UDF API would be good enough? For example, 
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> 
> Sorry that I should join the discussion earlier! Hope it is not too late:)
> 
> On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> +1 (non-binding) for better columnar data processing support.
> 
>  
> 
> From: Jules Damji  
> Sent: Friday, April 19, 2019 12:21 PM
> To: Bryan Cutler 
> Cc: Dev 
> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> Processing Support
> 
>  
> 
> + (non-binding)
> 
> Sent from my iPhone
> 
> Pardon the dumb thumb typos :)
> 
> 
> On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> 
> +1 (non-binding)
> 
>  
> 
> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> 
> +1 (non-binding).  Looking forward to seeing better support for processing 
> columnar data.
> 
>  
> 
> Jason
> 
>  
> 
> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves  
> wrote:
> 
> Hi everyone,
> 
>  
> 
> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
> Columnar Processing Support.  The proposal is to extend the support to allow 
> for more columnar processing.
> 
>  
> 
> You can find the full proposal in the jira at: 
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
> thread in the dev mailing list.
> 
>  
> 
> Please vote as early as you can, I will leave the vote open until next Monday 
> (the 22nd), 2pm CST to give people plenty of time.
> 
>  
> 
> [ ] +1: Accept the proposal as an official SPIP
> 
> [ ] +0
> 
> [ ] -1: I don't think this is a good idea because ...
> 
>  
> 
>  
> 
> Thanks!
> 
> Tom Graves
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matei Zaharia
To add to this, we can add a stable interface anytime if the original one was 
marked as unstable; we wouldn’t have to wait until 4.0. We had a lot of APIs 
that were experimental in 2.0 and then got stabilized in later 2.x releases for 
example.

Matei

> On Feb 26, 2019, at 5:12 PM, Reynold Xin  wrote:
> 
> We will have to fix that before we declare dev2 is stable, because 
> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. 
> 
> On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah  wrote:
> Will that then require an API break down the line? Do we save that for Spark 
> 4?
> 
> 
>  
> 
> -Matt Cheah?
> 
>  
> 
> From: Ryan Blue 
> Reply-To: "rb...@netflix.com" 
> Date: Tuesday, February 26, 2019 at 4:53 PM
> To: Matt Cheah 
> Cc: Sean Owen , Wenchen Fan , Xiao Li 
> , Matei Zaharia , Spark Dev 
> List 
> Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
> 
>  
> 
> That's a good question.
> 
>  
> 
> While I'd love to have a solution for that, I don't think it is a good idea 
> to delay DSv2 until we have one. That is going to require a lot of internal 
> changes and I don't see how we could make the release date if we are 
> including an InternalRow replacement.
> 
>  
> 
> On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:
> 
> Reynold made a note earlier about a proper Row API that isn’t InternalRow – 
> is that still on the table?
> 
>  
> 
> -Matt Cheah
> 
>  
> 
> From: Ryan Blue 
> Reply-To: "rb...@netflix.com" 
> Date: Tuesday, February 26, 2019 at 4:40 PM
> To: Matt Cheah 
> Cc: Sean Owen , Wenchen Fan , Xiao Li 
> , Matei Zaharia , Spark Dev 
> List 
> Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
> 
>  
> 
> Thanks for bumping this, Matt. I think we can have the discussion here to 
> clarify exactly what we’re committing to and then have a vote thread once 
> we’re agreed.
> Getting back to the DSv2 discussion, I think we have a good handle on what 
> would be added:
> · Plugin system for catalogs
> 
> · TableCatalog interface (I’ll start a vote thread for this SPIP 
> shortly)
> 
> · TableCatalog implementation backed by SessionCatalog that can load 
> v2 tables
> 
> · Resolution rule to load v2 tables using the new catalog
> 
> · CTAS logical and physical plan nodes
> 
> · Conversions from SQL parsed logical plans to v2 logical plans
> 
> Initially, this will always use the v2 catalog backed by SessionCatalog to 
> avoid dependence on the multi-catalog work. All of those are already 
> implemented and working, so I think it is reasonable that we can get them in.
> Then we can consider a few stretch goals:
> · Get in as much DDL as we can. I think create and drop table should 
> be easy.
> 
> · Multi-catalog identifier parsing and multi-catalog support
> 
> If we get those last two in, it would be great. We can make the call closer 
> to release time. Does anyone want to change this set of work?
>  
> 
> On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:
> 
> What would then be the next steps we'd take to collectively decide on plans 
> and timelines moving forward? Might I suggest scheduling a conference call 
> with appropriate PMCs to put our ideas together? Maybe such a discussion can 
> take place at next week's meeting? Or do we need to have a separate 
> formalized voting thread which is guided by a PMC?
> 
> My suggestion is to try to make concrete steps forward and to avoid letting 
> this slip through the cracks.
> 
> I also think there would be merits to having a project plan and estimates 
> around how long each of the features we want to complete is going to take to 
> implement and review.
> 
> -Matt Cheah
> 
> On 2/24/19, 3:05 PM, "Sean Owen"  wrote:
> 
> Sure, I don't read anyone making these statements though? Let's assume
> good intent, that "foo should happen" as "my opinion as a member of
> the community, which is not solely up to me, is that foo should
> happen". I understand it's possible for a person to make their opinion
> over-weighted; this whole style of decision making assumes good actors
> and doesn't optimize against bad ones. Not that it can't happen, just
> not seeing it here.
> 
> I have never seen any vote on a feature list, by a PMC or otherwise.
> We can do that if really needed I guess. But that also isn't the
> authoritative process in play here, in contrast.
> 
> If there's not a more specific subtext or issue here, which is fine to
> say (on private@ if it's sensitive or something), yes, let's move 

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Matei Zaharia
How large would the delay be? My 2 cents are that there’s nothing stopping us 
from making feature releases more often if we want to, so we shouldn’t see this 
as an “either delay 3.0 or release in >6 months” decision. If the work is 
likely to get in with a small delay and simplifies our work after 3.0 (e.g. we 
can get rid of older APIs), then the delay may be worth it. But if it would be 
a large delay, we should also weigh it against other things that are going to 
get delayed if 3.0 moves much later.

It might also be better to propose a specific date to delay until, so people 
can still plan around when the release branch will likely be cut.

Matei

> On Feb 21, 2019, at 1:03 PM, Ryan Blue  wrote:
> 
> Hi everyone,
> 
> In the DSv2 sync last night, we had a discussion about roadmap and what the 
> goal should be for getting the main features into Spark. We all agreed that 
> 3.0 should be that goal, even if it means delaying the 3.0 release.
> 
> The possibility of delaying the 3.0 release may be controversial, so I want 
> to bring it up to the dev list to build consensus around it. The rationale 
> for this is partly that much of this work has been outstanding for more than 
> a year now. If it doesn't make it into 3.0, then it would be another 6 months 
> before it would be in a release, and would be nearing 2 years to get the work 
> done.
> 
> Are there any objections to targeting 3.0 for this?
> 
> In addition, much of the planning for multi-catalog support has been done to 
> make v2 possible. Do we also want to include multi-catalog support?
> 
> 
> rb
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



ASF board report for February

2019-02-09 Thread Matei Zaharia
It’s time to submit Spark's quarterly ASF board report on February 13th, so I 
wanted to run the report by everyone to make sure we’re not missing something. 
Let me know whether I missed anything:



Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python and R as well as a rich set of 
libraries including stream processing, machine learning, and graph analytics. 

Project status:

- We released Apache Spark 2.2.3 on January 11th to fix bugs in the 2.2 branch. 
The community is also currently voting on a 2.3.3 release to bring recent fixes 
to the Spark 2.3 branch.

- Discussions are under way about the next feature release, which will likely 
be Spark 3.0, on our dev and user mailing lists. Some key questions include 
whether to remove various deprecated APIs, and which minimum versions of Java, 
Python, Scala, etc to support. There are also a number of new features 
targeting this release. We encourage everyone in the community to give feedback 
on these discussions through our mailing lists or issue tracker.

Trademarks:

- We are continuing engagement with various organizations.

Latest releases:

- Jan 11th, 2019: Spark 2.2.3
- Nov 2nd, 2018: Spark 2.4.0
- Sept 24th, 2018: Spark 2.3.2

Committers and PMC:

- We added Jose Torres as a new committer on January 29th.
- The latest committer was added on January 29th, 2019 (Jose Torres).
- The latest PMC member was added on Jan 12th, 2018 (Xiao Li).


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Automated formatting

2018-11-22 Thread Matei Zaharia
Can we start by just recommending to contributors that they do this manually? 
Then if it seems to work fine, we can try to automate it.

> On Nov 22, 2018, at 4:40 PM, Cody Koeninger  wrote:
> 
> I believe scalafmt only works on scala sources.  There are a few
> plugins for formatting java sources, but I'm less familiar with them.
> On Thu, Nov 22, 2018 at 11:39 AM Mridul Muralidharan  wrote:
>> 
>> Is this handling only scala or java as well ?
>> 
>> Regards,
>> Mridul
>> 
>> On Thu, Nov 22, 2018 at 9:11 AM Cody Koeninger  wrote:
>>> 
>>> Plugin invocation is ./build/mvn mvn-scalafmt_2.12:format
>>> 
>>> It takes about 5 seconds, and errors out on the first different file
>>> that doesn't match formatting.
>>> 
>>> I made a shell wrapper so that contributors can just run
>>> 
>>> ./dev/scalafmt
>>> 
>>> to actually format in place the files that have changed (or pass
>>> through commandline args if they want to do something different)
>>> 
>>> On Wed, Nov 21, 2018 at 3:36 PM Sean Owen  wrote:
 
 I know the PR builder runs SBT, but I presume this would just be a
 separate mvn job that runs. If it doesn't take long and only checks
 the right diff, seems worth a shot. What's the invocation that Shane
 could add (after this change goes in)
 On Wed, Nov 21, 2018 at 3:27 PM Cody Koeninger  wrote:
> 
> There's a mvn plugin (sbt as well, but it requires sbt 1.0+) so it
> should be runnable from the PR builder
> 
> Super basic example with a minimal config that's close to current
> style guide here:
> 
> https://github.com/apache/spark/compare/master...koeninger:scalafmt
> 
> I imagine tracking down the corner cases in the config, especially
> around interactions with scalastyle, may take a bit of work.  Happy to
> do it, but not if there's significant concern about style related
> changes in PRs.
> On Wed, Nov 21, 2018 at 2:42 PM Sean Owen  wrote:
>> 
>> Yeah fair, maybe mostly consistent in broad strokes but not in the 
>> details.
>> Is this something that can be just run in the PR builder? if the rules
>> are simple and not too hard to maintain, seems like a win.
>> On Wed, Nov 21, 2018 at 2:26 PM Cody Koeninger  
>> wrote:
>>> 
>>> Definitely not suggesting a mass reformat, just on a per-PR basis.
>>> 
>>> scalafmt --diff  will reformat only the files that differ from git head
>>> scalafmt --test --diff won't modify files, just throw an exception if
>>> they don't match format
>>> 
>>> I don't think code is consistently formatted now.
>>> I tried scalafmt on the most recent PR I looked at, and it caught
>>> stuff as basic as newlines before curly brace in existing code.
>>> I've had different reviewers for PRs that were literal backports or
>>> cut & paste of each other come up with different formatting nits.
>>> 
>>> 
>>> On Wed, Nov 21, 2018 at 12:03 PM Sean Owen  wrote:
 
 I think reformatting the whole code base might be too much. If there
 are some more targeted cleanups, sure. We do have some links to style
 guides buried somewhere in the docs, although the conventions are
 pretty industry standard.
 
 I *think* the code is pretty consistently formatted now, and would
 expect contributors to follow formatting they see, so ideally the
 surrounding code alone is enough to give people guidance. In practice,
 we're always going to have people format differently no matter what I
 think so it's inevitable.
 
 Is there a way to just check style on PR changes? that's fine.
 On Wed, Nov 21, 2018 at 11:40 AM Cody Koeninger  
 wrote:
> 
> Is there any appetite for revisiting automating formatting?
> 
> I know over the years various people have expressed opposition to it
> as unnecessary churn in diffs, but having every new contributor
> greeted with "nit: 4 space indentation for argument lists" isn't very
> welcoming.
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



ASF board report for November

2018-11-11 Thread Matei Zaharia
It’s time to submit Spark's quarterly ASF board report on November 14th, so I 
wanted to run the text by everyone to make sure we’re not missing something. 
Let me know whether I missed anything:



Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python and R as well as a rich set of 
libraries including stream processing, machine learning, and graph analytics. 

Project status:

- We released Apache Spark 2.4.0 on Nov 2nd, 2018 as our newest feature 
release. Spark 2.4’s features include a barrier execution mode for machine 
learning computations, higher-order functions in Spark SQL, pivot syntax in 
SQL, a built-in Apache Avro data source, Kubernetes improvements, and 
experimental support for Scala 2.12, as well as multiple smaller features and 
fixes. The release notes are available at 
http://spark.apache.org/releases/spark-release-2-4-0.html.

- We released Apache Spark 2.3.2 on Sept 24th, 2018 as a bug fix release for 
the 2.3 branch.

- Multiple dev discussions are under way about the next feature release, which 
is likely to be Spark 3.0, on our dev and user mailing lists. Some of the key 
questions are which JDK, Scala, Python, R, Hadoop and Hive versions to support, 
as well as whether to remove certain deprecated APIs. We encourage everyone in 
the community to give feedback on these discussions through the mailing lists 
and JIRA.

Trademarks:

- We are continuing engagement with various organizations.

Latest releases:

- Nov 2nd, 2018: Spark 2.4.0
- Sept 24th, 2018: Spark 2.3.2
- July 2nd, 2018: Spark 2.2.2

Committers and PMC:

- We added six new committers since the last report: Shane Knapp, Dongjoon 
Hyun, Kazuaki Ishizaki, Xingbo Jiang, Yinan Li, and Takeshi Yamamuro.
- The latest committer was added on September 18th, 2018 (Kazuaki Ishizaki).
- The latest PMC member was added on Jan 12th, 2018 (Xiao Li).


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-08 Thread Matei Zaharia
I didn’t realize the same thing was broken in 2.3.0, but we should probably 
have made this a blocker for future releases, if it’s just a matter of removing 
things from the test script. We should also make the docs at 
https://spark.apache.org/docs/latest/sparkr.html clear about how we want people 
to run SparkR. They don’t seem to say to use any specific mirror or anything 
(in fact they only talk about how to import SparkR in RStudio and in our 
bin/sparkR, not in a normal R shell). I’m pretty sure it’s OK to update the 
docs website for 2.4.0 after the release to fix this if we want.

Matei

> On Nov 7, 2018, at 6:24 PM, Wenchen Fan  wrote:
> 
> Do we need to create a JIRA ticket for it and list it as a known issue in 
> 2.4.0 release notes?
> 
> On Wed, Nov 7, 2018 at 11:26 PM Shivaram Venkataraman 
>  wrote:
> Agree with the points Felix made.
> 
> One thing is that it looks like the only problem is vignettes and the
> tests are being skipped as designed. If you see
> https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Windows/00check.log
> and 
> https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Debian/00check.log,
> the tests run in 1s.
> On Tue, Nov 6, 2018 at 1:29 PM Felix Cheung  wrote:
> >
> > I’d rather not mess with 2.4.0 at this point. On CRAN is nice but users can 
> > also install from Apache Mirror.
> >
> > Also I had attempted and failed to get vignettes not to build, it was non 
> > trivial and could t get it to work.  It I have an idea.
> >
> > As for tests I don’t know exact why is it not skipped. Need to investigate 
> > but worse case test_package can run with 0 test.
> >
> >
> >
> > 
> > From: Sean Owen 
> > Sent: Tuesday, November 6, 2018 10:51 AM
> > To: Shivaram Venkataraman
> > Cc: Felix Cheung; Wenchen Fan; Matei Zaharia; dev
> > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >
> > I think the second option, to skip the tests, is best right now, if
> > the alternative is to have no SparkR release at all!
> > Can we monkey-patch the 2.4.0 release for SparkR in this way, bless it
> > from the PMC, and release that? It's drastic but so is not being able
> > to release, I think.
> > Right? or is CRAN not actually an important distribution path for
> > SparkR in particular?
> >
> > On Tue, Nov 6, 2018 at 12:49 PM Shivaram Venkataraman
> >  wrote:
> > >
> > > Right - I think we should move on with 2.4.0.
> > >
> > > In terms of what can be done to avoid this error there are two strategies
> > > - Felix had this other thread about JDK 11 that should at least let
> > > Spark run on the CRAN instance. In general this strategy isn't
> > > foolproof because the JDK version and other dependencies on that
> > > machine keep changing over time and we dont have much control over it.
> > > Worse we also dont have much control
> > > - The other solution is to not run code to build the vignettes
> > > document and just have static code blocks there that have been
> > > pre-evaluated / pre-populated. We can open a JIRA to discuss the
> > > pros/cons of this ?
> > >
> > > Thanks
> > > Shivaram
> > >
> > > On Tue, Nov 6, 2018 at 10:57 AM Felix Cheung  
> > > wrote:
> > > >
> > > > We have not been able to publish to CRAN for quite some time (since 
> > > > 2.3.0 was archived - the cause is Java 11)
> > > >
> > > > I think it’s ok to announce the release of 2.4.0
> > > >
> > > >
> > > > 
> > > > From: Wenchen Fan 
> > > > Sent: Tuesday, November 6, 2018 8:51 AM
> > > > To: Felix Cheung
> > > > Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
> > > > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> > > >
> > > > Do you mean we should have a 2.4.0 release without CRAN and then do a 
> > > > 2.4.1 immediately?
> > > >
> > > > On Wed, Nov 7, 2018 at 12:34 AM Felix Cheung 
> > > >  wrote:
> > > >>
> > > >> Shivaram and I were discussing.
> > > >> Actually we worked with them before. Another possible approach is to 
> > > >> remove the vignettes eval and all test from the source package... in 
> > > >> the next release.
> > > >>
> > > >>
> > > >> 
> > > >

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-06 Thread Matei Zaharia
Maybe it’s wroth contacting the CRAN maintainers to ask for help? Perhaps we 
aren’t disabling it correctly, or perhaps they can ignore this specific 
failure. +Shivaram who might have some ideas.

Matei

> On Nov 5, 2018, at 9:09 PM, Felix Cheung  wrote:
> 
> I don¡Št know what the cause is yet.
> 
> The test should be skipped because of this check
> https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L21
> 
> And this
> https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L57
> 
> But it ran:
> callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
> "fit", formula,
> 
> The earlier release was achived because of Java 11+ too so this unfortunately 
> isn¡Št new.
> 
> 
> From: Sean Owen 
> Sent: Monday, November 5, 2018 7:22 PM
> To: Felix Cheung
> Cc: dev
> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>  
> What can we do to get the release through? is there any way to
> circumvent these tests or otherwise hack it? or does it need a
> maintenance release?
> On Mon, Nov 5, 2018 at 8:53 PM Felix Cheung  wrote:
> >
> > FYI. SparkR submission failed. It seems to detect Java 11 correctly with 
> > vignettes but not skipping tests as would be expected.
> >
> > Error: processing vignette ¡¥sparkr-vignettes.Rmd¡Š failed with diagnostics:
> > Java version 8 is required for this package; found version: 11.0.1
> > Execution halted
> >
> > * checking PDF version of manual ... OK
> > * DONE
> > Status: 1 WARNING, 1 NOTE
> >
> > Current CRAN status: ERROR: 1, OK: 1
> > See: 
> >
> > Version: 2.3.0
> > Check: tests, Result: ERROR
> > Running ¡¥run-all.R¡Š [8s/35s]
> > Running the tests in ¡¥tests/run-all.R¡Š failed.
> > Last 13 lines of output:
> > 4: callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
> > "fit", formula,
> > data@sdf, tolower(family$family), family$link, tol, as.integer(maxIter), 
> > weightCol,
> > regParam, as.double(var.power), as.double(link.power), 
> > stringIndexerOrderType,
> > offsetCol)
> > 5: invokeJava(isStatic = TRUE, className, methodName, ...)
> > 6: handleErrors(returnStatus, conn)
> > 7: stop(readString(conn))
> >
> >  testthat results 
> > ùù
> > OK: 0 SKIPPED: 0 FAILED: 2
> > 1. Error: create DataFrame from list or data.frame (@test_basic.R#26)
> > 2. Error: spark.glm and predict (@test_basic.R#58)
> >
> >
> >
> > -- Forwarded message -
> > Date: Mon, Nov 5, 2018, 10:12
> > Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >
> > Dear maintainer,
> >
> > package SparkR_2.4.0.tar.gz does not pass the incoming checks 
> > automatically, please see the following pre-tests:
> > Windows: 
> > 
> > Status: 1 NOTE
> > Debian: 
> > 
> > Status: 1 WARNING, 1 NOTE
> >
> > Last released version's CRAN status: ERROR: 1, OK: 1
> > See: 
> >
> > CRAN Web: 
> >
> > Please fix all problems and resubmit a fixed version via the webform.
> > If you are not sure how to fix the problems shown, please ask for help on 
> > the R-package-devel mailing list:
> > 
> > If you are fairly certain the rejection is a false positive, please 
> > reply-all to this message and explain.
> >
> > More details are given in the directory:
> > 
> > The files will be removed after roughly 7 days.
> >
> > No strong reverse dependencies to be checked.
> >
> > Best regards,
> > CRAN teams' auto-check service
> > Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
> > Check: CRAN incoming feasibility, Result: NOTE
> > Maintainer: 'Shivaram Venkataraman '
> >
> > New submission
> >
> > Package was archived on CRAN
> >
> > Possibly mis-spelled words in DESCRIPTION:
> > Frontend (4:10, 5:28)
> >
> > CRAN repository db overrides:
> > X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
> > corrected despite reminders.
> >
> > Flavor: r-devel-linux-x86_64-debian-gcc
> > Check: re-building of vignette outputs, Result: WARNING
> > Error in re-building vignettes:
> > ...
> >
> > Attaching package: 'SparkR'
> >
> > The following objects are masked from 'package:stats':
> >
> > cov, filter, lag, na.omit, predict, sd, var, window
> >
> > The following objects are masked from 'package:base':
> >
> > as.data.frame, colnames, colnames<-, drop, endsWith,
> > intersect, rank, rbind, sample, startsWith, subset, summary,
> > transform, 

Re: Spark 2.4.0 artifact in Maven repository

2018-11-06 Thread Matei Zaharia
Hi Bartosz,

This is because the vote on 2.4 has passed (you can see the vote thread on the 
dev mailing list) and we are just working to get the release into various 
channels (Maven, PyPI, etc), which can take some time. Expect to see an 
announcement soon once that’s done.

Matei

> On Nov 4, 2018, at 7:14 AM, Bartosz Konieczny  wrote:
> 
> Hi,
> 
> Today I wanted to set up a development environment for GraphX and when I 
> visited Maven central repository 
> (https://mvnrepository.com/artifact/org.apache.spark/spark-graphx) I saw that 
> it was already available in 2.4.0 version. Does it mean that the new version 
> of Apache Spark was released ? It seems quite surprising for me because I 
> didn't find any release information and the 2.4 artifact was deployed 
> 29/10/2018. Maybe somebody here has some explanation for that ?
> 
> Best regards,
> Bartosz Konieczny.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
That’s a good point — I’d say there’s just a risk of creating a perception 
issue. First, some users might feel that this means they have to migrate now, 
which is before Python itself drops support; they might also be surprised that 
we did this in a minor release (e.g. might we drop Python 2 altogether in a 
Spark 2.5 if that later comes out?). Second, contributors might feel that this 
means new features no longer have to work with Python 2, which would be 
confusing. Maybe it’s OK on both fronts, but it just seems scarier for users to 
do this now if we do plan to have Spark 3.0 in the next 6 months anyway.

Matei

> On Sep 17, 2018, at 1:04 PM, Mark Hamstra  wrote:
> 
> What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't 
> change the code at all; it's just a notification that we will eventually 
> cease supporting Py2. Wouldn't users prefer to get that notification sooner 
> rather than later?
> 
> On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia  
> wrote:
> I’d like to understand the maintenance burden of Python 2 before deprecating 
> it. Since it is not EOL yet, it might make sense to only deprecate it once 
> it’s EOL (which is still over a year from now). Supporting Python 2+3 seems 
> less burdensome than supporting, say, multiple Scala versions in the same 
> codebase, so what are we losing out?
> 
> The other thing is that even though Python core devs might not support 2.x 
> later, it’s quite possible that various Linux distros will if moving from 2 
> to 3 remains painful. In that case, we may want Apache Spark to continue 
> releasing for it despite the Python core devs not supporting it.
> 
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it 
> later in 3.x instead of deprecating it in 2.4. I’d also consider looking at 
> what other data science tools are doing before fully removing it: for 
> example, if Pandas and TensorFlow no longer support Python 2 past some point, 
> that might be a good point to remove it.
> 
> Matei
> 
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra  wrote:
> > 
> > If we're going to do that, then we need to do it right now, since 2.4.0 is 
> > already in release candidates.
> > 
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem 
> > like a ways off but even now there may be some spark versions supporting 
> > Py2 past the point where Py2 is no longer receiving security patches 
> > 
> > 
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra  
> > wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> > 
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:
> > In case this didn't make it onto this thread:
> > 
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove 
> > it entirely on a later 3.x release.
> > 
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson  
> > wrote:
> > On a separate dev@spark thread, I raised a question of whether or not to 
> > support python 2 in Apache Spark, going forward into Spark 3.0.
> > 
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 
> > is an opportunity to make breaking changes to Spark's APIs, and so it is a 
> > good time to consider support for Python-2 on PySpark.
> > 
> > Key advantages to dropping Python 2 are:
> >   • Support for PySpark becomes significantly easier.
> >   • Avoid having to support Python 2 until Spark 4.0, which is likely 
> > to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that 
> > PySpark would be supporting a version of python that was no longer 
> > receiving security patches)
> > 
> > The main disadvantage is that PySpark users who have legacy python-2 code 
> > would have to migrate their code to python 3 to take advantage of Spark 3.0
> > 
> > This decision obviously has large implications for the Apache Spark 
> > community and we want to solicit community feedback.
> > 
> > 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
That’s a good point — I’d say there’s just a risk of creating a perception 
issue. First, some users might feel that this means they have to migrate now, 
which is before Python itself drops support; they might also be surprised that 
we did this in a minor release (e.g. might we drop Python 2 altogether in a 
Spark 2.5 if that later comes out?). Second, contributors might feel that this 
means new features no longer have to work with Python 2, which would be 
confusing. Maybe it’s OK on both fronts, but it just seems scarier for users to 
do this now if we do plan to have Spark 3.0 in the next 6 months anyway.

Matei

> On Sep 17, 2018, at 1:04 PM, Mark Hamstra  wrote:
> 
> What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't 
> change the code at all; it's just a notification that we will eventually 
> cease supporting Py2. Wouldn't users prefer to get that notification sooner 
> rather than later?
> 
> On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia  
> wrote:
> I’d like to understand the maintenance burden of Python 2 before deprecating 
> it. Since it is not EOL yet, it might make sense to only deprecate it once 
> it’s EOL (which is still over a year from now). Supporting Python 2+3 seems 
> less burdensome than supporting, say, multiple Scala versions in the same 
> codebase, so what are we losing out?
> 
> The other thing is that even though Python core devs might not support 2.x 
> later, it’s quite possible that various Linux distros will if moving from 2 
> to 3 remains painful. In that case, we may want Apache Spark to continue 
> releasing for it despite the Python core devs not supporting it.
> 
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it 
> later in 3.x instead of deprecating it in 2.4. I’d also consider looking at 
> what other data science tools are doing before fully removing it: for 
> example, if Pandas and TensorFlow no longer support Python 2 past some point, 
> that might be a good point to remove it.
> 
> Matei
> 
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra  wrote:
> > 
> > If we're going to do that, then we need to do it right now, since 2.4.0 is 
> > already in release candidates.
> > 
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem 
> > like a ways off but even now there may be some spark versions supporting 
> > Py2 past the point where Py2 is no longer receiving security patches 
> > 
> > 
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra  
> > wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> > 
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:
> > In case this didn't make it onto this thread:
> > 
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove 
> > it entirely on a later 3.x release.
> > 
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson  
> > wrote:
> > On a separate dev@spark thread, I raised a question of whether or not to 
> > support python 2 in Apache Spark, going forward into Spark 3.0.
> > 
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 
> > is an opportunity to make breaking changes to Spark's APIs, and so it is a 
> > good time to consider support for Python-2 on PySpark.
> > 
> > Key advantages to dropping Python 2 are:
> >   • Support for PySpark becomes significantly easier.
> >   • Avoid having to support Python 2 until Spark 4.0, which is likely 
> > to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that 
> > PySpark would be supporting a version of python that was no longer 
> > receiving security patches)
> > 
> > The main disadvantage is that PySpark users who have legacy python-2 code 
> > would have to migrate their code to python 3 to take advantage of Spark 3.0
> > 
> > This decision obviously has large implications for the Apache Spark 
> > community and we want to solicit community feedback.
> > 
> > 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
I’d like to understand the maintenance burden of Python 2 before deprecating 
it. Since it is not EOL yet, it might make sense to only deprecate it once it’s 
EOL (which is still over a year from now). Supporting Python 2+3 seems less 
burdensome than supporting, say, multiple Scala versions in the same codebase, 
so what are we losing out?

The other thing is that even though Python core devs might not support 2.x 
later, it’s quite possible that various Linux distros will if moving from 2 to 
3 remains painful. In that case, we may want Apache Spark to continue releasing 
for it despite the Python core devs not supporting it.

Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it later 
in 3.x instead of deprecating it in 2.4. I’d also consider looking at what 
other data science tools are doing before fully removing it: for example, if 
Pandas and TensorFlow no longer support Python 2 past some point, that might be 
a good point to remove it.

Matei

> On Sep 17, 2018, at 11:01 AM, Mark Hamstra  wrote:
> 
> If we're going to do that, then we need to do it right now, since 2.4.0 is 
> already in release candidates.
> 
> On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:
> I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like 
> a ways off but even now there may be some spark versions supporting Py2 past 
> the point where Py2 is no longer receiving security patches 
> 
> 
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra  wrote:
> We could also deprecate Py2 already in the 2.4.0 release.
> 
> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:
> In case this didn't make it onto this thread:
> 
> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it 
> entirely on a later 3.x release.
> 
> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson  wrote:
> On a separate dev@spark thread, I raised a question of whether or not to 
> support python 2 in Apache Spark, going forward into Spark 3.0.
> 
> Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 
> is an opportunity to make breaking changes to Spark's APIs, and so it is a 
> good time to consider support for Python-2 on PySpark.
> 
> Key advantages to dropping Python 2 are:
>   • Support for PySpark becomes significantly easier.
>   • Avoid having to support Python 2 until Spark 4.0, which is likely to 
> imply supporting Python 2 for some time after it goes EOL.
> (Note that supporting python 2 after EOL means, among other things, that 
> PySpark would be supporting a version of python that was no longer receiving 
> security patches)
> 
> The main disadvantage is that PySpark users who have legacy python-2 code 
> would have to migrate their code to python 3 to take advantage of Spark 3.0
> 
> This decision obviously has large implications for the Apache Spark community 
> and we want to solicit community feedback.
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
I’d like to understand the maintenance burden of Python 2 before deprecating 
it. Since it is not EOL yet, it might make sense to only deprecate it once it’s 
EOL (which is still over a year from now). Supporting Python 2+3 seems less 
burdensome than supporting, say, multiple Scala versions in the same codebase, 
so what are we losing out?

The other thing is that even though Python core devs might not support 2.x 
later, it’s quite possible that various Linux distros will if moving from 2 to 
3 remains painful. In that case, we may want Apache Spark to continue releasing 
for it despite the Python core devs not supporting it.

Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it later 
in 3.x instead of deprecating it in 2.4. I’d also consider looking at what 
other data science tools are doing before fully removing it: for example, if 
Pandas and TensorFlow no longer support Python 2 past some point, that might be 
a good point to remove it.

Matei

> On Sep 17, 2018, at 11:01 AM, Mark Hamstra  wrote:
> 
> If we're going to do that, then we need to do it right now, since 2.4.0 is 
> already in release candidates.
> 
> On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:
> I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like 
> a ways off but even now there may be some spark versions supporting Py2 past 
> the point where Py2 is no longer receiving security patches 
> 
> 
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra  wrote:
> We could also deprecate Py2 already in the 2.4.0 release.
> 
> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:
> In case this didn't make it onto this thread:
> 
> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it 
> entirely on a later 3.x release.
> 
> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson  wrote:
> On a separate dev@spark thread, I raised a question of whether or not to 
> support python 2 in Apache Spark, going forward into Spark 3.0.
> 
> Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 
> is an opportunity to make breaking changes to Spark's APIs, and so it is a 
> good time to consider support for Python-2 on PySpark.
> 
> Key advantages to dropping Python 2 are:
>   • Support for PySpark becomes significantly easier.
>   • Avoid having to support Python 2 until Spark 4.0, which is likely to 
> imply supporting Python 2 for some time after it goes EOL.
> (Note that supporting python 2 after EOL means, among other things, that 
> PySpark would be supporting a version of python that was no longer receiving 
> security patches)
> 
> The main disadvantage is that PySpark users who have legacy python-2 code 
> would have to migrate their code to python 3 to take advantage of Spark 3.0
> 
> This decision obviously has large implications for the Apache Spark community 
> and we want to solicit community feedback.
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Is there any open source framework that converts Cypher to SparkSQL?

2018-09-16 Thread Matei Zaharia
GraphFrames (https://graphframes.github.io) offers a Cypher-like syntax that 
then executes on Spark SQL.

> On Sep 14, 2018, at 2:42 AM, kant kodali  wrote:
> 
> Hi All,
> 
> Is there any open source framework that converts Cypher to SparkSQL?
> 
> Thanks!


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Python friendly API for Spark 3.0

2018-09-16 Thread Matei Zaharia
My 2 cents on this is that the biggest room for improvement in Python is 
similarity to Pandas. We already made the Python DataFrame API different from 
Scala/Java in some respects, but if there’s anything we can do to make it more 
obvious to Pandas users, that will help the most. The other issue though is 
that a bunch of Pandas functions are just missing in Spark — it would be 
awesome to set up an umbrella JIRA to just track those and let people fill them 
in.

Matei

> On Sep 16, 2018, at 1:02 PM, Mark Hamstra  wrote:
> 
> It's not splitting hairs, Erik. It's actually very close to something that I 
> think deserves some discussion (perhaps on a separate thread.) What I've been 
> thinking about also concerns API "friendliness" or style. The original RDD 
> API was very intentionally modeled on the Scala parallel collections API. 
> That made it quite friendly for some Scala programmers, but not as much so 
> for users of the other language APIs when they eventually came about. 
> Similarly, the Dataframe API drew a lot from pandas and R, so it is 
> relatively friendly for those used to those abstractions. Of course, the 
> Spark SQL API is modeled closely on HiveQL and standard SQL. The new barrier 
> scheduling draws inspiration from MPI. With all of these models and sources 
> of inspiration, as well as multiple language targets, there isn't really a 
> strong sense of coherence across Spark -- I mean, even though one of the key 
> advantages of Spark is the ability to do within a single framework things 
> that would otherwise require multiple frameworks, actually doing that is 
> requiring more than one programming style or multiple design abstractions 
> more than what is strictly necessary even when writing Spark code in just a 
> single language.
> 
> For me, that raises questions over whether we want to start designing, 
> implementing and supporting APIs that are designed to be more consistent, 
> friendly and idiomatic to particular languages and abstractions -- e.g. an 
> API covering all of Spark that is designed to look and feel as much like 
> "normal" code for a Python programmer, another that looks and feels more like 
> "normal" Java code, another for Scala, etc. That's a lot more work and 
> support burden than the current approach where sometimes it feels like you 
> are writing "normal" code for your prefered programming environment, and 
> sometimes it feels like you are trying to interface with something foreign, 
> but underneath it hopefully isn't too hard for those writing the 
> implementation code below the APIs, and it is not too hard to maintain 
> multiple language bindings that are each fairly lightweight.
> 
> It's a cost-benefit judgement, of course, whether APIs that are heavier (in 
> terms of implementing and maintaining) and friendlier (for end users) are 
> worth doing, and maybe some of these "friendlier" APIs can be done outside of 
> Spark itself (imo, Frameless is doing a very nice job for the parts of Spark 
> that it is currently covering -- https://github.com/typelevel/frameless); but 
> what we have currently is a bit too ad hoc and fragmentary for my taste. 
> 
> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson  wrote:
> I am probably splitting hairs to finely, but I was considering the difference 
> between improvements to the jvm-side (py4j and the scala/java code) that 
> would make it easier to write the python layer ("python-friendly api"), and 
> actual improvements to the python layers ("friendly python api").
> 
> They're not mutually exclusive of course, and both worth working on. But it's 
> *possible* to improve either without the other.
> 
> Stub files look like a great solution for type annotations, maybe even if 
> only python 3 is supported.
> 
> I definitely agree that any decision to drop python 2 should not be taken 
> lightly. Anecdotally, I'm seeing an increase in python developers announcing 
> that they are dropping support for python 2 (and loving it). As people have 
> already pointed out, if we don't drop python 2 for spark 3.0, we're stuck 
> with it until 4.0, which would place spark in a possibly-awkward position of 
> supporting python 2 for some time after it goes EOL.
> 
> Under the current release cadence, spark 3.0 will land some time in early 
> 2019, which at that point will be mere months until EOL for py2.
> 
> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau  wrote:
> 
> 
> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
> To be clear, is this about "python-friendly API" or "friendly python API" ?
> Well what would you consider to be different between those two statements? I 
> think it would be good to be a bit more explicit, but I don't think we should 
> necessarily limit ourselves.
> 
> On the python side, it might be nice to take advantage of static typing. 
> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a good 
> opportunity to jump the python-3-only train.
> I think we can make 

Re: time for Apache Spark 3.0?

2018-09-06 Thread Matei Zaharia
Yes, you can start with Unstable and move to Evolving and Stable when needed. 
We’ve definitely had experimental features that changed across maintenance 
releases when they were well-isolated. If your change risks breaking stuff in 
stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> 
> I meant flexibility beyond the point releases. I think what Reynold was 
> suggesting was getting v2 code out more often than the point releases every 6 
> months. An Evolving API can change in point releases, but maybe we should 
> move v2 to Unstable so it can change more often? I don't really see another 
> way to get changes out more often.
> 
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra  wrote:
> Yes, that is why we have these annotations in the code and the corresponding 
> labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> 
> As long as it is properly annotated, we can change or even eliminate an API 
> method before the next major release. And frankly, we shouldn't be 
> contemplating bringing in the DS v2 API (and, I'd argue, any new API) without 
> such an annotation. There is just too much risk of not getting everything 
> right before we see the results of the new API being more widely used, and 
> too much cost in maintaining until the next major release something that we 
> come to regret for us to create new API in a fully frozen state.
>  
> 
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:
> It would be great to get more features out incrementally. For experimental 
> features, do we have more relaxed constraints?
> 
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
> +1 on 3.0
> 
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, 
> dsv1 and a lot of other major features all were developed throughout the 1.x 
> and 2.x lines.
> 
> I do want to explore ways for us to get dsv2 incremental changes out there 
> more frequently, to get feedback. Maybe that means we apply additive changes 
> to 2.4.x; maybe that means making another 2.5 release sooner. I will start a 
> separate thread about it.
> 
> 
> 
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 
> 6 months?) but simply next. Do you mean you'd prefer that change to happen 
> before 3.x? if it's a significant change, seems reasonable for a major 
> version bump rather than minor. Is the concern that tying it to 3.0 means you 
> have to take a major version update to get it?
> 
> I generally support moving on to 3.x so we can also jettison a lot of older 
> dependencies, code, fix some long standing issues, etc.
> 
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
> 
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue  wrote:
> My concern is that the v2 data source API is still evolving and not very 
> close to stable. I had hoped to have stabilized the API and behaviors for a 
> 3.0 release. But we could also wait on that for a 4.0 release, depending on 
> when we think that will be.
> 
> Unless there is a pressing need to move to 3.0 for some other area, I think 
> it would be better for the v2 sources to have a 2.5 release.
> 
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
> Yesterday, the 2.4 branch was created. Based on the above discussion, I think 
> we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Nightly Builds in the docs (in spark-nightly/spark-master-bin/latest? Can't seem to find it)

2018-08-31 Thread Matei Zaharia
If we actually build stuff nightly in Jenkins, it wouldn’t hurt to publish them 
IMO. It helps more people try master and test it.

> On Aug 31, 2018, at 1:28 PM, Sean Owen  wrote:
> 
> There are some builds there, but they're not recent:
> 
> https://people.apache.org/~pwendell/spark-nightly/
> 
> We can either get the jobs running again, or just knock this on the head and 
> remove it.
> 
> Anyone know how to get it running again and want to? I have a feeling Shane 
> knows if anyone. Or does anyone know if we even need these at this point? if 
> nobody has complained in about a year, unlikely.
> 
> On Fri, Aug 31, 2018 at 3:15 PM Cody Koeninger  wrote:
> Just got a question about this on the user list as well.
> 
> Worth removing that link to pwendell's directory from the docs?
> 
> On Sun, Jan 21, 2018 at 12:13 PM, Jacek Laskowski  wrote:
> > Hi,
> >
> > http://spark.apache.org/developer-tools.html#nightly-builds reads:
> >
> >> Spark nightly packages are available at:
> >> Latest master build:
> >> https://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest
> >
> > but the URL gives 404. Is this intended?
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://about.me/JacekLaskowski
> > Mastering Spark SQL https://bit.ly/mastering-spark-sql
> > Spark Structured Streaming https://bit.ly/spark-structured-streaming
> > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> > Follow me at https://twitter.com/jaceklaskowski
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Matei Zaharia
I like this as well. Regarding “cost”, I think the equivalent concept for us is 
impact on the rest of the project (say maintenance cost down the line or 
whatever). This could be captured in the “risks” too, but it’s a slightly 
different concept. We should probably just clarify what we mean with each 
question.

Matei

> On Aug 31, 2018, at 1:09 PM, Cody Koeninger  wrote:
> 
> +1 to Sean's comment
> 
> On Fri, Aug 31, 2018 at 2:48 PM, Reynold Xin  wrote:
>> Yup all good points. One way I've done it in the past is to have an appendix
>> section for design sketch, as an expansion to the question "- What is new in
>> your approach and why do you think it will be successful?"
>> 
>> On Fri, Aug 31, 2018 at 12:47 PM Marcelo Vanzin
>>  wrote:
>>> 
>>> I like the questions (aside maybe from the cost one which perhaps does
>>> not matter much here), especially since they encourage explaining
>>> things in a more plain language than generally used by specs.
>>> 
>>> But I don't think we can ignore design aspects; it's been my
>>> observation that a good portion of SPIPs, when proposed, already have
>>> at the very least some sort of implementation (even if it's a barely
>>> working p.o.c.), so it would also be good to have that information up
>>> front if it's available.
>>> 
>>> (So I guess I'm just repeating Sean's reply.)
>>> 
>>> On Fri, Aug 31, 2018 at 11:23 AM Reynold Xin  wrote:
 
 I helped craft the current SPIP template last year. I was recently
 (re-)introduced to the Heilmeier Catechism, a set of questions DARPA
 developed to evaluate proposals. The set of questions are:
 
 - What are you trying to do? Articulate your objectives using absolutely
 no jargon.
 - How is it done today, and what are the limits of current practice?
 - What is new in your approach and why do you think it will be
 successful?
 - Who cares? If you are successful, what difference will it make?
 - What are the risks?
 - How much will it cost?
 - How long will it take?
 - What are the mid-term and final “exams” to check for success?
 
 When I read the above list, it resonates really well because they are
 almost always the same set of questions I ask myself and others before I
 decide whether something is worth doing. In some ways, our SPIP template
 tries to capture some of these (e.g. target persona), but are not as
 explicit and well articulated.
 
 What do people think about replacing the current SPIP template with the
 above?
 
 At a high level, I think the Heilmeier's Catechism emphasizes less about
 the "how", and more the "why" and "what", which is what I'd argue SPIPs
 should be about. The hows should be left in design docs for larger 
 projects.
 
 
>>> 
>>> 
>>> --
>>> Marcelo
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
>> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-23 Thread Matei Zaharia
Yes, that makes sense, but just to be clear, using the same seed does *not* 
imply that the algorithm should produce “equivalent” results by some definition 
of equivalent if you change the input data. For example, in SGD, the random 
seed might be used to select the next minibatch of examples, but if you reorder 
the data or change the labels, this will result in a different gradient being 
computed. Just because the dataset transformation seems to preserve the ML 
problem at a high abstraction level does not mean that even a deterministic ML 
algorithm (MLlib with seed) will give the same result. Maybe other libraries 
do, but it doesn’t necessarily mean that MLlib is doing something wrong here.

Basically, I’m just saying that as an ML library developer I wouldn’t be super 
concerned about these particular test results (especially if just a few 
instances change classification). I would be much more interested, however, in 
results like the following:

- The algorithm’s evaluation metrics (loss, accuracy, etc) are statistically 
significant if you change these properties of the data. This probably requires 
you to run multiple times with different seeds. 
- MLlib’s evaluation metrics for a problem differ in a statistically 
significant way from other ML libraries, for algorithms configured with 
equivalent hyperparameters. (Sometimes libraries have different definitions for 
hyperparameters though).

The second one is definitely something we’ve tested for informally in the past, 
though it is not in unit tests as far as I know.

Matei

> On Aug 23, 2018, at 5:14 AM, Steffen Herbold  
> wrote:
> 
> Dear Matei,
> 
> thanks for the feedback!
> 
> I used the setSeed option for all randomized classifiers and always used the 
> same seeds for training with the hope that this deals with the 
> non-determinism. I did not run any significance tests, because I was 
> considering this from a functional perspective, assuming that the 
> nondeterminism would be dealt with if I fix the seed values. The test results 
> contain how many instances were classified differently. Sometimes these are 
> only 1 or 2 out of 100 instances, i.e., almost certainly not significant. 
> Other cases seem to be more interesting. For example, 20/100 instances were 
> classified differently by the linear SVM for informative uniformly 
> distributed data if we added 1 to each feature value.
> 
> I know that these problems should sometimes be expected. However, I was 
> actually not sure what to expect, especially after I started to look at the 
> results for different ML libraries in comparison. The random forest are a 
> good example. I expected them to be dependent on feature/instance order. 
> However, they are not in Weka, only in scikit-learn and Spark MLlib. There 
> are more such examples, like logistic regression that exhibits different 
> behavior in all three libraries. Thus, I decided to just give my results to 
> the people who know what to expect from their implementations, i.e., the devs.
> 
> I will probably expand my test generator to allow more detailed 
> specifications of the expectations of the algorithms in the future. This 
> seems to be a "must" for a potentially productive use by projects. Relaxing 
> the assertions to only react if the differences are significant would be 
> another possible change. This could be a command line option to allow 
> different strictness of testing.
> 
> Best,
> Steffen
> 
> 
> Am 22.08.2018 um 23:27 schrieb Matei Zaharia:
>> Hi Steffen,
>> 
>> Thanks for sharing your results about MLlib — this sounds like a useful 
>> tool. However, I wanted to point out that some of the results may be 
>> expected for certain machine learning algorithms, so it might be good to 
>> design those tests with that in mind. For example:
>> 
>>> - The classification of LogisticRegression, DecisionTree, and RandomForest 
>>> were not inverted when all binary class labels are flipped.
>>> - The classification of LogisticRegression, DecisionTree, GBT, and 
>>> RandomForest sometimes changed when the features are reordered.
>>> - The classification of LogisticRegression, RandomForest, and LinearSVC 
>>> sometimes changed when the instances are reordered.
>> All of these things might occur because the algorithms are nondeterministic. 
>> Were the effects large or small? Or, for example, was the final difference 
>> in accuracy statistically significant? Many ML algorithms are trained using 
>> randomized algorithms like stochastic gradient descent, so you can’t expect 
>> exactly the same results under these changes.
>> 
>>> - The classification of NaïveBayes and the LinearSVC sometimes changed if 
>>> one is added to each feat

Re: Porting or explicitly linking project style in Apache Spark based on https://github.com/databricks/scala-style-guide

2018-08-23 Thread Matei Zaharia
There’s already a code style guide listed on 
http://spark.apache.org/contributing.html. Maybe it’s the same? We should 
decide which one we actually want and update this page if it’s wrong.

Matei

> On Aug 23, 2018, at 6:33 PM, Sean Owen  wrote:
> 
> Seems OK to me. The style is pretty standard Scala style anyway. My guidance 
> is always to follow the code around the code you're changing.
> 
> On Thu, Aug 23, 2018 at 8:14 PM Hyukjin Kwon  wrote:
> Hi all,
> 
> I usually follow https://github.com/databricks/scala-style-guide for Apache 
> Spark's style, which is usually generally the same with the Spark's code base 
> in practice.
> Thing is, we don't explicitly mention this within Apache Spark as far as I 
> can tell.
> 
> Can we explicitly mention this or port this style guide? It doesn't 
> necessarily mean hard requirements for PRs or code changes but we could at 
> least encourage people to read it.
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-22 Thread Matei Zaharia
Hi Steffen,

Thanks for sharing your results about MLlib — this sounds like a useful tool. 
However, I wanted to point out that some of the results may be expected for 
certain machine learning algorithms, so it might be good to design those tests 
with that in mind. For example:

> - The classification of LogisticRegression, DecisionTree, and RandomForest 
> were not inverted when all binary class labels are flipped.
> - The classification of LogisticRegression, DecisionTree, GBT, and 
> RandomForest sometimes changed when the features are reordered.
> - The classification of LogisticRegression, RandomForest, and LinearSVC 
> sometimes changed when the instances are reordered.

All of these things might occur because the algorithms are nondeterministic. 
Were the effects large or small? Or, for example, was the final difference in 
accuracy statistically significant? Many ML algorithms are trained using 
randomized algorithms like stochastic gradient descent, so you can’t expect 
exactly the same results under these changes.

> - The classification of NaïveBayes and the LinearSVC sometimes changed if one 
> is added to each feature value.

This might be due to nondeterminism as above, but it might also be due to 
regularization or nonlinear effects for some algorithms. For example, some 
algorithms might look at the relative values of features, in which case adding 
1 to each feature value transforms the data. Other algorithms might require 
that data be centered around a mean of 0 to work best.

I haven’t read the paper in detail, but basically it would be good to account 
for randomized algorithms as well as various model assumptions, and make sure 
the differences in results in these tests are statistically significant.

Matei


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Am I crazy, or does the binary distro not have Kafka integration?

2018-08-04 Thread Matei Zaharia
I think that traditionally, the reason *not* to include these has been if they 
brought additional dependencies that users don’t really need, but that might 
clash with what the users have in their own app. Maybe this used to be the case 
for Kafka. We could analyze it and include it by default, or perhaps make it 
easier to add it in spark-submit and spark-shell. I feel that in an IDE, it 
won’t be a huge problem because you just add it once, but it is annoying for 
spark-submit.

Matei

> On Aug 4, 2018, at 2:19 PM, Sean Owen  wrote:
> 
> Hm OK I am crazy then. I think I never noticed it because I had always used a 
> distro that did actually supply this on the classpath.
> Well ... I think it would be reasonable to include these things (at least, 
> Kafka integration) by default in the binary distro. I'll update the JIRA to 
> reflect that this is at best a Wish.
> 
> On Sat, Aug 4, 2018 at 4:17 PM Jacek Laskowski  wrote:
> Hi Sean,
> 
> It's been for years I'd say that you had to specify --packages to get the 
> Kafka-related jars on the classpath. I simply got used to this annoyance (as 
> did others). Could it be that it's an external package (although an integral 
> part of Spark)?!
> 
> I'm very glad you've brought it up since I think Kafka data source is so 
> important that it should be included in spark-shell and spark-submit by 
> default. THANKS!
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
> 
> On Sat, Aug 4, 2018 at 9:56 PM, Sean Owen  wrote:
> Let's take this to https://issues.apache.org/jira/browse/SPARK-25026 -- I 
> provisionally marked this a Blocker, as if it's correct, then the release is 
> missing an important piece and we'll want to remedy that ASAP. I still have 
> this feeling I am missing something. The classes really aren't there in the 
> release but ... *nobody* noticed all this time? I guess maybe Spark-Kafka 
> users may be using a vendor distro that does package these bits.
> 
> 
> On Sat, Aug 4, 2018 at 10:48 AM Sean Owen  wrote:
> I was debugging why a Kafka-based streaming app doesn't seem to find 
> Kafka-related integration classes when run standalone from our latest 2.3.1 
> release, and noticed that there doesn't seem to be any Kafka-related jars 
> from Spark in the distro. In jars/, I see:
> 
> spark-catalyst_2.11-2.3.1.jar
> spark-core_2.11-2.3.1.jar
> spark-graphx_2.11-2.3.1.jar
> spark-hive-thriftserver_2.11-2.3.1.jar
> spark-hive_2.11-2.3.1.jar
> spark-kubernetes_2.11-2.3.1.jar
> spark-kvstore_2.11-2.3.1.jar
> spark-launcher_2.11-2.3.1.jar
> spark-mesos_2.11-2.3.1.jar
> spark-mllib-local_2.11-2.3.1.jar
> spark-mllib_2.11-2.3.1.jar
> spark-network-common_2.11-2.3.1.jar
> spark-network-shuffle_2.11-2.3.1.jar
> spark-repl_2.11-2.3.1.jar
> spark-sketch_2.11-2.3.1.jar
> spark-sql_2.11-2.3.1.jar
> spark-streaming_2.11-2.3.1.jar
> spark-tags_2.11-2.3.1.jar
> spark-unsafe_2.11-2.3.1.jar
> spark-yarn_2.11-2.3.1.jar
> 
> I checked make-distribution.sh, and it copies a bunch of JARs into the 
> distro, but does not seem to touch the kafka modules.
> 
> Am I crazy or missing something obvious -- those should be in the release, 
> right?
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Revisiting Online serving of Spark models?

2018-07-03 Thread Matei Zaharia
Just wondering, is there an update on this? I haven’t seen a summary of the 
offline discussion but maybe I’ve missed it.

Matei 

> On Jun 11, 2018, at 8:51 PM, Holden Karau  wrote:
> 
> So I kicked of a thread on user@ to collect people's feedback there but I'll 
> summarize the offline results later this week too.
> 
> On Tue, Jun 12, 2018, 5:03 AM Liang-Chi Hsieh  wrote:
> 
> Hi,
> 
> It'd be great if there can be any sharing of the offline discussion. Thanks!
> 
> 
> 
> Holden Karau wrote
> > We’re by the registration sign going to start walking over at 4:05
> > 
> > On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice <
> 
> > maximilianofelice@
> 
> >> wrote:
> > 
> >> Hi!
> >>
> >> Do we meet at the entrance?
> >>
> >> See you
> >>
> >>
> >> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <
> >> 
> 
> > nick.pentreath@
> 
> >> escribió:
> >>
> >>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
> >>>
> >>> On Sun, 3 Jun 2018 at 00:24 Holden Karau 
> 
> > holden@
> 
> >  wrote:
> >>>
>  On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
>  
> 
> > maximilianofelice@
> 
> >> wrote:
> 
> > Hi!
> >
> > We're already in San Francisco waiting for the summit. We even think
> > that we spotted @holdenk this afternoon.
> >
>  Unless you happened to be walking by my garage probably not super
>  likely, spent the day working on scooters/motorcycles (my style is a
>  little
>  less unique in SF :)). Also if you see me feel free to say hi unless I
>  look
>  like I haven't had my first coffee of the day, love chatting with folks
>  IRL
>  :)
> 
> >
> > @chris, we're really interested in the Meetup you're hosting. My team
> > will probably join it since the beginning of you have room for us, and
> > I'll
> > join it later after discussing the topics on this thread. I'll send
> > you an
> > email regarding this request.
> >
> > Thanks
> >
> > El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <
> > 
> 
> > sxk1969@
> 
> >> escribió:
> >
> >> @Chris This sounds fantastic, please send summary notes for Seattle
> >> folks
> >>
> >> @Felix I work in downtown Seattle, am wondering if we should a tech
> >> meetup around model serving in spark at my work or elsewhere close,
> >> thoughts?  I’m actually in the midst of building microservices to
> >> manage
> >> models and when I say models I mean much more than machine learning
> >> models
> >> (think OR, process models as well)
> >>
> >> Regards
> >>
> >> Sent from my iPhone
> >>
> >> On May 31, 2018, at 10:32 PM, Chris Fregly 
> 
> > chris@
> 
> >  wrote:
> >>
> >> Hey everyone!
> >>
> >> @Felix:  thanks for putting this together.  i sent some of you a
> >> quick
> >> calendar event - mostly for me, so i don’t forget!  :)
> >>
> >> Coincidentally, this is the focus of June 6th's *Advanced Spark and
> >> TensorFlow Meetup*
> >> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/;
> >> @5:30pm
> >> on June 6th (same night) here in SF!
> >>
> >> Everybody is welcome to come.  Here’s the link to the meetup that
> >> includes the signup link:
> >> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
> >> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/;
> >>
> >> We have an awesome lineup of speakers covered a lot of deep,
> >> technical
> >> ground.
> >>
> >> For those who can’t attend in person, we’ll be broadcasting live -
> >> and
> >> posting the recording afterward.
> >>
> >> All details are in the meetup link above…
> >>
> >> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
> >> welcome to give a talk. I can move things around to make room.
> >>
> >> @joseph:  I’d personally like an update on the direction of the
> >> Databricks proprietary ML Serving export format which is similar to
> >> PMML
> >> but not a standard in any way.
> >>
> >> Also, the Databricks ML Serving Runtime is only available to
> >> Databricks customers.  This seems in conflict with the community
> >> efforts
> >> described here.  Can you comment on behalf of Databricks?
> >>
> >> Look forward to your response, joseph.
> >>
> >> See you all soon!
> >>
> >> —
> >>
> >>
> >> *Chris Fregly *Founder @ *PipelineAI* https://pipeline.ai/;
> >> (100,000
> >> Users)
> >> Organizer @ *Advanced Spark and TensorFlow Meetup*
> >> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/;
> >> (85,000
> >> Global Members)
> >>
> >>
> >>
> >> *San Francisco - Chicago - Austin -
> >> Washington DC - London - Dusseldorf *
> >> *Try our PipelineAI Community Edition 

Re: Beam's recent community development work

2018-07-02 Thread Matei Zaharia
I think telling people that they’re being considered as committers early on is 
a good idea, but AFAIK we’ve always had individual committers do that with 
contributors who were doing great work in various areas. We don’t have a 
centralized process for it though — it’s up to whoever wants to work with each 
contributor.

Matei

> On Jul 2, 2018, at 5:35 PM, Reynold Xin  wrote:
> 
> That's fair, and it's great to find high quality contributors. But I also 
> feel the two projects have very different background and maturity phase. 
> There are 1300+ contributors to Spark, and only 300 to Beam, with the vast 
> majority of contributions coming from a single company for Beam (based on my 
> cursory look at the two pages of commits on github). With the recent security 
> and correctness storms, I actually worry about more quality (which requires 
> more infrastructure) than just people adding more code to the project.
> 
> 
> 
> On Mon, Jul 2, 2018 at 5:25 PM Holden Karau  wrote:
> As someone who floats a bit between both projects (as a contributor) I'd love 
> to see us adopt some of these techniques to be pro-active about growing our 
> committer-ship (I think perhaps we could do this by also moving some of the 
> newer committers into the PMC faster so there are more eyes out looking for 
> people to bring forward)?
> 
> On Mon, Jul 2, 2018 at 4:54 PM, Sean Owen  wrote:
> Worth, I think, a read and consideration from Spark folks. I'd be interested 
> in comments; I have a few reactions too.
> 
> 
> -- Forwarded message -
> From: Kenneth Knowles 
> Date: Sat, Jun 30, 2018 at 1:15 AM
> Subject: Beam's recent community development work
> To: , , Griselda Cuevas 
> , dev 
> 
> 
> Hi all,
> 
> The ASF board suggested that we (Beam) share some of what we've been doing 
> for community development with d...@community.apache.org and 
> memb...@apache.org. So here is a long description. I have included 
> d...@beam.apache.org because it is the subject, really, and this is & should 
> be all public knowledge.
> 
> We would love feedback! We based a lot of this on reading the community 
> project site, and probably could have learned even more with more study.
> 
> # Background
> 
> We face two problems in our contributor/committer-base:
> 
> 1. Not enough committers to review all the code being contributed, in part 
> due to recent departure of a few committers
> 2. We want our contributor-base (hence committer-base) to be more spread 
> across companies and backgrounds, for the usual Apache reasons. Our user base 
> is not active and varied enough to make this automatic. One solution is to 
> make the right software to get a varied user base, but that is a different 
> thread :-) so instead we have to work hard to build our community around the 
> software we have.
> 
> # What we did
> 
> ## Committer guidelines
> 
> We published committer guidelines [1] for transparency and as an invitation. 
> We start by emphasizing that there are many kinds of contributions, not just 
> code (we have committers from community development, tech writing, training, 
> etc). Then we have three aspects:
> 
> 1. ASF code of conduct
> 2. ASF committer responsibilities
> 3. Beam-specific committer responsibilities
> 
> The best way to understand is to follow the link at the bottom of this email. 
> The important part is that you shouldn't be proposing a committer for other 
> reasons, and you shouldn't be blocking a committer for other reasons.
> 
> ## Instead of just "[DISCUSS] Potential committer XYZ" we discuss every layer
> 
> Gris (CC'd) outlined this: people go through these phases of relationship 
> with our project:
> 
> 1. aware of it
> 2. interested in it / checking it out
> 3. using it for real
> 4. first-time contributor
> 5. repeat contributor
> 6. committer
> 7. PMC
> 
> As soon as we notice someone, like a user asking really deep questions, we 
> invite discussion on private@ on how we can move them to the next level of 
> engagement.
> 
> ## Monthly cadence
> 
> Every ~month, we call for new discussions and revisit ~all prior discussions. 
> This way we do not forget to keep up this effort.
> 
> ## Individual discussions
> 
> For each person we have a separate thread on private@. This ensures we have 
> quality focused discussions that lead to feedback. In collective discussions 
> that we used to do, we often didn't really come up with actionable feedback 
> and ended up not even contacting potential committers to encourage them. And 
> consensus was much less clear.
> 
> ## Feedback!
> 
> If someone is brought up for a discussion, that means they got enough 
> attention that we hope to engage them more. But unsolicited feedback is never 
> a good idea. For a potential committer, we did this:
> 
> 1. Send an email saying something like "you were discussed as a potential 
> committer - do you want to become one? do you want feedback?"
> 2. If they say yes (so far everyone) we send a few bullet points 

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Matei Zaharia
Maybe your application is overriding the master variable when it creates its 
SparkContext. I see you are still passing “yarn-client” as an argument later to 
it in your command.

> On Jun 17, 2018, at 11:53 AM, Raymond Xie  wrote:
> 
> Thank you Subhash.
> 
> Here is the new command:
> spark-submit --master local[*] --class retail_db.GetRevenuePerOrder --conf 
> spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client 
> /public/retail_db/order_items /home/rxie/output/revenueperorder
> 
> Still seeing the same issue here.
> 2018-06-17 11:51:25 INFO  RMProxy:98 - Connecting to ResourceManager at 
> /0.0.0.0:8032
> 2018-06-17 11:51:27 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:28 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:29 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:30 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:31 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:32 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:33 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:34 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:35 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 8 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:36 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 
> 
> 
> 
> Sincerely yours,
> 
> 
> Raymond
> 
> On Sun, Jun 17, 2018 at 2:36 PM, Subhash Sriram  
> wrote:
> Hi Raymond,
> 
> If you set your master to local[*] instead of yarn-client, it should run on 
> your local machine.
> 
> Thanks,
> Subhash 
> 
> Sent from my iPhone
> 
> On Jun 17, 2018, at 2:32 PM, Raymond Xie  wrote:
> 
>> Hello,
>> 
>> I am wondering how can I run spark job in my environment which is a single 
>> Ubuntu host with no hadoop installed? if I run my job like below, I will end 
>> up with infinite loop at the end. Thank you very much.
>> 
>> rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder --conf 
>> spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client 
>> /public/retail_db/order_items /home/rxie/output/revenueperorder
>> 2018-06-17 11:19:36 WARN  Utils:66 - Your hostname, ubuntu resolves to a 
>> loopback address: 127.0.1.1; using 192.168.112.141 instead (on interface 
>> ens33)
>> 2018-06-17 11:19:36 WARN  Utils:66 - Set SPARK_LOCAL_IP 

Re: time for Apache Spark 3.0?

2018-04-05 Thread Matei Zaharia
Oh, forgot to add, but splitting the source tree in Scala also creates the 
issue of a big maintenance burden for third-party libraries built on Spark. As 
Josh said on the JIRA:

"I think this is primarily going to be an issue for end users who want to use 
an existing source tree to cross-compile for Scala 2.10, 2.11, and 2.12. Thus 
the pain of the source incompatibility would mostly be felt by library/package 
maintainers but it can be worked around as long as there's at least some common 
subset which is source compatible across all of those versions.”

This means that all the data sources, ML algorithms, etc developed outside our 
source tree would have to do the same thing we do internally.

> On Apr 5, 2018, at 10:30 AM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> 
> Sorry, but just to be clear here, this is the 2.12 API issue: 
> https://issues.apache.org/jira/browse/SPARK-14643, with more details in this 
> doc: 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
> 
> Basically, if we are allowed to change Spark’s API a little to have only one 
> version of methods that are currently overloaded between Java and Scala, we 
> can get away with a single source three for all Scala versions and Java ABI 
> compatibility against any type of Spark (whether using Scala 2.11 or 2.12). 
> On the other hand, if we want to keep the API and ABI of the Spark 2.x 
> branch, we’ll need a different source tree for Scala 2.12 with different 
> copies of pretty large classes such as RDD, DataFrame and DStream, and Java 
> users may have to change their code when linking against different versions 
> of Spark.
> 
> This is of course only one of the possible ABI changes, but it is a 
> considerable engineering effort, so we’d have to sign up for maintaining all 
> these different source files. It seems kind of silly given that Scala 2.12 
> was released in 2016, so we’re doing all this work to keep ABI compatibility 
> for Scala 2.11, which isn’t even that widely used any more for new projects. 
> Also keep in mind that the next Spark release will probably take at least 3-4 
> months, so we’re talking about what people will be using in fall 2018.
> 
> Matei
> 
>> On Apr 5, 2018, at 10:13 AM, Marcelo Vanzin <van...@cloudera.com> wrote:
>> 
>> I remember seeing somewhere that Scala still has some issues with Java
>> 9/10 so that might be hard...
>> 
>> But on that topic, it might be better to shoot for Java 11
>> compatibility. 9 and 10, following the new release model, aren't
>> really meant to be long-term releases.
>> 
>> In general, agree with Sean here. Doesn't look like 2.12 support
>> requires unexpected API breakages. So unless there's a really good
>> reason to break / remove a bunch of existing APIs...
>> 
>> On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido <marcogaid...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> I also agree with Mark that we should add Java 9/10 support to an eventual
>>> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
>>> are using some internal APIs for the memory management which changed: either
>>> we find a solution which works on both (but I am not sure it is feasible) or
>>> we have to switch between 2 implementations according to the Java version.
>>> So I'd rather avoid doing this in a non-major release.
>>> 
>>> Thanks,
>>> Marco
>>> 
>>> 
>>> 2018-04-05 17:35 GMT+02:00 Mark Hamstra <m...@clearstorydata.com>:
>>>> 
>>>> As with Sean, I'm not sure that this will require a new major version, but
>>>> we should also be looking at Java 9 & 10 support -- particularly with 
>>>> regard
>>>> to their better functionality in a containerized environment (memory limits
>>>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>>>> also be looking at using the latest Scala 2.11.x maintenance release in
>>>> current Spark branches.
>>>> 
>>>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <sro...@gmail.com> wrote:
>>>>> 
>>>>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <r...@databricks.com> wrote:
>>>>>> 
>>>>>> The primary motivating factor IMO for a major version bump is to support
>>>>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>>>>> Similar to Spark 2.0, I think there are also opportunities for other 
>>>>>> changes
>>>>>> that we know have been biting us for a long time but can’t be changed in

Re: time for Apache Spark 3.0?

2018-04-05 Thread Matei Zaharia
Sorry, but just to be clear here, this is the 2.12 API issue: 
https://issues.apache.org/jira/browse/SPARK-14643, with more details in this 
doc: 
https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.

Basically, if we are allowed to change Spark’s API a little to have only one 
version of methods that are currently overloaded between Java and Scala, we can 
get away with a single source three for all Scala versions and Java ABI 
compatibility against any type of Spark (whether using Scala 2.11 or 2.12). On 
the other hand, if we want to keep the API and ABI of the Spark 2.x branch, 
we’ll need a different source tree for Scala 2.12 with different copies of 
pretty large classes such as RDD, DataFrame and DStream, and Java users may 
have to change their code when linking against different versions of Spark.

This is of course only one of the possible ABI changes, but it is a 
considerable engineering effort, so we’d have to sign up for maintaining all 
these different source files. It seems kind of silly given that Scala 2.12 was 
released in 2016, so we’re doing all this work to keep ABI compatibility for 
Scala 2.11, which isn’t even that widely used any more for new projects. Also 
keep in mind that the next Spark release will probably take at least 3-4 
months, so we’re talking about what people will be using in fall 2018.

Matei

> On Apr 5, 2018, at 10:13 AM, Marcelo Vanzin  wrote:
> 
> I remember seeing somewhere that Scala still has some issues with Java
> 9/10 so that might be hard...
> 
> But on that topic, it might be better to shoot for Java 11
> compatibility. 9 and 10, following the new release model, aren't
> really meant to be long-term releases.
> 
> In general, agree with Sean here. Doesn't look like 2.12 support
> requires unexpected API breakages. So unless there's a really good
> reason to break / remove a bunch of existing APIs...
> 
> On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido  wrote:
>> Hi all,
>> 
>> I also agree with Mark that we should add Java 9/10 support to an eventual
>> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
>> are using some internal APIs for the memory management which changed: either
>> we find a solution which works on both (but I am not sure it is feasible) or
>> we have to switch between 2 implementations according to the Java version.
>> So I'd rather avoid doing this in a non-major release.
>> 
>> Thanks,
>> Marco
>> 
>> 
>> 2018-04-05 17:35 GMT+02:00 Mark Hamstra :
>>> 
>>> As with Sean, I'm not sure that this will require a new major version, but
>>> we should also be looking at Java 9 & 10 support -- particularly with regard
>>> to their better functionality in a containerized environment (memory limits
>>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>>> also be looking at using the latest Scala 2.11.x maintenance release in
>>> current Spark branches.
>>> 
>>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
 
 On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
> 
> The primary motivating factor IMO for a major version bump is to support
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
> Similar to Spark 2.0, I think there are also opportunities for other 
> changes
> that we know have been biting us for a long time but can’t be changed in
> feature releases (to be clear, I’m actually not sure they are all good
> ideas, but I’m writing them down as candidates for consideration):
 
 
 IIRC from looking at this, it is possible to support 2.11 and 2.12
 simultaneously. The cross-build already works now in 2.3.0. Barring some 
 big
 change needed to get 2.12 fully working -- and that may be the case -- it
 nearly works that way now.
 
 Compiling vs 2.11 and 2.12 does however result in some APIs that differ
 in byte code. However Scala itself isn't mutually compatible between 2.11
 and 2.12 anyway; that's never been promised as compatible.
 
 (Interesting question about what *Java* users should expect; they would
 see a difference in 2.11 vs 2.12 Spark APIs, but that has always been 
 true.)
 
 I don't disagree with shooting for Spark 3.0, just saying I don't know if
 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
 2.11 support if needed to make supporting 2.12 less painful.
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-04-05 Thread Matei Zaharia
Java 9/10 support would be great to add as well.

Regarding Scala 2.12, I thought that supporting it would become easier if we 
change the Spark API and ABI slightly. Basically, it is of course possible to 
create an alternate source tree today, but it might be possible to share the 
same source files if we tweak some small things in the methods that are 
overloaded across Scala and Java. I don’t remember the exact details, but the 
idea was to reduce the total maintenance work needed at the cost of requiring 
users to recompile their apps.

I’m personally for moving to 3.0 because of the other things we can clean up as 
well, e.g. the default SQL dialect, Iterable stuff, and possibly dependency 
shading (a major pain point for lots of users). It’s also a chance to highlight 
Kubernetes, continuous processing and other features more if they become “GA".

Matei

> On Apr 5, 2018, at 9:04 AM, Marco Gaido  wrote:
> 
> Hi all,
> 
> I also agree with Mark that we should add Java 9/10 support to an eventual 
> Spark 3.0 release, because supporting Java 9 is not a trivial task since we 
> are using some internal APIs for the memory management which changed: either 
> we find a solution which works on both (but I am not sure it is feasible) or 
> we have to switch between 2 implementations according to the Java version.
> So I'd rather avoid doing this in a non-major release.
> 
> Thanks,
> Marco
> 
> 
> 2018-04-05 17:35 GMT+02:00 Mark Hamstra :
> As with Sean, I'm not sure that this will require a new major version, but we 
> should also be looking at Java 9 & 10 support -- particularly with regard to 
> their better functionality in a containerized environment (memory limits from 
> cgroups, not sysconf; support for cpusets). In that regard, we should also be 
> looking at using the latest Scala 2.11.x maintenance release in current Spark 
> branches.
> 
> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
> The primary motivating factor IMO for a major version bump is to support 
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs. 
> Similar to Spark 2.0, I think there are also opportunities for other changes 
> that we know have been biting us for a long time but can’t be changed in 
> feature releases (to be clear, I’m actually not sure they are all good ideas, 
> but I’m writing them down as candidates for consideration):
> 
> IIRC from looking at this, it is possible to support 2.11 and 2.12 
> simultaneously. The cross-build already works now in 2.3.0. Barring some big 
> change needed to get 2.12 fully working -- and that may be the case -- it 
> nearly works that way now.
> 
> Compiling vs 2.11 and 2.12 does however result in some APIs that differ in 
> byte code. However Scala itself isn't mutually compatible between 2.11 and 
> 2.12 anyway; that's never been promised as compatible.
> 
> (Interesting question about what *Java* users should expect; they would see a 
> difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
> 
> I don't disagree with shooting for Spark 3.0, just saying I don't know if 
> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping 
> 2.11 support if needed to make supporting 2.12 less painful.
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Matei Zaharia
Welcome, Zhenhua!

Matei

> On Apr 1, 2018, at 10:28 PM, Wenchen Fan  wrote:
> 
> Hi all,
> 
> The Spark PMC recently added Zhenhua Wang as a committer on the project. 
> Zhenhua is the major contributor of the CBO project, and has been 
> contributing across several areas of Spark for a while, focusing especially 
> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
> 
> Wenchen


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Welcoming some new committers

2018-03-02 Thread Matei Zaharia
Hi everyone,

The Spark PMC has recently voted to add several new committers to the project, 
based on their contributions to Spark 2.3 and other past work:

- Anirudh Ramanathan (contributor to Kubernetes support)
- Bryan Cutler (contributor to PySpark and Arrow support)
- Cody Koeninger (contributor to streaming and Kafka support)
- Erik Erlandson (contributor to Kubernetes support)
- Matt Cheah (contributor to Kubernetes support and other parts of Spark)
- Seth Hendrickson (contributor to MLlib and PySpark)

Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth as 
committers!

Matei
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Please keep s3://spark-related-packages/ alive

2018-02-27 Thread Matei Zaharia
For Flintrock, have you considered using a Requester Pays bucket? That way 
you’d get the availability of S3 without having to foot the bill for bandwidth 
yourself (which was the bulk of the cost for the old bucket).

Matei

> On Feb 27, 2018, at 4:35 PM, Nicholas Chammas  
> wrote:
> 
> So is there no hope for this S3 bucket, or room to replace it with a bucket 
> owned by some organization other than AMPLab (which is technically now 
> defunct, I guess)? Sorry to persist, but I just have to ask.
> 
> On Tue, Feb 27, 2018 at 10:36 AM Michael Heuer  wrote:
> On Tue, Feb 27, 2018 at 8:17 AM, Sean Owen  wrote:
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-d3kbcqa49mib13-cloudfront-net-td22427.html
>  -- it was 'retired', yes.
> 
> Agree with all that, though they're intended for occasional individual use 
> and not a case where performance and uptime matter. For that, I think you'd 
> want to just host your own copy of the bits you need. 
> 
> The notional problem was that the S3 bucket wasn't obviously 
> controlled/blessed by the ASF and yet was a source of official bits. It was 
> another set of third-party credentials to hand around to release managers, 
> which was IIRC a little problematic.
> 
> Homebrew does host distributions of ASF projects, like Spark, FWIW. 
> 
> To clarify, the apache-spark.rb formula in Homebrew uses the Apache mirror 
> closer.lua script
> 
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb#L4
> 
>michael
> 
>  
> On Mon, Feb 26, 2018 at 10:57 PM Nicholas Chammas 
>  wrote:
> If you go to the Downloads page and download Spark 2.2.1, you’ll get a link 
> to an Apache mirror. It didn’t use to be this way. As recently as Spark 
> 2.2.0, downloads were served via CloudFront, which was backed by an S3 bucket 
> named spark-related-packages.
> 
> It seems that we’ve stopped using CloudFront, and the S3 bucket behind it has 
> stopped receiving updates (e.g. Spark 2.2.1 isn’t there). I’m guessing this 
> is part of an effort to use the Apache mirror network, like other Apache 
> projects do.
> 
> From a user perspective, the Apache mirror network is several steps down from 
> using a modern CDN. Let me summarize why:
> 
>   • Apache mirrors are often slow. Apache does not impose any performance 
> requirements on its mirrors. The difference between getting a good mirror and 
> a bad one means downloading Spark in less than a minute vs. 20 minutes. The 
> problem is so bad that I’ve thought about adding an Apache mirror blacklist 
> to Flintrock to avoid getting one of these dud mirrors.
>   • Apache mirrors are inconvenient to use. When you download something 
> from an Apache mirror, you get a link like this one. Instead of automatically 
> redirecting you to your download, though, you need to process the results you 
> get back to find your download target. And you need to handle the high 
> download failure rate, since sometimes the mirror you get doesn’t have the 
> file it claims to have.
>   • Apache mirrors are incomplete. Apache mirrors only keep around the 
> latest releases, save for a few “archive” mirrors, which are often slow. So 
> if you want to download anything but the latest version of Spark, you are out 
> of luck.
> Some of these problems can be mitigated by picking a specific mirror that 
> works well and hardcoding it in your scripts, but that defeats the purpose of 
> dynamically selecting a mirror and makes you a “bad” user of the mirror 
> network.
> 
> I raised some of these issues over on INFRA-10999. The ticket sat for a year 
> before I heard anything back, and the bottom line was that none of the above 
> problems have a solution on the horizon. It’s fine. I understand that Apache 
> is a volunteer organization and that the infrastructure team has a lot to 
> manage as it is. I still find it disappointing that an organization of 
> Apache’s stature doesn’t have a better solution for this in collaboration 
> with a third party. Python serves PyPI downloads using Fastly and Homebrew 
> serves packages using Bintray. They both work really, really well. Why don’t 
> we have something as good for Apache projects? Anyway, that’s a separate 
> discussion.
> 
> What I want to say is this:
> 
> Dear whoever owns the spark-related-packages S3 bucket,
> 
> Please keep the bucket up-to-date with the latest Spark releases, alongside 
> the past releases that are already on there. It’s a huge help to the 
> Flintrock project, and it’s an equally big help to those of us writing 
> infrastructure automation scripts that deploy Spark in other contexts.
> 
> I understand that hosting this stuff is not free, and that I am not paying 
> anything for this service. If it needs to go, so be it. But I wanted to take 
> this opportunity to lay out the benefits I’ve enjoyed thanks to having this 
> bucket around, 

Re: Spark 3

2018-01-20 Thread Matei Zaharia
We should only make breaking changes when we have a strong reason to do so — 
otherwise, it’s fine to stay on 2.x for a while. For example, maybe there’s a 
way to support Hadoop 3.0 from Spark 2.x as well. So far, none of the JIRAs 
targeting 3.0 seem that compelling, though I could be missing something. The 
most serious ones are probably the ones regarding dependencies that we’re 
forced to pull in — it would be great to minimize those.

Matei

> On Jan 19, 2018, at 10:26 AM, Reynold Xin  wrote:
> 
> We can certainly provide a build for Scala 2.12, even in 2.x.
> 
> 
> On Fri, Jan 19, 2018 at 10:17 AM, Justin Miller 
>  wrote:
> Would that mean supporting both 2.12 and 2.11? Could be a while before some 
> of our libraries are off of 2.11.
> 
> Thanks,
> Justin
> 
> 
>> On Jan 19, 2018, at 10:53 AM, Koert Kuipers  wrote:
>> 
>> i was expecting to be able to move to scala 2.12 sometime this year
>> 
>> if this cannot be done in spark 2.x then that could be a compelling reason 
>> to move spark 3 up to 2018 i think
>> 
>> hadoop 3 sounds great but personally i have no use case for it yet
>> 
>> On Fri, Jan 19, 2018 at 12:31 PM, Sean Owen  wrote:
>> Forking this thread to muse about Spark 3. Like Spark 2, I assume it would 
>> be more about making all those accumulated breaking changes and updating 
>> lots of dependencies. Hadoop 3 looms large in that list as well as Scala 
>> 2.12.
>> 
>> Spark 1 was release in May 2014, and Spark 2 in July 2016. If Spark 2.3 is 
>> out in Feb 2018 and it takes the now-usual 6 months until a next release, 
>> Spark 3 could reasonably be next.
>> 
>> However the release cycles are naturally slowing down, and it could also be 
>> said that 2019 would be more on schedule for Spark 3.
>> 
>> Nothing particularly urgent about deciding, but I'm curious if anyone had an 
>> opinion on whether to move on to Spark 3 next or just continue with 2.4 
>> later this year.
>> 
>> On Fri, Jan 19, 2018 at 11:13 AM Sean Owen  wrote:
>> Yeah, if users are using Kryo directly, they should be insulated from a 
>> Spark-side change because of shading.
>> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x. I 
>> am not sure if that causes problems for apps.
>> 
>> Normally I'd avoid any major-version change in a minor release. This one 
>> looked potentially entirely internal.
>> I think if there are any doubts, we can leave it for Spark 3. There was a 
>> bug report that needed a fix from Kryo 4, but it might be minor after all.
>> 
>> 
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Timeline for Spark 2.3

2017-11-09 Thread Matei Zaharia
I’m also +1 on extending this to get Kubernetes and other features in.

Matei

> On Nov 9, 2017, at 4:04 PM, Anirudh Ramanathan  
> wrote:
> 
> This would help the community on the Kubernetes effort quite a bit - giving 
> us additional time for reviews and testing for the 2.3 release.
> 
> On Thu, Nov 9, 2017 at 3:56 PM, Justin Miller  
> wrote:
> That sounds fine to me. I’m hoping that this ticket can make it into Spark 
> 2.3: https://issues.apache.org/jira/browse/SPARK-18016
> 
> It’s causing some pretty considerable problems when we alter the columns to 
> be nullable, but we are OK for now without that.
> 
> Best,
> Justin
> 
>> On Nov 9, 2017, at 4:54 PM, Michael Armbrust  wrote:
>> 
>> According to the timeline posted on the website, we are nearing branch cut 
>> for Spark 2.3.  I'd like to propose pushing this out towards mid to late 
>> December for a couple of reasons and would like to hear what people think.
>> 
>> 1. I've done release management during the Thanksgiving / Christmas time 
>> before and in my experience, we don't actually get a lot of testing during 
>> this time due to vacations and other commitments. I think beginning the RC 
>> process in early January would give us the best coverage in the shortest 
>> amount of time.
>> 2. There are several large initiatives in progress that given a little more 
>> time would leave us with a much more exciting 2.3 release. Specifically, the 
>> work on the history server, Kubernetes and continuous processing.
>> 3. Given the actual release date of Spark 2.2, I think we'll still get Spark 
>> 2.3 out roughly 6 months after.
>> 
>> Thoughts?
>> 
>> Michael
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-03 Thread Matei Zaharia
+1 from me too.

Matei

> On Nov 3, 2017, at 4:59 PM, Wenchen Fan  wrote:
> 
> +1.
> 
> I think this architecture makes a lot of sense to let executors talk to 
> source/sink directly, and bring very low latency.
> 
> On Thu, Nov 2, 2017 at 9:01 AM, Sean Owen  wrote:
> +0 simply because I don't feel I know enough to have an opinion. I have no 
> reason to doubt the change though, from a skim through the doc.
> 
> 
> On Wed, Nov 1, 2017 at 3:37 PM Reynold Xin  wrote:
> Earlier I sent out a discussion thread for CP in Structured Streaming:
> 
> https://issues.apache.org/jira/browse/SPARK-20928
> 
> It is meant to be a very small, surgical change to Structured Streaming to 
> enable ultra-low latency. This is great timing because we are also designing 
> and implementing data source API v2. If designed properly, we can have the 
> same data source API working for both streaming and batch.
> 
> 
> Following the SPIP process, I'm putting this SPIP up for a vote.
> 
> +1: Let's go ahead and design / implement the SPIP.
> +0: Don't really care.
> -1: I do not think this is a good idea for the following reasons.
> 
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 1.x - End of life

2017-10-19 Thread Matei Zaharia
Hi Ismael,

It depends on what you mean by “support”. In general, there won’t be new 
feature releases for 1.X (e.g. Spark 1.7) because all the new features are 
being added to the master branch. However, there is always room for bug fix 
releases if there is a catastrophic bug, and committers can make those at any 
time. In general though, I’d recommend moving workloads to Spark 2.x. We tried 
to make the migration as easy as possible (a few APIs changed, but not many), 
and 2.x has been out for a long time now and is widely used.

We should perhaps write a more explicit maintenance policy, but all of this is 
run based on what committers want to work on; if someone thinks that there’s a 
serious enough issue in 1.6 to update it, they can put together a new release. 
It does help to hear from users about this though, e.g. if you think there’s a 
significant issue that people are missing.

Matei

> On Oct 19, 2017, at 5:20 AM, Ismaël Mejía  wrote:
> 
> Hello,
> 
> I noticed that some of the (Big Data / Cloud Managed) Hadoop
> distributions are starting to (phase out / deprecate) Spark 1.x and I
> was wondering if the Spark community has already decided when will it
> end the support for Spark 1.x. I ask this also considering that the
> latest release in the series is already almost one year old. Any idea
> on this ?
> 
> Thanks,
> Ismaël
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Welcoming Tejas Patil as a Spark committer

2017-09-29 Thread Matei Zaharia
Hi all,

The Spark PMC recently added Tejas Patil as a committer on the
project. Tejas has been contributing across several areas of Spark for
a while, focusing especially on scalability issues and SQL. Please
join me in welcoming Tejas!

Matei

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-23 Thread Matei Zaharia
+1; we should consider something similar for multi-dimensional tensors too.

Matei

> On Sep 23, 2017, at 7:27 AM, Yanbo Liang  wrote:
> 
> +1
> 
> On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan  wrote:
> +1 
> 
> Regards 
> Noman 
> From: Denny Lee 
> Sent: Friday, September 22, 2017 2:59:33 AM
> To: Apache Spark Dev; Sean Owen; Tim Hunter
> Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
> Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
>  
> +1 
> 
> On Thu, Sep 21, 2017 at 11:15 Sean Owen  wrote:
> Am I right that this doesn't mean other packages would use this 
> representation, but that they could?
> 
> The representation looked fine to me w.r.t. what DL frameworks need.
> 
> My previous comment was that this is actually quite lightweight. It's kind of 
> like how I/O support is provided for CSV and JSON, so makes enough sense to 
> add to Spark. It doesn't really preclude other solutions.
> 
> For those reasons I think it's fine. +1
> 
> On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter  wrote:
> Hello community,
> 
> I would like to call for a vote on SPARK-21866. It is a short proposal that 
> has important applications for image processing and deep learning. Joseph 
> Bradley has offered to be the shepherd.
> 
> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
> PDF version: 
> https://issues.apache.org/jira/secure/attachment/12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
> 
> Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
>   • BigDL
>   • DeepLearning4J
>   • Deep Learning Pipelines
>   • MMLSpark
>   • TensorFlow (Spark connector)
>   • TensorFlowOnSpark
>   • TensorFrames
>   • Thunder
> Goals:
>   • Simple representation of images in Spark DataFrames, based on 
> pre-existing industrial standards (OpenCV)
>   • This format should eventually allow the development of 
> high-performance integration points with image processing libraries such as 
> libOpenCV, Google TensorFlow, CNTK, and other C libraries.
>   • The reader should be able to read popular formats of images from 
> distributed sources.
> Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
>   • the total size of an image should be restricted to less than 2GB 
> (roughly)
>   • the meaning of color channels is application-specific and is not 
> mandated by the standard (in line with the OpenCV standard)
>   • specialized formats used in meteorology, the medical field, etc. are 
> not supported
>   • this format is specialized to images and does not attempt to solve 
> the more general problem of 

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Matei Zaharia
+1 (binding)

> On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon  wrote:
> 
> +1 (non-binding)
> 
> 
> 2017-09-12 9:52 GMT+09:00 Yin Huai :
> +1
> 
> On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal  wrote:
> +1 (non-binding)
> 
> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler  wrote:
> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's fine 
> to work out the minor details of the API during review.
> 
> Bryan
> 
> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN  wrote:
> Hi all,
> 
> Thank you for voting and suggestions.
> 
> As Wenchen mentioned and also we're discussing at JIRA, we need to discuss 
> the size hint for the 0-parameter UDF.
> But I believe we got a consensus about the basic APIs except for the size 
> hint, I'd like to submit a pr based on the current proposal and continue 
> discussing in its review.
> 
> https://github.com/apache/spark/pull/19147
> 
> I'd keep this vote open to wait for more opinions.
> 
> Thanks.
> 
> 
> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan  wrote:
> +1 on the design and proposed API.
> 
> One detail I'd like to discuss is the 0-parameter UDF, how we can specify the 
> size hint. This can be done in the PR review though.
> 
> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung  
> wrote:
> +1 on this and like the suggestion of type in string form.
> 
> Would it be correct to assume there will be data type check, for example the 
> returned pandas data frame column data types match what are specified. We 
> have seen quite a bit of issues/confusions with that in R.
> 
> Would it make sense to have a more generic decorator name so that it could 
> also be useable for other efficient vectorized format in the future? Or do we 
> anticipate the decorator to be format specific and will have more in the 
> future?
> 
> From: Reynold Xin 
> Sent: Friday, September 1, 2017 5:16:11 AM
> To: Takuya UESHIN
> Cc: spark-dev
> Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>  
> Ok, thanks.
> 
> +1 on the SPIP for scope etc
> 
> 
> On API details (will deal with in code reviews as well but leaving a note 
> here in case I forget)
> 
> 1. I would suggest having the API also accept data type specification in 
> string form. It is usually simpler to say "long" then "LongType()". 
> 
> 2. Think about what error message to show when the rows numbers don't match 
> at runtime. 
> 
> 
> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN  wrote:
> Yes, the aggregation is out of scope for now.
> I think we should continue discussing the aggregation at JIRA and we will be 
> adding those later separately.
> 
> Thanks.
> 
> 
> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin  wrote:
> Is the idea aggregate is out of scope for the current effort and we will be 
> adding those later?
> 
> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN  wrote:
> Hi all,
> 
> We've been discussing to support vectorized UDFs in Python and we almost got 
> a consensus about the APIs, so I'd like to summarize and call for a vote.
> 
> Note that this vote should focus on APIs for vectorized UDFs, not APIs for 
> vectorized UDAFs or Window operations.
> 
> https://issues.apache.org/jira/browse/SPARK-21190
> 
> 
> Proposed API
> 
> We introduce a @pandas_udf decorator (or annotation) to define vectorized 
> UDFs which takes one or more pandas.Series or one integer value meaning the 
> length of the input value for 0-parameter UDFs. The return value should be 
> pandas.Series of the specified type and the length of the returned value 
> should be the same as input value.
> 
> We can define vectorized UDFs as:
> 
>   @pandas_udf(DoubleType())
>   def plus(v1, v2):
>   return v1 + v2
> 
> or we can define as:
> 
>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
> 
> We can use it similar to row-by-row UDFs:
> 
>   df.withColumn('sum', plus(df.v1, df.v2))
> 
> As for 0-parameter UDFs, we can define and use as:
> 
>   @pandas_udf(LongType())
>   def f0(size):
>   return pd.Series(1).repeat(size)
> 
>   df.select(f0())
> 
> 
> 
> The vote will be up for the next 72 hours. Please reply with your vote:
> 
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical 
> reasons.
> 
> Thanks!
> 
> -- 
> Takuya UESHIN
> Tokyo, Japan
> 
> http://twitter.com/ueshin
> 
> 
> 
> -- 
> Takuya UESHIN
> Tokyo, Japan
> 
> http://twitter.com/ueshin
> 
> 
> 
> 
> -- 
> Takuya UESHIN
> Tokyo, Japan
> 
> http://twitter.com/ueshin
> 
> 
> 
> 
> -- 
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-04 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152915#comment-16152915
 ] 

Matei Zaharia commented on SPARK-21866:
---

Just to chime in on this, I've also seen feedback that the deep learning 
libraries for Spark are too fragmented: there are too many of them, and people 
don't know where to start. This standard representation would at least give 
them a clear way to interoperate. It would let people write separate libraries 
for image processing, data augmentation and then training for example.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of

Re: Moving Scala 2.12 forward one step

2017-09-01 Thread Matei Zaharia
That would be awesome. I’m not sure whether we want 3.0 to be right after 2.3 
(I guess this Scala issue is one reason to start discussing that), but even if 
we do, I imagine that wouldn’t be out for at least 4-6 more months after 2.3, 
and that’s a long time to go without Scala 2.12 support. If we decide to do 2.4 
next instead, that’s even longer.

Matei

> On Sep 1, 2017, at 1:52 AM, Sean Owen <so...@cloudera.com> wrote:
> 
> OK, what I'll do is focus on some changes that can be merged to master 
> without impacting the 2.11 build (e.g. putting kafka-0.8 behind a profile, 
> maybe, or adding the 2.12 REPL). Anything that is breaking, we can work on in 
> a series of open PRs, or maybe a branch, yea. It's unusual but might be 
> worthwhile.
> 
> On Fri, Sep 1, 2017 at 9:44 AM Matei Zaharia <matei.zaha...@gmail.com> wrote:
> If the changes aren’t that hard, I think we should also consider building a 
> Scala 2.12 version of Spark 2.3 in a separate branch. I’ve definitely seen 
> concerns from some large Scala users that Spark isn’t supporting 2.12 soon 
> enough. I thought SPARK-14220 was blocked mainly because the changes are 
> hard, but if not, maybe we can release such a branch sooner.
> 
> Matei
> 
> > On Aug 31, 2017, at 3:59 AM, Sean Owen <so...@cloudera.com> wrote:
> >
> > I don't think there's a target. The changes aren't all that hard (see the 
> > SPARK-14220 umbrella) but there are some changes that are hard or 
> > impossible without changing key APIs, as far as we can see. That would 
> > suggest 3.0.
> >
> > One motivation I have here for getting it as far as possible otherwise is 
> > so people could, if they wanted, create a 2.12 build themselves without 
> > much work even if it were not supported upstream. This particular change is 
> > a lot of the miscellaneous stuff you'd have to fix to get to that point.
> >
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Moving Scala 2.12 forward one step

2017-09-01 Thread Matei Zaharia
If the changes aren’t that hard, I think we should also consider building a 
Scala 2.12 version of Spark 2.3 in a separate branch. I’ve definitely seen 
concerns from some large Scala users that Spark isn’t supporting 2.12 soon 
enough. I thought SPARK-14220 was blocked mainly because the changes are hard, 
but if not, maybe we can release such a branch sooner.

Matei

> On Aug 31, 2017, at 3:59 AM, Sean Owen  wrote:
> 
> I don't think there's a target. The changes aren't all that hard (see the 
> SPARK-14220 umbrella) but there are some changes that are hard or impossible 
> without changing key APIs, as far as we can see. That would suggest 3.0.
> 
> One motivation I have here for getting it as far as possible otherwise is so 
> people could, if they wanted, create a 2.12 build themselves without much 
> work even if it were not supported upstream. This particular change is a lot 
> of the miscellaneous stuff you'd have to fix to get to that point.
> 
> On Thu, Aug 31, 2017 at 11:04 AM Saisai Shao  wrote:
> Hi Sean,
> 
> Do we have a planned target version for Scala 2.12 support? Several other 
> projects like Zeppelin, Livy which rely on Spark repl also require changes to 
> support this Scala 2.12.
> 
> Thanks
> Jerry
> 
> On Thu, Aug 31, 2017 at 5:55 PM, Sean Owen  wrote:
> No, this doesn't let Spark build and run on 2.12. It makes changes that will 
> be required though, the ones that are really no loss to the current 2.11 
> build.
> 
> 
> On Thu, Aug 31, 2017, 10:48 Denis Bolshakov  wrote:
> Hello,
> 
> Sounds amazing. Is there any improvements in benchmarks?
> 
> 
> On 31 August 2017 at 12:25, Sean Owen  wrote:
> Calling attention to the question of Scala 2.12 again for moment. I'd like to 
> make a modest step towards support. Have a look again, if you would, at 
> SPARK-14280:
> 
> https://github.com/apache/spark/pull/18645
> 
> This is a lot of the change for 2.12 that doesn't break 2.11, and really 
> doesn't add any complexity. It's mostly dependency updates and clarifying 
> some code. Other items like dealing with Kafka 0.8 support, the 2.12 REPL, 
> etc, are not  here.
> 
> So, this still doesn't result in a working 2.12 build but it's most of the 
> miscellany that will be required.
> 
> I'd like to merge it but wanted to flag it for feedback as it's not trivial.
> 
> 
> 
> -- 
> //with Best Regards
> --Denis Bolshakov
> e-mail: bolshakov.de...@gmail.com
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[jira] [Updated] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster

2017-08-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-18278:
--
Labels: SPIP  (was: )

> SPIP: Support native submission of spark jobs to a kubernetes cluster
> -
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
>  Labels: SPIP
> Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision 
> 2 (1).pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-21866:
--
Labels: SPIP  (was: )

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact chann

Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Matei Zaharia
Hi everyone,

The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai has 
been contributing to many areas of the project for a long time, so it’s great 
to see him join. Join me in thanking and congratulating him!

Matei
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >