Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Reynold Xin
+1

On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale
 wrote:

> +1
>
> On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun 
> wrote:
>
>> FYI, there is a proposal to drop Python 3.8 because its EOL is October
>> 2024.
>>
>> https://github.com/apache/spark/pull/46228
>> [SPARK-47993][PYTHON] Drop Python 3.8
>>
>> Since it's still alive and there will be an overlap between the lifecycle
>> of Python 3.8 and Apache Spark 4.0.0, please give us your feedback on the
>> PR, if you have any concerns.
>>
>> From my side, I agree with this decision.
>>
>> Thanks,
>> Dongjoon.
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Reynold Xin
One of the problem in the past when something like this was brought up was that 
the ASF couldn't have officially blessed venues beyond the already approved 
ones. So that's something to look into.

Now of course you are welcome to run unofficial things unblessed as long as 
they follow trademark rules.

On Mon, Mar 18, 2024 at 1:53 PM, Mich Talebzadeh < mich.talebza...@gmail.com > 
wrote:

> 
> Well as long as it works.
> 
> Please all check this link from Databricks and let us know your thoughts.
> Will something similar work for us?. Of course Databricks have much deeper
> pockets than our ASF community. Will it require moderation in our side to
> block spams and nutcases.
> 
> 
> 
> Knowledge Sharing Hub - Databricks (
> https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub
> )
> 
> 
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> 
> London
> 
> United Kingdom
> 
> 
> 
> 
> 
> 
> 
> ** view my Linkedin profile (
> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ )
> 
> 
> 
> 
> 
> 
> 
> 
> https:/ / en. everybodywiki. com/ Mich_Talebzadeh (
> https://en.everybodywiki.com/Mich_Talebzadeh )
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one - thousand
> expert opinions ( Werner ( https://en.wikipedia.org/wiki/Wernher_von_Braun
> ) Von Braun ( https://en.wikipedia.org/wiki/Wernher_von_Braun ) )".
> 
> 
> 
> 
> 
> On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen < bjornjorgensen@ gmail. com
> ( bjornjorgen...@gmail.com ) > wrote:
> 
> 
>> something like this Spark community · GitHub (
>> https://github.com/Spark-community )
>> 
>> 
>> 
>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud < mparsian@ illumina. 
>> com.
>> invalid ( mpars...@illumina.com.invalid ) >:
>> 
>> 
>>> 
>>> 
>>> Good idea. Will be useful
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> +1
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *From:* ashok34668@ yahoo. com. INVALID ( ashok34...@yahoo.com.INVALID ) <
>>> ashok34668@ yahoo. com. INVALID ( ashok34...@yahoo.com.INVALID ) >
>>> *Date:* Monday, March 18 , 2024 at 6:36 AM
>>> *To:* user @spark < user@ spark. apache. org ( u...@spark.apache.org ) >,
>>> Spark dev list < dev@ spark. apache. org ( dev@spark.apache.org ) >, Mich
>>> Talebzadeh < mich. talebzadeh@ gmail. com ( mich.talebza...@gmail.com ) >
>>> *Cc:* Matei Zaharia < matei. zaharia@ gmail. com ( matei.zaha...@gmail.com
>>> ) >
>>> *Subject:* Re: A proposal for creating a Knowledge Sharing Hub for Apache
>>> Spark Community
>>> 
>>> 
>>> 
>>> 
>>> External message, be mindful when clicking links or attachments
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Good idea. Will be useful
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> +1
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh < mich. 
>>> talebzadeh@
>>> gmail. com ( mich.talebza...@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Some of you may be aware that Databricks community Home | Databricks
>>> 
>>> 
>>> 
>>> 
>>> have just launched a knowledge sharing hub. I thought it would be a
>>> 
>>> 
>>> 
>>> 
>>> good idea for the Apache Spark user group to have the same, especially
>>> 
>>> 
>>> 
>>> 
>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>> 
>>> 
>>> 
>>> 
>>> Streaming, Spark Mlib and so forth.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Apache Spark user and dev groups have been around for a good while.
>>> 
>>> 
>>> 
>>> 
>>> They are serving their purpose . We went through creating a slack
>>> 
>>> 
>>> 
>>> 
>>> community that managed to create more more heat than light.. This is
>>> 
>>> 
>>> 
>>> 
>>> what Databricks community came up with and I quote
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> "Knowledge Sharing Hub
>>> 
>>> 
>>> 
>>> 
>>> Dive into a collaborative space where members like YOU can exchange
>>> 
>>> 
>>> 
>>> 
>>> knowledge, tips, and best practices. Join the conversation today and
>>> 
>>> 
>>> 
>>> 
>>> unlock a wealth of collective wisdom to enhance your experience and
>>> 
>>> 
>>> 
>>> 
>>> drive success."
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> I don't know the logistics of setting it up.but I am sure that should
>>> 
>>> 
>>> 
>>> 
>>> not be that difficult. If anyone is supportive of this proposal, let
>>> 
>>> 
>>> 
>>> 
>>> the usual +1, 0, -1 decide
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> HTH
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Mich Talebzadeh,
>>> 
>>> 
>>> 
>>> 
>>> Dad | Technologist | Solutions Architect | Engineer
>>> 
>>> 
>>> 
>>> 
>>> London
>>> 
>>> 
>>> 
>>> 
>>> United Kingdom
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> view my Linkedin profile

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Reynold Xin
+1

On Mon, Mar 11 2024 at 7:38 PM, Jungtaek Lim < kabhwan.opensou...@gmail.com > 
wrote:

> 
> +1 (non-binding), thanks Gengliang!
> 
> 
> On Mon, Mar 11, 2024 at 5:46 PM Gengliang Wang < ltn...@gmail.com > wrote:
> 
> 
> 
>> Hi all,
>> 
>> I'd like to start the vote for SPIP: Structured Logging Framework for
>> Apache Spark
>> 
>> References:
>> 
>> 
>> * JIRA ticket ( https://issues.apache.org/jira/browse/SPARK-47240 )
>> * SPIP doc (
>> https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing
>> )
>> * Discussion thread (
>> https://lists.apache.org/thread/gocslhbfv1r84kbcq3xt04nx827ljpxq )
>> 
>> 
>> Please vote on the SPIP for the next 72 hours:
>> 
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>> 
>> Thanks!
>> 
>> Gengliang Wang
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] SPIP: Testing Framework for Spark UI Javascript files

2023-11-24 Thread Reynold Xin
+1

On Fri, Nov 24, 2023 at 10:19 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> +1
> 
> 
> Thanks,
> Dongjoon.
> 
> On Fri, Nov 24, 2023 at 7:14 PM Ye Zhou < zhouyejoe@ gmail. com (
> zhouye...@gmail.com ) > wrote:
> 
> 
>> +1(non-binding)
>> 
>> On Fri, Nov 24, 2023 at 11:16 Mridul Muralidharan < mridul@ gmail. com (
>> mri...@gmail.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> +1
>>> 
>>> 
>>> Regards,
>>> Mridul
>>> 
>>> On Fri, Nov 24, 2023 at 8:21 AM Kent Yao < yao@ apache. org (
>>> y...@apache.org ) > wrote:
>>> 
>>> 
 Hi Spark Dev,
 
 Following the discussion [1], I'd like to start the vote for the SPIP [2].
 
 
 The SPIP aims to improve the test coverage and develop experience for
 Spark UI-related javascript codes.
 
 This thread will be open for at least the next 72 hours.  Please vote
 accordingly,
 
 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …
 
 
 Thank you!
 Kent Yao
 
 [1] https:/ / lists. apache. org/ thread/ 5rqrho4ldgmqlc173y2229pfll5sgkff
 ( https://lists.apache.org/thread/5rqrho4ldgmqlc173y2229pfll5sgkff )
 [2] https:/ / docs. google. com/ document/ d/ 
 1hWl5Q2CNNOjN5Ubyoa28XmpJtDyD9BtGtiEG2TT94rg/
 edit?usp=sharing (
 https://docs.google.com/document/d/1hWl5Q2CNNOjN5Ubyoa28XmpJtDyD9BtGtiEG2TT94rg/edit?usp=sharing
 )
 
 -
 To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
 dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] SPIP: ShuffleManager short name registration via SparkPlugin

2023-11-04 Thread Reynold Xin
Why do we need this? The reason data source APIs need it is because it will be 
used by very unsophisticated end users and used all the time (for each 
connection / query). Shuffle is something you set up once, presumably by fairly 
sophisticated admins / engineers.

On Sat, Nov 04, 2023 at 2:42 PM, Alessandro Bellina < abell...@gmail.com > 
wrote:

> 
> Hello devs,
> 
> 
> I would like to start discussion on the SPIP "ShuffleManager short name
> registration via SparkPlugin"
> 
> 
> The idea behind this change is to allow a driver plugin (spark.plugins) to
> export ShuffleManagers via short names, along with sensible default
> configurations. Users can then use this short name to enable this
> ShuffleManager + configs using spark.shuffle.manager.
> 
> 
> SPIP: https:/ / docs. google. com/ document/ d/ 
> 1flijDjMMAAGh2C2k-vg1u651RItaRquLGB_sVudxf6I/
> edit#heading=h. vqpecs4nrsto (
> https://docs.google.com/document/d/1flijDjMMAAGh2C2k-vg1u651RItaRquLGB_sVudxf6I/edit#heading=h.vqpecs4nrsto
> )
> JIRA: https:/ / issues. apache. org/ jira/ browse/ SPARK-45792 (
> https://issues.apache.org/jira/browse/SPARK-45792 )
> 
> 
> I look forward to hearing your feedback.
> 
> 
> Thanks
> 
> 
> Alessandro
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Reynold Xin
It should be the same as SQL. Otherwise it takes away a lot of potential future 
optimization opportunities.

On Mon, Sep 18 2023 at 8:47 AM, Nicholas Chammas < nicholas.cham...@gmail.com > 
wrote:

> 
> I’ve always considered DataFrames to be logically equivalent to SQL tables
> or queries.
> 
> 
> In SQL, the result order of any query is implementation-dependent without
> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
> table;` 10 times in a row and get 10 different orderings.
> 
> 
> I thought the same applied to DataFrames, but the docstring for the
> recently added method DataFrame.offset (
> https://github.com/apache/spark/pull/40873/files#diff-4ff57282598a3b9721b8d6f8c2fea23a62e4bc3c0f1aa5444527549d1daa38baR1293-R1301
> ) implies otherwise.
> 
> 
> This example will work fine in practice, of course. But if DataFrames are
> technically unordered without an explicit ordering clause, then in theory
> a future implementation change may result in “Bob" being the “first” row
> in the DataFrame, rather than “Tom”. That would make the example
> incorrect.
> 
> 
> Is that not the case?
> 
> 
> Nick
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE][SPIP] Python Data Source API

2023-07-07 Thread Reynold Xin
+1!

On Fri, Jul 7 2023 at 11:58 AM, Holden Karau < hol...@pigscanfly.ca > wrote:

> 
> +1
> 
> 
> On Fri, Jul 7, 2023 at 9:55 AM huaxin gao < huaxin.ga...@gmail.com > wrote:
> 
> 
> 
>> +1
>> 
>> 
>> On Fri, Jul 7, 2023 at 8:59 AM Mich Talebzadeh < mich.talebza...@gmail.com
>> > wrote:
>> 
>> 
>>> +1 for me
>>> 
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ** view my Linkedin profile (
>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ )
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed.
>>> The author will in no case be liable for any monetary damages arising from
>>> such loss, damage or destruction.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Fri, 7 Jul 2023 at 11:05, Martin Grund 
>>> wrote:
>>> 
>>> 
 +1 (non-binding)
 
 
 On Fri, Jul 7, 2023 at 12:05 AM Denny Lee < denny.g@gmail.com > wrote:
 
 
 
> +1 (non-binding)
> 
> On Fri, Jul 7, 2023 at 00:50 Maciej < mszymkiew...@gmail.com > wrote:
> 
> 
>> 
>> 
>> +0
>> 
>> 
>> Best regards,
>> Maciej Szymkiewicz
>> 
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>> On 7/6/23 17:41, Xiao Li wrote:
>> 
>> 
>>> +1
>>> 
>>> 
>>> Xiao
>>> 
>>> Hyukjin Kwon < gurwls...@apache.org > 于2023年7月5日周三 17:28写道:
>>> 
>>> 
 +1.
 
 See https://youtu.be/yj7XlTB1Jvc?t=604 :-).
 
 
 On Thu, 6 Jul 2023 at 09:15, Allison Wang 
 
 ( allison.w...@databricks.com.invalid ) wrote:
 
 
> Hi all,
> 
> I'd like to start the vote for SPIP: Python Data Source API.
> 
> The high-level summary for the SPIP is that it aims to introduce a 
> simple
> API in Python for Data Sources. The idea is to enable Python 
> developers to
> create data sources without learning Scala or dealing with the
> complexities of the current data source APIs. This would make Spark 
> more
> accessible to the wider Python developer community.
> 
> 
> References:
> 
> 
> * SPIP doc (
> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
> )
> * JIRA ticket ( https://issues.apache.org/jira/browse/SPARK-44076 )
> * Discussion thread (
> https://lists.apache.org/thread/w621zn14ho4rw61b0s139klnqh900s8y )
> 
> 
> 
> 
> Please vote on the SPIP for the next 72 hours:
> 
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because __.
> 
> Thanks,
> Allison
> 
 
 
>>> 
>>> 
>> 
>> 
> 
> 
 
 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
> --
> Twitter: https://twitter.com/holdenkarau
> 
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> ( https://amzn.to/2MaRAG9 )
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] SPIP: Python Data Source API

2023-06-25 Thread Reynold Xin
Personally I'd love this, but I agree with some of the earlier comments that 
this should not be Python specific (meaning I should be able to implement a 
data source in Python and then make it usable across all languages Spark 
supports). I think we should find a way to make this reusable beyond Python 
(especially for SQL).

Python is the most popular programming language by a large margin, in general 
and among Spark users. Many of the organizations that use Spark often don't 
even have a single person that knows Scala. What if they want to implement a 
custom data source to fetch some data? Today we'd have to tell them to learn 
Scala/Java and the fairly complex data source API (v1 or v2).

Maciej - I understand your concern about endpoint throttling etc. And it goes 
much more beyond querying REST endpoints. I personally had that concern too 
when we were adding the JDBC data source (what if somebody launches a 512 node 
Spark cluster to query my single node MySQL cluster?!). But the built-in JDBC 
data source is one of the most popular data sources (I just looked up its usage 
on Databricks and it's by far the #1 data source outside of files, used by > 
1 organizations everyday).

On Sun, Jun 25, 2023 at 1:38 AM, Maciej < mszymkiew...@gmail.com > wrote:

> 
> 
> 
> Thanks for your feedback Martin.
> 
> However, if the primary intended purpose of this API is to provide an
> interface for endpoint querying, then I find this proposal even less
> convincing.
> 
> 
> 
> Neither the Spark execution model nor the data source API (full or
> restricted as proposed here) are a good fit for handling problems arising
> from massive endpoint requests, including, but not limited to, handling
> quotas and rate limiting.
> 
> 
> 
> Consistency and streamlined development are, of course, valuable.
> Nonetheless, they are not sufficient, especially if they cannot deliver
> the expected user experience in terms of reliability and execution cost.
> 
> 
> 
> 
> 
> 
> Best regards,
> Maciej Szymkiewicz
> 
> Web: https:/ / zero323. net ( https://zero323.net )
> PGP: A30CEF0C31A501EC
> On 6/24/23 23:42, Martin Grund wrote:
> 
> 
>> Hey,
>> 
>> 
>> I would like to express my strong support for Python Data Sources even
>> though they might not be immediately as powerful as Scala-based data
>> sources. One element that is easily lost in this discussion is how much
>> faster the iteration speed is with Python compared to Scala. Due to the
>> dynamic nature of Python, you can design and build a data source while
>> running in a notebook and continuously change the code until it works as
>> you want. This behavior is unparalleled!
>> 
>> 
>> There exists a litany of Python libraries connecting to all kinds of
>> different endpoints that could provide data that is usable with Spark. I
>> personally can imagine implementing a data source on top of the AWS SDK to
>> extract EC2 instance information. Now I don't have to switch tools and can
>> keep my pipeline consistent.
>> 
>> 
>> Let's say you want to query an API in parallel from Spark using Python, today
>> 's way would be to create a Python RDD and then implement the planning and
>> execution process manually. Finally calling `toDF` in the end. While the
>> actual code of the DS and the RDD-based implementation would be very
>> similar, the abstraction that is provided by the DS is much more powerful
>> and future-proof. Performing dynamic partition elimination, and filter
>> push-down can all be implemented at a later point in time.
>> 
>> 
>> Comparing a DS to using batch calling from a UDF is not great because, the
>> execution pattern would be very brittle. Imagine something like
>> `spark.range(10).withColumn("data",
>> fetch_api).explode(col("data")).collect()`. Here you're encoding
>> partitioning logic and data transformation in simple ways, but you can't
>> reason about the structural integrity of the query and tiny changes in the
>> UDF interface might already cause a lot of downstream issues.
>> 
>> 
>> 
>> 
>> Martin
>> 
>> 
>> 
>> On Sat, Jun 24 , 2023 at 1:44 AM Maciej < mszymkiewicz@ gmail. com (
>> mszymkiew...@gmail.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> With such limited scope (both language availability and features) do we
>>> have any representative examples of sources that could significantly
>>> benefit from providing this API,  compared other available options, such
>>> as batch imports, direct queries from vectorized  UDFs or even interfacing
>>> sources through 3rd party FDWs?
>>> 
>>> 
>>> Best regards,
>>> Maciej Szymkiewicz
>>> 
>>> Web: https:/ / zero323. net ( https://zero323.net )
>>> PGP: A30CEF0C31A501EC
>>> On 6/20/23 16:23, Wenchen Fan wrote:
>>> 
>>> 
 In an ideal world, every data source you want to connect to already has a
 Spark data source implementation (either v1 or v2), then this Python API
 is useless. But I feel it's common that people want to do quick data
 exploration, and the target data system is not popular 

Re: [VOTE][SPIP] PySpark Test Framework

2023-06-21 Thread Reynold Xin
+1

This is a great idea.

On Wed, Jun 21, 2023 at 8:29 AM, Holden Karau < hol...@pigscanfly.ca > wrote:

> 
> I’d like to start with a +1, better Python testing tools integrated into
> the project make sense.
> 
> On Wed, Jun 21, 2023 at 8:11 AM Amanda Liu < amandastephanieliu@ gmail. com
> ( amandastephanie...@gmail.com ) > wrote:
> 
> 
>> Hi all,
>> 
>> I'd like to start the vote for SPIP: PySpark Test Framework.
>> 
>> The high-level summary for the SPIP is that it proposes an official test
>> framework for PySpark. Currently, there are only disparate open-source
>> repos and blog posts for PySpark testing resources. We can streamline and
>> simplify the testing process by incorporating test features, such as a
>> PySpark Test Base class (which allows tests to share Spark sessions) and
>> test util functions (for example, asserting dataframe and schema
>> equality).
>> 
>> *SPIP doc:* https:/ / docs. google. com/ document/ d/ 
>> 1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/
>> edit#heading=h. f5f0u2riv07v (
>> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v
>> )
>> 
>> 
>> *JIRA ticket:* https:/ / issues. apache. org/ jira/ browse/ SPARK-44042 (
>> https://issues.apache.org/jira/browse/SPARK-44042 )
>> 
>> *Discussion thread:* https:/ / lists. apache. org/ thread/ 
>> trwgbgn3ycoj8b8k8lkxko2hql23o41n
>> ( https://lists.apache.org/thread/trwgbgn3ycoj8b8k8lkxko2hql23o41n )
>> 
>> Please vote on the SPIP for the next 72 hours:
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because __.
>> 
>> Thank you!
>> 
>> Best,
>> Amanda Liu
>> 
>> 
> 
> --
> Twitter: https:/ / twitter. com/ holdenkarau (
> https://twitter.com/holdenkarau )
> 
> Books (Learning Spark, High Performance Spark, etc.): https:/ / amzn. to/ 
> 2MaRAG9
> ( https://amzn.to/2MaRAG9 )
> YouTube Live Streams: https:/ / www. youtube. com/ user/ holdenkarau (
> https://www.youtube.com/user/holdenkarau )
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Reynold Xin
+1

On Thu, Jan 12, 2023 at 9:46 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> +1 for the proposal (guiding only without any code change).
> 
> 
> Thanks,
> Dongjoon.
> 
> On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu < zsxwing@ gmail. com (
> zsxw...@gmail.com ) > wrote:
> 
> 
>> +1
>> 
>> 
>> 
>> On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das < tathagata. das1565@ gmail.
>> com ( tathagata.das1...@gmail.com ) > wrote:
>> 
>> 
>>> +1
>>> 
>>> 
>>> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon < gurwls223@ gmail. com (
>>> gurwls...@gmail.com ) > wrote:
>>> 
>>> 
 +1
 
 
 On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim < kabhwan. opensource@ gmail. 
 com
 ( kabhwan.opensou...@gmail.com ) > wrote:
 
 
> bump for more visibility.
> 
> On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim < kabhwan. opensource@ 
> gmail.
> com ( kabhwan.opensou...@gmail.com ) > wrote:
> 
> 
>> Hi dev,
>> 
>> 
>> I'd like to propose the deprecation of DStream in Spark 3.4, in favor of
>> promoting Structured Streaming.
>> (Sorry for the late proposal, if we don't make the change in 3.4, we will
>> have to wait for another 6 months.)
>> 
>> 
>> We have been focusing on Structured Streaming for years (across multiple
>> major and minor versions), and during the time we haven't made any
>> improvements for DStream. Furthermore, recently we updated the DStream 
>> doc
>> to explicitly say DStream is a legacy project.
>> https:/ / spark. apache. org/ docs/ latest/ streaming-programming-guide. 
>> html#note
>> (
>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>> )
>> 
>> 
>> 
>> The baseline of deprecation is that we don't see a particular use case
>> which only DStream solves. This is a different story with GraphX and
>> MLLIB, as we don't have replacements for that.
>> 
>> 
>> The proposal does not mean we will remove the API soon, as the Spark
>> project has been making deprecation against public API. I don't intend to
>> propose the target version for removal. The goal is to guide users to
>> refrain from constructing a new workload with DStream. We might want to 
>> go
>> with this in future, but it would require a new discussion thread at that
>> time.
>> 
>> 
>> What do you think?
>> 
>> 
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>> 
> 
> 
 
 
>>> 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: How can I get the same spark context in two different python processes

2022-12-12 Thread Reynold Xin
Spark Connect :)

(It’s work in progress)

On Mon, Dec 12 2022 at 2:29 PM, Kevin Su < pings...@gmail.com > wrote:

> 
> Hey there, How can I get the same spark context in two different python
> processes?
> Let’s say I create a context in Process A, and then I want to use python
> subprocess B to get the spark context created by Process A. How can I
> achieve that?
> 
> 
> I've tried
> pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but it
> will create a new spark context.
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Re: [VOTE][SPIP] Spark Connect

2022-06-15 Thread Reynold Xin
+1 super excited about this. I think it'd make Spark a lot more usable in 
application development and cloud setting:

(1) Makes it easier to embed in applications with thinner client dependencies.
(2) Easier to isolate user code vs system code in the driver.

(3) Opens up the potential to upgrade the server side for better performance 
and security updates without application changes.

One thing related to (2) I'd love to discuss, but at a separate thread, is 
whether it'd make sense to expand or work on a project to better specify system 
code vs user code boundary in the executors as well. Then we can really have 
complete system code vs user code isolation in execution.

On Wed, Jun 15, 2022 at 9:21 AM, Xiao Li < gatorsm...@gmail.com > wrote:

> 
> +1
> 
> 
> Xiao
> 
> beliefer < beliefer@ 163. com ( belie...@163.com ) > 于2022年6月14日周二 03:35写道:
> 
> 
> 
>> +1
>> Yeah, I tried to use Apache Livy, so as we can runing interactive query.
>> But the Spark Driver in Livy looks heavy.
>> 
>> 
>> The SPIP may resolve the issue.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> At 2022-06-14 18:11:21, "Wenchen Fan" < cloud0fan@ gmail. com (
>> cloud0...@gmail.com ) > wrote:
>> 
>> 
>>> +1
>>> 
>>> 
>>> On Tue, Jun 14, 2022 at 9:38 AM Ruifeng Zheng < ruifengz@ foxmail. com (
>>> ruife...@foxmail.com ) > wrote:
>>> 
>>> 
 +1
 
 
 
 
 -- 原始邮件 --
 *发件人:* "huaxin gao" < huaxin. gao11@ gmail. com ( huaxin.ga...@gmail.com )
 >;
 *发送时间:* 2022年6月14日(星期二) 上午8:47
 *收件人:* "L. C. Hsieh"< viirya@ gmail. com ( vii...@gmail.com ) >;
 *抄送:* "Spark dev list"< dev@ spark. apache. org ( dev@spark.apache.org ) >;
 
 *主题:* Re: [VOTE][SPIP] Spark Connect
 
 
 +1
 
 
 On Mon, Jun 13, 2022 at 5:42 PM L. C. Hsieh < viirya@ gmail. com (
 vii...@gmail.com ) > wrote:
 
 
> +1
> 
> On Mon, Jun 13, 2022 at 5:41 PM Chao Sun < sunchao@ apache. org (
> sunc...@apache.org ) > wrote:
> >
> > +1 (non-binding)
> >
> > On Mon, Jun 13, 2022 at 5:11 PM Hyukjin Kwon < gurwls223@ gmail. com (
> gurwls...@gmail.com ) > wrote:
> >>
> >> +1
> >>
> >> On Tue, 14 Jun 2022 at 08:50, Yuming Wang < wgyumg@ gmail. com (
> wgy...@gmail.com ) > wrote:
> >>>
> >>> +1.
> >>>
> >>> On Tue, Jun 14, 2022 at 2:20 AM Matei Zaharia < matei. zaharia@ gmail.
> com ( matei.zaha...@gmail.com ) > wrote:
> 
>  +1, very excited about this direction.
> 
>  Matei
> 
>  On Jun 13, 2022, at 11:07 AM, Herman van Hovell < herman@ databricks.
> com. INVALID ( her...@databricks.com.INVALID ) > wrote:
> 
>  Let me kick off the voting...
> 
>  +1
> 
>  On Mon, Jun 13, 2022 at 2:02 PM Herman van Hovell < herman@ 
>  databricks.
> com ( her...@databricks.com ) > wrote:
> >
> > Hi all,
> >
> > I’d like to start a vote for SPIP: "Spark Connect"
> >
> > The goal of the SPIP is to introduce a Dataframe based client/server
> API for Spark
> >
> > Please also refer to:
> >
> > - Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark
> Connect - A client and server interface for Apache Spark.
> > - Design doc: Spark Connect - A client and server interface for
> Apache Spark.
> > - JIRA: SPARK-39375
> >
> > Please vote on the SPIP for the next 72 hours:
> >
> > [ ] +1: Accept the proposal as an official SPIP
> > [ ] +0
> > [ ] -1: I don’t think this is a good idea because …
> >
> > Kind Regards,
> > Herman
> 
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
 
 
 
>>> 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Stickers and Swag

2022-06-14 Thread Reynold Xin
Nice! Going to order a few items myself ...

On Tue, Jun 14, 2022 at 7:54 PM, Gengliang Wang < ltn...@gmail.com > wrote:

> 
> FYI now you can find the shopping information on https:/ / spark. apache. org/
> community ( https://spark.apache.org/community ) as well :)
> 
> 
> 
> Gengliang
> 
> 
> 
> 
> 
> 
>> On Jun 14, 2022, at 7:47 PM, Hyukjin Kwon < gurwls223@ gmail. com (
>> gurwls...@gmail.com ) > wrote:
>> 
>> Woohoo
>> 
>> On Tue, 14 Jun 2022 at 15:04, Xiao Li < gatorsmile@ gmail. com (
>> gatorsm...@gmail.com ) > wrote:
>> 
>> 
>>> Hi, all,
>>> 
>>> 
>>> The ASF has an official store at RedBubble (
>>> https://www.redbubble.com/people/comdev/shop ) that Apache Community
>>> Development (ComDev) runs. If you are interested in buying Spark Swag, 70
>>> products featuring the Spark logo are available: https:/ / www. redbubble.
>>> com/ shop/ ap/ 113203780 ( https://www.redbubble.com/shop/ap/113203780 )
>>> 
>>> 
>>> Go Spark!
>>> 
>>> 
>>> Xiao
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Data correctness issue with Repartition + FetchFailure

2022-03-12 Thread Reynold Xin
This is why RoundRobinPartitioning shouldn't be used ...

On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu < jasonxu.sp...@gmail.com > wrote:

> 
> Hi Spark community,
> 
> I reported a data correctness issue in https:/ / issues. apache. org/ jira/
> browse/ SPARK-38388 ( https://issues.apache.org/jira/browse/SPARK-38388 ).
> In short, non-deterministic data + Repartition + FetchFailure could result
> in incorrect data, this is an issue we run into in production pipelines, I
> have an example to reproduce the bug in the ticket.
> 
> I report here to bring more attention, could you help confirm it's a bug
> and worth effort to further investigate and fix, thank you in advance for
> help!
> 
> Thanks,
> Jason Xu
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Reynold Xin
tl;dr: there's no easy way to implement aggregate expressions that'd require 
multiple pass over data. It is simply not something that's supported and doing 
so would be very high cost.

Would you be OK using approximate percentile? That's relatively cheap.

On Mon, Dec 13, 2021 at 6:43 PM, Nicholas Chammas < nicholas.cham...@gmail.com 
> wrote:

> 
> No takers here? :)
> 
> 
> I can see now why a median function is not available in most data
> processing systems. It's pretty annoying to implement!
> 
> On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas < nicholas. chammas@ gmail.
> com ( nicholas.cham...@gmail.com ) > wrote:
> 
> 
>> I'm trying to create a new aggregate function. It's my first time working
>> with Catalyst, so it's exciting---but I'm also in a bit over my head.
>> 
>> 
>> My goal is to create a function to calculate the median (
>> https://issues.apache.org/jira/browse/SPARK-26589 ).
>> 
>> 
>> As a very simple solution, I could just define median to be an alias of ` 
>> Percentile(col,
>> 0.5)`. However, the leading comment on the Percentile expression (
>> https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39
>> ) highlights that it's very memory-intensive and can easily lead to
>> OutOfMemory errors.
>> 
>> 
>> So instead of using Percentile, I'm trying to create an Expression that
>> calculates the median without needing to hold everything in memory at
>> once. I'm considering two different approaches:
>> 
>> 
>> 1. Define Median as a combination of existing expressions: The median can
>> perhaps be built out of the existing expressions for Count (
>> https://github.com/apache/spark/blob/9af338cd685bce26abbc2dd4d077bde5068157b1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L48
>> ) and NthValue (
>> https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L675
>> ).
>> 
>> 
>> 
>>> I don't see a template I can follow for building a new expression out of
>>> existing expressions (i.e. without having to implement a bunch of methods
>>> for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
>>> would wrap NthValue to make it usable as a regular aggregate function. The
>>> wrapped NthValue would need an implicit window that provides the necessary
>>> ordering.
>>> 
>> 
>> 
>> 
>> 
>>> Is there any potential to this idea? Any pointers on how to implement it?
>>> 
>> 
>> 
>> 
>> 2. Another memory-light approach to calculating the median requires
>> multiple passes over the data to converge on the answer. The approach is 
>> described
>> here (
>> https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers
>> ). (I posted a sketch implementation of this approach using Spark's
>> user-level API here (
>> https://issues.apache.org/jira/browse/SPARK-26589?focusedCommentId=17452081=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17452081
>> ).)
>> 
>> 
>> 
>>> I am also struggling to understand how I would build an aggregate function
>>> like this, since it requires multiple passes over the data. From what I
>>> can see, Catalyst's aggregate functions are designed to work with a single
>>> pass over the data.
>>> 
>>> 
>>> We don't seem to have an interface for AggregateFunction that supports
>>> multiple passes over the data. Is there some way to do this?
>>> 
>> 
>> 
>> Again, this is my first serious foray into Catalyst. Any specific
>> implementation guidance is appreciated!
>> 
>> 
>> Nick
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: spark binary map

2021-10-16 Thread Reynold Xin
Read up on Unsafe here: https://mechanical-sympathy.blogspot.com/

On Sat, Oct 16, 2021 at 12:41 AM, Rohan Bajaj < rohanbaja...@gmail.com > wrote:

> 
> In 2015 Reynold Xin made improvements to Spark and it was basically moving
> some structures that were on the java heap and moving them off heap.
> 
> 
> In particular it seemed like the memory did not require any
> serialization/deserialization.
> 
> 
> How was the performed? Was the data memory mapped? If it was, then to
> avoid serialization/deserialization i'm assuming some sort of wrapper was
> introduced to allow access to that data.
> 
> 
> Something like this:
> 
> 
> struct DataType
> {
> long pointertoData;
> 
> 
> method1();
> method2();
> }
> 
> 
> but if it was done this way there is an extra indirection, and I'm
> assuming it benchmarked positively.
> 
> 
> Just trying to learn.
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-08 Thread Reynold Xin
+1

On Thu, Oct 07, 2021 at 11:54 PM, Yuming Wang < wgy...@gmail.com > wrote:

> 
> +1 (non-binding).
> 
> 
> On Fri, Oct 8, 2021 at 1:02 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com (
> dongjoon.h...@gmail.com ) > wrote:
> 
> 
>> +1 for Apache Spark 3.2.0 RC7.
>> 
>> 
>> It looks good to me. I tested with EKS 1.21 additionally.
>> 
>> 
>> Cheers,
>> Dongjoon.
>> 
>> 
>> 
>> On Thu, Oct 7, 2021 at 7:46 PM 郑瑞峰 < ruifengz@ foxmail. com (
>> ruife...@foxmail.com ) > wrote:
>> 
>> 
>>> +1 (non-binding)
>>> 
>>> 
>>> 
>>> -- 原始邮件 --
>>> *发件人:* "Sean Owen" < srowen@ gmail. com ( sro...@gmail.com ) >;
>>> *发送时间:* 2021年10月7日(星期四) 晚上10:23
>>> *收件人:* "Gengliang Wang"< ltnwgl@ gmail. com ( ltn...@gmail.com ) >;
>>> *抄送:* "dev"< dev@ spark. apache. org ( dev@spark.apache.org ) >;
>>> *主题:* Re: [VOTE] Release Spark 3.2.0 (RC7)
>>> 
>>> 
>>> +1 again. Looks good in Scala 2.12, 2.13, and in Java 11.
>>> I note that the mem requirements for Java 11 tests seem to need to be
>>> increased but we're handling that separately. It doesn't really affect
>>> users.
>>> 
>>> On Wed, Oct 6, 2021 at 11:49 AM Gengliang Wang < ltnwgl@ gmail. com (
>>> ltn...@gmail.com ) > wrote:
>>> 
>>> 
 Please vote on releasing the following candidate as Apache Spark version
 3.2.0.
 
 
 
 The vote is open until 11:59pm Pacific time October 11 and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
 
 
 
 [ ] +1 Release this package as Apache Spark 3.2.0
 
 [ ] -1 Do not release this package because ...
 
 
 
 To learn more about Apache Spark, please see http:/ / spark. apache. org/ (
 http://spark.apache.org/ )
 
 
 
 The tag to be voted on is v3.2.0-rc7 (commit
 5d45a415f3a29898d92380380cfd82bfc7f579ea):
 
 https:/ / github. com/ apache/ spark/ tree/ v3. 2. 0-rc7 (
 https://github.com/apache/spark/tree/v3.2.0-rc7 )
 
 
 
 The release files, including signatures, digests, etc. can be found at:
 
 https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 2. 0-rc7-bin/ (
 https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc7-bin/ )
 
 
 
 Signatures used for Spark RCs can be found in this file:
 
 https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ KEYS (
 https://dist.apache.org/repos/dist/dev/spark/KEYS )
 
 
 
 The staging repository for this release can be found at:
 
 https:/ / repository. apache. org/ content/ repositories/ 
 orgapachespark-1394
 ( https://repository.apache.org/content/repositories/orgapachespark-1394 )
 
 
 
 
 The documentation corresponding to this release can be found at:
 
 https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 2. 0-rc7-docs/ (
 https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc7-docs/ )
 
 
 
 The list of bug fixes going into 3.2.0 can be found at the following URL:
 
 https://issues.apache.org/jira/projects/SPARK/versions/12349407
 ( https://issues.apache.org/jira/projects/SPARK/versions/12349407 )
 
 
 This release is using the release script of the tag v3.2.0-rc7.
 
 
 
 
 
 FAQ
 
 
 
 =
 
 How can I help test this release?
 
 =
 
 If you are a Spark user, you can help us test this release by taking
 
 an existing Spark workload and running on this release candidate, then
 
 reporting any regressions.
 
 
 
 If you're working in PySpark you can set up a virtual env and install
 
 the current RC and see if anything important breaks, in the Java/Scala
 
 you can add the staging repository to your projects resolvers and test
 
 with the RC (make sure to clean up the artifact cache before/after so
 
 you don't end up building with a out of date RC going forward).
 
 
 
 ===
 
 What should happen to JIRA tickets still targeting 3.2.0?
 
 ===
 
 The current list of open tickets targeted at 3.2.0 can be found at:
 
 https:/ / issues. apache. org/ jira/ projects/ SPARK (
 https://issues.apache.org/jira/projects/SPARK ) and search for "Target
 Version/s" = 3.2.0
 
 
 
 Committers should look at those and triage. Extremely important bug
 
 fixes, documentation, and API tweaks that impact compatibility should
 
 be worked on immediately. Everything else please retarget to an
 
 appropriate release.
 
 
 
 ==
 
 But my bug isn't fixed?
 
 ==
 
 In order to make timely releases, we will typically not hold the
 
 release unless the bug in question is a regression 

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Reynold Xin
+1. Would open up a huge persona for Spark.

On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler < cutl...@gmail.com > wrote:

> 
> +1 (non-binding)
> 
> 
> On Fri, Mar 26, 2021 at 9:49 AM Maciej < mszymkiew...@gmail.com > wrote:
> 
> 
>> +1 (nonbinding)
>> 
>> 
>> 
>> On 3/26/21 3:52 PM, Hyukjin Kwon wrote:
>> 
>> 
>>> 
>>> 
>>> Hi all,
>>> 
>>> 
>>> 
>>> I’d like to start a vote for SPIP: Support pandas API layer on PySpark.
>>> 
>>> 
>>> 
>>> The proposal is to embrace Koalas in PySpark to have the pandas API layer
>>> on PySpark.
>>> 
>>> 
>>> 
>>> 
>>> Please also refer to:
>>> 
>>> 
>>> 
>>> * Previous discussion in dev mailing list: [DISCUSS] Support pandas API
>>> layer on PySpark (
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html
>>> ).
>>> * JIRA: SPARK-34849 ( https://issues.apache.org/jira/browse/SPARK-34849 )
>>> * Koalas internals documentation: 
>>> https://docs.google.com/document/d/1tk24aq6FV5Wu2bX_Ym606doLFnrZsh4FdUd52FqojZU/edit
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Please vote on the SPIP for the next 72 hours:
>>> 
>>> 
>>> 
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Best regards,
>> Maciej Szymkiewicz
>> 
>> Web: https://zero323.net
>> Keybase: https://keybase.io/zero323
>> Gigs: https://www.codementor.io/@zero323
>> PGP: A30CEF0C31A501EC
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Reynold Xin
I don't think we should deprecate existing APIs.

Spark's own Python API is relatively stable and not difficult to support. It 
has a pretty large number of users and existing code. Also pretty easy to learn 
by data engineers.

pandas API is a great for data science, but isn't that great for some other 
tasks. It's super wide. Great for data scientists that have learned it, or 
great for copy paste from Stackoverflow.

On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you for the proposal. It looks like a good addition.
> BTW, what is the future plan for the existing APIs?
> Are we going to deprecate it eventually in favor of Koalas (because we
> don't remove the existing APIs in general)?
> 
> 
> > Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> 
> > PySpark follow pascalCase?" or "PySpark APIs are difficult to learn",
> and APIs are very difficult to change
> > in Spark (as I emphasized above).
> 
> 
> 
> 
> On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon < gurwls...@gmail.com > wrote:
> 
> 
> 
>> 
>> 
>> Firstly my biggest reason is that I would like to promote this more as a
>> built-in support because it is simply
>> important to have it with the impact on the large user group, and the
>> needs are increasing
>> as the charts indicate. I usually think that features or add-ons stay as
>> third parties when it’s rather for a
>> smaller set of users, it addresses a corner case of needs, etc. I think
>> this is similar to the datasources
>> we have added. Spark ported CSV and Avro because more and more people use
>> it, and it became important
>> to have it as a built-in support.
>> 
>> 
>> 
>> Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
>> experts from the
>> bigger community. Koalas’ team isn’t experts in all the areas, and there
>> are many missing corner
>> cases to fix, Some require deep expertise from specific areas.
>> 
>> 
>> 
>> One example is the type hints. Koalas uses type hints for schema
>> inference.
>> Due to the lack of Python’s type hinting way, Koalas added its own (hacky)
>> way (
>> https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas
>> ).
>> Fortunately the way Koalas implemented is now partially proposed into
>> Python officially (PEP 646).
>> But Koalas could have been better with interacting with the Python
>> community more and actively
>> joining in the design issues together to lead the best output that
>> benefits both and more projects.
>> 
>> 
>> 
>> Thirdly, I would like to contribute to the growth of PySpark. The growth
>> of the Koalas is very fast given the
>> internal and external stats. The number of users has jumped up twice
>> almost every 4 ~ 6 months.
>> I think Koalas will be a good momentum to keep Spark up.
>> 
>> 
>> Fourthly, PySpark is still not Pythonic enough. For example, I hear
>> complaints such as "why does
>> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
>> APIs are very difficult to change
>> in Spark (as I emphasized above). This set of Koalas APIs will be able to
>> address these concerns
>> in PySpark.
>> 
>> Lastly, I really think PySpark needs its native plotting features. As I
>> emphasized before with
>> elaboration, I do think this is an important feature missing in PySpark
>> that users need.
>> I do think Koalas completes what PySpark is currently missing.
>> 
>> 
>> 
>> 
>> 
>> 2021년 3월 14일 (일) 오후 7:12, Sean Owen < sro...@gmail.com >님이 작성:
>> 
>> 
>>> I like koalas a lot. Playing devil's advocate, why not just let it
>>> continue to live as an add on? Usually the argument is it'll be maintained
>>> better in Spark but it's well maintained. It adds some overhead to
>>> maintaining Spark conversely. On the upside it makes it a little more
>>> discoverable. Are there more 'synergies'?
>>> 
>>> On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon < gurwls...@gmail.com > wrote:
>>> 
>>> 
 
 
 Hi all,
 
 
 
 
 I would like to start the discussion on supporting pandas API layer on
 Spark.
 
 
 
 
 
 
 
 If we have a general consensus on having it in PySpark, I will initiate
 and drive an SPIP with a detailed explanation about the implementation’s
 overview and structure.
 
 
 
 I would appreciate it if I can know whether you guys support this or not
 before starting the SPIP.
 
 
 
 
  What do you want to propose?
 
 
 
 
 I have been working on the Koalas ( https://github.com/databricks/koalas )
 project that is essentially: pandas API support on Spark, and I would like
 to propose embracing Koalas in PySpark.
 
 
 
 
 
 
 
 More specifically, I am thinking about adding a separate package, to
 PySpark, for pandas APIs on PySpark Therefore 

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-24 Thread Reynold Xin
+1 Correctness issues are serious!

On Wed, Feb 24, 2021 at 11:08 AM, Mridul Muralidharan < mri...@gmail.com > 
wrote:

> 
> That is indeed cause for concern.
> +1 on extending the voting deadline until we finish investigation of this.
> 
> 
> 
> 
> Regards,
> Mridul
> 
> 
> 
> On Wed, Feb 24, 2021 at 12:55 PM Xiao Li < gatorsm...@gmail.com > wrote:
> 
> 
>> -1 Could we extend the voting deadline?
>> 
>> 
>> A few TPC-DS queries (q17, q18, q39a, q39b) are returning different
>> results between Spark 3.0 and Spark 3.1. We need a few more days to
>> understand whether these changes are expected.
>> 
>> 
>> Xiao
>> 
>> 
>> Mridul Muralidharan < mri...@gmail.com > 于2021年2月24日周三 上午10:41写道:
>> 
>> 
>>> 
>>> 
>>> Sounds good, thanks for clarifying Hyukjin !
>>> +1 on release.
>>> 
>>> 
>>> 
>>> Regards,
>>> Mridul
>>> 
>>> 
>>> 
>>> On Wed, Feb 24, 2021 at 2:46 AM Hyukjin Kwon < gurwls...@gmail.com > wrote:
>>> 
>>> 
>>> 
 
 
 I remember HiveExternalCatalogVersionsSuite was flaky for a while which is
 fixed in 
 https://github.com/apache/spark/commit/0d5d248bdc4cdc71627162a3d20c42ad19f24ef4
 
 and .. KafkaDelegationTokenSuite is flaky ( 
 https://issues.apache.org/jira/browse/SPARK-31250
 ).
 
 
 
 2021년 2월 24일 (수) 오후 5:19, Mridul Muralidharan < mri...@gmail.com >님이 작성:
 
 
> 
> 
> Signatures, digests, etc check out fine.
> 
> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
> -Phive-thriftserver -Pmesos -Pkubernetes
> 
> 
> I keep getting test failures with
> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
> * org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.
> (Note: I remove $HOME/.m2 and $HOME/.iv2 paths before build)
> 
> 
> 
> Removing these suites gets the build through though - does anyone have
> suggestions on how to fix it ? I did not face this with RC1.
> 
> 
> 
> Regards,
> Mridul
> 
> 
> 
> On Mon, Feb 22, 2021 at 12:57 AM Hyukjin Kwon < gurwls...@gmail.com >
> wrote:
> 
> 
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.1.1.
>> 
>> 
>> The vote is open until February 24th 11PM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>> 
>> 
>> [ ] +1 Release this package as Apache Spark 3.1.1
>> [ ] -1 Do not release this package because ...
>> 
>> 
>> To learn more about Apache Spark, please see http://spark.apache.org/
>> 
>> 
>> The tag to be voted on is v3.1.1-rc3 (commit
>> 1d550c4e90275ab418b9161925049239227f3dc9):
>> https://github.com/apache/spark/tree/v3.1.1-rc3
>> 
>> 
>> 
>> The release files, including signatures, digests, etc. can be found at:
>> ( https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/ )
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/
>> 
>> 
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> 
>> 
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1367
>> 
>> 
>> 
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/
>> 
>> 
>> 
>> The list of bug fixes going into 3.1.1 can be found at the following URL:
>> https://s.apache.org/41kf2
>> 
>> 
>> This release is using the release script of the tag v3.1.1-rc3.
>> 
>> 
>> FAQ
>> 
>> ===
>> What happened to 3.1.0?
>> ===
>> 
>> There was a technical issue during Apache Spark 3.1.0 preparation, and it
>> was discussed and decided to skip 3.1.0.
>> Please see 
>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html
>> for more details.
>> 
>> 
>> =
>> How can I help test this release?
>> =
>> 
>> 
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>> 
>> 
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC via "pip install 
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/pyspark-3.1.1.tar.gz
>> "
>> and see if anything important breaks.
>> In the Java/Scala, you can add the staging repository to your projects
>> resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>> 
>> 
>> ===

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Reynold Xin
Enrico - do feel free to reopen the PRs or email people directly, unless you 
are told otherwise.

On Thu, Feb 18, 2021 at 9:09 AM, Nicholas Chammas < nicholas.cham...@gmail.com 
> wrote:

> 
> On Thu, Feb 18, 2021 at 10:34 AM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> There is no way to force people to review or commit something of course.
>> And keep in mind we get a lot of, shall we say, unuseful pull requests.
>> There is occasionally some blowback to closing someone's PR, so the path
>> of least resistance is often the timeout / 'soft close'. That is, it takes
>> a lot more time to satisfactorily debate down the majority of PRs that
>> probably shouldn't get merged, and there just isn't that much bandwidth.
>> That said of course it's bad if lots of good PRs are getting lost in the
>> shuffle and I am sure there are some.
>> 
>> 
>> One other aspect is that a committer is taking some degree of
>> responsibility for merging a change, so the ask is more than just a few
>> minutes of eyeballing. If it breaks something the merger pretty much owns
>> resolving it, and, the whole project owns any consequence of the change
>> for the future.
>> 
> 
> 
> 
> +1
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] Add RocksDB StateStore

2021-02-13 Thread Reynold Xin
Late +1

On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh < vii...@gmail.com > wrote:

> 
> 
> 
> Hi devs,
> 
> 
> 
> Thanks for all the inputs. I think overall there are positive inputs in
> Spark community about having RocksDB state store as external module. Then
> let's go forward with this direction and to improve structured streaming.
> I will keep update to the JIRA SPARK-34198.
> 
> 
> 
> Thanks all again for the inputs and discussion.
> 
> 
> 
> Liang-Chi Hsieh
> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Reynold Xin
There's another thing that's not mentioned … it's primarily a problem for 
Scala. Due to static typing, we need a very large number of function overloads 
for the Scala version of each function, whereas in SQL/Python they are just 
one. There's a limit on how many functions we can add, and it also makes it 
difficult to browse through the docs when there are a lot of functions.

On Thu, Jan 28, 2021 at 1:09 PM, Maciej < mszymkiew...@gmail.com > wrote:

> 
> Just my two cents on R side.
> 
> 
> 
> On 1/28/21 10:00 PM, Nicholas Chammas wrote:
> 
> 
>> On Thu, Jan 28 , 2021 at 3:40 PM Sean Owen < srowen@ gmail. com (
>> sro...@gmail.com ) > wrote:
>> 
>> 
>>> It isn't that regexp_extract_all (for example) is useless outside SQL,
>>> just, where do you draw the line? Supporting 10s of random SQL functions
>>> across 3 other languages has a cost, which has to be weighed against
>>> benefit, which we can never measure well except anecdotally: one or two
>>> people say "I want this" in a sea of hundreds of thousands of users.
>>> 
>> 
>> 
>> 
>> +1 to this, but I will add that Jira and Stack Overflow activity can
>> sometimes give good signals about API gaps that are frustrating users. If
>> there is an SO question with 30K views about how to do something that
>> should have been easier, then that's an important signal about the API.
>> 
>> 
>> 
>>> For this specific case, I think there is a fine argument that
>>> regexp_extract_all should be added simply for consistency with
>>> regexp_extract. I can also see the argument that regexp_extract was a step
>>> too far, but, what's public is now a public API.
>>> 
>> 
>> 
>> 
>> I think in this case a few references to where/how people are having to
>> work around missing a direct function for regexp_extract_all could help
>> guide the decision. But that itself means we are making these decisions on
>> a case-by-case basis.
>> 
>> 
>> From a user perspective, it's definitely conceptually simpler to have SQL
>> functions be consistent and available across all APIs.
>> 
>> 
>> 
>> Perhaps if we had a way to lower the maintenance burden of keeping
>> functions in sync across SQL/Scala/Python/R, it would be easier for
>> everyone to agree to just have all the functions be included across the
>> board all the time.
>> 
> 
> 
> 
> Python aligns quite well with Scala so that might be fine, but R is a bit
> tricky thing. Especially lack of proper namespaces makes it rather risky
> to have packages that export hundreds of functions. sparkly handles this
> neatly with NSE, but I don't think we're going to go this way.
> 
> 
> 
>> 
>> 
>> Would, for example, some sort of automatic testing mechanism for SQL
>> functions help here? Something that uses a common function testing
>> specification to automatically test SQL, Scala, Python, and R functions,
>> without requiring maintainers to write tests for each language's version
>> of the functions. Would that address the maintenance burden?
>> 
> 
> 
> 
> With R we don't really test most of the functions beyond the simple
> "callability". One the complex ones, that require some non-trivial
> transformations of arguments, are fully tested.
> 
> 
> -- 
> Best regards,
> Maciej Szymkiewicz
> 
> Web: https:/ / zero323. net ( https://zero323.net )
> Keybase: https:/ / keybase. io/ zero323 ( https://keybase.io/zero323 )
> Gigs: https:/ / www. codementor. io/ @ zero323 (
> https://www.codementor.io/@zero323 )
> PGP: A30CEF0C31A501EC
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-09 Thread Reynold Xin
Exciting & look forward to this!

(And a late +1 vote that probably won't be counted)

On Mon, Nov 09, 2020 at 2:37 PM, Allison Wang < allison.w...@databricks.com > 
wrote:

> 
> 
> 
> Thanks everyone for voting! With 11 +1s and no -1s, this vote passes.
> 
> 
> 
> +1s:
> Mridul Muralidharan
> Angers Zhu
> Chandni Singh
> Eve Liao
> Matei Zaharia
> Kalyan
> Wenchen Fan
> Gengliang Wang
> Xiao Li
> Takeshi Yamamuro
> Herman van Hovell
> 
> 
> 
> Thanks,
> Allison
> 
> 
> 
> --
> Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/
> ( http://apache-spark-developers-list.1001551.n3.nabble.com/ )
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: I'm going to be out starting Nov 5th

2020-10-31 Thread Reynold Xin
Take care Holden and best of luck with everything!

On Sat, Oct 31 2020 at 10:21 AM, Holden Karau < hol...@pigscanfly.ca > wrote:

> 
> Hi Folks,
> 
> 
> Just a heads up so folks working on decommissioning or other areas I've
> been active in don't block on me, I'm going to be out for at least a week
> and possibly more starting on November 5th. If there is anything that
> folks want me to review before then please let me know and I'll make the
> time for it. If you are curious I've got more details at 
> http://blog.holdenkarau.com/2020/10/taking-break-surgery.html
> 
> 
> 
> Happy Sparking Everyone,
> 
> 
> Holden :)
> 
> 
> --
> Twitter: https://twitter.com/holdenkarau
> 
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> ( https://amzn.to/2MaRAG9 )
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Avoiding unnnecessary sort in FileFormatWriter/DynamicPartitionDataWriter

2020-09-04 Thread Reynold Xin
The issue is memory overhead. Writing files create a lot of buffer (especially 
in columnar formats like Parquet/ORC). Even a few file handlers and buffers per 
task can OOM the entire process easily.

On Fri, Sep 04, 2020 at 5:51 AM, XIMO GUANTER GONZALBEZ < 
joaquin.guantergonzal...@telefonica.com > wrote:

> 
> 
> 
> Hello,
> 
> 
> 
> 
> 
> 
> 
> I have observed that if a DataFrame is saved with partitioning columns in
> Parquet, then a sort is performed in FileFormatWriter (see https:/ / github.
> com/ apache/ spark/ blob/ v3. 0. 0/ sql/ core/ src/ main/ scala/ org/ apache/
> spark/ sql/ execution/ datasources/ FileFormatWriter. scala#L152 (
> https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L152
> ) ) because DynamicPartitionDataWriter only supports having a single file
> open at a time (see https:/ / github. com/ apache/ spark/ blob/ v3. 0. 0/ sql/
> core/ src/ main/ scala/ org/ apache/ spark/ sql/ execution/ datasources/ 
> FileFormatDataWriter.
> scala#L170-L171 (
> https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L170-L171
> ) ). I think it would be possible to avoid this sort (which is a major
> bottleneck for some of my scenarios) if DynamicPartitionDataWriter could
> have multiple files open at the same time, and writing each piece of data
> to its corresponding file.
> 
> 
> 
> 
> 
> 
> 
> Would that change be a welcome PR for the project or is there any major
> problem that I am not considering that would prevent removing this sort?
> 
> 
> 
> 
> 
> 
> 
> Thanks,
> 
> 
> 
> Ximo.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Some more detail about the problem, in case I didn’t explain myself
> correctly: suppose we have a dataframe which we want to partition by
> column A:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *Column A*
> 
> 
> 
> *Column B*
> 
> 
> 
> 4
> 
> 
> 
> A
> 
> 
> 
> 1
> 
> 
> 
> B
> 
> 
> 
> 2
> 
> 
> 
> C
> 
> 
> 
> 
> 
> 
> 
> 
> 
> The current behavior will first sort the dataframe:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *Column A*
> 
> 
> 
> *Column B*
> 
> 
> 
> 1
> 
> 
> 
> B
> 
> 
> 
> 2
> 
> 
> 
> C
> 
> 
> 
> 4
> 
> 
> 
> A
> 
> 
> 
> 
> 
> 
> 
> 
> 
> So that DynamicPartitionDataWriter can have a single file open, since all
> the data for a single partition will be adjacent and can be iterated over
> sequentially. In order to process the first row,
> DynamicPartitionDataWriter will open a file in
> /columnA=1/part-r-0-.parquet and write the data. When processing
> the second row it will see it belongs to a different partition, closet he
> first file and open a new file in /columna=2/part-r-0-.parquet
> and so on.
> 
> 
> 
> 
> 
> 
> 
> My proposed change would involve changing DynamicPartitionDataWriter to
> have as many open files as partitions, and close them all once all data
> has been processed.
> 
> 
> 
> 
> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
> puede contener información privilegiada o confidencial y es para uso
> exclusivo de la persona o entidad de destino. Si no es usted. el
> destinatario indicado, queda notificado de que la lectura, utilización,
> divulgación y/o copia sin autorización puede estar prohibida en virtud de
> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
> que nos lo comunique inmediatamente por esta misma vía y proceda a su
> destrucción.
> 
> The information contained in this transmission is privileged and
> confidential information intended only for the use of the individual or
> entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited. If you have received
> this transmission in error, do not read it. Please immediately reply to
> the sender that you have received this communication in error and then
> delete it.
> 
> Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
> pode conter informação privilegiada ou confidencial e é para uso exclusivo
> da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
> indicado, fica notificado de que a leitura, utilização, divulgação e/ou
> cópia sem autorização pode estar proibida em virtude da legislação
> vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o
> comunique imediatamente por esta mesma via e proceda a sua destruição
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Reynold Xin
Welcome all!

On Tue, Jul 14, 2020 at 10:36 AM, Matei Zaharia < matei.zaha...@gmail.com > 
wrote:

> 
> 
> 
> Hi all,
> 
> 
> 
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new roles! The new committers are:
> 
> 
> 
> - Huaxin Gao
> - Jungtaek Lim
> - Dilip Biswal
> 
> 
> 
> All three of them contributed to Spark 3.0 and we’re excited to have them
> join the project.
> 
> 
> 
> Matei and the Spark PMC
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Reynold Xin
+1 on doing a new patch release soon. I saw some of these issues when preparing 
the 3.0 release, and some of them are very serious.

On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman < 
shiva...@eecs.berkeley.edu > wrote:

> 
> 
> 
> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release soon.
> 
> 
> 
> 
> Shivaram
> 
> 
> 
> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro < linguin. m. s@ gmail. com
> ( linguin@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> Thanks for the heads-up, Yuanjian!
>> 
>> 
>>> 
>>> 
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>> 
>>> 
>> 
>> 
>> 
>> wow, the updates are so quick. Anyway, +1 for the release.
>> 
>> 
>> 
>> Bests,
>> Takeshi
>> 
>> 
>> 
>> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li < xyliyuanjian@ gmail. com (
>> xyliyuanj...@gmail.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Hi dev-list,
>>> 
>>> 
>>> 
>>> I’m writing this to raise the discussion about Spark 3.0.1 feasibility
>>> since 4 blocker issues were found after Spark 3.0.0:
>>> 
>>> 
>>> 
>>> [SPARK-31990] The state store compatibility broken will cause a
>>> correctness issue when Streaming query with `dropDuplicate` uses the
>>> checkpoint written by the old Spark version.
>>> 
>>> 
>>> 
>>> [SPARK-32038] The regression bug in handling NaN values in COUNT(DISTINCT)
>>> 
>>> 
>>> 
>>> 
>>> [SPARK-31918][WIP] CRAN requires to make it working with the latest R 4.0.
>>> It makes the 3.0 release unavailable on CRAN, and only supports R [3.5,
>>> 4.0)
>>> 
>>> 
>>> 
>>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression
>>> 
>>> 
>>> 
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I
>>> think it would be great if we have Spark 3.0.1 to deliver the critical
>>> fixes.
>>> 
>>> 
>>> 
>>> Any comments are appreciated.
>>> 
>>> 
>>> 
>>> Best,
>>> 
>>> 
>>> 
>>> Yuanjian
>>> 
>>> 
>> 
>> 
>> 
>> --
>> ---
>> Takeshi Yamamuro
>> 
>> 
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Removing references to slave (and maybe in the future master)

2020-06-18 Thread Reynold Xin
Thanks for doing this. I think this is a great thing to do.

But we gotta be careful with API compatibility.

On Thu, Jun 18, 2020 at 11:32 AM, Holden Karau < hol...@pigscanfly.ca > wrote:

> 
> Hi Folks,
> 
> 
> I've started working on cleaning up the Spark code to remove references to
> slave since the word has a lot of negative connotations and we can
> generally replace it with more accurate/descriptive words in our code
> base. The PR is at https:/ / github. com/ apache/ spark/ pull/ 28864 (
> https://github.com/apache/spark/pull/28864 ) (I'm a little uncertain on the
> place of where I chose the name "AgentLost" as the replacement,
> suggestions welcome).
> 
> 
> At some point I think we should explore deprecating master as well, but
> that is used very broadley inside of our code and in our APIs, so while it
> is visible to more people changing it would be more work. I think having
> consensus around removing slave though is a good first step.
> 
> 
> Cheers,
> 
> 
> Holden
> 
> 
> --
> Twitter: https:/ / twitter. com/ holdenkarau (
> https://twitter.com/holdenkarau )
> 
> Books (Learning Spark, High Performance Spark, etc.): https:/ / amzn. to/ 
> 2MaRAG9
> ( https://amzn.to/2MaRAG9 )
> YouTube Live Streams: https:/ / www. youtube. com/ user/ holdenkarau (
> https://www.youtube.com/user/holdenkarau )
>

smime.p7s
Description: S/MIME Cryptographic Signature


[ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Reynold Xin
Hi all,

Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many of 
the innovations from Spark 2.x, bringing new ideas as well as continuing 
long-term projects that have been in development. This release resolves more 
than 3400 tickets.

We'd like to thank our contributors and users for their contributions and early 
feedback to this release. This release would not have been possible without you.

To download Spark 3.0.0, head over to the download page: 
http://spark.apache.org/downloads.html

To view the release notes: 
https://spark.apache.org/releases/spark-release-3-0-0.html

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [vote] Apache Spark 3.0 RC3

2020-06-17 Thread Reynold Xin
Hopefully today ! I think we can get the release officially out tomorrow. For 
those of you that don't know, it's already on Maven. Please take a final look 
at the release notes, as I will turn it into a release page today.

On Wed, Jun 17, 2020 at 11:31 AM, Tom Graves < tgraves...@yahoo.com > wrote:

> 
> Reynold,
> 
> 
> What's the plan on pushing the official release binaries and source tar? 
> It would be nice to have the release artifacts now that it's available on
> maven.
> 
> 
> thanks,
> Tom
> 
> 
> On Monday, June 15, 2020, 01:52:12 PM CDT, Reynold Xin < rxin@ databricks.
> com ( r...@databricks.com ) > wrote:
> 
> 
> 
> 
> Thanks for the reminder, Dongjoon.
> 
> 
> 
> I created the official release tag the past weekend and been working on
> the release notes (a lot of interesting changes!). I've created a google
> docs so it's easier for everybody to give comment on things that I've
> missed: https:/ / docs. google. com/ document/ d/ 
> 1NrTqxf2f39AXDF8VTIch6kwD8VKPaIlLW1QvuqEcwR4/
> edit (
> https://docs.google.com/document/d/1NrTqxf2f39AXDF8VTIch6kwD8VKPaIlLW1QvuqEcwR4/edit
> )
> 
> 
> 
> Plan to publish to maven et al today or tomorrow and give a day or two for
> dev@ to comment on the release notes before finalizing.
> 
> 
> 
> PS: There are two critical problems I've seen with the release (Spark UI
> is virtually unusable in some cases, and streaming issues). I will
> highlight them in the release notes and link to the JIRA tickets. But I
> think we should make 3.0.1 ASAP to follow up.
> 
> 
> 
> 
> 
> 
> On Sun, Jun 14, 2020 at 11:46 AM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
> ( dongjoon.h...@gmail.com ) > wrote:
> 
>> Hi, Reynold.
>> 
>> 
>> Is there any progress on 3.0.0 release since the vote was finalized 5 days
>> ago?
>> 
>> 
>> Apparently, tag `v3.0.0` is not created yet, the binary and docs are still
>> sitting on the voting location, Maven Central doesn't have it, and
>> PySpark/SparkR uploading is not started yet.
>> 
>> 
>> https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 0. 0-rc3-bin/ (
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/ )
>> 
>> https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 0. 0-rc3-docs/ (
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/ )
>> 
>> 
>> 
>> Like Apache Spark 2.0.1 had 316 fixes after 2.0.0, we already have 35
>> patches on top of `v3.0.0-rc3` and are expecting more.
>> 
>> 
>> Although we can have Apache Spark 3.0.1 very soon before Spark+AI Summit,
>> Apache Spark 3.0.0 should be available in Apache Spark distribution
>> channel because it passed the vote.
>> 
>> 
>> 
>> Apache Spark 3.0.0 release itself helps the community use 3.0-line
>> codebase and makes the codebase healthy.
>> 
>> 
>> Please let us know if you need any help from the community for 3.0.0
>> release.
>> 
>> 
>> Thanks,
>> Dongjoon.
>> 
>> 
>> 
>> On Tue, Jun 9, 2020 at 9:41 PM Matei Zaharia < matei. zaharia@ gmail. com (
>> matei.zaha...@gmail.com ) > wrote:
>> 
>> 
>>> Congrats! Excited to see the release posted soon.
>>> 
>>> 
>>>> On Jun 9, 2020, at 6:39 PM, Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> I waited another day to account for the weekend. This vote passes with the
>>>> following +1 votes and no -1 votes!
>>>> 
>>>> 
>>>> 
>>>> I'll start the release prep later this week.
>>>> 
>>>> 
>>>> 
>>>> +1:
>>>> 
>>>> Reynold Xin (binding)
>>>> 
>>>> Prashant Sharma (binding)
>>>> 
>>>> Gengliang Wang
>>>> 
>>>> Sean Owen (binding)
>>>> 
>>>> Mridul Muralidharan (binding)
>>>> 
>>>> Takeshi Yamamuro
>>>> 
>>>> Maxim Gekk
>>>> 
>>>> Matei Zaharia (binding)
>>>> 
>>>> Jungtaek Lim
>>>> 
>>>> Denny Lee
>>>> 
>>>> Russell Spitzer
>>>> 
>>>> Dongjoon Hyun (binding)
>>>> 
>>>> DB Tsai (binding)
>>>> 
>>>> Michael Armbrust (binding)
>>>> 
>>>> Tom Graves (binding)
>>>> 
>>>>

Re: [vote] Apache Spark 3.0 RC3

2020-06-15 Thread Reynold Xin
Thanks for the reminder, Dongjoon.

I created the official release tag the past weekend and been working on the 
release notes (a lot of interesting changes!). I've created a google docs so 
it's easier for everybody to give comment on things that I've missed: 
https://docs.google.com/document/d/1NrTqxf2f39AXDF8VTIch6kwD8VKPaIlLW1QvuqEcwR4/edit

Plan to publish to maven et al today or tomorrow and give a day or two for dev@ 
to comment on the release notes before finalizing.

PS: There are two critical problems I've seen with the release (Spark UI is 
virtually unusable in some cases, and streaming issues). I will highlight them 
in the release notes and link to the JIRA tickets. But I think we should make 
3.0.1 ASAP to follow up.

On Sun, Jun 14, 2020 at 11:46 AM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Reynold.
> 
> 
> Is there any progress on 3.0.0 release since the vote was finalized 5 days
> ago?
> 
> 
> Apparently, tag `v3.0.0` is not created yet, the binary and docs are still
> sitting on the voting location, Maven Central doesn't have it, and
> PySpark/SparkR uploading is not started yet.
> 
> 
> https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 0. 0-rc3-bin/ (
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/ )
> 
> https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 0. 0-rc3-docs/ (
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/ )
> 
> 
> 
> Like Apache Spark 2.0.1 had 316 fixes after 2.0.0, we already have 35
> patches on top of `v3.0.0-rc3` and are expecting more.
> 
> 
> Although we can have Apache Spark 3.0.1 very soon before Spark+AI Summit,
> Apache Spark 3.0.0 should be available in Apache Spark distribution
> channel because it passed the vote.
> 
> 
> 
> Apache Spark 3.0.0 release itself helps the community use 3.0-line
> codebase and makes the codebase healthy.
> 
> 
> Please let us know if you need any help from the community for 3.0.0
> release.
> 
> 
> Thanks,
> Dongjoon.
> 
> 
> 
> On Tue, Jun 9, 2020 at 9:41 PM Matei Zaharia < matei. zaharia@ gmail. com (
> matei.zaha...@gmail.com ) > wrote:
> 
> 
>> Congrats! Excited to see the release posted soon.
>> 
>> 
>>> On Jun 9, 2020, at 6:39 PM, Reynold Xin < rxin@ databricks. com (
>>> r...@databricks.com ) > wrote:
>>> 
>>> 
>> 
>> 
>> 
>>> 
>>> I waited another day to account for the weekend. This vote passes with the
>>> following +1 votes and no -1 votes!
>>> 
>>> 
>>> 
>>> I'll start the release prep later this week.
>>> 
>>> 
>>> 
>>> +1:
>>> 
>>> Reynold Xin (binding)
>>> 
>>> Prashant Sharma (binding)
>>> 
>>> Gengliang Wang
>>> 
>>> Sean Owen (binding)
>>> 
>>> Mridul Muralidharan (binding)
>>> 
>>> Takeshi Yamamuro
>>> 
>>> Maxim Gekk
>>> 
>>> Matei Zaharia (binding)
>>> 
>>> Jungtaek Lim
>>> 
>>> Denny Lee
>>> 
>>> Russell Spitzer
>>> 
>>> Dongjoon Hyun (binding)
>>> 
>>> DB Tsai (binding)
>>> 
>>> Michael Armbrust (binding)
>>> 
>>> Tom Graves (binding)
>>> 
>>> Bryan Cutler
>>> 
>>> Huaxin Gao
>>> 
>>> Jiaxin Shan
>>> 
>>> Xingbo Jiang
>>> 
>>> Xiao Li (binding)
>>> 
>>> Hyukjin Kwon (binding)
>>> 
>>> Kent Yao
>>> 
>>> Wenchen Fan (binding)
>>> 
>>> Shixiong Zhu (binding)
>>> 
>>> Burak Yavuz
>>> 
>>> Tathagata Das (binding)
>>> 
>>> Ryan Blue
>>> 
>>> 
>>> 
>>> -1: None
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Jun 06, 2020 at 1:08 PM, Reynold Xin < rxin@ databricks. com (
>>> r...@databricks.com ) > wrote:
>>> 
>>>> Please vote on releasing the following candidate as Apache Spark version
>>>> 3.0.0.
>>>> 
>>>> 
>>>> 
>>>> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are
>>>> cast, with a minimum of 3 +1 votes.
>>>> 
>>>> 
>>>> 
>>>> [ ] +1 Release this package as Apache Spark 3.0.0
>>>> 
>>>> [ ] -1 Do not release this package because ...
>>>> 
>>>> 
>>>> 
>>>> To learn more about Apache Spark, please se

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-13 Thread Reynold Xin
Echoing Sean's earlier comment … What is the functionality that would go into a 
2.5.0 release, that can't be in a 2.4.7 release?

On Fri, Jun 12, 2020 at 11:14 PM, Holden Karau < hol...@pigscanfly.ca > wrote:

> 
> Can I suggest we maybe decouple this conversation a bit? First, if there
> is an agreement in making a transitional release in principle and then
> folks who feel strongly about specific backports can have their respective
> discussions. It ( http://discussions.it/ ) 's not like we normally know or
> have agreement on everything going into a release at the time we cut the
> branch.
> 
> On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> I understand the argument to add JDK 11 support just to extend the EOL,
>> but the other things seem kind of arbitrary and are not supported by your
>> arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api
>> stable yet and will continue to evolve in the 3.x line.
>> 
>> 
>> Spark is designed in a way that’s decoupled from storage, and as a result
>> one can run multiple versions of Spark in parallel during migration.
>> 
> 
> At the job level sure, but upgrading large jobs, possibly written in Scala
> 2.11, whole-hog as it currently stands is not a small matter.
> 
>> 
>> On Fri, Jun 12, 2020 at 9:40 PM DB Tsai < dbtsai@ dbtsai. com (
>> dbt...@dbtsai.com ) > wrote:
>> 
>> 
>>> +1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support
>>> 
>>> 
>>> 
>>> We had an internal preview version of Spark 3.0 for our customers to try
>>> out for a while, and then we realized that it's very challenging for
>>> enterprise applications in production to move to Spark 3.0. For example,
>>> many of our customers' Spark applications depend on some internal projects
>>> that may not be owned by ETL teams; it requires much coordination with
>>> other teams to cross-build the dependencies that Spark applications depend
>>> on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support
>>> of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate
>>> from 2.x version to 3.0 based on my observation working with our
>>> customers.
>>> 
>>> 
>>> Also, JDK8 is already EOL, in some companies, using JDK8 is not supported
>>> by the infra team, and requires an exception to use unsupported JDK. Of
>>> course, for those companies, they can use vendor's Spark distribution such
>>> as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark
>>> release which is possible but not very trivial.
>>> 
>>> 
>>> As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support
>>> can definitely lower the gap, and users can still move forward using new
>>> features. Afterall, the reason why we are working on OSS is we like people
>>> to use our code, isn't it?
>>> 
>>> Sincerely,
>>> 
>>> DB Tsai
>>> --
>>> Web: https:/ / www. dbtsai. com ( https://www.dbtsai.com )
>>> PGP Key ID: 42E5B25A8F7A82C1
>>> 
>>> 
>>> 
>>> On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim < kabhwan. opensource@ gmail.
>>> com ( kabhwan.opensou...@gmail.com ) > wrote:
>>> 
>>> 
>>>> I guess we already went through the same discussion, right? If anyone is
>>>> missed, please go through the discussion thread. [1] The consensus looks
>>>> to be not positive to migrate the new DSv2 into Spark 2.x version line,
>>>> because the change is pretty much huge, and also backward incompatible.
>>>> 
>>>> 
>>>> What I can think of benefits of having Spark 2.5 is to avoid force upgrade
>>>> to the major release to have fixes for critical bugs. Not all critical
>>>> fixes were landed to 2.x as well, because some fixes bring backward
>>>> incompatibility. We don't land these fixes to the 2.x version line because
>>>> we didn't consider having Spark 2.5 before - we don't want to let end
>>>> users tolerate the inconvenience during upgrading bugfix version. End
>>>> users may be OK to tolerate during upgrading minor version, since they can
>>>> still live with 2.4.x to deny these fixes.
>>>> 
>>>> 
>>>> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
>>>> might want to consider porting some of

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Reynold Xin
I understand the argument to add JDK 11 support just to extend the EOL, but
the other things seem kind of arbitrary and are not supported by your
arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api
stable yet and will continue to evolve in the 3.x line.

Spark is designed in a way that’s decoupled from storage, and as a result
one can run multiple versions of Spark in parallel during migration.

On Fri, Jun 12, 2020 at 9:40 PM DB Tsai  wrote:

> +1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support
>
> We had an internal preview version of Spark 3.0 for our customers to try
> out for a while, and then we realized that it's very challenging for
> enterprise applications in production to move to Spark 3.0. For example,
> many of our customers' Spark applications depend on some internal projects
> that may not be owned by ETL teams; it requires much coordination with
> other teams to cross-build the dependencies that Spark applications depend
> on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support
> of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate
> from 2.x version to 3.0 based on my observation working with our customers.
>
> Also, JDK8 is already EOL, in some companies, using JDK8 is not supported
> by the infra team, and requires an exception to use unsupported JDK. Of
> course, for those companies, they can use vendor's Spark distribution such
> as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark
> release which is possible but not very trivial.
>
> As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support
> can definitely lower the gap, and users can still move forward using new
> features. Afterall, the reason why we are working on OSS is we like people
> to use our code, isn't it?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
>
> On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim 
> wrote:
>
>> I guess we already went through the same discussion, right? If anyone is
>> missed, please go through the discussion thread. [1] The consensus looks to
>> be not positive to migrate the new DSv2 into Spark 2.x version line,
>> because the change is pretty much huge, and also backward incompatible.
>>
>> What I can think of benefits of having Spark 2.5 is to avoid force
>> upgrade to the major release to have fixes for critical bugs. Not all
>> critical fixes were landed to 2.x as well, because some fixes bring
>> backward incompatibility. We don't land these fixes to the 2.x version line
>> because we didn't consider having Spark 2.5 before - we don't want to let
>> end users tolerate the inconvenience during upgrading bugfix version. End
>> users may be OK to tolerate during upgrading minor version, since they can
>> still live with 2.4.x to deny these fixes.
>>
>> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
>> might want to consider porting some of features which don't bring backward
>> incompatibility. Well, new major features of Spark 3.0 would be probably
>> better to be introduced in Spark 3.0, but some features could be,
>> especially if the feature resolves the long-standing issue or the feature
>> has been provided for a long time in competitive products.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1.
>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979
>>
>> On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue 
>> wrote:
>>
>>> +1 for a 2.x release with a DSv2 API that matches 3.0.
>>>
>>> There are a lot of big differences between the API in 2.4 and 3.0, and I
>>> think a release to help migrate would be beneficial to organizations like
>>> ours that will be supporting 2.x and 3.0 in parallel for quite a while.
>>> Migration to Spark 3 is going to take time as people build confidence in
>>> it. I don't think that can be avoided by leaving a larger feature gap
>>> between 2.x and 3.0.
>>>
>>> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li  wrote:
>>>
 Based on my understanding, DSV2 is not stable yet. It still
 misses various features. Even our built-in file sources are still unable to
 fully migrate to DSV2. We plan to enhance it in the next few releases to
 close the gap.

 Also, the changes on DSV2 in Spark 3.0 did not break any existing
 application. We should encourage more users to try Spark 3 and increase the
 adoption of Spark 3.x.

 Xiao

 On Fri, Jun 12, 2020 at 5:36 PM Holden Karau 
 wrote:

> So I one of the things which we’re planning on backporting internally
> is DSv2, which I think being available in a community release in a 2 
> branch
> would be more broadly useful. Anything else on top of that would be on a
> case by case basis for if they make an easier upgrade path to 3.
>
> If we’re worried about people using 2.5 as 

Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Reynold Xin
I waited another day to account for the weekend. This vote passes with the 
following +1 votes and no -1 votes!

I'll start the release prep later this week.

+1:

Reynold Xin (binding)

Prashant Sharma (binding)

Gengliang Wang

Sean Owen (binding)

Mridul Muralidharan (binding)

Takeshi Yamamuro

Maxim Gekk

Matei Zaharia (binding)

Jungtaek Lim

Denny Lee

Russell Spitzer

Dongjoon Hyun (binding)

DB Tsai (binding)

Michael Armbrust (binding)

Tom Graves (binding)

Bryan Cutler

Huaxin Gao

Jiaxin Shan

Xingbo Jiang

Xiao Li (binding)

Hyukjin Kwon (binding)

Kent Yao

Wenchen Fan (binding)

Shixiong Zhu (binding)

Burak Yavuz

Tathagata Das (binding)

Ryan Blue

-1: None

On Sat, Jun 06, 2020 at 1:08 PM, Reynold Xin < r...@databricks.com > wrote:

> 
> Please vote on releasing the following candidate as Apache Spark version
> 3.0.0.
> 
> 
> 
> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are
> cast, with a minimum of 3 +1 votes.
> 
> 
> 
> [ ] +1 Release this package as Apache Spark 3.0.0
> 
> [ ] -1 Do not release this package because ...
> 
> 
> 
> To learn more about Apache Spark, please see http:/ / spark. apache. org/ (
> http://spark.apache.org/ )
> 
> 
> 
> The tag to be voted on is v3.0.0-rc3 (commit
> 3fdfce3120f307147244e5eaf46d61419a723d50):
> 
> https:/ / github. com/ apache/ spark/ tree/ v3. 0. 0-rc3 (
> https://github.com/apache/spark/tree/v3.0.0-rc3 )
> 
> 
> 
> The release files, including signatures, digests, etc. can be found at:
> 
> https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 0. 0-rc3-bin/ (
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/ )
> 
> 
> 
> Signatures used for Spark RCs can be found in this file:
> 
> https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ KEYS (
> https://dist.apache.org/repos/dist/dev/spark/KEYS )
> 
> 
> 
> The staging repository for this release can be found at:
> 
> https:/ / repository. apache. org/ content/ repositories/ orgapachespark-1350/
> ( https://repository.apache.org/content/repositories/orgapachespark-1350/
> )
> 
> 
> 
> The documentation corresponding to this release can be found at:
> 
> https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 0. 0-rc3-docs/ (
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/ )
> 
> 
> 
> The list of bug fixes going into 3.0.0 can be found at the following URL:
> 
> https:/ / issues. apache. org/ jira/ projects/ SPARK/ versions/ 12339177 (
> https://issues.apache.org/jira/projects/SPARK/versions/12339177 )
> 
> 
> 
> This release is using the release script of the tag v3.0.0-rc3.
> 
> 
> 
> FAQ
> 
> 
> 
> =
> 
> How can I help test this release?
> 
> =
> 
> 
> 
> If you are a Spark user, you can help us test this release by taking
> 
> an existing Spark workload and running on this release candidate, then
> 
> reporting any regressions.
> 
> 
> 
> If you're working in PySpark you can set up a virtual env and install
> 
> the current RC and see if anything important breaks, in the Java/Scala
> 
> you can add the staging repository to your projects resolvers and test
> 
> with the RC (make sure to clean up the artifact cache before/after so
> 
> you don't end up building with a out of date RC going forward).
> 
> 
> 
> ===
> 
> What should happen to JIRA tickets still targeting 3.0.0?
> 
> ===
> 
> 
> 
> The current list of open tickets targeted at 3.0.0 can be found at:
> 
> https:/ / issues. apache. org/ jira/ projects/ SPARK (
> https://issues.apache.org/jira/projects/SPARK ) and search for "Target
> Version/s" = 3.0.0
> 
> 
> 
> Committers should look at those and triage. Extremely important bug
> 
> fixes, documentation, and API tweaks that impact compatibility should
> 
> be worked on immediately. Everything else please retarget to an
> 
> appropriate release.
> 
> 
> 
> ==
> 
> But my bug isn't fixed?
> 
> ==
> 
> 
> 
> In order to make timely releases, we will typically not hold the
> 
> release unless the bug in question is a regression from the previous
> 
> release. That being said, if there is something which is a regression
> 
> that has not been correctly targeted please ping me or a committer to
> 
> help target the issue.
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [vote] Apache Spark 3.0 RC3

2020-06-06 Thread Reynold Xin
Apologies for the mistake. The vote is open till 11:59pm Pacific time on
Mon June 9th.

On Sat, Jun 6, 2020 at 1:08 PM Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.0.0.
>
> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are
> cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.0.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.0.0-rc3 (commit
> 3fdfce3120f307147244e5eaf46d61419a723d50):
> https://github.com/apache/spark/tree/v3.0.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1350/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/
>
> The list of bug fixes going into 3.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>
> This release is using the release script of the tag v3.0.0-rc3.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.0.0?
> ===
>
> The current list of open tickets targeted at 3.0.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>


smime.p7s
Description: S/MIME Cryptographic Signature


[vote] Apache Spark 3.0 RC3

2020-06-06 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 3.0.0.

The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are 
cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.0.0-rc3 (commit 
3fdfce3120f307147244e5eaf46d61419a723d50):

https://github.com/apache/spark/tree/v3.0.0-rc3

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1350/

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/

The list of bug fixes going into 3.0.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12339177

This release is using the release script of the tag v3.0.0-rc3.

FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with a out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.0.0?

===

The current list of open tickets targeted at 3.0.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.0.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

smime.p7s
Description: S/MIME Cryptographic Signature


[VOTE] Apache Spark 3.0 RC2

2020-05-18 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 3.0.0.

The vote is open until Thu May 21 11:59pm Pacific time and passes if a majority 
+1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.0.0-rc2 (commit 
29853eca69bceefd227cbe8421a09c116b7b753a):

https://github.com/apache/spark/tree/v3.0.0-rc2

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc2-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1345/

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc2-docs/

The list of bug fixes going into 3.0.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12339177

This release is using the release script of the tag v3.0.0-rc2.

FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with a out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.0.0?

===

The current list of open tickets targeted at 3.0.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.0.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-28 Thread Reynold Xin
The con is much more than just more effort to maintain a parallel API. It
puts the burden for all libraries and library developers to maintain a
parallel API as well. That’s one of the primary reasons we moved away from
this RDD vs JavaRDD approach in the old RDD API.


On Tue, Apr 28, 2020 at 12:38 AM ZHANG Wei  wrote:

> Be frankly, I also love the pure Java type in Java API and Scala type in
> Scala API. :-)
>
> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
> can adopt the status of option 1, the specific Java classes. (But I don't
> like the `Java` prefix, which is redundant when I'm coding Java app,
> such as JavaRDD, why not distinct it by package namespace...) The specific
> Java API can also leverage some native Java language features with new
> versions.
>
> And just since the friendly relationship between Scala and Java, the Java
> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
> is not ready. Then switch to Java API when it's well cooked.
>
> The cons is more efforts to maintain.
>
> My 2 cents.
>
> --
> Cheers,
> -z
>
> On Tue, 28 Apr 2020 12:07:36 +0900
> Hyukjin Kwon  wrote:
>
> > The problem is that calling Scala instances in Java side is discouraged
> in
> > general up to my best knowledge.
> > A Java user won't likely know asJava in Scala but a Scala user will
> likely
> > know both asScala and asJava.
> >
> >
> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
> >
> > > How about making a small change on option 4:
> > >   Keep Scala API returning Scala type instance with providing a
> > >   `asJava` method to return a Java type instance.
> > >
> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> > > Spark dependences upgrade, which can be supported by nature. For
> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> > > as what Scala 2.13 does and add implicit conversions.
> > >
> > > Just my 2 cents.
> > >
> > > --
> > > Cheers,
> > > -z
> > >
> > > [1]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3Dreserved=0
> > > [2]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3Dreserved=0
> > > [3]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3Dreserved=0
> > > [4]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3Dreserved=0
> > >
> > >
> > > On Tue, 28 Apr 2020 08:52:57 +0900
> > > Hyukjin Kwon  wrote:
> > >
> > > > I would like to make sure I am open for other options that can be
> > > > considered situationally and based on the context.
> > > > It's okay, and I don't target to restrict this here. For example,
> DSv2, I
> > > > understand it's written in Java because Java
> > > > interfaces arguably brings better performance. That's why vectorized
> > > > readers are written in Java too.
> > > >
> > > > Maybe the "general" wasn't explicit in my previous email. Adding
> APIs to
> > > > return a Java instance is still
> > > > rather rare in general given my few years monitoring.
> > > > The problem I would more like to deal with is more about when we
> need to
> > > > add one or a couple of user-facing
> > > > Java-specific APIs to return Java instances, which is relatively more
> > > > frequent compared to when we need a bunch
> > > > of Java specific APIs.
> > > >
> > > > In this case, I think it should be guided to use 4. approach. There
> are
> > > > pros and cons between 3. and 4., of course.
> > > > But it looks to me 4. approach is closer to what Spark has targeted
> so
> > > far.
> > > >
> > > >
> > > >
> > > > 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon 님이 작성:
> > > >
> > > > > > One thing we could do here is use Java collections internally and
> > > make
> > > > > the Scala API a thin wrapper around Java -- like how Python works.
> > > > > > Then adding a method to the Scala API would require adding it to
> the
> > > > > Java API and we would keep the two more in 

Re: Spark DAG scheduler

2020-04-16 Thread Reynold Xin
If you are talking about a tree, then the RDDs are nodes, and the dependencies 
are the edges.

If you are talking about a DAG, then the partitions in the RDDs are the nodes, 
and the dependencies between the partitions are the edges.

On Thu, Apr 16, 2020 at 4:02 PM, Mania Abdi < abdi...@husky.neu.edu > wrote:

> 
> Is it correct to say, the nodes in the DAG are RDDs and the edges are
> computations?
> 
> 
> On Thu, Apr 16, 2020 at 6:21 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> The RDD is the DAG.
>> 
>> 
>> 
>> On Thu, Apr 16, 2020 at 3:16 PM, Mania Abdi < abdi. ma@ husky. neu. edu (
>> abdi...@husky.neu.edu ) > wrote:
>> 
>>> Hello everyone,
>>> 
>>> I am implementing a caching mechanism for analytic workloads running on
>>> top of Spark and I need to retrieve the Spark DAG right after it is
>>> generated and the DAG scheduler. I would appreciate it if you could give
>>> me some hints or reference me to some documents about where the DAG is
>>> generated and inputs assigned to it. I found the DAG Scheduler class (
>>> https://github.com/apache/spark/blob/55dea9be62019d64d5d76619e1551956c8bb64d0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
>>> ) but I am not sure if it is a good starting point.
>>> 
>>> 
>>> 
>>> Regards
>>> Mania
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Spark DAG scheduler

2020-04-16 Thread Reynold Xin
The RDD is the DAG.

On Thu, Apr 16, 2020 at 3:16 PM, Mania Abdi < abdi...@husky.neu.edu > wrote:

> 
> Hello everyone,
> 
> I am implementing a caching mechanism for analytic workloads running on
> top of Spark and I need to retrieve the Spark DAG right after it is
> generated and the DAG scheduler. I would appreciate it if you could give
> me some hints or reference me to some documents about where the DAG is
> generated and inputs assigned to it. I found the DAG Scheduler class (
> https://github.com/apache/spark/blob/55dea9be62019d64d5d76619e1551956c8bb64d0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
> ) but I am not sure if it is a good starting point.
> 
> 
> 
> Regards
> Mania
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-01 Thread Reynold Xin
The Apache Software Foundation requires voting before any release can be 
published.

On Tue, Mar 31, 2020 at 11:27 PM, Stephen Coy < s...@infomedia.com.au.invalid > 
wrote:

> 
> 
>> On 1 Apr 2020, at 5:20 pm, Sean Owen < srowen@ gmail. com (
>> sro...@gmail.com ) > wrote:
>> 
>> It can be published as "3.0.0-rc1" but how do we test that to vote on it
>> without some other RC1 RC1
>> 
> 
> 
> I’m not sure what you mean by this question?
> 
> 
> 
> 
> This email contains confidential information of and is the copyright of
> Infomedia. It must not be forwarded, amended or disclosed without consent
> of the sender. If you received this message by mistake, please advise the
> sender and delete all copies. Security of transmission on the internet
> cannot be guaranteed, could be infected, intercepted, or corrupted and you
> should ensure you have suitable antivirus protection in place. By sending
> us your or any third party personal details, you consent to (or confirm
> you have obtained consent from such third parties) to Infomedia’s privacy
> policy. http:/ / www. infomedia. com. au/ privacy-policy/ (
> http://www.infomedia.com.au/privacy-policy/ )
>

smime.p7s
Description: S/MIME Cryptographic Signature


[VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 3.0.0.

The vote is open until 11:59pm Pacific time Fri Apr 3 , and passes if a 
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.0.0-rc1 (commit 
6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1):

https://github.com/apache/spark/tree/v3.0.0-rc1

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1341/

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-docs/

The list of bug fixes going into 2.4.5 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12339177

This release is using the release script of the tag v3.0.0-rc1.

FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with a out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.0.0?

===

The current list of open tickets targeted at 3.0.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.0.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

Note: I fully expect this RC to fail.

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Reynold Xin
Let's start cutting RC next week.

On Sat, Mar 28, 2020 at 11:51 AM, Sean Owen < sro...@gmail.com > wrote:

> 
> I'm also curious - there no open blockers for 3.0 but I know a few are
> still floating around open to revert changes. What is the status there?
> From my field of view I'm not aware of other blocking issues.
> 
> On Fri, Mar 27, 2020 at 10:56 PM Jungtaek Lim < kabhwan. opensource@ gmail.
> com ( kabhwan.opensou...@gmail.com ) > wrote:
> 
> 
>> Now the end of March is just around the corner. I'm not qualified to say
>> (and honestly don't know) where we are, but if we were intended to be in
>> blocker mode it doesn't seem to work; lots of developments still happen,
>> and priority/urgency doesn't seem to be applied to the sequence of
>> reviewing.
>> 
>> 
>> How about listing (or linking to epic, or labelling) JIRA issues/PRs which
>> are blockers (either from priority or technically) for Spark 3.0 release,
>> and make clear we should try to review these blockers first? Github PR
>> label may help here to filter out other PRs and concentrate these things.
>> 
>> 
>> 
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>> 
>> 
>> 
>> On Wed, Mar 25, 2020 at 1:52 PM Xiao Li < lixiao@ databricks. com (
>> lix...@databricks.com ) > wrote:
>> 
>> 
>>> Let us try to finish the remaining major blockers in the next few days.
>>> For example, https:/ / issues. apache. org/ jira/ browse/ SPARK-31085 (
>>> https://issues.apache.org/jira/browse/SPARK-31085 )
>>> 
>>> 
>>> +1 to cut the RC even if we still have the blockers that will fail the
>>> RCs. 
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> Xiao
>>> 
>>> 
>>> 
>>> On Tue, Mar 24, 2020 at 6:56 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> 
>>>> +1
>>>> 
>>>> 
>>>> Thanks,
>>>> Dongjoon.
>>>> 
>>>> On Tue, Mar 24, 2020 at 14:49 Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> I actually think we should start cutting RCs. We can cut RCs even with
>>>>> blockers.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. 
>>>>> com
>>>>> ( dongjoon.h...@gmail.com ) > wrote:
>>>>> 
>>>>>> Hi, All.
>>>>>> 
>>>>>> First of all, always "Community Over Code"!
>>>>>> I wish you the best health and happiness.
>>>>>> 
>>>>>> As we know, we are still working on QA period, we didn't reach RC stage.
>>>>>> It seems that we need to make website up-to-date once more.
>>>>>> 
>>>>>>    https:/ / spark. apache. org/ versioning-policy. html (
>>>>>> https://spark.apache.org/versioning-policy.html )
>>>>>> 
>>>>>> If possible, it would be really great if we can get `3.0.0` release
>>>>>> manager's official `branch-3.0` assessment because we have only 1 week
>>>>>> before the end of March.
>>>>>> 
>>>>>> 
>>>>>> Cloud you, the 3.0.0 release manager, share your thought and update the
>>>>>> website, please?
>>>>>> 
>>>>>> 
>>>>>> Bests
>>>>>> Dongjoon.
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> ( https://databricks.com/sparkaisummit/north-america )
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: results of taken(3) not appearing in console window

2020-03-26 Thread Reynold Xin
bcc dev, +user

You need to print out the result. Take itself doesn't print. You only got the 
results printed to the console because the Scala REPL automatically prints the 
returned value from take.

On Thu, Mar 26, 2020 at 12:15 PM, Zahid Rahman < zahidr1...@gmail.com > wrote:

> 
> I am running the same code with the same libraries but not getting same
> output.
> scala>  case class flight (DEST_COUNTRY_NAME: String,
>      |                      ORIGIN_COUNTRY_NAME:String,
>      |                      count: BigInt)
> defined class flight
> 
> scala>     val flightDf = spark. read. parquet ( http://spark.read.parquet/
> ) ("/data/flight-data/parquet/2010-summary.parquet/")
> flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string,
> ORIGIN_COUNTRY_NAME: string ... 1 more field]
> 
> scala> val flights = flightDf. as ( http://flightdf.as/ ) [flight]
> flights: org.apache.spark.sql.Dataset[flight] = [DEST_COUNTRY_NAME:
> string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
> 
> scala> flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME !=
> "Canada").map(flight_row => flight_row).take(3)
> *res0: Array[flight] = Array(flight(United States,Romania,1),
> flight(United States,Ireland,264), flight(United States,India,69))
> *
> 
> 
>  
> 20/03/26 19:09:00 INFO SparkContext: Running Spark version 3.0.0-preview2
> 20/03/26 19:09:00 INFO ResourceUtils:
> ==
> 20/03/26 19:09:00 INFO ResourceUtils: Resources for spark.driver:
> 
> 20/03/26 19:09:00 INFO ResourceUtils:
> ==
> 20/03/26 19:09:00 INFO SparkContext: Submitted application:  chapter2
> 20/03/26 19:09:00 INFO SecurityManager: Changing view acls to: kub19
> 20/03/26 19:09:00 INFO SecurityManager: Changing modify acls to: kub19
> 20/03/26 19:09:00 INFO SecurityManager: Changing view acls groups to:
> 20/03/26 19:09:00 INFO SecurityManager: Changing modify acls groups to:
> 20/03/26 19:09:00 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(kub19);
> groups with view permissions: Set(); users  with modify permissions:
> Set(kub19); groups with modify permissions: Set()
> 20/03/26 19:09:00 INFO Utils: Successfully started service 'sparkDriver'
> on port 46817.
> 20/03/26 19:09:00 INFO SparkEnv: Registering MapOutputTracker
> 20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMaster
> 20/03/26 19:09:00 INFO BlockManagerMasterEndpoint: Using org. apache. spark.
> storage. DefaultTopologyMapper (
> http://org.apache.spark.storage.defaulttopologymapper/ ) for getting
> topology information
> 20/03/26 19:09:00 INFO BlockManagerMasterEndpoint:
> BlockManagerMasterEndpoint up
> 20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
> 20/03/26 19:09:00 INFO DiskBlockManager: Created local directory at
> /tmp/blockmgr-0baf5097-2595-4542-99e2-a192d7baf37c
> 20/03/26 19:09:00 INFO MemoryStore: MemoryStore started with capacity
> 127.2 MiB
> 20/03/26 19:09:01 INFO SparkEnv: Registering OutputCommitCoordinator
> 20/03/26 19:09:01 WARN Utils: Service 'SparkUI' could not bind on port
> 4040. Attempting port 4041.
> 20/03/26 19:09:01 INFO Utils: Successfully started service 'SparkUI' on
> port 4041.
> 20/03/26 19:09:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http:/
> / localhost:4041 ( http://localhost:4041 )
> 20/03/26 19:09:01 INFO Executor: Starting executor ID driver on host
> localhost
> 20/03/26 19:09:01 INFO Utils: Successfully started service ' org. apache. 
> spark.
> network. netty. NettyBlockTransferService (
> http://org.apache.spark.network.netty.nettyblocktransferservice/ ) ' on
> port 38135.
> 20/03/26 19:09:01 INFO NettyBlockTransferService: Server created on
> localhost:38135
> 20/03/26 19:09:01 INFO BlockManager: Using org. apache. spark. storage. 
> RandomBlockReplicationPolicy
> ( http://org.apache.spark.storage.randomblockreplicationpolicy/ ) for block
> replication policy
> 20/03/26 19:09:01 INFO BlockManagerMaster: Registering BlockManager
> BlockManagerId(driver, localhost, 38135, None)
> 20/03/26 19:09:01 INFO BlockManagerMasterEndpoint: Registering block
> manager localhost:38135 with 127.2 MiB RAM, BlockManagerId(driver,
> localhost, 38135, None)
> 20/03/26 19:09:01 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, localhost, 38135, None)
> 20/03/26 19:09:01 INFO BlockManager: Initialized BlockManager:
> BlockManagerId(driver, localhost, 38135, None)
> 20/03/26 19:09:01 INFO SharedState: Setting hive.metastore.warehouse.dir
> ('null') to the value of spark.sql.warehouse.dir
> ('file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse').
> 
> 20/03/26 19:09:01 INFO SharedState: Warehouse path is
> 'file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse'.
> 
> 20/03/26 19:09:02 INFO SparkContext: Starting job: parquet at
> 

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-24 Thread Reynold Xin
I actually think we should start cutting RCs. We can cut RCs even with blockers.

On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, All.
> 
> First of all, always "Community Over Code"!
> I wish you the best health and happiness.
> 
> As we know, we are still working on QA period, we didn't reach RC stage.
> It seems that we need to make website up-to-date once more.
> 
>    https:/ / spark. apache. org/ versioning-policy. html (
> https://spark.apache.org/versioning-policy.html )
> 
> If possible, it would be really great if we can get `3.0.0` release
> manager's official `branch-3.0` assessment because we have only 1 week
> before the end of March.
> 
> 
> Cloud you, the 3.0.0 release manager, share your thought and update the
> website, please?
> 
> 
> Bests
> Dongjoon.
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin
I agree it sucks. We started with some decision that might have made sense back 
in 2013 (let's use Hive as the default source, and guess what, pick the slowest 
possible serde by default). We are paying that debt ever since.

Thanks for bringing this thread up though. We don't have a clear solution yet, 
but at least it made a lot of people aware of the issues.

On Thu, Mar 19, 2020 at 8:56 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Technically, I has been suffered with (1) `CREATE TABLE` due to many
> difference for a long time (since 2017). So, I had a wrong assumption for
> the implication of that "(2) FYI: SPARK-30098 Use default datasource as
> provider for CREATE TABLE syntax", Reynold. I admit that. You may not feel
> in the similar way. However, it was a lot to me. Also, switching
> `convertMetastoreOrc` at 2.4 was a big change to me although there will be
> no difference for Parquet-only users.
> 
> 
> Dongjoon.
> 
> 
> > References:
> > 1. "CHAR implementation?", 2017/09/15
> >      https:/ / lists. apache. org/ thread. html/ 
> > 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> )
> > 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
> >    https:/ / lists. apache. org/ thread. html/ 
> > 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
> )
> 
> 
> 
> 
> 
> 
> On Thu, Mar 19, 2020 at 8:47 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> You are joking when you said " informed widely and discussed in many ways
>> twice" right?
>> 
>> 
>> 
>> This thread doesn't even talk about char/varchar: https:/ / lists. apache.
>> org/ thread. html/ 
>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>> spark. apache. org%3E (
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>> )
>> 
>> 
>> 
>> (Yes it talked about changing the default data source provider, but that's
>> just one of the ways we are exposing this char/varchar issue).
>> 
>> 
>> 
>> 
>> 
>> On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.h...@gmail.com ) > wrote:
>> 
>>> +1 for Wenchen's suggestion.
>>> 
>>> I believe that the difference and effects are informed widely and
>>> discussed in many ways twice.
>>> 
>>> First, this was shared on last December.
>>> 
>>>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax", 2019/12/06
>>>    https:/ / lists. apache. org/ thread. html/ 
>>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>> )
>>> 
>>> Second (at this time in this thread), this has been discussed according to
>>> the new community rubric.
>>> 
>>>     - https:/ / spark. apache. org/ versioning-policy. html (
>>> https://spark.apache.org/versioning-policy.html ) (Section: "Considerations
>>> When Breaking APIs")
>>> 
>>> 
>>> Thank you all.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan < cloud0fan@ gmail. com (
>>> cloud0...@gmail.com ) > wrote:
>>> 
>>> 
>>>> OK let me put a proposal here:
>>>> 
>>>> 
>>>> 1. Permanently ban CHAR for native data source tables, and only keep it
>>>> for Hive compatibility.
>>>> It's OK to forget about padding like what Snowflake and MySQL have done.
>>>> But it's hard for Spark to require consistent behavior about CHAR type in
>>>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>>>> just ban it. Another way is to document that the padding of CHAR type is
>>>> data source dependent, but it's a bit weird to leave this inconsistency in
>>>> Spark.
>>>> 
>>>> 
>&g

Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin
You are joking when you said " informed widely and discussed in many ways 
twice" right?

This thread doesn't even talk about char/varchar: 
https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

(Yes it talked about changing the default data source provider, but that's just 
one of the ways we are exposing this char/varchar issue).

On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> +1 for Wenchen's suggestion.
> 
> I believe that the difference and effects are informed widely and
> discussed in many ways twice.
> 
> First, this was shared on last December.
> 
>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
>    https:/ / lists. apache. org/ thread. html/ 
> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
> )
> 
> Second (at this time in this thread), this has been discussed according to
> the new community rubric.
> 
>     - https:/ / spark. apache. org/ versioning-policy. html (
> https://spark.apache.org/versioning-policy.html ) (Section: "Considerations
> When Breaking APIs")
> 
> 
> Thank you all.
> 
> 
> Bests,
> Dongjoon.
> 
> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan < cloud0fan@ gmail. com (
> cloud0...@gmail.com ) > wrote:
> 
> 
>> OK let me put a proposal here:
>> 
>> 
>> 1. Permanently ban CHAR for native data source tables, and only keep it
>> for Hive compatibility.
>> It's OK to forget about padding like what Snowflake and MySQL have done.
>> But it's hard for Spark to require consistent behavior about CHAR type in
>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>> just ban it. Another way is to document that the padding of CHAR type is
>> data source dependent, but it's a bit weird to leave this inconsistency in
>> Spark.
>> 
>> 
>> 2. Leave VARCHAR unchanged in 3.0
>> VARCHAR type is so widely used in databases and it's weird if Spark
>> doesn't support it. VARCHAR type is exactly the same as Spark StringType
>> when the length limitation is not hit, and I'm fine to temporarily leave
>> this flaw in 3.0 and users may hit behavior changes when the string values
>> hit the VARCHAR length limitation.
>> 
>> 
>> 3. Finalize the VARCHAR behavior in 3.1
>> For now I have 2 ideas:
>> a) Make VARCHAR(x) a first-class data type. This means Spark data sources
>> should support VARCHAR, and CREATE TABLE should fail if a column is
>> VARCHAR type and the underlying data source doesn't support it (e.g.
>> JSON/CSV). Type cast, type coercion, table insertion, etc. should be
>> updated as well.
>> b) Simply document that, the underlying data source may or may not enforce
>> the length limitation of VARCHAR(x).
>> 
>> 
>> Please let me know if you have different ideas.
>> 
>> 
>> Thanks,
>> Wenchen
>> 
>> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust < michael@ databricks. com
>> ( mich...@databricks.com ) > wrote:
>> 
>> 
>>> 
 What I'd oppose is to just ban char for the native data sources, and do
 not have a plan to address this problem systematically.
 
>>> 
>>> 
>>> 
>>> +1
>>> 
>>>  
>>> 
 Just forget about padding, like what Snowflake and MySQL have done.
 Document that char(x) is just an alias for string. And then move on.
 Almost no work needs to be done...
 
>>> 
>>> 
>>> 
>>> +1 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
For sure.

There's another reason I feel char is not that important and it's more 
important to be internally consistent (e.g. all data sources support it with 
the same behavior, vs one data sources do one behavior and another do the 
other). char was created at a time when cpu was slow and storage was expensive, 
and being able to pack things nicely at fixed length was highly useful. The 
fact that it was padded was initially done for performance, not for the padding 
itself. A lot has changed since char was invented, and with modern technologies 
(columnar, dictionary encoding, etc) there is little reason to use a char data 
type for anything. As a matter of fact, Spark internally converts char type to 
string to work with.

I see two solutions really.

1. We require padding, and ban all uses of char when it is not properly padded. 
This would ban all the native data sources, which are the primarily way people 
are using Spark. This leaves only char support for tables going through Hive 
serdes, which are slow to begin with. It is basically Dongjoon and Wenchen's 
suggestion. This turns char support into a compatibility feature only for some 
Hive tables that cannot be converted into Spark native data sources. This has 
confusing end-user behavior because depending on whether that Hive table is 
converted into Spark native data sources, we might or might not support char 
type.

An extension to the above is to introduce padding for char type across the 
board, and make char type a first class data type. There are a lot of work to 
introduce another data type, especially for one that has virtually no usage ( 
https://trends.google.com/trends/explore?geo=US=hive%20char,hive%20string ) 
and its usage will likely continue to decline in the future (just reason from 
first principle based on the reason char was introduced in the first place).

Now I'm assuming it's a lot of work to do char properly. But if it is not the 
case (e.g. just a simple rule to insert padding at planning time), then maybe 
it's worth doing it this way. I'm totally OK with this too.

What I'd oppose is to just ban char for the native data sources, and do not 
have a plan to address this problem systematically.

2. Just forget about padding, like what Snowflake and MySQL have done. Document 
that char(x) is just an alias for string. And then move on. Almost no work 
needs to be done...

On Mon, Mar 16, 2020 at 5:54 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you for sharing and confirming.
> 
> 
> We had better consider all heterogeneous customers in the world. And, I
> also have experiences with the non-negligible cases in on-prem.
> 
> 
> 
> Bests,
> Dongjoon.
> 
> 
> On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> −User
>> 
>> 
>> 
>> char barely showed up (honestly negligible). I was comparing select vs
>> select.
>> 
>> 
>> 
>> 
>> 
>> On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.h...@gmail.com ) > wrote:
>> 
>>> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
>>> statements with `CHAR`?
>>> 
>>> > I looked up our usage logs (sorry I can't share this publicly) and trim
>>> has at least four orders of magnitude higher usage than char.
>>> 
>>> We need to discuss more about what to do. This thread is what I expected
>>> exactly. :)
>>> 
>>> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
>>> it). I was merely pointing out that if we deviate away from SQL standard
>>> in any way we are considered "wrong" or "incorrect". That argument itself
>>> is flawed when plenty of other popular database systems also deviate away
>>> from the standard on this specific behavior.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com (
>>> r...@databricks.com ) > wrote:
>>> 
>>> 
>>>> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
>>>> I was merely pointing out that if we deviate away from SQL standard in any
>>>> way we are considered "wrong" or "incorrect". That argument itself is
>>>> flawed when plenty of other popular database systems also deviate away
>>>> from the standard on this specific behavior.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < rxin@ da

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
−User

char barely showed up (honestly negligible). I was comparing select vs select.

On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
> statements with `CHAR`?
> 
> > I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> We need to discuss more about what to do. This thread is what I expected
> exactly. :)
> 
> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
> it). I was merely pointing out that if we deviate away from SQL standard
> in any way we are considered "wrong" or "incorrect". That argument itself
> is flawed when plenty of other popular database systems also deviate away
> from the standard on this specific behavior.
> 
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
>> I was merely pointing out that if we deviate away from SQL standard in any
>> way we are considered "wrong" or "incorrect". That argument itself is
>> flawed when plenty of other popular database systems also deviate away
>> from the standard on this specific behavior.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>>> I looked up our usage logs (sorry I can't share this publicly) and trim
>>> has at least four orders of magnitude higher usage than char.
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>>> Thank you, Stephen and Reynold.
>>>> 
>>>> 
>>>> To Reynold.
>>>> 
>>>> 
>>>> The way I see the following is a little different.
>>>> 
>>>> 
>>>>       > CHAR is an undocumented data type without clearly defined
>>>> semantics.
>>>> 
>>>> Let me describe in Apache Spark User's View point.
>>>> 
>>>> 
>>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
>>>> Apache Spark 1.x without much documentation. In addition, there still
>>>> exists an effort which is trying to keep it in 3.0.0 age.
>>>> 
>>>>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
>>>> https://issues.apache.org/jira/browse/SPARK-31088 )
>>>>        Add back HiveContext and createExternalTable
>>>> 
>>>> Historically, we tried to make many SQL-based customer migrate their
>>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>> 
>>>> Although Apache Spark didn't have a good document about the inconsistent
>>>> behavior among its data sources, Apache Hive has been providing its
>>>> documentation and many customers rely the behavior.
>>>> 
>>>>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
>>>> LanguageManual+Types
>>>> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
>>>> 
>>>> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
>>>> many existing huge tables were created by Apache Hive, not Apache Spark.
>>>> And, Apache Spark is used for boosting SQL performance with its *caching*.
>>>> This was true because Apache Spark was added into the Hadoop-vendor
>>>> products later than Apache Hive.
>>>> 
>>>> 
>>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>>> features to be consistent at least with Hive tables in Apache Hive and
>>>> Apache Spark because two SQL engines share the same tables.
>>>> 
>>>> For the following, technically, while Apache Hive doesn't changed its
>>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>>> away from the original Apache Spark old behaviors one-by-one.
>>>> 
>>>> 
>>>>       >  the value is already fucked up
>>>> 
>>>> 
>>>> The following is the change log.
>>>> 
>>>>       - When we sw

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was 
merely pointing out that if we deviate away from SQL standard in any way we are 
considered "wrong" or "incorrect". That argument itself is flawed when plenty 
of other popular database systems also deviate away from the standard on this 
specific behavior.

On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < r...@databricks.com > wrote:

> 
> I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> 
> 
> 
> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
> ( dongjoon.h...@gmail.com ) > wrote:
> 
>> Thank you, Stephen and Reynold.
>> 
>> 
>> To Reynold.
>> 
>> 
>> The way I see the following is a little different.
>> 
>> 
>>       > CHAR is an undocumented data type without clearly defined
>> semantics.
>> 
>> Let me describe in Apache Spark User's View point.
>> 
>> 
>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
>> Apache Spark 1.x without much documentation. In addition, there still
>> exists an effort which is trying to keep it in 3.0.0 age.
>> 
>>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
>> https://issues.apache.org/jira/browse/SPARK-31088 )
>>        Add back HiveContext and createExternalTable
>> 
>> Historically, we tried to make many SQL-based customer migrate their
>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>> 
>> Although Apache Spark didn't have a good document about the inconsistent
>> behavior among its data sources, Apache Hive has been providing its
>> documentation and many customers rely the behavior.
>> 
>>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
>> LanguageManual+Types
>> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
>> 
>> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
>> many existing huge tables were created by Apache Hive, not Apache Spark.
>> And, Apache Spark is used for boosting SQL performance with its *caching*.
>> This was true because Apache Spark was added into the Hadoop-vendor
>> products later than Apache Hive.
>> 
>> 
>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>> features to be consistent at least with Hive tables in Apache Hive and
>> Apache Spark because two SQL engines share the same tables.
>> 
>> For the following, technically, while Apache Hive doesn't changed its
>> existing behavior in this part, Apache Spark evolves inevitably by moving
>> away from the original Apache Spark old behaviors one-by-one.
>> 
>> 
>>       >  the value is already fucked up
>> 
>> 
>> The following is the change log.
>> 
>>       - When we switched the default value of `convertMetastoreParquet`.
>> (at Apache Spark 1.2)
>>       - When we switched the default value of `convertMetastoreOrc` (at
>> Apache Spark 2.4)
>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>> `PARQUET` table at Apache Spark 3.0)
>> 
>> To sum up, this has been a well-known issue in the community and among the
>> customers.
>> 
>> Bests,
>> Dongjoon.
>> 
>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
>> s...@infomedia.com.au ) > wrote:
>> 
>> 
>>> Hi there,
>>> 
>>> 
>>> I’m kind of new around here, but I have had experience with all of all the
>>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>>> Server as well as Postgresql.
>>> 
>>> 
>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>> means that such columns are always space padded, and they default to
>>> having this enabled (for ANSI compliance).
>>> 
>>> 
>>> MySQL also supports it, but it defaults to leaving it disabled for
>>> historical reasons not unlike what we have here.
>>> 
>>> 
>>> In my opinion we should push toward standards compliance where possible
>>> and then document where it cannot work.
>>> 
>>> 
>>> If users don’t like the padding on CHAR columns then they should change to
>>> VARCHAR - I believe that was its purpose in the first place, and it does
>>> not dictate any sort of “padding".
>>> 
>>> 
>>> I can see

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
I looked up our usage logs (sorry I can't share this publicly) and trim has at 
least four orders of magnitude higher usage than char.

On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you, Stephen and Reynold.
> 
> 
> To Reynold.
> 
> 
> The way I see the following is a little different.
> 
> 
>       > CHAR is an undocumented data type without clearly defined
> semantics.
> 
> Let me describe in Apache Spark User's View point.
> 
> 
> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
> Apache Spark 1.x without much documentation. In addition, there still
> exists an effort which is trying to keep it in 3.0.0 age.
> 
>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
> https://issues.apache.org/jira/browse/SPARK-31088 )
>        Add back HiveContext and createExternalTable
> 
> Historically, we tried to make many SQL-based customer migrate their
> workloads from Apache Hive into Apache Spark through `HiveContext`.
> 
> Although Apache Spark didn't have a good document about the inconsistent
> behavior among its data sources, Apache Hive has been providing its
> documentation and many customers rely the behavior.
> 
>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
> LanguageManual+Types
> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
> 
> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
> many existing huge tables were created by Apache Hive, not Apache Spark.
> And, Apache Spark is used for boosting SQL performance with its *caching*.
> This was true because Apache Spark was added into the Hadoop-vendor
> products later than Apache Hive.
> 
> 
> Until the turning point at Apache Spark 2.0, we tried to catch up more
> features to be consistent at least with Hive tables in Apache Hive and
> Apache Spark because two SQL engines share the same tables.
> 
> For the following, technically, while Apache Hive doesn't changed its
> existing behavior in this part, Apache Spark evolves inevitably by moving
> away from the original Apache Spark old behaviors one-by-one.
> 
> 
>       >  the value is already fucked up
> 
> 
> The following is the change log.
> 
>       - When we switched the default value of `convertMetastoreParquet`.
> (at Apache Spark 1.2)
>       - When we switched the default value of `convertMetastoreOrc` (at
> Apache Spark 2.4)
>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
> `PARQUET` table at Apache Spark 3.0)
> 
> To sum up, this has been a well-known issue in the community and among the
> customers.
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
> s...@infomedia.com.au ) > wrote:
> 
> 
>> Hi there,
>> 
>> 
>> I’m kind of new around here, but I have had experience with all of all the
>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>> Server as well as Postgresql.
>> 
>> 
>> They all support the notion of “ANSI padding” for CHAR columns - which
>> means that such columns are always space padded, and they default to
>> having this enabled (for ANSI compliance).
>> 
>> 
>> MySQL also supports it, but it defaults to leaving it disabled for
>> historical reasons not unlike what we have here.
>> 
>> 
>> In my opinion we should push toward standards compliance where possible
>> and then document where it cannot work.
>> 
>> 
>> If users don’t like the padding on CHAR columns then they should change to
>> VARCHAR - I believe that was its purpose in the first place, and it does
>> not dictate any sort of “padding".
>> 
>> 
>> I can see why you might “ban” the use of CHAR columns where they cannot be
>> consistently supported, but VARCHAR is a different animal and I would
>> expect it to work consistently everywhere.
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> Steve C
>> 
>> 
>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>> dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> Hi, Reynold.
>>> (And +Michael Armbrust)
>>> 
>>> 
>>> If you think so, do you think it's okay that we change the return value
>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>> 
>>> 
>>> > Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> Bests,
>>> Dongjoon.

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
I haven't spent enough time thinking about it to give a strong opinion, but 
this is of course very different from TRIM.

TRIM is a publicly documented function with two arguments, and we silently 
swapped the two arguments. And trim is also quite commonly used from a long 
time ago.

CHAR is an undocumented data type without clearly defined semantics. It's not 
great that we are changing the value here, but the value is already fucked up. 
It depends on the underlying data source, and random configs that are seemingly 
unrelated (orc) would impact the value.

On Mon, Mar 16, 2020 at 4:01 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Reynold.
> (And +Michael Armbrust)
> 
> 
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
> 
> 
> > Are we sure "not padding" is "incorrect"?
> 
> 
> 
> Bests,
> Dongjoon.
> 
> 
> 
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
> com ( gourav.sengu...@gmail.com ) > wrote:
> 
> 
>> Hi,
>> 
>> 
>> 100% agree with Reynold.
>> 
>> 
>> 
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> 
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>> 
>>> 
>>> 
>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. 
>>> html
>>> (
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.
>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>> strings shorter than the maximum length are not space-padded at the end."
>>> 
>>> 
>>> 
>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ 
>>> why-char-dont-have-padding-in-mysql
>>> (
>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>> )
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>>> Hi, Reynold.
>>>> 
>>>> 
>>>> Please see the following for the context.
>>>> 
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>> https://issues.apache.org/jira/browse/SPARK-31136 )
>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>> syntax"
>>>> 
>>>> 
>>>> I raised the above issue according to the new rubric, and the banning was
>>>> the proposed alternative to reduce the potential issue.
>>>> 
>>>> 
>>>> Please give us your opinion since it's still PR.
>>>> 
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>>>> of both new and old users?
>>>>> 
>>>>> 
>>>>> For old users, their old code that was working for char(3) would now stop
>>>>> working. 
>>>>> 
>>>>> 
>>>>> For new users, depending on whether the underlying metastore char(3) is
>>>>> either supported but different from ansi Sql (which is not that big of a
>>>>> deal if we explain it) or not supported. 
>>>>> 
>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>> ( dongjoon.h...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Hi, All.
>>>>>> 
>>>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>>>> type behavio

Re: FYI: The evolution on `CHAR` type behavior

2020-03-15 Thread Reynold Xin
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of 
databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html ( 
https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.
 ) : "Snowflake currently deviates from common CHAR semantics in that strings 
shorter than the maximum length are not space-padded at the end."

MySQL: 
https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql

On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Reynold.
> 
> 
> Please see the following for the context.
> 
> 
> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
> https://issues.apache.org/jira/browse/SPARK-31136 )
> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax"
> 
> 
> I raised the above issue according to the new rubric, and the banning was
> the proposed alternative to reduce the potential issue.
> 
> 
> Please give us your opinion since it's still PR.
> 
> 
> Bests,
> Dongjoon.
> 
> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>> of both new and old users?
>> 
>> 
>> For old users, their old code that was working for char(3) would now stop
>> working. 
>> 
>> 
>> For new users, depending on whether the underlying metastore char(3) is
>> either supported but different from ansi Sql (which is not that big of a
>> deal if we explain it) or not supported. 
>> 
>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.h...@gmail.com ) > wrote:
>> 
>> 
>>> Hi, All.
>>> 
>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>> type behavior among its usages and configurations. However, the evolution
>>> direction has been gradually moving forward to be consistent inside Apache
>>> Spark because we don't have `CHAR` offically. The following is the
>>> summary.
>>> 
>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>> Hive behavior.)
>>> 
>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>> 
>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>> behavior.)
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>> consistent.
>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>> fallback to Hive behavior.)
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>> following syntax to be safe.
>>> 
>>>     CREATE TABLE t(a CHAR(3));
>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>> https://github.com/apache/spark/pull/27902 )
>>> 
>>> This email is sent out to inform you based on the new policy we voted.
>>> The recommendation is always using Apache Spark's native type `String`.
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> References:
>>> 1. "CHAR implementation?", 2017/09/15
>>>      https:/ / lists. apache. org/ thread. html/ 
>>> 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>> )
>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax", 2019/12/06
>>>    https:/ / lists. apache. org/ thread. html/ 
>>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>> )
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: FYI: The evolution on `CHAR` type behavior

2020-03-14 Thread Reynold Xin
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of
both new and old users?

For old users, their old code that was working for char(3) would now stop
working.

For new users, depending on whether the underlying metastore char(3) is
either supported but different from ansi Sql (which is not that big of a
deal if we explain it) or not supported.

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark has been suffered from a known consistency issue on `CHAR`
> type behavior among its usages and configurations. However, the evolution
> direction has been gradually moving forward to be consistent inside Apache
> Spark because we don't have `CHAR` offically. The following is the summary.
>
> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
> Hive behavior.)
>
> spark-sql> CREATE TABLE t1(a CHAR(3));
> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>
> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>
> spark-sql> SELECT a, length(a) FROM t1;
> a   3
> spark-sql> SELECT a, length(a) FROM t2;
> a   3
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> Since 2.4.0, `STORED AS ORC` became consistent.
> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
> behavior.)
>
> spark-sql> SELECT a, length(a) FROM t1;
> a   3
> spark-sql> SELECT a, length(a) FROM t2;
> a 2
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
> consistent.
> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
> fallback to Hive behavior.)
>
> spark-sql> SELECT a, length(a) FROM t1;
> a 2
> spark-sql> SELECT a, length(a) FROM t2;
> a 2
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
> following syntax to be safe.
>
> CREATE TABLE t(a CHAR(3));
> https://github.com/apache/spark/pull/27902
>
> This email is sent out to inform you based on the new policy we voted.
> The recommendation is always using Apache Spark's native type `String`.
>
> Bests,
> Dongjoon.
>
> References:
> 1. "CHAR implementation?", 2017/09/15
>
> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
>
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Reynold Xin
+1

On Mon, Mar 09, 2020 at 3:53 PM, John Zhuge < jzh...@apache.org > wrote:

> 
> +1 (non-binding)
> 
> 
> On Mon, Mar 9, 2020 at 1:32 PM Michael Heuer < heuermh@ gmail. com (
> heue...@gmail.com ) > wrote:
> 
> 
>> +1 (non-binding)
>> 
>> 
>> I am disappointed however that this only mentions API and not dependencies
>> and transitive dependencies.
>> 
>> 
>> As Spark does not provide separation between its runtime classpath and the
>> classpath used by applications, I believe Spark's dependencies and
>> transitive dependencies should be considered part of the API for this
>> policy.  Breaking dependency upgrades and incompatible dependency versions
>> are the source of much frustration.
>> 
>> 
>>    michael
>> 
>> 
>> 
>> 
>>> On Mar 9, 2020, at 2:16 PM, Takuya UESHIN < ueshin@ happy-camper. st (
>>> ues...@happy-camper.st ) > wrote:
>>> 
>>> +1 (binding)
>>> 
>>> 
>>> 
>>> On Mon, Mar 9, 2020 at 11:49 AM Xingbo Jiang < jiangxb1987@ gmail. com (
>>> jiangxb1...@gmail.com ) > wrote:
>>> 
>>> 
 +1 (non-binding)
 
 
 Cheers,
 
 
 Xingbo
 
 On Mon, Mar 9, 2020 at 9:35 AM Xiao Li < lixiao@ databricks. com (
 lix...@databricks.com ) > wrote:
 
 
> +1 (binding)
> 
> 
> Xiao
> 
> On Mon, Mar 9, 2020 at 8:33 AM Denny Lee < denny. g. lee@ gmail. com (
> denny.g@gmail.com ) > wrote:
> 
> 
>> +1 (non-binding)
>> 
>> 
>> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon < gurwls223@ gmail. com (
>> gurwls...@gmail.com ) > wrote:
>> 
>> 
>>> The proposal itself seems good as the factors to consider, Thanks 
>>> Michael.
>>> 
>>> 
>>> Several concerns mentioned look good points, in particular:
>>> 
>>> > ... assuming that this is for public stable APIs, not APIs that are
>>> marked as unstable, evolving, etc. ...
>>> I would like to confirm this. We already have API annotations such as
>>> Experimental, Unstable, etc. and the implication of each is still
>>> effective. If it's for stable APIs, it makes sense to me as well.
>>> 
>>> > ... can we expand on 'when' an API change can occur ?  Since we are
>>> proposing to diverge from semver. ...
>>> 
>>> I think this is a good point. If we're proposing to divert from semver,
>>> the delta compared to semver will have to be clarified to avoid 
>>> different
>>> personal interpretations of the somewhat general principles.
>>> 
>>> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
>>> Apache Spark 3.0+? ...
>>> 
>>> Assuming these concerns will be addressed, +1 (binding).
>>> 
>>>  
>>> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro < linguin. m. s@ gmail. com (
>>> linguin@gmail.com ) >님이 작성:
>>> 
>>> 
 +1 (non-binding)
 
 
 Bests,
 Takeshi
 
 On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang < gengliang. wang@ 
 databricks.
 com ( gengliang.w...@databricks.com ) > wrote:
 
 
> +1 (non-binding)
> 
> 
> Gengliang
> 
> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia < matei. zaharia@ 
> gmail. com
> ( matei.zaha...@gmail.com ) > wrote:
> 
> 
>> +1 as well.
>> 
>> 
>> Matei
>> 
>> 
>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan < cloud0fan@ gmail. com (
>>> cloud0...@gmail.com ) > wrote:
>>> 
>>> +1 (binding), assuming that this is for public stable APIs, not 
>>> APIs that
>>> are marked as unstable, evolving, etc.
>>> 
>>> 
>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía < iemejia@ gmail. com (
>>> ieme...@gmail.com ) > wrote:
>>> 
>>> 
 +1 (non-binding)
 
 Michael's section on the trade-offs of maintaining / removing an 
 API are
 one of
 the best reads I have seeing in this mailing list. Enthusiast +1
 
 On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun < dongjoon. hyun@ 
 gmail. com (
 dongjoon.h...@gmail.com ) > wrote:
 >
 > This new policy has a good indention, but can we narrow down on 
 > the
 migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
 >
 > I saw that there already exists a reverting PR to bring back 
 > Spark 1.4
 and 1.5 APIs based on this AS-IS suggestion.
 >
 > The AS-IS policy is clearly mentioning that JVM/Scala-level 
 > difficulty,
 and it's nice.
 >
 > However, for the other cases, it sounds like `recommending older 
 > APIs as
 much as possible` due to the following.
 >
 >      

Re: [DISCUSS] Shall we mark spark streaming component as deprecated.

2020-03-02 Thread Reynold Xin
It's a good discussion to have though: should we deprecate dstream, and what do 
we need to do to make that happen? My experience working with a lot of Spark 
users is that in general I recommend them staying away from dstream, due to a 
lot of design and architectural issues.

On Mon, Mar 02, 2020 at 9:32 AM, Prashant Sharma < scrapco...@gmail.com > wrote:

> 
> I may have speculated, or believed the unauthorised sources, nevertheless
> I am happy to be corrected. 
> 
> On Mon, Mar 2, 2020 at 8:05 PM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> Er, who says it's deprecated? I have never heard anything like that.
>> Why would it be?
>> 
>> On Mon, Mar 2, 2020 at 4:52 AM Prashant Sharma < scrapcodes@ gmail. com (
>> scrapco...@gmail.com ) > wrote:
>> >
>> > Hi All,
>> >
>> > It is noticed that some of the users of Spark streaming do not
>> immediately realise that it is a deprecated component and it would be
>> scary, if they end up with it in production. Now that we are in a position
>> to release about Spark 3.0.0, may be we should discuss - should the spark
>> streaming carry an explicit notice? That it is deprecated and not under
>> active development.
>> >
>> > I have opened an issue already, but I think a mailing list discussion
>> would be more appropriate. https:/ / issues. apache. org/ jira/ browse/ 
>> SPARK-31006
>> ( https://issues.apache.org/jira/browse/SPARK-31006 )
>> >
>> > Thanks,
>> > Prashant.
>> >
> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] naming policy of Spark configs

2020-02-12 Thread Reynold Xin
This is really cool. We should also be more opinionated about how we specify 
time and intervals.

On Wed, Feb 12, 2020 at 3:15 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you, Wenchen.
> 
> 
> The new policy looks clear to me. +1 for the explicit policy.
> 
> 
> So, are we going to revise the existing conf names before 3.0.0 release?
> 
> 
> Or, is it applied to new up-coming configurations from now?
> 
> 
> Bests,
> Dongjoon.
> 
> On Wed, Feb 12, 2020 at 7:43 AM Wenchen Fan < cloud0fan@ gmail. com (
> cloud0...@gmail.com ) > wrote:
> 
> 
>> Hi all,
>> 
>> 
>> I'd like to discuss the naming policy of Spark configs, as for now it
>> depends on personal preference which leads to inconsistent namings.
>> 
>> 
>> In general, the config name should be a noun that describes its meaning
>> clearly.
>> Good examples:
>> spark.sql.session.timeZone
>> 
>> spark.sql.streaming.continuous.executorQueueSize
>> 
>> spark.sql.statistics.histogram.numBins
>> 
>> Bad examples:
>> spark.sql.defaultSizeInBytes (default size for what?)
>> 
>> 
>> 
>> Also note that, config name has many parts, joined by dots. Each part is a
>> namespace. Don't create namespace unnecessarily.
>> Good example:
>> spark.sql.execution.rangeExchange.sampleSizePerPartition
>> 
>> spark.sql.execution.arrow.maxRecordsPerBatch
>> 
>> Bad examples:
>> spark. sql. windowExec. buffer. in. memory. threshold (
>> http://spark.sql.windowexec.buffer.in.memory.threshold/ ) (" in" is not a
>> useful namespace, better to be.buffer.inMemoryThreshold )
>> 
>> 
>> 
>> For a big feature, usually we need to create an umbrella config to turn it
>> on/off, and other configs for fine-grained controls. These configs should
>> share the same namespace, and the umbrella config should be named like 
>> featureName.enabled
>> . For example:
>> spark.sql.cbo.enabled
>> 
>> spark.sql.cbo.starSchemaDetection
>> 
>> spark.sql.cbo.starJoinFTRatio
>> spark.sql.cbo.joinReorder.enabled
>> spark.sql.cbo.joinReorder.dp.threshold (BTW "dp" is not a good namespace)
>> 
>> spark.sql.cbo.joinReorder.card.weight (BTW "card" is not a good namespace)
>> 
>> 
>> 
>> 
>> For boolean configs, in general it should end with a verb, e.g. 
>> spark.sql.join.preferSortMergeJoin
>> . If the config is for a feature and you can't find a good verb for the
>> feature, featureName.enabled is also good.
>> 
>> 
>> I'll update https:/ / spark. apache. org/ contributing. html (
>> https://spark.apache.org/contributing.html ) after we reach a consensus
>> here. Any comments are welcome!
>> 
>> 
>> Thanks,
>> Wenchen
>> 
> 
>

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-02-01 Thread Reynold Xin
Note that branch-3.0 was cut. Please focus on testing, polish, and let's get 
the release out!

On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin < r...@databricks.com > wrote:

> 
> Just a reminder - code freeze is coming this Fri !
> 
> 
> 
> There can always be exceptions, but those should be exceptions and
> discussed on a case by case basis rather than becoming the norm.
> 
> 
> 
> 
> 
> 
> On Tue, Dec 24, 2019 at 4:55 PM, Jungtaek Lim < kabhwan. opensource@ gmail.
> com ( kabhwan.opensou...@gmail.com ) > wrote:
> 
>> Jan 31 sounds good to me.
>> 
>> 
>> Just curious, do we allow some exception on code freeze? One thing came
>> into my mind is that some feature could have multiple subtasks and part of
>> subtasks have been merged and other subtask(s) are in reviewing. In this
>> case do we allow these subtasks to have more days to get reviewed and
>> merged later?
>> 
>> 
>> Happy Holiday!
>> 
>> 
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>> 
>> On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro < linguin. m. s@ gmail. com
>> ( linguin@gmail.com ) > wrote:
>> 
>> 
>>> Looks nice, happy holiday, all!
>>> 
>>> 
>>> Bests,
>>> Takeshi
>>> 
>>> On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> 
>>>> +1 for January 31st.
>>>> 
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> On Tue, Dec 24, 2019 at 7:11 AM Xiao Li < lixiao@ databricks. com (
>>>> lix...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> Jan 31 is pretty reasonable. Happy Holidays! 
>>>>> 
>>>>> 
>>>>> Xiao
>>>>> 
>>>>> On Tue, Dec 24, 2019 at 5:52 AM Sean Owen < srowen@ gmail. com (
>>>>> sro...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Yep, always happens. Is earlier realistic, like Jan 15? it's all 
>>>>>> arbitrary
>>>>>> but indeed this has been in progress for a while, and there's a downside
>>>>>> to not releasing it, to making the gap to 3.0 larger. 
>>>>>> On my end I don't know of anything that's holding up a release; is it
>>>>>> basically DSv2?
>>>>>> 
>>>>>> BTW these are the items still targeted to 3.0.0, some of which may not
>>>>>> have been legitimately tagged. It may be worth reviewing what's still 
>>>>>> open
>>>>>> and necessary, and what should be untargeted.
>>>>>> 
>>>>>> 
>>>>>> SPARK-29768 nondeterministic expression fails column pruning
>>>>>> SPARK-29345 Add an API that allows a user to define and observe arbitrary
>>>>>> metrics on streaming queries
>>>>>> SPARK-29348 Add observable metrics
>>>>>> SPARK-29429 Support Prometheus monitoring natively
>>>>>> SPARK-29577 Implement p-value simulation and unit tests for chi2 test
>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>>> SPARK-28588 Build a SQL reference doc
>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>>> SPARK-28684 Hive module support JDK 11
>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames after
>>>>>> some operations
>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>>>> relation table properly
>>>>>> SPARK-27986 Support Aggregate Expressions with filter
>>>>>> SPARK-28024 Incorrect numeric values when out of range
>>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>>>> SPARK-27780 Shuffle server & client should be versioned to enable 
>>>>>> smoother
>>>>>> upgrade
>>>>>> SPARK-27714 Support Join Reorder based on Genetic Al

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-01-29 Thread Reynold Xin
27272 Enable blacklisting of node/executor on fetch failures by
>>>>> default
>>>>> SPARK-27296 Efficient User Defined Aggregators
>>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>>> cause driver pods to hang
>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>>> barrier stage
>>>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>>>> Aggregate
>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>>> SPARK-26425 Add more constraint checks in file streaming source to avoid
>>>>> checkpoint corruption
>>>>> SPARK-25843 Redesign rangeBetween API
>>>>> SPARK-25841 Redesign window function rangeBetween API
>>>>> SPARK-25752 Add trait to easily whitelist logical operators that produce
>>>>> named output from CleanupAliases
>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>>>> aggregate
>>>>> SPARK-25531 new write APIs for data source v2
>>>>> SPARK-25547 Pluggable jdbc connection factory
>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>>> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
>>>>> 
>>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>>> MesosFineGrainedSchedulerBackend
>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>> SPARK-25186 Stabilize Data Source V2 API
>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>>> execution mode
>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>>>> Spec
>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>>>> Spark
>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested list
>>>>> of structures
>>>>> SPARK-22386 Data Source V2 improvements
>>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Dec 23, 2019 at 5:48 PM Reynold Xin < rxin@ databricks. com (
>>>>> r...@databricks.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> We've pushed out 3.0 multiple times. The latest release window documented
>>>>>> on the website ( http://spark.apache.org/versioning-policy.html ) says
>>>>>> we'd code freeze and cut branch-3.0 early Dec. It looks like we are
>>>>>> suffering a bit from the tragedy of the commons, that nobody is pushing
>>>>>> for getting the release out. I understand the natural tendency for each
>>>>>> individual is to finish or extend the feature/bug that the person has 
>>>>>> been
>>>>>> working on. At some point we need to say "this is it" and get the release
>>>>>> out. I'm happy to help drive this process.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> To be realistic, I don't think we should just code freeze * today *.
>>>>>> Although we have updated the website, contributors have all been 
>>>>>> operating
>>>>>> under the assumption that all active developments are still going on. I
>>>>>> propose we *cut the branch on* *Jan 31* *, and code freeze and switch 
>>>>>> over
>>>>>> to bug squashing mode, and try to get the 3.0 official release out in 
>>>>>> Q1*.
>>>>>> That is, by default no new features can go into the branch starting Jan 
>>>>>> 31
>>>>>> .
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> What do you think?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> And happy holidays everybody.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Databricks Summit - Watch the talks (
>>>> https://databricks.com/sparkaisummit/north-america ) 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> --
>> ---
>> Takeshi Yamamuro
>> 
> 
>

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

2020-01-21 Thread Reynold Xin
If your UDF itself is very CPU intensive, it probably won't make that much of 
difference, because the UDF itself will dwarf the serialization/deserialization 
overhead.

If your UDF is cheap, it will help tremendously.

On Mon, Jan 20, 2020 at 6:33 PM, < em...@yeikel.com > wrote:

> 
> 
> 
> Hi,
> 
> 
> 
>  
> 
> 
> 
> I read online[1] that for a best UDF performance it is possible to
> implement them using internal Spark expressions, and I also saw a couple
> of pull requests such as [2] and [3] where this was put to practice (not
> sure if for that reason or just to extend the API).
> 
> 
> 
>  
> 
> 
> 
> We have an algorithm that computes a score similar to what the Levenshtein
> distance does and it takes about 30%-40% of the overall time of our job.
> We are looking for ways to improve it without adding more resources.
> 
> 
> 
>  
> 
> 
> 
> I was wondering if it would be advisable to implement it extending 
> BinaryExpression
> like[1] and if i t would result in any performance gains.
> 
> 
> 
>  
> 
> 
> 
> Thanks for your help!
> 
> 
> 
>  
> 
> 
> 
> [1] https:/ / hackernoon. com/ 
> apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
> (
> https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
> )
> 
> 
> 
> [2] https:/ / github. com/ apache/ spark/ pull/ 7214 (
> https://github.com/apache/spark/pull/7214 )
> 
> 
> 
> [3] https:/ / github. com/ apache/ spark/ pull/ 7236 (
> https://github.com/apache/spark/pull/7236 )
> 
> 
>

Re: Enabling push-based shuffle in Spark

2020-01-21 Thread Reynold Xin
Thanks for writing this up. 

Usually when people talk about push-based shuffle, they are motivating it 
primarily to reduce the latency of short queries, by pipelining the map phase, 
shuffle phase, and the reduce phase (which this design isn't going to address). 
It's interesting you are targeting throughput by optimizing for random reads 
instead.

My main questions are ...

1. This is designing for HDDs. But SSD prices have gone lower than HDDs this 
year, so most new data center storage will be using SSDs from now on. Are we 
introducing a lot of complexity to address a problem that only exists with 
legacy that will be phased out soon?

2. Is there a simpler way to address this? E.g. you can simply merge map 
outputs for each node locally, without involving any type of push. It seems to 
me you'd address the same issues you have, with the same limitations (of memory 
buffer limiting the number of concurrent streams you can write to).

On Tue, Jan 21, 2020 at 6:13 PM, mshen < ms...@apache.org > wrote:

> 
> 
> 
> I'd like to start a discussion on enabling push-based shuffle in Spark.
> This is meant to address issues with existing shuffle inefficiency in a
> large-scale Spark compute infra deployment.
> Facebook's previous talks on SOS shuffle
> < https:/ / databricks. com/ session/ sos-optimizing-shuffle-i-o (
> https://databricks.com/session/sos-optimizing-shuffle-i-o ) > and Cosco
> shuffle service
> < https:/ / databricks. com/ session/ 
> cosco-an-efficient-facebook-scale-shuffle-service
> (
> https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service
> ) > are solutions dealing with a similar problem.
> Note that this is somewhat orthogonal to the work in SPARK-25299
> < https:/ / issues. apache. org/ jira/ browse/ SPARK-25299 (
> https://issues.apache.org/jira/browse/SPARK-25299 ) > , which is to use
> remote storage to store shuffle data.
> More details of our proposed design is in SPARK-30602
> < https:/ / issues. apache. org/ jira/ browse/ SPARK-30602 (
> https://issues.apache.org/jira/browse/SPARK-30602 ) > , with SPIP attached.
> Would appreciate comments and discussions from the community.
> 
> 
> 
> -
> Min Shen
> Staff Software Engineer
> LinkedIn
> --
> Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/
> ( http://apache-spark-developers-list.1001551.n3.nabble.com/ )
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Adding Maven Central mirror from Google to the build?

2020-01-21 Thread Reynold Xin
This seems reasonable!

On Tue, Jan 21, 2020 at 3:23 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> +1, I'm supporting the following proposal.
> 
> 
> > this mirror as the primary repo in the build, falling back to Central if
> needed.
> 
> 
> Thanks,
> Dongjoon.
> 
> 
> 
> On Tue, Jan 21, 2020 at 14:37 Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> See https:/ / github. com/ apache/ spark/ pull/ 27307 (
>> https://github.com/apache/spark/pull/27307 ) for some context. We've
>> had to add, in at least one place, some settings to resolve artifacts
>> from a mirror besides Maven Central to work around some build
>> problems.
>> 
>> Now, we find it might be simpler to just use this mirror as the
>> primary repo in the build, falling back to Central if needed.
>> 
>> The question is: any objections to that?
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-10 Thread Reynold Xin
Introducing a new data type has high overhead, both in terms of internal 
complexity and users' cognitive load. Introducing two data types would have 
even higher overhead.

I looked quickly and looks like both Redshift and Snowflake, two of the most 
recent SQL analytics successes, have only one interval type, and don't support 
storing that. That gets me thinking in reality storing interval type is not 
that useful.

Do we really need to do this? One of the worst things we can do as a community 
is to introduce features that are almost never used, but at the same time have 
high internal complexity for maintenance.

On Fri, Jan 10, 2020 at 10:45 AM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you for clarification.
> 
> 
> Bests,
> Dongjoon.
> 
> On Fri, Jan 10, 2020 at 10:07 AM Kent Yao < yaooqinn@ qq. com (
> yaooq...@qq.com ) > wrote:
> 
> 
>> 
>> Hi Dongjoon,
>> 
>> 
>> Yes, As we want make CalenderIntervalType deprecated and so far, we just
>> find
>> 1. The make_interval function that produces legacy CalenderIntervalType
>> values, 
>> 2. `interval` -> CalenderIntervalType support in the parser
>> 
>> 
>> Thanks
>> 
>> 
>> *Kent Yao*
>> Data Science Center, Hangzhou Research Institute, Netease Corp.
>> PHONE: (86) 186-5715-3499
>> EMAIL: hzyaoqin@ corp. netease. com ( hzyao...@corp.netease.com )
>> 
>> 
>> On 01/11/2020 01:57 , Dongjoon Hyun (
>> dongjoon.h...@gmail.com ) wrote:
>> 
>>> Hi, Kent. 
>>> 
>>> 
>>> Thank you for the proposal.
>>> 
>>> 
>>> Does your proposal need to revert something from the master branch?
>>> I'm just asking because it's not clear in the proposal document.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Fri, Jan 10, 2020 at 5:31 AM Dr. Kent Yao < yaooqinn@ qq. com (
>>> yaooq...@qq.com ) > wrote:
>>> 
>>> 
 Hi, Devs
 
 I’d like to propose to add two new interval types which are year-month and
 
 day-time intervals for better ANSI support and future improvements. We
 will
 keep the current CalenderIntervalType but mark it as deprecated until we
 find the right time to remove it completely. The backward compatibility of
 
 the old interval type usages in 2.4 will be guaranteed.
 
 Here is the design doc:
 
 [SPIP] Support Year-Month and Day-Time Intervals -
 https:/ / docs. google. com/ document/ d/ 
 1JNRzcBk4hcm7k2cOXSG1A9U9QM2iNGQzBSXZzScUwAU/
 edit?usp=sharing (
 https://docs.google.com/document/d/1JNRzcBk4hcm7k2cOXSG1A9U9QM2iNGQzBSXZzScUwAU/edit?usp=sharing
 )
 
 All comments are welcome!
 
 Thanks,
 
 Kent Yao
 
 
 
 
 --
 Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/
 ( http://apache-spark-developers-list.1001551.n3.nabble.com/ )
 
 -
 To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
 dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: [SPARK-30296][SQL] Add Dataset diffing feature

2020-01-07 Thread Reynold Xin
Can this perhaps exist as an utility function outside Spark?

On Tue, Jan 07, 2020 at 12:18 AM, Enrico Minack < m...@enrico.minack.dev > 
wrote:

> 
> 
> 
> Hi Devs,
> 
> 
> 
> I'd like to get your thoughts on this Dataset feature proposal. Comparing
> datasets is a central operation when regression testing your code changes.
> 
> 
> 
> 
> It would be super useful if Spark's Datasets provide this transformation
> natively.
> 
> 
> 
> https:/ / github. com/ apache/ spark/ pull/ 26936 (
> https://github.com/apache/spark/pull/26936 )
> 
> 
> 
> Regards,
> Enrico
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-23 Thread Reynold Xin
We've pushed out 3.0 multiple times. The latest release window documented on 
the website ( http://spark.apache.org/versioning-policy.html ) says we'd code 
freeze and cut branch-3.0 early Dec. It looks like we are suffering a bit from 
the tragedy of the commons, that nobody is pushing for getting the release out. 
I understand the natural tendency for each individual is to finish or extend 
the feature/bug that the person has been working on. At some point we need to 
say "this is it" and get the release out. I'm happy to help drive this process.

To be realistic, I don't think we should just code freeze * today *. Although 
we have updated the website, contributors have all been operating under the 
assumption that all active developments are still going on. I propose we *cut 
the branch on* *Jan 31* *, and code freeze and switch over to bug squashing 
mode, and try to get the 3.0 official release out in Q1*. That is, by default 
no new features can go into the branch starting Jan 31.

What do you think?

And happy holidays everybody.

Re: Spark 3.0 preview release 2?

2019-12-08 Thread Reynold Xin
If the cost is low, why don't we just do monthly previews until we code freeze? 
If it is high, maybe we should discuss and do it when there are people that 
volunteer 

On Sun, Dec 08, 2019 at 10:32 PM, Xiao Li < gatorsm...@gmail.com > wrote:

> 
> 
> 
> I got many great feedbacks from the community about the recent 3.0 preview
> release. Since the last 3.0 preview release, we already have 353 commits [
> https://github.com/apache/spark/compare/v3.0.0-preview...master (
> https://github.com/apache/spark/compare/v3.0.0-preview...master ) ]. There
> are various important features and behavior changes we want the community
> to try before entering the official release candidates of Spark 3.0. 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Below is my selected items that are not part of the last 3.0 preview but
> already available in the upstream master branch: 
> 
> 
> 
> 
> 
> 
> 
> * Support JDK 11 with Hadoop 2.7
> * Spark SQL will respect its own default format (i.e., parquet) when users
> do CREATE TABLE without USING or STORED AS clauses
> * Enable Parquet nested schema pruning and nested pruning on expressions
> by default
> * Add observable Metrics for Streaming queries
> * Column pruning through nondeterministic expressions
> * RecordBinaryComparator should check endianness when compared by long 
> * Improve parallelism for local shuffle reader in adaptive query execution
> 
> * Upgrade Apache Arrow to version 0.15.1
> * Various interval-related SQL support
> * Add a mode to pin Python thread into JVM's
> * Provide option to clean up completed files in streaming query
> 
> 
> 
> 
> 
> 
> 
> 
> I am wondering if we can have another preview release for Spark 3.0? This
> can help us find the design/API defects as early as possible and avoid the
> significant delay of the upcoming Spark 3.0 release
> 
> 
> 
> 
> 
> 
> 
> 
> Also, any committer is willing to volunteer as the release manager of the
> next preview release of Spark 3.0, if we have such a release? 
> 
> 
> 
> 
> 
> 
> 
> 
> Cheers,
> 
> 
> 
> 
> 
> 
> 
> 
> Xiao
> 
> 
>

Re: Why Spark generates Java code and not Scala?

2019-11-09 Thread Reynold Xin
It’s mainly due to compilation speed. Scala compiler is known to be slow.
Even javac is quite slow. We use Janino which is a simpler compiler to get
faster compilation speed at runtime.

Also for low level code we can’t use (due to perf concerns) any of the
edges scala has over java, eg we can’t use the scala collection library,
functional programming, map/flatMap. So using scala doesn’t really buy
anything even if there is no compilation speed concerns.

On Sat, Nov 9, 2019 at 9:52 AM Holden Karau  wrote:

>
> Switching this from user to dev
>
> On Sat, Nov 9, 2019 at 9:47 AM Bartosz Konieczny 
> wrote:
>
>> Hi there,
>>
>> Few days ago I got an intriguing but hard to answer question:
>> "Why Spark generates Java code and not Scala code?"
>> (https://github.com/bartosz25/spark-scala-playground/issues/18)
>>
>> Since I'm not sure about the exact answer, I'd like to ask you to confirm
>> or not my thinking. I was looking for the reasons in the JIRA and the
>> research paper "Spark SQL: Relational Data Processing in Spark" (
>> http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf) but
>> found nothing explaining why Java over Scala. The single task I found was
>> about why Scala and not Java but concerning data types (
>> https://issues.apache.org/jira/browse/SPARK-5193) That's why I'm writing
>> here.
>>
>> My guesses about choosing Java code are:
>> - Java runtime compiler libs are more mature and prod-ready than the
>> Scala's - or at least, they were at the implementation time
>> - Scala compiler tends to be slower than the Java's
>> https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
>>
> From the discussions when I was doing some code gen (in MLlib not SQL) I
> think this is the primary reason why.
>
>>
>> 
>> - Scala compiler seems to be more complex, so debugging & maintaining it
>> would be harder
>>
> this was also given as a secondary reason
>
>> - it was easier to represent a pure Java OO design than mixed FP/OO in
>> Scala
>>
> no one brought up this point. Maybe it was a consideration and it just
> wasn’t raised.
>
>> ?
>>
>> Thank you for your help.
>>
>>
>> --
>> Bartosz Konieczny
>> data engineer
>> https://www.waitingforcode.com
>> https://github.com/bartosz25/
>> https://twitter.com/waitingforcode
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] SPARK 3.0.0-preview (RC1)

2019-10-29 Thread Reynold Xin
Does the description make sense? This is a preview release so there is no
need to retarget versions.

On Tue, Oct 29, 2019 at 7:01 PM Xingbo Jiang  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.0.0-preview.
>
> The vote is open until November 2 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.0.0-preview
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.0.0-preview-rc1 (commit
> 5eddbb5f1d9789696927f435c55df887e50a1389):
> https://github.com/apache/spark/tree/v3.0.0-preview-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1334/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-docs/
>
> The list of bug fixes going into 3.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.0.0?
> ===
>
> The current list of open tickets targeted at 3.0.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: Add spark dependency on on org.opencypher:okapi-shade.okapi

2019-10-16 Thread Reynold Xin
Just curious - did we discuss why this shouldn't be another Apache sister 
project?

On Wed, Oct 16, 2019 at 10:21 AM, Sean Owen < sro...@gmail.com > wrote:

> 
> 
> 
> We don't all have to agree on whether to add this -- there are like 10
> people with an opinion -- and I certainly would not veto it. In practice a
> medium-sized changes needs someone to review/merge it all the way through,
> and nobody strongly objecting. I too don't know what to make of the
> situation; what happened to the supporters here?
> 
> 
> 
> I am concerned about maintenance, as inevitably any new module falls on
> everyone to maintain to some degree, and people come and go despite their
> intentions. But that isn't the substance of why I personally wouldn't
> merge it. Just doesn't seem like it must live in Spark. But again this is
> my opinion; you don't need to convince me, just need to
> (re?)-convince a shepherd, sponsor for this change.
> 
> 
> 
> Voting on the dependency part or whatever is also not important. It's a
> detail, and already merged even.
> 
> 
> 
> The issue to hand is: if nobody supports reviewing and merging the rest of
> the change, what then? we can't leave it half implemented. The fallback
> plan is just to back it out and reconsider later. This would be a poor
> outcome process-wise, but better than leaving it incomplete.
> 
> 
> 
> On Wed, Oct 16, 2019 at 3:15 AM Martin Junghanns
> < martin. junghanns@ neo4j. com ( martin.jungha...@neo4j.com ) > wrote:
> 
> 
>> 
>> 
>> I'm slightly confused about this discussion. I worked on all of the
>> aforementioned PRs: the module PR that has been merged, the current PR
>> that introduces the Graph API and the PoC PR that contains the full
>> implementation. The issues around shading were addressed and the module PR
>> eventually got merged. Two PMC members including the SPIP shepherd are
>> working with me (and others) on the current API PR. The SPIP to bring
>> Spark Graph into Apache Spark itself has been successfully voted on
>> earlier this year. I presented this work at Spark Summit in San Fransisco
>> in May and was asked by the organizers to present the topic at the
>> European Spark Summit. I'm currently sitting in the speakers room of that
>> conference preparing for the talk and reading this thread. I hope you
>> understand my confusion.
>> 
>> 
>> 
>> I admit - and Xiangrui pointed this out in the other thread, too - that we
>> made the early mistake of not bringing more Spark committers on board
>> which lead to a stagnation period during summer when Xiangrui wasn't
>> around to help review and bring progress to the project.
>> 
>> 
>> 
>> Sean, if your concern is the lack of maintainers of that module, I
>> personally would like to volunteer to maintain Spark Graph. I'm also a
>> contributor to the Okapi stack and am able to work on whatever issue might
>> come up on that end including updating dependencies etc. FWIW, Okapi is
>> actively maintained by a team at Neo4j.
>> 
>> 
>> 
>> Best, Martin
>> 
>> 
>> 
>> On Wed, 16 Oct 2019, 4:35 AM Sean Owen < srowen@ gmail. com (
>> sro...@gmail.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> I do not have a very informed opinion here, so take this with a grain of
>>> salt.
>>> 
>>> 
>>> 
>>> I'd say that we need to either commit a coherent version of this for Spark
>>> 3, or not at all. If it doesn't have support, I'd back out the existing
>>> changes.
>>> I was initially skeptical about how much this needs to be in Spark vs a
>>> third-party package, and that still stands.
>>> 
>>> 
>>> 
>>> The addition of another dependency isn't that big a deal IMHO, but, yes,
>>> it does add something to the maintenance overhead. But that's all the more
>>> true of a new module.
>>> 
>>> 
>>> 
>>> I don't feel strongly about it, but if this isn't obviously getting
>>> support from any committers, can we keep it as a third party library for
>>> now?
>>> 
>>> 
>>> 
>>> On Tue, Oct 15, 2019 at 8:53 PM Weichen Xu < weichen. xu@ databricks. com (
>>> weichen...@databricks.com ) > wrote:
>>> 
>>> 
 
 
 Hi Mats Rydberg,
 
 
 
 Although this dependency "org.opencypher:okapi-shade.okapi" was added into
 spark, but Xiangrui raised two concerns (see above mail) about it, so we'd
 better rethink on this and consider whether this is a good choice, so I
 call this vote.
 
 
 
 Thanks!
 
 
 
 On Tue, Oct 15, 2019 at 10:56 PM Mats Rydberg < mats@ neo4j. org. invalid (
 m...@neo4j.org.invalid ) > wrote:
 
 
> 
> 
> Hello Weichen, community
> 
> 
> 
> I'm sorry, I'm feeling a little bit confused about this vote. Is this
> about the PR ( https:/ / github. com/ apache/ spark/ pull/ 24490 (
> https://github.com/apache/spark/pull/24490 ) ) that was merged in early
> June and introduced the spark-graph module including the okapi-shade
> dependency?
> 
> 
> 
> Regarding the okapi-shade dependency which was 

Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-16 Thread Reynold Xin
Can we just tag master?

On Wed, Oct 16, 2019 at 12:34 AM, Wenchen Fan < cloud0...@gmail.com > wrote:

> 
> Does anybody remember what we did for 2.0 preview? Personally I'd like to
> avoid cutting branch-3.0 right now, otherwise we need to merge PRs into
> two branches in the following several months.
> 
> 
> Thanks,
> Wenchen
> 
> On Wed, Oct 16, 2019 at 3:01 PM Xingbo Jiang < jiangxb1987@ gmail. com (
> jiangxb1...@gmail.com ) > wrote:
> 
> 
>> Hi Dongjoon,
>> 
>> 
>> I'm not sure about the best practice of maintaining a preview release
>> branch, since new features might still go into Spark 3.0 after preview
>> release, I guess it might make more sense to have separated  branches for
>> 3.0.0 and 3.0-preview.
>> 
>> 
>> However, I'm open to both solutions, if we really want to reuse the branch
>> to also release Spark 3.0.0, then I would be happy to create a new one.
>> 
>> 
>> Thanks!
>> 
>> 
>> Xingbo
>> 
>> Dongjoon Hyun < dongjoon. hyun@ gmail. com ( dongjoon.h...@gmail.com ) >
>> 于2019年10月16日周三 上午6:26写道:
>> 
>> 
>>> Hi, 
>>> 
>>> 
>>> It seems that we have `branch-3.0-preview` branch.
>>> 
>>> 
>>> https:/ / github. com/ apache/ spark/ commits/ branch-3. 0-preview (
>>> https://github.com/apache/spark/commits/branch-3.0-preview )
>>> 
>>> 
>>> 
>>> Can we have `branch-3.0` instead of `branch-3.0-preview`?
>>> 
>>> 
>>> We can tag `v3.0.0-preview` on `branch-3.0` and continue to use for
>>> `v3.0.0` later.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>> 
>> 
> 
>

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin
I just looked at the PR. I think there are some follow up work that needs to be 
done, e.g. we shouldn't create a top level package 
org.apache.spark.sql.dynamicpruning.

On Wed, Oct 02, 2019 at 1:52 PM, Maryann Xue < maryann@databricks.com > 
wrote:

> 
> There is no internal write up, but I think we should at least give some
> up-to-date description on that JIRA entry.
> 
> On Wed, Oct 2, 2019 at 3:13 PM Reynold Xin < r...@databricks.com > wrote:
> 
> 
>> No there is no separate write up internally.
>> 
>> On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue < rb...@netflix.com > wrote:
>> 
>> 
>>> Thanks for the pointers, but what I'm looking for is information about the
>>> design of this implementation, like what requires this to be in spark-sql
>>> instead of spark-catalyst.
>>> 
>>> 
>>> Even a high-level description, like what the optimizer rules are and what
>>> they do would be great. Was there one written up internally that you could
>>> share?
>>> 
>>> On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue < maryann@databricks.com >
>>> wrote:
>>> 
>>> 
>>>> > It lists 3 cases for how a filter is built, but nothing about the
>>>> overall approach or design that helps when trying to find out where it
>>>> should be placed in the optimizer rules.
>>>> 
>>>> 
>>>> The overall idea/design of DPP can be simply put as using the result of
>>>> one side of the join to prune partitions of a scan on the other side. The
>>>> optimal situation is when the join is a broadcast join and the table being
>>>> partition-pruned is on the probe side. In that case, by the time the probe
>>>> side starts, the filter will already have the results available and ready
>>>> for reuse.
>>>> 
>>>> 
>>>> Regarding the place in the optimizer rules, it's preferred to happen late
>>>> in the optimization, and definitely after join reorder.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Maryann
>>>> 
>>>> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin < r...@databricks.com > wrote:
>>>> 
>>>> 
>>>> 
>>>>> Whoever created the JIRA years ago didn't describe dpp correctly, but the
>>>>> linked jira in Hive was correct (which unfortunately is much more terse
>>>>> than any of the patches we have in Spark 
>>>>> https://issues.apache.org/jira/browse/HIVE-9152
>>>>> ). Henry R's description was also correct.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue < rb...@netflix.com.invalid > 
>>>>> wrote:
>>>>> 
>>>>> 
>>>>>> Where can I find a design doc for dynamic partition pruning that explains
>>>>>> how it works?
>>>>>> 
>>>>>> 
>>>>>> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition
>>>>>> pruning (as pointed out by Henry R.) and doesn't have any comments about
>>>>>> the implementation's approach. And the PR description also doesn't have
>>>>>> much information. It lists 3 cases for how a filter is built, but nothing
>>>>>> about the overall approach or design that helps when trying to find out
>>>>>> where it should be placed in the optimizer rules. It also isn't clear why
>>>>>> this couldn't be part of spark-catalyst.
>>>>>> 
>>>>>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan < cloud0...@gmail.com > wrote:
>>>>>> 
>>>>>> 
>>>>>>> dynamic partition pruning rule generates "hidden" filters that will be
>>>>>>> converted to real predicates at runtime, so it doesn't matter where we 
>>>>>>> run
>>>>>>> the rule.
>>>>>>> 
>>>>>>> 
>>>>>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's 
>>>>>>> better
>>>>>>> to run it before join reorder.
>>>>>>> 
>>>>>>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue < rb...@netflix.com.invalid >
>>>

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin
No there is no separate write up internally.

On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue  wrote:

> Thanks for the pointers, but what I'm looking for is information about the
> design of this implementation, like what requires this to be in spark-sql
> instead of spark-catalyst.
>
> Even a high-level description, like what the optimizer rules are and what
> they do would be great. Was there one written up internally that you could
> share?
>
> On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue 
> wrote:
>
>> > It lists 3 cases for how a filter is built, but nothing about the
>> overall approach or design that helps when trying to find out where it
>> should be placed in the optimizer rules.
>>
>> The overall idea/design of DPP can be simply put as using the result of
>> one side of the join to prune partitions of a scan on the other side. The
>> optimal situation is when the join is a broadcast join and the table being
>> partition-pruned is on the probe side. In that case, by the time the probe
>> side starts, the filter will already have the results available and ready
>> for reuse.
>>
>> Regarding the place in the optimizer rules, it's preferred to happen late
>> in the optimization, and definitely after join reorder.
>>
>>
>> Thanks,
>> Maryann
>>
>> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin  wrote:
>>
>>> Whoever created the JIRA years ago didn't describe dpp correctly, but
>>> the linked jira in Hive was correct (which unfortunately is much more terse
>>> than any of the patches we have in Spark
>>> https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description
>>> was also correct.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue 
>>> wrote:
>>>
>>>> Where can I find a design doc for dynamic partition pruning that
>>>> explains how it works?
>>>>
>>>> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition
>>>> pruning (as pointed out by Henry R.) and doesn't have any comments about
>>>> the implementation's approach. And the PR description also doesn't have
>>>> much information. It lists 3 cases for how a filter is built, but
>>>> nothing about the overall approach or design that helps when trying to find
>>>> out where it should be placed in the optimizer rules. It also isn't clear
>>>> why this couldn't be part of spark-catalyst.
>>>>
>>>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan  wrote:
>>>>
>>>>> dynamic partition pruning rule generates "hidden" filters that will be
>>>>> converted to real predicates at runtime, so it doesn't matter where we run
>>>>> the rule.
>>>>>
>>>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's
>>>>> better to run it before join reorder.
>>>>>
>>>>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue 
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I have been working on a PR that moves filter and projection pushdown
>>>>>> into the optimizer for DSv2, instead of when converting to physical plan.
>>>>>> This will make DSv2 work with optimizer rules that depend on stats, like
>>>>>> join reordering.
>>>>>>
>>>>>> While adding the optimizer rule, I found that some rules appear to be
>>>>>> out of order. For example, PruneFileSourcePartitions that handles
>>>>>> filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a
>>>>>> batch that will run after all of the batches in Optimizer
>>>>>> (spark-catalyst) including CostBasedJoinReorder.
>>>>>>
>>>>>> SparkOptimizer also adds the new “dynamic partition pruning” rules
>>>>>> *after* both the cost-based join reordering and the v1 partition
>>>>>> pruning rule. I’m not sure why this should run after join reordering and
>>>>>> partition pruning, since it seems to me like additional filters would be
>>>>>> good to have before those rules run.
>>>>>>
>>>>>> It looks like this might just be that the rules were written in the
>>>>>> spark-sql module instead of in catalyst. That makes some sense for the v1
>>>>>> pushdown, which is altering physical plan details (FileIndex) that
>>>>>> have leaked into the logical plan. I’m not sure why the dynamic partition
>>>>>> pruning rules aren’t in catalyst or why they run after the v1 predicate
>>>>>> pushdown.
>>>>>>
>>>>>> Can someone more familiar with these rules clarify why they appear to
>>>>>> be out of order?
>>>>>>
>>>>>> Assuming that this is an accident, I think it’s something that should
>>>>>> be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” 
>>>>>> pruning
>>>>>> may still need to be addressed.
>>>>>>
>>>>>> rb
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin
Whoever created the JIRA years ago didn't describe dpp correctly, but the 
linked jira in Hive was correct (which unfortunately is much more terse than 
any of the patches we have in Spark 
https://issues.apache.org/jira/browse/HIVE-9152 ). Henry R's description was 
also correct.

On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue < rb...@netflix.com.invalid > wrote:

> 
> Where can I find a design doc for dynamic partition pruning that explains
> how it works?
> 
> 
> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition
> pruning (as pointed out by Henry R.) and doesn't have any comments about
> the implementation's approach. And the PR description also doesn't have
> much information. It lists 3 cases for how a filter is built, but nothing
> about the overall approach or design that helps when trying to find out
> where it should be placed in the optimizer rules. It also isn't clear why
> this couldn't be part of spark-catalyst.
> 
> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan < cloud0...@gmail.com > wrote:
> 
> 
>> dynamic partition pruning rule generates "hidden" filters that will be
>> converted to real predicates at runtime, so it doesn't matter where we run
>> the rule.
>> 
>> 
>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better
>> to run it before join reorder.
>> 
>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue < rb...@netflix.com.invalid >
>> wrote:
>> 
>> 
>>> 
>>> 
>>> Hi everyone,
>>> 
>>> 
>>> 
>>> I have been working on a PR that moves filter and projection pushdown into
>>> the optimizer for DSv2, instead of when converting to physical plan. This
>>> will make DSv2 work with optimizer rules that depend on stats, like join
>>> reordering.
>>> 
>>> 
>>> 
>>> While adding the optimizer rule, I found that some rules appear to be out
>>> of order. For example, PruneFileSourcePartitions that handles filter
>>> pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will
>>> run after all of the batches in Optimizer (spark-catalyst) including 
>>> CostBasedJoinReorder
>>> .
>>> 
>>> 
>>> 
>>> SparkOptimizer also adds the new “dynamic partition pruning” rules after 
>>> both
>>> the cost-based join reordering and the v1 partition pruning rule. I’m not
>>> sure why this should run after join reordering and partition pruning,
>>> since it seems to me like additional filters would be good to have before
>>> those rules run.
>>> 
>>> 
>>> 
>>> It looks like this might just be that the rules were written in the
>>> spark-sql module instead of in catalyst. That makes some sense for the v1
>>> pushdown, which is altering physical plan details ( FileIndex ) that have
>>> leaked into the logical plan. I’m not sure why the dynamic partition
>>> pruning rules aren’t in catalyst or why they run after the v1 predicate
>>> pushdown.
>>> 
>>> 
>>> 
>>> Can someone more familiar with these rules clarify why they appear to be
>>> out of order?
>>> 
>>> 
>>> 
>>> Assuming that this is an accident, I think it’s something that should be
>>> fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning
>>> may still need to be addressed.
>>> 
>>> 
>>> 
>>> rb
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>> 
>> 
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Collections passed from driver to executors

2019-09-23 Thread Reynold Xin
A while ago we changed it so the task gets broadcasted too, so I think the two 
are fairly similar.

On Mon, Sep 23, 2019 at 8:17 PM, Dhrubajyoti Hati < dhruba.w...@gmail.com > 
wrote:

> 
> I was wondering if anyone could help with this question.
> 
> On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, < dhruba. work@ gmail. com
> ( dhruba.w...@gmail.com ) > wrote:
> 
> 
>> Hi,
>> 
>> 
>> I have a question regarding passing a dictionary from driver to executors
>> in spark on yarn. This dictionary is needed in an udf. I am using pyspark.
>> 
>> 
>> As I understand this can be passed in two ways:
>> 
>> 
>> 1. Broadcast the variable and then use it in the udfs
>> 
>> 
>> 2. Pass the dictionary in the udf itself, in something like this:
>> 
>> 
>>   def udf1(col1, dict):
>>    ..
>>   def udf 1 _ fn (dict):
>>     return udf(lambda col_ data : udf1( col_data, dict ))
>> 
>> 
>>   df.withColumn("column_new", udf 1 _ fn (dict)("old_column"))
>> 
>> 
>> Well I have tested with both the ways and it works both ways.
>> 
>> 
>> Now I am wondering what is fundamentally different between the two. I
>> understand how broadcast work but I am not sure how the data is passed
>> across in the 2nd way. Is the dictionary passed to each executor every
>> time when new task is running on that executor or they are passed only
>> once. Also how the data is passed to the python processes. They are python
>> udfs so I think they are executed natively in python.(Plz correct me if I
>> am wrong). So the data will be serialised and passed to python.
>> 
>> So in summary my question is which will be better/efficient way to write
>> the whole thing and why?
>> 
>> 
>> Thank you!
>> 
>> 
>> R egards,
>> Dhrub
>> 
> 
>

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Reynold Xin
Because for example we'd need to move the location of InternalRow, breaking the 
package name. If you insist we shouldn't change the unstable temporary API in 
3.x to maintain compatibility with 3.0, which is totally different from my 
understanding of the situation when you exposed it, then I'd say we should gate 
3.0 on having a stable row interface.

I also don't get this backporting a giant feature to 2.x line ... as suggested 
by others in the thread, DSv2 would be one of the main reasons people upgrade 
to 3.0. What's so special about DSv2 that we are doing this? Why not abandoning 
3.0 entirely and backport all the features to 2.x?

On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue < rb...@netflix.com > wrote:

> 
> Why would that require an incompatible change?
> 
> 
> We *could* make an incompatible change and remove support for InternalRow,
> but I think we would want to carefully consider whether that is the right
> decision. And in any case, we would be able to keep 2.5 and 3.0
> compatible, which is the main goal.
> 
> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin < r...@databricks.com > wrote:
> 
> 
> 
>> How would you not make incompatible changes in 3.x? As discussed the
>> InternalRow API is not stable and needs to change. 
>> 
>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue < rb...@netflix.com > wrote:
>> 
>> 
>>> > Making downstream to diverge their implementation heavily between minor
>>> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>> 
>>> 
>>> You're right that the API has been evolving in the 2.x line. But, it is
>>> now reasonably stable with respect to the current feature set and we
>>> should not need to break compatibility in the 3.x line. Because we have
>>> reached our goals for the 3.0 release, we can backport at least those
>>> features to 2.x and confidently have an API that works in both a 2.x
>>> release and is compatible with 3.0, if not 3.1 and later releases as well.
>>> 
>>> 
>>> 
>>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>>> 3.0 is officially released
>>> 
>>> 
>>> The reason I'm suggesting this is that I'm already going to do the work to
>>> backport the 3.0 release features to 2.4. I've been asked by several
>>> people when DSv2 will be released, so I know there is a lot of interest in
>>> making this available sooner than 3.0. If I'm already doing the work, then
>>> I'd be happy to share that with the community.
>>> 
>>> 
>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>>> about complete so we can easily release the same set of features and API
>>> in 2.5 and 3.0.
>>> 
>>> 
>>> If we decide for some reason to wait until after 3.0 is released, I don't
>>> know that there is much value in a 2.5. The purpose is to be a step toward
>>> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
>>> wouldn't get these features out any sooner than 3.0, as a 2.5 release
>>> probably would, given the work needed to validate the incompatible changes
>>> in 3.0.
>>> 
>>> 
>>> > DSv2 change would be the major backward incompatibility which Spark 2.x
>>> users may hesitate to upgrade
>>> 
>>> 
>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>>> expected. I don't think it will need incompatible changes in the 3.x line.
>>> 
>>> 
>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim < kabh...@gmail.com > wrote:
>>> 
>>> 
>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>>>> deal with this as the change made confusion on my PRs...), but my bet is
>>>> that DSv2 would be already changed in incompatible way, at least who works
>>>> for custom DataSource. Making downstream to diverge their implementation
>>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>>> experience - especially we are not completely closed the chance to further
>>>> modify DSv2, and the change could be backward incompatible.
>>>> 
>>>> 
>>>> If we really want to bring the DSv2 change to 2.x version line to let end
>>>> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
>>>> preparation of Spark 2.5 should be started after Spark 3.0 is officially
>>>> released, honestly even la

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Reynold Xin
How would you not make incompatible changes in 3.x? As discussed the
InternalRow API is not stable and needs to change.

On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  wrote:

> > Making downstream to diverge their implementation heavily between minor
> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>
> You're right that the API has been evolving in the 2.x line. But, it is
> now reasonably stable with respect to the current feature set and we should
> not need to break compatibility in the 3.x line. Because we have reached
> our goals for the 3.0 release, we can backport at least those features to
> 2.x and confidently have an API that works in both a 2.x release and is
> compatible with 3.0, if not 3.1 and later releases as well.
>
> > I'd rather say preparation of Spark 2.5 should be started after Spark
> 3.0 is officially released
>
> The reason I'm suggesting this is that I'm already going to do the work to
> backport the 3.0 release features to 2.4. I've been asked by several people
> when DSv2 will be released, so I know there is a lot of interest in making
> this available sooner than 3.0. If I'm already doing the work, then I'd be
> happy to share that with the community.
>
> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
> about complete so we can easily release the same set of features and API in
> 2.5 and 3.0.
>
> If we decide for some reason to wait until after 3.0 is released, I don't
> know that there is much value in a 2.5. The purpose is to be a step toward
> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
> wouldn't get these features out any sooner than 3.0, as a 2.5 release
> probably would, given the work needed to validate the incompatible changes
> in 3.0.
>
> > DSv2 change would be the major backward incompatibility which Spark 2.x
> users may hesitate to upgrade
>
> As I pointed out, DSv2 has been changing in the 2.x line, so this is
> expected. I don't think it will need incompatible changes in the 3.x line.
>
> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:
>
>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>> deal with this as the change made confusion on my PRs...), but my bet is
>> that DSv2 would be already changed in incompatible way, at least who works
>> for custom DataSource. Making downstream to diverge their implementation
>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>> experience - especially we are not completely closed the chance to further
>> modify DSv2, and the change could be backward incompatible.
>>
>> If we really want to bring the DSv2 change to 2.x version line to let end
>> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
>> preparation of Spark 2.5 should be started after Spark 3.0 is officially
>> released, honestly even later than that, say, getting some reports from
>> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
>> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
>> upgrade to next minor version.
>>
>> Btw, do we have any specific target users for this? Personally DSv2
>> change would be the major backward incompatibility which Spark 2.x users
>> may hesitate to upgrade, so they might be already prepared to migrate to
>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>
>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun 
>> wrote:
>>
>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>> I believe we follow Semantic Versioning (
>>> https://spark.apache.org/versioning-policy.html ).
>>>
>>> > We just won’t add any breaking changes before 3.1.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue 
>>> wrote:
>>>
>>>> I don’t think we need to gate a 3.0 release on making a more stable
>>>> version of InternalRow
>>>>
>>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>>> problems with it.
>>>>
>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>>> seems like a false premise.
>>>>
>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>
>>>> I’m only suggesting that we release the same support in a 2.5 release
>>>> that we d

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
I don't think we need to gate a 3.0 release on making a more stable version of 
InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change 
and progress towards more stable, but will have to break certain APIs) and 2.x 
seems like a false premise.

To point out some problems with InternalRow that you think are already 
pragmatic and stable:

The class is in catalyst, which states: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala

/**

* Catalyst is a library for manipulating relational query plans.  All classes 
in catalyst are

* considered an internal API to Spark SQL and are subject to change between 
minor releases.

*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled 
with internal implementations. For example, 

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

/**

* A UTF-8 String for internal Spark use.

* 

* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,

* search, see http://en.wikipedia.org/wiki/UTF-8 for details.

* 

* Note: This is not designed for general use cases, should not be used outside 
SQL.

*/

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala

(which again is in catalyst package)

If you want to argue this way, you might as well argue we should make the 
entire catalyst package public to be pragmatic and not allow any changes.

On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue < rb...@netflix.com > wrote:

> 
> 
>> 
>> 
>> When you created the PR to make InternalRow public
>> 
>> 
> 
> 
> 
> This isn’t quite accurate. The change I made was to use InternalRow instead
> of UnsafeRow , which is a specific implementation of InternalRow. Exposing
> this API has always been a part of DSv2 and while both you and I did some
> work to avoid this, we are still in the phase of starting with that API.
> 
> 
> 
> Note that any change to InternalRow would be very costly to implement
> because this interface is widely used. That is why I think we can
> certainly consider it stable enough to use here, and that’s probably why 
> UnsafeRow
> was part of the original proposal.
> 
> 
> 
> In any case, the goal for 3.0 was not to replace the use of InternalRow ,
> it was to get the majority of SQL working on top of the interface added
> after 2.4. That’s done and stable, so I think a 2.5 release with it is
> also reasonable.
> 
> 
> 
> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin < r...@databricks.com > wrote:
> 
> 
> 
>> To push back, while I agree we should not drastically change
>> "InternalRow", there are a lot of changes that need to happen to make it
>> stable. For example, none of the publicly exposed interfaces should be in
>> the Catalyst package or the unsafe package. External implementations
>> should be decoupled from the internal implementations, with cheap ways to
>> convert back and forth.
>> 
>> 
>> 
>> When you created the PR to make InternalRow public, the understanding was
>> to work towards making it stable in the future, assuming we will start
>> with an unstable API temporarily. You can't just make a bunch internal
>> APIs tightly coupled with other internal pieces public and stable and call
>> it a day, just because it happen to satisfy some use cases temporarily
>> assuming the rest of Spark doesn't change.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue < rb...@netflix.com > wrote:
>> 
>>> > DSv2 is far from stable right?
>>> 
>>> 
>>> No, I think it is reasonably stable and very close to being ready for a
>>> release.
>>> 
>>> 
>>> > All the actual data types are unstable and you guys have completely
>>> ignored that.
>>> 
>>> 
>>> I think what you're referring to is the use of `InternalRow`. That's a
>>> stable API and there has been no work to avoid using it. In any case, I
>>> don't think that anyone is suggesting that we delay 3.0 until a
>>> replacement for `InternalRow` is added, right?
>>> 
>>> 
>>> While I understand the motivation for a better solution here, I think the
>>> pragmatic solution is to continue using `InternalRow`.
>>> 
>>> 
>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>> invasive of a change to backport once you consider the parts needed to
>>> make dsv2 stabl

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
To push back, while I agree we should not drastically change "InternalRow", 
there are a lot of changes that need to happen to make it stable. For example, 
none of the publicly exposed interfaces should be in the Catalyst package or 
the unsafe package. External implementations should be decoupled from the 
internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to 
work towards making it stable in the future, assuming we will start with an 
unstable API temporarily. You can't just make a bunch internal APIs tightly 
coupled with other internal pieces public and stable and call it a day, just 
because it happen to satisfy some use cases temporarily assuming the rest of 
Spark doesn't change.

On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue < rb...@netflix.com > wrote:

> 
> > DSv2 is far from stable right?
> 
> 
> No, I think it is reasonably stable and very close to being ready for a
> release.
> 
> 
> > All the actual data types are unstable and you guys have completely
> ignored that.
> 
> 
> I think what you're referring to is the use of `InternalRow`. That's a
> stable API and there has been no work to avoid using it. In any case, I
> don't think that anyone is suggesting that we delay 3.0 until a
> replacement for `InternalRow` is added, right?
> 
> 
> While I understand the motivation for a better solution here, I think the
> pragmatic solution is to continue using `InternalRow`.
> 
> 
> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
> invasive of a change to backport once you consider the parts needed to
> make dsv2 stable.
> 
> 
> I believe that those of us working on DSv2 are confident about the current
> stability. We set goals for what to get into the 3.0 release months ago
> and have very nearly reached the point where we are ready for that
> release.
> 
> 
> I don't think instability would be a problem in maintaining compatibility
> between the 2.5 version and the 3.0 version. If we find that we need to
> make API changes (other than additions) then we can make those in the 3.1
> release. Because the goals we set for the 3.0 release have been reached
> with the current API and if we are ready to release 3.0, we can release a
> 2.5 with the same API.
> 
> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> DSv2 is far from stable right? All the actual data types are unstable and
>> you guys have completely ignored that. We'd need to work on that and that
>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>> 2.x, that seems too invasive of a change to backport once you consider the
>> parts needed to make dsv2 stable.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue < rblue@ netflix. com. invalid
>> ( rb...@netflix.com.invalid ) > wrote:
>> 
>>> Hi everyone,
>>> 
>>> 
>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>> 
>>> 
>>> A Spark 2.5 release with these two additions will help people migrate to
>>> Spark 3.0 when it is released because they will be able to use a single
>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>> 
>>> 
>>> 
>>> Another reason to consider a 2.5 release is that many people are
>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>> it makes sense to share this work with the community.
>>> 
>>> 
>>> This release line would just consist of backports like DSv2 and Java 11
>>> that assist compatibility, to keep the scope of the release small. The
>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>> release.
>>> 
>>> 
>>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>>> this plan?
>>> 
>>> 
>>> 
>>> 
>>> rb
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
DSv2 is far from stable right? All the actual data types are unstable and you 
guys have completely ignored that. We'd need to work on that and that will be a 
breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that 
seems too invasive of a change to backport once you consider the parts needed 
to make dsv2 stable.

On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue < rb...@netflix.com.invalid > wrote:

> 
> Hi everyone,
> 
> 
> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
> 
> 
> A Spark 2.5 release with these two additions will help people migrate to
> Spark 3.0 when it is released because they will be able to use a single
> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
> upgrading to 3.0 won't also require also updating to Java 11 because users
> could update to Java 11 with the 2.5 release and have fewer major changes.
> 
> 
> 
> Another reason to consider a 2.5 release is that many people are
> interested in a release with the latest DSv2 API and support for DSv2 SQL.
> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
> it makes sense to share this work with the community.
> 
> 
> This release line would just consist of backports like DSv2 and Java 11
> that assist compatibility, to keep the scope of the release small. The
> purpose is to assist people moving to 3.0 and not distract from the 3.0
> release.
> 
> 
> Would a Spark 2.5 release help anyone else? Are there any concerns about
> this plan?
> 
> 
> 
> 
> rb
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread Reynold Xin
+1! Long due for a preview release.

On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau < hol...@pigscanfly.ca > wrote:

> 
> I like the idea from the PoV of giving folks something to start testing
> against and exploring so they can raise issues with us earlier in the
> process and we have more time to make calls around this.
> 
> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge < jzhuge@ apache. org (
> jzh...@apache.org ) > wrote:
> 
> 
>> +1  Like the idea as a user and a DSv2 contributor.
>> 
>> 
>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim < kabhwan@ gmail. com (
>> kabh...@gmail.com ) > wrote:
>> 
>> 
>>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>>> would help to test the feature. When to cut preview release is
>>> questionable, as major works are ideally to be done before that - if we
>>> are intended to introduce new features before official release, that
>>> should work regardless of this, but if we are intended to have opportunity
>>> to test earlier, ideally it should.
>>> 
>>> 
>>> As a one of contributors in structured streaming area, I'd like to add
>>> some items for Spark 3.0, both "must be done" and "better to have". For
>>> "better to have", I pick some items for new features which committers
>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>> consumer pool as improvement) I hope we provide some gifts for structured
>>> streaming users in Spark 3.0 envelope.
>>> 
>>> 
>>> > must be done
>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>>> output
>>> 
>>> It's a correctness issue with multiple users reported, being reported at
>>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>>> submitted at Jan. 2019 to fix it.
>>> 
>>> 
>>> > better to have
>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
>>> start and end offset
>>> * SPARK-20568 Delete files after processing in structured streaming
>>> 
>>> 
>>> There're some more new features/improvements items in SS, but given we're
>>> talking about ramping-down, above list might be realistic one.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin < jgp@ jgp. net (
>>> j...@jgp.net ) > wrote:
>>> 
>>> 
 As a user/non committer, +1
 
 
 I love the idea of an early 3.0.0 so we can test current dev against it, I
 know the final 3.x will probably need another round of testing when it
 gets out, but less for sure... I know I could checkout and compile, but
 having a “packaged” preversion is great if it does not take too much time
 to the team...
 
 jg
 
 
 
 On Sep 11, 2019, at 20:40, Hyukjin Kwon < gurwls223@ gmail. com (
 gurwls...@gmail.com ) > wrote:
 
 
 
> +1 from me too but I would like to know what other people think too.
> 
> 
> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
> dongjoon.h...@gmail.com ) >님이 작성:
> 
> 
>> Thank you, Sean.
>> 
>> 
>> I'm also +1 for the following three.
>> 
>> 
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>> 
>> 
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps 
>> it
>> a lot.
>> 
>> 
>> After this discussion, can we have some timeline for `Spark 3.0 Release
>> Window` in our versioning-policy page?
>> 
>> 
>> - https:/ / spark. apache. org/ versioning-policy. html (
>> https://spark.apache.org/versioning-policy.html )
>> 
>> 
>> Bests,
>> Dongjoon.
>> 
>> 
>> 
>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer < heuermh@ gmail. com (
>> heue...@gmail.com ) > wrote:
>> 
>> 
>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility 
>>> problems
>>> resolved, e.g.
>>> 
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25588 (
>>> https://issues.apache.org/jira/browse/SPARK-25588 )
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27781 (
>>> https://issues.apache.org/jira/browse/SPARK-27781 )
>>> 
>>> 
>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far 
>>> as
>>> I know, Parquet has not cut a release based on this new version.
>>> 
>>> 
>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>> 
>>> 
>>> https:/ / github. com/ apache/ spark/ pull/ 24851 (
>>> https://github.com/apache/spark/pull/24851 )
>>> https:/ / github. com/ apache/ spark/ pull/ 24297 (
>>> https://github.com/apache/spark/pull/24297 )

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Reynold Xin
Having three modes is a lot. Why not just use ansi mode as default, and legacy 
for backward compatibility? Then over time there's only the ANSI mode, which is 
standard compliant and easy to understand. We also don't need to invent a 
standard just for Spark.

On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan < cloud0...@gmail.com > wrote:

> 
> +1
> 
> 
> To be honest I don't like the legacy policy. It's too loose and easy for
> users to make mistakes, especially when Spark returns null if a function
> hit errors like overflow.
> 
> 
> The strict policy is not good either. It's too strict and stops valid use
> cases like writing timestamp values to a date type column. Users do expect
> truncation to happen without adding cast manually in this case. It's also
> weird to use a spark specific policy that no other database is using.
> 
> 
> The ANSI policy is better. It stops invalid use cases like writing string
> values to an int type column, while keeping valid use cases like timestamp
> -> date.
> 
> 
> I think it's no doubt that we should use ANSI policy instead of legacy
> policy for v1 tables. Except for backward compatibility, ANSI policy is
> literally better than the legacy policy.
> 
> 
> The v2 table is arguable here. Although the ANSI policy is better than
> strict policy to me, this is just the store assignment policy, which only
> partially controls the table insertion behavior. With Spark's "return null
> on error" behavior, the table insertion is more likely to insert invalid
> null values with the ANSI policy compared to the strict policy.
> 
> 
> I think we should use ANSI policy by default for both v1 and v2 tables,
> because
> 1. End-users don't care how the table is implemented. Spark should provide
> consistent table insertion behavior between v1 and v2 tables.
> 2. Data Source V2 is unstable in Spark 2.x so there is no backward
> compatibility issue. That said, the baseline to judge which policy is
> better should be the table insertion behavior in Spark 2.x, which is the
> legacy policy + "return null on error". ANSI policy is better than the
> baseline.
> 3. We expect more and more uses to migrate their data sources to the V2
> API. The strict policy can be a stopper as it's a too big breaking change,
> which may break many existing queries.
> 
> 
> Thanks,
> Wenchen 
> 
> 
> 
> 
> On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang < gengliang. wang@ databricks.
> com ( gengliang.w...@databricks.com ) > wrote:
> 
> 
>> Hi everyone,
>> 
>> I'd like to call for a vote on SPARK-28885 (
>> https://issues.apache.org/jira/browse/SPARK-28885 ) "Follow ANSI store
>> assignment rules in table insertion by default".  
>> When inserting a value into a column with the different data type, Spark
>> performs type coercion. Currently, we support 3 policies for the type
>> coercion rules: ANSI, legacy and strict, which can be set via the option
>> "spark.sql.storeAssignmentPolicy":
>> 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice,
>> the behavior is mostly the same as PostgreSQL. It disallows certain
>> unreasonable type conversions such as converting `string` to `int` and
>> `double` to `boolean`.
>> 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`,
>> which is very loose. E.g., converting either `string` to `int` or `double`
>> to `boolean` is allowed. It is the current behavior in Spark 2.x for
>> compatibility with Hive.
>> 3. Strict: Spark doesn't allow any possible precision loss or data
>> truncation in type coercion, e.g., converting either `double` to `int` or
>> `decimal` to `double` is allowed. The rules are originally for Dataset
>> encoder. As far as I know, no maintainstream DBMS is using this policy by
>> default.
>> 
>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>> uses "Strict". This proposal is to use "ANSI" policy by default for both
>> V1 and V2 in Spark 3.0.
>> 
>> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in
>> the dev mailing list.
>> 
>> This vote is open until next Thurs (Sept. 12nd).
>> 
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ... Thank you! Gengliang
>> 
>> 
> 
>

Re: JDK11 Support in Apache Spark

2019-08-26 Thread Reynold Xin
Exactly - I think it's important to be able to create a single binary build. 
Otherwise downstream users (the 99.99% won't be building their own Spark but 
just pull it from Maven) will have to deal with the mess, and it's even worse 
for libraries.

On Mon, Aug 26, 2019 at 10:51 AM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Oh, right. If you want to publish something to Maven, it will inherit the
> situation.
> Thank you for feedback. :)
> 
> On Mon, Aug 26, 2019 at 10:37 AM Michael Heuer < heuermh@ gmail. com (
> heue...@gmail.com ) > wrote:
> 
> 
>> That is not true for any downstream users who also provide a library. 
>> Whatever build mess you create in Apache Spark, we'll have to inherit it. 
>> ;)
>> 
>> 
>>    michael
>> 
>> 
>> 
>> 
>>> On Aug 26, 2019, at 12:32 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>> dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> As Shane wrote, not yet.
>>> 
>>> 
>>> `one build for works for both` is our aspiration and the next step
>>> mentioned in the first email.
>>> 
>>> 
>>> 
>>> > The next step is `how to support JDK8/JDK11 together in a single
>>> artifact`.
>>> 
>>> 
>>> For the downstream users who build from the Apache Spark source, that will
>>> not be a blocker because they will prefer a single JDK.
>>> 
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Mon, Aug 26, 2019 at 10:28 AM Shane Knapp < sknapp@ berkeley. edu (
>>> skn...@berkeley.edu ) > wrote:
>>> 
>>> 
>>>> maybe in the future, but not right now as the hadoop 2.7 build is broken.
>>>> 
>>>> 
>>>> also, i busted dev/ run-tests. py ( http://dev/run-tests.py ) in my changes
>>>> to support java11 in PRBs:
>>>> https:/ / github. com/ apache/ spark/ pull/ 25585 (
>>>> https://github.com/apache/spark/pull/25585 )
>>>> 
>>>> 
>>>> 
>>>> quick fix, testing now.
>>>> 
>>>> On Mon, Aug 26, 2019 at 10:23 AM Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> Would it be possible to have one build that works for both?
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: JDK11 Support in Apache Spark

2019-08-26 Thread Reynold Xin
Would it be possible to have one build that works for both?

On Mon, Aug 26, 2019 at 10:22 AM Dongjoon Hyun 
wrote:

> Thank you all!
>
> Let me add more explanation on the current status.
>
> - If you want to run on JDK8, you need to build on JDK8
> - If you want to run on JDK11, you need to build on JDK11.
>
> The other combinations will not work.
>
> Currently, we have two Jenkins jobs. (1) is the one I pointed, and (2) is
> the one for the remaining community work.
>
> 1) Build and test on JDK11 (spark-master-test-maven-hadoop-3.2-jdk-11)
> 2) Build on JDK8 and test on JDK11
> (spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing)
>
> To keep JDK11 compatibility, the following is merged today.
>
> [SPARK-28701][TEST-HADOOP3.2][TEST-JAVA11][K8S] adding java11
> support for pull request builds
>
> But, we still have many stuffs to do for Jenkins/Release and we need your
> support about JDK11. :)
>
> Bests,
> Dongjoon.
>
>
> On Sun, Aug 25, 2019 at 10:30 PM Takeshi Yamamuro 
> wrote:
>
>> Cool, congrats!
>>
>> Bests,
>> Takeshi
>>
>> On Mon, Aug 26, 2019 at 1:01 PM Hichame El Khalfi 
>> wrote:
>>
>>> That's Awesome !!!
>>>
>>> Thanks to everyone that made this possible :cheers:
>>>
>>> Hichame
>>>
>>> *From:* cloud0...@gmail.com
>>> *Sent:* August 25, 2019 10:43 PM
>>> *To:* lix...@databricks.com
>>> *Cc:* felixcheun...@hotmail.com; ravishankar.n...@gmail.com;
>>> dongjoon.h...@gmail.com; dev@spark.apache.org; u...@spark.apache.org
>>> *Subject:* Re: JDK11 Support in Apache Spark
>>>
>>> Great work!
>>>
>>> On Sun, Aug 25, 2019 at 6:03 AM Xiao Li  wrote:
>>>
 Thank you for your contributions! This is a great feature for Spark
 3.0! We finally achieve it!

 Xiao

 On Sat, Aug 24, 2019 at 12:18 PM Felix Cheung <
 felixcheun...@hotmail.com> wrote:

> That’s great!
>
> --
> *From:* ☼ R Nair 
> *Sent:* Saturday, August 24, 2019 10:57:31 AM
> *To:* Dongjoon Hyun 
> *Cc:* dev@spark.apache.org ; user @spark/'user
> @spark'/spark users/user@spark 
> *Subject:* Re: JDK11 Support in Apache Spark
>
> Finally!!! Congrats
>
> On Sat, Aug 24, 2019, 11:11 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Thanks to your many many contributions,
>> Apache Spark master branch starts to pass on JDK11 as of today.
>> (with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6)
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/
>> (JDK11 is used for building and testing.)
>>
>> We already verified all UTs (including PySpark/SparkR) before.
>>
>> Please feel free to use JDK11 in order to build/test/run `master`
>> branch and
>> share your experience including any issues. It will help Apache Spark
>> 3.0.0 release.
>>
>> For the follow-ups, please follow
>> https://issues.apache.org/jira/browse/SPARK-24417 .
>> The next step is `how to support JDK8/JDK11 together in a single
>> artifact`.
>>
>> Bests,
>> Dongjoon.
>>
>

 --
 [image: Databricks Summit - Watch the talks]
 

>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Reynold Xin
OK to push back: "disagreeing with the premise that we can afford to not be 
maximal on standard 3. The correctness of the data is non-negotiable, and 
whatever solution we settle on cannot silently adjust the user’s data under any 
circumstances."

This blanket statement sounds great on surface, but there are a lot of 
subtleties. "Correctness" is absolutely important, but engineering/prod 
development are often about tradeoffs, and the industry has consistently traded 
correctness for performance or convenience, e.g. overflow checks, null 
pointers, consistency in databases ...

It all depends on the use cases and to what degree use cases can tolerate. For 
example, while I want my data engineering production pipeline to throw any 
error when the data doesn't match my expectations (e.g. type widening, 
overflow), if I'm doing some quick and dirty data science, I don't want the job 
to just fail due to outliers.

On Wed, Jul 31, 2019 at 10:13 AM, Matt Cheah < mch...@palantir.com > wrote:

> 
> 
> 
> Sorry I meant the current behavior for V2, which fails the query
> compilation if the cast is not safe.
> 
> 
> 
>  
> 
> 
> 
> Agreed that a separate discussion about overflow might be warranted. I’m
> surprised we don’t throw an error now, but it might be warranted to do so.
> 
> 
> 
> 
>  
> 
> 
> 
> -Matt Cheah
> 
> 
> 
>  
> 
> 
> 
> *From:* Reynold Xin < r...@databricks.com >
> *Date:* Wednesday, July 31, 2019 at 9:58 AM
> *To:* Matt Cheah < mch...@palantir.com >
> *Cc:* Russell Spitzer < russell.spit...@gmail.com >, Takeshi Yamamuro < 
> linguin@gmail.com
> >, Gengliang Wang < gengliang.w...@databricks.com >, Ryan Blue < 
> >rb...@netflix.com
> >, Spark dev list < dev@spark.apache.org >, Hyukjin Kwon < gurwls...@gmail.com
> >, Wenchen Fan < cloud0...@gmail.com >
> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Matt what do you mean by maximizing 3, while allowing not throwing errors
> when any operations overflow? Those two seem contradicting.
> 
> 
> 
> 
>  
> 
> 
> 
> 
>  
> 
> 
> 
> On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah < mch...@palantir.com > wrote:
> 
> 
> 
>> 
>> 
>> I’m -1, simply from disagreeing with the premise that we can afford to not
>> be maximal on standard 3. The correctness of the data is non-negotiable,
>> and whatever solution we settle on cannot silently adjust the user’s data
>> under any circumstances.
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> I think the existing behavior is fine, or perhaps the behavior can be
>> flagged by the destination writer at write time.
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> -Matt Cheah
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> *From:* Hyukjin Kwon < gurwls...@gmail.com >
>> *Date:* Monday, July 29, 2019 at 11:33 PM
>> *To:* Wenchen Fan < cloud0...@gmail.com >
>> *Cc:* Russell Spitzer < russell.spit...@gmail.com >, Takeshi Yamamuro < 
>> linguin@gmail.com
>> >, Gengliang Wang < gengliang.w...@databricks.com >, Ryan Blue < 
>> >rb...@netflix.com
>> >, Spark dev list < dev@spark.apache.org >
>> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> From my look, +1 on the proposal, considering ASCI and other DBMSes in
>> general.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 2019 년 7 월 30 일 ( 화 ) 오후 3:21, Wenchen Fan < cloud0...@gmail.com > 님이 작성 :
>> 
>> 
>> 
>> 
>>> 
>>> 
>>> We can add a config for a certain behavior if it makes sense, but the most
>>> important thing we want to reach an agreement here is: what should be the
>>> default behavior?
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> 
>>> Let's explore the solution space of table insertion behavior first:
>>> 
>>> 
>>> 
>>> 
>>> At compile time,
>>> 
>>> 
>>> 
>>> 
>>> 1. always add cast
>>> 
>>> 
>>> 
>>> 
>>> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
>>> int is forbidden but long to int is allowed)
>>> 
>>> 
>>> 
>>> 
>>> 3. only add cast

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Reynold Xin
Matt what do you mean by maximizing 3, while allowing not throwing errors when 
any operations overflow? Those two seem contradicting.

On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah < mch...@palantir.com > wrote:

> 
> 
> 
> I’m -1, simply from disagreeing with the premise that we can afford to not
> be maximal on standard 3. The correctness of the data is non-negotiable,
> and whatever solution we settle on cannot silently adjust the user’s data
> under any circumstances.
> 
> 
> 
>  
> 
> 
> 
> I think the existing behavior is fine, or perhaps the behavior can be
> flagged by the destination writer at write time.
> 
> 
> 
>  
> 
> 
> 
> -Matt Cheah
> 
> 
> 
>  
> 
> 
> 
> *From:* Hyukjin Kwon < gurwls...@gmail.com >
> *Date:* Monday, July 29, 2019 at 11:33 PM
> *To:* Wenchen Fan < cloud0...@gmail.com >
> *Cc:* Russell Spitzer < russell.spit...@gmail.com >, Takeshi Yamamuro < 
> linguin@gmail.com
> >, Gengliang Wang < gengliang.w...@databricks.com >, Ryan Blue < 
> >rb...@netflix.com
> >, Spark dev list < dev@spark.apache.org >
> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
> 
> 
> 
> 
>  
> 
> 
> 
> 
> From my look, +1 on the proposal, considering ASCI and other DBMSes in
> general.
> 
> 
> 
> 
>  
> 
> 
> 
> 2019 년 7 월 30 일 ( 화 ) 오후 3:21, Wenchen Fan < cloud0...@gmail.com > 님이 작성 :
> 
> 
> 
> 
>> 
>> 
>> We can add a config for a certain behavior if it makes sense, but the most
>> important thing we want to reach an agreement here is: what should be the
>> default behavior?
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> Let's explore the solution space of table insertion behavior first:
>> 
>> 
>> 
>> 
>> At compile time,
>> 
>> 
>> 
>> 
>> 1. always add cast
>> 
>> 
>> 
>> 
>> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
>> int is forbidden but long to int is allowed)
>> 
>> 
>> 
>> 
>> 3. only add cast if it's 100% safe
>> 
>> 
>> 
>> 
>> At runtime,
>> 
>> 
>> 
>> 
>> 1. return null for invalid operations
>> 
>> 
>> 
>> 
>> 2. throw exceptions at runtime for invalid operations
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The standards to evaluate a solution:
>> 
>> 
>> 
>> 
>> 1. How robust the query execution is. For example, users usually don't
>> want to see the query fails midway.
>> 
>> 
>> 
>> 
>> 2. how tolerant to user queries. For example, a user would like to write
>> long values to an int column as he knows all the long values won't exceed
>> int range.
>> 
>> 
>> 
>> 
>> 3. How clean the result is. For example, users usually don't want to see
>> silently corrupted data (null values).
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The current Spark behavior for Data Source V1 tables: always add cast and
>> return null for invalid operations. This maximizes standard 1 and 2, but
>> the result is least clean and users are very likely to see silently
>> corrupted data (null values).
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
>> only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
>> queries may fail to compile, even if these queries can run on other SQL
>> systems. Note that, people can still see silently corrupted data because
>> cast is not the only one that can return corrupted data. Simple operations
>> like ADD can also return corrected data if overflow happens. e.g. INSERT
>> INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2 
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The proposal here: add cast following ANSI SQL store assignment rule, and
>> return null for invalid operations. This maximizes standard 1, and also
>> fits standard 2 well: if a query can't compile in Spark, it usually can't
>> compile in other mainstream databases as well. I think that's tolerant
>> enough. For standard 3, this proposal doesn't maximize it but can avoid
>> many invalid operations already.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> Technically we can't make the result 100% clean at compile-time, we have
>> to handle things like overflow at runtime. I think the new proposal makes
>> more sense as the default behavior.
>> 
>> 
>> 
>> 
>>   
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer < russell.spit...@gmail.com
>> > wrote:
>> 
>> 
>> 
>>> 
>>> 
>>> I understand spark is making the decisions, i'm say the actual final
>>> effect of the null decision would be different depending on the insertion
>>> target if the target has different behaviors for null.
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan < cloud0...@gmail.com > wrote:
>>> 
>>> 
>>> 
>>> 
 
 
 > I'm a big -1 on null values for invalid casts.
 
 
 
  
 
 
 
 
 This is why we want to introduce the ANSI mode, so that invalid cast fails
 at runtime. But we have to keep the null behavior for a while, to keep
 backward compatibility. Spark returns null for invalid cast since the
 first day of Spark SQL, we 

Re: [DISCUSS] New sections in Github Pull Request description template

2019-07-23 Thread Reynold Xin
I like the spirit, but not sure about the exact proposal. Take a look at
k8s':
https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md



On Tue, Jul 23, 2019 at 8:27 PM, Hyukjin Kwon  wrote:

> (Plus, it helps to track history too. Spark's commit logs are growing and
> now it's pretty difficult to track the history and see what change
> introduced a specific behaviour)
>
> 2019년 7월 24일 (수) 오후 12:20, Hyukjin Kwon 님이 작성:
>
> Hi all,
>
> I would like to discuss about some new sections under "## What changes
> were proposed in this pull request?":
>
> ### Do the changes affect _any_ user/dev-facing input or output?
>
> (Please answer yes or no. If yes, answer the questions below)
>
> ### What was the previous behavior?
>
> (Please provide the console output, description and/or reproducer about the 
> previous behavior)
>
> ### What is the behavior the changes propose?
>
> (Please provide the console output, description and/or reproducer about the 
> behavior the changes propose)
>
> See
> https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
>  .
>
> From my experience so far in Spark community, and assuming from the
> interaction with other
> committers and contributors, It is pretty critical to know before/after
> behaviour changes even if it
> was a bug. In addition, I think this is requested by reviewers often.
>
> The new sections will make review process much easier, and we're able to
> quickly judge how serious the changes are.
> Given that Spark community still suffer from open PRs just queueing up
> without review, I think this can help
> both reviewers and PR authors.
>
> I do describe them often when I think it's useful and possible.
> For instance see https://github.com/apache/spark/pull/24927 - I am sure
> you guys have clear idea what the
> PR fixes.
>
> I cc'ed some guys I can currently think of for now FYI. Please let me know
> if you guys have any thought on this!
>
>


Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Reynold Xin
No sorry I'm not at liberty to share other people's code.

On Fri, Jul 12, 2019 at 9:33 AM, Gourav Sengupta < gourav.sengu...@gmail.com > 
wrote:

> 
> Hi Reynold,
> 
> 
> I am genuinely curious about queries which are more than 1 MB and am
> stunned by tens of MB's. Any samples to share :) 
> 
> 
> Regards,
> Gourav
> 
> On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> There is no explicit limit but a JVM string cannot be bigger than 2G. It
>> will also at some point run out of memory with too big of a query plan
>> tree or become incredibly slow due to query planning complexity. I've seen
>> queries that are tens of MBs in size.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Thu, Jul 11, 2019 at 5:01 AM, 李书明 < alemmontree@ 126. com (
>> alemmont...@126.com ) > wrote:
>> 
>>> I have a question about the limit(biggest) of SQL's length that is
>>> supported in SparkSQL. I can't find the answer in the documents of Spark.
>>> 
>>> 
>>> Maybe Interger.MAX_VALUE or not ?
>>> 
>> 
>> 
> 
>

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-11 Thread Reynold Xin
There is no explicit limit but a JVM string cannot be bigger than 2G. It will 
also at some point run out of memory with too big of a query plan tree or 
become incredibly slow due to query planning complexity. I've seen queries that 
are tens of MBs in size.

On Thu, Jul 11, 2019 at 5:01 AM, 李书明 < alemmont...@126.com > wrote:

> 
> I have a question about the limit(biggest) of SQL's length that is
> supported in SparkSQL. I can't find the answer in the documents of Spark.
> 
> 
> Maybe Interger.MAX_VALUE or not ?
> 
> 
> 
>

Revisiting Python / pandas UDF

2019-07-05 Thread Reynold Xin
Hi all,

In the past two years, the pandas UDFs are perhaps the most important changes 
to Spark for Python data science. However, these functionalities have evolved 
organically, leading to some inconsistencies and confusions among users. I 
created a ticket and a document summarizing the issues, and a concrete proposal 
to fix them (the changes are pretty small). Thanks Xiangrui for initially 
bringing this to my attention, and Li Jin, Hyukjin, for offline discussions.

Please take a look: 

https://issues.apache.org/jira/browse/SPARK-28264

https://docs.google.com/document/u/1/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit

Re: Disabling `Merge Commits` from GitHub Merge Button

2019-07-01 Thread Reynold Xin
That's a good idea. We should only be using squash.

On Mon, Jul 01, 2019 at 1:52 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Apache Spark PMC members and committers.
> 
> 
> We are using GitHub `Merge Button` in `spark-website` repository
> because it's very convenient.
> 
> 
>     1. https:/ / github. com/ apache/ spark-website/ commits/ asf-site (
> https://github.com/apache/spark-website/commits/asf-site )
> 
>     2. https:/ / github. com/ apache/ spark/ commits/ master (
> https://github.com/apache/spark/commits/master )
> 
> 
> In order to be consistent with our previous behavior,
> can we disable `Allow Merge Commits` from GitHub `Merge Button` setting
> explicitly?
> 
> 
> I hope we can enforce it in both `spark-website` and `spark` repository
> consistently.
> 
> 
> Bests,
> Dongjoon.
>

Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Reynold Xin
Seems like a good idea. Can we test this with a component first?

On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Since we use both Apache JIRA and GitHub actively for Apache Spark
> contributions, we have lots of JIRAs and PRs consequently. One specific
> thing I've been longing to see is `Jira Issue Type` in GitHub.
>
> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
> There are two main benefits:
> 1. It helps the communication between the contributors and reviewers with
> more information.
> (In some cases, some people only visit GitHub to see the PR and
> commits)
> 2. `Labels` is searchable. We don't need to visit Apache Jira to search
> PRs to see a specific type.
> (For example, the reviewers can see and review 'BUG' PRs first by
> using `is:open is:pr label:BUG`.)
>
> Of course, this can be done automatically without human intervention.
> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
> can add the labels from the beginning. If needed, I can volunteer to update
> the script.
>
> To show the demo, I labeled several PRs manually. You can see the result
> right now in Apache Spark PR page.
>
>   - https://github.com/apache/spark/pulls
>
> If you're surprised due to those manual activities, I want to apologize
> for that. I hope we can take advantage of the existing GitHub features to
> serve Apache Spark community in a way better than yesterday.
>
> How do you think about this specific suggestion?
>
> Bests,
> Dongjoon
>
> PS. I saw that `Request Review` and `Assign` features are already used for
> some purposes, but these feature are out of the scope in this email.
>


Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Reynold Xin
+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp  wrote:

> I don't have a good sense of the overhead of continuing to support
>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>
>> from the build/test side, it will actually be pretty easy to continue
> support for python2.7 for spark 2.x as the feature sets won't be expanding.
>
> that being said, i will be cracking a bottle of champagne when i can
> delete all of the ansible and anaconda configs for python2.x.  :)
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [RESULT][VOTE] SPIP: Public APIs for extended Columnar Processing Support

2019-05-29 Thread Reynold Xin
Thanks Tom.

I finally had time to look at the updated SPIP 10 mins ago. I support the high 
level idea and +1 on the SPIP.

That said, I think the proposed API is too complicated and invasive change to 
the existing internals. A much simpler API would be to expose a columnar batch 
iterator interface, i.e. an uber column oriented UDF with ability to manage 
life cycle. Once we have that, we can also refactor the existing Python UDFs to 
use that interface.

As I said earlier (couple months ago when this was first surfaced?), I support 
the idea to enable *external* column oriented processing logic, but not 
changing Spark itself to have two processing mode, which is simply very 
complicated and would create very high maintenance burden for the project.

On Wed, May 29, 2019 at 9:49 PM, Thomas graves < tgra...@apache.org > wrote:

> 
> 
> 
> Hi all,
> 
> 
> 
> The vote passed with 9 +1's (4 binding) and 1 +0 and no -1's.
> 
> 
> 
> +1s (* = binding) :
> Bobby Evans*
> Thomas Graves*
> DB Tsai*
> Felix Cheung*
> Bryan Cutler
> Kazuaki Ishizaki
> Tyson Condie
> Dongjoon Hyun
> Jason Lowe
> 
> 
> 
> +0s:
> Xiangrui Meng
> 
> 
> 
> Thanks,
> Tom Graves
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Interesting implications of supporting Scala 2.13

2019-05-11 Thread Reynold Xin
If the number of changes that would require two source trees are small, another 
thing we can do is to reach out to the Scala team and kindly ask them whether 
they could change Scala 2.13 itself so it'd be easier to maintain compatibility 
with Scala 2.12.

On Sat, May 11, 2019 at 4:25 PM, Sean Owen < sro...@gmail.com > wrote:

> 
> 
> 
> For those interested, here's the first significant problem I see that will
> require separate source trees or a breaking change: https:/ / issues. apache.
> org/ jira/ browse/ SPARK-27683?focusedCommentId=16837967=com. atlassian.
> jira. plugin. system. issuetabpanels%3Acomment-tabpanel#comment-16837967 (
> https://issues.apache.org/jira/browse/SPARK-27683?focusedCommentId=16837967=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16837967
> )
> 
> 
> 
> Interested in thoughts on how to proceed on something like this, as there
> will probably be a few more similar issues.
> 
> 
> 
> On Fri, May 10, 2019 at 3:32 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> 
>> 
>> Yea my main point is when we do support 2.13, it'd be great if we don't
>> have to break APIs. That's why doing the prep work in 3.0 would be great.
>> 
>> 
>> 
>> On Fri, May 10, 2019 at 1:30 PM, Imran Rashid < irashid@ cloudera. com (
>> iras...@cloudera.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> +1 on making whatever api changes we can now for 3.0.
>>> 
>>> 
>>> 
>>> I don't think that is making any commitments to supporting scala 2.13 in
>>> any specific version. We'll have to deal with all the other points you
>>> raised when we do cross that bridge, but hopefully those are things we can
>>> cover in a minor release.
>>> 
>>> 
>>> 
>>> On Fri, May 10, 2019 at 2:31 PM Sean Owen < srowen@ gmail. com (
>>> sro...@gmail.com ) > wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> I really hope we don't have to have separate source trees for some files,
>>>> but yeah it's an option too. OK, will start looking into changes we can
>>>> make now that don't break things now, and deprecations we need to make now
>>>> proactively.
>>>> 
>>>> 
>>>> 
>>>> I should also say that supporting Scala 2.13 will mean dependencies have
>>>> to support Scala 2.13, and that could take a while, because there are a
>>>> lot. In particular, I think we'll find our SBT 0.13 build won't make it,
>>>> perhaps just because of the plugins it needs. I tried updating to SBT 1.x
>>>> and it seemed to need quite a lot of rewrite, again in part due to how
>>>> newer plugin versions changed. I failed and gave up.
>>>> 
>>>> 
>>>> 
>>>> At some point maybe we figure out whether we can remove the SBT-based
>>>> build if it's super painful, but only if there's not much other choice.
>>>> That is for a future thread.
>>>> 
>>>> 
>>>> 
>>>> On Fri, May 10, 2019 at 1:51 PM Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Looks like a great idea to make changes in Spark 3.0 to prepare for Scala
>>>>> 2.13 upgrade.
>>>>> 
>>>>> 
>>>>> 
>>>>> Are there breaking changes that would require us to have two different
>>>>> source code for 2.12 vs 2.13?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, May 10, 2019 at 11:41 AM, Sean Owen < srowen@ gmail. com (
>>>>> sro...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> While that's not happening soon (2.13 isn't out), note that some of the
>>>>>> changes to collections will be fairly breaking changes.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25075 (
>>>>>> https://issues.apache.org/jira/browse/SPARK-25075 )
>>>>>> https:/ / docs. scala-lang. org/ overviews/ core/ 
>>>>>> collections-migration-213.
>>>>>> html (
>>>>>> https://docs.scala-lang.org/overviews/core/collections-migration-213.html
>>>>>> )
>>>>>> 
>>>>>> 
&

Re: Interesting implications of supporting Scala 2.13

2019-05-10 Thread Reynold Xin
Yea my main point is when we do support 2.13, it'd be great if we don't have to 
break APIs. That's why doing the prep work in 3.0 would be great.

On Fri, May 10, 2019 at 1:30 PM, Imran Rashid < iras...@cloudera.com > wrote:

> 
> +1 on making whatever api changes we can now for 3.0.
> 
> 
> I don't think that is making any commitments to supporting scala 2.13 in
> any specific version.  We'll have to deal with all the other points you
> raised when we do cross that bridge, but hopefully those are things we can
> cover in a minor release.
> 
> On Fri, May 10, 2019 at 2:31 PM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> I really hope we don't have to have separate source trees for some files,
>> but yeah it's an option too. OK, will start looking into changes we can
>> make now that don't break things now, and deprecations we need to make now
>> proactively.
>> 
>> 
>> I should also say that supporting Scala 2.13 will mean dependencies have
>> to support Scala 2.13, and that could take a while, because there are a
>> lot.
>> In particular, I think we'll find our SBT 0.13 build won't make it,
>> perhaps just because of the plugins it needs. I tried updating to SBT 1.x
>> and it seemed to need quite a lot of rewrite, again in part due to how
>> newer plugin versions changed. I failed and gave up.
>> 
>> 
>> At some point maybe we figure out whether we can remove the SBT-based
>> build if it's super painful, but only if there's not much other choice.
>> That is for a future thread.
>> 
>> 
>> 
>> On Fri, May 10, 2019 at 1:51 PM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Looks like a great idea to make changes in Spark 3.0 to prepare for Scala
>>> 2.13 upgrade.
>>> 
>>> 
>>> 
>>> Are there breaking changes that would require us to have two different
>>> source code for 2.12 vs 2.13?
>>> 
>>> 
>>> 
>>> On Fri, May 10, 2019 at 11:41 AM, Sean Owen < srowen@ gmail. com (
>>> sro...@gmail.com ) > wrote:
>>> 
>>>> 
>>>> 
>>>> While that's not happening soon (2.13 isn't out), note that some of the
>>>> changes to collections will be fairly breaking changes.
>>>> 
>>>> 
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25075 (
>>>> https://issues.apache.org/jira/browse/SPARK-25075 )
>>>> https:/ / docs. scala-lang. org/ overviews/ core/ 
>>>> collections-migration-213.
>>>> html (
>>>> https://docs.scala-lang.org/overviews/core/collections-migration-213.html
>>>> )
>>>> 
>>>> 
>>>> 
>>>> Some of this may impact a public API, so may need to start proactively
>>>> fixing stuff for 2.13 before 3.0 comes out where possible.
>>>> 
>>>> 
>>>> 
>>>> Here's an example: Traversable goes away. We have a method
>>>> SparkConf.setAll(Traversable). We can't support 2.13 while that still
>>>> exists. Of course, we can decide to deprecate it with replacement (use
>>>> Iterable) and remove it in the version that supports 2.13. But that would
>>>> mean a little breaking change, and we either have to accept that for a
>>>> future 3.x release, or it waits until 4.x.
>>>> 
>>>> 
>>>> 
>>>> I wanted to put that on the radar now to gather opinions about whether
>>>> this will probably be acceptable, or whether we really need to get methods
>>>> like that changed before 3.0.
>>>> 
>>>> 
>>>> 
>>>> Also: there's plenty of straightforward but medium-sized changes we can
>>>> make now in anticipation of 2.13 support, like, make the type of Seq we
>>>> use everywhere explicit (will be good for a like 1000 file change I'm
>>>> sure). Or see if we can swap out Traversable everywhere. Remove
>>>> MutableList, etc.
>>>> 
>>>> 
>>>> 
>>>> I was going to start fiddling with that unless it just sounds too
>>>> disruptive.
>>>> 
>>>> 
>>>> 
>>>> - To
>>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>>> dev-unsubscr...@spark.apache.org )
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Interesting implications of supporting Scala 2.13

2019-05-10 Thread Reynold Xin
Looks like a great idea to make changes in Spark 3.0 to prepare for Scala 2.13 
upgrade.

Are there breaking changes that would require us to have two different source 
code for 2.12 vs 2.13?

On Fri, May 10, 2019 at 11:41 AM, Sean Owen < sro...@gmail.com > wrote:

> 
> 
> 
> While that's not happening soon (2.13 isn't out), note that some of the
> changes to collections will be fairly breaking changes.
> 
> 
> 
> https:/ / issues. apache. org/ jira/ browse/ SPARK-25075 (
> https://issues.apache.org/jira/browse/SPARK-25075 )
> https:/ / docs. scala-lang. org/ overviews/ core/ collections-migration-213.
> html (
> https://docs.scala-lang.org/overviews/core/collections-migration-213.html
> )
> 
> 
> 
> Some of this may impact a public API, so may need to start proactively
> fixing stuff for 2.13 before 3.0 comes out where possible.
> 
> 
> 
> Here's an example: Traversable goes away. We have a method
> SparkConf.setAll(Traversable). We can't support 2.13 while that still
> exists. Of course, we can decide to deprecate it with replacement (use
> Iterable) and remove it in the version that supports 2.13. But that would
> mean a little breaking change, and we either have to accept that for a
> future 3.x release, or it waits until 4.x.
> 
> 
> 
> I wanted to put that on the radar now to gather opinions about whether
> this will probably be acceptable, or whether we really need to get methods
> like that changed before 3.0.
> 
> 
> 
> Also: there's plenty of straightforward but medium-sized changes we can
> make now in anticipation of 2.13 support, like, make the type of Seq we
> use everywhere explicit (will be good for a like 1000 file change I'm
> sure). Or see if we can swap out Traversable everywhere. Remove
> MutableList, etc.
> 
> 
> 
> I was going to start fiddling with that unless it just sounds too
> disruptive.
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-26 Thread Reynold Xin
I do feel it'd be better to not switch default Scala versions in a minor 
release. I don't know how much downstream this impacts. Dotnet is a good data 
point. Anybody else hit this issue?

On Thu, Apr 25, 2019 at 11:36 PM, Terry Kim < yumin...@gmail.com > wrote:

> 
> 
> 
> Very much interested in hearing what you folks decide. We currently have a
> couple asking us questions at https:/ / github. com/ dotnet/ spark/ issues
> ( https://github.com/dotnet/spark/issues ).
> 
> 
> 
> Thanks,
> Terry
> 
> 
> 
> --
> Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/
> ( http://apache-spark-developers-list.1001551.n3.nabble.com/ )
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Reynold Xin
"if others think it would be helpful, we can cancel this vote, update the SPIP 
to clarify exactly what I am proposing, and then restart the vote after we have 
gotten more agreement on what APIs should be exposed"

That'd be very useful. At least I was confused by what the SPIP was about. No 
point voting on something when there is still a lot of confusion about what it 
is.

On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans < reva...@gmail.com > wrote:

> 
> 
> 
> Xiangrui Meng,
> 
> 
> 
> I provided some examples in the original discussion thread.
> 
> 
> 
> https:/ / lists. apache. org/ thread. html/ 
> f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@
> %3Cdev. spark. apache. org%3E (
> https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E
> )
> 
> 
> 
> But the concrete use case that we have is GPU accelerated ETL on Spark.
> Primarily as data preparation and feature engineering for ML tools like
> XGBoost, which by the way exposes a Spark specific scala API, not just a
> python one. We built a proof of concept and saw decent performance gains.
> Enough gains to more than pay for the added cost of a GPU, with the
> potential for even better performance in the future. With that proof of
> concept, we were able to make all of the processing columnar end-to-end
> for many queries so there really wasn't any data conversion costs to
> overcome, but we did want the design flexible enough to include a
> cost-based optimizer. \
> 
> 
> 
> It looks like there is some confusion around this SPIP especially in how
> it relates to features in other SPIPs around data exchange between
> different systems. I didn't want to update the text of this SPIP while it
> was under an active vote, but if others think it would be helpful, we can
> cancel this vote, update the SPIP to clarify exactly what I am proposing,
> and then restart the vote after we have gotten more agreement on what APIs
> should be exposed.
> 
> 
> 
> Thanks,
> 
> 
> 
> Bobby
> 
> 
> 
> On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng < mengxr@ gmail. com (
> men...@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. I
>> think the SPIP should list a concrete ETL use case (from POC?) that can
>> benefit from this *public Java/Scala API, *does *vectorization*, and
>> significantly *boosts the performance *even with data conversion overhead.
>> 
>> 
>> 
>> 
>> The current mid-term success (Pandas UDF) doesn't match the purpose of
>> SPIP and it can be done without exposing any public APIs.
>> 
>> 
>> 
>> Depending how much benefit it brings, we might agree that a public
>> Java/Scala API is needed. Then we might want to step slightly into how. I
>> saw three options mentioned in the JIRA and discussion threads:
>> 
>> 
>> 
>> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrow
>> library.
>> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
>> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also used
>> by Spark internals. It makes us hard to change Spark internals in the
>> future.
>> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
>> maintain conversion between internal `ColumnarBatch` and
>> `SparkRecordBatch`. It might cause conversion overhead in the future if
>> our internal becomes different from Arrow.
>> 
>> 
>> 
>> Note that both 3 and 4 will make many APIs public to be Arrow compatible.
>> So we should really give concrete ETL cases to prove that it is important
>> for us to do so.
>> 
>> 
>> 
>> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves < tgraves_cs@ yahoo. com (
>> tgraves...@yahoo.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Based on there is still discussion and Spark Summit is this week, I'm
>>> going to extend the vote til Friday the 26th.
>>> 
>>> 
>>> 
>>> Tom
>>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans < revans2@ gmail. com
>>> ( reva...@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> Yes, it is technically possible for the layout to change. No, it is not
>>> going to happen. It is already baked into several different official
>>> libraries which are widely used, not just for holding and processing the
>>> data, but also for transfer of the data between the various
>>> implementations. There would have to be a really serious reason to force
>>> an incompatible change at this point. So in the worst case, we can version
>>> the layout and bake that into the API that exposes the internal layout of
>>> the data. That way code that wants to program against a JAVA API can do so
>>> using the API that Spark provides, those who want to interface with
>>> something that expects the data in arrow format will already have to know
>>> what version of the format it was programmed against and in the worst case
>>> if the layout does change we can support the new layout if needed.
>>> 
>>> 
>>> 
>>> On Sun, Apr 21, 2019 at 12:45 AM Bryan 

Re: Spark 2.4.2

2019-04-18 Thread Reynold Xin
We should have shaded all Spark’s dependencies :(

On Wed, Apr 17, 2019 at 11:47 PM Sean Owen  wrote:

> For users that would inherit Jackson and use it directly, or whose
> dependencies do. Spark itself (with modifications) should be OK with
> the change.
> It's risky and normally wouldn't backport, except that I've heard a
> few times about concerns about CVEs affecting Databind, so wondering
> who else out there might have an opinion. I'm not pushing for it
> necessarily.
>
> On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin  wrote:
> >
> > For Jackson - are you worrying about JSON parsing for users or internal
> Spark functionality breaking?
> >
> > On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:
> >>
> >> There's only one other item on my radar, which is considering updating
> >> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
> >> a few times now that there are a number of CVEs open for 2.6.7. Cons:
> >> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
> >> behavior non-trivially. That said back-porting the update PR to 2.4
> >> worked out OK locally. Any strong opinions on this one?
> >>
> >> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan 
> wrote:
> >> >
> >> > I volunteer to be the release manager for 2.4.2, as I was also going
> to propose 2.4.2 because of the reverting of SPARK-25250. Is there any
> other ongoing bug fixes we want to include in 2.4.2? If no I'd like to
> start the release process today (CST).
> >> >
> >> > Thanks,
> >> > Wenchen
> >> >
> >> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  wrote:
> >> >>
> >> >> I think the 'only backport bug fixes to branches' principle remains
> sound. But what's a bug fix? Something that changes behavior to match what
> is explicitly supposed to happen, or implicitly supposed to happen --
> implied by what other similar things do, by reasonable user expectations,
> or simply how it worked previously.
> >> >>
> >> >> Is this a bug fix? I guess the criteria that matches is that
> behavior doesn't match reasonable user expectations? I don't know enough to
> have a strong opinion. I also don't think there is currently an objection
> to backporting it, whatever it's called.
> >> >>
> >> >>
> >> >> Is the question whether this needs a new release? There's no harm in
> another point release, other than needing a volunteer release manager. One
> could say, wait a bit longer to see what more info comes in about 2.4.1.
> But given that 2.4.1 took like 2 months, it's reasonable to move towards a
> release cycle again. I don't see objection to that either (?)
> >> >>
> >> >>
> >> >> The meta question remains: is a 'bug fix' definition even agreed,
> and being consistently applied? There aren't correct answers, only best
> guesses from each person's own experience, judgment and priorities. These
> can differ even when applied in good faith.
> >> >>
> >> >> Sometimes the variance of opinion comes because people have
> different info that needs to be surfaced. Here, maybe it's best to share
> what about that offline conversation was convincing, for example.
> >> >>
> >> >> I'd say it's also important to separate what one would prefer from
> what one can't live with(out). Assuming one trusts the intent and
> experience of the handful of others with an opinion, I'd defer to someone
> who wants X and will own it, even if I'm moderately against it. Otherwise
> we'd get little done.
> >> >>
> >> >> In that light, it seems like both of the PRs at issue here are not
> _wrong_ to backport. This is a good pair that highlights why, when there
> isn't a clear reason to do / not do something (e.g. obvious errors,
> breaking public APIs) we give benefit-of-the-doubt in order to get it later.
> >> >>
> >> >>
> >> >> On Wed, Apr 17, 2019 at 12:09 PM Ryan Blue 
> wrote:
> >> >>>
> >> >>> Sorry, I should be more clear about what I'm trying to say here.
> >> >>>
> >> >>> In the past, Xiao has taken the opposite stance. A good example is
> PR #21060 that was a very similar situation: behavior didn't match what was
> expected and there was low risk. There was a long argument and the patch
> didn't make it into 2.3 (to my knowledge).
> >> >>>
> >> >>> What we call these low-risk behavior fixes doesn't matter. I called
> it a bug on #21060 but I'

  1   2   3   4   5   6   7   8   9   10   >