date:20210303

Re: SPARK-34600. Support user-defined types in Pandas UDF

2021-03-03 Thread attilapiros

Hi!

First of all thanks for your contribution!

PySpark is not an area I am familiar with but I can answer your question
regarding Jira.

The issue will be assigned to you when your change is in:
>  The JIRA will be Assigned to the primary contributor to the change as a
> way of giving credit. If the JIRA isn’t closed and/or Assigned promptly,
> comment on the JIRA.

You can check  the contributing page
  .

Best regards,
Attila



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

SPARK-34600. Support user-defined types in Pandas UDF

2021-03-03 Thread Lei Xu

Hi, Here

I have been working on a PR
(https://github.com/apache/spark/pull/31735) that allows returning
UserDefinedType from PandasUDF.
Would love to see the feedback from the community.

Btw, since this is my first patch on Spark, it seems that I dont have
permission to assign the ticket
(https://issues.apache.org/jira/browse/SPARK-34600) to myself.
Does it require a Spark committer / PMC help to add me into the
contributor list?

Thanks,
Best regards,

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread John Zhuge

+1 Good plan to move forward.

Thank you all for the constructive and comprehensive discussions in this
thread! Decisions on this important feature will have ramifications for
years to come.

On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan  wrote:

> +1 to this proposal. If people don't like the ScalarFunction0,1, ...
> variants and prefer the "magical methods", then we can have a single
> ScalarFunction interface which has the row-parameter API (with a default
> implementation to fail) and documents to describe the "magical methods"
> (which can be done later).
>
> I'll start the PR review this week to check the naming, doc, etc.
>
> Thanks all for the discussion here and let's move forward!
>
> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:
>
>> Good point, Dongjoon. I think we can probably come to some compromise
>> here:
>>
>>- Remove SupportsInvoke since it isn’t really needed. We should
>>always try to find the right method to invoke in the codegen path.
>>- Add a default implementation of produceResult so that
>>implementations don’t have to use it. If they don’t implement it and a
>>magic function can’t be found, then it will throw
>>UnsupportedOperationException
>>
>> This is assuming that we can agree not to introduce all of the
>> ScalarFunction interface variations, which would have limited utility
>> because of type erasure.
>>
>> Does that sound like a good plan to everyone? If so, I’ll update the SPIP
>> doc so we can move forward.
>>
>> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> We shared many opinions in different perspectives.
>>> However, we didn't reach a consensus even on a partial merge by
>>> excluding something
>>> (on the PR by me, on this mailing thread by Wenchen).
>>>
>>> For the following claims, we have another alternative to mitigate it.
>>>
>>> > I don't like it because it promotes the row-parameter API and
>>> forces users to implement it, even if the users want to only use the
>>> individual-parameters API.
>>>
>>> Why don't we merge the AS-IS PR by adding something instead of excluding
>>> something?
>>>
>>> - R produceResult(InternalRow input);
>>> + default R produceResult(InternalRow input) throws Exception {
>>> +   throw new UnsupportedOperationException();
>>> + }
>>>
>>> By providing the default implementation, it will not *forcing users to
>>> implement it* technically.
>>> And, we can provide a document about our expected usage properly.
>>> What do you think?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>>>
 Yes, GenericInternalRow is safe if when type mismatches, with the cost
 of using Object[], and primitive types need to do boxing

 The question is not whether to use the magic functions, which would not
 need boxing. The question here is whether to use multiple
 ScalarFunction interfaces. Those interfaces would require boxing or
 using Object[] so there isn’t a benefit.

 If we do want to reuse one UDF for different types, using “magical
 methods” solves the problem

 Yes, that’s correct. We agree that magic methods are a good option for
 this.

 Again, the question we need to decide is whether to use InternalRow or
 interfaces like ScalarFunction2 for non-codegen. The option to use
 multiple interfaces is limited by type erasure because you can only have
 one set of type parameters. If you wanted to support both 
 ScalarFunction2>>> Integer> and ScalarFunction2 you’d have to fall back to 
 ScalarFunction2>>> Object> and cast.

 The point is that type erasure will commonly lead either to many
 different implementation classes (one for each type combination) or will
 lead to parameterizing by Object, which defeats the purpose.

 The alternative adds safety because correct types are produced by calls
 like getLong(0). Yes, this depends on the implementation making the
 correct calls, but it is better than using Object and casting.

 I can’t think of real use cases that will force the
 individual-parameters approach to use Object instead of concrete types.

 I think this is addressed by the type erasure discussion above. A
 simple Plus method would require Object or 12 implementations for 2
 arguments and 4 numeric types.

 And basically all varargs cases would need to use Object[]. Consider a
 UDF to create a map that requires string keys and some consistent type for
 values. This would be easy with the InternalRow API because you can
 use getString(pos) and get(pos + 1, valueType) to get the key/value
 pairs. Use of UTF8String vs String will be checked at compile time.

 I agree that Object[] is worse than InternalRow

 Yes, and if we are always using Object because of type erasure or
 using magic methods to get better performance, the utility of the

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Wenchen Fan

+1 to this proposal. If people don't like the ScalarFunction0,1, ...
variants and prefer the "magical methods", then we can have a single
ScalarFunction interface which has the row-parameter API (with a default
implementation to fail) and documents to describe the "magical methods"
(which can be done later).

I'll start the PR review this week to check the naming, doc, etc.

Thanks all for the discussion here and let's move forward!

On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:

> Good point, Dongjoon. I think we can probably come to some compromise here:
>
>- Remove SupportsInvoke since it isn’t really needed. We should always
>try to find the right method to invoke in the codegen path.
>- Add a default implementation of produceResult so that
>implementations don’t have to use it. If they don’t implement it and a
>magic function can’t be found, then it will throw
>UnsupportedOperationException
>
> This is assuming that we can agree not to introduce all of the
> ScalarFunction interface variations, which would have limited utility
> because of type erasure.
>
> Does that sound like a good plan to everyone? If so, I’ll update the SPIP
> doc so we can move forward.
>
> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> We shared many opinions in different perspectives.
>> However, we didn't reach a consensus even on a partial merge by excluding
>> something
>> (on the PR by me, on this mailing thread by Wenchen).
>>
>> For the following claims, we have another alternative to mitigate it.
>>
>> > I don't like it because it promotes the row-parameter API and
>> forces users to implement it, even if the users want to only use the
>> individual-parameters API.
>>
>> Why don't we merge the AS-IS PR by adding something instead of excluding
>> something?
>>
>> - R produceResult(InternalRow input);
>> + default R produceResult(InternalRow input) throws Exception {
>> +   throw new UnsupportedOperationException();
>> + }
>>
>> By providing the default implementation, it will not *forcing users to
>> implement it* technically.
>> And, we can provide a document about our expected usage properly.
>> What do you think?
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>>
>>> Yes, GenericInternalRow is safe if when type mismatches, with the cost
>>> of using Object[], and primitive types need to do boxing
>>>
>>> The question is not whether to use the magic functions, which would not
>>> need boxing. The question here is whether to use multiple ScalarFunction
>>> interfaces. Those interfaces would require boxing or using Object[] so
>>> there isn’t a benefit.
>>>
>>> If we do want to reuse one UDF for different types, using “magical
>>> methods” solves the problem
>>>
>>> Yes, that’s correct. We agree that magic methods are a good option for
>>> this.
>>>
>>> Again, the question we need to decide is whether to use InternalRow or
>>> interfaces like ScalarFunction2 for non-codegen. The option to use
>>> multiple interfaces is limited by type erasure because you can only have
>>> one set of type parameters. If you wanted to support both 
>>> ScalarFunction2>> Integer> and ScalarFunction2 you’d have to fall back to 
>>> ScalarFunction2>> Object> and cast.
>>>
>>> The point is that type erasure will commonly lead either to many
>>> different implementation classes (one for each type combination) or will
>>> lead to parameterizing by Object, which defeats the purpose.
>>>
>>> The alternative adds safety because correct types are produced by calls
>>> like getLong(0). Yes, this depends on the implementation making the
>>> correct calls, but it is better than using Object and casting.
>>>
>>> I can’t think of real use cases that will force the
>>> individual-parameters approach to use Object instead of concrete types.
>>>
>>> I think this is addressed by the type erasure discussion above. A simple
>>> Plus method would require Object or 12 implementations for 2 arguments
>>> and 4 numeric types.
>>>
>>> And basically all varargs cases would need to use Object[]. Consider a
>>> UDF to create a map that requires string keys and some consistent type for
>>> values. This would be easy with the InternalRow API because you can use
>>> getString(pos) and get(pos + 1, valueType) to get the key/value pairs.
>>> Use of UTF8String vs String will be checked at compile time.
>>>
>>> I agree that Object[] is worse than InternalRow
>>>
>>> Yes, and if we are always using Object because of type erasure or using
>>> magic methods to get better performance, the utility of the parameterized
>>> interfaces is very limited.
>>>
>>> Because we want to expose the magic functions, the use of
>>> ScalarFunction2 and similar is extremely limited because it is only for
>>> non-codegen. Varargs is by far the more common case. The InternalRow
>>> interface is also a very simple way to get started and ensures that Spark
>>> can always find the right

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Ryan Blue

Good point, Dongjoon. I think we can probably come to some compromise here:

   - Remove SupportsInvoke since it isn’t really needed. We should always
   try to find the right method to invoke in the codegen path.
   - Add a default implementation of produceResult so that implementations
   don’t have to use it. If they don’t implement it and a magic function can’t
   be found, then it will throw UnsupportedOperationException

This is assuming that we can agree not to introduce all of the
ScalarFunction interface variations, which would have limited utility
because of type erasure.

Does that sound like a good plan to everyone? If so, I’ll update the SPIP
doc so we can move forward.

On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> We shared many opinions in different perspectives.
> However, we didn't reach a consensus even on a partial merge by excluding
> something
> (on the PR by me, on this mailing thread by Wenchen).
>
> For the following claims, we have another alternative to mitigate it.
>
> > I don't like it because it promotes the row-parameter API and forces
> users to implement it, even if the users want to only use the
> individual-parameters API.
>
> Why don't we merge the AS-IS PR by adding something instead of excluding
> something?
>
> - R produceResult(InternalRow input);
> + default R produceResult(InternalRow input) throws Exception {
> +   throw new UnsupportedOperationException();
> + }
>
> By providing the default implementation, it will not *forcing users to
> implement it* technically.
> And, we can provide a document about our expected usage properly.
> What do you think?
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>
>> Yes, GenericInternalRow is safe if when type mismatches, with the cost of
>> using Object[], and primitive types need to do boxing
>>
>> The question is not whether to use the magic functions, which would not
>> need boxing. The question here is whether to use multiple ScalarFunction
>> interfaces. Those interfaces would require boxing or using Object[] so
>> there isn’t a benefit.
>>
>> If we do want to reuse one UDF for different types, using “magical
>> methods” solves the problem
>>
>> Yes, that’s correct. We agree that magic methods are a good option for
>> this.
>>
>> Again, the question we need to decide is whether to use InternalRow or
>> interfaces like ScalarFunction2 for non-codegen. The option to use
>> multiple interfaces is limited by type erasure because you can only have
>> one set of type parameters. If you wanted to support both 
>> ScalarFunction2> Integer> and ScalarFunction2 you’d have to fall back to 
>> ScalarFunction2> Object> and cast.
>>
>> The point is that type erasure will commonly lead either to many
>> different implementation classes (one for each type combination) or will
>> lead to parameterizing by Object, which defeats the purpose.
>>
>> The alternative adds safety because correct types are produced by calls
>> like getLong(0). Yes, this depends on the implementation making the
>> correct calls, but it is better than using Object and casting.
>>
>> I can’t think of real use cases that will force the individual-parameters
>> approach to use Object instead of concrete types.
>>
>> I think this is addressed by the type erasure discussion above. A simple
>> Plus method would require Object or 12 implementations for 2 arguments
>> and 4 numeric types.
>>
>> And basically all varargs cases would need to use Object[]. Consider a
>> UDF to create a map that requires string keys and some consistent type for
>> values. This would be easy with the InternalRow API because you can use
>> getString(pos) and get(pos + 1, valueType) to get the key/value pairs.
>> Use of UTF8String vs String will be checked at compile time.
>>
>> I agree that Object[] is worse than InternalRow
>>
>> Yes, and if we are always using Object because of type erasure or using
>> magic methods to get better performance, the utility of the parameterized
>> interfaces is very limited.
>>
>> Because we want to expose the magic functions, the use of ScalarFunction2
>> and similar is extremely limited because it is only for non-codegen.
>> Varargs is by far the more common case. The InternalRow interface is
>> also a very simple way to get started and ensures that Spark can always
>> find the right method after the function is bound to input types.
>>
>> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan  wrote:
>>
>>> Yes, GenericInternalRow is safe if when type mismatches, with the cost
>>> of using Object[], and primitive types need to do boxing. And this is a
>>> runtime failure, which is absolutely worse than query-compile-time checks.
>>> Also, don't forget my previous point: users need to specify the type and
>>> index such as row.getLong(0), which is error-prone.
>>>
>>> > But we don’t do that for any of the similar UDFs today so I’m
>>> skeptical that this would actually be a high enough priority

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Dongjoon Hyun

Hi, All.

We shared many opinions in different perspectives.
However, we didn't reach a consensus even on a partial merge by excluding
something
(on the PR by me, on this mailing thread by Wenchen).

For the following claims, we have another alternative to mitigate it.

> I don't like it because it promotes the row-parameter API and forces
users to implement it, even if the users want to only use the
individual-parameters API.

Why don't we merge the AS-IS PR by adding something instead of excluding
something?

- R produceResult(InternalRow input);
+ default R produceResult(InternalRow input) throws Exception {
+   throw new UnsupportedOperationException();
+ }

By providing the default implementation, it will not *forcing users to
implement it* technically.
And, we can provide a document about our expected usage properly.
What do you think?

Bests,
Dongjoon.



On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:

> Yes, GenericInternalRow is safe if when type mismatches, with the cost of
> using Object[], and primitive types need to do boxing
>
> The question is not whether to use the magic functions, which would not
> need boxing. The question here is whether to use multiple ScalarFunction
> interfaces. Those interfaces would require boxing or using Object[] so
> there isn’t a benefit.
>
> If we do want to reuse one UDF for different types, using “magical
> methods” solves the problem
>
> Yes, that’s correct. We agree that magic methods are a good option for
> this.
>
> Again, the question we need to decide is whether to use InternalRow or
> interfaces like ScalarFunction2 for non-codegen. The option to use
> multiple interfaces is limited by type erasure because you can only have
> one set of type parameters. If you wanted to support both 
> ScalarFunction2 Integer> and ScalarFunction2 you’d have to fall back to 
> ScalarFunction2 Object> and cast.
>
> The point is that type erasure will commonly lead either to many different
> implementation classes (one for each type combination) or will lead to
> parameterizing by Object, which defeats the purpose.
>
> The alternative adds safety because correct types are produced by calls
> like getLong(0). Yes, this depends on the implementation making the
> correct calls, but it is better than using Object and casting.
>
> I can’t think of real use cases that will force the individual-parameters
> approach to use Object instead of concrete types.
>
> I think this is addressed by the type erasure discussion above. A simple
> Plus method would require Object or 12 implementations for 2 arguments
> and 4 numeric types.
>
> And basically all varargs cases would need to use Object[]. Consider a
> UDF to create a map that requires string keys and some consistent type for
> values. This would be easy with the InternalRow API because you can use
> getString(pos) and get(pos + 1, valueType) to get the key/value pairs.
> Use of UTF8String vs String will be checked at compile time.
>
> I agree that Object[] is worse than InternalRow
>
> Yes, and if we are always using Object because of type erasure or using
> magic methods to get better performance, the utility of the parameterized
> interfaces is very limited.
>
> Because we want to expose the magic functions, the use of ScalarFunction2
> and similar is extremely limited because it is only for non-codegen.
> Varargs is by far the more common case. The InternalRow interface is also
> a very simple way to get started and ensures that Spark can always find the
> right method after the function is bound to input types.
>
> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan  wrote:
>
>> Yes, GenericInternalRow is safe if when type mismatches, with the cost
>> of using Object[], and primitive types need to do boxing. And this is a
>> runtime failure, which is absolutely worse than query-compile-time checks.
>> Also, don't forget my previous point: users need to specify the type and
>> index such as row.getLong(0), which is error-prone.
>>
>> > But we don’t do that for any of the similar UDFs today so I’m skeptical
>> that this would actually be a high enough priority to implement.
>>
>> I'd say this is a must-have if we go with the individual-parameters
>> approach. The Scala UDF today checks the method signature at compile-time,
>> thanks to the Scala type tag. The Java UDF today doesn't check and is hard
>> to use.
>>
>> > You can’t implement ScalarFunction2 and
>> ScalarFunction2.
>>
>> Can you elaborate? We have function binding and we can use different UDFs
>> for different input types. If we do want to reuse one UDF
>> for different types, using "magical methods" solves the problem:
>> class MyUDF {
>>   def call(i: Int): Int = ...
>>   def call(l: Long): Long = ...
>> }
>>
>> On the other side, I don't think the row-parameter approach can solve
>> this problem. The best I can think of is:
>> class MyUDF implement ScalaFunction[Object] {
>>   def call(row: InternalRow): Object = {
>> if (int input)

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Hyukjin Kwon

Thank you so much guys .. it indeed took a long time and it was pretty
tough this time :-).
It was all possible because of your guys' support. I sincerely appreciate
it .

2021년 3월 4일 (목) 오전 2:26, Dongjoon Hyun 님이 작성:

> It took a long time. Thank you, Hyukjin and all!
>
> Bests,
> Dongjoon.
>
> On Wed, Mar 3, 2021 at 3:23 AM Gabor Somogyi 
> wrote:
>
>> Good to hear and great work Hyukjin! 
>>
>> On Wed, 3 Mar 2021, 11:15 Jungtaek Lim, 
>> wrote:
>>
>>> Thanks Hyukjin for driving the huge release, and thanks everyone for
>>> contributing the release!
>>>
>>> On Wed, Mar 3, 2021 at 6:54 PM angers zhu  wrote:
>>>
 Great work, Hyukjin !

 Bests,
 Angers

 Wenchen Fan  于2021年3月3日周三 下午5:02写道：

> Great work and congrats!
>
> On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:
>
>> Congrats, all!
>>
>> Bests,
>> *Kent Yao *
>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>> *a spark enthusiast*
>> *kyuubi is a
>> unified multi-tenant JDBC interface for large-scale data processing and
>> analytics, built on top of Apache Spark .*
>> *spark-authorizer A
>> Spark SQL extension which provides SQL Standard Authorization for 
>> **Apache
>> Spark .*
>> *spark-postgres  A
>> library for reading data from and transferring data to Postgres / 
>> Greenplum
>> with Spark SQL and DataFrames, 10~100x faster.*
>> *spark-func-extras A
>> library that brings excellent and useful functions from various modern
>> database management systems to Apache Spark .*
>>
>>
>>
>> On 03/3/2021 15:11，Takeshi Yamamuro
>>  wrote：
>>
>> Great work and Congrats, all!
>>
>> Bests,
>> Takeshi
>>
>> On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Thanks Hyukjin and congratulations everyone on the release !
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:
>>>
 Great work, Hyukjin!

 On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
 wrote:

> We are excited to announce Spark 3.1.1 today.
>
> Apache Spark 3.1.1 is the second release of the 3.x line. This
> release adds
> Python type annotations and Python dependency management support
> as part of Project Zen.
> Other major updates include improved ANSI SQL compliance support,
> history server support
> in structured streaming, the general availability (GA) of
> Kubernetes and node decommissioning
> in Kubernetes and Standalone. In addition, this release continues
> to focus on usability, stability,
> and polish while resolving around 1500 tickets.
>
> We'd like to thank our contributors and users for their
> contributions and early feedback to
> this release. This release would not have been possible without
> you.
>
> To download Spark 3.1.1, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-1-1.html
>
>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>>

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Takeshi Yamamuro

+1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering.
Btw, anyone roughly know how many v2.4 users still are based on some stats
(e.g., # of v2.4.7 downloads from the official repos)?
Most users have started using v3.x?

On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon  wrote:

> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't mind
> having 2.4.9 as EOL too if that's preferred from more people.
>
> 2021년 3월 4일 (목) 오전 4:01, Sean Owen 님이 작성:
>
>> Sure, I'm even arguing that 2.4.8 could possibly be the final release. No
>> objection of course to continuing to backport to 2.4.x where appropriate
>> and cutting 2.4.9 later in the year as a final EOL release, either.
>>
>> On Wed, Mar 3, 2021 at 12:59 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Sean.
>>>
>>> Ya, exactly, we can release 2.4.8 as a normal release first and use
>>> 2.4.9 as the EOL release.
>>>
>>> Since 2.4.7 was released almost 6 months ago, 2.4.8 is a little late in
>>> terms of the cadence.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Mar 3, 2021 at 10:55 AM Sean Owen  wrote:
>>>
 For reference, 2.3.x was maintained from February 2018 (2.3.0) to Sep
 2019 (2.3.4), or about 19 months. The 2.4 branch should probably be
 maintained longer than that, as the final 2.x branch. 2.4.0 was released in
 Nov 2018. A final release in, say, April 2021 would be about 30 months.
 That feels about right timing-wise.

 We should in any event release 2.4.8, yes. We can of course choose to
 release a 2.4.9 if some critical issue is found, later.

 But yeah based on the velocity of back-ports to 2.4.x, it seems about
 time to call it EOL.

 Sean

>>>

-- 
---
Takeshi Yamamuro

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Hyukjin Kwon

Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't mind
having 2.4.9 as EOL too if that's preferred from more people.

2021년 3월 4일 (목) 오전 4:01, Sean Owen 님이 작성:

> Sure, I'm even arguing that 2.4.8 could possibly be the final release. No
> objection of course to continuing to backport to 2.4.x where appropriate
> and cutting 2.4.9 later in the year as a final EOL release, either.
>
> On Wed, Mar 3, 2021 at 12:59 PM Dongjoon Hyun 
> wrote:
>
>> Thank you, Sean.
>>
>> Ya, exactly, we can release 2.4.8 as a normal release first and use 2.4.9
>> as the EOL release.
>>
>> Since 2.4.7 was released almost 6 months ago, 2.4.8 is a little late in
>> terms of the cadence.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Mar 3, 2021 at 10:55 AM Sean Owen  wrote:
>>
>>> For reference, 2.3.x was maintained from February 2018 (2.3.0) to Sep
>>> 2019 (2.3.4), or about 19 months. The 2.4 branch should probably be
>>> maintained longer than that, as the final 2.x branch. 2.4.0 was released in
>>> Nov 2018. A final release in, say, April 2021 would be about 30 months.
>>> That feels about right timing-wise.
>>>
>>> We should in any event release 2.4.8, yes. We can of course choose to
>>> release a 2.4.9 if some critical issue is found, later.
>>>
>>> But yeah based on the velocity of back-ports to 2.4.x, it seems about
>>> time to call it EOL.
>>>
>>> Sean
>>>
>>

Re: Apache Spark Docker image repository

2021-03-03 Thread Ismaël Mejía

Since Spark 3.1.1 is out now I was wondering if it would make sense to
try to get some consensus about starting to release docker images as
part of Spark 3.2.
Having ready to use images would definitely benefit adoption in
particular now that we support containerized runs via k8s became GA.

WDYT? Are there still some issues/blockers or reasons to not move forward?

On Tue, Feb 18, 2020 at 2:29 PM Ismaël Mejía  wrote:
>
> +1 to have Spark docker images for Dongjoon's arguments, having a container
> based distribution is definitely something in the benefit of users and the
> project too. Having this in the Apache Spark repo matters because of multiple
> eyes to fix/ímprove the images for the benefit of everyone.
>
> What still needs to be tested is the best distribution approach. I have been
> involved in both Flink and Beam's docker images processes (and passed the 
> whole
> 'docker official image' validation and some of the learnt lessons is that the
> less you put in an image the best it is for everyone. So I wonder if the whole
> include everything in the world (Python, R, etc) would scale or if those 
> should
> be overlays on top of a more core minimal image,  but well those are details 
> to
> fix once consensus on this is agreed.
>
> On the Apache INFRA side there is some stuff to deal with at the beginning, 
> but
> things become smoother once they are in place.  In any case fantastic idea and
> if I can help around I would be glad to.
>
> Regards,
> Ismaël
>
> On Tue, Feb 11, 2020 at 10:56 PM Dongjoon Hyun  
> wrote:
>>
>> Hi, Sean.
>>
>> Yes. We should keep this minimal.
>>
>> BTW, for the following questions,
>>
>> > But how much value does that add?
>>
>> How much value do you think we have at our binary distribution in the 
>> following link?
>>
>> - 
>> https://www.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>>
>> Docker image can have a similar value with the above for the users who are 
>> using Dockerized environment.
>>
>> If you are assuming the users who build from the source code or lives on 
>> vendor distributions, both the above existing binary distribution link and 
>> Docker image have no value.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Feb 11, 2020 at 8:51 AM Sean Owen  wrote:
>>>
>>> To be clear this is a convenience 'binary' for end users, not just an
>>> internal packaging to aid the testing framework?
>>>
>>> There's nothing wrong with providing an additional official packaging
>>> if we vote on it and it follows all the rules. There is an open
>>> question about how much value it adds vs that maintenance. I see we do
>>> already have some Dockerfiles, sure. Is it possible to reuse or
>>> repurpose these so that we don't have more to maintain? or: what is
>>> different from the existing Dockerfiles here? (dumb question, never
>>> paid much attention to them)
>>>
>>> We definitely can't release GPL bits or anything, yes. Just releasing
>>> a Dockerfile referring to GPL bits is a gray area - no bits are being
>>> redistributed, but, does it constitute a derived work where the GPL
>>> stuff is a non-optional dependency? Would any publishing of these
>>> images cause us to put a copy of third party GPL code anywhere?
>>>
>>> At the least, we should keep this minimal. One image if possible, that
>>> you overlay on top of your preferred OS/Java/Python image. But how
>>> much value does that add? I have no info either way that people want
>>> or don't need such a thing.
>>>
>>> On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson  wrote:
>>> >
>>> > My takeaway from the last time we discussed this was:
>>> > 1) To be ASF compliant, we needed to only publish images at official 
>>> > releases
>>> > 2) There was some ambiguity about whether or not a container image that 
>>> > included GPL'ed packages (spark images do) might trip over the GPL "viral 
>>> > propagation" due to integrating ASL and GPL in a "binary release".  The 
>>> > "air gap" GPL provision may apply - the GPL software interacts only at 
>>> > command-line boundaries.
>>> >
>>> > On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun  
>>> > wrote:
>>> >>
>>> >> Hi, All.
>>> >>
>>> >> From 2020, shall we have an official Docker image repository as an 
>>> >> additional distribution channel?
>>> >>
>>> >> I'm considering the following images.
>>> >>
>>> >> - Public binary release (no snapshot image)
>>> >> - Public non-Spark base image (OS + R + Python)
>>> >>   (This can be used in GitHub Action Jobs and Jenkins K8s 
>>> >> Integration Tests to speed up jobs and to have more stabler environments)
>>> >>
>>> >> Bests,
>>> >> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.2 Expectation

2021-03-03 Thread Dongjoon Hyun

Hi, John.

This thread aims to share your expectations and goals (and maybe work
progress) to Apache Spark 3.2 because we are making this together. :)

Bests,
Dongjoon.


On Wed, Mar 3, 2021 at 1:59 PM John Zhuge  wrote:

> Hi Dongjoon,
>
> Is it possible to get ViewCatalog in? The community already had fairly
> detailed discussions.
>
> Thanks,
> John
>
> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>> December 2020, March seems to be a good time to share our thoughts and
>> aspirations on Apache Spark 3.2.
>>
>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>> seems to be the last minor release of this year. Given the timeframe, we
>> might consider the following. (This is a small set. Please add your
>> thoughts to this limited list.)
>>
>> # Languages
>>
>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>> and investigating the publishing issue. Thank you for your contributions
>> and feedback on this.
>>
>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>> Java 11, we need lots of support from our dependencies. Let's see.
>>
>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>> 2021-12-23. So, the deprecation is not required yet, but we had better
>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>
>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>> better drop it from the releasing work item list officially.
>>
>> # Dependencies
>>
>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
>> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>> shaded clients via SPARK-33212. So far, there is one on-going report at
>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>> we can move toward Hadoop 3.3.2.
>>
>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>> official dependency via SPARK-32981. We are steadily improving this area
>> and will consume Hive 2.3.9 if available.
>>
>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>> support K8s model 1.19.
>>
>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>> with Kafka Client 2.8 hopefully.
>>
>> # Some Features
>>
>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>> and up-coming storage partitioned join SPIP can be delivered as a part of
>> Spark 3.2 and become an additional foundation.
>>
>> - Columnar Encryption: As of today, Apache Spark master branch supports
>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>> Apache Spark 3.2 is going to be the first release to have this feature
>> officially. Any feedback is welcome.
>>
>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>> too. I'm expecting more benefits.
>>
>> - Structure Streaming with RocksDB backend: According to the latest
>> update, it looks active enough for merging to master branch in Spark 3.2.
>>
>> Please share your thoughts and let's build better Apache Spark 3.2
>> together.
>>
>> Bests,
>> Dongjoon.
>>
>
>
> --
> John Zhuge
>

Re: Apache Spark 3.2 Expectation

2021-03-03 Thread John Zhuge

Hi Dongjoon,

Is it possible to get ViewCatalog in? The community already had fairly
detailed discussions.

Thanks,
John

On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
> seems to be the last minor release of this year. Given the timeframe, we
> might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped
> out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and
> investigating the publishing issue. Thank you for your contributions and
> feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
> better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead
> of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via
> SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client
> dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support
> K8s model 1.19.
>
> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
> with Kafka Client 2.8 hopefully.
>
> # Some Features
>
> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
> Iceberg integration. Especially, we hope the on-going function catalog SPIP
> and up-coming storage partitioned join SPIP can be delivered as a part of
> Spark 3.2 and become an additional foundation.
>
> - Columnar Encryption: As of today, Apache Spark master branch supports
> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
> Apache Spark 3.2 is going to be the first release to have this feature
> officially. Any feedback is welcome.
>
> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
> too. I'm expecting more benefits.
>
> - Structure Streaming with RocksDB backend: According to the latest
> update, it looks active enough for merging to master branch in Spark 3.2.
>
> Please share your thoughts and let's build better Apache Spark 3.2
> together.
>
> Bests,
> Dongjoon.
>


-- 
John Zhuge

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Sean Owen

Sure, I'm even arguing that 2.4.8 could possibly be the final release. No
objection of course to continuing to backport to 2.4.x where appropriate
and cutting 2.4.9 later in the year as a final EOL release, either.

On Wed, Mar 3, 2021 at 12:59 PM Dongjoon Hyun 
wrote:

> Thank you, Sean.
>
> Ya, exactly, we can release 2.4.8 as a normal release first and use 2.4.9
> as the EOL release.
>
> Since 2.4.7 was released almost 6 months ago, 2.4.8 is a little late in
> terms of the cadence.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Mar 3, 2021 at 10:55 AM Sean Owen  wrote:
>
>> For reference, 2.3.x was maintained from February 2018 (2.3.0) to Sep
>> 2019 (2.3.4), or about 19 months. The 2.4 branch should probably be
>> maintained longer than that, as the final 2.x branch. 2.4.0 was released in
>> Nov 2018. A final release in, say, April 2021 would be about 30 months.
>> That feels about right timing-wise.
>>
>> We should in any event release 2.4.8, yes. We can of course choose to
>> release a 2.4.9 if some critical issue is found, later.
>>
>> But yeah based on the velocity of back-ports to 2.4.x, it seems about
>> time to call it EOL.
>>
>> Sean
>>
>

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Dongjoon Hyun

Thank you, Sean.

Ya, exactly, we can release 2.4.8 as a normal release first and use 2.4.9
as the EOL release.

Since 2.4.7 was released almost 6 months ago, 2.4.8 is a little late in
terms of the cadence.

Bests,
Dongjoon.


On Wed, Mar 3, 2021 at 10:55 AM Sean Owen  wrote:

> For reference, 2.3.x was maintained from February 2018 (2.3.0) to Sep 2019
> (2.3.4), or about 19 months. The 2.4 branch should probably be maintained
> longer than that, as the final 2.x branch. 2.4.0 was released in Nov 2018.
> A final release in, say, April 2021 would be about 30 months. That feels
> about right timing-wise.
>
> We should in any event release 2.4.8, yes. We can of course choose to
> release a 2.4.9 if some critical issue is found, later.
>
> But yeah based on the velocity of back-ports to 2.4.x, it seems about time
> to call it EOL.
>
> Sean
>
>
> On Wed, Mar 3, 2021 at 12:05 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> We successfully completed Apache Spark 3.1.1 and 3.0.2 releases and
>> started 3.2.0 discussion already.
>>
>> Let's talk about branch-2.4 because there exists some discussions on JIRA
>> and GitHub about skipping backporting to 2.4.
>>
>> Since `branch-2.4` has been maintained well as LTS, I'd like to suggest
>> having Apache Spark 2.4.8 release as the official EOL release of 2.4 line
>> in order to focus on 3.x more from now. Please note that `branch-2.4` will
>> be frozen officially like `branch-2.3` after EOL release.
>>
>> - Apache Spark 2.4.0 was released on November 2, 2018.
>> - Apache Spark 2.4.7 was released on September 12, 2020.
>> - Since v2.4.7 tag, `branch-2.4` has 134 commits including the following
>> 12 correctness issues.
>>
>> ## CORRECTNESS ISSUE
>> SPARK-30201 HiveOutputWriter standardOI should use
>> ObjectInspectorCopyOption.DEFAULT
>> SPARK-30228 Update zstd-jni to 1.4.4-3
>> SPARK-30894 The nullability of Size function should not depend on
>> SQLConf.get
>> SPARK-32635 When pyspark.sql.functions.lit() function is used with
>> dataframe cache, it returns wrong result
>> SPARK-32908 percentile_approx() returns incorrect results
>> SPARK-33183 Bug in optimizer rule EliminateSorts
>> SPARK-33290 REFRESH TABLE should invalidate cache even though the table
>> itself may not be cached
>> SPARK-33593 Vector reader got incorrect data with binary partition value
>> SPARK-33726 Duplicate field names causes wrong answers during aggregation
>> SPARK-34187 Use available offset range obtained during polling when
>> checking offset validation
>> SPARK-34212 For parquet table, after changing the precision and scale of
>> decimal type in hive, spark reads incorrect value
>> SPARK-34229 Avro should read decimal values with the file schema
>>
>> ## SECURITY ISSUE
>> SPARK-3 Upgrade Jetty to 9.4.28.v20200408
>> SPARK-33831 Update to jetty 9.4.34
>> SPARK-34449 Upgrade Jetty to fix CVE-2020-27218
>>
>> What do you think about this?
>>
>> Bests,
>> Dongjoon.
>>
>

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Sean Owen

For reference, 2.3.x was maintained from February 2018 (2.3.0) to Sep 2019
(2.3.4), or about 19 months. The 2.4 branch should probably be maintained
longer than that, as the final 2.x branch. 2.4.0 was released in Nov 2018.
A final release in, say, April 2021 would be about 30 months. That feels
about right timing-wise.

We should in any event release 2.4.8, yes. We can of course choose to
release a 2.4.9 if some critical issue is found, later.

But yeah based on the velocity of back-ports to 2.4.x, it seems about time
to call it EOL.

Sean


On Wed, Mar 3, 2021 at 12:05 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> We successfully completed Apache Spark 3.1.1 and 3.0.2 releases and
> started 3.2.0 discussion already.
>
> Let's talk about branch-2.4 because there exists some discussions on JIRA
> and GitHub about skipping backporting to 2.4.
>
> Since `branch-2.4` has been maintained well as LTS, I'd like to suggest
> having Apache Spark 2.4.8 release as the official EOL release of 2.4 line
> in order to focus on 3.x more from now. Please note that `branch-2.4` will
> be frozen officially like `branch-2.3` after EOL release.
>
> - Apache Spark 2.4.0 was released on November 2, 2018.
> - Apache Spark 2.4.7 was released on September 12, 2020.
> - Since v2.4.7 tag, `branch-2.4` has 134 commits including the following
> 12 correctness issues.
>
> ## CORRECTNESS ISSUE
> SPARK-30201 HiveOutputWriter standardOI should use
> ObjectInspectorCopyOption.DEFAULT
> SPARK-30228 Update zstd-jni to 1.4.4-3
> SPARK-30894 The nullability of Size function should not depend on
> SQLConf.get
> SPARK-32635 When pyspark.sql.functions.lit() function is used with
> dataframe cache, it returns wrong result
> SPARK-32908 percentile_approx() returns incorrect results
> SPARK-33183 Bug in optimizer rule EliminateSorts
> SPARK-33290 REFRESH TABLE should invalidate cache even though the table
> itself may not be cached
> SPARK-33593 Vector reader got incorrect data with binary partition value
> SPARK-33726 Duplicate field names causes wrong answers during aggregation
> SPARK-34187 Use available offset range obtained during polling when
> checking offset validation
> SPARK-34212 For parquet table, after changing the precision and scale of
> decimal type in hive, spark reads incorrect value
> SPARK-34229 Avro should read decimal values with the file schema
>
> ## SECURITY ISSUE
> SPARK-3 Upgrade Jetty to 9.4.28.v20200408
> SPARK-33831 Update to jetty 9.4.34
> SPARK-34449 Upgrade Jetty to fix CVE-2020-27218
>
> What do you think about this?
>
> Bests,
> Dongjoon.
>

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Dongjoon Hyun

Thank you for volunteering as Apache Spark 2.4.8 release manager, Liang-chi!

On Wed, Mar 3, 2021 at 10:13 AM Liang-Chi Hsieh  wrote:

>
> Thanks Dongjoon!
>
> +1 and I volunteer to do the release of 2.4.8 if it passes.
>
>
> Liang-Chi
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Ryan Blue

Yes, GenericInternalRow is safe if when type mismatches, with the cost of
using Object[], and primitive types need to do boxing

The question is not whether to use the magic functions, which would not
need boxing. The question here is whether to use multiple ScalarFunction
interfaces. Those interfaces would require boxing or using Object[] so
there isn’t a benefit.

If we do want to reuse one UDF for different types, using “magical methods”
solves the problem

Yes, that’s correct. We agree that magic methods are a good option for this.

Again, the question we need to decide is whether to use InternalRow or
interfaces like ScalarFunction2 for non-codegen. The option to use multiple
interfaces is limited by type erasure because you can only have one set of
type parameters. If you wanted to support both ScalarFunction2 and ScalarFunction2 you’d have to fall back to
ScalarFunction2 and cast.

The point is that type erasure will commonly lead either to many different
implementation classes (one for each type combination) or will lead to
parameterizing by Object, which defeats the purpose.

The alternative adds safety because correct types are produced by calls
like getLong(0). Yes, this depends on the implementation making the correct
calls, but it is better than using Object and casting.

I can’t think of real use cases that will force the individual-parameters
approach to use Object instead of concrete types.

I think this is addressed by the type erasure discussion above. A simple
Plus method would require Object or 12 implementations for 2 arguments and
4 numeric types.

And basically all varargs cases would need to use Object[]. Consider a UDF
to create a map that requires string keys and some consistent type for
values. This would be easy with the InternalRow API because you can use
getString(pos) and get(pos + 1, valueType) to get the key/value pairs. Use
of UTF8String vs String will be checked at compile time.

I agree that Object[] is worse than InternalRow

Yes, and if we are always using Object because of type erasure or using
magic methods to get better performance, the utility of the parameterized
interfaces is very limited.

Because we want to expose the magic functions, the use of ScalarFunction2
and similar is extremely limited because it is only for non-codegen.
Varargs is by far the more common case. The InternalRow interface is also a
very simple way to get started and ensures that Spark can always find the
right method after the function is bound to input types.

On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan  wrote:

> Yes, GenericInternalRow is safe if when type mismatches, with the cost of
> using Object[], and primitive types need to do boxing. And this is a
> runtime failure, which is absolutely worse than query-compile-time checks.
> Also, don't forget my previous point: users need to specify the type and
> index such as row.getLong(0), which is error-prone.
>
> > But we don’t do that for any of the similar UDFs today so I’m skeptical
> that this would actually be a high enough priority to implement.
>
> I'd say this is a must-have if we go with the individual-parameters
> approach. The Scala UDF today checks the method signature at compile-time,
> thanks to the Scala type tag. The Java UDF today doesn't check and is hard
> to use.
>
> > You can’t implement ScalarFunction2 and
> ScalarFunction2.
>
> Can you elaborate? We have function binding and we can use different UDFs
> for different input types. If we do want to reuse one UDF
> for different types, using "magical methods" solves the problem:
> class MyUDF {
>   def call(i: Int): Int = ...
>   def call(l: Long): Long = ...
> }
>
> On the other side, I don't think the row-parameter approach can solve this
> problem. The best I can think of is:
> class MyUDF implement ScalaFunction[Object] {
>   def call(row: InternalRow): Object = {
> if (int input) row.getInt(0) ... else row.getLong(0) ...
>   }
> }
>
> This is worse because: 1) it needs to do if-else to check different input
> types. 2) the return type can only be Object and cause boxing issues.
>
> I agree that Object[] is worse than InternalRow. But I can't think of
> real use cases that will force the individual-parameters approach to use
> Object instead of concrete types.
>
>
> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue  wrote:
>
>> Thanks for adding your perspective, Erik!
>>
>> If the input is string type but the UDF implementation calls
>> row.getLong(0), it returns wrong data
>>
>> I think this is misleading. It is true for UnsafeRow, but there is no
>> reason why InternalRow should return incorrect values.
>>
>> The implementation in GenericInternalRow
>> 
>> would throw a ClassCastException. I don’t think that using a row is a
>> bad option simply because UnsafeRow is unsafe.
>>
>> It’s unlikely that UnsafeRow would be used to pass the data.

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Liang-Chi Hsieh



Thanks Dongjoon!

+1 and I volunteer to do the release of 2.4.8 if it passes.


Liang-Chi




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Dongjoon Hyun

Hi, All.

We successfully completed Apache Spark 3.1.1 and 3.0.2 releases and started
3.2.0 discussion already.

Let's talk about branch-2.4 because there exists some discussions on JIRA
and GitHub about skipping backporting to 2.4.

Since `branch-2.4` has been maintained well as LTS, I'd like to suggest
having Apache Spark 2.4.8 release as the official EOL release of 2.4 line
in order to focus on 3.x more from now. Please note that `branch-2.4` will
be frozen officially like `branch-2.3` after EOL release.

- Apache Spark 2.4.0 was released on November 2, 2018.
- Apache Spark 2.4.7 was released on September 12, 2020.
- Since v2.4.7 tag, `branch-2.4` has 134 commits including the following 12
correctness issues.

## CORRECTNESS ISSUE
SPARK-30201 HiveOutputWriter standardOI should use
ObjectInspectorCopyOption.DEFAULT
SPARK-30228 Update zstd-jni to 1.4.4-3
SPARK-30894 The nullability of Size function should not depend on
SQLConf.get
SPARK-32635 When pyspark.sql.functions.lit() function is used with
dataframe cache, it returns wrong result
SPARK-32908 percentile_approx() returns incorrect results
SPARK-33183 Bug in optimizer rule EliminateSorts
SPARK-33290 REFRESH TABLE should invalidate cache even though the table
itself may not be cached
SPARK-33593 Vector reader got incorrect data with binary partition value
SPARK-33726 Duplicate field names causes wrong answers during aggregation
SPARK-34187 Use available offset range obtained during polling when
checking offset validation
SPARK-34212 For parquet table, after changing the precision and scale of
decimal type in hive, spark reads incorrect value
SPARK-34229 Avro should read decimal values with the file schema

## SECURITY ISSUE
SPARK-3 Upgrade Jetty to 9.4.28.v20200408
SPARK-33831 Update to jetty 9.4.34
SPARK-34449 Upgrade Jetty to fix CVE-2020-27218

What do you think about this?

Bests,
Dongjoon.

Re: minikube and kubernetes cluster versions for integration testing

2021-03-03 Thread shane knapp ☠

please open a jira for this and assign it to me...  shouldn't be too big of
a deal to get this set up.

On Tue, Mar 2, 2021 at 6:06 PM Dongjoon Hyun 
wrote:

> Thank you for sharing and suggestion, Attila.
>
> Additionally, given the following information,
>
> - The latest Minikube is v1.18.0 with K8s v1.20.2
> - AWS EKS will add K8s v1.20 on April, 2021
> - The end of support in AWS EKS are
> K8s v1.15 (May 3, 2021)
> K8s v1.16 (July, 2021)
> K8s v1.17 (September, 2021)
>
> The minimum K8s versions (v1.17) sound reasonable and practical to me for
> Apache Spark 3.2.0.
>
> For Minikube, I'd like to recommend to use the latest Minikube versions.
> However, if Minikube v1.7.3 support is easy enough in the script, +1 for
> using v1.7.3 as the minimum Minikube version checking.
>
> Thanks,
> Dongjoon.
>
>
> On Tue, Mar 2, 2021 at 5:03 AM Attila Zsolt Piros <
> piros.attila.zs...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am working on PR to change kubernetes integration testing and use the
>> `minikube kubectl -- config view --minify` output to build the kubernetes
>> client config.
>> This solution has the advantage of not using hardcoded values like 8443
>> for server port (which is wrong when the vm-driver is docker as the port in
>> that case is 32788 by default).
>>
>> But my question is bit more generic than my PR. It is about the supported
>> Minikube versions and kubernetes cluster version this why I decided to
>> write this mail.
>>
>> To test this new solution I have created shell script to install each
>> Minikube versions one by one, start a kubernetes cluster and view the
>> config with the command above.
>> Running the test I found some issues.
>>
>> Currently for k8s testing we suggest to use *minikube version v0.34.1 or
>> greater* with *kubernetes version v1.15.12* (for details check "Testing
>> K8S" section in the developer tools page
>> ).
>>
>>
>> *The following three findings I have:*
>> 1) Looking the Minikube documentation I came across an advice
>> 
>> about checking which kubernetes cluster versions are supported for a
>> Minikube version:
>>
>>
>> *"For up to date information on supported versions,
>> see OldestKubernetesVersion and NewestKubernetesVersion in constants.go
>> "*
>> I think it would be a good idea to follow the official support matrix
>> of Minikube so I have collected some relevant versions into this table (the
>> link navigates to the relevant lines in `constants.go`):
>>  |   kubernetes version   |
>> minikube version |oldest|  newest  | default  |
>> 
>> v0.34.1
>> 
>>  |???   |???   | v1.13.3  |
>> v1.1.0 (22 May 2019)
>> 
>> | v1.10.13 | v1.14.2  | v1.14.2  |
>> v1.2.0
>> 
>>   | v1.10.13 | v1.15.0  | v1.15.0  |
>> v1.3.0 (6 Aug 2019)
>> 
>>  | v1.10.13 | v1.15.2  | v1.15.2  |
>> v1.6.0 (11 Dec 2019)
>> 
>> | v1.11.10 | v1.17.0  | v1.17.0  |
>> v1.7.3 (8 Feb 2020)
>> 
>> | v1.11.10 | v1.17.3  | v1.17.3  |
>> v1.13.1
>> 
>>  | v1.13.0  | v1.19.2  | v1.19.2  |
>> v1.17.1
>> 
>> | v1.13.0  | v1.20.2  | v1.20.3-rc.0 |
>>
>>
>> Looking this we can see if we intend to support v1.15.12 as kubernetes
>> version we should drop everything under v1.3.0.
>>
>> 2) I would suggest to drop v1.15.12 as kubernetes
>> version version because of this issue
>>  (I just found it
>> by running my script).
>>
>> 3) On Minikube v1.7.2 there is this permission denied issue
>>  so I suggest to
>> support Minikube version 1.7.3 and greater.
>>
>> My test script is check_minikube_versions.zsh
>> .
>>

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Dongjoon Hyun

It took a long time. Thank you, Hyukjin and all!

Bests,
Dongjoon.

On Wed, Mar 3, 2021 at 3:23 AM Gabor Somogyi 
wrote:

> Good to hear and great work Hyukjin! 
>
> On Wed, 3 Mar 2021, 11:15 Jungtaek Lim, 
> wrote:
>
>> Thanks Hyukjin for driving the huge release, and thanks everyone for
>> contributing the release!
>>
>> On Wed, Mar 3, 2021 at 6:54 PM angers zhu  wrote:
>>
>>> Great work, Hyukjin !
>>>
>>> Bests,
>>> Angers
>>>
>>> Wenchen Fan  于2021年3月3日周三 下午5:02写道：
>>>
 Great work and congrats!

 On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:

> Congrats, all!
>
> Bests,
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi is a
> unified multi-tenant JDBC interface for large-scale data processing and
> analytics, built on top of Apache Spark .*
> *spark-authorizer A
> Spark SQL extension which provides SQL Standard Authorization for **Apache
> Spark .*
> *spark-postgres  A library
> for reading data from and transferring data to Postgres / Greenplum with
> Spark SQL and DataFrames, 10~100x faster.*
> *spark-func-extras A
> library that brings excellent and useful functions from various modern
> database management systems to Apache Spark .*
>
>
>
> On 03/3/2021 15:11，Takeshi Yamamuro
>  wrote：
>
> Great work and Congrats, all!
>
> Bests,
> Takeshi
>
> On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
> wrote:
>
>>
>> Thanks Hyukjin and congratulations everyone on the release !
>>
>> Regards,
>> Mridul
>>
>> On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:
>>
>>> Great work, Hyukjin!
>>>
>>> On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
>>> wrote:
>>>
 We are excited to announce Spark 3.1.1 today.

 Apache Spark 3.1.1 is the second release of the 3.x line. This
 release adds
 Python type annotations and Python dependency management support as
 part of Project Zen.
 Other major updates include improved ANSI SQL compliance support,
 history server support
 in structured streaming, the general availability (GA) of
 Kubernetes and node decommissioning
 in Kubernetes and Standalone. In addition, this release continues
 to focus on usability, stability,
 and polish while resolving around 1500 tickets.

 We'd like to thank our contributors and users for their
 contributions and early feedback to
 this release. This release would not have been possible without you.

 To download Spark 3.1.1, head over to the download page:
 http://spark.apache.org/downloads.html

 To view the release notes:
 https://spark.apache.org/releases/spark-release-3-1-1.html


>
> --
> ---
> Takeshi Yamamuro
>
>

Re: Apache Spark 3.2 Expectation

2021-03-03 Thread Chang Chen

+1 for Data Source V2 Aggregate push down

huaxin gao  于2021年2月27日周六 上午4:20写道：

> Thanks Dongjoon and Xiao for the discussion. I would like to add Data
> Source V2 Aggregate push down to the list. I am currently working on
> JDBC Data Source V2 Aggregate push down, but the common code can be used
> for the file based V2 Data Source as well. For example, MAX and MIN can be
> pushed down to Parquet and Orc, since they can use statistics information
> to perform these operations efficiently. Quite a few users are
> interested in this Aggregate push down feature and the preliminary
> performance test for JDBC Aggregate push down is positive. So I think it is
> a valuable feature to add for Spark 3.2.
>
> Thanks,
> Huaxin
>
> On Fri, Feb 26, 2021 at 11:13 AM Xiao Li  wrote:
>
>> Thank you, Dongjoon, for initiating this discussion. Let us keep it open.
>> It might take 1-2 weeks to collect from the community all the features
>> we plan to build and ship in 3.2 since we just finished the 3.1 voting.
>>
>>
>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
>>> in April because we took 3 month for Spark 3.1 release.
>>
>>
>> TBH, cutting the branch this April does not look good to me. That means,
>> we only have one month left for feature development of Spark 3.2. Do we
>> have enough features in the current master branch? If not, are we able to
>> finish major features we collected here? Do they have a timeline or project
>> plan?
>>
>> Xiao
>>
>> Dongjoon Hyun  于2021年2月26日周五 上午10:07写道：
>>
>>> Thank you, Mridul and Sean.
>>>
>>> 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of
>>> course, it's a nice-to-have status. :)
>>>
>>> 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks
>>> for sharing,
>>>
>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
>>> in April because we took 3 month for Spark 3.1 release.
>>> Let's update our release roadmap of the Apache Spark website.
>>>
>>> > I'd roughly expect 3.2 in, say, July of this year, given the usual
>>> cadence. No reason it couldn't be a little sooner or later. There is
>>> already some good stuff in 3.2 and will be a good minor release in 5-6
>>> months.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Thu, Feb 25, 2021 at 9:33 AM Sean Owen  wrote:
>>>
 I'd roughly expect 3.2 in, say, July of this year, given the usual
 cadence. No reason it couldn't be a little sooner or later. There is
 already some good stuff in 3.2 and will be a good minor release in 5-6
 months.

 On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark
> 3.2 seems to be the last minor release of this year. Given the timeframe,
> we might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
> and investigating the publishing issue. Thank you for your contributions
> and feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far.
> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing.
> If it succeeds to revive it, we can keep publishing. Otherwise, I believe
> we had better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 
> 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile 
> completely
> via SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
> client dependency to

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Gabor Somogyi

Good to hear and great work Hyukjin! 

On Wed, 3 Mar 2021, 11:15 Jungtaek Lim, 
wrote:

> Thanks Hyukjin for driving the huge release, and thanks everyone for
> contributing the release!
>
> On Wed, Mar 3, 2021 at 6:54 PM angers zhu  wrote:
>
>> Great work, Hyukjin !
>>
>> Bests,
>> Angers
>>
>> Wenchen Fan  于2021年3月3日周三 下午5:02写道：
>>
>>> Great work and congrats!
>>>
>>> On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:
>>>
 Congrats, all!

 Bests,
 *Kent Yao *
 @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
 *a spark enthusiast*
 *kyuubi is a
 unified multi-tenant JDBC interface for large-scale data processing and
 analytics, built on top of Apache Spark .*
 *spark-authorizer A Spark
 SQL extension which provides SQL Standard Authorization for **Apache
 Spark .*
 *spark-postgres  A library
 for reading data from and transferring data to Postgres / Greenplum with
 Spark SQL and DataFrames, 10~100x faster.*
 *spark-func-extras A
 library that brings excellent and useful functions from various modern
 database management systems to Apache Spark .*



 On 03/3/2021 15:11，Takeshi Yamamuro
  wrote：

 Great work and Congrats, all!

 Bests,
 Takeshi

 On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
 wrote:

>
> Thanks Hyukjin and congratulations everyone on the release !
>
> Regards,
> Mridul
>
> On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:
>
>> Great work, Hyukjin!
>>
>> On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
>> wrote:
>>
>>> We are excited to announce Spark 3.1.1 today.
>>>
>>> Apache Spark 3.1.1 is the second release of the 3.x line. This
>>> release adds
>>> Python type annotations and Python dependency management support as
>>> part of Project Zen.
>>> Other major updates include improved ANSI SQL compliance support,
>>> history server support
>>> in structured streaming, the general availability (GA) of Kubernetes
>>> and node decommissioning
>>> in Kubernetes and Standalone. In addition, this release continues to
>>> focus on usability, stability,
>>> and polish while resolving around 1500 tickets.
>>>
>>> We'd like to thank our contributors and users for their
>>> contributions and early feedback to
>>> this release. This release would not have been possible without you.
>>>
>>> To download Spark 3.1.1, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-1-1.html
>>>
>>>

 --
 ---
 Takeshi Yamamuro

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Jungtaek Lim

Thanks Hyukjin for driving the huge release, and thanks everyone for
contributing the release!

On Wed, Mar 3, 2021 at 6:54 PM angers zhu  wrote:

> Great work, Hyukjin !
>
> Bests,
> Angers
>
> Wenchen Fan  于2021年3月3日周三 下午5:02写道：
>
>> Great work and congrats!
>>
>> On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:
>>
>>> Congrats, all!
>>>
>>> Bests,
>>> *Kent Yao *
>>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>>> *a spark enthusiast*
>>> *kyuubi is a
>>> unified multi-tenant JDBC interface for large-scale data processing and
>>> analytics, built on top of Apache Spark .*
>>> *spark-authorizer A Spark
>>> SQL extension which provides SQL Standard Authorization for **Apache
>>> Spark .*
>>> *spark-postgres  A library
>>> for reading data from and transferring data to Postgres / Greenplum with
>>> Spark SQL and DataFrames, 10~100x faster.*
>>> *spark-func-extras A
>>> library that brings excellent and useful functions from various modern
>>> database management systems to Apache Spark .*
>>>
>>>
>>>
>>> On 03/3/2021 15:11，Takeshi Yamamuro
>>>  wrote：
>>>
>>> Great work and Congrats, all!
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
>>> wrote:
>>>

 Thanks Hyukjin and congratulations everyone on the release !

 Regards,
 Mridul

 On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:

> Great work, Hyukjin!
>
> On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
> wrote:
>
>> We are excited to announce Spark 3.1.1 today.
>>
>> Apache Spark 3.1.1 is the second release of the 3.x line. This
>> release adds
>> Python type annotations and Python dependency management support as
>> part of Project Zen.
>> Other major updates include improved ANSI SQL compliance support,
>> history server support
>> in structured streaming, the general availability (GA) of Kubernetes
>> and node decommissioning
>> in Kubernetes and Standalone. In addition, this release continues to
>> focus on usability, stability,
>> and polish while resolving around 1500 tickets.
>>
>> We'd like to thank our contributors and users for their contributions
>> and early feedback to
>> this release. This release would not have been possible without you.
>>
>> To download Spark 3.1.1, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-1-1.html
>>
>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>>

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread angers zhu

Great work, Hyukjin !

Bests,
Angers

Wenchen Fan  于2021年3月3日周三 下午5:02写道：

> Great work and congrats!
>
> On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:
>
>> Congrats, all!
>>
>> Bests,
>> *Kent Yao *
>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>> *a spark enthusiast*
>> *kyuubi is a
>> unified multi-tenant JDBC interface for large-scale data processing and
>> analytics, built on top of Apache Spark .*
>> *spark-authorizer A Spark
>> SQL extension which provides SQL Standard Authorization for **Apache
>> Spark .*
>> *spark-postgres  A library
>> for reading data from and transferring data to Postgres / Greenplum with
>> Spark SQL and DataFrames, 10~100x faster.*
>> *spark-func-extras A
>> library that brings excellent and useful functions from various modern
>> database management systems to Apache Spark .*
>>
>>
>>
>> On 03/3/2021 15:11，Takeshi Yamamuro
>>  wrote：
>>
>> Great work and Congrats, all!
>>
>> Bests,
>> Takeshi
>>
>> On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Thanks Hyukjin and congratulations everyone on the release !
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:
>>>
 Great work, Hyukjin!

 On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
 wrote:

> We are excited to announce Spark 3.1.1 today.
>
> Apache Spark 3.1.1 is the second release of the 3.x line. This release
> adds
> Python type annotations and Python dependency management support as
> part of Project Zen.
> Other major updates include improved ANSI SQL compliance support,
> history server support
> in structured streaming, the general availability (GA) of Kubernetes
> and node decommissioning
> in Kubernetes and Standalone. In addition, this release continues to
> focus on usability, stability,
> and polish while resolving around 1500 tickets.
>
> We'd like to thank our contributors and users for their contributions
> and early feedback to
> this release. This release would not have been possible without you.
>
> To download Spark 3.1.1, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-1-1.html
>
>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>>

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Wenchen Fan

Great work and congrats!

On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:

> Congrats, all!
>
> Bests,
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark .*
> *spark-authorizer A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark .*
> *spark-postgres  A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *spark-func-extras A
> library that brings excellent and useful functions from various modern
> database management systems to Apache Spark .*
>
>
>
> On 03/3/2021 15:11，Takeshi Yamamuro
>  wrote：
>
> Great work and Congrats, all!
>
> Bests,
> Takeshi
>
> On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
> wrote:
>
>>
>> Thanks Hyukjin and congratulations everyone on the release !
>>
>> Regards,
>> Mridul
>>
>> On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:
>>
>>> Great work, Hyukjin!
>>>
>>> On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon  wrote:
>>>
 We are excited to announce Spark 3.1.1 today.

 Apache Spark 3.1.1 is the second release of the 3.x line. This release
 adds
 Python type annotations and Python dependency management support as
 part of Project Zen.
 Other major updates include improved ANSI SQL compliance support,
 history server support
 in structured streaming, the general availability (GA) of Kubernetes
 and node decommissioning
 in Kubernetes and Standalone. In addition, this release continues to
 focus on usability, stability,
 and polish while resolving around 1500 tickets.

 We'd like to thank our contributors and users for their contributions
 and early feedback to
 this release. This release would not have been possible without you.

 To download Spark 3.1.1, head over to the download page:
 http://spark.apache.org/downloads.html

 To view the release notes:
 https://spark.apache.org/releases/spark-release-3-1-1.html

>
> --
> ---
> Takeshi Yamamuro
>
>

Re: SPARK-34600. Support user-defined types in Pandas UDF

SPARK-34600. Support user-defined types in Pandas UDF

Re: [DISCUSS] SPIP: FunctionCatalog

Re: [DISCUSS] SPIP: FunctionCatalog

Re: [DISCUSS] SPIP: FunctionCatalog

Re: [DISCUSS] SPIP: FunctionCatalog

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

Re: Apache Spark 2.4.8 (and EOL of 2.4)

Re: Apache Spark 2.4.8 (and EOL of 2.4)

Re: Apache Spark Docker image repository

Re: Apache Spark 3.2 Expectation

Re: Apache Spark 3.2 Expectation

Re: Apache Spark 2.4.8 (and EOL of 2.4)

Re: Apache Spark 2.4.8 (and EOL of 2.4)

Re: Apache Spark 2.4.8 (and EOL of 2.4)

Re: Apache Spark 2.4.8 (and EOL of 2.4)

Re: [DISCUSS] SPIP: FunctionCatalog

Re: Apache Spark 2.4.8 (and EOL of 2.4)

Apache Spark 2.4.8 (and EOL of 2.4)

Re: minikube and kubernetes cluster versions for integration testing

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

Re: Apache Spark 3.2 Expectation

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

26 matches

Site Navigation

Mail list logo

Footer information