Re: Apache Spark 3.2.2 Release?

2022-07-07 Thread Joshua Rosen
+1; thanks for coordinating this!

I have a few more correctness bugs to add to the list in your original
email (these were originally missing the 'correctness' JIRA label):

- https://issues.apache.org/jira/browse/SPARK-37643 : when
charVarcharAsString is true, char datatype partition table query incorrect
- https://issues.apache.org/jira/browse/SPARK-37865 : Spark should not
dedup the groupingExpressions when the first child of Union has duplicate
columns
- https://issues.apache.org/jira/browse/SPARK-38787 : Possible correctness
issue on stream-stream join when handling edge case


On Thu, Jul 7, 2022 at 6:12 PM Dongjoon Hyun 
wrote:

> Thank you all.
>
> I'll check and prepare RC1 for next week.
>
> Dongjoon.
>


Re: Apache Spark 3.2.2 Release?

2022-07-07 Thread Dongjoon Hyun
Thank you all.

I'll check and prepare RC1 for next week.

Dongjoon.


Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-07 Thread Rui Wang
I want to highlight in case I missed this in the original email:

The 4 API will not be deleted. They will just be marked as deprecated
annotations and we encourage users to use their alternatives.


-Rui

On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:

> Hi Community,
>
> Proposal:
> I want to discuss a proposal to deprecate the following Catalog API:
> def listColumns(dbName: String, tableName: String): Dataset[Column]
> def getTable(dbName: String, tableName: String): Table
> def getFunction(dbName: String, functionName: String): Function
> def tableExists(dbName: String, tableName: String): Boolean
>
>
> Context:
> We have been adding table identifier with catalog name (aka 3 layer
> namespace) support to Catalog API in
> https://issues.apache.org/jira/browse/SPARK-39235.
> The basic idea is, if an API accepts:
> 1. only tableName:String, we allow it accepts "a.b.c" and
> goes analyzer which treats a as catalog name, b namespace name and c table
> name.
> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer which
> treats a as catalog name, b namespace name.
> Meanwhile we still maintain the backwards compatibility for such API to
> make sure past behavior remains the same. E.g. If you only use tableName it
> is still recognized by the session catalog.
>
> With this effort ongoing, the above 4 API becomes not fully
> compatible with the 3 layer namespace.
>
> use tableExists(dbName: String, tableName: String) as an example, given
> that it takes two parameters but leaves no room for the extra catalog name.
> Also if we want to reuse the two parameters, which one will be the one that
> takes more than one name part?
>
>
> How?
> So how to improve the above 4 API? There are two options:
> a. Expand those four API to let those API accept catalog names. For
> example, tableExists(catalogName: String, dbName: String, tableName:
> String).
> b. mark those API as `deprecated`.
>
> I am proposing to follow option B which does API deprecation.
>
> Why?
> 1. Reduce unneeded API. The existing API can support the same behavior
> given SPARK-39235. For example, tableExists(dbName, tableName) can be
> replaced to use tableExists("dbName.tableName").
> 2. Reduce incomplete API. The proposed API to deprecate does not support 3
> layer namespace now, and it is hard to do so (where to take 3 part names)?
> 3. Deprecation suggests users to migrate their usage on API.
> 4. There was existing practice that we deprecated CreateExternalTable API
> when adding CreateTable API:
> https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220
>
>
> What do you think?
>
> Thanks,
> Rui Wang
>
>
>


[DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-07 Thread Rui Wang
Hi Community,

Proposal:
I want to discuss a proposal to deprecate the following Catalog API:
def listColumns(dbName: String, tableName: String): Dataset[Column]
def getTable(dbName: String, tableName: String): Table
def getFunction(dbName: String, functionName: String): Function
def tableExists(dbName: String, tableName: String): Boolean


Context:
We have been adding table identifier with catalog name (aka 3 layer
namespace) support to Catalog API in
https://issues.apache.org/jira/browse/SPARK-39235.
The basic idea is, if an API accepts:
1. only tableName:String, we allow it accepts "a.b.c" and
goes analyzer which treats a as catalog name, b namespace name and c table
name.
2. only dbName:String, we allow it accepts "a.b" and goes analyzer which
treats a as catalog name, b namespace name.
Meanwhile we still maintain the backwards compatibility for such API to
make sure past behavior remains the same. E.g. If you only use tableName it
is still recognized by the session catalog.

With this effort ongoing, the above 4 API becomes not fully compatible with
the 3 layer namespace.

use tableExists(dbName: String, tableName: String) as an example, given
that it takes two parameters but leaves no room for the extra catalog name.
Also if we want to reuse the two parameters, which one will be the one that
takes more than one name part?


How?
So how to improve the above 4 API? There are two options:
a. Expand those four API to let those API accept catalog names. For
example, tableExists(catalogName: String, dbName: String, tableName:
String).
b. mark those API as `deprecated`.

I am proposing to follow option B which does API deprecation.

Why?
1. Reduce unneeded API. The existing API can support the same behavior
given SPARK-39235. For example, tableExists(dbName, tableName) can be
replaced to use tableExists("dbName.tableName").
2. Reduce incomplete API. The proposed API to deprecate does not support 3
layer namespace now, and it is hard to do so (where to take 3 part names)?
3. Deprecation suggests users to migrate their usage on API.
4. There was existing practice that we deprecated CreateExternalTable API
when adding CreateTable API:
https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220


What do you think?

Thanks,
Rui Wang


Re: Apache Spark 3.2.2 Release?

2022-07-07 Thread Andrew Ray
+1 (non-binding) Thanks!

On Thu, Jul 7, 2022 at 7:00 AM Yang,Jie(INF)  wrote:

> +1 (non-binding) Thank you Dongjoon ~
>
>
>
> *发件人**: *Ruifeng Zheng 
> *日期**: *2022年7月7日 星期四 16:28
> *收件人**: *dev 
> *主题**: *Re: Apache Spark 3.2.2 Release?
>
>
>
> +1 thank you Dongjoon!
>
>
> --
>
> [image: 图像已被发件人删除。]
>
> Ruifeng Zheng
>
> ruife...@foxmail.com
>
>
>
>
>
>
>
> -- Original --
>
> *From:* "Yikun Jiang" ;
>
> *Date:* Thu, Jul 7, 2022 04:16 PM
>
> *To:* "Mridul Muralidharan";
>
> *Cc:* "Gengliang Wang";"Cheng Su";"Maxim
> Gekk";"Wenchen 
> Fan";"Xiao
> Li";"Xinrong
> Meng";"Yuming Wang" >;"dev";
>
> *Subject:* Re: Apache Spark 3.2.2 Release?
>
>
>
> +1  (non-binding)
>
>
>
> Thanks!
>
>
> Regards,
>
> Yikun
>
>
>
>
>
> On Thu, Jul 7, 2022 at 1:57 PM Mridul Muralidharan 
> wrote:
>
> +1
>
>
>
> Thanks for driving this Dongjoon !
>
>
>
> Regards,
>
> Mridul
>
>
>
> On Thu, Jul 7, 2022 at 12:36 AM Gengliang Wang  wrote:
>
> +1.
>
> Thank you, Dongjoon.
>
>
>
> On Wed, Jul 6, 2022 at 10:21 PM Wenchen Fan  wrote:
>
> +1
>
>
>
> On Thu, Jul 7, 2022 at 10:41 AM Xinrong Meng
>  wrote:
>
> +1
>
>
> Thanks!
>
>
>
> Xinrong Meng
>
> Software Engineer
>
> Databricks
>
>
>
>
>
> On Wed, Jul 6, 2022 at 7:25 PM Xiao Li  wrote:
>
> +1
>
>
>
> Xiao
>
>
>
> Cheng Su  于2022年7月6日周三 19:16写道:
>
> +1 (non-binding)
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
> On Wed, Jul 6, 2022 at 6:01 PM Yuming Wang  wrote:
>
> +1
>
>
>
> On Thu, Jul 7, 2022 at 5:53 AM Maxim Gekk
>  wrote:
>
> +1
>
>
>
> On Thu, Jul 7, 2022 at 12:26 AM John Zhuge  wrote:
>
> +1  Thanks for the effort!
>
>
>
> On Wed, Jul 6, 2022 at 2:23 PM Bjørn Jørgensen 
> wrote:
>
> +1
>
>
>
> ons. 6. jul. 2022, 23:05 skrev Hyukjin Kwon :
>
> Yeah +1
>
>
>
> On Thu, Jul 7, 2022 at 5:40 AM Dongjoon Hyun 
> wrote:
>
> Hi, All.
>
> Since Apache Spark 3.2.1 tag creation (Jan 19), new 197 patches
> including 11 correctness patches arrived at branch-3.2.
>
> Shall we make a new release, Apache Spark 3.2.2, as the third release
> at 3.2 line? I'd like to volunteer as the release manager for Apache
> Spark 3.2.2. I'm thinking about starting the first RC next week.
>
> $ git log --oneline v3.2.1..HEAD | wc -l
>  197
>
> # Correctness issues
>
> SPARK-38075 Hive script transform with order by and limit will
> return fake rows
> SPARK-38204 All state operators are at a risk of inconsistency
> between state partitioning and operator partitioning
> SPARK-38309 SHS has incorrect percentiles for shuffle read bytes
> and shuffle total blocks metrics
> SPARK-38320 (flat)MapGroupsWithState can timeout groups which just
> received inputs in the same microbatch
> SPARK-38614 After Spark update, df.show() shows incorrect
> F.percent_rank results
> SPARK-38655 OffsetWindowFunctionFrameBase cannot find the offset
> row whose input is not null
> SPARK-38684 Stream-stream outer join has a possible correctness
> issue due to weakly read consistent on outer iterators
> SPARK-39061 Incorrect results or NPE when using Inline function
> against an array of dynamically created structs
> SPARK-39107 Silent change in regexp_replace's handling of empty strings
> SPARK-39259 Timestamps returned by now() and equivalent functions
> are not consistent in subqueries
> SPARK-39293 The accumulator of ArrayAggregate should copy the
> intermediate result if string, struct, array, or map
>
> Best,
> Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
>
> John Zhuge
>
>


Re: Apache Spark 3.2.2 Release?

2022-07-07 Thread Yang,Jie(INF)
+1 (non-binding) Thank you Dongjoon ~

发件人: Ruifeng Zheng 
日期: 2022年7月7日 星期四 16:28
收件人: dev 
主题: Re: Apache Spark 3.2.2 Release?

+1 thank you Dongjoon!


[图像已被发件人删除。]

Ruifeng Zheng
ruife...@foxmail.com




-- Original --
From: "Yikun Jiang" ;
Date: Thu, Jul 7, 2022 04:16 PM
To: "Mridul Muralidharan";
Cc: "Gengliang Wang";"Cheng Su";"Maxim 
Gekk";"Wenchen 
Fan";"Xiao Li";"Xinrong 
Meng";"Yuming 
Wang";"dev";
Subject: Re: Apache Spark 3.2.2 Release?

+1  (non-binding)

Thanks!

Regards,
Yikun


On Thu, Jul 7, 2022 at 1:57 PM Mridul Muralidharan 
mailto:mri...@gmail.com>> wrote:
+1

Thanks for driving this Dongjoon !

Regards,
Mridul

On Thu, Jul 7, 2022 at 12:36 AM Gengliang Wang 
mailto:ltn...@gmail.com>> wrote:
+1.
Thank you, Dongjoon.

On Wed, Jul 6, 2022 at 10:21 PM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
+1

On Thu, Jul 7, 2022 at 10:41 AM Xinrong Meng 
 wrote:
+1


Thanks!



Xinrong Meng

Software Engineer

Databricks


On Wed, Jul 6, 2022 at 7:25 PM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
+1

Xiao

Cheng Su mailto:scnj...@gmail.com>> 于2022年7月6日周三 19:16写道:
+1 (non-binding)

Thanks,
Cheng Su

On Wed, Jul 6, 2022 at 6:01 PM Yuming Wang 
mailto:wgy...@gmail.com>> wrote:
+1

On Thu, Jul 7, 2022 at 5:53 AM Maxim Gekk  
wrote:
+1

On Thu, Jul 7, 2022 at 12:26 AM John Zhuge 
mailto:jzh...@apache.org>> wrote:
+1  Thanks for the effort!

On Wed, Jul 6, 2022 at 2:23 PM Bjørn Jørgensen 
mailto:bjornjorgen...@gmail.com>> wrote:
+1

ons. 6. jul. 2022, 23:05 skrev Hyukjin Kwon 
mailto:gurwls...@gmail.com>>:
Yeah +1

On Thu, Jul 7, 2022 at 5:40 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

Since Apache Spark 3.2.1 tag creation (Jan 19), new 197 patches
including 11 correctness patches arrived at branch-3.2.

Shall we make a new release, Apache Spark 3.2.2, as the third release
at 3.2 line? I'd like to volunteer as the release manager for Apache
Spark 3.2.2. I'm thinking about starting the first RC next week.

$ git log --oneline v3.2.1..HEAD | wc -l
 197

# Correctness issues

SPARK-38075 Hive script transform with order by and limit will
return fake rows
SPARK-38204 All state operators are at a risk of inconsistency
between state partitioning and operator partitioning
SPARK-38309 SHS has incorrect percentiles for shuffle read bytes
and shuffle total blocks metrics
SPARK-38320 (flat)MapGroupsWithState can timeout groups which just
received inputs in the same microbatch
SPARK-38614 After Spark update, df.show() shows incorrect
F.percent_rank results
SPARK-38655 OffsetWindowFunctionFrameBase cannot find the offset
row whose input is not null
SPARK-38684 Stream-stream outer join has a possible correctness
issue due to weakly read consistent on outer iterators
SPARK-39061 Incorrect results or NPE when using Inline function
against an array of dynamically created structs
SPARK-39107 Silent change in regexp_replace's handling of empty strings
SPARK-39259 Timestamps returned by now() and equivalent functions
are not consistent in subqueries
SPARK-39293 The accumulator of ArrayAggregate should copy the
intermediate result if string, struct, array, or map

Best,
Dongjoon.

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org
--
John Zhuge


Re: Apache Spark 3.2.2 Release?

2022-07-07 Thread Ruifeng Zheng
+1 thank you Dongjoon!




RuifengZheng
ruife...@foxmail.com








--Original--
From:   
 "Yikun Jiang"  
  


Re: Apache Spark 3.2.2 Release?

2022-07-07 Thread Yikun Jiang
+1  (non-binding)

Thanks!

Regards,
Yikun


On Thu, Jul 7, 2022 at 1:57 PM Mridul Muralidharan  wrote:

> +1
>
> Thanks for driving this Dongjoon !
>
> Regards,
> Mridul
>
> On Thu, Jul 7, 2022 at 12:36 AM Gengliang Wang  wrote:
>
>> +1.
>> Thank you, Dongjoon.
>>
>> On Wed, Jul 6, 2022 at 10:21 PM Wenchen Fan  wrote:
>>
>>> +1
>>>
>>> On Thu, Jul 7, 2022 at 10:41 AM Xinrong Meng
>>>  wrote:
>>>
 +1

 Thanks!


 Xinrong Meng

 Software Engineer

 Databricks


 On Wed, Jul 6, 2022 at 7:25 PM Xiao Li  wrote:

> +1
>
> Xiao
>
> Cheng Su  于2022年7月6日周三 19:16写道:
>
>> +1 (non-binding)
>>
>> Thanks,
>> Cheng Su
>>
>> On Wed, Jul 6, 2022 at 6:01 PM Yuming Wang  wrote:
>>
>>> +1
>>>
>>> On Thu, Jul 7, 2022 at 5:53 AM Maxim Gekk
>>>  wrote:
>>>
 +1

 On Thu, Jul 7, 2022 at 12:26 AM John Zhuge 
 wrote:

> +1  Thanks for the effort!
>
> On Wed, Jul 6, 2022 at 2:23 PM Bjørn Jørgensen <
> bjornjorgen...@gmail.com> wrote:
>
>> +1
>>
>> ons. 6. jul. 2022, 23:05 skrev Hyukjin Kwon > >:
>>
>>> Yeah +1
>>>
>>> On Thu, Jul 7, 2022 at 5:40 AM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>>
 Hi, All.

 Since Apache Spark 3.2.1 tag creation (Jan 19), new 197 patches
 including 11 correctness patches arrived at branch-3.2.

 Shall we make a new release, Apache Spark 3.2.2, as the third
 release
 at 3.2 line? I'd like to volunteer as the release manager for
 Apache
 Spark 3.2.2. I'm thinking about starting the first RC next week.

 $ git log --oneline v3.2.1..HEAD | wc -l
  197

 # Correctness issues

 SPARK-38075 Hive script transform with order by and limit
 will
 return fake rows
 SPARK-38204 All state operators are at a risk of
 inconsistency
 between state partitioning and operator partitioning
 SPARK-38309 SHS has incorrect percentiles for shuffle read
 bytes
 and shuffle total blocks metrics
 SPARK-38320 (flat)MapGroupsWithState can timeout groups
 which just
 received inputs in the same microbatch
 SPARK-38614 After Spark update, df.show() shows incorrect
 F.percent_rank results
 SPARK-38655 OffsetWindowFunctionFrameBase cannot find the
 offset
 row whose input is not null
 SPARK-38684 Stream-stream outer join has a possible
 correctness
 issue due to weakly read consistent on outer iterators
 SPARK-39061 Incorrect results or NPE when using Inline
 function
 against an array of dynamically created structs
 SPARK-39107 Silent change in regexp_replace's handling of
 empty strings
 SPARK-39259 Timestamps returned by now() and equivalent
 functions
 are not consistent in subqueries
 SPARK-39293 The accumulator of ArrayAggregate should copy
 the
 intermediate result if string, struct, array, or map

 Best,
 Dongjoon.


 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

 --
> John Zhuge
>