[jira] [Resolved] (SPARK-43927) Add cast alias to Scala and Python

2023-06-02 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43927.
---
Resolution: Not A Problem

> Add cast alias to Scala and Python
> --
>
> Key: SPARK-43927
> URL: https://issues.apache.org/jira/browse/SPARK-43927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> add functions for:
> * castAlias("boolean", BooleanType),
> * castAlias("tinyint", ByteType),
> * castAlias("smallint", ShortType),
> * castAlias("int", IntegerType),
> * castAlias("bigint", LongType),
> * castAlias("float", FloatType),
> * castAlias("double", DoubleType),
> * castAlias("decimal", DecimalType.USER_DEFAULT),
> * castAlias("date", DateType),
> * castAlias("timestamp", TimestampType),
> * castAlias("binary", BinaryType),
> * castAlias("string", StringType),



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43907) Add SQL functions into Scala, Python and R API

2023-06-02 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728598#comment-17728598
 ] 

jiaan.geng commented on SPARK-43907:


[~gurwls223]Thank you for your feedback.

> Add SQL functions into Scala, Python and R API
> --
>
> Key: SPARK-43907
> URL: https://issues.apache.org/jira/browse/SPARK-43907
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See the discussion in dev mailing list 
> (https://lists.apache.org/thread/0tdcfyzxzcv8w46qbgwys2rormhdgyqg).
> This is an umbrella JIRA to implement all SQL functions in Scala, Python and R



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43916) Add percentile to Scala and Python API

2023-06-02 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-43916:
---
Summary: Add percentile to Scala and Python API  (was: Add percentile to 
Scala, Python and R API)

> Add percentile to Scala and Python API
> --
>
> Key: SPARK-43916
> URL: https://issues.apache.org/jira/browse/SPARK-43916
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, R, SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43945) Fix bug for `SQLQueryTestSuite` when run on local env

2023-06-02 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43945:
---

 Summary: Fix bug for `SQLQueryTestSuite` when run on local env
 Key: SPARK-43945
 URL: https://issues.apache.org/jira/browse/SPARK-43945
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43798) Initial support for Python UDTFs

2023-06-02 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728601#comment-17728601
 ] 

Hudson commented on SPARK-43798:


User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/41316

> Initial support for Python UDTFs
> 
>
> Key: SPARK-43798
> URL: https://issues.apache.org/jira/browse/SPARK-43798
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Support Python user-defined table functions with batch eval. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43945) Fix bug for `SQLQueryTestSuite` when run on local env

2023-06-02 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728603#comment-17728603
 ] 

Hudson commented on SPARK-43945:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41431

> Fix bug for `SQLQueryTestSuite` when run on local env
> -
>
> Key: SPARK-43945
> URL: https://issues.apache.org/jira/browse/SPARK-43945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43946) Add rule to remove unused CTEDef

2023-06-02 Thread jinhai-cloud (Jira)
jinhai-cloud created SPARK-43946:


 Summary: Add rule to remove unused CTEDef
 Key: SPARK-43946
 URL: https://issues.apache.org/jira/browse/SPARK-43946
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0, 3.3.2, 3.2.4
Reporter: jinhai-cloud






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43947) Incorrect SparkException when missing amount in resources in Stage-Level Scheduling

2023-06-02 Thread Jacek Laskowski (Jira)
Jacek Laskowski created SPARK-43947:
---

 Summary: Incorrect SparkException when missing amount in resources 
in Stage-Level Scheduling
 Key: SPARK-43947
 URL: https://issues.apache.org/jira/browse/SPARK-43947
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 3.4.0
Reporter: Jacek Laskowski


[ResourceUtils.listResourceIds|https://github.com/apache/spark/blob/807abf9c53ee8c1c7ef69646ebd8a266f60d5580/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala#L152-L155]
 can throw an exception for any missing config, not just `amount`.

{code:scala}
  val index = key.indexOf('.')
  if (index < 0) {
throw new SparkException(s"You must specify an amount config for 
resource: $key " +
  s"config: $componentName.$RESOURCE_PREFIX.$key")
  }
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43947) Incorrect SparkException when missing config in resources in Stage-Level Scheduling

2023-06-02 Thread Jacek Laskowski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-43947:

Summary: Incorrect SparkException when missing config in resources in 
Stage-Level Scheduling  (was: Incorrect SparkException when missing amount in 
resources in Stage-Level Scheduling)

> Incorrect SparkException when missing config in resources in Stage-Level 
> Scheduling
> ---
>
> Key: SPARK-43947
> URL: https://issues.apache.org/jira/browse/SPARK-43947
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.4.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> [ResourceUtils.listResourceIds|https://github.com/apache/spark/blob/807abf9c53ee8c1c7ef69646ebd8a266f60d5580/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala#L152-L155]
>  can throw an exception for any missing config, not just `amount`.
> {code:scala}
>   val index = key.indexOf('.')
>   if (index < 0) {
> throw new SparkException(s"You must specify an amount config for 
> resource: $key " +
>   s"config: $componentName.$RESOURCE_PREFIX.$key")
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43946) Add rule to remove unused CTEDef

2023-06-02 Thread jinhai-cloud (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jinhai-cloud updated SPARK-43946:
-
Description: 
{code:java}
// code placeholder
with t1 as (
  select rand() c3
),
t2 as (select * from t1)
select c3 from t1 where c3 > 0 {code}
{code:java}
// code placeholder
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE ===
 WithCTE                                               WithCTE
 :- CTERelationDef 0, false                            :- CTERelationDef 0, 
false
 :  +- Project [rand(3418873542988342437) AS c3#236]   :  +- Project 
[rand(3418873542988342437) AS c3#236]
 :     +- OneRowRelation                               :     +- OneRowRelation
!:- CTERelationDef 1, false                            +- Project [c3#236]
!:  +- Project [c3#236]                                   +- Filter (c3#236 > 
cast(0 as double))
!:     +- CTERelationRef 0, true, [c3#236]                   +- CTERelationRef 
0, true, [c3#236]
!+- Project [c3#236]                                   
!   +- Filter (c3#236 > cast(0 as double))             
!      +- CTERelationRef 0, true, [c3#236]             
 {code}
When the above query applies the inlineCTE rule, inline is not possible because 
the refCount of CTERelationDef 0 is equal to 2.

However, according to the optimized logicalplan, the plan can be further 
optimized because the refCount of CTERelationDef 0 is equal to 1.

Therefore, we can add the rule *RemoveRedundantCTEDef* to delete the 
unreferenced CTERelationDef to prevent the refCount from being miscalculated
{code:java}
// code placeholder
Project [c3#236]
+- Filter (c3#236 > cast(0 as double))
   +- Project [rand(-7871530451581327544) AS c3#236]
      +- OneRowRelation {code}

> Add rule to remove unused CTEDef
> 
>
> Key: SPARK-43946
> URL: https://issues.apache.org/jira/browse/SPARK-43946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
>Reporter: jinhai-cloud
>Priority: Major
>
> {code:java}
> // code placeholder
> with t1 as (
>   select rand() c3
> ),
> t2 as (select * from t1)
> select c3 from t1 where c3 > 0 {code}
> {code:java}
> // code placeholder
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE ===
>  WithCTE                                               WithCTE
>  :- CTERelationDef 0, false                            :- CTERelationDef 0, 
> false
>  :  +- Project [rand(3418873542988342437) AS c3#236]   :  +- Project 
> [rand(3418873542988342437) AS c3#236]
>  :     +- OneRowRelation                               :     +- OneRowRelation
> !:- CTERelationDef 1, false                            +- Project [c3#236]
> !:  +- Project [c3#236]                                   +- Filter (c3#236 > 
> cast(0 as double))
> !:     +- CTERelationRef 0, true, [c3#236]                   +- 
> CTERelationRef 0, true, [c3#236]
> !+- Project [c3#236]                                   
> !   +- Filter (c3#236 > cast(0 as double))             
> !      +- CTERelationRef 0, true, [c3#236]             
>  {code}
> When the above query applies the inlineCTE rule, inline is not possible 
> because the refCount of CTERelationDef 0 is equal to 2.
> However, according to the optimized logicalplan, the plan can be further 
> optimized because the refCount of CTERelationDef 0 is equal to 1.
> Therefore, we can add the rule *RemoveRedundantCTEDef* to delete the 
> unreferenced CTERelationDef to prevent the refCount from being miscalculated
> {code:java}
> // code placeholder
> Project [c3#236]
> +- Filter (c3#236 > cast(0 as double))
>    +- Project [rand(-7871530451581327544) AS c3#236]
>       +- OneRowRelation {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43946) Add rule to remove unused CTEDef

2023-06-02 Thread jinhai-cloud (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jinhai-cloud updated SPARK-43946:
-
Description: 
{code:java}
// code placeholder
with t1 as (
  select rand() c3
),
t2 as (select * from t1)
select c3 from t1 where c3 > 0 {code}
{code:java}
// code placeholder
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE ===
 WithCTE                                               WithCTE
 :- CTERelationDef 0, false                            :- CTERelationDef 0, 
false
 :  +- Project [rand(3418873542988342437) AS c3#236]   :  +- Project 
[rand(3418873542988342437) AS c3#236]
 :     +- OneRowRelation                               :     +- OneRowRelation
!:- CTERelationDef 1, false                            +- Project [c3#236]
!:  +- Project [c3#236]                                   +- Filter (c3#236 > 
cast(0 as double))
!:     +- CTERelationRef 0, true, [c3#236]                   +- CTERelationRef 
0, true, [c3#236]
!+- Project [c3#236]                                   
!   +- Filter (c3#236 > cast(0 as double))             
!      +- CTERelationRef 0, true, [c3#236]             
 {code}
When the above query applies the inlineCTE rule, inline is not possible because 
the refCount of CTERelationDef 0 is equal to 2.

However, according to the optimized logicalplan, the plan can be further 
optimized because the refCount of CTERelationDef 0 is equal to 1.

Therefore, we can add the rule *RemoveUnusedCTEDef* to delete the unreferenced 
CTERelationDef to prevent the refCount from being miscalculated
{code:java}
// code placeholder
Project [c3#236]
+- Filter (c3#236 > cast(0 as double))
   +- Project [rand(-7871530451581327544) AS c3#236]
      +- OneRowRelation {code}

  was:
{code:java}
// code placeholder
with t1 as (
  select rand() c3
),
t2 as (select * from t1)
select c3 from t1 where c3 > 0 {code}
{code:java}
// code placeholder
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE ===
 WithCTE                                               WithCTE
 :- CTERelationDef 0, false                            :- CTERelationDef 0, 
false
 :  +- Project [rand(3418873542988342437) AS c3#236]   :  +- Project 
[rand(3418873542988342437) AS c3#236]
 :     +- OneRowRelation                               :     +- OneRowRelation
!:- CTERelationDef 1, false                            +- Project [c3#236]
!:  +- Project [c3#236]                                   +- Filter (c3#236 > 
cast(0 as double))
!:     +- CTERelationRef 0, true, [c3#236]                   +- CTERelationRef 
0, true, [c3#236]
!+- Project [c3#236]                                   
!   +- Filter (c3#236 > cast(0 as double))             
!      +- CTERelationRef 0, true, [c3#236]             
 {code}
When the above query applies the inlineCTE rule, inline is not possible because 
the refCount of CTERelationDef 0 is equal to 2.

However, according to the optimized logicalplan, the plan can be further 
optimized because the refCount of CTERelationDef 0 is equal to 1.

Therefore, we can add the rule *RemoveRedundantCTEDef* to delete the 
unreferenced CTERelationDef to prevent the refCount from being miscalculated
{code:java}
// code placeholder
Project [c3#236]
+- Filter (c3#236 > cast(0 as double))
   +- Project [rand(-7871530451581327544) AS c3#236]
      +- OneRowRelation {code}


> Add rule to remove unused CTEDef
> 
>
> Key: SPARK-43946
> URL: https://issues.apache.org/jira/browse/SPARK-43946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
>Reporter: jinhai-cloud
>Priority: Major
>
> {code:java}
> // code placeholder
> with t1 as (
>   select rand() c3
> ),
> t2 as (select * from t1)
> select c3 from t1 where c3 > 0 {code}
> {code:java}
> // code placeholder
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE ===
>  WithCTE                                               WithCTE
>  :- CTERelationDef 0, false                            :- CTERelationDef 0, 
> false
>  :  +- Project [rand(3418873542988342437) AS c3#236]   :  +- Project 
> [rand(3418873542988342437) AS c3#236]
>  :     +- OneRowRelation                               :     +- OneRowRelation
> !:- CTERelationDef 1, false                            +- Project [c3#236]
> !:  +- Project [c3#236]                                   +- Filter (c3#236 > 
> cast(0 as double))
> !:     +- CTERelationRef 0, true, [c3#236]                   +- 
> CTERelationRef 0, true, [c3#236]
> !+- Project [c3#236]                                   
> !   +- Filter (c3#236 > cast(0 as double))             
> !      +- CTERelationRef 0, true, [c3#236]             
>  {code}
> When the above query applies the inlineCTE rule, inline is not possible 
> because the refCount of CTERelationDef 0 is equal to 2

[jira] [Commented] (SPARK-43946) Add rule to remove unused CTEDef

2023-06-02 Thread jinhai-cloud (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728620#comment-17728620
 ] 

jinhai-cloud commented on SPARK-43946:
--

[~cloud_fan], Can you take a look at this issue for me?

> Add rule to remove unused CTEDef
> 
>
> Key: SPARK-43946
> URL: https://issues.apache.org/jira/browse/SPARK-43946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
>Reporter: jinhai-cloud
>Priority: Major
>
> {code:java}
> // code placeholder
> with t1 as (
>   select rand() c3
> ),
> t2 as (select * from t1)
> select c3 from t1 where c3 > 0 {code}
> {code:java}
> // code placeholder
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.InlineCTE ===
>  WithCTE                                               WithCTE
>  :- CTERelationDef 0, false                            :- CTERelationDef 0, 
> false
>  :  +- Project [rand(3418873542988342437) AS c3#236]   :  +- Project 
> [rand(3418873542988342437) AS c3#236]
>  :     +- OneRowRelation                               :     +- OneRowRelation
> !:- CTERelationDef 1, false                            +- Project [c3#236]
> !:  +- Project [c3#236]                                   +- Filter (c3#236 > 
> cast(0 as double))
> !:     +- CTERelationRef 0, true, [c3#236]                   +- 
> CTERelationRef 0, true, [c3#236]
> !+- Project [c3#236]                                   
> !   +- Filter (c3#236 > cast(0 as double))             
> !      +- CTERelationRef 0, true, [c3#236]             
>  {code}
> When the above query applies the inlineCTE rule, inline is not possible 
> because the refCount of CTERelationDef 0 is equal to 2.
> However, according to the optimized logicalplan, the plan can be further 
> optimized because the refCount of CTERelationDef 0 is equal to 1.
> Therefore, we can add the rule *RemoveUnusedCTEDef* to delete the 
> unreferenced CTERelationDef to prevent the refCount from being miscalculated
> {code:java}
> // code placeholder
> Project [c3#236]
> +- Filter (c3#236 > cast(0 as double))
>    +- Project [rand(-7871530451581327544) AS c3#236]
>       +- OneRowRelation {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43948) Assign names to the error class _LEGACY_ERROR_TEMP_[0050|0058|0059|1204]

2023-06-02 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43948:
---

 Summary: Assign names to the error class 
_LEGACY_ERROR_TEMP_[0050|0058|0059|1204]
 Key: SPARK-43948
 URL: https://issues.apache.org/jira/browse/SPARK-43948
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43949) Upgrade Cloudpickle to 2.2.1

2023-06-02 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-43949:


 Summary: Upgrade Cloudpickle to 2.2.1
 Key: SPARK-43949
 URL: https://issues.apache.org/jira/browse/SPARK-43949
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.0, 3.3.2, 3.5.0
Reporter: Hyukjin Kwon


Cloudpickle 2.2.1 has a fix for named tuple issue 
(https://github.com/cloudpipe/cloudpickle/issues/460). PySpark relies on 
namedtuple heavily especially for RDD. We should upgrade and fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43911) Directly use Set to consume iterator data to deduplicate, thereby reducing memory usage

2023-06-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728634#comment-17728634
 ] 

ASF GitHub Bot commented on SPARK-43911:


User 'mcdull-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/41419

> Directly use Set to consume iterator data to deduplicate, thereby reducing 
> memory usage
> ---
>
> Key: SPARK-43911
> URL: https://issues.apache.org/jira/browse/SPARK-43911
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: mcdull_zhang
>Priority: Minor
>
> When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for 
> dynamic partition pruning, it will put all the keys in an Array, and then 
> call the distinct of the Array to remove the duplicates.
> In general, Broadcast HashedRelation may have many rows, and the repetition 
> rate of this key is high. Doing so will cause this Array to occupy a large 
> amount of memory (and this memory is not managed by MemoryManager), which may 
> trigger OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43943) Add math functions to Scala and Python

2023-06-02 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-43943:
--
Description: 
Add following functions:

* ceiling
* e
* pi
* ln
* negative
* positive
* power
* sign
* std
* width_bucket

to:

* Scala API
* Python API
* Spark Connect Scala Client
* Spark Connect Python Client

  was:
Add following functions:

* ceiling
* e
* pi
* ln
* mod
* negative
* positive
* power
* sign
* std
* width_bucket

to:

* Scala API
* Python API
* Spark Connect Scala Client
* Spark Connect Python Client


> Add math functions to Scala and Python
> --
>
> Key: SPARK-43943
> URL: https://issues.apache.org/jira/browse/SPARK-43943
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Add following functions:
> * ceiling
> * e
> * pi
> * ln
> * negative
> * positive
> * power
> * sign
> * std
> * width_bucket
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43879) Decouple handle command and send response on server side

2023-06-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728642#comment-17728642
 ] 

ASF GitHub Bot commented on SPARK-43879:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/41379

> Decouple handle command and send response on server side
> 
>
> Key: SPARK-43879
> URL: https://issues.apache.org/jira/browse/SPARK-43879
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>
> SparkConnectStreamHandler treat the request from connect client and send the 
> response back to connect client. SparkConnectStreamHandler hold a component 
> StreamObserver which is used to send response.
> So I think we should keep the StreamObserver could be accessed only with 
> SparkConnectStreamHandler.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43866) Partition filter condition should pushed down to metastore query if it is equivalence Predicate

2023-06-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728643#comment-17728643
 ] 

ASF GitHub Bot commented on SPARK-43866:


User 'ming95' has created a pull request for this issue:
https://github.com/apache/spark/pull/41370

> Partition filter condition should pushed down to metastore query if it is 
> equivalence Predicate
> ---
>
> Key: SPARK-43866
> URL: https://issues.apache.org/jira/browse/SPARK-43866
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: ming95
>Priority: Major
>
> Typically, hive partition fields are created as string types.
> {code:java}
> ```
> CREATE TABLE if not exists test_tb (
> id int
> )
> PARTITIONED BY (dt string)
> ```{code}
> However, cast data conversions are often introduced inadvertently during use. 
> For example
> {code:java}
> ```
> select * from test_tb where dt=20230505;
> ```{code}
> it will prevent the condition `dt=20230505` from being pushed down into the 
> metastore , because `20230505` is an IntegralType. And resulting in a request 
> for all partitions. However, in the case of equivalent predicates, partition 
> field pushdown should be supported.
> This can affect execution performance in cases where the data table has very 
> many partitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43911) Use toSet to deduplicate the iterator data to prevent the creation of large Array

2023-06-02 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-43911:
-
Summary: Use toSet to deduplicate the iterator data to prevent the creation 
of large Array  (was: Directly use Set to consume iterator data to deduplicate, 
thereby reducing memory usage)

> Use toSet to deduplicate the iterator data to prevent the creation of large 
> Array
> -
>
> Key: SPARK-43911
> URL: https://issues.apache.org/jira/browse/SPARK-43911
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: mcdull_zhang
>Priority: Minor
>
> When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for 
> dynamic partition pruning, it will put all the keys in an Array, and then 
> call the distinct of the Array to remove the duplicates.
> In general, Broadcast HashedRelation may have many rows, and the repetition 
> rate of this key is high. Doing so will cause this Array to occupy a large 
> amount of memory (and this memory is not managed by MemoryManager), which may 
> trigger OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43916) Add percentile* to Scala and Python API

2023-06-02 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-43916:
---
Summary: Add percentile* to Scala and Python API  (was: Add percentile to 
Scala and Python API)

> Add percentile* to Scala and Python API
> ---
>
> Key: SPARK-43916
> URL: https://issues.apache.org/jira/browse/SPARK-43916
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, R, SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43916) Add percentile like functions to Scala and Python API

2023-06-02 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-43916:
---
Summary: Add percentile like functions to Scala and Python API  (was: Add 
percentile* to Scala and Python API)

> Add percentile like functions to Scala and Python API
> -
>
> Key: SPARK-43916
> URL: https://issues.apache.org/jira/browse/SPARK-43916
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, R, SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43063) `df.show` handle null should print NULL instead of null

2023-06-02 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728682#comment-17728682
 ] 

GridGain Integration commented on SPARK-43063:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41432

> `df.show` handle null should print NULL instead of null
> ---
>
> Key: SPARK-43063
> URL: https://issues.apache.org/jira/browse/SPARK-43063
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yikaifei
>Assignee: yikaifei
>Priority: Trivial
> Fix For: 3.5.0
>
>
> `df.show` handle null should print NULL instead of null to consistent 
> behavior;
> {code:java}
> Like as the following behavior is currently inconsistent:
> ``` shell
> scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 
> 'New Jersey', 4, 'Seattle') as result").show(false)
> +--+
> |result|
> +--+
> |null  |
> +--+
> ```
> ``` shell
> spark-sql> DESC FUNCTION EXTENDED decode;
> function_desc
> Function: decode
> Class: org.apache.spark.sql.catalyst.expressions.Decode
> Usage:
> decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> decode(expr, search, result [, search, result ] ... [, default]) - 
> Compares expr
>   to each search value in order. If expr is equal to a search value, 
> decode returns
>   the corresponding result. If no match is found, then it returns 
> default. If default
>   is omitted, it returns null.
> Extended Usage:
> Examples:
>   > SELECT decode(encode('abc', 'utf-8'), 'utf-8');
>abc
>   > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>San Francisco
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>Non domestic
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle');
>NULL
> Since: 3.2.0
> Time taken: 0.074 seconds, Fetched 4 row(s)
> ```
> ``` shell
> spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New 
> Jersey', 4, 'Seattle');
> NULL
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43075) Change gRPC to grpcio when it is not installed.

2023-06-02 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728681#comment-17728681
 ] 

GridGain Integration commented on SPARK-43075:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/40716

> Change gRPC to grpcio when it is not installed.
> ---
>
> Key: SPARK-43075
> URL: https://issues.apache.org/jira/browse/SPARK-43075
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43949) Upgrade Cloudpickle to 2.2.1

2023-06-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43949.
--
Fix Version/s: 3.4.1
   3.5.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/41433

> Upgrade Cloudpickle to 2.2.1
> 
>
> Key: SPARK-43949
> URL: https://issues.apache.org/jira/browse/SPARK-43949
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>
> Cloudpickle 2.2.1 has a fix for named tuple issue 
> (https://github.com/cloudpipe/cloudpickle/issues/460). PySpark relies on 
> namedtuple heavily especially for RDD. We should upgrade and fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43950) Upgrade kubernetes-client to 6.7.0

2023-06-02 Thread Jira
Bjørn Jørgensen created SPARK-43950:
---

 Summary: Upgrade kubernetes-client to 6.7.0
 Key: SPARK-43950
 URL: https://issues.apache.org/jira/browse/SPARK-43950
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build, Kubernetes
Affects Versions: 3.5.0
Reporter: Bjørn Jørgensen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43351) Support Golang in Spark Connect

2023-06-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43351.
--
  Assignee: BoYang
Resolution: Fixed

Fixed in https://github.com/apache/spark-connect-go/pull/6

> Support Golang in Spark Connect
> ---
>
> Key: SPARK-43351
> URL: https://issues.apache.org/jira/browse/SPARK-43351
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: BoYang
>Assignee: BoYang
>Priority: Major
>
> Support Spark Connect client side in Go programming language 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43864) Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL

2023-06-02 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728724#comment-17728724
 ] 

gaoyajun02 commented on SPARK-43864:


It looks like a series of test package dependencies need to be changed.I'm not 
very familiar with these, Can you solve it? @[~panbingkun] 

> Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 
> 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL
> 
>
> Key: SPARK-43864
> URL: https://issues.apache.org/jira/browse/SPARK-43864
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: gaoyajun02
>Priority: Minor
>
> CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]
> It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
> spark
> {code:java}
>     
>       org.htmlunit
>       htmlunit
>       test
>     
>     
>       org.htmlunit
>       htmlunit-core-js
>       test
>      {code}
> see: [https://www.htmlunit.org/migration.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43951) RocksDB state store can become corrupt on task retries

2023-06-02 Thread Adam Binford (Jira)
Adam Binford created SPARK-43951:


 Summary: RocksDB state store can become corrupt on task retries
 Key: SPARK-43951
 URL: https://issues.apache.org/jira/browse/SPARK-43951
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Adam Binford


A couple of our streaming jobs have failed since upgrading to Spark 3.4 with an 
error such as:

org.rocksdb.RocksDBException: Mismatch in unique ID on table file ###. 
Expected: [###,###} Actual\{###,###} in file /MANIFEST-

This is due to the change from 
[https://github.com/facebook/rocksdb/commit/6de7081cf37169989e289a4801187097f0c50fae]
 that enabled unique ID checks by default, and I finally tracked down the exact 
sequence of steps that leads to this failure in the way RocksDB state store is 
used.
 # A task fails after uploading the checkpoint to HDFS. Lets say it uploaded 
11.zip to version 11 of the table, but the task failed before it could finish 
after successfully uploading the checkpoint.
 # The same task is retried and goes back to load version 10 of the table as 
expected.
 # Cleanup/maintenance is called for this partition, which looks in HDFS for 
persisted versions and sees up through version 11 since that zip file was 
successfully uploaded on the previous task.
 # As part of resolving what SST files are part of each table version, 
versionToRocksDBFiles.put(version, newResolvedFiles) is called for version 11 
with its SST files that were uploaded in the first failed task.
 # The second attempt at the task commits and goes to sync its checkpoint to 
HDFS.
 # versionToRocksDBFiles contains the SST files to upload from step 4, and 
these files are considered "the same" as what's in the local working dir 
because the name and file size match.
 # No SST files are uploaded because they matched above, but in reality the 
unique ID inside the SST files is different (presumably this is just randomly 
generated and inserted into each SST file?), it just doesn't affect the size.
 # A new METADATA file is uploaded which has the new unique IDs listed inside.
 # When version 11 of the table is read during the next batch, the unique IDs 
in the METADATA file don't match the unique IDS in the SST files, which causes 
the exception.

 

This is basically a ticking time bomb for anyone using RocksDB. Thoughts on 
possible fixes would be:
 * Disable unique ID verification. I don't currently see a binding for this in 
the RocksDB java wrapper, so that would probably have to be added first.
 * Disable checking if files are already uploaded with the same size, and just 
always upload SST files no matter what.
 * Update the "same file" check to also be able to do some kind of CRC 
comparison or something like that.
 * Update the mainteance/cleanup to not update the versionToRocksDBFiles map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43951) RocksDB state store can become corrupt on task retries

2023-06-02 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728736#comment-17728736
 ] 

Adam Binford commented on SPARK-43951:
--

Of course as soon as I finish figuring all this out I found 
https://github.com/apache/spark/pull/41089

> RocksDB state store can become corrupt on task retries
> --
>
> Key: SPARK-43951
> URL: https://issues.apache.org/jira/browse/SPARK-43951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Adam Binford
>Priority: Major
>
> A couple of our streaming jobs have failed since upgrading to Spark 3.4 with 
> an error such as:
> org.rocksdb.RocksDBException: Mismatch in unique ID on table file ###. 
> Expected: [###,###} Actual\{###,###} in file /MANIFEST-
> This is due to the change from 
> [https://github.com/facebook/rocksdb/commit/6de7081cf37169989e289a4801187097f0c50fae]
>  that enabled unique ID checks by default, and I finally tracked down the 
> exact sequence of steps that leads to this failure in the way RocksDB state 
> store is used.
>  # A task fails after uploading the checkpoint to HDFS. Lets say it uploaded 
> 11.zip to version 11 of the table, but the task failed before it could finish 
> after successfully uploading the checkpoint.
>  # The same task is retried and goes back to load version 10 of the table as 
> expected.
>  # Cleanup/maintenance is called for this partition, which looks in HDFS for 
> persisted versions and sees up through version 11 since that zip file was 
> successfully uploaded on the previous task.
>  # As part of resolving what SST files are part of each table version, 
> versionToRocksDBFiles.put(version, newResolvedFiles) is called for version 11 
> with its SST files that were uploaded in the first failed task.
>  # The second attempt at the task commits and goes to sync its checkpoint to 
> HDFS.
>  # versionToRocksDBFiles contains the SST files to upload from step 4, and 
> these files are considered "the same" as what's in the local working dir 
> because the name and file size match.
>  # No SST files are uploaded because they matched above, but in reality the 
> unique ID inside the SST files is different (presumably this is just randomly 
> generated and inserted into each SST file?), it just doesn't affect the size.
>  # A new METADATA file is uploaded which has the new unique IDs listed inside.
>  # When version 11 of the table is read during the next batch, the unique IDs 
> in the METADATA file don't match the unique IDS in the SST files, which 
> causes the exception.
>  
> This is basically a ticking time bomb for anyone using RocksDB. Thoughts on 
> possible fixes would be:
>  * Disable unique ID verification. I don't currently see a binding for this 
> in the RocksDB java wrapper, so that would probably have to be added first.
>  * Disable checking if files are already uploaded with the same size, and 
> just always upload SST files no matter what.
>  * Update the "same file" check to also be able to do some kind of CRC 
> comparison or something like that.
>  * Update the mainteance/cleanup to not update the versionToRocksDBFiles map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43951) RocksDB state store can become corrupt on task retries

2023-06-02 Thread Adam Binford (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Binford resolved SPARK-43951.
--
Resolution: Fixed

> RocksDB state store can become corrupt on task retries
> --
>
> Key: SPARK-43951
> URL: https://issues.apache.org/jira/browse/SPARK-43951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Adam Binford
>Priority: Major
>
> A couple of our streaming jobs have failed since upgrading to Spark 3.4 with 
> an error such as:
> org.rocksdb.RocksDBException: Mismatch in unique ID on table file ###. 
> Expected: [###,###} Actual\{###,###} in file /MANIFEST-
> This is due to the change from 
> [https://github.com/facebook/rocksdb/commit/6de7081cf37169989e289a4801187097f0c50fae]
>  that enabled unique ID checks by default, and I finally tracked down the 
> exact sequence of steps that leads to this failure in the way RocksDB state 
> store is used.
>  # A task fails after uploading the checkpoint to HDFS. Lets say it uploaded 
> 11.zip to version 11 of the table, but the task failed before it could finish 
> after successfully uploading the checkpoint.
>  # The same task is retried and goes back to load version 10 of the table as 
> expected.
>  # Cleanup/maintenance is called for this partition, which looks in HDFS for 
> persisted versions and sees up through version 11 since that zip file was 
> successfully uploaded on the previous task.
>  # As part of resolving what SST files are part of each table version, 
> versionToRocksDBFiles.put(version, newResolvedFiles) is called for version 11 
> with its SST files that were uploaded in the first failed task.
>  # The second attempt at the task commits and goes to sync its checkpoint to 
> HDFS.
>  # versionToRocksDBFiles contains the SST files to upload from step 4, and 
> these files are considered "the same" as what's in the local working dir 
> because the name and file size match.
>  # No SST files are uploaded because they matched above, but in reality the 
> unique ID inside the SST files is different (presumably this is just randomly 
> generated and inserted into each SST file?), it just doesn't affect the size.
>  # A new METADATA file is uploaded which has the new unique IDs listed inside.
>  # When version 11 of the table is read during the next batch, the unique IDs 
> in the METADATA file don't match the unique IDS in the SST files, which 
> causes the exception.
>  
> This is basically a ticking time bomb for anyone using RocksDB. Thoughts on 
> possible fixes would be:
>  * Disable unique ID verification. I don't currently see a binding for this 
> in the RocksDB java wrapper, so that would probably have to be added first.
>  * Disable checking if files are already uploaded with the same size, and 
> just always upload SST files no matter what.
>  * Update the "same file" check to also be able to do some kind of CRC 
> comparison or something like that.
>  * Update the mainteance/cleanup to not update the versionToRocksDBFiles map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43943) Add math functions to Scala and Python

2023-06-02 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728738#comment-17728738
 ] 

GridGain Integration commented on SPARK-43943:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/41435

> Add math functions to Scala and Python
> --
>
> Key: SPARK-43943
> URL: https://issues.apache.org/jira/browse/SPARK-43943
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Add following functions:
> * ceiling
> * e
> * pi
> * ln
> * negative
> * positive
> * power
> * sign
> * std
> * width_bucket
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43864) Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL

2023-06-02 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728759#comment-17728759
 ] 

BingKun Pan commented on SPARK-43864:
-

[~gaoyajun02] Okay, Let me investigate it first.

 

> Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 
> 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL
> 
>
> Key: SPARK-43864
> URL: https://issues.apache.org/jira/browse/SPARK-43864
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: gaoyajun02
>Priority: Minor
>
> CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]
> It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
> spark
> {code:java}
>     
>       org.htmlunit
>       htmlunit
>       test
>     
>     
>       org.htmlunit
>       htmlunit-core-js
>       test
>      {code}
> see: [https://www.htmlunit.org/migration.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43864) Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL

2023-06-02 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728759#comment-17728759
 ] 

BingKun Pan edited comment on SPARK-43864 at 6/2/23 2:20 PM:
-

@[~gaoyajun02] Okay, Let me investigate it first.

 


was (Author: panbingkun):
[~gaoyajun02] Okay, Let me investigate it first.

 

> Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 
> 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL
> 
>
> Key: SPARK-43864
> URL: https://issues.apache.org/jira/browse/SPARK-43864
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: gaoyajun02
>Priority: Minor
>
> CVE-2023-26119 Detail: [https://nvd.nist.gov/vuln/detail/CVE-2023-26119]
> It is recommended to replace 'net.sourceforge.htmlunit'' by 'org.htmlunit' in 
> spark
> {code:java}
>     
>       org.htmlunit
>       htmlunit
>       test
>     
>     
>       org.htmlunit
>       htmlunit-core-js
>       test
>      {code}
> see: [https://www.htmlunit.org/migration.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43952) Cancel Spark jobs not only by a single "jobgroup", but allow multiple "job tags"

2023-06-02 Thread Juliusz Sompolski (Jira)
Juliusz Sompolski created SPARK-43952:
-

 Summary: Cancel Spark jobs not only by a single "jobgroup", but 
allow multiple "job tags"
 Key: SPARK-43952
 URL: https://issues.apache.org/jira/browse/SPARK-43952
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Juliusz Sompolski


Currently, the only way to cancel running Spark Jobs is by using 
SparkContext.cancelJobGroup, using a job group name that was previously set 
using SparkContext.setJobGroup. This is problematic if multiple different parts 
of the system want to do cancellation, and set their own ids.

For example, 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L133]
 sets it's own job group, which may override job group set by user. This way, 
if user cancels the job group they set, it will not cancel these broadcast jobs 
launches from within their jobs...

As a solution, consider adding SparkContext.addJobTag / 
SparkContext.removeJobTag, which would allow to have multiple "tags" on the 
jobs, and introduce SparkContext.cancelJobsByTag to allow more flexible 
cancelling of jobs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43754) Spark Connect Session & Query lifecycle

2023-06-02 Thread Juliusz Sompolski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728790#comment-17728790
 ] 

Juliusz Sompolski commented on SPARK-43754:
---

Indirectly related to https://issues.apache.org/jira/browse/SPARK-43952 (to 
allow Spark Connect cancellation of queries not conflict with other places 
setting job groups)

> Spark Connect Session & Query lifecycle
> ---
>
> Key: SPARK-43754
> URL: https://issues.apache.org/jira/browse/SPARK-43754
> Project: Spark
>  Issue Type: Epic
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Currently, queries in Spark Connect are executed within the RPC handler.
> We want to detach the RPC interface from actual sessions and execution, so 
> that we can make the interface more flexible
>  * maintain long running sessions, independent of unbroken GRPC channel
>  * be able to cancel queries
>  * have different interfaces to query results than push from server



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43952) Cancel Spark jobs not only by a single "jobgroup", but allow multiple "job tags"

2023-06-02 Thread Juliusz Sompolski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728791#comment-17728791
 ] 

Juliusz Sompolski commented on SPARK-43952:
---

Indirectly related with https://issues.apache.org/jira/browse/SPARK-43754 to 
allow Spark Connect cancellation of queries not conflict with other places 
setting job groups.

> Cancel Spark jobs not only by a single "jobgroup", but allow multiple "job 
> tags"
> 
>
> Key: SPARK-43952
> URL: https://issues.apache.org/jira/browse/SPARK-43952
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Currently, the only way to cancel running Spark Jobs is by using 
> SparkContext.cancelJobGroup, using a job group name that was previously set 
> using SparkContext.setJobGroup. This is problematic if multiple different 
> parts of the system want to do cancellation, and set their own ids.
> For example, 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L133]
>  sets it's own job group, which may override job group set by user. This way, 
> if user cancels the job group they set, it will not cancel these broadcast 
> jobs launches from within their jobs...
> As a solution, consider adding SparkContext.addJobTag / 
> SparkContext.removeJobTag, which would allow to have multiple "tags" on the 
> jobs, and introduce SparkContext.cancelJobsByTag to allow more flexible 
> cancelling of jobs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43922) Add named argument support in parser for function call

2023-06-02 Thread Richard Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-43922:
---
Description: 
Today, we are implementing named argument support for user defined functions, 
some built-in functions, and table-valued functions. For the first step towards 
building such a feature, we need to make some requisite changes in the parser. 

To accomplish this, in this issue, we plan to add some new syntax tokens to the 
parser in Spark. Changes will also be made in the abstract syntax tree builder 
as well to reflect these new tokens. Such changes will first be restricted to 
normal function calls (table value functions will be treated separately). 

  was:
Today, we are implementing named parameter support for user defined functions, 
some built-in functions, and table-valued functions. For the first step towards 
building such a feature, we need to make some requisite changes in the parser. 

To accomplish this, in this issue, we plan to add some new syntax tokens to the 
parser in Spark. Changes will also be made in the abstract syntax tree builder 
as well to reflect these new tokens. Such changes will first be restricted to 
normal function calls (table value functions will be treated separately). 


> Add named argument support in parser for function call
> --
>
> Key: SPARK-43922
> URL: https://issues.apache.org/jira/browse/SPARK-43922
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Richard Yu
>Priority: Major
>
> Today, we are implementing named argument support for user defined functions, 
> some built-in functions, and table-valued functions. For the first step 
> towards building such a feature, we need to make some requisite changes in 
> the parser. 
> To accomplish this, in this issue, we plan to add some new syntax tokens to 
> the parser in Spark. Changes will also be made in the abstract syntax tree 
> builder as well to reflect these new tokens. Such changes will first be 
> restricted to normal function calls (table value functions will be treated 
> separately). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43922) Add named argument support in parser for function call

2023-06-02 Thread Richard Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-43922:
---
Summary: Add named argument support in parser for function call  (was: Add 
named parameter support in parser for function call)

> Add named argument support in parser for function call
> --
>
> Key: SPARK-43922
> URL: https://issues.apache.org/jira/browse/SPARK-43922
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Richard Yu
>Priority: Major
>
> Today, we are implementing named parameter support for user defined 
> functions, some built-in functions, and table-valued functions. For the first 
> step towards building such a feature, we need to make some requisite changes 
> in the parser. 
> To accomplish this, in this issue, we plan to add some new syntax tokens to 
> the parser in Spark. Changes will also be made in the abstract syntax tree 
> builder as well to reflect these new tokens. Such changes will first be 
> restricted to normal function calls (table value functions will be treated 
> separately). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43945) Fix bug for `SQLQueryTestSuite` when run on local env

2023-06-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43945.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41431
[https://github.com/apache/spark/pull/41431]

> Fix bug for `SQLQueryTestSuite` when run on local env
> -
>
> Key: SPARK-43945
> URL: https://issues.apache.org/jira/browse/SPARK-43945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43945) Fix bug for `SQLQueryTestSuite` when run on local env

2023-06-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43945:


Assignee: BingKun Pan

> Fix bug for `SQLQueryTestSuite` when run on local env
> -
>
> Key: SPARK-43945
> URL: https://issues.apache.org/jira/browse/SPARK-43945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43904) Upgrade jackson to 2.15.2

2023-06-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43904:
-

Assignee: BingKun Pan

> Upgrade jackson to 2.15.2
> -
>
> Key: SPARK-43904
> URL: https://issues.apache.org/jira/browse/SPARK-43904
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43950) Upgrade kubernetes-client to 6.7.0

2023-06-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43950.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41434
[https://github.com/apache/spark/pull/41434]

> Upgrade kubernetes-client to 6.7.0
> --
>
> Key: SPARK-43950
> URL: https://issues.apache.org/jira/browse/SPARK-43950
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build, Kubernetes
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43950) Upgrade kubernetes-client to 6.7.0

2023-06-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43950:
-

Assignee: Bjørn Jørgensen

> Upgrade kubernetes-client to 6.7.0
> --
>
> Key: SPARK-43950
> URL: https://issues.apache.org/jira/browse/SPARK-43950
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build, Kubernetes
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43904) Upgrade jackson to 2.15.2

2023-06-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43904.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41414
[https://github.com/apache/spark/pull/41414]

> Upgrade jackson to 2.15.2
> -
>
> Key: SPARK-43904
> URL: https://issues.apache.org/jira/browse/SPARK-43904
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43953) Remove pass

2023-06-02 Thread Jira
Bjørn Jørgensen created SPARK-43953:
---

 Summary: Remove pass
 Key: SPARK-43953
 URL: https://issues.apache.org/jira/browse/SPARK-43953
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Bjørn Jørgensen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36612) Support left outer join build left or right outer join build right in shuffled hash join

2023-06-02 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-36612:


Assignee: Szehon Ho

> Support left outer join build left or right outer join build right in 
> shuffled hash join
> 
>
> Key: SPARK-36612
> URL: https://issues.apache.org/jira/browse/SPARK-36612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Assignee: Szehon Ho
>Priority: Major
>
> Currently spark sql does not support build left side when left outer join (or 
> build right side when right outer join).
> However, in our production environment, there are a large number of scenarios 
> where small tables are left join large tables, and many times, large tables 
> have data skew (currently AQE can't handle this kind of skew).
> Inspired by SPARK-32399, we can use similar ideas to realize left outer join 
> build left.
> I think this treatment is very meaningful, but I don’t know how members 
> consider this matter?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36612) Support left outer join build left or right outer join build right in shuffled hash join

2023-06-02 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-36612.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

> Support left outer join build left or right outer join build right in 
> shuffled hash join
> 
>
> Key: SPARK-36612
> URL: https://issues.apache.org/jira/browse/SPARK-36612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Assignee: Szehon Ho
>Priority: Major
> Fix For: 3.5.0
>
>
> Currently spark sql does not support build left side when left outer join (or 
> build right side when right outer join).
> However, in our production environment, there are a large number of scenarios 
> where small tables are left join large tables, and many times, large tables 
> have data skew (currently AQE can't handle this kind of skew).
> Inspired by SPARK-32399, we can use similar ideas to realize left outer join 
> build left.
> I think this treatment is very meaningful, but I don’t know how members 
> consider this matter?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43380) Fix Avro data type conversion issues to avoid producing incorrect results

2023-06-02 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-43380.

Fix Version/s: 3.5.0
   (was: 3.4.0)
   Resolution: Fixed

Issue resolved by pull request 41052
[https://github.com/apache/spark/pull/41052]

> Fix Avro data type conversion issues to avoid producing incorrect results
> -
>
> Key: SPARK-43380
> URL: https://issues.apache.org/jira/browse/SPARK-43380
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Zerui Bao
>Priority: Major
> Fix For: 3.5.0
>
>
> We found the following issues with open-source Avro:
>  * Interval types can be read as date or timestamp types that would lead to 
> wildly different results
>  * Decimal types can be read with lower precision, that leads to data being 
> read as {{null}} instead of suggesting that a wider decimal format should be 
> provided



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43954) Upgrade sbt from 1.8.2 to 1.9.0

2023-06-02 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43954:
---

 Summary: Upgrade sbt from 1.8.2 to 1.9.0
 Key: SPARK-43954
 URL: https://issues.apache.org/jira/browse/SPARK-43954
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43954) Upgrade sbt from 1.8.3 to 1.9.0

2023-06-02 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43954:

Summary: Upgrade sbt from 1.8.3 to 1.9.0  (was: Upgrade sbt from 1.8.2 to 
1.9.0)

> Upgrade sbt from 1.8.3 to 1.9.0
> ---
>
> Key: SPARK-43954
> URL: https://issues.apache.org/jira/browse/SPARK-43954
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43955) Upgrade `scalafmt` from 3.7.3 to 3.7.4

2023-06-02 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43955:
---

 Summary: Upgrade `scalafmt` from 3.7.3 to 3.7.4
 Key: SPARK-43955
 URL: https://issues.apache.org/jira/browse/SPARK-43955
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43955) Upgrade `scalafmt` from 3.7.3 to 3.7.4

2023-06-02 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43955:

Description: Release notes: 
https://github.com/scalameta/scalafmt/releases/tag/v3.7.4

> Upgrade `scalafmt` from 3.7.3 to 3.7.4
> --
>
> Key: SPARK-43955
> URL: https://issues.apache.org/jira/browse/SPARK-43955
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> Release notes: https://github.com/scalameta/scalafmt/releases/tag/v3.7.4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43954) Upgrade sbt from 1.8.3 to 1.9.0

2023-06-02 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43954:

Description: Release notes: [https://github.com/sbt/sbt/releases/tag/v1.9.0]

> Upgrade sbt from 1.8.3 to 1.9.0
> ---
>
> Key: SPARK-43954
> URL: https://issues.apache.org/jira/browse/SPARK-43954
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> Release notes: [https://github.com/sbt/sbt/releases/tag/v1.9.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43956) Fix the bug doesn't display column's sql for Percentile[Cont|Disc]

2023-06-02 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-43956:
--

 Summary: Fix the bug doesn't display column's sql for 
Percentile[Cont|Disc]
 Key: SPARK-43956
 URL: https://issues.apache.org/jira/browse/SPARK-43956
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng
 Fix For: 3.5.0


Last year, I committed Percentile[Cont|Disc] functions for Spark SQL.
Recently, I found the sql method of Percentile[Cont|Disc] doesn't display 
column's sql suitably.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43956) Fix the bug doesn't display column's sql for Percentile[Cont|Disc]

2023-06-02 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-43956.
-
  Assignee: jiaan.geng
Resolution: Fixed

> Fix the bug doesn't display column's sql for Percentile[Cont|Disc]
> --
>
> Key: SPARK-43956
> URL: https://issues.apache.org/jira/browse/SPARK-43956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>
> Last year, I committed Percentile[Cont|Disc] functions for Spark SQL.
> Recently, I found the sql method of Percentile[Cont|Disc] doesn't display 
> column's sql suitably.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43956) Fix the bug doesn't display column's sql for Percentile[Cont|Disc]

2023-06-02 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728926#comment-17728926
 ] 

jiaan.geng commented on SPARK-43956:


resolved by https://github.com/apache/spark/pull/41436

> Fix the bug doesn't display column's sql for Percentile[Cont|Disc]
> --
>
> Key: SPARK-43956
> URL: https://issues.apache.org/jira/browse/SPARK-43956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>
> Last year, I committed Percentile[Cont|Disc] functions for Spark SQL.
> Recently, I found the sql method of Percentile[Cont|Disc] doesn't display 
> column's sql suitably.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org