[jira] [Resolved] (SPARK-43627) Enable pyspark.pandas.spark.functions.skew in Spark Connect.

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43627.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41604
[https://github.com/apache/spark/pull/41604]

> Enable pyspark.pandas.spark.functions.skew in Spark Connect.
> 
>
> Key: SPARK-43627
> URL: https://issues.apache.org/jira/browse/SPARK-43627
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>
> Enable pyspark.pandas.spark.functions.skew in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43626) Enable pyspark.pandas.spark.functions.kurt in Spark Connect.

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43626.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41604
[https://github.com/apache/spark/pull/41604]

> Enable pyspark.pandas.spark.functions.kurt in Spark Connect.
> 
>
> Key: SPARK-43626
> URL: https://issues.apache.org/jira/browse/SPARK-43626
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>
> Enable pyspark.pandas.spark.functions.kurt in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44064) Maven test `ProductAggSuite` aborted

2023-06-14 Thread Yang Jie (Jira)
Yang Jie created SPARK-44064:


 Summary: Maven test `ProductAggSuite` aborted
 Key: SPARK-44064
 URL: https://issues.apache.org/jira/browse/SPARK-44064
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yang Jie


run 

 
{code:java}
 ./build/mvn  -DskipTests -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive 
-Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl  clean install
 build/mvn test -pl sql/catalyst     {code}
aborted

 
{code:java}
ProductAggSuite:
*** RUN ABORTED ***
  java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$
  at 
org.apache.spark.sql.catalyst.expressions.codegen.JavaCode$.variable(javaCode.scala:64)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.JavaCode$.isNullVariable(javaCode.scala:77)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:200)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.$anonfun$create$1(GenerateSafeProjection.scala:156)
  at scala.collection.immutable.List.map(List.scala:293)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:153)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1369)
 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44064) Maven test `ProductAggSuite` aborted

2023-06-14 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44064:
-
Issue Type: Bug  (was: Improvement)

> Maven test `ProductAggSuite` aborted
> 
>
> Key: SPARK-44064
> URL: https://issues.apache.org/jira/browse/SPARK-44064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> run 
>  
> {code:java}
>  ./build/mvn  -DskipTests -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive 
> -Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl  clean install
>  build/mvn test -pl sql/catalyst     {code}
> aborted
>  
> {code:java}
> ProductAggSuite:
> *** RUN ABORTED ***
>   java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.JavaCode$.variable(javaCode.scala:64)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.JavaCode$.isNullVariable(javaCode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:200)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.$anonfun$create$1(GenerateSafeProjection.scala:156)
>   at scala.collection.immutable.List.map(List.scala:293)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:153)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1369)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44058) Remove deprecated API usage in HiveShim.scala

2023-06-14 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732851#comment-17732851
 ] 

Yuming Wang commented on SPARK-44058:
-

This is used to connect Hive metastore 0.12.

> Remove deprecated API usage in HiveShim.scala
> -
>
> Key: SPARK-44058
> URL: https://issues.apache.org/jira/browse/SPARK-44058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.4.0
>Reporter: Aman Raj
>Priority: Major
>
> Spark's HiveShim.scala calls this particular method in Hive :
> createPartitionMethod.invoke(
> hive,
> table,
> spec,
> location,
> params, // partParams
> null, // inputFormat
> null, // outputFormat
> -1: JInteger, // numBuckets
> null, // cols
> null, // serializationLib
> null, // serdeParams
> null, // bucketCols
> null) // sortCols
> }
>  
> We do not have any such implementation of createPartition in Hive. We only 
> have this definition :
> public Partition createPartition(Table tbl, Map partSpec) 
> throws HiveException {
>     try
> {       org.apache.hadoop.hive.metastore.api.Partition part =           
> Partition.createMetaPartitionObject(tbl, partSpec, null);       
> AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, 
> tbl);       part.setWriteId(tableSnapshot != null ? 
> tableSnapshot.getWriteId() : 0);       return new Partition(tbl, 
> getMSC().add_partition(part));     }
> catch (Exception e)
> {       LOG.error(StringUtils.stringifyException(e));       throw new 
> HiveException(e);     }
>   }
> *The 12 parameter implementation was removed in HIVE-5951*
>  
> The issue is that this 12 parameter implementation of createPartition method 
> was added in Hive-0.12 and then was removed in Hive-0.13. When Hive 0.12 was 
> used in Spark, SPARK-15334 commit in Spark added this 12 parameters 
> implementation. But after Hive migrated to newer APIs somehow this was not 
> changed in Spark OSS and it looks to us like a Bug from the Spark end.
>  
> We need to migrate to the newest implementation of Hive createPartition 
> method otherwise this flow can break



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43259) Assign a name to the error class _LEGACY_ERROR_TEMP_2024

2023-06-14 Thread Abhijeet Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732843#comment-17732843
 ] 

Abhijeet Singh commented on SPARK-43259:


i want to work on this issue.
raised a pr for same https://github.com/apache/spark/pull/41607

> Assign a name to the error class _LEGACY_ERROR_TEMP_2024
> 
>
> Key: SPARK-43259
> URL: https://issues.apache.org/jira/browse/SPARK-43259
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2024* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43937) Add ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and Python

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-43937:
--
Summary: Add ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and 
Python  (was: Add not,ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and 
Python)

> Add ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and Python
> ---
>
> Key: SPARK-43937
> URL: https://issues.apache.org/jira/browse/SPARK-43937
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Add following functions:
> * -not-
> * -if-
> * ifnull
> * isnotnull
> * equal_null
> * nullif
> * nvl
> * nvl2
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43937) Add not,ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and Python

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-43937:
--
Description: 
Add following functions:

* -not-
* -if-
* ifnull
* isnotnull
* equal_null
* nullif
* nvl
* nvl2

to:

* Scala API
* Python API
* Spark Connect Scala Client
* Spark Connect Python Client

  was:
Add following functions:

* not
* if
* ifnull
* isnotnull
* equal_null
* nullif
* nvl
* nvl2

to:

* Scala API
* Python API
* Spark Connect Scala Client
* Spark Connect Python Client


> Add not,ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and Python
> ---
>
> Key: SPARK-43937
> URL: https://issues.apache.org/jira/browse/SPARK-43937
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Add following functions:
> * -not-
> * -if-
> * ifnull
> * isnotnull
> * equal_null
> * nullif
> * nvl
> * nvl2
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43937) Add not,ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and Python

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-43937:
--
Summary: Add not,ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and 
Python  (was: Add not,if,ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala 
and Python)

> Add not,ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and Python
> ---
>
> Key: SPARK-43937
> URL: https://issues.apache.org/jira/browse/SPARK-43937
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Add following functions:
> * not
> * if
> * ifnull
> * isnotnull
> * equal_null
> * nullif
> * nvl
> * nvl2
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44058) Remove deprecated API usage in HiveShim.scala

2023-06-14 Thread Aman Raj (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732840#comment-17732840
 ] 

Aman Raj commented on SPARK-44058:
--

[~yumwang] In that case this function of createPartition is not required right?

> Remove deprecated API usage in HiveShim.scala
> -
>
> Key: SPARK-44058
> URL: https://issues.apache.org/jira/browse/SPARK-44058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.4.0
>Reporter: Aman Raj
>Priority: Major
>
> Spark's HiveShim.scala calls this particular method in Hive :
> createPartitionMethod.invoke(
> hive,
> table,
> spec,
> location,
> params, // partParams
> null, // inputFormat
> null, // outputFormat
> -1: JInteger, // numBuckets
> null, // cols
> null, // serializationLib
> null, // serdeParams
> null, // bucketCols
> null) // sortCols
> }
>  
> We do not have any such implementation of createPartition in Hive. We only 
> have this definition :
> public Partition createPartition(Table tbl, Map partSpec) 
> throws HiveException {
>     try
> {       org.apache.hadoop.hive.metastore.api.Partition part =           
> Partition.createMetaPartitionObject(tbl, partSpec, null);       
> AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, 
> tbl);       part.setWriteId(tableSnapshot != null ? 
> tableSnapshot.getWriteId() : 0);       return new Partition(tbl, 
> getMSC().add_partition(part));     }
> catch (Exception e)
> {       LOG.error(StringUtils.stringifyException(e));       throw new 
> HiveException(e);     }
>   }
> *The 12 parameter implementation was removed in HIVE-5951*
>  
> The issue is that this 12 parameter implementation of createPartition method 
> was added in Hive-0.12 and then was removed in Hive-0.13. When Hive 0.12 was 
> used in Spark, SPARK-15334 commit in Spark added this 12 parameters 
> implementation. But after Hive migrated to newer APIs somehow this was not 
> changed in Spark OSS and it looks to us like a Bug from the Spark end.
>  
> We need to migrate to the newest implementation of Hive createPartition 
> method otherwise this flow can break



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44063) Revert SPARK-44047

2023-06-14 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan resolved SPARK-44063.
-
Resolution: Won't Fix

It's my local environmental issue.

> Revert SPARK-44047
> --
>
> Key: SPARK-44063
> URL: https://issues.apache.org/jira/browse/SPARK-44063
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Connect
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44063) Revert SPARK-44047

2023-06-14 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-44063:
---

 Summary: Revert SPARK-44047
 Key: SPARK-44063
 URL: https://issues.apache.org/jira/browse/SPARK-44063
 Project: Spark
  Issue Type: Bug
  Components: Build, Connect
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43926) Add array_agg, array_size, cardinality, count_min_sketch,mask,named_struct,json_* to Scala and Python

2023-06-14 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732825#comment-17732825
 ] 

Ruifeng Zheng commented on SPARK-43926:
---

[~ivoson] many thanks, please go ahead

> Add array_agg, array_size, cardinality, 
> count_min_sketch,mask,named_struct,json_* to Scala and Python
> -
>
> Key: SPARK-43926
> URL: https://issues.apache.org/jira/browse/SPARK-43926
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Add array_agg, array_size, cardinality, count_min_sketch
> Add following functions:
> * array_agg
> * array_size
> * cardinality
> * count_min_sketch
> * named_struct
> * json_array_length
> * json_object_keys
> * mask
>   to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43627) Enable pyspark.pandas.spark.functions.skew in Spark Connect.

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43627:
-

Assignee: Ruifeng Zheng

> Enable pyspark.pandas.spark.functions.skew in Spark Connect.
> 
>
> Key: SPARK-43627
> URL: https://issues.apache.org/jira/browse/SPARK-43627
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
>
> Enable pyspark.pandas.spark.functions.skew in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43626) Enable pyspark.pandas.spark.functions.kurt in Spark Connect.

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43626:
-

Assignee: Ruifeng Zheng

> Enable pyspark.pandas.spark.functions.kurt in Spark Connect.
> 
>
> Key: SPARK-43626
> URL: https://issues.apache.org/jira/browse/SPARK-43626
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
>
> Enable pyspark.pandas.spark.functions.kurt in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43941) Add any_value, approx_percentile,count_if,first_value,histogram_numeric,last_value to Scala and Python

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43941.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41588
[https://github.com/apache/spark/pull/41588]

> Add any_value, 
> approx_percentile,count_if,first_value,histogram_numeric,last_value to Scala 
> and Python
> --
>
> Key: SPARK-43941
> URL: https://issues.apache.org/jira/browse/SPARK-43941
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>
> Add following functions:
> * any_value
> * approx_percentile
> * count_if
> * first_value
> * histogram_numeric
> * last_value
> * reduce
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43941) Add any_value, approx_percentile,count_if,first_value,histogram_numeric,last_value to Scala and Python

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43941:
-

Assignee: jiaan.geng

> Add any_value, 
> approx_percentile,count_if,first_value,histogram_numeric,last_value to Scala 
> and Python
> --
>
> Key: SPARK-43941
> URL: https://issues.apache.org/jira/browse/SPARK-43941
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: jiaan.geng
>Priority: Major
>
> Add following functions:
> * any_value
> * approx_percentile
> * count_if
> * first_value
> * histogram_numeric
> * last_value
> * reduce
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43659) Enable OpsOnDiffFramesEnabledSlowParityTests.test_series_eq

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43659.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41582
[https://github.com/apache/spark/pull/41582]

> Enable OpsOnDiffFramesEnabledSlowParityTests.test_series_eq
> ---
>
> Key: SPARK-43659
> URL: https://issues.apache.org/jira/browse/SPARK-43659
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Enable OpsOnDiffFramesEnabledSlowParityTests.test_series_eq



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43659) Enable OpsOnDiffFramesEnabledSlowParityTests.test_series_eq

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43659:
-

Assignee: Haejoon Lee

> Enable OpsOnDiffFramesEnabledSlowParityTests.test_series_eq
> ---
>
> Key: SPARK-43659
> URL: https://issues.apache.org/jira/browse/SPARK-43659
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Enable OpsOnDiffFramesEnabledSlowParityTests.test_series_eq



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44058) Remove deprecated API usage in HiveShim.scala

2023-06-14 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732806#comment-17732806
 ] 

Yuming Wang commented on SPARK-44058:
-

For Hive 0.13 and later, we use 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L753-L768.

> Remove deprecated API usage in HiveShim.scala
> -
>
> Key: SPARK-44058
> URL: https://issues.apache.org/jira/browse/SPARK-44058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.4.0
>Reporter: Aman Raj
>Priority: Major
>
> Spark's HiveShim.scala calls this particular method in Hive :
> createPartitionMethod.invoke(
> hive,
> table,
> spec,
> location,
> params, // partParams
> null, // inputFormat
> null, // outputFormat
> -1: JInteger, // numBuckets
> null, // cols
> null, // serializationLib
> null, // serdeParams
> null, // bucketCols
> null) // sortCols
> }
>  
> We do not have any such implementation of createPartition in Hive. We only 
> have this definition :
> public Partition createPartition(Table tbl, Map partSpec) 
> throws HiveException {
>     try
> {       org.apache.hadoop.hive.metastore.api.Partition part =           
> Partition.createMetaPartitionObject(tbl, partSpec, null);       
> AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, 
> tbl);       part.setWriteId(tableSnapshot != null ? 
> tableSnapshot.getWriteId() : 0);       return new Partition(tbl, 
> getMSC().add_partition(part));     }
> catch (Exception e)
> {       LOG.error(StringUtils.stringifyException(e));       throw new 
> HiveException(e);     }
>   }
> *The 12 parameter implementation was removed in HIVE-5951*
>  
> The issue is that this 12 parameter implementation of createPartition method 
> was added in Hive-0.12 and then was removed in Hive-0.13. When Hive 0.12 was 
> used in Spark, SPARK-15334 commit in Spark added this 12 parameters 
> implementation. But after Hive migrated to newer APIs somehow this was not 
> changed in Spark OSS and it looks to us like a Bug from the Spark end.
>  
> We need to migrate to the newest implementation of Hive createPartition 
> method otherwise this flow can break



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43975) DataSource V2: Handle UPDATE commands for group-based sources

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43975:
-

Assignee: Anton Okolnychyi

> DataSource V2: Handle UPDATE commands for group-based sources
> -
>
> Key: SPARK-43975
> URL: https://issues.apache.org/jira/browse/SPARK-43975
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>
> We need to handle UPDATE commands for group-based sources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43975) DataSource V2: Handle UPDATE commands for group-based sources

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43975.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41600
[https://github.com/apache/spark/pull/41600]

> DataSource V2: Handle UPDATE commands for group-based sources
> -
>
> Key: SPARK-43975
> URL: https://issues.apache.org/jira/browse/SPARK-43975
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.5.0
>
>
> We need to handle UPDATE commands for group-based sources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44062) Add PySparkTestBase unit test class

2023-06-14 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44062:
--

 Summary: Add PySparkTestBase unit test class
 Key: SPARK-44062
 URL: https://issues.apache.org/jira/browse/SPARK-44062
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44061) Add assert_df_equality util function

2023-06-14 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44061:
--

 Summary: Add assert_df_equality util function
 Key: SPARK-44061
 URL: https://issues.apache.org/jira/browse/SPARK-44061
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44060) Code-gen for build side outer shuffled hash join

2023-06-14 Thread Szehon Ho (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated SPARK-44060:
--
Description: 
Here, build side outer join means LEFT OUTER join with build left, or RIGHT 
OUTER join with build right.

As a followup for https://github.com/apache/spark/pull/41398 (non-codegen 
build-side outer shuffled hash join), this task is to add code-gen for it.

> Code-gen for build side outer shuffled hash join
> 
>
> Key: SPARK-44060
> URL: https://issues.apache.org/jira/browse/SPARK-44060
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Szehon Ho
>Priority: Major
>
> Here, build side outer join means LEFT OUTER join with build left, or RIGHT 
> OUTER join with build right.
> As a followup for https://github.com/apache/spark/pull/41398 (non-codegen 
> build-side outer shuffled hash join), this task is to add code-gen for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44060) Code-gen for build side outer shuffled hash join

2023-06-14 Thread Szehon Ho (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated SPARK-44060:
--
Description: 
Here, build side outer join means LEFT OUTER join with build left, or RIGHT 
OUTER join with build right.

As a followup for https://github.com/apache/spark/pull/41398/ SPARK-36612 
(non-codegen build-side outer shuffled hash join), this task is to add code-gen 
for it.

  was:
Here, build side outer join means LEFT OUTER join with build left, or RIGHT 
OUTER join with build right.

As a followup for https://github.com/apache/spark/pull/41398 (non-codegen 
build-side outer shuffled hash join), this task is to add code-gen for it.


> Code-gen for build side outer shuffled hash join
> 
>
> Key: SPARK-44060
> URL: https://issues.apache.org/jira/browse/SPARK-44060
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Szehon Ho
>Priority: Major
>
> Here, build side outer join means LEFT OUTER join with build left, or RIGHT 
> OUTER join with build right.
> As a followup for https://github.com/apache/spark/pull/41398/ SPARK-36612 
> (non-codegen build-side outer shuffled hash join), this task is to add 
> code-gen for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44060) Code-gen for build side outer shuffled hash join

2023-06-14 Thread Szehon Ho (Jira)
Szehon Ho created SPARK-44060:
-

 Summary: Code-gen for build side outer shuffled hash join
 Key: SPARK-44060
 URL: https://issues.apache.org/jira/browse/SPARK-44060
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Szehon Ho






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43440) Support registration of an Arrow Python UDF

2023-06-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43440:
-
Summary: Support registration of an Arrow Python UDF   (was: Support 
registration of an Arrow-optimized Python UDF )

> Support registration of an Arrow Python UDF 
> 
>
> Key: SPARK-43440
> URL: https://issues.apache.org/jira/browse/SPARK-43440
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>
> Currently, when users register an Arrow-optimized Python UDF, it will be 
> registered as a pickled Python UDF and thus, executed without Arrow 
> optimization. 
> We should support Arrow-optimized Python UDFs registration and execute them 
> with Arrow optimization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43893) Non-atomic data type support in Arrow Python UDF

2023-06-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43893:
-
Summary: Non-atomic data type support in Arrow Python UDF  (was: Non-atomic 
data type support in Arrow-optimized Python UDF)

> Non-atomic data type support in Arrow Python UDF
> 
>
> Key: SPARK-43893
> URL: https://issues.apache.org/jira/browse/SPARK-43893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43412) Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow Python UDFs

2023-06-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43412:
-
Summary: Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow Python UDFs  
(was: Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow-optimized Python 
UDFs)

> Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow Python UDFs
> 
>
> Key: SPARK-43412
> URL: https://issues.apache.org/jira/browse/SPARK-43412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>
> We are about to improve nested non-atomic input/output support of an 
> Arrow-optimized Python UDF.
> However, currently, it shares the same EvalType with a pickled Python UDF, 
> but the same implementation with a Pandas UDF.
> Introducing an EvalType enables isolating the changes to Arrow-optimized 
> Python UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43082) Arrow Python UDFs in Spark Connect

2023-06-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43082:
-
Summary: Arrow Python UDFs in Spark Connect  (was: Arrow-optimized Python 
UDFs in Spark Connect)

> Arrow Python UDFs in Spark Connect
> --
>
> Key: SPARK-43082
> URL: https://issues.apache.org/jira/browse/SPARK-43082
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>
> Implement Arrow-optimized Python UDFs in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42893) Block Arrow Python UDFs

2023-06-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42893:
-
Summary: Block Arrow Python UDFs  (was: Block Arrow-optimized Python UDFs)

> Block Arrow Python UDFs
> ---
>
> Key: SPARK-42893
> URL: https://issues.apache.org/jira/browse/SPARK-42893
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Considering the upcoming improvements on the result inconsistencies between 
> traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better 
> block the feature, otherwise, users who try out the feature will expect 
> behavior changes in the next release.
> In addition, since Spark Connect Python Client(SCPC) has been introduced in 
> Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark 
> and SCPC at the same time for compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40307) Introduce Arrow Python UDFs

2023-06-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40307:
-
Summary: Introduce Arrow Python UDFs  (was: Introduce Arrow-optimized 
Python UDFs)

> Introduce Arrow Python UDFs
> ---
>
> Key: SPARK-40307
> URL: https://issues.apache.org/jira/browse/SPARK-40307
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Python user-defined function (UDF) enables users to run arbitrary code 
> against PySpark columns. It uses Pickle for (de)serialization and executes 
> row by row.
> One major performance bottleneck of Python UDFs is (de)serialization, that 
> is, the data interchanging between the worker JVM and the spawned Python 
> subprocess which actually executes the UDF. We should seek an alternative to 
> handle the (de)serialization: Arrow, which is used in the (de)serialization 
> of Pandas UDF already.
> There should be two ways to enable/disable the Arrow optimization for Python 
> UDFs:
> - the Spark configuration `spark.sql.execution.pythonUDF.arrow.enabled`, 
> disabled by default.
> - the `useArrow` parameter of the `udf` function, None by default.
> The Spark configuration takes effect only when `useArrow` is None. Otherwise, 
> `useArrow` decides whether a specific user-defined function is optimized by 
> Arrow or not.
> The reason why we introduce these two ways is to provide both a convenient, 
> per-Spark-session control and a finer-grained, per-UDF control of the Arrow 
> optimization for Python UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43903) Improve ArrayType input support in Arrow Python UDF

2023-06-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43903:
-
Summary: Improve ArrayType input support in Arrow Python UDF  (was: Improve 
ArrayType input support in Arrow-optimized Python UDF)

> Improve ArrayType input support in Arrow Python UDF
> ---
>
> Key: SPARK-43903
> URL: https://issues.apache.org/jira/browse/SPARK-43903
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43903) Improve ArrayType input support in Arrow-optimized Python UDF

2023-06-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43903:
-
Summary: Improve ArrayType input support in Arrow-optimized Python UDF  
(was: Non-atomic data type support in Arrow-optimized Python UDF)

> Improve ArrayType input support in Arrow-optimized Python UDF
> -
>
> Key: SPARK-43903
> URL: https://issues.apache.org/jira/browse/SPARK-43903
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43893) Non-atomic data type support in Arrow-optimized Python UDF

2023-06-14 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-43893:
-
Summary: Non-atomic data type support in Arrow-optimized Python UDF  (was: 
StructType input/output support in Arrow-optimized Python UDF)

> Non-atomic data type support in Arrow-optimized Python UDF
> --
>
> Key: SPARK-43893
> URL: https://issues.apache.org/jira/browse/SPARK-43893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44057) Mark all `local-cluster` tests as `ExtendedSQLTest`

2023-06-14 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732751#comment-17732751
 ] 

GridGain Integration commented on SPARK-44057:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41601

> Mark all `local-cluster` tests as `ExtendedSQLTest`
> ---
>
> Key: SPARK-44057
> URL: https://issues.apache.org/jira/browse/SPARK-44057
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.5.0
>
>
> This issue aims to mark all `local-cluster` tests as `ExtendedSQLTest`
> https://pipelines.actions.githubusercontent.com/serviceHosts/03398d36-4378-4d47-a936-fba0a5e8ccb9/_apis/pipelines/1/runs/251144/signedlogcontent/12?urlExpires=2023-06-14T17%3A11%3A50.2399742Z&urlSigningMethod=HMACV1&urlSignature=%2FHTlrgaHtF2Jv65vw%2Fj4SzT69etebI0swSSM6dXC0tk%3D
> {code}
> $ git grep local-cluster sql/core/
> sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala:  
>   val session = SparkSession.builder().master("local-cluster[3, 1, 
> 1024]").getOrCreate()
> sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala:  
>   val session = SparkSession.builder().master("local-cluster[3, 1, 
> 1024]").getOrCreate()
> sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala://
>  Additional tests run in 'local-cluster' mode.
> sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala:
>   .setMaster("local-cluster[2,1,1024]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala:
>   "--master", "local-cluster[1,1,1024]",
> sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala:
>* Create a new [[SparkSession]] running in local-cluster mode with unsafe 
> and codegen enabled.
> sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala:
>   .master("local-cluster[2,1,1024]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>  * Tests in this suite we need to run Spark in local-cluster mode. In 
> particular, the use of
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>* Create a new [[SparkSession]] running in local-cluster mode with unsafe 
> and codegen enabled.
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>   .master("local-cluster[2,1,512]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala:
>   .config(sparkConf.setMaster("local-cluster[2, 1, 1024]"))
> sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala:
>   // Create a new [[SparkSession]] running in local-cluster mode.
> sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala:
>   .master("local-cluster[2,1,1024]")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44057) Mark all `local-cluster` tests as `ExtendedSQLTest`

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44057:
-

Assignee: Dongjoon Hyun

> Mark all `local-cluster` tests as `ExtendedSQLTest`
> ---
>
> Key: SPARK-44057
> URL: https://issues.apache.org/jira/browse/SPARK-44057
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to mark all `local-cluster` tests as `ExtendedSQLTest`
> https://pipelines.actions.githubusercontent.com/serviceHosts/03398d36-4378-4d47-a936-fba0a5e8ccb9/_apis/pipelines/1/runs/251144/signedlogcontent/12?urlExpires=2023-06-14T17%3A11%3A50.2399742Z&urlSigningMethod=HMACV1&urlSignature=%2FHTlrgaHtF2Jv65vw%2Fj4SzT69etebI0swSSM6dXC0tk%3D
> {code}
> $ git grep local-cluster sql/core/
> sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala:  
>   val session = SparkSession.builder().master("local-cluster[3, 1, 
> 1024]").getOrCreate()
> sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala:  
>   val session = SparkSession.builder().master("local-cluster[3, 1, 
> 1024]").getOrCreate()
> sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala://
>  Additional tests run in 'local-cluster' mode.
> sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala:
>   .setMaster("local-cluster[2,1,1024]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala:
>   "--master", "local-cluster[1,1,1024]",
> sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala:
>* Create a new [[SparkSession]] running in local-cluster mode with unsafe 
> and codegen enabled.
> sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala:
>   .master("local-cluster[2,1,1024]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>  * Tests in this suite we need to run Spark in local-cluster mode. In 
> particular, the use of
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>* Create a new [[SparkSession]] running in local-cluster mode with unsafe 
> and codegen enabled.
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>   .master("local-cluster[2,1,512]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala:
>   .config(sparkConf.setMaster("local-cluster[2, 1, 1024]"))
> sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala:
>   // Create a new [[SparkSession]] running in local-cluster mode.
> sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala:
>   .master("local-cluster[2,1,1024]")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44057) Mark all `local-cluster` tests as `ExtendedSQLTest`

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44057.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41601
[https://github.com/apache/spark/pull/41601]

> Mark all `local-cluster` tests as `ExtendedSQLTest`
> ---
>
> Key: SPARK-44057
> URL: https://issues.apache.org/jira/browse/SPARK-44057
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.5.0
>
>
> This issue aims to mark all `local-cluster` tests as `ExtendedSQLTest`
> https://pipelines.actions.githubusercontent.com/serviceHosts/03398d36-4378-4d47-a936-fba0a5e8ccb9/_apis/pipelines/1/runs/251144/signedlogcontent/12?urlExpires=2023-06-14T17%3A11%3A50.2399742Z&urlSigningMethod=HMACV1&urlSignature=%2FHTlrgaHtF2Jv65vw%2Fj4SzT69etebI0swSSM6dXC0tk%3D
> {code}
> $ git grep local-cluster sql/core/
> sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala:  
>   val session = SparkSession.builder().master("local-cluster[3, 1, 
> 1024]").getOrCreate()
> sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala:  
>   val session = SparkSession.builder().master("local-cluster[3, 1, 
> 1024]").getOrCreate()
> sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala://
>  Additional tests run in 'local-cluster' mode.
> sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala:
>   .setMaster("local-cluster[2,1,1024]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala:
>   "--master", "local-cluster[1,1,1024]",
> sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala:
>* Create a new [[SparkSession]] running in local-cluster mode with unsafe 
> and codegen enabled.
> sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala:
>   .master("local-cluster[2,1,1024]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>  * Tests in this suite we need to run Spark in local-cluster mode. In 
> particular, the use of
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>* Create a new [[SparkSession]] running in local-cluster mode with unsafe 
> and codegen enabled.
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>   .master("local-cluster[2,1,512]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala:
>   .config(sparkConf.setMaster("local-cluster[2, 1, 1024]"))
> sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala:
>   // Create a new [[SparkSession]] running in local-cluster mode.
> sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala:
>   .master("local-cluster[2,1,1024]")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44041) Upgrade ammonite to 2.5.9

2023-06-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732747#comment-17732747
 ] 

Dongjoon Hyun commented on SPARK-44041:
---

Nice! Looking forwarding to seeing it.

> Upgrade ammonite to 2.5.9
> -
>
> Key: SPARK-44041
> URL: https://issues.apache.org/jira/browse/SPARK-44041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> For support Scala 2.12.18 & 2.13.11
>  
> already has a tag : 
> [https://github.com/com-lihaoyi/Ammonite/releases/tag/2.5.9]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44059) Add named argument support for SQL functions

2023-06-14 Thread Richard Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-44059:
---
Description: 
Today, there is increasing demand for named argument functions, especially as 
we continue to introduce longer and longer parameter lists in our SQL 
functions. In these functions, many arguments could have default values, making 
it a waste to specify them all even if it is redundant. This is an umbrella 
ticket to track smaller subtasks which would be completed for implementing this 
feature.

Issues currently tracked:

https://issues.apache.org/jira/browse/SPARK-43922

 

 

  was:Today, there is increasing demand for named argument functions, 
especially as we continue to introduce longer and longer parameter lists in our 
SQL functions. In these functions, many arguments could have default values, 
making it a waste to specify them all even if it is redundant. This is an 
umbrella ticket to track smaller subtasks which would be completed for 
implementing this feature.


> Add named argument support for SQL functions
> 
>
> Key: SPARK-44059
> URL: https://issues.apache.org/jira/browse/SPARK-44059
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Richard Yu
>Priority: Major
>
> Today, there is increasing demand for named argument functions, especially as 
> we continue to introduce longer and longer parameter lists in our SQL 
> functions. In these functions, many arguments could have default values, 
> making it a waste to specify them all even if it is redundant. This is an 
> umbrella ticket to track smaller subtasks which would be completed for 
> implementing this feature.
> Issues currently tracked:
> https://issues.apache.org/jira/browse/SPARK-43922
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44059) Add named argument support for SQL functions

2023-06-14 Thread Richard Yu (Jira)
Richard Yu created SPARK-44059:
--

 Summary: Add named argument support for SQL functions
 Key: SPARK-44059
 URL: https://issues.apache.org/jira/browse/SPARK-44059
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Affects Versions: 3.5.0
Reporter: Richard Yu


Today, there is increasing demand for named argument functions, especially as 
we continue to introduce longer and longer parameter lists in our SQL 
functions. In these functions, many arguments could have default values, making 
it a waste to specify them all even if it is redundant. This is an umbrella 
ticket to track smaller subtasks which would be completed for implementing this 
feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44058) Remove deprecated API usage in HiveShim.scala

2023-06-14 Thread Aman Raj (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Raj updated SPARK-44058:
-
Description: 
Spark's HiveShim.scala calls this particular method in Hive :
createPartitionMethod.invoke(
hive,
table,
spec,
location,
params, // partParams
null, // inputFormat
null, // outputFormat
-1: JInteger, // numBuckets
null, // cols
null, // serializationLib
null, // serdeParams
null, // bucketCols
null) // sortCols
}
 
We do not have any such implementation of createPartition in Hive. We only have 
this definition :
public Partition createPartition(Table tbl, Map partSpec) 
throws HiveException {
    try

{       org.apache.hadoop.hive.metastore.api.Partition part =           
Partition.createMetaPartitionObject(tbl, partSpec, null);       
AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, tbl);  
     part.setWriteId(tableSnapshot != null ? tableSnapshot.getWriteId() : 0);   
    return new Partition(tbl, getMSC().add_partition(part));     }

catch (Exception e)

{       LOG.error(StringUtils.stringifyException(e));       throw new 
HiveException(e);     }

  }
*The 12 parameter implementation was removed in HIVE-5951*

 

The issue is that this 12 parameter implementation of createPartition method 
was added in Hive-0.12 and then was removed in Hive-0.13. When Hive 0.12 was 
used in Spark, SPARK-15334 commit in Spark added this 12 parameters 
implementation. But after Hive migrated to newer APIs somehow this was not 
changed in Spark OSS and it looks to us like a Bug from the Spark end.

 

We need to migrate to the newest implementation of Hive createPartition method 
otherwise this flow can break

  was:
Spark's HiveShim.scala calls this particular method in Hive :
createPartitionMethod.invoke(
hive,
table,
spec,
location,
params, // partParams
null, // inputFormat
null, // outputFormat
-1: JInteger, // numBuckets
null, // cols
null, // serializationLib
null, // serdeParams
null, // bucketCols
null) // sortCols
}
 
We do not have any such implementation of createPartition in Hive. We only have 
this definition :
public Partition createPartition(Table tbl, Map partSpec) 
throws HiveException {
    try {
      org.apache.hadoop.hive.metastore.api.Partition part =
          Partition.createMetaPartitionObject(tbl, partSpec, null);
      AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, 
tbl);
      part.setWriteId(tableSnapshot != null ? tableSnapshot.getWriteId() : 0);
      return new Partition(tbl, getMSC().add_partition(part));
    } catch (Exception e) {
      LOG.error(StringUtils.stringifyException(e));
      throw new HiveException(e);
    }

  }
 
The issue is that this 12 parameter implementation of createPartition method 
was added in Hive-0.12 and then was removed in Hive-0.13. When Hive 0.12 was 
used in Spark, [SPARK-15334] commit in Spark added this 12 parameters 
implementation. But after Hive migrated to newer APIs somehow this was not 
changed in Spark OSS and it looks to us like a Bug from the Spark end.

 

We need to migrate to the newest implementation of Hive createPartition method 
otherwise this flow can break


> Remove deprecated API usage in HiveShim.scala
> -
>
> Key: SPARK-44058
> URL: https://issues.apache.org/jira/browse/SPARK-44058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.4.0
>Reporter: Aman Raj
>Priority: Major
>
> Spark's HiveShim.scala calls this particular method in Hive :
> createPartitionMethod.invoke(
> hive,
> table,
> spec,
> location,
> params, // partParams
> null, // inputFormat
> null, // outputFormat
> -1: JInteger, // numBuckets
> null, // cols
> null, // serializationLib
> null, // serdeParams
> null, // bucketCols
> null) // sortCols
> }
>  
> We do not have any such implementation of createPartition in Hive. We only 
> have this definition :
> public Partition createPartition(Table tbl, Map partSpec) 
> throws HiveException {
>     try
> {       org.apache.hadoop.hive.metastore.api.Partition part =           
> Partition.createMetaPartitionObject(tbl, partSpec, null);       
> AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, 
> tbl);       part.setWriteId(tableSnapshot != null ? 
> tableSnapshot.getWriteId() : 0);       return new Partition(tbl, 
> getMSC().add_partition(part));     }
> catch (Exception e)
> {       LOG.error(StringUtils.stringifyException(e));       throw new 
> HiveException(e);     }
>   }
> *The 12 parameter implementation was removed in HIVE-5951*
>  
> The issue is that this 12 parameter implementation of createPartition method 
> was added in Hive-0.12 and then was removed in Hive-0.13. When Hive 0.12 was 
> used in Spark, SPARK-15334 commit in Spark added this 12 parameters 
> implementati

[jira] [Created] (SPARK-44058) Remove deprecated API usage in HiveShim.scala

2023-06-14 Thread Aman Raj (Jira)
Aman Raj created SPARK-44058:


 Summary: Remove deprecated API usage in HiveShim.scala
 Key: SPARK-44058
 URL: https://issues.apache.org/jira/browse/SPARK-44058
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 3.4.0
Reporter: Aman Raj


Spark's HiveShim.scala calls this particular method in Hive :
createPartitionMethod.invoke(
hive,
table,
spec,
location,
params, // partParams
null, // inputFormat
null, // outputFormat
-1: JInteger, // numBuckets
null, // cols
null, // serializationLib
null, // serdeParams
null, // bucketCols
null) // sortCols
}
 
We do not have any such implementation of createPartition in Hive. We only have 
this definition :
public Partition createPartition(Table tbl, Map partSpec) 
throws HiveException {
    try {
      org.apache.hadoop.hive.metastore.api.Partition part =
          Partition.createMetaPartitionObject(tbl, partSpec, null);
      AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, 
tbl);
      part.setWriteId(tableSnapshot != null ? tableSnapshot.getWriteId() : 0);
      return new Partition(tbl, getMSC().add_partition(part));
    } catch (Exception e) {
      LOG.error(StringUtils.stringifyException(e));
      throw new HiveException(e);
    }

  }
 
The issue is that this 12 parameter implementation of createPartition method 
was added in Hive-0.12 and then was removed in Hive-0.13. When Hive 0.12 was 
used in Spark, [SPARK-15334] commit in Spark added this 12 parameters 
implementation. But after Hive migrated to newer APIs somehow this was not 
changed in Spark OSS and it looks to us like a Bug from the Spark end.

 

We need to migrate to the newest implementation of Hive createPartition method 
otherwise this flow can break



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44057) Mark all `local-cluster` tests as `ExtendedSQLTest`

2023-06-14 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44057:
-

 Summary: Mark all `local-cluster` tests as `ExtendedSQLTest`
 Key: SPARK-44057
 URL: https://issues.apache.org/jira/browse/SPARK-44057
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.5.0
Reporter: Dongjoon Hyun


This issue aims to mark all `local-cluster` tests as `ExtendedSQLTest`

https://pipelines.actions.githubusercontent.com/serviceHosts/03398d36-4378-4d47-a936-fba0a5e8ccb9/_apis/pipelines/1/runs/251144/signedlogcontent/12?urlExpires=2023-06-14T17%3A11%3A50.2399742Z&urlSigningMethod=HMACV1&urlSignature=%2FHTlrgaHtF2Jv65vw%2Fj4SzT69etebI0swSSM6dXC0tk%3D

{code}
$ git grep local-cluster sql/core/
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala:
val session = SparkSession.builder().master("local-cluster[3, 1, 
1024]").getOrCreate()
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala:
val session = SparkSession.builder().master("local-cluster[3, 1, 
1024]").getOrCreate()
sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala://
 Additional tests run in 'local-cluster' mode.
sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala:
  .setMaster("local-cluster[2,1,1024]")
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala:
  "--master", "local-cluster[1,1,1024]",
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala:
   * Create a new [[SparkSession]] running in local-cluster mode with unsafe 
and codegen enabled.
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala:
  .master("local-cluster[2,1,1024]")
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
 * Tests in this suite we need to run Spark in local-cluster mode. In 
particular, the use of
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
   * Create a new [[SparkSession]] running in local-cluster mode with unsafe 
and codegen enabled.
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
  .master("local-cluster[2,1,512]")
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala:
  .config(sparkConf.setMaster("local-cluster[2, 1, 1024]"))
sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala:
  // Create a new [[SparkSession]] running in local-cluster mode.
sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala:
  .master("local-cluster[2,1,1024]")
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44053) Update ORC to 1.8.4

2023-06-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732668#comment-17732668
 ] 

Dongjoon Hyun commented on SPARK-44053:
---

Apache ORC 1.9.0 PR will arrive soon in this month.

> Update ORC to 1.8.4
> ---
>
> Key: SPARK-44053
> URL: https://issues.apache.org/jira/browse/SPARK-44053
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Yiqun Zhang
>Assignee: Yiqun Zhang
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44053) Update ORC to 1.8.4

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44053:
--
Affects Version/s: 3.5.0

> Update ORC to 1.8.4
> ---
>
> Key: SPARK-44053
> URL: https://issues.apache.org/jira/browse/SPARK-44053
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Yiqun Zhang
>Assignee: Yiqun Zhang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44053) Update ORC to 1.8.4

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44053:
--
Fix Version/s: 3.4.1

> Update ORC to 1.8.4
> ---
>
> Key: SPARK-44053
> URL: https://issues.apache.org/jira/browse/SPARK-44053
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Yiqun Zhang
>Assignee: Yiqun Zhang
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44056) Improve error message when UDF execution fails

2023-06-14 Thread Rob Reeves (Jira)
Rob Reeves created SPARK-44056:
--

 Summary: Improve error message when UDF execution fails
 Key: SPARK-44056
 URL: https://issues.apache.org/jira/browse/SPARK-44056
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Rob Reeves


If a user has multiple UDFs defined with the same method signature it is hard 
to figure out which one caused the issue from the function class alone. For 
example, in Spark 3.1.1:
{code}
Caused by: org.apache.spark.SparkException: Failed to execute user defined 
function(UDFRegistration$$Lambda$666/1969461119: (bigint, string) => string)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.subExpr_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3(basicPhysicalOperators.scala:249)
at 
org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3$adapted(basicPhysicalOperators.scala:248)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:131)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:523)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1535)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:526)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException

This is the end of the stack trace. I didn't truncate it.
{code}

If the SQL API is used the ScalaUDF will have a name. It should be part of the 
error to help debug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44053) Update ORC to 1.8.4

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44053.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41593
[https://github.com/apache/spark/pull/41593]

> Update ORC to 1.8.4
> ---
>
> Key: SPARK-44053
> URL: https://issues.apache.org/jira/browse/SPARK-44053
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
>Reporter: Yiqun Zhang
>Assignee: Yiqun Zhang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44053) Update ORC to 1.8.4

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44053:
-

Assignee: Yiqun Zhang

> Update ORC to 1.8.4
> ---
>
> Key: SPARK-44053
> URL: https://issues.apache.org/jira/browse/SPARK-44053
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
>Reporter: Yiqun Zhang
>Assignee: Yiqun Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44055) Remove redundant `override` from `CheckpointRDD`

2023-06-14 Thread Yang Jie (Jira)
Yang Jie created SPARK-44055:


 Summary: Remove redundant `override` from `CheckpointRDD`
 Key: SPARK-44055
 URL: https://issues.apache.org/jira/browse/SPARK-44055
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-43819) Barrier Executor Stage Not Retried on Task Failure

2023-06-14 Thread Matthew Tieman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tieman closed SPARK-43819.
--

> Barrier Executor Stage Not Retried on Task Failure
> --
>
> Key: SPARK-43819
> URL: https://issues.apache.org/jira/browse/SPARK-43819
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.2
>Reporter: Matthew Tieman
>Priority: Major
>
> When running a stage using barrier executor, the expectation is that a 
> failure in a task will result in the stage being retried. However, if an 
> exception is thrown from a task, the stage is not retried and the job fails.
> Running the pyspark code below will cause a single task to fail, failing the 
> stage without retrying.
> {code:java}
> def test_func(index: int) -> list:
>     if index == 0:
>         raise RuntimeError("Thrown from test func")
>     return []
> start_rdd = sc.parallelize([i for i in range(10)], 10)
> result = start_rdd.barrier().mapPartitionsWithIndex(lambda i, c: test_func(i))
> result.collect(){code}
>  
> This failure is seen running locally via the pyspark shell and on a K8s 
> cluster.
>  
> Stack trace from local execution:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/rdd.py", 
> line 1197, in collect
>     sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/sql/utils.py", 
> line 190, in deco
>     return f(*a, **kw)
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py",
>  line 326, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Could 
> not recover from a failed barrier ResultStage. Most recent failure reason: 
> Stage failed because barrier task ResultTask(0, 0) finished unsuccessfully.
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py",
>  line 686, in main
>     process()
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py",
>  line 676, in process
>     out_iter = func(split_index, iterator)
>   File "", line 1, in 
>   File "", line 3, in test_func
> RuntimeError: Thrown from test func
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:559)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:765)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:747)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:512)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1021)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:9

[jira] [Commented] (SPARK-43819) Barrier Executor Stage Not Retried on Task Failure

2023-06-14 Thread Matthew Tieman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732599#comment-17732599
 ] 

Matthew Tieman commented on SPARK-43819:


After further debugging I found the issue was a combination of incorrect 
configuration and expectations.

First, the incorrect expectation would be that on task failure, the stage would 
be retried. However, this only happens if failure happens in a shuffle map 
stage. If the failure happens in a result stage, the job will be aborted.

Next, misconfiguration was occurring in the {{SparkApplication}} resource 
submitted to the K8s spark operator, specifically, the {{restartPolicy}} was 
being set to {{{}Never{}}}.

The combination of the barrier failing the job and the spark operator being 
told not to retry applications on failure lead to the issue. The solution was 
to configure a restart policy with an appropriate number of retry attempts.

> Barrier Executor Stage Not Retried on Task Failure
> --
>
> Key: SPARK-43819
> URL: https://issues.apache.org/jira/browse/SPARK-43819
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.2
>Reporter: Matthew Tieman
>Priority: Major
>
> When running a stage using barrier executor, the expectation is that a 
> failure in a task will result in the stage being retried. However, if an 
> exception is thrown from a task, the stage is not retried and the job fails.
> Running the pyspark code below will cause a single task to fail, failing the 
> stage without retrying.
> {code:java}
> def test_func(index: int) -> list:
>     if index == 0:
>         raise RuntimeError("Thrown from test func")
>     return []
> start_rdd = sc.parallelize([i for i in range(10)], 10)
> result = start_rdd.barrier().mapPartitionsWithIndex(lambda i, c: test_func(i))
> result.collect(){code}
>  
> This failure is seen running locally via the pyspark shell and on a K8s 
> cluster.
>  
> Stack trace from local execution:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/rdd.py", 
> line 1197, in collect
>     sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/sql/utils.py", 
> line 190, in deco
>     return f(*a, **kw)
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py",
>  line 326, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Could 
> not recover from a failed barrier ResultStage. Most recent failure reason: 
> Stage failed because barrier task ResultTask(0, 0) finished unsuccessfully.
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py",
>  line 686, in main
>     process()
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py",
>  line 676, in process
>     out_iter = func(split_index, iterator)
>   File "", line 1, in 
>   File "", line 3, in test_func
> RuntimeError: Thrown from test func
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:559)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:765)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:747)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:512)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
>   at 
> o

[jira] [Resolved] (SPARK-43819) Barrier Executor Stage Not Retried on Task Failure

2023-06-14 Thread Matthew Tieman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tieman resolved SPARK-43819.

Resolution: Not A Problem

> Barrier Executor Stage Not Retried on Task Failure
> --
>
> Key: SPARK-43819
> URL: https://issues.apache.org/jira/browse/SPARK-43819
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.2
>Reporter: Matthew Tieman
>Priority: Major
>
> When running a stage using barrier executor, the expectation is that a 
> failure in a task will result in the stage being retried. However, if an 
> exception is thrown from a task, the stage is not retried and the job fails.
> Running the pyspark code below will cause a single task to fail, failing the 
> stage without retrying.
> {code:java}
> def test_func(index: int) -> list:
>     if index == 0:
>         raise RuntimeError("Thrown from test func")
>     return []
> start_rdd = sc.parallelize([i for i in range(10)], 10)
> result = start_rdd.barrier().mapPartitionsWithIndex(lambda i, c: test_func(i))
> result.collect(){code}
>  
> This failure is seen running locally via the pyspark shell and on a K8s 
> cluster.
>  
> Stack trace from local execution:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/rdd.py", 
> line 1197, in collect
>     sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/sql/utils.py", 
> line 190, in deco
>     return f(*a, **kw)
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py",
>  line 326, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Could 
> not recover from a failed barrier ResultStage. Most recent failure reason: 
> Stage failed because barrier task ResultTask(0, 0) finished unsuccessfully.
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py",
>  line 686, in main
>     process()
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py",
>  line 676, in process
>     out_iter = func(split_index, iterator)
>   File "", line 1, in 
>   File "", line 3, in test_func
> RuntimeError: Thrown from test func
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:559)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:765)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:747)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:512)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1021)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
>   at org.apache.spark.scheduler.Res

[jira] [Commented] (SPARK-44041) Upgrade ammonite to 2.5.9

2023-06-14 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732572#comment-17732572
 ] 

Yang Jie commented on SPARK-44041:
--

I will give a pr when it can be downloaded from Maven

> Upgrade ammonite to 2.5.9
> -
>
> Key: SPARK-44041
> URL: https://issues.apache.org/jira/browse/SPARK-44041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> For support Scala 2.12.18 & 2.13.11
>  
> already has a tag : 
> [https://github.com/com-lihaoyi/Ammonite/releases/tag/2.5.9]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44041) Upgrade ammonite to 2.5.9

2023-06-14 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732572#comment-17732572
 ] 

Yang Jie edited comment on SPARK-44041 at 6/14/23 3:03 PM:
---

I will give a pr when it can be downloaded by Maven


was (Author: luciferyang):
I will give a pr when it can be downloaded from Maven

> Upgrade ammonite to 2.5.9
> -
>
> Key: SPARK-44041
> URL: https://issues.apache.org/jira/browse/SPARK-44041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> For support Scala 2.12.18 & 2.13.11
>  
> already has a tag : 
> [https://github.com/com-lihaoyi/Ammonite/releases/tag/2.5.9]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44054) Make test cases inherit SparkFunSuite have a default timeout

2023-06-14 Thread Yang Jie (Jira)
Yang Jie created SPARK-44054:


 Summary: Make test cases inherit SparkFunSuite have a default 
timeout
 Key: SPARK-44054
 URL: https://issues.apache.org/jira/browse/SPARK-44054
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44047) Upgrade google guava for connect from 31.0.1-jre to 32.0.1-jre

2023-06-14 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44047:


Assignee: BingKun Pan

> Upgrade google guava for connect from 31.0.1-jre to 32.0.1-jre
> --
>
> Key: SPARK-44047
> URL: https://issues.apache.org/jira/browse/SPARK-44047
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Connect
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44047) Upgrade google guava for connect from 31.0.1-jre to 32.0.1-jre

2023-06-14 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44047.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41581
[https://github.com/apache/spark/pull/41581]

> Upgrade google guava for connect from 31.0.1-jre to 32.0.1-jre
> --
>
> Key: SPARK-44047
> URL: https://issues.apache.org/jira/browse/SPARK-44047
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Connect
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44051) Split `pyspark.pandas.tests.connect.data_type_ops.test_parity_num_ops`

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44051.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41591
[https://github.com/apache/spark/pull/41591]

> Split `pyspark.pandas.tests.connect.data_type_ops.test_parity_num_ops`
> --
>
> Key: SPARK-44051
> URL: https://issues.apache.org/jira/browse/SPARK-44051
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44051) Split `pyspark.pandas.tests.connect.data_type_ops.test_parity_num_ops`

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44051:
-

Assignee: Ruifeng Zheng

> Split `pyspark.pandas.tests.connect.data_type_ops.test_parity_num_ops`
> --
>
> Key: SPARK-44051
> URL: https://issues.apache.org/jira/browse/SPARK-44051
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44004) Assign name & improve error message for frequent LEGACY errors.

2023-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732465#comment-17732465
 ] 

ASF GitHub Bot commented on SPARK-44004:


User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/41504

> Assign name & improve error message for frequent LEGACY errors.
> ---
>
> Key: SPARK-44004
> URL: https://issues.apache.org/jira/browse/SPARK-44004
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This addresses _LEGACY_ERROR_TEMP_1333, _LEGACY_ERROR_TEMP_2331, 
> _LEGACY_ERROR_TEMP_0023, _LEGACY_ERROR_TEMP_1157, _LEGACY_ERROR_TEMP_2308, 
> _LEGACY_ERROR_TEMP_1051, _LEGACY_ERROR_TEMP_1029, _LEGACY_ERROR_TEMP_1318



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44004) Assign name & improve error message for frequent LEGACY errors.

2023-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732464#comment-17732464
 ] 

ASF GitHub Bot commented on SPARK-44004:


User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/41504

> Assign name & improve error message for frequent LEGACY errors.
> ---
>
> Key: SPARK-44004
> URL: https://issues.apache.org/jira/browse/SPARK-44004
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This addresses _LEGACY_ERROR_TEMP_1333, _LEGACY_ERROR_TEMP_2331, 
> _LEGACY_ERROR_TEMP_0023, _LEGACY_ERROR_TEMP_1157, _LEGACY_ERROR_TEMP_2308, 
> _LEGACY_ERROR_TEMP_1051, _LEGACY_ERROR_TEMP_1029, _LEGACY_ERROR_TEMP_1318



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44040) Incorrect result after count distinct

2023-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732461#comment-17732461
 ] 

ASF GitHub Bot commented on SPARK-44040:


User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/41576

> Incorrect result after count distinct
> -
>
> Key: SPARK-44040
> URL: https://issues.apache.org/jira/browse/SPARK-44040
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Aleksandr Aleksandrov
>Priority: Critical
>
> When i try to call count after distinct function for Decimal null field, 
> spark return incorrect result starting from spark 3.4.0.
> A minimal example to reproduce:
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.\{Column, DataFrame, Dataset, Row, SparkSession}
> import org.apache.spark.sql.types.\{StringType, StructField, StructType}
> val schema = StructType( Array(
> StructField("money", DecimalType(38,6), true),
> StructField("reference_id", StringType, true)
> ))
> val payDf = spark.createDataFrame(sc.emptyRDD[Row], schema)
> val aggDf = payDf.agg(sum("money").as("money")).withColumn("name", lit("df1"))
> val aggDf1 = payDf.agg(sum("money").as("money")).withColumn("name", 
> lit("df2"))
> val unionDF: DataFrame = aggDf.union(aggDf1)
> unionDF.select("money").distinct.show // return correct result
> unionDF.select("money").distinct.count // return 2 instead of 1
> unionDF.select("money").distinct.count == 1 // return false
> This block of code returns some assertion error and after that an incorrect 
> count (in spark 3.2.1 everything works fine and i get correct result = 1):
> *scala> unionDF.select("money").distinct.show // return correct result*
> java.lang.AssertionError: assertion failed:
> Decimal$DecimalIsFractional
> while compiling: 
> during phase: globalPhase=terminal, enteringPhase=jvm
> library version: version 2.12.17
> compiler version: version 2.12.17
> reconstructed args: -classpath 
> /Users/aleksandrov/.ivy2/jars/org.apache.spark_spark-connect_2.12-3.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-core_2.12-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-storage-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar:/Users/aleksandrov/.ivy2/jars/org.antlr_antlr4-runtime-4.9.3.jar
>  -Yrepl-class-based -Yrepl-outdir 
> /private/var/folders/qj/_dn4xbp14jn37qmdk7ylyfwcgr/T/spark-f37bb154-75f3-4db7-aea8-3c4363377bd8/repl-350f37a1-1df1-4816-bd62-97929c60a6c1
> last tree to typer: TypeTree(class Byte)
> tree position: line 6 of 
> tree tpe: Byte
> symbol: (final abstract) class Byte in package scala
> symbol definition: final abstract class Byte extends (a ClassSymbol)
> symbol package: scala
> symbol owners: class Byte
> call site: constructor $eval in object $eval in package $line19
> == Source file context for tree position ==
> 3
> 4object $eval {
> 5lazyval $result = 
> $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0
> 6lazyval $print: {_}root{_}.java.lang.String = {
> 7 $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw
> 8
> 9""
> at 
> scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185)
> at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525)
> at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514)
> at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353)
> at 
> scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346)
> at 
> scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348)
> at 
> scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487)
> at 
> scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
> at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799)
> at scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805)
> at scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357)
> at scala.reflect.internal.pickling.UnPickler$Scan.at(UnP

[jira] [Commented] (SPARK-43915) Assign names to the error class _LEGACY_ERROR_TEMP_[2438-2445]

2023-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732459#comment-17732459
 ] 

ASF GitHub Bot commented on SPARK-43915:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/41553

> Assign names to the error class _LEGACY_ERROR_TEMP_[2438-2445]
> --
>
> Key: SPARK-43915
> URL: https://issues.apache.org/jira/browse/SPARK-43915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44044) Improve Error message for SQL Window functions

2023-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732456#comment-17732456
 ] 

ASF GitHub Bot commented on SPARK-44044:


User 'siying' has created a pull request for this issue:
https://github.com/apache/spark/pull/41578

> Improve Error message for SQL Window functions
> --
>
> Key: SPARK-44044
> URL: https://issues.apache.org/jira/browse/SPARK-44044
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Siying Dong
>Priority: Trivial
>
> Right now, if window spec is used with a stream query, the error message 
> looks like following:
> Non-time-based windows are not supported on streaming DataFrames/Datasets;
> Window [... 
> The message isn't very helpful to identify what's the problem is and some 
> customers and even support engineers got confused by this. It is suggested 
> that we call out aggregation function over the window spec so that the users 
> can locate the part of the query that caused the problem easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44044) Improve Error message for SQL Window functions

2023-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732455#comment-17732455
 ] 

ASF GitHub Bot commented on SPARK-44044:


User 'siying' has created a pull request for this issue:
https://github.com/apache/spark/pull/41578

> Improve Error message for SQL Window functions
> --
>
> Key: SPARK-44044
> URL: https://issues.apache.org/jira/browse/SPARK-44044
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Siying Dong
>Priority: Trivial
>
> Right now, if window spec is used with a stream query, the error message 
> looks like following:
> Non-time-based windows are not supported on streaming DataFrames/Datasets;
> Window [... 
> The message isn't very helpful to identify what's the problem is and some 
> customers and even support engineers got confused by this. It is suggested 
> that we call out aggregation function over the window spec so that the users 
> can locate the part of the query that caused the problem easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44053) Update ORC to 1.8.4

2023-06-14 Thread Yiqun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Zhang updated SPARK-44053:

Affects Version/s: 3.4.1
   (was: 3.5.0)

> Update ORC to 1.8.4
> ---
>
> Key: SPARK-44053
> URL: https://issues.apache.org/jira/browse/SPARK-44053
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
>Reporter: Yiqun Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44053) Update ORC to 1.8.4

2023-06-14 Thread Yiqun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732447#comment-17732447
 ] 

Yiqun Zhang commented on SPARK-44053:
-

Our plan is to
spark 3.4.1 upgrade to ORC 1.8.4
spark 3.5.0 upgrade to ORC 1.9.0
So I set the affected version to 3.4.1

[~yumwang]  :)

> Update ORC to 1.8.4
> ---
>
> Key: SPARK-44053
> URL: https://issues.apache.org/jira/browse/SPARK-44053
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yiqun Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44053) Update ORC to 1.8.4

2023-06-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44053:

Affects Version/s: 3.5.0
   (was: 3.4.1)

> Update ORC to 1.8.4
> ---
>
> Key: SPARK-44053
> URL: https://issues.apache.org/jira/browse/SPARK-44053
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yiqun Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44053) Update ORC to 1.8.4

2023-06-14 Thread Yiqun Zhang (Jira)
Yiqun Zhang created SPARK-44053:
---

 Summary: Update ORC to 1.8.4
 Key: SPARK-44053
 URL: https://issues.apache.org/jira/browse/SPARK-44053
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.4.1
Reporter: Yiqun Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43645) Enable pyspark.pandas.spark.functions.stddev in Spark Connect.

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43645:
-

Assignee: Ruifeng Zheng

> Enable pyspark.pandas.spark.functions.stddev in Spark Connect.
> --
>
> Key: SPARK-43645
> URL: https://issues.apache.org/jira/browse/SPARK-43645
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>
> Enable pyspark.pandas.spark.functions.stddev in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43931) Add make_* functions to Scala and Python

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43931:
-

Assignee: BingKun Pan

> Add make_* functions to Scala and Python
> 
>
> Key: SPARK-43931
> URL: https://issues.apache.org/jira/browse/SPARK-43931
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: BingKun Pan
>Priority: Major
>
> Add following functions:
> * make_dt_interval
> * make_interval
> * make_timestamp
> * make_timestamp_ltz
> * make_timestamp_ntz
> * make_ym_interval
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43931) Add make_* functions to Scala and Python

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43931.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41477
[https://github.com/apache/spark/pull/41477]

> Add make_* functions to Scala and Python
> 
>
> Key: SPARK-43931
> URL: https://issues.apache.org/jira/browse/SPARK-43931
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.5.0
>
>
> Add following functions:
> * make_dt_interval
> * make_interval
> * make_timestamp
> * make_timestamp_ltz
> * make_timestamp_ntz
> * make_ym_interval
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43622) Enable pyspark.pandas.spark.functions.var in Spark Connect.

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43622.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41589
[https://github.com/apache/spark/pull/41589]

> Enable pyspark.pandas.spark.functions.var in Spark Connect.
> ---
>
> Key: SPARK-43622
> URL: https://issues.apache.org/jira/browse/SPARK-43622
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>
> Enable pyspark.pandas.spark.functions.var in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43645) Enable pyspark.pandas.spark.functions.stddev in Spark Connect.

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43645.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41589
[https://github.com/apache/spark/pull/41589]

> Enable pyspark.pandas.spark.functions.stddev in Spark Connect.
> --
>
> Key: SPARK-43645
> URL: https://issues.apache.org/jira/browse/SPARK-43645
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Enable pyspark.pandas.spark.functions.stddev in Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44035) Split `pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_slow`

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44035.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41565
[https://github.com/apache/spark/pull/41565]

> Split `pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_slow`
> 
>
> Key: SPARK-44035
> URL: https://issues.apache.org/jira/browse/SPARK-44035
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44035) Split `pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_slow`

2023-06-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44035:
-

Assignee: Ruifeng Zheng

> Split `pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_slow`
> 
>
> Key: SPARK-44035
> URL: https://issues.apache.org/jira/browse/SPARK-44035
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44048) Remove sql-migration-old.md

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44048:
-

Assignee: Yuming Wang

> Remove sql-migration-old.md
> ---
>
> Key: SPARK-44048
> URL: https://issues.apache.org/jira/browse/SPARK-44048
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44048) Remove sql-migration-old.md

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44048.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41583
[https://github.com/apache/spark/pull/41583]

> Remove sql-migration-old.md
> ---
>
> Key: SPARK-44048
> URL: https://issues.apache.org/jira/browse/SPARK-44048
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43963) DataSource V2: Handle MERGE commands for group-based sources

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43963:
-

Assignee: Anton Okolnychyi

> DataSource V2: Handle MERGE commands for group-based sources
> 
>
> Key: SPARK-43963
> URL: https://issues.apache.org/jira/browse/SPARK-43963
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>
> We need to handle MERGE commands for group-based sources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43963) DataSource V2: Handle MERGE commands for group-based sources

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43963.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41577
[https://github.com/apache/spark/pull/41577]

> DataSource V2: Handle MERGE commands for group-based sources
> 
>
> Key: SPARK-43963
> URL: https://issues.apache.org/jira/browse/SPARK-43963
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.5.0
>
>
> We need to handle MERGE commands for group-based sources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44049) Fix KubernetesSuite to use `inNamespace` for validating driver pod cleanup

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44049.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41586
[https://github.com/apache/spark/pull/41586]

> Fix KubernetesSuite to use `inNamespace` for validating driver pod cleanup
> --
>
> Key: SPARK-44049
> URL: https://issues.apache.org/jira/browse/SPARK-44049
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44049) Fix KubernetesSuite to use `inNamespace` for validating driver pod cleanup

2023-06-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44049:
-

Assignee: Dongjoon Hyun

> Fix KubernetesSuite to use `inNamespace` for validating driver pod cleanup
> --
>
> Key: SPARK-44049
> URL: https://issues.apache.org/jira/browse/SPARK-44049
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44052) Add util to get proper Column or DataFrame class for Spark Connect.

2023-06-14 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-44052:
---

 Summary: Add util to get proper Column or DataFrame class for 
Spark Connect.
 Key: SPARK-44052
 URL: https://issues.apache.org/jira/browse/SPARK-44052
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


There are many codes are duplicated to get proper PySparkColumn or 
PySparkDataFrame, so it would be great if we have util function to deduplicate 
these codes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org