[jira] [Resolved] (SPARK-44920) Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()

2023-08-22 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44920.
--
Fix Version/s: 3.3.4
   3.5.0
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42619
[https://github.com/apache/spark/pull/42619]

> Use await() instead of awaitUninterruptibly() in 
> TransportClientFactory.createClient() 
> ---
>
> Key: SPARK-44920
> URL: https://issues.apache.org/jira/browse/SPARK-44920
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.3, 3.4.2, 3.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 3.3.4, 3.5.0, 4.0.0, 3.4.2
>
>
> This is a follow up for SPARK-44241:
> That call added an `awaitUninterruptibly()` call, which I think should be a 
> plain `await()` instead. This will prevent issues when cancelling tasks with 
> hanging network connections. 
> This issue is similar to SPARK-19529



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44925) K8s default service token file should not be materialized into token

2023-08-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44925:
-

Assignee: Dongjoon Hyun

> K8s default service token file should not be materialized into token
> 
>
> Key: SPARK-44925
> URL: https://issues.apache.org/jira/browse/SPARK-44925
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44925) K8s default service token file should not be materialized into token

2023-08-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44925.
---
Fix Version/s: 3.3.4
   3.5.0
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42624
[https://github.com/apache/spark/pull/42624]

> K8s default service token file should not be materialized into token
> 
>
> Key: SPARK-44925
> URL: https://issues.apache.org/jira/browse/SPARK-44925
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.4, 3.5.0, 4.0.0, 3.4.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44922) Disable o.a.p.h.InternalParquetRecordWriter logs for tests to reduce the log volume

2023-08-22 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44922.
--
Fix Version/s: 3.4.2
   3.5.0
 Assignee: Kent Yao
   Resolution: Fixed

Issue resolved by https://github.com/apache/spark/pull/42614

> Disable o.a.p.h.InternalParquetRecordWriter logs for tests to reduce the log 
> volume
> ---
>
> Key: SPARK-44922
> URL: https://issues.apache.org/jira/browse/SPARK-44922
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.4.2, 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44905) NullPointerException on stateful expression evaluation

2023-08-22 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44905.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by https://github.com/apache/spark/pull/42601

> NullPointerException on stateful expression evaluation
> --
>
> Key: SPARK-44905
> URL: https://issues.apache.org/jira/browse/SPARK-44905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0, 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44840) array_insert() give wrong results for ngative index

2023-08-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44840:
--
Fix Version/s: 3.5.1

> array_insert() give wrong results for ngative index
> ---
>
> Key: SPARK-44840
> URL: https://issues.apache.org/jira/browse/SPARK-44840
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Max Gekk
>Priority: Major
> Fix For: 4.0.0, 3.5.1
>
>
> Unlike in Snowflake we decided that array_inert() is 1 based.
> This means 1 is the first element in an array and -1 is the last. 
> This matches the behavior of functions such as substr() and element_at().
>  
> {code:java}
> > SELECT array_insert(array('a', 'b', 'c'), 1, 'z');
> ["z","a","b","c"]
> > SELECT array_insert(array('a', 'b', 'c'), 0, 'z');
> Error
> > SELECT array_insert(array('a', 'b', 'c'), -1, 'z');
> ["a","b","c","z"]
> > SELECT array_insert(array('a', 'b', 'c'), 5, 'z');
> ["a","b","c",NULL,"z"]
> > SELECT array_insert(array('a', 'b', 'c'), -5, 'z');
> ["z",NULL,"a","b","c"]
> > SELECT array_insert(array('a', 'b', 'c'), 2, cast(NULL AS STRING));
> ["a",NULL,"b","c"]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44878) Address LRU cache insertion failure for RocksDB with strict limit

2023-08-22 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757801#comment-17757801
 ] 

Snoot.io commented on SPARK-44878:
--

User 'anishshri-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/42567

> Address LRU cache insertion failure for RocksDB with strict limit
> -
>
> Key: SPARK-44878
> URL: https://issues.apache.org/jira/browse/SPARK-44878
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
> Fix For: 4.0.0
>
>
> Address LRU cache insertion failure for RocksDB with strict limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44878) Address LRU cache insertion failure for RocksDB with strict limit

2023-08-22 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757800#comment-17757800
 ] 

Snoot.io commented on SPARK-44878:
--

User 'anishshri-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/42567

> Address LRU cache insertion failure for RocksDB with strict limit
> -
>
> Key: SPARK-44878
> URL: https://issues.apache.org/jira/browse/SPARK-44878
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
> Fix For: 4.0.0
>
>
> Address LRU cache insertion failure for RocksDB with strict limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44878) Address LRU cache insertion failure for RocksDB with strict limit

2023-08-22 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-44878.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42567
[https://github.com/apache/spark/pull/42567]

> Address LRU cache insertion failure for RocksDB with strict limit
> -
>
> Key: SPARK-44878
> URL: https://issues.apache.org/jira/browse/SPARK-44878
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
> Fix For: 4.0.0
>
>
> Address LRU cache insertion failure for RocksDB with strict limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44878) Address LRU cache insertion failure for RocksDB with strict limit

2023-08-22 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-44878:


Assignee: Anish Shrigondekar

> Address LRU cache insertion failure for RocksDB with strict limit
> -
>
> Key: SPARK-44878
> URL: https://issues.apache.org/jira/browse/SPARK-44878
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>
> Address LRU cache insertion failure for RocksDB with strict limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44750) SparkSession.Builder should respect the options

2023-08-22 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757797#comment-17757797
 ] 

Snoot.io commented on SPARK-44750:
--

User 'michaelzhan-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/42548

> SparkSession.Builder should respect the options
> ---
>
> Key: SPARK-44750
> URL: https://issues.apache.org/jira/browse/SPARK-44750
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Michael Zhang
>Priority: Major
>
> In connect session builder, we use {{config}} method to set options.
> However, the options are actually ignored when we create a new session.
> {code}
> def create(self) -> "SparkSession":
> has_channel_builder = self._channel_builder is not None
> has_spark_remote = "spark.remote" in self._options
> if has_channel_builder and has_spark_remote:
> raise ValueError(
> "Only one of connection string or channelBuilder "
> "can be used to create a new SparkSession."
> )
> if not has_channel_builder and not has_spark_remote:
> raise ValueError(
> "Needs either connection string or channelBuilder to 
> create a new SparkSession."
> )
> if has_channel_builder:
> assert self._channel_builder is not None
> session = SparkSession(connection=self._channel_builder)
> else:
> spark_remote = to_str(self._options.get("spark.remote"))
> assert spark_remote is not None
> session = SparkSession(connection=spark_remote)
> SparkSession._set_default_and_active_session(session)
> return session
> {code}
> we should respect the options by invoking {{session.conf.set}} after creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42017) df["bad_key"] does not raise AnalysisException

2023-08-22 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757794#comment-17757794
 ] 

Snoot.io commented on SPARK-42017:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/42608

> df["bad_key"] does not raise AnalysisException
> --
>
> Key: SPARK-42017
> URL: https://issues.apache.org/jira/browse/SPARK-42017
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> e.g.)
> {code}
> 23/01/12 14:33:43 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> FAILED [  8%]
> pyspark/sql/tests/test_column.py:105 (ColumnParityTests.test_access_column)
> self =  testMethod=test_access_column>
> def test_access_column(self):
> df = self.df
> self.assertTrue(isinstance(df.key, Column))
> self.assertTrue(isinstance(df["key"], Column))
> self.assertTrue(isinstance(df[0], Column))
> self.assertRaises(IndexError, lambda: df[2])
> >   self.assertRaises(AnalysisException, lambda: df["bad_key"])
> E   AssertionError: AnalysisException not raised by 
> ../test_column.py:112: AssertionError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44903) Refine docstring of `approx_count_distinct`

2023-08-22 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757790#comment-17757790
 ] 

Snoot.io commented on SPARK-44903:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/42596

> Refine docstring of `approx_count_distinct`
> ---
>
> Key: SPARK-44903
> URL: https://issues.apache.org/jira/browse/SPARK-44903
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44860) Implement SESSION_USER function

2023-08-22 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757788#comment-17757788
 ] 

Snoot.io commented on SPARK-44860:
--

User 'vitaliili-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/42549

> Implement SESSION_USER function
> ---
>
> Key: SPARK-44860
> URL: https://issues.apache.org/jira/browse/SPARK-44860
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vitalii Li
>Priority: Major
>
> According to SQL standard SESSION_USER and CURRENT_USER behavior differs for 
> routines:
> - CURRENT_USER inside a routine should return security definer of a routine, 
> e.g. owner identity
> - SESSION_USER inside a routine should return connected user.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44913) DS V2 supports push down V2 UDF that has magic method

2023-08-22 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757787#comment-17757787
 ] 

Snoot.io commented on SPARK-44913:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/42612

> DS V2 supports push down V2 UDF that has magic method
> -
>
> Key: SPARK-44913
> URL: https://issues.apache.org/jira/browse/SPARK-44913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Xianyang Liu
>Priority: Major
>
> Right now we only support pushing down the V2 UDF that has not a magic 
> method. Because the V2 UDF will be analyzed into the 
> `ApplyFunctionExpression` which could be translated and pushed down. However, 
> a V2 UDF that has the magic method will be analyzed into `StaticInvoke` or 
> `Invoke` that can not be translated into V2 expression and then can not be 
> pushed down to the data source. The magic method is suggested. So this PR 
> adds the support of pushing down the V2 UDF that has a magic method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44925) K8s default service token file should not be materialized into token

2023-08-22 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757786#comment-17757786
 ] 

Snoot.io commented on SPARK-44925:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42624

> K8s default service token file should not be materialized into token
> 
>
> Key: SPARK-44925
> URL: https://issues.apache.org/jira/browse/SPARK-44925
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44925) K8s default service token file should not be materialized into token

2023-08-22 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757785#comment-17757785
 ] 

Snoot.io commented on SPARK-44925:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42624

> K8s default service token file should not be materialized into token
> 
>
> Key: SPARK-44925
> URL: https://issues.apache.org/jira/browse/SPARK-44925
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44923) Some directories should be cleared when regenerating files

2023-08-22 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-44923:

Component/s: Build

> Some directories should be cleared when regenerating files
> --
>
> Key: SPARK-44923
> URL: https://issues.apache.org/jira/browse/SPARK-44923
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44925) K8s default service token file should not be materialized into token

2023-08-22 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44925:
-

 Summary: K8s default service token file should not be materialized 
into token
 Key: SPARK-44925
 URL: https://issues.apache.org/jira/browse/SPARK-44925
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.4.1, 3.3.2, 3.2.4, 3.1.3, 3.0.3, 2.4.8, 3.5.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44742) Add Spark version drop down to the PySpark doc site

2023-08-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44742.
--
Fix Version/s: 4.0.0
   3.5.0
   Resolution: Fixed

Issue resolved by pull request 42428
[https://github.com/apache/spark/pull/42428]

> Add Spark version drop down to the PySpark doc site
> ---
>
> Key: SPARK-44742
> URL: https://issues.apache.org/jira/browse/SPARK-44742
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Currently, PySpark documentation does not have a version dropdown. While by 
> default we want people to land on the latest version, it will be helpful and 
> easier for people to find docs if we have this version dropdown. 
> Other libraries such as numpy have such version dropdown.  
> !image-2023-08-09-09-38-00-805.png|width=214,height=189!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44742) Add Spark version drop down to the PySpark doc site

2023-08-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44742:


Assignee: BingKun Pan

> Add Spark version drop down to the PySpark doc site
> ---
>
> Key: SPARK-44742
> URL: https://issues.apache.org/jira/browse/SPARK-44742
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: BingKun Pan
>Priority: Major
>
> Currently, PySpark documentation does not have a version dropdown. While by 
> default we want people to land on the latest version, it will be helpful and 
> easier for people to find docs if we have this version dropdown. 
> Other libraries such as numpy have such version dropdown.  
> !image-2023-08-09-09-38-00-805.png|width=214,height=189!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44924) Add configurations for FileStreamSource cached files

2023-08-22 Thread kevin nacios (Jira)
kevin nacios created SPARK-44924:


 Summary: Add configurations for FileStreamSource cached files
 Key: SPARK-44924
 URL: https://issues.apache.org/jira/browse/SPARK-44924
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: kevin nacios


With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed files 
was added for structured streaming to reduce cost of relisting from filesystem 
each batch.  The settings that drive this are currently hardcoded and there is 
no way to change them.  

 

This impacts some of our workloads where we process large datasets where its 
unknown how "heavy" some files are, so a single batch can take a long period of 
time.  When we set maxFilesPerTrigger to 100k files, a subsequent batch using 
the cached max of 10k files is causing the job to take longer since the cluster 
is capable of handling the 100k files but is stuck doing 10% of the workload.  
The benefit of the caching doesn't outweigh the cost of the performance on the 
rest of the job.

 

With config settings available for this, we could either absorb some increased 
driver memory usage for caching the next 100k files, or opt to disable caching 
entirely and just relist files each batch by setting the cache amount to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44923) Some directories should be cleared when regenerating files

2023-08-22 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-44923:
---

 Summary: Some directories should be cleared when regenerating files
 Key: SPARK-44923
 URL: https://issues.apache.org/jira/browse/SPARK-44923
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44922) Disable o.a.p.h.InternalParquetRecordWriter logs for tests to reduce the log volume

2023-08-22 Thread Kent Yao (Jira)
Kent Yao created SPARK-44922:


 Summary: Disable o.a.p.h.InternalParquetRecordWriter logs for 
tests to reduce the log volume
 Key: SPARK-44922
 URL: https://issues.apache.org/jira/browse/SPARK-44922
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44921) Remove SqlBaseLexer.tokens from codebase

2023-08-22 Thread Rui Wang (Jira)
Rui Wang created SPARK-44921:


 Summary: Remove SqlBaseLexer.tokens from codebase
 Key: SPARK-44921
 URL: https://issues.apache.org/jira/browse/SPARK-44921
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44920) Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()

2023-08-22 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-44920:
--

Assignee: Josh Rosen

> Use await() instead of awaitUninterruptibly() in 
> TransportClientFactory.createClient() 
> ---
>
> Key: SPARK-44920
> URL: https://issues.apache.org/jira/browse/SPARK-44920
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.3, 3.4.2, 3.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> This is a follow up for SPARK-44241:
> That call added an `awaitUninterruptibly()` call, which I think should be a 
> plain `await()` instead. This will prevent issues when cancelling tasks with 
> hanging network connections. 
> This issue is similar to SPARK-19529



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44920) Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()

2023-08-22 Thread Josh Rosen (Jira)
Josh Rosen created SPARK-44920:
--

 Summary: Use await() instead of awaitUninterruptibly() in 
TransportClientFactory.createClient() 
 Key: SPARK-44920
 URL: https://issues.apache.org/jira/browse/SPARK-44920
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.3, 3.4.2, 3.5.0
Reporter: Josh Rosen


This is a follow up for SPARK-44241:

That call added an `awaitUninterruptibly()` call, which I think should be a 
plain `await()` instead. This will prevent issues when cancelling tasks with 
hanging network connections. 

This issue is similar to SPARK-19529



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44907) `DataFrame.join` should throw IllegalArgumentException for invalid join types

2023-08-22 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44907:
-

Assignee: Ruifeng Zheng

> `DataFrame.join` should throw IllegalArgumentException for invalid join types
> -
>
> Key: SPARK-44907
> URL: https://issues.apache.org/jira/browse/SPARK-44907
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44907) `DataFrame.join` should throw IllegalArgumentException for invalid join types

2023-08-22 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44907.
---
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42603
[https://github.com/apache/spark/pull/42603]

> `DataFrame.join` should throw IllegalArgumentException for invalid join types
> -
>
> Key: SPARK-44907
> URL: https://issues.apache.org/jira/browse/SPARK-44907
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44919) Avro connector: convert a union of a single primitive type to a StructType

2023-08-22 Thread Tianhan Hu (Jira)
Tianhan Hu created SPARK-44919:
--

 Summary: Avro connector: convert a union of a single primitive 
type to a StructType
 Key: SPARK-44919
 URL: https://issues.apache.org/jira/browse/SPARK-44919
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Tianhan Hu


Spark Avro data source schema converter currently converts union with a single 
primitive type to a Spark primitive type instead of a StructType.

While for more complex union types that consists of multiple primitive types, 
the schema converter translate them into StructTypes.

For example, 
import scala.collection.JavaConverters._
import org.apache.avro._
import org.apache.spark.sql.avro._

// ["string", "null"]
SchemaConverters.toSqlType(
  Schema.createUnion(Seq(Schema.create(Schema.Type.STRING), 
Schema.create(Schema.Type.NULL)).asJava)
).dataType

// ["string", "int", "null"]
SchemaConverters.toSqlType(
  Schema.createUnion(Seq(Schema.create(Schema.Type.STRING), 
Schema.create(Schema.Type.INT), Schema.create(Schema.Type.NULL)).asJava)
).dataType
The first one would return StringType, the second would return 
StructType(StringType, IntegerType).
 
We hope to add a new configuration to control the conversion behavior. The 
default behavior would still be the same. When the config is altered, a union 
with single primitive type would be translated into StructType.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44857) Fix getBaseURI error in Spark Worker LogPage UI buttons

2023-08-22 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757691#comment-17757691
 ] 

Dongjoon Hyun commented on SPARK-44857:
---

Thank you for fixing the `Fix Version`, [~yumwang].

> Fix getBaseURI error in Spark Worker LogPage UI buttons
> ---
>
> Key: SPARK-44857
> URL: https://issues.apache.org/jira/browse/SPARK-44857
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.2.0, 3.2.4, 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4
>
> Attachments: Screenshot 2023-08-17 at 2.38.45 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44918) Add named argument support for scalar Python/Pandas UDFs

2023-08-22 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-44918:
-

 Summary: Add named argument support for scalar Python/Pandas UDFs
 Key: SPARK-44918
 URL: https://issues.apache.org/jira/browse/SPARK-44918
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44916) Document Spark Driver Live Log UI

2023-08-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44916.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42615
[https://github.com/apache/spark/pull/42615]

> Document Spark Driver Live Log UI
> -
>
> Key: SPARK-44916
> URL: https://issues.apache.org/jira/browse/SPARK-44916
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44916) Document Spark Driver Live Log UI

2023-08-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44916:
-

Assignee: Dongjoon Hyun

> Document Spark Driver Live Log UI
> -
>
> Key: SPARK-44916
> URL: https://issues.apache.org/jira/browse/SPARK-44916
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44917) PySpark Streaming DataStreamWriter table API

2023-08-22 Thread Wei Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu resolved SPARK-44917.
-
Resolution: Not A Problem

> PySpark Streaming DataStreamWriter table API
> 
>
> Key: SPARK-44917
> URL: https://issues.apache.org/jira/browse/SPARK-44917
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44914) Upgrade Apache ivy to 2.5.2

2023-08-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-44914:

Affects Version/s: 3.5.0

> Upgrade Apache ivy  to 2.5.2
> 
>
> Key: SPARK-44914
> URL: https://issues.apache.org/jira/browse/SPARK-44914
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44917) PySpark Streaming DataStreamWriter table API

2023-08-22 Thread Wei Liu (Jira)
Wei Liu created SPARK-44917:
---

 Summary: PySpark Streaming DataStreamWriter table API
 Key: SPARK-44917
 URL: https://issues.apache.org/jira/browse/SPARK-44917
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Wei Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44914) Upgrade Apache ivy to 2.5.2

2023-08-22 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757647#comment-17757647
 ] 

Hudson commented on SPARK-44914:


User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/42613

> Upgrade Apache ivy  to 2.5.2
> 
>
> Key: SPARK-44914
> URL: https://issues.apache.org/jira/browse/SPARK-44914
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44840) array_insert() give wrong results for ngative index

2023-08-22 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-44840.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42564
[https://github.com/apache/spark/pull/42564]

> array_insert() give wrong results for ngative index
> ---
>
> Key: SPARK-44840
> URL: https://issues.apache.org/jira/browse/SPARK-44840
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Max Gekk
>Priority: Major
> Fix For: 4.0.0
>
>
> Unlike in Snowflake we decided that array_inert() is 1 based.
> This means 1 is the first element in an array and -1 is the last. 
> This matches the behavior of functions such as substr() and element_at().
>  
> {code:java}
> > SELECT array_insert(array('a', 'b', 'c'), 1, 'z');
> ["z","a","b","c"]
> > SELECT array_insert(array('a', 'b', 'c'), 0, 'z');
> Error
> > SELECT array_insert(array('a', 'b', 'c'), -1, 'z');
> ["a","b","c","z"]
> > SELECT array_insert(array('a', 'b', 'c'), 5, 'z');
> ["a","b","c",NULL,"z"]
> > SELECT array_insert(array('a', 'b', 'c'), -5, 'z');
> ["z",NULL,"a","b","c"]
> > SELECT array_insert(array('a', 'b', 'c'), 2, cast(NULL AS STRING));
> ["a",NULL,"b","c"]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44916) Document Spark Driver Live Log UI

2023-08-22 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44916:
-

 Summary: Document Spark Driver Live Log UI
 Key: SPARK-44916
 URL: https://issues.apache.org/jira/browse/SPARK-44916
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44915) Validate checksum of remounted PVC's shuffle data before recovery

2023-08-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44915:
--
Component/s: (was: Spark Core)

> Validate checksum of remounted PVC's shuffle data before recovery
> -
>
> Key: SPARK-44915
> URL: https://issues.apache.org/jira/browse/SPARK-44915
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44915) Validate checksum of remounted PVC's shuffle data before recovery

2023-08-22 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44915:
-

 Summary: Validate checksum of remounted PVC's shuffle data before 
recovery
 Key: SPARK-44915
 URL: https://issues.apache.org/jira/browse/SPARK-44915
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-22 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-44871:


Assignee: Peter Toth

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Critical
> Fix For: 3.5.0, 4.0.0
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-22 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-44871.
--
Fix Version/s: 3.4.2
   (was: 3.5.0)
   (was: 4.0.0)
   Resolution: Fixed

Issue resolved by pull request 42610
[https://github.com/apache/spark/pull/42610]

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Critical
> Fix For: 3.4.2
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44914) Upgrade Apache ivy to 2.5.2

2023-08-22 Thread Jira
Bjørn Jørgensen created SPARK-44914:
---

 Summary: Upgrade Apache ivy  to 2.5.2
 Key: SPARK-44914
 URL: https://issues.apache.org/jira/browse/SPARK-44914
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 4.0.0
Reporter: Bjørn Jørgensen


[CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44884) Spark doesn't create SUCCESS file when external path is passed

2023-08-22 Thread Dipayan Dev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757550#comment-17757550
 ] 

Dipayan Dev commented on SPARK-44884:
-

[~ste...@apache.org] , I am running on DataProc but I am able to replicate the 
same from my local machine as well.

_SUCCESS file is created
 * Spark 2.x (2.4.0 I am using) with .saveAsTable() with or without external 
path.
 * Spark 3.3.0 with .saveAsTable() without external path.

_SUCCESS file is not created
 * Spark 3.3.0 with .saveAsTable() with external path.

As mentioned, I have set the following config, but no help.
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs", 
true) 
Are you able to replicate the issue with the snippet I have shared or the 
_SUCCESS file is generating at your end when an external path is passed?

> Spark doesn't create SUCCESS file when external path is passed
> --
>
> Key: SPARK-44884
> URL: https://issues.apache.org/jira/browse/SPARK-44884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dipayan Dev
>Priority: Critical
> Attachments: image-2023-08-20-18-08-38-531.png, 
> image-2023-08-20-18-46-53-342.png
>
>
> The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0
> Code to reproduce the issue.
>  
> {code:java}
> scala> spark.conf.set("spark.sql.orc.char.enabled", true)
> scala> val DF = Seq(("test1", 123)).toDF("name", "num")
> scala> DF.write.option("path", 
> "gs://test_dd123/").mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.table_name")
> 23/08/20 12:31:43 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.   {code}
> The above code succeeds and creates the External Hive table, but {*}there is 
> no SUCCESS file generated{*}. The same code when running spark 2.4.0, 
> generating a SUCCESS file.
> Adding the content of the bucket after table creation
>  
> !image-2023-08-20-18-08-38-531.png|width=453,height=162!
>  
> But when I don’t pass the external path as following, the SUCCESS file is 
> generated
> {code:java}
> scala> 
> DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("us_wm_supply_chain_rcv_pre_prod.test_tb1")
>  {code}
> !image-2023-08-20-18-46-53-342.png|width=465,height=166!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44884) Spark doesn't create SUCCESS file when external path is passed

2023-08-22 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757547#comment-17757547
 ] 

Steve Loughran commented on SPARK-44884:


[~dipayandev] i don't think think anyone has disabled the option; doesn't 
surface in my test setup (manifest and s3a committers).

Afraid you are going to have to debug it yourself, as it is your env which has 
the problem.

does everything work if you use .saveAs() rather than .saveAsTable()?

> Spark doesn't create SUCCESS file when external path is passed
> --
>
> Key: SPARK-44884
> URL: https://issues.apache.org/jira/browse/SPARK-44884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dipayan Dev
>Priority: Critical
> Attachments: image-2023-08-20-18-08-38-531.png, 
> image-2023-08-20-18-46-53-342.png
>
>
> The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0
> Code to reproduce the issue.
>  
> {code:java}
> scala> spark.conf.set("spark.sql.orc.char.enabled", true)
> scala> val DF = Seq(("test1", 123)).toDF("name", "num")
> scala> DF.write.option("path", 
> "gs://test_dd123/").mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.table_name")
> 23/08/20 12:31:43 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.   {code}
> The above code succeeds and creates the External Hive table, but {*}there is 
> no SUCCESS file generated{*}. The same code when running spark 2.4.0, 
> generating a SUCCESS file.
> Adding the content of the bucket after table creation
>  
> !image-2023-08-20-18-08-38-531.png|width=453,height=162!
>  
> But when I don’t pass the external path as following, the SUCCESS file is 
> generated
> {code:java}
> scala> 
> DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("us_wm_supply_chain_rcv_pre_prod.test_tb1")
>  {code}
> !image-2023-08-20-18-46-53-342.png|width=465,height=166!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44817) Incremental Stats Collection

2023-08-22 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757539#comment-17757539
 ] 

Rakesh Raushan commented on SPARK-44817:


Sure. I would try to come up with a SPIP by this weekend.

> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-22 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-44871:
---
Affects Version/s: 3.4.0
   3.3.2
   3.3.1
   3.3.0

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Peter Toth
>Priority: Critical
> Fix For: 3.5.0, 4.0.0
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-22 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-44871:
---
Affects Version/s: 3.4.1
   3.3.3
   (was: 3.3.0)
   (was: 3.4.0)
   (was: 3.5.0)
   (was: 4.0.0)

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1
>Reporter: Peter Toth
>Priority: Critical
> Fix For: 3.5.0, 4.0.0
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-22 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-44871:
---
Fix Version/s: 3.5.0
   4.0.0

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0, 4.0.0
>Reporter: Peter Toth
>Priority: Critical
> Fix For: 3.5.0, 4.0.0
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44913) DS V2 supports push down V2 UDF that has magic method

2023-08-22 Thread Xianyang Liu (Jira)
Xianyang Liu created SPARK-44913:


 Summary: DS V2 supports push down V2 UDF that has magic method
 Key: SPARK-44913
 URL: https://issues.apache.org/jira/browse/SPARK-44913
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.1
Reporter: Xianyang Liu


Right now we only support pushing down the V2 UDF that has not a magic method. 
Because the V2 UDF will be analyzed into the `ApplyFunctionExpression` which 
could be translated and pushed down. However, a V2 UDF that has the magic 
method will be analyzed into `StaticInvoke` or `Invoke` that can not be 
translated into V2 expression and then can not be pushed down to the data 
source. The magic method is suggested. So this PR adds the support of pushing 
down the V2 UDF that has a magic method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44912) Spark 3.4 multi-column sum slows with many columns

2023-08-22 Thread Brady Bickel (Jira)
Brady Bickel created SPARK-44912:


 Summary: Spark 3.4 multi-column sum slows with many columns
 Key: SPARK-44912
 URL: https://issues.apache.org/jira/browse/SPARK-44912
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.1, 3.4.0
Reporter: Brady Bickel


The code below is a minimal reproducible example of an issue I discovered with 
Pyspark 3.4.x. I want to sum the values of multiple columns and put the sum of 
those columns (per row) into a new column. This code works and returns in a 
reasonable amount of time in Pyspark 3.3.x, but is extremely slow in Pyspark 
3.4.x when the number of columns grows. See below for execution timing summary 
as N varies.
{code:java}
import pyspark.sql.functions as F
import random
import string
from functools import reduce
from operator import add
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# generate a dataframe N columns by M rows with random 8 digit column 
# names and random integers in [-5,10]
N = 30
M = 100
columns = [''.join(random.choices(string.ascii_uppercase +
  string.digits, k=8))
   for _ in range(N)]
data = [tuple([random.randint(-5,10) for _ in range(N)])
for _ in range(M)]

df = spark.sparkContext.parallelize(data).toDF(columns)
# 3 ways to add a sum column, all of them slow for high N in spark 3.4
df = df.withColumn("col_sum1", sum(df[col] for col in columns))
df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns]))
df = df.withColumn("col_sum3", F.expr("+".join(columns))) {code}
Timing results for Spark 3.3:
||N||Exe Time (s)||
|5|0.514|
|10|0.248|
|15|0.327|
|20|0.403|
|25|0.279|
|30|0.322|
|50|0.430|

Timing results for Spark 3.4:
||N||Exe Time (s)||
|5|0.379|
|10|0.318|
|15|0.405|
|20|1.32|
|25|28.8|
|30|448|
|50|>1 (did not finish)|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44911) create hive table with invalid column should return error class

2023-08-22 Thread zzzzming95 (Jira)
ming95 created SPARK-44911:
--

 Summary: create hive table with invalid column should return error 
class
 Key: SPARK-44911
 URL: https://issues.apache.org/jira/browse/SPARK-44911
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.1
Reporter: ming95






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40156) url_decode() exposes a Java error

2023-08-22 Thread zzzzming95 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ming95 resolved SPARK-40156.

Fix Version/s: 3.4.0
   Resolution: Fixed

> url_decode() exposes a Java error
> -
>
> Key: SPARK-40156
> URL: https://issues.apache.org/jira/browse/SPARK-40156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
> Fix For: 3.4.0
>
>
> Given a badly encode string Spark returns a Java error.
> It should the return an ERROR_CLASS
> spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org');
> 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT 
> url_decode('http%3A%2F%2spark.apache.org')]
> java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in 
> escape (%) pattern - Error at index 1 in: "2s"
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44910) Encoders.bean does not support superclasses with generic type arguments

2023-08-22 Thread Giambattista Bloisi (Jira)
Giambattista Bloisi created SPARK-44910:
---

 Summary: Encoders.bean does not support superclasses with generic 
type arguments
 Key: SPARK-44910
 URL: https://issues.apache.org/jira/browse/SPARK-44910
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1, 3.5.0, 4.0.0
Reporter: Giambattista Bloisi


As per SPARK-44634 another unsupported feature of bean encoder is when the 
superclass of the bean has generic type arguments. For example:
{code:java}
class JavaBeanWithGenericsA {
public T getPropertyA() {
return null;
}

public void setPropertyA(T a) {

}
}

class JavaBeanWithGenericBase extends JavaBeanWithGenericsA {
}

Encoders.bean(JavaBeanWithGenericBase.class); // Exception

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44885) NullPointerException is thrown when column with ROWID type contains NULL values

2023-08-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44885:


Assignee: Tim Nieradzik

> NullPointerException is thrown when column with ROWID type contains NULL 
> values
> ---
>
> Key: SPARK-44885
> URL: https://issues.apache.org/jira/browse/SPARK-44885
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1
>Reporter: Tim Nieradzik
>Assignee: Tim Nieradzik
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> A row ID may be NULL in an Oracle table. When this is the case, the following 
> exception is thrown:
> {{[info] Cause: java.lang.NullPointerException:}}
> {{{}[info] at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$12(JdbcUtils.scala:452){}}}{{{}[info]
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$12$adapted(JdbcUtils.scala:451){}}}
> {{{}[info] at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:361){}}}{{{}[info]
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:343){}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44885) NullPointerException is thrown when column with ROWID type contains NULL values

2023-08-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44885.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42576
[https://github.com/apache/spark/pull/42576]

> NullPointerException is thrown when column with ROWID type contains NULL 
> values
> ---
>
> Key: SPARK-44885
> URL: https://issues.apache.org/jira/browse/SPARK-44885
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1
>Reporter: Tim Nieradzik
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> A row ID may be NULL in an Oracle table. When this is the case, the following 
> exception is thrown:
> {{[info] Cause: java.lang.NullPointerException:}}
> {{{}[info] at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$12(JdbcUtils.scala:452){}}}{{{}[info]
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$12$adapted(JdbcUtils.scala:451){}}}
> {{{}[info] at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:361){}}}{{{}[info]
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:343){}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44786) Convert common Spark exceptions

2023-08-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44786:
-
Fix Version/s: 3.5.0

> Convert common Spark exceptions
> ---
>
> Key: SPARK-44786
> URL: https://issues.apache.org/jira/browse/SPARK-44786
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44786) Convert common Spark exceptions

2023-08-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44786.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42472
[https://github.com/apache/spark/pull/42472]

> Convert common Spark exceptions
> ---
>
> Key: SPARK-44786
> URL: https://issues.apache.org/jira/browse/SPARK-44786
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44786) Convert common Spark exceptions

2023-08-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44786:


Assignee: Yihong He

> Convert common Spark exceptions
> ---
>
> Key: SPARK-44786
> URL: https://issues.apache.org/jira/browse/SPARK-44786
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38958) Override S3 Client in Spark Write/Read calls

2023-08-22 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757364#comment-17757364
 ] 

Steve Loughran commented on SPARK-38958:


[~hershalb] we are about to merge the v2 sdk feature set; it'd be good for you 
to see if your changes work there.

as for static headers, I could imagine something like we added in HADOOP-17833 
for adding headers to created files.

# Define a well know prefix, e.g {{fs.s3a.request.headers.))
# every key which matches fs.s3a.request.headers.* becomes a header; the value 
the header value.

the alternative is as done for custom signers, a list of key=value separated by 
commas.

> Override S3 Client in Spark Write/Read calls
> 
>
> Key: SPARK-38958
> URL: https://issues.apache.org/jira/browse/SPARK-38958
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Hershal
>Priority: Major
>
> Hello,
> I have been working to use spark to read and write data to S3. Unfortunately, 
> there are a few S3 headers that I need to add to my spark read/write calls. 
> After much looking, I have not found a way to replace the S3 client that 
> spark uses to make the read/write calls. I also have not found a 
> configuration that allows me to pass in S3 headers. Here is an example of 
> some common S3 request headers 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonRequestHeaders.html).]
>  Does there already exist functionality to add S3 headers to spark read/write 
> calls or pass in a custom client that would pass these headers on every 
> read/write request? Appreciate the help and feedback
>  
> Thanks,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44909) Skip starting torch distributor log streaming server when it is not available

2023-08-22 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-44909:
--

 Summary: Skip starting torch distributor log streaming server when 
it is not available
 Key: SPARK-44909
 URL: https://issues.apache.org/jira/browse/SPARK-44909
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.5
Reporter: Weichen Xu


Skip starting torch distributor log streaming server when it is not available.

 

In some cases, e.g., in a databricks connect cluster, there is some network 
limitation that casues starting log streaming server failure, but, this does 
not need to break torch distributor training routine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44905) NullPointerException on stateful expression evaluation

2023-08-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757297#comment-17757297
 ] 

ASF GitHub Bot commented on SPARK-44905:


User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/42601

> NullPointerException on stateful expression evaluation
> --
>
> Key: SPARK-44905
> URL: https://issues.apache.org/jira/browse/SPARK-44905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0, 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26398) Support building GPU docker images

2023-08-22 Thread comet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757293#comment-17757293
 ] 

comet commented on SPARK-26398:
---

I see #23347 is closed without merge. Does that mean support for GPU is not 
available in spark and we need to build the docker image ourself? Any guide 
step available?

> Support building GPU docker images
> --
>
> Key: SPARK-26398
> URL: https://issues.apache.org/jira/browse/SPARK-26398
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.0
>Reporter: Rong Ou
>Priority: Minor
>
> To run Spark on Kubernetes, a user first needs to build docker images using 
> the `bin/docker-image-tool.sh` script. However, this script only supports 
> building images for running on CPUs. As parts of Spark and related libraries 
> (e.g. XGBoost) get accelerated on GPUs, it's desirable to build base images 
> that can take advantage of GPU acceleration.
> This issue only addresses building docker images with CUDA support. Actually 
> accelerating Spark on GPUs is outside the scope, as is supporting other types 
> of GPUs.
> Today if anyone wants to experiment with running Spark on Kubernetes with GPU 
> support, they have to write their own custom `Dockerfile`. By providing an 
> "official" way to build GPU-enabled docker images, we can make it easier to 
> get started.
> For now probably not that many people care about this, but it's a necessary 
> first step towards GPU acceleration for Spark on Kubernetes.
> The risks are minimal as we only need to make minor changes to 
> `bin/docker-image-tool.sh`. The PR is already done and will be attached. 
> Success means anyone can easily build Spark docker images with GPU support.
> Proposed API changes: add an optional  `-g` flag to 
> `bin/docker-image-tool.sh` for building GPU versions of the JVM/Python/R 
> docker images. When the `-g` is omitted, existing behavior is preserved.
> Design sketch: when the `-g` flag is specified, we append `-gpu` to the 
> docker image names, and switch to dockerfiles based on the official CUDA 
> images. Since the CUDA images are based on Ubuntu while the Spark dockerfiles 
> are based on Alpine, steps for setting up additional packages are different, 
> so there are a parallel set of `Dockerfile.gpu` files.
> Alternative: if we are willing to forego Alpine and switch to Ubuntu for the 
> CPU-only images, the two sets of dockerfiles can be unified, and we can just 
> pass in a different base image depending on whether the `-g` flag is present 
> or not.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44908) Fix spark connect ML crossvalidator "foldCol" param

2023-08-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757292#comment-17757292
 ] 

ASF GitHub Bot commented on SPARK-44908:


User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/42605

> Fix spark connect ML crossvalidator "foldCol" param
> ---
>
> Key: SPARK-44908
> URL: https://issues.apache.org/jira/browse/SPARK-44908
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, ML
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> Fix spark connect ML crossvalidator "foldCol" param.
>  
> Currently it calls `df.rdd` APIs but it is not supported in spark connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44908) Fix spark connect ML crossvalidator "foldCol" param

2023-08-22 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-44908:
--

 Summary: Fix spark connect ML crossvalidator "foldCol" param
 Key: SPARK-44908
 URL: https://issues.apache.org/jira/browse/SPARK-44908
 Project: Spark
  Issue Type: Bug
  Components: Connect, ML
Affects Versions: 3.5.0
Reporter: Weichen Xu


Fix spark connect ML crossvalidator "foldCol" param.

 

Currently it calls `df.rdd` APIs but it is not supported in spark connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44908) Fix spark connect ML crossvalidator "foldCol" param

2023-08-22 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-44908:
--

Assignee: Weichen Xu

> Fix spark connect ML crossvalidator "foldCol" param
> ---
>
> Key: SPARK-44908
> URL: https://issues.apache.org/jira/browse/SPARK-44908
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, ML
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> Fix spark connect ML crossvalidator "foldCol" param.
>  
> Currently it calls `df.rdd` APIs but it is not supported in spark connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44906) Move substituteAppNExecIds logic into kubernetesConf.annotations method

2023-08-22 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757238#comment-17757238
 ] 

GridGain Integration commented on SPARK-44906:
--

User 'zwangsheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/42600

> Move substituteAppNExecIds logic into kubernetesConf.annotations method 
> 
>
> Key: SPARK-44906
> URL: https://issues.apache.org/jira/browse/SPARK-44906
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.1
>Reporter: Binjie Yang
>Priority: Major
>
> Move Utils. SubstituteAppNExecIds logic  into KubernetesConf.annotations as 
> the default logic, easy for users to reuse, rather than to rewrite it again 
> at the same logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42768) Enable cached plan apply AQE by default

2023-08-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42768.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40390
[https://github.com/apache/spark/pull/40390]

> Enable cached plan apply AQE by default
> ---
>
> Key: SPARK-42768
> URL: https://issues.apache.org/jira/browse/SPARK-42768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42768) Enable cached plan apply AQE by default

2023-08-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42768:
---

Assignee: XiDuo You

> Enable cached plan apply AQE by default
> ---
>
> Key: SPARK-42768
> URL: https://issues.apache.org/jira/browse/SPARK-42768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44905) NullPointerException on stateful expression evaluation

2023-08-22 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-44905:


Assignee: Kent Yao

> NullPointerException on stateful expression evaluation
> --
>
> Key: SPARK-44905
> URL: https://issues.apache.org/jira/browse/SPARK-44905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0, 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44907) `DataFrame.join` should throw IllegalArgumentException for invalid join types

2023-08-22 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44907:
--
Component/s: PySpark
 (was: Tests)

> `DataFrame.join` should throw IllegalArgumentException for invalid join types
> -
>
> Key: SPARK-44907
> URL: https://issues.apache.org/jira/browse/SPARK-44907
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44907) `DataFrame.join` should throw IllegalArgumentException for invalid join types

2023-08-22 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-44907:
-

 Summary: `DataFrame.join` should throw IllegalArgumentException 
for invalid join types
 Key: SPARK-44907
 URL: https://issues.apache.org/jira/browse/SPARK-44907
 Project: Spark
  Issue Type: Bug
  Components: Connect, Tests
Affects Versions: 3.5.0, 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3

2023-08-22 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44892:

Fix Version/s: (was: 4.0.0)

> Add official image Dockerfile for Spark 3.3.3
> -
>
> Key: SPARK-44892
> URL: https://issues.apache.org/jira/browse/SPARK-44892
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.3.3
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3

2023-08-22 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44892.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 54
[https://github.com/apache/spark-docker/pull/54]

> Add official image Dockerfile for Spark 3.3.3
> -
>
> Key: SPARK-44892
> URL: https://issues.apache.org/jira/browse/SPARK-44892
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.3.3
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3

2023-08-22 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44892:
---

Assignee: Yuming Wang

> Add official image Dockerfile for Spark 3.3.3
> -
>
> Key: SPARK-44892
> URL: https://issues.apache.org/jira/browse/SPARK-44892
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.3.3
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44890) Miswritten remarks in pom file

2023-08-22 Thread chenyu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756746#comment-17756746
 ] 

chenyu edited comment on SPARK-44890 at 8/22/23 7:11 AM:
-

I had submit a patch 

[https://github.com/apache/spark/pull/42598|https://github.com/apache/spark/pull/42583]

 


was (Author: JIRAUSER299988):
I had submit a patch 

https://github.com/apache/spark/pull/42583

 

> Miswritten remarks in pom file
> --
>
> Key: SPARK-44890
> URL: https://issues.apache.org/jira/browse/SPARK-44890
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
>Reporter: chenyu
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Spelling issues in pom files affect understanding which uses 'dont update'.
> It needs to maintain the same writing style as other places



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44728) Improve PySpark documentations

2023-08-22 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757209#comment-17757209
 ] 

Ruifeng Zheng edited comment on SPARK-44728 at 8/22/23 6:20 AM:


A docstring should contain the following sections:
 # Brief Description: A concise summary explaining the function's purpose.
 # Version Annotations: Annotations like versionadded and versionchanged to 
signify the addition or modifications of the function in different versions of 
the software.
 # Parameters: This section should list and describe all input parameters. If 
the function doesn't accept any parameters, this section can be omitted.
 # Returns: Detail what the function returns. If the function doesn't return 
anything, this section can be omitted.
 # See Also: A list of related API functions or methods. This section can be 
omitted if no related APIs exist.
 # Notes: Include additional information or warnings about the function's usage 
here.
 # Examples:
 ## A docstring contains 3~5 examples if poosible.
 ## Every example should begin with a brief description, followed by the 
example code, and conclude with the expected output.
 ## Every example should be copy-paste able;
 ## Any necessary import statements should be included at the beginning of each 
example.


was (Author: podongfeng):
A docstring should contain the following sections:
 # Brief Description: A concise summary explaining the function's purpose.
 # Version Annotations: Annotations like versionadded and versionchanged to 
signify the addition or modifications of the function in different versions of 
the software.
 # Parameters: This section should list and describe all input parameters. If 
the function doesn't accept any parameters, this section can be omitted.
 # Returns: Detail what the function returns. If the function doesn't return 
anything, this section can be omitted.
 # See Also: A list of related API functions or methods. This section can be 
omitted if no related APIs exist.
 # Notes: Include additional information or warnings about the function's usage 
here.
 # Examples: A docstring contains 3~5 examples if poosible.
 ## Every example should begin with a brief description, followed by the 
example code, and conclude with the expected output.
 ## Every example should be copy-paste able;
 ## Any necessary import statements should be included at the beginning of each 
example.

> Improve PySpark documentations
> --
>
> Key: SPARK-44728
> URL: https://issues.apache.org/jira/browse/SPARK-44728
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> An umbrella Jira ticket to improve the PySpark documentation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44728) Improve PySpark documentations

2023-08-22 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757209#comment-17757209
 ] 

Ruifeng Zheng edited comment on SPARK-44728 at 8/22/23 6:10 AM:


A docstring should contain the following sections:
 # Brief Description: A concise summary explaining the function's purpose.
 # Version Annotations: Annotations like versionadded and versionchanged to 
signify the addition or modifications of the function in different versions of 
the software.
 # Parameters: This section should list and describe all input parameters. If 
the function doesn't accept any parameters, this section can be omitted.
 # Returns: Detail what the function returns. If the function doesn't return 
anything, this section can be omitted.
 # See Also: A list of related API functions or methods. This section can be 
omitted if no related APIs exist.
 # Notes: Include additional information or warnings about the function's usage 
here.
 # Examples: A docstring contains 3~5 examples if poosible.
 ## Every example should begin with a brief description, followed by the 
example code, and conclude with the expected output.
 ## Every example should be copy-paste able;
 ## Any necessary import statements should be included at the beginning of each 
example.


was (Author: podongfeng):
A good docstring should contain the following sections:
 # Brief Description: A concise summary explaining the function's purpose.
 # Version Annotations: Annotations like versionadded and versionchanged to 
signify the addition or modifications of the function in different versions of 
the software.
 # Parameters: This section should list and describe all input parameters. If 
the function doesn't accept any parameters, this section can be omitted.
 # Returns: Detail what the function returns. If the function doesn't return 
anything, this section can be omitted.
 # See Also: A list of related API functions or methods. This section can be 
omitted if no related APIs exist.
 # Notes: Include additional information or warnings about the function's usage 
here.
 # Examples: A docstring contains 3~5 examples if poosible.
 ## Every example should begin with a brief description, followed by the 
example code, and conclude with the expected output.
 ## Every example should be copy-paste able;
 ## Any necessary import statements should be included at the beginning of each 
example.

> Improve PySpark documentations
> --
>
> Key: SPARK-44728
> URL: https://issues.apache.org/jira/browse/SPARK-44728
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> An umbrella Jira ticket to improve the PySpark documentation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44904) Correct the ‘versionadded’ of `sql.functions.approx_percentile` to 3.5.0.

2023-08-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44904:


Assignee: Yang Jie

> Correct the ‘versionadded’ of `sql.functions.approx_percentile` to 3.5.0.
> -
>
> Key: SPARK-44904
> URL: https://issues.apache.org/jira/browse/SPARK-44904
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44904) Correct the ‘versionadded’ of `sql.functions.approx_percentile` to 3.5.0.

2023-08-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44904.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42597
[https://github.com/apache/spark/pull/42597]

> Correct the ‘versionadded’ of `sql.functions.approx_percentile` to 3.5.0.
> -
>
> Key: SPARK-44904
> URL: https://issues.apache.org/jira/browse/SPARK-44904
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44728) Improve PySpark documentations

2023-08-22 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757209#comment-17757209
 ] 

Ruifeng Zheng edited comment on SPARK-44728 at 8/22/23 6:01 AM:


A good docstring should contain the following sections:
 # Brief Description: A concise summary explaining the function's purpose.
 # Version Annotations: Annotations like versionadded and versionchanged to 
signify the addition or modifications of the function in different versions of 
the software.
 # Parameters: This section should list and describe all input parameters. If 
the function doesn't accept any parameters, this section can be omitted.
 # Returns: Detail what the function returns. If the function doesn't return 
anything, this section can be omitted.
 # See Also: A list of related API functions or methods. This section can be 
omitted if no related APIs exist.
 # Notes: Include additional information or warnings about the function's usage 
here.
 # Examples: A docstring contains 3~5 examples if poosible.
 ## Every example should begin with a brief description, followed by the 
example code, and conclude with the expected output.
 ## Every example should be copy-paste able;
 ## Any necessary import statements should be included at the beginning of each 
example.


was (Author: podongfeng):
A good docstring should contain the following sections:
 # Brief Description: A concise summary explaining the function's purpose.
 # xVersion Annotations: Annotations like versionadded and versionchanged to 
signify the addition or modifications of the function in different versions of 
the software.
 # Parameters: This section should list and describe all input parameters. If 
the function doesn't accept any parameters, this section can be omitted.
 # Returns: Detail what the function returns. If the function doesn't return 
anything, this section can be omitted.
 # See Also: A list of related API functions or methods. This section can be 
omitted if no related APIs exist.
 # Notes: Include additional information or warnings about the function's usage 
here.
 # Examples: A docstring contains 3~5 examples if poosible.
 ## Every example should begin with a brief description, followed by the 
example code, and conclude with the expected output.
 ## Every example should be copy-paste able;
 ## Any necessary import statements should be included at the beginning of each 
example.

> Improve PySpark documentations
> --
>
> Key: SPARK-44728
> URL: https://issues.apache.org/jira/browse/SPARK-44728
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> An umbrella Jira ticket to improve the PySpark documentation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44728) Improve PySpark documentations

2023-08-22 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757209#comment-17757209
 ] 

Ruifeng Zheng edited comment on SPARK-44728 at 8/22/23 5:59 AM:


A good docstring should contain the following sections:
 # Brief Description: A concise summary explaining the function's purpose.
 # xVersion Annotations: Annotations like versionadded and versionchanged to 
signify the addition or modifications of the function in different versions of 
the software.
 # Parameters: This section should list and describe all input parameters. If 
the function doesn't accept any parameters, this section can be omitted.
 # Returns: Detail what the function returns. If the function doesn't return 
anything, this section can be omitted.
 # See Also: A list of related API functions or methods. This section can be 
omitted if no related APIs exist.
 # Notes: Include additional information or warnings about the function's usage 
here.
 # Examples: A docstring contains 3~5 examples if poosible.
 ## Every example should begin with a brief description, followed by the 
example code, and conclude with the expected output.
 ## Every example should be copy-paste able;
 ## Any necessary import statements should be included at the beginning of each 
example.


was (Author: podongfeng):
A good docstring should contain the following sections:
 # Brief Description: A concise summary explaining the function's purpose.
 # xVersion Annotations: Annotations like versionadded and versionchanged to 
signify the addition or modifications of the function in different versions of 
the software.
 # Parameters: This section should list and describe all input parameters. If 
the function doesn't accept any parameters, this section can be omitted.
 # Returns: Detail what the function returns. If the function doesn't return 
anything, this section can be omitted.
 # See Also: A list of related API functions or methods. This section can be 
omitted if no related APIs exist.
 # Notes: Include additional information or warnings about the function's usage 
here.
 # Examples: A docstring contains 3~5 examples if poosible. Every example 
should begin with a brief description, followed by the example code, and 
conclude with the expected output.
 ## An example should be copy-paste able;
 ## Any necessary import statements should be included at the beginning of each 
example.

> Improve PySpark documentations
> --
>
> Key: SPARK-44728
> URL: https://issues.apache.org/jira/browse/SPARK-44728
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> An umbrella Jira ticket to improve the PySpark documentation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44728) Improve PySpark documentations

2023-08-22 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757209#comment-17757209
 ] 

Ruifeng Zheng edited comment on SPARK-44728 at 8/22/23 5:59 AM:


A good docstring should contain the following sections:
 # Brief Description: A concise summary explaining the function's purpose.
 # xVersion Annotations: Annotations like versionadded and versionchanged to 
signify the addition or modifications of the function in different versions of 
the software.
 # Parameters: This section should list and describe all input parameters. If 
the function doesn't accept any parameters, this section can be omitted.
 # Returns: Detail what the function returns. If the function doesn't return 
anything, this section can be omitted.
 # See Also: A list of related API functions or methods. This section can be 
omitted if no related APIs exist.
 # Notes: Include additional information or warnings about the function's usage 
here.
 # Examples: A docstring contains 3~5 examples if poosible. Every example 
should begin with a brief description, followed by the example code, and 
conclude with the expected output.
 ## An example should be copy-paste able;
 ## Any necessary import statements should be included at the beginning of each 
example.


was (Author: podongfeng):
A good docstring should contain the following sections:
 # Brief Description: A concise summary explaining the function's purpose.
 # xVersion Annotations: Annotations like versionadded and versionchanged to 
signify the addition or modifications of the function in different versions of 
the software.
 # Parameters: This section should list and describe all input parameters. If 
the function doesn't accept any parameters, this section can be omitted.
 # Returns: Detail what the function returns. If the function doesn't return 
anything, this section can be omitted.
 # See Also: A list of related API functions or methods. This section can be 
omitted if no related APIs exist.
 # Notes: Include additional information or warnings about the function's usage 
here.
 # Examples: Every example should begin with a brief description, followed by 
the example code, and conclude with the expected output. Any necessary import 
statements should be included at the beginning of each example.

> Improve PySpark documentations
> --
>
> Key: SPARK-44728
> URL: https://issues.apache.org/jira/browse/SPARK-44728
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> An umbrella Jira ticket to improve the PySpark documentation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org