date:20230809

[jira] [Assigned] (SPARK-44691) Move Subclasses of Analysis to sql/api

2023-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44691:
---

Assignee: Yihong He

> Move Subclasses of Analysis to sql/api
> --
>
> Key: SPARK-44691
> URL: https://issues.apache.org/jira/browse/SPARK-44691
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44691) Move Subclasses of Analysis to sql/api

2023-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44691.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

> Move Subclasses of Analysis to sql/api
> --
>
> Key: SPARK-44691
> URL: https://issues.apache.org/jira/browse/SPARK-44691
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44755) Local tmp data is not cleared while using spark streaming consuming from kafka

2023-08-09 Thread leesf (Jira)

leesf created SPARK-44755:
-

 Summary: Local tmp data is not cleared while using spark streaming 
consuming from kafka
 Key: SPARK-44755
 URL: https://issues.apache.org/jira/browse/SPARK-44755
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: leesf


we are using spark 3.2 consuming data from kafka and then using `collectAsMap` 
to send to driver, we found the local temp file do not get cleared if the data 
consumed from kafka is larger than 
200m(spark.network.maxRemoteBlockSizeFetchToMem)

!https://intranetproxy.alipay.com/skylark/lark/0/2023/png/320711/1691419276170-2dd0964f-4cf4-4b15-9fbe-9622116671da.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44754) Improve DeduplicateRelations rewriteAttrs compatibility

2023-08-09 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-44754:

Description: 
{{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }}

{{{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
{{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
{{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
{{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
{{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState 
and}} {{FlatMapCoGroupsInPandas}}

{{To make sure DeduplicateRelations rewriteAttrs will rewrite its attribute 
normally. Also should fix the error behavior follow 
[https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}

  was:
{{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }}

{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
{{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
{{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
{{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
{{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState 
and}} {{FlatMapCoGroupsInPandas

{{{}{{{}{}}}{{{}{}}}{}}}{{{}{{To make sure DeduplicateRelations rewriteAttrs 
will rewrite its attribute normally. Also should fix the error behavior follow 
[https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}{}}}


> Improve DeduplicateRelations rewriteAttrs compatibility
> ---
>
> Key: SPARK-44754
> URL: https://issues.apache.org/jira/browse/SPARK-44754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> {{Follow [https://github.com/apache/spark/pull/41554,] we should add test for 
> }}
> {{{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
> {{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
> {{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
> {{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
> {{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, 
> {{FlatMapGroupsInPandasWithState and}} {{FlatMapCoGroupsInPandas}}
> {{To make sure DeduplicateRelations rewriteAttrs will rewrite its attribute 
> normally. Also should fix the error behavior follow 
> [https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44754) Improve DeduplicateRelations rewriteAttrs compatibility

2023-08-09 Thread Jia Fan (Jira)

Jia Fan created SPARK-44754:
---

 Summary: Improve DeduplicateRelations rewriteAttrs compatibility
 Key: SPARK-44754
 URL: https://issues.apache.org/jira/browse/SPARK-44754
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


{{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }}

{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
{{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
{{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
{{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
{{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState 
and}} {{FlatMapCoGroupsInPandas

{{{}{{{}{}}}{{{}{}}}{}}}{{{}{{To make sure DeduplicateRelations rewriteAttrs 
will rewrite its attribute normally. Also should fix the error behavior follow 
[https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44461) Enable Process Isolation for streaming python worker

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44461:
-
Fix Version/s: (was: 3.5.0)
   (was: 4.0.0)

> Enable Process Isolation for streaming python worker
> 
>
> Key: SPARK-44461
> URL: https://issues.apache.org/jira/browse/SPARK-44461
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Raghu Angadi
>Priority: Major
>
> Enable PI for Python worker used for foreachBatch() & streaming listener in 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-44461) Enable Process Isolation for streaming python worker

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-44461:
--

> Enable Process Isolation for streaming python worker
> 
>
> Key: SPARK-44461
> URL: https://issues.apache.org/jira/browse/SPARK-44461
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Raghu Angadi
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Enable PI for Python worker used for foreachBatch() & streaming listener in 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44729) Add canonical links to the PySpark docs page

2023-08-09 Thread Ruifeng Zheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752599#comment-17752599
 ] 

Ruifeng Zheng commented on SPARK-44729:
---

[~panbingkun] Thanks!

> Add canonical links to the PySpark docs page
> 
>
> Key: SPARK-44729
> URL: https://issues.apache.org/jira/browse/SPARK-44729
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> We should add the canonical link to the PySpark docs page 
> [https://spark.apache.org/docs/latest/api/python/index.html] so that the 
> search engine can return the latest PySpark docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44461) Enable Process Isolation for streaming python worker

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44461.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42421
[https://github.com/apache/spark/pull/42421]

> Enable Process Isolation for streaming python worker
> 
>
> Key: SPARK-44461
> URL: https://issues.apache.org/jira/browse/SPARK-44461
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Raghu Angadi
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Enable PI for Python worker used for foreachBatch() & streaming listener in 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44753) XML: Add Python and sparkR binding including Spark Connect

2023-08-09 Thread Sandip Agarwala (Jira)

Sandip Agarwala created SPARK-44753:
---

 Summary: XML: Add Python and sparkR binding including Spark Connect
 Key: SPARK-44753
 URL: https://issues.apache.org/jira/browse/SPARK-44753
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Sandip Agarwala






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44752) XML: Update Spark Docs

2023-08-09 Thread Sandip Agarwala (Jira)

Sandip Agarwala created SPARK-44752:
---

 Summary: XML: Update Spark Docs
 Key: SPARK-44752
 URL: https://issues.apache.org/jira/browse/SPARK-44752
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Sandip Agarwala


 [https://spark.apache.org/docs/latest/sql-data-sources.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44751) XML: Implement FIleFormat Interface

2023-08-09 Thread Sandip Agarwala (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandip Agarwala updated SPARK-44751:

Description: 
This will also address most of the review comments from the first XML PR:

https://github.com/apache/spark/pull/41832

> XML: Implement FIleFormat Interface
> ---
>
> Key: SPARK-44751
> URL: https://issues.apache.org/jira/browse/SPARK-44751
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Priority: Major
>
> This will also address most of the review comments from the first XML PR:
> https://github.com/apache/spark/pull/41832



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44751) XML: Implement FIleFormat Interface

2023-08-09 Thread Sandip Agarwala (Jira)

Sandip Agarwala created SPARK-44751:
---

 Summary: XML: Implement FIleFormat Interface
 Key: SPARK-44751
 URL: https://issues.apache.org/jira/browse/SPARK-44751
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Sandip Agarwala






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44750) SparkSession.Builder should respect the options

2023-08-09 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44750:
--
Description: 
In connect session builder, users use {{config}} method to set options.
However, the options are actually ignored when we create a new session.

{code}
def create(self) -> "SparkSession":
has_channel_builder = self._channel_builder is not None
has_spark_remote = "spark.remote" in self._options

if has_channel_builder and has_spark_remote:
raise ValueError(
"Only one of connection string or channelBuilder "
"can be used to create a new SparkSession."
)

if not has_channel_builder and not has_spark_remote:
raise ValueError(
"Needs either connection string or channelBuilder to create 
a new SparkSession."
)

if has_channel_builder:
assert self._channel_builder is not None
session = SparkSession(connection=self._channel_builder)
else:
spark_remote = to_str(self._options.get("spark.remote"))
assert spark_remote is not None
session = SparkSession(connection=spark_remote)

SparkSession._set_default_and_active_session(session)
return session

{code}


we should respect the options by invoking {{session.conf.set}} after creation.

  was:
In connect session builder, we use {{config}} method to set options.
However, the options are actually ignored.

{code}
def create(self) -> "SparkSession":
has_channel_builder = self._channel_builder is not None
has_spark_remote = "spark.remote" in self._options

if has_channel_builder and has_spark_remote:
raise ValueError(
"Only one of connection string or channelBuilder "
"can be used to create a new SparkSession."
)

if not has_channel_builder and not has_spark_remote:
raise ValueError(
"Needs either connection string or channelBuilder to create 
a new SparkSession."
)

if has_channel_builder:
assert self._channel_builder is not None
session = SparkSession(connection=self._channel_builder)
else:
spark_remote = to_str(self._options.get("spark.remote"))
assert spark_remote is not None
session = SparkSession(connection=spark_remote)

SparkSession._set_default_and_active_session(session)
return session

{code}


we should respect the options by invoking {{session.conf.set}} after creation.


> SparkSession.Builder should respect the options
> ---
>
> Key: SPARK-44750
> URL: https://issues.apache.org/jira/browse/SPARK-44750
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>
> In connect session builder, users use {{config}} method to set options.
> However, the options are actually ignored when we create a new session.
> {code}
> def create(self) -> "SparkSession":
> has_channel_builder = self._channel_builder is not None
> has_spark_remote = "spark.remote" in self._options
> if has_channel_builder and has_spark_remote:
> raise ValueError(
> "Only one of connection string or channelBuilder "
> "can be used to create a new SparkSession."
> )
> if not has_channel_builder and not has_spark_remote:
> raise ValueError(
> "Needs either connection string or channelBuilder to 
> create a new SparkSession."
> )
> if has_channel_builder:
> assert self._channel_builder is not None
> session = SparkSession(connection=self._channel_builder)
> else:
> spark_remote = to_str(self._options.get("spark.remote"))
> assert spark_remote is not None
> session = SparkSession(connection=spark_remote)
> SparkSession._set_default_and_active_session(session)
> return session
> {code}
> we should respect the options by invoking {{session.conf.set}} after creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44750) SparkSession.Builder should respect the options

2023-08-09 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44750:
--
Description: 
In connect session builder, we use {{config}} method to set options.
However, the options are actually ignored when we create a new session.

{code}
def create(self) -> "SparkSession":
has_channel_builder = self._channel_builder is not None
has_spark_remote = "spark.remote" in self._options

if has_channel_builder and has_spark_remote:
raise ValueError(
"Only one of connection string or channelBuilder "
"can be used to create a new SparkSession."
)

if not has_channel_builder and not has_spark_remote:
raise ValueError(
"Needs either connection string or channelBuilder to create 
a new SparkSession."
)

if has_channel_builder:
assert self._channel_builder is not None
session = SparkSession(connection=self._channel_builder)
else:
spark_remote = to_str(self._options.get("spark.remote"))
assert spark_remote is not None
session = SparkSession(connection=spark_remote)

SparkSession._set_default_and_active_session(session)
return session

{code}


we should respect the options by invoking {{session.conf.set}} after creation.

  was:
In connect session builder, users use {{config}} method to set options.
However, the options are actually ignored when we create a new session.

{code}
def create(self) -> "SparkSession":
has_channel_builder = self._channel_builder is not None
has_spark_remote = "spark.remote" in self._options

if has_channel_builder and has_spark_remote:
raise ValueError(
"Only one of connection string or channelBuilder "
"can be used to create a new SparkSession."
)

if not has_channel_builder and not has_spark_remote:
raise ValueError(
"Needs either connection string or channelBuilder to create 
a new SparkSession."
)

if has_channel_builder:
assert self._channel_builder is not None
session = SparkSession(connection=self._channel_builder)
else:
spark_remote = to_str(self._options.get("spark.remote"))
assert spark_remote is not None
session = SparkSession(connection=spark_remote)

SparkSession._set_default_and_active_session(session)
return session

{code}


we should respect the options by invoking {{session.conf.set}} after creation.


> SparkSession.Builder should respect the options
> ---
>
> Key: SPARK-44750
> URL: https://issues.apache.org/jira/browse/SPARK-44750
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>
> In connect session builder, we use {{config}} method to set options.
> However, the options are actually ignored when we create a new session.
> {code}
> def create(self) -> "SparkSession":
> has_channel_builder = self._channel_builder is not None
> has_spark_remote = "spark.remote" in self._options
> if has_channel_builder and has_spark_remote:
> raise ValueError(
> "Only one of connection string or channelBuilder "
> "can be used to create a new SparkSession."
> )
> if not has_channel_builder and not has_spark_remote:
> raise ValueError(
> "Needs either connection string or channelBuilder to 
> create a new SparkSession."
> )
> if has_channel_builder:
> assert self._channel_builder is not None
> session = SparkSession(connection=self._channel_builder)
> else:
> spark_remote = to_str(self._options.get("spark.remote"))
> assert spark_remote is not None
> session = SparkSession(connection=spark_remote)
> SparkSession._set_default_and_active_session(session)
> return session
> {code}
> we should respect the options by invoking {{session.conf.set}} after creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@s

[jira] [Assigned] (SPARK-44750) SparkSession.Builder should respect the options

2023-08-09 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44750:
-

Assignee: Ruifeng Zheng

> SparkSession.Builder should respect the options
> ---
>
> Key: SPARK-44750
> URL: https://issues.apache.org/jira/browse/SPARK-44750
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>
> In connect session builder, we use {{config}} method to set options.
> However, the options are actually ignored.
> {code}
> def create(self) -> "SparkSession":
> has_channel_builder = self._channel_builder is not None
> has_spark_remote = "spark.remote" in self._options
> if has_channel_builder and has_spark_remote:
> raise ValueError(
> "Only one of connection string or channelBuilder "
> "can be used to create a new SparkSession."
> )
> if not has_channel_builder and not has_spark_remote:
> raise ValueError(
> "Needs either connection string or channelBuilder to 
> create a new SparkSession."
> )
> if has_channel_builder:
> assert self._channel_builder is not None
> session = SparkSession(connection=self._channel_builder)
> else:
> spark_remote = to_str(self._options.get("spark.remote"))
> assert spark_remote is not None
> session = SparkSession(connection=spark_remote)
> SparkSession._set_default_and_active_session(session)
> return session
> {code}
> we should respect the options by invoking {{session.conf.set}} after creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44750) SparkSession.Builder should respect the options

2023-08-09 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-44750:
-

 Summary: SparkSession.Builder should respect the options
 Key: SPARK-44750
 URL: https://issues.apache.org/jira/browse/SPARK-44750
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 3.5.0, 4.0.0
Reporter: Ruifeng Zheng


In connect session builder, we use {{config}} method to set options.
However, the options are actually ignored.

{code}
def create(self) -> "SparkSession":
has_channel_builder = self._channel_builder is not None
has_spark_remote = "spark.remote" in self._options

if has_channel_builder and has_spark_remote:
raise ValueError(
"Only one of connection string or channelBuilder "
"can be used to create a new SparkSession."
)

if not has_channel_builder and not has_spark_remote:
raise ValueError(
"Needs either connection string or channelBuilder to create 
a new SparkSession."
)

if has_channel_builder:
assert self._channel_builder is not None
session = SparkSession(connection=self._channel_builder)
else:
spark_remote = to_str(self._options.get("spark.remote"))
assert spark_remote is not None
session = SparkSession(connection=spark_remote)

SparkSession._set_default_and_active_session(session)
return session

{code}


we should respect the options by invoking {{session.conf.set}} after creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44732) Port the initial implementation of Spark XML data source

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44732:


Assignee: Hyukjin Kwon

> Port the initial implementation of Spark XML data source
> 
>
> Key: SPARK-44732
> URL: https://issues.apache.org/jira/browse/SPARK-44732
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44732) Port the initial implementation of Spark XML data source

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44732.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 41832
[https://github.com/apache/spark/pull/41832]

> Port the initial implementation of Spark XML data source
> 
>
> Key: SPARK-44732
> URL: https://issues.apache.org/jira/browse/SPARK-44732
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42621) Add `inclusive` parameter for date_range

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42621.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 40665
[https://github.com/apache/spark/pull/40665]

> Add `inclusive` parameter for date_range
> 
>
> Key: SPARK-42621
> URL: https://issues.apache.org/jira/browse/SPARK-42621
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> See https://github.com/pandas-dev/pandas/issues/40245



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44729) Add canonical links to the PySpark docs page

2023-08-09 Thread BingKun Pan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752575#comment-17752575
 ] 

BingKun Pan commented on SPARK-44729:
-

Okay, let me do it.

> Add canonical links to the PySpark docs page
> 
>
> Key: SPARK-44729
> URL: https://issues.apache.org/jira/browse/SPARK-44729
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> We should add the canonical link to the PySpark docs page 
> [https://spark.apache.org/docs/latest/api/python/index.html] so that the 
> search engine can return the latest PySpark docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44580) RocksDB crashed when testing in GitHub Actions

2023-08-09 Thread BingKun Pan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752572#comment-17752572
 ] 

BingKun Pan commented on SPARK-44580:
-

I have observed the logs of the above cases, and there are logs similar to  
!image-2023-08-10-09-44-19-341.png! before each crash

> RocksDB crashed when testing in GitHub Actions
> --
>
> Key: SPARK-44580
> URL: https://issues.apache.org/jira/browse/SPARK-44580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Priority: Major
> Attachments: image-2023-08-09-20-26-11-507.png, 
> image-2023-08-10-09-44-19-341.png
>
>
> [https://github.com/LuciferYang/spark/actions/runs/5666554831/job/15395578871]
>  
> {code:java}
> #
> 17177# A fatal error has been detected by the Java Runtime Environment:
> 17178#
> 17179#  SIGSEGV (0xb) at pc=0x7f8a077d2743, pid=4403, 
> tid=0x7f89cadff640
> 17180#
> 17181# JRE version: OpenJDK Runtime Environment (8.0_372-b07) (build 
> 1.8.0_372-b07)
> 17182# Java VM: OpenJDK 64-Bit Server VM (25.372-b07 mixed mode linux-amd64 
> compressed oops)
> 17183# Problematic frame:
> 17184# C  [librocksdbjni886380103972770161.so+0x3d2743]  
> rocksdb::DBImpl::FailIfCfHasTs(rocksdb::ColumnFamilyHandle const*) const+0x23
> 17185#
> 17186# Failed to write core dump. Core dumps have been disabled. To enable 
> core dumping, try "ulimit -c unlimited" before starting Java again
> 17187#
> 17188# An error report file with more information is saved as:
> 17189# /home/runner/work/spark/spark/sql/core/hs_err_pid4403.log
> 17190#
> 17191# If you would like to submit a bug report, please visit:
> 17192#   https://github.com/adoptium/adoptium-support/issues
> 17193# The crash happened outside the Java Virtual Machine in native code.
> 17194# See problematic frame for where to report the bug.
> 17195# {code}
>  
> This is my first time encountering this problem, and I am  unsure of the root 
> cause now
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44580) RocksDB crashed when testing in GitHub Actions

2023-08-09 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-44580:

Attachment: image-2023-08-10-09-44-19-341.png

> RocksDB crashed when testing in GitHub Actions
> --
>
> Key: SPARK-44580
> URL: https://issues.apache.org/jira/browse/SPARK-44580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Priority: Major
> Attachments: image-2023-08-09-20-26-11-507.png, 
> image-2023-08-10-09-44-19-341.png
>
>
> [https://github.com/LuciferYang/spark/actions/runs/5666554831/job/15395578871]
>  
> {code:java}
> #
> 17177# A fatal error has been detected by the Java Runtime Environment:
> 17178#
> 17179#  SIGSEGV (0xb) at pc=0x7f8a077d2743, pid=4403, 
> tid=0x7f89cadff640
> 17180#
> 17181# JRE version: OpenJDK Runtime Environment (8.0_372-b07) (build 
> 1.8.0_372-b07)
> 17182# Java VM: OpenJDK 64-Bit Server VM (25.372-b07 mixed mode linux-amd64 
> compressed oops)
> 17183# Problematic frame:
> 17184# C  [librocksdbjni886380103972770161.so+0x3d2743]  
> rocksdb::DBImpl::FailIfCfHasTs(rocksdb::ColumnFamilyHandle const*) const+0x23
> 17185#
> 17186# Failed to write core dump. Core dumps have been disabled. To enable 
> core dumping, try "ulimit -c unlimited" before starting Java again
> 17187#
> 17188# An error report file with more information is saved as:
> 17189# /home/runner/work/spark/spark/sql/core/hs_err_pid4403.log
> 17190#
> 17191# If you would like to submit a bug report, please visit:
> 17192#   https://github.com/adoptium/adoptium-support/issues
> 17193# The crash happened outside the Java Virtual Machine in native code.
> 17194# See problematic frame for where to report the bug.
> 17195# {code}
>  
> This is my first time encountering this problem, and I am  unsure of the root 
> cause now
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44580) RocksDB crashed when testing in GitHub Actions

2023-08-09 Thread BingKun Pan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752564#comment-17752564
 ] 

BingKun Pan commented on SPARK-44580:
-

>From this error, it seems that it is caused by the absence of `dfsRootDir`

> RocksDB crashed when testing in GitHub Actions
> --
>
> Key: SPARK-44580
> URL: https://issues.apache.org/jira/browse/SPARK-44580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Priority: Major
> Attachments: image-2023-08-09-20-26-11-507.png
>
>
> [https://github.com/LuciferYang/spark/actions/runs/5666554831/job/15395578871]
>  
> {code:java}
> #
> 17177# A fatal error has been detected by the Java Runtime Environment:
> 17178#
> 17179#  SIGSEGV (0xb) at pc=0x7f8a077d2743, pid=4403, 
> tid=0x7f89cadff640
> 17180#
> 17181# JRE version: OpenJDK Runtime Environment (8.0_372-b07) (build 
> 1.8.0_372-b07)
> 17182# Java VM: OpenJDK 64-Bit Server VM (25.372-b07 mixed mode linux-amd64 
> compressed oops)
> 17183# Problematic frame:
> 17184# C  [librocksdbjni886380103972770161.so+0x3d2743]  
> rocksdb::DBImpl::FailIfCfHasTs(rocksdb::ColumnFamilyHandle const*) const+0x23
> 17185#
> 17186# Failed to write core dump. Core dumps have been disabled. To enable 
> core dumping, try "ulimit -c unlimited" before starting Java again
> 17187#
> 17188# An error report file with more information is saved as:
> 17189# /home/runner/work/spark/spark/sql/core/hs_err_pid4403.log
> 17190#
> 17191# If you would like to submit a bug report, please visit:
> 17192#   https://github.com/adoptium/adoptium-support/issues
> 17193# The crash happened outside the Java Virtual Machine in native code.
> 17194# See problematic frame for where to report the bug.
> 17195# {code}
>  
> This is my first time encountering this problem, and I am  unsure of the root 
> cause now
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44729) Add canonical links to the PySpark docs page

2023-08-09 Thread Ruifeng Zheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752561#comment-17752561
 ] 

Ruifeng Zheng commented on SPARK-44729:
---

[~panbingkun] HI, bingkun, would you mind taking a look at this one?

> Add canonical links to the PySpark docs page
> 
>
> Key: SPARK-44729
> URL: https://issues.apache.org/jira/browse/SPARK-44729
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> We should add the canonical link to the PySpark docs page 
> [https://spark.apache.org/docs/latest/api/python/index.html] so that the 
> search engine can return the latest PySpark docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44747) Add Dataset.Builder methods to Scala Client

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44747.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42419
[https://github.com/apache/spark/pull/42419]

> Add Dataset.Builder methods to Scala Client
> ---
>
> Key: SPARK-44747
> URL: https://issues.apache.org/jira/browse/SPARK-44747
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44749) Support named arguments in Python UDTF

2023-08-09 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-44749:
-

 Summary: Support named arguments in Python UDTF
 Key: SPARK-44749
 URL: https://issues.apache.org/jira/browse/SPARK-44749
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44740) Allow configuring the session ID for a spark connect client in the remote string

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44740:


Assignee: Martin Grund

> Allow configuring the session ID for a spark connect client in the remote 
> string
> 
>
> Key: SPARK-44740
> URL: https://issues.apache.org/jira/browse/SPARK-44740
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44740) Allow configuring the session ID for a spark connect client in the remote string

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44740.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42415
[https://github.com/apache/spark/pull/42415]

> Allow configuring the session ID for a spark connect client in the remote 
> string
> 
>
> Key: SPARK-44740
> URL: https://issues.apache.org/jira/browse/SPARK-44740
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44745) Document shuffle data recovery from the remounted K8s PVCs

2023-08-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44745:
-

Assignee: Dongjoon Hyun

> Document shuffle data recovery from the remounted K8s PVCs
> --
>
> Key: SPARK-44745
> URL: https://issues.apache.org/jira/browse/SPARK-44745
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44745) Document shuffle data recovery from the remounted K8s PVCs

2023-08-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44745.
---
Fix Version/s: 3.5.0
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42417
[https://github.com/apache/spark/pull/42417]

> Document shuffle data recovery from the remounted K8s PVCs
> --
>
> Key: SPARK-44745
> URL: https://issues.apache.org/jira/browse/SPARK-44745
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.5.0, 4.0.0, 3.4.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44748) Query execution to support PARTITION BY and ORDER BY clause for table arguments

2023-08-09 Thread Daniel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752531#comment-17752531
 ] 

Daniel commented on SPARK-44748:


I can work on this one

> Query execution to support PARTITION BY and ORDER BY clause for table 
> arguments
> ---
>
> Key: SPARK-44748
> URL: https://issues.apache.org/jira/browse/SPARK-44748
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44503) Query planning to support PARTITION BY and ORDER BY clause for table arguments

2023-08-09 Thread Daniel (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel resolved SPARK-44503.

Resolution: Fixed

> Query planning to support PARTITION BY and ORDER BY clause for table arguments
> --
>
> Key: SPARK-44503
> URL: https://issues.apache.org/jira/browse/SPARK-44503
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44748) Query execution to support PARTITION BY and ORDER BY clause for table arguments

2023-08-09 Thread Daniel (Jira)

Daniel created SPARK-44748:
--

 Summary: Query execution to support PARTITION BY and ORDER BY 
clause for table arguments
 Key: SPARK-44748
 URL: https://issues.apache.org/jira/browse/SPARK-44748
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44503) Query planning to support PARTITION BY and ORDER BY clause for table arguments

2023-08-09 Thread Daniel (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel updated SPARK-44503:
---
Summary: Query planning to support PARTITION BY and ORDER BY clause for 
table arguments  (was: Support PARTITION BY and ORDER BY clause for table 
arguments)

> Query planning to support PARTITION BY and ORDER BY clause for table arguments
> --
>
> Key: SPARK-44503
> URL: https://issues.apache.org/jira/browse/SPARK-44503
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44646) Migrate Log4j 2.x in Spark 3.4.1 to Logback

2023-08-09 Thread Yu Tian (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752524#comment-17752524
 ] 

Yu Tian commented on SPARK-44646:
-

Hi [~viirya] 

Thanks for the suggestion, we spent some time evaluate the log4j-to-slf4j 
approach, unfortunately, it seems not working.

Compared to log4j-over-slf4j for log4j 1.x, log4j-to-slf4j is mainly an adapter 
for log4j-core, which means we still need the dependency of log4j-core. Since 
logback and log4j-core are 2 implementations, slf4j will complain about it. 
Below is the diagram we have tried:

!Screenshot 2023-08-09 at 2.40.12 PM.png!

If there are no better solutions, we may need to rewrite the existing logging 
logics with log4j 2.x. Thanks.

> Migrate Log4j 2.x in Spark 3.4.1 to Logback
> ---
>
> Key: SPARK-44646
> URL: https://issues.apache.org/jira/browse/SPARK-44646
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Build
>Affects Versions: 3.4.1
>Reporter: Yu Tian
>Priority: Major
> Attachments: Screenshot 2023-08-09 at 2.40.12 PM.png
>
>
> Hi,
> We are working on the spark 3.4.1 upgrade from spark 3.1.3, in our logging 
> system, we are using logback framework, it is working with spark 3.1.3 since 
> it is using log4j 1.x. However, when we upgrade spark to 3.4.1, based on the 
> [release 
> notes|https://spark.apache.org/docs/latest/core-migration-guide.html], spark 
> is migrating from log4j 2.x from log4j 1.x, the way we are replacing the 
> log4j with logback is causing build failures in spark master start process.
> Error: Unable to initialize main class org.apache.spark.deploy.master.Master
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/logging/log4j/core/Filter
> In our current approach, we are using log4j-over-slf4j to replace the 
> log4j-core, it is only applicable to log4j 1.x library. And there is no 
> log4j-over-slf4j for log4j 2.x out there yet. (please correct me if I am 
> wrong). 
> I am also curious that why spark choose to use log4j 2.x instead of using 
> SPI, which gives the users less flexibility to choose whatever logger 
> implementation they want to use.
> I want to share this issue and see if anyone else has been reported this and 
> if there is any work-around or alternative solutions for it. Any suggestions 
> are appreciated, thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44646) Migrate Log4j 2.x in Spark 3.4.1 to Logback

2023-08-09 Thread Yu Tian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Tian updated SPARK-44646:

Attachment: Screenshot 2023-08-09 at 2.40.12 PM.png

> Migrate Log4j 2.x in Spark 3.4.1 to Logback
> ---
>
> Key: SPARK-44646
> URL: https://issues.apache.org/jira/browse/SPARK-44646
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Build
>Affects Versions: 3.4.1
>Reporter: Yu Tian
>Priority: Major
> Attachments: Screenshot 2023-08-09 at 2.40.12 PM.png
>
>
> Hi,
> We are working on the spark 3.4.1 upgrade from spark 3.1.3, in our logging 
> system, we are using logback framework, it is working with spark 3.1.3 since 
> it is using log4j 1.x. However, when we upgrade spark to 3.4.1, based on the 
> [release 
> notes|https://spark.apache.org/docs/latest/core-migration-guide.html], spark 
> is migrating from log4j 2.x from log4j 1.x, the way we are replacing the 
> log4j with logback is causing build failures in spark master start process.
> Error: Unable to initialize main class org.apache.spark.deploy.master.Master
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/logging/log4j/core/Filter
> In our current approach, we are using log4j-over-slf4j to replace the 
> log4j-core, it is only applicable to log4j 1.x library. And there is no 
> log4j-over-slf4j for log4j 2.x out there yet. (please correct me if I am 
> wrong). 
> I am also curious that why spark choose to use log4j 2.x instead of using 
> SPI, which gives the users less flexibility to choose whatever logger 
> implementation they want to use.
> I want to share this issue and see if anyone else has been reported this and 
> if there is any work-around or alternative solutions for it. Any suggestions 
> are appreciated, thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44747) Add Dataset.Builder methods to Scala Client

2023-08-09 Thread Jira

Herman van Hövell created SPARK-44747:
-

 Summary: Add Dataset.Builder methods to Scala Client
 Key: SPARK-44747
 URL: https://issues.apache.org/jira/browse/SPARK-44747
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44746) Improve the documentation for TABLE input arguments for UDTFs

2023-08-09 Thread Allison Wang (Jira)

Allison Wang created SPARK-44746:


 Summary: Improve the documentation for TABLE input arguments for 
UDTFs
 Key: SPARK-44746
 URL: https://issues.apache.org/jira/browse/SPARK-44746
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


We should add more examples for using Python UDTFs with TABLE arguments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44508) Add user guide for Python UDTFs

2023-08-09 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44508:
-
Component/s: Documentation

> Add user guide for Python UDTFs
> ---
>
> Key: SPARK-44508
> URL: https://issues.apache.org/jira/browse/SPARK-44508
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Add documentation for Python UDTFs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44738) Spark Connect Reattach misses metadata propagation

2023-08-09 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-44738:
--
Epic Link: SPARK-43754

> Spark Connect Reattach misses metadata propagation
> --
>
> Key: SPARK-44738
> URL: https://issues.apache.org/jira/browse/SPARK-44738
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Blocker
> Fix For: 3.5.0, 4.0.0
>
>
> Currently, in the Spark Connect Reattach handler client metadata is not 
> propgated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44745) Document shuffle data recovery from the remounted K8s PVCs

2023-08-09 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-44745:
-

 Summary: Document shuffle data recovery from the remounted K8s PVCs
 Key: SPARK-44745
 URL: https://issues.apache.org/jira/browse/SPARK-44745
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.4.2, 3.5.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42035) Add a config flag to force exit on JDK major version mismatch

2023-08-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-42035:
-
Target Version/s: 4.0.0

> Add a config flag to force exit on JDK major version mismatch
> -
>
> Key: SPARK-42035
> URL: https://issues.apache.org/jira/browse/SPARK-42035
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> JRE version mismatches can cause errors which are difficult to debug 
> (potentially correctness with serialization issues). We should add a flag for 
> platform which wish to "fail fast" and exit on major version mismatch.
>  
> I think this could be a good thing to have default on in Spark 4.
>  
> Generally I expect to see more folks upgrading JRE & JDKs in the coming few 
> years.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42261) K8s will not allocate more execs if there are any pending execs until next snapshot

2023-08-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-42261:
-
Target Version/s: 4.0.0

> K8s will not allocate more execs if there are any pending execs until next 
> snapshot
> ---
>
> Key: SPARK-42261
> URL: https://issues.apache.org/jira/browse/SPARK-42261
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0, 3.3.1, 3.3.2, 3.4.0, 3.4.1, 3.5.0
>Reporter: Holden Karau
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44511) Allow insertInto to succeed with partion columns specified when they match those on the target table

2023-08-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-44511:
-
Target Version/s: 4.0.0

> Allow insertInto to succeed with partion columns specified when they match 
> those on the target table
> 
>
> Key: SPARK-44511
> URL: https://issues.apache.org/jira/browse/SPARK-44511
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42361) Add an option to use external storage to distribute JAR set in cluster mode on Kube

2023-08-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-42361:
-
Target Version/s: 4.0.0

> Add an option to use external storage to distribute JAR set in cluster mode 
> on Kube
> ---
>
> Key: SPARK-42361
> URL: https://issues.apache.org/jira/browse/SPARK-42361
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Priority: Minor
>
> tl;dr – sometimes the driver can get overwhelmed serving the initial jar set. 
> You'll see a lot of "Executor fetching spark://.../jar" and then 
> connection timed out.
>  
> On YARN the jars (in cluster mode) are cached in HDFS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42260) Log when the K8s Exec Pods Allocator Stalls

2023-08-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-42260:
-
Target Version/s: 4.0.0

> Log when the K8s Exec Pods Allocator Stalls
> ---
>
> Key: SPARK-42260
> URL: https://issues.apache.org/jira/browse/SPARK-42260
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> Sometimes if the K8s APIs are being slow the ExecutorPods allocator can stall 
> and it would be good for us to log this (and how long we've stalled for) so 
> folks can tell more clearly why Spark is unable to reach the desired target 
> number of executors.
>  
> This is _somewhat_ related to SPARK-36664 which logs the time spent waiting 
> for executor allocation but goes a step further for K8s and logs when we've 
> stalled because we have too many pending pods.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44727) Improve the error message for dynamic allocation conditions

2023-08-09 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752496#comment-17752496
 ] 

Holden Karau commented on SPARK-44727:
--

Do you have more context [~chengpan] ?

> Improve the error message for dynamic allocation conditions
> ---
>
> Key: SPARK-44727
> URL: https://issues.apache.org/jira/browse/SPARK-44727
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42035) Add a config flag to force exit on JDK major version mismatch

2023-08-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-42035:
-
Description: 
JRE version mismatches can cause errors which are difficult to debug 
(potentially correctness with serialization issues). We should add a flag for 
platform which wish to "fail fast" and exit on major version mismatch.

 

I think this could be a good thing to have default on in Spark 4.

 

Generally I expect to see more folks upgrading JRE & JDKs in the coming few 
years.

  was:JRE version mismatches can cause errors which are difficult to debug 
(potentially correctness with serialization issues). We should add a flag for 
platform which wish to "fail fast" and exit on major version mismatch.


> Add a config flag to force exit on JDK major version mismatch
> -
>
> Key: SPARK-42035
> URL: https://issues.apache.org/jira/browse/SPARK-42035
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> JRE version mismatches can cause errors which are difficult to debug 
> (potentially correctness with serialization issues). We should add a flag for 
> platform which wish to "fail fast" and exit on major version mismatch.
>  
> I think this could be a good thing to have default on in Spark 4.
>  
> Generally I expect to see more folks upgrading JRE & JDKs in the coming few 
> years.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34337) Reject disk blocks when out of disk space

2023-08-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-34337:
-
Target Version/s: 4.0.0

> Reject disk blocks when out of disk space
> -
>
> Key: SPARK-34337
> URL: https://issues.apache.org/jira/browse/SPARK-34337
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.1.2, 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> Now that we have the ability to store shuffle blocks on dis-aggregated 
> storage (when configured) we should add the option to reject storing blocks 
> locally on an executor at a certain disk pressure threshold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44744) Move DS v2 API to sql/api module

2023-08-09 Thread Yihong He (Jira)

Yihong He created SPARK-44744:
-

 Summary: Move DS v2 API to sql/api module
 Key: SPARK-44744
 URL: https://issues.apache.org/jira/browse/SPARK-44744
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Yihong He






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44743) Reflect function behavior different from Hive

2023-08-09 Thread Nikhil Goyal (Jira)

Nikhil Goyal created SPARK-44743:


 Summary: Reflect function behavior different from Hive
 Key: SPARK-44743
 URL: https://issues.apache.org/jira/browse/SPARK-44743
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Affects Versions: 3.4.1
Reporter: Nikhil Goyal


Spark reflect function will fail if underlying method call throws exception. 
This causes the whole job to fail.

In Hive however the exception is caught and null is returned. Simple test to 
reproduce the behavior
{code:java}
select reflect('java.net.URLDecoder', 'decode', '%') {code}
The workaround would be to wrap this call in a try
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CallMethodViaReflection.scala#L136]


We can support this by adding a new UDF `try_reflect` which mimics the Hive's 
behavior. Please share your thoughts on this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44742) Add Spark version drop down to the PySpark doc site

2023-08-09 Thread Allison Wang (Jira)

Allison Wang created SPARK-44742:


 Summary: Add Spark version drop down to the PySpark doc site
 Key: SPARK-44742
 URL: https://issues.apache.org/jira/browse/SPARK-44742
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Currently, PySpark documentation does not have a version dropdown. While by 
default we want people to land on the latest version, it will be helpful and 
easier for people to find docs if we have this version dropdown. 

Other libraries such as numpy have such version dropdown.  
!image-2023-08-09-09-38-00-805.png|width=214,height=189!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44741) Spark StatsD metrics reported to support metrics filter option

2023-08-09 Thread rameshkrishnan muthusamy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752456#comment-17752456
 ] 

rameshkrishnan muthusamy commented on SPARK-44741:
--

I am working on PR for this enhancement 

> Spark StatsD metrics reported to support metrics filter option 
> ---
>
> Key: SPARK-44741
> URL: https://issues.apache.org/jira/browse/SPARK-44741
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: rameshkrishnan muthusamy
>Priority: Minor
>  Labels: Metrics, Sink, statsd
>
> Spark statsd metrics sink in the current state does not support metrics 
> filtering option. 
> Though this option is available in the reporters it is not exposed in the 
> statsD sink. An example of this can be looked at 
> [https://github.com/apache/spark/blob/be9ffb37585fe421705ceaa52fe49b89c50703a3/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L76]
>  
> This is a critical option to have when teams does not want all the metrics 
> that are exposed by spark in the metrics monitoring platforms and switch to 
> detailed metrics as and when needed. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44741) Spark StatsD metrics reported to support metrics filter option

2023-08-09 Thread rameshkrishnan muthusamy (Jira)

rameshkrishnan muthusamy created SPARK-44741:


 Summary: Spark StatsD metrics reported to support metrics filter 
option 
 Key: SPARK-44741
 URL: https://issues.apache.org/jira/browse/SPARK-44741
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: rameshkrishnan muthusamy


Spark statsd metrics sink in the current state does not support metrics 
filtering option. 

Though this option is available in the reporters it is not exposed in the 
statsD sink. An example of this can be looked at 
[https://github.com/apache/spark/blob/be9ffb37585fe421705ceaa52fe49b89c50703a3/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L76]

 

This is a critical option to have when teams does not want all the metrics that 
are exposed by spark in the metrics monitoring platforms and switch to detailed 
metrics as and when needed. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44740) Allow configuring the session ID for a spark connect client in the remote string

2023-08-09 Thread Martin Grund (Jira)

Martin Grund created SPARK-44740:


 Summary: Allow configuring the session ID for a spark connect 
client in the remote string
 Key: SPARK-44740
 URL: https://issues.apache.org/jira/browse/SPARK-44740
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Martin Grund






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44720) Make Dataset use Encoder instead of AgnosticEncoder

2023-08-09 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44720.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Make Dataset use Encoder instead of AgnosticEncoder
> ---
>
> Key: SPARK-44720
> URL: https://issues.apache.org/jira/browse/SPARK-44720
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44580) RocksDB crashed when testing in GitHub Actions

2023-08-09 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44580:
-
Attachment: image-2023-08-09-20-26-11-507.png

> RocksDB crashed when testing in GitHub Actions
> --
>
> Key: SPARK-44580
> URL: https://issues.apache.org/jira/browse/SPARK-44580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Priority: Major
> Attachments: image-2023-08-09-20-26-11-507.png
>
>
> [https://github.com/LuciferYang/spark/actions/runs/5666554831/job/15395578871]
>  
> {code:java}
> #
> 17177# A fatal error has been detected by the Java Runtime Environment:
> 17178#
> 17179#  SIGSEGV (0xb) at pc=0x7f8a077d2743, pid=4403, 
> tid=0x7f89cadff640
> 17180#
> 17181# JRE version: OpenJDK Runtime Environment (8.0_372-b07) (build 
> 1.8.0_372-b07)
> 17182# Java VM: OpenJDK 64-Bit Server VM (25.372-b07 mixed mode linux-amd64 
> compressed oops)
> 17183# Problematic frame:
> 17184# C  [librocksdbjni886380103972770161.so+0x3d2743]  
> rocksdb::DBImpl::FailIfCfHasTs(rocksdb::ColumnFamilyHandle const*) const+0x23
> 17185#
> 17186# Failed to write core dump. Core dumps have been disabled. To enable 
> core dumping, try "ulimit -c unlimited" before starting Java again
> 17187#
> 17188# An error report file with more information is saved as:
> 17189# /home/runner/work/spark/spark/sql/core/hs_err_pid4403.log
> 17190#
> 17191# If you would like to submit a bug report, please visit:
> 17192#   https://github.com/adoptium/adoptium-support/issues
> 17193# The crash happened outside the Java Virtual Machine in native code.
> 17194# See problematic frame for where to report the bug.
> 17195# {code}
>  
> This is my first time encountering this problem, and I am  unsure of the root 
> cause now
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44580) RocksDB crashed when testing in GitHub Actions

2023-08-09 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752402#comment-17752402
 ] 

Yang Jie commented on SPARK-44580:
--

A new crash case

[https://github.com/yaooqinn/spark/actions/runs/5805477173/job/15736662791]

!image-2023-08-09-20-26-11-507.png!

> RocksDB crashed when testing in GitHub Actions
> --
>
> Key: SPARK-44580
> URL: https://issues.apache.org/jira/browse/SPARK-44580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Priority: Major
>
> [https://github.com/LuciferYang/spark/actions/runs/5666554831/job/15395578871]
>  
> {code:java}
> #
> 17177# A fatal error has been detected by the Java Runtime Environment:
> 17178#
> 17179#  SIGSEGV (0xb) at pc=0x7f8a077d2743, pid=4403, 
> tid=0x7f89cadff640
> 17180#
> 17181# JRE version: OpenJDK Runtime Environment (8.0_372-b07) (build 
> 1.8.0_372-b07)
> 17182# Java VM: OpenJDK 64-Bit Server VM (25.372-b07 mixed mode linux-amd64 
> compressed oops)
> 17183# Problematic frame:
> 17184# C  [librocksdbjni886380103972770161.so+0x3d2743]  
> rocksdb::DBImpl::FailIfCfHasTs(rocksdb::ColumnFamilyHandle const*) const+0x23
> 17185#
> 17186# Failed to write core dump. Core dumps have been disabled. To enable 
> core dumping, try "ulimit -c unlimited" before starting Java again
> 17187#
> 17188# An error report file with more information is saved as:
> 17189# /home/runner/work/spark/spark/sql/core/hs_err_pid4403.log
> 17190#
> 17191# If you would like to submit a bug report, please visit:
> 17192#   https://github.com/adoptium/adoptium-support/issues
> 17193# The crash happened outside the Java Virtual Machine in native code.
> 17194# See problematic frame for where to report the bug.
> 17195# {code}
>  
> This is my first time encountering this problem, and I am  unsure of the root 
> cause now
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43429) Add default/active SparkSession APIs

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43429.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42406
[https://github.com/apache/spark/pull/42406]

> Add default/active SparkSession APIs
> 
>
> Key: SPARK-43429
> URL: https://issues.apache.org/jira/browse/SPARK-43429
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42620) Add `inclusive` parameter for (DataFrame|Series).between_time

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42620.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 40370
[https://github.com/apache/spark/pull/40370]

> Add `inclusive` parameter for (DataFrame|Series).between_time
> -
>
> Key: SPARK-42620
> URL: https://issues.apache.org/jira/browse/SPARK-42620
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> See https://github.com/pandas-dev/pandas/pull/43248



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44721) Retry Policy Revamp

2023-08-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752349#comment-17752349
 ] 

ASF GitHub Bot commented on SPARK-44721:


User 'cdkrot' has created a pull request for this issue:
https://github.com/apache/spark/pull/42399

> Retry Policy Revamp
> ---
>
> Key: SPARK-44721
> URL: https://issues.apache.org/jira/browse/SPARK-44721
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Alice Sayutina
>Priority: Major
>
> Change retry logic. For existing retry logic the maximum tolerated wait time 
> can be extremely low with small probability. Revamp the logic to guarantee 
> the certain minimum wait time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40909) Reuse the broadcast exchange for bloom filter

2023-08-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752348#comment-17752348
 ] 

ASF GitHub Bot commented on SPARK-40909:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/42395

> Reuse the broadcast exchange for bloom filter
> -
>
> Key: SPARK-40909
> URL: https://issues.apache.org/jira/browse/SPARK-40909
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, if the creation side of bloom filter could be broadcasted, Spark 
> cannot inject a bloom filter or InSunquery filter into the application side.
> In fact, we can inject bloom filter which could reuse the broadcast exchange 
> and improve performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44738) Spark Connect Reattach misses metadata propagation

2023-08-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752343#comment-17752343
 ] 

ASF GitHub Bot commented on SPARK-44738:


User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/42409

> Spark Connect Reattach misses metadata propagation
> --
>
> Key: SPARK-44738
> URL: https://issues.apache.org/jira/browse/SPARK-44738
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Blocker
> Fix For: 3.5.0, 4.0.0
>
>
> Currently, in the Spark Connect Reattach handler client metadata is not 
> propgated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44738) Spark Connect Reattach misses metadata propagation

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44738.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42409
[https://github.com/apache/spark/pull/42409]

> Spark Connect Reattach misses metadata propagation
> --
>
> Key: SPARK-44738
> URL: https://issues.apache.org/jira/browse/SPARK-44738
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Blocker
> Fix For: 3.5.0, 4.0.0
>
>
> Currently, in the Spark Connect Reattach handler client metadata is not 
> propgated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44738) Spark Connect Reattach misses metadata propagation

2023-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44738:


Assignee: Martin Grund

> Spark Connect Reattach misses metadata propagation
> --
>
> Key: SPARK-44738
> URL: https://issues.apache.org/jira/browse/SPARK-44738
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Blocker
> Fix For: 3.5.0
>
>
> Currently, in the Spark Connect Reattach handler client metadata is not 
> propgated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44739) Conflicting attribute during join two times the same table (AQE is disabled)

2023-08-09 Thread kondziolka9ld (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kondziolka9ld updated SPARK-44739:
--
Description: 
h2. Issue

I come across a something that seems to be bug in *pyspark* (when I disable 
adaptive queries). It is about joining two times the same dataframe (please 
look at reproduction steps below). 

h2. Reproduction steps
{code:java}
pyspark --conf spark.sql.adaptive.enabled=false
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
23/08/09 10:18:54 WARN Utils: Your hostname, kondziolka-dd-laptop resolves to a 
loopback address: 127.0.1.1; using 192.168.0.18 instead (on interface wlp0s20f3)
23/08/09 10:18:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
23/08/09 10:18:56 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/Using Python version 3.8.10 (default, Nov 14 2022 12:59:47)
Spark context Web UI available at http://192.168.0.18:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1691569137130).
SparkSession available as 'spark'.

>>> sc.setCheckpointDir("file:///tmp")
>>> df1=spark.createDataFrame([(1, 42)], ["id", "fval"])
>>> df2=spark.createDataFrame([(1, 0, "jeden")], ["id", "target", "aux"]) 
>>> df2.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[id#4L,target#5L,aux#6]
>>> j1=df1.join(df2, ["id"]).select("fval", "aux").checkpoint()
>>> j1.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[fval#1L,aux#6]
>>> # we see that both j1 and df2 refers to the same attribute aux#6
>>> # let's join df2 to j1. Both of them has aux column.
>>> j1.join(df2, "aux")                                                      
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/kondziolkadd/.local/lib/python3.8/site-packages/pyspark/sql/dataframe.py",
 line 1539, in join
    jdf = self._jdf.join(other._jdf, on, how)
  File 
"/home/kondziolkadd/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/home/kondziolkadd/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", 
line 196, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: 
Failure when resolving conflicting references in Join:
'Join Inner
:- LogicalRDD [fval#1L, aux#6], false
+- LogicalRDD [id#4L, target#5L, aux#6], false

Conflicting attributes: aux#6
;
'Join Inner
:- LogicalRDD [fval#1L, aux#6], false
+- LogicalRDD [id#4L, target#5L, aux#6], false
{code}
 

h2. Workaround

The workaround is about renaming columns twice times - I mean identity rename 
`X -> X' -> X`. It looks like it forces rewrite of metadata (change attribute 
id) and in this way it avoids conflict.
{code:java}
>>> sc.setCheckpointDir("file:///tmp")
>>> df1=spark.createDataFrame([(1, 42)], ["id", "fval"])
>>> df2=spark.createDataFrame([(1, 0, "jeden")], ["id", "target", "aux"])
>>> df2.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[id#4L,target#5L,aux#6]
>>> j1=df1.join(df2, ["id"]).select("fval", "aux").withColumnRenamed("aux", 
>>> "_aux").withColumnRenamed("_aux", "aux").checkpoint()
>>> j1.explain()                                                                
== Physical Plan ==
*(1) Scan ExistingRDD[fval#1L,aux#19]
>>> j1.join(df2, "aux")
>>>
{code}

h2. Others
 * Repartition before checkpoint is workaround as well (it does not change id 
of attribute)

{code:java}
>>> j1=df1.join(df2, ["id"]).select("fval", 
>>> "aux").repartition(100).checkpoint() 
>>> j1.join(df2, "aux") {code}
 * Without `checkpoint` issue does not occur (although id is the same)

{code:java}
>>> j1=df1.join(df2, ["id"]).select("fval", "aux")
>>> j1.join(df2, "aux") {code}
 * Without disabling `AQE` it does not occur
 * I was not able to reproduce it on spark -  by saying that I mean that I 
reproduced it only in `pyspark`.

 

  was:
h2. Issue

I come across a something that seems to be bug in *pyspark* (when I disable 
adaptive queries).

It is about joining two times the same dataframe.

It is needed to `checkpoint`  dataframe `j1` before joining to expose this 
issue.

h2. Reproduction steps

 
{code:java}
pyspark --conf spark.sql.adaptive.enabled=false
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
23/08/09 10:18:54 WARN Utils: Your hostname, kondziolka-dd-laptop resolves to a 
loopback address: 127.0.1.1; using 192.168.0.18 instead (on interface wlp0s20f3)
23/08/09 10:18

[jira] [Created] (SPARK-44739) Conflicting attribute during join two times the same table (AQE is disabled)

2023-08-09 Thread kondziolka9ld (Jira)

kondziolka9ld created SPARK-44739:
-

 Summary: Conflicting attribute during join two times the same 
table (AQE is disabled)
 Key: SPARK-44739
 URL: https://issues.apache.org/jira/browse/SPARK-44739
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0
Reporter: kondziolka9ld


h2. Issue

I come across a something that seems to be bug in *pyspark* (when I disable 
adaptive queries).

It is about joining two times the same dataframe.

It is needed to `checkpoint`  dataframe `j1` before joining to expose this 
issue.

h2. Reproduction steps

 
{code:java}
pyspark --conf spark.sql.adaptive.enabled=false
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
23/08/09 10:18:54 WARN Utils: Your hostname, kondziolka-dd-laptop resolves to a 
loopback address: 127.0.1.1; using 192.168.0.18 instead (on interface wlp0s20f3)
23/08/09 10:18:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
23/08/09 10:18:56 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/Using Python version 3.8.10 (default, Nov 14 2022 12:59:47)
Spark context Web UI available at http://192.168.0.18:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1691569137130).
SparkSession available as 'spark'.

>>> sc.setCheckpointDir("file:///tmp")
>>> df1=spark.createDataFrame([(1, 42)], ["id", "fval"])
>>> df2=spark.createDataFrame([(1, 0, "jeden")], ["id", "target", "aux"]) >>> 
>>> >>> df2.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[id#4L,target#5L,aux#6]
>>> j1=df1.join(df2, ["id"]).select("fval", "aux").checkpoint()
>>> j1.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[fval#1L,aux#6]
>>> # we see that both j1 and df2 refers to the same attribute aux#6
>>> # let's join df2 to j1. Both of them has aux column.
>>> j2=j1.join(df2, "aux")                                                      
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/kondziolkadd/.local/lib/python3.8/site-packages/pyspark/sql/dataframe.py",
 line 1539, in join
    jdf = self._jdf.join(other._jdf, on, how)
  File 
"/home/user/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File "/home/user/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", 
line 196, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: 
Failure when resolving conflicting references in Join:
'Join Inner
:- LogicalRDD [fval#1L, aux#6], false
+- LogicalRDD [id#4L, target#5L, aux#6], falseConflicting attributes: aux#6
;
'Join Inner
:- LogicalRDD [fval#1L, aux#6], false
+- LogicalRDD [id#4L, target#5L, aux#6], false
{code}
 

h2. Workaround

The workaround is about renaming columns twice times - I mean identity rename 
`X -> X' -> X`.

It looks like it forces rewrite of metadata (change attribute id) and in this 
way it avoid conflict.
{code:java}
>>> sc.setCheckpointDir("file:///tmp")
>>> df1=spark.createDataFrame([(1, 42)], ["id", "fval"])
>>> df2=spark.createDataFrame([(1, 0, "jeden")], ["id", "target", "aux"])
>>> df2.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[id#4L,target#5L,aux#6]
>>> j1=df1.join(df2, ["id"]).select("fval", "aux").withColumnRenamed("aux", 
>>> "_aux").withColumnRenamed("_aux", "aux").checkpoint()
>>> j1.explain()                                                                
== Physical Plan ==
*(1) Scan ExistingRDD[fval#1L,aux#19]
>>> j2=j1.join(df2, "aux")
>>> 
 {code}

h2. Others
 # Repartition of `j1` before checkpoint is workaround as well (it does not 
change id of attribute)

{code:java}
j1=df1.join(df2, ["id"]).select("fval", "aux").repartition(100).checkpoint() 
{code}

 # Without `checkpoint` issue does not occur (although id is the same)

{code:java}
>>> j1=df1.join(df2, ["id"]).select("fval", "aux")
>>> j2=j1.join(df2, "aux") {code}

 # Without disabling AQE it does not occur
 # I was not able to reproduce it on spark -  by saying that I mean that I 
reproduced it only in `pyspark`.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42849) Session variables

2023-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42849.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 40474
[https://github.com/apache/spark/pull/40474]

> Session variables
> -
>
> Key: SPARK-42849
> URL: https://issues.apache.org/jira/browse/SPARK-42849
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
> Fix For: 4.0.0
>
>
> Provide a type-safe, engine controlled session variable:
> CREATE [ OR REPLACE } TEMPORARY VARIABLE [ IF NOT EXISTS ]var_name  [ type ][ 
> DEFAULT expresion ]
> SET {  variable = expression | ( variable [, ...] ) = ( subquery | expression 
> [, ...] )
> DROP VARIABLE  [ IF EXISTS ]variable_name



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42849) Session variables

2023-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42849:
---

Assignee: Serge Rielau

> Session variables
> -
>
> Key: SPARK-42849
> URL: https://issues.apache.org/jira/browse/SPARK-42849
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
>
> Provide a type-safe, engine controlled session variable:
> CREATE [ OR REPLACE } TEMPORARY VARIABLE [ IF NOT EXISTS ]var_name  [ type ][ 
> DEFAULT expresion ]
> SET {  variable = expression | ( variable [, ...] ) = ( subquery | expression 
> [, ...] )
> DROP VARIABLE  [ IF EXISTS ]variable_name



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44738) Spark Connect Reattach misses metadata propagation

2023-08-09 Thread Martin Grund (Jira)

Martin Grund created SPARK-44738:


 Summary: Spark Connect Reattach misses metadata propagation
 Key: SPARK-44738
 URL: https://issues.apache.org/jira/browse/SPARK-44738
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0
Reporter: Martin Grund
 Fix For: 3.5.0


Currently, in the Spark Connect Reattach handler client metadata is not 
propgated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44551) Wrong semantics for null IN (empty list) - IN expression execution

2023-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44551.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 42163
[https://github.com/apache/spark/pull/42163]

> Wrong semantics for null IN (empty list) - IN expression execution
> --
>
> Key: SPARK-44551
> URL: https://issues.apache.org/jira/browse/SPARK-44551
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.5.0
>
>
> {{null IN (empty list)}} incorrectly evaluates to null, when it should 
> evaluate to false. (The reason it should be false is because a IN (b1, b2) is 
> defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR 
> which is false. This is specified by ANSI SQL.)
> Many places in Spark execution (In, InSet, InSubquery) and optimization 
> (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
> the Spark behavior for the null IN (empty list) is inconsistent in some 
> places - literal IN lists generally return null (incorrect), while IN/NOT IN 
> subqueries mostly return false/true, respectively (correct) in this case.
> This is a longstanding correctness issue which has existed since null support 
> for IN expressions was first added to Spark.
> Doc with more details: 
> [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44551) Wrong semantics for null IN (empty list) - IN expression execution

2023-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44551:
---

Assignee: Jack Chen

> Wrong semantics for null IN (empty list) - IN expression execution
> --
>
> Key: SPARK-44551
> URL: https://issues.apache.org/jira/browse/SPARK-44551
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>
> {{null IN (empty list)}} incorrectly evaluates to null, when it should 
> evaluate to false. (The reason it should be false is because a IN (b1, b2) is 
> defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR 
> which is false. This is specified by ANSI SQL.)
> Many places in Spark execution (In, InSet, InSubquery) and optimization 
> (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
> the Spark behavior for the null IN (empty list) is inconsistent in some 
> places - literal IN lists generally return null (incorrect), while IN/NOT IN 
> subqueries mostly return false/true, respectively (correct) in this case.
> This is a longstanding correctness issue which has existed since null support 
> for IN expressions was first added to Spark.
> Doc with more details: 
> [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

73 matches

Mail list logo