[jira] [Assigned] (SPARK-44691) Move Subclasses of Analysis to sql/api
[ https://issues.apache.org/jira/browse/SPARK-44691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44691: --- Assignee: Yihong He > Move Subclasses of Analysis to sql/api > -- > > Key: SPARK-44691 > URL: https://issues.apache.org/jira/browse/SPARK-44691 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44691) Move Subclasses of Analysis to sql/api
[ https://issues.apache.org/jira/browse/SPARK-44691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44691. - Fix Version/s: 3.5.0 Resolution: Fixed > Move Subclasses of Analysis to sql/api > -- > > Key: SPARK-44691 > URL: https://issues.apache.org/jira/browse/SPARK-44691 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44755) Local tmp data is not cleared while using spark streaming consuming from kafka
leesf created SPARK-44755: - Summary: Local tmp data is not cleared while using spark streaming consuming from kafka Key: SPARK-44755 URL: https://issues.apache.org/jira/browse/SPARK-44755 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Reporter: leesf we are using spark 3.2 consuming data from kafka and then using `collectAsMap` to send to driver, we found the local temp file do not get cleared if the data consumed from kafka is larger than 200m(spark.network.maxRemoteBlockSizeFetchToMem) !https://intranetproxy.alipay.com/skylark/lark/0/2023/png/320711/1691419276170-2dd0964f-4cf4-4b15-9fbe-9622116671da.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44754) Improve DeduplicateRelations rewriteAttrs compatibility
[ https://issues.apache.org/jira/browse/SPARK-44754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jia Fan updated SPARK-44754: Description: {{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }} {{{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, {{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, {{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, {{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState and}} {{FlatMapCoGroupsInPandas}} {{To make sure DeduplicateRelations rewriteAttrs will rewrite its attribute normally. Also should fix the error behavior follow [https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}} was: {{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }} {}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, {{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, {{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, {{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState and}} {{FlatMapCoGroupsInPandas {{{}{{{}{}}}{{{}{}}}{}}}{{{}{{To make sure DeduplicateRelations rewriteAttrs will rewrite its attribute normally. Also should fix the error behavior follow [https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}{}}} > Improve DeduplicateRelations rewriteAttrs compatibility > --- > > Key: SPARK-44754 > URL: https://issues.apache.org/jira/browse/SPARK-44754 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jia Fan >Priority: Major > > {{Follow [https://github.com/apache/spark/pull/41554,] we should add test for > }} > {{{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, > {{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, > {{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, > {{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, > {{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, > {{FlatMapGroupsInPandasWithState and}} {{FlatMapCoGroupsInPandas}} > {{To make sure DeduplicateRelations rewriteAttrs will rewrite its attribute > normally. Also should fix the error behavior follow > [https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44754) Improve DeduplicateRelations rewriteAttrs compatibility
Jia Fan created SPARK-44754: --- Summary: Improve DeduplicateRelations rewriteAttrs compatibility Key: SPARK-44754 URL: https://issues.apache.org/jira/browse/SPARK-44754 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Jia Fan {{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }} {}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, {{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, {{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, {{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState and}} {{FlatMapCoGroupsInPandas {{{}{{{}{}}}{{{}{}}}{}}}{{{}{{To make sure DeduplicateRelations rewriteAttrs will rewrite its attribute normally. Also should fix the error behavior follow [https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44461) Enable Process Isolation for streaming python worker
[ https://issues.apache.org/jira/browse/SPARK-44461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44461: - Fix Version/s: (was: 3.5.0) (was: 4.0.0) > Enable Process Isolation for streaming python worker > > > Key: SPARK-44461 > URL: https://issues.apache.org/jira/browse/SPARK-44461 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.4.1 >Reporter: Raghu Angadi >Priority: Major > > Enable PI for Python worker used for foreachBatch() & streaming listener in > Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-44461) Enable Process Isolation for streaming python worker
[ https://issues.apache.org/jira/browse/SPARK-44461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-44461: -- > Enable Process Isolation for streaming python worker > > > Key: SPARK-44461 > URL: https://issues.apache.org/jira/browse/SPARK-44461 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.4.1 >Reporter: Raghu Angadi >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Enable PI for Python worker used for foreachBatch() & streaming listener in > Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44729) Add canonical links to the PySpark docs page
[ https://issues.apache.org/jira/browse/SPARK-44729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752599#comment-17752599 ] Ruifeng Zheng commented on SPARK-44729: --- [~panbingkun] Thanks! > Add canonical links to the PySpark docs page > > > Key: SPARK-44729 > URL: https://issues.apache.org/jira/browse/SPARK-44729 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > We should add the canonical link to the PySpark docs page > [https://spark.apache.org/docs/latest/api/python/index.html] so that the > search engine can return the latest PySpark docs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44461) Enable Process Isolation for streaming python worker
[ https://issues.apache.org/jira/browse/SPARK-44461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44461. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42421 [https://github.com/apache/spark/pull/42421] > Enable Process Isolation for streaming python worker > > > Key: SPARK-44461 > URL: https://issues.apache.org/jira/browse/SPARK-44461 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.4.1 >Reporter: Raghu Angadi >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Enable PI for Python worker used for foreachBatch() & streaming listener in > Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44753) XML: Add Python and sparkR binding including Spark Connect
Sandip Agarwala created SPARK-44753: --- Summary: XML: Add Python and sparkR binding including Spark Connect Key: SPARK-44753 URL: https://issues.apache.org/jira/browse/SPARK-44753 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Sandip Agarwala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44752) XML: Update Spark Docs
Sandip Agarwala created SPARK-44752: --- Summary: XML: Update Spark Docs Key: SPARK-44752 URL: https://issues.apache.org/jira/browse/SPARK-44752 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: Sandip Agarwala [https://spark.apache.org/docs/latest/sql-data-sources.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44751) XML: Implement FIleFormat Interface
[ https://issues.apache.org/jira/browse/SPARK-44751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandip Agarwala updated SPARK-44751: Description: This will also address most of the review comments from the first XML PR: https://github.com/apache/spark/pull/41832 > XML: Implement FIleFormat Interface > --- > > Key: SPARK-44751 > URL: https://issues.apache.org/jira/browse/SPARK-44751 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Priority: Major > > This will also address most of the review comments from the first XML PR: > https://github.com/apache/spark/pull/41832 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44751) XML: Implement FIleFormat Interface
Sandip Agarwala created SPARK-44751: --- Summary: XML: Implement FIleFormat Interface Key: SPARK-44751 URL: https://issues.apache.org/jira/browse/SPARK-44751 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Sandip Agarwala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44750) SparkSession.Builder should respect the options
[ https://issues.apache.org/jira/browse/SPARK-44750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44750: -- Description: In connect session builder, users use {{config}} method to set options. However, the options are actually ignored when we create a new session. {code} def create(self) -> "SparkSession": has_channel_builder = self._channel_builder is not None has_spark_remote = "spark.remote" in self._options if has_channel_builder and has_spark_remote: raise ValueError( "Only one of connection string or channelBuilder " "can be used to create a new SparkSession." ) if not has_channel_builder and not has_spark_remote: raise ValueError( "Needs either connection string or channelBuilder to create a new SparkSession." ) if has_channel_builder: assert self._channel_builder is not None session = SparkSession(connection=self._channel_builder) else: spark_remote = to_str(self._options.get("spark.remote")) assert spark_remote is not None session = SparkSession(connection=spark_remote) SparkSession._set_default_and_active_session(session) return session {code} we should respect the options by invoking {{session.conf.set}} after creation. was: In connect session builder, we use {{config}} method to set options. However, the options are actually ignored. {code} def create(self) -> "SparkSession": has_channel_builder = self._channel_builder is not None has_spark_remote = "spark.remote" in self._options if has_channel_builder and has_spark_remote: raise ValueError( "Only one of connection string or channelBuilder " "can be used to create a new SparkSession." ) if not has_channel_builder and not has_spark_remote: raise ValueError( "Needs either connection string or channelBuilder to create a new SparkSession." ) if has_channel_builder: assert self._channel_builder is not None session = SparkSession(connection=self._channel_builder) else: spark_remote = to_str(self._options.get("spark.remote")) assert spark_remote is not None session = SparkSession(connection=spark_remote) SparkSession._set_default_and_active_session(session) return session {code} we should respect the options by invoking {{session.conf.set}} after creation. > SparkSession.Builder should respect the options > --- > > Key: SPARK-44750 > URL: https://issues.apache.org/jira/browse/SPARK-44750 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > > In connect session builder, users use {{config}} method to set options. > However, the options are actually ignored when we create a new session. > {code} > def create(self) -> "SparkSession": > has_channel_builder = self._channel_builder is not None > has_spark_remote = "spark.remote" in self._options > if has_channel_builder and has_spark_remote: > raise ValueError( > "Only one of connection string or channelBuilder " > "can be used to create a new SparkSession." > ) > if not has_channel_builder and not has_spark_remote: > raise ValueError( > "Needs either connection string or channelBuilder to > create a new SparkSession." > ) > if has_channel_builder: > assert self._channel_builder is not None > session = SparkSession(connection=self._channel_builder) > else: > spark_remote = to_str(self._options.get("spark.remote")) > assert spark_remote is not None > session = SparkSession(connection=spark_remote) > SparkSession._set_default_and_active_session(session) > return session > {code} > we should respect the options by invoking {{session.conf.set}} after creation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44750) SparkSession.Builder should respect the options
[ https://issues.apache.org/jira/browse/SPARK-44750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44750: -- Description: In connect session builder, we use {{config}} method to set options. However, the options are actually ignored when we create a new session. {code} def create(self) -> "SparkSession": has_channel_builder = self._channel_builder is not None has_spark_remote = "spark.remote" in self._options if has_channel_builder and has_spark_remote: raise ValueError( "Only one of connection string or channelBuilder " "can be used to create a new SparkSession." ) if not has_channel_builder and not has_spark_remote: raise ValueError( "Needs either connection string or channelBuilder to create a new SparkSession." ) if has_channel_builder: assert self._channel_builder is not None session = SparkSession(connection=self._channel_builder) else: spark_remote = to_str(self._options.get("spark.remote")) assert spark_remote is not None session = SparkSession(connection=spark_remote) SparkSession._set_default_and_active_session(session) return session {code} we should respect the options by invoking {{session.conf.set}} after creation. was: In connect session builder, users use {{config}} method to set options. However, the options are actually ignored when we create a new session. {code} def create(self) -> "SparkSession": has_channel_builder = self._channel_builder is not None has_spark_remote = "spark.remote" in self._options if has_channel_builder and has_spark_remote: raise ValueError( "Only one of connection string or channelBuilder " "can be used to create a new SparkSession." ) if not has_channel_builder and not has_spark_remote: raise ValueError( "Needs either connection string or channelBuilder to create a new SparkSession." ) if has_channel_builder: assert self._channel_builder is not None session = SparkSession(connection=self._channel_builder) else: spark_remote = to_str(self._options.get("spark.remote")) assert spark_remote is not None session = SparkSession(connection=spark_remote) SparkSession._set_default_and_active_session(session) return session {code} we should respect the options by invoking {{session.conf.set}} after creation. > SparkSession.Builder should respect the options > --- > > Key: SPARK-44750 > URL: https://issues.apache.org/jira/browse/SPARK-44750 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > > In connect session builder, we use {{config}} method to set options. > However, the options are actually ignored when we create a new session. > {code} > def create(self) -> "SparkSession": > has_channel_builder = self._channel_builder is not None > has_spark_remote = "spark.remote" in self._options > if has_channel_builder and has_spark_remote: > raise ValueError( > "Only one of connection string or channelBuilder " > "can be used to create a new SparkSession." > ) > if not has_channel_builder and not has_spark_remote: > raise ValueError( > "Needs either connection string or channelBuilder to > create a new SparkSession." > ) > if has_channel_builder: > assert self._channel_builder is not None > session = SparkSession(connection=self._channel_builder) > else: > spark_remote = to_str(self._options.get("spark.remote")) > assert spark_remote is not None > session = SparkSession(connection=spark_remote) > SparkSession._set_default_and_active_session(session) > return session > {code} > we should respect the options by invoking {{session.conf.set}} after creation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@s
[jira] [Assigned] (SPARK-44750) SparkSession.Builder should respect the options
[ https://issues.apache.org/jira/browse/SPARK-44750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-44750: - Assignee: Ruifeng Zheng > SparkSession.Builder should respect the options > --- > > Key: SPARK-44750 > URL: https://issues.apache.org/jira/browse/SPARK-44750 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > > In connect session builder, we use {{config}} method to set options. > However, the options are actually ignored. > {code} > def create(self) -> "SparkSession": > has_channel_builder = self._channel_builder is not None > has_spark_remote = "spark.remote" in self._options > if has_channel_builder and has_spark_remote: > raise ValueError( > "Only one of connection string or channelBuilder " > "can be used to create a new SparkSession." > ) > if not has_channel_builder and not has_spark_remote: > raise ValueError( > "Needs either connection string or channelBuilder to > create a new SparkSession." > ) > if has_channel_builder: > assert self._channel_builder is not None > session = SparkSession(connection=self._channel_builder) > else: > spark_remote = to_str(self._options.get("spark.remote")) > assert spark_remote is not None > session = SparkSession(connection=spark_remote) > SparkSession._set_default_and_active_session(session) > return session > {code} > we should respect the options by invoking {{session.conf.set}} after creation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44750) SparkSession.Builder should respect the options
Ruifeng Zheng created SPARK-44750: - Summary: SparkSession.Builder should respect the options Key: SPARK-44750 URL: https://issues.apache.org/jira/browse/SPARK-44750 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 3.5.0, 4.0.0 Reporter: Ruifeng Zheng In connect session builder, we use {{config}} method to set options. However, the options are actually ignored. {code} def create(self) -> "SparkSession": has_channel_builder = self._channel_builder is not None has_spark_remote = "spark.remote" in self._options if has_channel_builder and has_spark_remote: raise ValueError( "Only one of connection string or channelBuilder " "can be used to create a new SparkSession." ) if not has_channel_builder and not has_spark_remote: raise ValueError( "Needs either connection string or channelBuilder to create a new SparkSession." ) if has_channel_builder: assert self._channel_builder is not None session = SparkSession(connection=self._channel_builder) else: spark_remote = to_str(self._options.get("spark.remote")) assert spark_remote is not None session = SparkSession(connection=spark_remote) SparkSession._set_default_and_active_session(session) return session {code} we should respect the options by invoking {{session.conf.set}} after creation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44732) Port the initial implementation of Spark XML data source
[ https://issues.apache.org/jira/browse/SPARK-44732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44732: Assignee: Hyukjin Kwon > Port the initial implementation of Spark XML data source > > > Key: SPARK-44732 > URL: https://issues.apache.org/jira/browse/SPARK-44732 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44732) Port the initial implementation of Spark XML data source
[ https://issues.apache.org/jira/browse/SPARK-44732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44732. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 41832 [https://github.com/apache/spark/pull/41832] > Port the initial implementation of Spark XML data source > > > Key: SPARK-44732 > URL: https://issues.apache.org/jira/browse/SPARK-44732 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42621) Add `inclusive` parameter for date_range
[ https://issues.apache.org/jira/browse/SPARK-42621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42621. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 40665 [https://github.com/apache/spark/pull/40665] > Add `inclusive` parameter for date_range > > > Key: SPARK-42621 > URL: https://issues.apache.org/jira/browse/SPARK-42621 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Fix For: 4.0.0 > > > See https://github.com/pandas-dev/pandas/issues/40245 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44729) Add canonical links to the PySpark docs page
[ https://issues.apache.org/jira/browse/SPARK-44729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752575#comment-17752575 ] BingKun Pan commented on SPARK-44729: - Okay, let me do it. > Add canonical links to the PySpark docs page > > > Key: SPARK-44729 > URL: https://issues.apache.org/jira/browse/SPARK-44729 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > We should add the canonical link to the PySpark docs page > [https://spark.apache.org/docs/latest/api/python/index.html] so that the > search engine can return the latest PySpark docs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44580) RocksDB crashed when testing in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-44580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752572#comment-17752572 ] BingKun Pan commented on SPARK-44580: - I have observed the logs of the above cases, and there are logs similar to !image-2023-08-10-09-44-19-341.png! before each crash > RocksDB crashed when testing in GitHub Actions > -- > > Key: SPARK-44580 > URL: https://issues.apache.org/jira/browse/SPARK-44580 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Priority: Major > Attachments: image-2023-08-09-20-26-11-507.png, > image-2023-08-10-09-44-19-341.png > > > [https://github.com/LuciferYang/spark/actions/runs/5666554831/job/15395578871] > > {code:java} > # > 17177# A fatal error has been detected by the Java Runtime Environment: > 17178# > 17179# SIGSEGV (0xb) at pc=0x7f8a077d2743, pid=4403, > tid=0x7f89cadff640 > 17180# > 17181# JRE version: OpenJDK Runtime Environment (8.0_372-b07) (build > 1.8.0_372-b07) > 17182# Java VM: OpenJDK 64-Bit Server VM (25.372-b07 mixed mode linux-amd64 > compressed oops) > 17183# Problematic frame: > 17184# C [librocksdbjni886380103972770161.so+0x3d2743] > rocksdb::DBImpl::FailIfCfHasTs(rocksdb::ColumnFamilyHandle const*) const+0x23 > 17185# > 17186# Failed to write core dump. Core dumps have been disabled. To enable > core dumping, try "ulimit -c unlimited" before starting Java again > 17187# > 17188# An error report file with more information is saved as: > 17189# /home/runner/work/spark/spark/sql/core/hs_err_pid4403.log > 17190# > 17191# If you would like to submit a bug report, please visit: > 17192# https://github.com/adoptium/adoptium-support/issues > 17193# The crash happened outside the Java Virtual Machine in native code. > 17194# See problematic frame for where to report the bug. > 17195# {code} > > This is my first time encountering this problem, and I am unsure of the root > cause now > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44580) RocksDB crashed when testing in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-44580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-44580: Attachment: image-2023-08-10-09-44-19-341.png > RocksDB crashed when testing in GitHub Actions > -- > > Key: SPARK-44580 > URL: https://issues.apache.org/jira/browse/SPARK-44580 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Priority: Major > Attachments: image-2023-08-09-20-26-11-507.png, > image-2023-08-10-09-44-19-341.png > > > [https://github.com/LuciferYang/spark/actions/runs/5666554831/job/15395578871] > > {code:java} > # > 17177# A fatal error has been detected by the Java Runtime Environment: > 17178# > 17179# SIGSEGV (0xb) at pc=0x7f8a077d2743, pid=4403, > tid=0x7f89cadff640 > 17180# > 17181# JRE version: OpenJDK Runtime Environment (8.0_372-b07) (build > 1.8.0_372-b07) > 17182# Java VM: OpenJDK 64-Bit Server VM (25.372-b07 mixed mode linux-amd64 > compressed oops) > 17183# Problematic frame: > 17184# C [librocksdbjni886380103972770161.so+0x3d2743] > rocksdb::DBImpl::FailIfCfHasTs(rocksdb::ColumnFamilyHandle const*) const+0x23 > 17185# > 17186# Failed to write core dump. Core dumps have been disabled. To enable > core dumping, try "ulimit -c unlimited" before starting Java again > 17187# > 17188# An error report file with more information is saved as: > 17189# /home/runner/work/spark/spark/sql/core/hs_err_pid4403.log > 17190# > 17191# If you would like to submit a bug report, please visit: > 17192# https://github.com/adoptium/adoptium-support/issues > 17193# The crash happened outside the Java Virtual Machine in native code. > 17194# See problematic frame for where to report the bug. > 17195# {code} > > This is my first time encountering this problem, and I am unsure of the root > cause now > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44580) RocksDB crashed when testing in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-44580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752564#comment-17752564 ] BingKun Pan commented on SPARK-44580: - >From this error, it seems that it is caused by the absence of `dfsRootDir` > RocksDB crashed when testing in GitHub Actions > -- > > Key: SPARK-44580 > URL: https://issues.apache.org/jira/browse/SPARK-44580 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Priority: Major > Attachments: image-2023-08-09-20-26-11-507.png > > > [https://github.com/LuciferYang/spark/actions/runs/5666554831/job/15395578871] > > {code:java} > # > 17177# A fatal error has been detected by the Java Runtime Environment: > 17178# > 17179# SIGSEGV (0xb) at pc=0x7f8a077d2743, pid=4403, > tid=0x7f89cadff640 > 17180# > 17181# JRE version: OpenJDK Runtime Environment (8.0_372-b07) (build > 1.8.0_372-b07) > 17182# Java VM: OpenJDK 64-Bit Server VM (25.372-b07 mixed mode linux-amd64 > compressed oops) > 17183# Problematic frame: > 17184# C [librocksdbjni886380103972770161.so+0x3d2743] > rocksdb::DBImpl::FailIfCfHasTs(rocksdb::ColumnFamilyHandle const*) const+0x23 > 17185# > 17186# Failed to write core dump. Core dumps have been disabled. To enable > core dumping, try "ulimit -c unlimited" before starting Java again > 17187# > 17188# An error report file with more information is saved as: > 17189# /home/runner/work/spark/spark/sql/core/hs_err_pid4403.log > 17190# > 17191# If you would like to submit a bug report, please visit: > 17192# https://github.com/adoptium/adoptium-support/issues > 17193# The crash happened outside the Java Virtual Machine in native code. > 17194# See problematic frame for where to report the bug. > 17195# {code} > > This is my first time encountering this problem, and I am unsure of the root > cause now > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44729) Add canonical links to the PySpark docs page
[ https://issues.apache.org/jira/browse/SPARK-44729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752561#comment-17752561 ] Ruifeng Zheng commented on SPARK-44729: --- [~panbingkun] HI, bingkun, would you mind taking a look at this one? > Add canonical links to the PySpark docs page > > > Key: SPARK-44729 > URL: https://issues.apache.org/jira/browse/SPARK-44729 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > We should add the canonical link to the PySpark docs page > [https://spark.apache.org/docs/latest/api/python/index.html] so that the > search engine can return the latest PySpark docs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44747) Add Dataset.Builder methods to Scala Client
[ https://issues.apache.org/jira/browse/SPARK-44747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44747. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42419 [https://github.com/apache/spark/pull/42419] > Add Dataset.Builder methods to Scala Client > --- > > Key: SPARK-44747 > URL: https://issues.apache.org/jira/browse/SPARK-44747 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44749) Support named arguments in Python UDTF
Takuya Ueshin created SPARK-44749: - Summary: Support named arguments in Python UDTF Key: SPARK-44749 URL: https://issues.apache.org/jira/browse/SPARK-44749 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44740) Allow configuring the session ID for a spark connect client in the remote string
[ https://issues.apache.org/jira/browse/SPARK-44740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44740: Assignee: Martin Grund > Allow configuring the session ID for a spark connect client in the remote > string > > > Key: SPARK-44740 > URL: https://issues.apache.org/jira/browse/SPARK-44740 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44740) Allow configuring the session ID for a spark connect client in the remote string
[ https://issues.apache.org/jira/browse/SPARK-44740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44740. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42415 [https://github.com/apache/spark/pull/42415] > Allow configuring the session ID for a spark connect client in the remote > string > > > Key: SPARK-44740 > URL: https://issues.apache.org/jira/browse/SPARK-44740 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44745) Document shuffle data recovery from the remounted K8s PVCs
[ https://issues.apache.org/jira/browse/SPARK-44745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44745: - Assignee: Dongjoon Hyun > Document shuffle data recovery from the remounted K8s PVCs > -- > > Key: SPARK-44745 > URL: https://issues.apache.org/jira/browse/SPARK-44745 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.4.2, 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44745) Document shuffle data recovery from the remounted K8s PVCs
[ https://issues.apache.org/jira/browse/SPARK-44745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44745. --- Fix Version/s: 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42417 [https://github.com/apache/spark/pull/42417] > Document shuffle data recovery from the remounted K8s PVCs > -- > > Key: SPARK-44745 > URL: https://issues.apache.org/jira/browse/SPARK-44745 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.4.2, 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.5.0, 4.0.0, 3.4.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44748) Query execution to support PARTITION BY and ORDER BY clause for table arguments
[ https://issues.apache.org/jira/browse/SPARK-44748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752531#comment-17752531 ] Daniel commented on SPARK-44748: I can work on this one > Query execution to support PARTITION BY and ORDER BY clause for table > arguments > --- > > Key: SPARK-44748 > URL: https://issues.apache.org/jira/browse/SPARK-44748 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44503) Query planning to support PARTITION BY and ORDER BY clause for table arguments
[ https://issues.apache.org/jira/browse/SPARK-44503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel resolved SPARK-44503. Resolution: Fixed > Query planning to support PARTITION BY and ORDER BY clause for table arguments > -- > > Key: SPARK-44503 > URL: https://issues.apache.org/jira/browse/SPARK-44503 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44748) Query execution to support PARTITION BY and ORDER BY clause for table arguments
Daniel created SPARK-44748: -- Summary: Query execution to support PARTITION BY and ORDER BY clause for table arguments Key: SPARK-44748 URL: https://issues.apache.org/jira/browse/SPARK-44748 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Daniel -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44503) Query planning to support PARTITION BY and ORDER BY clause for table arguments
[ https://issues.apache.org/jira/browse/SPARK-44503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel updated SPARK-44503: --- Summary: Query planning to support PARTITION BY and ORDER BY clause for table arguments (was: Support PARTITION BY and ORDER BY clause for table arguments) > Query planning to support PARTITION BY and ORDER BY clause for table arguments > -- > > Key: SPARK-44503 > URL: https://issues.apache.org/jira/browse/SPARK-44503 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44646) Migrate Log4j 2.x in Spark 3.4.1 to Logback
[ https://issues.apache.org/jira/browse/SPARK-44646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752524#comment-17752524 ] Yu Tian commented on SPARK-44646: - Hi [~viirya] Thanks for the suggestion, we spent some time evaluate the log4j-to-slf4j approach, unfortunately, it seems not working. Compared to log4j-over-slf4j for log4j 1.x, log4j-to-slf4j is mainly an adapter for log4j-core, which means we still need the dependency of log4j-core. Since logback and log4j-core are 2 implementations, slf4j will complain about it. Below is the diagram we have tried: !Screenshot 2023-08-09 at 2.40.12 PM.png! If there are no better solutions, we may need to rewrite the existing logging logics with log4j 2.x. Thanks. > Migrate Log4j 2.x in Spark 3.4.1 to Logback > --- > > Key: SPARK-44646 > URL: https://issues.apache.org/jira/browse/SPARK-44646 > Project: Spark > Issue Type: Brainstorming > Components: Build >Affects Versions: 3.4.1 >Reporter: Yu Tian >Priority: Major > Attachments: Screenshot 2023-08-09 at 2.40.12 PM.png > > > Hi, > We are working on the spark 3.4.1 upgrade from spark 3.1.3, in our logging > system, we are using logback framework, it is working with spark 3.1.3 since > it is using log4j 1.x. However, when we upgrade spark to 3.4.1, based on the > [release > notes|https://spark.apache.org/docs/latest/core-migration-guide.html], spark > is migrating from log4j 2.x from log4j 1.x, the way we are replacing the > log4j with logback is causing build failures in spark master start process. > Error: Unable to initialize main class org.apache.spark.deploy.master.Master > Caused by: java.lang.NoClassDefFoundError: > org/apache/logging/log4j/core/Filter > In our current approach, we are using log4j-over-slf4j to replace the > log4j-core, it is only applicable to log4j 1.x library. And there is no > log4j-over-slf4j for log4j 2.x out there yet. (please correct me if I am > wrong). > I am also curious that why spark choose to use log4j 2.x instead of using > SPI, which gives the users less flexibility to choose whatever logger > implementation they want to use. > I want to share this issue and see if anyone else has been reported this and > if there is any work-around or alternative solutions for it. Any suggestions > are appreciated, thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44646) Migrate Log4j 2.x in Spark 3.4.1 to Logback
[ https://issues.apache.org/jira/browse/SPARK-44646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Tian updated SPARK-44646: Attachment: Screenshot 2023-08-09 at 2.40.12 PM.png > Migrate Log4j 2.x in Spark 3.4.1 to Logback > --- > > Key: SPARK-44646 > URL: https://issues.apache.org/jira/browse/SPARK-44646 > Project: Spark > Issue Type: Brainstorming > Components: Build >Affects Versions: 3.4.1 >Reporter: Yu Tian >Priority: Major > Attachments: Screenshot 2023-08-09 at 2.40.12 PM.png > > > Hi, > We are working on the spark 3.4.1 upgrade from spark 3.1.3, in our logging > system, we are using logback framework, it is working with spark 3.1.3 since > it is using log4j 1.x. However, when we upgrade spark to 3.4.1, based on the > [release > notes|https://spark.apache.org/docs/latest/core-migration-guide.html], spark > is migrating from log4j 2.x from log4j 1.x, the way we are replacing the > log4j with logback is causing build failures in spark master start process. > Error: Unable to initialize main class org.apache.spark.deploy.master.Master > Caused by: java.lang.NoClassDefFoundError: > org/apache/logging/log4j/core/Filter > In our current approach, we are using log4j-over-slf4j to replace the > log4j-core, it is only applicable to log4j 1.x library. And there is no > log4j-over-slf4j for log4j 2.x out there yet. (please correct me if I am > wrong). > I am also curious that why spark choose to use log4j 2.x instead of using > SPI, which gives the users less flexibility to choose whatever logger > implementation they want to use. > I want to share this issue and see if anyone else has been reported this and > if there is any work-around or alternative solutions for it. Any suggestions > are appreciated, thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44747) Add Dataset.Builder methods to Scala Client
Herman van Hövell created SPARK-44747: - Summary: Add Dataset.Builder methods to Scala Client Key: SPARK-44747 URL: https://issues.apache.org/jira/browse/SPARK-44747 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.5.0 Reporter: Herman van Hövell Assignee: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44746) Improve the documentation for TABLE input arguments for UDTFs
Allison Wang created SPARK-44746: Summary: Improve the documentation for TABLE input arguments for UDTFs Key: SPARK-44746 URL: https://issues.apache.org/jira/browse/SPARK-44746 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang We should add more examples for using Python UDTFs with TABLE arguments. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44508) Add user guide for Python UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44508: - Component/s: Documentation > Add user guide for Python UDTFs > --- > > Key: SPARK-44508 > URL: https://issues.apache.org/jira/browse/SPARK-44508 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Add documentation for Python UDTFs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44738) Spark Connect Reattach misses metadata propagation
[ https://issues.apache.org/jira/browse/SPARK-44738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliusz Sompolski updated SPARK-44738: -- Epic Link: SPARK-43754 > Spark Connect Reattach misses metadata propagation > -- > > Key: SPARK-44738 > URL: https://issues.apache.org/jira/browse/SPARK-44738 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Blocker > Fix For: 3.5.0, 4.0.0 > > > Currently, in the Spark Connect Reattach handler client metadata is not > propgated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44745) Document shuffle data recovery from the remounted K8s PVCs
Dongjoon Hyun created SPARK-44745: - Summary: Document shuffle data recovery from the remounted K8s PVCs Key: SPARK-44745 URL: https://issues.apache.org/jira/browse/SPARK-44745 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.4.2, 3.5.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42035) Add a config flag to force exit on JDK major version mismatch
[ https://issues.apache.org/jira/browse/SPARK-42035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-42035: - Target Version/s: 4.0.0 > Add a config flag to force exit on JDK major version mismatch > - > > Key: SPARK-42035 > URL: https://issues.apache.org/jira/browse/SPARK-42035 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > > JRE version mismatches can cause errors which are difficult to debug > (potentially correctness with serialization issues). We should add a flag for > platform which wish to "fail fast" and exit on major version mismatch. > > I think this could be a good thing to have default on in Spark 4. > > Generally I expect to see more folks upgrading JRE & JDKs in the coming few > years. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42261) K8s will not allocate more execs if there are any pending execs until next snapshot
[ https://issues.apache.org/jira/browse/SPARK-42261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-42261: - Target Version/s: 4.0.0 > K8s will not allocate more execs if there are any pending execs until next > snapshot > --- > > Key: SPARK-42261 > URL: https://issues.apache.org/jira/browse/SPARK-42261 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0, 3.3.1, 3.3.2, 3.4.0, 3.4.1, 3.5.0 >Reporter: Holden Karau >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44511) Allow insertInto to succeed with partion columns specified when they match those on the target table
[ https://issues.apache.org/jira/browse/SPARK-44511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-44511: - Target Version/s: 4.0.0 > Allow insertInto to succeed with partion columns specified when they match > those on the target table > > > Key: SPARK-44511 > URL: https://issues.apache.org/jira/browse/SPARK-44511 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Holden Karau >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42361) Add an option to use external storage to distribute JAR set in cluster mode on Kube
[ https://issues.apache.org/jira/browse/SPARK-42361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-42361: - Target Version/s: 4.0.0 > Add an option to use external storage to distribute JAR set in cluster mode > on Kube > --- > > Key: SPARK-42361 > URL: https://issues.apache.org/jira/browse/SPARK-42361 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Holden Karau >Priority: Minor > > tl;dr – sometimes the driver can get overwhelmed serving the initial jar set. > You'll see a lot of "Executor fetching spark://.../jar" and then > connection timed out. > > On YARN the jars (in cluster mode) are cached in HDFS. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42260) Log when the K8s Exec Pods Allocator Stalls
[ https://issues.apache.org/jira/browse/SPARK-42260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-42260: - Target Version/s: 4.0.0 > Log when the K8s Exec Pods Allocator Stalls > --- > > Key: SPARK-42260 > URL: https://issues.apache.org/jira/browse/SPARK-42260 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0, 3.4.1 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > Sometimes if the K8s APIs are being slow the ExecutorPods allocator can stall > and it would be good for us to log this (and how long we've stalled for) so > folks can tell more clearly why Spark is unable to reach the desired target > number of executors. > > This is _somewhat_ related to SPARK-36664 which logs the time spent waiting > for executor allocation but goes a step further for K8s and logs when we've > stalled because we have too many pending pods. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44727) Improve the error message for dynamic allocation conditions
[ https://issues.apache.org/jira/browse/SPARK-44727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752496#comment-17752496 ] Holden Karau commented on SPARK-44727: -- Do you have more context [~chengpan] ? > Improve the error message for dynamic allocation conditions > --- > > Key: SPARK-44727 > URL: https://issues.apache.org/jira/browse/SPARK-44727 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42035) Add a config flag to force exit on JDK major version mismatch
[ https://issues.apache.org/jira/browse/SPARK-42035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-42035: - Description: JRE version mismatches can cause errors which are difficult to debug (potentially correctness with serialization issues). We should add a flag for platform which wish to "fail fast" and exit on major version mismatch. I think this could be a good thing to have default on in Spark 4. Generally I expect to see more folks upgrading JRE & JDKs in the coming few years. was:JRE version mismatches can cause errors which are difficult to debug (potentially correctness with serialization issues). We should add a flag for platform which wish to "fail fast" and exit on major version mismatch. > Add a config flag to force exit on JDK major version mismatch > - > > Key: SPARK-42035 > URL: https://issues.apache.org/jira/browse/SPARK-42035 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > > JRE version mismatches can cause errors which are difficult to debug > (potentially correctness with serialization issues). We should add a flag for > platform which wish to "fail fast" and exit on major version mismatch. > > I think this could be a good thing to have default on in Spark 4. > > Generally I expect to see more folks upgrading JRE & JDKs in the coming few > years. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34337) Reject disk blocks when out of disk space
[ https://issues.apache.org/jira/browse/SPARK-34337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-34337: - Target Version/s: 4.0.0 > Reject disk blocks when out of disk space > - > > Key: SPARK-34337 > URL: https://issues.apache.org/jira/browse/SPARK-34337 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: Holden Karau >Priority: Major > > Now that we have the ability to store shuffle blocks on dis-aggregated > storage (when configured) we should add the option to reject storing blocks > locally on an executor at a certain disk pressure threshold. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44744) Move DS v2 API to sql/api module
Yihong He created SPARK-44744: - Summary: Move DS v2 API to sql/api module Key: SPARK-44744 URL: https://issues.apache.org/jira/browse/SPARK-44744 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.5.0 Reporter: Yihong He -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44743) Reflect function behavior different from Hive
Nikhil Goyal created SPARK-44743: Summary: Reflect function behavior different from Hive Key: SPARK-44743 URL: https://issues.apache.org/jira/browse/SPARK-44743 Project: Spark Issue Type: New Feature Components: PySpark, SQL Affects Versions: 3.4.1 Reporter: Nikhil Goyal Spark reflect function will fail if underlying method call throws exception. This causes the whole job to fail. In Hive however the exception is caught and null is returned. Simple test to reproduce the behavior {code:java} select reflect('java.net.URLDecoder', 'decode', '%') {code} The workaround would be to wrap this call in a try [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CallMethodViaReflection.scala#L136] We can support this by adding a new UDF `try_reflect` which mimics the Hive's behavior. Please share your thoughts on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44742) Add Spark version drop down to the PySpark doc site
Allison Wang created SPARK-44742: Summary: Add Spark version drop down to the PySpark doc site Key: SPARK-44742 URL: https://issues.apache.org/jira/browse/SPARK-44742 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Currently, PySpark documentation does not have a version dropdown. While by default we want people to land on the latest version, it will be helpful and easier for people to find docs if we have this version dropdown. Other libraries such as numpy have such version dropdown. !image-2023-08-09-09-38-00-805.png|width=214,height=189! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44741) Spark StatsD metrics reported to support metrics filter option
[ https://issues.apache.org/jira/browse/SPARK-44741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752456#comment-17752456 ] rameshkrishnan muthusamy commented on SPARK-44741: -- I am working on PR for this enhancement > Spark StatsD metrics reported to support metrics filter option > --- > > Key: SPARK-44741 > URL: https://issues.apache.org/jira/browse/SPARK-44741 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: rameshkrishnan muthusamy >Priority: Minor > Labels: Metrics, Sink, statsd > > Spark statsd metrics sink in the current state does not support metrics > filtering option. > Though this option is available in the reporters it is not exposed in the > statsD sink. An example of this can be looked at > [https://github.com/apache/spark/blob/be9ffb37585fe421705ceaa52fe49b89c50703a3/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L76] > > This is a critical option to have when teams does not want all the metrics > that are exposed by spark in the metrics monitoring platforms and switch to > detailed metrics as and when needed. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44741) Spark StatsD metrics reported to support metrics filter option
rameshkrishnan muthusamy created SPARK-44741: Summary: Spark StatsD metrics reported to support metrics filter option Key: SPARK-44741 URL: https://issues.apache.org/jira/browse/SPARK-44741 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.1 Reporter: rameshkrishnan muthusamy Spark statsd metrics sink in the current state does not support metrics filtering option. Though this option is available in the reporters it is not exposed in the statsD sink. An example of this can be looked at [https://github.com/apache/spark/blob/be9ffb37585fe421705ceaa52fe49b89c50703a3/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala#L76] This is a critical option to have when teams does not want all the metrics that are exposed by spark in the metrics monitoring platforms and switch to detailed metrics as and when needed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44740) Allow configuring the session ID for a spark connect client in the remote string
Martin Grund created SPARK-44740: Summary: Allow configuring the session ID for a spark connect client in the remote string Key: SPARK-44740 URL: https://issues.apache.org/jira/browse/SPARK-44740 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Martin Grund -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44720) Make Dataset use Encoder instead of AgnosticEncoder
[ https://issues.apache.org/jira/browse/SPARK-44720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44720. --- Fix Version/s: 3.5.0 Resolution: Fixed > Make Dataset use Encoder instead of AgnosticEncoder > --- > > Key: SPARK-44720 > URL: https://issues.apache.org/jira/browse/SPARK-44720 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44580) RocksDB crashed when testing in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-44580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44580: - Attachment: image-2023-08-09-20-26-11-507.png > RocksDB crashed when testing in GitHub Actions > -- > > Key: SPARK-44580 > URL: https://issues.apache.org/jira/browse/SPARK-44580 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Priority: Major > Attachments: image-2023-08-09-20-26-11-507.png > > > [https://github.com/LuciferYang/spark/actions/runs/5666554831/job/15395578871] > > {code:java} > # > 17177# A fatal error has been detected by the Java Runtime Environment: > 17178# > 17179# SIGSEGV (0xb) at pc=0x7f8a077d2743, pid=4403, > tid=0x7f89cadff640 > 17180# > 17181# JRE version: OpenJDK Runtime Environment (8.0_372-b07) (build > 1.8.0_372-b07) > 17182# Java VM: OpenJDK 64-Bit Server VM (25.372-b07 mixed mode linux-amd64 > compressed oops) > 17183# Problematic frame: > 17184# C [librocksdbjni886380103972770161.so+0x3d2743] > rocksdb::DBImpl::FailIfCfHasTs(rocksdb::ColumnFamilyHandle const*) const+0x23 > 17185# > 17186# Failed to write core dump. Core dumps have been disabled. To enable > core dumping, try "ulimit -c unlimited" before starting Java again > 17187# > 17188# An error report file with more information is saved as: > 17189# /home/runner/work/spark/spark/sql/core/hs_err_pid4403.log > 17190# > 17191# If you would like to submit a bug report, please visit: > 17192# https://github.com/adoptium/adoptium-support/issues > 17193# The crash happened outside the Java Virtual Machine in native code. > 17194# See problematic frame for where to report the bug. > 17195# {code} > > This is my first time encountering this problem, and I am unsure of the root > cause now > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44580) RocksDB crashed when testing in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-44580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752402#comment-17752402 ] Yang Jie commented on SPARK-44580: -- A new crash case [https://github.com/yaooqinn/spark/actions/runs/5805477173/job/15736662791] !image-2023-08-09-20-26-11-507.png! > RocksDB crashed when testing in GitHub Actions > -- > > Key: SPARK-44580 > URL: https://issues.apache.org/jira/browse/SPARK-44580 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Priority: Major > > [https://github.com/LuciferYang/spark/actions/runs/5666554831/job/15395578871] > > {code:java} > # > 17177# A fatal error has been detected by the Java Runtime Environment: > 17178# > 17179# SIGSEGV (0xb) at pc=0x7f8a077d2743, pid=4403, > tid=0x7f89cadff640 > 17180# > 17181# JRE version: OpenJDK Runtime Environment (8.0_372-b07) (build > 1.8.0_372-b07) > 17182# Java VM: OpenJDK 64-Bit Server VM (25.372-b07 mixed mode linux-amd64 > compressed oops) > 17183# Problematic frame: > 17184# C [librocksdbjni886380103972770161.so+0x3d2743] > rocksdb::DBImpl::FailIfCfHasTs(rocksdb::ColumnFamilyHandle const*) const+0x23 > 17185# > 17186# Failed to write core dump. Core dumps have been disabled. To enable > core dumping, try "ulimit -c unlimited" before starting Java again > 17187# > 17188# An error report file with more information is saved as: > 17189# /home/runner/work/spark/spark/sql/core/hs_err_pid4403.log > 17190# > 17191# If you would like to submit a bug report, please visit: > 17192# https://github.com/adoptium/adoptium-support/issues > 17193# The crash happened outside the Java Virtual Machine in native code. > 17194# See problematic frame for where to report the bug. > 17195# {code} > > This is my first time encountering this problem, and I am unsure of the root > cause now > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43429) Add default/active SparkSession APIs
[ https://issues.apache.org/jira/browse/SPARK-43429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43429. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42406 [https://github.com/apache/spark/pull/42406] > Add default/active SparkSession APIs > > > Key: SPARK-43429 > URL: https://issues.apache.org/jira/browse/SPARK-43429 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42620) Add `inclusive` parameter for (DataFrame|Series).between_time
[ https://issues.apache.org/jira/browse/SPARK-42620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42620. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 40370 [https://github.com/apache/spark/pull/40370] > Add `inclusive` parameter for (DataFrame|Series).between_time > - > > Key: SPARK-42620 > URL: https://issues.apache.org/jira/browse/SPARK-42620 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Fix For: 4.0.0 > > > See https://github.com/pandas-dev/pandas/pull/43248 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44721) Retry Policy Revamp
[ https://issues.apache.org/jira/browse/SPARK-44721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752349#comment-17752349 ] ASF GitHub Bot commented on SPARK-44721: User 'cdkrot' has created a pull request for this issue: https://github.com/apache/spark/pull/42399 > Retry Policy Revamp > --- > > Key: SPARK-44721 > URL: https://issues.apache.org/jira/browse/SPARK-44721 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Alice Sayutina >Priority: Major > > Change retry logic. For existing retry logic the maximum tolerated wait time > can be extremely low with small probability. Revamp the logic to guarantee > the certain minimum wait time -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40909) Reuse the broadcast exchange for bloom filter
[ https://issues.apache.org/jira/browse/SPARK-40909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752348#comment-17752348 ] ASF GitHub Bot commented on SPARK-40909: User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/42395 > Reuse the broadcast exchange for bloom filter > - > > Key: SPARK-40909 > URL: https://issues.apache.org/jira/browse/SPARK-40909 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, if the creation side of bloom filter could be broadcasted, Spark > cannot inject a bloom filter or InSunquery filter into the application side. > In fact, we can inject bloom filter which could reuse the broadcast exchange > and improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44738) Spark Connect Reattach misses metadata propagation
[ https://issues.apache.org/jira/browse/SPARK-44738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752343#comment-17752343 ] ASF GitHub Bot commented on SPARK-44738: User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/42409 > Spark Connect Reattach misses metadata propagation > -- > > Key: SPARK-44738 > URL: https://issues.apache.org/jira/browse/SPARK-44738 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Blocker > Fix For: 3.5.0, 4.0.0 > > > Currently, in the Spark Connect Reattach handler client metadata is not > propgated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44738) Spark Connect Reattach misses metadata propagation
[ https://issues.apache.org/jira/browse/SPARK-44738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44738. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42409 [https://github.com/apache/spark/pull/42409] > Spark Connect Reattach misses metadata propagation > -- > > Key: SPARK-44738 > URL: https://issues.apache.org/jira/browse/SPARK-44738 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Blocker > Fix For: 3.5.0, 4.0.0 > > > Currently, in the Spark Connect Reattach handler client metadata is not > propgated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44738) Spark Connect Reattach misses metadata propagation
[ https://issues.apache.org/jira/browse/SPARK-44738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44738: Assignee: Martin Grund > Spark Connect Reattach misses metadata propagation > -- > > Key: SPARK-44738 > URL: https://issues.apache.org/jira/browse/SPARK-44738 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Blocker > Fix For: 3.5.0 > > > Currently, in the Spark Connect Reattach handler client metadata is not > propgated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44739) Conflicting attribute during join two times the same table (AQE is disabled)
[ https://issues.apache.org/jira/browse/SPARK-44739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kondziolka9ld updated SPARK-44739: -- Description: h2. Issue I come across a something that seems to be bug in *pyspark* (when I disable adaptive queries). It is about joining two times the same dataframe (please look at reproduction steps below). h2. Reproduction steps {code:java} pyspark --conf spark.sql.adaptive.enabled=false Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. 23/08/09 10:18:54 WARN Utils: Your hostname, kondziolka-dd-laptop resolves to a loopback address: 127.0.1.1; using 192.168.0.18 instead (on interface wlp0s20f3) 23/08/09 10:18:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/08/09 10:18:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/Using Python version 3.8.10 (default, Nov 14 2022 12:59:47) Spark context Web UI available at http://192.168.0.18:4040 Spark context available as 'sc' (master = local[*], app id = local-1691569137130). SparkSession available as 'spark'. >>> sc.setCheckpointDir("file:///tmp") >>> df1=spark.createDataFrame([(1, 42)], ["id", "fval"]) >>> df2=spark.createDataFrame([(1, 0, "jeden")], ["id", "target", "aux"]) >>> df2.explain() == Physical Plan == *(1) Scan ExistingRDD[id#4L,target#5L,aux#6] >>> j1=df1.join(df2, ["id"]).select("fval", "aux").checkpoint() >>> j1.explain() == Physical Plan == *(1) Scan ExistingRDD[fval#1L,aux#6] >>> # we see that both j1 and df2 refers to the same attribute aux#6 >>> # let's join df2 to j1. Both of them has aux column. >>> j1.join(df2, "aux") Traceback (most recent call last): File "", line 1, in File "/home/kondziolkadd/.local/lib/python3.8/site-packages/pyspark/sql/dataframe.py", line 1539, in join jdf = self._jdf.join(other._jdf, on, how) File "/home/kondziolkadd/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/kondziolkadd/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 196, in deco raise converted from None pyspark.sql.utils.AnalysisException: Failure when resolving conflicting references in Join: 'Join Inner :- LogicalRDD [fval#1L, aux#6], false +- LogicalRDD [id#4L, target#5L, aux#6], false Conflicting attributes: aux#6 ; 'Join Inner :- LogicalRDD [fval#1L, aux#6], false +- LogicalRDD [id#4L, target#5L, aux#6], false {code} h2. Workaround The workaround is about renaming columns twice times - I mean identity rename `X -> X' -> X`. It looks like it forces rewrite of metadata (change attribute id) and in this way it avoids conflict. {code:java} >>> sc.setCheckpointDir("file:///tmp") >>> df1=spark.createDataFrame([(1, 42)], ["id", "fval"]) >>> df2=spark.createDataFrame([(1, 0, "jeden")], ["id", "target", "aux"]) >>> df2.explain() == Physical Plan == *(1) Scan ExistingRDD[id#4L,target#5L,aux#6] >>> j1=df1.join(df2, ["id"]).select("fval", "aux").withColumnRenamed("aux", >>> "_aux").withColumnRenamed("_aux", "aux").checkpoint() >>> j1.explain() == Physical Plan == *(1) Scan ExistingRDD[fval#1L,aux#19] >>> j1.join(df2, "aux") >>> {code} h2. Others * Repartition before checkpoint is workaround as well (it does not change id of attribute) {code:java} >>> j1=df1.join(df2, ["id"]).select("fval", >>> "aux").repartition(100).checkpoint() >>> j1.join(df2, "aux") {code} * Without `checkpoint` issue does not occur (although id is the same) {code:java} >>> j1=df1.join(df2, ["id"]).select("fval", "aux") >>> j1.join(df2, "aux") {code} * Without disabling `AQE` it does not occur * I was not able to reproduce it on spark - by saying that I mean that I reproduced it only in `pyspark`. was: h2. Issue I come across a something that seems to be bug in *pyspark* (when I disable adaptive queries). It is about joining two times the same dataframe. It is needed to `checkpoint` dataframe `j1` before joining to expose this issue. h2. Reproduction steps {code:java} pyspark --conf spark.sql.adaptive.enabled=false Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. 23/08/09 10:18:54 WARN Utils: Your hostname, kondziolka-dd-laptop resolves to a loopback address: 127.0.1.1; using 192.168.0.18 instead (on interface wlp0s20f3) 23/08/09 10:18
[jira] [Created] (SPARK-44739) Conflicting attribute during join two times the same table (AQE is disabled)
kondziolka9ld created SPARK-44739: - Summary: Conflicting attribute during join two times the same table (AQE is disabled) Key: SPARK-44739 URL: https://issues.apache.org/jira/browse/SPARK-44739 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.3.0 Reporter: kondziolka9ld h2. Issue I come across a something that seems to be bug in *pyspark* (when I disable adaptive queries). It is about joining two times the same dataframe. It is needed to `checkpoint` dataframe `j1` before joining to expose this issue. h2. Reproduction steps {code:java} pyspark --conf spark.sql.adaptive.enabled=false Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. 23/08/09 10:18:54 WARN Utils: Your hostname, kondziolka-dd-laptop resolves to a loopback address: 127.0.1.1; using 192.168.0.18 instead (on interface wlp0s20f3) 23/08/09 10:18:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/08/09 10:18:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/Using Python version 3.8.10 (default, Nov 14 2022 12:59:47) Spark context Web UI available at http://192.168.0.18:4040 Spark context available as 'sc' (master = local[*], app id = local-1691569137130). SparkSession available as 'spark'. >>> sc.setCheckpointDir("file:///tmp") >>> df1=spark.createDataFrame([(1, 42)], ["id", "fval"]) >>> df2=spark.createDataFrame([(1, 0, "jeden")], ["id", "target", "aux"]) >>> >>> >>> df2.explain() == Physical Plan == *(1) Scan ExistingRDD[id#4L,target#5L,aux#6] >>> j1=df1.join(df2, ["id"]).select("fval", "aux").checkpoint() >>> j1.explain() == Physical Plan == *(1) Scan ExistingRDD[fval#1L,aux#6] >>> # we see that both j1 and df2 refers to the same attribute aux#6 >>> # let's join df2 to j1. Both of them has aux column. >>> j2=j1.join(df2, "aux") Traceback (most recent call last): File "", line 1, in File "/home/kondziolkadd/.local/lib/python3.8/site-packages/pyspark/sql/dataframe.py", line 1539, in join jdf = self._jdf.join(other._jdf, on, how) File "/home/user/.local/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/user/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 196, in deco raise converted from None pyspark.sql.utils.AnalysisException: Failure when resolving conflicting references in Join: 'Join Inner :- LogicalRDD [fval#1L, aux#6], false +- LogicalRDD [id#4L, target#5L, aux#6], falseConflicting attributes: aux#6 ; 'Join Inner :- LogicalRDD [fval#1L, aux#6], false +- LogicalRDD [id#4L, target#5L, aux#6], false {code} h2. Workaround The workaround is about renaming columns twice times - I mean identity rename `X -> X' -> X`. It looks like it forces rewrite of metadata (change attribute id) and in this way it avoid conflict. {code:java} >>> sc.setCheckpointDir("file:///tmp") >>> df1=spark.createDataFrame([(1, 42)], ["id", "fval"]) >>> df2=spark.createDataFrame([(1, 0, "jeden")], ["id", "target", "aux"]) >>> df2.explain() == Physical Plan == *(1) Scan ExistingRDD[id#4L,target#5L,aux#6] >>> j1=df1.join(df2, ["id"]).select("fval", "aux").withColumnRenamed("aux", >>> "_aux").withColumnRenamed("_aux", "aux").checkpoint() >>> j1.explain() == Physical Plan == *(1) Scan ExistingRDD[fval#1L,aux#19] >>> j2=j1.join(df2, "aux") >>> {code} h2. Others # Repartition of `j1` before checkpoint is workaround as well (it does not change id of attribute) {code:java} j1=df1.join(df2, ["id"]).select("fval", "aux").repartition(100).checkpoint() {code} # Without `checkpoint` issue does not occur (although id is the same) {code:java} >>> j1=df1.join(df2, ["id"]).select("fval", "aux") >>> j2=j1.join(df2, "aux") {code} # Without disabling AQE it does not occur # I was not able to reproduce it on spark - by saying that I mean that I reproduced it only in `pyspark`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42849) Session variables
[ https://issues.apache.org/jira/browse/SPARK-42849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42849. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 40474 [https://github.com/apache/spark/pull/40474] > Session variables > - > > Key: SPARK-42849 > URL: https://issues.apache.org/jira/browse/SPARK-42849 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Fix For: 4.0.0 > > > Provide a type-safe, engine controlled session variable: > CREATE [ OR REPLACE } TEMPORARY VARIABLE [ IF NOT EXISTS ]var_name [ type ][ > DEFAULT expresion ] > SET { variable = expression | ( variable [, ...] ) = ( subquery | expression > [, ...] ) > DROP VARIABLE [ IF EXISTS ]variable_name -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42849) Session variables
[ https://issues.apache.org/jira/browse/SPARK-42849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42849: --- Assignee: Serge Rielau > Session variables > - > > Key: SPARK-42849 > URL: https://issues.apache.org/jira/browse/SPARK-42849 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > > Provide a type-safe, engine controlled session variable: > CREATE [ OR REPLACE } TEMPORARY VARIABLE [ IF NOT EXISTS ]var_name [ type ][ > DEFAULT expresion ] > SET { variable = expression | ( variable [, ...] ) = ( subquery | expression > [, ...] ) > DROP VARIABLE [ IF EXISTS ]variable_name -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44738) Spark Connect Reattach misses metadata propagation
Martin Grund created SPARK-44738: Summary: Spark Connect Reattach misses metadata propagation Key: SPARK-44738 URL: https://issues.apache.org/jira/browse/SPARK-44738 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: Martin Grund Fix For: 3.5.0 Currently, in the Spark Connect Reattach handler client metadata is not propgated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44551) Wrong semantics for null IN (empty list) - IN expression execution
[ https://issues.apache.org/jira/browse/SPARK-44551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44551. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 42163 [https://github.com/apache/spark/pull/42163] > Wrong semantics for null IN (empty list) - IN expression execution > -- > > Key: SPARK-44551 > URL: https://issues.apache.org/jira/browse/SPARK-44551 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.5.0 > > > {{null IN (empty list)}} incorrectly evaluates to null, when it should > evaluate to false. (The reason it should be false is because a IN (b1, b2) is > defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR > which is false. This is specified by ANSI SQL.) > Many places in Spark execution (In, InSet, InSubquery) and optimization > (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that > the Spark behavior for the null IN (empty list) is inconsistent in some > places - literal IN lists generally return null (incorrect), while IN/NOT IN > subqueries mostly return false/true, respectively (correct) in this case. > This is a longstanding correctness issue which has existed since null support > for IN expressions was first added to Spark. > Doc with more details: > [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44551) Wrong semantics for null IN (empty list) - IN expression execution
[ https://issues.apache.org/jira/browse/SPARK-44551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44551: --- Assignee: Jack Chen > Wrong semantics for null IN (empty list) - IN expression execution > -- > > Key: SPARK-44551 > URL: https://issues.apache.org/jira/browse/SPARK-44551 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > > {{null IN (empty list)}} incorrectly evaluates to null, when it should > evaluate to false. (The reason it should be false is because a IN (b1, b2) is > defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR > which is false. This is specified by ANSI SQL.) > Many places in Spark execution (In, InSet, InSubquery) and optimization > (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that > the Spark behavior for the null IN (empty list) is inconsistent in some > places - literal IN lists generally return null (incorrect), while IN/NOT IN > subqueries mostly return false/true, respectively (correct) in this case. > This is a longstanding correctness issue which has existed since null support > for IN expressions was first added to Spark. > Doc with more details: > [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org