[jira] [Commented] (SPARK-43291) Match behavior for DataFrame.cov on string DataFrame
[ https://issues.apache.org/jira/browse/SPARK-43291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731449#comment-17731449 ] Haejoon Lee commented on SPARK-43291: - I'm currently doing an analysis of the usage of the pandas API. We plan to support the most commonly used APIs in the next release, and for all other pandas-related breaking changes, we expect to provide support in version 4.0. We have scheduled another meeting tomorrow to discuss this further, so let me update with the final conclusions soon. > Match behavior for DataFrame.cov on string DataFrame > > > Key: SPARK-43291 > URL: https://issues.apache.org/jira/browse/SPARK-43291 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Should enable test below: > {code:java} > pdf = pd.DataFrame([("1", "2"), ("0", "3"), ("2", "0"), ("1", "1")], > columns=["a", "b"]) > psdf = ps.from_pandas(pdf) > self.assert_eq(pdf.cov(), psdf.cov()) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44025) CSV Table Read Error with CharType(length) column
[ https://issues.apache.org/jira/browse/SPARK-44025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44025: Target Version/s: 3.4.1 > CSV Table Read Error with CharType(length) column > - > > Key: SPARK-44025 > URL: https://issues.apache.org/jira/browse/SPARK-44025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 > Environment: {{apache/spark:v3.4.0 image}} >Reporter: Fengyu Cao >Priority: Major > > Problem: > # read a CSV format table > # table has a `CharType(length)` column > # read table failed with Exception: `org.apache.spark.SparkException: Job > aborted due to stage failure: Task 0 in stage 36.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 36.0 (TID 72) (10.113.9.208 executor > 11): java.lang.IllegalArgumentException: requirement failed: requiredSchema > (struct) should be the subset of dataSchema > (struct).` > > reproduce with official image: > # {{docker run -it apache/spark:v3.4.0 /opt/spark/bin/spark-sql}} > # {{CREATE TABLE csv_bug (name STRING, age INT, job CHAR(4)) USING CSV > OPTIONS ('header' = 'true', 'sep' = ';') LOCATION > "/opt/spark/examples/src/main/resources/people.csv";}} > # SELECT * FROM csv_bug; > # ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: requiredSchema > (struct) should be the subset of dataSchema > (struct). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43529) Support general expressions as OPTIONS values
[ https://issues.apache.org/jira/browse/SPARK-43529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-43529: -- Assignee: Daniel > Support general expressions as OPTIONS values > -- > > Key: SPARK-43529 > URL: https://issues.apache.org/jira/browse/SPARK-43529 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43529) Support general expressions as OPTIONS values
[ https://issues.apache.org/jira/browse/SPARK-43529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-43529. Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41191 [https://github.com/apache/spark/pull/41191] > Support general expressions as OPTIONS values > -- > > Key: SPARK-43529 > URL: https://issues.apache.org/jira/browse/SPARK-43529 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44021) Add a config to make it do not generate too many partitions
[ https://issues.apache.org/jira/browse/SPARK-44021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731429#comment-17731429 ] Snoot.io commented on SPARK-44021: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/41545 > Add a config to make it do not generate too many partitions > --- > > Key: SPARK-44021 > URL: https://issues.apache.org/jira/browse/SPARK-44021 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44021) Add a config to make it do not generate too many partitions
[ https://issues.apache.org/jira/browse/SPARK-44021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731428#comment-17731428 ] Snoot.io commented on SPARK-44021: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/41545 > Add a config to make it do not generate too many partitions > --- > > Key: SPARK-44021 > URL: https://issues.apache.org/jira/browse/SPARK-44021 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44025) CSV Table Read Error with CharType(length) column
Fengyu Cao created SPARK-44025: -- Summary: CSV Table Read Error with CharType(length) column Key: SPARK-44025 URL: https://issues.apache.org/jira/browse/SPARK-44025 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Environment: {{apache/spark:v3.4.0 image}} Reporter: Fengyu Cao Problem: # read a CSV format table # table has a `CharType(length)` column # read table failed with Exception: `org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 36.0 failed 4 times, most recent failure: Lost task 0.3 in stage 36.0 (TID 72) (10.113.9.208 executor 11): java.lang.IllegalArgumentException: requirement failed: requiredSchema (struct) should be the subset of dataSchema (struct).` reproduce with official image: # {{docker run -it apache/spark:v3.4.0 /opt/spark/bin/spark-sql}} # {{CREATE TABLE csv_bug (name STRING, age INT, job CHAR(4)) USING CSV OPTIONS ('header' = 'true', 'sep' = ';') LOCATION "/opt/spark/examples/src/main/resources/people.csv";}} # SELECT * FROM csv_bug; # ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: requirement failed: requiredSchema (struct) should be the subset of dataSchema (struct). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly
[ https://issues.apache.org/jira/browse/SPARK-32559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731425#comment-17731425 ] Snoot.io commented on SPARK-32559: -- User 'Kwafoor' has created a pull request for this issue: https://github.com/apache/spark/pull/41535 > Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters > correctly > --- > > Key: SPARK-32559 > URL: https://issues.apache.org/jira/browse/SPARK-32559 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: EdisonWang >Assignee: EdisonWang >Priority: Major > Labels: correctness > Fix For: 3.0.1 > > Attachments: error.log > > > The trim logic in Cast expression introduced in > [https://github.com/apache/spark/pull/26622] will trim chinese characters > unexpectly. > For example, sql select cast("1中文" as float) gives 1 instead of null > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44024) Change to use `map` where `unzip` used to extract a single element
[ https://issues.apache.org/jira/browse/SPARK-44024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44024: - Summary: Change to use `map` where `unzip` used to extract a single element (was: Change to use map where unzip used to extract a single element ) > Change to use `map` where `unzip` used to extract a single element > --- > > Key: SPARK-44024 > URL: https://issues.apache.org/jira/browse/SPARK-44024 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44024) Change to use `map` where `unzip` used to extract a single element
[ https://issues.apache.org/jira/browse/SPARK-44024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44024: - Description: For example: Seq((1, 11), (2, 22)).unzip._1 should change to Seq((1, 11), (2, 22)).map(_._1) > Change to use `map` where `unzip` used to extract a single element > --- > > Key: SPARK-44024 > URL: https://issues.apache.org/jira/browse/SPARK-44024 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > For example: > > Seq((1, 11), (2, 22)).unzip._1 > > should change to > > Seq((1, 11), (2, 22)).map(_._1) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44024) Change to use map where unzip used to extract a single element
Yang Jie created SPARK-44024: Summary: Change to use map where unzip used to extract a single element Key: SPARK-44024 URL: https://issues.apache.org/jira/browse/SPARK-44024 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44023) Add System.gc at beforeEach in PruneFileSourcePartitionsSuite
Yang Jie created SPARK-44023: Summary: Add System.gc at beforeEach in PruneFileSourcePartitionsSuite Key: SPARK-44023 URL: https://issues.apache.org/jira/browse/SPARK-44023 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44022) Enforce Java max bytecode version to maven dependencies
[ https://issues.apache.org/jira/browse/SPARK-44022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bowen Liang updated SPARK-44022: Description: To enforce Java's max bytecode version to maven dependencies, by using `enforceBytecodeVersion` enforcer rule. Preventing introducing dependencies requiring higher Java version 11+, including transparent depencencies. (was: To enforce Java's max bytecode version to maven dependencies, by using `enforceBytecodeVersion` enforcer rule. Preventing introducing dependencies requiring higher Java version 11+.) > Enforce Java max bytecode version to maven dependencies > --- > > Key: SPARK-44022 > URL: https://issues.apache.org/jira/browse/SPARK-44022 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Bowen Liang >Priority: Major > > To enforce Java's max bytecode version to maven dependencies, by using > `enforceBytecodeVersion` enforcer rule. Preventing introducing dependencies > requiring higher Java version 11+, including transparent depencencies. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44022) Enforce Java max bytecode version to maven dependencies
[ https://issues.apache.org/jira/browse/SPARK-44022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bowen Liang updated SPARK-44022: Description: To enforce Java's max bytecode version to maven dependencies, by using `enforceBytecodeVersion` enforcer rule. Preventing introducing dependencies requiring higher Java version 11+, including transparent dependencies. was:To enforce Java's max bytecode version to maven dependencies, by using `enforceBytecodeVersion` enforcer rule. Preventing introducing dependencies requiring higher Java version 11+, including transparent depencencies. > Enforce Java max bytecode version to maven dependencies > --- > > Key: SPARK-44022 > URL: https://issues.apache.org/jira/browse/SPARK-44022 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Bowen Liang >Priority: Major > > To enforce Java's max bytecode version to maven dependencies, by using > `enforceBytecodeVersion` enforcer rule. > Preventing introducing dependencies requiring higher Java version 11+, > including transparent dependencies. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44022) Enforce Java max bytecode version to maven dependencies
Bowen Liang created SPARK-44022: --- Summary: Enforce Java max bytecode version to maven dependencies Key: SPARK-44022 URL: https://issues.apache.org/jira/browse/SPARK-44022 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Bowen Liang To enforce Java's max bytecode version to maven dependencies, by using `enforceBytecodeVersion` enforcer rule. Preventing introducing dependencies requiring higher Java version 11+. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43617) Enable pyspark.pandas.spark.functions.product in Spark Connect.
[ https://issues.apache.org/jira/browse/SPARK-43617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43617: - Assignee: Ruifeng Zheng > Enable pyspark.pandas.spark.functions.product in Spark Connect. > --- > > Key: SPARK-43617 > URL: https://issues.apache.org/jira/browse/SPARK-43617 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Ruifeng Zheng >Priority: Major > > Enable pyspark.pandas.spark.functions.product in Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43938) Add to_* functions to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-43938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43938: - Assignee: BingKun Pan > Add to_* functions to Scala and Python > -- > > Key: SPARK-43938 > URL: https://issues.apache.org/jira/browse/SPARK-43938 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: BingKun Pan >Priority: Major > > Add following functions: > * str_to_map > * to_binary > * to_char > * to_number > * to_timestamp_ltz > * to_timestamp_ntz > * to_unix_timestamp > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43938) Add to_* functions to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-43938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43938. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41505 [https://github.com/apache/spark/pull/41505] > Add to_* functions to Scala and Python > -- > > Key: SPARK-43938 > URL: https://issues.apache.org/jira/browse/SPARK-43938 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: BingKun Pan >Priority: Major > Fix For: 3.5.0 > > > Add following functions: > * str_to_map > * to_binary > * to_char > * to_number > * to_timestamp_ltz > * to_timestamp_ntz > * to_unix_timestamp > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44021) Add a config to make it do not generate too many partitions
Yuming Wang created SPARK-44021: --- Summary: Add a config to make it do not generate too many partitions Key: SPARK-44021 URL: https://issues.apache.org/jira/browse/SPARK-44021 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43179) Add option for applications to control saving of metadata in the External Shuffle Service LevelDB
[ https://issues.apache.org/jira/browse/SPARK-43179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-43179. -- Resolution: Fixed Issue resolved by pull request 41502 [https://github.com/apache/spark/pull/41502] > Add option for applications to control saving of metadata in the External > Shuffle Service LevelDB > - > > Key: SPARK-43179 > URL: https://issues.apache.org/jira/browse/SPARK-43179 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.4.0 >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Fix For: 3.5.0 > > > Currently, the External Shuffle Service stores application metadata in > LevelDB. This is necessary to enable the shuffle server to resume serving > shuffle data for an application whose executors registered before the > NodeManager restarts. However, the metadata includes the application secret, > which is stored in LevelDB without encryption. This is a potential security > risk, particularly for applications with high security requirements. While > filesystem access control lists (ACLs) can help protect keys and > certificates, they may not be sufficient for some use cases. In response, we > have decided not to store metadata for these high-security applications in > LevelDB. As a result, these applications may experience more failures in the > event of a node restart, but we believe this trade-off is acceptable given > the increased security risk. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43942) Add string functions to Scala and Python - part 1
[ https://issues.apache.org/jira/browse/SPARK-43942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731340#comment-17731340 ] BingKun Pan commented on SPARK-43942: - I work on it. > Add string functions to Scala and Python - part 1 > - > > Key: SPARK-43942 > URL: https://issues.apache.org/jira/browse/SPARK-43942 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > > Add following functions: > * char > * btrim > * char_length > * character_length > * chr > * contains > * elt > * find_in_set > * like > * ilike > * lcase > * ucase > * len > * left > * right > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42290) Spark Driver hangs on OOM during Broadcast when AQE is enabled
[ https://issues.apache.org/jira/browse/SPARK-42290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731331#comment-17731331 ] Jia Fan commented on SPARK-42290: - Thanks [~dongjoon] > Spark Driver hangs on OOM during Broadcast when AQE is enabled > --- > > Key: SPARK-42290 > URL: https://issues.apache.org/jira/browse/SPARK-42290 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Shardul Mahadik >Assignee: Jia Fan >Priority: Critical > Fix For: 3.4.1, 3.5.0 > > > Repro steps: > {code} > $ spark-shell --conf spark.driver.memory=1g > val df = spark.range(500).withColumn("str", > lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf")) > val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer") > df2.collect > {code} > This will cause the driver to hang indefinitely. Heres a thread dump of the > {{main}} thread when its stuck > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:285) > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$Lambda$2819/629294880.apply(Unknown > Source) > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:809) > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:236) > => holding Monitor(java.lang.Object@1932537396}) > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:381) > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:354) > org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4179) > org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3420) > org.apache.spark.sql.Dataset$$Lambda$2390/1803372144.apply(Unknown Source) > org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4169) > org.apache.spark.sql.Dataset$$Lambda$2791/1357377136.apply(Unknown Source) > org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526) > org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4167) > org.apache.spark.sql.Dataset$$Lambda$2391/1172042998.apply(Unknown Source) > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118) > org.apache.spark.sql.execution.SQLExecution$$$Lambda$2402/721269425.apply(Unknown > Source) > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195) > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103) > org.apache.spark.sql.execution.SQLExecution$$$Lambda$2392/11632488.apply(Unknown > Source) > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:809) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > org.apache.spark.sql.Dataset.withAction(Dataset.scala:4167) > org.apache.spark.sql.Dataset.collect(Dataset.scala:3420) > {code} > When we disable AQE though we get the following exception instead of driver > hang. > {code} > Caused by: org.apache.spark.SparkException: Not enough memory to build and > broadcast the table to all worker nodes. As a workaround, you can either > disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or > increase the spark driver memory by setting spark.driver.memory to a higher > value. > ... 7 more > Caused by: java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:834) > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:777) > at > org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:1086) > at > org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:157) > at > org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:1163) > at > org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:1151) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:148) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$Lambda$2999/145945436.apply(Unknown > Source) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCap
[jira] [Commented] (SPARK-43926) Add array_agg, array_size, cardinality, count_min_sketch,mask,named_struct,json_* to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-43926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731326#comment-17731326 ] Tengfei Huang commented on SPARK-43926: --- I am working on this, will send a PR soon. > Add array_agg, array_size, cardinality, > count_min_sketch,mask,named_struct,json_* to Scala and Python > - > > Key: SPARK-43926 > URL: https://issues.apache.org/jira/browse/SPARK-43926 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > > Add array_agg, array_size, cardinality, count_min_sketch > Add following functions: > * array_agg > * array_size > * cardinality > * count_min_sketch > * named_struct > * json_array_length > * json_object_keys > * mask > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43438) Fix mismatched column list error on INSERT
[ https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731321#comment-17731321 ] Max Gekk commented on SPARK-43438: -- [~erico] Would you like to work on this issue? It is related to your resolved ticket SPARK-43387 > Fix mismatched column list error on INSERT > -- > > Key: SPARK-43438 > URL: https://issues.apache.org/jira/browse/SPARK-43438 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > This error message is pretty bad, and common > "_LEGACY_ERROR_TEMP_1038" : { > "message" : [ > "Cannot write to table due to mismatched user specified column > size() and data column size()." > ] > }, > It can perhaps be merged with this one - after giving it an ERROR_CLASS > "_LEGACY_ERROR_TEMP_1168" : { > "message" : [ > " requires that the data to be inserted have the same number of > columns as the target table: target table has column(s) but > the inserted data has column(s), including > partition column(s) having constant value(s)." > ] > }, > Repro: > CREATE TABLE tabtest(c1 INT, c2 INT); > INSERT INTO tabtest SELECT 1; > `spark_catalog`.`default`.`tabtest` requires that the data to be inserted > have the same number of columns as the target table: target table has 2 > column(s) but the inserted data has 1 column(s), including 0 partition > column(s) having constant value(s). > INSERT INTO tabtest(c1) SELECT 1, 2, 3; > Cannot write to table due to mismatched user specified column size(1) and > data column size(3).; line 1 pos 24 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43438) Fix mismatched column list error on INSERT
[ https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-43438: - Parent: SPARK-37935 Issue Type: Sub-task (was: Improvement) > Fix mismatched column list error on INSERT > -- > > Key: SPARK-43438 > URL: https://issues.apache.org/jira/browse/SPARK-43438 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > This error message is pretty bad, and common > "_LEGACY_ERROR_TEMP_1038" : { > "message" : [ > "Cannot write to table due to mismatched user specified column > size() and data column size()." > ] > }, > It can perhaps be merged with this one - after giving it an ERROR_CLASS > "_LEGACY_ERROR_TEMP_1168" : { > "message" : [ > " requires that the data to be inserted have the same number of > columns as the target table: target table has column(s) but > the inserted data has column(s), including > partition column(s) having constant value(s)." > ] > }, > Repro: > CREATE TABLE tabtest(c1 INT, c2 INT); > INSERT INTO tabtest SELECT 1; > `spark_catalog`.`default`.`tabtest` requires that the data to be inserted > have the same number of columns as the target table: target table has 2 > column(s) but the inserted data has 1 column(s), including 0 partition > column(s) having constant value(s). > INSERT INTO tabtest(c1) SELECT 1, 2, 3; > Cannot write to table due to mismatched user specified column size(1) and > data column size(3).; line 1 pos 24 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42298) Assign name to _LEGACY_ERROR_TEMP_2132
[ https://issues.apache.org/jira/browse/SPARK-42298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731320#comment-17731320 ] ASF GitHub Bot commented on SPARK-42298: User 'Hisoka-X' has created a pull request for this issue: https://github.com/apache/spark/pull/40632 > Assign name to _LEGACY_ERROR_TEMP_2132 > -- > > Key: SPARK-42298 > URL: https://issues.apache.org/jira/browse/SPARK-42298 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot
[ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731318#comment-17731318 ] ASF GitHub Bot commented on SPARK-40637: User 'Hisoka-X' has created a pull request for this issue: https://github.com/apache/spark/pull/41531 > Spark-shell can correctly encode BINARY type but Spark-sql cannot > - > > Key: SPARK-40637 > URL: https://issues.apache.org/jira/browse/SPARK-40637 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.4.0 >Reporter: xsys >Priority: Minor > Attachments: image-2022-10-18-12-15-05-576.png > > > h3. Describe the bug > When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / > {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from > Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly > when querying it via {{{}spark-sql{}}}. > i.e., > Insert via spark-shell, read via spark-shell: display correctly > Insert via spark-shell, read via spark-sql: does not display correctly > Insert via spark-sql, read via spark-sql: does not display correctly > Insert via spark-sql, read via spark-shell: display correctly > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > scala> import org.apache.spark.sql.Row > scala> import org.apache.spark.sql.types._ > scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray))) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[356] at parallelize at :28 > scala> val schema = new StructType().add(StructField("c1", BinaryType, true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,BinaryType,true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: binary] > scala> > df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell") > scala> spark.sql("select * from binary_vals_shell;").show(false) > ++ > |c1 | > ++ > |[01]| > ++{code} > Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the > binary_vals_shell table, and then (2) insert the value via spark-sql to the > binary_vals_sql table (we use tee to redirect the log to a file) > {code:java} > $SPARK_HOME/bin/spark-sql | tee sql.log{code} > Execute the following, we only get an empty output in the terminal (but a > garbage character in the log file): > {code:java} > spark-sql> select * from binary_vals_shell; -- query what is inserted via > spark-shell; > spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; > spark-sql> insert into binary_vals_sql select X'01'; -- try to insert > directly in spark-sql; > spark-sql> select * from binary_vals_sql; > Time taken: 0.077 seconds, Fetched 1 row(s) > {code} > From the log file, we find it shows as a garbage character. (We never > encountered this garbage character in logs of other data types) > h3. !image-2022-10-18-12-15-05-576.png! > We then return to spark-shell again and run the following: > {code:java} > scala> spark.sql("select * from binary_vals_sql;").show(false) > ++ > > |c1 | > ++ > |[01]| > ++{code} > The binary value does not display correctly via spark-sql, it still displays > correctly via spark-shell. > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type ({{{}BINARY{}}}) & input > ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination. > > h3. Additional context > We also tried Avro and Parquet and encountered the same issue. We believe > this is format-independent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43772) Move version configuration in `connect` module to parent
[ https://issues.apache.org/jira/browse/SPARK-43772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-43772. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41295 [https://github.com/apache/spark/pull/41295] > Move version configuration in `connect` module to parent > > > Key: SPARK-43772 > URL: https://issues.apache.org/jira/browse/SPARK-43772 > Project: Spark > Issue Type: Improvement > Components: Build, Connect >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > > In the pom file of the submodule, there are some common version properties, > eg: > * guava.version > * guava.failureaccess.version > that need to be moved to the parent pom for better management -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43772) Move version configuration in `connect` module to parent
[ https://issues.apache.org/jira/browse/SPARK-43772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-43772: Assignee: BingKun Pan > Move version configuration in `connect` module to parent > > > Key: SPARK-43772 > URL: https://issues.apache.org/jira/browse/SPARK-43772 > Project: Spark > Issue Type: Improvement > Components: Build, Connect >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > > In the pom file of the submodule, there are some common version properties, > eg: > * guava.version > * guava.failureaccess.version > that need to be moved to the parent pom for better management -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org