[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator
[ https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736659#comment-17736659 ] Snoot.io commented on SPARK-44132: -- User 'steven-aerts' has created a pull request for this issue: https://github.com/apache/spark/pull/41712 > nesting full outer joins confuses code generator > > > Key: SPARK-44132 > URL: https://issues.apache.org/jira/browse/SPARK-44132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 > Environment: We verified the existence of this bug from spark 3.3 > until spark 3.5. >Reporter: Steven Aerts >Priority: Major > > We are seeing issues with the code generator when querying java bean encoded > data with 2 nested joins. > {code:java} > dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); > {code} > will generate invalid code in the code generator. And can depending on the > data used generate stack traces like: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > Or: > {code:java} > Caused by: java.lang.AssertionError: index (2) should < 2 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > When we look at the generated code we see that the code generator seems to be > mixing up parameters. For example: > {code:java} > if (smj_leftOutputRow_0 != null) { //< null > check for wrong/left parameter > boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes > NPE on right parameter here{code} > It is as if the the nesting of 2 full outer joins is confusing the code > generator and as such generating invalid code. > There is one other strange thing. We found this issue when using data sets > which were using the java bean encoder. We tried to reproduce this in the > spark shell or using scala case classes but were unable to do so. > We made a reproduction scenario as unit tests (one for each of the stacktrace > above) on the spark code base and made it available as a [pull > request|https://github.com/apache/spark/pull/41688] to this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44164) Extract toAttribute method from StructField to Util class
Rui Wang created SPARK-44164: Summary: Extract toAttribute method from StructField to Util class Key: SPARK-44164 URL: https://issues.apache.org/jira/browse/SPARK-44164 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43974) Upgrade buf to v1.22.0
[ https://issues.apache.org/jira/browse/SPARK-43974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-43974: Summary: Upgrade buf to v1.22.0 (was: Upgrade buf to v1.21.0) > Upgrade buf to v1.22.0 > -- > > Key: SPARK-43974 > URL: https://issues.apache.org/jira/browse/SPARK-43974 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44163) Handle ModuleNotFoundError like ImportError
[ https://issues.apache.org/jira/browse/SPARK-44163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44163. --- Resolution: Invalid > Handle ModuleNotFoundError like ImportError > --- > > Key: SPARK-44163 > URL: https://issues.apache.org/jira/browse/SPARK-44163 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.2, 3.4.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-44163) Handle ModuleNotFoundError like ImportError
[ https://issues.apache.org/jira/browse/SPARK-44163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-44163. - > Handle ModuleNotFoundError like ImportError > --- > > Key: SPARK-44163 > URL: https://issues.apache.org/jira/browse/SPARK-44163 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.2, 3.4.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44163) Handle ModuleNotFoundError like ImportError
Dongjoon Hyun created SPARK-44163: - Summary: Handle ModuleNotFoundError like ImportError Key: SPARK-44163 URL: https://issues.apache.org/jira/browse/SPARK-44163 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.4.1, 3.3.2 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44162) Support G1GC in `spark.eventLog.gcMetrics.*` without warning
Dongjoon Hyun created SPARK-44162: - Summary: Support G1GC in `spark.eventLog.gcMetrics.*` without warning Key: SPARK-44162 URL: https://issues.apache.org/jira/browse/SPARK-44162 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.5.0 Reporter: Dongjoon Hyun {code} >>> 23/06/23 14:26:53 WARN GarbageCollectionMetrics: To enable non-built-in >>> garbage collector(s) List(G1 Concurrent GC), users should configure >>> it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or >>> spark.eventLog.gcMetrics.oldGenerationGarbageCollectors {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44161) Row as UDF inputs causes encoder errors
Zhen Li created SPARK-44161: --- Summary: Row as UDF inputs causes encoder errors Key: SPARK-44161 URL: https://issues.apache.org/jira/browse/SPARK-44161 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: Zhen Li Ensure row inputs to udfs can be handled correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44160) Extract shared code from StructType
[ https://issues.apache.org/jira/browse/SPARK-44160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-44160: - Description: The StructType has some methods that require CatalystParser and Catalyst expression. We are not planning to move the parser and expression to the shared module thus needs to do code split to share as much as code as possible between Scala client and Catalyst. > Extract shared code from StructType > --- > > Key: SPARK-44160 > URL: https://issues.apache.org/jira/browse/SPARK-44160 > Project: Spark > Issue Type: Sub-task > Components: Connect, SQL >Affects Versions: 3.5.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > > The StructType has some methods that require CatalystParser and Catalyst > expression. We are not planning to move the parser and expression to the > shared module thus needs to do code split to share as much as code as > possible between Scala client and Catalyst. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44160) Extract shared code from StructType
Rui Wang created SPARK-44160: Summary: Extract shared code from StructType Key: SPARK-44160 URL: https://issues.apache.org/jira/browse/SPARK-44160 Project: Spark Issue Type: Sub-task Components: Connect, SQL Affects Versions: 3.5.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44158) Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
[ https://issues.apache.org/jira/browse/SPARK-44158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44158: - Assignee: Dongjoon Hyun > Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts` > --- > > Key: SPARK-44158 > URL: https://issues.apache.org/jira/browse/SPARK-44158 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.3, 3.4.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44158) Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
[ https://issues.apache.org/jira/browse/SPARK-44158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44158. --- Fix Version/s: 3.3.3 3.5.0 3.4.2 Resolution: Fixed Issue resolved by pull request 41713 [https://github.com/apache/spark/pull/41713] > Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts` > --- > > Key: SPARK-44158 > URL: https://issues.apache.org/jira/browse/SPARK-44158 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.3, 3.4.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.3.3, 3.5.0, 3.4.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44159) Commands for writting (InsertIntoHadoopFsRelationCommand and InsertIntoHiveTable) should log what they are doing
Navin Kumar created SPARK-44159: --- Summary: Commands for writting (InsertIntoHadoopFsRelationCommand and InsertIntoHiveTable) should log what they are doing Key: SPARK-44159 URL: https://issues.apache.org/jira/browse/SPARK-44159 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Navin Kumar Improvements from SPARK-41763 decoupled the execution of create table and data writing commands in a CTAS (see SPARK-41713). This means that while the code is cleaner with v1 write implementation limited to InsertIntoHadoopFsRelationCommand and InsertIntoHiveTable, the execution of these operations is less clear than it was before. Previously, the command was present in the physical plan (see explain output below): {{== Physical Plan ==}} {{CommandResult }} {{+- Execute CreateHiveTableAsSelectCommand [Database: default, TableName: test_hive_text_table, InsertIntoHiveTable]}} {{+- *(1) Scan ExistingRDD[...]}} But in Spark 3.4.0, this output is: {{== Physical Plan ==}} {{CommandResult }} {{+- Execute CreateHiveTableAsSelectCommand}} {{+- CreateHiveTableAsSelectCommand [Database: default, TableName: test_hive_text_table]}} {{+- Project [...]}} {{+- SubqueryAlias hive_input_table}} {{+- View (`hive_input_table`, [...])}} {{+- LogicalRDD [...], false}} And the write command is now missing. This makes sense since execution is decoupled, but since there is no log output from InsertIntoHiveTable, there is no clear way to fully know that the command actually executed. I would propose that either these commands should add a log message at the INFO level that indicates how many rows were written into what table to make easier for a user to know what has happened from the Spark logs. Another option maybe to update the explain output in Spark 3.4 to handle this, but that might be more difficult and make less sense since the operations are now decoupled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44158) Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
[ https://issues.apache.org/jira/browse/SPARK-44158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44158: -- Affects Version/s: 3.2.4 3.1.3 3.0.3 2.4.8 > Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts` > --- > > Key: SPARK-44158 > URL: https://issues.apache.org/jira/browse/SPARK-44158 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.3, 3.4.1 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44158) Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
[ https://issues.apache.org/jira/browse/SPARK-44158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44158: -- Affects Version/s: 3.4.1 > Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts` > --- > > Key: SPARK-44158 > URL: https://issues.apache.org/jira/browse/SPARK-44158 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.3, 3.4.1 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44158) Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
Dongjoon Hyun created SPARK-44158: - Summary: Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts` Key: SPARK-44158 URL: https://issues.apache.org/jira/browse/SPARK-44158 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.3.3 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44151) Upgrade commons-codec from 1.15 to 1.16.0
[ https://issues.apache.org/jira/browse/SPARK-44151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44151: - Assignee: BingKun Pan > Upgrade commons-codec from 1.15 to 1.16.0 > - > > Key: SPARK-44151 > URL: https://issues.apache.org/jira/browse/SPARK-44151 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44151) Upgrade commons-codec from 1.15 to 1.16.0
[ https://issues.apache.org/jira/browse/SPARK-44151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44151. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41707 [https://github.com/apache/spark/pull/41707] > Upgrade commons-codec from 1.15 to 1.16.0 > - > > Key: SPARK-44151 > URL: https://issues.apache.org/jira/browse/SPARK-44151 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44156) SortAggregation slows down dropDuplicates()
[ https://issues.apache.org/jira/browse/SPARK-44156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emanuel Velzi updated SPARK-44156: -- Description: TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate. How to make Spark to use HashAggregate over SortAggregate? -- We have a Spark cluster running on Kubernetes with the following configurations: * Spark v3.3.2 * Hadoop 3.3.4 * Java 17 We are running a simple job on a dataset (~6GBi) with almost 600 columns, many of which contain null values. The job involves the following steps: # Load data from S3. # Apply dropDuplicates(). # Save the deduplicated data back to S3 using magic committers. One of the columns is of type "map". When we run dropDuplicates() without specifying any parameters (i.e., using all columns), it throws an error: {noformat} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column my_column is map>>;{noformat} To overcome this issue, we used "dropDuplicates(id)" by specifying an identifier column. However, the performance of this method was {*}much worse than expected{*}, taking around 30 minutes. As an alternative approach, we tested converting the "map" column to JSON, applying dropDuplicates() without parameters, and then converting the column back to "map" format: {code:java} DataType t = ds.schema().apply("my_column").dataType(); ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column"))); ds = ds.dropDuplicates(); ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); {code} Surprisingly, this approach {*}significantly improved the performance{*}, reducing the execution time to 7 minutes. The only noticeable difference was in the execution plan. In the *slower* case, the execution plan involved {*}SortAggregate{*}, while in the *faster* case, it involved {*}HashAggregate{*}. {noformat} == Physical Plan [slow case] == Execute InsertIntoHadoopFsRelationCommand (13) +- AdaptiveSparkPlan (12) +- == Final Plan == Coalesce (8) +- SortAggregate (7) +- Sort (6) +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, rowCount=1.25E+7) +- Exchange (4) +- SortAggregate (3) +- Sort (2) +- Scan parquet (1) +- == Initial Plan == Coalesce (11) +- SortAggregate (10) +- Sort (9) +- Exchange (4) +- SortAggregate (3) +- Sort (2) +- Scan parquet (1) {noformat} {noformat} == Physical Plan [fast case] == Execute InsertIntoHadoopFsRelationCommand (11) +- AdaptiveSparkPlan (10) +- == Final Plan == Coalesce (7) +- HashAggregate (6) +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, rowCount=1.25E+7) +- Exchange (4) +- HashAggregate (3) +- Project (2) +- Scan parquet (1) +- == Initial Plan == Coalesce (9) +- HashAggregate (8) +- Exchange (4) +- HashAggregate (3) +- Project (2) +- Scan parquet (1) {noformat} Based on this observation, we concluded that the difference in performance is related to {*}SortAggregate vs. HashAggregate{*}. Is this line of thinking correct? How we can to enforce the use of HashAggregate instead of SortAggregate {*}even using one colum to deduplicate{*}? *The final result is somewhat counterintuitive* because deduplicating using only one column should theoretically be faster, as it provides a simpler way to compare rows and determine duplicates. was: TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate. How to make Spark to use HashAggregate over SortAggregate? -- We have a Spark cluster running on Kubernetes with the following configurations: * Spark v3.3.2 * Hadoop 3.3.4 * Java 17 We are running a simple job on a dataset (~6GBi) with almost 600 columns, many of which contain null values. The job involves the following steps: # Load data from S3. # Apply dropDuplicates(). # Save the deduplicated data back to S3 using magic committers. One of the columns is of type "map". When we run dropDuplicates() without specifying any parameters (i.e., using all columns), it throws an error: {noformat} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column my_column is map>>;{noformat} To overcome this issue, we used "dropDuplicates(id)" by specifying an identifier column. However, the performance of this method was {*}much worse than expected{*}, taking around 30
[jira] [Updated] (SPARK-44156) SortAggregation slows down dropDuplicates()
[ https://issues.apache.org/jira/browse/SPARK-44156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emanuel Velzi updated SPARK-44156: -- Description: TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate. How to make Spark to use HashAggregate over SortAggregate? -- We have a Spark cluster running on Kubernetes with the following configurations: * Spark v3.3.2 * Hadoop 3.3.4 * Java 17 We are running a simple job on a dataset (~6GBi) with almost 600 columns, many of which contain null values. The job involves the following steps: # Load data from S3. # Apply dropDuplicates(). # Save the deduplicated data back to S3 using magic committers. One of the columns is of type "map". When we run dropDuplicates() without specifying any parameters (i.e., using all columns), it throws an error: {noformat} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column my_column is map>>;{noformat} To overcome this issue, we used "dropDuplicates(id)" by specifying an identifier column. However, the performance of this method was {*}much worse than expected{*}, taking around 30 minutes. As an alternative approach, we tested converting the "map" column to JSON, applying dropDuplicates() without parameters, and then converting the column back to "map" format: {code:java} DataType t = ds.schema().apply("my_column").dataType(); ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column"))); ds = ds.dropDuplicates(); ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); {code} Surprisingly, this approach {*}significantly improved the performance{*}, reducing the execution time to 7 minutes. The only noticeable difference was in the execution plan. In the *slower* case, the execution plan involved {*}SortAggregate{*}, while in the *faster* case, it involved {*}HashAggregate{*}. {noformat} == Physical Plan [slow case] == Execute InsertIntoHadoopFsRelationCommand (13) +- AdaptiveSparkPlan (12) +- == Final Plan == Coalesce (8) +- SortAggregate (7) +- Sort (6) +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, rowCount=1.25E+7) +- Exchange (4) +- SortAggregate (3) +- Sort (2) +- Scan parquet (1) +- == Initial Plan == Coalesce (11) +- SortAggregate (10) +- Sort (9) +- Exchange (4) +- SortAggregate (3) +- Sort (2) +- Scan parquet (1) {noformat} {noformat} == Physical Plan [fast case] == Execute InsertIntoHadoopFsRelationCommand (11) +- AdaptiveSparkPlan (10) +- == Final Plan == Coalesce (7) +- HashAggregate (6) +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, rowCount=1.25E+7) +- Exchange (4) +- HashAggregate (3) +- Project (2) +- Scan parquet (1) +- == Initial Plan == Coalesce (9) +- HashAggregate (8) +- Exchange (4) +- HashAggregate (3) +- Project (2) +- Scan parquet (1) {noformat} Based on this observation, we concluded that the difference in performance is related to {*}SortAggregate vs. HashAggregate{*}. Is this line of thinking correct? How we can to enforce the use of HashAggregate instead of SortAggregate? *The final result is somewhat counterintuitive* because deduplicating using only one column should theoretically be faster, as it provides a simpler way to compare rows and determine duplicates. was: TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate. How to make Spark to use HashAggregate over SortAggregate? -- We have a Spark cluster running on Kubernetes with the following configurations: * Spark v3.3.2 * Hadoop 3.3.4 * Java 17 We are running a simple job on a dataset (~6GBi) with almost 600 columns, many of which contain null values. The job involves the following steps: # Load data from S3. # Apply dropDuplicates(). # Save the deduplicated data back to S3 using magic committers. One of the columns is of type "map". When we run dropDuplicates() without specifying any parameters (i.e., using all columns), it throws an error: {noformat} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column my_column is map>>;{noformat} To overcome this issue, we used "dropDuplicates(id)" by specifying an identifier column. However, the performance of this method was {*}much worse than expected{*}, taking around 30 minutes. As an alternative approach, we
[jira] [Updated] (SPARK-44156) SortAggregation slows down dropDuplicates()
[ https://issues.apache.org/jira/browse/SPARK-44156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emanuel Velzi updated SPARK-44156: -- Summary: SortAggregation slows down dropDuplicates() (was: SortAggregation slows down dropDuplicates().) > SortAggregation slows down dropDuplicates() > --- > > Key: SPARK-44156 > URL: https://issues.apache.org/jira/browse/SPARK-44156 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: Emanuel Velzi >Priority: Major > > TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate. > How to make Spark to use HashAggregate over SortAggregate? > -- > We have a Spark cluster running on Kubernetes with the following > configurations: > * Spark v3.3.2 > * Hadoop 3.3.4 > * Java 17 > We are running a simple job on a dataset (~6GBi) with almost 600 columns, > many of which contain null values. The job involves the following steps: > # Load data from S3. > # Apply dropDuplicates(). > # Save the deduplicated data back to S3 using magic committers. > One of the columns is of type "map". When we run dropDuplicates() without > specifying any parameters (i.e., using all columns), it throws an error: > > {noformat} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot > have map type columns in DataFrame which calls set operations(intersect, > except, etc.), but the type of column my_column is > map>>;{noformat} > > To overcome this issue, we used "dropDuplicates(id)" by specifying an > identifier column. > However, the performance of this method was {*}much worse than expected{*}, > taking around 30 minutes. > As an alternative approach, we tested converting the "map" column to JSON, > applying dropDuplicates() without parameters, and then converting the column > back to "map" format: > > {code:java} > DataType t = ds.schema().apply("my_column").dataType(); > ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column"))); > ds = ds.dropDuplicates(); > ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); > {code} > > Surprisingly, this approach {*}significantly improved the performance{*}, > reducing the execution time to 7 minutes. > The only noticeable difference was in the execution plan. In the *slower* > case, the execution plan involved {*}SortAggregate{*}, while in the *faster* > case, it involved {*}HashAggregate{*}. > > {noformat} > == Physical Plan [slow case] == > Execute InsertIntoHadoopFsRelationCommand (13) > +- AdaptiveSparkPlan (12) > +- == Final Plan == > Coalesce (8) > +- SortAggregate (7) > +- Sort (6) > +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, > rowCount=1.25E+7) > +- Exchange (4) > +- SortAggregate (3) > +- Sort (2) > +- Scan parquet (1) > +- == Initial Plan == > Coalesce (11) > +- SortAggregate (10) > +- Sort (9) > +- Exchange (4) > +- SortAggregate (3) > +- Sort (2) > +- Scan parquet (1) > {noformat} > {noformat} > == Physical Plan [fast case] == > Execute InsertIntoHadoopFsRelationCommand (11) > +- AdaptiveSparkPlan (10) > +- == Final Plan == > Coalesce (7) > +- HashAggregate (6) > +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, > rowCount=1.25E+7) > +- Exchange (4) > +- HashAggregate (3) > +- Project (2) > +- Scan parquet (1) > +- == Initial Plan == > Coalesce (9) > +- HashAggregate (8) > +- Exchange (4) > +- HashAggregate (3) > +- Project (2) > +- Scan parquet (1) > {noformat} > > Based on this observation, we concluded that the difference in performance is > related to {*}SortAggregate versus HashAggregate{*}. > Is this line of thinking correct? How we can to enforce the use of > HashAggregate instead of SortAggregate? > *The final result is somewhat counterintuitive* because deduplicating using > only one column should theoretically be faster, as it provides a simpler way > to compare rows and determine duplicates. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44156) SortAggregation slows down dropDuplicates().
[ https://issues.apache.org/jira/browse/SPARK-44156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emanuel Velzi updated SPARK-44156: -- Summary: SortAggregation slows down dropDuplicates(). (was: Should HashAggregation improve dropDuplicates()?) > SortAggregation slows down dropDuplicates(). > > > Key: SPARK-44156 > URL: https://issues.apache.org/jira/browse/SPARK-44156 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: Emanuel Velzi >Priority: Major > > TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate. > How to make Spark to use HashAggregate over SortAggregate? > -- > We have a Spark cluster running on Kubernetes with the following > configurations: > * Spark v3.3.2 > * Hadoop 3.3.4 > * Java 17 > We are running a simple job on a dataset (~6GBi) with almost 600 columns, > many of which contain null values. The job involves the following steps: > # Load data from S3. > # Apply dropDuplicates(). > # Save the deduplicated data back to S3 using magic committers. > One of the columns is of type "map". When we run dropDuplicates() without > specifying any parameters (i.e., using all columns), it throws an error: > > {noformat} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot > have map type columns in DataFrame which calls set operations(intersect, > except, etc.), but the type of column my_column is > map>>;{noformat} > > To overcome this issue, we used "dropDuplicates(id)" by specifying an > identifier column. > However, the performance of this method was {*}much worse than expected{*}, > taking around 30 minutes. > As an alternative approach, we tested converting the "map" column to JSON, > applying dropDuplicates() without parameters, and then converting the column > back to "map" format: > > {code:java} > DataType t = ds.schema().apply("my_column").dataType(); > ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column"))); > ds = ds.dropDuplicates(); > ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); > {code} > > Surprisingly, this approach {*}significantly improved the performance{*}, > reducing the execution time to 7 minutes. > The only noticeable difference was in the execution plan. In the *slower* > case, the execution plan involved {*}SortAggregate{*}, while in the *faster* > case, it involved {*}HashAggregate{*}. > > {noformat} > == Physical Plan [slow case] == > Execute InsertIntoHadoopFsRelationCommand (13) > +- AdaptiveSparkPlan (12) > +- == Final Plan == > Coalesce (8) > +- SortAggregate (7) > +- Sort (6) > +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, > rowCount=1.25E+7) > +- Exchange (4) > +- SortAggregate (3) > +- Sort (2) > +- Scan parquet (1) > +- == Initial Plan == > Coalesce (11) > +- SortAggregate (10) > +- Sort (9) > +- Exchange (4) > +- SortAggregate (3) > +- Sort (2) > +- Scan parquet (1) > {noformat} > {noformat} > == Physical Plan [fast case] == > Execute InsertIntoHadoopFsRelationCommand (11) > +- AdaptiveSparkPlan (10) > +- == Final Plan == > Coalesce (7) > +- HashAggregate (6) > +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, > rowCount=1.25E+7) > +- Exchange (4) > +- HashAggregate (3) > +- Project (2) > +- Scan parquet (1) > +- == Initial Plan == > Coalesce (9) > +- HashAggregate (8) > +- Exchange (4) > +- HashAggregate (3) > +- Project (2) > +- Scan parquet (1) > {noformat} > > Based on this observation, we concluded that the difference in performance is > related to {*}SortAggregate versus HashAggregate{*}. > Is this line of thinking correct? How we can to enforce the use of > HashAggregate instead of SortAggregate? > *The final result is somewhat counterintuitive* because deduplicating using > only one column should theoretically be faster, as it provides a simpler way > to compare rows and determine duplicates. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44157) Outdated JARs in PySpark package
Adrian Gonzalez-Martin created SPARK-44157: -- Summary: Outdated JARs in PySpark package Key: SPARK-44157 URL: https://issues.apache.org/jira/browse/SPARK-44157 Project: Spark Issue Type: Bug Components: Build, PySpark Affects Versions: 3.4.1 Reporter: Adrian Gonzalez-Martin The JARs which ship embedded within PySpark's package in PyPi don't seem aligned with the deps specified in Spark's own `pom.xml`. For example, in Spark's `pom.xml`, `protobuf-java` is set to `3.21.12`: [https://github.com/apache/spark/blob/6b1ff22dde1ead51cbf370be6e48a802daae58b6/pom.xml#L127] However, if we look at the JARs embedded within PySpark tarball, the version of `protobuf-java` is `2.5.0` (i.e. `/site-packages/pyspark/jars/protobuf-java-2.5.0.jar`). Same seems to apply to all other dependencies. This introduces a set of CVEs which are fixed on upstream Spark, but are still present in PySpark (e.g. `CVE-2022-3509`, `CVE-2021-22569`, ` CVE-2015-5237` and a few others). As well as potentially introduce a source of conflict whenever there's a breaking change on these deps. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44156) Should HashAggregation improve dropDuplicates()?
[ https://issues.apache.org/jira/browse/SPARK-44156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emanuel Velzi updated SPARK-44156: -- Description: TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate. How to make Spark to use HashAggregate over SortAggregate? -- We have a Spark cluster running on Kubernetes with the following configurations: * Spark v3.3.2 * Hadoop 3.3.4 * Java 17 We are running a simple job on a dataset (~6GBi) with almost 600 columns, many of which contain null values. The job involves the following steps: # Load data from S3. # Apply dropDuplicates(). # Save the deduplicated data back to S3 using magic committers. One of the columns is of type "map". When we run dropDuplicates() without specifying any parameters (i.e., using all columns), it throws an error: {noformat} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column my_column is map>>;{noformat} To overcome this issue, we used "dropDuplicates(id)" by specifying an identifier column. However, the performance of this method was {*}much worse than expected{*}, taking around 30 minutes. As an alternative approach, we tested converting the "map" column to JSON, applying dropDuplicates() without parameters, and then converting the column back to "map" format: {code:java} DataType t = ds.schema().apply("my_column").dataType(); ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column"))); ds = ds.dropDuplicates(); ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); {code} Surprisingly, this approach {*}significantly improved the performance{*}, reducing the execution time to 7 minutes. The only noticeable difference was in the execution plan. In the *slower* case, the execution plan involved {*}SortAggregate{*}, while in the *faster* case, it involved {*}HashAggregate{*}. {noformat} == Physical Plan [slow case] == Execute InsertIntoHadoopFsRelationCommand (13) +- AdaptiveSparkPlan (12) +- == Final Plan == Coalesce (8) +- SortAggregate (7) +- Sort (6) +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, rowCount=1.25E+7) +- Exchange (4) +- SortAggregate (3) +- Sort (2) +- Scan parquet (1) +- == Initial Plan == Coalesce (11) +- SortAggregate (10) +- Sort (9) +- Exchange (4) +- SortAggregate (3) +- Sort (2) +- Scan parquet (1) {noformat} {noformat} == Physical Plan [fast case] == Execute InsertIntoHadoopFsRelationCommand (11) +- AdaptiveSparkPlan (10) +- == Final Plan == Coalesce (7) +- HashAggregate (6) +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, rowCount=1.25E+7) +- Exchange (4) +- HashAggregate (3) +- Project (2) +- Scan parquet (1) +- == Initial Plan == Coalesce (9) +- HashAggregate (8) +- Exchange (4) +- HashAggregate (3) +- Project (2) +- Scan parquet (1) {noformat} Based on this observation, we concluded that the difference in performance is related to {*}SortAggregate versus HashAggregate{*}. Is this line of thinking correct? How we can to enforce the use of HashAggregate instead of SortAggregate? *The final result is somewhat counterintuitive* because deduplicating using only one column should theoretically be faster, as it provides a simpler way to compare rows and determine duplicates. was: TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate. How to make Spark to use HashAggregate over SortAggregate? -- We have a Spark cluster running on Kubernetes with the following configurations: * Spark v3.3.2 * Hadoop 3.3.4 * Java 17 We are running a simple job on a dataset (~6GBi) with almost 600 columns, many of which contain null values. The job involves the following steps: # Load data from S3. # Apply dropDuplicates(). # Save the deduplicated data back to S3 using magic committers. One of the columns is of type "map". When we run dropDuplicates() without specifying any parameters (i.e., using all columns), it throws an error: {noformat} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column my_column is map>>;{noformat} To overcome this issue, we used "dropDuplicates(id)" by specifying an identifier column. However, the performance of this method was {*}much worse than expected{*}, taking around 30 minutes. As an alternative approach, we
[jira] [Created] (SPARK-44156) Should HashAggregation improve dropDuplicates()?
Emanuel Velzi created SPARK-44156: - Summary: Should HashAggregation improve dropDuplicates()? Key: SPARK-44156 URL: https://issues.apache.org/jira/browse/SPARK-44156 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.2 Reporter: Emanuel Velzi TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate. How to make Spark to use HashAggregate over SortAggregate? -- We have a Spark cluster running on Kubernetes with the following configurations: * Spark v3.3.2 * Hadoop 3.3.4 * Java 17 We are running a simple job on a dataset (~6GBi) with almost 600 columns, many of which contain null values. The job involves the following steps: # Load data from S3. # Apply dropDuplicates(). # Save the deduplicated data back to S3 using magic committers. One of the columns is of type "map". When we run dropDuplicates() without specifying any parameters (i.e., using all columns), it throws an error: {noformat} Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column my_column is map>>;{noformat} To overcome this issue, we used "dropDuplicates(id)" by specifying an identifier column. However, the performance of this method was {*}much worse than expected{*}, taking around 30 minutes. As an alternative approach, we tested converting the "map" column to JSON, applying dropDuplicates() without parameters, and then converting the column back to "map" format: {code:java} DataType t = ds.schema().apply("my_column").dataType(); ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column"))); ds = ds.dropDuplicates(); ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); {code} Surprisingly, this approach {*}significantly improved the performance{*}, reducing the execution time to 7 minutes. The only noticeable difference was in the execution plan. In the *slower* case, the execution plan involved {*}SortAggregate{*}, while in the *faster* case, it involved {*}HashAggregate{*}. {noformat} == Physical Plan [slow case] == Execute InsertIntoHadoopFsRelationCommand (13) +- AdaptiveSparkPlan (12) +- == Final Plan == Coalesce (8) +- SortAggregate (7) +- Sort (6) +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, rowCount=1.25E+7) +- Exchange (4) +- SortAggregate (3) +- Sort (2) +- Scan parquet (1) +- == Initial Plan == Coalesce (11) +- SortAggregate (10) +- Sort (9) +- Exchange (4) +- SortAggregate (3) +- Sort (2) +- Scan parquet (1) {noformat} {noformat} == Physical Plan [fast case] == Execute InsertIntoHadoopFsRelationCommand (11) +- AdaptiveSparkPlan (10) +- == Final Plan == Coalesce (7) +- HashAggregate (6) +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, rowCount=1.25E+7) +- Exchange (4) +- HashAggregate (3) +- Project (2) +- Scan parquet (1) +- == Initial Plan == Coalesce (9) +- HashAggregate (8) +- Exchange (4) +- HashAggregate (3) +- Project (2) +- Scan parquet (1) {noformat} Based on this observation, we concluded that the difference in performance is related to {*}SortAggregate versus HashAggregate{*}. Is this line of thinking correct? How we can to enforce the use of HashAggregate instead of SortAggregate? *The final result is somewhat counterintuitive* because deduplicating using only one column should theoretically be faster, as it provides a simpler way to compare rows and determine duplicates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44134) Can't set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-44134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-44134: -- Fix Version/s: 3.4.2 (was: 3.4.1) > Can't set resources (GPU/FPGA) to 0 when they are set to positive value in > spark-defaults.conf > -- > > Key: SPARK-44134 > URL: https://issues.apache.org/jira/browse/SPARK-44134 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.3.3, 3.5.0, 3.4.2 > > > With resource aware scheduling, if you specify a default value in the > spark-defaults.conf, a user can't override that to set it to 0. > Meaning spark-defaults.conf has something like: > {{spark.executor.resource.\{resourceName}.amount=1}} > {{spark.task.resource.\{resourceName}.amount}} =1 > If the user tries to override when submitting an application with > {{{}spark.executor.resource.\{resourceName}.amount{}}}=0 and > {{spark.task.resource.\{resourceName}.amount}} =0, it gives the user an error: > > {code:java} > 23/06/21 09:12:57 ERROR Main: Failed to initialize Spark session. > org.apache.spark.SparkException: No executor resource configs were not > specified for the following task configs: gpu > at > org.apache.spark.resource.ResourceProfile.calculateTasksAndLimitingResource(ResourceProfile.scala:206) > at > org.apache.spark.resource.ResourceProfile.$anonfun$limitingResource$1(ResourceProfile.scala:139) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.resource.ResourceProfile.limitingResource(ResourceProfile.scala:138) > at > org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:95) > at > org.apache.spark.resource.ResourceProfileManager.(ResourceProfileManager.scala:49) > at org.apache.spark.SparkContext.(SparkContext.scala:455) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953){code} > This used to work, my guess is this may have gotten broken with the stage > level scheduling feature. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator
[ https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736490#comment-17736490 ] Steven Aerts commented on SPARK-44132: -- [https://github.com/apache/spark/pull/41712] was created with a proposed fix and reproduction scenario for this problem. Let me know if you prefer to update this Jira ticket, as it is still referring to the BeanEncoder which had nothing to do with it. > nesting full outer joins confuses code generator > > > Key: SPARK-44132 > URL: https://issues.apache.org/jira/browse/SPARK-44132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 > Environment: We verified the existence of this bug from spark 3.3 > until spark 3.5. >Reporter: Steven Aerts >Priority: Major > > We are seeing issues with the code generator when querying java bean encoded > data with 2 nested joins. > {code:java} > dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); > {code} > will generate invalid code in the code generator. And can depending on the > data used generate stack traces like: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > Or: > {code:java} > Caused by: java.lang.AssertionError: index (2) should < 2 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > When we look at the generated code we see that the code generator seems to be > mixing up parameters. For example: > {code:java} > if (smj_leftOutputRow_0 != null) { //< null > check for wrong/left parameter > boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes > NPE on right parameter here{code} > It is as if the the nesting of 2 full outer joins is confusing the code > generator and as such generating invalid code. > There is one other strange thing. We found this issue when using data sets > which were using the java bean encoder. We tried to reproduce this in the > spark shell or using scala case classes but were unable to do so. > We made a reproduction scenario as unit tests (one for each of the stacktrace > above) on the spark code base and made it available as a [pull > request|https://github.com/apache/spark/pull/41688] to this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43438) Fix mismatched column list error on INSERT
[ https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485 ] BingKun Pan edited comment on SPARK-43438 at 6/23/23 12:48 PM: --- I checked and found that, 1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully. There is a default value completion operation. [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42] 2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. 3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is too many data columns: Table columns: `c1`. Data columns: `1`, `2`, `3`. Among them, 2 and 3 are in line with our expectations when after `[https://github.com/apache/spark/pull/41458]`. But the behavior difference between 1 and 2 is a bit confusing. *Should we align the logic of 1 and 2?* was (Author: panbingkun): I checked and found that after `[https://github.com/apache/spark/pull/41458]`, 1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully. There is a default value completion operation. [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42] 2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. 3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is too many data columns: Table columns: `c1`. Data columns: `1`, `2`, `3`. Among them, 2 and 3 are in line with our expectations. But the behavior difference between 1 and 2 is a bit confusing. *Should we align the logic of 1 and 2?* > Fix mismatched column list error on INSERT > -- > > Key: SPARK-43438 > URL: https://issues.apache.org/jira/browse/SPARK-43438 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > This error message is pretty bad, and common > "_LEGACY_ERROR_TEMP_1038" : { > "message" : [ > "Cannot write to table due to mismatched user specified column > size() and data column size()." > ] > }, > It can perhaps be merged with this one - after giving it an ERROR_CLASS > "_LEGACY_ERROR_TEMP_1168" : { > "message" : [ > " requires that the data to be inserted have the same number of > columns as the target table: target table has column(s) but > the inserted data has column(s), including > partition column(s) having constant value(s)." > ] > }, > Repro: > CREATE TABLE tabtest(c1 INT, c2 INT); > INSERT INTO tabtest SELECT 1; > `spark_catalog`.`default`.`tabtest` requires that the data to be inserted > have the same number of columns as the target table: target table has 2 > column(s) but the inserted data has 1 column(s), including 0 partition > column(s) having constant value(s). > INSERT INTO tabtest(c1) SELECT 1, 2, 3; > Cannot write to table due to mismatched user specified column size(1) and > data column size(3).; line 1 pos 24 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43438) Fix mismatched column list error on INSERT
[ https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485 ] BingKun Pan edited comment on SPARK-43438 at 6/23/23 12:44 PM: --- I checked and found that after `[https://github.com/apache/spark/pull/41458]`, 1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully. There is a default value completion operation. [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42] 2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. 3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is too many data columns: Table columns: `c1`. Data columns: `1`, `2`, `3`. Among them, 2 and 3 are in line with our expectations. But the behavior difference between 1 and 2 is a bit confusing. *Should we align the logic of 1 and 2?* was (Author: panbingkun): I checked and found that after `[https://github.com/apache/spark/pull/41458]`, 1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully. [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42] 2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. 3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is too many data columns: Table columns: `c1`. Data columns: `1`, `2`, `3`. Among them, 2 and 3 are in line with our expectations. But the behavior difference between 1 and 2 is a bit confusing. *Should we align the logic of 1 and 2?* > Fix mismatched column list error on INSERT > -- > > Key: SPARK-43438 > URL: https://issues.apache.org/jira/browse/SPARK-43438 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > This error message is pretty bad, and common > "_LEGACY_ERROR_TEMP_1038" : { > "message" : [ > "Cannot write to table due to mismatched user specified column > size() and data column size()." > ] > }, > It can perhaps be merged with this one - after giving it an ERROR_CLASS > "_LEGACY_ERROR_TEMP_1168" : { > "message" : [ > " requires that the data to be inserted have the same number of > columns as the target table: target table has column(s) but > the inserted data has column(s), including > partition column(s) having constant value(s)." > ] > }, > Repro: > CREATE TABLE tabtest(c1 INT, c2 INT); > INSERT INTO tabtest SELECT 1; > `spark_catalog`.`default`.`tabtest` requires that the data to be inserted > have the same number of columns as the target table: target table has 2 > column(s) but the inserted data has 1 column(s), including 0 partition > column(s) having constant value(s). > INSERT INTO tabtest(c1) SELECT 1, 2, 3; > Cannot write to table due to mismatched user specified column size(1) and > data column size(3).; line 1 pos 24 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43438) Fix mismatched column list error on INSERT
[ https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485 ] BingKun Pan edited comment on SPARK-43438 at 6/23/23 12:42 PM: --- I checked and found that after `[https://github.com/apache/spark/pull/41458]`, 1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully. [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42] 2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. 3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is too many data columns: Table columns: `c1`. Data columns: `1`, `2`, `3`. Among them, 2 and 3 are in line with our expectations. But the behavior difference between 1 and 2 is a bit confusing. *Should we align the logic of 1 and 2?* was (Author: panbingkun): I checked and found that after `[https://github.com/apache/spark/pull/41458]`, 1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully. [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42] 2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. 3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`t1`, the reason is too many data columns: Table columns: `c1`. Data columns: `1`, `2`, `3`. Among them, 2 and 3 are in line with our expectations. But the behavior difference between 1 and 2 is a bit confusing. *Should we align the logic of 1 and 2?* > Fix mismatched column list error on INSERT > -- > > Key: SPARK-43438 > URL: https://issues.apache.org/jira/browse/SPARK-43438 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > This error message is pretty bad, and common > "_LEGACY_ERROR_TEMP_1038" : { > "message" : [ > "Cannot write to table due to mismatched user specified column > size() and data column size()." > ] > }, > It can perhaps be merged with this one - after giving it an ERROR_CLASS > "_LEGACY_ERROR_TEMP_1168" : { > "message" : [ > " requires that the data to be inserted have the same number of > columns as the target table: target table has column(s) but > the inserted data has column(s), including > partition column(s) having constant value(s)." > ] > }, > Repro: > CREATE TABLE tabtest(c1 INT, c2 INT); > INSERT INTO tabtest SELECT 1; > `spark_catalog`.`default`.`tabtest` requires that the data to be inserted > have the same number of columns as the target table: target table has 2 > column(s) but the inserted data has 1 column(s), including 0 partition > column(s) having constant value(s). > INSERT INTO tabtest(c1) SELECT 1, 2, 3; > Cannot write to table due to mismatched user specified column size(1) and > data column size(3).; line 1 pos 24 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43438) Fix mismatched column list error on INSERT
[ https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485 ] BingKun Pan edited comment on SPARK-43438 at 6/23/23 12:40 PM: --- I checked and found that after `[https://github.com/apache/spark/pull/41458]`, 1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully. [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42] 2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. 3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`t1`, the reason is too many data columns: Table columns: `c1`. Data columns: `1`, `2`, `3`. Among them, 2 and 3 are in line with our expectations. But the behavior difference between 1 and 2 is a bit confusing. *Should we align the logic of 1 and 2?* was (Author: panbingkun): I checked and found that after `[https://github.com/apache/spark/pull/41458]`, 1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully. [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42] 2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. 3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`t1`, the reason is too many data columns: Table columns: `c1`. Data columns: `1`, `2`, `3`. Among them, 2 and 3 are in line with our expectations. But the behavior difference between 1 and 2 is a bit confusing. > Fix mismatched column list error on INSERT > -- > > Key: SPARK-43438 > URL: https://issues.apache.org/jira/browse/SPARK-43438 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > This error message is pretty bad, and common > "_LEGACY_ERROR_TEMP_1038" : { > "message" : [ > "Cannot write to table due to mismatched user specified column > size() and data column size()." > ] > }, > It can perhaps be merged with this one - after giving it an ERROR_CLASS > "_LEGACY_ERROR_TEMP_1168" : { > "message" : [ > " requires that the data to be inserted have the same number of > columns as the target table: target table has column(s) but > the inserted data has column(s), including > partition column(s) having constant value(s)." > ] > }, > Repro: > CREATE TABLE tabtest(c1 INT, c2 INT); > INSERT INTO tabtest SELECT 1; > `spark_catalog`.`default`.`tabtest` requires that the data to be inserted > have the same number of columns as the target table: target table has 2 > column(s) but the inserted data has 1 column(s), including 0 partition > column(s) having constant value(s). > INSERT INTO tabtest(c1) SELECT 1, 2, 3; > Cannot write to table due to mismatched user specified column size(1) and > data column size(3).; line 1 pos 24 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43438) Fix mismatched column list error on INSERT
[ https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485 ] BingKun Pan edited comment on SPARK-43438 at 6/23/23 12:39 PM: --- I checked and found that after `[https://github.com/apache/spark/pull/41458]`, 1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully. [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42] 2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. 3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`t1`, the reason is too many data columns: Table columns: `c1`. Data columns: `1`, `2`, `3`. Among them, 2 and 3 are in line with our expectations. But the behavior difference between 1 and 2 is a bit confusing. was (Author: panbingkun): I checked and found that after `[https://github.com/apache/spark/pull/41458]`, 1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully. [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401] 2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. 3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`t1`, the reason is too many data columns: Table columns: `c1`. Data columns: `1`, `2`, `3`. Among them, 2 and 3 are in line with our expectations. But the behavior difference between 1 and 2 is a bit confusing. > Fix mismatched column list error on INSERT > -- > > Key: SPARK-43438 > URL: https://issues.apache.org/jira/browse/SPARK-43438 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > This error message is pretty bad, and common > "_LEGACY_ERROR_TEMP_1038" : { > "message" : [ > "Cannot write to table due to mismatched user specified column > size() and data column size()." > ] > }, > It can perhaps be merged with this one - after giving it an ERROR_CLASS > "_LEGACY_ERROR_TEMP_1168" : { > "message" : [ > " requires that the data to be inserted have the same number of > columns as the target table: target table has column(s) but > the inserted data has column(s), including > partition column(s) having constant value(s)." > ] > }, > Repro: > CREATE TABLE tabtest(c1 INT, c2 INT); > INSERT INTO tabtest SELECT 1; > `spark_catalog`.`default`.`tabtest` requires that the data to be inserted > have the same number of columns as the target table: target table has 2 > column(s) but the inserted data has 1 column(s), including 0 partition > column(s) having constant value(s). > INSERT INTO tabtest(c1) SELECT 1, 2, 3; > Cannot write to table due to mismatched user specified column size(1) and > data column size(3).; line 1 pos 24 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43438) Fix mismatched column list error on INSERT
[ https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485 ] BingKun Pan commented on SPARK-43438: - I checked and found that after `[https://github.com/apache/spark/pull/41458]`, 1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully. [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397] [https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401] 2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. 3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as follows: [INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`t1`, the reason is too many data columns: Table columns: `c1`. Data columns: `1`, `2`, `3`. Among them, 2 and 3 are in line with our expectations. But the behavior difference between 1 and 2 is a bit confusing. > Fix mismatched column list error on INSERT > -- > > Key: SPARK-43438 > URL: https://issues.apache.org/jira/browse/SPARK-43438 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > This error message is pretty bad, and common > "_LEGACY_ERROR_TEMP_1038" : { > "message" : [ > "Cannot write to table due to mismatched user specified column > size() and data column size()." > ] > }, > It can perhaps be merged with this one - after giving it an ERROR_CLASS > "_LEGACY_ERROR_TEMP_1168" : { > "message" : [ > " requires that the data to be inserted have the same number of > columns as the target table: target table has column(s) but > the inserted data has column(s), including > partition column(s) having constant value(s)." > ] > }, > Repro: > CREATE TABLE tabtest(c1 INT, c2 INT); > INSERT INTO tabtest SELECT 1; > `spark_catalog`.`default`.`tabtest` requires that the data to be inserted > have the same number of columns as the target table: target table has 2 > column(s) but the inserted data has 1 column(s), including 0 partition > column(s) having constant value(s). > INSERT INTO tabtest(c1) SELECT 1, 2, 3; > Cannot write to table due to mismatched user specified column size(1) and > data column size(3).; line 1 pos 24 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44155) Adding a dev utility to improve error messages based on LLM
[ https://issues.apache.org/jira/browse/SPARK-44155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736424#comment-17736424 ] ASF GitHub Bot commented on SPARK-44155: User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/41711 > Adding a dev utility to improve error messages based on LLM > > > Key: SPARK-44155 > URL: https://issues.apache.org/jira/browse/SPARK-44155 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Adding a utility function to assist with error message improvement tasks. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44151) Upgrade commons-codec from 1.15 to 1.16.0
[ https://issues.apache.org/jira/browse/SPARK-44151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736421#comment-17736421 ] ASF GitHub Bot commented on SPARK-44151: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41707 > Upgrade commons-codec from 1.15 to 1.16.0 > - > > Key: SPARK-44151 > URL: https://issues.apache.org/jira/browse/SPARK-44151 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44153) Support `Heap Histogram` column in Executor tab
[ https://issues.apache.org/jira/browse/SPARK-44153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736420#comment-17736420 ] ASF GitHub Bot commented on SPARK-44153: User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/41709 > Support `Heap Histogram` column in Executor tab > --- > > Key: SPARK-44153 > URL: https://issues.apache.org/jira/browse/SPARK-44153 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44155) Adding a dev utility to improve error messages based on LLM
Haejoon Lee created SPARK-44155: --- Summary: Adding a dev utility to improve error messages based on LLM Key: SPARK-44155 URL: https://issues.apache.org/jira/browse/SPARK-44155 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Haejoon Lee Adding a utility function to assist with error message improvement tasks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736404#comment-17736404 ] Ramakrishna commented on SPARK-44152: - [~gurwls223] Is this an issue spark 3.4.0 ? at least I am facing this issue, with all other constraints remaining . > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42877) Implement DataFrame.foreach
[ https://issues.apache.org/jira/browse/SPARK-42877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736399#comment-17736399 ] Gurpreet Singh commented on SPARK-42877: [~XinrongM] I am interested in working on this issue. I am new to this codebase, so could you maybe provide some more context on this? Thanks > Implement DataFrame.foreach > --- > > Key: SPARK-42877 > URL: https://issues.apache.org/jira/browse/SPARK-42877 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Priority: Major > > Maybe we can leverage UDFs to implement that, such as > `df.select(udf(*df.schema)).count()`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44153) Support `Heap Histogram` column in Executor tab
[ https://issues.apache.org/jira/browse/SPARK-44153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44153. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41709 [https://github.com/apache/spark/pull/41709] > Support `Heap Histogram` column in Executor tab > --- > > Key: SPARK-44153 > URL: https://issues.apache.org/jira/browse/SPARK-44153 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44153) Support `Heap Histogram` column in Executor tab
[ https://issues.apache.org/jira/browse/SPARK-44153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44153: - Assignee: Dongjoon Hyun > Support `Heap Histogram` column in Executor tab > --- > > Key: SPARK-44153 > URL: https://issues.apache.org/jira/browse/SPARK-44153 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736365#comment-17736365 ] Hyukjin Kwon commented on SPARK-44152: -- This is from https://issues.apache.org/jira/browse/SPARK-44135. I made a mistake in JIRA number so manually switched both JIRAs. > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44152: - Affects Version/s: 3.4.0 (was: 3.5.0) Description: I have a spark application that is deployed using k8s and it is of version 3.3.2 Recently there were some vulneabilities in spark 3.3.2 I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my application jar is built on spark 3.4.0 However while deploying, I get this error *{{Exception in thread "main" java.nio.file.NoSuchFileException: /spark-assembly-1.0.jar}}* I have this in deployment.yaml of the app *mainApplicationFile: "local:spark-assembly-1.0.jar"* and I have not changed anything related to that. I see that some code has changed in spark 3.4.0 core's source code regarding jar location. Has it really changed the functionality ? Is there anyone who is facing same issue as me ? Should the path be specified in a different way? was: I have a spark application that is deployed using k8s and it is of version 3.3.2 Recently there were some vulneabilities in spark 3.3.2 I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my application jar is built on spark 3.4.0 However while deploying, I get this error *{{Exception in thread "main" java.nio.file.NoSuchFileException: /spark-assembly-1.0.jar}}* I have this in deployment.yaml of the app *mainApplicationFile: "local:spark-assembly-1.0.jar"* and I have not changed anything related to that. I see that some code has changed in spark 3.4.0 core's source code regarding jar location. Has it really changed the functionality ? Is there anyone who is facing same issue as me ? Should the path be specified in a different way? > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44135) Document Spark Connect only API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736364#comment-17736364 ] Hyukjin Kwon commented on SPARK-44135: -- I made a mistake during the JIRA resolution. I move the original JIRA here to https://issues.apache.org/jira/browse/SPARK-44152 > Document Spark Connect only API in PySpark > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Documentation > Components: Connect, Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.5.0 > > > https://issues.apache.org/jira/browse/SPARK-41255 > https://issues.apache.org/jira/browse/SPARK-43509 > https://issues.apache.org/jira/browse/SPARK-43612 > https://issues.apache.org/jira/browse/SPARK-43790 > added four Spark Connect only API to Spark Session. We should document them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44135) Document Spark Connect only API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44135: Assignee: Hyukjin Kwon > Document Spark Connect only API in PySpark > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Documentation > Components: Connect, Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Assignee: Hyukjin Kwon >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-41255 > https://issues.apache.org/jira/browse/SPARK-43509 > https://issues.apache.org/jira/browse/SPARK-43612 > https://issues.apache.org/jira/browse/SPARK-43790 > added four Spark Connect only API to Spark Session. We should document them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Document Spark Connect only API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44135: - Issue Type: Documentation (was: Bug) > Document Spark Connect only API in PySpark > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Documentation > Components: Connect, Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-41255 > https://issues.apache.org/jira/browse/SPARK-43509 > https://issues.apache.org/jira/browse/SPARK-43612 > https://issues.apache.org/jira/browse/SPARK-43790 > added four Spark Connect only API to Spark Session. We should document them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Document Spark Connect only API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44135: - Epic Link: SPARK-39375 > Document Spark Connect only API in PySpark > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Documentation > Components: Connect, Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-41255 > https://issues.apache.org/jira/browse/SPARK-43509 > https://issues.apache.org/jira/browse/SPARK-43612 > https://issues.apache.org/jira/browse/SPARK-43790 > added four Spark Connect only API to Spark Session. We should document them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44152: - Issue Type: Bug (was: Documentation) > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44135) Document Spark Connect only API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44135. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41708 [https://github.com/apache/spark/pull/41708] > Document Spark Connect only API in PySpark > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Documentation > Components: Connect, Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.5.0 > > > https://issues.apache.org/jira/browse/SPARK-41255 > https://issues.apache.org/jira/browse/SPARK-43509 > https://issues.apache.org/jira/browse/SPARK-43612 > https://issues.apache.org/jira/browse/SPARK-43790 > added four Spark Connect only API to Spark Session. We should document them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44152: - Epic Link: (was: SPARK-39375) > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44152: - Summary: Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location (was: Document Spark Connect only API in PySpark) > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Documentation > Components: Connect, Documentation, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44152: - Component/s: Spark Core (was: Connect) (was: Documentation) (was: PySpark) > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44135: - Description: https://issues.apache.org/jira/browse/SPARK-41255 https://issues.apache.org/jira/browse/SPARK-43509 https://issues.apache.org/jira/browse/SPARK-43612 https://issues.apache.org/jira/browse/SPARK-43790 added four Spark Connect only API to Spark Session. We should document them. was: I have a spark application that is deployed using k8s and it is of version 3.3.2 Recently there were some vulneabilities in spark 3.3.2 I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my application jar is built on spark 3.4.0 However while deploying, I get this error *{{Exception in thread "main" java.nio.file.NoSuchFileException: /spark-assembly-1.0.jar}}* I have this in deployment.yaml of the app *mainApplicationFile: "local:spark-assembly-1.0.jar"* and I have not changed anything related to that. I see that some code has changed in spark 3.4.0 core's source code regarding jar location. Has it really changed the functionality ? Is there anyone who is facing same issue as me ? Should the path be specified in a different way? > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Blocker > > https://issues.apache.org/jira/browse/SPARK-41255 > https://issues.apache.org/jira/browse/SPARK-43509 > https://issues.apache.org/jira/browse/SPARK-43612 > https://issues.apache.org/jira/browse/SPARK-43790 > added four Spark Connect only API to Spark Session. We should document them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Document Spark Connect only API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44135: - Priority: Major (was: Blocker) > Document Spark Connect only API in PySpark > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Bug > Components: Connect, Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-41255 > https://issues.apache.org/jira/browse/SPARK-43509 > https://issues.apache.org/jira/browse/SPARK-43612 > https://issues.apache.org/jira/browse/SPARK-43790 > added four Spark Connect only API to Spark Session. We should document them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Document Spark Connect only API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44135: - Component/s: Connect Documentation PySpark (was: Spark Core) > Document Spark Connect only API in PySpark > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Bug > Components: Connect, Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Blocker > > https://issues.apache.org/jira/browse/SPARK-41255 > https://issues.apache.org/jira/browse/SPARK-43509 > https://issues.apache.org/jira/browse/SPARK-43612 > https://issues.apache.org/jira/browse/SPARK-43790 > added four Spark Connect only API to Spark Session. We should document them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44152) Document Spark Connect only API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44152: - Description: I have a spark application that is deployed using k8s and it is of version 3.3.2 Recently there were some vulneabilities in spark 3.3.2 I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my application jar is built on spark 3.4.0 However while deploying, I get this error *{{Exception in thread "main" java.nio.file.NoSuchFileException: /spark-assembly-1.0.jar}}* I have this in deployment.yaml of the app *mainApplicationFile: "local:spark-assembly-1.0.jar"* and I have not changed anything related to that. I see that some code has changed in spark 3.4.0 core's source code regarding jar location. Has it really changed the functionality ? Is there anyone who is facing same issue as me ? Should the path be specified in a different way? was: https://issues.apache.org/jira/browse/SPARK-41255 https://issues.apache.org/jira/browse/SPARK-43509 https://issues.apache.org/jira/browse/SPARK-43612 https://issues.apache.org/jira/browse/SPARK-43790 added four Spark Connect only API to Spark Session. We should document them. > Document Spark Connect only API in PySpark > -- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Documentation > Components: Connect, Documentation, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Document Spark Connect only API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44135: - Summary: Document Spark Connect only API in PySpark (was: Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location) > Document Spark Connect only API in PySpark > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Blocker > > https://issues.apache.org/jira/browse/SPARK-41255 > https://issues.apache.org/jira/browse/SPARK-43509 > https://issues.apache.org/jira/browse/SPARK-43612 > https://issues.apache.org/jira/browse/SPARK-43790 > added four Spark Connect only API to Spark Session. We should document them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org