[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-23 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736659#comment-17736659
 ] 

Snoot.io commented on SPARK-44132:
--

User 'steven-aerts' has created a pull request for this issue:
https://github.com/apache/spark/pull/41712

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44164) Extract toAttribute method from StructField to Util class

2023-06-23 Thread Rui Wang (Jira)
Rui Wang created SPARK-44164:


 Summary: Extract toAttribute method from StructField to Util class
 Key: SPARK-44164
 URL: https://issues.apache.org/jira/browse/SPARK-44164
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43974) Upgrade buf to v1.22.0

2023-06-23 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43974:

Summary: Upgrade buf to v1.22.0  (was: Upgrade buf to v1.21.0)

> Upgrade buf to v1.22.0
> --
>
> Key: SPARK-43974
> URL: https://issues.apache.org/jira/browse/SPARK-43974
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44163) Handle ModuleNotFoundError like ImportError

2023-06-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44163.
---
Resolution: Invalid

> Handle ModuleNotFoundError like ImportError
> ---
>
> Key: SPARK-44163
> URL: https://issues.apache.org/jira/browse/SPARK-44163
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.2, 3.4.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-44163) Handle ModuleNotFoundError like ImportError

2023-06-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-44163.
-

> Handle ModuleNotFoundError like ImportError
> ---
>
> Key: SPARK-44163
> URL: https://issues.apache.org/jira/browse/SPARK-44163
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.2, 3.4.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44163) Handle ModuleNotFoundError like ImportError

2023-06-23 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44163:
-

 Summary: Handle ModuleNotFoundError like ImportError
 Key: SPARK-44163
 URL: https://issues.apache.org/jira/browse/SPARK-44163
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.1, 3.3.2
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44162) Support G1GC in `spark.eventLog.gcMetrics.*` without warning

2023-06-23 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44162:
-

 Summary: Support G1GC in `spark.eventLog.gcMetrics.*` without 
warning
 Key: SPARK-44162
 URL: https://issues.apache.org/jira/browse/SPARK-44162
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Dongjoon Hyun


{code}
>>> 23/06/23 14:26:53 WARN GarbageCollectionMetrics: To enable non-built-in 
>>> garbage collector(s) List(G1 Concurrent GC), users should configure 
>>> it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or 
>>> spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44161) Row as UDF inputs causes encoder errors

2023-06-23 Thread Zhen Li (Jira)
Zhen Li created SPARK-44161:
---

 Summary: Row as UDF inputs causes encoder errors
 Key: SPARK-44161
 URL: https://issues.apache.org/jira/browse/SPARK-44161
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0
Reporter: Zhen Li


Ensure row inputs to udfs can be handled correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44160) Extract shared code from StructType

2023-06-23 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-44160:
-
Description: The StructType has some methods that require CatalystParser 
and Catalyst expression. We are not planning to move the parser and expression 
to the shared module thus needs to do code split to share as much as code as 
possible between Scala client and Catalyst.

> Extract shared code from StructType
> ---
>
> Key: SPARK-44160
> URL: https://issues.apache.org/jira/browse/SPARK-44160
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>
> The StructType has some methods that require CatalystParser and Catalyst 
> expression. We are not planning to move the parser and expression to the 
> shared module thus needs to do code split to share as much as code as 
> possible between Scala client and Catalyst.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44160) Extract shared code from StructType

2023-06-23 Thread Rui Wang (Jira)
Rui Wang created SPARK-44160:


 Summary: Extract shared code from StructType
 Key: SPARK-44160
 URL: https://issues.apache.org/jira/browse/SPARK-44160
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, SQL
Affects Versions: 3.5.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44158) Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`

2023-06-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44158:
-

Assignee: Dongjoon Hyun

> Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
> ---
>
> Key: SPARK-44158
> URL: https://issues.apache.org/jira/browse/SPARK-44158
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.3, 3.4.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44158) Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`

2023-06-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44158.
---
Fix Version/s: 3.3.3
   3.5.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 41713
[https://github.com/apache/spark/pull/41713]

> Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
> ---
>
> Key: SPARK-44158
> URL: https://issues.apache.org/jira/browse/SPARK-44158
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.3, 3.4.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.3.3, 3.5.0, 3.4.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44159) Commands for writting (InsertIntoHadoopFsRelationCommand and InsertIntoHiveTable) should log what they are doing

2023-06-23 Thread Navin Kumar (Jira)
Navin Kumar created SPARK-44159:
---

 Summary: Commands for writting (InsertIntoHadoopFsRelationCommand 
and InsertIntoHiveTable) should log what they are doing
 Key: SPARK-44159
 URL: https://issues.apache.org/jira/browse/SPARK-44159
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Navin Kumar


Improvements from SPARK-41763 decoupled the execution of create table and data 
writing commands in a CTAS (see SPARK-41713).

This means that while the code is cleaner with v1 write implementation limited 
to InsertIntoHadoopFsRelationCommand and InsertIntoHiveTable, the execution of 
these operations is less clear than it was before. Previously, the command was 
present in the physical plan (see explain output below):
 
{{== Physical Plan ==}}
{{CommandResult }}
{{+- Execute CreateHiveTableAsSelectCommand [Database: default, TableName: 
test_hive_text_table, InsertIntoHiveTable]}}
{{+- *(1) Scan ExistingRDD[...]}}

But in Spark 3.4.0, this output is:

{{== Physical Plan ==}}
{{CommandResult }}
{{+- Execute CreateHiveTableAsSelectCommand}}
{{+- CreateHiveTableAsSelectCommand [Database: default, TableName: 
test_hive_text_table]}}
{{+- Project [...]}}
{{+- SubqueryAlias hive_input_table}}
{{+- View (`hive_input_table`, [...])}}
{{+- LogicalRDD [...], false}}

And the write command is now missing. This makes sense since execution is 
decoupled, but since there is no log output from InsertIntoHiveTable, there is 
no clear way to fully know that the command actually executed. 

I would propose that either these commands should add a log message at the INFO 
level that indicates how many rows were written into what table to make easier 
for a user to know what has happened from the Spark logs. Another option maybe 
to update the explain output in Spark 3.4 to handle this, but that might be 
more difficult and make less sense since the operations are now decoupled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44158) Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`

2023-06-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44158:
--
Affects Version/s: 3.2.4
   3.1.3
   3.0.3
   2.4.8

> Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
> ---
>
> Key: SPARK-44158
> URL: https://issues.apache.org/jira/browse/SPARK-44158
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.3, 3.4.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44158) Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`

2023-06-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44158:
--
Affects Version/s: 3.4.1

> Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
> ---
>
> Key: SPARK-44158
> URL: https://issues.apache.org/jira/browse/SPARK-44158
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.3, 3.4.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44158) Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`

2023-06-23 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44158:
-

 Summary: Remove unused 
`spark.kubernetes.executor.lostCheck.maxAttempts`
 Key: SPARK-44158
 URL: https://issues.apache.org/jira/browse/SPARK-44158
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.3.3
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44151) Upgrade commons-codec from 1.15 to 1.16.0

2023-06-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44151:
-

Assignee: BingKun Pan

> Upgrade commons-codec from 1.15 to 1.16.0
> -
>
> Key: SPARK-44151
> URL: https://issues.apache.org/jira/browse/SPARK-44151
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44151) Upgrade commons-codec from 1.15 to 1.16.0

2023-06-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44151.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41707
[https://github.com/apache/spark/pull/41707]

> Upgrade commons-codec from 1.15 to 1.16.0
> -
>
> Key: SPARK-44151
> URL: https://issues.apache.org/jira/browse/SPARK-44151
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44156) SortAggregation slows down dropDuplicates()

2023-06-23 Thread Emanuel Velzi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emanuel Velzi updated SPARK-44156:
--
Description: 
TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate.

How to make Spark to use HashAggregate over SortAggregate? 

--

We have a Spark cluster running on Kubernetes with the following configurations:
 * Spark v3.3.2
 * Hadoop 3.3.4
 * Java 17

We are running a simple job on a dataset (~6GBi) with almost 600 columns, many 
of which contain null values. The job involves the following steps:
 # Load data from S3.
 # Apply dropDuplicates().
 # Save the deduplicated data back to S3 using magic committers.

One of the columns is of type "map". When we run dropDuplicates() without 
specifying any parameters (i.e., using all columns), it throws an error:

 
{noformat}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have 
map type columns in DataFrame which calls set operations(intersect, except, 
etc.), but the type of column my_column is 
map>>;{noformat}
 

To overcome this issue, we used "dropDuplicates(id)" by specifying an 
identifier column.

However, the performance of this method was {*}much worse than expected{*}, 
taking around 30 minutes.

As an alternative approach, we tested converting the "map" column to JSON, 
applying dropDuplicates() without parameters, and then converting the column 
back to "map" format:

 
{code:java}
DataType t = ds.schema().apply("my_column").dataType();
ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column")));
ds = ds.dropDuplicates();
ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); 
{code}
 

Surprisingly, this approach {*}significantly improved the performance{*}, 
reducing the execution time to 7 minutes.

The only noticeable difference was in the execution plan. In the *slower* case, 
the execution plan involved {*}SortAggregate{*}, while in the *faster* case, it 
involved {*}HashAggregate{*}.

 
{noformat}
== Physical Plan [slow case] == 
Execute InsertIntoHadoopFsRelationCommand (13)
+- AdaptiveSparkPlan (12)
   +- == Final Plan ==
      Coalesce (8)
      +- SortAggregate (7)
         +- Sort (6)
            +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, 
rowCount=1.25E+7)
               +- Exchange (4)
                  +- SortAggregate (3)
                     +- Sort (2)
                        +- Scan parquet  (1)
   +- == Initial Plan ==
      Coalesce (11)
      +- SortAggregate (10)
         +- Sort (9)
            +- Exchange (4)
               +- SortAggregate (3)
                  +- Sort (2)
                     +- Scan parquet  (1)
{noformat}
{noformat}
== Physical Plan [fast case] ==
Execute InsertIntoHadoopFsRelationCommand (11)
+- AdaptiveSparkPlan (10)
   +- == Final Plan ==
      Coalesce (7)
      +- HashAggregate (6)
         +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, 
rowCount=1.25E+7)
            +- Exchange (4)
               +- HashAggregate (3)
                  +- Project (2)
                     +- Scan parquet  (1)
   +- == Initial Plan ==
      Coalesce (9)
      +- HashAggregate (8)
         +- Exchange (4)
            +- HashAggregate (3)
               +- Project (2)
                  +- Scan parquet  (1)
{noformat}
 

Based on this observation, we concluded that the difference in performance is 
related to {*}SortAggregate vs. HashAggregate{*}.

Is this line of thinking correct? How we can to enforce the use of 
HashAggregate instead of SortAggregate {*}even using one colum to 
deduplicate{*}?

*The final result is somewhat counterintuitive* because deduplicating using 
only one column should theoretically be faster, as it provides a simpler way to 
compare rows and determine duplicates.

 

  was:
TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate.

How to make Spark to use HashAggregate over SortAggregate? 

--

We have a Spark cluster running on Kubernetes with the following configurations:
 * Spark v3.3.2
 * Hadoop 3.3.4
 * Java 17

We are running a simple job on a dataset (~6GBi) with almost 600 columns, many 
of which contain null values. The job involves the following steps:
 # Load data from S3.
 # Apply dropDuplicates().
 # Save the deduplicated data back to S3 using magic committers.

One of the columns is of type "map". When we run dropDuplicates() without 
specifying any parameters (i.e., using all columns), it throws an error:

 
{noformat}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have 
map type columns in DataFrame which calls set operations(intersect, except, 
etc.), but the type of column my_column is 
map>>;{noformat}
 

To overcome this issue, we used "dropDuplicates(id)" by specifying an 
identifier column.

However, the performance of this method was {*}much worse than expected{*}, 
taking around 30 

[jira] [Updated] (SPARK-44156) SortAggregation slows down dropDuplicates()

2023-06-23 Thread Emanuel Velzi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emanuel Velzi updated SPARK-44156:
--
Description: 
TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate.

How to make Spark to use HashAggregate over SortAggregate? 

--

We have a Spark cluster running on Kubernetes with the following configurations:
 * Spark v3.3.2
 * Hadoop 3.3.4
 * Java 17

We are running a simple job on a dataset (~6GBi) with almost 600 columns, many 
of which contain null values. The job involves the following steps:
 # Load data from S3.
 # Apply dropDuplicates().
 # Save the deduplicated data back to S3 using magic committers.

One of the columns is of type "map". When we run dropDuplicates() without 
specifying any parameters (i.e., using all columns), it throws an error:

 
{noformat}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have 
map type columns in DataFrame which calls set operations(intersect, except, 
etc.), but the type of column my_column is 
map>>;{noformat}
 

To overcome this issue, we used "dropDuplicates(id)" by specifying an 
identifier column.

However, the performance of this method was {*}much worse than expected{*}, 
taking around 30 minutes.

As an alternative approach, we tested converting the "map" column to JSON, 
applying dropDuplicates() without parameters, and then converting the column 
back to "map" format:

 
{code:java}
DataType t = ds.schema().apply("my_column").dataType();
ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column")));
ds = ds.dropDuplicates();
ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); 
{code}
 

Surprisingly, this approach {*}significantly improved the performance{*}, 
reducing the execution time to 7 minutes.

The only noticeable difference was in the execution plan. In the *slower* case, 
the execution plan involved {*}SortAggregate{*}, while in the *faster* case, it 
involved {*}HashAggregate{*}.

 
{noformat}
== Physical Plan [slow case] == 
Execute InsertIntoHadoopFsRelationCommand (13)
+- AdaptiveSparkPlan (12)
   +- == Final Plan ==
      Coalesce (8)
      +- SortAggregate (7)
         +- Sort (6)
            +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, 
rowCount=1.25E+7)
               +- Exchange (4)
                  +- SortAggregate (3)
                     +- Sort (2)
                        +- Scan parquet  (1)
   +- == Initial Plan ==
      Coalesce (11)
      +- SortAggregate (10)
         +- Sort (9)
            +- Exchange (4)
               +- SortAggregate (3)
                  +- Sort (2)
                     +- Scan parquet  (1)
{noformat}
{noformat}
== Physical Plan [fast case] ==
Execute InsertIntoHadoopFsRelationCommand (11)
+- AdaptiveSparkPlan (10)
   +- == Final Plan ==
      Coalesce (7)
      +- HashAggregate (6)
         +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, 
rowCount=1.25E+7)
            +- Exchange (4)
               +- HashAggregate (3)
                  +- Project (2)
                     +- Scan parquet  (1)
   +- == Initial Plan ==
      Coalesce (9)
      +- HashAggregate (8)
         +- Exchange (4)
            +- HashAggregate (3)
               +- Project (2)
                  +- Scan parquet  (1)
{noformat}
 

Based on this observation, we concluded that the difference in performance is 
related to {*}SortAggregate vs. HashAggregate{*}.

Is this line of thinking correct? How we can to enforce the use of 
HashAggregate instead of SortAggregate?

*The final result is somewhat counterintuitive* because deduplicating using 
only one column should theoretically be faster, as it provides a simpler way to 
compare rows and determine duplicates.

 

  was:
TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate.

How to make Spark to use HashAggregate over SortAggregate? 

--

We have a Spark cluster running on Kubernetes with the following configurations:
 * Spark v3.3.2
 * Hadoop 3.3.4
 * Java 17

We are running a simple job on a dataset (~6GBi) with almost 600 columns, many 
of which contain null values. The job involves the following steps:
 # Load data from S3.
 # Apply dropDuplicates().
 # Save the deduplicated data back to S3 using magic committers.

One of the columns is of type "map". When we run dropDuplicates() without 
specifying any parameters (i.e., using all columns), it throws an error:

 
{noformat}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have 
map type columns in DataFrame which calls set operations(intersect, except, 
etc.), but the type of column my_column is 
map>>;{noformat}
 

To overcome this issue, we used "dropDuplicates(id)" by specifying an 
identifier column.

However, the performance of this method was {*}much worse than expected{*}, 
taking around 30 minutes.

As an alternative approach, we 

[jira] [Updated] (SPARK-44156) SortAggregation slows down dropDuplicates()

2023-06-23 Thread Emanuel Velzi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emanuel Velzi updated SPARK-44156:
--
Summary: SortAggregation slows down dropDuplicates()  (was: SortAggregation 
slows down dropDuplicates().)

> SortAggregation slows down dropDuplicates()
> ---
>
> Key: SPARK-44156
> URL: https://issues.apache.org/jira/browse/SPARK-44156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: Emanuel Velzi
>Priority: Major
>
> TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate.
> How to make Spark to use HashAggregate over SortAggregate? 
> --
> We have a Spark cluster running on Kubernetes with the following 
> configurations:
>  * Spark v3.3.2
>  * Hadoop 3.3.4
>  * Java 17
> We are running a simple job on a dataset (~6GBi) with almost 600 columns, 
> many of which contain null values. The job involves the following steps:
>  # Load data from S3.
>  # Apply dropDuplicates().
>  # Save the deduplicated data back to S3 using magic committers.
> One of the columns is of type "map". When we run dropDuplicates() without 
> specifying any parameters (i.e., using all columns), it throws an error:
>  
> {noformat}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> have map type columns in DataFrame which calls set operations(intersect, 
> except, etc.), but the type of column my_column is 
> map>>;{noformat}
>  
> To overcome this issue, we used "dropDuplicates(id)" by specifying an 
> identifier column.
> However, the performance of this method was {*}much worse than expected{*}, 
> taking around 30 minutes.
> As an alternative approach, we tested converting the "map" column to JSON, 
> applying dropDuplicates() without parameters, and then converting the column 
> back to "map" format:
>  
> {code:java}
> DataType t = ds.schema().apply("my_column").dataType();
> ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column")));
> ds = ds.dropDuplicates();
> ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); 
> {code}
>  
> Surprisingly, this approach {*}significantly improved the performance{*}, 
> reducing the execution time to 7 minutes.
> The only noticeable difference was in the execution plan. In the *slower* 
> case, the execution plan involved {*}SortAggregate{*}, while in the *faster* 
> case, it involved {*}HashAggregate{*}.
>  
> {noformat}
> == Physical Plan [slow case] == 
> Execute InsertIntoHadoopFsRelationCommand (13)
> +- AdaptiveSparkPlan (12)
>    +- == Final Plan ==
>       Coalesce (8)
>       +- SortAggregate (7)
>          +- Sort (6)
>             +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, 
> rowCount=1.25E+7)
>                +- Exchange (4)
>                   +- SortAggregate (3)
>                      +- Sort (2)
>                         +- Scan parquet  (1)
>    +- == Initial Plan ==
>       Coalesce (11)
>       +- SortAggregate (10)
>          +- Sort (9)
>             +- Exchange (4)
>                +- SortAggregate (3)
>                   +- Sort (2)
>                      +- Scan parquet  (1)
> {noformat}
> {noformat}
> == Physical Plan [fast case] ==
> Execute InsertIntoHadoopFsRelationCommand (11)
> +- AdaptiveSparkPlan (10)
>    +- == Final Plan ==
>       Coalesce (7)
>       +- HashAggregate (6)
>          +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, 
> rowCount=1.25E+7)
>             +- Exchange (4)
>                +- HashAggregate (3)
>                   +- Project (2)
>                      +- Scan parquet  (1)
>    +- == Initial Plan ==
>       Coalesce (9)
>       +- HashAggregate (8)
>          +- Exchange (4)
>             +- HashAggregate (3)
>                +- Project (2)
>                   +- Scan parquet  (1)
> {noformat}
>  
> Based on this observation, we concluded that the difference in performance is 
> related to {*}SortAggregate versus HashAggregate{*}.
> Is this line of thinking correct? How we can to enforce the use of 
> HashAggregate instead of SortAggregate?
> *The final result is somewhat counterintuitive* because deduplicating using 
> only one column should theoretically be faster, as it provides a simpler way 
> to compare rows and determine duplicates.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44156) SortAggregation slows down dropDuplicates().

2023-06-23 Thread Emanuel Velzi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emanuel Velzi updated SPARK-44156:
--
Summary: SortAggregation slows down dropDuplicates().  (was: Should 
HashAggregation improve dropDuplicates()?)

> SortAggregation slows down dropDuplicates().
> 
>
> Key: SPARK-44156
> URL: https://issues.apache.org/jira/browse/SPARK-44156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: Emanuel Velzi
>Priority: Major
>
> TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate.
> How to make Spark to use HashAggregate over SortAggregate? 
> --
> We have a Spark cluster running on Kubernetes with the following 
> configurations:
>  * Spark v3.3.2
>  * Hadoop 3.3.4
>  * Java 17
> We are running a simple job on a dataset (~6GBi) with almost 600 columns, 
> many of which contain null values. The job involves the following steps:
>  # Load data from S3.
>  # Apply dropDuplicates().
>  # Save the deduplicated data back to S3 using magic committers.
> One of the columns is of type "map". When we run dropDuplicates() without 
> specifying any parameters (i.e., using all columns), it throws an error:
>  
> {noformat}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> have map type columns in DataFrame which calls set operations(intersect, 
> except, etc.), but the type of column my_column is 
> map>>;{noformat}
>  
> To overcome this issue, we used "dropDuplicates(id)" by specifying an 
> identifier column.
> However, the performance of this method was {*}much worse than expected{*}, 
> taking around 30 minutes.
> As an alternative approach, we tested converting the "map" column to JSON, 
> applying dropDuplicates() without parameters, and then converting the column 
> back to "map" format:
>  
> {code:java}
> DataType t = ds.schema().apply("my_column").dataType();
> ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column")));
> ds = ds.dropDuplicates();
> ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); 
> {code}
>  
> Surprisingly, this approach {*}significantly improved the performance{*}, 
> reducing the execution time to 7 minutes.
> The only noticeable difference was in the execution plan. In the *slower* 
> case, the execution plan involved {*}SortAggregate{*}, while in the *faster* 
> case, it involved {*}HashAggregate{*}.
>  
> {noformat}
> == Physical Plan [slow case] == 
> Execute InsertIntoHadoopFsRelationCommand (13)
> +- AdaptiveSparkPlan (12)
>    +- == Final Plan ==
>       Coalesce (8)
>       +- SortAggregate (7)
>          +- Sort (6)
>             +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, 
> rowCount=1.25E+7)
>                +- Exchange (4)
>                   +- SortAggregate (3)
>                      +- Sort (2)
>                         +- Scan parquet  (1)
>    +- == Initial Plan ==
>       Coalesce (11)
>       +- SortAggregate (10)
>          +- Sort (9)
>             +- Exchange (4)
>                +- SortAggregate (3)
>                   +- Sort (2)
>                      +- Scan parquet  (1)
> {noformat}
> {noformat}
> == Physical Plan [fast case] ==
> Execute InsertIntoHadoopFsRelationCommand (11)
> +- AdaptiveSparkPlan (10)
>    +- == Final Plan ==
>       Coalesce (7)
>       +- HashAggregate (6)
>          +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, 
> rowCount=1.25E+7)
>             +- Exchange (4)
>                +- HashAggregate (3)
>                   +- Project (2)
>                      +- Scan parquet  (1)
>    +- == Initial Plan ==
>       Coalesce (9)
>       +- HashAggregate (8)
>          +- Exchange (4)
>             +- HashAggregate (3)
>                +- Project (2)
>                   +- Scan parquet  (1)
> {noformat}
>  
> Based on this observation, we concluded that the difference in performance is 
> related to {*}SortAggregate versus HashAggregate{*}.
> Is this line of thinking correct? How we can to enforce the use of 
> HashAggregate instead of SortAggregate?
> *The final result is somewhat counterintuitive* because deduplicating using 
> only one column should theoretically be faster, as it provides a simpler way 
> to compare rows and determine duplicates.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44157) Outdated JARs in PySpark package

2023-06-23 Thread Adrian Gonzalez-Martin (Jira)
Adrian Gonzalez-Martin created SPARK-44157:
--

 Summary: Outdated JARs in PySpark package
 Key: SPARK-44157
 URL: https://issues.apache.org/jira/browse/SPARK-44157
 Project: Spark
  Issue Type: Bug
  Components: Build, PySpark
Affects Versions: 3.4.1
Reporter: Adrian Gonzalez-Martin


The JARs which ship embedded within PySpark's package in PyPi don't seem 
aligned with the deps specified in Spark's own `pom.xml`.

For example, in Spark's `pom.xml`, `protobuf-java` is set to `3.21.12`:

[https://github.com/apache/spark/blob/6b1ff22dde1ead51cbf370be6e48a802daae58b6/pom.xml#L127]

However, if we look at the JARs embedded within PySpark tarball, the version of 
`protobuf-java` is `2.5.0` (i.e. 
`/site-packages/pyspark/jars/protobuf-java-2.5.0.jar`). Same seems to apply 
to all other dependencies.

This introduces a set of CVEs which are fixed on upstream Spark, but are still 
present in PySpark (e.g. `CVE-2022-3509`, `CVE-2021-22569`, ` CVE-2015-5237` 
and a few others). As well as potentially introduce a source of conflict 
whenever there's a breaking change on these deps.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44156) Should HashAggregation improve dropDuplicates()?

2023-06-23 Thread Emanuel Velzi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emanuel Velzi updated SPARK-44156:
--
Description: 
TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate.

How to make Spark to use HashAggregate over SortAggregate? 

--

We have a Spark cluster running on Kubernetes with the following configurations:
 * Spark v3.3.2
 * Hadoop 3.3.4
 * Java 17

We are running a simple job on a dataset (~6GBi) with almost 600 columns, many 
of which contain null values. The job involves the following steps:
 # Load data from S3.
 # Apply dropDuplicates().
 # Save the deduplicated data back to S3 using magic committers.

One of the columns is of type "map". When we run dropDuplicates() without 
specifying any parameters (i.e., using all columns), it throws an error:

 
{noformat}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have 
map type columns in DataFrame which calls set operations(intersect, except, 
etc.), but the type of column my_column is 
map>>;{noformat}
 

To overcome this issue, we used "dropDuplicates(id)" by specifying an 
identifier column.

However, the performance of this method was {*}much worse than expected{*}, 
taking around 30 minutes.

As an alternative approach, we tested converting the "map" column to JSON, 
applying dropDuplicates() without parameters, and then converting the column 
back to "map" format:

 
{code:java}
DataType t = ds.schema().apply("my_column").dataType();
ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column")));
ds = ds.dropDuplicates();
ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); 
{code}
 

Surprisingly, this approach {*}significantly improved the performance{*}, 
reducing the execution time to 7 minutes.

The only noticeable difference was in the execution plan. In the *slower* case, 
the execution plan involved {*}SortAggregate{*}, while in the *faster* case, it 
involved {*}HashAggregate{*}.

 
{noformat}
== Physical Plan [slow case] == 
Execute InsertIntoHadoopFsRelationCommand (13)
+- AdaptiveSparkPlan (12)
   +- == Final Plan ==
      Coalesce (8)
      +- SortAggregate (7)
         +- Sort (6)
            +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, 
rowCount=1.25E+7)
               +- Exchange (4)
                  +- SortAggregate (3)
                     +- Sort (2)
                        +- Scan parquet  (1)
   +- == Initial Plan ==
      Coalesce (11)
      +- SortAggregate (10)
         +- Sort (9)
            +- Exchange (4)
               +- SortAggregate (3)
                  +- Sort (2)
                     +- Scan parquet  (1)
{noformat}
{noformat}
== Physical Plan [fast case] ==
Execute InsertIntoHadoopFsRelationCommand (11)
+- AdaptiveSparkPlan (10)
   +- == Final Plan ==
      Coalesce (7)
      +- HashAggregate (6)
         +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, 
rowCount=1.25E+7)
            +- Exchange (4)
               +- HashAggregate (3)
                  +- Project (2)
                     +- Scan parquet  (1)
   +- == Initial Plan ==
      Coalesce (9)
      +- HashAggregate (8)
         +- Exchange (4)
            +- HashAggregate (3)
               +- Project (2)
                  +- Scan parquet  (1)
{noformat}
 

Based on this observation, we concluded that the difference in performance is 
related to {*}SortAggregate versus HashAggregate{*}.

Is this line of thinking correct? How we can to enforce the use of 
HashAggregate instead of SortAggregate?

*The final result is somewhat counterintuitive* because deduplicating using 
only one column should theoretically be faster, as it provides a simpler way to 
compare rows and determine duplicates.

 

  was:
TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate. 

How to make Spark to use HashAggregate over SortAggregate? 

--

We have a Spark cluster running on Kubernetes with the following configurations:
 * Spark v3.3.2
 * Hadoop 3.3.4
 * Java 17

We are running a simple job on a dataset (~6GBi) with almost 600 columns, many 
of which contain null values. The job involves the following steps:
 # Load data from S3.
 # Apply dropDuplicates().
 # Save the deduplicated data back to S3 using magic committers.

One of the columns is of type "map". When we run dropDuplicates() without 
specifying any parameters (i.e., using all columns), it throws an error:

 
{noformat}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have 
map type columns in DataFrame which calls set operations(intersect, except, 
etc.), but the type of column my_column is 
map>>;{noformat}
 

To overcome this issue, we used "dropDuplicates(id)" by specifying an 
identifier column.

However, the performance of this method was {*}much worse than expected{*}, 
taking around 30 minutes.

As an alternative approach, we 

[jira] [Created] (SPARK-44156) Should HashAggregation improve dropDuplicates()?

2023-06-23 Thread Emanuel Velzi (Jira)
Emanuel Velzi created SPARK-44156:
-

 Summary: Should HashAggregation improve dropDuplicates()?
 Key: SPARK-44156
 URL: https://issues.apache.org/jira/browse/SPARK-44156
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.2
Reporter: Emanuel Velzi


TL;DR: SortAggregate makes dropDuplicates slower than HashAggregate. 

How to make Spark to use HashAggregate over SortAggregate? 

--

We have a Spark cluster running on Kubernetes with the following configurations:
 * Spark v3.3.2
 * Hadoop 3.3.4
 * Java 17

We are running a simple job on a dataset (~6GBi) with almost 600 columns, many 
of which contain null values. The job involves the following steps:
 # Load data from S3.
 # Apply dropDuplicates().
 # Save the deduplicated data back to S3 using magic committers.

One of the columns is of type "map". When we run dropDuplicates() without 
specifying any parameters (i.e., using all columns), it throws an error:

 
{noformat}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot have 
map type columns in DataFrame which calls set operations(intersect, except, 
etc.), but the type of column my_column is 
map>>;{noformat}
 

To overcome this issue, we used "dropDuplicates(id)" by specifying an 
identifier column.

However, the performance of this method was {*}much worse than expected{*}, 
taking around 30 minutes.

As an alternative approach, we tested converting the "map" column to JSON, 
applying dropDuplicates() without parameters, and then converting the column 
back to "map" format:

 
{code:java}
DataType t = ds.schema().apply("my_column").dataType();
ds = ds.withColumn("my_column", functions.to_json(ds.col("my_column")));
ds = ds.dropDuplicates();
ds = ds.withColumn("my_column", functions.from_json(ds.col("my_column"),t)); 
{code}
 

Surprisingly, this approach {*}significantly improved the performance{*}, 
reducing the execution time to 7 minutes.

The only noticeable difference was in the execution plan. In the *slower* case, 
the execution plan involved {*}SortAggregate{*}, while in the *faster* case, it 
involved {*}HashAggregate{*}.

 
{noformat}
== Physical Plan [slow case] == 
Execute InsertIntoHadoopFsRelationCommand (13)
+- AdaptiveSparkPlan (12)
   +- == Final Plan ==
      Coalesce (8)
      +- SortAggregate (7)
         +- Sort (6)
            +- ShuffleQueryStage (5), Statistics(sizeInBytes=141.3 GiB, 
rowCount=1.25E+7)
               +- Exchange (4)
                  +- SortAggregate (3)
                     +- Sort (2)
                        +- Scan parquet  (1)
   +- == Initial Plan ==
      Coalesce (11)
      +- SortAggregate (10)
         +- Sort (9)
            +- Exchange (4)
               +- SortAggregate (3)
                  +- Sort (2)
                     +- Scan parquet  (1)
{noformat}
 

 

 
{noformat}
== Physical Plan [fast case] ==
Execute InsertIntoHadoopFsRelationCommand (11)
+- AdaptiveSparkPlan (10)
   +- == Final Plan ==
      Coalesce (7)
      +- HashAggregate (6)
         +- ShuffleQueryStage (5), Statistics(sizeInBytes=81.6 GiB, 
rowCount=1.25E+7)
            +- Exchange (4)
               +- HashAggregate (3)
                  +- Project (2)
                     +- Scan parquet  (1)
   +- == Initial Plan ==
      Coalesce (9)
      +- HashAggregate (8)
         +- Exchange (4)
            +- HashAggregate (3)
               +- Project (2)
                  +- Scan parquet  (1)
{noformat}
 

 

Based on this observation, we concluded that the difference in performance is 
related to {*}SortAggregate versus HashAggregate{*}.

Is this line of thinking correct? How we can to enforce the use of 
HashAggregate instead of SortAggregate?

*The final result is somewhat counterintuitive* because deduplicating using 
only one column should theoretically be faster, as it provides a simpler way to 
compare rows and determine duplicates.





 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44134) Can't set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf

2023-06-23 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-44134:
--
Fix Version/s: 3.4.2
   (was: 3.4.1)

> Can't set resources (GPU/FPGA) to 0 when they are set to positive value in 
> spark-defaults.conf
> --
>
> Key: SPARK-44134
> URL: https://issues.apache.org/jira/browse/SPARK-44134
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.3.3, 3.5.0, 3.4.2
>
>
> With resource aware scheduling, if you specify a default value in the 
> spark-defaults.conf, a user can't override that to set it to 0.
> Meaning spark-defaults.conf has something like:
> {{spark.executor.resource.\{resourceName}.amount=1}}
> {{spark.task.resource.\{resourceName}.amount}} =1
> If the user tries to override when submitting an application with 
> {{{}spark.executor.resource.\{resourceName}.amount{}}}=0 and 
> {{spark.task.resource.\{resourceName}.amount}} =0, it gives the user an error:
>  
> {code:java}
> 23/06/21 09:12:57 ERROR Main: Failed to initialize Spark session.
> org.apache.spark.SparkException: No executor resource configs were not 
> specified for the following task configs: gpu
>         at 
> org.apache.spark.resource.ResourceProfile.calculateTasksAndLimitingResource(ResourceProfile.scala:206)
>         at 
> org.apache.spark.resource.ResourceProfile.$anonfun$limitingResource$1(ResourceProfile.scala:139)
>         at scala.Option.getOrElse(Option.scala:189)
>         at 
> org.apache.spark.resource.ResourceProfile.limitingResource(ResourceProfile.scala:138)
>         at 
> org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:95)
>         at 
> org.apache.spark.resource.ResourceProfileManager.(ResourceProfileManager.scala:49)
>         at org.apache.spark.SparkContext.(SparkContext.scala:455)
>         at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
>         at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953){code}
> This used to work, my guess is this may have gotten broken with the stage 
> level scheduling feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-23 Thread Steven Aerts (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736490#comment-17736490
 ] 

Steven Aerts commented on SPARK-44132:
--

[https://github.com/apache/spark/pull/41712] was created with a proposed fix 
and reproduction scenario for this problem.
Let me know if you prefer to update this Jira ticket, as it is still referring 
to the BeanEncoder which had nothing to do with it.

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43438) Fix mismatched column list error on INSERT

2023-06-23 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485
 ] 

BingKun Pan edited comment on SPARK-43438 at 6/23/23 12:48 PM:
---

I checked and found that,

1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully.

There is a default value completion operation.

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42]

 

2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is not enough data columns:
Table columns: `c1`, `c2`.
Data columns: `1`.

 

3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is too many data columns:
Table columns: `c1`.
Data columns: `1`, `2`, `3`.

 

Among them, 2 and 3 are in line with our expectations when after 
`[https://github.com/apache/spark/pull/41458]`.

But the behavior difference between 1 and 2 is a bit confusing.

 

*Should we align the logic of 1 and 2?*


was (Author: panbingkun):
I checked and found that after `[https://github.com/apache/spark/pull/41458]`,

1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully.

There is a default value completion operation.

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42]

 

2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is not enough data columns:
Table columns: `c1`, `c2`.
Data columns: `1`.

 

3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is too many data columns:
Table columns: `c1`.
Data columns: `1`, `2`, `3`.

 

Among them, 2 and 3 are in line with our expectations.

But the behavior difference between 1 and 2 is a bit confusing.

 

*Should we align the logic of 1 and 2?*

> Fix mismatched column list error on INSERT
> --
>
> Key: SPARK-43438
> URL: https://issues.apache.org/jira/browse/SPARK-43438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> This error message is pretty bad, and common
> "_LEGACY_ERROR_TEMP_1038" : {
> "message" : [
> "Cannot write to table due to mismatched user specified column 
> size() and data column size()."
> ]
> },
> It can perhaps be merged with this one - after giving it an ERROR_CLASS
> "_LEGACY_ERROR_TEMP_1168" : {
> "message" : [
> " requires that the data to be inserted have the same number of 
> columns as the target table: target table has  column(s) but 
> the inserted data has  column(s), including  
> partition column(s) having constant value(s)."
> ]
> },
> Repro:
> CREATE TABLE tabtest(c1 INT, c2 INT);
> INSERT INTO tabtest SELECT 1;
> `spark_catalog`.`default`.`tabtest` requires that the data to be inserted 
> have the same number of columns as the target table: target table has 2 
> column(s) but the inserted data has 1 column(s), including 0 partition 
> column(s) having constant value(s).
> INSERT INTO tabtest(c1) SELECT 1, 2, 3;
> Cannot write to table due to mismatched user specified column size(1) and 
> data column size(3).; line 1 pos 24
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43438) Fix mismatched column list error on INSERT

2023-06-23 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485
 ] 

BingKun Pan edited comment on SPARK-43438 at 6/23/23 12:44 PM:
---

I checked and found that after `[https://github.com/apache/spark/pull/41458]`,

1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully.

There is a default value completion operation.

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42]

 

2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is not enough data columns:
Table columns: `c1`, `c2`.
Data columns: `1`.

 

3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is too many data columns:
Table columns: `c1`.
Data columns: `1`, `2`, `3`.

 

Among them, 2 and 3 are in line with our expectations.

But the behavior difference between 1 and 2 is a bit confusing.

 

*Should we align the logic of 1 and 2?*


was (Author: panbingkun):
I checked and found that after `[https://github.com/apache/spark/pull/41458]`,

1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully.

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42]

 

2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is not enough data columns:
Table columns: `c1`, `c2`.
Data columns: `1`.

 

3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is too many data columns:
Table columns: `c1`.
Data columns: `1`, `2`, `3`.

 

Among them, 2 and 3 are in line with our expectations.

But the behavior difference between 1 and 2 is a bit confusing.

 

*Should we align the logic of 1 and 2?*

> Fix mismatched column list error on INSERT
> --
>
> Key: SPARK-43438
> URL: https://issues.apache.org/jira/browse/SPARK-43438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> This error message is pretty bad, and common
> "_LEGACY_ERROR_TEMP_1038" : {
> "message" : [
> "Cannot write to table due to mismatched user specified column 
> size() and data column size()."
> ]
> },
> It can perhaps be merged with this one - after giving it an ERROR_CLASS
> "_LEGACY_ERROR_TEMP_1168" : {
> "message" : [
> " requires that the data to be inserted have the same number of 
> columns as the target table: target table has  column(s) but 
> the inserted data has  column(s), including  
> partition column(s) having constant value(s)."
> ]
> },
> Repro:
> CREATE TABLE tabtest(c1 INT, c2 INT);
> INSERT INTO tabtest SELECT 1;
> `spark_catalog`.`default`.`tabtest` requires that the data to be inserted 
> have the same number of columns as the target table: target table has 2 
> column(s) but the inserted data has 1 column(s), including 0 partition 
> column(s) having constant value(s).
> INSERT INTO tabtest(c1) SELECT 1, 2, 3;
> Cannot write to table due to mismatched user specified column size(1) and 
> data column size(3).; line 1 pos 24
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43438) Fix mismatched column list error on INSERT

2023-06-23 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485
 ] 

BingKun Pan edited comment on SPARK-43438 at 6/23/23 12:42 PM:
---

I checked and found that after `[https://github.com/apache/spark/pull/41458]`,

1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully.

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42]

 

2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is not enough data columns:
Table columns: `c1`, `c2`.
Data columns: `1`.

 

3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is too many data columns:
Table columns: `c1`.
Data columns: `1`, `2`, `3`.

 

Among them, 2 and 3 are in line with our expectations.

But the behavior difference between 1 and 2 is a bit confusing.

 

*Should we align the logic of 1 and 2?*


was (Author: panbingkun):
I checked and found that after `[https://github.com/apache/spark/pull/41458]`,

1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully.

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42]

 

2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is not enough data columns:
Table columns: `c1`, `c2`.
Data columns: `1`.

 

3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`t1`, the reason is too many data columns:
Table columns: `c1`.
Data columns: `1`, `2`, `3`.

 

Among them, 2 and 3 are in line with our expectations.

But the behavior difference between 1 and 2 is a bit confusing.

 

*Should we align the logic of 1 and 2?*

> Fix mismatched column list error on INSERT
> --
>
> Key: SPARK-43438
> URL: https://issues.apache.org/jira/browse/SPARK-43438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> This error message is pretty bad, and common
> "_LEGACY_ERROR_TEMP_1038" : {
> "message" : [
> "Cannot write to table due to mismatched user specified column 
> size() and data column size()."
> ]
> },
> It can perhaps be merged with this one - after giving it an ERROR_CLASS
> "_LEGACY_ERROR_TEMP_1168" : {
> "message" : [
> " requires that the data to be inserted have the same number of 
> columns as the target table: target table has  column(s) but 
> the inserted data has  column(s), including  
> partition column(s) having constant value(s)."
> ]
> },
> Repro:
> CREATE TABLE tabtest(c1 INT, c2 INT);
> INSERT INTO tabtest SELECT 1;
> `spark_catalog`.`default`.`tabtest` requires that the data to be inserted 
> have the same number of columns as the target table: target table has 2 
> column(s) but the inserted data has 1 column(s), including 0 partition 
> column(s) having constant value(s).
> INSERT INTO tabtest(c1) SELECT 1, 2, 3;
> Cannot write to table due to mismatched user specified column size(1) and 
> data column size(3).; line 1 pos 24
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43438) Fix mismatched column list error on INSERT

2023-06-23 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485
 ] 

BingKun Pan edited comment on SPARK-43438 at 6/23/23 12:40 PM:
---

I checked and found that after `[https://github.com/apache/spark/pull/41458]`,

1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully.

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42]

 

2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is not enough data columns:
Table columns: `c1`, `c2`.
Data columns: `1`.

 

3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`t1`, the reason is too many data columns:
Table columns: `c1`.
Data columns: `1`, `2`, `3`.

 

Among them, 2 and 3 are in line with our expectations.

But the behavior difference between 1 and 2 is a bit confusing.

 

*Should we align the logic of 1 and 2?*


was (Author: panbingkun):
I checked and found that after `[https://github.com/apache/spark/pull/41458]`,

1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully.

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42]

 

2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is not enough data columns:
Table columns: `c1`, `c2`.
Data columns: `1`.

 

3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`t1`, the reason is too many data columns:
Table columns: `c1`.
Data columns: `1`, `2`, `3`.

 

Among them, 2 and 3 are in line with our expectations.

But the behavior difference between 1 and 2 is a bit confusing.

> Fix mismatched column list error on INSERT
> --
>
> Key: SPARK-43438
> URL: https://issues.apache.org/jira/browse/SPARK-43438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> This error message is pretty bad, and common
> "_LEGACY_ERROR_TEMP_1038" : {
> "message" : [
> "Cannot write to table due to mismatched user specified column 
> size() and data column size()."
> ]
> },
> It can perhaps be merged with this one - after giving it an ERROR_CLASS
> "_LEGACY_ERROR_TEMP_1168" : {
> "message" : [
> " requires that the data to be inserted have the same number of 
> columns as the target table: target table has  column(s) but 
> the inserted data has  column(s), including  
> partition column(s) having constant value(s)."
> ]
> },
> Repro:
> CREATE TABLE tabtest(c1 INT, c2 INT);
> INSERT INTO tabtest SELECT 1;
> `spark_catalog`.`default`.`tabtest` requires that the data to be inserted 
> have the same number of columns as the target table: target table has 2 
> column(s) but the inserted data has 1 column(s), including 0 partition 
> column(s) having constant value(s).
> INSERT INTO tabtest(c1) SELECT 1, 2, 3;
> Cannot write to table due to mismatched user specified column size(1) and 
> data column size(3).; line 1 pos 24
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43438) Fix mismatched column list error on INSERT

2023-06-23 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485
 ] 

BingKun Pan edited comment on SPARK-43438 at 6/23/23 12:39 PM:
---

I checked and found that after `[https://github.com/apache/spark/pull/41458]`,

1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully.

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L42]

 

2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is not enough data columns:
Table columns: `c1`, `c2`.
Data columns: `1`.

 

3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`t1`, the reason is too many data columns:
Table columns: `c1`.
Data columns: `1`, `2`, `3`.

 

Among them, 2 and 3 are in line with our expectations.

But the behavior difference between 1 and 2 is a bit confusing.


was (Author: panbingkun):
I checked and found that after `[https://github.com/apache/spark/pull/41458]`,

1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully.

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401]

 

2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is not enough data columns:
Table columns: `c1`, `c2`.
Data columns: `1`.

 

3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`t1`, the reason is too many data columns:
Table columns: `c1`.
Data columns: `1`, `2`, `3`.

 

Among them, 2 and 3 are in line with our expectations.

But the behavior difference between 1 and 2 is a bit confusing.

> Fix mismatched column list error on INSERT
> --
>
> Key: SPARK-43438
> URL: https://issues.apache.org/jira/browse/SPARK-43438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> This error message is pretty bad, and common
> "_LEGACY_ERROR_TEMP_1038" : {
> "message" : [
> "Cannot write to table due to mismatched user specified column 
> size() and data column size()."
> ]
> },
> It can perhaps be merged with this one - after giving it an ERROR_CLASS
> "_LEGACY_ERROR_TEMP_1168" : {
> "message" : [
> " requires that the data to be inserted have the same number of 
> columns as the target table: target table has  column(s) but 
> the inserted data has  column(s), including  
> partition column(s) having constant value(s)."
> ]
> },
> Repro:
> CREATE TABLE tabtest(c1 INT, c2 INT);
> INSERT INTO tabtest SELECT 1;
> `spark_catalog`.`default`.`tabtest` requires that the data to be inserted 
> have the same number of columns as the target table: target table has 2 
> column(s) but the inserted data has 1 column(s), including 0 partition 
> column(s) having constant value(s).
> INSERT INTO tabtest(c1) SELECT 1, 2, 3;
> Cannot write to table due to mismatched user specified column size(1) and 
> data column size(3).; line 1 pos 24
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43438) Fix mismatched column list error on INSERT

2023-06-23 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736485#comment-17736485
 ] 

BingKun Pan commented on SPARK-43438:
-

I checked and found that after `[https://github.com/apache/spark/pull/41458]`,

1.when execute sql "INSERT INTO tabtest SELECT 1", will execute successfully.

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L393-L397]

[https://github.com/apache/spark/blob/cd69d4dd18cfaccf58bf64dde6268f7ea1d4415b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L401]

 

2.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`tabtest`, the reason is not enough data columns:
Table columns: `c1`, `c2`.
Data columns: `1`.

 

3.when execute sql "INSERT INTO tabtest(c1, c2) SELECT 1", the error is as 
follows:

[INSERT_COLUMN_ARITY_MISMATCH.TOO_MANY_DATA_COLUMNS] Cannot write to 
`spark_catalog`.`default`.`t1`, the reason is too many data columns:
Table columns: `c1`.
Data columns: `1`, `2`, `3`.

 

Among them, 2 and 3 are in line with our expectations.

But the behavior difference between 1 and 2 is a bit confusing.

> Fix mismatched column list error on INSERT
> --
>
> Key: SPARK-43438
> URL: https://issues.apache.org/jira/browse/SPARK-43438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> This error message is pretty bad, and common
> "_LEGACY_ERROR_TEMP_1038" : {
> "message" : [
> "Cannot write to table due to mismatched user specified column 
> size() and data column size()."
> ]
> },
> It can perhaps be merged with this one - after giving it an ERROR_CLASS
> "_LEGACY_ERROR_TEMP_1168" : {
> "message" : [
> " requires that the data to be inserted have the same number of 
> columns as the target table: target table has  column(s) but 
> the inserted data has  column(s), including  
> partition column(s) having constant value(s)."
> ]
> },
> Repro:
> CREATE TABLE tabtest(c1 INT, c2 INT);
> INSERT INTO tabtest SELECT 1;
> `spark_catalog`.`default`.`tabtest` requires that the data to be inserted 
> have the same number of columns as the target table: target table has 2 
> column(s) but the inserted data has 1 column(s), including 0 partition 
> column(s) having constant value(s).
> INSERT INTO tabtest(c1) SELECT 1, 2, 3;
> Cannot write to table due to mismatched user specified column size(1) and 
> data column size(3).; line 1 pos 24
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44155) Adding a dev utility to improve error messages based on LLM

2023-06-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736424#comment-17736424
 ] 

ASF GitHub Bot commented on SPARK-44155:


User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/41711

>  Adding a dev utility to improve error messages based on LLM
> 
>
> Key: SPARK-44155
> URL: https://issues.apache.org/jira/browse/SPARK-44155
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Adding a utility function to assist with error message improvement tasks.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44151) Upgrade commons-codec from 1.15 to 1.16.0

2023-06-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736421#comment-17736421
 ] 

ASF GitHub Bot commented on SPARK-44151:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41707

> Upgrade commons-codec from 1.15 to 1.16.0
> -
>
> Key: SPARK-44151
> URL: https://issues.apache.org/jira/browse/SPARK-44151
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44153) Support `Heap Histogram` column in Executor tab

2023-06-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736420#comment-17736420
 ] 

ASF GitHub Bot commented on SPARK-44153:


User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41709

> Support `Heap Histogram` column in Executor tab
> ---
>
> Key: SPARK-44153
> URL: https://issues.apache.org/jira/browse/SPARK-44153
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44155) Adding a dev utility to improve error messages based on LLM

2023-06-23 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-44155:
---

 Summary:  Adding a dev utility to improve error messages based on 
LLM
 Key: SPARK-44155
 URL: https://issues.apache.org/jira/browse/SPARK-44155
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Adding a utility function to assist with error message improvement tasks.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-23 Thread Ramakrishna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736404#comment-17736404
 ] 

Ramakrishna commented on SPARK-44152:
-

[~gurwls223] 

Is this an issue spark 3.4.0 ? at least I am facing this issue, with all other 
constraints remaining .

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44152
> URL: https://issues.apache.org/jira/browse/SPARK-44152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42877) Implement DataFrame.foreach

2023-06-23 Thread Gurpreet Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736399#comment-17736399
 ] 

Gurpreet Singh commented on SPARK-42877:


[~XinrongM] I am interested in working on this issue. I am new to this 
codebase, so could you maybe provide some more context on this? Thanks 

> Implement DataFrame.foreach
> ---
>
> Key: SPARK-42877
> URL: https://issues.apache.org/jira/browse/SPARK-42877
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Maybe we can leverage UDFs to implement that, such as 
> `df.select(udf(*df.schema)).count()`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44153) Support `Heap Histogram` column in Executor tab

2023-06-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44153.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41709
[https://github.com/apache/spark/pull/41709]

> Support `Heap Histogram` column in Executor tab
> ---
>
> Key: SPARK-44153
> URL: https://issues.apache.org/jira/browse/SPARK-44153
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44153) Support `Heap Histogram` column in Executor tab

2023-06-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44153:
-

Assignee: Dongjoon Hyun

> Support `Heap Histogram` column in Executor tab
> ---
>
> Key: SPARK-44153
> URL: https://issues.apache.org/jira/browse/SPARK-44153
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736365#comment-17736365
 ] 

Hyukjin Kwon commented on SPARK-44152:
--

This is from https://issues.apache.org/jira/browse/SPARK-44135. I made a 
mistake in JIRA number so manually switched both JIRAs.

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44152
> URL: https://issues.apache.org/jira/browse/SPARK-44152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44152:
-
Affects Version/s: 3.4.0
   (was: 3.5.0)
  Description: 
 
I have a spark application that is deployed using k8s and it is of version 
3.3.2 Recently there were some vulneabilities in spark 3.3.2

I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
application jar is built on spark 3.4.0

However while deploying, I get this error

        

*{{Exception in thread "main" java.nio.file.NoSuchFileException: 
/spark-assembly-1.0.jar}}*

 

I have this in deployment.yaml of the app

 

*mainApplicationFile: "local:spark-assembly-1.0.jar"*

 

 

 

 

and I have not changed anything related to that. I see that some code has 
changed in spark 3.4.0 core's source code regarding jar location.

Has it really changed the functionality ? Is there anyone who is facing same 
issue as me ? Should the path be specified in a different way?

  was:


 
I have a spark application that is deployed using k8s and it is of version 
3.3.2 Recently there were some vulneabilities in spark 3.3.2

I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
application jar is built on spark 3.4.0

However while deploying, I get this error

        

*{{Exception in thread "main" java.nio.file.NoSuchFileException: 
/spark-assembly-1.0.jar}}*

 

I have this in deployment.yaml of the app

 

*mainApplicationFile: "local:spark-assembly-1.0.jar"*

 

 

 

 

and I have not changed anything related to that. I see that some code has 
changed in spark 3.4.0 core's source code regarding jar location.

Has it really changed the functionality ? Is there anyone who is facing same 
issue as me ? Should the path be specified in a different way?


> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44152
> URL: https://issues.apache.org/jira/browse/SPARK-44152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44135) Document Spark Connect only API in PySpark

2023-06-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736364#comment-17736364
 ] 

Hyukjin Kwon commented on SPARK-44135:
--

I made a mistake during the JIRA resolution. I move the original JIRA here to 
https://issues.apache.org/jira/browse/SPARK-44152

> Document Spark Connect only API in PySpark 
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect, Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0
>
>
> https://issues.apache.org/jira/browse/SPARK-41255
> https://issues.apache.org/jira/browse/SPARK-43509
> https://issues.apache.org/jira/browse/SPARK-43612
> https://issues.apache.org/jira/browse/SPARK-43790
> added four Spark Connect only API to Spark Session. We should document them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44135) Document Spark Connect only API in PySpark

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44135:


Assignee: Hyukjin Kwon

> Document Spark Connect only API in PySpark 
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect, Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Assignee: Hyukjin Kwon
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-41255
> https://issues.apache.org/jira/browse/SPARK-43509
> https://issues.apache.org/jira/browse/SPARK-43612
> https://issues.apache.org/jira/browse/SPARK-43790
> added four Spark Connect only API to Spark Session. We should document them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Document Spark Connect only API in PySpark

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44135:
-
Issue Type: Documentation  (was: Bug)

> Document Spark Connect only API in PySpark 
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect, Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-41255
> https://issues.apache.org/jira/browse/SPARK-43509
> https://issues.apache.org/jira/browse/SPARK-43612
> https://issues.apache.org/jira/browse/SPARK-43790
> added four Spark Connect only API to Spark Session. We should document them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Document Spark Connect only API in PySpark

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44135:
-
Epic Link: SPARK-39375

> Document Spark Connect only API in PySpark 
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect, Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-41255
> https://issues.apache.org/jira/browse/SPARK-43509
> https://issues.apache.org/jira/browse/SPARK-43612
> https://issues.apache.org/jira/browse/SPARK-43790
> added four Spark Connect only API to Spark Session. We should document them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44152:
-
Issue Type: Bug  (was: Documentation)

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44152
> URL: https://issues.apache.org/jira/browse/SPARK-44152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44135) Document Spark Connect only API in PySpark

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44135.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41708
[https://github.com/apache/spark/pull/41708]

> Document Spark Connect only API in PySpark 
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect, Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0
>
>
> https://issues.apache.org/jira/browse/SPARK-41255
> https://issues.apache.org/jira/browse/SPARK-43509
> https://issues.apache.org/jira/browse/SPARK-43612
> https://issues.apache.org/jira/browse/SPARK-43790
> added four Spark Connect only API to Spark Session. We should document them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44152:
-
Epic Link:   (was: SPARK-39375)

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44152
> URL: https://issues.apache.org/jira/browse/SPARK-44152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44152:
-
Summary: Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
java.nio.file.NoSuchFileException: , although jar is present in the location  
(was: Document Spark Connect only API in PySpark)

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44152
> URL: https://issues.apache.org/jira/browse/SPARK-44152
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect, Documentation, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44152:
-
Component/s: Spark Core
 (was: Connect)
 (was: Documentation)
 (was: PySpark)

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44152
> URL: https://issues.apache.org/jira/browse/SPARK-44152
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44135:
-
Description: 
https://issues.apache.org/jira/browse/SPARK-41255
https://issues.apache.org/jira/browse/SPARK-43509
https://issues.apache.org/jira/browse/SPARK-43612
https://issues.apache.org/jira/browse/SPARK-43790

added four Spark Connect only API to Spark Session. We should document them.


  was:
 
I have a spark application that is deployed using k8s and it is of version 
3.3.2 Recently there were some vulneabilities in spark 3.3.2

I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
application jar is built on spark 3.4.0

However while deploying, I get this error

        

*{{Exception in thread "main" java.nio.file.NoSuchFileException: 
/spark-assembly-1.0.jar}}*

 

I have this in deployment.yaml of the app

 

*mainApplicationFile: "local:spark-assembly-1.0.jar"*

 

 

 

 

and I have not changed anything related to that. I see that some code has 
changed in spark 3.4.0 core's source code regarding jar location.

Has it really changed the functionality ? Is there anyone who is facing same 
issue as me ? Should the path be specified in a different way?


> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Blocker
>
> https://issues.apache.org/jira/browse/SPARK-41255
> https://issues.apache.org/jira/browse/SPARK-43509
> https://issues.apache.org/jira/browse/SPARK-43612
> https://issues.apache.org/jira/browse/SPARK-43790
> added four Spark Connect only API to Spark Session. We should document them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Document Spark Connect only API in PySpark

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44135:
-
Priority: Major  (was: Blocker)

> Document Spark Connect only API in PySpark 
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-41255
> https://issues.apache.org/jira/browse/SPARK-43509
> https://issues.apache.org/jira/browse/SPARK-43612
> https://issues.apache.org/jira/browse/SPARK-43790
> added four Spark Connect only API to Spark Session. We should document them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Document Spark Connect only API in PySpark

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44135:
-
Component/s: Connect
 Documentation
 PySpark
 (was: Spark Core)

> Document Spark Connect only API in PySpark 
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Blocker
>
> https://issues.apache.org/jira/browse/SPARK-41255
> https://issues.apache.org/jira/browse/SPARK-43509
> https://issues.apache.org/jira/browse/SPARK-43612
> https://issues.apache.org/jira/browse/SPARK-43790
> added four Spark Connect only API to Spark Session. We should document them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44152) Document Spark Connect only API in PySpark

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44152:
-
Description: 


 
I have a spark application that is deployed using k8s and it is of version 
3.3.2 Recently there were some vulneabilities in spark 3.3.2

I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
application jar is built on spark 3.4.0

However while deploying, I get this error

        

*{{Exception in thread "main" java.nio.file.NoSuchFileException: 
/spark-assembly-1.0.jar}}*

 

I have this in deployment.yaml of the app

 

*mainApplicationFile: "local:spark-assembly-1.0.jar"*

 

 

 

 

and I have not changed anything related to that. I see that some code has 
changed in spark 3.4.0 core's source code regarding jar location.

Has it really changed the functionality ? Is there anyone who is facing same 
issue as me ? Should the path be specified in a different way?

  was:
https://issues.apache.org/jira/browse/SPARK-41255
https://issues.apache.org/jira/browse/SPARK-43509
https://issues.apache.org/jira/browse/SPARK-43612
https://issues.apache.org/jira/browse/SPARK-43790

added four Spark Connect only API to Spark Session. We should document them.


> Document Spark Connect only API in PySpark
> --
>
> Key: SPARK-44152
> URL: https://issues.apache.org/jira/browse/SPARK-44152
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect, Documentation, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Document Spark Connect only API in PySpark

2023-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44135:
-
Summary: Document Spark Connect only API in PySpark   (was: Upgrade to 
spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
java.nio.file.NoSuchFileException: , although jar is present in the location)

> Document Spark Connect only API in PySpark 
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Blocker
>
> https://issues.apache.org/jira/browse/SPARK-41255
> https://issues.apache.org/jira/browse/SPARK-43509
> https://issues.apache.org/jira/browse/SPARK-43612
> https://issues.apache.org/jira/browse/SPARK-43790
> added four Spark Connect only API to Spark Session. We should document them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org