date:20230725

[jira] [Created] (SPARK-44552) Remove useless private object `ParseState` from `IntervalUtils`

2023-07-25 Thread Yang Jie (Jira)

Yang Jie created SPARK-44552:


 Summary: Remove useless private object `ParseState` from 
`IntervalUtils`
 Key: SPARK-44552
 URL: https://issues.apache.org/jira/browse/SPARK-44552
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yang Jie


SPARK-44326（https://github.com/apache/spark/pull/41885/files#diff-e695bcf363031df04cc16c96f1b62fb35093807ee1157c3cbecae3cccb9a3f4e）
 moved the relevant code from `IntervalUtils` to `SparkIntervalUtils` 
(including the definition of another `private object ParseState`), but did not 
delete the definition of `private object ParseState` in `IntervalUtils`. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43968) Improve error messages for Python UDTFs with wrong number of outputs

2023-07-25 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747262#comment-17747262
 ] 

Snoot.io commented on SPARK-43968:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/42157

> Improve error messages for Python UDTFs with wrong number of outputs
> 
>
> Key: SPARK-43968
> URL: https://issues.apache.org/jira/browse/SPARK-43968
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Improve the error messages for Python UDTFs when the number of outputs 
> mismatches the number of outputs specified in the return type of the UDTFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44479) Support Python UDTFs with empty schema

2023-07-25 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747258#comment-17747258
 ] 

Snoot.io commented on SPARK-44479:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/42161

> Support Python UDTFs with empty schema
> --
>
> Key: SPARK-44479
> URL: https://issues.apache.org/jira/browse/SPARK-44479
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Support UDTFs with empty schema, for example:
> {code:python}
> >>> class TestUDTF:
> ...   def eval(self):
> ... yield tuple()
> {code}
> Currently it fails with `useArrow=True`:
> {code:python}
> >>> udtf(TestUDTF, returnType=StructType())().collect()
> Traceback (most recent call last):
> ...
> ValueError: not enough values to unpack (expected 2, got 0)
> {code}
> whereas without Arrow:
> {code:python}
> >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect()
> [Row()]
> {code}
> Otherwise, we should raise an error without Arrow, too, to be consistent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44535) Move Streaming API to sql/api

2023-07-25 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44535.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Move Streaming API to sql/api
> -
>
> Key: SPARK-44535
> URL: https://issues.apache.org/jira/browse/SPARK-44535
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 3.4.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44532) Move ArrowUtil to sql/api

2023-07-25 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44532.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Move ArrowUtil to sql/api
> -
>
> Key: SPARK-44532
> URL: https://issues.apache.org/jira/browse/SPARK-44532
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 3.4.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44550) Wrong semantics for null IN (empty list)

2023-07-25 Thread Jack Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-44550:
--
Fix Version/s: (was: 3.5.0)

> Wrong semantics for null IN (empty list)
> 
>
> Key: SPARK-44550
> URL: https://issues.apache.org/jira/browse/SPARK-44550
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>
> {{null IN (empty list)}} incorrectly evaluates to null, when it should 
> evaluate to false. (The reason it should be false is because a IN (b1, b2) is 
> defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR 
> which is false. This is specified by ANSI SQL.)
> Many places in Spark execution (In, InSet, InSubquery) and optimization 
> (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
> the Spark behavior for the null IN (empty list) is inconsistent in some 
> places - literal IN lists generally return null (incorrect), while IN/NOT IN 
> subqueries mostly return false/true, respectively (correct) in this case.
> This is a longstanding correctness issue which has existed since null support 
> for IN expressions was first added to Spark.
> Doc with more details: 
> [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44551) Wrong semantics for null IN (empty list) - IN expression execution

2023-07-25 Thread Jack Chen (Jira)

Jack Chen created SPARK-44551:
-

 Summary: Wrong semantics for null IN (empty list) - IN expression 
execution
 Key: SPARK-44551
 URL: https://issues.apache.org/jira/browse/SPARK-44551
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen


{{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate 
to false. (The reason it should be false is because a IN (b1, b2) is defined as 
a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is 
false. This is specified by ANSI SQL.)

Many places in Spark execution (In, InSet, InSubquery) and optimization 
(OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
the Spark behavior for the null IN (empty list) is inconsistent in some places 
- literal IN lists generally return null (incorrect), while IN/NOT IN 
subqueries mostly return false/true, respectively (correct) in this case.

This is a longstanding correctness issue which has existed since null support 
for IN expressions was first added to Spark.

Doc with more details: 
[https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44431) Wrong semantics for null IN (empty list) - optimization rules

2023-07-25 Thread Jack Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-44431:
--
Parent: SPARK-44550
Issue Type: Sub-task  (was: Bug)

> Wrong semantics for null IN (empty list) - optimization rules
> -
>
> Key: SPARK-44431
> URL: https://issues.apache.org/jira/browse/SPARK-44431
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.5.0
>
>
> {{null IN (empty list)}} incorrectly evaluates to null, when it should 
> evaluate to false. (The reason it should be false is because a IN (b1, b2) is 
> defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR 
> which is false. This is specified by ANSI SQL.)
> Many places in Spark execution (In, InSet, InSubquery) and optimization 
> (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
> the Spark behavior for the null IN (empty list) is inconsistent in some 
> places - literal IN lists generally return null (incorrect), while IN/NOT IN 
> subqueries mostly return false/true, respectively (correct) in this case.
> This is a longstanding correctness issue which has existed since null support 
> for IN expressions was first added to Spark.
> Doc with more details: 
> [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44431) Wrong semantics for null IN (empty list) - optimization rules

2023-07-25 Thread Jack Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-44431:
--
Summary: Wrong semantics for null IN (empty list) - optimization rules  
(was: Wrong semantics for null IN (empty list))

> Wrong semantics for null IN (empty list) - optimization rules
> -
>
> Key: SPARK-44431
> URL: https://issues.apache.org/jira/browse/SPARK-44431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.5.0
>
>
> {{null IN (empty list)}} incorrectly evaluates to null, when it should 
> evaluate to false. (The reason it should be false is because a IN (b1, b2) is 
> defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR 
> which is false. This is specified by ANSI SQL.)
> Many places in Spark execution (In, InSet, InSubquery) and optimization 
> (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
> the Spark behavior for the null IN (empty list) is inconsistent in some 
> places - literal IN lists generally return null (incorrect), while IN/NOT IN 
> subqueries mostly return false/true, respectively (correct) in this case.
> This is a longstanding correctness issue which has existed since null support 
> for IN expressions was first added to Spark.
> Doc with more details: 
> [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44550) Wrong semantics for null IN (empty list)

2023-07-25 Thread Jack Chen (Jira)

Jack Chen created SPARK-44550:
-

 Summary: Wrong semantics for null IN (empty list)
 Key: SPARK-44550
 URL: https://issues.apache.org/jira/browse/SPARK-44550
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen
Assignee: Jack Chen
 Fix For: 3.5.0


{{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate 
to false. (The reason it should be false is because a IN (b1, b2) is defined as 
a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is 
false. This is specified by ANSI SQL.)

Many places in Spark execution (In, InSet, InSubquery) and optimization 
(OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
the Spark behavior for the null IN (empty list) is inconsistent in some places 
- literal IN lists generally return null (incorrect), while IN/NOT IN 
subqueries mostly return false/true, respectively (correct) in this case.

This is a longstanding correctness issue which has existed since null support 
for IN expressions was first added to Spark.

Doc with more details: 
[https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44494) K8s-it test failed

2023-07-25 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747248#comment-17747248
 ] 

Snoot.io commented on SPARK-44494:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/42162

> K8s-it test failed
> --
>
> Key: SPARK-44494
> URL: https://issues.apache.org/jira/browse/SPARK-44494
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> * [https://github.com/apache/spark/actions/runs/5607397734/jobs/10258527838]
> {code:java}
> [info] - PVs with local hostpath storage on statefulsets *** FAILED *** (3 
> minutes, 11 seconds)
> 3786[info]   The code passed to eventually never returned normally. Attempted 
> 7921 times over 3.000105988813 minutes. Last failure message: "++ id -u
> 3787[info]   + myuid=185
> 3788[info]   ++ id -g
> 3789[info]   + mygid=0
> 3790[info]   + set +e
> 3791[info]   ++ getent passwd 185
> 3792[info]   + uidentry=
> 3793[info]   + set -e
> 3794[info]   + '[' -z '' ']'
> 3795[info]   + '[' -w /etc/passwd ']'
> 3796[info]   + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
> 3797[info]   + '[' -z /opt/java/openjdk ']'
> 3798[info]   + SPARK_CLASSPATH=':/opt/spark/jars/*'
> 3799[info]   + grep SPARK_JAVA_OPT_
> 3800[info]   + sort -t_ -k4 -n
> 3801[info]   + sed 's/[^=]*=\(.*\)/\1/g'
> 3802[info]   + env
> 3803[info]   ++ command -v readarray
> 3804[info]   + '[' readarray ']'
> 3805[info]   + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> 3806[info]   + '[' -n '' ']'
> 3807[info]   + '[' -z ']'
> 3808[info]   + '[' -z ']'
> 3809[info]   + '[' -n '' ']'
> 3810[info]   + '[' -z ']'
> 3811[info]   + '[' -z x ']'
> 3812[info]   + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
> 3813[info]   + 
> SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*:/opt/spark/work-dir'
> 3814[info]   + case "$1" in
> 3815[info]   + shift 1
> 3816[info]   + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --conf 
> "spark.executorEnv.SPARK_DRIVER_POD_IP=$SPARK_DRIVER_BIND_ADDRESS" 
> --deploy-mode client "$@")
> 3817[info]   + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.244.0.45 --conf 
> spark.executorEnv.SPARK_DRIVER_POD_IP=10.244.0.45 --deploy-mode client 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.MiniReadWriteTest 
> local:///opt/spark/examples/jars/spark-examples_2.12-4.0.0-SNAPSHOT.jar 
> /opt/spark/pv-tests/tmp3727659354473892032.txt
> 3818[info]   Files 
> local:///opt/spark/examples/jars/spark-examples_2.12-4.0.0-SNAPSHOT.jar from 
> /opt/spark/examples/jars/spark-examples_2.12-4.0.0-SNAPSHOT.jar to 
> /opt/spark/work-dir/spark-examples_2.12-4.0.0-SNAPSHOT.jar
> 3819[info]   23/07/20 06:15:15 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> 3820[info]   Performing local word count from 
> /opt/spark/pv-tests/tmp3727659354473892032.txt
> 3821[info]   File contents are List(test PVs)
> 3822[info]   Creating SparkSession
> 3823[info]   23/07/20 06:15:15 INFO SparkContext: Running Spark version 
> 4.0.0-SNAPSHOT
> 3824[info]   23/07/20 06:15:15 INFO SparkContext: OS info Linux, 
> 5.15.0-1041-azure, amd64
> 3825[info]   23/07/20 06:15:15 INFO SparkContext: Java version 17.0.7
> 3826[info]   23/07/20 06:15:15 INFO ResourceUtils: 
> ==
> 3827[info]   23/07/20 06:15:15 INFO ResourceUtils: No custom resources 
> configured for spark.driver.
> 3828[info]   23/07/20 06:15:15 INFO ResourceUtils: 
> ==
> 3829[info]   23/07/20 06:15:15 INFO SparkContext: Submitted application: Mini 
> Read Write Test
> 3830[info]   23/07/20 06:15:16 INFO ResourceProfile: Default ResourceProfile 
> created, executor resources: Map(cores -> name: cores, amount: 1, script: , 
> vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap 
> -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> 
> name: cpus, amount: 1.0) {code}
> The tests in the past two days have failed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44546:
---
Description: 
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. Historically, PySpark has had code regressions due to 
insufficient testing of public APIs.

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype

  was:
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. Historically, PySpark has had code regressions due to 
insufficient testing of public APIs (see 
[https://databricks.atlassian.net/browse/ES-705815]).

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype


> Add a dev utility to generate PySpark tests with LLM
> 
>
> Key: SPARK-44546
> URL: https://issues.apache.org/jira/browse/SPARK-44546
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using 
> LLM response. The purpose of this experimental script is

[jira] [Resolved] (SPARK-44530) Move SparkBuildInfo to common/util

2023-07-25 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44530.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Move SparkBuildInfo to common/util
> --
>
> Key: SPARK-44530
> URL: https://issues.apache.org/jira/browse/SPARK-44530
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44549) Support correlated references under window functions

2023-07-25 Thread Andrey Gubichev (Jira)

Andrey Gubichev created SPARK-44549:
---

 Summary: Support correlated references under window functions
 Key: SPARK-44549
 URL: https://issues.apache.org/jira/browse/SPARK-44549
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1
Reporter: Andrey Gubichev


We should support subqueries with correlated references under a window function 
operator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44548) Add support for pandas DataFrame assertDataFrameEqual

2023-07-25 Thread Amanda Liu (Jira)

Amanda Liu created SPARK-44548:
--

 Summary: Add support for pandas DataFrame assertDataFrameEqual
 Key: SPARK-44548
 URL: https://issues.apache.org/jira/browse/SPARK-44548
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu
Assignee: Amanda Liu
 Fix For: 3.5.0, 4.0.0


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43968) Improve error messages for Python UDTFs with wrong number of outputs

2023-07-25 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-43968:
-
Description: Improve the error messages for Python UDTFs when the number of 
outputs mismatches the number of outputs specified in the return type of the 
UDTFs.  (was: Add more compile time checks when creating UDTFs and throw 
user-friendly error messages when the UDTF is invalid.)

> Improve error messages for Python UDTFs with wrong number of outputs
> 
>
> Key: SPARK-43968
> URL: https://issues.apache.org/jira/browse/SPARK-43968
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Improve the error messages for Python UDTFs when the number of outputs 
> mismatches the number of outputs specified in the return type of the UDTFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43968) Improve error messages for Python UDTFs with wrong number of outputs

2023-07-25 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-43968:
-
Summary: Improve error messages for Python UDTFs with wrong number of 
outputs  (was: Add more compile-time checks when creating Python UDTFs)

> Improve error messages for Python UDTFs with wrong number of outputs
> 
>
> Key: SPARK-43968
> URL: https://issues.apache.org/jira/browse/SPARK-43968
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Add more compile time checks when creating UDTFs and throw user-friendly 
> error messages when the UDTF is invalid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44005) Support returning non-tuple values for regular Python UDTFs

2023-07-25 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44005:
-
Description: 
Currently, if you have a UDTF like this:
{code:java}
class TestUDTF:
def eval(self, a: int):
yield a {code}
and run the UDTF, it will fail with a confusing error message like
{code:java}
Unexpected tuple 1 with StructType {code}
Note this works when arrow is enabled. We should support this use case for 
regular UDTFs.

  was:
Currently, if you have a UDTF like this:
{code:java}
class TestUDTF:
def eval(self, a: int):
yield a {code}
and run the UDTF, it will fail with a confusing error message like
{code:java}
Unexpected tuple 1 with StructType {code}
We should support this use case.


> Support returning non-tuple values for regular Python UDTFs
> ---
>
> Key: SPARK-44005
> URL: https://issues.apache.org/jira/browse/SPARK-44005
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, if you have a UDTF like this:
> {code:java}
> class TestUDTF:
> def eval(self, a: int):
> yield a {code}
> and run the UDTF, it will fail with a confusing error message like
> {code:java}
> Unexpected tuple 1 with StructType {code}
> Note this works when arrow is enabled. We should support this use case for 
> regular UDTFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44005) Support returning non-tuple values for regular Python UDTFs

2023-07-25 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44005:
-
Description: 
Currently, if you have a UDTF like this:
{code:java}
class TestUDTF:
def eval(self, a: int):
yield a {code}
and run the UDTF, it will fail with a confusing error message like
{code:java}
Unexpected tuple 1 with StructType {code}
We should support this use case.

  was:
Currently, if you have a UDTF like this:
{code:java}
class TestUDTF:
def eval(self, a: int):
yield a {code}
and run the UDTF, it will fail with a confusing error message like
{code:java}
Unexpected tuple 1 with StructType {code}
We should make this error message more user-friendly.


> Support returning non-tuple values for regular Python UDTFs
> ---
>
> Key: SPARK-44005
> URL: https://issues.apache.org/jira/browse/SPARK-44005
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, if you have a UDTF like this:
> {code:java}
> class TestUDTF:
> def eval(self, a: int):
> yield a {code}
> and run the UDTF, it will fail with a confusing error message like
> {code:java}
> Unexpected tuple 1 with StructType {code}
> We should support this use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44005) Support returning non-tuple values for regular Python UDTFs

2023-07-25 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44005:
-
Summary: Support returning non-tuple values for regular Python UDTFs  (was: 
Support returning a non-tuple value for regular Python UDTFs)

> Support returning non-tuple values for regular Python UDTFs
> ---
>
> Key: SPARK-44005
> URL: https://issues.apache.org/jira/browse/SPARK-44005
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, if you have a UDTF like this:
> {code:java}
> class TestUDTF:
> def eval(self, a: int):
> yield a {code}
> and run the UDTF, it will fail with a confusing error message like
> {code:java}
> Unexpected tuple 1 with StructType {code}
> We should make this error message more user-friendly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44005) Support returning a non-tuple value for regular Python UDTFs

2023-07-25 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44005:
-
Summary: Support returning a non-tuple value for regular Python UDTFs  
(was: Improve the error messages when a UDTF returns a non-tuple value)

> Support returning a non-tuple value for regular Python UDTFs
> 
>
> Key: SPARK-44005
> URL: https://issues.apache.org/jira/browse/SPARK-44005
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, if you have a UDTF like this:
> {code:java}
> class TestUDTF:
> def eval(self, a: int):
> yield a {code}
> and run the UDTF, it will fail with a confusing error message like
> {code:java}
> Unexpected tuple 1 with StructType {code}
> We should make this error message more user-friendly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44547) BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks to fallback storage

2023-07-25 Thread Frank Yin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Yin updated SPARK-44547:
--
Description: 
Looks like the RDD cache doesn't support fallback storage and we should stop 
the migration if the only viable peer is the fallback storage. 

  [^spark-error.log] 23/07/25 05:12:58 WARN BlockManager: Failed to replicate 
rdd_18_25 to BlockManagerId(fallback, remote, 7337, None), failure #0
java.io.IOException: Failed to connect to remote:7337
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at 
org.apache.spark.network.netty.NettyBlockTransferService.uploadBlock(NettyBlockTransferService.scala:168)
at 
org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:121)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1784)
at 
org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2(BlockManager.scala:1721)
at 
org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2$adapted(BlockManager.scala:1707)
at scala.Option.forall(Option.scala:390)
at 
org.apache.spark.storage.BlockManager.replicateBlock(BlockManager.scala:1707)
at 
org.apache.spark.storage.BlockManagerDecommissioner.migrateBlock(BlockManagerDecommissioner.scala:356)
at 
org.apache.spark.storage.BlockManagerDecommissioner.$anonfun$decommissionRddCacheBlocks$3(BlockManagerDecommissioner.scala:340)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.storage.BlockManagerDecommissioner.decommissionRddCacheBlocks(BlockManagerDecommissioner.scala:339)
at 
org.apache.spark.storage.BlockManagerDecommissioner$$anon$1.run(BlockManagerDecommissioner.scala:214)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.UnknownHostException: remote
at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)
at java.base/java.net.InetAddress.getAllByName0(Unknown Source)
at java.base/java.net.InetAddress.getAllByName(Unknown Source)
at java.base/java.net.InetAddress.getAllByName(Unknown Source)
at java.base/java.net.InetAddress.getByName(Unknown Source)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at 
io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153)
at 
io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41)
at 
io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61)
at 
io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31)
at 
io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:206)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:46)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:180)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:166)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
at

[jira] [Updated] (SPARK-44547) BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks to fallback storage

2023-07-25 Thread Frank Yin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Yin updated SPARK-44547:
--
Attachment: spark-error.log

> BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks 
> to fallback storage
> -
>
> Key: SPARK-44547
> URL: https://issues.apache.org/jira/browse/SPARK-44547
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Frank Yin
>Priority: Major
> Attachments: spark-error.log
>
>
> Looks like the RDD cache doesn't support fallback storage and we should stop 
> the migration if the only viable peer is the fallback storage. 
>  {{23/07/25 05:12:58 WARN BlockManager: Failed to replicate rdd_18_25 to 
> BlockManagerId(fallback, remote, 7337, None), failure #0
> java.io.IOException: Failed to connect to remote:7337
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService.uploadBlock(NettyBlockTransferService.scala:168)
>   at 
> org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:121)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1784)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2(BlockManager.scala:1721)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2$adapted(BlockManager.scala:1707)
>   at scala.Option.forall(Option.scala:390)
>   at 
> org.apache.spark.storage.BlockManager.replicateBlock(BlockManager.scala:1707)
>   at 
> org.apache.spark.storage.BlockManagerDecommissioner.migrateBlock(BlockManagerDecommissioner.scala:356)
>   at 
> org.apache.spark.storage.BlockManagerDecommissioner.$anonfun$decommissionRddCacheBlocks$3(BlockManagerDecommissioner.scala:340)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.storage.BlockManagerDecommissioner.decommissionRddCacheBlocks(BlockManagerDecommissioner.scala:339)
>   at 
> org.apache.spark.storage.BlockManagerDecommissioner$$anon$1.run(BlockManagerDecommissioner.scala:214)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>   at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
>   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)
>   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
>   at java.base/java.lang.Thread.run(Unknown Source)
> Caused by: java.net.UnknownHostException: remote
>   at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)
>   at java.base/java.net.InetAddress.getAllByName0(Unknown Source)
>   at java.base/java.net.InetAddress.getAllByName(Unknown Source)
>   at java.base/java.net.InetAddress.getAllByName(Unknown Source)
>   at java.base/java.net.InetAddress.getByName(Unknown Source)
>   at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156)
>   at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at 
> io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153)
>   at 
> io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41)
>   at 
> io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61)
>   at 
> io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53)
>   at 
> io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55)
>   at 
> io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31)
>   at 
> io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106)
>   at

[jira] [Updated] (SPARK-44547) BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks to fallback storage

2023-07-25 Thread Frank Yin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Yin updated SPARK-44547:
--
Description: 
Looks like the RDD cache doesn't support fallback storage and we should stop 
the migration if the only viable peer is the fallback storage. 

 {{23/07/25 05:12:58 WARN BlockManager: Failed to replicate rdd_18_25 to 
BlockManagerId(fallback, remote, 7337, None), failure #0
java.io.IOException: Failed to connect to remote:7337
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at 
org.apache.spark.network.netty.NettyBlockTransferService.uploadBlock(NettyBlockTransferService.scala:168)
at 
org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:121)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1784)
at 
org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2(BlockManager.scala:1721)
at 
org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2$adapted(BlockManager.scala:1707)
at scala.Option.forall(Option.scala:390)
at 
org.apache.spark.storage.BlockManager.replicateBlock(BlockManager.scala:1707)
at 
org.apache.spark.storage.BlockManagerDecommissioner.migrateBlock(BlockManagerDecommissioner.scala:356)
at 
org.apache.spark.storage.BlockManagerDecommissioner.$anonfun$decommissionRddCacheBlocks$3(BlockManagerDecommissioner.scala:340)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.storage.BlockManagerDecommissioner.decommissionRddCacheBlocks(BlockManagerDecommissioner.scala:339)
at 
org.apache.spark.storage.BlockManagerDecommissioner$$anon$1.run(BlockManagerDecommissioner.scala:214)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.UnknownHostException: remote
at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)
at java.base/java.net.InetAddress.getAllByName0(Unknown Source)
at java.base/java.net.InetAddress.getAllByName(Unknown Source)
at java.base/java.net.InetAddress.getAllByName(Unknown Source)
at java.base/java.net.InetAddress.getByName(Unknown Source)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at 
io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153)
at 
io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41)
at 
io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61)
at 
io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31)
at 
io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:206)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:46)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:180)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:166)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
at 
io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)

[jira] [Created] (SPARK-44547) BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks to fallback storage

2023-07-25 Thread Frank Yin (Jira)

Frank Yin created SPARK-44547:
-

 Summary: BlockManagerDecommissioner throws exceptions when 
migrating RDD cached blocks to fallback storage
 Key: SPARK-44547
 URL: https://issues.apache.org/jira/browse/SPARK-44547
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Frank Yin


Looks like the RDD cache doesn't support fallback storage and we should stop 
the migration if the only viable peer is the fallback storage. 

```
23/07/25 05:12:58 WARN BlockManager: Failed to replicate rdd_18_25 to 
BlockManagerId(fallback, remote, 7337, None), failure #0
java.io.IOException: Failed to connect to remote:7337
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at 
org.apache.spark.network.netty.NettyBlockTransferService.uploadBlock(NettyBlockTransferService.scala:168)
at 
org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:121)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1784)
at 
org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2(BlockManager.scala:1721)
at 
org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2$adapted(BlockManager.scala:1707)
at scala.Option.forall(Option.scala:390)
at org.apache.spark.storage.BlockManager.replicateBlock(BlockManager.scala:1707)
at 
org.apache.spark.storage.BlockManagerDecommissioner.migrateBlock(BlockManagerDecommissioner.scala:356)
at 
org.apache.spark.storage.BlockManagerDecommissioner.$anonfun$decommissionRddCacheBlocks$3(BlockManagerDecommissioner.scala:340)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.storage.BlockManagerDecommissioner.decommissionRddCacheBlocks(BlockManagerDecommissioner.scala:339)
at 
org.apache.spark.storage.BlockManagerDecommissioner$$anon$1.run(BlockManagerDecommissioner.scala:214)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.UnknownHostException: remote
at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)
at java.base/java.net.InetAddress.getAllByName0(Unknown Source)
at java.base/java.net.InetAddress.getAllByName(Unknown Source)
at java.base/java.net.InetAddress.getAllByName(Unknown Source)
at java.base/java.net.InetAddress.getByName(Unknown Source)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153)
at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31)
at 
io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:206)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:46)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:180)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:166)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:605)
at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at

[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44546:
---
Description: 
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. Historically, PySpark has had code regressions due to 
insufficient testing of public APIs (see 
[https://databricks.atlassian.net/browse/ES-705815]).

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype

  was:
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype


> Add a dev utility to generate PySpark tests with LLM
> 
>
> Key: SPARK-44546
> URL: https://issues.apache.org/jira/browse/SPARK-44546
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using 
> LLM response. The purpose of this experimental script is to encourage PySpark 
> developers to test their code thoroughly, to avoid introducing

[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44546:
---
Description: 
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype

  was:
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # String
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. String
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype


> Add a dev utility to generate PySpark tests with LLM
> 
>
> Key: SPARK-44546
> URL: https://issues.apache.org/jira/browse/SPARK-44546
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using 
> LLM response. The purpose of this experimental script is to encourage PySpark 
> developers to test their code thoroughly, to avoid introducing regressions in 
> the codebase. 
> Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
> from the perspective of arguments. Many

[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44546:
---
Description: 
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # String
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. String
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype

  was:
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. String
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype


> Add a dev utility to generate PySpark tests with LLM
> 
>
> Key: SPARK-44546
> URL: https://issues.apache.org/jira/browse/SPARK-44546
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using 
> LLM response. The purpose of this experimental script is to encourage PySpark 
> developers to test their code thoroughly, to avoid introducing regressions in 
> the codebase. 
> Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
> from the perspective of arguments. Many of these edge

[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44546:
---
Description: 
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. String
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype

> Add a dev utility to generate PySpark tests with LLM
> 
>
> Key: SPARK-44546
> URL: https://issues.apache.org/jira/browse/SPARK-44546
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using 
> LLM response. The purpose of this experimental script is to encourage PySpark 
> developers to test their code thoroughly, to avoid introducing regressions in 
> the codebase. 
> Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
> from the perspective of arguments. Many of these edge cases are passed into 
> the LLM through the script's base prompt.
> Please note that this list is not exhaustive, but rather a starting point. 
> Some of these cases may not apply, depending on the situation. We encourage 
> all PySpark developers to carefully consider edge case scenarios when writing 
> tests.
> h2. Table of Contents
>  # None
>  # Ints
>  # Floats
>  # Single column / column name
>  # Multi column / column names
>  # DataFrame argument
> h3. 1. None
>  * Empty input
>  * None type
> h3. 2. Ints
>  * Negatives
>  * 0
>  * value > Int.MaxValue
>  * value < Int.MinValue
> h3. 3. Floats
>  * Negatives
>  * 0.0
>  * Float(“nan”)
>  * Float("inf")
>  * Float("-inf")
>  * decimal.Decimal
>  * numpy.float16
> h3. 4. String
>  * Special characters
>  * Spaces
>  * Empty strings
> h3. 5. Single column / column name
>  * Non-existent column
>  * Empty column name
>  * Column name with special characters, e.g. dots
>  * Multi columns with the same name
>  * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
>  * Column of special types, e.g. nested type;
>  * Column containing special values, e.g. Null;
> h3. 6. Multi column / column names
>  * Empty input; e.g DataFrame.drop()
>  * Special cases for each single column
>  * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
>  * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))
> h3. 7. DataFrame argument
>  * DataFrame argument
>  * Empty dataframe; e.g. spark.range(5).limit(0)
>  * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
>  * Dataset with repeated arguments
>  * Local dataset (pd.DataFrame) containing unsupported datatype



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43402) FileSourceScanExec supports push down data filter with scalar subquery

2023-07-25 Thread Ignite TC Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747067#comment-17747067
 ] 

Ignite TC Bot commented on SPARK-43402:
---

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/41088

> FileSourceScanExec supports push down data filter with scalar subquery
> --
>
> Key: SPARK-43402
> URL: https://issues.apache.org/jira/browse/SPARK-43402
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> Scalar subquery can be pushed down as data filter at runtime, since we always 
> execute subquery first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44545) It's impossible to get column by index if names are not unique

2023-07-25 Thread Alexey Dmitriev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Dmitriev updated SPARK-44545:

Description: 
So, I have a dataframe with non-unique columns names:

 
{code:java}
df = spark_session.createDataFrame([[1,2,3], [4,5,6]], ['a', 'a', 'c']) {code}
 

It works fine.

 

Now I try to get a column with non-unique name by index
{code:java}
df[0]
{code}
Expectation: It returns first of the columns

Note, the 
[doc|[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.__getitem__.html]]
 doesn't mention non-unique name as a precondition.

 

Actual result:

It throws exception:

 
{noformat}
AnalysisException Traceback (most recent call last)
Cell In[71], line 1
> 1 df[0]

File /usr/local/spark/python/pyspark/sql/dataframe.py:2935, in 
DataFrame.__getitem__(self, item)
   2933 return self.select(*item)
   2934 elif isinstance(item, int):
-> 2935 jc = self._jdf.apply(self.columns[item])
   2936 return Column(jc)
   2937 else:

File 
/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in 
JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317 self.command_header +\
   1318 args_command +\
   1319 proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323 answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326 if hasattr(temp_arg, "_detach"):

File /usr/local/spark/python/pyspark/errors/exceptions/captured.py:175, in 
capture_sql_exception..deco(*a, **kw)
171 converted = convert_exception(e.java_exception)
172 if not isinstance(converted, UnknownException):
173 # Hide where the exception came from that shows a non-Pythonic
174 # JVM exception message.
--> 175 raise converted from None
176 else:
177 raise

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could be: 
[`a`, `a`].{noformat}
 

  was:
So, I have a dataframe with non-unique columns names:

 
{code:java}
df = spark_session.createDataFrame([[1,2,3], [4,5,6]], ['a', 'a', 'c']) {code}
 

It works fine.

 

Now I try to get a column with non-unique name by index
{code:java}
df[0]
{code}
Expectation: It returns first of the columns

Note, the 
[doc|[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.__getitem__.html]]
 doesn't mention non-unique name as a precondition.

 

Actual result:

It throws exception:

 
{noformat}
AnalysisException Traceback (most recent call last)
Cell In[71], line 1
> 1 df[0]

File /usr/local/spark/python/pyspark/sql/dataframe.py:2935, in 
DataFrame.__getitem__(self, item)
   2933 return self.select(*item)
   2934 elif isinstance(item, int):
-> 2935 jc = self._jdf.apply(self.columns[item])
   2936 return Column(jc)
   2937 else:

File 
/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in 
JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317 self.command_header +\
   1318 args_command +\
   1319 proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323 answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326 if hasattr(temp_arg, "_detach"):

File /usr/local/spark/python/pyspark/errors/exceptions/captured.py:175, in 
capture_sql_exception..deco(*a, **kw)
171 converted = convert_exception(e.java_exception)
172 if not isinstance(converted, UnknownException):
173 # Hide where the exception came from that shows a non-Pythonic
174 # JVM exception message.
--> 175 raise converted from None
176 else:
177 raise

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could be: 
[`a`, `a`].{noformat}
 


> It's impossible to get column by index if names are not unique
> --
>
> Key: SPARK-44545
> URL: https://issues.apache.org/jira/browse/SPARK-44545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
> Environment: I have python 3.11, pyspark 3.4.0
>Reporter: Alexey Dmitriev
>Priority: Major
>
> So, I have a dataframe with non-unique columns names:
>  
> {code:java}
> df = spark_session.createDataFrame([[1,2,3], [4,5,6]], ['a', 'a', 'c']) {code}
>  
> It works fine.
>  
> Now I try to get a column with non-unique name by index
> {code:java}
> df[0]
> {code}
> Expectation: It returns first of the columns
> Note, the 
>

[jira] [Resolved] (SPARK-44541) Remove useless function `hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker`

2023-07-25 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44541.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42147
[https://github.com/apache/spark/pull/42147]

> Remove useless function `hasRangeExprAgainstEventTimeCol` from 
> `UnsupportedOperationChecker`
> 
>
> Key: SPARK-44541
> URL: https://issues.apache.org/jira/browse/SPARK-44541
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 4.0.0
>
>
> funciton `hasRangeExprAgainstEventTimeCol` was introduced by SPARK-40940 and 
> no longer be used after SPARK-42376
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44541) Remove useless function `hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker`

2023-07-25 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44541:


Assignee: Yang Jie

> Remove useless function `hasRangeExprAgainstEventTimeCol` from 
> `UnsupportedOperationChecker`
> 
>
> Key: SPARK-44541
> URL: https://issues.apache.org/jira/browse/SPARK-44541
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> funciton `hasRangeExprAgainstEventTimeCol` was introduced by SPARK-40940 and 
> no longer be used after SPARK-42376
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)

Amanda Liu created SPARK-44546:
--

 Summary: Add a dev utility to generate PySpark tests with LLM
 Key: SPARK-44546
 URL: https://issues.apache.org/jira/browse/SPARK-44546
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44545) It's impossible to get column by index if names are not unique

2023-07-25 Thread Alexey Dmitriev (Jira)

Alexey Dmitriev created SPARK-44545:
---

 Summary: It's impossible to get column by index if names are not 
unique
 Key: SPARK-44545
 URL: https://issues.apache.org/jira/browse/SPARK-44545
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.0
 Environment: I have python 3.11, pyspark 3.4.0
Reporter: Alexey Dmitriev


So, I have a dataframe with non-unique columns names:

 
{code:java}
df = spark_session.createDataFrame([[1,2,3], [4,5,6]], ['a', 'a', 'c']) {code}
 

It works fine.

 

Now I try to get a column with non-unique name by index
{code:java}
df[0]
{code}
Expectation: It returns first of the columns

Note, the 
[doc|[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.__getitem__.html]]
 doesn't mention non-unique name as a precondition.

 

Actual result:

It throws exception:

 
{noformat}
AnalysisException Traceback (most recent call last)
Cell In[71], line 1
> 1 df[0]

File /usr/local/spark/python/pyspark/sql/dataframe.py:2935, in 
DataFrame.__getitem__(self, item)
   2933 return self.select(*item)
   2934 elif isinstance(item, int):
-> 2935 jc = self._jdf.apply(self.columns[item])
   2936 return Column(jc)
   2937 else:

File 
/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in 
JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317 self.command_header +\
   1318 args_command +\
   1319 proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323 answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326 if hasattr(temp_arg, "_detach"):

File /usr/local/spark/python/pyspark/errors/exceptions/captured.py:175, in 
capture_sql_exception..deco(*a, **kw)
171 converted = convert_exception(e.java_exception)
172 if not isinstance(converted, UnknownException):
173 # Hide where the exception came from that shows a non-Pythonic
174 # JVM exception message.
--> 175 raise converted from None
176 else:
177 raise

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could be: 
[`a`, `a`].{noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44355) Move commands to CTEDef code path and deprecate CTE inline path

2023-07-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44355:
---

Assignee: Wenchen Fan

> Move commands to CTEDef code path and deprecate CTE inline path
> ---
>
> Key: SPARK-44355
> URL: https://issues.apache.org/jira/browse/SPARK-44355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 4.0.0
>
>
> Right now our CTE resolution code path is diverged: query with commands go 
> into CTE inline code path where non-command queries go into CTEDef code path 
> (see 
> https://github.com/apache/spark/blob/42719d9425b9a24ef016b5c2874e522b960cf114/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala#L50
>  ).
> For longer term we should migrate command queries go through CTEDef as well 
> and deprecate the CTE inline path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44355) Move commands to CTEDef code path and deprecate CTE inline path

2023-07-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44355.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42036
[https://github.com/apache/spark/pull/42036]

> Move commands to CTEDef code path and deprecate CTE inline path
> ---
>
> Key: SPARK-44355
> URL: https://issues.apache.org/jira/browse/SPARK-44355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Major
> Fix For: 4.0.0
>
>
> Right now our CTE resolution code path is diverged: query with commands go 
> into CTE inline code path where non-command queries go into CTEDef code path 
> (see 
> https://github.com/apache/spark/blob/42719d9425b9a24ef016b5c2874e522b960cf114/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala#L50
>  ).
> For longer term we should migrate command queries go through CTEDef as well 
> and deprecate the CTE inline path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44544) Move python packaging tests to a separate module

2023-07-25 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-44544:
-

 Summary: Move python packaging tests to a separate module
 Key: SPARK-44544
 URL: https://issues.apache.org/jira/browse/SPARK-44544
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43602) Rebalance PySpark test suites

2023-07-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43602.
---
Resolution: Resolved

> Rebalance PySpark test suites
> -
>
> Key: SPARK-43602
> URL: https://issues.apache.org/jira/browse/SPARK-43602
> Project: Spark
>  Issue Type: Test
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>
> Some python tests are time costing, we should split them in order to:
> 1, speed up debuging for development;
> 2, make it more suitable for parallelism in downstream;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41401) spark3 stagedir can't be change

2023-07-25 Thread Dipayan Dev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746978#comment-17746978
 ] 

Dipayan Dev commented on SPARK-41401:
-

This doesn't work in Google Cloud Storage even in Spark 2.x. 
Can I pick this issue?

> spark3 stagedir can't be change 
> 
>
> Key: SPARK-41401
> URL: https://issues.apache.org/jira/browse/SPARK-41401
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.2.3
>Reporter: sinlang
>Priority: Major
>
> i want't change different staging dir when write temporary data using , but 
> spark3 seen can only write in table path
> spark.yarn.stagingDir parameter only work when use spark2
>  
> in org.apache.spark.internal.io.FileCommitProtocol  file :  
>   def getStagingDir(path: String, jobId: String): Path = {
>     new Path(path, ".spark-staging-" + jobId)
>   }
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails

2023-07-25 Thread Dipayan Dev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dipayan Dev updated SPARK-44543:

Description: 
Spark creates the staging directories like .hive-staging, .spark-staging etc 
which get created when you run an dynamic insert overwrite to a partitioned 
table. Spark spends maximum time in renaming the partitioned files, and because 
GCS renaming are too slow, there are frequent scenarios where YARN fails due to 
network error etc.

Such directories will remain forever in Google Cloud Storage, in case the yarn 
application manager gets killed.
 
Over time this pileup and incurs a lot of cloud storage cost.
 
Can we update our File committer to clean up the temporary directories in case 
the job commit fails.

PS : This request is specifically for GCS.
Image for reference 
 

!image-2023-07-25-17-52-55-006.png|width=458,height=177!

  was:
Spark creates the staging directories like .hive-staging, .spark-staging etc 
which get created when you run an insert overwrite to a partitioned table. 
Spark spends maximum time in renaming the partitioned files, and because GCS 
renaming are too slow, there are frequent scenarios where YARN fails due to 
network error etc.

Such directories will remain forever in Google Cloud Storage, in case the yarn 
application manager gets killed.
 
Over time this pileup and incurs a lot of cloud storage cost.
 
Can we update our File committer to clean up the temporary directories in case 
the job commit fails.

PS : This request is specifically for GCS.
Image for reference 
 

!image-2023-07-25-17-52-55-006.png|width=458,height=177!


> Cleanup .spark-staging directories when yarn application fails
> --
>
> Key: SPARK-44543
> URL: https://issues.apache.org/jira/browse/SPARK-44543
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.4.1
>Reporter: Dipayan Dev
>Priority: Major
> Attachments: image-2023-07-25-17-52-55-006.png
>
>
> Spark creates the staging directories like .hive-staging, .spark-staging etc 
> which get created when you run an dynamic insert overwrite to a partitioned 
> table. Spark spends maximum time in renaming the partitioned files, and 
> because GCS renaming are too slow, there are frequent scenarios where YARN 
> fails due to network error etc.
> Such directories will remain forever in Google Cloud Storage, in case the 
> yarn application manager gets killed.
>  
> Over time this pileup and incurs a lot of cloud storage cost.
>  
> Can we update our File committer to clean up the temporary directories in 
> case the job commit fails.
> PS : This request is specifically for GCS.
> Image for reference 
>  
> !image-2023-07-25-17-52-55-006.png|width=458,height=177!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails

2023-07-25 Thread Dipayan Dev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dipayan Dev updated SPARK-44543:

Description: 
Spark creates the staging directories like .hive-staging, .spark-staging etc 
which get created when you run an insert overwrite to a partitioned table. 
Spark spends maximum time in renaming the partitioned files, and because GCS 
renaming are too slow, there are frequent scenarios where YARN fails due to 
network error etc.

Such directories will remain forever in Google Cloud Storage, in case the yarn 
application manager gets killed.
 
Over time this pileup and incurs a lot of cloud storage cost.
 
Can we update our File committer to clean up the temporary directories in case 
the job commit fails.

PS : This request is specifically for GCS.
Image for reference 
 

!image-2023-07-25-17-52-55-006.png|width=458,height=177!

  was:
Spark creates the staging directories like .hive-staging, .spark-staging etc 
which get created when you run an upsert to a partitioned table. Such 
directories will remain forever in Google Cloud Storage, in case the yarn 
application manager gets killed.
 
Over time this pileup and incurs a lot of cloud storage cost.
 
Can we update our File committer to clean up the temporary directories in case 
the job commit fails.
 

!image-2023-07-25-17-52-55-006.png|width=458,height=177!


> Cleanup .spark-staging directories when yarn application fails
> --
>
> Key: SPARK-44543
> URL: https://issues.apache.org/jira/browse/SPARK-44543
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.4.1
>Reporter: Dipayan Dev
>Priority: Major
> Attachments: image-2023-07-25-17-52-55-006.png
>
>
> Spark creates the staging directories like .hive-staging, .spark-staging etc 
> which get created when you run an insert overwrite to a partitioned table. 
> Spark spends maximum time in renaming the partitioned files, and because GCS 
> renaming are too slow, there are frequent scenarios where YARN fails due to 
> network error etc.
> Such directories will remain forever in Google Cloud Storage, in case the 
> yarn application manager gets killed.
>  
> Over time this pileup and incurs a lot of cloud storage cost.
>  
> Can we update our File committer to clean up the temporary directories in 
> case the job commit fails.
> PS : This request is specifically for GCS.
> Image for reference 
>  
> !image-2023-07-25-17-52-55-006.png|width=458,height=177!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails

2023-07-25 Thread Dipayan Dev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746974#comment-17746974
 ] 

Dipayan Dev commented on SPARK-44543:
-

Happy to contribute to this development of this feature. 

> Cleanup .spark-staging directories when yarn application fails
> --
>
> Key: SPARK-44543
> URL: https://issues.apache.org/jira/browse/SPARK-44543
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.4.1
>Reporter: Dipayan Dev
>Priority: Major
> Attachments: image-2023-07-25-17-52-55-006.png
>
>
> Spark creates the staging directories like .hive-staging, .spark-staging etc 
> which get created when you run an insert overwrite to a partitioned table. 
> Spark spends maximum time in renaming the partitioned files, and because GCS 
> renaming are too slow, there are frequent scenarios where YARN fails due to 
> network error etc.
> Such directories will remain forever in Google Cloud Storage, in case the 
> yarn application manager gets killed.
>  
> Over time this pileup and incurs a lot of cloud storage cost.
>  
> Can we update our File committer to clean up the temporary directories in 
> case the job commit fails.
> PS : This request is specifically for GCS.
> Image for reference 
>  
> !image-2023-07-25-17-52-55-006.png|width=458,height=177!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails

2023-07-25 Thread Dipayan Dev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dipayan Dev updated SPARK-44543:

Description: 
Spark creates the staging directories like .hive-staging, .spark-staging etc 
which get created when you run an upsert to a partitioned table. Such 
directories will remain forever in Google Cloud Storage, in case the yarn 
application manager gets killed.
 
Over time this pileup and incurs a lot of cloud storage cost.
 
Can we update our File committer to clean up the temporary directories in case 
the job commit fails.
 

!image-2023-07-25-17-52-55-006.png|width=458,height=177!

  was:
Spark creates the staging directories like .hive-staging, .spark-staging etc 
which get created when you run an upsert to a partitioned table. Such 
directories will remain forever in Google Cloud Storage, in case the yarn 
application manager gets killed.
 
Over time this pileup and incurs a lot of cloud storage cost.
 
Can we update our File committer to clean up the temporary directories in case 
the job commit fails.
 


> Cleanup .spark-staging directories when yarn application fails
> --
>
> Key: SPARK-44543
> URL: https://issues.apache.org/jira/browse/SPARK-44543
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.4.1
>Reporter: Dipayan Dev
>Priority: Major
> Attachments: image-2023-07-25-17-52-55-006.png
>
>
> Spark creates the staging directories like .hive-staging, .spark-staging etc 
> which get created when you run an upsert to a partitioned table. Such 
> directories will remain forever in Google Cloud Storage, in case the yarn 
> application manager gets killed.
>  
> Over time this pileup and incurs a lot of cloud storage cost.
>  
> Can we update our File committer to clean up the temporary directories in 
> case the job commit fails.
>  
> !image-2023-07-25-17-52-55-006.png|width=458,height=177!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails

2023-07-25 Thread Dipayan Dev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dipayan Dev updated SPARK-44543:

Attachment: image-2023-07-25-17-52-55-006.png

> Cleanup .spark-staging directories when yarn application fails
> --
>
> Key: SPARK-44543
> URL: https://issues.apache.org/jira/browse/SPARK-44543
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.4.1
>Reporter: Dipayan Dev
>Priority: Major
> Attachments: image-2023-07-25-17-52-55-006.png
>
>
> Spark creates the staging directories like .hive-staging, .spark-staging etc 
> which get created when you run an upsert to a partitioned table. Such 
> directories will remain forever in Google Cloud Storage, in case the yarn 
> application manager gets killed.
>  
> Over time this pileup and incurs a lot of cloud storage cost.
>  
> Can we update our File committer to clean up the temporary directories in 
> case the job commit fails.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails

2023-07-25 Thread Dipayan Dev (Jira)

Dipayan Dev created SPARK-44543:
---

 Summary: Cleanup .spark-staging directories when yarn application 
fails
 Key: SPARK-44543
 URL: https://issues.apache.org/jira/browse/SPARK-44543
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Spark Shell
Affects Versions: 3.4.1
Reporter: Dipayan Dev


Spark creates the staging directories like .hive-staging, .spark-staging etc 
which get created when you run an upsert to a partitioned table. Such 
directories will remain forever in Google Cloud Storage, in case the yarn 
application manager gets killed.
 
Over time this pileup and incurs a lot of cloud storage cost.
 
Can we update our File committer to clean up the temporary directories in case 
the job commit fails.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746849#comment-17746849
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 11:36 AM:
-

bumping to blocker because I believe this is a potentially very serious issue 
in the query planner, which may affect other queries

(sort() then select(), but the sorting column is not in select(), then query 
planner would use wrong column to sort)


was (Author: JIRAUSER301473):
bumping to blocker because I believe this is a potentially very serious issue 
in the query planner (sort() then select(), and the sorting column is not in 
select(), then query plan would use the wrong column to sort), which may affect 
other queries

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Blocker
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44540) Remove unused stylesheet and javascript files of jsonFormatter

2023-07-25 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-44540:


Assignee: Kent Yao

> Remove unused stylesheet and javascript files of jsonFormatter
> --
>
> Key: SPARK-44540
> URL: https://issues.apache.org/jira/browse/SPARK-44540
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> jsonFormatter.min.css and jsonFormatter.min.js is unreached



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44540) Remove unused stylesheet and javascript files of jsonFormatter

2023-07-25 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44540.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42145
[https://github.com/apache/spark/pull/42145]

> Remove unused stylesheet and javascript files of jsonFormatter
> --
>
> Key: SPARK-44540
> URL: https://issues.apache.org/jira/browse/SPARK-44540
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> jsonFormatter.min.css and jsonFormatter.min.js is unreached



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746849#comment-17746849
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 10:06 AM:
-

bumping to blocker because I believe this is a potentially very serious issue 
in the query planner (sort().select() and the original sorting column is not in 
select(), then query plan would use the wrong column to sort), which may affect 
other queries


was (Author: JIRAUSER301473):
bumping to blocker because I believe this is a potentially very serious issue 
in the query planner, which may affect other queries

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Blocker
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746849#comment-17746849
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 10:06 AM:
-

bumping to blocker because I believe this is a potentially very serious issue 
in the query planner (sort() then select(), and the sorting column is not in 
select(), then query plan would use the wrong column to sort), which may affect 
other queries


was (Author: JIRAUSER301473):
bumping to blocker because I believe this is a potentially very serious issue 
in the query planner (sort().select() and the original sorting column is not in 
select(), then query plan would use the wrong column to sort), which may affect 
other queries

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Blocker
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43744) Spark Connect scala UDF serialization pulling in unrelated classes not available on server

2023-07-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746863#comment-17746863
 ] 

ASF GitHub Bot commented on SPARK-43744:


User 'zhenlineo' has created a pull request for this issue:
https://github.com/apache/spark/pull/42069

> Spark Connect scala UDF serialization pulling in unrelated classes not 
> available on server
> --
>
> Key: SPARK-43744
> URL: https://issues.apache.org/jira/browse/SPARK-43744
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Priority: Major
>  Labels: SPARK-43745
>
> [https://github.com/apache/spark/pull/41487] moved "interrupt all - 
> background queries, foreground interrupt" and "interrupt all - foreground 
> queries, background interrupt" tests from ClientE2ETestSuite into a new 
> isolated suite SparkSessionE2ESuite to avoid an unexplicable UDF 
> serialization issue.
>  
> When these tests are moved back to ClientE2ETestSuite and when testing with
> {code:java}
> build/mvn clean install -DskipTests -Phive
> build/mvn test -pl connector/connect/client/jvm -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite{code}
>  
> the tests fails with
> {code:java}
> 23/05/22 15:44:11 ERROR SparkConnectService: Error during: execute. UserId: . 
> SessionId: 0f4013ca-3af9-443b-a0e5-e339a827e0cf.
> java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/connect/client/SparkResult
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> at java.lang.Class.getDeclaredMethod(Class.java:2128)
> at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1643)
> at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:79)
> at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:520)
> at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.io.ObjectStreamClass.(ObjectStreamClass.java:494)
> at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391)
> at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:681)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
> at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1815)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1640)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
> at org.apache.spark.util.Utils$.deserialize(Utils.scala:148)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.org$apache$spark$sql$connect$planner$SparkConnectPlanner$$unpackUdf(SparkConnectPlanner.scala:1353)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner$TypedScalaUdf$.apply(SparkConnectPlanner.scala:761)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformTypedMapPartitions(SparkConnectPlanner.scala:531)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformMapPartitions(SparkConnectPlanner.scala:495)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:143)
> at 
> org.apache.spark.sql.connect.service.SparkConnectStreamHandler.handlePlan(SparkConnectStreamHandler.scala:100)
> at 
> org.apache.spark.sql.connect.service.SparkConnectStreamHandler.$anonfun$handle$2(SparkConnectStreamHandler.scala:87)
> at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> at

[jira] [Commented] (SPARK-43744) Spark Connect scala UDF serialization pulling in unrelated classes not available on server

2023-07-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746862#comment-17746862
 ] 

ASF GitHub Bot commented on SPARK-43744:


User 'zhenlineo' has created a pull request for this issue:
https://github.com/apache/spark/pull/42069

> Spark Connect scala UDF serialization pulling in unrelated classes not 
> available on server
> --
>
> Key: SPARK-43744
> URL: https://issues.apache.org/jira/browse/SPARK-43744
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Priority: Major
>  Labels: SPARK-43745
>
> [https://github.com/apache/spark/pull/41487] moved "interrupt all - 
> background queries, foreground interrupt" and "interrupt all - foreground 
> queries, background interrupt" tests from ClientE2ETestSuite into a new 
> isolated suite SparkSessionE2ESuite to avoid an unexplicable UDF 
> serialization issue.
>  
> When these tests are moved back to ClientE2ETestSuite and when testing with
> {code:java}
> build/mvn clean install -DskipTests -Phive
> build/mvn test -pl connector/connect/client/jvm -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite{code}
>  
> the tests fails with
> {code:java}
> 23/05/22 15:44:11 ERROR SparkConnectService: Error during: execute. UserId: . 
> SessionId: 0f4013ca-3af9-443b-a0e5-e339a827e0cf.
> java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/connect/client/SparkResult
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> at java.lang.Class.getDeclaredMethod(Class.java:2128)
> at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1643)
> at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:79)
> at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:520)
> at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.io.ObjectStreamClass.(ObjectStreamClass.java:494)
> at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391)
> at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:681)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
> at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1815)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1640)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
> at org.apache.spark.util.Utils$.deserialize(Utils.scala:148)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.org$apache$spark$sql$connect$planner$SparkConnectPlanner$$unpackUdf(SparkConnectPlanner.scala:1353)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner$TypedScalaUdf$.apply(SparkConnectPlanner.scala:761)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformTypedMapPartitions(SparkConnectPlanner.scala:531)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformMapPartitions(SparkConnectPlanner.scala:495)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:143)
> at 
> org.apache.spark.sql.connect.service.SparkConnectStreamHandler.handlePlan(SparkConnectStreamHandler.scala:100)
> at 
> org.apache.spark.sql.connect.service.SparkConnectStreamHandler.$anonfun$handle$2(SparkConnectStreamHandler.scala:87)
> at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> at

[jira] [Commented] (SPARK-44509) Fine grained interrupt in Python Spark Connect

2023-07-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746858#comment-17746858
 ] 

ASF GitHub Bot commented on SPARK-44509:


User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/42149

> Fine grained interrupt in Python Spark Connect
> --
>
> Key: SPARK-44509
> URL: https://issues.apache.org/jira/browse/SPARK-44509
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Same as https://issues.apache.org/jira/browse/SPARK-44422 but need it for 
> Python
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler

2023-07-25 Thread YE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YE updated SPARK-44542:
---
Description: 
There are two background for this improvement proposal:

1. When running spark on yarn, the disk might be corrupted during application 
running. The corrupted disk might contain the spark jars(cache archive from 
spark.yarn.archive). In that case , the executor JVM cannot load any spark 
related classes any more.

2. Spark leverages the OutputCommitCoordinator to avoid data race between 
speculate tasks so that no tasks could commit the same partition in the same 
time. In other words, once a task's commit request is allowed, other commit 
requests would be denied until the committing task is failed.

 

We encountered a corner case combined the above two cases, which makes the 
spark hangs.  A short timeline could be described as below:
 # task 5372(tid: 21662) starts running in 21:55
 # the disk contains the spark archive for that task/executor is corrupted, 
thus making the archive inaccessible from executor's JVM perspective, it 
happened around 22:00
 # the task continues running, at 22:05, it requests commit from coordinator 
and performs the commit. 
 # however due the corrupted disk, some exception raised in the executor JVM.
 # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is 
corrupted, the handler itself throws an exception, and the halt process throws 
an exception too.
 # The executor is hanging there, no more tasks are running. However the 
authorized commit request is still valid in the driver side
 # Speculate tasks start to click in, due to no commit permission, all 
speculate tasks are killed/denied.
 # The job is hanging until our SRE killed the container from outside.

Some screenshot are provided below.

!image-2023-07-25-16-46-03-989.png!

!image-2023-07-25-16-46-28-158.png!

!image-2023-07-25-16-46-42-522.png!

For this specific case: I'd like to the propose to eagerly load SparkExitCode 
class in the 
SparkUncaughtExceptionHandler, so that the halt process could be executed 
rather than throws an exception as SparkExitCode is not loadable during the 
previous scenario.

  was:
There are two background for this improvement proposal:

1. When running spark on yarn, the disk might be corrupted during application 
running. The corrupted disk might contain the spark jars(cache archive from 
spark.yarn.archive). In that case , the executor JVM cannot load any spark 
related classes any more.

2. Spark leverages the OutputCommitCoordinator to avoid data race between 
speculate tasks so that no tasks could commit the same partition in the same 
time. In other words, once a task's commit request is allowed, other commit 
requests would be denied until the committing task is failed.

 

We encountered a corner case combined the above two cases, which makes the 
spark hangs.  A short timeline could be described as below:
 # task 5372(tid: 21662) starts running in 21:55
 # the disk contains the spark archive for that task/executor is corrupted, 
thus making the archive inaccessible from executor's JVM perspective, it 
happened around 22:00
 # the task continues running, at 22:05, it requests commit from coordinator 
and performs the commit. 
 # however due the corrupted disk, some exception raised in the executor JVM.
 # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is 
corrupted, the handler itself throws an exception, and the halt process throws 
an exception too.
 # The executor is hanging there, no more tasks are running. However the 
authorized commit request is still valid in the driver side
 # Speculate tasks start to click in, due to no commit permission, all 
speculate tasks are killed/denied.
 # The job is hanging until our SRE killed the container from outside.

Some screenshot are provided below.

!image-2023-07-25-16-46-03-989.png!

For this specific case: I'd like to the propose to eagerly load SparkExitCode 
class in the 
SparkUncaughtExceptionHandler, so that the halt process could be executed 
rather than throws an exception as SparkExitCode is not loadable during the 
previous scenario.


> easily load SparkExitCode class in SparkUncaughtExceptionHandler
> 
>
> Key: SPARK-44542
> URL: https://issues.apache.org/jira/browse/SPARK-44542
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.3, 3.3.2, 3.4.1
>Reporter: YE
>Priority: Major
> Attachments: image-2023-07-25-16-46-03-989.png, 
> image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png
>
>
> There are two background for this improvement proposal:
> 1. When running spark on yarn, the disk might be corrupted during application 
> running. The corrupted disk might contain the spark

[jira] [Updated] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler

2023-07-25 Thread YE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YE updated SPARK-44542:
---
Attachment: image-2023-07-25-16-46-42-522.png

> easily load SparkExitCode class in SparkUncaughtExceptionHandler
> 
>
> Key: SPARK-44542
> URL: https://issues.apache.org/jira/browse/SPARK-44542
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.3, 3.3.2, 3.4.1
>Reporter: YE
>Priority: Major
> Attachments: image-2023-07-25-16-46-03-989.png, 
> image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png
>
>
> There are two background for this improvement proposal:
> 1. When running spark on yarn, the disk might be corrupted during application 
> running. The corrupted disk might contain the spark jars(cache archive from 
> spark.yarn.archive). In that case , the executor JVM cannot load any spark 
> related classes any more.
> 2. Spark leverages the OutputCommitCoordinator to avoid data race between 
> speculate tasks so that no tasks could commit the same partition in the same 
> time. In other words, once a task's commit request is allowed, other commit 
> requests would be denied until the committing task is failed.
>  
> We encountered a corner case combined the above two cases, which makes the 
> spark hangs.  A short timeline could be described as below:
>  # task 5372(tid: 21662) starts running in 21:55
>  # the disk contains the spark archive for that task/executor is corrupted, 
> thus making the archive inaccessible from executor's JVM perspective, it 
> happened around 22:00
>  # the task continues running, at 22:05, it requests commit from coordinator 
> and performs the commit. 
>  # however due the corrupted disk, some exception raised in the executor JVM.
>  # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is 
> corrupted, the handler itself throws an exception, and the halt process 
> throws an exception too.
>  # The executor is hanging there, no more tasks are running. However the 
> authorized commit request is still valid in the driver side
>  # Speculate tasks start to click in, due to no commit permission, all 
> speculate tasks are killed/denied.
>  # The job is hanging until our SRE killed the container from outside.
> Some screenshot are provided below.
> !image-2023-07-25-16-46-03-989.png!
> !image-2023-07-25-16-46-28-158.png!
> !image-2023-07-25-16-46-42-522.png!
> For this specific case: I'd like to the propose to eagerly load SparkExitCode 
> class in the 
> SparkUncaughtExceptionHandler, so that the halt process could be executed 
> rather than throws an exception as SparkExitCode is not loadable during the 
> previous scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler

2023-07-25 Thread YE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YE updated SPARK-44542:
---
Attachment: image-2023-07-25-16-46-28-158.png

> easily load SparkExitCode class in SparkUncaughtExceptionHandler
> 
>
> Key: SPARK-44542
> URL: https://issues.apache.org/jira/browse/SPARK-44542
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.3, 3.3.2, 3.4.1
>Reporter: YE
>Priority: Major
> Attachments: image-2023-07-25-16-46-03-989.png, 
> image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png
>
>
> There are two background for this improvement proposal:
> 1. When running spark on yarn, the disk might be corrupted during application 
> running. The corrupted disk might contain the spark jars(cache archive from 
> spark.yarn.archive). In that case , the executor JVM cannot load any spark 
> related classes any more.
> 2. Spark leverages the OutputCommitCoordinator to avoid data race between 
> speculate tasks so that no tasks could commit the same partition in the same 
> time. In other words, once a task's commit request is allowed, other commit 
> requests would be denied until the committing task is failed.
>  
> We encountered a corner case combined the above two cases, which makes the 
> spark hangs.  A short timeline could be described as below:
>  # task 5372(tid: 21662) starts running in 21:55
>  # the disk contains the spark archive for that task/executor is corrupted, 
> thus making the archive inaccessible from executor's JVM perspective, it 
> happened around 22:00
>  # the task continues running, at 22:05, it requests commit from coordinator 
> and performs the commit. 
>  # however due the corrupted disk, some exception raised in the executor JVM.
>  # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is 
> corrupted, the handler itself throws an exception, and the halt process 
> throws an exception too.
>  # The executor is hanging there, no more tasks are running. However the 
> authorized commit request is still valid in the driver side
>  # Speculate tasks start to click in, due to no commit permission, all 
> speculate tasks are killed/denied.
>  # The job is hanging until our SRE killed the container from outside.
> Some screenshot are provided below.
> !image-2023-07-25-16-46-03-989.png!
> For this specific case: I'd like to the propose to eagerly load SparkExitCode 
> class in the 
> SparkUncaughtExceptionHandler, so that the halt process could be executed 
> rather than throws an exception as SparkExitCode is not loadable during the 
> previous scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler

2023-07-25 Thread YE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YE updated SPARK-44542:
---
Description: 
There are two background for this improvement proposal:

1. When running spark on yarn, the disk might be corrupted during application 
running. The corrupted disk might contain the spark jars(cache archive from 
spark.yarn.archive). In that case , the executor JVM cannot load any spark 
related classes any more.

2. Spark leverages the OutputCommitCoordinator to avoid data race between 
speculate tasks so that no tasks could commit the same partition in the same 
time. In other words, once a task's commit request is allowed, other commit 
requests would be denied until the committing task is failed.

 

We encountered a corner case combined the above two cases, which makes the 
spark hangs.  A short timeline could be described as below:
 # task 5372(tid: 21662) starts running in 21:55
 # the disk contains the spark archive for that task/executor is corrupted, 
thus making the archive inaccessible from executor's JVM perspective, it 
happened around 22:00
 # the task continues running, at 22:05, it requests commit from coordinator 
and performs the commit. 
 # however due the corrupted disk, some exception raised in the executor JVM.
 # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is 
corrupted, the handler itself throws an exception, and the halt process throws 
an exception too.
 # The executor is hanging there, no more tasks are running. However the 
authorized commit request is still valid in the driver side
 # Speculate tasks start to click in, due to no commit permission, all 
speculate tasks are killed/denied.
 # The job is hanging until our SRE killed the container from outside.

Some screenshot are provided below.

!image-2023-07-25-16-46-03-989.png!

For this specific case: I'd like to the propose to eagerly load SparkExitCode 
class in the 
SparkUncaughtExceptionHandler, so that the halt process could be executed 
rather than throws an exception as SparkExitCode is not loadable during the 
previous scenario.

  was:
There are two background for this improvement proposal:

1. When running spark on yarn, the disk might be corrupted during application 
running. The corrupted disk might contain the spark jars(cache archive from 
spark.yarn.archive). In that case , the executor JVM cannot load any spark 
related classes any more.

2. Spark leverages the OutputCommitCoordinator to avoid data race between 
speculate tasks so that no tasks could commit the same partition in the same 
time. In other words, once a task's commit request is allowed, other commit 
requests would be denied until the committing task is failed.

 

We encountered a corner case combined the above two cases, which makes the 
spark hangs.  A short timeline could be described as below:
 # task 5372(tid: 21662) starts running in 21:55
 # the disk contains the spark archive for that task/executor is corrupted, 
thus making the archive inaccessible from executor's JVM perspective, it 
happened around 22:00
 # the task continues running, at 22:05, it requests commit from coordinator 
and performs the commit. 
 # however due the corrupted disk, some exception raised in the executor JVM.
 # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is 
corrupted, the handler itself throws an exception, and the halt process throws 
an exception too.
 # The executor is hanging there, no more tasks are running. However the 
authorized commit request is still valid in the driver side
 # Speculate tasks start to click in, due to no commit permission, all 
speculate tasks are killed/denied.
 # The job is hanging until our SRE killed the container from outside.

Some screenshot are provided below.

!image-2023-07-25-16-37-16-821.png!

!image-2023-07-25-16-38-52-270.png!

!image-2023-07-25-16-39-40-182.png!

 

For this specific case: I'd like to the propose to eagerly load SparkExitCode 
class in the 
SparkUncaughtExceptionHandler, so that the halt process could be executed 
rather than throws an exception as SparkExitCode is not loadable during the 
previous scenario.


> easily load SparkExitCode class in SparkUncaughtExceptionHandler
> 
>
> Key: SPARK-44542
> URL: https://issues.apache.org/jira/browse/SPARK-44542
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.3, 3.3.2, 3.4.1
>Reporter: YE
>Priority: Major
> Attachments: image-2023-07-25-16-46-03-989.png, 
> image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png
>
>
> There are two background for this improvement proposal:
> 1. When running spark on yarn, the disk might be corrupted during application 
> running. The corrupted disk might contain the

[jira] [Updated] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler

2023-07-25 Thread YE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YE updated SPARK-44542:
---
Attachment: image-2023-07-25-16-46-03-989.png

> easily load SparkExitCode class in SparkUncaughtExceptionHandler
> 
>
> Key: SPARK-44542
> URL: https://issues.apache.org/jira/browse/SPARK-44542
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.3, 3.3.2, 3.4.1
>Reporter: YE
>Priority: Major
> Attachments: image-2023-07-25-16-46-03-989.png, 
> image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png
>
>
> There are two background for this improvement proposal:
> 1. When running spark on yarn, the disk might be corrupted during application 
> running. The corrupted disk might contain the spark jars(cache archive from 
> spark.yarn.archive). In that case , the executor JVM cannot load any spark 
> related classes any more.
> 2. Spark leverages the OutputCommitCoordinator to avoid data race between 
> speculate tasks so that no tasks could commit the same partition in the same 
> time. In other words, once a task's commit request is allowed, other commit 
> requests would be denied until the committing task is failed.
>  
> We encountered a corner case combined the above two cases, which makes the 
> spark hangs.  A short timeline could be described as below:
>  # task 5372(tid: 21662) starts running in 21:55
>  # the disk contains the spark archive for that task/executor is corrupted, 
> thus making the archive inaccessible from executor's JVM perspective, it 
> happened around 22:00
>  # the task continues running, at 22:05, it requests commit from coordinator 
> and performs the commit. 
>  # however due the corrupted disk, some exception raised in the executor JVM.
>  # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is 
> corrupted, the handler itself throws an exception, and the halt process 
> throws an exception too.
>  # The executor is hanging there, no more tasks are running. However the 
> authorized commit request is still valid in the driver side
>  # Speculate tasks start to click in, due to no commit permission, all 
> speculate tasks are killed/denied.
>  # The job is hanging until our SRE killed the container from outside.
> Some screenshot are provided below.
> !image-2023-07-25-16-37-16-821.png!
> !image-2023-07-25-16-38-52-270.png!
> !image-2023-07-25-16-39-40-182.png!
>  
> For this specific case: I'd like to the propose to eagerly load SparkExitCode 
> class in the 
> SparkUncaughtExceptionHandler, so that the halt process could be executed 
> rather than throws an exception as SparkExitCode is not loadable during the 
> previous scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler

2023-07-25 Thread YE (Jira)

YE created SPARK-44542:
--

 Summary: easily load SparkExitCode class in 
SparkUncaughtExceptionHandler
 Key: SPARK-44542
 URL: https://issues.apache.org/jira/browse/SPARK-44542
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1, 3.3.2, 3.1.3
Reporter: YE


There are two background for this improvement proposal:

1. When running spark on yarn, the disk might be corrupted during application 
running. The corrupted disk might contain the spark jars(cache archive from 
spark.yarn.archive). In that case , the executor JVM cannot load any spark 
related classes any more.

2. Spark leverages the OutputCommitCoordinator to avoid data race between 
speculate tasks so that no tasks could commit the same partition in the same 
time. In other words, once a task's commit request is allowed, other commit 
requests would be denied until the committing task is failed.

 

We encountered a corner case combined the above two cases, which makes the 
spark hangs.  A short timeline could be described as below:
 # task 5372(tid: 21662) starts running in 21:55
 # the disk contains the spark archive for that task/executor is corrupted, 
thus making the archive inaccessible from executor's JVM perspective, it 
happened around 22:00
 # the task continues running, at 22:05, it requests commit from coordinator 
and performs the commit. 
 # however due the corrupted disk, some exception raised in the executor JVM.
 # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is 
corrupted, the handler itself throws an exception, and the halt process throws 
an exception too.
 # The executor is hanging there, no more tasks are running. However the 
authorized commit request is still valid in the driver side
 # Speculate tasks start to click in, due to no commit permission, all 
speculate tasks are killed/denied.
 # The job is hanging until our SRE killed the container from outside.

Some screenshot are provided below.

!image-2023-07-25-16-37-16-821.png!

!image-2023-07-25-16-38-52-270.png!

!image-2023-07-25-16-39-40-182.png!

 

For this specific case: I'd like to the propose to eagerly load SparkExitCode 
class in the 
SparkUncaughtExceptionHandler, so that the halt process could be executed 
rather than throws an exception as SparkExitCode is not loadable during the 
previous scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746849#comment-17746849
 ] 

Yiu-Chung Lee commented on SPARK-44512:
---

bumping to blocker because I believe this is a potentially very serious issue 
in the query planner, which may affect other queries

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Blocker
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Priority: Blocker  (was: Major)

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Blocker
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746833#comment-17746833
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 8:31 AM:


After inspecting the SQL plan, below are the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark generates sorted incorrect column if 
spark.sql.optimizer.plannedWrite.enabled=true
(it should sort _1, but it actually sorted _2 instead)


was (Author: JIRAUSER301473):
After inspecting the SQL plan, bekow the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark generates sorted incorrect column if 
spark.sql.optimizer.plannedWrite.enabled=true
(it should sort _1, but it actually sorted _2 instead)

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746833#comment-17746833
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 8:31 AM:


After inspecting the SQL plan, below are the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark sorted incorrect column if 
spark.sql.optimizer.plannedWrite.enabled=true
(it should sort _1, but it actually sorted _2 instead)


was (Author: JIRAUSER301473):
After inspecting the SQL plan, below are the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark generates sorted incorrect column if 
spark.sql.optimizer.plannedWrite.enabled=true
(it should sort _1, but it actually sorted _2 instead)

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Description: 
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem), unless 
spark.sql.optimizer.plannedWrite.enabled is set to false.

After further investigation, spark actually sorted wrong column in the 
following code.

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 
(the following workaround is no longer necessary)
-However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.-

-{{dataset = dataset.sort("_1")}}-
-{{.select("_2", "_3");}}-
-{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
-{{.write()}}-
-{{{}.{}}}{{{}partitionBy("_2"){}}}-
-{{.text("output")}}-

Below is the complete code that reproduces the problem.

  was:
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem), unless 
spark.sql.optimizer.plannedWrite.enabled is set to false

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 
(the following workaround is no longer necessary)
-However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.-

-{{dataset = dataset.sort("_1")}}-
-{{.select("_2", "_3");}}-
-{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
-{{.write()}}-
-{{{}.{}}}{{{}partitionBy("_2"){}}}-
-{{.text("output")}}-

Below is the complete code that reproduces the problem.


> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false.
> After further investigation, spark actually sorted wrong column in the 
> following code.
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Summary: dataset.sort.select.write.partitionBy sorts wrong column  (was: 
dataset.sort.select.write.partitionBy sorts wong column)

> dataset.sort.select.write.partitionBy sorts wrong column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wong column

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Summary: dataset.sort.select.write.partitionBy sorts wong column  (was: 
dataset.sort.select.write.partitionBy sorts incorrect column)

> dataset.sort.select.write.partitionBy sorts wong column
> ---
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts incorrect column

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Summary: dataset.sort.select.write.partitionBy sorts incorrect column  
(was: dataset.sort.select.write.partitionBy sorts incorrect field)

> dataset.sort.select.write.partitionBy sorts incorrect column
> 
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts incorrect field

2023-07-25 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746833#comment-17746833
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 8:24 AM:


After inspecting the SQL plan, bekow the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark generates sorted incorrect column if 
spark.sql.optimizer.plannedWrite.enabled=true
(it should sort _1, but it actually sorted _2 instead)


was (Author: JIRAUSER301473):
After inspecting the SQL plan, bekow the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark generates sorted incorrect column if 
spark.sql.optimizer.plannedWrite.enabled=true

> dataset.sort.select.write.partitionBy sorts incorrect field
> ---
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts incorrect field

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Summary: dataset.sort.select.write.partitionBy sorts incorrect field  (was: 
dataset.sort.select.write.partitionBy does not return a sorted output)

> dataset.sort.select.write.partitionBy sorts incorrect field
> ---
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746833#comment-17746833
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 8:21 AM:


After inspecting the SQL plan, bekow the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark generates sorted incorrect column if 
spark.sql.optimizer.plannedWrite.enabled=true


was (Author: JIRAUSER301473):
After inspecting the SQL plan, bekow the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark generates incorrect sorting plan if 
spark.sql.optimizer.plannedWrite.enabled=true

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746833#comment-17746833
 ] 

Yiu-Chung Lee commented on SPARK-44512:
---

After inspecting the SQL plan, bekow the differences

spark.sql.optimizer.plannedWrite.enabled=false (correct result)
 !Test-Details-for-Query-0.png! 

spark.sql.optimizer.plannedWrite.enabled=true (incorrect result)
 !Test-Details-for-Query-1.png! 

It appears spark generates incorrect sorting plan if 
spark.sql.optimizer.plannedWrite.enabled=true

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Attachment: Test-Details-for-Query-1.png

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png, 
> Test-Details-for-Query-1.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Attachment: Test-Details-for-Query-0.png

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
> Attachments: Test-Details-for-Query-0.png
>
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Attachment: screenshot-1.png

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Attachment: (was: screenshot-1.png)

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746025#comment-17746025
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 7:32 AM:


Here is the 
[gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that 
reproduces the issue.

To compile: javac Test.java && jar cvf Test.jar Test.class
bug reproduce: spark-submit --class Test Test.jar
no bug: spark-submit --conf spark.sql.optimizer.plannedWrite.enabled=false 
--class Test Test.jar

(the following commands are either invalid or no longer necessary)
-no bug if workaround is enabled: spark-submit --class Test Test.jar workaround-
-no bug too if AQE is disabled: spark-submit --conf 
spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each 
partition key)-

 


was (Author: JIRAUSER301473):
Here is the 
[gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that 
reproduces the issue.

To compile: javac Test.java && jar cvf Test.jar Test.class
bug reproduce: spark-submit --class Test Test.jar
no bug: spark-submit --conf spark.sql.optimizer.plannedWrite.enabled=false 
--class Test Test.jar
-no bug if workaround is enabled: spark-submit --class Test Test.jar workaround-
-no bug too if AQE is disabled: spark-submit --conf 
spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each 
partition key)-

 

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Description: 
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem), unless 
spark.sql.optimizer.plannedWrite.enabled is set to false

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 
(the following workaround is no longer necessary)
-However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.-

-{{dataset = dataset.sort("_1")}}-
-{{.select("_2", "_3");}}-
-{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
-{{.write()}}-
-{{{}.{}}}{{{}partitionBy("_2"){}}}-
-{{.text("output")}}-

Below is the complete code that reproduces the problem.

  was:
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 
(the following workaround is no longer necessary)
-However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.-

-{{dataset = dataset.sort("_1")}}-
-{{.select("_2", "_3");}}-
-{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
-{{.write()}}-
-{{{}.{}}}{{{}partitionBy("_2"){}}}-
-{{.text("output")}}-

Below is the complete code that reproduces the problem.


> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem), unless 
> spark.sql.optimizer.plannedWrite.enabled is set to false
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746025#comment-17746025
 ] 

Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 7:30 AM:


Here is the 
[gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that 
reproduces the issue.

To compile: javac Test.java && jar cvf Test.jar Test.class
bug reproduce: spark-submit --class Test Test.jar
no bug: spark-submit --conf spark.sql.optimizer.plannedWrite.enabled=false 
--class Test Test.jar
-no bug if workaround is enabled: spark-submit --class Test Test.jar workaround-
-no bug too if AQE is disabled: spark-submit --conf 
spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each 
partition key)-

 


was (Author: JIRAUSER301473):
Here is the 
[gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that 
reproduces the issue.

To compile: javac Test.java && jar cvf Test.jar Test.class
bug reproduce: spark-submit --class Test Test.jar
no bug if workaround is enabled: spark-submit --class Test Test.jar workaround
-no bug too if AQE is disabled: spark-submit --conf 
spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each 
partition key)-

 

> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Description: 
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 
(the following workaround is no longer necessary)
However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.

{{dataset = dataset.sort("_1")}}
{{.select("_2", "_3");}}
{{dataset.map((MapFunction) row -> row, dataset.encoder())}}
{{.write()}}
{{{}.{}}}{{{}partitionBy("_2"){}}}
{{.text("output")}}

Below is the complete code that reproduces the problem.

  was:
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 
(the following workaround is no longer necessary)
-However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.

{{dataset = dataset.sort("_1")}}
{{.select("_2", "_3");}}
{{dataset.map((MapFunction) row -> row, dataset.encoder())}}
{{.write()}}
{{{}.{}}}{{{}partitionBy("_2"){}}}
{{.text("output")}}

Below is the complete code that reproduces the problem.


> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Description: 
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 
(the following workaround is no longer necessary)
-However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.

{{dataset = dataset.sort("_1")}}
{{.select("_2", "_3");}}
{{dataset.map((MapFunction) row -> row, dataset.encoder())}}
{{.write()}}
{{{}.{}}}{{{}partitionBy("_2"){}}}
{{.text("output")}}

Below is the complete code that reproduces the problem.

  was:
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 
(the following workaround is not necessary)
-However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.

{{dataset = dataset.sort("_1")}}
{{.select("_2", "_3");}}
{{dataset.map((MapFunction) row -> row, dataset.encoder())}}
{{.write()}}
{{{}.{}}}{{{}partitionBy("_2"){}}}
{{.text("output")}}-

Below is the complete code that reproduces the problem.


> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Description: 
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 
(the following workaround is no longer necessary)
-However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.-

-{{dataset = dataset.sort("_1")}}-
-{{.select("_2", "_3");}}-
-{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
-{{.write()}}-
-{{{}.{}}}{{{}partitionBy("_2"){}}}-
-{{.text("output")}}-

Below is the complete code that reproduces the problem.

  was:
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 
(the following workaround is no longer necessary)
However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.

{{dataset = dataset.sort("_1")}}
{{.select("_2", "_3");}}
{{dataset.map((MapFunction) row -> row, dataset.encoder())}}
{{.write()}}
{{{}.{}}}{{{}partitionBy("_2"){}}}
{{.text("output")}}

Below is the complete code that reproduces the problem.


> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is no longer necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.-
> -{{dataset = dataset.sort("_1")}}-
> -{{.select("_2", "_3");}}-
> -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}-
> -{{.write()}}-
> -{{{}.{}}}{{{}partitionBy("_2"){}}}-
> -{{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output

2023-07-25 Thread Yiu-Chung Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiu-Chung Lee updated SPARK-44512:
--
Description: 
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 
(the following workaround is not necessary)
-However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.

{{dataset = dataset.sort("_1")}}
{{.select("_2", "_3");}}
{{dataset.map((MapFunction) row -> row, dataset.encoder())}}
{{.write()}}
{{{}.{}}}{{{}partitionBy("_2"){}}}
{{.text("output")}}-

Below is the complete code that reproduces the problem.

  was:
(In this example the dataset is of type Tuple3, and the columns are named _1, 
_2 and _3)

 

I found -then when AQE is enabled,- that the following code does not produce 
sorted output (.drop() also have the same problem)

{{dataset.sort("_1")}}
{{.select("_2", "_3")}}
{{.write()}}
{{.partitionBy("_2")}}
{{.text("output");}}

 

However, if I insert an identity mapper between select and write, the output 
would be sorted as expected.

{{dataset = dataset.sort("_1")}}
{{.select("_2", "_3");}}
{{dataset.map((MapFunction) row -> row, dataset.encoder())}}
{{.write()}}
{{{}.{}}}{{{}partitionBy("_2"){}}}
{{.text("output")}}

Below is the complete code that reproduces the problem.


> dataset.sort.select.write.partitionBy does not return a sorted output
> -
>
> Key: SPARK-44512
> URL: https://issues.apache.org/jira/browse/SPARK-44512
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.4.1
>Reporter: Yiu-Chung Lee
>Priority: Major
>  Labels: correctness
>
> (In this example the dataset is of type Tuple3, and the columns are named _1, 
> _2 and _3)
>  
> I found -then when AQE is enabled,- that the following code does not produce 
> sorted output (.drop() also have the same problem)
> {{dataset.sort("_1")}}
> {{.select("_2", "_3")}}
> {{.write()}}
> {{.partitionBy("_2")}}
> {{.text("output");}}
>  
> (the following workaround is not necessary)
> -However, if I insert an identity mapper between select and write, the output 
> would be sorted as expected.
> {{dataset = dataset.sort("_1")}}
> {{.select("_2", "_3");}}
> {{dataset.map((MapFunction) row -> row, dataset.encoder())}}
> {{.write()}}
> {{{}.{}}}{{{}partitionBy("_2"){}}}
> {{.text("output")}}-
> Below is the complete code that reproduces the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44534) Handle only shuffle files in KubernetesLocalDiskShuffleExecutorComponents

2023-07-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44534:
-

Assignee: Dongjoon Hyun

> Handle only shuffle files in KubernetesLocalDiskShuffleExecutorComponents
> -
>
> Key: SPARK-44534
> URL: https://issues.apache.org/jira/browse/SPARK-44534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44534) Handle only shuffle files in KubernetesLocalDiskShuffleExecutorComponents

2023-07-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44534.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 42138
[https://github.com/apache/spark/pull/42138]

> Handle only shuffle files in KubernetesLocalDiskShuffleExecutorComponents
> -
>
> Key: SPARK-44534
> URL: https://issues.apache.org/jira/browse/SPARK-44534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

85 matches

Mail list logo