[jira] [Created] (SPARK-44552) Remove useless private object `ParseState` from `IntervalUtils`
Yang Jie created SPARK-44552: Summary: Remove useless private object `ParseState` from `IntervalUtils` Key: SPARK-44552 URL: https://issues.apache.org/jira/browse/SPARK-44552 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yang Jie SPARK-44326(https://github.com/apache/spark/pull/41885/files#diff-e695bcf363031df04cc16c96f1b62fb35093807ee1157c3cbecae3cccb9a3f4e) moved the relevant code from `IntervalUtils` to `SparkIntervalUtils` (including the definition of another `private object ParseState`), but did not delete the definition of `private object ParseState` in `IntervalUtils`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43968) Improve error messages for Python UDTFs with wrong number of outputs
[ https://issues.apache.org/jira/browse/SPARK-43968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747262#comment-17747262 ] Snoot.io commented on SPARK-43968: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/42157 > Improve error messages for Python UDTFs with wrong number of outputs > > > Key: SPARK-43968 > URL: https://issues.apache.org/jira/browse/SPARK-43968 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Improve the error messages for Python UDTFs when the number of outputs > mismatches the number of outputs specified in the return type of the UDTFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44479) Support Python UDTFs with empty schema
[ https://issues.apache.org/jira/browse/SPARK-44479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747258#comment-17747258 ] Snoot.io commented on SPARK-44479: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/42161 > Support Python UDTFs with empty schema > -- > > Key: SPARK-44479 > URL: https://issues.apache.org/jira/browse/SPARK-44479 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Priority: Major > > Support UDTFs with empty schema, for example: > {code:python} > >>> class TestUDTF: > ... def eval(self): > ... yield tuple() > {code} > Currently it fails with `useArrow=True`: > {code:python} > >>> udtf(TestUDTF, returnType=StructType())().collect() > Traceback (most recent call last): > ... > ValueError: not enough values to unpack (expected 2, got 0) > {code} > whereas without Arrow: > {code:python} > >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect() > [Row()] > {code} > Otherwise, we should raise an error without Arrow, too, to be consistent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44535) Move Streaming API to sql/api
[ https://issues.apache.org/jira/browse/SPARK-44535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44535. --- Fix Version/s: 3.5.0 Resolution: Fixed > Move Streaming API to sql/api > - > > Key: SPARK-44535 > URL: https://issues.apache.org/jira/browse/SPARK-44535 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44532) Move ArrowUtil to sql/api
[ https://issues.apache.org/jira/browse/SPARK-44532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44532. --- Fix Version/s: 3.5.0 Resolution: Fixed > Move ArrowUtil to sql/api > - > > Key: SPARK-44532 > URL: https://issues.apache.org/jira/browse/SPARK-44532 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44550) Wrong semantics for null IN (empty list)
[ https://issues.apache.org/jira/browse/SPARK-44550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-44550: -- Fix Version/s: (was: 3.5.0) > Wrong semantics for null IN (empty list) > > > Key: SPARK-44550 > URL: https://issues.apache.org/jira/browse/SPARK-44550 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > > {{null IN (empty list)}} incorrectly evaluates to null, when it should > evaluate to false. (The reason it should be false is because a IN (b1, b2) is > defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR > which is false. This is specified by ANSI SQL.) > Many places in Spark execution (In, InSet, InSubquery) and optimization > (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that > the Spark behavior for the null IN (empty list) is inconsistent in some > places - literal IN lists generally return null (incorrect), while IN/NOT IN > subqueries mostly return false/true, respectively (correct) in this case. > This is a longstanding correctness issue which has existed since null support > for IN expressions was first added to Spark. > Doc with more details: > [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44551) Wrong semantics for null IN (empty list) - IN expression execution
Jack Chen created SPARK-44551: - Summary: Wrong semantics for null IN (empty list) - IN expression execution Key: SPARK-44551 URL: https://issues.apache.org/jira/browse/SPARK-44551 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Jack Chen {{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate to false. (The reason it should be false is because a IN (b1, b2) is defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is false. This is specified by ANSI SQL.) Many places in Spark execution (In, InSet, InSubquery) and optimization (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that the Spark behavior for the null IN (empty list) is inconsistent in some places - literal IN lists generally return null (incorrect), while IN/NOT IN subqueries mostly return false/true, respectively (correct) in this case. This is a longstanding correctness issue which has existed since null support for IN expressions was first added to Spark. Doc with more details: [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44431) Wrong semantics for null IN (empty list) - optimization rules
[ https://issues.apache.org/jira/browse/SPARK-44431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-44431: -- Parent: SPARK-44550 Issue Type: Sub-task (was: Bug) > Wrong semantics for null IN (empty list) - optimization rules > - > > Key: SPARK-44431 > URL: https://issues.apache.org/jira/browse/SPARK-44431 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.5.0 > > > {{null IN (empty list)}} incorrectly evaluates to null, when it should > evaluate to false. (The reason it should be false is because a IN (b1, b2) is > defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR > which is false. This is specified by ANSI SQL.) > Many places in Spark execution (In, InSet, InSubquery) and optimization > (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that > the Spark behavior for the null IN (empty list) is inconsistent in some > places - literal IN lists generally return null (incorrect), while IN/NOT IN > subqueries mostly return false/true, respectively (correct) in this case. > This is a longstanding correctness issue which has existed since null support > for IN expressions was first added to Spark. > Doc with more details: > [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44431) Wrong semantics for null IN (empty list) - optimization rules
[ https://issues.apache.org/jira/browse/SPARK-44431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-44431: -- Summary: Wrong semantics for null IN (empty list) - optimization rules (was: Wrong semantics for null IN (empty list)) > Wrong semantics for null IN (empty list) - optimization rules > - > > Key: SPARK-44431 > URL: https://issues.apache.org/jira/browse/SPARK-44431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.5.0 > > > {{null IN (empty list)}} incorrectly evaluates to null, when it should > evaluate to false. (The reason it should be false is because a IN (b1, b2) is > defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR > which is false. This is specified by ANSI SQL.) > Many places in Spark execution (In, InSet, InSubquery) and optimization > (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that > the Spark behavior for the null IN (empty list) is inconsistent in some > places - literal IN lists generally return null (incorrect), while IN/NOT IN > subqueries mostly return false/true, respectively (correct) in this case. > This is a longstanding correctness issue which has existed since null support > for IN expressions was first added to Spark. > Doc with more details: > [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44550) Wrong semantics for null IN (empty list)
Jack Chen created SPARK-44550: - Summary: Wrong semantics for null IN (empty list) Key: SPARK-44550 URL: https://issues.apache.org/jira/browse/SPARK-44550 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Jack Chen Assignee: Jack Chen Fix For: 3.5.0 {{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate to false. (The reason it should be false is because a IN (b1, b2) is defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is false. This is specified by ANSI SQL.) Many places in Spark execution (In, InSet, InSubquery) and optimization (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that the Spark behavior for the null IN (empty list) is inconsistent in some places - literal IN lists generally return null (incorrect), while IN/NOT IN subqueries mostly return false/true, respectively (correct) in this case. This is a longstanding correctness issue which has existed since null support for IN expressions was first added to Spark. Doc with more details: [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44494) K8s-it test failed
[ https://issues.apache.org/jira/browse/SPARK-44494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747248#comment-17747248 ] Snoot.io commented on SPARK-44494: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/42162 > K8s-it test failed > -- > > Key: SPARK-44494 > URL: https://issues.apache.org/jira/browse/SPARK-44494 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > * [https://github.com/apache/spark/actions/runs/5607397734/jobs/10258527838] > {code:java} > [info] - PVs with local hostpath storage on statefulsets *** FAILED *** (3 > minutes, 11 seconds) > 3786[info] The code passed to eventually never returned normally. Attempted > 7921 times over 3.000105988813 minutes. Last failure message: "++ id -u > 3787[info] + myuid=185 > 3788[info] ++ id -g > 3789[info] + mygid=0 > 3790[info] + set +e > 3791[info] ++ getent passwd 185 > 3792[info] + uidentry= > 3793[info] + set -e > 3794[info] + '[' -z '' ']' > 3795[info] + '[' -w /etc/passwd ']' > 3796[info] + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' > 3797[info] + '[' -z /opt/java/openjdk ']' > 3798[info] + SPARK_CLASSPATH=':/opt/spark/jars/*' > 3799[info] + grep SPARK_JAVA_OPT_ > 3800[info] + sort -t_ -k4 -n > 3801[info] + sed 's/[^=]*=\(.*\)/\1/g' > 3802[info] + env > 3803[info] ++ command -v readarray > 3804[info] + '[' readarray ']' > 3805[info] + readarray -t SPARK_EXECUTOR_JAVA_OPTS > 3806[info] + '[' -n '' ']' > 3807[info] + '[' -z ']' > 3808[info] + '[' -z ']' > 3809[info] + '[' -n '' ']' > 3810[info] + '[' -z ']' > 3811[info] + '[' -z x ']' > 3812[info] + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*' > 3813[info] + > SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*:/opt/spark/work-dir' > 3814[info] + case "$1" in > 3815[info] + shift 1 > 3816[info] + CMD=("$SPARK_HOME/bin/spark-submit" --conf > "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --conf > "spark.executorEnv.SPARK_DRIVER_POD_IP=$SPARK_DRIVER_BIND_ADDRESS" > --deploy-mode client "$@") > 3817[info] + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=10.244.0.45 --conf > spark.executorEnv.SPARK_DRIVER_POD_IP=10.244.0.45 --deploy-mode client > --properties-file /opt/spark/conf/spark.properties --class > org.apache.spark.examples.MiniReadWriteTest > local:///opt/spark/examples/jars/spark-examples_2.12-4.0.0-SNAPSHOT.jar > /opt/spark/pv-tests/tmp3727659354473892032.txt > 3818[info] Files > local:///opt/spark/examples/jars/spark-examples_2.12-4.0.0-SNAPSHOT.jar from > /opt/spark/examples/jars/spark-examples_2.12-4.0.0-SNAPSHOT.jar to > /opt/spark/work-dir/spark-examples_2.12-4.0.0-SNAPSHOT.jar > 3819[info] 23/07/20 06:15:15 WARN NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 3820[info] Performing local word count from > /opt/spark/pv-tests/tmp3727659354473892032.txt > 3821[info] File contents are List(test PVs) > 3822[info] Creating SparkSession > 3823[info] 23/07/20 06:15:15 INFO SparkContext: Running Spark version > 4.0.0-SNAPSHOT > 3824[info] 23/07/20 06:15:15 INFO SparkContext: OS info Linux, > 5.15.0-1041-azure, amd64 > 3825[info] 23/07/20 06:15:15 INFO SparkContext: Java version 17.0.7 > 3826[info] 23/07/20 06:15:15 INFO ResourceUtils: > == > 3827[info] 23/07/20 06:15:15 INFO ResourceUtils: No custom resources > configured for spark.driver. > 3828[info] 23/07/20 06:15:15 INFO ResourceUtils: > == > 3829[info] 23/07/20 06:15:15 INFO SparkContext: Submitted application: Mini > Read Write Test > 3830[info] 23/07/20 06:15:16 INFO ResourceProfile: Default ResourceProfile > created, executor resources: Map(cores -> name: cores, amount: 1, script: , > vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap > -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> > name: cpus, amount: 1.0) {code} > The tests in the past two days have failed -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
[ https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44546: --- Description: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Historically, PySpark has had code regressions due to insufficient testing of public APIs. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype was: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Historically, PySpark has had code regressions due to insufficient testing of public APIs (see [https://databricks.atlassian.net/browse/ES-705815]). Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype > Add a dev utility to generate PySpark tests with LLM > > > Key: SPARK-44546 > URL: https://issues.apache.org/jira/browse/SPARK-44546 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > h2. Summary > This ticket adds a dev utility script to help generate PySpark tests using > LLM response. The purpose of this experimental script is
[jira] [Resolved] (SPARK-44530) Move SparkBuildInfo to common/util
[ https://issues.apache.org/jira/browse/SPARK-44530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44530. --- Fix Version/s: 3.5.0 Resolution: Fixed > Move SparkBuildInfo to common/util > -- > > Key: SPARK-44530 > URL: https://issues.apache.org/jira/browse/SPARK-44530 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44549) Support correlated references under window functions
Andrey Gubichev created SPARK-44549: --- Summary: Support correlated references under window functions Key: SPARK-44549 URL: https://issues.apache.org/jira/browse/SPARK-44549 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1 Reporter: Andrey Gubichev We should support subqueries with correlated references under a window function operator -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44548) Add support for pandas DataFrame assertDataFrameEqual
Amanda Liu created SPARK-44548: -- Summary: Add support for pandas DataFrame assertDataFrameEqual Key: SPARK-44548 URL: https://issues.apache.org/jira/browse/SPARK-44548 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu Assignee: Amanda Liu Fix For: 3.5.0, 4.0.0 SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43968) Improve error messages for Python UDTFs with wrong number of outputs
[ https://issues.apache.org/jira/browse/SPARK-43968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-43968: - Description: Improve the error messages for Python UDTFs when the number of outputs mismatches the number of outputs specified in the return type of the UDTFs. (was: Add more compile time checks when creating UDTFs and throw user-friendly error messages when the UDTF is invalid.) > Improve error messages for Python UDTFs with wrong number of outputs > > > Key: SPARK-43968 > URL: https://issues.apache.org/jira/browse/SPARK-43968 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Improve the error messages for Python UDTFs when the number of outputs > mismatches the number of outputs specified in the return type of the UDTFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43968) Improve error messages for Python UDTFs with wrong number of outputs
[ https://issues.apache.org/jira/browse/SPARK-43968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-43968: - Summary: Improve error messages for Python UDTFs with wrong number of outputs (was: Add more compile-time checks when creating Python UDTFs) > Improve error messages for Python UDTFs with wrong number of outputs > > > Key: SPARK-43968 > URL: https://issues.apache.org/jira/browse/SPARK-43968 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Add more compile time checks when creating UDTFs and throw user-friendly > error messages when the UDTF is invalid. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44005) Support returning non-tuple values for regular Python UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44005: - Description: Currently, if you have a UDTF like this: {code:java} class TestUDTF: def eval(self, a: int): yield a {code} and run the UDTF, it will fail with a confusing error message like {code:java} Unexpected tuple 1 with StructType {code} Note this works when arrow is enabled. We should support this use case for regular UDTFs. was: Currently, if you have a UDTF like this: {code:java} class TestUDTF: def eval(self, a: int): yield a {code} and run the UDTF, it will fail with a confusing error message like {code:java} Unexpected tuple 1 with StructType {code} We should support this use case. > Support returning non-tuple values for regular Python UDTFs > --- > > Key: SPARK-44005 > URL: https://issues.apache.org/jira/browse/SPARK-44005 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Currently, if you have a UDTF like this: > {code:java} > class TestUDTF: > def eval(self, a: int): > yield a {code} > and run the UDTF, it will fail with a confusing error message like > {code:java} > Unexpected tuple 1 with StructType {code} > Note this works when arrow is enabled. We should support this use case for > regular UDTFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44005) Support returning non-tuple values for regular Python UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44005: - Description: Currently, if you have a UDTF like this: {code:java} class TestUDTF: def eval(self, a: int): yield a {code} and run the UDTF, it will fail with a confusing error message like {code:java} Unexpected tuple 1 with StructType {code} We should support this use case. was: Currently, if you have a UDTF like this: {code:java} class TestUDTF: def eval(self, a: int): yield a {code} and run the UDTF, it will fail with a confusing error message like {code:java} Unexpected tuple 1 with StructType {code} We should make this error message more user-friendly. > Support returning non-tuple values for regular Python UDTFs > --- > > Key: SPARK-44005 > URL: https://issues.apache.org/jira/browse/SPARK-44005 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Currently, if you have a UDTF like this: > {code:java} > class TestUDTF: > def eval(self, a: int): > yield a {code} > and run the UDTF, it will fail with a confusing error message like > {code:java} > Unexpected tuple 1 with StructType {code} > We should support this use case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44005) Support returning non-tuple values for regular Python UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44005: - Summary: Support returning non-tuple values for regular Python UDTFs (was: Support returning a non-tuple value for regular Python UDTFs) > Support returning non-tuple values for regular Python UDTFs > --- > > Key: SPARK-44005 > URL: https://issues.apache.org/jira/browse/SPARK-44005 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Currently, if you have a UDTF like this: > {code:java} > class TestUDTF: > def eval(self, a: int): > yield a {code} > and run the UDTF, it will fail with a confusing error message like > {code:java} > Unexpected tuple 1 with StructType {code} > We should make this error message more user-friendly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44005) Support returning a non-tuple value for regular Python UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44005: - Summary: Support returning a non-tuple value for regular Python UDTFs (was: Improve the error messages when a UDTF returns a non-tuple value) > Support returning a non-tuple value for regular Python UDTFs > > > Key: SPARK-44005 > URL: https://issues.apache.org/jira/browse/SPARK-44005 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Currently, if you have a UDTF like this: > {code:java} > class TestUDTF: > def eval(self, a: int): > yield a {code} > and run the UDTF, it will fail with a confusing error message like > {code:java} > Unexpected tuple 1 with StructType {code} > We should make this error message more user-friendly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44547) BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks to fallback storage
[ https://issues.apache.org/jira/browse/SPARK-44547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Yin updated SPARK-44547: -- Description: Looks like the RDD cache doesn't support fallback storage and we should stop the migration if the only viable peer is the fallback storage. [^spark-error.log] 23/07/25 05:12:58 WARN BlockManager: Failed to replicate rdd_18_25 to BlockManagerId(fallback, remote, 7337, None), failure #0 java.io.IOException: Failed to connect to remote:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230) at org.apache.spark.network.netty.NettyBlockTransferService.uploadBlock(NettyBlockTransferService.scala:168) at org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:121) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1784) at org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2(BlockManager.scala:1721) at org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2$adapted(BlockManager.scala:1707) at scala.Option.forall(Option.scala:390) at org.apache.spark.storage.BlockManager.replicateBlock(BlockManager.scala:1707) at org.apache.spark.storage.BlockManagerDecommissioner.migrateBlock(BlockManagerDecommissioner.scala:356) at org.apache.spark.storage.BlockManagerDecommissioner.$anonfun$decommissionRddCacheBlocks$3(BlockManagerDecommissioner.scala:340) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.storage.BlockManagerDecommissioner.decommissionRddCacheBlocks(BlockManagerDecommissioner.scala:339) at org.apache.spark.storage.BlockManagerDecommissioner$$anon$1.run(BlockManagerDecommissioner.scala:214) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.net.UnknownHostException: remote at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source) at java.base/java.net.InetAddress.getAllByName0(Unknown Source) at java.base/java.net.InetAddress.getAllByName(Unknown Source) at java.base/java.net.InetAddress.getAllByName(Unknown Source) at java.base/java.net.InetAddress.getByName(Unknown Source) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153) at java.base/java.security.AccessController.doPrivileged(Native Method) at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153) at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31) at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106) at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:206) at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:46) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:180) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:166) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at
[jira] [Updated] (SPARK-44547) BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks to fallback storage
[ https://issues.apache.org/jira/browse/SPARK-44547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Yin updated SPARK-44547: -- Attachment: spark-error.log > BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks > to fallback storage > - > > Key: SPARK-44547 > URL: https://issues.apache.org/jira/browse/SPARK-44547 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Frank Yin >Priority: Major > Attachments: spark-error.log > > > Looks like the RDD cache doesn't support fallback storage and we should stop > the migration if the only viable peer is the fallback storage. > {{23/07/25 05:12:58 WARN BlockManager: Failed to replicate rdd_18_25 to > BlockManagerId(fallback, remote, 7337, None), failure #0 > java.io.IOException: Failed to connect to remote:7337 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230) > at > org.apache.spark.network.netty.NettyBlockTransferService.uploadBlock(NettyBlockTransferService.scala:168) > at > org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:121) > at > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1784) > at > org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2(BlockManager.scala:1721) > at > org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2$adapted(BlockManager.scala:1707) > at scala.Option.forall(Option.scala:390) > at > org.apache.spark.storage.BlockManager.replicateBlock(BlockManager.scala:1707) > at > org.apache.spark.storage.BlockManagerDecommissioner.migrateBlock(BlockManagerDecommissioner.scala:356) > at > org.apache.spark.storage.BlockManagerDecommissioner.$anonfun$decommissionRddCacheBlocks$3(BlockManagerDecommissioner.scala:340) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.storage.BlockManagerDecommissioner.decommissionRddCacheBlocks(BlockManagerDecommissioner.scala:339) > at > org.apache.spark.storage.BlockManagerDecommissioner$$anon$1.run(BlockManagerDecommissioner.scala:214) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) > at java.base/java.util.concurrent.FutureTask.run(Unknown Source) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) > Caused by: java.net.UnknownHostException: remote > at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source) > at java.base/java.net.InetAddress.getAllByName0(Unknown Source) > at java.base/java.net.InetAddress.getAllByName(Unknown Source) > at java.base/java.net.InetAddress.getAllByName(Unknown Source) > at java.base/java.net.InetAddress.getByName(Unknown Source) > at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156) > at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153) > at java.base/java.security.AccessController.doPrivileged(Native Method) > at > io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153) > at > io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41) > at > io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61) > at > io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53) > at > io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55) > at > io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31) > at > io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106) > at
[jira] [Updated] (SPARK-44547) BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks to fallback storage
[ https://issues.apache.org/jira/browse/SPARK-44547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Yin updated SPARK-44547: -- Description: Looks like the RDD cache doesn't support fallback storage and we should stop the migration if the only viable peer is the fallback storage. {{23/07/25 05:12:58 WARN BlockManager: Failed to replicate rdd_18_25 to BlockManagerId(fallback, remote, 7337, None), failure #0 java.io.IOException: Failed to connect to remote:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230) at org.apache.spark.network.netty.NettyBlockTransferService.uploadBlock(NettyBlockTransferService.scala:168) at org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:121) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1784) at org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2(BlockManager.scala:1721) at org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2$adapted(BlockManager.scala:1707) at scala.Option.forall(Option.scala:390) at org.apache.spark.storage.BlockManager.replicateBlock(BlockManager.scala:1707) at org.apache.spark.storage.BlockManagerDecommissioner.migrateBlock(BlockManagerDecommissioner.scala:356) at org.apache.spark.storage.BlockManagerDecommissioner.$anonfun$decommissionRddCacheBlocks$3(BlockManagerDecommissioner.scala:340) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.storage.BlockManagerDecommissioner.decommissionRddCacheBlocks(BlockManagerDecommissioner.scala:339) at org.apache.spark.storage.BlockManagerDecommissioner$$anon$1.run(BlockManagerDecommissioner.scala:214) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.net.UnknownHostException: remote at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source) at java.base/java.net.InetAddress.getAllByName0(Unknown Source) at java.base/java.net.InetAddress.getAllByName(Unknown Source) at java.base/java.net.InetAddress.getAllByName(Unknown Source) at java.base/java.net.InetAddress.getByName(Unknown Source) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153) at java.base/java.security.AccessController.doPrivileged(Native Method) at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153) at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31) at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106) at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:206) at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:46) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:180) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:166) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
[jira] [Created] (SPARK-44547) BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks to fallback storage
Frank Yin created SPARK-44547: - Summary: BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks to fallback storage Key: SPARK-44547 URL: https://issues.apache.org/jira/browse/SPARK-44547 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.1 Reporter: Frank Yin Looks like the RDD cache doesn't support fallback storage and we should stop the migration if the only viable peer is the fallback storage. ``` 23/07/25 05:12:58 WARN BlockManager: Failed to replicate rdd_18_25 to BlockManagerId(fallback, remote, 7337, None), failure #0 java.io.IOException: Failed to connect to remote:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230) at org.apache.spark.network.netty.NettyBlockTransferService.uploadBlock(NettyBlockTransferService.scala:168) at org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:121) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1784) at org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2(BlockManager.scala:1721) at org.apache.spark.storage.BlockManager.$anonfun$replicateBlock$2$adapted(BlockManager.scala:1707) at scala.Option.forall(Option.scala:390) at org.apache.spark.storage.BlockManager.replicateBlock(BlockManager.scala:1707) at org.apache.spark.storage.BlockManagerDecommissioner.migrateBlock(BlockManagerDecommissioner.scala:356) at org.apache.spark.storage.BlockManagerDecommissioner.$anonfun$decommissionRddCacheBlocks$3(BlockManagerDecommissioner.scala:340) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.storage.BlockManagerDecommissioner.decommissionRddCacheBlocks(BlockManagerDecommissioner.scala:339) at org.apache.spark.storage.BlockManagerDecommissioner$$anon$1.run(BlockManagerDecommissioner.scala:214) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.net.UnknownHostException: remote at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source) at java.base/java.net.InetAddress.getAllByName0(Unknown Source) at java.base/java.net.InetAddress.getAllByName(Unknown Source) at java.base/java.net.InetAddress.getAllByName(Unknown Source) at java.base/java.net.InetAddress.getByName(Unknown Source) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153) at java.base/java.security.AccessController.doPrivileged(Native Method) at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153) at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31) at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106) at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:206) at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:46) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:180) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:166) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:605) at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104) at
[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
[ https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44546: --- Description: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Historically, PySpark has had code regressions due to insufficient testing of public APIs (see [https://databricks.atlassian.net/browse/ES-705815]). Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype was: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype > Add a dev utility to generate PySpark tests with LLM > > > Key: SPARK-44546 > URL: https://issues.apache.org/jira/browse/SPARK-44546 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > h2. Summary > This ticket adds a dev utility script to help generate PySpark tests using > LLM response. The purpose of this experimental script is to encourage PySpark > developers to test their code thoroughly, to avoid introducing
[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
[ https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44546: --- Description: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype was: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # String # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. String * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype > Add a dev utility to generate PySpark tests with LLM > > > Key: SPARK-44546 > URL: https://issues.apache.org/jira/browse/SPARK-44546 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > h2. Summary > This ticket adds a dev utility script to help generate PySpark tests using > LLM response. The purpose of this experimental script is to encourage PySpark > developers to test their code thoroughly, to avoid introducing regressions in > the codebase. > Below, we outline some common edge case scenarios for PySpark DataFrame APIs, > from the perspective of arguments. Many
[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
[ https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44546: --- Description: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # String # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. String * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype was: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. String * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype > Add a dev utility to generate PySpark tests with LLM > > > Key: SPARK-44546 > URL: https://issues.apache.org/jira/browse/SPARK-44546 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > h2. Summary > This ticket adds a dev utility script to help generate PySpark tests using > LLM response. The purpose of this experimental script is to encourage PySpark > developers to test their code thoroughly, to avoid introducing regressions in > the codebase. > Below, we outline some common edge case scenarios for PySpark DataFrame APIs, > from the perspective of arguments. Many of these edge
[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
[ https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44546: --- Description: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. String * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype > Add a dev utility to generate PySpark tests with LLM > > > Key: SPARK-44546 > URL: https://issues.apache.org/jira/browse/SPARK-44546 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > h2. Summary > This ticket adds a dev utility script to help generate PySpark tests using > LLM response. The purpose of this experimental script is to encourage PySpark > developers to test their code thoroughly, to avoid introducing regressions in > the codebase. > Below, we outline some common edge case scenarios for PySpark DataFrame APIs, > from the perspective of arguments. Many of these edge cases are passed into > the LLM through the script's base prompt. > Please note that this list is not exhaustive, but rather a starting point. > Some of these cases may not apply, depending on the situation. We encourage > all PySpark developers to carefully consider edge case scenarios when writing > tests. > h2. Table of Contents > # None > # Ints > # Floats > # Single column / column name > # Multi column / column names > # DataFrame argument > h3. 1. None > * Empty input > * None type > h3. 2. Ints > * Negatives > * 0 > * value > Int.MaxValue > * value < Int.MinValue > h3. 3. Floats > * Negatives > * 0.0 > * Float(“nan”) > * Float("inf") > * Float("-inf") > * decimal.Decimal > * numpy.float16 > h3. 4. String > * Special characters > * Spaces > * Empty strings > h3. 5. Single column / column name > * Non-existent column > * Empty column name > * Column name with special characters, e.g. dots > * Multi columns with the same name > * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ > * Column of special types, e.g. nested type; > * Column containing special values, e.g. Null; > h3. 6. Multi column / column names > * Empty input; e.g DataFrame.drop() > * Special cases for each single column > * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) > * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) > h3. 7. DataFrame argument > * DataFrame argument > * Empty dataframe; e.g. spark.range(5).limit(0) > * Dataframe with 0 columns, e.g. spark.range(5).drop('id') > * Dataset with repeated arguments > * Local dataset (pd.DataFrame) containing unsupported datatype -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43402) FileSourceScanExec supports push down data filter with scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-43402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747067#comment-17747067 ] Ignite TC Bot commented on SPARK-43402: --- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/41088 > FileSourceScanExec supports push down data filter with scalar subquery > -- > > Key: SPARK-43402 > URL: https://issues.apache.org/jira/browse/SPARK-43402 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > Scalar subquery can be pushed down as data filter at runtime, since we always > execute subquery first. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44545) It's impossible to get column by index if names are not unique
[ https://issues.apache.org/jira/browse/SPARK-44545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Dmitriev updated SPARK-44545: Description: So, I have a dataframe with non-unique columns names: {code:java} df = spark_session.createDataFrame([[1,2,3], [4,5,6]], ['a', 'a', 'c']) {code} It works fine. Now I try to get a column with non-unique name by index {code:java} df[0] {code} Expectation: It returns first of the columns Note, the [doc|[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.__getitem__.html]] doesn't mention non-unique name as a precondition. Actual result: It throws exception: {noformat} AnalysisException Traceback (most recent call last) Cell In[71], line 1 > 1 df[0] File /usr/local/spark/python/pyspark/sql/dataframe.py:2935, in DataFrame.__getitem__(self, item) 2933 return self.select(*item) 2934 elif isinstance(item, int): -> 2935 jc = self._jdf.apply(self.columns[item]) 2936 return Column(jc) 2937 else: File /usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args) 1316 command = proto.CALL_COMMAND_NAME +\ 1317 self.command_header +\ 1318 args_command +\ 1319 proto.END_COMMAND_PART 1321 answer = self.gateway_client.send_command(command) -> 1322 return_value = get_return_value( 1323 answer, self.gateway_client, self.target_id, self.name) 1325 for temp_arg in temp_args: 1326 if hasattr(temp_arg, "_detach"): File /usr/local/spark/python/pyspark/errors/exceptions/captured.py:175, in capture_sql_exception..deco(*a, **kw) 171 converted = convert_exception(e.java_exception) 172 if not isinstance(converted, UnknownException): 173 # Hide where the exception came from that shows a non-Pythonic 174 # JVM exception message. --> 175 raise converted from None 176 else: 177 raise AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could be: [`a`, `a`].{noformat} was: So, I have a dataframe with non-unique columns names: {code:java} df = spark_session.createDataFrame([[1,2,3], [4,5,6]], ['a', 'a', 'c']) {code} It works fine. Now I try to get a column with non-unique name by index {code:java} df[0] {code} Expectation: It returns first of the columns Note, the [doc|[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.__getitem__.html]] doesn't mention non-unique name as a precondition. Actual result: It throws exception: {noformat} AnalysisException Traceback (most recent call last) Cell In[71], line 1 > 1 df[0] File /usr/local/spark/python/pyspark/sql/dataframe.py:2935, in DataFrame.__getitem__(self, item) 2933 return self.select(*item) 2934 elif isinstance(item, int): -> 2935 jc = self._jdf.apply(self.columns[item]) 2936 return Column(jc) 2937 else: File /usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args) 1316 command = proto.CALL_COMMAND_NAME +\ 1317 self.command_header +\ 1318 args_command +\ 1319 proto.END_COMMAND_PART 1321 answer = self.gateway_client.send_command(command) -> 1322 return_value = get_return_value( 1323 answer, self.gateway_client, self.target_id, self.name) 1325 for temp_arg in temp_args: 1326 if hasattr(temp_arg, "_detach"): File /usr/local/spark/python/pyspark/errors/exceptions/captured.py:175, in capture_sql_exception..deco(*a, **kw) 171 converted = convert_exception(e.java_exception) 172 if not isinstance(converted, UnknownException): 173 # Hide where the exception came from that shows a non-Pythonic 174 # JVM exception message. --> 175 raise converted from None 176 else: 177 raise AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could be: [`a`, `a`].{noformat} > It's impossible to get column by index if names are not unique > -- > > Key: SPARK-44545 > URL: https://issues.apache.org/jira/browse/SPARK-44545 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 > Environment: I have python 3.11, pyspark 3.4.0 >Reporter: Alexey Dmitriev >Priority: Major > > So, I have a dataframe with non-unique columns names: > > {code:java} > df = spark_session.createDataFrame([[1,2,3], [4,5,6]], ['a', 'a', 'c']) {code} > > It works fine. > > Now I try to get a column with non-unique name by index > {code:java} > df[0] > {code} > Expectation: It returns first of the columns > Note, the >
[jira] [Resolved] (SPARK-44541) Remove useless function `hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker`
[ https://issues.apache.org/jira/browse/SPARK-44541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44541. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42147 [https://github.com/apache/spark/pull/42147] > Remove useless function `hasRangeExprAgainstEventTimeCol` from > `UnsupportedOperationChecker` > > > Key: SPARK-44541 > URL: https://issues.apache.org/jira/browse/SPARK-44541 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 4.0.0 > > > funciton `hasRangeExprAgainstEventTimeCol` was introduced by SPARK-40940 and > no longer be used after SPARK-42376 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44541) Remove useless function `hasRangeExprAgainstEventTimeCol` from `UnsupportedOperationChecker`
[ https://issues.apache.org/jira/browse/SPARK-44541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44541: Assignee: Yang Jie > Remove useless function `hasRangeExprAgainstEventTimeCol` from > `UnsupportedOperationChecker` > > > Key: SPARK-44541 > URL: https://issues.apache.org/jira/browse/SPARK-44541 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > funciton `hasRangeExprAgainstEventTimeCol` was introduced by SPARK-40940 and > no longer be used after SPARK-42376 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
Amanda Liu created SPARK-44546: -- Summary: Add a dev utility to generate PySpark tests with LLM Key: SPARK-44546 URL: https://issues.apache.org/jira/browse/SPARK-44546 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44545) It's impossible to get column by index if names are not unique
Alexey Dmitriev created SPARK-44545: --- Summary: It's impossible to get column by index if names are not unique Key: SPARK-44545 URL: https://issues.apache.org/jira/browse/SPARK-44545 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.4.0 Environment: I have python 3.11, pyspark 3.4.0 Reporter: Alexey Dmitriev So, I have a dataframe with non-unique columns names: {code:java} df = spark_session.createDataFrame([[1,2,3], [4,5,6]], ['a', 'a', 'c']) {code} It works fine. Now I try to get a column with non-unique name by index {code:java} df[0] {code} Expectation: It returns first of the columns Note, the [doc|[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.__getitem__.html]] doesn't mention non-unique name as a precondition. Actual result: It throws exception: {noformat} AnalysisException Traceback (most recent call last) Cell In[71], line 1 > 1 df[0] File /usr/local/spark/python/pyspark/sql/dataframe.py:2935, in DataFrame.__getitem__(self, item) 2933 return self.select(*item) 2934 elif isinstance(item, int): -> 2935 jc = self._jdf.apply(self.columns[item]) 2936 return Column(jc) 2937 else: File /usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args) 1316 command = proto.CALL_COMMAND_NAME +\ 1317 self.command_header +\ 1318 args_command +\ 1319 proto.END_COMMAND_PART 1321 answer = self.gateway_client.send_command(command) -> 1322 return_value = get_return_value( 1323 answer, self.gateway_client, self.target_id, self.name) 1325 for temp_arg in temp_args: 1326 if hasattr(temp_arg, "_detach"): File /usr/local/spark/python/pyspark/errors/exceptions/captured.py:175, in capture_sql_exception..deco(*a, **kw) 171 converted = convert_exception(e.java_exception) 172 if not isinstance(converted, UnknownException): 173 # Hide where the exception came from that shows a non-Pythonic 174 # JVM exception message. --> 175 raise converted from None 176 else: 177 raise AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could be: [`a`, `a`].{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44355) Move commands to CTEDef code path and deprecate CTE inline path
[ https://issues.apache.org/jira/browse/SPARK-44355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44355: --- Assignee: Wenchen Fan > Move commands to CTEDef code path and deprecate CTE inline path > --- > > Key: SPARK-44355 > URL: https://issues.apache.org/jira/browse/SPARK-44355 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Assignee: Wenchen Fan >Priority: Major > Fix For: 4.0.0 > > > Right now our CTE resolution code path is diverged: query with commands go > into CTE inline code path where non-command queries go into CTEDef code path > (see > https://github.com/apache/spark/blob/42719d9425b9a24ef016b5c2874e522b960cf114/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala#L50 > ). > For longer term we should migrate command queries go through CTEDef as well > and deprecate the CTE inline path. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44355) Move commands to CTEDef code path and deprecate CTE inline path
[ https://issues.apache.org/jira/browse/SPARK-44355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44355. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42036 [https://github.com/apache/spark/pull/42036] > Move commands to CTEDef code path and deprecate CTE inline path > --- > > Key: SPARK-44355 > URL: https://issues.apache.org/jira/browse/SPARK-44355 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Major > Fix For: 4.0.0 > > > Right now our CTE resolution code path is diverged: query with commands go > into CTE inline code path where non-command queries go into CTEDef code path > (see > https://github.com/apache/spark/blob/42719d9425b9a24ef016b5c2874e522b960cf114/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala#L50 > ). > For longer term we should migrate command queries go through CTEDef as well > and deprecate the CTE inline path. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44544) Move python packaging tests to a separate module
Ruifeng Zheng created SPARK-44544: - Summary: Move python packaging tests to a separate module Key: SPARK-44544 URL: https://issues.apache.org/jira/browse/SPARK-44544 Project: Spark Issue Type: Test Components: Project Infra Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43602) Rebalance PySpark test suites
[ https://issues.apache.org/jira/browse/SPARK-43602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43602. --- Resolution: Resolved > Rebalance PySpark test suites > - > > Key: SPARK-43602 > URL: https://issues.apache.org/jira/browse/SPARK-43602 > Project: Spark > Issue Type: Test > Components: Connect, PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Minor > > Some python tests are time costing, we should split them in order to: > 1, speed up debuging for development; > 2, make it more suitable for parallelism in downstream; -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41401) spark3 stagedir can't be change
[ https://issues.apache.org/jira/browse/SPARK-41401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746978#comment-17746978 ] Dipayan Dev commented on SPARK-41401: - This doesn't work in Google Cloud Storage even in Spark 2.x. Can I pick this issue? > spark3 stagedir can't be change > > > Key: SPARK-41401 > URL: https://issues.apache.org/jira/browse/SPARK-41401 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2, 3.2.3 >Reporter: sinlang >Priority: Major > > i want't change different staging dir when write temporary data using , but > spark3 seen can only write in table path > spark.yarn.stagingDir parameter only work when use spark2 > > in org.apache.spark.internal.io.FileCommitProtocol file : > def getStagingDir(path: String, jobId: String): Path = { > new Path(path, ".spark-staging-" + jobId) > } > } -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-44543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dipayan Dev updated SPARK-44543: Description: Spark creates the staging directories like .hive-staging, .spark-staging etc which get created when you run an dynamic insert overwrite to a partitioned table. Spark spends maximum time in renaming the partitioned files, and because GCS renaming are too slow, there are frequent scenarios where YARN fails due to network error etc. Such directories will remain forever in Google Cloud Storage, in case the yarn application manager gets killed. Over time this pileup and incurs a lot of cloud storage cost. Can we update our File committer to clean up the temporary directories in case the job commit fails. PS : This request is specifically for GCS. Image for reference !image-2023-07-25-17-52-55-006.png|width=458,height=177! was: Spark creates the staging directories like .hive-staging, .spark-staging etc which get created when you run an insert overwrite to a partitioned table. Spark spends maximum time in renaming the partitioned files, and because GCS renaming are too slow, there are frequent scenarios where YARN fails due to network error etc. Such directories will remain forever in Google Cloud Storage, in case the yarn application manager gets killed. Over time this pileup and incurs a lot of cloud storage cost. Can we update our File committer to clean up the temporary directories in case the job commit fails. PS : This request is specifically for GCS. Image for reference !image-2023-07-25-17-52-55-006.png|width=458,height=177! > Cleanup .spark-staging directories when yarn application fails > -- > > Key: SPARK-44543 > URL: https://issues.apache.org/jira/browse/SPARK-44543 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Spark Shell >Affects Versions: 3.4.1 >Reporter: Dipayan Dev >Priority: Major > Attachments: image-2023-07-25-17-52-55-006.png > > > Spark creates the staging directories like .hive-staging, .spark-staging etc > which get created when you run an dynamic insert overwrite to a partitioned > table. Spark spends maximum time in renaming the partitioned files, and > because GCS renaming are too slow, there are frequent scenarios where YARN > fails due to network error etc. > Such directories will remain forever in Google Cloud Storage, in case the > yarn application manager gets killed. > > Over time this pileup and incurs a lot of cloud storage cost. > > Can we update our File committer to clean up the temporary directories in > case the job commit fails. > PS : This request is specifically for GCS. > Image for reference > > !image-2023-07-25-17-52-55-006.png|width=458,height=177! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-44543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dipayan Dev updated SPARK-44543: Description: Spark creates the staging directories like .hive-staging, .spark-staging etc which get created when you run an insert overwrite to a partitioned table. Spark spends maximum time in renaming the partitioned files, and because GCS renaming are too slow, there are frequent scenarios where YARN fails due to network error etc. Such directories will remain forever in Google Cloud Storage, in case the yarn application manager gets killed. Over time this pileup and incurs a lot of cloud storage cost. Can we update our File committer to clean up the temporary directories in case the job commit fails. PS : This request is specifically for GCS. Image for reference !image-2023-07-25-17-52-55-006.png|width=458,height=177! was: Spark creates the staging directories like .hive-staging, .spark-staging etc which get created when you run an upsert to a partitioned table. Such directories will remain forever in Google Cloud Storage, in case the yarn application manager gets killed. Over time this pileup and incurs a lot of cloud storage cost. Can we update our File committer to clean up the temporary directories in case the job commit fails. !image-2023-07-25-17-52-55-006.png|width=458,height=177! > Cleanup .spark-staging directories when yarn application fails > -- > > Key: SPARK-44543 > URL: https://issues.apache.org/jira/browse/SPARK-44543 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Spark Shell >Affects Versions: 3.4.1 >Reporter: Dipayan Dev >Priority: Major > Attachments: image-2023-07-25-17-52-55-006.png > > > Spark creates the staging directories like .hive-staging, .spark-staging etc > which get created when you run an insert overwrite to a partitioned table. > Spark spends maximum time in renaming the partitioned files, and because GCS > renaming are too slow, there are frequent scenarios where YARN fails due to > network error etc. > Such directories will remain forever in Google Cloud Storage, in case the > yarn application manager gets killed. > > Over time this pileup and incurs a lot of cloud storage cost. > > Can we update our File committer to clean up the temporary directories in > case the job commit fails. > PS : This request is specifically for GCS. > Image for reference > > !image-2023-07-25-17-52-55-006.png|width=458,height=177! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-44543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746974#comment-17746974 ] Dipayan Dev commented on SPARK-44543: - Happy to contribute to this development of this feature. > Cleanup .spark-staging directories when yarn application fails > -- > > Key: SPARK-44543 > URL: https://issues.apache.org/jira/browse/SPARK-44543 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Spark Shell >Affects Versions: 3.4.1 >Reporter: Dipayan Dev >Priority: Major > Attachments: image-2023-07-25-17-52-55-006.png > > > Spark creates the staging directories like .hive-staging, .spark-staging etc > which get created when you run an insert overwrite to a partitioned table. > Spark spends maximum time in renaming the partitioned files, and because GCS > renaming are too slow, there are frequent scenarios where YARN fails due to > network error etc. > Such directories will remain forever in Google Cloud Storage, in case the > yarn application manager gets killed. > > Over time this pileup and incurs a lot of cloud storage cost. > > Can we update our File committer to clean up the temporary directories in > case the job commit fails. > PS : This request is specifically for GCS. > Image for reference > > !image-2023-07-25-17-52-55-006.png|width=458,height=177! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-44543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dipayan Dev updated SPARK-44543: Description: Spark creates the staging directories like .hive-staging, .spark-staging etc which get created when you run an upsert to a partitioned table. Such directories will remain forever in Google Cloud Storage, in case the yarn application manager gets killed. Over time this pileup and incurs a lot of cloud storage cost. Can we update our File committer to clean up the temporary directories in case the job commit fails. !image-2023-07-25-17-52-55-006.png|width=458,height=177! was: Spark creates the staging directories like .hive-staging, .spark-staging etc which get created when you run an upsert to a partitioned table. Such directories will remain forever in Google Cloud Storage, in case the yarn application manager gets killed. Over time this pileup and incurs a lot of cloud storage cost. Can we update our File committer to clean up the temporary directories in case the job commit fails. > Cleanup .spark-staging directories when yarn application fails > -- > > Key: SPARK-44543 > URL: https://issues.apache.org/jira/browse/SPARK-44543 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Spark Shell >Affects Versions: 3.4.1 >Reporter: Dipayan Dev >Priority: Major > Attachments: image-2023-07-25-17-52-55-006.png > > > Spark creates the staging directories like .hive-staging, .spark-staging etc > which get created when you run an upsert to a partitioned table. Such > directories will remain forever in Google Cloud Storage, in case the yarn > application manager gets killed. > > Over time this pileup and incurs a lot of cloud storage cost. > > Can we update our File committer to clean up the temporary directories in > case the job commit fails. > > !image-2023-07-25-17-52-55-006.png|width=458,height=177! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-44543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dipayan Dev updated SPARK-44543: Attachment: image-2023-07-25-17-52-55-006.png > Cleanup .spark-staging directories when yarn application fails > -- > > Key: SPARK-44543 > URL: https://issues.apache.org/jira/browse/SPARK-44543 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Spark Shell >Affects Versions: 3.4.1 >Reporter: Dipayan Dev >Priority: Major > Attachments: image-2023-07-25-17-52-55-006.png > > > Spark creates the staging directories like .hive-staging, .spark-staging etc > which get created when you run an upsert to a partitioned table. Such > directories will remain forever in Google Cloud Storage, in case the yarn > application manager gets killed. > > Over time this pileup and incurs a lot of cloud storage cost. > > Can we update our File committer to clean up the temporary directories in > case the job commit fails. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44543) Cleanup .spark-staging directories when yarn application fails
Dipayan Dev created SPARK-44543: --- Summary: Cleanup .spark-staging directories when yarn application fails Key: SPARK-44543 URL: https://issues.apache.org/jira/browse/SPARK-44543 Project: Spark Issue Type: New Feature Components: Spark Core, Spark Shell Affects Versions: 3.4.1 Reporter: Dipayan Dev Spark creates the staging directories like .hive-staging, .spark-staging etc which get created when you run an upsert to a partitioned table. Such directories will remain forever in Google Cloud Storage, in case the yarn application manager gets killed. Over time this pileup and incurs a lot of cloud storage cost. Can we update our File committer to clean up the temporary directories in case the job commit fails. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746849#comment-17746849 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 11:36 AM: - bumping to blocker because I believe this is a potentially very serious issue in the query planner, which may affect other queries (sort() then select(), but the sorting column is not in select(), then query planner would use wrong column to sort) was (Author: JIRAUSER301473): bumping to blocker because I believe this is a potentially very serious issue in the query planner (sort() then select(), and the sorting column is not in select(), then query plan would use the wrong column to sort), which may affect other queries > dataset.sort.select.write.partitionBy sorts wrong column > > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Blocker > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false. > After further investigation, spark actually sorted wrong column in the > following code. > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44540) Remove unused stylesheet and javascript files of jsonFormatter
[ https://issues.apache.org/jira/browse/SPARK-44540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-44540: Assignee: Kent Yao > Remove unused stylesheet and javascript files of jsonFormatter > -- > > Key: SPARK-44540 > URL: https://issues.apache.org/jira/browse/SPARK-44540 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.5.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > jsonFormatter.min.css and jsonFormatter.min.js is unreached -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44540) Remove unused stylesheet and javascript files of jsonFormatter
[ https://issues.apache.org/jira/browse/SPARK-44540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-44540. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42145 [https://github.com/apache/spark/pull/42145] > Remove unused stylesheet and javascript files of jsonFormatter > -- > > Key: SPARK-44540 > URL: https://issues.apache.org/jira/browse/SPARK-44540 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.5.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > jsonFormatter.min.css and jsonFormatter.min.js is unreached -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746849#comment-17746849 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 10:06 AM: - bumping to blocker because I believe this is a potentially very serious issue in the query planner (sort().select() and the original sorting column is not in select(), then query plan would use the wrong column to sort), which may affect other queries was (Author: JIRAUSER301473): bumping to blocker because I believe this is a potentially very serious issue in the query planner, which may affect other queries > dataset.sort.select.write.partitionBy sorts wrong column > > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Blocker > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false. > After further investigation, spark actually sorted wrong column in the > following code. > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746849#comment-17746849 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 10:06 AM: - bumping to blocker because I believe this is a potentially very serious issue in the query planner (sort() then select(), and the sorting column is not in select(), then query plan would use the wrong column to sort), which may affect other queries was (Author: JIRAUSER301473): bumping to blocker because I believe this is a potentially very serious issue in the query planner (sort().select() and the original sorting column is not in select(), then query plan would use the wrong column to sort), which may affect other queries > dataset.sort.select.write.partitionBy sorts wrong column > > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Blocker > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false. > After further investigation, spark actually sorted wrong column in the > following code. > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43744) Spark Connect scala UDF serialization pulling in unrelated classes not available on server
[ https://issues.apache.org/jira/browse/SPARK-43744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746863#comment-17746863 ] ASF GitHub Bot commented on SPARK-43744: User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/42069 > Spark Connect scala UDF serialization pulling in unrelated classes not > available on server > -- > > Key: SPARK-43744 > URL: https://issues.apache.org/jira/browse/SPARK-43744 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Juliusz Sompolski >Priority: Major > Labels: SPARK-43745 > > [https://github.com/apache/spark/pull/41487] moved "interrupt all - > background queries, foreground interrupt" and "interrupt all - foreground > queries, background interrupt" tests from ClientE2ETestSuite into a new > isolated suite SparkSessionE2ESuite to avoid an unexplicable UDF > serialization issue. > > When these tests are moved back to ClientE2ETestSuite and when testing with > {code:java} > build/mvn clean install -DskipTests -Phive > build/mvn test -pl connector/connect/client/jvm -Dtest=none > -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite{code} > > the tests fails with > {code:java} > 23/05/22 15:44:11 ERROR SparkConnectService: Error during: execute. UserId: . > SessionId: 0f4013ca-3af9-443b-a0e5-e339a827e0cf. > java.lang.NoClassDefFoundError: > org/apache/spark/sql/connect/client/SparkResult > at java.lang.Class.getDeclaredMethods0(Native Method) > at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) > at java.lang.Class.getDeclaredMethod(Class.java:2128) > at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1643) > at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:79) > at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:520) > at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494) > at java.security.AccessController.doPrivileged(Native Method) > at java.io.ObjectStreamClass.(ObjectStreamClass.java:494) > at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391) > at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:681) > at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852) > at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1815) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1640) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) > at org.apache.spark.util.Utils$.deserialize(Utils.scala:148) > at > org.apache.spark.sql.connect.planner.SparkConnectPlanner.org$apache$spark$sql$connect$planner$SparkConnectPlanner$$unpackUdf(SparkConnectPlanner.scala:1353) > at > org.apache.spark.sql.connect.planner.SparkConnectPlanner$TypedScalaUdf$.apply(SparkConnectPlanner.scala:761) > at > org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformTypedMapPartitions(SparkConnectPlanner.scala:531) > at > org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformMapPartitions(SparkConnectPlanner.scala:495) > at > org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:143) > at > org.apache.spark.sql.connect.service.SparkConnectStreamHandler.handlePlan(SparkConnectStreamHandler.scala:100) > at > org.apache.spark.sql.connect.service.SparkConnectStreamHandler.$anonfun$handle$2(SparkConnectStreamHandler.scala:87) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at
[jira] [Commented] (SPARK-43744) Spark Connect scala UDF serialization pulling in unrelated classes not available on server
[ https://issues.apache.org/jira/browse/SPARK-43744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746862#comment-17746862 ] ASF GitHub Bot commented on SPARK-43744: User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/42069 > Spark Connect scala UDF serialization pulling in unrelated classes not > available on server > -- > > Key: SPARK-43744 > URL: https://issues.apache.org/jira/browse/SPARK-43744 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Juliusz Sompolski >Priority: Major > Labels: SPARK-43745 > > [https://github.com/apache/spark/pull/41487] moved "interrupt all - > background queries, foreground interrupt" and "interrupt all - foreground > queries, background interrupt" tests from ClientE2ETestSuite into a new > isolated suite SparkSessionE2ESuite to avoid an unexplicable UDF > serialization issue. > > When these tests are moved back to ClientE2ETestSuite and when testing with > {code:java} > build/mvn clean install -DskipTests -Phive > build/mvn test -pl connector/connect/client/jvm -Dtest=none > -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite{code} > > the tests fails with > {code:java} > 23/05/22 15:44:11 ERROR SparkConnectService: Error during: execute. UserId: . > SessionId: 0f4013ca-3af9-443b-a0e5-e339a827e0cf. > java.lang.NoClassDefFoundError: > org/apache/spark/sql/connect/client/SparkResult > at java.lang.Class.getDeclaredMethods0(Native Method) > at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) > at java.lang.Class.getDeclaredMethod(Class.java:2128) > at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1643) > at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:79) > at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:520) > at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494) > at java.security.AccessController.doPrivileged(Native Method) > at java.io.ObjectStreamClass.(ObjectStreamClass.java:494) > at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391) > at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:681) > at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852) > at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1815) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1640) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) > at org.apache.spark.util.Utils$.deserialize(Utils.scala:148) > at > org.apache.spark.sql.connect.planner.SparkConnectPlanner.org$apache$spark$sql$connect$planner$SparkConnectPlanner$$unpackUdf(SparkConnectPlanner.scala:1353) > at > org.apache.spark.sql.connect.planner.SparkConnectPlanner$TypedScalaUdf$.apply(SparkConnectPlanner.scala:761) > at > org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformTypedMapPartitions(SparkConnectPlanner.scala:531) > at > org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformMapPartitions(SparkConnectPlanner.scala:495) > at > org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:143) > at > org.apache.spark.sql.connect.service.SparkConnectStreamHandler.handlePlan(SparkConnectStreamHandler.scala:100) > at > org.apache.spark.sql.connect.service.SparkConnectStreamHandler.$anonfun$handle$2(SparkConnectStreamHandler.scala:87) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at
[jira] [Commented] (SPARK-44509) Fine grained interrupt in Python Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-44509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746858#comment-17746858 ] ASF GitHub Bot commented on SPARK-44509: User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/42149 > Fine grained interrupt in Python Spark Connect > -- > > Key: SPARK-44509 > URL: https://issues.apache.org/jira/browse/SPARK-44509 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Same as https://issues.apache.org/jira/browse/SPARK-44422 but need it for > Python > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler
[ https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YE updated SPARK-44542: --- Description: There are two background for this improvement proposal: 1. When running spark on yarn, the disk might be corrupted during application running. The corrupted disk might contain the spark jars(cache archive from spark.yarn.archive). In that case , the executor JVM cannot load any spark related classes any more. 2. Spark leverages the OutputCommitCoordinator to avoid data race between speculate tasks so that no tasks could commit the same partition in the same time. In other words, once a task's commit request is allowed, other commit requests would be denied until the committing task is failed. We encountered a corner case combined the above two cases, which makes the spark hangs. A short timeline could be described as below: # task 5372(tid: 21662) starts running in 21:55 # the disk contains the spark archive for that task/executor is corrupted, thus making the archive inaccessible from executor's JVM perspective, it happened around 22:00 # the task continues running, at 22:05, it requests commit from coordinator and performs the commit. # however due the corrupted disk, some exception raised in the executor JVM. # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is corrupted, the handler itself throws an exception, and the halt process throws an exception too. # The executor is hanging there, no more tasks are running. However the authorized commit request is still valid in the driver side # Speculate tasks start to click in, due to no commit permission, all speculate tasks are killed/denied. # The job is hanging until our SRE killed the container from outside. Some screenshot are provided below. !image-2023-07-25-16-46-03-989.png! !image-2023-07-25-16-46-28-158.png! !image-2023-07-25-16-46-42-522.png! For this specific case: I'd like to the propose to eagerly load SparkExitCode class in the SparkUncaughtExceptionHandler, so that the halt process could be executed rather than throws an exception as SparkExitCode is not loadable during the previous scenario. was: There are two background for this improvement proposal: 1. When running spark on yarn, the disk might be corrupted during application running. The corrupted disk might contain the spark jars(cache archive from spark.yarn.archive). In that case , the executor JVM cannot load any spark related classes any more. 2. Spark leverages the OutputCommitCoordinator to avoid data race between speculate tasks so that no tasks could commit the same partition in the same time. In other words, once a task's commit request is allowed, other commit requests would be denied until the committing task is failed. We encountered a corner case combined the above two cases, which makes the spark hangs. A short timeline could be described as below: # task 5372(tid: 21662) starts running in 21:55 # the disk contains the spark archive for that task/executor is corrupted, thus making the archive inaccessible from executor's JVM perspective, it happened around 22:00 # the task continues running, at 22:05, it requests commit from coordinator and performs the commit. # however due the corrupted disk, some exception raised in the executor JVM. # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is corrupted, the handler itself throws an exception, and the halt process throws an exception too. # The executor is hanging there, no more tasks are running. However the authorized commit request is still valid in the driver side # Speculate tasks start to click in, due to no commit permission, all speculate tasks are killed/denied. # The job is hanging until our SRE killed the container from outside. Some screenshot are provided below. !image-2023-07-25-16-46-03-989.png! For this specific case: I'd like to the propose to eagerly load SparkExitCode class in the SparkUncaughtExceptionHandler, so that the halt process could be executed rather than throws an exception as SparkExitCode is not loadable during the previous scenario. > easily load SparkExitCode class in SparkUncaughtExceptionHandler > > > Key: SPARK-44542 > URL: https://issues.apache.org/jira/browse/SPARK-44542 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.3, 3.3.2, 3.4.1 >Reporter: YE >Priority: Major > Attachments: image-2023-07-25-16-46-03-989.png, > image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png > > > There are two background for this improvement proposal: > 1. When running spark on yarn, the disk might be corrupted during application > running. The corrupted disk might contain the spark
[jira] [Updated] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler
[ https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YE updated SPARK-44542: --- Attachment: image-2023-07-25-16-46-42-522.png > easily load SparkExitCode class in SparkUncaughtExceptionHandler > > > Key: SPARK-44542 > URL: https://issues.apache.org/jira/browse/SPARK-44542 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.3, 3.3.2, 3.4.1 >Reporter: YE >Priority: Major > Attachments: image-2023-07-25-16-46-03-989.png, > image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png > > > There are two background for this improvement proposal: > 1. When running spark on yarn, the disk might be corrupted during application > running. The corrupted disk might contain the spark jars(cache archive from > spark.yarn.archive). In that case , the executor JVM cannot load any spark > related classes any more. > 2. Spark leverages the OutputCommitCoordinator to avoid data race between > speculate tasks so that no tasks could commit the same partition in the same > time. In other words, once a task's commit request is allowed, other commit > requests would be denied until the committing task is failed. > > We encountered a corner case combined the above two cases, which makes the > spark hangs. A short timeline could be described as below: > # task 5372(tid: 21662) starts running in 21:55 > # the disk contains the spark archive for that task/executor is corrupted, > thus making the archive inaccessible from executor's JVM perspective, it > happened around 22:00 > # the task continues running, at 22:05, it requests commit from coordinator > and performs the commit. > # however due the corrupted disk, some exception raised in the executor JVM. > # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is > corrupted, the handler itself throws an exception, and the halt process > throws an exception too. > # The executor is hanging there, no more tasks are running. However the > authorized commit request is still valid in the driver side > # Speculate tasks start to click in, due to no commit permission, all > speculate tasks are killed/denied. > # The job is hanging until our SRE killed the container from outside. > Some screenshot are provided below. > !image-2023-07-25-16-46-03-989.png! > !image-2023-07-25-16-46-28-158.png! > !image-2023-07-25-16-46-42-522.png! > For this specific case: I'd like to the propose to eagerly load SparkExitCode > class in the > SparkUncaughtExceptionHandler, so that the halt process could be executed > rather than throws an exception as SparkExitCode is not loadable during the > previous scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler
[ https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YE updated SPARK-44542: --- Attachment: image-2023-07-25-16-46-28-158.png > easily load SparkExitCode class in SparkUncaughtExceptionHandler > > > Key: SPARK-44542 > URL: https://issues.apache.org/jira/browse/SPARK-44542 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.3, 3.3.2, 3.4.1 >Reporter: YE >Priority: Major > Attachments: image-2023-07-25-16-46-03-989.png, > image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png > > > There are two background for this improvement proposal: > 1. When running spark on yarn, the disk might be corrupted during application > running. The corrupted disk might contain the spark jars(cache archive from > spark.yarn.archive). In that case , the executor JVM cannot load any spark > related classes any more. > 2. Spark leverages the OutputCommitCoordinator to avoid data race between > speculate tasks so that no tasks could commit the same partition in the same > time. In other words, once a task's commit request is allowed, other commit > requests would be denied until the committing task is failed. > > We encountered a corner case combined the above two cases, which makes the > spark hangs. A short timeline could be described as below: > # task 5372(tid: 21662) starts running in 21:55 > # the disk contains the spark archive for that task/executor is corrupted, > thus making the archive inaccessible from executor's JVM perspective, it > happened around 22:00 > # the task continues running, at 22:05, it requests commit from coordinator > and performs the commit. > # however due the corrupted disk, some exception raised in the executor JVM. > # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is > corrupted, the handler itself throws an exception, and the halt process > throws an exception too. > # The executor is hanging there, no more tasks are running. However the > authorized commit request is still valid in the driver side > # Speculate tasks start to click in, due to no commit permission, all > speculate tasks are killed/denied. > # The job is hanging until our SRE killed the container from outside. > Some screenshot are provided below. > !image-2023-07-25-16-46-03-989.png! > For this specific case: I'd like to the propose to eagerly load SparkExitCode > class in the > SparkUncaughtExceptionHandler, so that the halt process could be executed > rather than throws an exception as SparkExitCode is not loadable during the > previous scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler
[ https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YE updated SPARK-44542: --- Description: There are two background for this improvement proposal: 1. When running spark on yarn, the disk might be corrupted during application running. The corrupted disk might contain the spark jars(cache archive from spark.yarn.archive). In that case , the executor JVM cannot load any spark related classes any more. 2. Spark leverages the OutputCommitCoordinator to avoid data race between speculate tasks so that no tasks could commit the same partition in the same time. In other words, once a task's commit request is allowed, other commit requests would be denied until the committing task is failed. We encountered a corner case combined the above two cases, which makes the spark hangs. A short timeline could be described as below: # task 5372(tid: 21662) starts running in 21:55 # the disk contains the spark archive for that task/executor is corrupted, thus making the archive inaccessible from executor's JVM perspective, it happened around 22:00 # the task continues running, at 22:05, it requests commit from coordinator and performs the commit. # however due the corrupted disk, some exception raised in the executor JVM. # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is corrupted, the handler itself throws an exception, and the halt process throws an exception too. # The executor is hanging there, no more tasks are running. However the authorized commit request is still valid in the driver side # Speculate tasks start to click in, due to no commit permission, all speculate tasks are killed/denied. # The job is hanging until our SRE killed the container from outside. Some screenshot are provided below. !image-2023-07-25-16-46-03-989.png! For this specific case: I'd like to the propose to eagerly load SparkExitCode class in the SparkUncaughtExceptionHandler, so that the halt process could be executed rather than throws an exception as SparkExitCode is not loadable during the previous scenario. was: There are two background for this improvement proposal: 1. When running spark on yarn, the disk might be corrupted during application running. The corrupted disk might contain the spark jars(cache archive from spark.yarn.archive). In that case , the executor JVM cannot load any spark related classes any more. 2. Spark leverages the OutputCommitCoordinator to avoid data race between speculate tasks so that no tasks could commit the same partition in the same time. In other words, once a task's commit request is allowed, other commit requests would be denied until the committing task is failed. We encountered a corner case combined the above two cases, which makes the spark hangs. A short timeline could be described as below: # task 5372(tid: 21662) starts running in 21:55 # the disk contains the spark archive for that task/executor is corrupted, thus making the archive inaccessible from executor's JVM perspective, it happened around 22:00 # the task continues running, at 22:05, it requests commit from coordinator and performs the commit. # however due the corrupted disk, some exception raised in the executor JVM. # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is corrupted, the handler itself throws an exception, and the halt process throws an exception too. # The executor is hanging there, no more tasks are running. However the authorized commit request is still valid in the driver side # Speculate tasks start to click in, due to no commit permission, all speculate tasks are killed/denied. # The job is hanging until our SRE killed the container from outside. Some screenshot are provided below. !image-2023-07-25-16-37-16-821.png! !image-2023-07-25-16-38-52-270.png! !image-2023-07-25-16-39-40-182.png! For this specific case: I'd like to the propose to eagerly load SparkExitCode class in the SparkUncaughtExceptionHandler, so that the halt process could be executed rather than throws an exception as SparkExitCode is not loadable during the previous scenario. > easily load SparkExitCode class in SparkUncaughtExceptionHandler > > > Key: SPARK-44542 > URL: https://issues.apache.org/jira/browse/SPARK-44542 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.3, 3.3.2, 3.4.1 >Reporter: YE >Priority: Major > Attachments: image-2023-07-25-16-46-03-989.png, > image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png > > > There are two background for this improvement proposal: > 1. When running spark on yarn, the disk might be corrupted during application > running. The corrupted disk might contain the
[jira] [Updated] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler
[ https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YE updated SPARK-44542: --- Attachment: image-2023-07-25-16-46-03-989.png > easily load SparkExitCode class in SparkUncaughtExceptionHandler > > > Key: SPARK-44542 > URL: https://issues.apache.org/jira/browse/SPARK-44542 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.3, 3.3.2, 3.4.1 >Reporter: YE >Priority: Major > Attachments: image-2023-07-25-16-46-03-989.png, > image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png > > > There are two background for this improvement proposal: > 1. When running spark on yarn, the disk might be corrupted during application > running. The corrupted disk might contain the spark jars(cache archive from > spark.yarn.archive). In that case , the executor JVM cannot load any spark > related classes any more. > 2. Spark leverages the OutputCommitCoordinator to avoid data race between > speculate tasks so that no tasks could commit the same partition in the same > time. In other words, once a task's commit request is allowed, other commit > requests would be denied until the committing task is failed. > > We encountered a corner case combined the above two cases, which makes the > spark hangs. A short timeline could be described as below: > # task 5372(tid: 21662) starts running in 21:55 > # the disk contains the spark archive for that task/executor is corrupted, > thus making the archive inaccessible from executor's JVM perspective, it > happened around 22:00 > # the task continues running, at 22:05, it requests commit from coordinator > and performs the commit. > # however due the corrupted disk, some exception raised in the executor JVM. > # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is > corrupted, the handler itself throws an exception, and the halt process > throws an exception too. > # The executor is hanging there, no more tasks are running. However the > authorized commit request is still valid in the driver side > # Speculate tasks start to click in, due to no commit permission, all > speculate tasks are killed/denied. > # The job is hanging until our SRE killed the container from outside. > Some screenshot are provided below. > !image-2023-07-25-16-37-16-821.png! > !image-2023-07-25-16-38-52-270.png! > !image-2023-07-25-16-39-40-182.png! > > For this specific case: I'd like to the propose to eagerly load SparkExitCode > class in the > SparkUncaughtExceptionHandler, so that the halt process could be executed > rather than throws an exception as SparkExitCode is not loadable during the > previous scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler
YE created SPARK-44542: -- Summary: easily load SparkExitCode class in SparkUncaughtExceptionHandler Key: SPARK-44542 URL: https://issues.apache.org/jira/browse/SPARK-44542 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.1, 3.3.2, 3.1.3 Reporter: YE There are two background for this improvement proposal: 1. When running spark on yarn, the disk might be corrupted during application running. The corrupted disk might contain the spark jars(cache archive from spark.yarn.archive). In that case , the executor JVM cannot load any spark related classes any more. 2. Spark leverages the OutputCommitCoordinator to avoid data race between speculate tasks so that no tasks could commit the same partition in the same time. In other words, once a task's commit request is allowed, other commit requests would be denied until the committing task is failed. We encountered a corner case combined the above two cases, which makes the spark hangs. A short timeline could be described as below: # task 5372(tid: 21662) starts running in 21:55 # the disk contains the spark archive for that task/executor is corrupted, thus making the archive inaccessible from executor's JVM perspective, it happened around 22:00 # the task continues running, at 22:05, it requests commit from coordinator and performs the commit. # however due the corrupted disk, some exception raised in the executor JVM. # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is corrupted, the handler itself throws an exception, and the halt process throws an exception too. # The executor is hanging there, no more tasks are running. However the authorized commit request is still valid in the driver side # Speculate tasks start to click in, due to no commit permission, all speculate tasks are killed/denied. # The job is hanging until our SRE killed the container from outside. Some screenshot are provided below. !image-2023-07-25-16-37-16-821.png! !image-2023-07-25-16-38-52-270.png! !image-2023-07-25-16-39-40-182.png! For this specific case: I'd like to the propose to eagerly load SparkExitCode class in the SparkUncaughtExceptionHandler, so that the halt process could be executed rather than throws an exception as SparkExitCode is not loadable during the previous scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746849#comment-17746849 ] Yiu-Chung Lee commented on SPARK-44512: --- bumping to blocker because I believe this is a potentially very serious issue in the query planner, which may affect other queries > dataset.sort.select.write.partitionBy sorts wrong column > > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Blocker > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false. > After further investigation, spark actually sorted wrong column in the > following code. > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Priority: Blocker (was: Major) > dataset.sort.select.write.partitionBy sorts wrong column > > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Blocker > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false. > After further investigation, spark actually sorted wrong column in the > following code. > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746833#comment-17746833 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 8:31 AM: After inspecting the SQL plan, below are the differences spark.sql.optimizer.plannedWrite.enabled=false (correct result) !Test-Details-for-Query-0.png! spark.sql.optimizer.plannedWrite.enabled=true (incorrect result) !Test-Details-for-Query-1.png! It appears spark generates sorted incorrect column if spark.sql.optimizer.plannedWrite.enabled=true (it should sort _1, but it actually sorted _2 instead) was (Author: JIRAUSER301473): After inspecting the SQL plan, bekow the differences spark.sql.optimizer.plannedWrite.enabled=false (correct result) !Test-Details-for-Query-0.png! spark.sql.optimizer.plannedWrite.enabled=true (incorrect result) !Test-Details-for-Query-1.png! It appears spark generates sorted incorrect column if spark.sql.optimizer.plannedWrite.enabled=true (it should sort _1, but it actually sorted _2 instead) > dataset.sort.select.write.partitionBy sorts wrong column > > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false. > After further investigation, spark actually sorted wrong column in the > following code. > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746833#comment-17746833 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 8:31 AM: After inspecting the SQL plan, below are the differences spark.sql.optimizer.plannedWrite.enabled=false (correct result) !Test-Details-for-Query-0.png! spark.sql.optimizer.plannedWrite.enabled=true (incorrect result) !Test-Details-for-Query-1.png! It appears spark sorted incorrect column if spark.sql.optimizer.plannedWrite.enabled=true (it should sort _1, but it actually sorted _2 instead) was (Author: JIRAUSER301473): After inspecting the SQL plan, below are the differences spark.sql.optimizer.plannedWrite.enabled=false (correct result) !Test-Details-for-Query-0.png! spark.sql.optimizer.plannedWrite.enabled=true (incorrect result) !Test-Details-for-Query-1.png! It appears spark generates sorted incorrect column if spark.sql.optimizer.plannedWrite.enabled=true (it should sort _1, but it actually sorted _2 instead) > dataset.sort.select.write.partitionBy sorts wrong column > > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false. > After further investigation, spark actually sorted wrong column in the > following code. > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Description: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem), unless spark.sql.optimizer.plannedWrite.enabled is set to false. After further investigation, spark actually sorted wrong column in the following code. {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is no longer necessary) -However, if I insert an identity mapper between select and write, the output would be sorted as expected.- -{{dataset = dataset.sort("_1")}}- -{{.select("_2", "_3");}}- -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- -{{.write()}}- -{{{}.{}}}{{{}partitionBy("_2"){}}}- -{{.text("output")}}- Below is the complete code that reproduces the problem. was: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem), unless spark.sql.optimizer.plannedWrite.enabled is set to false {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is no longer necessary) -However, if I insert an identity mapper between select and write, the output would be sorted as expected.- -{{dataset = dataset.sort("_1")}}- -{{.select("_2", "_3");}}- -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- -{{.write()}}- -{{{}.{}}}{{{}partitionBy("_2"){}}}- -{{.text("output")}}- Below is the complete code that reproduces the problem. > dataset.sort.select.write.partitionBy sorts wrong column > > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false. > After further investigation, spark actually sorted wrong column in the > following code. > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wrong column
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Summary: dataset.sort.select.write.partitionBy sorts wrong column (was: dataset.sort.select.write.partitionBy sorts wong column) > dataset.sort.select.write.partitionBy sorts wrong column > > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts wong column
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Summary: dataset.sort.select.write.partitionBy sorts wong column (was: dataset.sort.select.write.partitionBy sorts incorrect column) > dataset.sort.select.write.partitionBy sorts wong column > --- > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts incorrect column
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Summary: dataset.sort.select.write.partitionBy sorts incorrect column (was: dataset.sort.select.write.partitionBy sorts incorrect field) > dataset.sort.select.write.partitionBy sorts incorrect column > > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy sorts incorrect field
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746833#comment-17746833 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 8:24 AM: After inspecting the SQL plan, bekow the differences spark.sql.optimizer.plannedWrite.enabled=false (correct result) !Test-Details-for-Query-0.png! spark.sql.optimizer.plannedWrite.enabled=true (incorrect result) !Test-Details-for-Query-1.png! It appears spark generates sorted incorrect column if spark.sql.optimizer.plannedWrite.enabled=true (it should sort _1, but it actually sorted _2 instead) was (Author: JIRAUSER301473): After inspecting the SQL plan, bekow the differences spark.sql.optimizer.plannedWrite.enabled=false (correct result) !Test-Details-for-Query-0.png! spark.sql.optimizer.plannedWrite.enabled=true (incorrect result) !Test-Details-for-Query-1.png! It appears spark generates sorted incorrect column if spark.sql.optimizer.plannedWrite.enabled=true > dataset.sort.select.write.partitionBy sorts incorrect field > --- > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy sorts incorrect field
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Summary: dataset.sort.select.write.partitionBy sorts incorrect field (was: dataset.sort.select.write.partitionBy does not return a sorted output) > dataset.sort.select.write.partitionBy sorts incorrect field > --- > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746833#comment-17746833 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 8:21 AM: After inspecting the SQL plan, bekow the differences spark.sql.optimizer.plannedWrite.enabled=false (correct result) !Test-Details-for-Query-0.png! spark.sql.optimizer.plannedWrite.enabled=true (incorrect result) !Test-Details-for-Query-1.png! It appears spark generates sorted incorrect column if spark.sql.optimizer.plannedWrite.enabled=true was (Author: JIRAUSER301473): After inspecting the SQL plan, bekow the differences spark.sql.optimizer.plannedWrite.enabled=false (correct result) !Test-Details-for-Query-0.png! spark.sql.optimizer.plannedWrite.enabled=true (incorrect result) !Test-Details-for-Query-1.png! It appears spark generates incorrect sorting plan if spark.sql.optimizer.plannedWrite.enabled=true > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746833#comment-17746833 ] Yiu-Chung Lee commented on SPARK-44512: --- After inspecting the SQL plan, bekow the differences spark.sql.optimizer.plannedWrite.enabled=false (correct result) !Test-Details-for-Query-0.png! spark.sql.optimizer.plannedWrite.enabled=true (incorrect result) !Test-Details-for-Query-1.png! It appears spark generates incorrect sorting plan if spark.sql.optimizer.plannedWrite.enabled=true > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Attachment: Test-Details-for-Query-1.png > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png, > Test-Details-for-Query-1.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Attachment: Test-Details-for-Query-0.png > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > Attachments: Test-Details-for-Query-0.png > > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Attachment: screenshot-1.png > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Attachment: (was: screenshot-1.png) > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746025#comment-17746025 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 7:32 AM: Here is the [gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that reproduces the issue. To compile: javac Test.java && jar cvf Test.jar Test.class bug reproduce: spark-submit --class Test Test.jar no bug: spark-submit --conf spark.sql.optimizer.plannedWrite.enabled=false --class Test Test.jar (the following commands are either invalid or no longer necessary) -no bug if workaround is enabled: spark-submit --class Test Test.jar workaround- -no bug too if AQE is disabled: spark-submit --conf spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each partition key)- was (Author: JIRAUSER301473): Here is the [gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that reproduces the issue. To compile: javac Test.java && jar cvf Test.jar Test.class bug reproduce: spark-submit --class Test Test.jar no bug: spark-submit --conf spark.sql.optimizer.plannedWrite.enabled=false --class Test Test.jar -no bug if workaround is enabled: spark-submit --class Test Test.jar workaround- -no bug too if AQE is disabled: spark-submit --conf spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each partition key)- > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Description: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem), unless spark.sql.optimizer.plannedWrite.enabled is set to false {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is no longer necessary) -However, if I insert an identity mapper between select and write, the output would be sorted as expected.- -{{dataset = dataset.sort("_1")}}- -{{.select("_2", "_3");}}- -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- -{{.write()}}- -{{{}.{}}}{{{}partitionBy("_2"){}}}- -{{.text("output")}}- Below is the complete code that reproduces the problem. was: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is no longer necessary) -However, if I insert an identity mapper between select and write, the output would be sorted as expected.- -{{dataset = dataset.sort("_1")}}- -{{.select("_2", "_3");}}- -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- -{{.write()}}- -{{{}.{}}}{{{}partitionBy("_2"){}}}- -{{.text("output")}}- Below is the complete code that reproduces the problem. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem), unless > spark.sql.optimizer.plannedWrite.enabled is set to false > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746025#comment-17746025 ] Yiu-Chung Lee edited comment on SPARK-44512 at 7/25/23 7:30 AM: Here is the [gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that reproduces the issue. To compile: javac Test.java && jar cvf Test.jar Test.class bug reproduce: spark-submit --class Test Test.jar no bug: spark-submit --conf spark.sql.optimizer.plannedWrite.enabled=false --class Test Test.jar -no bug if workaround is enabled: spark-submit --class Test Test.jar workaround- -no bug too if AQE is disabled: spark-submit --conf spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each partition key)- was (Author: JIRAUSER301473): Here is the [gist|https://gist.github.com/leeyc0/2bdab65901fe5754c471832acdc00890] that reproduces the issue. To compile: javac Test.java && jar cvf Test.jar Test.class bug reproduce: spark-submit --class Test Test.jar no bug if workaround is enabled: spark-submit --class Test Test.jar workaround -no bug too if AQE is disabled: spark-submit --conf spark.sql.adaptive.enabled=false --class Test Test.jar (3 output files in each partition key)- > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Description: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is no longer necessary) However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}} Below is the complete code that reproduces the problem. was: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is no longer necessary) -However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}} Below is the complete code that reproduces the problem. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Description: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is no longer necessary) -However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}} Below is the complete code that reproduces the problem. was: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is not necessary) -However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}}- Below is the complete code that reproduces the problem. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}} > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Description: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is no longer necessary) -However, if I insert an identity mapper between select and write, the output would be sorted as expected.- -{{dataset = dataset.sort("_1")}}- -{{.select("_2", "_3");}}- -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- -{{.write()}}- -{{{}.{}}}{{{}partitionBy("_2"){}}}- -{{.text("output")}}- Below is the complete code that reproduces the problem. was: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is no longer necessary) However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}} Below is the complete code that reproduces the problem. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is no longer necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected.- > -{{dataset = dataset.sort("_1")}}- > -{{.select("_2", "_3");}}- > -{{dataset.map((MapFunction) row -> row, dataset.encoder())}}- > -{{.write()}}- > -{{{}.{}}}{{{}partitionBy("_2"){}}}- > -{{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44512) dataset.sort.select.write.partitionBy does not return a sorted output
[ https://issues.apache.org/jira/browse/SPARK-44512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiu-Chung Lee updated SPARK-44512: -- Description: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} (the following workaround is not necessary) -However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}}- Below is the complete code that reproduces the problem. was: (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3) I found -then when AQE is enabled,- that the following code does not produce sorted output (.drop() also have the same problem) {{dataset.sort("_1")}} {{.select("_2", "_3")}} {{.write()}} {{.partitionBy("_2")}} {{.text("output");}} However, if I insert an identity mapper between select and write, the output would be sorted as expected. {{dataset = dataset.sort("_1")}} {{.select("_2", "_3");}} {{dataset.map((MapFunction) row -> row, dataset.encoder())}} {{.write()}} {{{}.{}}}{{{}partitionBy("_2"){}}} {{.text("output")}} Below is the complete code that reproduces the problem. > dataset.sort.select.write.partitionBy does not return a sorted output > - > > Key: SPARK-44512 > URL: https://issues.apache.org/jira/browse/SPARK-44512 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.4.1 >Reporter: Yiu-Chung Lee >Priority: Major > Labels: correctness > > (In this example the dataset is of type Tuple3, and the columns are named _1, > _2 and _3) > > I found -then when AQE is enabled,- that the following code does not produce > sorted output (.drop() also have the same problem) > {{dataset.sort("_1")}} > {{.select("_2", "_3")}} > {{.write()}} > {{.partitionBy("_2")}} > {{.text("output");}} > > (the following workaround is not necessary) > -However, if I insert an identity mapper between select and write, the output > would be sorted as expected. > {{dataset = dataset.sort("_1")}} > {{.select("_2", "_3");}} > {{dataset.map((MapFunction) row -> row, dataset.encoder())}} > {{.write()}} > {{{}.{}}}{{{}partitionBy("_2"){}}} > {{.text("output")}}- > Below is the complete code that reproduces the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44534) Handle only shuffle files in KubernetesLocalDiskShuffleExecutorComponents
[ https://issues.apache.org/jira/browse/SPARK-44534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44534: - Assignee: Dongjoon Hyun > Handle only shuffle files in KubernetesLocalDiskShuffleExecutorComponents > - > > Key: SPARK-44534 > URL: https://issues.apache.org/jira/browse/SPARK-44534 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44534) Handle only shuffle files in KubernetesLocalDiskShuffleExecutorComponents
[ https://issues.apache.org/jira/browse/SPARK-44534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44534. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 42138 [https://github.com/apache/spark/pull/42138] > Handle only shuffle files in KubernetesLocalDiskShuffleExecutorComponents > - > > Key: SPARK-44534 > URL: https://issues.apache.org/jira/browse/SPARK-44534 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org