[jira] [Updated] (SPARK-42170) Files added to the spark-submit command with master K8s and deploy mode cluster, end up in a non deterministic location inside the driver.
[ https://issues.apache.org/jira/browse/SPARK-42170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santosh Pingale updated SPARK-42170: Description: Files added to the spark-submit command with master K8s and deploy mode cluster, end up in a non deterministic location inside the driver. eg: {{spark-submit --files myfile --master k8s.. --deploy-mode cluster` will upload the files to /tmp/spark-uuid/myfile}} The issue happens because [Utils.createTempDir()|https://github.com/apache/spark/blob/v3.3.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L344] creates a directory with a uuid in the directory name. This issue does not affect the --archives option, because we `unarchive` the archives into the destination directory which is relative to the working dir. This bug affects file access pre & post app creation. For example if we distribute python dependencies with pex, we need to use --files to attach the pex file and change the spark.pyspark.python to point to this file. But the file location can not be determined before submitting the app. On the other hand, after the app is created, referencing the files without using `SparkFiles.get` also does not work was: Files added to the spark-submit command with master K8s and deploy mode cluster, end up in a non deterministic location inside the driver. eg: {{spark-submit --files myfile --master k8s.. --deploy-mode cluster` will upload the files to /tmp/spark-uuid/myfile}} The issue happens because [Utils.createTempDir()|https://github.com/apache/spark/blob/v3.3.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L344] creates a directory with a uuid in the directory name. This issue does not affect the `--archives` option, because we `unarchive` the archives into the destination directory which is relative to the working dir. This bug affects file access pre & post app creation. For example if we distribute python dependencies with pex, we need to use `--files` to attach the pex file and change the spark.pyspark.python to point to this file. But the file location can not be determined before submitting the app. On the other hand, after the app is created, referencing the files without using `SparkFiles.get` also does not work > Files added to the spark-submit command with master K8s and deploy mode > cluster, end up in a non deterministic location inside the driver. > -- > > Key: SPARK-42170 > URL: https://issues.apache.org/jira/browse/SPARK-42170 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Submit >Affects Versions: 3.3.0, 3.2.2 >Reporter: Santosh Pingale >Priority: Major > > Files added to the spark-submit command with master K8s and deploy mode > cluster, end up in a non deterministic location inside the driver. > eg: > {{spark-submit --files myfile --master k8s.. --deploy-mode cluster` will > upload the files to /tmp/spark-uuid/myfile}} > The issue happens because > [Utils.createTempDir()|https://github.com/apache/spark/blob/v3.3.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L344] > creates a directory with a uuid in the directory name. This issue does not > affect the --archives option, because we `unarchive` the archives into the > destination directory which is relative to the working dir. This bug affects > file access pre & post app creation. For example if we distribute python > dependencies with pex, we need to use --files to attach the pex file and > change the spark.pyspark.python to point to this file. But the file location > can not be determined before submitting the app. On the other hand, after the > app is created, referencing the files without using `SparkFiles.get` also > does not work -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42170) Files added to the spark-submit command with master K8s and deploy mode cluster, end up in a non deterministic location inside the driver.
Santosh Pingale created SPARK-42170: --- Summary: Files added to the spark-submit command with master K8s and deploy mode cluster, end up in a non deterministic location inside the driver. Key: SPARK-42170 URL: https://issues.apache.org/jira/browse/SPARK-42170 Project: Spark Issue Type: Bug Components: Kubernetes, Spark Submit Affects Versions: 3.2.2, 3.3.0 Reporter: Santosh Pingale Files added to the spark-submit command with master K8s and deploy mode cluster, end up in a non deterministic location inside the driver. eg: {{spark-submit --files myfile --master k8s.. --deploy-mode cluster` will upload the files to /tmp/spark-uuid/myfile}} The issue happens because [Utils.createTempDir()|https://github.com/apache/spark/blob/v3.3.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L344] creates a directory with a uuid in the directory name. This issue does not affect the `--archives` option, because we `unarchive` the archives into the destination directory which is relative to the working dir. This bug affects file access pre & post app creation. For example if we distribute python dependencies with pex, we need to use `--files` to attach the pex file and change the spark.pyspark.python to point to this file. But the file location can not be determined before submitting the app. On the other hand, after the app is created, referencing the files without using `SparkFiles.get` also does not work -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order
[ https://issues.apache.org/jira/browse/SPARK-42168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42168: Assignee: (was: Apache Spark) > CoGroup with window function returns incorrect result when partition keys > differ in order > - > > Key: SPARK-42168 > URL: https://issues.apache.org/jira/browse/SPARK-42168 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.3 >Reporter: Enrico Minack >Priority: Major > Labels: correctness > > The following example returns an incorrect result: > {code:java} > import pandas as pd > from pyspark.sql import SparkSession, Window > from pyspark.sql.functions import col, lit, sum > spark = SparkSession \ > .builder \ > .getOrCreate() > ids = 1000 > days = 1000 > parts = 10 > id_df = spark.range(ids) > day_df = spark.range(days).withColumnRenamed("id", "day") > id_day_df = id_df.join(day_df) > left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), > lit("left").alias("side")).repartition(parts).cache() > right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), > lit("right").alias("side")).repartition(parts).cache() > #.withColumnRenamed("id", "id2") > # note the column order is different to the groupBy("id", "day") column order > below > window = Window.partitionBy("day", "id") > left_grouped_df = left_df.groupBy("id", "day") > right_grouped_df = right_df.withColumn("day_sum", > sum(col("day")).over(window)).groupBy("id", "day") > def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame: > return pd.DataFrame([{ > "id": left["id"][0] if not left.empty else (right["id"][0] if not > right.empty else None), > "day": left["day"][0] if not left.empty else (right["day"][0] if not > right.empty else None), > "lefts": len(left.index), > "rights": len(right.index) > }]) > df = left_grouped_df.cogroup(right_grouped_df) \ > .applyInPandas(cogroup, schema="id long, day long, lefts integer, > rights integer") > df.explain() > df.show(5) > {code} > Output is > {code} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, > day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, > lefts#66, rights#67] >:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, > [plan_id=117] >: +- ... >+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0 > +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L] > +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS day_sum#54L], [day#30L, id#29L] > +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, > 0 >+- Exchange hashpartitioning(day#30L, id#29L, 200), > ENSURE_REQUIREMENTS, [plan_id=112] > +- ... > +---+---+-+--+ > | id|day|lefts|rights| > +---+---+-+--+ > | 0| 3|0| 1| > | 0| 4|0| 1| > | 0| 13|1| 0| > | 0| 27|0| 1| > | 0| 31|0| 1| > +---+---+-+--+ > only showing top 5 rows > {code} > The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the > second child is hash-partitioned by {{day}} and {{id}} (required by the > window function). Therefore, rows end up in different partitions. > This has been fixed in Spark 3.3 by > [#32875|https://github.com/apache/spark/pull/32875/files#diff-e938569a4ca4eba8f7e10fe473d4f9c306ea253df151405bcaba880a601f075fR75-R76]: > {code} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, > day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L)#63, [id#64L, day#65L, > lefts#66, rights#67] >:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, > [plan_id=117] >: +- ... >+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#29L, day#30L, 200), > ENSURE_REQUIREMENTS, [plan_id=118] > +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L] > +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS day_sum#54L], [day#30L, id#29L] >+- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], > false, 0 >
[jira] [Assigned] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order
[ https://issues.apache.org/jira/browse/SPARK-42168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42168: Assignee: Apache Spark > CoGroup with window function returns incorrect result when partition keys > differ in order > - > > Key: SPARK-42168 > URL: https://issues.apache.org/jira/browse/SPARK-42168 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.3 >Reporter: Enrico Minack >Assignee: Apache Spark >Priority: Major > Labels: correctness > > The following example returns an incorrect result: > {code:java} > import pandas as pd > from pyspark.sql import SparkSession, Window > from pyspark.sql.functions import col, lit, sum > spark = SparkSession \ > .builder \ > .getOrCreate() > ids = 1000 > days = 1000 > parts = 10 > id_df = spark.range(ids) > day_df = spark.range(days).withColumnRenamed("id", "day") > id_day_df = id_df.join(day_df) > left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), > lit("left").alias("side")).repartition(parts).cache() > right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), > lit("right").alias("side")).repartition(parts).cache() > #.withColumnRenamed("id", "id2") > # note the column order is different to the groupBy("id", "day") column order > below > window = Window.partitionBy("day", "id") > left_grouped_df = left_df.groupBy("id", "day") > right_grouped_df = right_df.withColumn("day_sum", > sum(col("day")).over(window)).groupBy("id", "day") > def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame: > return pd.DataFrame([{ > "id": left["id"][0] if not left.empty else (right["id"][0] if not > right.empty else None), > "day": left["day"][0] if not left.empty else (right["day"][0] if not > right.empty else None), > "lefts": len(left.index), > "rights": len(right.index) > }]) > df = left_grouped_df.cogroup(right_grouped_df) \ > .applyInPandas(cogroup, schema="id long, day long, lefts integer, > rights integer") > df.explain() > df.show(5) > {code} > Output is > {code} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, > day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, > lefts#66, rights#67] >:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, > [plan_id=117] >: +- ... >+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0 > +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L] > +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS day_sum#54L], [day#30L, id#29L] > +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, > 0 >+- Exchange hashpartitioning(day#30L, id#29L, 200), > ENSURE_REQUIREMENTS, [plan_id=112] > +- ... > +---+---+-+--+ > | id|day|lefts|rights| > +---+---+-+--+ > | 0| 3|0| 1| > | 0| 4|0| 1| > | 0| 13|1| 0| > | 0| 27|0| 1| > | 0| 31|0| 1| > +---+---+-+--+ > only showing top 5 rows > {code} > The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the > second child is hash-partitioned by {{day}} and {{id}} (required by the > window function). Therefore, rows end up in different partitions. > This has been fixed in Spark 3.3 by > [#32875|https://github.com/apache/spark/pull/32875/files#diff-e938569a4ca4eba8f7e10fe473d4f9c306ea253df151405bcaba880a601f075fR75-R76]: > {code} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, > day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L)#63, [id#64L, day#65L, > lefts#66, rights#67] >:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, > [plan_id=117] >: +- ... >+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#29L, day#30L, 200), > ENSURE_REQUIREMENTS, [plan_id=118] > +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L] > +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS day_sum#54L], [day#30L, id#29L] >+- Sort [day#30L ASC NULLS FIRST, id#29L ASC
[jira] [Commented] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order
[ https://issues.apache.org/jira/browse/SPARK-42168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680229#comment-17680229 ] Apache Spark commented on SPARK-42168: -- User 'EnricoMi' has created a pull request for this issue: https://github.com/apache/spark/pull/39717 > CoGroup with window function returns incorrect result when partition keys > differ in order > - > > Key: SPARK-42168 > URL: https://issues.apache.org/jira/browse/SPARK-42168 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.3 >Reporter: Enrico Minack >Priority: Major > Labels: correctness > > The following example returns an incorrect result: > {code:java} > import pandas as pd > from pyspark.sql import SparkSession, Window > from pyspark.sql.functions import col, lit, sum > spark = SparkSession \ > .builder \ > .getOrCreate() > ids = 1000 > days = 1000 > parts = 10 > id_df = spark.range(ids) > day_df = spark.range(days).withColumnRenamed("id", "day") > id_day_df = id_df.join(day_df) > left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), > lit("left").alias("side")).repartition(parts).cache() > right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), > lit("right").alias("side")).repartition(parts).cache() > #.withColumnRenamed("id", "id2") > # note the column order is different to the groupBy("id", "day") column order > below > window = Window.partitionBy("day", "id") > left_grouped_df = left_df.groupBy("id", "day") > right_grouped_df = right_df.withColumn("day_sum", > sum(col("day")).over(window)).groupBy("id", "day") > def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame: > return pd.DataFrame([{ > "id": left["id"][0] if not left.empty else (right["id"][0] if not > right.empty else None), > "day": left["day"][0] if not left.empty else (right["day"][0] if not > right.empty else None), > "lefts": len(left.index), > "rights": len(right.index) > }]) > df = left_grouped_df.cogroup(right_grouped_df) \ > .applyInPandas(cogroup, schema="id long, day long, lefts integer, > rights integer") > df.explain() > df.show(5) > {code} > Output is > {code} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, > day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, > lefts#66, rights#67] >:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, > [plan_id=117] >: +- ... >+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0 > +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L] > +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS day_sum#54L], [day#30L, id#29L] > +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, > 0 >+- Exchange hashpartitioning(day#30L, id#29L, 200), > ENSURE_REQUIREMENTS, [plan_id=112] > +- ... > +---+---+-+--+ > | id|day|lefts|rights| > +---+---+-+--+ > | 0| 3|0| 1| > | 0| 4|0| 1| > | 0| 13|1| 0| > | 0| 27|0| 1| > | 0| 31|0| 1| > +---+---+-+--+ > only showing top 5 rows > {code} > The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the > second child is hash-partitioned by {{day}} and {{id}} (required by the > window function). Therefore, rows end up in different partitions. > This has been fixed in Spark 3.3 by > [#32875|https://github.com/apache/spark/pull/32875/files#diff-e938569a4ca4eba8f7e10fe473d4f9c306ea253df151405bcaba880a601f075fR75-R76]: > {code} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, > day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L)#63, [id#64L, day#65L, > lefts#66, rights#67] >:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, > [plan_id=117] >: +- ... >+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#29L, day#30L, 200), > ENSURE_REQUIREMENTS, [plan_id=118] > +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L] > +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS day_sum#54L], [day#30L,
[jira] [Assigned] (SPARK-42044) Fix wrong error message for `MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY`
[ https://issues.apache.org/jira/browse/SPARK-42044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42044: Assignee: Haejoon Lee > Fix wrong error message for `MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY` > --- > > Key: SPARK-42044 > URL: https://issues.apache.org/jira/browse/SPARK-42044 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > The current error message > "Correlated scalar subqueries in the GROUP BY clause must also be in the > aggregate expressions" > is incorrect since it's not related with group by clause. > We should fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42044) Fix wrong error message for `MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY`
[ https://issues.apache.org/jira/browse/SPARK-42044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42044. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39543 [https://github.com/apache/spark/pull/39543] > Fix wrong error message for `MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY` > --- > > Key: SPARK-42044 > URL: https://issues.apache.org/jira/browse/SPARK-42044 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > The current error message > "Correlated scalar subqueries in the GROUP BY clause must also be in the > aggregate expressions" > is incorrect since it's not related with group by clause. > We should fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order
[ https://issues.apache.org/jira/browse/SPARK-42168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-42168: -- Description: The following example returns an incorrect result: {code:java} import pandas as pd from pyspark.sql import SparkSession, Window from pyspark.sql.functions import col, lit, sum spark = SparkSession \ .builder \ .getOrCreate() ids = 1000 days = 1000 parts = 10 id_df = spark.range(ids) day_df = spark.range(days).withColumnRenamed("id", "day") id_day_df = id_df.join(day_df) left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), lit("left").alias("side")).repartition(parts).cache() right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), lit("right").alias("side")).repartition(parts).cache() #.withColumnRenamed("id", "id2") # note the column order is different to the groupBy("id", "day") column order below window = Window.partitionBy("day", "id") left_grouped_df = left_df.groupBy("id", "day") right_grouped_df = right_df.withColumn("day_sum", sum(col("day")).over(window)).groupBy("id", "day") def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame: return pd.DataFrame([{ "id": left["id"][0] if not left.empty else (right["id"][0] if not right.empty else None), "day": left["day"][0] if not left.empty else (right["day"][0] if not right.empty else None), "lefts": len(left.index), "rights": len(right.index) }]) df = left_grouped_df.cogroup(right_grouped_df) \ .applyInPandas(cogroup, schema="id long, day long, lefts integer, rights integer") df.explain() df.show(5) {code} Output is {code} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, lefts#66, rights#67] :- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, [plan_id=117] : +- ... +- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0 +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L] +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS day_sum#54L], [day#30L, id#29L] +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(day#30L, id#29L, 200), ENSURE_REQUIREMENTS, [plan_id=112] +- ... +---+---+-+--+ | id|day|lefts|rights| +---+---+-+--+ | 0| 3|0| 1| | 0| 4|0| 1| | 0| 13|1| 0| | 0| 27|0| 1| | 0| 31|0| 1| +---+---+-+--+ only showing top 5 rows {code} The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the second child is hash-partitioned by {{day}} and {{id}} (required by the window function). Therefore, rows end up in different partitions. This has been fixed in Spark 3.3 by [#32875|https://github.com/apache/spark/pull/32875/files#diff-e938569a4ca4eba8f7e10fe473d4f9c306ea253df151405bcaba880a601f075fR75-R76]: {code} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L)#63, [id#64L, day#65L, lefts#66, rights#67] :- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, [plan_id=117] : +- ... +- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#29L, day#30L, 200), ENSURE_REQUIREMENTS, [plan_id=118] +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L] +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS day_sum#54L], [day#30L, id#29L] +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(day#30L, id#29L, 200), ENSURE_REQUIREMENTS, [plan_id=112] +- ... +---+---+-+--+ | id|day|lefts|rights| +---+---+-+--+ | 0| 13|1| 1| | 0| 63|1| 1| | 0| 89|1| 1| | 0| 95|1| 1| | 0| 96|1| 1| +---+---+-+--+ only showing top 5 rows {code} Only PySpark is to be affected ({{FlatMapCoGroupsInPandas }}), as Scala API uses {{CoGroup}}. {{FlatMapCoGroupsInPandas}} reports required child distribution {{ClusteredDistribution}}, while {{CoGroup}} reports {{HashClusteredDistribution}}. The {{EnsureRequirements}} rule correctly recognizes a {{HashClusteredDistribution(id, day)}} as not compatible with
[jira] [Created] (SPARK-42169) Implement code generation for `to_csv` function (StructsToCsv)
Narek Karapetian created SPARK-42169: Summary: Implement code generation for `to_csv` function (StructsToCsv) Key: SPARK-42169 URL: https://issues.apache.org/jira/browse/SPARK-42169 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.4.0 Reporter: Narek Karapetian Fix For: 3.4.0 Implement code generation for `to_csv` function instead of extending it from CodegenFallback trait. {code:java} org.apache.spark.sql.catalyst.expressions.StructsToCsv.doGenCode(...){code} This is good to have from performance point of view. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order
[ https://issues.apache.org/jira/browse/SPARK-42168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-42168: -- Description: The following example returns an incorrect result: {code:java} import pandas as pd from pyspark.sql import SparkSession, Window from pyspark.sql.functions import col, lit, sum spark = SparkSession \ .builder \ .getOrCreate() ids = 1000 days = 1000 parts = 10 id_df = spark.range(ids) day_df = spark.range(days).withColumnRenamed("id", "day") id_day_df = id_df.join(day_df) left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), lit("left").alias("side")).repartition(parts).cache() right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), lit("right").alias("side")).repartition(parts).cache() #.withColumnRenamed("id", "id2") # note the column order is different to the groupBy("id", "day") column order below window = Window.partitionBy("day", "id") left_grouped_df = left_df.groupBy("id", "day") right_grouped_df = right_df.withColumn("day_sum", sum(col("day")).over(window)).groupBy("id", "day") def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame: return pd.DataFrame([{ "id": left["id"][0] if not left.empty else (right["id"][0] if not right.empty else None), "day": left["day"][0] if not left.empty else (right["day"][0] if not right.empty else None), "lefts": len(left.index), "rights": len(right.index) }]) df = left_grouped_df.cogroup(right_grouped_df) \ .applyInPandas(cogroup, schema="id long, day long, lefts integer, rights integer") df.explain() df.show(5) {code} Output is {code} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, lefts#66, rights#67] :- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, [plan_id=117] : +- ... +- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0 +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L] +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS day_sum#54L], [day#30L, id#29L] +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(day#30L, id#29L, 200), ENSURE_REQUIREMENTS, [plan_id=112] +- ... +---+---+-+--+ | id|day|lefts|rights| +---+---+-+--+ | 0| 3|0| 1| | 0| 4|0| 1| | 0| 13|1| 0| | 0| 27|0| 1| | 0| 31|0| 1| +---+---+-+--+ only showing top 5 rows {code} The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the second child is hash-partitioned by {{day}} and {{id}} (required by the window function). Therefore, rows end up in different partitions. This has been fixed in Spark 3.3 by [#32875|https://github.com/apache/spark/pull/32875/files#diff-e938569a4ca4eba8f7e10fe473d4f9c306ea253df151405bcaba880a601f075fR75-R76]: {code} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L)#63, [id#64L, day#65L, lefts#66, rights#67] :- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, [plan_id=117] : +- ... +- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#29L, day#30L, 200), ENSURE_REQUIREMENTS, [plan_id=118] +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L] +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS day_sum#54L], [day#30L, id#29L] +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(day#30L, id#29L, 200), ENSURE_REQUIREMENTS, [plan_id=112] +- ... +---+---+-+--+ | id|day|lefts|rights| +---+---+-+--+ | 0| 13|1| 1| | 0| 63|1| 1| | 0| 89|1| 1| | 0| 95|1| 1| | 0| 96|1| 1| +---+---+-+--+ only showing top 5 rows {code} Only PySpark seems to be affected. was: The following example returns an incorrect result: {code:java} import pandas as pd from pyspark.sql import SparkSession, Window from pyspark.sql.functions import col, lit, sum spark = SparkSession \ .builder \ .getOrCreate() ids = 1000 days = 1000 parts = 10 id_df = spark.range(ids) day_df =
[jira] [Created] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order
Enrico Minack created SPARK-42168: - Summary: CoGroup with window function returns incorrect result when partition keys differ in order Key: SPARK-42168 URL: https://issues.apache.org/jira/browse/SPARK-42168 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.2.3, 3.1.3, 3.0.3 Reporter: Enrico Minack The following example returns an incorrect result: {code:java} import pandas as pd from pyspark.sql import SparkSession, Window from pyspark.sql.functions import col, lit, sum spark = SparkSession \ .builder \ .getOrCreate() ids = 1000 days = 1000 parts = 10 id_df = spark.range(ids) day_df = spark.range(days).withColumnRenamed("id", "day") id_day_df = id_df.join(day_df) left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), lit("left").alias("side")).repartition(parts).cache() right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), lit("right").alias("side")).repartition(parts).cache() #.withColumnRenamed("id", "id2") # note the column order is different to the groupBy("id", "day") column order below window = Window.partitionBy("day", "id") left_grouped_df = left_df.groupBy("id", "day") right_grouped_df = right_df.withColumn("day_sum", sum(col("day")).over(window)).groupBy("id", "day") def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame: return pd.DataFrame([{ "id": left["id"][0] if not left.empty else (right["id"][0] if not right.empty else None), "day": left["day"][0] if not left.empty else (right["day"][0] if not right.empty else None), "lefts": len(left.index), "rights": len(right.index) }]) df = left_grouped_df.cogroup(right_grouped_df) \ .applyInPandas(cogroup, schema="id long, day long, lefts integer, rights integer") df.explain() df.show(5) {code} Output is {code} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, lefts#66, rights#67] :- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, [plan_id=117] : +- InMemoryTableScan [id#8L, day#9L, id#8L, day#9L, side#10] : +- InMemoryRelation [id#8L, day#9L, side#10], StorageLevel(disk, memory, deserialized, 1 replicas) : +- Exchange RoundRobinPartitioning(10), REPARTITION_BY_NUM, [plan_id=33] :+- *(2) Project [id#0L, day#4L, left AS side#10] : +- *(2) BroadcastNestedLoopJoin BuildRight, Inner : :- *(2) Range (0, 1000, step=1, splits=16) : +- BroadcastExchange IdentityBroadcastMode, [plan_id=28] : +- *(1) Project [id#2L AS day#4L] :+- *(1) Range (0, 1000, step=1, splits=16) +- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0 +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L] +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS day_sum#54L], [day#30L, id#29L] +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(day#30L, id#29L, 200), ENSURE_REQUIREMENTS, [plan_id=112] +- InMemoryTableScan [id#29L, day#30L, side#31] +- InMemoryRelation [id#29L, day#30L, side#31], StorageLevel(disk, memory, deserialized, 1 replicas) +- Exchange RoundRobinPartitioning(10), REPARTITION_BY_NUM, [plan_id=79] +- *(2) Project [id#0L, day#4L, right AS side#31] +- *(2) BroadcastNestedLoopJoin BuildRight, Inner :- *(2) Range (0, 1000, step=1, splits=16) +- BroadcastExchange IdentityBroadcastMode, [plan_id=74] +- *(1) Project [id#2L AS day#4L] +- *(1) Range (0, 1000, step=1, splits=16) +---+---+-+--+ | id|day|lefts|rights| +---+---+-+--+ | 0| 3|0| 1| | 0| 4|0| 1| | 0| 13|1| 0| | 0| 27|0| 1| | 0| 31|0| 1| +---+---+-+--+ only showing top 5 rows {code} The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the second child is hash-partitioned by {{day}} and {{id}} (required by the window function). Therefore, rows end up in different partitions. This has been fixed in Spark 3.3 by
[jira] [Updated] (SPARK-42166) Make `docker-image-tool.sh` usage message up-to-date
[ https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42166: -- Summary: Make `docker-image-tool.sh` usage message up-to-date (was: Make docker-image-tool.sh help up-to-date) > Make `docker-image-tool.sh` usage message up-to-date > > > Key: SPARK-42166 > URL: https://issues.apache.org/jira/browse/SPARK-42166 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42166) Make docker-image-tool.sh help up-to-date
[ https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42166. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39714 [https://github.com/apache/spark/pull/39714] > Make docker-image-tool.sh help up-to-date > - > > Key: SPARK-42166 > URL: https://issues.apache.org/jira/browse/SPARK-42166 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42166) Make docker-image-tool.sh help up-to-date
[ https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42166: - Assignee: Dongjoon Hyun > Make docker-image-tool.sh help up-to-date > - > > Key: SPARK-42166 > URL: https://issues.apache.org/jira/browse/SPARK-42166 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42167) Improve GitHub Action `lint` job to stop on failures earlier
[ https://issues.apache.org/jira/browse/SPARK-42167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42167: Assignee: (was: Apache Spark) > Improve GitHub Action `lint` job to stop on failures earlier > > > Key: SPARK-42167 > URL: https://issues.apache.org/jira/browse/SPARK-42167 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42167) Improve GitHub Action `lint` job to stop on failures earlier
[ https://issues.apache.org/jira/browse/SPARK-42167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42167: Assignee: Apache Spark > Improve GitHub Action `lint` job to stop on failures earlier > > > Key: SPARK-42167 > URL: https://issues.apache.org/jira/browse/SPARK-42167 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42167) Improve GitHub Action `lint` job to stop on failures earlier
[ https://issues.apache.org/jira/browse/SPARK-42167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680165#comment-17680165 ] Apache Spark commented on SPARK-42167: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39716 > Improve GitHub Action `lint` job to stop on failures earlier > > > Key: SPARK-42167 > URL: https://issues.apache.org/jira/browse/SPARK-42167 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42167) Improve GitHub Action `lint` job to stop on failures earlier
Dongjoon Hyun created SPARK-42167: - Summary: Improve GitHub Action `lint` job to stop on failures earlier Key: SPARK-42167 URL: https://issues.apache.org/jira/browse/SPARK-42167 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41775) Implement training functions as input
[ https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680154#comment-17680154 ] Apache Spark commented on SPARK-41775: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39715 > Implement training functions as input > - > > Key: SPARK-41775 > URL: https://issues.apache.org/jira/browse/SPARK-41775 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Assignee: Rithwik Ediga Lakhamsani >Priority: Major > Fix For: 3.4.0 > > > Sidenote: make formatting updates described in > https://github.com/apache/spark/pull/39188 > > Currently, `Distributor().run(...)` takes only files as input. Now we will > add in additional functionality to take in functions as well. This will > require us to go through the following process on each task in the executor > nodes: > 1. take the input function and args and pickle them > 2. Create a temp train.py file that looks like > {code:java} > import cloudpickle > import os > if _name_ == "_main_": > train, args = cloudpickle.load(f"{tempdir}/train_input.pkl") > output = train(*args) > if output and os.environ.get("RANK", "") == "0": # this is for > partitionId == 0 > cloudpickle.dump(f"{tempdir}/train_output.pkl") {code} > 3. Run that train.py file with `torchrun` > 4. Check if `train_output.pkl` has been created on process on partitionId == > 0, if it has, then deserialize it and return that output through `.collect()` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42166) Make docker-image-tool.sh help up-to-date
[ https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680145#comment-17680145 ] Apache Spark commented on SPARK-42166: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39714 > Make docker-image-tool.sh help up-to-date > - > > Key: SPARK-42166 > URL: https://issues.apache.org/jira/browse/SPARK-42166 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42166) Make docker-image-tool.sh help up-to-date
[ https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42166: Assignee: (was: Apache Spark) > Make docker-image-tool.sh help up-to-date > - > > Key: SPARK-42166 > URL: https://issues.apache.org/jira/browse/SPARK-42166 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42166) Make docker-image-tool.sh help up-to-date
[ https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680144#comment-17680144 ] Apache Spark commented on SPARK-42166: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39714 > Make docker-image-tool.sh help up-to-date > - > > Key: SPARK-42166 > URL: https://issues.apache.org/jira/browse/SPARK-42166 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42166) Make docker-image-tool.sh help up-to-date
[ https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42166: Assignee: Apache Spark > Make docker-image-tool.sh help up-to-date > - > > Key: SPARK-42166 > URL: https://issues.apache.org/jira/browse/SPARK-42166 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42166) Make docker-image-tool.sh help up-to-date
[ https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42166: -- Summary: Make docker-image-tool.sh help up-to-date (was: Fix docker-image-tool.sh help message) > Make docker-image-tool.sh help up-to-date > - > > Key: SPARK-42166 > URL: https://issues.apache.org/jira/browse/SPARK-42166 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42166) Fix docker-image-tool.sh help message
[ https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42166: -- Issue Type: Improvement (was: Bug) > Fix docker-image-tool.sh help message > - > > Key: SPARK-42166 > URL: https://issues.apache.org/jira/browse/SPARK-42166 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42166) Fix docker-image-tool.sh help message
Dongjoon Hyun created SPARK-42166: - Summary: Fix docker-image-tool.sh help message Key: SPARK-42166 URL: https://issues.apache.org/jira/browse/SPARK-42166 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42165) Integrate all errors that causes re-create of a view failing into one error class
Haejoon Lee created SPARK-42165: --- Summary: Integrate all errors that causes re-create of a view failing into one error class Key: SPARK-42165 URL: https://issues.apache.org/jira/browse/SPARK-42165 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Haejoon Lee Should integrate errors that causes re-create of view failing into single error class, and make the subclasses for distinct cases. e.g. * CANNOT UPCAST * TABLE NOT FOUND/UNRESOLVED * UNRESOLVED FUNCTION * DATATYPE MISMATCH -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org