[jira] [Updated] (SPARK-42170) Files added to the spark-submit command with master K8s and deploy mode cluster, end up in a non deterministic location inside the driver.

2023-01-24 Thread Santosh Pingale (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santosh Pingale updated SPARK-42170:

Description: 
Files added to the spark-submit command with master K8s and deploy mode 
cluster, end up in a non deterministic location inside the driver.

eg:

{{spark-submit --files myfile --master k8s.. --deploy-mode cluster` will upload 
the files to /tmp/spark-uuid/myfile}}

The issue happens because 
[Utils.createTempDir()|https://github.com/apache/spark/blob/v3.3.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L344]
 creates a directory with a uuid in the directory name. This issue does not 
affect the --archives option, because we `unarchive` the archives into the 
destination directory which is relative to the working dir. This bug affects 
file access pre & post app creation. For example if we distribute python 
dependencies with pex, we need to use --files to attach the pex file and change 
the spark.pyspark.python to point to this file. But the file location can not 
be determined before submitting the app. On the other hand, after the app is 
created, referencing the files without using `SparkFiles.get` also does not work

  was:
Files added to the spark-submit command with master K8s and deploy mode 
cluster, end up in a non deterministic location inside the driver.

eg:

{{spark-submit --files myfile --master k8s.. --deploy-mode cluster` will upload 
the files to /tmp/spark-uuid/myfile}}

The issue happens because 
[Utils.createTempDir()|https://github.com/apache/spark/blob/v3.3.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L344]
 creates a directory with a uuid in the directory name. This issue does not 
affect the `--archives` option, because we `unarchive` the archives into the 
destination directory which is relative to the working dir. This bug affects 
file access pre & post app creation. For example if we distribute python 
dependencies with pex, we need to use `--files` to attach the pex file and 
change the spark.pyspark.python to point to this file. But the file location 
can not be determined before submitting the app. On the other hand, after the 
app is created, referencing the files without using `SparkFiles.get` also does 
not work


> Files added to the spark-submit command with master K8s and deploy mode 
> cluster, end up in a non deterministic location inside the driver.
> --
>
> Key: SPARK-42170
> URL: https://issues.apache.org/jira/browse/SPARK-42170
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Santosh Pingale
>Priority: Major
>
> Files added to the spark-submit command with master K8s and deploy mode 
> cluster, end up in a non deterministic location inside the driver.
> eg:
> {{spark-submit --files myfile --master k8s.. --deploy-mode cluster` will 
> upload the files to /tmp/spark-uuid/myfile}}
> The issue happens because 
> [Utils.createTempDir()|https://github.com/apache/spark/blob/v3.3.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L344]
>  creates a directory with a uuid in the directory name. This issue does not 
> affect the --archives option, because we `unarchive` the archives into the 
> destination directory which is relative to the working dir. This bug affects 
> file access pre & post app creation. For example if we distribute python 
> dependencies with pex, we need to use --files to attach the pex file and 
> change the spark.pyspark.python to point to this file. But the file location 
> can not be determined before submitting the app. On the other hand, after the 
> app is created, referencing the files without using `SparkFiles.get` also 
> does not work



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42170) Files added to the spark-submit command with master K8s and deploy mode cluster, end up in a non deterministic location inside the driver.

2023-01-24 Thread Santosh Pingale (Jira)
Santosh Pingale created SPARK-42170:
---

 Summary: Files added to the spark-submit command with master K8s 
and deploy mode cluster, end up in a non deterministic location inside the 
driver.
 Key: SPARK-42170
 URL: https://issues.apache.org/jira/browse/SPARK-42170
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Spark Submit
Affects Versions: 3.2.2, 3.3.0
Reporter: Santosh Pingale


Files added to the spark-submit command with master K8s and deploy mode 
cluster, end up in a non deterministic location inside the driver.

eg:

{{spark-submit --files myfile --master k8s.. --deploy-mode cluster` will upload 
the files to /tmp/spark-uuid/myfile}}

The issue happens because 
[Utils.createTempDir()|https://github.com/apache/spark/blob/v3.3.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L344]
 creates a directory with a uuid in the directory name. This issue does not 
affect the `--archives` option, because we `unarchive` the archives into the 
destination directory which is relative to the working dir. This bug affects 
file access pre & post app creation. For example if we distribute python 
dependencies with pex, we need to use `--files` to attach the pex file and 
change the spark.pyspark.python to point to this file. But the file location 
can not be determined before submitting the app. On the other hand, after the 
app is created, referencing the files without using `SparkFiles.get` also does 
not work



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order

2023-01-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42168:


Assignee: (was: Apache Spark)

> CoGroup with window function returns incorrect result when partition keys 
> differ in order
> -
>
> Key: SPARK-42168
> URL: https://issues.apache.org/jira/browse/SPARK-42168
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.3, 3.1.3, 3.2.3
>Reporter: Enrico Minack
>Priority: Major
>  Labels: correctness
>
> The following example returns an incorrect result:
> {code:java}
> import pandas as pd
> from pyspark.sql import SparkSession, Window
> from pyspark.sql.functions import col, lit, sum
> spark = SparkSession \
> .builder \
> .getOrCreate()
> ids = 1000
> days = 1000
> parts = 10
> id_df = spark.range(ids)
> day_df = spark.range(days).withColumnRenamed("id", "day")
> id_day_df = id_df.join(day_df)
> left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
> lit("left").alias("side")).repartition(parts).cache()
> right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
> lit("right").alias("side")).repartition(parts).cache()  
> #.withColumnRenamed("id", "id2")
> # note the column order is different to the groupBy("id", "day") column order 
> below
> window = Window.partitionBy("day", "id")
> left_grouped_df = left_df.groupBy("id", "day")
> right_grouped_df = right_df.withColumn("day_sum", 
> sum(col("day")).over(window)).groupBy("id", "day")
> def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
> return pd.DataFrame([{
> "id": left["id"][0] if not left.empty else (right["id"][0] if not 
> right.empty else None),
> "day": left["day"][0] if not left.empty else (right["day"][0] if not 
> right.empty else None),
> "lefts": len(left.index),
> "rights": len(right.index)
> }])
> df = left_grouped_df.cogroup(right_grouped_df) \
>  .applyInPandas(cogroup, schema="id long, day long, lefts integer, 
> rights integer")
> df.explain()
> df.show(5)
> {code}
> Output is
> {code}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, 
> day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, 
> lefts#66, rights#67]
>:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=117]
>: +- ...
>+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0
>   +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L]
>  +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS day_sum#54L], [day#30L, id#29L]
> +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, > 0
>+- Exchange hashpartitioning(day#30L, id#29L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=112]
>   +- ...
> +---+---+-+--+
> | id|day|lefts|rights|
> +---+---+-+--+
> |  0|  3|0| 1|
> |  0|  4|0| 1|
> |  0| 13|1| 0|
> |  0| 27|0| 1|
> |  0| 31|0| 1|
> +---+---+-+--+
> only showing top 5 rows
> {code}
> The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the 
> second child is hash-partitioned by {{day}} and {{id}} (required by the 
> window function). Therefore, rows end up in different partitions.
> This has been fixed in Spark 3.3 by 
> [#32875|https://github.com/apache/spark/pull/32875/files#diff-e938569a4ca4eba8f7e10fe473d4f9c306ea253df151405bcaba880a601f075fR75-R76]:
> {code}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, 
> day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L)#63, [id#64L, day#65L, 
> lefts#66, rights#67]
>:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=117]
>: +- ...
>+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#29L, day#30L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=118]
>  +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L]
> +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS day_sum#54L], [day#30L, id#29L]
>+- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], 
> false, 0
>   

[jira] [Assigned] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order

2023-01-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42168:


Assignee: Apache Spark

> CoGroup with window function returns incorrect result when partition keys 
> differ in order
> -
>
> Key: SPARK-42168
> URL: https://issues.apache.org/jira/browse/SPARK-42168
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.3, 3.1.3, 3.2.3
>Reporter: Enrico Minack
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> The following example returns an incorrect result:
> {code:java}
> import pandas as pd
> from pyspark.sql import SparkSession, Window
> from pyspark.sql.functions import col, lit, sum
> spark = SparkSession \
> .builder \
> .getOrCreate()
> ids = 1000
> days = 1000
> parts = 10
> id_df = spark.range(ids)
> day_df = spark.range(days).withColumnRenamed("id", "day")
> id_day_df = id_df.join(day_df)
> left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
> lit("left").alias("side")).repartition(parts).cache()
> right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
> lit("right").alias("side")).repartition(parts).cache()  
> #.withColumnRenamed("id", "id2")
> # note the column order is different to the groupBy("id", "day") column order 
> below
> window = Window.partitionBy("day", "id")
> left_grouped_df = left_df.groupBy("id", "day")
> right_grouped_df = right_df.withColumn("day_sum", 
> sum(col("day")).over(window)).groupBy("id", "day")
> def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
> return pd.DataFrame([{
> "id": left["id"][0] if not left.empty else (right["id"][0] if not 
> right.empty else None),
> "day": left["day"][0] if not left.empty else (right["day"][0] if not 
> right.empty else None),
> "lefts": len(left.index),
> "rights": len(right.index)
> }])
> df = left_grouped_df.cogroup(right_grouped_df) \
>  .applyInPandas(cogroup, schema="id long, day long, lefts integer, 
> rights integer")
> df.explain()
> df.show(5)
> {code}
> Output is
> {code}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, 
> day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, 
> lefts#66, rights#67]
>:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=117]
>: +- ...
>+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0
>   +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L]
>  +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS day_sum#54L], [day#30L, id#29L]
> +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, > 0
>+- Exchange hashpartitioning(day#30L, id#29L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=112]
>   +- ...
> +---+---+-+--+
> | id|day|lefts|rights|
> +---+---+-+--+
> |  0|  3|0| 1|
> |  0|  4|0| 1|
> |  0| 13|1| 0|
> |  0| 27|0| 1|
> |  0| 31|0| 1|
> +---+---+-+--+
> only showing top 5 rows
> {code}
> The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the 
> second child is hash-partitioned by {{day}} and {{id}} (required by the 
> window function). Therefore, rows end up in different partitions.
> This has been fixed in Spark 3.3 by 
> [#32875|https://github.com/apache/spark/pull/32875/files#diff-e938569a4ca4eba8f7e10fe473d4f9c306ea253df151405bcaba880a601f075fR75-R76]:
> {code}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, 
> day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L)#63, [id#64L, day#65L, 
> lefts#66, rights#67]
>:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=117]
>: +- ...
>+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#29L, day#30L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=118]
>  +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L]
> +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS day_sum#54L], [day#30L, id#29L]
>+- Sort [day#30L ASC NULLS FIRST, id#29L ASC 

[jira] [Commented] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order

2023-01-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680229#comment-17680229
 ] 

Apache Spark commented on SPARK-42168:
--

User 'EnricoMi' has created a pull request for this issue:
https://github.com/apache/spark/pull/39717

> CoGroup with window function returns incorrect result when partition keys 
> differ in order
> -
>
> Key: SPARK-42168
> URL: https://issues.apache.org/jira/browse/SPARK-42168
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.3, 3.1.3, 3.2.3
>Reporter: Enrico Minack
>Priority: Major
>  Labels: correctness
>
> The following example returns an incorrect result:
> {code:java}
> import pandas as pd
> from pyspark.sql import SparkSession, Window
> from pyspark.sql.functions import col, lit, sum
> spark = SparkSession \
> .builder \
> .getOrCreate()
> ids = 1000
> days = 1000
> parts = 10
> id_df = spark.range(ids)
> day_df = spark.range(days).withColumnRenamed("id", "day")
> id_day_df = id_df.join(day_df)
> left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
> lit("left").alias("side")).repartition(parts).cache()
> right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
> lit("right").alias("side")).repartition(parts).cache()  
> #.withColumnRenamed("id", "id2")
> # note the column order is different to the groupBy("id", "day") column order 
> below
> window = Window.partitionBy("day", "id")
> left_grouped_df = left_df.groupBy("id", "day")
> right_grouped_df = right_df.withColumn("day_sum", 
> sum(col("day")).over(window)).groupBy("id", "day")
> def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
> return pd.DataFrame([{
> "id": left["id"][0] if not left.empty else (right["id"][0] if not 
> right.empty else None),
> "day": left["day"][0] if not left.empty else (right["day"][0] if not 
> right.empty else None),
> "lefts": len(left.index),
> "rights": len(right.index)
> }])
> df = left_grouped_df.cogroup(right_grouped_df) \
>  .applyInPandas(cogroup, schema="id long, day long, lefts integer, 
> rights integer")
> df.explain()
> df.show(5)
> {code}
> Output is
> {code}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, 
> day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, 
> lefts#66, rights#67]
>:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=117]
>: +- ...
>+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0
>   +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L]
>  +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS day_sum#54L], [day#30L, id#29L]
> +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, > 0
>+- Exchange hashpartitioning(day#30L, id#29L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=112]
>   +- ...
> +---+---+-+--+
> | id|day|lefts|rights|
> +---+---+-+--+
> |  0|  3|0| 1|
> |  0|  4|0| 1|
> |  0| 13|1| 0|
> |  0| 27|0| 1|
> |  0| 31|0| 1|
> +---+---+-+--+
> only showing top 5 rows
> {code}
> The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the 
> second child is hash-partitioned by {{day}} and {{id}} (required by the 
> window function). Therefore, rows end up in different partitions.
> This has been fixed in Spark 3.3 by 
> [#32875|https://github.com/apache/spark/pull/32875/files#diff-e938569a4ca4eba8f7e10fe473d4f9c306ea253df151405bcaba880a601f075fR75-R76]:
> {code}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, 
> day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L)#63, [id#64L, day#65L, 
> lefts#66, rights#67]
>:- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=117]
>: +- ...
>+- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#29L, day#30L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=118]
>  +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L]
> +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS day_sum#54L], [day#30L, 

[jira] [Assigned] (SPARK-42044) Fix wrong error message for `MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY`

2023-01-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42044:


Assignee: Haejoon Lee

> Fix wrong error message for `MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY`
> ---
>
> Key: SPARK-42044
> URL: https://issues.apache.org/jira/browse/SPARK-42044
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> The current error message
> "Correlated scalar subqueries in the GROUP BY clause must also be in the 
> aggregate expressions"
> is incorrect since it's not related with group by clause.
> We should fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42044) Fix wrong error message for `MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY`

2023-01-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42044.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39543
[https://github.com/apache/spark/pull/39543]

> Fix wrong error message for `MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY`
> ---
>
> Key: SPARK-42044
> URL: https://issues.apache.org/jira/browse/SPARK-42044
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> The current error message
> "Correlated scalar subqueries in the GROUP BY clause must also be in the 
> aggregate expressions"
> is incorrect since it's not related with group by clause.
> We should fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order

2023-01-24 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-42168:
--
Description: 
The following example returns an incorrect result:
{code:java}
import pandas as pd

from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import col, lit, sum

spark = SparkSession \
.builder \
.getOrCreate()

ids = 1000
days = 1000
parts = 10

id_df = spark.range(ids)
day_df = spark.range(days).withColumnRenamed("id", "day")
id_day_df = id_df.join(day_df)
left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
lit("left").alias("side")).repartition(parts).cache()
right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
lit("right").alias("side")).repartition(parts).cache()  
#.withColumnRenamed("id", "id2")

# note the column order is different to the groupBy("id", "day") column order 
below
window = Window.partitionBy("day", "id")

left_grouped_df = left_df.groupBy("id", "day")
right_grouped_df = right_df.withColumn("day_sum", 
sum(col("day")).over(window)).groupBy("id", "day")

def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame([{
"id": left["id"][0] if not left.empty else (right["id"][0] if not 
right.empty else None),
"day": left["day"][0] if not left.empty else (right["day"][0] if not 
right.empty else None),
"lefts": len(left.index),
"rights": len(right.index)
}])

df = left_grouped_df.cogroup(right_grouped_df) \
 .applyInPandas(cogroup, schema="id long, day long, lefts integer, 
rights integer")

df.explain()
df.show(5)
{code}
Output is
{code}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, 
day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, 
lefts#66, rights#67]
   :- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, 
[plan_id=117]
   : +- ...
   +- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0
  +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L]
 +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
AS day_sum#54L], [day#30L, id#29L]
+- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(day#30L, id#29L, 200), 
ENSURE_REQUIREMENTS, [plan_id=112]
  +- ...


+---+---+-+--+
| id|day|lefts|rights|
+---+---+-+--+
|  0|  3|0| 1|
|  0|  4|0| 1|
|  0| 13|1| 0|
|  0| 27|0| 1|
|  0| 31|0| 1|
+---+---+-+--+
only showing top 5 rows
{code}
The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the second 
child is hash-partitioned by {{day}} and {{id}} (required by the window 
function). Therefore, rows end up in different partitions.

This has been fixed in Spark 3.3 by 
[#32875|https://github.com/apache/spark/pull/32875/files#diff-e938569a4ca4eba8f7e10fe473d4f9c306ea253df151405bcaba880a601f075fR75-R76]:
{code}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, 
day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L)#63, [id#64L, day#65L, 
lefts#66, rights#67]
   :- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, 
[plan_id=117]
   : +- ...
   +- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(id#29L, day#30L, 200), ENSURE_REQUIREMENTS, 
[plan_id=118]
 +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L]
+- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
AS day_sum#54L], [day#30L, id#29L]
   +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], 
false, 0
  +- Exchange hashpartitioning(day#30L, id#29L, 200), 
ENSURE_REQUIREMENTS, [plan_id=112]
 +- ...

+---+---+-+--+
| id|day|lefts|rights|
+---+---+-+--+
|  0| 13|1| 1|
|  0| 63|1| 1|
|  0| 89|1| 1|
|  0| 95|1| 1|
|  0| 96|1| 1|
+---+---+-+--+
only showing top 5 rows
{code}

Only PySpark is to be affected ({{FlatMapCoGroupsInPandas }}), as Scala API 
uses {{CoGroup}}. {{FlatMapCoGroupsInPandas}} reports required child 
distribution {{ClusteredDistribution}}, while {{CoGroup}} reports 
{{HashClusteredDistribution}}. The {{EnsureRequirements}} rule correctly 
recognizes a {{HashClusteredDistribution(id, day)}} as not compatible with 

[jira] [Created] (SPARK-42169) Implement code generation for `to_csv` function (StructsToCsv)

2023-01-24 Thread Narek Karapetian (Jira)
Narek Karapetian created SPARK-42169:


 Summary: Implement code generation for `to_csv` function 
(StructsToCsv)
 Key: SPARK-42169
 URL: https://issues.apache.org/jira/browse/SPARK-42169
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Narek Karapetian
 Fix For: 3.4.0


Implement code generation for `to_csv` function instead of extending it from 
CodegenFallback trait.
{code:java}
org.apache.spark.sql.catalyst.expressions.StructsToCsv.doGenCode(...){code}
 

This is good to have from performance point of view.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order

2023-01-24 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-42168:
--
Description: 
The following example returns an incorrect result:
{code:java}
import pandas as pd

from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import col, lit, sum

spark = SparkSession \
.builder \
.getOrCreate()

ids = 1000
days = 1000
parts = 10

id_df = spark.range(ids)
day_df = spark.range(days).withColumnRenamed("id", "day")
id_day_df = id_df.join(day_df)
left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
lit("left").alias("side")).repartition(parts).cache()
right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
lit("right").alias("side")).repartition(parts).cache()  
#.withColumnRenamed("id", "id2")

# note the column order is different to the groupBy("id", "day") column order 
below
window = Window.partitionBy("day", "id")

left_grouped_df = left_df.groupBy("id", "day")
right_grouped_df = right_df.withColumn("day_sum", 
sum(col("day")).over(window)).groupBy("id", "day")

def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame([{
"id": left["id"][0] if not left.empty else (right["id"][0] if not 
right.empty else None),
"day": left["day"][0] if not left.empty else (right["day"][0] if not 
right.empty else None),
"lefts": len(left.index),
"rights": len(right.index)
}])

df = left_grouped_df.cogroup(right_grouped_df) \
 .applyInPandas(cogroup, schema="id long, day long, lefts integer, 
rights integer")

df.explain()
df.show(5)
{code}
Output is
{code}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, 
day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, 
lefts#66, rights#67]
   :- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, 
[plan_id=117]
   : +- ...
   +- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0
  +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L]
 +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
AS day_sum#54L], [day#30L, id#29L]
+- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(day#30L, id#29L, 200), 
ENSURE_REQUIREMENTS, [plan_id=112]
  +- ...


+---+---+-+--+
| id|day|lefts|rights|
+---+---+-+--+
|  0|  3|0| 1|
|  0|  4|0| 1|
|  0| 13|1| 0|
|  0| 27|0| 1|
|  0| 31|0| 1|
+---+---+-+--+
only showing top 5 rows
{code}
The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the second 
child is hash-partitioned by {{day}} and {{id}} (required by the window 
function). Therefore, rows end up in different partitions.

This has been fixed in Spark 3.3 by 
[#32875|https://github.com/apache/spark/pull/32875/files#diff-e938569a4ca4eba8f7e10fe473d4f9c306ea253df151405bcaba880a601f075fR75-R76]:
{code}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, 
day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L)#63, [id#64L, day#65L, 
lefts#66, rights#67]
   :- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, 
[plan_id=117]
   : +- ...
   +- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(id#29L, day#30L, 200), ENSURE_REQUIREMENTS, 
[plan_id=118]
 +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L]
+- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
AS day_sum#54L], [day#30L, id#29L]
   +- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], 
false, 0
  +- Exchange hashpartitioning(day#30L, id#29L, 200), 
ENSURE_REQUIREMENTS, [plan_id=112]
 +- ...

+---+---+-+--+
| id|day|lefts|rights|
+---+---+-+--+
|  0| 13|1| 1|
|  0| 63|1| 1|
|  0| 89|1| 1|
|  0| 95|1| 1|
|  0| 96|1| 1|
+---+---+-+--+
only showing top 5 rows
{code}
Only PySpark seems to be affected.

  was:
The following example returns an incorrect result:
{code:java}
import pandas as pd

from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import col, lit, sum

spark = SparkSession \
.builder \
.getOrCreate()

ids = 1000
days = 1000
parts = 10

id_df = spark.range(ids)
day_df = 

[jira] [Created] (SPARK-42168) CoGroup with window function returns incorrect result when partition keys differ in order

2023-01-24 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-42168:
-

 Summary: CoGroup with window function returns incorrect result 
when partition keys differ in order
 Key: SPARK-42168
 URL: https://issues.apache.org/jira/browse/SPARK-42168
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.2.3, 3.1.3, 3.0.3
Reporter: Enrico Minack


The following example returns an incorrect result:
{code:java}
import pandas as pd

from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import col, lit, sum

spark = SparkSession \
.builder \
.getOrCreate()

ids = 1000
days = 1000
parts = 10

id_df = spark.range(ids)
day_df = spark.range(days).withColumnRenamed("id", "day")
id_day_df = id_df.join(day_df)
left_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
lit("left").alias("side")).repartition(parts).cache()
right_df = id_day_df.select(col("id").alias("id"), col("day").alias("day"), 
lit("right").alias("side")).repartition(parts).cache()  
#.withColumnRenamed("id", "id2")

# note the column order is different to the groupBy("id", "day") column order 
below
window = Window.partitionBy("day", "id")

left_grouped_df = left_df.groupBy("id", "day")
right_grouped_df = right_df.withColumn("day_sum", 
sum(col("day")).over(window)).groupBy("id", "day")

def cogroup(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame([{
"id": left["id"][0] if not left.empty else (right["id"][0] if not 
right.empty else None),
"day": left["day"][0] if not left.empty else (right["day"][0] if not 
right.empty else None),
"lefts": len(left.index),
"rights": len(right.index)
}])

df = left_grouped_df.cogroup(right_grouped_df) \
 .applyInPandas(cogroup, schema="id long, day long, lefts integer, 
rights integer")

df.explain()
df.show(5)
{code}
Output is
{code}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- FlatMapCoGroupsInPandas [id#8L, day#9L], [id#29L, day#30L], cogroup(id#8L, 
day#9L, side#10, id#29L, day#30L, side#31, day_sum#54L), [id#64L, day#65L, 
lefts#66, rights#67]
   :- Sort [id#8L ASC NULLS FIRST, day#9L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#8L, day#9L, 200), ENSURE_REQUIREMENTS, 
[plan_id=117]
   : +- InMemoryTableScan [id#8L, day#9L, id#8L, day#9L, side#10]
   :   +- InMemoryRelation [id#8L, day#9L, side#10], StorageLevel(disk, 
memory, deserialized, 1 replicas)
   : +- Exchange RoundRobinPartitioning(10), 
REPARTITION_BY_NUM, [plan_id=33]
   :+- *(2) Project [id#0L, day#4L, left AS side#10]
   :   +- *(2) BroadcastNestedLoopJoin BuildRight, Inner
   :  :- *(2) Range (0, 1000, step=1, splits=16)
   :  +- BroadcastExchange IdentityBroadcastMode, 
[plan_id=28]
   : +- *(1) Project [id#2L AS day#4L]
   :+- *(1) Range (0, 1000, step=1, splits=16)
   +- Sort [id#29L ASC NULLS FIRST, day#30L ASC NULLS FIRST], false, 0
  +- Project [id#29L, day#30L, id#29L, day#30L, side#31, day_sum#54L]
 +- Window [sum(day#30L) windowspecdefinition(day#30L, id#29L, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
AS day_sum#54L], [day#30L, id#29L]
+- Sort [day#30L ASC NULLS FIRST, id#29L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(day#30L, id#29L, 200), 
ENSURE_REQUIREMENTS, [plan_id=112]
  +- InMemoryTableScan [id#29L, day#30L, side#31]
+- InMemoryRelation [id#29L, day#30L, side#31], 
StorageLevel(disk, memory, deserialized, 1 replicas)
  +- Exchange RoundRobinPartitioning(10), 
REPARTITION_BY_NUM, [plan_id=79]
 +- *(2) Project [id#0L, day#4L, right AS 
side#31]
+- *(2) BroadcastNestedLoopJoin BuildRight, 
Inner
   :- *(2) Range (0, 1000, step=1, 
splits=16)
   +- BroadcastExchange 
IdentityBroadcastMode, [plan_id=74]
  +- *(1) Project [id#2L AS day#4L]
 +- *(1) Range (0, 1000, step=1, 
splits=16)


+---+---+-+--+
| id|day|lefts|rights|
+---+---+-+--+
|  0|  3|0| 1|
|  0|  4|0| 1|
|  0| 13|1| 0|
|  0| 27|0| 1|
|  0| 31|0| 1|
+---+---+-+--+
only showing top 5 rows
{code}
The first child is hash-partitioned by {{id}} and {{{}day{}}}, while the second 
child is hash-partitioned by {{day}} and {{id}} (required by the window 
function). Therefore, rows end up in different partitions.

This has been fixed in Spark 3.3 by 

[jira] [Updated] (SPARK-42166) Make `docker-image-tool.sh` usage message up-to-date

2023-01-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42166:
--
Summary: Make `docker-image-tool.sh` usage message up-to-date  (was: Make 
docker-image-tool.sh help up-to-date)

> Make `docker-image-tool.sh` usage message up-to-date
> 
>
> Key: SPARK-42166
> URL: https://issues.apache.org/jira/browse/SPARK-42166
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42166) Make docker-image-tool.sh help up-to-date

2023-01-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42166.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39714
[https://github.com/apache/spark/pull/39714]

> Make docker-image-tool.sh help up-to-date
> -
>
> Key: SPARK-42166
> URL: https://issues.apache.org/jira/browse/SPARK-42166
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42166) Make docker-image-tool.sh help up-to-date

2023-01-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42166:
-

Assignee: Dongjoon Hyun

> Make docker-image-tool.sh help up-to-date
> -
>
> Key: SPARK-42166
> URL: https://issues.apache.org/jira/browse/SPARK-42166
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42167) Improve GitHub Action `lint` job to stop on failures earlier

2023-01-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42167:


Assignee: (was: Apache Spark)

> Improve GitHub Action `lint` job to stop on failures earlier
> 
>
> Key: SPARK-42167
> URL: https://issues.apache.org/jira/browse/SPARK-42167
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42167) Improve GitHub Action `lint` job to stop on failures earlier

2023-01-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42167:


Assignee: Apache Spark

> Improve GitHub Action `lint` job to stop on failures earlier
> 
>
> Key: SPARK-42167
> URL: https://issues.apache.org/jira/browse/SPARK-42167
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42167) Improve GitHub Action `lint` job to stop on failures earlier

2023-01-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680165#comment-17680165
 ] 

Apache Spark commented on SPARK-42167:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39716

> Improve GitHub Action `lint` job to stop on failures earlier
> 
>
> Key: SPARK-42167
> URL: https://issues.apache.org/jira/browse/SPARK-42167
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42167) Improve GitHub Action `lint` job to stop on failures earlier

2023-01-24 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-42167:
-

 Summary: Improve GitHub Action `lint` job to stop on failures 
earlier
 Key: SPARK-42167
 URL: https://issues.apache.org/jira/browse/SPARK-42167
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41775) Implement training functions as input

2023-01-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680154#comment-17680154
 ] 

Apache Spark commented on SPARK-41775:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39715

> Implement training functions as input
> -
>
> Key: SPARK-41775
> URL: https://issues.apache.org/jira/browse/SPARK-41775
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Assignee: Rithwik Ediga Lakhamsani
>Priority: Major
> Fix For: 3.4.0
>
>
> Sidenote: make formatting updates described in 
> https://github.com/apache/spark/pull/39188
>  
> Currently, `Distributor().run(...)` takes only files as input. Now we will 
> add in additional functionality to take in functions as well. This will 
> require us to go through the following process on each task in the executor 
> nodes:
> 1. take the input function and args and pickle them
> 2. Create a temp train.py file that looks like
> {code:java}
> import cloudpickle
> import os
> if _name_ == "_main_":
>     train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
>     output = train(*args)
>     if output and os.environ.get("RANK", "") == "0": # this is for 
> partitionId == 0
>         cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
> 3. Run that train.py file with `torchrun`
> 4. Check if `train_output.pkl` has been created on process on partitionId == 
> 0, if it has, then deserialize it and return that output through `.collect()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42166) Make docker-image-tool.sh help up-to-date

2023-01-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680145#comment-17680145
 ] 

Apache Spark commented on SPARK-42166:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39714

> Make docker-image-tool.sh help up-to-date
> -
>
> Key: SPARK-42166
> URL: https://issues.apache.org/jira/browse/SPARK-42166
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42166) Make docker-image-tool.sh help up-to-date

2023-01-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42166:


Assignee: (was: Apache Spark)

> Make docker-image-tool.sh help up-to-date
> -
>
> Key: SPARK-42166
> URL: https://issues.apache.org/jira/browse/SPARK-42166
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42166) Make docker-image-tool.sh help up-to-date

2023-01-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680144#comment-17680144
 ] 

Apache Spark commented on SPARK-42166:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39714

> Make docker-image-tool.sh help up-to-date
> -
>
> Key: SPARK-42166
> URL: https://issues.apache.org/jira/browse/SPARK-42166
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42166) Make docker-image-tool.sh help up-to-date

2023-01-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42166:


Assignee: Apache Spark

> Make docker-image-tool.sh help up-to-date
> -
>
> Key: SPARK-42166
> URL: https://issues.apache.org/jira/browse/SPARK-42166
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42166) Make docker-image-tool.sh help up-to-date

2023-01-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42166:
--
Summary: Make docker-image-tool.sh help up-to-date  (was: Fix 
docker-image-tool.sh help message)

> Make docker-image-tool.sh help up-to-date
> -
>
> Key: SPARK-42166
> URL: https://issues.apache.org/jira/browse/SPARK-42166
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42166) Fix docker-image-tool.sh help message

2023-01-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42166:
--
Issue Type: Improvement  (was: Bug)

> Fix docker-image-tool.sh help message
> -
>
> Key: SPARK-42166
> URL: https://issues.apache.org/jira/browse/SPARK-42166
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42166) Fix docker-image-tool.sh help message

2023-01-24 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-42166:
-

 Summary: Fix docker-image-tool.sh help message
 Key: SPARK-42166
 URL: https://issues.apache.org/jira/browse/SPARK-42166
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42165) Integrate all errors that causes re-create of a view failing into one error class

2023-01-24 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-42165:
---

 Summary: Integrate all errors that causes re-create of a view 
failing into one error class
 Key: SPARK-42165
 URL: https://issues.apache.org/jira/browse/SPARK-42165
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Should integrate errors that causes re-create of view failing into single error 
class, and make the subclasses for distinct cases.

e.g.
 * CANNOT UPCAST
 * TABLE NOT FOUND/UNRESOLVED
 * UNRESOLVED FUNCTION
 * DATATYPE MISMATCH



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2