[jira] [Commented] (SPARK-33230) FileOutputWriter jobs have duplicate JobIDs if launched in same second
[ https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221963#comment-17221963 ] Dongjoon Hyun commented on SPARK-33230: --- Thank you for sharing the status, [~ste...@apache.org]! > FileOutputWriter jobs have duplicate JobIDs if launched in same second > -- > > Key: SPARK-33230 > URL: https://issues.apache.org/jira/browse/SPARK-33230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.1 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > The Hadoop S3A staging committer has problems with >1 spark sql query being > launched simultaneously, as it uses the jobID for its path in the clusterFS > to pass the commit information from tasks to job committer. > If two queries are launched in the same second, they conflict and the output > of job 1 includes that of all job2 files written so far; job 2 will fail with > FNFE. > Proposed: > job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of > {{WriteJobDescription.uuid}} > That was the property name which used to serve this purpose; any committers > already written which use this property will pick it up without needing any > changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33183) Bug in optimizer rule EliminateSorts
[ https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33183. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30093 [https://github.com/apache/spark/pull/30093] > Bug in optimizer rule EliminateSorts > > > Key: SPARK-33183 > URL: https://issues.apache.org/jira/browse/SPARK-33183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.1.0 > > > Currently, the rule {{EliminateSorts}} removes a global sort node if its > child plan already satisfies the required sort order without checking if the > child plan's ordering is local or global. For example, in the following > scenario, the first sort shouldn't be removed because it has a stronger > guarantee than the second sort even if the sort orders are the same for both > sorts. > {code:java} > Sort(orders, global = True, ...) > Sort(orders, global = False, ...){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33183) Bug in optimizer rule EliminateSorts
[ https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33183: --- Assignee: Allison Wang > Bug in optimizer rule EliminateSorts > > > Key: SPARK-33183 > URL: https://issues.apache.org/jira/browse/SPARK-33183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Currently, the rule {{EliminateSorts}} removes a global sort node if its > child plan already satisfies the required sort order without checking if the > child plan's ordering is local or global. For example, in the following > scenario, the first sort shouldn't be removed because it has a stronger > guarantee than the second sort even if the sort orders are the same for both > sorts. > {code:java} > Sort(orders, global = True, ...) > Sort(orders, global = False, ...){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33174) Migrate DROP TABLE to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33174: --- Assignee: Terry Kim > Migrate DROP TABLE to new resolution framework > -- > > Key: SPARK-33174 > URL: https://issues.apache.org/jira/browse/SPARK-33174 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > > Migrate "DROP TABLE t" to new resolution framework so that it has a > consistent resolution rule for v1/v2 commands. > Currently, the following resolves to a table instead of a temp view: > {code:java} > sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") > sql("CREATE TEMPORARY VIEW t AS SELECT 2 as id") > sql("USE testcat.ns") > sql("DROP TABLE t") // 't' is resolved to testcat.ns.t > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33174) Migrate DROP TABLE to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33174. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30079 [https://github.com/apache/spark/pull/30079] > Migrate DROP TABLE to new resolution framework > -- > > Key: SPARK-33174 > URL: https://issues.apache.org/jira/browse/SPARK-33174 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > Fix For: 3.1.0 > > > Migrate "DROP TABLE t" to new resolution framework so that it has a > consistent resolution rule for v1/v2 commands. > Currently, the following resolves to a table instead of a temp view: > {code:java} > sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") > sql("CREATE TEMPORARY VIEW t AS SELECT 2 as id") > sql("USE testcat.ns") > sql("DROP TABLE t") // 't' is resolved to testcat.ns.t > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog
[ https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221955#comment-17221955 ] Apache Spark commented on SPARK-33240: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/30167 > Fail fast when fails to instantiate configured v2 session catalog > - > > Key: SPARK-33240 > URL: https://issues.apache.org/jira/browse/SPARK-33240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.1.0 > > > Now Spark fails back to use "default catalog" when Spark fails to instantiate > configured v2 session catalog. > While the error log message says nothing about why the instantiation has been > failing and the error log message pollutes the log file (as it's logged every > time when resolving the catalog), it should be considered as "incorrect" > behavior as end users are intended to set the custom catalog and Spark > sometimes ignores it, which is against the intention. > We should simply fail in the case so that end users indicate the failure > earlier and try to fix the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog
[ https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221954#comment-17221954 ] Apache Spark commented on SPARK-33240: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/30167 > Fail fast when fails to instantiate configured v2 session catalog > - > > Key: SPARK-33240 > URL: https://issues.apache.org/jira/browse/SPARK-33240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.1.0 > > > Now Spark fails back to use "default catalog" when Spark fails to instantiate > configured v2 session catalog. > While the error log message says nothing about why the instantiation has been > failing and the error log message pollutes the log file (as it's logged every > time when resolving the catalog), it should be considered as "incorrect" > behavior as end users are intended to set the custom catalog and Spark > sometimes ignores it, which is against the intention. > We should simply fail in the case so that end users indicate the failure > earlier and try to fix the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog
[ https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33240. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30147 [https://github.com/apache/spark/pull/30147] > Fail fast when fails to instantiate configured v2 session catalog > - > > Key: SPARK-33240 > URL: https://issues.apache.org/jira/browse/SPARK-33240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.1.0 > > > Now Spark fails back to use "default catalog" when Spark fails to instantiate > configured v2 session catalog. > While the error log message says nothing about why the instantiation has been > failing and the error log message pollutes the log file (as it's logged every > time when resolving the catalog), it should be considered as "incorrect" > behavior as end users are intended to set the custom catalog and Spark > sometimes ignores it, which is against the intention. > We should simply fail in the case so that end users indicate the failure > earlier and try to fix the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog
[ https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33240: --- Assignee: Jungtaek Lim > Fail fast when fails to instantiate configured v2 session catalog > - > > Key: SPARK-33240 > URL: https://issues.apache.org/jira/browse/SPARK-33240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > Now Spark fails back to use "default catalog" when Spark fails to instantiate > configured v2 session catalog. > While the error log message says nothing about why the instantiation has been > failing and the error log message pollutes the log file (as it's logged every > time when resolving the catalog), it should be considered as "incorrect" > behavior as end users are intended to set the custom catalog and Spark > sometimes ignores it, which is against the intention. > We should simply fail in the case so that end users indicate the failure > earlier and try to fix the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33266) Add total duration, read duration, and write duration as task level metrics
[ https://issues.apache.org/jira/browse/SPARK-33266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-33266: -- Description: Sometimes we need to identify performance bottlenecks, for example, how long it took to read from data store, how long it took to write into another data store. It would be great if we can have total duration, read duration, and write duration as task level metrics. Currently it seems that both `InputMetrics` and `OutputMetrics` do not have duration related metrics. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58] [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56] On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. We might need similar metrics for input/output. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala] was: Sometimes we need to identify performance bottlenecks, for example, how long it took to read from data store, how long it took to write into another data store. It would be great if we can have total duration, read duration, and write duration as task level metrics. Currently it seems that both `InputMetrics` and `OutputMetrics` do not have duration related metrics. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58] [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56 ] On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. We might need similar metrics for input/output. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala] > Add total duration, read duration, and write duration as task level metrics > --- > > Key: SPARK-33266 > URL: https://issues.apache.org/jira/browse/SPARK-33266 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Noritaka Sekiyama >Priority: Major > > Sometimes we need to identify performance bottlenecks, for example, how long > it took to read from data store, how long it took to write into another data > store. > It would be great if we can have total duration, read duration, and write > duration as task level metrics. > Currently it seems that both `InputMetrics` and `OutputMetrics` do not have > duration related metrics. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58] > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56] > > On the other hand, other metrics such as `ShuffleWriteMetrics` has write > time. We might need similar metrics for input/output. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33266) Add total duration, read duration, and write duration as task level metrics
Noritaka Sekiyama created SPARK-33266: - Summary: Add total duration, read duration, and write duration as task level metrics Key: SPARK-33266 URL: https://issues.apache.org/jira/browse/SPARK-33266 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: Noritaka Sekiyama Sometimes we need to identify performance bottlenecks, for example, how long it took to read from data store, how long it took to write into another data store. It would be great if we can have total duration, read duration, and write duration as task level metrics. Currently it seems that both `InputMetrics` and `OutputMetrics` do not have duration related metrics. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58] [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56 ] On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. We might need similar metrics for input/output. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33264) Add a dedicated page for SQL-on-file in SQL documents
[ https://issues.apache.org/jira/browse/SPARK-33264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-33264. -- Fix Version/s: 3.1.0 3.0.2 Assignee: Takeshi Yamamuro Resolution: Fixed Resolved by https://github.com/apache/spark/pull/30165 > Add a dedicated page for SQL-on-file in SQL documents > - > > Key: SPARK-33264 > URL: https://issues.apache.org/jira/browse/SPARK-33264 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > This ticket intends to add a dedicated page for SQL-on-file in SQL documents. > This comes from the comment: > [https://github.com/apache/spark/pull/30095/files#r508965149] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33265: Assignee: Apache Spark (was: Kousuke Saruta) > Rename classOf[Seq] to classOf[scala.collection.Seq] in > PostgresIntegrationSuite for Scala 2.13 > --- > > Key: SPARK-33265 > URL: https://issues.apache.org/jira/browse/SPARK-33265 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] > fails due to ClassCastException. > The reason is the same as what is resolved in SPARK-29292 but this happens at > test time, not compile time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221898#comment-17221898 ] Apache Spark commented on SPARK-33265: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/30166 > Rename classOf[Seq] to classOf[scala.collection.Seq] in > PostgresIntegrationSuite for Scala 2.13 > --- > > Key: SPARK-33265 > URL: https://issues.apache.org/jira/browse/SPARK-33265 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] > fails due to ClassCastException. > The reason is the same as what is resolved in SPARK-29292 but this happens at > test time, not compile time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33265: Assignee: Apache Spark (was: Kousuke Saruta) > Rename classOf[Seq] to classOf[scala.collection.Seq] in > PostgresIntegrationSuite for Scala 2.13 > --- > > Key: SPARK-33265 > URL: https://issues.apache.org/jira/browse/SPARK-33265 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] > fails due to ClassCastException. > The reason is the same as what is resolved in SPARK-29292 but this happens at > test time, not compile time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33265: Assignee: Kousuke Saruta (was: Apache Spark) > Rename classOf[Seq] to classOf[scala.collection.Seq] in > PostgresIntegrationSuite for Scala 2.13 > --- > > Key: SPARK-33265 > URL: https://issues.apache.org/jira/browse/SPARK-33265 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] > fails due to ClassCastException. > The reason is the same as what is resolved in SPARK-29292 but this happens at > test time, not compile time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221899#comment-17221899 ] Apache Spark commented on SPARK-33265: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/30166 > Rename classOf[Seq] to classOf[scala.collection.Seq] in > PostgresIntegrationSuite for Scala 2.13 > --- > > Key: SPARK-33265 > URL: https://issues.apache.org/jira/browse/SPARK-33265 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] > fails due to ClassCastException. > The reason is the same as what is resolved in SPARK-29292 but this happens at > test time, not compile time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-33265: --- Priority: Minor (was: Major) > Rename classOf[Seq] to classOf[scala.collection.Seq] in > PostgresIntegrationSuite for Scala 2.13 > --- > > Key: SPARK-33265 > URL: https://issues.apache.org/jira/browse/SPARK-33265 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] > fails due to ClassCastException. > The reason is the same as what is resolved in SPARK-29292 but this happens at > test time, not compile time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13
Kousuke Saruta created SPARK-33265: -- Summary: Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13 Key: SPARK-33265 URL: https://issues.apache.org/jira/browse/SPARK-33265 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] fails due to ClassCastException. The reason is the same as what is resolved in SPARK-29292 but this happens at test time, not compile time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened
[ https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta reassigned SPARK-33204: -- Assignee: akiyamaneko (was: Apache Spark) > `Event Timeline` in Spark Job UI sometimes cannot be opened > > > Key: SPARK-33204 > URL: https://issues.apache.org/jira/browse/SPARK-33204 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Assignee: akiyamaneko >Priority: Minor > Fix For: 3.1.0 > > Attachments: reproduce.gif > > > The Event Timeline area cannot be expanded when a spark application has some > failed jobs. > show as the attachment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33264) Add a dedicated page for SQL-on-file in SQL documents
[ https://issues.apache.org/jira/browse/SPARK-33264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33264: Assignee: (was: Apache Spark) > Add a dedicated page for SQL-on-file in SQL documents > - > > Key: SPARK-33264 > URL: https://issues.apache.org/jira/browse/SPARK-33264 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This ticket intends to add a dedicated page for SQL-on-file in SQL documents. > This comes from the comment: > [https://github.com/apache/spark/pull/30095/files#r508965149] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33264) Add a dedicated page for SQL-on-file in SQL documents
[ https://issues.apache.org/jira/browse/SPARK-33264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221880#comment-17221880 ] Apache Spark commented on SPARK-33264: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/30165 > Add a dedicated page for SQL-on-file in SQL documents > - > > Key: SPARK-33264 > URL: https://issues.apache.org/jira/browse/SPARK-33264 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This ticket intends to add a dedicated page for SQL-on-file in SQL documents. > This comes from the comment: > [https://github.com/apache/spark/pull/30095/files#r508965149] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33264) Add a dedicated page for SQL-on-file in SQL documents
[ https://issues.apache.org/jira/browse/SPARK-33264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33264: Assignee: Apache Spark > Add a dedicated page for SQL-on-file in SQL documents > - > > Key: SPARK-33264 > URL: https://issues.apache.org/jira/browse/SPARK-33264 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Major > > This ticket intends to add a dedicated page for SQL-on-file in SQL documents. > This comes from the comment: > [https://github.com/apache/spark/pull/30095/files#r508965149] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33264) Add a dedicated page for SQL-on-file in SQL documents
Takeshi Yamamuro created SPARK-33264: Summary: Add a dedicated page for SQL-on-file in SQL documents Key: SPARK-33264 URL: https://issues.apache.org/jira/browse/SPARK-33264 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 3.0.2, 3.1.0 Reporter: Takeshi Yamamuro This ticket intends to add a dedicated page for SQL-on-file in SQL documents. This comes from the comment: [https://github.com/apache/spark/pull/30095/files#r508965149] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33258: Assignee: Maciej Szymkiewicz > Add asc_nulls_* and desc_nulls_* methods to SparkR > -- > > Key: SPARK-33258 > URL: https://issues.apache.org/jira/browse/SPARK-33258 > Project: Spark > Issue Type: Bug > Components: R, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > At the moment Spark provides only > - {{asc}} > - {{desc}} > but {{NULL}} handling variants > - {{asc_nulls_first}} > - {{asc_nulls_last}} > - {{desc_nulls_first}} > - {{desc_nulls_last}} > are missing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33258. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30159 [https://github.com/apache/spark/pull/30159] > Add asc_nulls_* and desc_nulls_* methods to SparkR > -- > > Key: SPARK-33258 > URL: https://issues.apache.org/jira/browse/SPARK-33258 > Project: Spark > Issue Type: Bug > Components: R, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > At the moment Spark provides only > - {{asc}} > - {{desc}} > but {{NULL}} handling variants > - {{asc_nulls_first}} > - {{asc_nulls_last}} > - {{desc_nulls_first}} > - {{desc_nulls_last}} > are missing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan
[ https://issues.apache.org/jira/browse/SPARK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33228: - Fix Version/s: 2.4.8 > Don't uncache data when replacing an existing view having the same plan > --- > > Key: SPARK-33228 > URL: https://issues.apache.org/jira/browse/SPARK-33228 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache > when replacing an existing view. But, this change drops cache even when > replacing a view having the same logical plan. A sequence of queries to > reproduce this as follows; > {code} > scala> val df = spark.range(1).selectExpr("id a", "id b") > scala> df.cache() > scala> df.explain() > == Physical Plan == > *(1) ColumnarToRow > +- InMemoryTableScan [a#2L, b#3L] > +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L] > +- *(1) Range (0, 1, step=1, splits=4) > scala> df.createOrReplaceTempView("t") > scala> sql("select * from t").explain() > == Physical Plan == > *(1) ColumnarToRow > +- InMemoryTableScan [a#2L, b#3L] > +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L] > +- *(1) Range (0, 1, step=1, splits=4) > // If one re-runs the same query `df.createOrReplaceTempView("t")`, the > cache's swept away > scala> df.createOrReplaceTempView("t") > scala> sql("select * from t").explain() > == Physical Plan == > *(1) Project [id#0L AS a#2L, id#0L AS b#3L] > +- *(1) Range (0, 1, step=1, splits=4) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32090) UserDefinedType.equal() does not have symmetry
[ https://issues.apache.org/jira/browse/SPARK-32090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32090: - Fix Version/s: 3.0.2 2.4.8 > UserDefinedType.equal() does not have symmetry > --- > > Key: SPARK-32090 > URL: https://issues.apache.org/jira/browse/SPARK-32090 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.3.0, 2.4.0, 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass > val udt1 = new ExampleBaseTypeUDT > val udt2 = new ExampleSubTypeUDT > println(udt1 == udt2) // true > println(udt2 == udt1) // false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33246) Spark SQL null semantics documentation is incorrect
[ https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-33246. -- Fix Version/s: 3.1.0 3.0.2 Assignee: Stuart White Resolution: Fixed Resolved by https://github.com/apache/spark/pull/30161 > Spark SQL null semantics documentation is incorrect > --- > > Key: SPARK-33246 > URL: https://issues.apache.org/jira/browse/SPARK-33246 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Stuart White >Assignee: Stuart White >Priority: Trivial > Fix For: 3.0.2, 3.1.0 > > > The documentation of Spark SQL's null semantics is (I believe) incorrect. > The documentation states that "NULL AND False" yields NULL, when in fact it > yields False. > {noformat} > Seq[(java.lang.Boolean, java.lang.Boolean)]( > (true, null), > (false, null), > (null, true), > (null, false), > (null, null) > ) > .toDF("left_operand", "right_operand") > .withColumn("OR", 'left_operand || 'right_operand) > .withColumn("AND", 'left_operand && 'right_operand) > .show(truncate = false) > ++-++-+ > |left_operand|right_operand|OR |AND | > ++-++-+ > |true|null |true|null | > |false |null |null|false| > |null|true |true|null | > |null|false|null|false| < this line is incorrect in the > docs > |null|null |null|null | > ++-++-+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33246) Spark SQL null semantics documentation is incorrect
[ https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33246: - Affects Version/s: (was: 3.0.1) 3.1.0 3.0.2 > Spark SQL null semantics documentation is incorrect > --- > > Key: SPARK-33246 > URL: https://issues.apache.org/jira/browse/SPARK-33246 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Stuart White >Priority: Trivial > > The documentation of Spark SQL's null semantics is (I believe) incorrect. > The documentation states that "NULL AND False" yields NULL, when in fact it > yields False. > {noformat} > Seq[(java.lang.Boolean, java.lang.Boolean)]( > (true, null), > (false, null), > (null, true), > (null, false), > (null, null) > ) > .toDF("left_operand", "right_operand") > .withColumn("OR", 'left_operand || 'right_operand) > .withColumn("AND", 'left_operand && 'right_operand) > .show(truncate = false) > ++-++-+ > |left_operand|right_operand|OR |AND | > ++-++-+ > |true|null |true|null | > |false |null |null|false| > |null|true |true|null | > |null|false|null|false| < this line is incorrect in the > docs > |null|null |null|null | > ++-++-+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33246) Spark SQL null semantics documentation is incorrect
[ https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33246: - Component/s: SQL > Spark SQL null semantics documentation is incorrect > --- > > Key: SPARK-33246 > URL: https://issues.apache.org/jira/browse/SPARK-33246 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.0.1 >Reporter: Stuart White >Priority: Trivial > > The documentation of Spark SQL's null semantics is (I believe) incorrect. > The documentation states that "NULL AND False" yields NULL, when in fact it > yields False. > {noformat} > Seq[(java.lang.Boolean, java.lang.Boolean)]( > (true, null), > (false, null), > (null, true), > (null, false), > (null, null) > ) > .toDF("left_operand", "right_operand") > .withColumn("OR", 'left_operand || 'right_operand) > .withColumn("AND", 'left_operand && 'right_operand) > .show(truncate = false) > ++-++-+ > |left_operand|right_operand|OR |AND | > ++-++-+ > |true|null |true|null | > |false |null |null|false| > |null|true |true|null | > |null|false|null|false| < this line is incorrect in the > docs > |null|null |null|null | > ++-++-+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33183) Bug in optimizer rule EliminateSorts
[ https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33183: - Issue Type: Bug (was: Improvement) > Bug in optimizer rule EliminateSorts > > > Key: SPARK-33183 > URL: https://issues.apache.org/jira/browse/SPARK-33183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Allison Wang >Priority: Major > > Currently, the rule {{EliminateSorts}} removes a global sort node if its > child plan already satisfies the required sort order without checking if the > child plan's ordering is local or global. For example, in the following > scenario, the first sort shouldn't be removed because it has a stronger > guarantee than the second sort even if the sort orders are the same for both > sorts. > {code:java} > Sort(orders, global = True, ...) > Sort(orders, global = False, ...){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33183) Bug in optimizer rule EliminateSorts
[ https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33183: - Issue Type: Improvement (was: Bug) > Bug in optimizer rule EliminateSorts > > > Key: SPARK-33183 > URL: https://issues.apache.org/jira/browse/SPARK-33183 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Allison Wang >Priority: Major > > Currently, the rule {{EliminateSorts}} removes a global sort node if its > child plan already satisfies the required sort order without checking if the > child plan's ordering is local or global. For example, in the following > scenario, the first sort shouldn't be removed because it has a stronger > guarantee than the second sort even if the sort orders are the same for both > sorts. > {code:java} > Sort(orders, global = True, ...) > Sort(orders, global = False, ...){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions
[ https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32919: Assignee: (was: Apache Spark) > Add support in Spark driver to coordinate the shuffle map stage in push-based > shuffle by selecting external shuffle services for merging shuffle partitions > --- > > Key: SPARK-32919 > URL: https://issues.apache.org/jira/browse/SPARK-32919 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > > In the beginning of a shuffle map stage, driver needs to select external > shuffle services as the mergers of the shuffle partitions for the > corresponding shuffle. > We currently leverage the immediate available information about current and > past executor location information for this selection purpose. Ideally, this > would be behind a pluggable interface so that we can potentially leverage > information tracked outside of a Spark application for better load balancing > or for a disaggregate deployment environment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions
[ https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32919: Assignee: Apache Spark > Add support in Spark driver to coordinate the shuffle map stage in push-based > shuffle by selecting external shuffle services for merging shuffle partitions > --- > > Key: SPARK-32919 > URL: https://issues.apache.org/jira/browse/SPARK-32919 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Apache Spark >Priority: Major > > In the beginning of a shuffle map stage, driver needs to select external > shuffle services as the mergers of the shuffle partitions for the > corresponding shuffle. > We currently leverage the immediate available information about current and > past executor location information for this selection purpose. Ideally, this > would be behind a pluggable interface so that we can potentially leverage > information tracked outside of a Spark application for better load balancing > or for a disaggregate deployment environment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32918: Assignee: (was: Apache Spark) > RPC implementation to support control plane coordination for push-based > shuffle > --- > > Key: SPARK-32918 > URL: https://issues.apache.org/jira/browse/SPARK-32918 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > > RPCs to facilitate coordination of shuffle map/reduce stages. Notifications > to external shuffle services to finalize shuffle block merge for a given > shuffle are carried through this RPC. It also respond back the metadata about > a merged shuffle partition back to the caller. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions
[ https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221822#comment-17221822 ] Apache Spark commented on SPARK-32919: -- User 'venkata91' has created a pull request for this issue: https://github.com/apache/spark/pull/30164 > Add support in Spark driver to coordinate the shuffle map stage in push-based > shuffle by selecting external shuffle services for merging shuffle partitions > --- > > Key: SPARK-32919 > URL: https://issues.apache.org/jira/browse/SPARK-32919 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > > In the beginning of a shuffle map stage, driver needs to select external > shuffle services as the mergers of the shuffle partitions for the > corresponding shuffle. > We currently leverage the immediate available information about current and > past executor location information for this selection purpose. Ideally, this > would be behind a pluggable interface so that we can potentially leverage > information tracked outside of a Spark application for better load balancing > or for a disaggregate deployment environment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221821#comment-17221821 ] Apache Spark commented on SPARK-32918: -- User 'zhouyejoe' has created a pull request for this issue: https://github.com/apache/spark/pull/30163 > RPC implementation to support control plane coordination for push-based > shuffle > --- > > Key: SPARK-32918 > URL: https://issues.apache.org/jira/browse/SPARK-32918 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > > RPCs to facilitate coordination of shuffle map/reduce stages. Notifications > to external shuffle services to finalize shuffle block merge for a given > shuffle are carried through this RPC. It also respond back the metadata about > a merged shuffle partition back to the caller. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32918: Assignee: Apache Spark > RPC implementation to support control plane coordination for push-based > shuffle > --- > > Key: SPARK-32918 > URL: https://issues.apache.org/jira/browse/SPARK-32918 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Apache Spark >Priority: Major > > RPCs to facilitate coordination of shuffle map/reduce stages. Notifications > to external shuffle services to finalize shuffle block merge for a given > shuffle are carried through this RPC. It also respond back the metadata about > a merged shuffle partition back to the caller. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions
[ https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221824#comment-17221824 ] Apache Spark commented on SPARK-32919: -- User 'venkata91' has created a pull request for this issue: https://github.com/apache/spark/pull/30164 > Add support in Spark driver to coordinate the shuffle map stage in push-based > shuffle by selecting external shuffle services for merging shuffle partitions > --- > > Key: SPARK-32919 > URL: https://issues.apache.org/jira/browse/SPARK-32919 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > > In the beginning of a shuffle map stage, driver needs to select external > shuffle services as the mergers of the shuffle partitions for the > corresponding shuffle. > We currently leverage the immediate available information about current and > past executor location information for this selection purpose. Ideally, this > would be behind a pluggable interface so that we can potentially leverage > information tracked outside of a Spark application for better load balancing > or for a disaggregate deployment environment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33263) Configurable StateStore compression codec
[ https://issues.apache.org/jira/browse/SPARK-33263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221770#comment-17221770 ] Apache Spark commented on SPARK-33263: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/30162 > Configurable StateStore compression codec > - > > Key: SPARK-33263 > URL: https://issues.apache.org/jira/browse/SPARK-33263 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently the compression codec of StateStore is not configurable and > hard-coded to be lz4. It is better if we can follow Spark other modules to > configure the compression codec of StateStore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33263) Configurable StateStore compression codec
[ https://issues.apache.org/jira/browse/SPARK-33263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33263: Assignee: Apache Spark (was: L. C. Hsieh) > Configurable StateStore compression codec > - > > Key: SPARK-33263 > URL: https://issues.apache.org/jira/browse/SPARK-33263 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > Currently the compression codec of StateStore is not configurable and > hard-coded to be lz4. It is better if we can follow Spark other modules to > configure the compression codec of StateStore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33263) Configurable StateStore compression codec
[ https://issues.apache.org/jira/browse/SPARK-33263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33263: Assignee: L. C. Hsieh (was: Apache Spark) > Configurable StateStore compression codec > - > > Key: SPARK-33263 > URL: https://issues.apache.org/jira/browse/SPARK-33263 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently the compression codec of StateStore is not configurable and > hard-coded to be lz4. It is better if we can follow Spark other modules to > configure the compression codec of StateStore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33263) Configurable StateStore compression codec
L. C. Hsieh created SPARK-33263: --- Summary: Configurable StateStore compression codec Key: SPARK-33263 URL: https://issues.apache.org/jira/browse/SPARK-33263 Project: Spark Issue Type: Improvement Components: SQL, Structured Streaming Affects Versions: 3.1.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh Currently the compression codec of StateStore is not configurable and hard-coded to be lz4. It is better if we can follow Spark other modules to configure the compression codec of StateStore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream
[ https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221725#comment-17221725 ] Dongjoon Hyun commented on SPARK-33260: --- I added `correctness` label. > SortExec produces incorrect results if sortOrder is a Stream > > > Key: SPARK-33260 > URL: https://issues.apache.org/jira/browse/SPARK-33260 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Major > Labels: correctness > Fix For: 3.0.2, 3.1.0 > > > The following query produces incorrect results. The query has two essential > features: (1) it contains a string aggregate, resulting in a {{SortExec}} > node, and (2) it contains a duplicate grouping key, causing > {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a > Stream. > SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) > FROM table_4 > GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 > When the sort order is stored as a {{Stream}}, the line > {{ordering.map(_.child.genCode(ctx))}} in > {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to > {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} > is a {{Stream}}, the modifications will not happen immediately as intended, > but will instead occur lazily when the returned {{Stream}} is used later. > Similar bugs have occurred at least three times in the past: > https://issues.apache.org/jira/browse/SPARK-24500, > https://issues.apache.org/jira/browse/SPARK-25767, > https://issues.apache.org/jira/browse/SPARK-26680. > The fix is to check if {{ordering}} is a {{Stream}} and force the > modifications to happen immediately if so. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream
[ https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33260: -- Labels: correctness (was: ) > SortExec produces incorrect results if sortOrder is a Stream > > > Key: SPARK-33260 > URL: https://issues.apache.org/jira/browse/SPARK-33260 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Major > Labels: correctness > Fix For: 3.0.2, 3.1.0 > > > The following query produces incorrect results. The query has two essential > features: (1) it contains a string aggregate, resulting in a {{SortExec}} > node, and (2) it contains a duplicate grouping key, causing > {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a > Stream. > SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) > FROM table_4 > GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 > When the sort order is stored as a {{Stream}}, the line > {{ordering.map(_.child.genCode(ctx))}} in > {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to > {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} > is a {{Stream}}, the modifications will not happen immediately as intended, > but will instead occur lazily when the returned {{Stream}} is used later. > Similar bugs have occurred at least three times in the past: > https://issues.apache.org/jira/browse/SPARK-24500, > https://issues.apache.org/jira/browse/SPARK-25767, > https://issues.apache.org/jira/browse/SPARK-26680. > The fix is to check if {{ordering}} is a {{Stream}} and force the > modifications to happen immediately if so. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream
[ https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33260. --- Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30160 [https://github.com/apache/spark/pull/30160] > SortExec produces incorrect results if sortOrder is a Stream > > > Key: SPARK-33260 > URL: https://issues.apache.org/jira/browse/SPARK-33260 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > The following query produces incorrect results. The query has two essential > features: (1) it contains a string aggregate, resulting in a {{SortExec}} > node, and (2) it contains a duplicate grouping key, causing > {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a > Stream. > SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) > FROM table_4 > GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 > When the sort order is stored as a {{Stream}}, the line > {{ordering.map(_.child.genCode(ctx))}} in > {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to > {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} > is a {{Stream}}, the modifications will not happen immediately as intended, > but will instead occur lazily when the returned {{Stream}} is used later. > Similar bugs have occurred at least three times in the past: > https://issues.apache.org/jira/browse/SPARK-24500, > https://issues.apache.org/jira/browse/SPARK-25767, > https://issues.apache.org/jira/browse/SPARK-26680. > The fix is to check if {{ordering}} is a {{Stream}} and force the > modifications to happen immediately if so. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33262) Keep pending pods in account while scheduling new pods
[ https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33262. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30155 [https://github.com/apache/spark/pull/30155] > Keep pending pods in account while scheduling new pods > -- > > Key: SPARK-33262 > URL: https://issues.apache.org/jira/browse/SPARK-33262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33231) Make podCreationTimeout configurable
[ https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33231. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30155 [https://github.com/apache/spark/pull/30155] > Make podCreationTimeout configurable > > > Key: SPARK-33231 > URL: https://issues.apache.org/jira/browse/SPARK-33231 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > Fix For: 3.1.0 > > > Execution Monitor & Pod Allocator have differing views of the world which can > lead to pod trashing. > The executor monitor can be notified of an executor coming up before a > snapshot is delivered to the PodAllocator. This can cause the executor > monitor to believe it needs to delete a pod, and the pod allocator to believe > that it needs to create a new pod. This happens if the podCreationTimeout is > too low for the cluster. Currently podCreationTimeout can only be configured > by increasing the batch delay but that has additional consequences leading to > slower spin up. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33231) Make podCreationTimeout configurable
[ https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33231: - Assignee: Holden Karau > Make podCreationTimeout configurable > > > Key: SPARK-33231 > URL: https://issues.apache.org/jira/browse/SPARK-33231 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > > Execution Monitor & Pod Allocator have differing views of the world which can > lead to pod trashing. > The executor monitor can be notified of an executor coming up before a > snapshot is delivered to the PodAllocator. This can cause the executor > monitor to believe it needs to delete a pod, and the pod allocator to believe > that it needs to create a new pod. This happens if the podCreationTimeout is > too low for the cluster. Currently podCreationTimeout can only be configured > by increasing the batch delay but that has additional consequences leading to > slower spin up. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33262) Keep pending pods in account while scheduling new pods
[ https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33262: - Assignee: Holden Karau > Keep pending pods in account while scheduling new pods > -- > > Key: SPARK-33262 > URL: https://issues.apache.org/jira/browse/SPARK-33262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33262) Keep pending pods in account while scheduling new pods
[ https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33262: Assignee: (was: Apache Spark) > Keep pending pods in account while scheduling new pods > -- > > Key: SPARK-33262 > URL: https://issues.apache.org/jira/browse/SPARK-33262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33262) Keep pending pods in account while scheduling new pods
[ https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221644#comment-17221644 ] Apache Spark commented on SPARK-33262: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/30155 > Keep pending pods in account while scheduling new pods > -- > > Key: SPARK-33262 > URL: https://issues.apache.org/jira/browse/SPARK-33262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33262) Keep pending pods in account while scheduling new pods
[ https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33262: Assignee: Apache Spark > Keep pending pods in account while scheduling new pods > -- > > Key: SPARK-33262 > URL: https://issues.apache.org/jira/browse/SPARK-33262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33262) Keep pending pods in account while scheduling new pods
Holden Karau created SPARK-33262: Summary: Keep pending pods in account while scheduling new pods Key: SPARK-33262 URL: https://issues.apache.org/jira/browse/SPARK-33262 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.1.0 Reporter: Holden Karau -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33261) Allow people to extend the pod feature steps
Holden Karau created SPARK-33261: Summary: Allow people to extend the pod feature steps Key: SPARK-33261 URL: https://issues.apache.org/jira/browse/SPARK-33261 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.1.0 Reporter: Holden Karau While we allow people to specify pod templates, some deployments could benefit from being able to add a feature step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33230) FileOutputWriter jobs have duplicate JobIDs if launched in same second
[ https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221581#comment-17221581 ] Steve Loughran commented on SPARK-33230: Thanks. Still got some changes to work through my side to make sure the are no assumptions that app attempt is unique. For the curious see HADOOP-17318 * Staging committer is using task attemptID (jobId+ taskId + task-attempt) for a path to the local temp dir * Magic committer uses app attemptId for the path under the dest/__magic dir. Only an issue once that committer allows >1 job to write to same dest. > FileOutputWriter jobs have duplicate JobIDs if launched in same second > -- > > Key: SPARK-33230 > URL: https://issues.apache.org/jira/browse/SPARK-33230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.1 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > The Hadoop S3A staging committer has problems with >1 spark sql query being > launched simultaneously, as it uses the jobID for its path in the clusterFS > to pass the commit information from tasks to job committer. > If two queries are launched in the same second, they conflict and the output > of job 1 includes that of all job2 files written so far; job 2 will fail with > FNFE. > Proposed: > job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of > {{WriteJobDescription.uuid}} > That was the property name which used to serve this purpose; any committers > already written which use this property will pick it up without needing any > changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22390) Aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22390: Assignee: Apache Spark > Aggregate push down > --- > > Key: SPARK-22390 > URL: https://issues.apache.org/jira/browse/SPARK-22390 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22390) Aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22390: Assignee: (was: Apache Spark) > Aggregate push down > --- > > Key: SPARK-22390 > URL: https://issues.apache.org/jira/browse/SPARK-22390 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22390) Aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221548#comment-17221548 ] Apache Spark commented on SPARK-22390: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/29695 > Aggregate push down > --- > > Key: SPARK-22390 > URL: https://issues.apache.org/jira/browse/SPARK-22390 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22390) Aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221546#comment-17221546 ] Huaxin Gao commented on SPARK-22390: Hi [~baibaichen], I am working on this. I put it under a different jira. I will link it here too. > Aggregate push down > --- > > Key: SPARK-22390 > URL: https://issues.apache.org/jira/browse/SPARK-22390 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33246) Spark SQL null semantics documentation is incorrect
[ https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33246: Assignee: Apache Spark > Spark SQL null semantics documentation is incorrect > --- > > Key: SPARK-33246 > URL: https://issues.apache.org/jira/browse/SPARK-33246 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Stuart White >Assignee: Apache Spark >Priority: Trivial > > The documentation of Spark SQL's null semantics is (I believe) incorrect. > The documentation states that "NULL AND False" yields NULL, when in fact it > yields False. > {noformat} > Seq[(java.lang.Boolean, java.lang.Boolean)]( > (true, null), > (false, null), > (null, true), > (null, false), > (null, null) > ) > .toDF("left_operand", "right_operand") > .withColumn("OR", 'left_operand || 'right_operand) > .withColumn("AND", 'left_operand && 'right_operand) > .show(truncate = false) > ++-++-+ > |left_operand|right_operand|OR |AND | > ++-++-+ > |true|null |true|null | > |false |null |null|false| > |null|true |true|null | > |null|false|null|false| < this line is incorrect in the > docs > |null|null |null|null | > ++-++-+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33138) unify temp view and permanent view behaviors
[ https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221506#comment-17221506 ] Erik Krogen commented on SPARK-33138: - I see, thanks for clarifying [~cloud_fan]. > unify temp view and permanent view behaviors > > > Key: SPARK-33138 > URL: https://issues.apache.org/jira/browse/SPARK-33138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > Currently, temp view store mapping of temp view name and its logicalPlan, and > permanent view store in HMS stores its origin SQL text. > So for permanent view, when try to refer the permanent view, its SQL text > will be parse-analyze-optimize-plan again with current SQLConf and > SparkSession context, so it might keep changing when the SQLConf and context > is different each time. > In order the unify the behaviors of temp view and permanent view, proposed > that we keep its origin SQLText for both temp and permanent view, and also > keep record of the SQLConf when the view was created. Each time we try to > refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan > the SQLText, in this way, we can make sure the output of the created view to > be stable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33246) Spark SQL null semantics documentation is incorrect
[ https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221507#comment-17221507 ] Apache Spark commented on SPARK-33246: -- User 'stwhit' has created a pull request for this issue: https://github.com/apache/spark/pull/30161 > Spark SQL null semantics documentation is incorrect > --- > > Key: SPARK-33246 > URL: https://issues.apache.org/jira/browse/SPARK-33246 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Stuart White >Priority: Trivial > > The documentation of Spark SQL's null semantics is (I believe) incorrect. > The documentation states that "NULL AND False" yields NULL, when in fact it > yields False. > {noformat} > Seq[(java.lang.Boolean, java.lang.Boolean)]( > (true, null), > (false, null), > (null, true), > (null, false), > (null, null) > ) > .toDF("left_operand", "right_operand") > .withColumn("OR", 'left_operand || 'right_operand) > .withColumn("AND", 'left_operand && 'right_operand) > .show(truncate = false) > ++-++-+ > |left_operand|right_operand|OR |AND | > ++-++-+ > |true|null |true|null | > |false |null |null|false| > |null|true |true|null | > |null|false|null|false| < this line is incorrect in the > docs > |null|null |null|null | > ++-++-+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33246) Spark SQL null semantics documentation is incorrect
[ https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart White updated SPARK-33246: - Attachment: (was: null-semantics.patch) > Spark SQL null semantics documentation is incorrect > --- > > Key: SPARK-33246 > URL: https://issues.apache.org/jira/browse/SPARK-33246 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Stuart White >Priority: Trivial > > The documentation of Spark SQL's null semantics is (I believe) incorrect. > The documentation states that "NULL AND False" yields NULL, when in fact it > yields False. > {noformat} > Seq[(java.lang.Boolean, java.lang.Boolean)]( > (true, null), > (false, null), > (null, true), > (null, false), > (null, null) > ) > .toDF("left_operand", "right_operand") > .withColumn("OR", 'left_operand || 'right_operand) > .withColumn("AND", 'left_operand && 'right_operand) > .show(truncate = false) > ++-++-+ > |left_operand|right_operand|OR |AND | > ++-++-+ > |true|null |true|null | > |false |null |null|false| > |null|true |true|null | > |null|false|null|false| < this line is incorrect in the > docs > |null|null |null|null | > ++-++-+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33246) Spark SQL null semantics documentation is incorrect
[ https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221505#comment-17221505 ] Apache Spark commented on SPARK-33246: -- User 'stwhit' has created a pull request for this issue: https://github.com/apache/spark/pull/30161 > Spark SQL null semantics documentation is incorrect > --- > > Key: SPARK-33246 > URL: https://issues.apache.org/jira/browse/SPARK-33246 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Stuart White >Priority: Trivial > > The documentation of Spark SQL's null semantics is (I believe) incorrect. > The documentation states that "NULL AND False" yields NULL, when in fact it > yields False. > {noformat} > Seq[(java.lang.Boolean, java.lang.Boolean)]( > (true, null), > (false, null), > (null, true), > (null, false), > (null, null) > ) > .toDF("left_operand", "right_operand") > .withColumn("OR", 'left_operand || 'right_operand) > .withColumn("AND", 'left_operand && 'right_operand) > .show(truncate = false) > ++-++-+ > |left_operand|right_operand|OR |AND | > ++-++-+ > |true|null |true|null | > |false |null |null|false| > |null|true |true|null | > |null|false|null|false| < this line is incorrect in the > docs > |null|null |null|null | > ++-++-+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33246) Spark SQL null semantics documentation is incorrect
[ https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33246: Assignee: (was: Apache Spark) > Spark SQL null semantics documentation is incorrect > --- > > Key: SPARK-33246 > URL: https://issues.apache.org/jira/browse/SPARK-33246 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Stuart White >Priority: Trivial > > The documentation of Spark SQL's null semantics is (I believe) incorrect. > The documentation states that "NULL AND False" yields NULL, when in fact it > yields False. > {noformat} > Seq[(java.lang.Boolean, java.lang.Boolean)]( > (true, null), > (false, null), > (null, true), > (null, false), > (null, null) > ) > .toDF("left_operand", "right_operand") > .withColumn("OR", 'left_operand || 'right_operand) > .withColumn("AND", 'left_operand && 'right_operand) > .show(truncate = false) > ++-++-+ > |left_operand|right_operand|OR |AND | > ++-++-+ > |true|null |true|null | > |false |null |null|false| > |null|true |true|null | > |null|false|null|false| < this line is incorrect in the > docs > |null|null |null|null | > ++-++-+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33137) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (PostgreSQL dialect)
[ https://issues.apache.org/jira/browse/SPARK-33137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33137: --- Assignee: Huaxin Gao > Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of > columns (PostgreSQL dialect) > - > > Key: SPARK-33137 > URL: https://issues.apache.org/jira/browse/SPARK-33137 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Override the default SQL strings for: > ALTER TABLE UPDATE COLUMN TYPE > ALTER TABLE UPDATE COLUMN NULLABILITY > in the following PostgreSQL JDBC dialect according to official documentation. > Write PostgreSQL integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33137) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (PostgreSQL dialect)
[ https://issues.apache.org/jira/browse/SPARK-33137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33137. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30089 [https://github.com/apache/spark/pull/30089] > Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of > columns (PostgreSQL dialect) > - > > Key: SPARK-33137 > URL: https://issues.apache.org/jira/browse/SPARK-33137 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.1.0 > > > Override the default SQL strings for: > ALTER TABLE UPDATE COLUMN TYPE > ALTER TABLE UPDATE COLUMN NULLABILITY > in the following PostgreSQL JDBC dialect according to official documentation. > Write PostgreSQL integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix
[ https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20044: Assignee: Gengliang Wang (was: Apache Spark) > Support Spark UI behind front-end reverse proxy using a path prefix > --- > > Key: SPARK-20044 > URL: https://issues.apache.org/jira/browse/SPARK-20044 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0 >Reporter: Oliver Koeth >Assignee: Gengliang Wang >Priority: Minor > Labels: reverse-proxy, sso > > Purpose: allow to run the Spark web UI behind a reverse proxy with URLs > prefixed by a context root, like www.mydomain.com/spark. In particular, this > allows to access multiple Spark clusters through the same virtual host, only > distinguishing them by context root, like www.mydomain.com/cluster1, > www.mydomain.com/cluster2, and it allows to run the Spark UI in a common > cookie domain (for SSO) with other services. > [SPARK-15487] introduced some support for front-end reverse proxies by > allowing all Spark UI requests to be routed through the master UI as a single > endpoint and also added a spark.ui.reverseProxyUrl setting to define a > another proxy sitting in front of Spark. However, as noted in the comments on > [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl > includes a context root like the examples above: Most links generated by the > Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not > account for a path prefix (context root) and work only if the Spark UI "owns" > the entire virtual host. In fact, the only place in the UI where the > reverseProxyUrl seems to be used is the back-link from the worker UI to the > master UI. > The discussion on [SPARK-15487] proposes to open a new issue for the problem, > but that does not seem to have happened, so this issue aims to address the > remaining shortcomings of spark.ui.reverseProxyUrl > The problem can be partially worked around by doing content rewrite in a > front-end proxy and prefixing src="/..." or href="/..." links with a context > root. However, detecting and patching URLs in HTML output is not a robust > approach and breaks down for URLs included in custom REST responses. E.g. the > "allexecutors" REST call used from the Spark 2.1.0 application/executors page > returns links for log viewing that direct to the worker UI and do not work in > this scenario. > This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL > generation. Experiments indicate that most of this can simply be achieved by > using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase > system property. Beyond that, the places that require adaption are > - worker and application links in the master web UI > - webui URLs returned by REST interfaces > Note: It seems that returned redirect location headers do not need to be > adapted, since URL rewriting for these is commonly done in front-end proxies > and has a well-defined interface -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix
[ https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221457#comment-17221457 ] Gengliang Wang commented on SPARK-20044: I am taking this over in https://github.com/apache/spark/pull/29820 > Support Spark UI behind front-end reverse proxy using a path prefix > --- > > Key: SPARK-20044 > URL: https://issues.apache.org/jira/browse/SPARK-20044 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0 >Reporter: Oliver Koeth >Assignee: Apache Spark >Priority: Minor > Labels: reverse-proxy, sso > > Purpose: allow to run the Spark web UI behind a reverse proxy with URLs > prefixed by a context root, like www.mydomain.com/spark. In particular, this > allows to access multiple Spark clusters through the same virtual host, only > distinguishing them by context root, like www.mydomain.com/cluster1, > www.mydomain.com/cluster2, and it allows to run the Spark UI in a common > cookie domain (for SSO) with other services. > [SPARK-15487] introduced some support for front-end reverse proxies by > allowing all Spark UI requests to be routed through the master UI as a single > endpoint and also added a spark.ui.reverseProxyUrl setting to define a > another proxy sitting in front of Spark. However, as noted in the comments on > [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl > includes a context root like the examples above: Most links generated by the > Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not > account for a path prefix (context root) and work only if the Spark UI "owns" > the entire virtual host. In fact, the only place in the UI where the > reverseProxyUrl seems to be used is the back-link from the worker UI to the > master UI. > The discussion on [SPARK-15487] proposes to open a new issue for the problem, > but that does not seem to have happened, so this issue aims to address the > remaining shortcomings of spark.ui.reverseProxyUrl > The problem can be partially worked around by doing content rewrite in a > front-end proxy and prefixing src="/..." or href="/..." links with a context > root. However, detecting and patching URLs in HTML output is not a robust > approach and breaks down for URLs included in custom REST responses. E.g. the > "allexecutors" REST call used from the Spark 2.1.0 application/executors page > returns links for log viewing that direct to the worker UI and do not work in > this scenario. > This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL > generation. Experiments indicate that most of this can simply be achieved by > using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase > system property. Beyond that, the places that require adaption are > - worker and application links in the master web UI > - webui URLs returned by REST interfaces > Note: It seems that returned redirect location headers do not need to be > adapted, since URL rewriting for these is commonly done in front-end proxies > and has a well-defined interface -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix
[ https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20044: Assignee: Apache Spark (was: Gengliang Wang) > Support Spark UI behind front-end reverse proxy using a path prefix > --- > > Key: SPARK-20044 > URL: https://issues.apache.org/jira/browse/SPARK-20044 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0 >Reporter: Oliver Koeth >Assignee: Apache Spark >Priority: Minor > Labels: reverse-proxy, sso > > Purpose: allow to run the Spark web UI behind a reverse proxy with URLs > prefixed by a context root, like www.mydomain.com/spark. In particular, this > allows to access multiple Spark clusters through the same virtual host, only > distinguishing them by context root, like www.mydomain.com/cluster1, > www.mydomain.com/cluster2, and it allows to run the Spark UI in a common > cookie domain (for SSO) with other services. > [SPARK-15487] introduced some support for front-end reverse proxies by > allowing all Spark UI requests to be routed through the master UI as a single > endpoint and also added a spark.ui.reverseProxyUrl setting to define a > another proxy sitting in front of Spark. However, as noted in the comments on > [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl > includes a context root like the examples above: Most links generated by the > Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not > account for a path prefix (context root) and work only if the Spark UI "owns" > the entire virtual host. In fact, the only place in the UI where the > reverseProxyUrl seems to be used is the back-link from the worker UI to the > master UI. > The discussion on [SPARK-15487] proposes to open a new issue for the problem, > but that does not seem to have happened, so this issue aims to address the > remaining shortcomings of spark.ui.reverseProxyUrl > The problem can be partially worked around by doing content rewrite in a > front-end proxy and prefixing src="/..." or href="/..." links with a context > root. However, detecting and patching URLs in HTML output is not a robust > approach and breaks down for URLs included in custom REST responses. E.g. the > "allexecutors" REST call used from the Spark 2.1.0 application/executors page > returns links for log viewing that direct to the worker UI and do not work in > this scenario. > This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL > generation. Experiments indicate that most of this can simply be achieved by > using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase > system property. Beyond that, the places that require adaption are > - worker and application links in the master web UI > - webui URLs returned by REST interfaces > Note: It seems that returned redirect location headers do not need to be > adapted, since URL rewriting for these is commonly done in front-end proxies > and has a well-defined interface -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix
[ https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-20044: --- Affects Version/s: 2.2.0 2.3.0 2.4.0 3.0.0 > Support Spark UI behind front-end reverse proxy using a path prefix > --- > > Key: SPARK-20044 > URL: https://issues.apache.org/jira/browse/SPARK-20044 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0 >Reporter: Oliver Koeth >Assignee: Gengliang Wang >Priority: Minor > Labels: reverse-proxy, sso > > Purpose: allow to run the Spark web UI behind a reverse proxy with URLs > prefixed by a context root, like www.mydomain.com/spark. In particular, this > allows to access multiple Spark clusters through the same virtual host, only > distinguishing them by context root, like www.mydomain.com/cluster1, > www.mydomain.com/cluster2, and it allows to run the Spark UI in a common > cookie domain (for SSO) with other services. > [SPARK-15487] introduced some support for front-end reverse proxies by > allowing all Spark UI requests to be routed through the master UI as a single > endpoint and also added a spark.ui.reverseProxyUrl setting to define a > another proxy sitting in front of Spark. However, as noted in the comments on > [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl > includes a context root like the examples above: Most links generated by the > Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not > account for a path prefix (context root) and work only if the Spark UI "owns" > the entire virtual host. In fact, the only place in the UI where the > reverseProxyUrl seems to be used is the back-link from the worker UI to the > master UI. > The discussion on [SPARK-15487] proposes to open a new issue for the problem, > but that does not seem to have happened, so this issue aims to address the > remaining shortcomings of spark.ui.reverseProxyUrl > The problem can be partially worked around by doing content rewrite in a > front-end proxy and prefixing src="/..." or href="/..." links with a context > root. However, detecting and patching URLs in HTML output is not a robust > approach and breaks down for URLs included in custom REST responses. E.g. the > "allexecutors" REST call used from the Spark 2.1.0 application/executors page > returns links for log viewing that direct to the worker UI and do not work in > this scenario. > This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL > generation. Experiments indicate that most of this can simply be achieved by > using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase > system property. Beyond that, the places that require adaption are > - worker and application links in the master web UI > - webui URLs returned by REST interfaces > Note: It seems that returned redirect location headers do not need to be > adapted, since URL rewriting for these is commonly done in front-end proxies > and has a well-defined interface -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix
[ https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reopened SPARK-20044: Assignee: Gengliang Wang > Support Spark UI behind front-end reverse proxy using a path prefix > --- > > Key: SPARK-20044 > URL: https://issues.apache.org/jira/browse/SPARK-20044 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Oliver Koeth >Assignee: Gengliang Wang >Priority: Minor > Labels: reverse-proxy, sso > > Purpose: allow to run the Spark web UI behind a reverse proxy with URLs > prefixed by a context root, like www.mydomain.com/spark. In particular, this > allows to access multiple Spark clusters through the same virtual host, only > distinguishing them by context root, like www.mydomain.com/cluster1, > www.mydomain.com/cluster2, and it allows to run the Spark UI in a common > cookie domain (for SSO) with other services. > [SPARK-15487] introduced some support for front-end reverse proxies by > allowing all Spark UI requests to be routed through the master UI as a single > endpoint and also added a spark.ui.reverseProxyUrl setting to define a > another proxy sitting in front of Spark. However, as noted in the comments on > [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl > includes a context root like the examples above: Most links generated by the > Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not > account for a path prefix (context root) and work only if the Spark UI "owns" > the entire virtual host. In fact, the only place in the UI where the > reverseProxyUrl seems to be used is the back-link from the worker UI to the > master UI. > The discussion on [SPARK-15487] proposes to open a new issue for the problem, > but that does not seem to have happened, so this issue aims to address the > remaining shortcomings of spark.ui.reverseProxyUrl > The problem can be partially worked around by doing content rewrite in a > front-end proxy and prefixing src="/..." or href="/..." links with a context > root. However, detecting and patching URLs in HTML output is not a robust > approach and breaks down for URLs included in custom REST responses. E.g. the > "allexecutors" REST call used from the Spark 2.1.0 application/executors page > returns links for log viewing that direct to the worker UI and do not work in > this scenario. > This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL > generation. Experiments indicate that most of this can simply be achieved by > using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase > system property. Beyond that, the places that require adaption are > - worker and application links in the master web UI > - webui URLs returned by REST interfaces > Note: It seems that returned redirect location headers do not need to be > adapted, since URL rewriting for these is commonly done in front-end proxies > and has a well-defined interface -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33225) Extract AliasHelper trait
[ https://issues.apache.org/jira/browse/SPARK-33225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-33225. -- Fix Version/s: 3.1.0 Assignee: Tanel Kiis Resolution: Fixed Resolved by https://github.com/apache/spark/pull/30134# > Extract AliasHelper trait > - > > Key: SPARK-33225 > URL: https://issues.apache.org/jira/browse/SPARK-33225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Assignee: Tanel Kiis >Priority: Major > Fix For: 3.1.0 > > > During SPARK-33122 we saw that there are several alias related methods > duplicated between optimizers and analyzers. Do keep that PR more concise, > lets extract AliasHelper in a separate PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221425#comment-17221425 ] Michael commented on SPARK-33259: - Oh I see, thanks for pointing to the section in the documentation! For our team's use case having this working would definitely be very useful, as we do a lot of such data joins... > Joining 3 streams results in incorrect output > - > > Key: SPARK-33259 > URL: https://issues.apache.org/jira/browse/SPARK-33259 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Michael >Priority: Major > > I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN > B) INNER JOIN C) operation. Below you can see example code I [posted on > Stackoverflow|https://stackoverflow.com/questions/64503539/]... > I created a minimal example of "sessions", that have "start" and "end" events > and optionally some "metadata". > The script generates two outputs: {{sessionStartsWithMetadata}} result from > "start" events that are left-joined with the "metadata" events, based on > {{sessionId}}. A "left join" is used, since we like to get an output event > even when no corresponding metadata exists. > Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining > "end" events to the previously created DataFrame. Here an "inner join" is > used, since we only want some output when a session has ended for sure. > This code can be executed in {{spark-shell}}: > {code:scala} > import java.sql.Timestamp > import org.apache.spark.sql.execution.streaming.{MemoryStream, > StreamingQueryWrapper} > import org.apache.spark.sql.streaming.StreamingQuery > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.functions.{col, expr, lit} > import spark.implicits._ > implicit val sqlContext: SQLContext = spark.sqlContext > // Main data processing, regardless whether batch or stream processing > def process( > sessionStartEvents: DataFrame, > sessionOptionalMetadataEvents: DataFrame, > sessionEndEvents: DataFrame > ): (DataFrame, DataFrame) = { > val sessionStartsWithMetadata: DataFrame = sessionStartEvents > .join( > sessionOptionalMetadataEvents, > sessionStartEvents("sessionId") === > sessionOptionalMetadataEvents("sessionId") && > sessionStartEvents("sessionStartTimestamp").between( > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL > 1 seconds")), > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL > 1 seconds")) > ), > "left" // metadata is optional > ) > .select( > sessionStartEvents("sessionId"), > sessionStartEvents("sessionStartTimestamp"), > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") > ) > val endedSessionsWithMetadata = sessionStartsWithMetadata.join( > sessionEndEvents, > sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") > && > sessionStartsWithMetadata("sessionStartTimestamp").between( > sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 > seconds")), > sessionEndEvents("sessionEndTimestamp") > ) > ) > (sessionStartsWithMetadata, endedSessionsWithMetadata) > } > def streamProcessing( > sessionStartData: Seq[(Timestamp, Int)], > sessionOptionalMetadata: Seq[(Timestamp, Int)], > sessionEndData: Seq[(Timestamp, Int)] > ): (StreamingQuery, StreamingQuery) = { > val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionStartEventsStream.addData(sessionStartData) > val sessionStartEvents: DataFrame = sessionStartEventsStream > .toDS() > .toDF("sessionStartTimestamp", "sessionId") > .withWatermark("sessionStartTimestamp", "1 second") > val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) > val sessionOptionalMetadataEvents: DataFrame = > sessionOptionalMetadataEventsStream > .toDS() > .toDF("sessionOptionalMetadataTimestamp", "sessionId") > .withWatermark("sessionOptionalMetadataTimestamp", "1 second") > val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionEndEventsStream.addData(sessionEndData) > val sessionEndEvents: DataFrame = sessionEndEventsStream > .toDS() > .toDF("sessionEndTimestamp", "sessionId") > .withWatermark("sessionEndTimestamp", "1 second") > val (sessionStartsWithMetadata, endedSessionsWithMetadata) = > process(sessionStartEvents, sessionOptionalMetadataEvents, >
[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221409#comment-17221409 ] Jungtaek Lim commented on SPARK-33259: -- As you already figured out, this is a known limitation, and at least for now we ended up with documenting such limitation to compensate. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#limitation-of-global-watermark This requires major change on the concept of watermark, so without huge demand on this it may be unlikely to be addressed. > Joining 3 streams results in incorrect output > - > > Key: SPARK-33259 > URL: https://issues.apache.org/jira/browse/SPARK-33259 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Michael >Priority: Major > > I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN > B) INNER JOIN C) operation. Below you can see example code I [posted on > Stackoverflow|https://stackoverflow.com/questions/64503539/]... > I created a minimal example of "sessions", that have "start" and "end" events > and optionally some "metadata". > The script generates two outputs: {{sessionStartsWithMetadata}} result from > "start" events that are left-joined with the "metadata" events, based on > {{sessionId}}. A "left join" is used, since we like to get an output event > even when no corresponding metadata exists. > Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining > "end" events to the previously created DataFrame. Here an "inner join" is > used, since we only want some output when a session has ended for sure. > This code can be executed in {{spark-shell}}: > {code:scala} > import java.sql.Timestamp > import org.apache.spark.sql.execution.streaming.{MemoryStream, > StreamingQueryWrapper} > import org.apache.spark.sql.streaming.StreamingQuery > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.functions.{col, expr, lit} > import spark.implicits._ > implicit val sqlContext: SQLContext = spark.sqlContext > // Main data processing, regardless whether batch or stream processing > def process( > sessionStartEvents: DataFrame, > sessionOptionalMetadataEvents: DataFrame, > sessionEndEvents: DataFrame > ): (DataFrame, DataFrame) = { > val sessionStartsWithMetadata: DataFrame = sessionStartEvents > .join( > sessionOptionalMetadataEvents, > sessionStartEvents("sessionId") === > sessionOptionalMetadataEvents("sessionId") && > sessionStartEvents("sessionStartTimestamp").between( > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL > 1 seconds")), > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL > 1 seconds")) > ), > "left" // metadata is optional > ) > .select( > sessionStartEvents("sessionId"), > sessionStartEvents("sessionStartTimestamp"), > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") > ) > val endedSessionsWithMetadata = sessionStartsWithMetadata.join( > sessionEndEvents, > sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") > && > sessionStartsWithMetadata("sessionStartTimestamp").between( > sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 > seconds")), > sessionEndEvents("sessionEndTimestamp") > ) > ) > (sessionStartsWithMetadata, endedSessionsWithMetadata) > } > def streamProcessing( > sessionStartData: Seq[(Timestamp, Int)], > sessionOptionalMetadata: Seq[(Timestamp, Int)], > sessionEndData: Seq[(Timestamp, Int)] > ): (StreamingQuery, StreamingQuery) = { > val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionStartEventsStream.addData(sessionStartData) > val sessionStartEvents: DataFrame = sessionStartEventsStream > .toDS() > .toDF("sessionStartTimestamp", "sessionId") > .withWatermark("sessionStartTimestamp", "1 second") > val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) > val sessionOptionalMetadataEvents: DataFrame = > sessionOptionalMetadataEventsStream > .toDS() > .toDF("sessionOptionalMetadataTimestamp", "sessionId") > .withWatermark("sessionOptionalMetadataTimestamp", "1 second") > val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionEndEventsStream.addData(sessionEndData) > val sessionEndEvents: DataFrame = sessionEndEventsStream > .toDS() > .toDF("sessionEndTimestamp",
[jira] [Commented] (SPARK-33138) unify temp view and permanent view behaviors
[ https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221392#comment-17221392 ] Wenchen Fan commented on SPARK-33138: - > It seems to me that the current behavior of re-optimizing the SQL plan at > query execution time, not at view creation time, is correct. This won't change. After the analysis phase, the view becomes a part of the query plan, and must be optimized/planned/executed together with the query plan. The captured configs can only take affect in the parsing and analysis phase, which I think makes sense, to make the view semantically consistent with when it was created. It's better to only capture the parser/analyzer configs, but seems there is no easy way to do it. > unify temp view and permanent view behaviors > > > Key: SPARK-33138 > URL: https://issues.apache.org/jira/browse/SPARK-33138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > Currently, temp view store mapping of temp view name and its logicalPlan, and > permanent view store in HMS stores its origin SQL text. > So for permanent view, when try to refer the permanent view, its SQL text > will be parse-analyze-optimize-plan again with current SQLConf and > SparkSession context, so it might keep changing when the SQLConf and context > is different each time. > In order the unify the behaviors of temp view and permanent view, proposed > that we keep its origin SQLText for both temp and permanent view, and also > keep record of the SQLConf when the view was created. Each time we try to > refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan > the SQLText, in this way, we can make sure the output of the created view to > be stable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33140) make Analyzer rules using SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-33140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33140: --- Assignee: Leanken.Lin > make Analyzer rules using SQLConf.get > - > > Key: SPARK-33140 > URL: https://issues.apache.org/jira/browse/SPARK-33140 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: Leanken.Lin >Priority: Major > > TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33140) make Analyzer rules using SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-33140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33140. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30097 [https://github.com/apache/spark/pull/30097] > make Analyzer rules using SQLConf.get > - > > Key: SPARK-33140 > URL: https://issues.apache.org/jira/browse/SPARK-33140 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: Leanken.Lin >Priority: Major > Fix For: 3.1.0 > > > TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23539) Add support for Kafka headers in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221334#comment-17221334 ] Dongjin Lee commented on SPARK-23539: - [~ckessler] Hi Calvin, It is totally up to the committers. > Add support for Kafka headers in Structured Streaming > - > > Key: SPARK-23539 > URL: https://issues.apache.org/jira/browse/SPARK-23539 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Dongjin Lee >Priority: Major > Fix For: 3.0.0 > > > Kafka headers were added in 0.11. We should expose them through our kafka > data source in both batch and streaming queries. > This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ > SPARK-18057 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream
[ https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221333#comment-17221333 ] Apache Spark commented on SPARK-33260: -- User 'ankurdave' has created a pull request for this issue: https://github.com/apache/spark/pull/30160 > SortExec produces incorrect results if sortOrder is a Stream > > > Key: SPARK-33260 > URL: https://issues.apache.org/jira/browse/SPARK-33260 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Major > > The following query produces incorrect results. The query has two essential > features: (1) it contains a string aggregate, resulting in a {{SortExec}} > node, and (2) it contains a duplicate grouping key, causing > {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a > Stream. > SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) > FROM table_4 > GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 > When the sort order is stored as a {{Stream}}, the line > {{ordering.map(_.child.genCode(ctx))}} in > {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to > {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} > is a {{Stream}}, the modifications will not happen immediately as intended, > but will instead occur lazily when the returned {{Stream}} is used later. > Similar bugs have occurred at least three times in the past: > https://issues.apache.org/jira/browse/SPARK-24500, > https://issues.apache.org/jira/browse/SPARK-25767, > https://issues.apache.org/jira/browse/SPARK-26680. > The fix is to check if {{ordering}} is a {{Stream}} and force the > modifications to happen immediately if so. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream
[ https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221332#comment-17221332 ] Apache Spark commented on SPARK-33260: -- User 'ankurdave' has created a pull request for this issue: https://github.com/apache/spark/pull/30160 > SortExec produces incorrect results if sortOrder is a Stream > > > Key: SPARK-33260 > URL: https://issues.apache.org/jira/browse/SPARK-33260 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Major > > The following query produces incorrect results. The query has two essential > features: (1) it contains a string aggregate, resulting in a {{SortExec}} > node, and (2) it contains a duplicate grouping key, causing > {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a > Stream. > SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) > FROM table_4 > GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 > When the sort order is stored as a {{Stream}}, the line > {{ordering.map(_.child.genCode(ctx))}} in > {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to > {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} > is a {{Stream}}, the modifications will not happen immediately as intended, > but will instead occur lazily when the returned {{Stream}} is used later. > Similar bugs have occurred at least three times in the past: > https://issues.apache.org/jira/browse/SPARK-24500, > https://issues.apache.org/jira/browse/SPARK-25767, > https://issues.apache.org/jira/browse/SPARK-26680. > The fix is to check if {{ordering}} is a {{Stream}} and force the > modifications to happen immediately if so. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream
[ https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-33260: --- Affects Version/s: (was: 2.4.7) 3.0.0 > SortExec produces incorrect results if sortOrder is a Stream > > > Key: SPARK-33260 > URL: https://issues.apache.org/jira/browse/SPARK-33260 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Major > > The following query produces incorrect results. The query has two essential > features: (1) it contains a string aggregate, resulting in a {{SortExec}} > node, and (2) it contains a duplicate grouping key, causing > {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a > Stream. > SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) > FROM table_4 > GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 > When the sort order is stored as a {{Stream}}, the line > {{ordering.map(_.child.genCode(ctx))}} in > {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to > {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} > is a {{Stream}}, the modifications will not happen immediately as intended, > but will instead occur lazily when the returned {{Stream}} is used later. > Similar bugs have occurred at least three times in the past: > https://issues.apache.org/jira/browse/SPARK-24500, > https://issues.apache.org/jira/browse/SPARK-25767, > https://issues.apache.org/jira/browse/SPARK-26680. > The fix is to check if {{ordering}} is a {{Stream}} and force the > modifications to happen immediately if so. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream
Ankur Dave created SPARK-33260: -- Summary: SortExec produces incorrect results if sortOrder is a Stream Key: SPARK-33260 URL: https://issues.apache.org/jira/browse/SPARK-33260 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 2.4.7 Reporter: Ankur Dave Assignee: Ankur Dave The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a {{SortExec}} node, and (2) it contains a duplicate grouping key, causing {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a Stream. SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) FROM table_4 GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 When the sort order is stored as a {{Stream}}, the line {{ordering.map(_.child.genCode(ctx))}} in {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} is a {{Stream}}, the modifications will not happen immediately as intended, but will instead occur lazily when the returned {{Stream}} is used later. Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680. The fix is to check if {{ordering}} is a {{Stream}} and force the modifications to happen immediately if so. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22390) Aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221286#comment-17221286 ] Chang chen commented on SPARK-22390: Spark 3.0 already supported JDBC DataSource v2, so is there any update ? > Aggregate push down > --- > > Key: SPARK-22390 > URL: https://issues.apache.org/jira/browse/SPARK-22390 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael updated SPARK-33259: Description: I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN B) INNER JOIN C) operation. Below you can see example code I [posted on Stackoverflow|https://stackoverflow.com/questions/64503539/]... I created a minimal example of "sessions", that have "start" and "end" events and optionally some "metadata". The script generates two outputs: {{sessionStartsWithMetadata}} result from "start" events that are left-joined with the "metadata" events, based on {{sessionId}}. A "left join" is used, since we like to get an output event even when no corresponding metadata exists. Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining "end" events to the previously created DataFrame. Here an "inner join" is used, since we only want some output when a session has ended for sure. This code can be executed in {{spark-shell}}: {code:scala} import java.sql.Timestamp import org.apache.spark.sql.execution.streaming.{MemoryStream, StreamingQueryWrapper} import org.apache.spark.sql.streaming.StreamingQuery import org.apache.spark.sql.{DataFrame, SQLContext} import org.apache.spark.sql.functions.{col, expr, lit} import spark.implicits._ implicit val sqlContext: SQLContext = spark.sqlContext // Main data processing, regardless whether batch or stream processing def process( sessionStartEvents: DataFrame, sessionOptionalMetadataEvents: DataFrame, sessionEndEvents: DataFrame ): (DataFrame, DataFrame) = { val sessionStartsWithMetadata: DataFrame = sessionStartEvents .join( sessionOptionalMetadataEvents, sessionStartEvents("sessionId") === sessionOptionalMetadataEvents("sessionId") && sessionStartEvents("sessionStartTimestamp").between( sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL 1 seconds")), sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL 1 seconds")) ), "left" // metadata is optional ) .select( sessionStartEvents("sessionId"), sessionStartEvents("sessionStartTimestamp"), sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") ) val endedSessionsWithMetadata = sessionStartsWithMetadata.join( sessionEndEvents, sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") && sessionStartsWithMetadata("sessionStartTimestamp").between( sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 seconds")), sessionEndEvents("sessionEndTimestamp") ) ) (sessionStartsWithMetadata, endedSessionsWithMetadata) } def streamProcessing( sessionStartData: Seq[(Timestamp, Int)], sessionOptionalMetadata: Seq[(Timestamp, Int)], sessionEndData: Seq[(Timestamp, Int)] ): (StreamingQuery, StreamingQuery) = { val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionStartEventsStream.addData(sessionStartData) val sessionStartEvents: DataFrame = sessionStartEventsStream .toDS() .toDF("sessionStartTimestamp", "sessionId") .withWatermark("sessionStartTimestamp", "1 second") val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) val sessionOptionalMetadataEvents: DataFrame = sessionOptionalMetadataEventsStream .toDS() .toDF("sessionOptionalMetadataTimestamp", "sessionId") .withWatermark("sessionOptionalMetadataTimestamp", "1 second") val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionEndEventsStream.addData(sessionEndData) val sessionEndEvents: DataFrame = sessionEndEventsStream .toDS() .toDF("sessionEndTimestamp", "sessionId") .withWatermark("sessionEndTimestamp", "1 second") val (sessionStartsWithMetadata, endedSessionsWithMetadata) = process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents) val sessionStartsWithMetadataQuery = sessionStartsWithMetadata .select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see which query's output it is .writeStream .outputMode("append") .format("console") .option("truncate", "false") .option("numRows", "1000") .start() val endedSessionsWithMetadataQuery = endedSessionsWithMetadata .select(lit("endedSessionsWithMetadata"), col("*")) // Add label to see which query's output it is .writeStream .outputMode("append") .format("console") .option("truncate", "false") .option("numRows", "1000") .start() (sessionStartsWithMetadataQuery, endedSessionsWithMetadataQuery) } def batchProcessing( sessionStartData: Seq[(Timestamp, Int)],
[jira] [Commented] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221280#comment-17221280 ] Apache Spark commented on SPARK-33258: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/30159 > Add asc_nulls_* and desc_nulls_* methods to SparkR > -- > > Key: SPARK-33258 > URL: https://issues.apache.org/jira/browse/SPARK-33258 > Project: Spark > Issue Type: Bug > Components: R, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > At the moment Spark provides only > - {{asc}} > - {{desc}} > but {{NULL}} handling variants > - {{asc_nulls_first}} > - {{asc_nulls_last}} > - {{desc_nulls_first}} > - {{desc_nulls_last}} > are missing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33258: Assignee: Apache Spark > Add asc_nulls_* and desc_nulls_* methods to SparkR > -- > > Key: SPARK-33258 > URL: https://issues.apache.org/jira/browse/SPARK-33258 > Project: Spark > Issue Type: Bug > Components: R, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > At the moment Spark provides only > - {{asc}} > - {{desc}} > but {{NULL}} handling variants > - {{asc_nulls_first}} > - {{asc_nulls_last}} > - {{desc_nulls_first}} > - {{desc_nulls_last}} > are missing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33258: Assignee: (was: Apache Spark) > Add asc_nulls_* and desc_nulls_* methods to SparkR > -- > > Key: SPARK-33258 > URL: https://issues.apache.org/jira/browse/SPARK-33258 > Project: Spark > Issue Type: Bug > Components: R, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > At the moment Spark provides only > - {{asc}} > - {{desc}} > but {{NULL}} handling variants > - {{asc_nulls_first}} > - {{asc_nulls_last}} > - {{desc_nulls_first}} > - {{desc_nulls_last}} > are missing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael updated SPARK-33259: Description: I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN B) JOIN C) operation. Below you can see example code I [posted on Stackoverflow|https://stackoverflow.com/questions/64503539/]... I created a minimal example of "sessions", that have "start" and "end" events and optionally some "metadata". The script generates two outputs: {{sessionStartsWithMetadata}} result from "start" events that are left-joined with the "metadata" events, based on {{sessionId}}. A "left join" is used, since we like to get an output event even when no corresponding metadata exists. Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining "end" events to the previously created DataFrame. Here an "inner join" is used, since we only want some output when a session has ended for sure. This code can be executed in {{spark-shell}}: {code:scala} import java.sql.Timestamp import org.apache.spark.sql.execution.streaming.{MemoryStream, StreamingQueryWrapper} import org.apache.spark.sql.streaming.StreamingQuery import org.apache.spark.sql.{DataFrame, SQLContext} import org.apache.spark.sql.functions.{col, expr, lit} import spark.implicits._ implicit val sqlContext: SQLContext = spark.sqlContext // Main data processing, regardless whether batch or stream processing def process( sessionStartEvents: DataFrame, sessionOptionalMetadataEvents: DataFrame, sessionEndEvents: DataFrame ): (DataFrame, DataFrame) = { val sessionStartsWithMetadata: DataFrame = sessionStartEvents .join( sessionOptionalMetadataEvents, sessionStartEvents("sessionId") === sessionOptionalMetadataEvents("sessionId") && sessionStartEvents("sessionStartTimestamp").between( sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL 1 seconds")), sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL 1 seconds")) ), "left" // metadata is optional ) .select( sessionStartEvents("sessionId"), sessionStartEvents("sessionStartTimestamp"), sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") ) val endedSessionsWithMetadata = sessionStartsWithMetadata.join( sessionEndEvents, sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") && sessionStartsWithMetadata("sessionStartTimestamp").between( sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 seconds")), sessionEndEvents("sessionEndTimestamp") ) ) (sessionStartsWithMetadata, endedSessionsWithMetadata) } def streamProcessing( sessionStartData: Seq[(Timestamp, Int)], sessionOptionalMetadata: Seq[(Timestamp, Int)], sessionEndData: Seq[(Timestamp, Int)] ): (StreamingQuery, StreamingQuery) = { val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionStartEventsStream.addData(sessionStartData) val sessionStartEvents: DataFrame = sessionStartEventsStream .toDS() .toDF("sessionStartTimestamp", "sessionId") .withWatermark("sessionStartTimestamp", "1 second") val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) val sessionOptionalMetadataEvents: DataFrame = sessionOptionalMetadataEventsStream .toDS() .toDF("sessionOptionalMetadataTimestamp", "sessionId") .withWatermark("sessionOptionalMetadataTimestamp", "1 second") val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionEndEventsStream.addData(sessionEndData) val sessionEndEvents: DataFrame = sessionEndEventsStream .toDS() .toDF("sessionEndTimestamp", "sessionId") .withWatermark("sessionEndTimestamp", "1 second") val (sessionStartsWithMetadata, endedSessionsWithMetadata) = process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents) val sessionStartsWithMetadataQuery = sessionStartsWithMetadata .select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see which query's output it is .writeStream .outputMode("append") .format("console") .option("truncate", "false") .option("numRows", "1000") .start() val endedSessionsWithMetadataQuery = endedSessionsWithMetadata .select(lit("endedSessionsWithMetadata"), col("*")) // Add label to see which query's output it is .writeStream .outputMode("append") .format("console") .option("truncate", "false") .option("numRows", "1000") .start() (sessionStartsWithMetadataQuery, endedSessionsWithMetadataQuery) } def batchProcessing( sessionStartData: Seq[(Timestamp, Int)],
[jira] [Commented] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221279#comment-17221279 ] Apache Spark commented on SPARK-33258: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/30159 > Add asc_nulls_* and desc_nulls_* methods to SparkR > -- > > Key: SPARK-33258 > URL: https://issues.apache.org/jira/browse/SPARK-33258 > Project: Spark > Issue Type: Bug > Components: R, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > At the moment Spark provides only > - {{asc}} > - {{desc}} > but {{NULL}} handling variants > - {{asc_nulls_first}} > - {{asc_nulls_last}} > - {{desc_nulls_first}} > - {{desc_nulls_last}} > are missing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael updated SPARK-33259: Description: I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN B) JOIN C) operation. Below you can see example code I [posted on Stackoverflow|https://stackoverflow.com/questions/64503539/]... I created a minimal example of "sessions", that have "start" and "end" events and optionally some "metadata". The script generates two outputs: {{sessionStartsWithMetadata}} result from "start" events that are left-joined with the "metadata" events, based on {{sessionId}}. A "left join" is used, since we like to get an output event even when no corresponding metadata exists. Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining "end" events to the previously created DataFrame. Here an "inner join" is used, since we only want some output when a session has ended for sure. This code can be executed in {{spark-shell}}: {code:scala} import java.sql.Timestamp import org.apache.spark.sql.execution.streaming.{MemoryStream, StreamingQueryWrapper} import org.apache.spark.sql.streaming.StreamingQuery import org.apache.spark.sql.{DataFrame, SQLContext} import org.apache.spark.sql.functions.{col, expr, lit} import spark.implicits._ implicit val sqlContext: SQLContext = spark.sqlContext // Main data processing, regardless whether batch or stream processing def process( sessionStartEvents: DataFrame, sessionOptionalMetadataEvents: DataFrame, sessionEndEvents: DataFrame ): (DataFrame, DataFrame) = { val sessionStartsWithMetadata: DataFrame = sessionStartEvents .join( sessionOptionalMetadataEvents, sessionStartEvents("sessionId") === sessionOptionalMetadataEvents("sessionId") && sessionStartEvents("sessionStartTimestamp").between( sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL 1 seconds")), sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL 1 seconds")) ), "left" // metadata is optional ) .select( sessionStartEvents("sessionId"), sessionStartEvents("sessionStartTimestamp"), sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") ) val endedSessionsWithMetadata = sessionStartsWithMetadata.join( sessionEndEvents, sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") && sessionStartsWithMetadata("sessionStartTimestamp").between( sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 seconds")), sessionEndEvents("sessionEndTimestamp") ) ) (sessionStartsWithMetadata, endedSessionsWithMetadata) } def streamProcessing( sessionStartData: Seq[(Timestamp, Int)], sessionOptionalMetadata: Seq[(Timestamp, Int)], sessionEndData: Seq[(Timestamp, Int)] ): (StreamingQuery, StreamingQuery) = { val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionStartEventsStream.addData(sessionStartData) val sessionStartEvents: DataFrame = sessionStartEventsStream .toDS() .toDF("sessionStartTimestamp", "sessionId") .withWatermark("sessionStartTimestamp", "1 second") val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) val sessionOptionalMetadataEvents: DataFrame = sessionOptionalMetadataEventsStream .toDS() .toDF("sessionOptionalMetadataTimestamp", "sessionId") .withWatermark("sessionOptionalMetadataTimestamp", "1 second") val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionEndEventsStream.addData(sessionEndData) val sessionEndEvents: DataFrame = sessionEndEventsStream .toDS() .toDF("sessionEndTimestamp", "sessionId") .withWatermark("sessionEndTimestamp", "1 second") val (sessionStartsWithMetadata, endedSessionsWithMetadata) = process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents) val sessionStartsWithMetadataQuery = sessionStartsWithMetadata .select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see which query's output it is .writeStream .outputMode("append") .format("console") .option("truncate", "false") .option("numRows", "1000") .start() val endedSessionsWithMetadataQuery = endedSessionsWithMetadata .select(lit("endedSessionsWithMetadata"), col("*")) // Add label to see which query's output it is .writeStream .outputMode("append") .format("console") .option("truncate", "false") .option("numRows", "1000") .start() (sessionStartsWithMetadataQuery, endedSessionsWithMetadataQuery) } def batchProcessing( sessionStartData: Seq[(Timestamp, Int)],
[jira] [Updated] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael updated SPARK-33259: Description: I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN B) JOIN C) operation. Below you can see example code I [posted on Stackoverflow|https://stackoverflow.com/questions/64503539/]... I created a minimal example of "sessions", that have "start" and "end" events and optionally some "metadata". The script generates two outputs: {{sessionStartsWithMetadata}} result from "start" events that are left-joined with the "metadata" events, based on {{sessionId}}. A "left join" is used, since we like to get an output event even when no corresponding metadata exists. Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining "end" events to the previously created DataFrame. Here an "inner join" is used, since we only want some output when a session has ended for sure. This code can be executed in {{spark-shell}}: {code:scala} import java.sql.Timestamp import org.apache.spark.sql.execution.streaming.{MemoryStream, StreamingQueryWrapper} import org.apache.spark.sql.streaming.StreamingQuery import org.apache.spark.sql.{DataFrame, SQLContext} import org.apache.spark.sql.functions.{col, expr, lit} import spark.implicits._ implicit val sqlContext: SQLContext = spark.sqlContext // Main data processing, regardless whether batch or stream processing def process( sessionStartEvents: DataFrame, sessionOptionalMetadataEvents: DataFrame, sessionEndEvents: DataFrame ): (DataFrame, DataFrame) = { val sessionStartsWithMetadata: DataFrame = sessionStartEvents .join( sessionOptionalMetadataEvents, sessionStartEvents("sessionId") === sessionOptionalMetadataEvents("sessionId") && sessionStartEvents("sessionStartTimestamp").between( sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL 1 seconds")), sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL 1 seconds")) ), "left" // metadata is optional ) .select( sessionStartEvents("sessionId"), sessionStartEvents("sessionStartTimestamp"), sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") ) val endedSessionsWithMetadata = sessionStartsWithMetadata.join( sessionEndEvents, sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") && sessionStartsWithMetadata("sessionStartTimestamp").between( sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 seconds")), sessionEndEvents("sessionEndTimestamp") ) ) (sessionStartsWithMetadata, endedSessionsWithMetadata) } def streamProcessing( sessionStartData: Seq[(Timestamp, Int)], sessionOptionalMetadata: Seq[(Timestamp, Int)], sessionEndData: Seq[(Timestamp, Int)] ): (StreamingQuery, StreamingQuery) = { val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionStartEventsStream.addData(sessionStartData) val sessionStartEvents: DataFrame = sessionStartEventsStream .toDS() .toDF("sessionStartTimestamp", "sessionId") .withWatermark("sessionStartTimestamp", "1 second") val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) val sessionOptionalMetadataEvents: DataFrame = sessionOptionalMetadataEventsStream .toDS() .toDF("sessionOptionalMetadataTimestamp", "sessionId") .withWatermark("sessionOptionalMetadataTimestamp", "1 second") val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionEndEventsStream.addData(sessionEndData) val sessionEndEvents: DataFrame = sessionEndEventsStream .toDS() .toDF("sessionEndTimestamp", "sessionId") .withWatermark("sessionEndTimestamp", "1 second") val (sessionStartsWithMetadata, endedSessionsWithMetadata) = process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents) val sessionStartsWithMetadataQuery = sessionStartsWithMetadata .select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see which query's output it is .writeStream .outputMode("append") .format("console") .option("truncate", "false") .option("numRows", "1000") .start() val endedSessionsWithMetadataQuery = endedSessionsWithMetadata .select(lit("endedSessionsWithMetadata"), col("*")) // Add label to see which query's output it is .writeStream .outputMode("append") .format("console") .option("truncate", "false") .option("numRows", "1000") .start() (sessionStartsWithMetadataQuery, endedSessionsWithMetadataQuery) } def batchProcessing( sessionStartData: Seq[(Timestamp, Int)],
[jira] [Created] (SPARK-33259) Joining 3 streams results in incorrect output
Michael created SPARK-33259: --- Summary: Joining 3 streams results in incorrect output Key: SPARK-33259 URL: https://issues.apache.org/jira/browse/SPARK-33259 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.1 Reporter: Michael I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN B) JOIN C) operation. Below you can see example code I [posted on Stackoverflow|https://stackoverflow.com/questions/64503539/]... I created a minimal example of "sessions", that have "start" and "end" events and optionally some "metadata". The script generates two outputs: `sessionStartsWithMetadata` result from "start" events that are left-joined with the "metadata" events, based on `sessionId`. A "left join" is used, since we like to get an output event even when no corresponding metadata exists. Additionally a DataFrame `endedSessionsWithMetadata` is created by joining "end" events to the previously created DataFrame. Here an "inner join" is used, since we only want some output when a session has ended for sure. This code can be executed in `spark-shell`: {code:scala} import java.sql.Timestamp import org.apache.spark.sql.execution.streaming.{MemoryStream, StreamingQueryWrapper} import org.apache.spark.sql.streaming.StreamingQuery import org.apache.spark.sql.{DataFrame, SQLContext} import org.apache.spark.sql.functions.{col, expr, lit} import spark.implicits._ implicit val sqlContext: SQLContext = spark.sqlContext // Main data processing, regardless whether batch or stream processing def process( sessionStartEvents: DataFrame, sessionOptionalMetadataEvents: DataFrame, sessionEndEvents: DataFrame ): (DataFrame, DataFrame) = { val sessionStartsWithMetadata: DataFrame = sessionStartEvents .join( sessionOptionalMetadataEvents, sessionStartEvents("sessionId") === sessionOptionalMetadataEvents("sessionId") && sessionStartEvents("sessionStartTimestamp").between( sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL 1 seconds")), sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL 1 seconds")) ), "left" // metadata is optional ) .select( sessionStartEvents("sessionId"), sessionStartEvents("sessionStartTimestamp"), sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") ) val endedSessionsWithMetadata = sessionStartsWithMetadata.join( sessionEndEvents, sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") && sessionStartsWithMetadata("sessionStartTimestamp").between( sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 seconds")), sessionEndEvents("sessionEndTimestamp") ) ) (sessionStartsWithMetadata, endedSessionsWithMetadata) } def streamProcessing( sessionStartData: Seq[(Timestamp, Int)], sessionOptionalMetadata: Seq[(Timestamp, Int)], sessionEndData: Seq[(Timestamp, Int)] ): (StreamingQuery, StreamingQuery) = { val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionStartEventsStream.addData(sessionStartData) val sessionStartEvents: DataFrame = sessionStartEventsStream .toDS() .toDF("sessionStartTimestamp", "sessionId") .withWatermark("sessionStartTimestamp", "1 second") val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) val sessionOptionalMetadataEvents: DataFrame = sessionOptionalMetadataEventsStream .toDS() .toDF("sessionOptionalMetadataTimestamp", "sessionId") .withWatermark("sessionOptionalMetadataTimestamp", "1 second") val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)] sessionEndEventsStream.addData(sessionEndData) val sessionEndEvents: DataFrame = sessionEndEventsStream .toDS() .toDF("sessionEndTimestamp", "sessionId") .withWatermark("sessionEndTimestamp", "1 second") val (sessionStartsWithMetadata, endedSessionsWithMetadata) = process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents) val sessionStartsWithMetadataQuery = sessionStartsWithMetadata .select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see which query's output it is .writeStream .outputMode("append") .format("console") .option("truncate", "false") .option("numRows", "1000") .start() val endedSessionsWithMetadataQuery = endedSessionsWithMetadata .select(lit("endedSessionsWithMetadata"), col("*")) // Add label to see which query's output it is .writeStream .outputMode("append") .format("console") .option("truncate", "false")
[jira] [Created] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR
Maciej Szymkiewicz created SPARK-33258: -- Summary: Add asc_nulls_* and desc_nulls_* methods to SparkR Key: SPARK-33258 URL: https://issues.apache.org/jira/browse/SPARK-33258 Project: Spark Issue Type: Bug Components: R, SQL Affects Versions: 3.1.0 Reporter: Maciej Szymkiewicz At the moment Spark provides only - {{asc}} - {{desc}} but {{NULL}} handling variants - {{asc_nulls_first}} - {{asc_nulls_last}} - {{desc_nulls_first}} - {{desc_nulls_last}} are missing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33257) Support Column inputs in PySpark ordering functions (asc*, desc*)
Maciej Szymkiewicz created SPARK-33257: -- Summary: Support Column inputs in PySpark ordering functions (asc*, desc*) Key: SPARK-33257 URL: https://issues.apache.org/jira/browse/SPARK-33257 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.1.0 Reporter: Maciej Szymkiewicz According to SPARK-26979, PySpark functions should support both {{Column}} and {{str}} arguments, when possible. However, the following ordering support only {{str}} - {{asc}} - {{desc}} - {{asc_nulls_first}} - {{asc_nulls_last}} - {{desc_nulls_first}} - {{desc_nulls_last}} support only {{str}}. This is because Scala side doesn't provide {{Column => Column}} variants. To fix this, we do one of the following: - Call corresponding {{Column}} methods as [suggested|https://github.com/apache/spark/pull/30143#discussion_r512366978] by [~hyukjin.kwon] - Add missing signatures on Scala side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33249) Add status plugin for live application
[ https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221244#comment-17221244 ] Apache Spark commented on SPARK-33249: -- User 'kuwii' has created a pull request for this issue: https://github.com/apache/spark/pull/30158 > Add status plugin for live application > -- > > Key: SPARK-33249 > URL: https://issues.apache.org/jira/browse/SPARK-33249 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Weiyi Kong >Priority: Minor > > There are cases that developer may want to extend the current REST API of Web > UI. In most cases, adding external module is a better option than directly > editing the original Spark code. > For an external module, to extend the REST API of the Web UI, 2 things may > need to be done: > * Add extra API to provide extra status info. This can be simply done by > implementing another ApiRequestContext which will be automatically loaded. > * If the info can not be calculated from the original data in the store, add > extra listeners to generate them. > For history server, there is an interface called AppHistoryServerPlugin, > which is loaded based on SPI, providing a method to create listeners. In live > application, the only way is spark.extraListeners based on > Utils.loadExtensions. But this is not enough for the cases. > To let the API get the status info, the data need to be written to the > AppStatusStore, which is the only store that an API can get by accessing > "ui.store" or "ui.sc.statusStore". But listeners created by > Utils.loadExtensions only get a SparkConf in construction, and are unable to > write the AppStatusStore. > So I think we still need plugin like AppHistorySever for live UI. For > concerns like SPARK-22786, the plugin for live app can be separated from the > history server one, and also loaded using Utils.loadExtensions with an extra > configurations. So by default, nothing will be loaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33249) Add status plugin for live application
[ https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33249: Assignee: Apache Spark > Add status plugin for live application > -- > > Key: SPARK-33249 > URL: https://issues.apache.org/jira/browse/SPARK-33249 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Weiyi Kong >Assignee: Apache Spark >Priority: Minor > > There are cases that developer may want to extend the current REST API of Web > UI. In most cases, adding external module is a better option than directly > editing the original Spark code. > For an external module, to extend the REST API of the Web UI, 2 things may > need to be done: > * Add extra API to provide extra status info. This can be simply done by > implementing another ApiRequestContext which will be automatically loaded. > * If the info can not be calculated from the original data in the store, add > extra listeners to generate them. > For history server, there is an interface called AppHistoryServerPlugin, > which is loaded based on SPI, providing a method to create listeners. In live > application, the only way is spark.extraListeners based on > Utils.loadExtensions. But this is not enough for the cases. > To let the API get the status info, the data need to be written to the > AppStatusStore, which is the only store that an API can get by accessing > "ui.store" or "ui.sc.statusStore". But listeners created by > Utils.loadExtensions only get a SparkConf in construction, and are unable to > write the AppStatusStore. > So I think we still need plugin like AppHistorySever for live UI. For > concerns like SPARK-22786, the plugin for live app can be separated from the > history server one, and also loaded using Utils.loadExtensions with an extra > configurations. So by default, nothing will be loaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33249) Add status plugin for live application
[ https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221243#comment-17221243 ] Apache Spark commented on SPARK-33249: -- User 'kuwii' has created a pull request for this issue: https://github.com/apache/spark/pull/30158 > Add status plugin for live application > -- > > Key: SPARK-33249 > URL: https://issues.apache.org/jira/browse/SPARK-33249 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Weiyi Kong >Priority: Minor > > There are cases that developer may want to extend the current REST API of Web > UI. In most cases, adding external module is a better option than directly > editing the original Spark code. > For an external module, to extend the REST API of the Web UI, 2 things may > need to be done: > * Add extra API to provide extra status info. This can be simply done by > implementing another ApiRequestContext which will be automatically loaded. > * If the info can not be calculated from the original data in the store, add > extra listeners to generate them. > For history server, there is an interface called AppHistoryServerPlugin, > which is loaded based on SPI, providing a method to create listeners. In live > application, the only way is spark.extraListeners based on > Utils.loadExtensions. But this is not enough for the cases. > To let the API get the status info, the data need to be written to the > AppStatusStore, which is the only store that an API can get by accessing > "ui.store" or "ui.sc.statusStore". But listeners created by > Utils.loadExtensions only get a SparkConf in construction, and are unable to > write the AppStatusStore. > So I think we still need plugin like AppHistorySever for live UI. For > concerns like SPARK-22786, the plugin for live app can be separated from the > history server one, and also loaded using Utils.loadExtensions with an extra > configurations. So by default, nothing will be loaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org