[jira] [Commented] (SPARK-33230) FileOutputWriter jobs have duplicate JobIDs if launched in same second

2020-10-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221963#comment-17221963
 ] 

Dongjoon Hyun commented on SPARK-33230:
---

Thank you for sharing the status, [~ste...@apache.org]!

> FileOutputWriter jobs have duplicate JobIDs if launched in same second
> --
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33183) Bug in optimizer rule EliminateSorts

2020-10-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33183.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30093
[https://github.com/apache/spark/pull/30093]

> Bug in optimizer rule EliminateSorts
> 
>
> Key: SPARK-33183
> URL: https://issues.apache.org/jira/browse/SPARK-33183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, the rule {{EliminateSorts}} removes a global sort node if its 
> child plan already satisfies the required sort order without checking if the 
> child plan's ordering is local or global. For example, in the following 
> scenario, the first sort shouldn't be removed because it has a stronger 
> guarantee than the second sort even if the sort orders are the same for both 
> sorts. 
> {code:java}
> Sort(orders, global = True, ...)
>   Sort(orders, global = False, ...){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33183) Bug in optimizer rule EliminateSorts

2020-10-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33183:
---

Assignee: Allison Wang

> Bug in optimizer rule EliminateSorts
> 
>
> Key: SPARK-33183
> URL: https://issues.apache.org/jira/browse/SPARK-33183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Currently, the rule {{EliminateSorts}} removes a global sort node if its 
> child plan already satisfies the required sort order without checking if the 
> child plan's ordering is local or global. For example, in the following 
> scenario, the first sort shouldn't be removed because it has a stronger 
> guarantee than the second sort even if the sort orders are the same for both 
> sorts. 
> {code:java}
> Sort(orders, global = True, ...)
>   Sort(orders, global = False, ...){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33174) Migrate DROP TABLE to new resolution framework

2020-10-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33174:
---

Assignee: Terry Kim

> Migrate DROP TABLE to new resolution framework
> --
>
> Key: SPARK-33174
> URL: https://issues.apache.org/jira/browse/SPARK-33174
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>
> Migrate "DROP TABLE t" to new resolution framework so that it has a 
> consistent resolution rule for v1/v2 commands.
> Currently, the following resolves to a table instead of a temp view:
> {code:java}
> sql("CREATE TABLE testcat.ns.t (id bigint) USING foo")
> sql("CREATE TEMPORARY VIEW t AS SELECT 2 as id")
> sql("USE testcat.ns")
> sql("DROP TABLE t") // 't' is resolved to testcat.ns.t
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33174) Migrate DROP TABLE to new resolution framework

2020-10-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33174.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30079
[https://github.com/apache/spark/pull/30079]

> Migrate DROP TABLE to new resolution framework
> --
>
> Key: SPARK-33174
> URL: https://issues.apache.org/jira/browse/SPARK-33174
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.1.0
>
>
> Migrate "DROP TABLE t" to new resolution framework so that it has a 
> consistent resolution rule for v1/v2 commands.
> Currently, the following resolves to a table instead of a temp view:
> {code:java}
> sql("CREATE TABLE testcat.ns.t (id bigint) USING foo")
> sql("CREATE TEMPORARY VIEW t AS SELECT 2 as id")
> sql("USE testcat.ns")
> sql("DROP TABLE t") // 't' is resolved to testcat.ns.t
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221955#comment-17221955
 ] 

Apache Spark commented on SPARK-33240:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/30167

> Fail fast when fails to instantiate configured v2 session catalog
> -
>
> Key: SPARK-33240
> URL: https://issues.apache.org/jira/browse/SPARK-33240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.1.0
>
>
> Now Spark fails back to use "default catalog" when Spark fails to instantiate 
> configured v2 session catalog.
> While the error log message says nothing about why the instantiation has been 
> failing and the error log message pollutes the log file (as it's logged every 
> time when resolving the catalog), it should be considered as "incorrect" 
> behavior as end users are intended to set the custom catalog and Spark 
> sometimes ignores it, which is against the intention.
> We should simply fail in the case so that end users indicate the failure 
> earlier and try to fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221954#comment-17221954
 ] 

Apache Spark commented on SPARK-33240:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/30167

> Fail fast when fails to instantiate configured v2 session catalog
> -
>
> Key: SPARK-33240
> URL: https://issues.apache.org/jira/browse/SPARK-33240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.1.0
>
>
> Now Spark fails back to use "default catalog" when Spark fails to instantiate 
> configured v2 session catalog.
> While the error log message says nothing about why the instantiation has been 
> failing and the error log message pollutes the log file (as it's logged every 
> time when resolving the catalog), it should be considered as "incorrect" 
> behavior as end users are intended to set the custom catalog and Spark 
> sometimes ignores it, which is against the intention.
> We should simply fail in the case so that end users indicate the failure 
> earlier and try to fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog

2020-10-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33240.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30147
[https://github.com/apache/spark/pull/30147]

> Fail fast when fails to instantiate configured v2 session catalog
> -
>
> Key: SPARK-33240
> URL: https://issues.apache.org/jira/browse/SPARK-33240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.1.0
>
>
> Now Spark fails back to use "default catalog" when Spark fails to instantiate 
> configured v2 session catalog.
> While the error log message says nothing about why the instantiation has been 
> failing and the error log message pollutes the log file (as it's logged every 
> time when resolving the catalog), it should be considered as "incorrect" 
> behavior as end users are intended to set the custom catalog and Spark 
> sometimes ignores it, which is against the intention.
> We should simply fail in the case so that end users indicate the failure 
> earlier and try to fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog

2020-10-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33240:
---

Assignee: Jungtaek Lim

> Fail fast when fails to instantiate configured v2 session catalog
> -
>
> Key: SPARK-33240
> URL: https://issues.apache.org/jira/browse/SPARK-33240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> Now Spark fails back to use "default catalog" when Spark fails to instantiate 
> configured v2 session catalog.
> While the error log message says nothing about why the instantiation has been 
> failing and the error log message pollutes the log file (as it's logged every 
> time when resolving the catalog), it should be considered as "incorrect" 
> behavior as end users are intended to set the custom catalog and Spark 
> sometimes ignores it, which is against the intention.
> We should simply fail in the case so that end users indicate the failure 
> earlier and try to fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33266) Add total duration, read duration, and write duration as task level metrics

2020-10-27 Thread Noritaka Sekiyama (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-33266:
--
Description: 
Sometimes we need to identify performance bottlenecks, for example, how long it 
took to read from data store, how long it took to write into another data store.

It would be great if we can have total duration, read duration, and write 
duration as task level metrics.

Currently it seems that both `InputMetrics` and `OutputMetrics` do not have 
duration related metrics.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58]

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56]

 

On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. 
We might need similar metrics for input/output.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala]

 

  was:
Sometimes we need to identify performance bottlenecks, for example, how long it 
took to read from data store, how long it took to write into another data store.

It would be great if we can have total duration, read duration, and write 
duration as task level metrics.

Currently it seems that both `InputMetrics` and `OutputMetrics` do not have 
duration related metrics.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58]

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56
]

On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. 
We might need similar metrics for input/output.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala]

 


> Add total duration, read duration, and write duration as task level metrics
> ---
>
> Key: SPARK-33266
> URL: https://issues.apache.org/jira/browse/SPARK-33266
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> Sometimes we need to identify performance bottlenecks, for example, how long 
> it took to read from data store, how long it took to write into another data 
> store.
> It would be great if we can have total duration, read duration, and write 
> duration as task level metrics.
> Currently it seems that both `InputMetrics` and `OutputMetrics` do not have 
> duration related metrics.
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58]
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56]
>  
> On the other hand, other metrics such as `ShuffleWriteMetrics` has write 
> time. We might need similar metrics for input/output.
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33266) Add total duration, read duration, and write duration as task level metrics

2020-10-27 Thread Noritaka Sekiyama (Jira)
Noritaka Sekiyama created SPARK-33266:
-

 Summary: Add total duration, read duration, and write duration as 
task level metrics
 Key: SPARK-33266
 URL: https://issues.apache.org/jira/browse/SPARK-33266
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Noritaka Sekiyama


Sometimes we need to identify performance bottlenecks, for example, how long it 
took to read from data store, how long it took to write into another data store.

It would be great if we can have total duration, read duration, and write 
duration as task level metrics.

Currently it seems that both `InputMetrics` and `OutputMetrics` do not have 
duration related metrics.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58]

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56
]

On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. 
We might need similar metrics for input/output.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33264) Add a dedicated page for SQL-on-file in SQL documents

2020-10-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33264.
--
Fix Version/s: 3.1.0
   3.0.2
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/30165

> Add a dedicated page for SQL-on-file in SQL documents
> -
>
> Key: SPARK-33264
> URL: https://issues.apache.org/jira/browse/SPARK-33264
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> This ticket intends to add a dedicated page for SQL-on-file in SQL documents.
> This comes from the comment: 
> [https://github.com/apache/spark/pull/30095/files#r508965149]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33265:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Rename classOf[Seq] to classOf[scala.collection.Seq] in 
> PostgresIntegrationSuite for Scala 2.13
> ---
>
> Key: SPARK-33265
> URL: https://issues.apache.org/jira/browse/SPARK-33265
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] 
> fails due to ClassCastException. 
> The reason is the same as what is resolved in SPARK-29292 but this happens at 
> test time, not compile time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221898#comment-17221898
 ] 

Apache Spark commented on SPARK-33265:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/30166

> Rename classOf[Seq] to classOf[scala.collection.Seq] in 
> PostgresIntegrationSuite for Scala 2.13
> ---
>
> Key: SPARK-33265
> URL: https://issues.apache.org/jira/browse/SPARK-33265
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] 
> fails due to ClassCastException. 
> The reason is the same as what is resolved in SPARK-29292 but this happens at 
> test time, not compile time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33265:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Rename classOf[Seq] to classOf[scala.collection.Seq] in 
> PostgresIntegrationSuite for Scala 2.13
> ---
>
> Key: SPARK-33265
> URL: https://issues.apache.org/jira/browse/SPARK-33265
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] 
> fails due to ClassCastException. 
> The reason is the same as what is resolved in SPARK-29292 but this happens at 
> test time, not compile time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33265:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Rename classOf[Seq] to classOf[scala.collection.Seq] in 
> PostgresIntegrationSuite for Scala 2.13
> ---
>
> Key: SPARK-33265
> URL: https://issues.apache.org/jira/browse/SPARK-33265
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] 
> fails due to ClassCastException. 
> The reason is the same as what is resolved in SPARK-29292 but this happens at 
> test time, not compile time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221899#comment-17221899
 ] 

Apache Spark commented on SPARK-33265:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/30166

> Rename classOf[Seq] to classOf[scala.collection.Seq] in 
> PostgresIntegrationSuite for Scala 2.13
> ---
>
> Key: SPARK-33265
> URL: https://issues.apache.org/jira/browse/SPARK-33265
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] 
> fails due to ClassCastException. 
> The reason is the same as what is resolved in SPARK-29292 but this happens at 
> test time, not compile time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13

2020-10-27 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-33265:
---
Priority: Minor  (was: Major)

> Rename classOf[Seq] to classOf[scala.collection.Seq] in 
> PostgresIntegrationSuite for Scala 2.13
> ---
>
> Key: SPARK-33265
> URL: https://issues.apache.org/jira/browse/SPARK-33265
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] 
> fails due to ClassCastException. 
> The reason is the same as what is resolved in SPARK-29292 but this happens at 
> test time, not compile time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13

2020-10-27 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-33265:
--

 Summary: Rename classOf[Seq] to classOf[scala.collection.Seq] in 
PostgresIntegrationSuite for Scala 2.13
 Key: SPARK-33265
 URL: https://issues.apache.org/jira/browse/SPARK-33265
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] fails 
due to ClassCastException. 
The reason is the same as what is resolved in SPARK-29292 but this happens at 
test time, not compile time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened

2020-10-27 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta reassigned SPARK-33204:
--

Assignee: akiyamaneko  (was: Apache Spark)

> `Event Timeline`  in Spark Job UI sometimes cannot be opened
> 
>
> Key: SPARK-33204
> URL: https://issues.apache.org/jira/browse/SPARK-33204
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: reproduce.gif
>
>
> The Event Timeline area  cannot be expanded when a spark application has some 
> failed jobs.
> show as the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33264) Add a dedicated page for SQL-on-file in SQL documents

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33264:


Assignee: (was: Apache Spark)

> Add a dedicated page for SQL-on-file in SQL documents
> -
>
> Key: SPARK-33264
> URL: https://issues.apache.org/jira/browse/SPARK-33264
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket intends to add a dedicated page for SQL-on-file in SQL documents.
> This comes from the comment: 
> [https://github.com/apache/spark/pull/30095/files#r508965149]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33264) Add a dedicated page for SQL-on-file in SQL documents

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221880#comment-17221880
 ] 

Apache Spark commented on SPARK-33264:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/30165

> Add a dedicated page for SQL-on-file in SQL documents
> -
>
> Key: SPARK-33264
> URL: https://issues.apache.org/jira/browse/SPARK-33264
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket intends to add a dedicated page for SQL-on-file in SQL documents.
> This comes from the comment: 
> [https://github.com/apache/spark/pull/30095/files#r508965149]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33264) Add a dedicated page for SQL-on-file in SQL documents

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33264:


Assignee: Apache Spark

> Add a dedicated page for SQL-on-file in SQL documents
> -
>
> Key: SPARK-33264
> URL: https://issues.apache.org/jira/browse/SPARK-33264
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Major
>
> This ticket intends to add a dedicated page for SQL-on-file in SQL documents.
> This comes from the comment: 
> [https://github.com/apache/spark/pull/30095/files#r508965149]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33264) Add a dedicated page for SQL-on-file in SQL documents

2020-10-27 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-33264:


 Summary: Add a dedicated page for SQL-on-file in SQL documents
 Key: SPARK-33264
 URL: https://issues.apache.org/jira/browse/SPARK-33264
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.0.2, 3.1.0
Reporter: Takeshi Yamamuro


This ticket intends to add a dedicated page for SQL-on-file in SQL documents.
This comes from the comment: 
[https://github.com/apache/spark/pull/30095/files#r508965149]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR

2020-10-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33258:


Assignee: Maciej Szymkiewicz

> Add asc_nulls_* and desc_nulls_* methods to SparkR
> --
>
> Key: SPARK-33258
> URL: https://issues.apache.org/jira/browse/SPARK-33258
> Project: Spark
>  Issue Type: Bug
>  Components: R, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> At the moment Spark provides only
> - {{asc}}
> - {{desc}}
> but {{NULL}} handling variants
> - {{asc_nulls_first}}
> - {{asc_nulls_last}}
> - {{desc_nulls_first}}
> - {{desc_nulls_last}}
> are missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR

2020-10-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33258.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30159
[https://github.com/apache/spark/pull/30159]

> Add asc_nulls_* and desc_nulls_* methods to SparkR
> --
>
> Key: SPARK-33258
> URL: https://issues.apache.org/jira/browse/SPARK-33258
> Project: Spark
>  Issue Type: Bug
>  Components: R, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> At the moment Spark provides only
> - {{asc}}
> - {{desc}}
> but {{NULL}} handling variants
> - {{asc_nulls_first}}
> - {{asc_nulls_last}}
> - {{desc_nulls_first}}
> - {{desc_nulls_last}}
> are missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan

2020-10-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33228:
-
Fix Version/s: 2.4.8

> Don't uncache data when replacing an existing view having the same plan
> ---
>
> Key: SPARK-33228
> URL: https://issues.apache.org/jira/browse/SPARK-33228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache 
> when replacing an existing view. But, this change drops cache even when 
> replacing a view having the same logical plan. A sequence of queries to 
> reproduce this as follows;
> {code}
> scala> val df = spark.range(1).selectExpr("id a", "id b")
> scala> df.cache()
> scala> df.explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> // If one re-runs the same query `df.createOrReplaceTempView("t")`, the 
> cache's swept away
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
> +- *(1) Range (0, 1, step=1, splits=4)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32090) UserDefinedType.equal() does not have symmetry

2020-10-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32090:
-
Fix Version/s: 3.0.2
   2.4.8

> UserDefinedType.equal() does not have symmetry 
> ---
>
> Key: SPARK-32090
> URL: https://issues.apache.org/jira/browse/SPARK-32090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0, 2.3.0, 2.4.0, 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
> val udt1 = new ExampleBaseTypeUDT
> val udt2 = new ExampleSubTypeUDT
> println(udt1 == udt2) // true
> println(udt2 == udt1) // false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33246) Spark SQL null semantics documentation is incorrect

2020-10-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33246.
--
Fix Version/s: 3.1.0
   3.0.2
 Assignee: Stuart White
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/30161

> Spark SQL null semantics documentation is incorrect
> ---
>
> Key: SPARK-33246
> URL: https://issues.apache.org/jira/browse/SPARK-33246
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Stuart White
>Assignee: Stuart White
>Priority: Trivial
> Fix For: 3.0.2, 3.1.0
>
>
> The documentation of Spark SQL's null semantics is (I believe) incorrect.
> The documentation states that "NULL AND False" yields NULL, when in fact it 
> yields False.
> {noformat}
> Seq[(java.lang.Boolean, java.lang.Boolean)](
>   (true, null),
>   (false, null),
>   (null, true),
>   (null, false),
>   (null, null)
> )
>   .toDF("left_operand", "right_operand")
>   .withColumn("OR", 'left_operand || 'right_operand)
>   .withColumn("AND", 'left_operand && 'right_operand)
>   .show(truncate = false)
> ++-++-+
> |left_operand|right_operand|OR  |AND  |
> ++-++-+
> |true|null |true|null |
> |false   |null |null|false|
> |null|true |true|null |
> |null|false|null|false|  < this line is incorrect in the 
> docs
> |null|null |null|null |
> ++-++-+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33246) Spark SQL null semantics documentation is incorrect

2020-10-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33246:
-
Affects Version/s: (was: 3.0.1)
   3.1.0
   3.0.2

> Spark SQL null semantics documentation is incorrect
> ---
>
> Key: SPARK-33246
> URL: https://issues.apache.org/jira/browse/SPARK-33246
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Stuart White
>Priority: Trivial
>
> The documentation of Spark SQL's null semantics is (I believe) incorrect.
> The documentation states that "NULL AND False" yields NULL, when in fact it 
> yields False.
> {noformat}
> Seq[(java.lang.Boolean, java.lang.Boolean)](
>   (true, null),
>   (false, null),
>   (null, true),
>   (null, false),
>   (null, null)
> )
>   .toDF("left_operand", "right_operand")
>   .withColumn("OR", 'left_operand || 'right_operand)
>   .withColumn("AND", 'left_operand && 'right_operand)
>   .show(truncate = false)
> ++-++-+
> |left_operand|right_operand|OR  |AND  |
> ++-++-+
> |true|null |true|null |
> |false   |null |null|false|
> |null|true |true|null |
> |null|false|null|false|  < this line is incorrect in the 
> docs
> |null|null |null|null |
> ++-++-+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33246) Spark SQL null semantics documentation is incorrect

2020-10-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33246:
-
Component/s: SQL

> Spark SQL null semantics documentation is incorrect
> ---
>
> Key: SPARK-33246
> URL: https://issues.apache.org/jira/browse/SPARK-33246
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.0.1
>Reporter: Stuart White
>Priority: Trivial
>
> The documentation of Spark SQL's null semantics is (I believe) incorrect.
> The documentation states that "NULL AND False" yields NULL, when in fact it 
> yields False.
> {noformat}
> Seq[(java.lang.Boolean, java.lang.Boolean)](
>   (true, null),
>   (false, null),
>   (null, true),
>   (null, false),
>   (null, null)
> )
>   .toDF("left_operand", "right_operand")
>   .withColumn("OR", 'left_operand || 'right_operand)
>   .withColumn("AND", 'left_operand && 'right_operand)
>   .show(truncate = false)
> ++-++-+
> |left_operand|right_operand|OR  |AND  |
> ++-++-+
> |true|null |true|null |
> |false   |null |null|false|
> |null|true |true|null |
> |null|false|null|false|  < this line is incorrect in the 
> docs
> |null|null |null|null |
> ++-++-+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33183) Bug in optimizer rule EliminateSorts

2020-10-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33183:
-
Issue Type: Bug  (was: Improvement)

> Bug in optimizer rule EliminateSorts
> 
>
> Key: SPARK-33183
> URL: https://issues.apache.org/jira/browse/SPARK-33183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, the rule {{EliminateSorts}} removes a global sort node if its 
> child plan already satisfies the required sort order without checking if the 
> child plan's ordering is local or global. For example, in the following 
> scenario, the first sort shouldn't be removed because it has a stronger 
> guarantee than the second sort even if the sort orders are the same for both 
> sorts. 
> {code:java}
> Sort(orders, global = True, ...)
>   Sort(orders, global = False, ...){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33183) Bug in optimizer rule EliminateSorts

2020-10-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33183:
-
Issue Type: Improvement  (was: Bug)

> Bug in optimizer rule EliminateSorts
> 
>
> Key: SPARK-33183
> URL: https://issues.apache.org/jira/browse/SPARK-33183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, the rule {{EliminateSorts}} removes a global sort node if its 
> child plan already satisfies the required sort order without checking if the 
> child plan's ordering is local or global. For example, in the following 
> scenario, the first sort shouldn't be removed because it has a stronger 
> guarantee than the second sort even if the sort orders are the same for both 
> sorts. 
> {code:java}
> Sort(orders, global = True, ...)
>   Sort(orders, global = False, ...){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32919:


Assignee: (was: Apache Spark)

> Add support in Spark driver to coordinate the shuffle map stage in push-based 
> shuffle by selecting external shuffle services for merging shuffle partitions
> ---
>
> Key: SPARK-32919
> URL: https://issues.apache.org/jira/browse/SPARK-32919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> In the beginning of a shuffle map stage, driver needs to select external 
> shuffle services as the mergers of the shuffle partitions for the 
> corresponding shuffle.
> We currently leverage the immediate available information about current and 
> past executor location information for this selection purpose. Ideally, this 
> would be behind a pluggable interface so that we can potentially leverage 
> information tracked outside of a Spark application for better load balancing 
> or for a disaggregate deployment environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32919:


Assignee: Apache Spark

> Add support in Spark driver to coordinate the shuffle map stage in push-based 
> shuffle by selecting external shuffle services for merging shuffle partitions
> ---
>
> Key: SPARK-32919
> URL: https://issues.apache.org/jira/browse/SPARK-32919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Apache Spark
>Priority: Major
>
> In the beginning of a shuffle map stage, driver needs to select external 
> shuffle services as the mergers of the shuffle partitions for the 
> corresponding shuffle.
> We currently leverage the immediate available information about current and 
> past executor location information for this selection purpose. Ideally, this 
> would be behind a pluggable interface so that we can potentially leverage 
> information tracked outside of a Spark application for better load balancing 
> or for a disaggregate deployment environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32918:


Assignee: (was: Apache Spark)

> RPC implementation to support control plane coordination for push-based 
> shuffle
> ---
>
> Key: SPARK-32918
> URL: https://issues.apache.org/jira/browse/SPARK-32918
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> RPCs to facilitate coordination of shuffle map/reduce stages. Notifications 
> to external shuffle services to finalize shuffle block merge for a given 
> shuffle are carried through this RPC. It also respond back the metadata about 
> a merged shuffle partition back to the caller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221822#comment-17221822
 ] 

Apache Spark commented on SPARK-32919:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/30164

> Add support in Spark driver to coordinate the shuffle map stage in push-based 
> shuffle by selecting external shuffle services for merging shuffle partitions
> ---
>
> Key: SPARK-32919
> URL: https://issues.apache.org/jira/browse/SPARK-32919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> In the beginning of a shuffle map stage, driver needs to select external 
> shuffle services as the mergers of the shuffle partitions for the 
> corresponding shuffle.
> We currently leverage the immediate available information about current and 
> past executor location information for this selection purpose. Ideally, this 
> would be behind a pluggable interface so that we can potentially leverage 
> information tracked outside of a Spark application for better load balancing 
> or for a disaggregate deployment environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221821#comment-17221821
 ] 

Apache Spark commented on SPARK-32918:
--

User 'zhouyejoe' has created a pull request for this issue:
https://github.com/apache/spark/pull/30163

> RPC implementation to support control plane coordination for push-based 
> shuffle
> ---
>
> Key: SPARK-32918
> URL: https://issues.apache.org/jira/browse/SPARK-32918
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> RPCs to facilitate coordination of shuffle map/reduce stages. Notifications 
> to external shuffle services to finalize shuffle block merge for a given 
> shuffle are carried through this RPC. It also respond back the metadata about 
> a merged shuffle partition back to the caller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32918:


Assignee: Apache Spark

> RPC implementation to support control plane coordination for push-based 
> shuffle
> ---
>
> Key: SPARK-32918
> URL: https://issues.apache.org/jira/browse/SPARK-32918
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Apache Spark
>Priority: Major
>
> RPCs to facilitate coordination of shuffle map/reduce stages. Notifications 
> to external shuffle services to finalize shuffle block merge for a given 
> shuffle are carried through this RPC. It also respond back the metadata about 
> a merged shuffle partition back to the caller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221824#comment-17221824
 ] 

Apache Spark commented on SPARK-32919:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/30164

> Add support in Spark driver to coordinate the shuffle map stage in push-based 
> shuffle by selecting external shuffle services for merging shuffle partitions
> ---
>
> Key: SPARK-32919
> URL: https://issues.apache.org/jira/browse/SPARK-32919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> In the beginning of a shuffle map stage, driver needs to select external 
> shuffle services as the mergers of the shuffle partitions for the 
> corresponding shuffle.
> We currently leverage the immediate available information about current and 
> past executor location information for this selection purpose. Ideally, this 
> would be behind a pluggable interface so that we can potentially leverage 
> information tracked outside of a Spark application for better load balancing 
> or for a disaggregate deployment environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33263) Configurable StateStore compression codec

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221770#comment-17221770
 ] 

Apache Spark commented on SPARK-33263:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30162

> Configurable StateStore compression codec
> -
>
> Key: SPARK-33263
> URL: https://issues.apache.org/jira/browse/SPARK-33263
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently the compression codec of StateStore is not configurable and 
> hard-coded to be lz4. It is better if we can follow Spark other modules to 
> configure the compression codec of StateStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33263) Configurable StateStore compression codec

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33263:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Configurable StateStore compression codec
> -
>
> Key: SPARK-33263
> URL: https://issues.apache.org/jira/browse/SPARK-33263
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Currently the compression codec of StateStore is not configurable and 
> hard-coded to be lz4. It is better if we can follow Spark other modules to 
> configure the compression codec of StateStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33263) Configurable StateStore compression codec

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33263:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Configurable StateStore compression codec
> -
>
> Key: SPARK-33263
> URL: https://issues.apache.org/jira/browse/SPARK-33263
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently the compression codec of StateStore is not configurable and 
> hard-coded to be lz4. It is better if we can follow Spark other modules to 
> configure the compression codec of StateStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33263) Configurable StateStore compression codec

2020-10-27 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-33263:
---

 Summary: Configurable StateStore compression codec
 Key: SPARK-33263
 URL: https://issues.apache.org/jira/browse/SPARK-33263
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Structured Streaming
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Currently the compression codec of StateStore is not configurable and 
hard-coded to be lz4. It is better if we can follow Spark other modules to 
configure the compression codec of StateStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream

2020-10-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221725#comment-17221725
 ] 

Dongjoon Hyun commented on SPARK-33260:
---

I added `correctness` label.

> SortExec produces incorrect results if sortOrder is a Stream
> 
>
> Key: SPARK-33260
> URL: https://issues.apache.org/jira/browse/SPARK-33260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.2, 3.1.0
>
>
> The following query produces incorrect results. The query has two essential 
> features: (1) it contains a string aggregate, resulting in a {{SortExec}} 
> node, and (2) it contains a duplicate grouping key, causing 
> {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a 
> Stream.
> SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
> FROM table_4
> GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
> When the sort order is stored as a {{Stream}}, the line 
> {{ordering.map(_.child.genCode(ctx))}} in 
> {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to 
> {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} 
> is a {{Stream}}, the modifications will not happen immediately as intended, 
> but will instead occur lazily when the returned {{Stream}} is used later.
> Similar bugs have occurred at least three times in the past: 
> https://issues.apache.org/jira/browse/SPARK-24500, 
> https://issues.apache.org/jira/browse/SPARK-25767, 
> https://issues.apache.org/jira/browse/SPARK-26680.
> The fix is to check if {{ordering}} is a {{Stream}} and force the 
> modifications to happen immediately if so.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream

2020-10-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33260:
--
Labels: correctness  (was: )

> SortExec produces incorrect results if sortOrder is a Stream
> 
>
> Key: SPARK-33260
> URL: https://issues.apache.org/jira/browse/SPARK-33260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.2, 3.1.0
>
>
> The following query produces incorrect results. The query has two essential 
> features: (1) it contains a string aggregate, resulting in a {{SortExec}} 
> node, and (2) it contains a duplicate grouping key, causing 
> {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a 
> Stream.
> SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
> FROM table_4
> GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
> When the sort order is stored as a {{Stream}}, the line 
> {{ordering.map(_.child.genCode(ctx))}} in 
> {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to 
> {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} 
> is a {{Stream}}, the modifications will not happen immediately as intended, 
> but will instead occur lazily when the returned {{Stream}} is used later.
> Similar bugs have occurred at least three times in the past: 
> https://issues.apache.org/jira/browse/SPARK-24500, 
> https://issues.apache.org/jira/browse/SPARK-25767, 
> https://issues.apache.org/jira/browse/SPARK-26680.
> The fix is to check if {{ordering}} is a {{Stream}} and force the 
> modifications to happen immediately if so.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream

2020-10-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33260.
---
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30160
[https://github.com/apache/spark/pull/30160]

> SortExec produces incorrect results if sortOrder is a Stream
> 
>
> Key: SPARK-33260
> URL: https://issues.apache.org/jira/browse/SPARK-33260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> The following query produces incorrect results. The query has two essential 
> features: (1) it contains a string aggregate, resulting in a {{SortExec}} 
> node, and (2) it contains a duplicate grouping key, causing 
> {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a 
> Stream.
> SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
> FROM table_4
> GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
> When the sort order is stored as a {{Stream}}, the line 
> {{ordering.map(_.child.genCode(ctx))}} in 
> {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to 
> {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} 
> is a {{Stream}}, the modifications will not happen immediately as intended, 
> but will instead occur lazily when the returned {{Stream}} is used later.
> Similar bugs have occurred at least three times in the past: 
> https://issues.apache.org/jira/browse/SPARK-24500, 
> https://issues.apache.org/jira/browse/SPARK-25767, 
> https://issues.apache.org/jira/browse/SPARK-26680.
> The fix is to check if {{ordering}} is a {{Stream}} and force the 
> modifications to happen immediately if so.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33262) Keep pending pods in account while scheduling new pods

2020-10-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33262.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30155
[https://github.com/apache/spark/pull/30155]

> Keep pending pods in account while scheduling new pods
> --
>
> Key: SPARK-33262
> URL: https://issues.apache.org/jira/browse/SPARK-33262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33231) Make podCreationTimeout configurable

2020-10-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33231.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30155
[https://github.com/apache/spark/pull/30155]

> Make podCreationTimeout configurable
> 
>
> Key: SPARK-33231
> URL: https://issues.apache.org/jira/browse/SPARK-33231
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>
> Execution Monitor & Pod Allocator have differing views of the world which can 
> lead to pod trashing.
> The executor monitor can be notified of an executor coming up before a 
> snapshot is delivered to the PodAllocator. This can cause the executor 
> monitor to believe it needs to delete a pod, and the pod allocator to believe 
> that it needs to create a new pod. This happens if the podCreationTimeout is 
> too low for the cluster. Currently podCreationTimeout can only be configured 
> by increasing the batch delay but that has additional consequences leading to 
> slower spin up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33231) Make podCreationTimeout configurable

2020-10-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33231:
-

Assignee: Holden Karau

> Make podCreationTimeout configurable
> 
>
> Key: SPARK-33231
> URL: https://issues.apache.org/jira/browse/SPARK-33231
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> Execution Monitor & Pod Allocator have differing views of the world which can 
> lead to pod trashing.
> The executor monitor can be notified of an executor coming up before a 
> snapshot is delivered to the PodAllocator. This can cause the executor 
> monitor to believe it needs to delete a pod, and the pod allocator to believe 
> that it needs to create a new pod. This happens if the podCreationTimeout is 
> too low for the cluster. Currently podCreationTimeout can only be configured 
> by increasing the batch delay but that has additional consequences leading to 
> slower spin up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33262) Keep pending pods in account while scheduling new pods

2020-10-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33262:
-

Assignee: Holden Karau

> Keep pending pods in account while scheduling new pods
> --
>
> Key: SPARK-33262
> URL: https://issues.apache.org/jira/browse/SPARK-33262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33262) Keep pending pods in account while scheduling new pods

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33262:


Assignee: (was: Apache Spark)

> Keep pending pods in account while scheduling new pods
> --
>
> Key: SPARK-33262
> URL: https://issues.apache.org/jira/browse/SPARK-33262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33262) Keep pending pods in account while scheduling new pods

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221644#comment-17221644
 ] 

Apache Spark commented on SPARK-33262:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30155

> Keep pending pods in account while scheduling new pods
> --
>
> Key: SPARK-33262
> URL: https://issues.apache.org/jira/browse/SPARK-33262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33262) Keep pending pods in account while scheduling new pods

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33262:


Assignee: Apache Spark

> Keep pending pods in account while scheduling new pods
> --
>
> Key: SPARK-33262
> URL: https://issues.apache.org/jira/browse/SPARK-33262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33262) Keep pending pods in account while scheduling new pods

2020-10-27 Thread Holden Karau (Jira)
Holden Karau created SPARK-33262:


 Summary: Keep pending pods in account while scheduling new pods
 Key: SPARK-33262
 URL: https://issues.apache.org/jira/browse/SPARK-33262
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Holden Karau






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33261) Allow people to extend the pod feature steps

2020-10-27 Thread Holden Karau (Jira)
Holden Karau created SPARK-33261:


 Summary: Allow people to extend the pod feature steps
 Key: SPARK-33261
 URL: https://issues.apache.org/jira/browse/SPARK-33261
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Holden Karau


While we allow people to specify pod templates, some deployments could benefit 
from being able to add a feature step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33230) FileOutputWriter jobs have duplicate JobIDs if launched in same second

2020-10-27 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221581#comment-17221581
 ] 

Steve Loughran commented on SPARK-33230:


Thanks. Still got some changes to work through my side to make sure the are no 
assumptions that app attempt is unique.

For the curious see HADOOP-17318
* Staging committer is using task attemptID (jobId+ taskId + task-attempt) for 
a path to the local temp dir
* Magic committer uses app attemptId for the path under the dest/__magic dir. 
Only an issue once that committer allows >1 job to write to same dest.



> FileOutputWriter jobs have duplicate JobIDs if launched in same second
> --
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22390) Aggregate push down

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22390:


Assignee: Apache Spark

> Aggregate push down
> ---
>
> Key: SPARK-22390
> URL: https://issues.apache.org/jira/browse/SPARK-22390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22390) Aggregate push down

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22390:


Assignee: (was: Apache Spark)

> Aggregate push down
> ---
>
> Key: SPARK-22390
> URL: https://issues.apache.org/jira/browse/SPARK-22390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22390) Aggregate push down

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221548#comment-17221548
 ] 

Apache Spark commented on SPARK-22390:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29695

> Aggregate push down
> ---
>
> Key: SPARK-22390
> URL: https://issues.apache.org/jira/browse/SPARK-22390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22390) Aggregate push down

2020-10-27 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221546#comment-17221546
 ] 

Huaxin Gao commented on SPARK-22390:


Hi [~baibaichen], I am working on this. I put it under a different jira. I will 
link it here too.

> Aggregate push down
> ---
>
> Key: SPARK-22390
> URL: https://issues.apache.org/jira/browse/SPARK-22390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33246) Spark SQL null semantics documentation is incorrect

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33246:


Assignee: Apache Spark

> Spark SQL null semantics documentation is incorrect
> ---
>
> Key: SPARK-33246
> URL: https://issues.apache.org/jira/browse/SPARK-33246
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Stuart White
>Assignee: Apache Spark
>Priority: Trivial
>
> The documentation of Spark SQL's null semantics is (I believe) incorrect.
> The documentation states that "NULL AND False" yields NULL, when in fact it 
> yields False.
> {noformat}
> Seq[(java.lang.Boolean, java.lang.Boolean)](
>   (true, null),
>   (false, null),
>   (null, true),
>   (null, false),
>   (null, null)
> )
>   .toDF("left_operand", "right_operand")
>   .withColumn("OR", 'left_operand || 'right_operand)
>   .withColumn("AND", 'left_operand && 'right_operand)
>   .show(truncate = false)
> ++-++-+
> |left_operand|right_operand|OR  |AND  |
> ++-++-+
> |true|null |true|null |
> |false   |null |null|false|
> |null|true |true|null |
> |null|false|null|false|  < this line is incorrect in the 
> docs
> |null|null |null|null |
> ++-++-+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33138) unify temp view and permanent view behaviors

2020-10-27 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221506#comment-17221506
 ] 

Erik Krogen commented on SPARK-33138:
-

I see, thanks for clarifying [~cloud_fan]. 

> unify temp view and permanent view behaviors
> 
>
> Key: SPARK-33138
> URL: https://issues.apache.org/jira/browse/SPARK-33138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, temp view store mapping of temp view name and its logicalPlan, and 
> permanent view store in HMS stores its origin SQL text.
> So for permanent view, when try to refer the permanent view, its SQL text 
> will be parse-analyze-optimize-plan again with current SQLConf and 
> SparkSession context, so it might keep changing when the SQLConf and context 
> is different each time.
> In order the unify the behaviors of temp view and permanent view, proposed 
> that we keep its origin SQLText for both temp and permanent view, and also 
> keep record of the SQLConf when the view was created. Each time we try to 
> refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
> the SQLText, in this way, we can make sure the output of the created view to 
> be stable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33246) Spark SQL null semantics documentation is incorrect

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221507#comment-17221507
 ] 

Apache Spark commented on SPARK-33246:
--

User 'stwhit' has created a pull request for this issue:
https://github.com/apache/spark/pull/30161

> Spark SQL null semantics documentation is incorrect
> ---
>
> Key: SPARK-33246
> URL: https://issues.apache.org/jira/browse/SPARK-33246
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Stuart White
>Priority: Trivial
>
> The documentation of Spark SQL's null semantics is (I believe) incorrect.
> The documentation states that "NULL AND False" yields NULL, when in fact it 
> yields False.
> {noformat}
> Seq[(java.lang.Boolean, java.lang.Boolean)](
>   (true, null),
>   (false, null),
>   (null, true),
>   (null, false),
>   (null, null)
> )
>   .toDF("left_operand", "right_operand")
>   .withColumn("OR", 'left_operand || 'right_operand)
>   .withColumn("AND", 'left_operand && 'right_operand)
>   .show(truncate = false)
> ++-++-+
> |left_operand|right_operand|OR  |AND  |
> ++-++-+
> |true|null |true|null |
> |false   |null |null|false|
> |null|true |true|null |
> |null|false|null|false|  < this line is incorrect in the 
> docs
> |null|null |null|null |
> ++-++-+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33246) Spark SQL null semantics documentation is incorrect

2020-10-27 Thread Stuart White (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart White updated SPARK-33246:
-
Attachment: (was: null-semantics.patch)

> Spark SQL null semantics documentation is incorrect
> ---
>
> Key: SPARK-33246
> URL: https://issues.apache.org/jira/browse/SPARK-33246
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Stuart White
>Priority: Trivial
>
> The documentation of Spark SQL's null semantics is (I believe) incorrect.
> The documentation states that "NULL AND False" yields NULL, when in fact it 
> yields False.
> {noformat}
> Seq[(java.lang.Boolean, java.lang.Boolean)](
>   (true, null),
>   (false, null),
>   (null, true),
>   (null, false),
>   (null, null)
> )
>   .toDF("left_operand", "right_operand")
>   .withColumn("OR", 'left_operand || 'right_operand)
>   .withColumn("AND", 'left_operand && 'right_operand)
>   .show(truncate = false)
> ++-++-+
> |left_operand|right_operand|OR  |AND  |
> ++-++-+
> |true|null |true|null |
> |false   |null |null|false|
> |null|true |true|null |
> |null|false|null|false|  < this line is incorrect in the 
> docs
> |null|null |null|null |
> ++-++-+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33246) Spark SQL null semantics documentation is incorrect

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221505#comment-17221505
 ] 

Apache Spark commented on SPARK-33246:
--

User 'stwhit' has created a pull request for this issue:
https://github.com/apache/spark/pull/30161

> Spark SQL null semantics documentation is incorrect
> ---
>
> Key: SPARK-33246
> URL: https://issues.apache.org/jira/browse/SPARK-33246
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Stuart White
>Priority: Trivial
>
> The documentation of Spark SQL's null semantics is (I believe) incorrect.
> The documentation states that "NULL AND False" yields NULL, when in fact it 
> yields False.
> {noformat}
> Seq[(java.lang.Boolean, java.lang.Boolean)](
>   (true, null),
>   (false, null),
>   (null, true),
>   (null, false),
>   (null, null)
> )
>   .toDF("left_operand", "right_operand")
>   .withColumn("OR", 'left_operand || 'right_operand)
>   .withColumn("AND", 'left_operand && 'right_operand)
>   .show(truncate = false)
> ++-++-+
> |left_operand|right_operand|OR  |AND  |
> ++-++-+
> |true|null |true|null |
> |false   |null |null|false|
> |null|true |true|null |
> |null|false|null|false|  < this line is incorrect in the 
> docs
> |null|null |null|null |
> ++-++-+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33246) Spark SQL null semantics documentation is incorrect

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33246:


Assignee: (was: Apache Spark)

> Spark SQL null semantics documentation is incorrect
> ---
>
> Key: SPARK-33246
> URL: https://issues.apache.org/jira/browse/SPARK-33246
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Stuart White
>Priority: Trivial
>
> The documentation of Spark SQL's null semantics is (I believe) incorrect.
> The documentation states that "NULL AND False" yields NULL, when in fact it 
> yields False.
> {noformat}
> Seq[(java.lang.Boolean, java.lang.Boolean)](
>   (true, null),
>   (false, null),
>   (null, true),
>   (null, false),
>   (null, null)
> )
>   .toDF("left_operand", "right_operand")
>   .withColumn("OR", 'left_operand || 'right_operand)
>   .withColumn("AND", 'left_operand && 'right_operand)
>   .show(truncate = false)
> ++-++-+
> |left_operand|right_operand|OR  |AND  |
> ++-++-+
> |true|null |true|null |
> |false   |null |null|false|
> |null|true |true|null |
> |null|false|null|false|  < this line is incorrect in the 
> docs
> |null|null |null|null |
> ++-++-+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33137) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (PostgreSQL dialect)

2020-10-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33137:
---

Assignee: Huaxin Gao

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (PostgreSQL dialect)
> -
>
> Key: SPARK-33137
> URL: https://issues.apache.org/jira/browse/SPARK-33137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following PostgreSQL JDBC dialect according to official documentation.
> Write PostgreSQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33137) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (PostgreSQL dialect)

2020-10-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33137.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30089
[https://github.com/apache/spark/pull/30089]

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (PostgreSQL dialect)
> -
>
> Key: SPARK-33137
> URL: https://issues.apache.org/jira/browse/SPARK-33137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.1.0
>
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following PostgreSQL JDBC dialect according to official documentation.
> Write PostgreSQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20044:


Assignee: Gengliang Wang  (was: Apache Spark)

> Support Spark UI behind front-end reverse proxy using a path prefix
> ---
>
> Key: SPARK-20044
> URL: https://issues.apache.org/jira/browse/SPARK-20044
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0
>Reporter: Oliver Koeth
>Assignee: Gengliang Wang
>Priority: Minor
>  Labels: reverse-proxy, sso
>
> Purpose: allow to run the Spark web UI behind a reverse proxy with URLs 
> prefixed by a context root, like www.mydomain.com/spark. In particular, this 
> allows to access multiple Spark clusters through the same virtual host, only 
> distinguishing them by context root, like www.mydomain.com/cluster1, 
> www.mydomain.com/cluster2, and it allows to run the Spark UI in a common 
> cookie domain (for SSO) with other services.
> [SPARK-15487] introduced some support for front-end reverse proxies by 
> allowing all Spark UI requests to be routed through the master UI as a single 
> endpoint and also added a spark.ui.reverseProxyUrl setting to define a 
> another proxy sitting in front of Spark. However, as noted in the comments on 
> [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl 
> includes a context root like the examples above: Most links generated by the 
> Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not 
> account for a path prefix (context root) and work only if the Spark UI "owns" 
> the entire virtual host. In fact, the only place in the UI where the 
> reverseProxyUrl seems to be used is the back-link from the worker UI to the 
> master UI.
> The discussion on [SPARK-15487] proposes to open a new issue for the problem, 
> but that does not seem to have happened, so this issue aims to address the 
> remaining shortcomings of spark.ui.reverseProxyUrl
> The problem can be partially worked around by doing content rewrite in a 
> front-end proxy and prefixing src="/..." or href="/..." links with a context 
> root. However, detecting and patching URLs in HTML output is not a robust 
> approach and breaks down for URLs included in custom REST responses. E.g. the 
> "allexecutors" REST call used from the Spark 2.1.0 application/executors page 
> returns links for log viewing that direct to the worker UI and do not work in 
> this scenario.
> This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL 
> generation. Experiments indicate that most of this can simply be achieved by 
> using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase 
> system property. Beyond that, the places that require adaption are
> - worker and application links in the master web UI
> - webui URLs returned by REST interfaces
> Note: It seems that returned redirect location headers do not need to be 
> adapted, since URL rewriting for these is commonly done in front-end proxies 
> and has a well-defined interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix

2020-10-27 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221457#comment-17221457
 ] 

Gengliang Wang commented on SPARK-20044:


I am taking this over in https://github.com/apache/spark/pull/29820 

> Support Spark UI behind front-end reverse proxy using a path prefix
> ---
>
> Key: SPARK-20044
> URL: https://issues.apache.org/jira/browse/SPARK-20044
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0
>Reporter: Oliver Koeth
>Assignee: Apache Spark
>Priority: Minor
>  Labels: reverse-proxy, sso
>
> Purpose: allow to run the Spark web UI behind a reverse proxy with URLs 
> prefixed by a context root, like www.mydomain.com/spark. In particular, this 
> allows to access multiple Spark clusters through the same virtual host, only 
> distinguishing them by context root, like www.mydomain.com/cluster1, 
> www.mydomain.com/cluster2, and it allows to run the Spark UI in a common 
> cookie domain (for SSO) with other services.
> [SPARK-15487] introduced some support for front-end reverse proxies by 
> allowing all Spark UI requests to be routed through the master UI as a single 
> endpoint and also added a spark.ui.reverseProxyUrl setting to define a 
> another proxy sitting in front of Spark. However, as noted in the comments on 
> [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl 
> includes a context root like the examples above: Most links generated by the 
> Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not 
> account for a path prefix (context root) and work only if the Spark UI "owns" 
> the entire virtual host. In fact, the only place in the UI where the 
> reverseProxyUrl seems to be used is the back-link from the worker UI to the 
> master UI.
> The discussion on [SPARK-15487] proposes to open a new issue for the problem, 
> but that does not seem to have happened, so this issue aims to address the 
> remaining shortcomings of spark.ui.reverseProxyUrl
> The problem can be partially worked around by doing content rewrite in a 
> front-end proxy and prefixing src="/..." or href="/..." links with a context 
> root. However, detecting and patching URLs in HTML output is not a robust 
> approach and breaks down for URLs included in custom REST responses. E.g. the 
> "allexecutors" REST call used from the Spark 2.1.0 application/executors page 
> returns links for log viewing that direct to the worker UI and do not work in 
> this scenario.
> This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL 
> generation. Experiments indicate that most of this can simply be achieved by 
> using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase 
> system property. Beyond that, the places that require adaption are
> - worker and application links in the master web UI
> - webui URLs returned by REST interfaces
> Note: It seems that returned redirect location headers do not need to be 
> adapted, since URL rewriting for these is commonly done in front-end proxies 
> and has a well-defined interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20044:


Assignee: Apache Spark  (was: Gengliang Wang)

> Support Spark UI behind front-end reverse proxy using a path prefix
> ---
>
> Key: SPARK-20044
> URL: https://issues.apache.org/jira/browse/SPARK-20044
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0
>Reporter: Oliver Koeth
>Assignee: Apache Spark
>Priority: Minor
>  Labels: reverse-proxy, sso
>
> Purpose: allow to run the Spark web UI behind a reverse proxy with URLs 
> prefixed by a context root, like www.mydomain.com/spark. In particular, this 
> allows to access multiple Spark clusters through the same virtual host, only 
> distinguishing them by context root, like www.mydomain.com/cluster1, 
> www.mydomain.com/cluster2, and it allows to run the Spark UI in a common 
> cookie domain (for SSO) with other services.
> [SPARK-15487] introduced some support for front-end reverse proxies by 
> allowing all Spark UI requests to be routed through the master UI as a single 
> endpoint and also added a spark.ui.reverseProxyUrl setting to define a 
> another proxy sitting in front of Spark. However, as noted in the comments on 
> [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl 
> includes a context root like the examples above: Most links generated by the 
> Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not 
> account for a path prefix (context root) and work only if the Spark UI "owns" 
> the entire virtual host. In fact, the only place in the UI where the 
> reverseProxyUrl seems to be used is the back-link from the worker UI to the 
> master UI.
> The discussion on [SPARK-15487] proposes to open a new issue for the problem, 
> but that does not seem to have happened, so this issue aims to address the 
> remaining shortcomings of spark.ui.reverseProxyUrl
> The problem can be partially worked around by doing content rewrite in a 
> front-end proxy and prefixing src="/..." or href="/..." links with a context 
> root. However, detecting and patching URLs in HTML output is not a robust 
> approach and breaks down for URLs included in custom REST responses. E.g. the 
> "allexecutors" REST call used from the Spark 2.1.0 application/executors page 
> returns links for log viewing that direct to the worker UI and do not work in 
> this scenario.
> This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL 
> generation. Experiments indicate that most of this can simply be achieved by 
> using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase 
> system property. Beyond that, the places that require adaption are
> - worker and application links in the master web UI
> - webui URLs returned by REST interfaces
> Note: It seems that returned redirect location headers do not need to be 
> adapted, since URL rewriting for these is commonly done in front-end proxies 
> and has a well-defined interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix

2020-10-27 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-20044:
---
Affects Version/s: 2.2.0
   2.3.0
   2.4.0
   3.0.0

> Support Spark UI behind front-end reverse proxy using a path prefix
> ---
>
> Key: SPARK-20044
> URL: https://issues.apache.org/jira/browse/SPARK-20044
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0
>Reporter: Oliver Koeth
>Assignee: Gengliang Wang
>Priority: Minor
>  Labels: reverse-proxy, sso
>
> Purpose: allow to run the Spark web UI behind a reverse proxy with URLs 
> prefixed by a context root, like www.mydomain.com/spark. In particular, this 
> allows to access multiple Spark clusters through the same virtual host, only 
> distinguishing them by context root, like www.mydomain.com/cluster1, 
> www.mydomain.com/cluster2, and it allows to run the Spark UI in a common 
> cookie domain (for SSO) with other services.
> [SPARK-15487] introduced some support for front-end reverse proxies by 
> allowing all Spark UI requests to be routed through the master UI as a single 
> endpoint and also added a spark.ui.reverseProxyUrl setting to define a 
> another proxy sitting in front of Spark. However, as noted in the comments on 
> [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl 
> includes a context root like the examples above: Most links generated by the 
> Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not 
> account for a path prefix (context root) and work only if the Spark UI "owns" 
> the entire virtual host. In fact, the only place in the UI where the 
> reverseProxyUrl seems to be used is the back-link from the worker UI to the 
> master UI.
> The discussion on [SPARK-15487] proposes to open a new issue for the problem, 
> but that does not seem to have happened, so this issue aims to address the 
> remaining shortcomings of spark.ui.reverseProxyUrl
> The problem can be partially worked around by doing content rewrite in a 
> front-end proxy and prefixing src="/..." or href="/..." links with a context 
> root. However, detecting and patching URLs in HTML output is not a robust 
> approach and breaks down for URLs included in custom REST responses. E.g. the 
> "allexecutors" REST call used from the Spark 2.1.0 application/executors page 
> returns links for log viewing that direct to the worker UI and do not work in 
> this scenario.
> This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL 
> generation. Experiments indicate that most of this can simply be achieved by 
> using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase 
> system property. Beyond that, the places that require adaption are
> - worker and application links in the master web UI
> - webui URLs returned by REST interfaces
> Note: It seems that returned redirect location headers do not need to be 
> adapted, since URL rewriting for these is commonly done in front-end proxies 
> and has a well-defined interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix

2020-10-27 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reopened SPARK-20044:

  Assignee: Gengliang Wang

> Support Spark UI behind front-end reverse proxy using a path prefix
> ---
>
> Key: SPARK-20044
> URL: https://issues.apache.org/jira/browse/SPARK-20044
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Oliver Koeth
>Assignee: Gengliang Wang
>Priority: Minor
>  Labels: reverse-proxy, sso
>
> Purpose: allow to run the Spark web UI behind a reverse proxy with URLs 
> prefixed by a context root, like www.mydomain.com/spark. In particular, this 
> allows to access multiple Spark clusters through the same virtual host, only 
> distinguishing them by context root, like www.mydomain.com/cluster1, 
> www.mydomain.com/cluster2, and it allows to run the Spark UI in a common 
> cookie domain (for SSO) with other services.
> [SPARK-15487] introduced some support for front-end reverse proxies by 
> allowing all Spark UI requests to be routed through the master UI as a single 
> endpoint and also added a spark.ui.reverseProxyUrl setting to define a 
> another proxy sitting in front of Spark. However, as noted in the comments on 
> [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl 
> includes a context root like the examples above: Most links generated by the 
> Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not 
> account for a path prefix (context root) and work only if the Spark UI "owns" 
> the entire virtual host. In fact, the only place in the UI where the 
> reverseProxyUrl seems to be used is the back-link from the worker UI to the 
> master UI.
> The discussion on [SPARK-15487] proposes to open a new issue for the problem, 
> but that does not seem to have happened, so this issue aims to address the 
> remaining shortcomings of spark.ui.reverseProxyUrl
> The problem can be partially worked around by doing content rewrite in a 
> front-end proxy and prefixing src="/..." or href="/..." links with a context 
> root. However, detecting and patching URLs in HTML output is not a robust 
> approach and breaks down for URLs included in custom REST responses. E.g. the 
> "allexecutors" REST call used from the Spark 2.1.0 application/executors page 
> returns links for log viewing that direct to the worker UI and do not work in 
> this scenario.
> This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL 
> generation. Experiments indicate that most of this can simply be achieved by 
> using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase 
> system property. Beyond that, the places that require adaption are
> - worker and application links in the master web UI
> - webui URLs returned by REST interfaces
> Note: It seems that returned redirect location headers do not need to be 
> adapted, since URL rewriting for these is commonly done in front-end proxies 
> and has a well-defined interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33225) Extract AliasHelper trait

2020-10-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33225.
--
Fix Version/s: 3.1.0
 Assignee: Tanel Kiis
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/30134#

> Extract AliasHelper trait
> -
>
> Key: SPARK-33225
> URL: https://issues.apache.org/jira/browse/SPARK-33225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.1.0
>
>
> During SPARK-33122 we saw that there are several alias related methods 
> duplicated between optimizers and analyzers. Do keep that PR more concise, 
> lets extract AliasHelper in a separate PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-27 Thread Michael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221425#comment-17221425
 ] 

Michael commented on SPARK-33259:
-

Oh I see, thanks for pointing to the section in the documentation!

For our team's use case having this working would definitely be very useful, as 
we do a lot of such data joins...

> Joining 3 streams results in incorrect output
> -
>
> Key: SPARK-33259
> URL: https://issues.apache.org/jira/browse/SPARK-33259
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Michael
>Priority: Major
>
> I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN 
> B) INNER JOIN C) operation. Below you can see example code I [posted on 
> Stackoverflow|https://stackoverflow.com/questions/64503539/]...
> I created a minimal example of "sessions", that have "start" and "end" events 
> and optionally some "metadata".
> The script generates two outputs: {{sessionStartsWithMetadata}} result from 
> "start" events that are left-joined with the "metadata" events, based on 
> {{sessionId}}. A "left join" is used, since we like to get an output event 
> even when no corresponding metadata exists.
> Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
> "end" events to the previously created DataFrame. Here an "inner join" is 
> used, since we only want some output when a session has ended for sure.
> This code can be executed in {{spark-shell}}:
> {code:scala}
> import java.sql.Timestamp
> import org.apache.spark.sql.execution.streaming.{MemoryStream, 
> StreamingQueryWrapper}
> import org.apache.spark.sql.streaming.StreamingQuery
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.functions.{col, expr, lit}
> import spark.implicits._
> implicit val sqlContext: SQLContext = spark.sqlContext
> // Main data processing, regardless whether batch or stream processing
> def process(
> sessionStartEvents: DataFrame,
> sessionOptionalMetadataEvents: DataFrame,
> sessionEndEvents: DataFrame
> ): (DataFrame, DataFrame) = {
>   val sessionStartsWithMetadata: DataFrame = sessionStartEvents
> .join(
>   sessionOptionalMetadataEvents,
>   sessionStartEvents("sessionId") === 
> sessionOptionalMetadataEvents("sessionId") &&
> sessionStartEvents("sessionStartTimestamp").between(
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
>  1 seconds")),
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
>  1 seconds"))
> ),
>   "left" // metadata is optional
> )
> .select(
>   sessionStartEvents("sessionId"),
>   sessionStartEvents("sessionStartTimestamp"),
>   sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
> )
>   val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
> sessionEndEvents,
> sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") 
> &&
>   sessionStartsWithMetadata("sessionStartTimestamp").between(
> sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
> seconds")),
> sessionEndEvents("sessionEndTimestamp")
>   )
>   )
>   (sessionStartsWithMetadata, endedSessionsWithMetadata)
> }
> def streamProcessing(
> sessionStartData: Seq[(Timestamp, Int)],
> sessionOptionalMetadata: Seq[(Timestamp, Int)],
> sessionEndData: Seq[(Timestamp, Int)]
> ): (StreamingQuery, StreamingQuery) = {
>   val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionStartEventsStream.addData(sessionStartData)
>   val sessionStartEvents: DataFrame = sessionStartEventsStream
> .toDS()
> .toDF("sessionStartTimestamp", "sessionId")
> .withWatermark("sessionStartTimestamp", "1 second")
>   val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)
>   val sessionOptionalMetadataEvents: DataFrame = 
> sessionOptionalMetadataEventsStream
> .toDS()
> .toDF("sessionOptionalMetadataTimestamp", "sessionId")
> .withWatermark("sessionOptionalMetadataTimestamp", "1 second")
>   val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionEndEventsStream.addData(sessionEndData)
>   val sessionEndEvents: DataFrame = sessionEndEventsStream
> .toDS()
> .toDF("sessionEndTimestamp", "sessionId")
> .withWatermark("sessionEndTimestamp", "1 second")
>   val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
> process(sessionStartEvents, sessionOptionalMetadataEvents, 
> 

[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-27 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221409#comment-17221409
 ] 

Jungtaek Lim commented on SPARK-33259:
--

As you already figured out, this is a known limitation, and at least for now we 
ended up with documenting such limitation to compensate.

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#limitation-of-global-watermark

This requires major change on the concept of watermark, so without huge demand 
on this it may be unlikely to be addressed.

> Joining 3 streams results in incorrect output
> -
>
> Key: SPARK-33259
> URL: https://issues.apache.org/jira/browse/SPARK-33259
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Michael
>Priority: Major
>
> I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN 
> B) INNER JOIN C) operation. Below you can see example code I [posted on 
> Stackoverflow|https://stackoverflow.com/questions/64503539/]...
> I created a minimal example of "sessions", that have "start" and "end" events 
> and optionally some "metadata".
> The script generates two outputs: {{sessionStartsWithMetadata}} result from 
> "start" events that are left-joined with the "metadata" events, based on 
> {{sessionId}}. A "left join" is used, since we like to get an output event 
> even when no corresponding metadata exists.
> Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
> "end" events to the previously created DataFrame. Here an "inner join" is 
> used, since we only want some output when a session has ended for sure.
> This code can be executed in {{spark-shell}}:
> {code:scala}
> import java.sql.Timestamp
> import org.apache.spark.sql.execution.streaming.{MemoryStream, 
> StreamingQueryWrapper}
> import org.apache.spark.sql.streaming.StreamingQuery
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.functions.{col, expr, lit}
> import spark.implicits._
> implicit val sqlContext: SQLContext = spark.sqlContext
> // Main data processing, regardless whether batch or stream processing
> def process(
> sessionStartEvents: DataFrame,
> sessionOptionalMetadataEvents: DataFrame,
> sessionEndEvents: DataFrame
> ): (DataFrame, DataFrame) = {
>   val sessionStartsWithMetadata: DataFrame = sessionStartEvents
> .join(
>   sessionOptionalMetadataEvents,
>   sessionStartEvents("sessionId") === 
> sessionOptionalMetadataEvents("sessionId") &&
> sessionStartEvents("sessionStartTimestamp").between(
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
>  1 seconds")),
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
>  1 seconds"))
> ),
>   "left" // metadata is optional
> )
> .select(
>   sessionStartEvents("sessionId"),
>   sessionStartEvents("sessionStartTimestamp"),
>   sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
> )
>   val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
> sessionEndEvents,
> sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") 
> &&
>   sessionStartsWithMetadata("sessionStartTimestamp").between(
> sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
> seconds")),
> sessionEndEvents("sessionEndTimestamp")
>   )
>   )
>   (sessionStartsWithMetadata, endedSessionsWithMetadata)
> }
> def streamProcessing(
> sessionStartData: Seq[(Timestamp, Int)],
> sessionOptionalMetadata: Seq[(Timestamp, Int)],
> sessionEndData: Seq[(Timestamp, Int)]
> ): (StreamingQuery, StreamingQuery) = {
>   val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionStartEventsStream.addData(sessionStartData)
>   val sessionStartEvents: DataFrame = sessionStartEventsStream
> .toDS()
> .toDF("sessionStartTimestamp", "sessionId")
> .withWatermark("sessionStartTimestamp", "1 second")
>   val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)
>   val sessionOptionalMetadataEvents: DataFrame = 
> sessionOptionalMetadataEventsStream
> .toDS()
> .toDF("sessionOptionalMetadataTimestamp", "sessionId")
> .withWatermark("sessionOptionalMetadataTimestamp", "1 second")
>   val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionEndEventsStream.addData(sessionEndData)
>   val sessionEndEvents: DataFrame = sessionEndEventsStream
> .toDS()
> .toDF("sessionEndTimestamp", 

[jira] [Commented] (SPARK-33138) unify temp view and permanent view behaviors

2020-10-27 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221392#comment-17221392
 ] 

Wenchen Fan commented on SPARK-33138:
-

> It seems to me that the current behavior of re-optimizing the SQL plan at 
> query execution time, not at view creation time, is correct.

This won't change. After the analysis phase, the view becomes a part of the 
query plan, and must be optimized/planned/executed together with the query 
plan. The captured configs can only take affect in the parsing and analysis 
phase, which I think makes sense, to make the view semantically consistent with 
when it was created.

It's better to only capture the parser/analyzer configs, but seems there is no 
easy way to do it.

> unify temp view and permanent view behaviors
> 
>
> Key: SPARK-33138
> URL: https://issues.apache.org/jira/browse/SPARK-33138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, temp view store mapping of temp view name and its logicalPlan, and 
> permanent view store in HMS stores its origin SQL text.
> So for permanent view, when try to refer the permanent view, its SQL text 
> will be parse-analyze-optimize-plan again with current SQLConf and 
> SparkSession context, so it might keep changing when the SQLConf and context 
> is different each time.
> In order the unify the behaviors of temp view and permanent view, proposed 
> that we keep its origin SQLText for both temp and permanent view, and also 
> keep record of the SQLConf when the view was created. Each time we try to 
> refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
> the SQLText, in this way, we can make sure the output of the created view to 
> be stable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33140) make Analyzer rules using SQLConf.get

2020-10-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33140:
---

Assignee: Leanken.Lin

> make Analyzer rules using SQLConf.get
> -
>
> Key: SPARK-33140
> URL: https://issues.apache.org/jira/browse/SPARK-33140
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33140) make Analyzer rules using SQLConf.get

2020-10-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33140.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30097
[https://github.com/apache/spark/pull/30097]

> make Analyzer rules using SQLConf.get
> -
>
> Key: SPARK-33140
> URL: https://issues.apache.org/jira/browse/SPARK-33140
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23539) Add support for Kafka headers in Structured Streaming

2020-10-27 Thread Dongjin Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221334#comment-17221334
 ] 

Dongjin Lee commented on SPARK-23539:
-

[~ckessler] Hi Calvin, It is totally up to the committers.

> Add support for Kafka headers in Structured Streaming
> -
>
> Key: SPARK-23539
> URL: https://issues.apache.org/jira/browse/SPARK-23539
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Dongjin Lee
>Priority: Major
> Fix For: 3.0.0
>
>
> Kafka headers were added in 0.11. We should expose them through our kafka 
> data source in both batch and streaming queries. 
> This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ 
> SPARK-18057



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221333#comment-17221333
 ] 

Apache Spark commented on SPARK-33260:
--

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/30160

> SortExec produces incorrect results if sortOrder is a Stream
> 
>
> Key: SPARK-33260
> URL: https://issues.apache.org/jira/browse/SPARK-33260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>
> The following query produces incorrect results. The query has two essential 
> features: (1) it contains a string aggregate, resulting in a {{SortExec}} 
> node, and (2) it contains a duplicate grouping key, causing 
> {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a 
> Stream.
> SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
> FROM table_4
> GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
> When the sort order is stored as a {{Stream}}, the line 
> {{ordering.map(_.child.genCode(ctx))}} in 
> {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to 
> {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} 
> is a {{Stream}}, the modifications will not happen immediately as intended, 
> but will instead occur lazily when the returned {{Stream}} is used later.
> Similar bugs have occurred at least three times in the past: 
> https://issues.apache.org/jira/browse/SPARK-24500, 
> https://issues.apache.org/jira/browse/SPARK-25767, 
> https://issues.apache.org/jira/browse/SPARK-26680.
> The fix is to check if {{ordering}} is a {{Stream}} and force the 
> modifications to happen immediately if so.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221332#comment-17221332
 ] 

Apache Spark commented on SPARK-33260:
--

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/30160

> SortExec produces incorrect results if sortOrder is a Stream
> 
>
> Key: SPARK-33260
> URL: https://issues.apache.org/jira/browse/SPARK-33260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>
> The following query produces incorrect results. The query has two essential 
> features: (1) it contains a string aggregate, resulting in a {{SortExec}} 
> node, and (2) it contains a duplicate grouping key, causing 
> {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a 
> Stream.
> SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
> FROM table_4
> GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
> When the sort order is stored as a {{Stream}}, the line 
> {{ordering.map(_.child.genCode(ctx))}} in 
> {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to 
> {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} 
> is a {{Stream}}, the modifications will not happen immediately as intended, 
> but will instead occur lazily when the returned {{Stream}} is used later.
> Similar bugs have occurred at least three times in the past: 
> https://issues.apache.org/jira/browse/SPARK-24500, 
> https://issues.apache.org/jira/browse/SPARK-25767, 
> https://issues.apache.org/jira/browse/SPARK-26680.
> The fix is to check if {{ordering}} is a {{Stream}} and force the 
> modifications to happen immediately if so.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream

2020-10-27 Thread Ankur Dave (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-33260:
---
Affects Version/s: (was: 2.4.7)
   3.0.0

> SortExec produces incorrect results if sortOrder is a Stream
> 
>
> Key: SPARK-33260
> URL: https://issues.apache.org/jira/browse/SPARK-33260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Major
>
> The following query produces incorrect results. The query has two essential 
> features: (1) it contains a string aggregate, resulting in a {{SortExec}} 
> node, and (2) it contains a duplicate grouping key, causing 
> {{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a 
> Stream.
> SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
> FROM table_4
> GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
> When the sort order is stored as a {{Stream}}, the line 
> {{ordering.map(_.child.genCode(ctx))}} in 
> {{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to 
> {{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} 
> is a {{Stream}}, the modifications will not happen immediately as intended, 
> but will instead occur lazily when the returned {{Stream}} is used later.
> Similar bugs have occurred at least three times in the past: 
> https://issues.apache.org/jira/browse/SPARK-24500, 
> https://issues.apache.org/jira/browse/SPARK-25767, 
> https://issues.apache.org/jira/browse/SPARK-26680.
> The fix is to check if {{ordering}} is a {{Stream}} and force the 
> modifications to happen immediately if so.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33260) SortExec produces incorrect results if sortOrder is a Stream

2020-10-27 Thread Ankur Dave (Jira)
Ankur Dave created SPARK-33260:
--

 Summary: SortExec produces incorrect results if sortOrder is a 
Stream
 Key: SPARK-33260
 URL: https://issues.apache.org/jira/browse/SPARK-33260
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 2.4.7
Reporter: Ankur Dave
Assignee: Ankur Dave


The following query produces incorrect results. The query has two essential 
features: (1) it contains a string aggregate, resulting in a {{SortExec}} node, 
and (2) it contains a duplicate grouping key, causing 
{{RemoveRepetitionFromGroupExpressions}} to produce a sort order stored as a 
Stream.

SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
FROM table_4
GROUP BY bigint_col_1, bigint_col_9, bigint_col_9

When the sort order is stored as a {{Stream}}, the line 
{{ordering.map(_.child.genCode(ctx))}} in 
{{GenerateOrdering#createOrderKeys()}} produces unpredictable side effects to 
{{ctx}}. This is because {{genCode(ctx)}} modifies {{ctx}}. When {{ordering}} 
is a {{Stream}}, the modifications will not happen immediately as intended, but 
will instead occur lazily when the returned {{Stream}} is used later.

Similar bugs have occurred at least three times in the past: 
https://issues.apache.org/jira/browse/SPARK-24500, 
https://issues.apache.org/jira/browse/SPARK-25767, 
https://issues.apache.org/jira/browse/SPARK-26680.

The fix is to check if {{ordering}} is a {{Stream}} and force the modifications 
to happen immediately if so.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22390) Aggregate push down

2020-10-27 Thread Chang chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221286#comment-17221286
 ] 

Chang chen commented on SPARK-22390:


Spark 3.0 already supported JDBC DataSource v2, so is there any update ?

> Aggregate push down
> ---
>
> Key: SPARK-22390
> URL: https://issues.apache.org/jira/browse/SPARK-22390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-27 Thread Michael (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael updated SPARK-33259:

Description: 
I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN B) 
INNER JOIN C) operation. Below you can see example code I [posted on 
Stackoverflow|https://stackoverflow.com/questions/64503539/]...

I created a minimal example of "sessions", that have "start" and "end" events 
and optionally some "metadata".

The script generates two outputs: {{sessionStartsWithMetadata}} result from 
"start" events that are left-joined with the "metadata" events, based on 
{{sessionId}}. A "left join" is used, since we like to get an output event even 
when no corresponding metadata exists.

Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
"end" events to the previously created DataFrame. Here an "inner join" is used, 
since we only want some output when a session has ended for sure.

This code can be executed in {{spark-shell}}:
{code:scala}
import java.sql.Timestamp
import org.apache.spark.sql.execution.streaming.{MemoryStream, 
StreamingQueryWrapper}
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.functions.{col, expr, lit}

import spark.implicits._
implicit val sqlContext: SQLContext = spark.sqlContext

// Main data processing, regardless whether batch or stream processing
def process(
sessionStartEvents: DataFrame,
sessionOptionalMetadataEvents: DataFrame,
sessionEndEvents: DataFrame
): (DataFrame, DataFrame) = {
  val sessionStartsWithMetadata: DataFrame = sessionStartEvents
.join(
  sessionOptionalMetadataEvents,
  sessionStartEvents("sessionId") === 
sessionOptionalMetadataEvents("sessionId") &&
sessionStartEvents("sessionStartTimestamp").between(
  
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
 1 seconds")),
  
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
 1 seconds"))
),
  "left" // metadata is optional
)
.select(
  sessionStartEvents("sessionId"),
  sessionStartEvents("sessionStartTimestamp"),
  sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
)

  val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
sessionEndEvents,
sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") &&
  sessionStartsWithMetadata("sessionStartTimestamp").between(
sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
seconds")),
sessionEndEvents("sessionEndTimestamp")
  )
  )

  (sessionStartsWithMetadata, endedSessionsWithMetadata)
}

def streamProcessing(
sessionStartData: Seq[(Timestamp, Int)],
sessionOptionalMetadata: Seq[(Timestamp, Int)],
sessionEndData: Seq[(Timestamp, Int)]
): (StreamingQuery, StreamingQuery) = {

  val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionStartEventsStream.addData(sessionStartData)

  val sessionStartEvents: DataFrame = sessionStartEventsStream
.toDS()
.toDF("sessionStartTimestamp", "sessionId")
.withWatermark("sessionStartTimestamp", "1 second")

  val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)

  val sessionOptionalMetadataEvents: DataFrame = 
sessionOptionalMetadataEventsStream
.toDS()
.toDF("sessionOptionalMetadataTimestamp", "sessionId")
.withWatermark("sessionOptionalMetadataTimestamp", "1 second")

  val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionEndEventsStream.addData(sessionEndData)

  val sessionEndEvents: DataFrame = sessionEndEventsStream
.toDS()
.toDF("sessionEndTimestamp", "sessionId")
.withWatermark("sessionEndTimestamp", "1 second")

  val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents)

  val sessionStartsWithMetadataQuery = sessionStartsWithMetadata
.select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see 
which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.option("numRows", "1000")
.start()

  val endedSessionsWithMetadataQuery = endedSessionsWithMetadata
.select(lit("endedSessionsWithMetadata"), col("*")) // Add label to see 
which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.option("numRows", "1000")
.start()

  (sessionStartsWithMetadataQuery, endedSessionsWithMetadataQuery)
}

def batchProcessing(
sessionStartData: Seq[(Timestamp, Int)],

[jira] [Commented] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221280#comment-17221280
 ] 

Apache Spark commented on SPARK-33258:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30159

> Add asc_nulls_* and desc_nulls_* methods to SparkR
> --
>
> Key: SPARK-33258
> URL: https://issues.apache.org/jira/browse/SPARK-33258
> Project: Spark
>  Issue Type: Bug
>  Components: R, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> At the moment Spark provides only
> - {{asc}}
> - {{desc}}
> but {{NULL}} handling variants
> - {{asc_nulls_first}}
> - {{asc_nulls_last}}
> - {{desc_nulls_first}}
> - {{desc_nulls_last}}
> are missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33258:


Assignee: Apache Spark

> Add asc_nulls_* and desc_nulls_* methods to SparkR
> --
>
> Key: SPARK-33258
> URL: https://issues.apache.org/jira/browse/SPARK-33258
> Project: Spark
>  Issue Type: Bug
>  Components: R, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> At the moment Spark provides only
> - {{asc}}
> - {{desc}}
> but {{NULL}} handling variants
> - {{asc_nulls_first}}
> - {{asc_nulls_last}}
> - {{desc_nulls_first}}
> - {{desc_nulls_last}}
> are missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33258:


Assignee: (was: Apache Spark)

> Add asc_nulls_* and desc_nulls_* methods to SparkR
> --
>
> Key: SPARK-33258
> URL: https://issues.apache.org/jira/browse/SPARK-33258
> Project: Spark
>  Issue Type: Bug
>  Components: R, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> At the moment Spark provides only
> - {{asc}}
> - {{desc}}
> but {{NULL}} handling variants
> - {{asc_nulls_first}}
> - {{asc_nulls_last}}
> - {{desc_nulls_first}}
> - {{desc_nulls_last}}
> are missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-27 Thread Michael (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael updated SPARK-33259:

Description: 
I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN B) 
JOIN C) operation. Below you can see example code I [posted on 
Stackoverflow|https://stackoverflow.com/questions/64503539/]...

I created a minimal example of "sessions", that have "start" and "end" events 
and optionally some "metadata".

The script generates two outputs: {{sessionStartsWithMetadata}} result from 
"start" events that are left-joined with the "metadata" events, based on 
{{sessionId}}. A "left join" is used, since we like to get an output event even 
when no corresponding metadata exists.

Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
"end" events to the previously created DataFrame. Here an "inner join" is used, 
since we only want some output when a session has ended for sure.

This code can be executed in {{spark-shell}}:
{code:scala}
import java.sql.Timestamp
import org.apache.spark.sql.execution.streaming.{MemoryStream, 
StreamingQueryWrapper}
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.functions.{col, expr, lit}

import spark.implicits._
implicit val sqlContext: SQLContext = spark.sqlContext

// Main data processing, regardless whether batch or stream processing
def process(
sessionStartEvents: DataFrame,
sessionOptionalMetadataEvents: DataFrame,
sessionEndEvents: DataFrame
): (DataFrame, DataFrame) = {
  val sessionStartsWithMetadata: DataFrame = sessionStartEvents
.join(
  sessionOptionalMetadataEvents,
  sessionStartEvents("sessionId") === 
sessionOptionalMetadataEvents("sessionId") &&
sessionStartEvents("sessionStartTimestamp").between(
  
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
 1 seconds")),
  
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
 1 seconds"))
),
  "left" // metadata is optional
)
.select(
  sessionStartEvents("sessionId"),
  sessionStartEvents("sessionStartTimestamp"),
  sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
)

  val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
sessionEndEvents,
sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") &&
  sessionStartsWithMetadata("sessionStartTimestamp").between(
sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
seconds")),
sessionEndEvents("sessionEndTimestamp")
  )
  )

  (sessionStartsWithMetadata, endedSessionsWithMetadata)
}

def streamProcessing(
sessionStartData: Seq[(Timestamp, Int)],
sessionOptionalMetadata: Seq[(Timestamp, Int)],
sessionEndData: Seq[(Timestamp, Int)]
): (StreamingQuery, StreamingQuery) = {

  val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionStartEventsStream.addData(sessionStartData)

  val sessionStartEvents: DataFrame = sessionStartEventsStream
.toDS()
.toDF("sessionStartTimestamp", "sessionId")
.withWatermark("sessionStartTimestamp", "1 second")

  val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)

  val sessionOptionalMetadataEvents: DataFrame = 
sessionOptionalMetadataEventsStream
.toDS()
.toDF("sessionOptionalMetadataTimestamp", "sessionId")
.withWatermark("sessionOptionalMetadataTimestamp", "1 second")

  val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionEndEventsStream.addData(sessionEndData)

  val sessionEndEvents: DataFrame = sessionEndEventsStream
.toDS()
.toDF("sessionEndTimestamp", "sessionId")
.withWatermark("sessionEndTimestamp", "1 second")

  val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents)

  val sessionStartsWithMetadataQuery = sessionStartsWithMetadata
.select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see 
which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.option("numRows", "1000")
.start()

  val endedSessionsWithMetadataQuery = endedSessionsWithMetadata
.select(lit("endedSessionsWithMetadata"), col("*")) // Add label to see 
which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.option("numRows", "1000")
.start()

  (sessionStartsWithMetadataQuery, endedSessionsWithMetadataQuery)
}

def batchProcessing(
sessionStartData: Seq[(Timestamp, Int)],

[jira] [Commented] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221279#comment-17221279
 ] 

Apache Spark commented on SPARK-33258:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30159

> Add asc_nulls_* and desc_nulls_* methods to SparkR
> --
>
> Key: SPARK-33258
> URL: https://issues.apache.org/jira/browse/SPARK-33258
> Project: Spark
>  Issue Type: Bug
>  Components: R, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> At the moment Spark provides only
> - {{asc}}
> - {{desc}}
> but {{NULL}} handling variants
> - {{asc_nulls_first}}
> - {{asc_nulls_last}}
> - {{desc_nulls_first}}
> - {{desc_nulls_last}}
> are missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-27 Thread Michael (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael updated SPARK-33259:

Description: 
I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN B) 
JOIN C) operation. Below you can see example code I [posted on 
Stackoverflow|https://stackoverflow.com/questions/64503539/]...

I created a minimal example of "sessions", that have "start" and "end" events 
and optionally some "metadata".

The script generates two outputs: {{sessionStartsWithMetadata}} result from 
"start" events that are left-joined with the "metadata" events, based on 
{{sessionId}}. A "left join" is used, since we like to get an output event even 
when no corresponding metadata exists.

Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
"end" events to the previously created DataFrame. Here an "inner join" is used, 
since we only want some output when a session has ended for sure.

This code can be executed in {{spark-shell}}:
{code:scala}
import java.sql.Timestamp
import org.apache.spark.sql.execution.streaming.{MemoryStream, 
StreamingQueryWrapper}
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.functions.{col, expr, lit}

import spark.implicits._
implicit val sqlContext: SQLContext = spark.sqlContext

// Main data processing, regardless whether batch or stream processing
def process(
sessionStartEvents: DataFrame,
sessionOptionalMetadataEvents: DataFrame,
sessionEndEvents: DataFrame
): (DataFrame, DataFrame) = {
  val sessionStartsWithMetadata: DataFrame = sessionStartEvents
.join(
  sessionOptionalMetadataEvents,
  sessionStartEvents("sessionId") === 
sessionOptionalMetadataEvents("sessionId") &&
sessionStartEvents("sessionStartTimestamp").between(
  
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
 1 seconds")),
  
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
 1 seconds"))
),
  "left" // metadata is optional
)
.select(
  sessionStartEvents("sessionId"),
  sessionStartEvents("sessionStartTimestamp"),
  sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
)

  val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
sessionEndEvents,
sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") &&
  sessionStartsWithMetadata("sessionStartTimestamp").between(
sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
seconds")),
sessionEndEvents("sessionEndTimestamp")
  )
  )

  (sessionStartsWithMetadata, endedSessionsWithMetadata)
}

def streamProcessing(
sessionStartData: Seq[(Timestamp, Int)],
sessionOptionalMetadata: Seq[(Timestamp, Int)],
sessionEndData: Seq[(Timestamp, Int)]
): (StreamingQuery, StreamingQuery) = {

  val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionStartEventsStream.addData(sessionStartData)

  val sessionStartEvents: DataFrame = sessionStartEventsStream
.toDS()
.toDF("sessionStartTimestamp", "sessionId")
.withWatermark("sessionStartTimestamp", "1 second")

  val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)

  val sessionOptionalMetadataEvents: DataFrame = 
sessionOptionalMetadataEventsStream
.toDS()
.toDF("sessionOptionalMetadataTimestamp", "sessionId")
.withWatermark("sessionOptionalMetadataTimestamp", "1 second")

  val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionEndEventsStream.addData(sessionEndData)

  val sessionEndEvents: DataFrame = sessionEndEventsStream
.toDS()
.toDF("sessionEndTimestamp", "sessionId")
.withWatermark("sessionEndTimestamp", "1 second")

  val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents)

  val sessionStartsWithMetadataQuery = sessionStartsWithMetadata
.select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see 
which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.option("numRows", "1000")
.start()

  val endedSessionsWithMetadataQuery = endedSessionsWithMetadata
.select(lit("endedSessionsWithMetadata"), col("*")) // Add label to see 
which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.option("numRows", "1000")
.start()

  (sessionStartsWithMetadataQuery, endedSessionsWithMetadataQuery)
}

def batchProcessing(
sessionStartData: Seq[(Timestamp, Int)],

[jira] [Updated] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-27 Thread Michael (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael updated SPARK-33259:

Description: 
I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN B) 
JOIN C) operation. Below you can see example code I [posted on 
Stackoverflow|https://stackoverflow.com/questions/64503539/]...

I created a minimal example of "sessions", that have "start" and "end" events 
and optionally some "metadata".

The script generates two outputs: {{sessionStartsWithMetadata}} result from 
"start" events that are left-joined with the "metadata" events, based on 
{{sessionId}}. A "left join" is used, since we like to get an output event even 
when no corresponding metadata exists.

Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
"end" events to the previously created DataFrame. Here an "inner join" is used, 
since we only want some output when a session has ended for sure.

This code can be executed in {{spark-shell}}:

{code:scala}
import java.sql.Timestamp
import org.apache.spark.sql.execution.streaming.{MemoryStream, 
StreamingQueryWrapper}
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.functions.{col, expr, lit}

import spark.implicits._
implicit val sqlContext: SQLContext = spark.sqlContext

// Main data processing, regardless whether batch or stream processing
def process(
sessionStartEvents: DataFrame,
sessionOptionalMetadataEvents: DataFrame,
sessionEndEvents: DataFrame
): (DataFrame, DataFrame) = {
  val sessionStartsWithMetadata: DataFrame = sessionStartEvents
.join(
  sessionOptionalMetadataEvents,
  sessionStartEvents("sessionId") === 
sessionOptionalMetadataEvents("sessionId") &&
sessionStartEvents("sessionStartTimestamp").between(
  
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
 1 seconds")),
  
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
 1 seconds"))
),
  "left" // metadata is optional
)
.select(
  sessionStartEvents("sessionId"),
  sessionStartEvents("sessionStartTimestamp"),
  sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
)

  val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
sessionEndEvents,
sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") &&
  sessionStartsWithMetadata("sessionStartTimestamp").between(
sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
seconds")),
sessionEndEvents("sessionEndTimestamp")
  )
  )

  (sessionStartsWithMetadata, endedSessionsWithMetadata)
}

def streamProcessing(
sessionStartData: Seq[(Timestamp, Int)],
sessionOptionalMetadata: Seq[(Timestamp, Int)],
sessionEndData: Seq[(Timestamp, Int)]
): (StreamingQuery, StreamingQuery) = {

  val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionStartEventsStream.addData(sessionStartData)

  val sessionStartEvents: DataFrame = sessionStartEventsStream
.toDS()
.toDF("sessionStartTimestamp", "sessionId")
.withWatermark("sessionStartTimestamp", "1 second")

  val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)

  val sessionOptionalMetadataEvents: DataFrame = 
sessionOptionalMetadataEventsStream
.toDS()
.toDF("sessionOptionalMetadataTimestamp", "sessionId")
.withWatermark("sessionOptionalMetadataTimestamp", "1 second")

  val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionEndEventsStream.addData(sessionEndData)

  val sessionEndEvents: DataFrame = sessionEndEventsStream
.toDS()
.toDF("sessionEndTimestamp", "sessionId")
.withWatermark("sessionEndTimestamp", "1 second")

  val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents)

  val sessionStartsWithMetadataQuery = sessionStartsWithMetadata
.select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see 
which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.option("numRows", "1000")
.start()

  val endedSessionsWithMetadataQuery = endedSessionsWithMetadata
.select(lit("endedSessionsWithMetadata"), col("*")) // Add label to see 
which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.option("numRows", "1000")
.start()

  (sessionStartsWithMetadataQuery, endedSessionsWithMetadataQuery)
}

def batchProcessing(
sessionStartData: Seq[(Timestamp, Int)],

[jira] [Created] (SPARK-33259) Joining 3 streams results in incorrect output

2020-10-27 Thread Michael (Jira)
Michael created SPARK-33259:
---

 Summary: Joining 3 streams results in incorrect output
 Key: SPARK-33259
 URL: https://issues.apache.org/jira/browse/SPARK-33259
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.1
Reporter: Michael


I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN B) 
JOIN C) operation. Below you can see example code I [posted on 
Stackoverflow|https://stackoverflow.com/questions/64503539/]...

I created a minimal example of "sessions", that have "start" and "end" events 
and optionally some "metadata".

The script generates two outputs: `sessionStartsWithMetadata` result from 
"start" events that are left-joined with the "metadata" events, based on 
`sessionId`. A "left join" is used, since we like to get an output event even 
when no corresponding metadata exists.

Additionally a DataFrame `endedSessionsWithMetadata` is created by joining 
"end" events to the previously created DataFrame. Here an "inner join" is used, 
since we only want some output when a session has ended for sure.

This code can be executed in `spark-shell`:

{code:scala}
import java.sql.Timestamp
import org.apache.spark.sql.execution.streaming.{MemoryStream, 
StreamingQueryWrapper}
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.functions.{col, expr, lit}

import spark.implicits._
implicit val sqlContext: SQLContext = spark.sqlContext

// Main data processing, regardless whether batch or stream processing
def process(
sessionStartEvents: DataFrame,
sessionOptionalMetadataEvents: DataFrame,
sessionEndEvents: DataFrame
): (DataFrame, DataFrame) = {
  val sessionStartsWithMetadata: DataFrame = sessionStartEvents
.join(
  sessionOptionalMetadataEvents,
  sessionStartEvents("sessionId") === 
sessionOptionalMetadataEvents("sessionId") &&
sessionStartEvents("sessionStartTimestamp").between(
  
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
 1 seconds")),
  
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
 1 seconds"))
),
  "left" // metadata is optional
)
.select(
  sessionStartEvents("sessionId"),
  sessionStartEvents("sessionStartTimestamp"),
  sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
)

  val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
sessionEndEvents,
sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") &&
  sessionStartsWithMetadata("sessionStartTimestamp").between(
sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
seconds")),
sessionEndEvents("sessionEndTimestamp")
  )
  )

  (sessionStartsWithMetadata, endedSessionsWithMetadata)
}

def streamProcessing(
sessionStartData: Seq[(Timestamp, Int)],
sessionOptionalMetadata: Seq[(Timestamp, Int)],
sessionEndData: Seq[(Timestamp, Int)]
): (StreamingQuery, StreamingQuery) = {

  val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionStartEventsStream.addData(sessionStartData)

  val sessionStartEvents: DataFrame = sessionStartEventsStream
.toDS()
.toDF("sessionStartTimestamp", "sessionId")
.withWatermark("sessionStartTimestamp", "1 second")

  val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)

  val sessionOptionalMetadataEvents: DataFrame = 
sessionOptionalMetadataEventsStream
.toDS()
.toDF("sessionOptionalMetadataTimestamp", "sessionId")
.withWatermark("sessionOptionalMetadataTimestamp", "1 second")

  val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
MemoryStream[(Timestamp, Int)]
  sessionEndEventsStream.addData(sessionEndData)

  val sessionEndEvents: DataFrame = sessionEndEventsStream
.toDS()
.toDF("sessionEndTimestamp", "sessionId")
.withWatermark("sessionEndTimestamp", "1 second")

  val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents)

  val sessionStartsWithMetadataQuery = sessionStartsWithMetadata
.select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see 
which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.option("numRows", "1000")
.start()

  val endedSessionsWithMetadataQuery = endedSessionsWithMetadata
.select(lit("endedSessionsWithMetadata"), col("*")) // Add label to see 
which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")

[jira] [Created] (SPARK-33258) Add asc_nulls_* and desc_nulls_* methods to SparkR

2020-10-27 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-33258:
--

 Summary: Add asc_nulls_* and desc_nulls_* methods to SparkR
 Key: SPARK-33258
 URL: https://issues.apache.org/jira/browse/SPARK-33258
 Project: Spark
  Issue Type: Bug
  Components: R, SQL
Affects Versions: 3.1.0
Reporter: Maciej Szymkiewicz


At the moment Spark provides only

- {{asc}}
- {{desc}}

but {{NULL}} handling variants

- {{asc_nulls_first}}
- {{asc_nulls_last}}
- {{desc_nulls_first}}
- {{desc_nulls_last}}

are missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33257) Support Column inputs in PySpark ordering functions (asc*, desc*)

2020-10-27 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-33257:
--

 Summary: Support Column inputs in PySpark ordering functions 
(asc*, desc*)
 Key: SPARK-33257
 URL: https://issues.apache.org/jira/browse/SPARK-33257
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.1.0
Reporter: Maciej Szymkiewicz


According to SPARK-26979, PySpark functions should support both {{Column}} and 
{{str}} arguments, when possible.

However, the following ordering support only {{str}}

- {{asc}}
- {{desc}}
- {{asc_nulls_first}}
- {{asc_nulls_last}}
- {{desc_nulls_first}}
- {{desc_nulls_last}}

support only {{str}}. This is because Scala side doesn't provide {{Column => 
Column}} variants.

To fix this, we do one of the following:

- Call corresponding {{Column}} methods as 
[suggested|https://github.com/apache/spark/pull/30143#discussion_r512366978] by 
 [~hyukjin.kwon]
- Add missing signatures on Scala side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33249) Add status plugin for live application

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221244#comment-17221244
 ] 

Apache Spark commented on SPARK-33249:
--

User 'kuwii' has created a pull request for this issue:
https://github.com/apache/spark/pull/30158

> Add status plugin for live application
> --
>
> Key: SPARK-33249
> URL: https://issues.apache.org/jira/browse/SPARK-33249
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Weiyi Kong
>Priority: Minor
>
> There are cases that developer may want to extend the current REST API of Web 
> UI. In most cases, adding external module is a better option than directly 
> editing the original Spark code.
> For an external module, to extend the REST API of the Web UI, 2 things may 
> need to be done:
>  * Add extra API to provide extra status info. This can be simply done by 
> implementing another ApiRequestContext which will be automatically loaded.
>  * If the info can not be calculated from the original data in the store, add 
> extra listeners to generate them.
> For history server, there is an interface called AppHistoryServerPlugin, 
> which is loaded based on SPI, providing a method to create listeners. In live 
> application, the only way is spark.extraListeners based on 
> Utils.loadExtensions. But this is not enough for the cases.
> To let the API get the status info, the data need to be written to the 
> AppStatusStore, which is the only store that an API can get by accessing 
> "ui.store" or "ui.sc.statusStore". But listeners created by 
> Utils.loadExtensions only get a SparkConf in construction, and are unable to 
> write the AppStatusStore.
> So I think we still need plugin like AppHistorySever for live UI. For 
> concerns like SPARK-22786, the plugin for live app can be separated from the 
> history server one, and also loaded using Utils.loadExtensions with an extra 
> configurations. So by default, nothing will be loaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33249) Add status plugin for live application

2020-10-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33249:


Assignee: Apache Spark

> Add status plugin for live application
> --
>
> Key: SPARK-33249
> URL: https://issues.apache.org/jira/browse/SPARK-33249
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Weiyi Kong
>Assignee: Apache Spark
>Priority: Minor
>
> There are cases that developer may want to extend the current REST API of Web 
> UI. In most cases, adding external module is a better option than directly 
> editing the original Spark code.
> For an external module, to extend the REST API of the Web UI, 2 things may 
> need to be done:
>  * Add extra API to provide extra status info. This can be simply done by 
> implementing another ApiRequestContext which will be automatically loaded.
>  * If the info can not be calculated from the original data in the store, add 
> extra listeners to generate them.
> For history server, there is an interface called AppHistoryServerPlugin, 
> which is loaded based on SPI, providing a method to create listeners. In live 
> application, the only way is spark.extraListeners based on 
> Utils.loadExtensions. But this is not enough for the cases.
> To let the API get the status info, the data need to be written to the 
> AppStatusStore, which is the only store that an API can get by accessing 
> "ui.store" or "ui.sc.statusStore". But listeners created by 
> Utils.loadExtensions only get a SparkConf in construction, and are unable to 
> write the AppStatusStore.
> So I think we still need plugin like AppHistorySever for live UI. For 
> concerns like SPARK-22786, the plugin for live app can be separated from the 
> history server one, and also loaded using Utils.loadExtensions with an extra 
> configurations. So by default, nothing will be loaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33249) Add status plugin for live application

2020-10-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221243#comment-17221243
 ] 

Apache Spark commented on SPARK-33249:
--

User 'kuwii' has created a pull request for this issue:
https://github.com/apache/spark/pull/30158

> Add status plugin for live application
> --
>
> Key: SPARK-33249
> URL: https://issues.apache.org/jira/browse/SPARK-33249
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Weiyi Kong
>Priority: Minor
>
> There are cases that developer may want to extend the current REST API of Web 
> UI. In most cases, adding external module is a better option than directly 
> editing the original Spark code.
> For an external module, to extend the REST API of the Web UI, 2 things may 
> need to be done:
>  * Add extra API to provide extra status info. This can be simply done by 
> implementing another ApiRequestContext which will be automatically loaded.
>  * If the info can not be calculated from the original data in the store, add 
> extra listeners to generate them.
> For history server, there is an interface called AppHistoryServerPlugin, 
> which is loaded based on SPI, providing a method to create listeners. In live 
> application, the only way is spark.extraListeners based on 
> Utils.loadExtensions. But this is not enough for the cases.
> To let the API get the status info, the data need to be written to the 
> AppStatusStore, which is the only store that an API can get by accessing 
> "ui.store" or "ui.sc.statusStore". But listeners created by 
> Utils.loadExtensions only get a SparkConf in construction, and are unable to 
> write the AppStatusStore.
> So I think we still need plugin like AppHistorySever for live UI. For 
> concerns like SPARK-22786, the plugin for live app can be separated from the 
> history server one, and also loaded using Utils.loadExtensions with an extra 
> configurations. So by default, nothing will be loaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >