[jira] [Resolved] (SPARK-31270) Expose executor memory metrics at the task detal, in the Stages tab

2021-01-02 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-31270.
---
Resolution: Won't Fix

> Expose executor memory metrics at the task detal, in the Stages tab
> ---
>
> Key: SPARK-31270
> URL: https://issues.apache.org/jira/browse/SPARK-31270
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-02 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257669#comment-17257669
 ] 

angerszhu commented on SPARK-26399:
---

working on this PR

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http://:18080/api/v1/applications/ id>// attempt>/executorSummary{code}
> Add parameters to the stages REST API to specify:
> *  filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
> * task metric quantiles, add adding the task summary if specified
> * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33915) Allow json expression to be pushable column

2021-01-02 Thread Ted Yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257647#comment-17257647
 ] 

Ted Yu commented on SPARK-33915:


Here is sample code for capturing the column and fields in downstream 
PredicatePushDown.scala
{code}
  private val JSONCapture = "`GetJsonObject\\((.*),(.*)\\)`".r
  private def transformGetJsonObject(p: Predicate): Predicate = {
val eq = p.asInstanceOf[sources.EqualTo]
eq.attribute match {
  case JSONCapture(column,field) =>
val colName = column.toString.split("#")(0)
val names = field.toString.split("\\.").foldLeft(List[String]()){(z, n) 
=> z :+ "->'"+n+"'" }
sources.EqualTo(colName + names.slice(1, names.size).mkString(""), 
eq.value).asInstanceOf[Predicate]
  case _ => sources.EqualTo("foo", "bar").asInstanceOf[Predicate]
}
  }
{code}

> Allow json expression to be pushable column
> ---
>
> Key: SPARK-33915
> URL: https://issues.apache.org/jira/browse/SPARK-33915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Major
>
> Currently PushableColumnBase provides no support for json / jsonb expression.
> Example of json expression:
> {code}
> get_json_object(phone, '$.code') = '1200'
> {code}
> If non-string literal is part of the expression, the presence of cast() would 
> complicate the situation.
> Implication is that implementation of SupportsPushDownFilters doesn't have a 
> chance to perform pushdown even if third party DB engine supports json 
> expression pushdown.
> This issue is for discussion and implementation of Spark core changes which 
> would allow json expression to be recognized as pushable column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-33915) Allow json expression to be pushable column

2021-01-02 Thread Ted Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-33915:
---
Comment: was deleted

(was: Opened https://github.com/apache/spark/pull/30984)

> Allow json expression to be pushable column
> ---
>
> Key: SPARK-33915
> URL: https://issues.apache.org/jira/browse/SPARK-33915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Major
>
> Currently PushableColumnBase provides no support for json / jsonb expression.
> Example of json expression:
> {code}
> get_json_object(phone, '$.code') = '1200'
> {code}
> If non-string literal is part of the expression, the presence of cast() would 
> complicate the situation.
> Implication is that implementation of SupportsPushDownFilters doesn't have a 
> chance to perform pushdown even if third party DB engine supports json 
> expression pushdown.
> This issue is for discussion and implementation of Spark core changes which 
> would allow json expression to be recognized as pushable column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-02 Thread Ron Hu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257639#comment-17257639
 ] 

Ron Hu commented on SPARK-26399:


Hi [~Baohe Zhang] , This ticket proposes a new REST API: 

http://:18080/api/v1/applicationsexecutorSummary.
 

It means to display the percentile distribution of peak memory metrics among 
the executors used in a given stage.  It can help Spark users debug/monitor a 
bottleneck of a stage.

In the ticket https://issues.apache.org/jira/browse/SPARK-32446, it proposed to 
add a REST API,  which can display the percentile distribution of peak memory 
metrics for all executors used in an application.  The REST API is: 

http://:18080/api/v1/applications///executorSummary

Hence this ticket displays executorSummary for a given stage inside an 
application.  SPARK-32446 wants to display executorSummary for the entire 
application.  They are different.

 

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http://:18080/api/v1/applications/ id>// attempt>/executorSummary{code}
> Add parameters to the stages REST API to specify:
> *  filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
> * task metric quantiles, add adding the task summary if specified
> * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33963) `isCached` return `false` for cached Hive table

2021-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33963:


Assignee: Maxim Gekk

> `isCached` return `false` for cached Hive table
> ---
>
> Key: SPARK-33963
> URL: https://issues.apache.org/jira/browse/SPARK-33963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> The same works in Spark 3.0 but fails in Spark 3.2.0-SNAPSHOT (and 3.1.0 
> probably):
> *Spark 3.0:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.1
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res3: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res5: Boolean = true
> {code}
> *Spark master:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res2: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res4: Boolean = false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33963) `isCached` return `false` for cached Hive table

2021-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33963.
--
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30995
[https://github.com/apache/spark/pull/30995]

> `isCached` return `false` for cached Hive table
> ---
>
> Key: SPARK-33963
> URL: https://issues.apache.org/jira/browse/SPARK-33963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> The same works in Spark 3.0 but fails in Spark 3.2.0-SNAPSHOT (and 3.1.0 
> probably):
> *Spark 3.0:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.1
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res3: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res5: Boolean = true
> {code}
> *Spark master:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res2: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res4: Boolean = false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33959) Improve the statistics estimation of the Tail

2021-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33959.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30991
[https://github.com/apache/spark/pull/30991]

> Improve the statistics estimation of the Tail
> -
>
> Key: SPARK-33959
> URL: https://issues.apache.org/jira/browse/SPARK-33959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as 
> e").write.saveAsTable("t1")
> println(Tail(Literal(5), spark.sql("SELECT * FROM 
> t1").queryExecution.logical).queryExecution.explainString(org.apache.spark.sql.execution.CostMode))
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=3.8 KiB)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33959) Improve the statistics estimation of the Tail

2021-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33959:


Assignee: Yuming Wang

> Improve the statistics estimation of the Tail
> -
>
> Key: SPARK-33959
> URL: https://issues.apache.org/jira/browse/SPARK-33959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as 
> e").write.saveAsTable("t1")
> println(Tail(Literal(5), spark.sql("SELECT * FROM 
> t1").queryExecution.logical).queryExecution.explainString(org.apache.spark.sql.execution.CostMode))
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=3.8 KiB)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33960) LimitPushDown support Sort

2021-01-02 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-33960.
-
Resolution: Not A Problem

> LimitPushDown support Sort
> --
>
> Key: SPARK-33960
> URL: https://issues.apache.org/jira/browse/SPARK-33960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> LimitPushDown support Sort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33922) Fix error test SparkLauncherSuite.testSparkLauncherGetError

2021-01-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257620#comment-17257620
 ] 

Hyukjin Kwon commented on SPARK-33922:
--

Would you mind sharing the full logs? Meanwhile, it might be helpful to take a 
look for https://spark.apache.org/developer-tools.html about who to run a test.

> Fix error test SparkLauncherSuite.testSparkLauncherGetError
> ---
>
> Key: SPARK-33922
> URL: https://issues.apache.org/jira/browse/SPARK-33922
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.1
>Reporter: dengziming
>Priority: Minor
>
> org.apache.spark.launcher.SparkLauncherSuite.testSparkLauncherGetError get 
> failed everytime when executing, note that it's not a flaky test because it 
> failed everytime.
> ```
> java.lang.AssertionErrorjava.lang.AssertionError at 
> org.junit.Assert.fail(Assert.java:87) at 
> org.junit.Assert.assertTrue(Assert.java:42) at 
> org.junit.Assert.assertTrue(Assert.java:53) at 
> org.apache.spark.launcher.SparkLauncherSuite.testSparkLauncherGetError
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33961) Upgrade SBT to 1.4.6

2021-01-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33961:
--
Affects Version/s: (was: 3.2.0)

> Upgrade SBT to 1.4.6
> 
>
> Key: SPARK-33961
> URL: https://issues.apache.org/jira/browse/SPARK-33961
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33961) Upgrade SBT to 1.4.6

2021-01-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33961:
-

Assignee: Dongjoon Hyun

> Upgrade SBT to 1.4.6
> 
>
> Key: SPARK-33961
> URL: https://issues.apache.org/jira/browse/SPARK-33961
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33961) Upgrade SBT to 1.4.6

2021-01-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33961.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30993
[https://github.com/apache/spark/pull/30993]

> Upgrade SBT to 1.4.6
> 
>
> Key: SPARK-33961
> URL: https://issues.apache.org/jira/browse/SPARK-33961
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33964) Combine distinct unions in more cases

2021-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257499#comment-17257499
 ] 

Apache Spark commented on SPARK-33964:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30996

> Combine distinct unions in more cases
> -
>
> Key: SPARK-33964
> URL: https://issues.apache.org/jira/browse/SPARK-33964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> In several TPCDS queries the CombineUnions rule does not manage to combine 
> unions, because they have noop Projects between them.
> The Projects will be removed by RemoveNoopOperators, but by then 
> ReplaceDistinctWithAggregate has been applied and there are aggregates 
> between the unions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33964) Combine distinct unions in more cases

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33964:


Assignee: Apache Spark

> Combine distinct unions in more cases
> -
>
> Key: SPARK-33964
> URL: https://issues.apache.org/jira/browse/SPARK-33964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Major
>
> In several TPCDS queries the CombineUnions rule does not manage to combine 
> unions, because they have noop Projects between them.
> The Projects will be removed by RemoveNoopOperators, but by then 
> ReplaceDistinctWithAggregate has been applied and there are aggregates 
> between the unions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33964) Combine distinct unions in more cases

2021-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257498#comment-17257498
 ] 

Apache Spark commented on SPARK-33964:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30996

> Combine distinct unions in more cases
> -
>
> Key: SPARK-33964
> URL: https://issues.apache.org/jira/browse/SPARK-33964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> In several TPCDS queries the CombineUnions rule does not manage to combine 
> unions, because they have noop Projects between them.
> The Projects will be removed by RemoveNoopOperators, but by then 
> ReplaceDistinctWithAggregate has been applied and there are aggregates 
> between the unions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33964) Combine distinct unions in more cases

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33964:


Assignee: (was: Apache Spark)

> Combine distinct unions in more cases
> -
>
> Key: SPARK-33964
> URL: https://issues.apache.org/jira/browse/SPARK-33964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> In several TPCDS queries the CombineUnions rule does not manage to combine 
> unions, because they have noop Projects between them.
> The Projects will be removed by RemoveNoopOperators, but by then 
> ReplaceDistinctWithAggregate has been applied and there are aggregates 
> between the unions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33964) Combine distinct unions in more cases

2021-01-02 Thread Tanel Kiis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-33964:
---
Description: 
In several TPCDS queries the CombineUnions rule does not manage to combine 
unions, because they have noop Projects between them.
The Projects will be removed by RemoveNoopOperators, but by then 
ReplaceDistinctWithAggregate has been applied and there are aggregates between 
the unions.

> Combine distinct unions in more cases
> -
>
> Key: SPARK-33964
> URL: https://issues.apache.org/jira/browse/SPARK-33964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> In several TPCDS queries the CombineUnions rule does not manage to combine 
> unions, because they have noop Projects between them.
> The Projects will be removed by RemoveNoopOperators, but by then 
> ReplaceDistinctWithAggregate has been applied and there are aggregates 
> between the unions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33964) Combine distinct unions in more cases

2021-01-02 Thread Tanel Kiis (Jira)
Tanel Kiis created SPARK-33964:
--

 Summary: Combine distinct unions in more cases
 Key: SPARK-33964
 URL: https://issues.apache.org/jira/browse/SPARK-33964
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Tanel Kiis






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33963) `isCached` return `false` for cached Hive table

2021-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257490#comment-17257490
 ] 

Apache Spark commented on SPARK-33963:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30995

> `isCached` return `false` for cached Hive table
> ---
>
> Key: SPARK-33963
> URL: https://issues.apache.org/jira/browse/SPARK-33963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The same works in Spark 3.0 but fails in Spark 3.2.0-SNAPSHOT (and 3.1.0 
> probably):
> *Spark 3.0:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.1
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res3: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res5: Boolean = true
> {code}
> *Spark master:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res2: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res4: Boolean = false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33963) `isCached` return `false` for cached Hive table

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33963:


Assignee: Apache Spark

> `isCached` return `false` for cached Hive table
> ---
>
> Key: SPARK-33963
> URL: https://issues.apache.org/jira/browse/SPARK-33963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The same works in Spark 3.0 but fails in Spark 3.2.0-SNAPSHOT (and 3.1.0 
> probably):
> *Spark 3.0:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.1
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res3: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res5: Boolean = true
> {code}
> *Spark master:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res2: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res4: Boolean = false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33963) `isCached` return `false` for cached Hive table

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33963:


Assignee: (was: Apache Spark)

> `isCached` return `false` for cached Hive table
> ---
>
> Key: SPARK-33963
> URL: https://issues.apache.org/jira/browse/SPARK-33963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The same works in Spark 3.0 but fails in Spark 3.2.0-SNAPSHOT (and 3.1.0 
> probably):
> *Spark 3.0:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.1
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res3: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res5: Boolean = true
> {code}
> *Spark master:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res2: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res4: Boolean = false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33963) `isCached` return `false` for cached Hive table

2021-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257489#comment-17257489
 ] 

Apache Spark commented on SPARK-33963:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30995

> `isCached` return `false` for cached Hive table
> ---
>
> Key: SPARK-33963
> URL: https://issues.apache.org/jira/browse/SPARK-33963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The same works in Spark 3.0 but fails in Spark 3.2.0-SNAPSHOT (and 3.1.0 
> probably):
> *Spark 3.0:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.1
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res3: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res5: Boolean = true
> {code}
> *Spark master:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res2: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res4: Boolean = false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33963) `isCached` return `false` for cached Hive table

2021-01-02 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257483#comment-17257483
 ] 

Maxim Gekk commented on SPARK-33963:


I am working on a bug fix.

> `isCached` return `false` for cached Hive table
> ---
>
> Key: SPARK-33963
> URL: https://issues.apache.org/jira/browse/SPARK-33963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The same works in Spark 3.0 but fails in Spark 3.2.0-SNAPSHOT (and 3.1.0 
> probably):
> *Spark 3.0:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.1
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res3: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res5: Boolean = true
> {code}
> *Spark master:*
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
> scala> sql("CREATE TABLE tbl (col int)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res2: Boolean = false
> scala> sql("CACHE TABLE tbl")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res4: Boolean = false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33963) `isCached` return `false` for cached Hive table

2021-01-02 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33963:
--

 Summary: `isCached` return `false` for cached Hive table
 Key: SPARK-33963
 URL: https://issues.apache.org/jira/browse/SPARK-33963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0, 3.2.0
Reporter: Maxim Gekk


The same works in Spark 3.0 but fails in Spark 3.2.0-SNAPSHOT (and 3.1.0 
probably):
*Spark 3.0:*
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/
scala> sql("CREATE TABLE tbl (col int)")
res2: org.apache.spark.sql.DataFrame = []
scala> spark.catalog.isCached("tbl")
res3: Boolean = false

scala> sql("CACHE TABLE tbl")
res4: org.apache.spark.sql.DataFrame = []

scala> spark.catalog.isCached("tbl")
res5: Boolean = true
{code}

*Spark master:*
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
  /_/
scala> sql("CREATE TABLE tbl (col int)")
res1: org.apache.spark.sql.DataFrame = []

scala> spark.catalog.isCached("tbl")
res2: Boolean = false

scala> sql("CACHE TABLE tbl")
res3: org.apache.spark.sql.DataFrame = []

scala> spark.catalog.isCached("tbl")
res4: Boolean = false
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33962) Fix incorrect min partition condition in getRanges

2021-01-02 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-33962:

Issue Type: Bug  (was: Improvement)

> Fix incorrect min partition condition in getRanges
> --
>
> Key: SPARK-33962
> URL: https://issues.apache.org/jira/browse/SPARK-33962
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> When calculating offset ranges, we consider minPartitions configuration. If 
> minPartitions is not set or is less than or equal the size of given ranges, 
> it means there are enough partitions at Kafka so we don't need to split 
> offsets to satisfy min partition requirement. But the current condition is 
> offsetRanges.size > minPartitions.get and is not correct. Currently getRanges 
> will split offsets in unnecessary case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33962) Fix incorrect min partition condition in getRanges

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33962:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Fix incorrect min partition condition in getRanges
> --
>
> Key: SPARK-33962
> URL: https://issues.apache.org/jira/browse/SPARK-33962
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> When calculating offset ranges, we consider minPartitions configuration. If 
> minPartitions is not set or is less than or equal the size of given ranges, 
> it means there are enough partitions at Kafka so we don't need to split 
> offsets to satisfy min partition requirement. But the current condition is 
> offsetRanges.size > minPartitions.get and is not correct. Currently getRanges 
> will split offsets in unnecessary case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33962) Fix incorrect min partition condition in getRanges

2021-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257466#comment-17257466
 ] 

Apache Spark commented on SPARK-33962:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30994

> Fix incorrect min partition condition in getRanges
> --
>
> Key: SPARK-33962
> URL: https://issues.apache.org/jira/browse/SPARK-33962
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> When calculating offset ranges, we consider minPartitions configuration. If 
> minPartitions is not set or is less than or equal the size of given ranges, 
> it means there are enough partitions at Kafka so we don't need to split 
> offsets to satisfy min partition requirement. But the current condition is 
> offsetRanges.size > minPartitions.get and is not correct. Currently getRanges 
> will split offsets in unnecessary case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33962) Fix incorrect min partition condition in getRanges

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33962:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Fix incorrect min partition condition in getRanges
> --
>
> Key: SPARK-33962
> URL: https://issues.apache.org/jira/browse/SPARK-33962
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> When calculating offset ranges, we consider minPartitions configuration. If 
> minPartitions is not set or is less than or equal the size of given ranges, 
> it means there are enough partitions at Kafka so we don't need to split 
> offsets to satisfy min partition requirement. But the current condition is 
> offsetRanges.size > minPartitions.get and is not correct. Currently getRanges 
> will split offsets in unnecessary case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33962) Fix incorrect min partition condition in getRanges

2021-01-02 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-33962:
---

 Summary: Fix incorrect min partition condition in getRanges
 Key: SPARK-33962
 URL: https://issues.apache.org/jira/browse/SPARK-33962
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


When calculating offset ranges, we consider minPartitions configuration. If 
minPartitions is not set or is less than or equal the size of given ranges, it 
means there are enough partitions at Kafka so we don't need to split offsets to 
satisfy min partition requirement. But the current condition is 
offsetRanges.size > minPartitions.get and is not correct. Currently getRanges 
will split offsets in unnecessary case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33961) Upgrade SBT to 1.4.6

2021-01-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33961:
--
Affects Version/s: 3.1.0

> Upgrade SBT to 1.4.6
> 
>
> Key: SPARK-33961
> URL: https://issues.apache.org/jira/browse/SPARK-33961
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33961) Upgrade SBT to 1.4.6

2021-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257459#comment-17257459
 ] 

Apache Spark commented on SPARK-33961:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30993

> Upgrade SBT to 1.4.6
> 
>
> Key: SPARK-33961
> URL: https://issues.apache.org/jira/browse/SPARK-33961
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-33961) Upgrade SBT to 1.4.6

2021-01-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33961:
--
Comment: was deleted

(was: User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30993)

> Upgrade SBT to 1.4.6
> 
>
> Key: SPARK-33961
> URL: https://issues.apache.org/jira/browse/SPARK-33961
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33961) Upgrade SBT to 1.4.6

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33961:


Assignee: Apache Spark

> Upgrade SBT to 1.4.6
> 
>
> Key: SPARK-33961
> URL: https://issues.apache.org/jira/browse/SPARK-33961
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33961) Upgrade SBT to 1.4.6

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33961:


Assignee: (was: Apache Spark)

> Upgrade SBT to 1.4.6
> 
>
> Key: SPARK-33961
> URL: https://issues.apache.org/jira/browse/SPARK-33961
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33961) Upgrade SBT to 1.4.6

2021-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257458#comment-17257458
 ] 

Apache Spark commented on SPARK-33961:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30993

> Upgrade SBT to 1.4.6
> 
>
> Key: SPARK-33961
> URL: https://issues.apache.org/jira/browse/SPARK-33961
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33961) Upgrade SBT to 1.4.6

2021-01-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-33961:
-

 Summary: Upgrade SBT to 1.4.6
 Key: SPARK-33961
 URL: https://issues.apache.org/jira/browse/SPARK-33961
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33961) Upgrade SBT to 1.4.6

2021-01-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33961:
--
Priority: Minor  (was: Major)

> Upgrade SBT to 1.4.6
> 
>
> Key: SPARK-33961
> URL: https://issues.apache.org/jira/browse/SPARK-33961
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2021-01-02 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257437#comment-17257437
 ] 

Dongjoon Hyun commented on SPARK-33933:
---

Thank you for reporting a bug, [~zhongyu09]. I converted this to a subtask of 
SPARK-33828 .

> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Priority: Major
>
> In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
> queries as below.
>  
> {code:java}
> Could not execute broadcast in 300 secs. You can increase the timeout for 
> broadcasts via spark.sql.broadcastTimeout or disable broadcast join by 
> setting spark.sql.autoBroadcastJoinThreshold to -1
> {code}
>  
> This is usually happens when broadcast join(with or without hint) after a 
> long running shuffle (more than 5 minutes).  By disable AQE, the issues 
> disappear.
> The workaround is to increase spark.sql.broadcastTimeout and it works. But 
> because the data to broadcast is very small, that doesn't make sense.
> After investigation, the root cause should be like this: when enable AQE, in 
> getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
> query stage for materialized part by createQueryStages and materialize those 
> new created query stages to submit map stages or broadcasting. When 
> ShuffleQueryStage are materializing before BroadcastQueryStage, the map job 
> and broadcast job are submitted almost at the same time, but map job will 
> hold all the computing resources. If the map job runs slow (when lots of data 
> needs to process and the resource is limited), the broadcast job cannot be 
> started(and finished) before spark.sql.broadcastTimeout, thus cause whole job 
> failed (introduced in SPARK-31475).
> Code to reproduce:
>  
> {code:java}
> import java.util.UUID
> import scala.util.Random
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> val spark = SparkSession.builder()
>   .master("local[2]")
>   .appName("Test Broadcast").getOrCreate()
> import spark.implicits._
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> val sc = spark.sparkContext
> sc.setLogLevel("INFO")
> val uuid = UUID.randomUUID
> val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
>   for (i <- Range(0, 1 + Random.nextInt(1)))
> yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
> }).toDF("index", "part", "pv", "uuid")
>   .withColumn("md5", md5($"uuid"))
> val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
> val dim = dim_data.toDF("name", "index")
> val result = df.groupBy("index")
>   .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
>   .join(dim, Seq("index"))
>   .collect(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2021-01-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33933:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Bug)

> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Priority: Major
>
> In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
> queries as below.
>  
> {code:java}
> Could not execute broadcast in 300 secs. You can increase the timeout for 
> broadcasts via spark.sql.broadcastTimeout or disable broadcast join by 
> setting spark.sql.autoBroadcastJoinThreshold to -1
> {code}
>  
> This is usually happens when broadcast join(with or without hint) after a 
> long running shuffle (more than 5 minutes).  By disable AQE, the issues 
> disappear.
> The workaround is to increase spark.sql.broadcastTimeout and it works. But 
> because the data to broadcast is very small, that doesn't make sense.
> After investigation, the root cause should be like this: when enable AQE, in 
> getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
> query stage for materialized part by createQueryStages and materialize those 
> new created query stages to submit map stages or broadcasting. When 
> ShuffleQueryStage are materializing before BroadcastQueryStage, the map job 
> and broadcast job are submitted almost at the same time, but map job will 
> hold all the computing resources. If the map job runs slow (when lots of data 
> needs to process and the resource is limited), the broadcast job cannot be 
> started(and finished) before spark.sql.broadcastTimeout, thus cause whole job 
> failed (introduced in SPARK-31475).
> Code to reproduce:
>  
> {code:java}
> import java.util.UUID
> import scala.util.Random
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> val spark = SparkSession.builder()
>   .master("local[2]")
>   .appName("Test Broadcast").getOrCreate()
> import spark.implicits._
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> val sc = spark.sparkContext
> sc.setLogLevel("INFO")
> val uuid = UUID.randomUUID
> val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
>   for (i <- Range(0, 1 + Random.nextInt(1)))
> yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
> }).toDF("index", "part", "pv", "uuid")
>   .withColumn("md5", md5($"uuid"))
> val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
> val dim = dim_data.toDF("name", "index")
> val result = df.groupBy("index")
>   .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
>   .join(dim, Seq("index"))
>   .collect(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33956) Add rowCount for Range operator

2021-01-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33956:
-

Assignee: Yuming Wang

> Add rowCount for Range operator
> ---
>
> Key: SPARK-33956
> URL: https://issues.apache.org/jira/browse/SPARK-33956
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.sql("select id from range(100)").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B, 
> rowCount=100)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33956) Add rowCount for Range operator

2021-01-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33956.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30989
[https://github.com/apache/spark/pull/30989]

> Add rowCount for Range operator
> ---
>
> Key: SPARK-33956
> URL: https://issues.apache.org/jira/browse/SPARK-33956
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.sql("select id from range(100)").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B, 
> rowCount=100)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33960) LimitPushDown support Sort

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33960:


Assignee: Apache Spark

> LimitPushDown support Sort
> --
>
> Key: SPARK-33960
> URL: https://issues.apache.org/jira/browse/SPARK-33960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> LimitPushDown support Sort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33960) LimitPushDown support Sort

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33960:


Assignee: Apache Spark

> LimitPushDown support Sort
> --
>
> Key: SPARK-33960
> URL: https://issues.apache.org/jira/browse/SPARK-33960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> LimitPushDown support Sort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33960) LimitPushDown support Sort

2021-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257426#comment-17257426
 ] 

Apache Spark commented on SPARK-33960:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30992

> LimitPushDown support Sort
> --
>
> Key: SPARK-33960
> URL: https://issues.apache.org/jira/browse/SPARK-33960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> LimitPushDown support Sort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33960) LimitPushDown support Sort

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33960:


Assignee: (was: Apache Spark)

> LimitPushDown support Sort
> --
>
> Key: SPARK-33960
> URL: https://issues.apache.org/jira/browse/SPARK-33960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> LimitPushDown support Sort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33960) LimitPushDown support Sort

2021-01-02 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33960:
---

 Summary: LimitPushDown support Sort
 Key: SPARK-33960
 URL: https://issues.apache.org/jira/browse/SPARK-33960
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


LimitPushDown support Sort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33888) JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis

2021-01-02 Thread Duc Hoa Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257370#comment-17257370
 ] 

Duc Hoa Nguyen commented on SPARK-33888:


Seems like the PR is accepted. Anything else that is needed before this can be 
merged?

> JDBC SQL TIME type represents incorrectly as TimestampType, it should be 
> physical Int in millis
> ---
>
> Key: SPARK-33888
> URL: https://issues.apache.org/jira/browse/SPARK-33888
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.0, 3.0.1
>Reporter: Duc Hoa Nguyen
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, for JDBC, SQL TIME type represents incorrectly as Spark 
> TimestampType. This should be represent as physical int in millis Represents 
> a time of day, with no reference to a particular calendar, time zone or date, 
> with a precision of one millisecond. It stores the number of milliseconds 
> after midnight, 00:00:00.000.
> We encountered the issue of Avro logical type of `TimeMillis` not being 
> converted correctly to Spark `Timestamp` struct type using the 
> `SchemaConverters`, but it converts to regular `int` instead. Reproducible by 
> ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe 
> will get the correct type (Timestamp), but enforcing our avro schema 
> (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to 
> apply with the following exception:
> {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type 
> for schema of int}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33958) spark sql DoubleType(0 * (-1)) return "-0.0"

2021-01-02 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257347#comment-17257347
 ] 

Yuming Wang commented on SPARK-33958:
-

PostgreSQL also returns -0:
{noformat}
postgres=# create table test_zjg(a float8);
CREATE TABLE
postgres=# insert into test_zjg values(-1.0);
INSERT 0 1
postgres=# select a*0 from test_zjg;
 ?column?
--
   -0
(1 row)
{noformat}

> spark sql DoubleType(0 * (-1))  return "-0.0"
> -
>
> Key: SPARK-33958
> URL: https://issues.apache.org/jira/browse/SPARK-33958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.5, 3.0.0
>Reporter: Zhang Jianguo
>Priority: Minor
>
> spark version: 2.3.2
> {code:java}
> create table test_zjg(a double);
> insert into test_zjg values(-1.0);
> select a*0 from test_zjg
> {code}
>  After select operation, *{color:#de350b}we will get -0.0 which expected as 
> 0.0:{color}*
> \+\+
> \|(a * CAST(0 AS DOUBLE))\|
> \+\+
> \|-0.0                               \|
> \+\+
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33959) Improve the statistics estimation of the Tail

2021-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257329#comment-17257329
 ] 

Apache Spark commented on SPARK-33959:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30991

> Improve the statistics estimation of the Tail
> -
>
> Key: SPARK-33959
> URL: https://issues.apache.org/jira/browse/SPARK-33959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as 
> e").write.saveAsTable("t1")
> println(Tail(Literal(5), spark.sql("SELECT * FROM 
> t1").queryExecution.logical).queryExecution.explainString(org.apache.spark.sql.execution.CostMode))
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=3.8 KiB)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33959) Improve the statistics estimation of the Tail

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33959:


Assignee: (was: Apache Spark)

> Improve the statistics estimation of the Tail
> -
>
> Key: SPARK-33959
> URL: https://issues.apache.org/jira/browse/SPARK-33959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as 
> e").write.saveAsTable("t1")
> println(Tail(Literal(5), spark.sql("SELECT * FROM 
> t1").queryExecution.logical).queryExecution.explainString(org.apache.spark.sql.execution.CostMode))
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=3.8 KiB)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33959) Improve the statistics estimation of the Tail

2021-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257330#comment-17257330
 ] 

Apache Spark commented on SPARK-33959:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30991

> Improve the statistics estimation of the Tail
> -
>
> Key: SPARK-33959
> URL: https://issues.apache.org/jira/browse/SPARK-33959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as 
> e").write.saveAsTable("t1")
> println(Tail(Literal(5), spark.sql("SELECT * FROM 
> t1").queryExecution.logical).queryExecution.explainString(org.apache.spark.sql.execution.CostMode))
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=3.8 KiB)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33959) Improve the statistics estimation of the Tail

2021-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33959:


Assignee: Apache Spark

> Improve the statistics estimation of the Tail
> -
>
> Key: SPARK-33959
> URL: https://issues.apache.org/jira/browse/SPARK-33959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as 
> e").write.saveAsTable("t1")
> println(Tail(Literal(5), spark.sql("SELECT * FROM 
> t1").queryExecution.logical).queryExecution.explainString(org.apache.spark.sql.execution.CostMode))
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=3.8 KiB)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33959) Improve the statistics estimation of the Tail

2021-01-02 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33959:

Description: 
{code:scala}
spark.sql("set spark.sql.cbo.enabled=true")
spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as 
e").write.saveAsTable("t1")
println(Tail(Literal(5), spark.sql("SELECT * FROM 
t1").queryExecution.logical).queryExecution.explainString(org.apache.spark.sql.execution.CostMode))

{code}

Current:
{noformat}
== Optimized Logical Plan ==
Tail 5, Statistics(sizeInBytes=3.8 KiB)
+- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
+- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
{noformat}


> Improve the statistics estimation of the Tail
> -
>
> Key: SPARK-33959
> URL: https://issues.apache.org/jira/browse/SPARK-33959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as 
> e").write.saveAsTable("t1")
> println(Tail(Literal(5), spark.sql("SELECT * FROM 
> t1").queryExecution.logical).queryExecution.explainString(org.apache.spark.sql.execution.CostMode))
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=3.8 KiB)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33959) Improve the statistics estimation of the Tail

2021-01-02 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33959:
---

 Summary: Improve the statistics estimation of the Tail
 Key: SPARK-33959
 URL: https://issues.apache.org/jira/browse/SPARK-33959
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33958) spark sql DoubleType(0 * (-1)) return "-0.0"

2021-01-02 Thread Zhang Jianguo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Jianguo updated SPARK-33958:
--
Description: 
spark version: 2.3.2
{code:java}
create table test_zjg(a double);
insert into test_zjg values(-1.0);
select a*0 from test_zjg
{code}
 After select operation, *{color:#de350b}we will get -0.0 which expected as 
0.0:{color}*

\+\+

\|(a * CAST(0 AS DOUBLE))\|

\+\+

\|-0.0                               \|

\+\+

 

 

  was:
spark version: 2.3.2
{code:java}
create table test_zjg(a double);
insert into test_zjg values(-1.0);
select a*0 from test_zjg
{code}
 After select operation, *{color:#de350b}we will get -0.0 which expected as 
0.0:{color}*

++

|(a * CAST(0 AS DOUBLE))|

++

|-0.0                               |

++

 

 


> spark sql DoubleType(0 * (-1))  return "-0.0"
> -
>
> Key: SPARK-33958
> URL: https://issues.apache.org/jira/browse/SPARK-33958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.5, 3.0.0
>Reporter: Zhang Jianguo
>Priority: Minor
>
> spark version: 2.3.2
> {code:java}
> create table test_zjg(a double);
> insert into test_zjg values(-1.0);
> select a*0 from test_zjg
> {code}
>  After select operation, *{color:#de350b}we will get -0.0 which expected as 
> 0.0:{color}*
> \+\+
> \|(a * CAST(0 AS DOUBLE))\|
> \+\+
> \|-0.0                               \|
> \+\+
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33958) spark sql DoubleType(0 * (-1)) return "-0.0"

2021-01-02 Thread Zhang Jianguo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Jianguo updated SPARK-33958:
--
Description: 
spark version: 2.3.2
{code:java}
create table test_zjg(a double);
insert into test_zjg values(-1.0);
select a*0 from test_zjg
{code}
 After select operation, *{color:#de350b}we will get -0.0 which expected as 
0.0:{color}*

++

|(a * CAST(0 AS DOUBLE))|

++

|-0.0                               |

++

 

 

  was:
spark version: 2.3.2
{code:java}
create table test_zjg(a double);
insert into test_zjg values(-1.0);
select a*0 from test_zjg
{code}
 After select operation, we will get -0.0 which expected as 0.0:

\+\+

\|(a * CAST(0 AS DOUBLE))\|

\+\+

\|-0.0                               \|

\+\+

 

 


> spark sql DoubleType(0 * (-1))  return "-0.0"
> -
>
> Key: SPARK-33958
> URL: https://issues.apache.org/jira/browse/SPARK-33958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.5, 3.0.0
>Reporter: Zhang Jianguo
>Priority: Minor
>
> spark version: 2.3.2
> {code:java}
> create table test_zjg(a double);
> insert into test_zjg values(-1.0);
> select a*0 from test_zjg
> {code}
>  After select operation, *{color:#de350b}we will get -0.0 which expected as 
> 0.0:{color}*
> ++
> |(a * CAST(0 AS DOUBLE))|
> ++
> |-0.0                               |
> ++
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33958) spark sql DoubleType(0 * (-1)) return "-0.0"

2021-01-02 Thread Zhang Jianguo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Jianguo updated SPARK-33958:
--
Description: 
spark version: 2.3.2
{code:java}
create table test_zjg(a double);
insert into test_zjg values(-1.0);
select a*0 from test_zjg
{code}
 After select operation, we will get -0.0 which expected as 0.0:

\+\+

\|(a * CAST(0 AS DOUBLE))\|

\+\+

\|-0.0                               \|

\+\+

 

 

  was:
spark version: 2.3.2

 

```sql

create table test_zjg(a double);

insert into test_zjg values(-1.0);

select a*0 from test_zjg

```

 

After select operation, we will get -0.0 which expected as 0.0:

\+\+

\|(a * CAST(0 AS DOUBLE))\|

\+\+

\|-0.0                               \|

\+\+

 

 


> spark sql DoubleType(0 * (-1))  return "-0.0"
> -
>
> Key: SPARK-33958
> URL: https://issues.apache.org/jira/browse/SPARK-33958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.5, 3.0.0
>Reporter: Zhang Jianguo
>Priority: Minor
>
> spark version: 2.3.2
> {code:java}
> create table test_zjg(a double);
> insert into test_zjg values(-1.0);
> select a*0 from test_zjg
> {code}
>  After select operation, we will get -0.0 which expected as 0.0:
> \+\+
> \|(a * CAST(0 AS DOUBLE))\|
> \+\+
> \|-0.0                               \|
> \+\+
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33958) spark sql DoubleType(0 * (-1)) return "-0.0"

2021-01-02 Thread Zhang Jianguo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Jianguo updated SPARK-33958:
--
Description: 
spark version: 2.3.2

 

```sql

create table test_zjg(a double);

insert into test_zjg values(-1.0);

select a*0 from test_zjg

```

 

After select operation, we will get -0.0 which expected as 0.0:

\+\+

\|(a * CAST(0 AS DOUBLE))\|

\+\+

\|-0.0                               \|

\+\+

 

 

  was:
spark version: 2.3.2

 

```sql

create table test_zjg(a double);

insert into test_zjg values(-1.0);

select a*0 from test_zjg

```

 

After select operation, we will get -0.0 which expected as 0.0:

+\+--\+
| (a * CAST(0 AS DOUBLE)) |
\+--\+
| -0.0 |
\+--\++

 

 


> spark sql DoubleType(0 * (-1))  return "-0.0"
> -
>
> Key: SPARK-33958
> URL: https://issues.apache.org/jira/browse/SPARK-33958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.5, 3.0.0
>Reporter: Zhang Jianguo
>Priority: Minor
>
> spark version: 2.3.2
>  
> ```sql
> create table test_zjg(a double);
> insert into test_zjg values(-1.0);
> select a*0 from test_zjg
> ```
>  
> After select operation, we will get -0.0 which expected as 0.0:
> \+\+
> \|(a * CAST(0 AS DOUBLE))\|
> \+\+
> \|-0.0                               \|
> \+\+
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33958) spark sql DoubleType(0 * (-1)) return "-0.0"

2021-01-02 Thread Zhang Jianguo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Jianguo updated SPARK-33958:
--
Description: 
spark version: 2.3.2

 

```sql

create table test_zjg(a double);

insert into test_zjg values(-1.0);

select a*0 from test_zjg

```

 

After select operation, we will get -0.0 which expected as 0.0:

+\+--\+
| (a * CAST(0 AS DOUBLE)) |
\+--\+
| -0.0 |
\+--\++

 

 

  was:
spark version: 2.3.2

 

```sql

create table test_zjg(a double);

insert into test_zjg values(-1.0);

select a*0 from test_zjg

```

 

After select operation, we will get -0.0 which expected as 0.0:

+--+
| (a * CAST(0 AS DOUBLE)) |
+--+
| -0.0 |
+--+

 

 


> spark sql DoubleType(0 * (-1))  return "-0.0"
> -
>
> Key: SPARK-33958
> URL: https://issues.apache.org/jira/browse/SPARK-33958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.5, 3.0.0
>Reporter: Zhang Jianguo
>Priority: Minor
>
> spark version: 2.3.2
>  
> ```sql
> create table test_zjg(a double);
> insert into test_zjg values(-1.0);
> select a*0 from test_zjg
> ```
>  
> After select operation, we will get -0.0 which expected as 0.0:
> +\+--\+
> | (a * CAST(0 AS DOUBLE)) |
> \+--\+
> | -0.0 |
> \+--\++
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33958) spark sql DoubleType(0 * (-1)) return "-0.0"

2021-01-02 Thread Zhang Jianguo (Jira)
Zhang Jianguo created SPARK-33958:
-

 Summary: spark sql DoubleType(0 * (-1))  return "-0.0"
 Key: SPARK-33958
 URL: https://issues.apache.org/jira/browse/SPARK-33958
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0, 2.4.5, 2.3.2
Reporter: Zhang Jianguo


spark version: 2.3.2

 

```sql

create table test_zjg(a double);

insert into test_zjg values(-1.0);

select a*0 from test_zjg

```

 

After select operation, we will get -0.0 which expected as 0.0:

+--+
| (a * CAST(0 AS DOUBLE)) |
+--+
| -0.0 |
+--+

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org