[jira] [Commented] (SPARK-33911) Update SQL migration guide about changes in HiveClientImpl

2020-12-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254966#comment-17254966
 ] 

Apache Spark commented on SPARK-33911:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30933

> Update SQL migration guide about changes in HiveClientImpl
> --
>
> Key: SPARK-33911
> URL: https://issues.apache.org/jira/browse/SPARK-33911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> 1. https://github.com/apache/spark/pull/30802
> 2. https://github.com/apache/spark/pull/30711



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33911) Update SQL migration guide about changes in HiveClientImpl

2020-12-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254965#comment-17254965
 ] 

Apache Spark commented on SPARK-33911:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30933

> Update SQL migration guide about changes in HiveClientImpl
> --
>
> Key: SPARK-33911
> URL: https://issues.apache.org/jira/browse/SPARK-33911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> 1. https://github.com/apache/spark/pull/30802
> 2. https://github.com/apache/spark/pull/30711



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33897) Can't set option 'cross' in join method.

2020-12-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33897:


Assignee: GokiMori

> Can't set option 'cross' in join method.
> 
>
> Key: SPARK-33897
> URL: https://issues.apache.org/jira/browse/SPARK-33897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: GokiMori
>Assignee: GokiMori
>Priority: Minor
>
> [The PySpark 
> documentation|https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join]
>  says "Must be one of: inner, cross, outer, full, fullouter, full_outer, 
> left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, 
> left_semi, anti, leftanti and left_anti."
> However, I get the following error when I set the cross option.
>  
> {code:java}
> scala> val df1 = spark.createDataFrame(Seq((1,"a"),(2,"b")))
> df1: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
> scala> val df2 = spark.createDataFrame(Seq((1,"A"),(2,"B"), (3, "C")))
> df2: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
> scala> df1.join(right = df2, usingColumns = Seq("_1"), joinType = 
> "cross").show()
> java.lang.IllegalArgumentException: requirement failed: Unsupported using 
> join type Cross
>  at scala.Predef$.require(Predef.scala:281)
>  at org.apache.spark.sql.catalyst.plans.UsingJoin.(joinTypes.scala:106)
>  at org.apache.spark.sql.Dataset.join(Dataset.scala:1025)
>  ... 53 elided
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33897) Can't set option 'cross' in join method.

2020-12-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33897.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30803
[https://github.com/apache/spark/pull/30803]

> Can't set option 'cross' in join method.
> 
>
> Key: SPARK-33897
> URL: https://issues.apache.org/jira/browse/SPARK-33897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: GokiMori
>Assignee: GokiMori
>Priority: Minor
> Fix For: 3.1.0
>
>
> [The PySpark 
> documentation|https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join]
>  says "Must be one of: inner, cross, outer, full, fullouter, full_outer, 
> left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, 
> left_semi, anti, leftanti and left_anti."
> However, I get the following error when I set the cross option.
>  
> {code:java}
> scala> val df1 = spark.createDataFrame(Seq((1,"a"),(2,"b")))
> df1: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
> scala> val df2 = spark.createDataFrame(Seq((1,"A"),(2,"B"), (3, "C")))
> df2: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
> scala> df1.join(right = df2, usingColumns = Seq("_1"), joinType = 
> "cross").show()
> java.lang.IllegalArgumentException: requirement failed: Unsupported using 
> join type Cross
>  at scala.Predef$.require(Predef.scala:281)
>  at org.apache.spark.sql.catalyst.plans.UsingJoin.(joinTypes.scala:106)
>  at org.apache.spark.sql.Dataset.join(Dataset.scala:1025)
>  ... 53 elided
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33911) Update SQL migration guide about changes in HiveClientImpl

2020-12-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254960#comment-17254960
 ] 

Apache Spark commented on SPARK-33911:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30932

> Update SQL migration guide about changes in HiveClientImpl
> --
>
> Key: SPARK-33911
> URL: https://issues.apache.org/jira/browse/SPARK-33911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> 1. https://github.com/apache/spark/pull/30802
> 2. https://github.com/apache/spark/pull/30711



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33911) Update SQL migration guide about changes in HiveClientImpl

2020-12-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254959#comment-17254959
 ] 

Apache Spark commented on SPARK-33911:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30931

> Update SQL migration guide about changes in HiveClientImpl
> --
>
> Key: SPARK-33911
> URL: https://issues.apache.org/jira/browse/SPARK-33911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> 1. https://github.com/apache/spark/pull/30802
> 2. https://github.com/apache/spark/pull/30711



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33911) Update SQL migration guide about changes in HiveClientImpl

2020-12-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254958#comment-17254958
 ] 

Apache Spark commented on SPARK-33911:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30931

> Update SQL migration guide about changes in HiveClientImpl
> --
>
> Key: SPARK-33911
> URL: https://issues.apache.org/jira/browse/SPARK-33911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> 1. https://github.com/apache/spark/pull/30802
> 2. https://github.com/apache/spark/pull/30711



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column

2020-12-25 Thread Ted Yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254944#comment-17254944
 ] 

Ted Yu edited comment on SPARK-33915 at 12/26/20, 4:14 AM:
---

I have experimented with two patches which allow json expression to be pushable 
column (without the presence of cast).

The first involves duplicating GetJsonObject as GetJsonString where eval() 
method returns UTF8String. FunctionRegistry.scala and functions.scala would 
have corresponding change.
In PushableColumnBase class, new case for GetJsonString is added.

Since code duplication is undesirable and case class cannot be directly 
subclassed, there is more work to be done in this direction.

The second is simpler. The return type of GetJsonObject#eval is changed to 
UTF8String.
I have run through catalyst / sql core unit tests which passed.
In PushableColumnBase class, new case for GetJsonObject is added.

In test output, I observed the following (for second patch, output for first 
patch is similar):
{code}
Post-Scan Filters: (get_json_object(phone#37, $.code) = 1200)
{code}
Comment on the two approaches (and other approach) is welcome.

Below is snippet for second approach.
{code}
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index c22b68890a..8004ddd735 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression)

   @transient private lazy val parsedPath = 
parsePath(path.eval().asInstanceOf[UTF8String])

-  override def eval(input: InternalRow): Any = {
+  override def eval(input: InternalRow): UTF8String = {
 val jsonStr = json.eval(input).asInstanceOf[UTF8String]
 if (jsonStr == null) {
   return null
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index e4f001d61a..1cdc2642ba 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
@@ -723,6 +725,12 @@ abstract class PushableColumnBase {
 }
   case s: GetStructField if nestedPredicatePushdownEnabled =>
 helper(s.child).map(_ :+ s.childSchema(s.ordinal).name)
+  case GetJsonObject(col, field) =>
+Some(Seq("GetJsonObject(" + col + "," + field + ")"))
   case _ => None
 }
 helper(e).map(_.quoted)
{code}


was (Author: yuzhih...@gmail.com):
I have experimented with two patches which allow json expression to be pushable 
column (without the presence of cast).

The first involves duplicating GetJsonObject as GetJsonString where eval() 
method returns UTF8String. FunctionRegistry.scala and functions.scala would 
have corresponding change.
In PushableColumnBase class, new case for GetJsonString is added.

Since code duplication is undesirable and case class cannot be directly 
subclassed, there is more work to be done in this direction.

The second is simpler. The return type of GetJsonObject#eval is changed to 
UTF8String.
I have run through catalyst / sql core unit tests which passed.
In PushableColumnBase class, new case for GetJsonObject is added.

In test output, I observed the following (for second patch, output for first 
patch is similar):
{code}
Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200)
{code}
Comment on the two approaches (and other approach) is welcome.

Below is snippet for second approach.
{code}
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index c22b68890a..8004ddd735 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression)

   @transient private lazy val parsedPath = 
parsePath(path.eval().asInstanceOf[UTF8String])

-  override def eval(input: InternalRow): Any = {
+  override def eval(input: InternalRow): UTF8String = {
 val jsonStr = json.eval(input).asInstanceOf[UTF8String]
 if (jsonStr == null) {
   return null
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index 

[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column

2020-12-25 Thread Ted Yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254944#comment-17254944
 ] 

Ted Yu edited comment on SPARK-33915 at 12/26/20, 4:14 AM:
---

I have experimented with two patches which allow json expression to be pushable 
column (without the presence of cast).

The first involves duplicating GetJsonObject as GetJsonString where eval() 
method returns UTF8String. FunctionRegistry.scala and functions.scala would 
have corresponding change.
In PushableColumnBase class, new case for GetJsonString is added.

Since code duplication is undesirable and case class cannot be directly 
subclassed, there is more work to be done in this direction.

The second is simpler. The return type of GetJsonObject#eval is changed to 
UTF8String.
I have run through catalyst / sql core unit tests which passed.
In PushableColumnBase class, new case for GetJsonObject is added.

In test output, I observed the following (for second patch, output for first 
patch is similar):
{code}
Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200)
{code}
Comment on the two approaches (and other approach) is welcome.

Below is snippet for second approach.
{code}
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index c22b68890a..8004ddd735 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression)

   @transient private lazy val parsedPath = 
parsePath(path.eval().asInstanceOf[UTF8String])

-  override def eval(input: InternalRow): Any = {
+  override def eval(input: InternalRow): UTF8String = {
 val jsonStr = json.eval(input).asInstanceOf[UTF8String]
 if (jsonStr == null) {
   return null
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index e4f001d61a..1cdc2642ba 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
@@ -723,6 +725,12 @@ abstract class PushableColumnBase {
 }
   case s: GetStructField if nestedPredicatePushdownEnabled =>
 helper(s.child).map(_ :+ s.childSchema(s.ordinal).name)
+  case GetJsonObject(col, field) =>
+Some(Seq("GetJsonObject(" + col + "," + field + ")"))
   case _ => None
 }
 helper(e).map(_.quoted)
{code}


was (Author: yuzhih...@gmail.com):
I have experimented with two patches which allow json expression to be pushable 
column (without the presence of cast).

The first involves duplicating GetJsonObject as GetJsonString where eval() 
method returns UTF8String. FunctionRegistry.scala and functions.scala would 
have corresponding change.
In PushableColumnBase class, new case for GetJsonString is added.

Since code duplication is undesirable and case class cannot be directly 
subclassed, there is more work to be done in this direction.

The second is simpler. The return type of GetJsonObject#eval is changed to 
UTF8String.
I have run through catalyst / sql core unit tests which passed.
In PushableColumnBase class, new case for GetJsonObject is added.

In test output, I observed the following (for second patch, output for first 
patch is similar):

Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200)

Comment on the two approaches (and other approach) is welcome.

Below is snippet for second approach.
{code}
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index c22b68890a..8004ddd735 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression)

   @transient private lazy val parsedPath = 
parsePath(path.eval().asInstanceOf[UTF8String])

-  override def eval(input: InternalRow): Any = {
+  override def eval(input: InternalRow): UTF8String = {
 val jsonStr = json.eval(input).asInstanceOf[UTF8String]
 if (jsonStr == null) {
   return null
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index 

[jira] [Commented] (SPARK-33915) Allow json expression to be pushable column

2020-12-25 Thread Ted Yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254944#comment-17254944
 ] 

Ted Yu commented on SPARK-33915:


I have experimented with two patches which allow json expression to be pushable 
column (without the presence of cast).

The first involves duplicating GetJsonObject as GetJsonString where eval() 
method returns UTF8String. FunctionRegistry.scala and functions.scala would 
have corresponding change.
In PushableColumnBase class, new case for GetJsonString is added.

Since code duplication is undesirable and case class cannot be directly 
subclassed, there is more work to be done in this direction.

The second is simpler. The return type of GetJsonObject#eval is changed to 
UTF8String.
I have run through catalyst / sql core unit tests which passed.
In PushableColumnBase class, new case for GetJsonObject is added.

In test output, I observed the following (for second patch, output for first 
patch is similar):

Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200)

Comment on the two approaches (and other approach) is welcome.

Below is snippet for second approach.
{code}
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index c22b68890a..8004ddd735 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression)

   @transient private lazy val parsedPath = 
parsePath(path.eval().asInstanceOf[UTF8String])

-  override def eval(input: InternalRow): Any = {
+  override def eval(input: InternalRow): UTF8String = {
 val jsonStr = json.eval(input).asInstanceOf[UTF8String]
 if (jsonStr == null) {
   return null
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index e4f001d61a..1cdc2642ba 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
@@ -723,6 +725,12 @@ abstract class PushableColumnBase {
 }
   case s: GetStructField if nestedPredicatePushdownEnabled =>
 helper(s.child).map(_ :+ s.childSchema(s.ordinal).name)
+  case GetJsonObject(col, field) =>
+Some(Seq("GetJsonObject(" + col + "," + field + ")"))
   case _ => None
 }
 helper(e).map(_.quoted)
{code}

> Allow json expression to be pushable column
> ---
>
> Key: SPARK-33915
> URL: https://issues.apache.org/jira/browse/SPARK-33915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Ted Yu
>Priority: Major
>
> Currently PushableColumnBase provides no support for json / jsonb expression.
> Example of json expression:
> {code}
> get_json_object(phone, '$.code') = '1200'
> {code}
> If non-string literal is part of the expression, the presence of cast() would 
> complicate the situation.
> Implication is that implementation of SupportsPushDownFilters doesn't have a 
> chance to perform pushdown even if third party DB engine supports json 
> expression pushdown.
> This issue is for discussion and implementation of Spark core changes which 
> would allow json expression to be recognized as pushable column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33915) Allow json expression to be pushable column

2020-12-25 Thread Ted Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-33915:
---
Description: 
Currently PushableColumnBase provides no support for json / jsonb expression.

Example of json expression:
{code}
get_json_object(phone, '$.code') = '1200'
{code}
If non-string literal is part of the expression, the presence of cast() would 
complicate the situation.

Implication is that implementation of SupportsPushDownFilters doesn't have a 
chance to perform pushdown even if third party DB engine supports json 
expression pushdown.

This issue is for discussion and implementation of Spark core changes which 
would allow json expression to be recognized as pushable column.

  was:
Currently PushableColumnBase provides no support for json / jsonb expression.

Example of json expression:
get_json_object(phone, '$.code') = '1200'

If non-string literal is part of the expression, the presence of cast() would 
complicate the situation.

Implication is that implementation of SupportsPushDownFilters doesn't have a 
chance to perform pushdown even if third party DB engine supports json 
expression pushdown.

This issue is for discussion and implementation of Spark core changes which 
would allow json expression to be recognized as pushable column.


> Allow json expression to be pushable column
> ---
>
> Key: SPARK-33915
> URL: https://issues.apache.org/jira/browse/SPARK-33915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Ted Yu
>Priority: Major
>
> Currently PushableColumnBase provides no support for json / jsonb expression.
> Example of json expression:
> {code}
> get_json_object(phone, '$.code') = '1200'
> {code}
> If non-string literal is part of the expression, the presence of cast() would 
> complicate the situation.
> Implication is that implementation of SupportsPushDownFilters doesn't have a 
> chance to perform pushdown even if third party DB engine supports json 
> expression pushdown.
> This issue is for discussion and implementation of Spark core changes which 
> would allow json expression to be recognized as pushable column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33915) Allow json expression to be pushable column

2020-12-25 Thread Ted Yu (Jira)
Ted Yu created SPARK-33915:
--

 Summary: Allow json expression to be pushable column
 Key: SPARK-33915
 URL: https://issues.apache.org/jira/browse/SPARK-33915
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Ted Yu


Currently PushableColumnBase provides no support for json / jsonb expression.

Example of json expression:
get_json_object(phone, '$.code') = '1200'

If non-string literal is part of the expression, the presence of cast() would 
complicate the situation.

Implication is that implementation of SupportsPushDownFilters doesn't have a 
chance to perform pushdown even if third party DB engine supports json 
expression pushdown.

This issue is for discussion and implementation of Spark core changes which 
would allow json expression to be recognized as pushable column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27367) Faster RoaringBitmap Serialization with v0.8.0

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-27367.
-
Resolution: Duplicate

Issue fixed by SPARK-32437.

> Faster RoaringBitmap Serialization with v0.8.0
> --
>
> Key: SPARK-27367
> URL: https://issues.apache.org/jira/browse/SPARK-27367
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Imran Rashid
>Priority: Major
>
> RoaringBitmap 0.8.0 adds faster serde, but also requires us to change how we 
> call the serde routines slightly to take advantage of it.  This is probably a 
> worthwhile optimization as the every shuffle map task with a large # of 
> partitions generates these bitmaps, and the driver especially has to 
> deserialize many of these messages.
> See 
> * https://github.com/apache/spark/pull/24264#issuecomment-479675572
> * https://github.com/RoaringBitmap/RoaringBitmap/pull/325
> * https://github.com/RoaringBitmap/RoaringBitmap/issues/319



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33070) Optimizer rules for HigherOrderFunctions

2020-12-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254910#comment-17254910
 ] 

Apache Spark commented on SPARK-33070:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30930

> Optimizer rules for HigherOrderFunctions
> 
>
> Key: SPARK-33070
> URL: https://issues.apache.org/jira/browse/SPARK-33070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Minor
>
> SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be 
> combined and reordered to achieve more optimal plan.
> Possible rules:
> * Combine 2 consecutive array transforms
> * Combine 2 consecutive array filters
> * Push array filter through array sort
> * Remove array sort before array exists and array forall.
> * Combine 2 consecutive map filters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33070) Optimizer rules for HigherOrderFunctions

2020-12-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33070:


Assignee: Apache Spark

> Optimizer rules for HigherOrderFunctions
> 
>
> Key: SPARK-33070
> URL: https://issues.apache.org/jira/browse/SPARK-33070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Minor
>
> SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be 
> combined and reordered to achieve more optimal plan.
> Possible rules:
> * Combine 2 consecutive array transforms
> * Combine 2 consecutive array filters
> * Push array filter through array sort
> * Remove array sort before array exists and array forall.
> * Combine 2 consecutive map filters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33070) Optimizer rules for HigherOrderFunctions

2020-12-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254909#comment-17254909
 ] 

Apache Spark commented on SPARK-33070:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30930

> Optimizer rules for HigherOrderFunctions
> 
>
> Key: SPARK-33070
> URL: https://issues.apache.org/jira/browse/SPARK-33070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Minor
>
> SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be 
> combined and reordered to achieve more optimal plan.
> Possible rules:
> * Combine 2 consecutive array transforms
> * Combine 2 consecutive array filters
> * Push array filter through array sort
> * Remove array sort before array exists and array forall.
> * Combine 2 consecutive map filters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33070) Optimizer rules for HigherOrderFunctions

2020-12-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33070:


Assignee: (was: Apache Spark)

> Optimizer rules for HigherOrderFunctions
> 
>
> Key: SPARK-33070
> URL: https://issues.apache.org/jira/browse/SPARK-33070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Minor
>
> SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be 
> combined and reordered to achieve more optimal plan.
> Possible rules:
> * Combine 2 consecutive array transforms
> * Combine 2 consecutive array filters
> * Push array filter through array sort
> * Remove array sort before array exists and array forall.
> * Combine 2 consecutive map filters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33070) Optimizer rules for HigherOrderFunctions

2020-12-25 Thread Tanel Kiis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-33070:
---
Summary: Optimizer rules for HigherOrderFunctions  (was: Optimizer rules 
for collection datatypes and SimpleHigherOrderFunction)

> Optimizer rules for HigherOrderFunctions
> 
>
> Key: SPARK-33070
> URL: https://issues.apache.org/jira/browse/SPARK-33070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Minor
>
> SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be 
> combined and reordered to achieve more optimal plan.
> Possible rules:
> * Combine 2 consecutive array transforms
> * Combine 2 consecutive array filters
> * Push array filter through array sort
> * Remove array sort before array exists and array forall.
> * Combine 2 consecutive map filters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33914) Describe the structure of unified v1 and v2 tests

2020-12-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33914:


Assignee: (was: Apache Spark)

> Describe the structure of unified v1 and v2 tests
> -
>
> Key: SPARK-33914
> URL: https://issues.apache.org/jira/browse/SPARK-33914
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Add comments for unified v1 and v2 tests and describe their structure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33914) Describe the structure of unified v1 and v2 tests

2020-12-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254866#comment-17254866
 ] 

Apache Spark commented on SPARK-33914:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30929

> Describe the structure of unified v1 and v2 tests
> -
>
> Key: SPARK-33914
> URL: https://issues.apache.org/jira/browse/SPARK-33914
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Add comments for unified v1 and v2 tests and describe their structure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33914) Describe the structure of unified v1 and v2 tests

2020-12-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33914:


Assignee: Apache Spark

> Describe the structure of unified v1 and v2 tests
> -
>
> Key: SPARK-33914
> URL: https://issues.apache.org/jira/browse/SPARK-33914
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add comments for unified v1 and v2 tests and describe their structure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33914) Describe the structure of unified v1 and v2 tests

2020-12-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254865#comment-17254865
 ] 

Apache Spark commented on SPARK-33914:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30929

> Describe the structure of unified v1 and v2 tests
> -
>
> Key: SPARK-33914
> URL: https://issues.apache.org/jira/browse/SPARK-33914
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Add comments for unified v1 and v2 tests and describe their structure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33914) Describe the structure of unified v1 and v2 tests

2020-12-25 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33914:
--

 Summary: Describe the structure of unified v1 and v2 tests
 Key: SPARK-33914
 URL: https://issues.apache.org/jira/browse/SPARK-33914
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


Add comments for unified v1 and v2 tests and describe their structure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33913) Upgrade Kafka to 2.7.0

2020-12-25 Thread dengziming (Jira)
dengziming created SPARK-33913:
--

 Summary: Upgrade Kafka to 2.7.0
 Key: SPARK-33913
 URL: https://issues.apache.org/jira/browse/SPARK-33913
 Project: Spark
  Issue Type: Improvement
  Components: Build, DStreams
Affects Versions: 3.2.0
Reporter: dengziming


 
The Apache Kafka community has released for Apache Kafka 2.7.0, some features 
are useful for example the KAFKA-9893
 configurable TCP connection timeout, more details : 
https://downloads.apache.org/kafka/2.7.0/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the conditional expressions to data source.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
  +- *(1) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS 
id#4L]
+- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
[isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
{noformat}

3. Reduce the amount of calculation.
{noformat}
Before simplify:
explain select event_type = 'e' from v1;
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#37]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
+- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS (event_type 
= e)#38]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct

After simplify:
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#10]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet,
+- *(2) Project [false AS (event_type = e)#14]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

{noformat}



  was:
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the conditional expressions to data source.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
  +- *(1) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
After simplify:
== Physical Plan ==
*(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS 
id#4L]
+- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
[isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
{noformat}

3. Reduce the amount of calculation.
{noformat}
Before simplify:
spark-sql> explain select event_type = 'e' from v1;
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#37]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
+- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS (event_type 
= e)#38]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
After simplify:
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#10]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet,
+- *(2) Project [false AS (event_type = e)#14]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

{noformat}



  was:
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, Location: 

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the conditional expressions to data source.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
  +- *(1) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS 
id#4L]
+- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
[isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
{noformat}

3. Reduce the amount of calculation.
{noformat}
Before simplify:
spark-sql> explain select event_type = 'e' from v1;
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#37]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
+- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS (event_type 
= e)#38]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct

After simplify:
== Physical Plan ==
Union
:- *(1) Project [false AS (event_type = e)#10]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: 
Parquet,
+- *(2) Project [false AS (event_type = e)#14]
   +- *(2) ColumnarToRow
  +- FileScan parquet default.t2[] Batched: true, DataFilters: [], Format: 
Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

{noformat}



  was:
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, PartitionFilters: [], PushedFilters: [], 

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/t1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the conditional expressions to data source.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
  +- *(1) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
After simplify:
== Physical Plan ==
*(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS 
id#4L]
+- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
[isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
{noformat}


  was:
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/t1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the filter.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b 

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
1. Push down the foldable expressions through CaseWhen/If
2. Simplify conditional in predicate
3. Push the UnaryExpression into (if / case) branches
4. Simplify CaseWhen if elseValue is None
5. Simplify conditional if all branches are foldable boolean type

 

Common use cases are:
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1   
union all 
select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
{code}

1. Reduce read the whole table.
{noformat}
explain select * from v1 where event_type = 'a';
Before simplify:
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#7, id#9L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/t1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
+- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

After simplify:
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}

2. Push down the filter.
{noformat}
explain select * from v1 where event_type = 'b';
Before simplify:
== Physical Plan ==
Union
:- LocalTableScan , [event_type#7, id#9L]
+- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
id#10L]
   +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
  +- *(1) ColumnarToRow
 +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
[(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
After simplify:
== Physical Plan ==
*(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS 
id#4L]
+- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
   +- *(1) ColumnarToRow
  +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
[isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
{noformat}


  was:
Simplify/Optimize conditional expressions. We can improve these cases:
1. Reduce read datasource.
2. Simple CaseWhen/If to support filter push down.
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as  
   
select 'a' as event_type, * from t1 
   
union all   
   
select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from 
t2 

explain select * from v1 where event_type = 'a';
{code}

Before this PR:

{noformat}
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#30533, id#30535L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], 
Format: Parquet
+- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END AS event_type#30534, id#30536L]
   +- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: 
[(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], 
Format: Parquet
{noformat}

After this PR:


{noformat}
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}



>  Simplify/Optimize conditional expressions
> --
>
> Key: SPARK-33910
> URL: https://issues.apache.org/jira/browse/SPARK-33910
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> 1. Push down the foldable expressions through CaseWhen/If
> 2. Simplify conditional in predicate
> 3. Push the UnaryExpression into (if / case) 

[jira] [Updated] (SPARK-33848) Push the UnaryExpression into (if / case) branches

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33848:

Summary: Push the UnaryExpression into (if / case) branches  (was: Push the 
cast into (if / case) branches)

> Push the UnaryExpression into (if / case) branches
> --
>
> Key: SPARK-33848
> URL: https://issues.apache.org/jira/browse/SPARK-33848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Push the cast into (if / case) branches. The use case is:
> {code:sql}
> create table t1 using parquet as select id from range(10);
> explain select id from t1 where (CASE WHEN id = 1 THEN '1' WHEN id = 3 THEN 
> '2' end) > 3;
> {code}
> Before this pr:
> {noformat}
> == Physical Plan ==
> *(1) Filter (cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as 
> int) > 3)
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: 
> [(cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 
> 3)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {noformat}
> After this pr:
> {noformat}
> == Physical Plan ==
> LocalTableScan , [id#1L]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Description: 
Simplify/Optimize conditional expressions. We can improve these cases:
1. Reduce read datasource.
2. Simple CaseWhen/If to support filter push down.
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as  
   
select 'a' as event_type, * from t1 
   
union all   
   
select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from 
t2 

explain select * from v1 where event_type = 'a';
{code}

Before this PR:

{noformat}
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#30533, id#30535L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], 
Format: Parquet
+- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END AS event_type#30534, id#30536L]
   +- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: 
[(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], 
Format: Parquet
{noformat}

After this PR:


{noformat}
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}


  was:
Simplify CaseWhen/If conditionals. We can improve these cases:
1. Reduce read datasource.
2. Simple CaseWhen/If to support filter push down.
{code:sql}
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as  
   
select 'a' as event_type, * from t1 
   
union all   
   
select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from 
t2 

explain select * from v1 where event_type = 'a';
{code}

Before this PR:

{noformat}
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#30533, id#30535L]
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], 
Format: Parquet
+- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END AS event_type#30534, id#30536L]
   +- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
END = a)
  +- *(2) ColumnarToRow
 +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: 
[(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], 
Format: Parquet
{noformat}

After this PR:


{noformat}
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
Format: Parquet
{noformat}



>  Simplify/Optimize conditional expressions
> --
>
> Key: SPARK-33910
> URL: https://issues.apache.org/jira/browse/SPARK-33910
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Simplify/Optimize conditional expressions. We can improve these cases:
> 1. Reduce read datasource.
> 2. Simple CaseWhen/If to support filter push down.
> {code:sql}
> create table t1 using parquet as select * from range(100);
> create table t2 using parquet as select * from range(200);
> create temp view v1 as
>  
> select 'a' as event_type, * from t1   
>  
> union all 
>  
> select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * 
> from t2 
> explain select * from v1 where event_type = 'a';
> {code}
> Before this PR:
> {noformat}
> == Physical Plan ==
> Union
> :- *(1) Project [a AS event_type#30533, id#30535L]
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: 
> [], Format: Parquet
> +- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
> END AS event_type#30534, id#30536L]
>+- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN 
> c END = a)
>   +- *(2) ColumnarToRow
>  +- FileScan parquet default.t2[id#30536L] Batched: true, 
> DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN 

[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33910:

Summary:  Simplify/Optimize conditional expressions  (was:  Simplify 
conditional)

>  Simplify/Optimize conditional expressions
> --
>
> Key: SPARK-33910
> URL: https://issues.apache.org/jira/browse/SPARK-33910
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Simplify CaseWhen/If conditionals. We can improve these cases:
> 1. Reduce read datasource.
> 2. Simple CaseWhen/If to support filter push down.
> {code:sql}
> create table t1 using parquet as select * from range(100);
> create table t2 using parquet as select * from range(200);
> create temp view v1 as
>  
> select 'a' as event_type, * from t1   
>  
> union all 
>  
> select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * 
> from t2 
> explain select * from v1 where event_type = 'a';
> {code}
> Before this PR:
> {noformat}
> == Physical Plan ==
> Union
> :- *(1) Project [a AS event_type#30533, id#30535L]
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: 
> [], Format: Parquet
> +- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
> END AS event_type#30534, id#30536L]
>+- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN 
> c END = a)
>   +- *(2) ColumnarToRow
>  +- FileScan parquet default.t2[id#30536L] Batched: true, 
> DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c 
> END = a)], Format: Parquet
> {noformat}
> After this PR:
> {noformat}
> == Physical Plan ==
> *(1) Project [a AS event_type#8, id#4L]
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
> Format: Parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33070) Optimizer rules for collection datatypes and SimpleHigherOrderFunction

2020-12-25 Thread Tanel Kiis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-33070:
---
Affects Version/s: (was: 3.1.0)
   3.2.0

> Optimizer rules for collection datatypes and SimpleHigherOrderFunction
> --
>
> Key: SPARK-33070
> URL: https://issues.apache.org/jira/browse/SPARK-33070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Minor
>
> SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be 
> combined and reordered to achieve more optimal plan.
> Possible rules:
> * Combine 2 consecutive array transforms
> * Combine 2 consecutive array filters
> * Push array filter through array sort
> * Remove array sort before array exists and array forall.
> * Combine 2 consecutive map filters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33888) JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis

2020-12-25 Thread Duc Hoa Nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duc Hoa Nguyen updated SPARK-33888:
---
Summary: JDBC SQL TIME type represents incorrectly as TimestampType, it 
should be physical Int in millis  (was: AVRO SchemaConverts - logicalType 
TimeMillis not being converted to Timestamp type)

> JDBC SQL TIME type represents incorrectly as TimestampType, it should be 
> physical Int in millis
> ---
>
> Key: SPARK-33888
> URL: https://issues.apache.org/jira/browse/SPARK-33888
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.0, 3.0.1
>Reporter: Duc Hoa Nguyen
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, for JDBC, SQL TIME type represents incorrectly as Spark 
> TimestampType. This should be represent as physical int in millis Represents 
> a time of day, with no reference to a particular calendar, time zone or date, 
> with a precision of one millisecond. It stores the number of milliseconds 
> after midnight, 00:00:00.000.
> We encountered the issue of Avro logical type of `TimeMillis` not being 
> converted correctly to Spark `Timestamp` struct type using the 
> `SchemaConverters`, but it converts to regular `int` instead. Reproducible by 
> ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe 
> will get the correct type (Timestamp), but enforcing our avro schema 
> (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to 
> apply with the following exception:
> {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type 
> for schema of int}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33888) AVRO SchemaConverts - logicalType TimeMillis not being converted to Timestamp type

2020-12-25 Thread Duc Hoa Nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duc Hoa Nguyen updated SPARK-33888:
---
Description: 
Currently, for JDBC, SQL TIME type represents incorrectly as Spark 
TimestampType. This should be represent as physical int in millis Represents a 
time of day, with no reference to a particular calendar, time zone or date, 
with a precision of one millisecond. It stores the number of milliseconds after 
midnight, 00:00:00.000.

We encountered the issue of Avro logical type of `TimeMillis` not being 
converted correctly to Spark `Timestamp` struct type using the 
`SchemaConverters`, but it converts to regular `int` instead. Reproducible by 
ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe 
will get the correct type (Timestamp), but enforcing our avro schema (`{"type": 
"int"," logicalType": "time-millis"}`) externally will fail to apply with the 
following exception:

{{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type 
for schema of int}}

  was:
We encountered the issue of Avro logical type of `TimeMillis` not being 
converted correctly to Spark `Timestamp` struct type using the 
`SchemaConverters`, but it converts to regular `int` instead. Reproducible by 
ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe 
will get the correct type (Timestamp), but enforcing our avro schema (`{"type": 
"int"," logicalType": "time-millis"}`) externally will fail to apply with the 
following exception:

{{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type 
for schema of int}}




> AVRO SchemaConverts - logicalType TimeMillis not being converted to Timestamp 
> type
> --
>
> Key: SPARK-33888
> URL: https://issues.apache.org/jira/browse/SPARK-33888
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.0, 3.0.1
>Reporter: Duc Hoa Nguyen
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, for JDBC, SQL TIME type represents incorrectly as Spark 
> TimestampType. This should be represent as physical int in millis Represents 
> a time of day, with no reference to a particular calendar, time zone or date, 
> with a precision of one millisecond. It stores the number of milliseconds 
> after midnight, 00:00:00.000.
> We encountered the issue of Avro logical type of `TimeMillis` not being 
> converted correctly to Spark `Timestamp` struct type using the 
> `SchemaConverters`, but it converts to regular `int` instead. Reproducible by 
> ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe 
> will get the correct type (Timestamp), but enforcing our avro schema 
> (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to 
> apply with the following exception:
> {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type 
> for schema of int}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33912) Refactor DependencyUtils ivy property parameter

2020-12-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254785#comment-17254785
 ] 

Apache Spark commented on SPARK-33912:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30928

> Refactor DependencyUtils ivy property parameter
> 
>
> Key: SPARK-33912
> URL: https://issues.apache.org/jira/browse/SPARK-33912
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Minor
>
> according to https://github.com/apache/spark/pull/29966#discussion_r533573137



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33912) Refactor DependencyUtils ivy property parameter

2020-12-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33912:


Assignee: Apache Spark

> Refactor DependencyUtils ivy property parameter
> 
>
> Key: SPARK-33912
> URL: https://issues.apache.org/jira/browse/SPARK-33912
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Minor
>
> according to https://github.com/apache/spark/pull/29966#discussion_r533573137



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33912) Refactor DependencyUtils ivy property parameter

2020-12-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33912:


Assignee: (was: Apache Spark)

> Refactor DependencyUtils ivy property parameter
> 
>
> Key: SPARK-33912
> URL: https://issues.apache.org/jira/browse/SPARK-33912
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Minor
>
> according to https://github.com/apache/spark/pull/29966#discussion_r533573137



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33888) AVRO SchemaConverts - logicalType TimeMillis not being converted to Timestamp type

2020-12-25 Thread Duc Hoa Nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duc Hoa Nguyen updated SPARK-33888:
---
Issue Type: Bug  (was: New Feature)

> AVRO SchemaConverts - logicalType TimeMillis not being converted to Timestamp 
> type
> --
>
> Key: SPARK-33888
> URL: https://issues.apache.org/jira/browse/SPARK-33888
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.0, 3.0.1
>Reporter: Duc Hoa Nguyen
>Assignee: Apache Spark
>Priority: Minor
>
> We encountered the issue of Avro logical type of `TimeMillis` not being 
> converted correctly to Spark `Timestamp` struct type using the 
> `SchemaConverters`, but it converts to regular `int` instead. Reproducible by 
> ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe 
> will get the correct type (Timestamp), but enforcing our avro schema 
> (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to 
> apply with the following exception:
> {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type 
> for schema of int}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28123) String Functions: Add support btrim

2020-12-25 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-28123:
---
Affects Version/s: (was: 3.0.0)
   3.2.0

> String Functions: Add support btrim
> ---
>
> Key: SPARK-28123
> URL: https://issues.apache.org/jira/browse/SPARK-28123
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||Result||
> |{{btrim(_{{string}}_}}{{bytea}}{{, 
> _{{bytes}}_}}{{bytea}}{{)}}|{{bytea}}|Remove the longest string containing 
> only bytes appearing in _{{bytes}}_from the start and end of 
> _{{string}}_|{{btrim('\000trim\001'::bytea, '\000\001'::bytea)}}|{{trim}}|
> More details: https://www.postgresql.org/docs/11/functions-binarystring.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30789) Support (IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE

2020-12-25 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30789:
---
Affects Version/s: (was: 3.1.0)
   3.2.0

> Support (IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE
> --
>
> Key: SPARK-30789
> URL: https://issues.apache.org/jira/browse/SPARK-30789
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> All of LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE support IGNORE NULLS | 
> RESPECT NULLS. For example:
> {code:java}
> LEAD (value_expr [, offset ])
> [ IGNORE NULLS | RESPECT NULLS ]
> OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code}
>  
> {code:java}
> LAG (value_expr [, offset ])
> [ IGNORE NULLS | RESPECT NULLS ]
> OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code}
>  
> {code:java}
> NTH_VALUE (expr, offset)
> [ IGNORE NULLS | RESPECT NULLS ]
> OVER
> ( [ PARTITION BY window_partition ]
> [ ORDER BY window_ordering 
>  frame_clause ] ){code}
>  
> *Oracle:*
> [https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0]
> *Redshift*
> [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html]
> *Presto*
> [https://prestodb.io/docs/current/functions/window.html]
> *DB2*
> [https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm]
> *Teradata*
> [https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w]
> *Snowflake*
> [https://docs.snowflake.com/en/sql-reference/functions/lead.html]
> [https://docs.snowflake.com/en/sql-reference/functions/lag.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33804) Cleanup "view bounds are deprecated" compilation warnings

2020-12-25 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254767#comment-17254767
 ] 

Maxim Gekk commented on SPARK-33804:


[~LuciferYang] Could you convert this (and your other similar JIRA tickets) to 
a sub-task of the umbrella ticket 
https://issues.apache.org/jira/browse/SPARK-30165

> Cleanup "view bounds are deprecated" compilation warnings
> -
>
> Key: SPARK-33804
> URL: https://issues.apache.org/jira/browse/SPARK-33804
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are only 3 compilation warnings related to `view bounds are deprecated` 
> in SequenceFileRDDFunctions:
> [WARNING] 
> /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:35:
>  view bounds are deprecated; use an implicit parameter instead.
> [WARNING] 
> /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:35:
>  view bounds are deprecated; use an implicit parameter instead.
> [WARNING] 
> /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:55:
>  view bounds are deprecated; use an implicit parameter instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org