[jira] [Commented] (SPARK-33911) Update SQL migration guide about changes in HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-33911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254966#comment-17254966 ] Apache Spark commented on SPARK-33911: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30933 > Update SQL migration guide about changes in HiveClientImpl > -- > > Key: SPARK-33911 > URL: https://issues.apache.org/jira/browse/SPARK-33911 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > 1. https://github.com/apache/spark/pull/30802 > 2. https://github.com/apache/spark/pull/30711 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33911) Update SQL migration guide about changes in HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-33911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254965#comment-17254965 ] Apache Spark commented on SPARK-33911: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30933 > Update SQL migration guide about changes in HiveClientImpl > -- > > Key: SPARK-33911 > URL: https://issues.apache.org/jira/browse/SPARK-33911 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > 1. https://github.com/apache/spark/pull/30802 > 2. https://github.com/apache/spark/pull/30711 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33897) Can't set option 'cross' in join method.
[ https://issues.apache.org/jira/browse/SPARK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33897: Assignee: GokiMori > Can't set option 'cross' in join method. > > > Key: SPARK-33897 > URL: https://issues.apache.org/jira/browse/SPARK-33897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: GokiMori >Assignee: GokiMori >Priority: Minor > > [The PySpark > documentation|https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join] > says "Must be one of: inner, cross, outer, full, fullouter, full_outer, > left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, > left_semi, anti, leftanti and left_anti." > However, I get the following error when I set the cross option. > > {code:java} > scala> val df1 = spark.createDataFrame(Seq((1,"a"),(2,"b"))) > df1: org.apache.spark.sql.DataFrame = [_1: int, _2: string] > scala> val df2 = spark.createDataFrame(Seq((1,"A"),(2,"B"), (3, "C"))) > df2: org.apache.spark.sql.DataFrame = [_1: int, _2: string] > scala> df1.join(right = df2, usingColumns = Seq("_1"), joinType = > "cross").show() > java.lang.IllegalArgumentException: requirement failed: Unsupported using > join type Cross > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.catalyst.plans.UsingJoin.(joinTypes.scala:106) > at org.apache.spark.sql.Dataset.join(Dataset.scala:1025) > ... 53 elided > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33897) Can't set option 'cross' in join method.
[ https://issues.apache.org/jira/browse/SPARK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33897. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30803 [https://github.com/apache/spark/pull/30803] > Can't set option 'cross' in join method. > > > Key: SPARK-33897 > URL: https://issues.apache.org/jira/browse/SPARK-33897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: GokiMori >Assignee: GokiMori >Priority: Minor > Fix For: 3.1.0 > > > [The PySpark > documentation|https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join] > says "Must be one of: inner, cross, outer, full, fullouter, full_outer, > left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, > left_semi, anti, leftanti and left_anti." > However, I get the following error when I set the cross option. > > {code:java} > scala> val df1 = spark.createDataFrame(Seq((1,"a"),(2,"b"))) > df1: org.apache.spark.sql.DataFrame = [_1: int, _2: string] > scala> val df2 = spark.createDataFrame(Seq((1,"A"),(2,"B"), (3, "C"))) > df2: org.apache.spark.sql.DataFrame = [_1: int, _2: string] > scala> df1.join(right = df2, usingColumns = Seq("_1"), joinType = > "cross").show() > java.lang.IllegalArgumentException: requirement failed: Unsupported using > join type Cross > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.catalyst.plans.UsingJoin.(joinTypes.scala:106) > at org.apache.spark.sql.Dataset.join(Dataset.scala:1025) > ... 53 elided > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33911) Update SQL migration guide about changes in HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-33911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254960#comment-17254960 ] Apache Spark commented on SPARK-33911: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30932 > Update SQL migration guide about changes in HiveClientImpl > -- > > Key: SPARK-33911 > URL: https://issues.apache.org/jira/browse/SPARK-33911 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > 1. https://github.com/apache/spark/pull/30802 > 2. https://github.com/apache/spark/pull/30711 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33911) Update SQL migration guide about changes in HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-33911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254959#comment-17254959 ] Apache Spark commented on SPARK-33911: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30931 > Update SQL migration guide about changes in HiveClientImpl > -- > > Key: SPARK-33911 > URL: https://issues.apache.org/jira/browse/SPARK-33911 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > 1. https://github.com/apache/spark/pull/30802 > 2. https://github.com/apache/spark/pull/30711 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33911) Update SQL migration guide about changes in HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-33911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254958#comment-17254958 ] Apache Spark commented on SPARK-33911: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30931 > Update SQL migration guide about changes in HiveClientImpl > -- > > Key: SPARK-33911 > URL: https://issues.apache.org/jira/browse/SPARK-33911 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > 1. https://github.com/apache/spark/pull/30802 > 2. https://github.com/apache/spark/pull/30711 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254944#comment-17254944 ] Ted Yu edited comment on SPARK-33915 at 12/26/20, 4:14 AM: --- I have experimented with two patches which allow json expression to be pushable column (without the presence of cast). The first involves duplicating GetJsonObject as GetJsonString where eval() method returns UTF8String. FunctionRegistry.scala and functions.scala would have corresponding change. In PushableColumnBase class, new case for GetJsonString is added. Since code duplication is undesirable and case class cannot be directly subclassed, there is more work to be done in this direction. The second is simpler. The return type of GetJsonObject#eval is changed to UTF8String. I have run through catalyst / sql core unit tests which passed. In PushableColumnBase class, new case for GetJsonObject is added. In test output, I observed the following (for second patch, output for first patch is similar): {code} Post-Scan Filters: (get_json_object(phone#37, $.code) = 1200) {code} Comment on the two approaches (and other approach) is welcome. Below is snippet for second approach. {code} diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index c22b68890a..8004ddd735 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression) @transient private lazy val parsedPath = parsePath(path.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + override def eval(input: InternalRow): UTF8String = { val jsonStr = json.eval(input).asInstanceOf[UTF8String] if (jsonStr == null) { return null diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index e4f001d61a..1cdc2642ba 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -723,6 +725,12 @@ abstract class PushableColumnBase { } case s: GetStructField if nestedPredicatePushdownEnabled => helper(s.child).map(_ :+ s.childSchema(s.ordinal).name) + case GetJsonObject(col, field) => +Some(Seq("GetJsonObject(" + col + "," + field + ")")) case _ => None } helper(e).map(_.quoted) {code} was (Author: yuzhih...@gmail.com): I have experimented with two patches which allow json expression to be pushable column (without the presence of cast). The first involves duplicating GetJsonObject as GetJsonString where eval() method returns UTF8String. FunctionRegistry.scala and functions.scala would have corresponding change. In PushableColumnBase class, new case for GetJsonString is added. Since code duplication is undesirable and case class cannot be directly subclassed, there is more work to be done in this direction. The second is simpler. The return type of GetJsonObject#eval is changed to UTF8String. I have run through catalyst / sql core unit tests which passed. In PushableColumnBase class, new case for GetJsonObject is added. In test output, I observed the following (for second patch, output for first patch is similar): {code} Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200) {code} Comment on the two approaches (and other approach) is welcome. Below is snippet for second approach. {code} diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index c22b68890a..8004ddd735 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression) @transient private lazy val parsedPath = parsePath(path.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + override def eval(input: InternalRow): UTF8String = { val jsonStr = json.eval(input).asInstanceOf[UTF8String] if (jsonStr == null) { return null diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index
[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254944#comment-17254944 ] Ted Yu edited comment on SPARK-33915 at 12/26/20, 4:14 AM: --- I have experimented with two patches which allow json expression to be pushable column (without the presence of cast). The first involves duplicating GetJsonObject as GetJsonString where eval() method returns UTF8String. FunctionRegistry.scala and functions.scala would have corresponding change. In PushableColumnBase class, new case for GetJsonString is added. Since code duplication is undesirable and case class cannot be directly subclassed, there is more work to be done in this direction. The second is simpler. The return type of GetJsonObject#eval is changed to UTF8String. I have run through catalyst / sql core unit tests which passed. In PushableColumnBase class, new case for GetJsonObject is added. In test output, I observed the following (for second patch, output for first patch is similar): {code} Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200) {code} Comment on the two approaches (and other approach) is welcome. Below is snippet for second approach. {code} diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index c22b68890a..8004ddd735 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression) @transient private lazy val parsedPath = parsePath(path.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + override def eval(input: InternalRow): UTF8String = { val jsonStr = json.eval(input).asInstanceOf[UTF8String] if (jsonStr == null) { return null diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index e4f001d61a..1cdc2642ba 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -723,6 +725,12 @@ abstract class PushableColumnBase { } case s: GetStructField if nestedPredicatePushdownEnabled => helper(s.child).map(_ :+ s.childSchema(s.ordinal).name) + case GetJsonObject(col, field) => +Some(Seq("GetJsonObject(" + col + "," + field + ")")) case _ => None } helper(e).map(_.quoted) {code} was (Author: yuzhih...@gmail.com): I have experimented with two patches which allow json expression to be pushable column (without the presence of cast). The first involves duplicating GetJsonObject as GetJsonString where eval() method returns UTF8String. FunctionRegistry.scala and functions.scala would have corresponding change. In PushableColumnBase class, new case for GetJsonString is added. Since code duplication is undesirable and case class cannot be directly subclassed, there is more work to be done in this direction. The second is simpler. The return type of GetJsonObject#eval is changed to UTF8String. I have run through catalyst / sql core unit tests which passed. In PushableColumnBase class, new case for GetJsonObject is added. In test output, I observed the following (for second patch, output for first patch is similar): Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200) Comment on the two approaches (and other approach) is welcome. Below is snippet for second approach. {code} diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index c22b68890a..8004ddd735 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression) @transient private lazy val parsedPath = parsePath(path.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + override def eval(input: InternalRow): UTF8String = { val jsonStr = json.eval(input).asInstanceOf[UTF8String] if (jsonStr == null) { return null diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index
[jira] [Commented] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254944#comment-17254944 ] Ted Yu commented on SPARK-33915: I have experimented with two patches which allow json expression to be pushable column (without the presence of cast). The first involves duplicating GetJsonObject as GetJsonString where eval() method returns UTF8String. FunctionRegistry.scala and functions.scala would have corresponding change. In PushableColumnBase class, new case for GetJsonString is added. Since code duplication is undesirable and case class cannot be directly subclassed, there is more work to be done in this direction. The second is simpler. The return type of GetJsonObject#eval is changed to UTF8String. I have run through catalyst / sql core unit tests which passed. In PushableColumnBase class, new case for GetJsonObject is added. In test output, I observed the following (for second patch, output for first patch is similar): Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200) Comment on the two approaches (and other approach) is welcome. Below is snippet for second approach. {code} diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index c22b68890a..8004ddd735 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression) @transient private lazy val parsedPath = parsePath(path.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + override def eval(input: InternalRow): UTF8String = { val jsonStr = json.eval(input).asInstanceOf[UTF8String] if (jsonStr == null) { return null diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index e4f001d61a..1cdc2642ba 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -723,6 +725,12 @@ abstract class PushableColumnBase { } case s: GetStructField if nestedPredicatePushdownEnabled => helper(s.child).map(_ :+ s.childSchema(s.ordinal).name) + case GetJsonObject(col, field) => +Some(Seq("GetJsonObject(" + col + "," + field + ")")) case _ => None } helper(e).map(_.quoted) {code} > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-33915: --- Description: Currently PushableColumnBase provides no support for json / jsonb expression. Example of json expression: {code} get_json_object(phone, '$.code') = '1200' {code} If non-string literal is part of the expression, the presence of cast() would complicate the situation. Implication is that implementation of SupportsPushDownFilters doesn't have a chance to perform pushdown even if third party DB engine supports json expression pushdown. This issue is for discussion and implementation of Spark core changes which would allow json expression to be recognized as pushable column. was: Currently PushableColumnBase provides no support for json / jsonb expression. Example of json expression: get_json_object(phone, '$.code') = '1200' If non-string literal is part of the expression, the presence of cast() would complicate the situation. Implication is that implementation of SupportsPushDownFilters doesn't have a chance to perform pushdown even if third party DB engine supports json expression pushdown. This issue is for discussion and implementation of Spark core changes which would allow json expression to be recognized as pushable column. > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33915) Allow json expression to be pushable column
Ted Yu created SPARK-33915: -- Summary: Allow json expression to be pushable column Key: SPARK-33915 URL: https://issues.apache.org/jira/browse/SPARK-33915 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: Ted Yu Currently PushableColumnBase provides no support for json / jsonb expression. Example of json expression: get_json_object(phone, '$.code') = '1200' If non-string literal is part of the expression, the presence of cast() would complicate the situation. Implication is that implementation of SupportsPushDownFilters doesn't have a chance to perform pushdown even if third party DB engine supports json expression pushdown. This issue is for discussion and implementation of Spark core changes which would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27367) Faster RoaringBitmap Serialization with v0.8.0
[ https://issues.apache.org/jira/browse/SPARK-27367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-27367. - Resolution: Duplicate Issue fixed by SPARK-32437. > Faster RoaringBitmap Serialization with v0.8.0 > -- > > Key: SPARK-27367 > URL: https://issues.apache.org/jira/browse/SPARK-27367 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Imran Rashid >Priority: Major > > RoaringBitmap 0.8.0 adds faster serde, but also requires us to change how we > call the serde routines slightly to take advantage of it. This is probably a > worthwhile optimization as the every shuffle map task with a large # of > partitions generates these bitmaps, and the driver especially has to > deserialize many of these messages. > See > * https://github.com/apache/spark/pull/24264#issuecomment-479675572 > * https://github.com/RoaringBitmap/RoaringBitmap/pull/325 > * https://github.com/RoaringBitmap/RoaringBitmap/issues/319 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33070) Optimizer rules for HigherOrderFunctions
[ https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254910#comment-17254910 ] Apache Spark commented on SPARK-33070: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/30930 > Optimizer rules for HigherOrderFunctions > > > Key: SPARK-33070 > URL: https://issues.apache.org/jira/browse/SPARK-33070 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Minor > > SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be > combined and reordered to achieve more optimal plan. > Possible rules: > * Combine 2 consecutive array transforms > * Combine 2 consecutive array filters > * Push array filter through array sort > * Remove array sort before array exists and array forall. > * Combine 2 consecutive map filters -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33070) Optimizer rules for HigherOrderFunctions
[ https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33070: Assignee: Apache Spark > Optimizer rules for HigherOrderFunctions > > > Key: SPARK-33070 > URL: https://issues.apache.org/jira/browse/SPARK-33070 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Assignee: Apache Spark >Priority: Minor > > SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be > combined and reordered to achieve more optimal plan. > Possible rules: > * Combine 2 consecutive array transforms > * Combine 2 consecutive array filters > * Push array filter through array sort > * Remove array sort before array exists and array forall. > * Combine 2 consecutive map filters -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33070) Optimizer rules for HigherOrderFunctions
[ https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254909#comment-17254909 ] Apache Spark commented on SPARK-33070: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/30930 > Optimizer rules for HigherOrderFunctions > > > Key: SPARK-33070 > URL: https://issues.apache.org/jira/browse/SPARK-33070 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Minor > > SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be > combined and reordered to achieve more optimal plan. > Possible rules: > * Combine 2 consecutive array transforms > * Combine 2 consecutive array filters > * Push array filter through array sort > * Remove array sort before array exists and array forall. > * Combine 2 consecutive map filters -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33070) Optimizer rules for HigherOrderFunctions
[ https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33070: Assignee: (was: Apache Spark) > Optimizer rules for HigherOrderFunctions > > > Key: SPARK-33070 > URL: https://issues.apache.org/jira/browse/SPARK-33070 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Minor > > SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be > combined and reordered to achieve more optimal plan. > Possible rules: > * Combine 2 consecutive array transforms > * Combine 2 consecutive array filters > * Push array filter through array sort > * Remove array sort before array exists and array forall. > * Combine 2 consecutive map filters -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33070) Optimizer rules for HigherOrderFunctions
[ https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanel Kiis updated SPARK-33070: --- Summary: Optimizer rules for HigherOrderFunctions (was: Optimizer rules for collection datatypes and SimpleHigherOrderFunction) > Optimizer rules for HigherOrderFunctions > > > Key: SPARK-33070 > URL: https://issues.apache.org/jira/browse/SPARK-33070 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Minor > > SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be > combined and reordered to achieve more optimal plan. > Possible rules: > * Combine 2 consecutive array transforms > * Combine 2 consecutive array filters > * Push array filter through array sort > * Remove array sort before array exists and array forall. > * Combine 2 consecutive map filters -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33914) Describe the structure of unified v1 and v2 tests
[ https://issues.apache.org/jira/browse/SPARK-33914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33914: Assignee: (was: Apache Spark) > Describe the structure of unified v1 and v2 tests > - > > Key: SPARK-33914 > URL: https://issues.apache.org/jira/browse/SPARK-33914 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Add comments for unified v1 and v2 tests and describe their structure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33914) Describe the structure of unified v1 and v2 tests
[ https://issues.apache.org/jira/browse/SPARK-33914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254866#comment-17254866 ] Apache Spark commented on SPARK-33914: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30929 > Describe the structure of unified v1 and v2 tests > - > > Key: SPARK-33914 > URL: https://issues.apache.org/jira/browse/SPARK-33914 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Add comments for unified v1 and v2 tests and describe their structure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33914) Describe the structure of unified v1 and v2 tests
[ https://issues.apache.org/jira/browse/SPARK-33914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33914: Assignee: Apache Spark > Describe the structure of unified v1 and v2 tests > - > > Key: SPARK-33914 > URL: https://issues.apache.org/jira/browse/SPARK-33914 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Add comments for unified v1 and v2 tests and describe their structure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33914) Describe the structure of unified v1 and v2 tests
[ https://issues.apache.org/jira/browse/SPARK-33914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254865#comment-17254865 ] Apache Spark commented on SPARK-33914: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30929 > Describe the structure of unified v1 and v2 tests > - > > Key: SPARK-33914 > URL: https://issues.apache.org/jira/browse/SPARK-33914 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Add comments for unified v1 and v2 tests and describe their structure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33914) Describe the structure of unified v1 and v2 tests
Maxim Gekk created SPARK-33914: -- Summary: Describe the structure of unified v1 and v2 tests Key: SPARK-33914 URL: https://issues.apache.org/jira/browse/SPARK-33914 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Add comments for unified v1 and v2 tests and describe their structure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33913) Upgrade Kafka to 2.7.0
dengziming created SPARK-33913: -- Summary: Upgrade Kafka to 2.7.0 Key: SPARK-33913 URL: https://issues.apache.org/jira/browse/SPARK-33913 Project: Spark Issue Type: Improvement Components: Build, DStreams Affects Versions: 3.2.0 Reporter: dengziming The Apache Kafka community has released for Apache Kafka 2.7.0, some features are useful for example the KAFKA-9893 configurable TCP connection timeout, more details : https://downloads.apache.org/kafka/2.7.0/RELEASE_NOTES.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33910: Description: 1. Push down the foldable expressions through CaseWhen/If 2. Simplify conditional in predicate 3. Push the UnaryExpression into (if / case) branches 4. Simplify CaseWhen if elseValue is None 5. Simplify conditional if all branches are foldable boolean type Common use cases are: {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2 {code} 1. Reduce read the whole table. {noformat} explain select * from v1 where event_type = 'a'; Before simplify: == Physical Plan == Union :- *(1) Project [a AS event_type#7, id#9L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a) +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == *(1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet {noformat} 2. Push down the conditional expressions to data source. {noformat} explain select * from v1 where event_type = 'b'; Before simplify: == Physical Plan == Union :- LocalTableScan , [event_type#7, id#9L] +- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b) +- *(1) ColumnarToRow +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == *(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS id#4L] +- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1)) +- *(1) ColumnarToRow +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: [isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct {noformat} 3. Reduce the amount of calculation. {noformat} Before simplify: explain select event_type = 'e' from v1; == Physical Plan == Union :- *(1) Project [false AS (event_type = e)#37] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> +- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS (event_type = e)#38] +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == Union :- *(1) Project [false AS (event_type = e)#10] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: Parquet, +- *(2) Project [false AS (event_type = e)#14] +- *(2) ColumnarToRow +- FileScan parquet default.t2[] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> {noformat} was: 1. Push down the foldable expressions through CaseWhen/If 2. Simplify conditional in predicate 3. Push the UnaryExpression into (if / case) branches 4. Simplify CaseWhen if elseValue is None 5. Simplify conditional if all branches are foldable boolean type Common use cases are: {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2 {code} 1. Reduce read the whole table. {noformat} explain select * from v1 where event_type = 'a'; Before simplify: == Physical Plan == Union :- *(1) Project [a AS event_type#7, id#9L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema:
[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33910: Description: 1. Push down the foldable expressions through CaseWhen/If 2. Simplify conditional in predicate 3. Push the UnaryExpression into (if / case) branches 4. Simplify CaseWhen if elseValue is None 5. Simplify conditional if all branches are foldable boolean type Common use cases are: {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2 {code} 1. Reduce read the whole table. {noformat} explain select * from v1 where event_type = 'a'; Before simplify: == Physical Plan == Union :- *(1) Project [a AS event_type#7, id#9L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a) +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == *(1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet {noformat} 2. Push down the conditional expressions to data source. {noformat} explain select * from v1 where event_type = 'b'; Before simplify: == Physical Plan == Union :- LocalTableScan , [event_type#7, id#9L] +- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b) +- *(1) ColumnarToRow +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == *(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS id#4L] +- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1)) +- *(1) ColumnarToRow +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: [isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct {noformat} 3. Reduce the amount of calculation. {noformat} Before simplify: spark-sql> explain select event_type = 'e' from v1; == Physical Plan == Union :- *(1) Project [false AS (event_type = e)#37] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> +- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS (event_type = e)#38] +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == Union :- *(1) Project [false AS (event_type = e)#10] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: Parquet, +- *(2) Project [false AS (event_type = e)#14] +- *(2) ColumnarToRow +- FileScan parquet default.t2[] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> {noformat} was: 1. Push down the foldable expressions through CaseWhen/If 2. Simplify conditional in predicate 3. Push the UnaryExpression into (if / case) branches 4. Simplify CaseWhen if elseValue is None 5. Simplify conditional if all branches are foldable boolean type Common use cases are: {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2 {code} 1. Reduce read the whole table. {noformat} explain select * from v1 where event_type = 'a'; Before simplify: == Physical Plan == Union :- *(1) Project [a AS event_type#7, id#9L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], Format: Parquet, Location:
[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33910: Description: 1. Push down the foldable expressions through CaseWhen/If 2. Simplify conditional in predicate 3. Push the UnaryExpression into (if / case) branches 4. Simplify CaseWhen if elseValue is None 5. Simplify conditional if all branches are foldable boolean type Common use cases are: {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2 {code} 1. Reduce read the whole table. {noformat} explain select * from v1 where event_type = 'a'; Before simplify: == Physical Plan == Union :- *(1) Project [a AS event_type#7, id#9L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a) +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == *(1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet {noformat} 2. Push down the conditional expressions to data source. {noformat} explain select * from v1 where event_type = 'b'; Before simplify: == Physical Plan == Union :- LocalTableScan , [event_type#7, id#9L] +- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b) +- *(1) ColumnarToRow +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == *(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS id#4L] +- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1)) +- *(1) ColumnarToRow +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: [isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct {noformat} 3. Reduce the amount of calculation. {noformat} Before simplify: spark-sql> explain select event_type = 'e' from v1; == Physical Plan == Union :- *(1) Project [false AS (event_type = e)#37] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> +- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS (event_type = e)#38] +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == Union :- *(1) Project [false AS (event_type = e)#10] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], Format: Parquet, +- *(2) Project [false AS (event_type = e)#14] +- *(2) ColumnarToRow +- FileScan parquet default.t2[] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> {noformat} was: 1. Push down the foldable expressions through CaseWhen/If 2. Simplify conditional in predicate 3. Push the UnaryExpression into (if / case) branches 4. Simplify CaseWhen if elseValue is None 5. Simplify conditional if all branches are foldable boolean type Common use cases are: {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2 {code} 1. Reduce read the whole table. {noformat} explain select * from v1 where event_type = 'a'; Before simplify: == Physical Plan == Union :- *(1) Project [a AS event_type#7, id#9L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [], PushedFilters: [],
[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33910: Description: 1. Push down the foldable expressions through CaseWhen/If 2. Simplify conditional in predicate 3. Push the UnaryExpression into (if / case) branches 4. Simplify CaseWhen if elseValue is None 5. Simplify conditional if all branches are foldable boolean type Common use cases are: {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2 {code} 1. Reduce read the whole table. {noformat} explain select * from v1 where event_type = 'a'; Before simplify: == Physical Plan == Union :- *(1) Project [a AS event_type#7, id#9L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a) +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == *(1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet {noformat} 2. Push down the conditional expressions to data source. {noformat} explain select * from v1 where event_type = 'b'; Before simplify: == Physical Plan == Union :- LocalTableScan , [event_type#7, id#9L] +- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b) +- *(1) ColumnarToRow +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == *(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS id#4L] +- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1)) +- *(1) ColumnarToRow +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: [isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct {noformat} was: 1. Push down the foldable expressions through CaseWhen/If 2. Simplify conditional in predicate 3. Push the UnaryExpression into (if / case) branches 4. Simplify CaseWhen if elseValue is None 5. Simplify conditional if all branches are foldable boolean type Common use cases are: {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2 {code} 1. Reduce read the whole table. {noformat} explain select * from v1 where event_type = 'a'; Before simplify: == Physical Plan == Union :- *(1) Project [a AS event_type#7, id#9L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a) +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == *(1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet {noformat} 2. Push down the filter. {noformat} explain select * from v1 where event_type = 'b'; Before simplify: == Physical Plan == Union :- LocalTableScan , [event_type#7, id#9L] +- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b
[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33910: Description: 1. Push down the foldable expressions through CaseWhen/If 2. Simplify conditional in predicate 3. Push the UnaryExpression into (if / case) branches 4. Simplify CaseWhen if elseValue is None 5. Simplify conditional if all branches are foldable boolean type Common use cases are: {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2 {code} 1. Reduce read the whole table. {noformat} explain select * from v1 where event_type = 'a'; Before simplify: == Physical Plan == Union :- *(1) Project [a AS event_type#7, id#9L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a) +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == *(1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet {noformat} 2. Push down the filter. {noformat} explain select * from v1 where event_type = 'b'; Before simplify: == Physical Plan == Union :- LocalTableScan , [event_type#7, id#9L] +- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, id#10L] +- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b) +- *(1) ColumnarToRow +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: [(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct After simplify: == Physical Plan == *(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L AS id#4L] +- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1)) +- *(1) ColumnarToRow +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: [isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct {noformat} was: Simplify/Optimize conditional expressions. We can improve these cases: 1. Reduce read datasource. 2. Simple CaseWhen/If to support filter push down. {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from t2 explain select * from v1 where event_type = 'a'; {code} Before this PR: {noformat} == Physical Plan == Union :- *(1) Project [a AS event_type#30533, id#30535L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], Format: Parquet +- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END AS event_type#30534, id#30536L] +- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a) +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], Format: Parquet {noformat} After this PR: {noformat} == Physical Plan == *(1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet {noformat} > Simplify/Optimize conditional expressions > -- > > Key: SPARK-33910 > URL: https://issues.apache.org/jira/browse/SPARK-33910 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > 1. Push down the foldable expressions through CaseWhen/If > 2. Simplify conditional in predicate > 3. Push the UnaryExpression into (if / case)
[jira] [Updated] (SPARK-33848) Push the UnaryExpression into (if / case) branches
[ https://issues.apache.org/jira/browse/SPARK-33848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33848: Summary: Push the UnaryExpression into (if / case) branches (was: Push the cast into (if / case) branches) > Push the UnaryExpression into (if / case) branches > -- > > Key: SPARK-33848 > URL: https://issues.apache.org/jira/browse/SPARK-33848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > Push the cast into (if / case) branches. The use case is: > {code:sql} > create table t1 using parquet as select id from range(10); > explain select id from t1 where (CASE WHEN id = 1 THEN '1' WHEN id = 3 THEN > '2' end) > 3; > {code} > Before this pr: > {noformat} > == Physical Plan == > *(1) Filter (cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as > int) > 3) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: > [(cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > > 3)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > {noformat} > After this pr: > {noformat} > == Physical Plan == > LocalTableScan , [id#1L] > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33910: Description: Simplify/Optimize conditional expressions. We can improve these cases: 1. Reduce read datasource. 2. Simple CaseWhen/If to support filter push down. {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from t2 explain select * from v1 where event_type = 'a'; {code} Before this PR: {noformat} == Physical Plan == Union :- *(1) Project [a AS event_type#30533, id#30535L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], Format: Parquet +- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END AS event_type#30534, id#30536L] +- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a) +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], Format: Parquet {noformat} After this PR: {noformat} == Physical Plan == *(1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet {noformat} was: Simplify CaseWhen/If conditionals. We can improve these cases: 1. Reduce read datasource. 2. Simple CaseWhen/If to support filter push down. {code:sql} create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from t2 explain select * from v1 where event_type = 'a'; {code} Before this PR: {noformat} == Physical Plan == Union :- *(1) Project [a AS event_type#30533, id#30535L] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], Format: Parquet +- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END AS event_type#30534, id#30536L] +- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a) +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], Format: Parquet {noformat} After this PR: {noformat} == Physical Plan == *(1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet {noformat} > Simplify/Optimize conditional expressions > -- > > Key: SPARK-33910 > URL: https://issues.apache.org/jira/browse/SPARK-33910 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > Simplify/Optimize conditional expressions. We can improve these cases: > 1. Reduce read datasource. > 2. Simple CaseWhen/If to support filter push down. > {code:sql} > create table t1 using parquet as select * from range(100); > create table t2 using parquet as select * from range(200); > create temp view v1 as > > select 'a' as event_type, * from t1 > > union all > > select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * > from t2 > explain select * from v1 where event_type = 'a'; > {code} > Before this PR: > {noformat} > == Physical Plan == > Union > :- *(1) Project [a AS event_type#30533, id#30535L] > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: > [], Format: Parquet > +- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c > END AS event_type#30534, id#30536L] >+- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN > c END = a) > +- *(2) ColumnarToRow > +- FileScan parquet default.t2[id#30536L] Batched: true, > DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN
[jira] [Updated] (SPARK-33910) Simplify/Optimize conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33910: Summary: Simplify/Optimize conditional expressions (was: Simplify conditional) > Simplify/Optimize conditional expressions > -- > > Key: SPARK-33910 > URL: https://issues.apache.org/jira/browse/SPARK-33910 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > Simplify CaseWhen/If conditionals. We can improve these cases: > 1. Reduce read datasource. > 2. Simple CaseWhen/If to support filter push down. > {code:sql} > create table t1 using parquet as select * from range(100); > create table t2 using parquet as select * from range(200); > create temp view v1 as > > select 'a' as event_type, * from t1 > > union all > > select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * > from t2 > explain select * from v1 where event_type = 'a'; > {code} > Before this PR: > {noformat} > == Physical Plan == > Union > :- *(1) Project [a AS event_type#30533, id#30535L] > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: > [], Format: Parquet > +- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c > END AS event_type#30534, id#30536L] >+- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN > c END = a) > +- *(2) ColumnarToRow > +- FileScan parquet default.t2[id#30536L] Batched: true, > DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c > END = a)], Format: Parquet > {noformat} > After this PR: > {noformat} > == Physical Plan == > *(1) Project [a AS event_type#8, id#4L] > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], > Format: Parquet > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33070) Optimizer rules for collection datatypes and SimpleHigherOrderFunction
[ https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanel Kiis updated SPARK-33070: --- Affects Version/s: (was: 3.1.0) 3.2.0 > Optimizer rules for collection datatypes and SimpleHigherOrderFunction > -- > > Key: SPARK-33070 > URL: https://issues.apache.org/jira/browse/SPARK-33070 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Minor > > SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be > combined and reordered to achieve more optimal plan. > Possible rules: > * Combine 2 consecutive array transforms > * Combine 2 consecutive array filters > * Push array filter through array sort > * Remove array sort before array exists and array forall. > * Combine 2 consecutive map filters -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33888) JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis
[ https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duc Hoa Nguyen updated SPARK-33888: --- Summary: JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis (was: AVRO SchemaConverts - logicalType TimeMillis not being converted to Timestamp type) > JDBC SQL TIME type represents incorrectly as TimestampType, it should be > physical Int in millis > --- > > Key: SPARK-33888 > URL: https://issues.apache.org/jira/browse/SPARK-33888 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 3.0.0, 3.0.1 >Reporter: Duc Hoa Nguyen >Assignee: Apache Spark >Priority: Minor > > Currently, for JDBC, SQL TIME type represents incorrectly as Spark > TimestampType. This should be represent as physical int in millis Represents > a time of day, with no reference to a particular calendar, time zone or date, > with a precision of one millisecond. It stores the number of milliseconds > after midnight, 00:00:00.000. > We encountered the issue of Avro logical type of `TimeMillis` not being > converted correctly to Spark `Timestamp` struct type using the > `SchemaConverters`, but it converts to regular `int` instead. Reproducible by > ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe > will get the correct type (Timestamp), but enforcing our avro schema > (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to > apply with the following exception: > {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type > for schema of int}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33888) AVRO SchemaConverts - logicalType TimeMillis not being converted to Timestamp type
[ https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duc Hoa Nguyen updated SPARK-33888: --- Description: Currently, for JDBC, SQL TIME type represents incorrectly as Spark TimestampType. This should be represent as physical int in millis Represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one millisecond. It stores the number of milliseconds after midnight, 00:00:00.000. We encountered the issue of Avro logical type of `TimeMillis` not being converted correctly to Spark `Timestamp` struct type using the `SchemaConverters`, but it converts to regular `int` instead. Reproducible by ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe will get the correct type (Timestamp), but enforcing our avro schema (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to apply with the following exception: {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type for schema of int}} was: We encountered the issue of Avro logical type of `TimeMillis` not being converted correctly to Spark `Timestamp` struct type using the `SchemaConverters`, but it converts to regular `int` instead. Reproducible by ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe will get the correct type (Timestamp), but enforcing our avro schema (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to apply with the following exception: {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type for schema of int}} > AVRO SchemaConverts - logicalType TimeMillis not being converted to Timestamp > type > -- > > Key: SPARK-33888 > URL: https://issues.apache.org/jira/browse/SPARK-33888 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 3.0.0, 3.0.1 >Reporter: Duc Hoa Nguyen >Assignee: Apache Spark >Priority: Minor > > Currently, for JDBC, SQL TIME type represents incorrectly as Spark > TimestampType. This should be represent as physical int in millis Represents > a time of day, with no reference to a particular calendar, time zone or date, > with a precision of one millisecond. It stores the number of milliseconds > after midnight, 00:00:00.000. > We encountered the issue of Avro logical type of `TimeMillis` not being > converted correctly to Spark `Timestamp` struct type using the > `SchemaConverters`, but it converts to regular `int` instead. Reproducible by > ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe > will get the correct type (Timestamp), but enforcing our avro schema > (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to > apply with the following exception: > {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type > for schema of int}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33912) Refactor DependencyUtils ivy property parameter
[ https://issues.apache.org/jira/browse/SPARK-33912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254785#comment-17254785 ] Apache Spark commented on SPARK-33912: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/30928 > Refactor DependencyUtils ivy property parameter > > > Key: SPARK-33912 > URL: https://issues.apache.org/jira/browse/SPARK-33912 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Minor > > according to https://github.com/apache/spark/pull/29966#discussion_r533573137 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33912) Refactor DependencyUtils ivy property parameter
[ https://issues.apache.org/jira/browse/SPARK-33912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33912: Assignee: Apache Spark > Refactor DependencyUtils ivy property parameter > > > Key: SPARK-33912 > URL: https://issues.apache.org/jira/browse/SPARK-33912 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Minor > > according to https://github.com/apache/spark/pull/29966#discussion_r533573137 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33912) Refactor DependencyUtils ivy property parameter
[ https://issues.apache.org/jira/browse/SPARK-33912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33912: Assignee: (was: Apache Spark) > Refactor DependencyUtils ivy property parameter > > > Key: SPARK-33912 > URL: https://issues.apache.org/jira/browse/SPARK-33912 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Minor > > according to https://github.com/apache/spark/pull/29966#discussion_r533573137 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33888) AVRO SchemaConverts - logicalType TimeMillis not being converted to Timestamp type
[ https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duc Hoa Nguyen updated SPARK-33888: --- Issue Type: Bug (was: New Feature) > AVRO SchemaConverts - logicalType TimeMillis not being converted to Timestamp > type > -- > > Key: SPARK-33888 > URL: https://issues.apache.org/jira/browse/SPARK-33888 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 3.0.0, 3.0.1 >Reporter: Duc Hoa Nguyen >Assignee: Apache Spark >Priority: Minor > > We encountered the issue of Avro logical type of `TimeMillis` not being > converted correctly to Spark `Timestamp` struct type using the > `SchemaConverters`, but it converts to regular `int` instead. Reproducible by > ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe > will get the correct type (Timestamp), but enforcing our avro schema > (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to > apply with the following exception: > {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type > for schema of int}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28123) String Functions: Add support btrim
[ https://issues.apache.org/jira/browse/SPARK-28123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-28123: --- Affects Version/s: (was: 3.0.0) 3.2.0 > String Functions: Add support btrim > --- > > Key: SPARK-28123 > URL: https://issues.apache.org/jira/browse/SPARK-28123 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{btrim(_{{string}}_}}{{bytea}}{{, > _{{bytes}}_}}{{bytea}}{{)}}|{{bytea}}|Remove the longest string containing > only bytes appearing in _{{bytes}}_from the start and end of > _{{string}}_|{{btrim('\000trim\001'::bytea, '\000\001'::bytea)}}|{{trim}}| > More details: https://www.postgresql.org/docs/11/functions-binarystring.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30789) Support (IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE
[ https://issues.apache.org/jira/browse/SPARK-30789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-30789: --- Affects Version/s: (was: 3.1.0) 3.2.0 > Support (IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE > -- > > Key: SPARK-30789 > URL: https://issues.apache.org/jira/browse/SPARK-30789 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Priority: Major > > All of LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE support IGNORE NULLS | > RESPECT NULLS. For example: > {code:java} > LEAD (value_expr [, offset ]) > [ IGNORE NULLS | RESPECT NULLS ] > OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code} > > {code:java} > LAG (value_expr [, offset ]) > [ IGNORE NULLS | RESPECT NULLS ] > OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code} > > {code:java} > NTH_VALUE (expr, offset) > [ IGNORE NULLS | RESPECT NULLS ] > OVER > ( [ PARTITION BY window_partition ] > [ ORDER BY window_ordering > frame_clause ] ){code} > > *Oracle:* > [https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0] > *Redshift* > [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html] > *Presto* > [https://prestodb.io/docs/current/functions/window.html] > *DB2* > [https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm] > *Teradata* > [https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w] > *Snowflake* > [https://docs.snowflake.com/en/sql-reference/functions/lead.html] > [https://docs.snowflake.com/en/sql-reference/functions/lag.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33804) Cleanup "view bounds are deprecated" compilation warnings
[ https://issues.apache.org/jira/browse/SPARK-33804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254767#comment-17254767 ] Maxim Gekk commented on SPARK-33804: [~LuciferYang] Could you convert this (and your other similar JIRA tickets) to a sub-task of the umbrella ticket https://issues.apache.org/jira/browse/SPARK-30165 > Cleanup "view bounds are deprecated" compilation warnings > - > > Key: SPARK-33804 > URL: https://issues.apache.org/jira/browse/SPARK-33804 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > There are only 3 compilation warnings related to `view bounds are deprecated` > in SequenceFileRDDFunctions: > [WARNING] > /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:35: > view bounds are deprecated; use an implicit parameter instead. > [WARNING] > /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:35: > view bounds are deprecated; use an implicit parameter instead. > [WARNING] > /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:55: > view bounds are deprecated; use an implicit parameter instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org