[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255389#comment-17255389 ] Ted Yu edited comment on SPARK-33915 at 12/31/20, 3:16 PM: --- Here is the plan prior to predicate pushdown: {code} 2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- Filter (get_json_object(phone#37, $.code) = 1200) +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [] - Requested Columns: [id,address,phone] {code} Here is the plan with pushdown: {code} 2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [[phone->'code' = ?, 1200]] - Requested Columns: [id,address,phone] {code} was (Author: yuzhih...@gmail.com): Here is the plan prior to predicate pushdown: {code} 2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- Filter (get_json_object(phone#37, $.phone) = 1200) +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [] - Requested Columns: [id,address,phone] {code} Here is the plan with pushdown: {code} 2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [[phone->'phone' = ?, 1200]] - Requested Columns: [id,address,phone] {code} > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255389#comment-17255389 ] Ted Yu edited comment on SPARK-33915 at 12/29/20, 6:51 PM: --- Here is the plan prior to predicate pushdown: {code} 2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- Filter (get_json_object(phone#37, $.phone) = 1200) +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [] - Requested Columns: [id,address,phone] {code} Here is the plan with pushdown: {code} 2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [[phone->'phone' = ?, 1200]] - Requested Columns: [id,address,phone] {code} was (Author: yuzhih...@gmail.com): Here is the plan prior to predicate pushdown: {code} 2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- Filter (get_json_object(phone#37, $.phone) = 1200) +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [] - Requested Columns: [id,address,phone] {code} Here is the plan with pushdown: {code} 2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [["`GetJsonObject(phone#37,$.phone)`" = ?, 1200]] - Requested Columns: [id,address,phone] {code} > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254944#comment-17254944 ] Ted Yu edited comment on SPARK-33915 at 12/26/20, 4:14 AM: --- I have experimented with two patches which allow json expression to be pushable column (without the presence of cast). The first involves duplicating GetJsonObject as GetJsonString where eval() method returns UTF8String. FunctionRegistry.scala and functions.scala would have corresponding change. In PushableColumnBase class, new case for GetJsonString is added. Since code duplication is undesirable and case class cannot be directly subclassed, there is more work to be done in this direction. The second is simpler. The return type of GetJsonObject#eval is changed to UTF8String. I have run through catalyst / sql core unit tests which passed. In PushableColumnBase class, new case for GetJsonObject is added. In test output, I observed the following (for second patch, output for first patch is similar): {code} Post-Scan Filters: (get_json_object(phone#37, $.code) = 1200) {code} Comment on the two approaches (and other approach) is welcome. Below is snippet for second approach. {code} diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index c22b68890a..8004ddd735 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression) @transient private lazy val parsedPath = parsePath(path.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + override def eval(input: InternalRow): UTF8String = { val jsonStr = json.eval(input).asInstanceOf[UTF8String] if (jsonStr == null) { return null diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index e4f001d61a..1cdc2642ba 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -723,6 +725,12 @@ abstract class PushableColumnBase { } case s: GetStructField if nestedPredicatePushdownEnabled => helper(s.child).map(_ :+ s.childSchema(s.ordinal).name) + case GetJsonObject(col, field) => +Some(Seq("GetJsonObject(" + col + "," + field + ")")) case _ => None } helper(e).map(_.quoted) {code} was (Author: yuzhih...@gmail.com): I have experimented with two patches which allow json expression to be pushable column (without the presence of cast). The first involves duplicating GetJsonObject as GetJsonString where eval() method returns UTF8String. FunctionRegistry.scala and functions.scala would have corresponding change. In PushableColumnBase class, new case for GetJsonString is added. Since code duplication is undesirable and case class cannot be directly subclassed, there is more work to be done in this direction. The second is simpler. The return type of GetJsonObject#eval is changed to UTF8String. I have run through catalyst / sql core unit tests which passed. In PushableColumnBase class, new case for GetJsonObject is added. In test output, I observed the following (for second patch, output for first patch is similar): {code} Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200) {code} Comment on the two approaches (and other approach) is welcome. Below is snippet for second approach. {code} diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index c22b68890a..8004ddd735 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression) @transient private lazy val parsedPath = parsePath(path.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + override def eval(input: InternalRow): UTF8String = { val jsonStr = json.eval(input).asInstanceOf[UTF8String] if (jsonStr == null) { return null diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index
[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254944#comment-17254944 ] Ted Yu edited comment on SPARK-33915 at 12/26/20, 4:14 AM: --- I have experimented with two patches which allow json expression to be pushable column (without the presence of cast). The first involves duplicating GetJsonObject as GetJsonString where eval() method returns UTF8String. FunctionRegistry.scala and functions.scala would have corresponding change. In PushableColumnBase class, new case for GetJsonString is added. Since code duplication is undesirable and case class cannot be directly subclassed, there is more work to be done in this direction. The second is simpler. The return type of GetJsonObject#eval is changed to UTF8String. I have run through catalyst / sql core unit tests which passed. In PushableColumnBase class, new case for GetJsonObject is added. In test output, I observed the following (for second patch, output for first patch is similar): {code} Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200) {code} Comment on the two approaches (and other approach) is welcome. Below is snippet for second approach. {code} diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index c22b68890a..8004ddd735 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression) @transient private lazy val parsedPath = parsePath(path.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + override def eval(input: InternalRow): UTF8String = { val jsonStr = json.eval(input).asInstanceOf[UTF8String] if (jsonStr == null) { return null diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index e4f001d61a..1cdc2642ba 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -723,6 +725,12 @@ abstract class PushableColumnBase { } case s: GetStructField if nestedPredicatePushdownEnabled => helper(s.child).map(_ :+ s.childSchema(s.ordinal).name) + case GetJsonObject(col, field) => +Some(Seq("GetJsonObject(" + col + "," + field + ")")) case _ => None } helper(e).map(_.quoted) {code} was (Author: yuzhih...@gmail.com): I have experimented with two patches which allow json expression to be pushable column (without the presence of cast). The first involves duplicating GetJsonObject as GetJsonString where eval() method returns UTF8String. FunctionRegistry.scala and functions.scala would have corresponding change. In PushableColumnBase class, new case for GetJsonString is added. Since code duplication is undesirable and case class cannot be directly subclassed, there is more work to be done in this direction. The second is simpler. The return type of GetJsonObject#eval is changed to UTF8String. I have run through catalyst / sql core unit tests which passed. In PushableColumnBase class, new case for GetJsonObject is added. In test output, I observed the following (for second patch, output for first patch is similar): Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200) Comment on the two approaches (and other approach) is welcome. Below is snippet for second approach. {code} diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index c22b68890a..8004ddd735 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression) @transient private lazy val parsedPath = parsePath(path.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + override def eval(input: InternalRow): UTF8String = { val jsonStr = json.eval(input).asInstanceOf[UTF8String] if (jsonStr == null) { return null diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index