[jira] [Commented] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257647#comment-17257647 ] Ted Yu commented on SPARK-33915: Here is sample code for capturing the column and fields in downstream PredicatePushDown.scala {code} private val JSONCapture = "`GetJsonObject\\((.*),(.*)\\)`".r private def transformGetJsonObject(p: Predicate): Predicate = { val eq = p.asInstanceOf[sources.EqualTo] eq.attribute match { case JSONCapture(column,field) => val colName = column.toString.split("#")(0) val names = field.toString.split("\\.").foldLeft(List[String]()){(z, n) => z :+ "->'"+n+"'" } sources.EqualTo(colName + names.slice(1, names.size).mkString(""), eq.value).asInstanceOf[Predicate] case _ => sources.EqualTo("foo", "bar").asInstanceOf[Predicate] } } {code} > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Assignee: Apache Spark >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257027#comment-17257027 ] Apache Spark commented on SPARK-33915: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/30984 > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257025#comment-17257025 ] Ted Yu commented on SPARK-33915: Opened https://github.com/apache/spark/pull/30984 > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256840#comment-17256840 ] Yuanjian Li commented on SPARK-33915: - +1 for the second one if it can pass all the tests. Feel free to submit a PR. I think it's an essential improvement. Also cc [~cloud_fan] > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255631#comment-17255631 ] Ted Yu commented on SPARK-33915: [~codingcat] [~XuanYuan][~viirya][~Alex Herman] Can you provide your comment ? Thanks > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255389#comment-17255389 ] Ted Yu commented on SPARK-33915: Here is the plan prior to predicate pushdown: {code} 2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- Filter (get_json_object(phone#37, $.phone) = 1200) +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [] - Requested Columns: [id,address,phone] {code} Here is the plan with pushdown: {code} 2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [["`GetJsonObject(phone#37,$.phone)`" = ?, 1200]] - Requested Columns: [id,address,phone] {code} > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254944#comment-17254944 ] Ted Yu commented on SPARK-33915: I have experimented with two patches which allow json expression to be pushable column (without the presence of cast). The first involves duplicating GetJsonObject as GetJsonString where eval() method returns UTF8String. FunctionRegistry.scala and functions.scala would have corresponding change. In PushableColumnBase class, new case for GetJsonString is added. Since code duplication is undesirable and case class cannot be directly subclassed, there is more work to be done in this direction. The second is simpler. The return type of GetJsonObject#eval is changed to UTF8String. I have run through catalyst / sql core unit tests which passed. In PushableColumnBase class, new case for GetJsonObject is added. In test output, I observed the following (for second patch, output for first patch is similar): Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200) Comment on the two approaches (and other approach) is welcome. Below is snippet for second approach. {code} diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala index c22b68890a..8004ddd735 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala @@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression) @transient private lazy val parsedPath = parsePath(path.eval().asInstanceOf[UTF8String]) - override def eval(input: InternalRow): Any = { + override def eval(input: InternalRow): UTF8String = { val jsonStr = json.eval(input).asInstanceOf[UTF8String] if (jsonStr == null) { return null diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index e4f001d61a..1cdc2642ba 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -723,6 +725,12 @@ abstract class PushableColumnBase { } case s: GetStructField if nestedPredicatePushdownEnabled => helper(s.child).map(_ :+ s.childSchema(s.ordinal).name) + case GetJsonObject(col, field) => +Some(Seq("GetJsonObject(" + col + "," + field + ")")) case _ => None } helper(e).map(_.quoted) {code} > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org