subject:"\[jira\] \[Comment Edited\] \(SPARK\-33915\) Allow json expression to be pushable column"

[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column

2020-12-31 Thread Ted Yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255389#comment-17255389
 ] 

Ted Yu edited comment on SPARK-33915 at 12/31/20, 3:16 PM:
---

Here is the plan prior to predicate pushdown:
{code}
2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS 
phone#33]
   +- Filter (get_json_object(phone#37, $.code) = 1200)
  +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: []
 - Requested Columns: [id,address,phone]
{code}
Here is the plan with pushdown:
{code}
2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) 
AS phone#33]
   +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: [[phone->'code' = ?, 1200]]
 - Requested Columns: [id,address,phone]

{code}


was (Author: yuzhih...@gmail.com):
Here is the plan prior to predicate pushdown:
{code}
2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS 
phone#33]
   +- Filter (get_json_object(phone#37, $.phone) = 1200)
  +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: []
 - Requested Columns: [id,address,phone]
{code}
Here is the plan with pushdown:
{code}
2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) 
AS phone#33]
   +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: [[phone->'phone' = ?, 1200]]
 - Requested Columns: [id,address,phone]

{code}

> Allow json expression to be pushable column
> ---
>
> Key: SPARK-33915
> URL: https://issues.apache.org/jira/browse/SPARK-33915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Ted Yu
>Priority: Major
>
> Currently PushableColumnBase provides no support for json / jsonb expression.
> Example of json expression:
> {code}
> get_json_object(phone, '$.code') = '1200'
> {code}
> If non-string literal is part of the expression, the presence of cast() would 
> complicate the situation.
> Implication is that implementation of SupportsPushDownFilters doesn't have a 
> chance to perform pushdown even if third party DB engine supports json 
> expression pushdown.
> This issue is for discussion and implementation of Spark core changes which 
> would allow json expression to be recognized as pushable column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column

2020-12-29 Thread Ted Yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255389#comment-17255389
 ] 

Ted Yu edited comment on SPARK-33915 at 12/29/20, 6:51 PM:
---

Here is the plan prior to predicate pushdown:
{code}
2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS 
phone#33]
   +- Filter (get_json_object(phone#37, $.phone) = 1200)
  +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: []
 - Requested Columns: [id,address,phone]
{code}
Here is the plan with pushdown:
{code}
2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) 
AS phone#33]
   +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: [[phone->'phone' = ?, 1200]]
 - Requested Columns: [id,address,phone]

{code}


was (Author: yuzhih...@gmail.com):
Here is the plan prior to predicate pushdown:
{code}
2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS 
phone#33]
   +- Filter (get_json_object(phone#37, $.phone) = 1200)
  +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: []
 - Requested Columns: [id,address,phone]
{code}
Here is the plan with pushdown:
{code}
2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS 
phone#33]
   +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: [["`GetJsonObject(phone#37,$.phone)`" = ?, 1200]]
 - Requested Columns: [id,address,phone]
{code}

> Allow json expression to be pushable column
> ---
>
> Key: SPARK-33915
> URL: https://issues.apache.org/jira/browse/SPARK-33915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Ted Yu
>Priority: Major
>
> Currently PushableColumnBase provides no support for json / jsonb expression.
> Example of json expression:
> {code}
> get_json_object(phone, '$.code') = '1200'
> {code}
> If non-string literal is part of the expression, the presence of cast() would 
> complicate the situation.
> Implication is that implementation of SupportsPushDownFilters doesn't have a 
> chance to perform pushdown even if third party DB engine supports json 
> expression pushdown.
> This issue is for discussion and implementation of Spark core changes which 
> would allow json expression to be recognized as pushable column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column

2020-12-25 Thread Ted Yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254944#comment-17254944
 ] 

Ted Yu edited comment on SPARK-33915 at 12/26/20, 4:14 AM:
---

I have experimented with two patches which allow json expression to be pushable 
column (without the presence of cast).

The first involves duplicating GetJsonObject as GetJsonString where eval() 
method returns UTF8String. FunctionRegistry.scala and functions.scala would 
have corresponding change.
In PushableColumnBase class, new case for GetJsonString is added.

Since code duplication is undesirable and case class cannot be directly 
subclassed, there is more work to be done in this direction.

The second is simpler. The return type of GetJsonObject#eval is changed to 
UTF8String.
I have run through catalyst / sql core unit tests which passed.
In PushableColumnBase class, new case for GetJsonObject is added.

In test output, I observed the following (for second patch, output for first 
patch is similar):
{code}
Post-Scan Filters: (get_json_object(phone#37, $.code) = 1200)
{code}
Comment on the two approaches (and other approach) is welcome.

Below is snippet for second approach.
{code}
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index c22b68890a..8004ddd735 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression)

   @transient private lazy val parsedPath = 
parsePath(path.eval().asInstanceOf[UTF8String])

-  override def eval(input: InternalRow): Any = {
+  override def eval(input: InternalRow): UTF8String = {
 val jsonStr = json.eval(input).asInstanceOf[UTF8String]
 if (jsonStr == null) {
   return null
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index e4f001d61a..1cdc2642ba 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
@@ -723,6 +725,12 @@ abstract class PushableColumnBase {
 }
   case s: GetStructField if nestedPredicatePushdownEnabled =>
 helper(s.child).map(_ :+ s.childSchema(s.ordinal).name)
+  case GetJsonObject(col, field) =>
+Some(Seq("GetJsonObject(" + col + "," + field + ")"))
   case _ => None
 }
 helper(e).map(_.quoted)
{code}


was (Author: yuzhih...@gmail.com):
I have experimented with two patches which allow json expression to be pushable 
column (without the presence of cast).

The first involves duplicating GetJsonObject as GetJsonString where eval() 
method returns UTF8String. FunctionRegistry.scala and functions.scala would 
have corresponding change.
In PushableColumnBase class, new case for GetJsonString is added.

Since code duplication is undesirable and case class cannot be directly 
subclassed, there is more work to be done in this direction.

The second is simpler. The return type of GetJsonObject#eval is changed to 
UTF8String.
I have run through catalyst / sql core unit tests which passed.
In PushableColumnBase class, new case for GetJsonObject is added.

In test output, I observed the following (for second patch, output for first 
patch is similar):
{code}
Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200)
{code}
Comment on the two approaches (and other approach) is welcome.

Below is snippet for second approach.
{code}
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index c22b68890a..8004ddd735 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression)

   @transient private lazy val parsedPath = 
parsePath(path.eval().asInstanceOf[UTF8String])

-  override def eval(input: InternalRow): Any = {
+  override def eval(input: InternalRow): UTF8String = {
 val jsonStr = json.eval(input).asInstanceOf[UTF8String]
 if (jsonStr == null) {
   return null
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index

[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column

2020-12-25 Thread Ted Yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254944#comment-17254944
 ] 

Ted Yu edited comment on SPARK-33915 at 12/26/20, 4:14 AM:
---

I have experimented with two patches which allow json expression to be pushable 
column (without the presence of cast).

The first involves duplicating GetJsonObject as GetJsonString where eval() 
method returns UTF8String. FunctionRegistry.scala and functions.scala would 
have corresponding change.
In PushableColumnBase class, new case for GetJsonString is added.

Since code duplication is undesirable and case class cannot be directly 
subclassed, there is more work to be done in this direction.

The second is simpler. The return type of GetJsonObject#eval is changed to 
UTF8String.
I have run through catalyst / sql core unit tests which passed.
In PushableColumnBase class, new case for GetJsonObject is added.

In test output, I observed the following (for second patch, output for first 
patch is similar):
{code}
Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200)
{code}
Comment on the two approaches (and other approach) is welcome.

Below is snippet for second approach.
{code}
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index c22b68890a..8004ddd735 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression)

   @transient private lazy val parsedPath = 
parsePath(path.eval().asInstanceOf[UTF8String])

-  override def eval(input: InternalRow): Any = {
+  override def eval(input: InternalRow): UTF8String = {
 val jsonStr = json.eval(input).asInstanceOf[UTF8String]
 if (jsonStr == null) {
   return null
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index e4f001d61a..1cdc2642ba 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
@@ -723,6 +725,12 @@ abstract class PushableColumnBase {
 }
   case s: GetStructField if nestedPredicatePushdownEnabled =>
 helper(s.child).map(_ :+ s.childSchema(s.ordinal).name)
+  case GetJsonObject(col, field) =>
+Some(Seq("GetJsonObject(" + col + "," + field + ")"))
   case _ => None
 }
 helper(e).map(_.quoted)
{code}


was (Author: yuzhih...@gmail.com):
I have experimented with two patches which allow json expression to be pushable 
column (without the presence of cast).

The first involves duplicating GetJsonObject as GetJsonString where eval() 
method returns UTF8String. FunctionRegistry.scala and functions.scala would 
have corresponding change.
In PushableColumnBase class, new case for GetJsonString is added.

Since code duplication is undesirable and case class cannot be directly 
subclassed, there is more work to be done in this direction.

The second is simpler. The return type of GetJsonObject#eval is changed to 
UTF8String.
I have run through catalyst / sql core unit tests which passed.
In PushableColumnBase class, new case for GetJsonObject is added.

In test output, I observed the following (for second patch, output for first 
patch is similar):

Post-Scan Filters: (get_json_object(phone#37, $.phone) = 1200)

Comment on the two approaches (and other approach) is welcome.

Below is snippet for second approach.
{code}
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index c22b68890a..8004ddd735 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -139,7 +139,7 @@ case class GetJsonObject(json: Expression, path: Expression)

   @transient private lazy val parsedPath = 
parsePath(path.eval().asInstanceOf[UTF8String])

-  override def eval(input: InternalRow): Any = {
+  override def eval(input: InternalRow): UTF8String = {
 val jsonStr = json.eval(input).asInstanceOf[UTF8String]
 if (jsonStr == null) {
   return null
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index

[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column

[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column

[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column

[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column

4 matches

Site Navigation

Mail list logo

Footer information