[ 
https://issues.apache.org/jira/browse/SPARK-34017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259423#comment-17259423
 ] 

Ted Yu commented on SPARK-34017:
--------------------------------

For PushDownUtils#pruneColumns, I am experimenting with the following:
{code}
      case r: SupportsPushDownRequiredColumns if 
SQLConf.get.nestedSchemaPruningEnabled =>
        val JSONCapture = "get_json_object\\((.*), *(.*)\\)".r
        var jsonRootFields : ArrayBuffer[RootField] = ArrayBuffer()
        projects.map{ _.map{ f => f.toString match {
          case JSONCapture(column, field) =>
            jsonRootFields += RootField(StructField(column, f.dataType, 
f.nullable),
              derivedFromAtt = false, prunedIfAnyChildAccessed = true)
          case _ => logDebug("else " + f)
        }}}
        val rootFields = SchemaPruning.identifyRootFields(projects, filters) ++ 
jsonRootFields
{code}

> Pass json column information via pruneColumns()
> -----------------------------------------------
>
>                 Key: SPARK-34017
>                 URL: https://issues.apache.org/jira/browse/SPARK-34017
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.1
>            Reporter: Ted Yu
>            Priority: Major
>
> Currently PushDownUtils#pruneColumns only passes root fields to 
> SupportsPushDownRequiredColumns implementation(s).
> {code}
> 2021-01-05 19:36:07,437 (Time-limited test) [DEBUG - 
> org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] nested schema 
> projection List(id#33, address#34, phone#36, get_json_object(phone#36, 
> $.code) AS get_json_object(phone, $.code)#37)
> 2021-01-05 19:36:07,438 (Time-limited test) [DEBUG - 
> org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] nested schema 
> StructType(StructField(id,IntegerType,false), 
> StructField(address,StringType,true), StructField(phone,StringType,true))
> {code}
> The first line shows projections and the second line shows the pruned schema.
> We can see that get_json_object(phone#36, $.code) is filtered. This 
> expression retrieves field 'code' from phone json column.
> We should allow json column information to be passed via pruneColumns().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to