Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22905#discussion_r230914377
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
    @@ -306,7 +306,15 @@ case class FileSourceScanExec(
           withOptPartitionCount
         }
     
    -    withSelectedBucketsCount
    +    val withOptColumnCount = relation.fileFormat match {
    +      case columnar: ColumnarFileFormat =>
    +        val sqlConf = relation.sparkSession.sessionState.conf
    +        val columnCount = columnar.columnCountForSchema(sqlConf, 
requiredSchema)
    +        withSelectedBucketsCount + ("ColumnCount" -> columnCount.toString)
    --- End diff --
    
    > In that way, all other logs should be put in metadata.
    
    Who wants that? If someone wants to put metadata somewhere in the physical 
plan, let them open a PR and make a case for it. Otherwise, why would we put 
things into the plan metadata that no one has asked for? That does not make 
sense.
    
    > For instance, we're not even showing the actual filters (not cayalyst but 
I mean the actual pushed filters that are going to apply at each source 
implementation level such as filters from ParquetFilters.createFilter) in Spark 
side.
    
    That's right, however that's not the aim of this PR. The aim is to ensure 
that plan-time schema pruning is working as expected, and I assert that that 
information itself is valuable enough to warrant inclusion in the data source 
metadata. That's speaking from experience, not conjecture. This metadata has 
been a _very valuable_ tool for us, and incurs little burden in practice.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to