Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/22905#discussion_r230914377 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -306,7 +306,15 @@ case class FileSourceScanExec( withOptPartitionCount } - withSelectedBucketsCount + val withOptColumnCount = relation.fileFormat match { + case columnar: ColumnarFileFormat => + val sqlConf = relation.sparkSession.sessionState.conf + val columnCount = columnar.columnCountForSchema(sqlConf, requiredSchema) + withSelectedBucketsCount + ("ColumnCount" -> columnCount.toString) --- End diff -- > In that way, all other logs should be put in metadata. Who wants that? If someone wants to put metadata somewhere in the physical plan, let them open a PR and make a case for it. Otherwise, why would we put things into the plan metadata that no one has asked for? That does not make sense. > For instance, we're not even showing the actual filters (not cayalyst but I mean the actual pushed filters that are going to apply at each source implementation level such as filters from ParquetFilters.createFilter) in Spark side. That's right, however that's not the aim of this PR. The aim is to ensure that plan-time schema pruning is working as expected, and I assert that that information itself is valuable enough to warrant inclusion in the data source metadata. That's speaking from experience, not conjecture. This metadata has been a _very valuable_ tool for us, and incurs little burden in practice.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org