Github user vkhristenko commented on the issue: https://github.com/apache/spark/pull/16578 Hi, My name is Viktor and I'm working at CERN on ROOT I/O DataSource for JVM and an interface for Spark. ROOT I/O is the format that is used for CERN's LHC data. ROOT data format (a columnar data format) is similar to Parquet format and is subject to pruning nested fields as well. I'm new to contributing to Apache Spark and this is why I'm writing this all explicitly. - I found that this PR is more general than just Parquet! - By using my source with this PR, buildReader function, https://github.com/diana-hep/spark-root/blob/master/src/main/scala/org/dianahep/sparkroot/experimental/package.scala#L86 , receives only the schema that is required by the df.select("") statement - There is a minor change needed though, "parquetFormat: ParquetFileFormat" should be replaced by "fileFormat: FileFormat" as there is no dependency on the actual ParquetFileFormat class defined in parquet package https://github.com/apache/spark/pull/16578/files?diff=unified#diff-3bad814b3336a83f360d7395bd740759R38 - And may be renaming this ParquetSchemaPruning and taking it outside of the parquet package as it is quite more general than just for parquet, otherwise I have to add a special Rule here, https://github.com/apache/spark/pull/16578/files?diff=unified#diff-2370d8ed85930c93ef8e5ce67abca53fR35 ??? Thanks! VK
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org