[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

mallman Fri, 15 Sep 2017 20:01:32 -0700

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/16578
  
    > * There is a minor change needed though, "parquetFormat: 
ParquetFileFormat" should be replaced by "fileFormat: FileFormat" as there is 
no dependency on the actual ParquetFileFormat class defined in parquet package
    
https://github.com/apache/spark/pull/16578/files?diff=unified#diff-3bad814b3336a83f360d7395bd740759R38
    
    This is true. I hesitate to weaken the match to all instances of 
`FileFormat` because the only format I have extensive experience and knowledge 
of is Parquet. There are other formats that could expand upon this work, such 
as ORC and ROOT, but I have no practical experience working with those. I'd 
prefer someone who does have such experience build on this PR to make it work 
with those file formats.
    
    Incidentally, I added a `ColumnarFileFormat` trait in this PR. You might 
consider it a marker for columnar file formats, but all it really does right 
now is compute the number of physical columns read for a given catalyst schema. 
This value is used in the description of a physical plan to help a dev/user 
ensure that their expected column pruning is in fact occurring.
    
    >  * And may be renaming this ParquetSchemaPruning and taking it outside of 
the parquet package as it is quite more general than just for parquet, 
otherwise I have to add a special Rule here, 
https://github.com/apache/spark/pull/16578/files?diff=unified#diff-2370d8ed85930c93ef8e5ce67abca53fR35
 ???
    
    Moving and renaming `ParquetSchemaPruning` makes sense if it's generalized 
to other file formats.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

Reply via email to