[GitHub] [flink] AHeise commented on pull request #15725: [FLINK-21389] determine parquet schema from file instead of taking it from user

GitBox Thu, 10 Jun 2021 01:44:05 -0700


AHeise commented on pull request #15725:
URL: https://github.com/apache/flink/pull/15725#issuecomment-858434406



   >  But still, there is a thing I don't get: what is the point of providing a 
schema in _ParquetInputFormat_ constructor, it is not the one that will be used 
**at the actual reading time**, it is the writer schema (the one extracted in 
_#open_) that will be used. Hence the fact that I qualified the ticket as a bug 
with current flink master state.
   
   Have a look at `ParquetTableSource` and the used `ParquetRowInputFormat`. 
The idea is that the projection is pushed into the source. So if you have a 
`SELECT a,b FROM parquet_file`, you really just need to read `a, b`. In 
columnar formats, that translates to omitting a large chunk of data.
   
   In the other formats, a user would need to supply the desired read schema 
manually. That is probably quite rare, but I'd keep it as it's just a small 
constructor each.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] AHeise commented on pull request #15725: [FLINK-21389] determine parquet schema from file instead of taking it from user

Reply via email to