luoyuxia commented on PR #21149:
URL: https://github.com/apache/flink/pull/21149#issuecomment-1341903988

   @saLeox The code changes looks  good to me. But after an offline discussion 
with @Tartarus0zm, I have some other conerns about this pr.
   > However, some fields may only be nonexistent in some of the historical 
parquet files, while exist in latest ones.
   
   In which case will it happen? In FLINK-23715, we make `unknownFieldsIndices` 
a memeber in `ParquetVectorizedInputFormat`   instead of `ParquetReader ` by 
design.  The basic background is that  we usually add column in dataware house, 
so it may well happen that some historical partitions miss some columns,  then 
we make it can be still read.  And also, considering the normal case is that we 
first add column and write a **new** partition, so in here, we assume single 
one partition will have single one schema. There shouldn't mix some files with 
one shema and some files with another schema in one partition.
   So, in here, we choose a more conservative strategy which will throw 
exception instead of reading miss columns as null value when reading one single 
partition.
   
   Personally,  I think the changes are fine, and the case you descibe will 
happen.  But I'm wondering is it the normal case in your production env and is 
it the best practice that  mix different schema files in one single partition.
   If so, I aggree it should be merged to Flink's codebase.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to