luoyuxia commented on PR #21149: URL: https://github.com/apache/flink/pull/21149#issuecomment-1341903988
@saLeox The code changes looks good to me. But after an offline discussion with @Tartarus0zm, I have some other conerns about this pr. > However, some fields may only be nonexistent in some of the historical parquet files, while exist in latest ones. In which case will it happen? In FLINK-23715, we make `unknownFieldsIndices` a memeber in `ParquetVectorizedInputFormat` instead of `ParquetReader ` by design. The basic background is that we usually add column in dataware house, so it may well happen that some historical partitions miss some columns, then we make it can be still read. And also, considering the normal case is that we first add column and write a **new** partition, so in here, we assume single one partition will have single one schema. There shouldn't mix some files with one shema and some files with another schema in one partition. So, in here, we choose a more conservative strategy which will throw exception instead of reading miss columns as null value when reading one single partition. Personally, I think the changes are fine, and the case you descibe will happen. But I'm wondering is it the normal case in your production env and is it the best practice that mix different schema files in one single partition. If so, I aggree it should be merged to Flink's codebase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org