[jira] [Commented] (FLINK-29527) Make unknownFieldsIndices work for single ParquetReader
[ https://issues.apache.org/jira/browse/FLINK-29527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630132#comment-17630132 ] Sun Shun commented on FLINK-29527: -- [~lirui] could you please help take a look at this PR when you are free, thanks > Make unknownFieldsIndices work for single ParquetReader > --- > > Key: FLINK-29527 > URL: https://issues.apache.org/jira/browse/FLINK-29527 > Project: Flink > Issue Type: Bug > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.16.0 >Reporter: Sun Shun >Assignee: Sun Shun >Priority: Major > Labels: pull-request-available > > Currently, from the improvement FLINK-23715, Flink use a collection named > `unknownFieldsIndices` to track the nonexistent fields, and it is kept inside > the `ParquetVectorizedInputFormat`, and applied to all parquet files under > given path. > However, some fields may only be nonexistent in some of the historical > parquet files, while exist in latest ones. And based on > `unknownFieldsIndices`, flink will always skip these fields, even thought > they are existing in the later parquets. > As a result, the value of these fields will become empty when they are > nonexistent in some historical parquet files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29527) Make unknownFieldsIndices work for single ParquetReader
[ https://issues.apache.org/jira/browse/FLINK-29527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617444#comment-17617444 ] Rui Li commented on FLINK-29527: [~suns] Assigned. Thanks for taking the issue > Make unknownFieldsIndices work for single ParquetReader > --- > > Key: FLINK-29527 > URL: https://issues.apache.org/jira/browse/FLINK-29527 > Project: Flink > Issue Type: Bug > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.16.0 >Reporter: Sun Shun >Assignee: Sun Shun >Priority: Major > Labels: pull-request-available > > Currently, from the improvement FLINK-23715, Flink use a collection named > `unknownFieldsIndices` to track the nonexistent fields, and it is kept inside > the `ParquetVectorizedInputFormat`, and applied to all parquet files under > given path. > However, some fields may only be nonexistent in some of the historical > parquet files, while exist in latest ones. And based on > `unknownFieldsIndices`, flink will always skip these fields, even thought > they are existing in the later parquets. > As a result, the value of these fields will become empty when they are > nonexistent in some historical parquet files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29527) Make unknownFieldsIndices work for single ParquetReader
[ https://issues.apache.org/jira/browse/FLINK-29527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613400#comment-17613400 ] Sun Shun commented on FLINK-29527: -- Please assign this issue to me, thanks, I already fix it. > Make unknownFieldsIndices work for single ParquetReader > --- > > Key: FLINK-29527 > URL: https://issues.apache.org/jira/browse/FLINK-29527 > Project: Flink > Issue Type: Bug > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Affects Versions: 1.16.0 >Reporter: Sun Shun >Priority: Major > Labels: pull-request-available > > Currently, from the improvement FLINK-23715, Flink use a collection named > `unknownFieldsIndices` to track the nonexistent fields, and it is kept inside > the `ParquetVectorizedInputFormat`, and applied to all parquet files under > given path. > However, some fields may only be nonexistent in some of the historical > parquet files, while exist in latest ones. And based on > `unknownFieldsIndices`, flink will always skip these fields, even thought > they are existing in the later parquets. > As a result, the value of these fields will become empty when they are > nonexistent in some historical parquet files. -- This message was sent by Atlassian Jira (v8.20.10#820010)