flyrain commented on PR #6026: URL: https://github.com/apache/iceberg/pull/6026#issuecomment-1287876977
> @flyrain are you asking if there are other cases where _pos will still be absent after this fix and we need ReadConf#startRowPositions() to return a valid startRowPositions? Yes, I have concerns for this case. But I guess it is fine since `VectorizedArrowReader::setRowGroupInfo()` doesn't consume the `rowPosition`, which means it still works fine even `rowPosition` is off. The problem happens only if we want to rely on the row positions. https://github.com/apache/iceberg/blob/5688d59f6b8ea4784de1f45c9386e13618803673/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L396 By digging a bit more, `ReadConf.columnChunkMetadataForRowGroups` keeps the row group metadata. Wondering if we can calculate the row positions while generating the `columnChunkMetadataForRowGroups`, so that the row position will always be correct and we don't have to read the metadata twice. What do you think, @chenjunjiedada? I guess that's what you mentioned to avoid optimization. https://github.com/apache/iceberg/blob/5688d59f6b8ea4784de1f45c9386e13618803673/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L237-L237 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
