flyrain commented on PR #6026:
URL: https://github.com/apache/iceberg/pull/6026#issuecomment-1287876977

   > @flyrain are you asking if there are other cases where _pos will still be 
absent after this fix and we need ReadConf#startRowPositions() to return a 
valid startRowPositions?
   
   Yes, I have concerns for this case. But I guess it is fine since 
`VectorizedArrowReader::setRowGroupInfo()` doesn't consume the `rowPosition`, 
which means it still works fine even `rowPosition` is off. The problem happens 
only if we want to rely on the row positions. 
https://github.com/apache/iceberg/blob/5688d59f6b8ea4784de1f45c9386e13618803673/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L396
   
   By digging a bit more, `ReadConf.columnChunkMetadataForRowGroups` keeps the 
row group metadata. Wondering if we can calculate the row positions while 
generating the `columnChunkMetadataForRowGroups`, so that the row position will 
always be correct and we don't have to read the metadata twice. What do you 
think, @chenjunjiedada? I guess that's what you mentioned to avoid 
optimization. 
https://github.com/apache/iceberg/blob/5688d59f6b8ea4784de1f45c9386e13618803673/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L237-L237
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to