sunchao commented on pull request #31998: URL: https://github.com/apache/spark/pull/31998#issuecomment-850570200
@lxian I'm thinking that the extra cost is just incrementing two indexes at the same time, so it should be fairly cheap. You can also refer to how [SynchronizingColumnReader](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/SynchronizingColumnReader.java#L89) is doing that. Porting that logic to Spark is a bit tricky though, especially when it comes to handle the RLE-encoded definition levels. Let me try experimenting this idea too on my side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org