yordan-pavlov edited a comment on issue #171: URL: https://github.com/apache/arrow-rs/issues/171#issuecomment-991955458
@tustvold this latest approach you describe will probably work in many cases, but usually there is a reason for having partial dictionary encoding in parquet files - my understanding is that the reason usually is that the dictionary grew too big. And to have to reconstruct a big dictionary from plain-encoded parquet data sounds expensive and I suspect this will result in suboptimal performance and increased memory use. If the possibility to have a mix of dictionary-encoded and plain-encoded pages is just how parquet works, then is this something that has to be abstracted / hidden? Furthermore, if we take the DataFusion Parquet reader as an example, here https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/file_format/parquet.rs#L436 we can see that it doesn't care for the number of rows in the record batches as long as the record batch iterator doesn't return `None`. Finally, how would the user know that they have to make changes to `batch_size` depending on the use of dictionary encoding in their parquet files (so that record batches do not span row groups)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
