[GitHub] [arrow-rs] yordan-pavlov edited a comment on issue #171: Implement returning dictionary arrays from parquet reader

GitBox Sun, 12 Dec 2021 11:22:47 -0800


yordan-pavlov edited a comment on issue #171:
URL: https://github.com/apache/arrow-rs/issues/171#issuecomment-991955458



   @tustvold this latest approach you describe will probably work in many 
cases, but usually there is a reason for having partial dictionary encoding in 
parquet files - my understanding is that the reason usually is that the 
dictionary grew too big. And to have to reconstruct a big dictionary from 
plain-encoded parquet data sounds expensive and I suspect this will result in 
suboptimal performance and increased memory use.
   
   If the possibility to have a mix of dictionary-encoded and plain-encoded 
pages is just how parquet works, then is this something that has to be 
abstracted / hidden?
   
   Furthermore, if we take the DataFusion Parquet reader as an example, here 
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/file_format/parquet.rs#L436
 we can see that it doesn't care for the number of rows in the record batches 
as long as the record batch iterator doesn't return `None`.
   
   Finally, how would the user know that they have to make changes to 
`batch_size` depending on the use of dictionary encoding in their parquet files 
(so that record batches do not span row groups)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] yordan-pavlov edited a comment on issue #171: Implement returning dictionary arrays from parquet reader

Reply via email to