[ https://issues.apache.org/jira/browse/ARROW-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rok Mihevc updated ARROW-5129: ------------------------------ External issue URL: https://github.com/apache/arrow/issues/21613 > [Rust][Parquet] Column writer bug: check dictionary encoder when adding a new > data page > --------------------------------------------------------------------------------------- > > Key: ARROW-5129 > URL: https://issues.apache.org/jira/browse/ARROW-5129 > Project: Apache Arrow > Issue Type: Bug > Components: Rust > Environment: N/A > Reporter: Ivan Sadikov > Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 50m > Remaining Estimate: 0h > > As part of my weekly routine, I glanced over code in Parquet column writer > and found that the way we check when to add a new data page is buggy. The > idea is checking the current encoder and deciding if we have written enough > bytes for a page to construct. The problem is that we only check value > encoder, regardless whether or not dictionary encoder is enabled. > Here is how we do it now: actual check > (https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L378) > and the buggy function > (https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L423). > > In the case of sparse column and dictionary encoder we would write a single > data page, even though we would have accumulated a large enough number of > bytes for more than one page in encoder (value encoder will be empty, so it > will always less than constant limit). > I forgot that parquet-cpp has `current_encoder` as either value encoder or > dictionary encoder > (https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_writer.cc#L544), > but in parquet-rs we have them separate. > So the fix could be something like this: > {code} > /// Returns true if there is enough data for a data page, false otherwise. > #[inline] > fn should_add_data_page(&self) -> bool { > match self.dict_encoder { > Some(ref encoder) => { > encoder.estimated_data_encoded_size() >= > self.props.data_pagesize_limit() > }, > None => { > self.encoder.estimated_data_encoded_size() >= > self.props.data_pagesize_limit() > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)