[GitHub] [arrow-rs] tustvold edited a comment on issue #171: Implement returning dictionary arrays from parquet reader

GitBox Tue, 07 Dec 2021 09:17:17 -0800


tustvold edited a comment on issue #171:
URL: https://github.com/apache/arrow-rs/issues/171#issuecomment-988120175



   Spent some time digging into this and think I have a plan of action. 
   
   I am off for the next day or so but will be looking to make a start on my 
return
   
   **Background**
   
   The lowest level parquet API:
   
   * `FileReader` provides the ability to obtain metadata and `RowGroupReader` 
for each row group
   * `RowGroupReader` provides the abilty to get `PageReader` for the pages 
within a single column chunk within a row group
   * `PageIterator` provides the ability to get `PageReader` for a specific 
column across a selection of row groups
   
   The next level up is `ColumnReader` implemented by 
   
   * Is an enumeration containing `ColumnReaderImpl<T>` for each of the 
physical parquet types
   * Created by `RowGroupReader::get_column_reader`
   * Provides the ability to slices of data from a single column chunk
   * Stateful read_batch that populates a physical typed slice (i.e. integer, 
float, byte array)
   * Reads `batch_size` levels unless exhausted
   * Only reads number of values corresponding to definition levels read
   * Uses `Decoder` implementations to decode from the raw byte data to this 
slice
   * On encountering a dictionary page, will register a specialised 
`DictDecoder` with this dictionary
   
   The next level is `RecordReader<T>`
   
   * Wraps a `ColumnReaderImpl<T>`
   * Seems to exist to handle repetition levels, in particular to avoid 
repeated fields across reads, but appears to be incomplete
   * Computes a null bitmask and pads NULLs in the values array
   * Could avoid having to pad array if `ColumnReaderImpl` used 
`Decoder::get_spaced`
   
   Next up we have the `ArrayReader` implementations:
   
   * `ArrayReader::next_batch` returns `Result<ArrayRef>`
   * Implementations are selected by `TypeVisitor` on the parquet schema
   * `ArrayReader` are provided with `FilePageIterator`
   * StructArrayReader assumes that all `ArrayReader::next_batch` will return 
arrays with the same number of rows for all children
   * `ArrayReader` are agnostic to RowGroup boundaries
   
   Finally we have `ArrowReader` the only implementation for which is 
`ParquetFileArrowReader`:
   
   * `ParquetFileArrowReader` is created from an `Arc<dyn FileReader>`
   * `ArrowReader::get_record_reader` and 
`ArrowReader::get_record_reader_by_columns` return `ParquetRecordBatchReader`
   
   `ParquetRecordBatchReader` provides the abilty to stream `RecordBatch` from 
a parquet file
   
   * Wraps a `StructArrayReader` created with  the root schema of the parquet 
file
   * Converts the `ArrayReader` to `Iterator<Item=ArrowResult<RecordBatch>>`
   * `ArrayReader::next_batch` returning an empty `RecordBatch` is used as the 
termination condition
   
   There is an additional `ArrowArrayReader`, `ArrayConverter`, `ValueDecoder` 
implementation added in #384:
   
   * This was originally intended to replace `PrimitiveArrayReader` and 
`ComplexObjectArrayReader` but this appears to have stalled
   * Only used for StringArray as far as I can tell
   * Since then `MapArrayReader` has been added (#491)
   
   Given this I think it is acceptable to add a new  `ArrayReader` 
implementation, and avoid this complexity
   
   **Problems**
   
   * RLE dictionaries are per-column chunk, a demarcation the `ArrayReader` API 
does not currently respect
   * A ColumnChunk may contain mixed encodings
   * There isn't an efficient way to determine if all pages in a ColumnChunk 
have a given encoding
   * `ColumnReaderImpl` hides encoding as an implementation detail
   
   **Proposal**
   
   Add two settings to `ParquetRecordBatchReader` (or higher) with defaults of 
`false`:
   
   * `delimit_row_groups` - don't yield RecordBatches spanning row group 
boundaries
   * `preserve_dictionaries` - preserve dictionary encoding for dictionary 
columns
   
   When reading if `delimit_row_groups` is set, each `ArrayReader` will be 
given a `PageIterator` for a single column chunk (i.e. a `PageReader`) 
   
   Within `build_for_primitive_type_inner` if the arrow type is a dictionary of 
a primitive type, `preserve_dictionaries` is enabled and the column chunk has a 
dictionary page:
   
   * If `delimit_row_groups` is not set return an error
   * Return a `DictionaryArrayReader` as the `ArrayReader`
   
   This `DictionaryArrayReader` will return an error if it encounters a 
non-dictionary encoded page
   
   Otherwise it will produce DictionaryArray preserving the RLE dictionary. It 
will likely need to duplicate some logic from `ColumnReaderImpl` to achieve this
   
   I think this will avoid making any breaking changes to clients, whilst 
allowing them to opt in to better behaviour when they know that their parquet 
files are such that this optimisation can be performed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] tustvold edited a comment on issue #171: Implement returning dictionary arrays from parquet reader

Reply via email to