Hi Matthieu,

This is the current intended behavior. All dictionary arrays are read back
with int32 indices [1]. I'm not sure if it's hard coded since that's the
most efficient way to read the indices, or if it wasn't seen as worth
implementing additional functions for decoding into other bit widths.

So your two options right now are to either always use 32-bit indices (if
that is an option) or to write a function that casts all the
dictionary columns to their original index bit width. To do the latter,
access indices with DictionaryArray::indices(), cast them, and then
reconstruct them with DictionaryArray::FromArrays(). To get the original
schema, I think you might have to parse it yourself. It's under the key
ARROW:schema in the metadata and is base64 encoded [2].

Best,

Will Jones

[1]
https://github.com/apache/arrow/blob/aac30873fd9b112d236a7359fcc8a60f04ebfead/cpp/src/parquet/arrow/reader_internal.cc#L478
[2] https://arrow.apache.org/docs/cpp/parquet.html#serialization-details

On Wed, Dec 28, 2022 at 12:51 AM Matthieu Bolt <[email protected]> wrote:

> Hi,
>
> I have encountered a problem while saving an apache arrow table as a
> parquet file and reading the table back again.The dictionary types of the
> table read from the parquet file are not the same as the dictionary types
> of the written table, more specific all integer types are int32. For
> example a column that is stored with int8 indices is read in again with
> int32 indices:
> storing: column name: dictionary<values=string, indices=int8, ordered=0>
> reading:column name: dictionary<values=string, indices=int32, ordered=0>
> I'm on arrow version 8.0.0 on Windows 10
>
> Please advise on how to correct/prevent this bug/feature so that the
> original dictionary types are used
>
> Best regards,
>
> Matthieu
>
> Code used to write parquet file
> auto Table2Pqt(const std::shared_ptr<arrow::Table>& t,    const
> std::string& output_filepath) {
>     auto p = arrow::io::FileOutputStream::Open(output_filepath);
>     const auto st = parquet::arrow::WriteTable(
>       *t,
>       arrow::default_memory_pool(),
>       *p,
>       t->num_rows(),
>       parquet::default_writer_properties(),
>       parquet::ArrowWriterProperties::Builder().store_schema()->build()
>     );
>   }
>
> Code used to read parquet file
> auto Pqt2Table(std::shared_ptr<arrow::Table>& t,  const std::string&
> pqt_filepath) {
>     arrow::fs::LocalFileSystem fs;
>     const auto input_file = fs.OpenInputFile(pqt_filepath);
>     const auto& input = input_file.ValueOrDie();
>     std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
>     OpenFile(input, arrow::default_memory_pool(), &arrow_reader);
>     arrow_reader->ReadTable(&t)
> }
>

Reply via email to