lesterfan commented on issue #30302:
URL: https://github.com/apache/arrow/issues/30302#issuecomment-2700272620
I ran into this issue recently as well. In case it helps, I spent some time
making a minimal C++ repro since fixing this would involve a C++ code change
(i.e. maybe this could be turned into a unit test). It took me a while to
realize that I needed to explicitly set `store_schema()` in the
`ArrowWriterProperties` as the default in C++ is `false` while this is `true`
in Python.
```
arrow::Status arrow_test() {
arrow::MemoryPool* pool = arrow::default_memory_pool();
std::shared_ptr<arrow::Array> values;
arrow::DictionaryBuilder<arrow::StringType> dict_builder(pool);
ARROW_RETURN_NOT_OK(dict_builder.Append("abc"));
ARROW_RETURN_NOT_OK(dict_builder.Append("def"));
ARROW_RETURN_NOT_OK(dict_builder.Append("abc"));
ARROW_RETURN_NOT_OK(dict_builder.Finish(&values));
auto dict_type = dict_builder.type();
auto schema = arrow::schema({arrow::field("x", dict_type)});
auto table = arrow::Table::Make(schema, {values});
const char* filepath = "/not/a/real/filepath"; // NOTE: Replace with a
real filepath
std::shared_ptr<arrow::io::FileOutputStream> outfile;
arrow::io::FileOutputStream::Open(filepath).ValueOrDie().swap(outfile);
auto writer_props = parquet::WriterProperties::Builder().build();
// NOTE: This is important; if store_schema not set, the writer won't
add the
// "ARROW:schema": <base 64 encoded schema> metadata to the parquet file,
// and the reader will read the column as a string rather than as a
dictionary.
auto arrow_writer_props =
parquet::ArrowWriterProperties::Builder().store_schema()->build();
ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
*table, pool, outfile, 1024, writer_props, arrow_writer_props));
std::shared_ptr<arrow::io::ReadableFile> infile;
arrow::io::ReadableFile::Open(filepath).ValueOrDie().swap(infile);
std::unique_ptr<parquet::arrow::FileReader> reader;
ARROW_RETURN_NOT_OK(parquet::arrow::OpenFile(infile, pool, &reader));
std::shared_ptr<arrow::Table> table_round_trip;
ARROW_RETURN_NOT_OK(reader->ReadTable(&table_round_trip));
printf("table: %s\n", table->ToString().c_str());
printf("table_round_trip: %s\n", table_round_trip->ToString().c_str());
return arrow::Status::OK();
}
```
This code gives the following output:
```
table: x: dictionary<values=string, indices=int8, ordered=0>
----
x:
[
-- dictionary:
[
"abc",
"def"
]
-- indices:
[
0,
1,
0
]
]
table_round_trip: x: dictionary<values=string, indices=int32, ordered=0>
----
x:
[
-- dictionary:
[
"abc",
"def"
]
-- indices:
[
0,
1,
0
]
]
ok
```
The expected behavior is that the `indices` of `table_round_trip` should be
`int8` instead of `int32`. (The `int8` comes from the `DictionaryBuilder`'s
`AdaptiveIntBuilder` determining that the smallest index size which can
accommodate the dictionary indices is an `int8`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]