nevi-me opened a new pull request #270:
URL: https://github.com/apache/arrow-rs/pull/270
# Which issue does this PR close?
Closes #245 .
# Rationale for this change
This addresses bugs in the Rust Parquet writer and reader, where we were:
* Not reading lists correctly when they have null bitmaps vs not. We were
creating null bitmaps even when a list was non-null.
* Not incrementing definitions correctly for some combinations of lists and
structs
# What changes are included in this PR?
This PR:
* Fixes the reader, by making roundtrip tests pass under conditions that
were previously failing (mostly when lists are set as non-nullable).
* Combines a few loose variables into a `LevelType` enum, that has enough
information about the Arrow types when calculating levels. This is a lighter
solution that passing Arrow fields around when computing levels, and could
allow us to reuse the levels logic elsewhere in the codebase.
* Enables nullability conditions that were failing
In working on this PR, I:
* Wrote the rest recordbatch to an IPC file
* Wrote the test recordbatch to parquet
* Read the file with `pyarrow`, and wrote it to disk with `pyarrow.parquet`
* Read both parquet files with `pyarrow.parquet`, and confirmed that the
results were identical
* Read both parquet files with `pyspark`, and confirmed that the results
were identical
An interesting observation is that `pyspark` always interpreted the parquet
columns as all nullable.
# Are there any user-facing changes?
All changes are within crate-level structs
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]