Rafferty97 opened a new pull request, #9496:
URL: https://github.com/apache/arrow-rs/pull/9496
# Which issue does this PR close?
N/A
# Rationale for this change
There are JSON files in the wild that are structured as a single array at
the root level, i.e. `[{...}, {....}, {...}]`. At present, this would be read
by the tape decoder as one single array-valued record, but since each nested
object is what constitutes a "row", this is not ideal.
So, this PR extends the `TapeDecoder` to support these kinds of JSON files
via an opt-in configuration option.
A PR was recently merged into Datafusion to support this exact usecase, but
it currently employs a streaming converter to transform "top-level array" JSON
sources into ND-JSON. If this PR is merged, this would facilitate the
refactoring of this feature to use the `TapeDecoder` directly, which should
noticably improve performance. See
https://github.com/apache/datafusion/issues/19920 for context.
# What changes are included in this PR?
This PR modestly refactors the `TapeDecoder` to facilitate this use case, by
adding an option called `flatten_top_level_arrays`. When enabled, any top-level
arrays are "flattened" such that their elements each become an individual row
in the output batch, rather than the entire array becoming a single row as
would otherwise happen.
# Are these changes tested?
Yes, these changes pass all existing unit tests, and I've added a new unit
test for this feature specifically.
# Are there any user-facing changes?
The primary change is the addition of a `new_with_options` method on
`TapeDecoder`, that allows the user to specify a value for the new
configuration option. I figured a config struct was more future-proof, and we
may want to mark it as `#[non_exhaustive]` but that's debatable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]