Rafferty97 opened a new pull request, #9496:
URL: https://github.com/apache/arrow-rs/pull/9496

   # Which issue does this PR close?
   
   N/A
   
   # Rationale for this change
   
   There are JSON files in the wild that are structured as a single array at 
the root level, i.e. `[{...}, {....}, {...}]`. At present, this would be read 
by the tape decoder as one single array-valued record, but since each nested 
object is what constitutes a "row", this is not ideal.
   
   So, this PR extends the `TapeDecoder` to support these kinds of JSON files 
via an opt-in configuration option.
   
   A PR was recently merged into Datafusion to support this exact usecase, but 
it currently employs a streaming converter to transform "top-level array" JSON 
sources into ND-JSON. If this PR is merged, this would facilitate the 
refactoring of this feature to use the `TapeDecoder` directly, which should 
noticably improve performance. See 
https://github.com/apache/datafusion/issues/19920 for context.
   
   # What changes are included in this PR?
   
   This PR modestly refactors the `TapeDecoder` to facilitate this use case, by 
adding an option called `flatten_top_level_arrays`. When enabled, any top-level 
arrays are "flattened" such that their elements each become an individual row 
in the output batch, rather than the entire array becoming a single row as 
would otherwise happen.
   
   # Are these changes tested?
   
   Yes, these changes pass all existing unit tests, and I've added a new unit 
test for this feature specifically.
   
   # Are there any user-facing changes?
   
   The primary change is the addition of a `new_with_options` method on 
`TapeDecoder`, that allows the user to specify a value for the new 
configuration option. I figured a config struct was more future-proof, and we 
may want to mark it as `#[non_exhaustive]` but that's debatable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to