Re: [PR] Speed up JSON schema inference by ~2.8x [arrow-rs]

via GitHub Wed, 24 Jun 2026 16:17:20 -0700


Rafferty97 commented on PR #9494:
URL: https://github.com/apache/arrow-rs/pull/9494#issuecomment-4794432498


   Hi @alamb, thanks for having another look over the PR. I definitely 
appreciate that it's a little on the heavier side, and the aversion to changing 
the code's behaviour.
   
   You're right that these changes would cause some schema inferences that 
previously succeeded to return an error. In particular, it would now reject 
cases where a given key appears with both scalar and array values, which is 
what `mixed_arrays.json` specifically tests:
   
   ```
   {"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":4.1}
   {"a":-10, "b":[2.0, 1.3, -6.1], "c":null, "d":null}
   {"a":2, "b":[2.0, null, -6.1], "c":[false, null], "d":"text"}
   {"a":3, "b":4, "c": true, "d":[1, false, "array", 2.4]}
   ```
   
   Here, the keys `b`, `c` and `d` appear in some rows as scalar values, and as 
other rows as array values. In the existing code, these keys get inferred as 
array types, on the assumption that when the file is read into actual 
`RecordBatches`, those scalar values will be coerced into single-element arrays.
   
   However, the JSON reader doesn't actually do this. It appears that it used 
to but that this functionality was removed. So, even though the schema 
inference succeeds, the JSON file will fail to be read at query execution time. 
So, my opinion is that it would actually be better for the inference itself to 
fail, as this is more informative to the user.
   
   With that being said, I'm open to preserving the existing behaviour for now 
to get the PR over the line. If that's your preference, I'll re-implement that 
functionality and reinstate those tests I removed.
   
   I also agree with your earlier comment about preferring not to commit gziped 
files in the repo itself. I'd be happy to change those tests that need them to 
instead do the compression themselves rather than relying on an 
already-compressed file existing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Speed up JSON schema inference by ~2.8x [arrow-rs]

Reply via email to