Re: [PR] Expose record boundary information in JSON decoder [arrow-rs]

via GitHub Mon, 10 Feb 2025 11:20:33 -0800


scovich commented on PR #7092:
URL: https://github.com/apache/arrow-rs/pull/7092#issuecomment-2649013122


   > > Whilst this probably has some additional overheads, I'd be curious to 
see these quantified e.g. compared to the approach of not checking, I suspect 
these are low relative to the inherent costs of JSON decoding, and such an 
approach still benefits from the vectorised tape->array conversion.
   > 
   > In the common case where all strings contain correct JSON, the check 
should be branch-predicted away. It's ultimately just checking two variables 
that should already be hot in CPU cache, if not in registers, and both branches 
should be not-taken almost always.
   > 
   > In any case tho -- this enables the user of a `Decoder` to express 
correctness constraints they care about, and the small performance overhead 
would be totally acceptable. The change doesn't impact normal parsing at all.
   
   Actually... sending in a bunch of small (not I/O optimal) strings one at a 
time will probably be the biggest overhead. If we didn't need boundary 
validation, we could _probably_ just pass the entire underlying byte array from 
the `StringArray` in a single call. But that's unavoidable. Especially because 
I don't think the underlying byte array is required to be tightly packed (there 
could be regions of invalid bytes between strings).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Expose record boundary information in JSON decoder [arrow-rs]

Reply via email to