scovich commented on PR #7092: URL: https://github.com/apache/arrow-rs/pull/7092#issuecomment-2649013122
> > Whilst this probably has some additional overheads, I'd be curious to see these quantified e.g. compared to the approach of not checking, I suspect these are low relative to the inherent costs of JSON decoding, and such an approach still benefits from the vectorised tape->array conversion. > > In the common case where all strings contain correct JSON, the check should be branch-predicted away. It's ultimately just checking two variables that should already be hot in CPU cache, if not in registers, and both branches should be not-taken almost always. > > In any case tho -- this enables the user of a `Decoder` to express correctness constraints they care about, and the small performance overhead would be totally acceptable. The change doesn't impact normal parsing at all. Actually... sending in a bunch of small (not I/O optimal) strings one at a time will probably be the biggest overhead. If we didn't need boundary validation, we could _probably_ just pass the entire underlying byte array from the `StringArray` in a single call. But that's unavoidable. Especially because I don't think the underlying byte array is required to be tightly packed (there could be regions of invalid bytes between strings). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
