scovich commented on PR #7092: URL: https://github.com/apache/arrow-rs/pull/7092#issuecomment-2648951247
> This makes sense to me, my understanding being this allows deserializing `StringArray` one value at a time, ensuring records are not split across value boundaries. That's a good description of what I hoped to achieve, yes. > Whilst this probably has some additional overheads, I'd be curious to see these quantified e.g. compared to the approach of not checking, I suspect these are low relative to the inherent costs of JSON decoding, and such an approach still benefits from the vectorised tape->array conversion. In the common case where all strings contain correct JSON, the check should be branch-predicted away. It's ultimately just checking two variables that should already be hot in CPU cache, if not in registers, and both branches should be not-taken almost always. In any case tho -- this enables the user of a `Decoder` to express correctness constraints they care about, and the small performance overhead would be totally acceptable. The change doesn't impact normal parsing at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
