zeroshade commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3114581236
Okay, I've dug into this a bit and have a theory and possible workaround + fix. But first a question: @caldempsey your example benchmark ends up shoving NUL characters into the values for `metadata` which, when written out by the json encoder are converted to unicode. In your tests using spark-connect-go where you discovered this performance issue, were there unicode characters in your json that were being decoded? This seems to ultimately be an issue with `goccy/go-json` that we're using, so a temporary workaround that seems to solve the performance problem is to use `-tags arrow_json_stdlib` when building, which will switch back to using the stdlib `encoding/json` instead which doesn't exhibit this issue. When I profiled it, 99% of the runtime in the slow case was stuck on https://github.com/goccy/go-json/blob/master/internal/decoder/string.go#L152 doing a `runtime.memmove` because of the ever growing buffer. It looks like the big difference between the `RecordFromJSON` vs `NewJSONReader` is the implementation of `UnmarshalJSON` for `RecordBuilder` vs `StructBuilder`. Both call `NewDecoder` on each call to `UnmarshalJSON`, but `StructBuilder` will iterate each row of the struct column with a single decoder, while `RecordBuilder` will only decode a single row per call to `UnmarshalJSON`. As a result, since `RecordFromJSON` just decodes into a Struct and then converts that to a Record, the `goccy/go-json` `StreamDecoder` will end up repeatedly reallocating the buffer over and over to convert the unicode. If I run the same benchmark and remove the unicode escaped characters, then `RecordFromJSON` performs identically to `NewJSONReader`. I was able to come up with a workaround that solves the performance problem for `RecordFromJSON` even with the unicode characters, but ultimately if someone had a large enough string column they were trying to decode which contained many unicode characters in each element, they'd run into the same problem. So I'm gonna see if I can figure out a solution that works for goccy/go-json (which is definitely faster for scenarios that don't hit this case, particularly for larger JSON strings). Let me know if the suggestions above work for you (even if they are only temporary workarounds). I've also not been able to reproduce the same problem where `RecordFromJSON` produces only a single row for a non-ndjson string. But I'll take a deeper look at what I can figure out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org