Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

via GitHub Thu, 24 Jul 2025 12:10:41 -0700


zeroshade commented on issue #448:
URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3114581236


   Okay, I've dug into this a bit and have a theory and possible workaround + 
fix.
   
   But first a question: @caldempsey your example benchmark ends up shoving NUL 
characters into the values for `metadata` which, when written out by the json 
encoder are converted to unicode. In your tests using spark-connect-go where 
you discovered this performance issue, were there unicode characters in your 
json that were being decoded?
   
   This seems to ultimately be an issue with `goccy/go-json` that we're using, 
so a temporary workaround that seems to solve the performance problem is to use 
`-tags arrow_json_stdlib` when building, which will switch back to using the 
stdlib `encoding/json` instead which doesn't exhibit this issue.
   
   When I profiled it, 99% of the runtime in the slow case was stuck on 
https://github.com/goccy/go-json/blob/master/internal/decoder/string.go#L152 
doing a `runtime.memmove` because of the ever growing buffer. It looks like the 
big difference between the `RecordFromJSON` vs `NewJSONReader` is the 
implementation of `UnmarshalJSON` for `RecordBuilder` vs `StructBuilder`. Both 
call `NewDecoder` on each call to `UnmarshalJSON`, but `StructBuilder` will 
iterate each row of the struct column with a single decoder, while 
`RecordBuilder` will only decode a single row per call to `UnmarshalJSON`. As a 
result, since `RecordFromJSON` just decodes into a Struct and then converts 
that to a Record, the `goccy/go-json` `StreamDecoder` will end up repeatedly 
reallocating the buffer over and over to convert the unicode. If I run the same 
benchmark and remove the unicode escaped characters, then `RecordFromJSON` 
performs identically to `NewJSONReader`. 
   
   I was able to come up with a workaround that solves the performance problem 
for `RecordFromJSON` even with the unicode characters, but ultimately if 
someone had a large enough string column they were trying to decode which 
contained many unicode characters in each element, they'd run into the same 
problem. So I'm gonna see if I can figure out a solution that works for 
goccy/go-json (which is definitely faster for scenarios that don't hit this 
case, particularly for larger JSON strings).
   
   Let me know if the suggestions above work for you (even if they are only 
temporary workarounds). I've also not been able to reproduce the same problem 
where `RecordFromJSON` produces only a single row for a non-ndjson string. But 
I'll take a deeper look at what I can figure out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

Reply via email to