zeroshade commented on issue #2021: URL: https://github.com/apache/arrow-adbc/issues/2021#issuecomment-2248902554
Strap in friends, I've narrowed down the cause and it's a doozy! In case you don't want to follow along, here's the TL;DR: https://github.com/golang/go/commit/3560cf0afb introduced a change into go1.22.5 (and was backported to go1.21.12) which forces Go always update the stack boundaries when making a cgo call. This has the effect of causing a significant performance degradation in the scenario of repeated calls to the ADBC driver (in the example @lidavidm provided above we go from ~2s to >~5.5s). For the time being this can be mitigated by building our distributable binaries with go1.22.4 or go1.21.11 which do not have this change. But going forward this is a problem. At the moment, the only likely solution I can think of is to file an issue with the golang/go repo and see if someone can fix this. -------------------------- Now, if you want the gory details of this let's have some fun! https://github.com/golang/go/commit/3560cf0afb was introduced to address the issue identified by https://github.com/golang/go/issues/62440 in which the issue was ultimately that in between calls to a Go callback from C, C was changing out the stack leading to a crash when the known bounds of a goroutine didn't have enough space on the stack left to operate. The solution ended up to always have the Goroutine update its stack boundaries before going on to perform a call from C. So naturally, we assumed that the performance issue was due to the calls into Go from C being more expensive, but we didn't know *why* this was the case as getting some stack boundaries shouldn't be an expensive operation, right? After getting the example working with `pprof`, I did some digging to see if the issue could be in the upstream Arrow library (the `cdata` package) or in how we package the drivers here. What stuck out to me was this:  Notice that it just says `runtime._ExternalCode` which is calling `libc.so.6` and taking up over 50% of the runtime. This was confusing because according to this, the slowdown is in `libc`, not in the Go runtime? How could that be? For comparison, if I compile with go1.22.4 instead, I get the following flamegraph:  Note how small the `runtime._ExternalCode` entry there is. **Important note**: I had to disable all our manual calls to `runtime.GC()` in the `driver.go` file otherwise all of the GC calls, mark and sweep, and all of that ends up dominating the profiles making it much more difficult to pin down the causes. So I dd a bunch of research and found out that `runtime._ExternalCode` is just what `pprof` puts when it has no insight into the profiling. Which means it's native code (C or assembly) which `pprof` can't report the profiling for. So now what? Well, there is the handy dandy `perf` tool on linux! After reading a bunch of tutorials and figuring out how it all works, I gave it a shot and got the following by using `perf record -g -- python repro.py`... When compiling with `go1.22.4` I got the following results from `perf`:  This breaks down to roughtly: * 66.59% -> `blas_thread_server` (i.e. numpy doing stuff! not sure why, but weird!) * 13.60% -> `runtime.goexit.abi0` (grpc cleanup callbacks) * the next several ones collectively add up to ~38% -> cgo callbacks (i.e. the Driver implementation functions) When I compile with `go1.22.5` I get this instead:  Which breaks down to: * 35.85% -> `blas_thread_server` (i.e. numpy. seriously, wtf??) * 7.74% -> `runtime.goexit.abi0` (grpc cleanup) * the cgo callbacks for the driver collectively add up to ~42% * 21.56% -> `__isoc99_sscanf` This is where I had a huge WTF moment: how the hell did `sscanf` get involved here? And why the hell is it taking up so much time?? After a few hours, I was finally able to figure it out. I played around with the options in perf and googled a LOT and ultimately found that I could try changing the way it attempts to determine the call-graph and frames. I tried using `--call-graph dwarf` but that failed horribly, as I expected, because we're not using DWARF executables. But when I changed it to `lbr` I was able to get the following output:  In that screenshot you can see the full chain: * Before executing the Go code from the CGO callback, it calls `runtime.callbackUpdateSystemStack` * Which internally calls [asmcgocall to x_cgo_getstackbound](https://github.com/golang/go/blob/792a26130347c9b9db344ba56f86645679a1a9d9/src/runtime/cgocall.go#L296) * Which then calls [`pthread_getattr_np`](https://github.com/golang/go/blob/792a26130347c9b9db344ba56f86645679a1a9d9/src/runtime/cgo/gcc_stack_unix.c#L28) * Which apparently, calls `__isoc99_sscanf` (which is just what GCC calls whenever you use `sscanf` in C code). And the calls to `sscanf` end up completely and utterly dominating the entire cost of the release callbacks, which is ludicrous and hilarious. Ultimately, I think I'm gonna have to file an issue on golang/go repo to address this because I can't see any other way around it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
