zeroshade commented on issue #2021:
URL: https://github.com/apache/arrow-adbc/issues/2021#issuecomment-2248902554

   Strap in friends, I've narrowed down the cause and it's a doozy!
   
   In case you don't want to follow along, here's the TL;DR: 
   
   https://github.com/golang/go/commit/3560cf0afb introduced a change into 
go1.22.5 (and was backported to go1.21.12) which forces Go always update the 
stack boundaries when making a cgo call. This has the effect of causing a 
significant performance degradation in the scenario of repeated calls to the 
ADBC driver (in the example @lidavidm provided above we go from ~2s to >~5.5s). 
For the time being this can be mitigated by building our distributable binaries 
with go1.22.4 or go1.21.11 which do not have this change. But going forward 
this is a problem. At the moment, the only likely solution I can think of is to 
file an issue with the golang/go repo and see if someone can fix this.
   
   --------------------------
   
   Now, if you want the gory details of this let's have some fun!
   
   https://github.com/golang/go/commit/3560cf0afb was introduced to address the 
issue identified by https://github.com/golang/go/issues/62440 in which the 
issue was ultimately that in between calls to a Go callback from C, C was 
changing out the stack leading to a crash when the known bounds of a goroutine 
didn't have enough space on the stack left to operate. The solution ended up to 
always have the Goroutine update its stack boundaries before going on to 
perform a call from C.
   
   So naturally, we assumed that the performance issue was due to the calls 
into Go from C being more expensive, but we didn't know *why* this was the case 
as getting some stack boundaries shouldn't be an expensive operation, right?
   
   After getting the example working with `pprof`, I did some digging to see if 
the issue could be in the upstream Arrow library (the `cdata` package) or in 
how we package the drivers here. What stuck out to me was this:
   
   
![image](https://github.com/user-attachments/assets/6c5ee6cc-7878-45f2-8fa7-da6ff990b2cf)
   
   Notice that it just says `runtime._ExternalCode` which is calling 
`libc.so.6` and taking up over 50% of the runtime. This was confusing because 
according to this, the slowdown is in `libc`, not in the Go runtime? How could 
that be?
   
   For comparison, if I compile with go1.22.4 instead, I get the following 
flamegraph:
   
   
![image](https://github.com/user-attachments/assets/7fca8de6-5873-4f15-b4d8-2cad91c1fba7)
   
   Note how small the `runtime._ExternalCode` entry there is.
   
   **Important note**: I had to disable all our manual calls to `runtime.GC()` 
in the `driver.go` file otherwise all of the GC calls, mark and sweep, and all 
of that ends up dominating the profiles making it much more difficult to pin 
down the causes.
   
   So I dd a bunch of research and found out that `runtime._ExternalCode` is 
just what `pprof` puts when it has no insight into the profiling. Which means 
it's native code (C or assembly) which `pprof` can't report the profiling for. 
So now what? Well, there is the handy dandy `perf` tool on linux!
   
   After reading a bunch of tutorials and figuring out how it all works, I gave 
it a shot and got the following by using `perf record -g -- python repro.py`...
   
   When compiling with `go1.22.4` I got the following results from `perf`:
   
![image](https://github.com/user-attachments/assets/db6d6e3c-7f90-44ae-90e7-ff4d1dd389ef)
   
   This breaks down to roughtly:
   * 66.59% -> `blas_thread_server` (i.e. numpy doing stuff! not sure why, but 
weird!)
   * 13.60% -> `runtime.goexit.abi0` (grpc cleanup callbacks)
   * the next several ones collectively add up to ~38% -> cgo callbacks (i.e. 
the Driver implementation functions)
   
   When I compile with `go1.22.5` I get this instead:
   
   
![image](https://github.com/user-attachments/assets/026779fa-a14a-4087-a1fa-01c5c7ba3439)
   
   Which breaks down to:
   * 35.85% -> `blas_thread_server` (i.e. numpy. seriously, wtf??)
   * 7.74% -> `runtime.goexit.abi0` (grpc cleanup)
   * the cgo callbacks for the driver collectively add up to ~42%
   * 21.56% -> `__isoc99_sscanf`
   
   This is where I had a huge WTF moment: how the hell did `sscanf` get 
involved here? And why the hell is it taking up so much time?? After a few 
hours, I was finally able to figure it out.
   
   I played around with the options in perf and googled a LOT and ultimately 
found that I could try changing the way it attempts to determine the call-graph 
and frames. I tried using `--call-graph dwarf` but that failed horribly, as I 
expected, because we're not using DWARF executables. But when I changed it to 
`lbr` I was able to get the following output:
   
   
![image](https://github.com/user-attachments/assets/4420744c-086e-489e-9ec8-ecf5465fc24a)
   
   In that screenshot you can see the full chain:
   
   * Before executing the Go code from the CGO callback, it calls 
`runtime.callbackUpdateSystemStack`
   * Which internally calls [asmcgocall to 
x_cgo_getstackbound](https://github.com/golang/go/blob/792a26130347c9b9db344ba56f86645679a1a9d9/src/runtime/cgocall.go#L296)
   * Which then calls 
[`pthread_getattr_np`](https://github.com/golang/go/blob/792a26130347c9b9db344ba56f86645679a1a9d9/src/runtime/cgo/gcc_stack_unix.c#L28)
   * Which apparently, calls `__isoc99_sscanf` (which is just what GCC calls 
whenever you use `sscanf` in C code).
   
   And the calls to `sscanf` end up completely and utterly dominating the 
entire cost of the release callbacks, which is ludicrous and hilarious. 
   
   Ultimately, I think I'm gonna have to file an issue on golang/go repo to 
address this because I can't see any other way around it. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to