go1.21.12 [arrow-adbc]

via GitHub Wed, 24 Jul 2024 14:09:29 -0700


zeroshade commented on issue #2021:
URL: https://github.com/apache/arrow-adbc/issues/2021#issuecomment-2248902554

Strap in friends, I've narrowed down the cause and it's a doozy!

In case you don't want to follow along, here's the TL;DR:

https://github.com/golang/go/commit/3560cf0afb introduced a change into
go1.22.5 (and was backported to go1.21.12) which forces Go always update the
stack boundaries when making a cgo call. This has the effect of causing a
significant performance degradation in the scenario of repeated calls to the
ADBC driver (in the example @lidavidm provided above we go from ~2s to >~5.5s).
For the time being this can be mitigated by building our distributable binaries
with go1.22.4 or go1.21.11 which do not have this change. But going forward
this is a problem. At the moment, the only likely solution I can think of is to
file an issue with the golang/go repo and see if someone can fix this.

--------------------------

Now, if you want the gory details of this let's have some fun!

https://github.com/golang/go/commit/3560cf0afb was introduced to address the
issue identified by https://github.com/golang/go/issues/62440 in which the
issue was ultimately that in between calls to a Go callback from C, C was
changing out the stack leading to a crash when the known bounds of a goroutine
didn't have enough space on the stack left to operate. The solution ended up to
always have the Goroutine update its stack boundaries before going on to
perform a call from C.

So naturally, we assumed that the performance issue was due to the calls
into Go from C being more expensive, but we didn't know *why* this was the case
as getting some stack boundaries shouldn't be an expensive operation, right?

After getting the example working with `pprof`, I did some digging to see if
the issue could be in the upstream Arrow library (the `cdata` package) or in
how we package the drivers here. What stuck out to me was this:

![image](https://github.com/user-attachments/assets/6c5ee6cc-7878-45f2-8fa7-da6ff990b2cf)

Notice that it just says `runtime._ExternalCode` which is calling
`libc.so.6` and taking up over 50% of the runtime. This was confusing because
according to this, the slowdown is in `libc`, not in the Go runtime? How could
that be?

For comparison, if I compile with go1.22.4 instead, I get the following
flamegraph:

![image](https://github.com/user-attachments/assets/7fca8de6-5873-4f15-b4d8-2cad91c1fba7)

Note how small the `runtime._ExternalCode` entry there is.

**Important note**: I had to disable all our manual calls to `runtime.GC()`
in the `driver.go` file otherwise all of the GC calls, mark and sweep, and all
of that ends up dominating the profiles making it much more difficult to pin
down the causes.

So I dd a bunch of research and found out that `runtime._ExternalCode` is
just what `pprof` puts when it has no insight into the profiling. Which means
it's native code (C or assembly) which `pprof` can't report the profiling for.
So now what? Well, there is the handy dandy `perf` tool on linux!

After reading a bunch of tutorials and figuring out how it all works, I gave
it a shot and got the following by using `perf record -g -- python repro.py`...

When compiling with `go1.22.4` I got the following results from `perf`:

![image](https://github.com/user-attachments/assets/db6d6e3c-7f90-44ae-90e7-ff4d1dd389ef)

This breaks down to roughtly:
* 66.59% -> `blas_thread_server` (i.e. numpy doing stuff! not sure why, but
weird!)
* 13.60% -> `runtime.goexit.abi0` (grpc cleanup callbacks)
* the next several ones collectively add up to ~38% -> cgo callbacks (i.e.
the Driver implementation functions)

When I compile with `go1.22.5` I get this instead:

![image](https://github.com/user-attachments/assets/026779fa-a14a-4087-a1fa-01c5c7ba3439)

Which breaks down to:
* 35.85% -> `blas_thread_server` (i.e. numpy. seriously, wtf??)
* 7.74% -> `runtime.goexit.abi0` (grpc cleanup)
* the cgo callbacks for the driver collectively add up to ~42%
* 21.56% -> `__isoc99_sscanf`

This is where I had a huge WTF moment: how the hell did `sscanf` get
involved here? And why the hell is it taking up so much time?? After a few
hours, I was finally able to figure it out.

I played around with the options in perf and googled a LOT and ultimately
found that I could try changing the way it attempts to determine the call-graph
and frames. I tried using `--call-graph dwarf` but that failed horribly, as I
expected, because we're not using DWARF executables. But when I changed it to
`lbr` I was able to get the following output:

![image](https://github.com/user-attachments/assets/4420744c-086e-489e-9ec8-ecf5465fc24a)

In that screenshot you can see the full chain:

* Before executing the Go code from the CGO callback, it calls
`runtime.callbackUpdateSystemStack`
* Which internally calls [asmcgocall to
x_cgo_getstackbound](https://github.com/golang/go/blob/792a26130347c9b9db344ba56f86645679a1a9d9/src/runtime/cgocall.go#L296)
* Which then calls
[`pthread_getattr_np`](https://github.com/golang/go/blob/792a26130347c9b9db344ba56f86645679a1a9d9/src/runtime/cgo/gcc_stack_unix.c#L28)
* Which apparently, calls `__isoc99_sscanf` (which is just what GCC calls
whenever you use `sscanf` in C code).

And the calls to `sscanf` end up completely and utterly dominating the
entire cost of the release callbacks, which is ludicrous and hilarious.

Ultimately, I think I'm gonna have to file an issue on golang/go repo to
address this because I can't see any other way around it.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] go/adbc/driver/flightsql: performance loss in go1.22.5/go1.21.12 [arrow-adbc]

Reply via email to