I am interested in having less overhead for Go-C-Go roundtrips, for C
programs that I know (or at least am very much sure) will behave, for
most part, similar to non-preemptible (as in prior to Go 1.14) loops
in Go. Concretely, I have a self-imposed exercise of making a
frankenprogram that talks graphics using Vulkan (this part is in C)
and talks to other computers over network (this one is in Go).
Background
Certain sizable class of Vulkan programs spend overwhelming majority
of their CPU time in a loop of recording command buffers. This means
calling vkCmd* functions some 1 to 10 or more times per
second. The frequency with which these functions are invoked makes it
infeasible to use the common Cgo mechanism because its constant
overhead becomes significantly larger than the time these functions
individually run for. In fact, it is likely that most of vkCmd* these
programs are interested in just merely copy their arguments to an
array and bump a counter. This property makes the assumption of Cgo
that a C program may block redundant. But this is an implementation
detail of a Vulkan driver and differs between drivers (we still assume
that driver is good and won't be doing nasty things like hanging
forever). Even if we knew memory layout of command buffer and
replicated in Go relevant vkCmd* functions from, say, mesa radv, this
will be broken by an inevitable driver update (the .so part of driver
is overwhelmingly common to be linked dynamically) and will not work
on a different driver such as mesa anvil (intel vulkan driver).
There is also a number of minor nuisances such as dynamic cgocheck
being constantly angry at pointers to Go memory being sent to C
(runtime.KeepAlives were carefully placed in the program) and general
discomfort of writing "Go-looking C" in Go.
I suspect there's an alternative way to making graphics card do work
by means of using indirect draw similar to OpenGL AZDO approach (this
is a speculation, since I'm not at all familiar with this
approach). This lets us to just write indirect draw commands to a
large array from Go and make calling into C much less frequent. This
approach appears to come at big cost of ergonomics: we have no way of
interleaving binding of anything with the draw calls. We would need a
separate logical buffer (offset in vkCmdDraw*Indirect) with indirect
draw commands between any two vkCmdBind* calls.
Another approach would be to have intermediate command buffer of our
own which would be, for example, an array of tagged unions (structs
with an integer tag and a union) describing which vkCmd* to call and
what parameters to give it.
This hoop-jumping motivates us to write a considerable partition of
program in C, which is not necessarily the thing we desire, which in
turn motivates us to find ways to call classes of C programs that we
hope will not misbehave with overhead that is significantly less than
that of Cgo.
rustgo: calling Rust from Go with near-zero overhead
I recalled stumbling upon https://blog.filippo.io/rustgo/ and reread
the article. The approach described is simple but extremely dangerous
for opaque C calls (which is what vkCmd* are). We have no idea about
how much stack the C function is going to need and if we had to make a
conservative guess, we would still want to have a stack guard at the
bottom, to be safe.
https://github.com/minio/c2goasm lets us do what is described in the
article in a less involved manner but (as far as I understood) assumes
that the C function is a leaf function.
I never tried installing stack guards at goroutine stacks in Go but in
my own exercise of re-implementing rsc's libtask I tried mprotecting
4k at the bottom of the stack to PROT_READ. I noted that I couldn't
have more than about 15k tasks. This is due to the default limit of
about 32k of memory maps in linux. mprotecting PROT_READ and then
restoring protection at each context switch removed this handicap but
slowed down context switching considerably. In Go, I imagine, stack
guard for C functions could be done as follows:
Withguard(func() {
// C functions are called here in a way described in rustgo article
})
where Withguard would ask for large morestack + 8k at bottom, install
PROT_READ page somewhere in the 8k part near the bottom of the stack,
call the function passed to it, munprotect and leave. But I speculate
this would lead to deadly interactions with GC such as ocassional
segfaults whenever GC would for whatever unknown reason access the
PROT_READ part of the stack.
runtime.systemstack
I stumbled upon this function when exploring Go runtime. It lets me
achieve things I would want to make Withguard for. It has an important
caveat: while we're on systemstack, we may not be preempted (also no
defers). This means that we probably should not allocate anything in
fear of dangerous interactions with GC. We also should probably leave
systemstack sooner, because I suspect GC will not be able to see roots
of the goroutine that switched to s