Language-neutral interface specifications (research)

Jakob Stoklund Olesen Tue, 12 Jul 2022 00:37:56 -0700

I saw this project on the wiki, and it reminds me of a problem I have
been trying to understand better:
    https://wiki.netbsd.org/projects/project/language-neutral-interfaces/


I am a retired compiler engineer. I used to work on LLVM and other
compilers, including LLVM's code generator and register allocators. I
brought up the sparc64 support and implemented the System V ABI for that
architecture in Clang and LLVM. I haven't contributed directly to NetBSD
before, but my name appears once or twice in src/external.

I am building an Ada compiler as a hobby project. I realize this is a
huge undertaking that I will probably never finish. It's fun anyway. The
compiler will generate native code for various platforms and
architectures, but to help with bootstrapping, I also want it to be able
to generate portable C++11/POSIX code. This means I have to think about
the difference between API and ABI on POSIX platforms.


As I see it, there are three levels of interface definition to consider:

1. The API level, or source code compatibility level. Standards like
POSIX tend to describe interfaces in terms of what your C source code
should look like:

    #include <sys/stat.h>
    #include <errno.h>

    int check_dir(const char *pathname)
    {
        struct stat buffer;

        if (stat(pathname, &buffer) != 0)
            return errno;

        if (S_ISDIR(buffer.st_mode))
            return 0;
        else
            return ENOTDIR;
    }

POSIX doesn't specify the exact contents of struct stat nor the value of
ENOTDIR. It only defines that you can use those symbols and struct
members in C code after including the right headers.


2. The Pure-C level. The C compiler lowers the C code to something
self-contained. It will:

    - Run the preprocessor,
    - Expand typedefs, and
    - Expand inline functions.

(I'm not saying compilers work this way, but they work as-if this
happened).

This leaves code consisting of only C primitives:

    struct stat {
        int st_mode;
        ...
    };
    extern __tls int errno;

    int check_dir(const char *pathname)
    {
        struct stat buffer;

        if (stat(pathname, &buffer) != 0)
            return errno;

        if ((buffer.st_mode & 0170000) == 0040000)
            return 0;
        else
            return 20;
    }

This Pure-C code is not portable. There are differences between
platforms, architectures, and even some compiler flags can affect this
code. Note that Pure-C also doesn't have to be standard C. It is common
to use vendor-specific extensions like __attr__ to get the right
alignments and calling conventions. I see NetBSD sources are using a
__RENAME macro to change linkage names.


3. The ABI level. Documents like the System V ABI describe how Pure-C is
translated to machine code calls. This includes size and alignment of
primitive types, layout of structs, and how arguments and return values
are passed in function calls.


The translation from Pure-C to ABI is a pretty well understood problem.
There are ABI documents describing the standard stuff, and you may have
to deal with a few compiler extensions for alignment, SIMD types, and
alternate calling conventions. This is all stuff that compilers need to
deal with anyway.

The translation from API level to Pure-C level is more difficult (for
me, anyway). It seems like you basically have to run C code through the
system C compiler to make sure you covered all the corner cases. It is
not very satisfying for an Ada compiler to have to depend on the system
C compiler in order to generate binary code that interacts with the
system.

Ada actually has a standardized foreign function interface that can be
used to interface with C and Fortran. The problem is that it interfaces
to the Pure-C level, not the API level. I can't use it to call stat() in
a portable way.

I am interested in an Interface Description Language that can be used
to:

    a. Define the source-level API in a way that is detached from the C
       language.
    b. Define stronger types than C allows: S_ISDIR() is only supposed
       to work on a mode_t returned from stat(), not any old integer.
    c. Define data flow better than C allows: The struct stat* argument
       to stat() is only meant to move data out of the function. The
       pathname pointer isn't captured by the function call.
    d. Describe how the API level gets translated to the Pure-C level or
       similar. This is different for different platforms and
       architectures.

This IDL would make it possible for me to generate portable Ada bindings
for POSIX and other APIs without having to rely on the system C
compiler. I think it could also be useful for a project like NetBSD to
be able to track source compatibility and binary compatibility
individually.


I haven't done anything concrete yet, and I agree that it is a good idea
to research prior art. There is a lot of it:

- Wikipedia as a long list of IDLs:
    https://en.wikipedia.org/wiki/Interface_description_language
- CORBA IDL: https://www.omg.org/spec/IDL
- ASN.1: https://www.itu.int/en/ITU-T/asn1/Pages/asn1_project.aspx
- Apache Thrift: https://thrift.apache.org
- Google protobuf: https://developers.google.com/protocol-buffers/
- Zephyr ASDL:
    
https://www.usenix.org/conference/dsl-97/zephyr-abstract-syntax-description-language
- SWIG: https://swig.org
- Rust's bindgen: https://rust-lang.github.io/rust-bindgen/

I don't know if any of these are usable for us. IDLs are often
associated with a specific protocol or data format, and they tend to
expose the quirks of those use cases. They are good for defining new
things, but not so good for describing existing things. This is why
there are so many IDLs.

It seems to me that we have slightly different use cases for an IDL, but
there is a lot of overlap, and it could be possible to use the same
language for both. I am not looking to do a GSOC project or anything lke
that, but I would like to research this some more, and perhaps learn
from your expertise.

In terms of existing IDLs, do you have anything in mind that you think
could work?

Best,
Jakob

Language-neutral interface specifications (research)

Reply via email to