Thanks Uroš for reporting and Faidon for the analysis!

On Sun, 18 Aug 2024 21:15:23 +0300 Faidon
Liambotis <parav...@debian.org> wrote:

> On Sun, Aug 18, 2024 at 04:47:39PM +0200, Uroš Knupleš wrote:
>
> [...]
> 
> > Interestingly, this kernel message pops up every time an container 
> > is brought up as an non-root user:
> > 
> > [  361.611472] audit: type=1326 audit(1723988353.266:23): auid=1000 
> > uid=1000 gid=1000 ses=1 subj=pasta pid=1394 comm="pasta" 
> > exe="/usr/bin/pasta" sig=31 arch=40000003 syscall=403 compat=0 
> > ip=0xb7fb0579 code=0x80000000
> 
> This is indeed the smoking gun. You can parse these messages manually
> (by looking at audit.h, syscall etc. values in headers), or just install
> auditd (apt install auditd), and tail /var/log/audit/audit.log.
> 
> In this case, 1326 is type=SECCOMP, and syscall 403 is
> "clock_gettime64".

Right, and this terminates the process right away, as clock_gettime()
is called at every event (inbound data or outbound packets received),
in passt.c, after the seccomp filter is in place (isolate_postfork(),
isolation.c).

> It looks like the passt source code includes a shell script, that parses
> "syscall:" comments and generates seccomp filters for them. (It does not
> use libseccomp).

Correct, we don't use libseccomp for several reasons, including the
advantage of this mechanism based on comments, but also to avoid a
dependency, and to optimise the system call lookup tree in the BPF
program to our (simple) needs.

> In this case, there is a comment that states:
>   * #syscalls clock_gettime arm:clock_gettime64
> ...but on i386, and likely other 32-bit architectures (like 32-bit arm,
> which is seemingly already handled), glibc's clock_gettime() is wrapping
> the clock_gettime64 syscall.

I tested the full functionality on armhf quite recently, so I don't
think there should be issues with this, but I'll give it another run.

> Adding i686:clock_gettime64¹ to that line addresses this specific
> occurence, but moves the goalpost a bit further: after a few iterations,
> I found that the "fcntl64", "socketcall" and "recvmmsg_time64" also need
> to be allowlisted.

Thanks, that will definitely save some time.

> By adjusting source code comments to add these 4 syscalls in their
> relevant spots and rebuilding passt, I managed to get "podman run --rm
> -it" to work on i386. Note however that this is a rudimentary test and
> for example only exercises the "pasta" code path; someone more familiar
> with passt/pasta should probably verify other code paths as well. It'd
> be a good idea to involve upstream.

I'll run the full test suite on i686 and check if anything is missing.
Unfortunately, I can't easily turn the existing upstream test suite
into an autopkgtest, because it's rather complicated as it involves
setting up guests with throughput tests.

But we're working on a new approach to the test framework that should
eventually enable some degree of modularity, and make
running it as part of autopkgtests feasible.

> Hope this helps!
> Faidon
> 
> 1: "i686" because seccomp.sh calls `uname -m` if there is no TARGET
> specified, which I think is a (cross-)portability bug of its own...

On Debian builds (and I guess most other distribution builds, really)
TARGET is always passed, we use `uname -m` as a fallback option only,
so that if you just want to type 'make', you can do that. See also
https://salsa.debian.org/reproducible-builds/reprotest/-/issues/6#note_386163

-- 
Stefano

Reply via email to