[OE-core] pseudo database integrity checking

Richard Purdie Mon, 27 Jun 2022 06:52:47 -0700

I've been worrying a bit about pseudo. Since we made it stricter about
inode mismatches, we see a trickle of reports of pseudo aborts
(fakeroot tasks showing exit 134 which is SIGABORT).


The issue occurs when a file in the pseudo database is removed outside
of pseudo's context. The inode stored in the database can then appear
as a new file, which would trigger path mismatch errors.

Since pseudo is an LD_PRELOAD, even getting a sensible error to the
user is hard. The error occurs in the pseudo server process (which has
the database) and is reported back over a connection to the library
code wrapping some libc call in some user application. All we can
really do is abort(), we can't print to stdout/stderr since we don't
even known whether that is available or where it might go.

One of the worries is about build determinism. Rather than randomly
hitting these issues, could we hit them more consistently? There are
two and a half ideas I've had there:

a) Adding in a startup DB integrity check. I have a patch which does
this, i.e. when the server loads, it just exits if the DB inodes don't
match those on disk. The trouble is the server is usually spawned
through some application making a glibc call, so reporting any sensible
error is near impossible, we can just abort(). We can put a decent
error in pseudo.log but that isn't something seen on the console,
particularly problematic for CI. Locally in testing, I do see
occasional issues with missing files /tmp/ with this.

The second issue here is the server startup retry code. It takes pseudo
about 80s to timeout startup a server due to the backoff+retry
algorithm it understandably has. bitbake sits looking confused during
this time (no tasks running) as the worker processes never report in.

b) We could add a new command to run an integrity check on the DB to
pseudo. If we do that, we would then be able to show the user a decent
error and above the timeout issue. The question is where/when to
trigger it and whether races could occur against the check (e.g. where
multiple fakeroot tasks are running in parallel against the same
WORKDIR).

c) We could add specialist code to bitbake such that when a fakeroot
worker exits with 134, we dump the tail end of the pseudo log if
present. That doens't directly fix the issue but would help users debug
problems. This does come at a cost of making the bitbake code pseudo
specific.


Unfortunately the position of pseudo maintainer is effectively open, I
know some people have expressed interest but nobody is really working
on issues like this. I am open to people's thoughts on the ideas above
or whether there is some other approach anyone can see...

Cheers,

Richard

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#167322): 
https://lists.openembedded.org/g/openembedded-core/message/167322
Mute This Topic: https://lists.openembedded.org/mt/92020938/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

[OE-core] pseudo database integrity checking

Reply via email to