Hi, On Wed, Sep 16, 2020 at 11:14:58PM +0200, Pierre-Elliott Bécue wrote: > Le mercredi 16 septembre 2020 à 22:08:24+0200, Pierre-Elliott Bécue a écrit : > > Le vendredi 11 septembre 2020 à 11:12:25+0200, Iñaki Malerba a écrit : > > > Hi Pebs, > > > > > > Thanks for checking this. > > > > > > On Sat, 5 Sep 2020 23:23:30 +0200 Pierre-Elliott =?utf-8?B?QsOpY3Vl?= > > > <p...@debian.org> wrote:> > > > > LXC's devs told me that 4.0.4 should solve it. I'm uploading this > > > > release now. Please don't hesitate to tell me if it helps. > > > > > > Run a pipeline removing the pinning of lxc, and the behaviour seems to > > > be the same. > > > > > > Image building: > > > https://salsa.debian.org/ina/pipeline/-/jobs/990332 > > > > Setting up lxc (1:4.0.4-1) .. > > > > > > Running lxc: > > > https://salsa.debian.org/ina/pipeline/-/jobs/990352 > > > > <VirtSubproc>: failure: ['sudo', 'timeout', '600', 'lxc-stop', > > > '--quiet', '--kill', '--name', 'ci-254-b2fcad5f'] failed (exit status 1, > > > stderr '') > > > > > > Please let me know if you want us to test something else. > > > > > > Abrazos, > > > > Could you get me a full trace like the previous time? I have no > > technical means of running proper tests currently, sorry. :/ > > > > Cheers! > > I found a way to run tests on my own. > > Turns out I tried to add a lxc-attach autopkgtest-stable-amd64 -- ps > auxf to see the process tree in case I could find something useful, > and… the container successfully stopped that time. I retried and it kept > working. > > The process tree I see is: > ─( 23:09:35 )─< /home/becue/tmp > >───────────────────────────────────────────────[ 0 ]─ > root@dawaj # docker run --rm --privileged -i autopkgtest > Starting LXC network bridge: :Starting LXC autoboot containers: :USER > PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 4 0.0 0.0 7644 2760 ? R 21:09 0:00 ps auxf > root 1 0.0 0.0 20904 7492 ? Ds 21:09 0:00 /sbin/init > ok > > After some more tests, it seems that lxc-start && lxc-stop isn't working > properly because the signal is sent before the container is ready to > handle it. > > After this test I decided to add a sleep 2 before the lxc-attach ... -- > ps command: > > Starting LXC network bridge: :Starting LXC autoboot containers: :USER > PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 52 0.0 0.0 7644 2768 ? R 21:10 0:00 ps auxf > root 1 2.5 0.1 21524 9596 ? Ss 21:10 0:00 /sbin/init > root 17 0.5 0.1 27444 8404 ? Ss 21:10 0:00 > /lib/systemd/systemd-journald > root 27 0.0 0.0 2348 1772 ? Ss 21:10 0:00 /sbin/ifup > -a --read-environment > root 42 0.0 0.0 2392 764 ? S 21:10 0:00 \_ > /bin/sh -c /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf > /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases > eth0 . > root 43 0.0 0.0 8456 1936 ? S 21:10 0:00 \_ > /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf > /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases > eth0 > root 44 0.0 0.0 9492 5644 ? S 21:10 0:00 \_ > /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf > /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases > eth0 > message+ 50 0.0 0.0 8696 3636 ? Ss 21:10 0:00 > /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile > --systemd-activation --syslog-only > root 51 0.5 0.0 19308 6376 ? Ss 21:10 0:00 > /lib/systemd/systemd-logind > ok > > Turns out your lxc-stop is really fast, and therefore, not catched > properly by LXC. > > While I appreciate it shouldn't be a corner case that makes things > explode, do you think there's a way to take this realization into > account to lower the severity of this bug, having a temporary fix set up > in place? > > I'll still try to see what upstream could offer to handle this in a > better way.
I did a little bit of debugging on this today. I think the provided Dockerfile does not really reproduce the real issue. The lxc-stop call from lxc comes after the tests are run, so plenty of time after the lxc-start call. I installed autopkgtest inside the container, did manually the mount steps listed in the Dockerfile, and tried it directly. Failed as expected: root@665d38b2f9e3:~/pkg# autopkgtest -B . -- lxc autopkgtest-stable-amd64 autopkgtest [12:41:01]: starting date: 2020-09-17 autopkgtest [12:41:01]: version 5.14 autopkgtest [12:41:01]: host 665d38b2f9e3; command line: /usr/bin/autopkgtest -B . -- lxc autopkgtest-stable-amd64 autopkgtest [12:41:07]: testbed dpkg architecture: amd64 autopkgtest [12:41:08]: testbed running kernel: Linux 5.8.0-1-amd64 #1 SMP Debian 5.8.7-1 (2020-09-05) autopkgtest [12:41:08]: @@@@@@@@@@@@@@@@@@@@ built-tree . autopkgtest [12:41:08]: test command1: preparing testbed autopkgtest [12:41:08]: test command1: true autopkgtest [12:41:08]: test command1: [----------------------- autopkgtest [12:41:09]: test command1: -----------------------] autopkgtest [12:41:09]: test command1: - - - - - - - - - - results - - - - - - - - - - command1 PASS autopkgtest [12:41:09]: @@@@@@@@@@@@@@@@@@@@ summary command1 PASS <VirtSubproc>: failure: ['lxc-stop', '--quiet', '--kill', '--name', 'autopkgtest-lxc-jodrho'] failed (exit status 1, stderr '') autopkgtest [12:41:09]: ERROR: autopkgtest : error cleaning up: autopkgtest [12:41:09]: ERROR: testbed failure: cannot send to testbed: [Errno 32] Broken pipe Note however the despite lxc-stop call returning non-zero, the container is stopped. I then tried starting the container again, and stopping it by hand. Failed as well: root@665d38b2f9e3:~/pkg# lxc-start --name autopkgtest-lxc-jodrho root@665d38b2f9e3:~/pkg# lxc-attach --name autopkgtest-lxc-jodrho -- sh -c 'sleep 1 && echo HI' HI root@665d38b2f9e3:~/pkg# lxc-stop --kill --name autopkgtest-lxc-jodrho lxc-stop: autopkgtest-lxc-jodrho: commands.c: lxc_cmd_stop: 775 No such file or directory - Failed to stop container "autopkgtest-lxc-jodrho" Note however that this time I omitted the --quiet flag that autopkgtest uses, which gives us at least some extra info. No such file of directory. I tried lxc-stop with the --kill flag: root@665d38b2f9e3:~/pkg# lxc-start --name autopkgtest-lxc-jodrho root@665d38b2f9e3:~/pkg# lxc-attach --name autopkgtest-lxc-jodrho -- sh -c 'sleep 1 && echo HI' HI root@665d38b2f9e3:~/pkg# lxc-stop --name autopkgtest-lxc-jodrho Works! So this seems related to the --kill flag. I then tried to strace the lxc-stop --kill call. By the end of the log, there is this: openat(AT_FDCWD, "/lxc/autopkgtest-lxc-jodrho/partial", O_RDWR|O_CLOEXEC) = -1 ENOENT (No such file or directory) socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0 getpid() = 3567 getuid() = 0 getgid() = 0 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\232\r\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16 close(3) = 0 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0 getpid() = 3567 getuid() = 0 getgid() = 0 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16 close(3) = 0 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0 getpid() = 3567 getuid() = 0 getgid() = 0 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16 close(3) = 0 write(2, "lxc-stop: autopkgtest-lxc-jodrho"..., 34lxc-stop: autopkgtest-lxc-jodrho: ) = 34 write(2, "commands.c: lxc_cmd_stop: 775 ", 30commands.c: lxc_cmd_stop: 775 ) = 30 write(2, "No such file or directory - Fail"..., 77No such file or directory - Failed to stop container "autopkgtest-lxc-jodrho") = 77 write(2, "\n", 1 ) = 1 exit_group(1) = ? +++ exited with 1 +++ First I though that it could be related to /lxc/autopkgtest-lxc-jodrho/partial, but strace'ing a lxc-stop call without --kill, there is a similar error, so that's not it. For comparison, the end of the logs in that case look like this: openat(AT_FDCWD, "/lxc/test/partial", O_RDWR|O_CLOEXEC) = -1 ENOENT (No such file or directory) socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0 getpid() = 3926 getuid() = 0 getgid() = 0 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\f\17\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16 close(3) = 0 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0 getpid() = 3926 getuid() = 0 getgid() = 0 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16 close(3) = 0 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0 getpid() = 3926 getuid() = 0 getgid() = 0 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16 close(3) = 0 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0 getpid() = 3926 getuid() = 0 getgid() = 0 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\22\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[4]}], msg_controllen=24, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16 close(3) = 0 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0 getpid() = 3926 getuid() = 0 getgid() = 0 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\f\17\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16 close(3) = 0 openat(AT_FDCWD, "/proc/3852/status", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 read(3, "Name:\tsystemd\nUmask:\t0000\nState:"..., 1024) = 1024 close(3) = 0 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0 getpid() = 3926 getuid() = 0 getgid() = 0 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\n\0\0\0 \0\0\0\320\210s\352\376\177\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16 sendto(3, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_NOSIGNAL, NULL, 0) = 32 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\10\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16 pidfd_send_signal(4, SIGRT_5, NULL, 0) = 0 poll([{fd=4, events=POLLIN}], 1, 60000) = 1 ([{fd=4, revents=POLLIN}]) setsockopt(3, SOL_SOCKET, SO_RCVTIMEO_OLD, "<\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 recvfrom(3, "", 264, 0, NULL, NULL) = 0 close(3) = 0 close(4) = 0 exit_group(0) = ? +++ exited with 0 +++ I just ran out of time. Maybe the above is useful for someone else to figure the issue out.
signature.asc
Description: PGP signature