[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Update: we've had this deployed at scale since my last comment and haven't noticed any such issues. I don't know if this information will be of use to anyone, but at least our experience with this patch has been a good one. -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: Confirmed Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv 154033 _apt0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w gpgv 154033 _apt1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r ``` So: - apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9) - http and gpgv are waiting for any data written by their parent (at their respective fd 0) Parent backtrace: ``` #0 0x7f420116a74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f420153fb5d in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #3 0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #4 0x7f42016d127b in DoUpdate (CmdL=...) at ./apt-private/private-update.cc:73 #5 0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #6 0x7f420169fa97 in DispatchCommandLine (CmdL=..., Cmds=std::vector
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Following up, `Acquire::Retries=0` seems to be avoiding the bug in prod, which strengthens my suspicion that it's related to retries. As for the actual bug, when apt is hanging, I notice that the work queue has items in it, but they have not been delegated to a worker. There is a condition that is supposed to "unstuck" the queue here[1], but for some reason `GetFetchAfter` is returning a zeroed time_point which avoids that condition[2]. The way we're setting `FetchAfter` looks OK to me[3] though I'm not familiar with std::chrono. Maybe there's some other place where we're partially copying the item or instantiating a new item and failing to copy the `FetchAfter` value. A bandaid patch is attached. I'm not sure if there's any downside in calling `Cycle` there. A real bugfix IMO would address the zeroed `FetchAfter` value described above, or potentially refactor so that work items are immediately delegated instead of having to rely on calling `Cycle` at the right time. I am missing a lot of context so that maybe that's not possible. [1] https://salsa.debian.org/apt- team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt- pkg/acquire.cc#L738-740 [2] https://salsa.debian.org/apt- team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt- pkg/acquire.cc#L733-736 [3] https://salsa.debian.org/apt- team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt- pkg/acquire-worker.cc#L651-655 ** Attachment added: "patch" https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851/+attachment/5759339/+files/patch -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: Confirmed Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Hello, also affected by this. I'm able to reproduce the bug using Walter's mock server. I suspect it may be related to the retry code that kicks in when a transient error is encountered (like a 503). Retries are enabled by default since apt 2.3.2, which seems to fit the regression window people have mentioned here. Frustratingly, I can't reproduce while step debugging, I'm guessing because the bug is clock-dependent. Setting either `Acquire::Retries::Delay=false` or `Acquire::Retries=0` seems to mitigate it. I'm curious if that works for others as well. I'll be testing this in prod soon and will report back. In the mean time I'll keep trying to find the actual bug. -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: Confirmed Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv 154033 _apt0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w gpgv 154033 _apt1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r ``` So: - apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9) - http and gpgv are waiting for any data written by their parent (at their respective fd 0) Parent backtrace: ``` #0 0x7f420116a74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f420153fb5d in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
We're seeing a similar issue here. At first we thought it was an issue specific to a prometheus collector (https://bugs.debian.org/cgi- bin/bugreport.cgi?bug=1028212 / https://github.com/prometheus- community/node-exporter-textfile-collector-scripts/issues/179) but now that I see this bug report, I can't help but think this is an issue in apt itself. I should also mention this seems like a regression between bullseye and bookworm. For context, we have a nightly job that runs apt-update here as well, a home-grown (`dsa-update-apt-status`, from DSA) script that spews warnings through cron when there's an issue. In my mailbox where I track those, I do have instances of this before we started the bookworm upgrade, but those were rare. Starting from the beginning of our bookworm deployment though, we are seeing more and more of this as we upgrade machines over. We're now seeing daily warnings, as the `dsa- update-apt-status` runs into lock contention with another job (`apt_info.py` from the above collector, runs every 15m) more frequently. We were getting daily warnings from the fleet, all bookworm machines, with locks sometimes being held for hours. Our current workaround has been to set a time limit to the `apt_info.py` job, but we're *still* seeing errors, which is interesting in itself, as it means the issue is *not* specific to that script: it's a global apt issue. We've had unattended-upgrades.py hanging forever as well now, which we've never seen before. So I think this is an apt issue. Perhaps Acquire::http::Timeout=120 is a valid workaround, but I can't help but think this is an issue that was specifically introduced between bullseye and bookworm (2.2 vs 2.6). ** Bug watch added: Debian Bug tracker #1028212 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028212 ** Bug watch added: github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues #179 https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/179 -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: Confirmed Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
FYI, ran into this same issue with Debian bookworm's apt-get (2.6.1) hanging indefinitely talking to an apt-cacher-ng 3.7.x instance, easily reproducible by having ansible run an `apt-get update` on ~20-25 machines simultaneously. Problem disappears when switching back to apt- cacher-ng 3.6.x, so +1 for the apt-cacher-ng bug linked above. Still, the problem persists even when invoking apt-get with `-o Acquire::http::Timeout=120`, so presumably there must still be something bugged on the side of apt itself not properly adhering to the configured timeout. -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: Confirmed Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv 154033 _apt0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w gpgv 154033 _apt1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r ``` So: - apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9) - http and gpgv are waiting for any data written by their parent (at their respective fd 0) Parent backtrace: ``` #0 0x7f420116a74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f420153fb5d in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #3 0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) () from /lib/x86_64-linux-gnu
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
I suggest a temporary solution for cron . wrap apt update in timeout -k 60 1200 or add to systemd unit add JobTimeoutSec=1200 -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: Confirmed Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv 154033 _apt0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w gpgv 154033 _apt1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r ``` So: - apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9) - http and gpgv are waiting for any data written by their parent (at their respective fd 0) Parent backtrace: ``` #0 0x7f420116a74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f420153fb5d in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #3 0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #4 0x7f42016d127b in DoUpdate (CmdL=...) at ./apt-private/private-update.cc:73 #5 0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #6 0x7f420169fa97 in DispatchCommandLine (CmdL=..., Cmds=std::vector of length 27, capacity 32 = {...}) at ./apt-private/private-cmndline.cc:588 #7 0x561f
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Hi Julian, > [There] isn't much we can do otherwise, this highly depends on network specifics like latency and mtus and is never reliably reproducible. So, with my bogus python PoC server I mentioned in the comments, it is quite easily reproducible. If I can assist you in getting it up and running, do let me know. Cheers, Walter -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: Confirmed Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv 154033 _apt0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w gpgv 154033 _apt1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r ``` So: - apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9) - http and gpgv are waiting for any data written by their parent (at their respective fd 0) Parent backtrace: ``` #0 0x7f420116a74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f420153fb5d in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #3 0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #4 0x7f42016d127b in DoUpdate (CmdL=...) at ./apt-private/private-update.cc:73 #5 0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) () from /lib/x86_64-
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
** Tags added: jammy -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: Confirmed Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv 154033 _apt0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w gpgv 154033 _apt1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r ``` So: - apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9) - http and gpgv are waiting for any data written by their parent (at their respective fd 0) Parent backtrace: ``` #0 0x7f420116a74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f420153fb5d in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #3 0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #4 0x7f42016d127b in DoUpdate (CmdL=...) at ./apt-private/private-update.cc:73 #5 0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #6 0x7f420169fa97 in DispatchCommandLine (CmdL=..., Cmds=std::vector of length 27, capacity 32 = {...}) at ./apt-private/private-cmndline.cc:588 #7 0x561fc06bafbd in main (argc=2, argv=0x7ffc0e14aa38) at ./cmdline/apt-get.cc:447 ``` Child backtrace:
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: apt (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: Confirmed Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv 154033 _apt0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w gpgv 154033 _apt1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r ``` So: - apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9) - http and gpgv are waiting for any data written by their parent (at their respective fd 0) Parent backtrace: ``` #0 0x7f420116a74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f420153fb5d in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #3 0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #4 0x7f42016d127b in DoUpdate (CmdL=...) at ./apt-private/private-update.cc:73 #5 0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #6 0x7f420169fa97 in DispatchCommandLine (CmdL=..., Cmds=std::vector of length 27, capacity 32 = {...}) at ./apt-private/private-cmndline.cc:588 #7 0x000
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Relevant bug in apt-cacher-ng: https://bugs.launchpad.net/ubuntu/+source/apt-cacher-ng/+bug/1983856 -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: New Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv 154033 _apt0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w gpgv 154033 _apt1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r ``` So: - apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9) - http and gpgv are waiting for any data written by their parent (at their respective fd 0) Parent backtrace: ``` #0 0x7f420116a74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f420153fb5d in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #3 0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #4 0x7f42016d127b in DoUpdate (CmdL=...) at ./apt-private/private-update.cc:73 #5 0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #6 0x7f420169fa97 in DispatchCommandLine (CmdL=..., Cmds=std::vector of length 27, capacity 32 = {...}) at ./apt-private/private-cmndline.cc:588 #7 0x561fc06bafbd in main (argc=2, argv=0
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Got some more pcaps. I can also reproduce by sending a 503 after a 200, without aborting during/before body transfer (*). server.py logs 13:29:55 [3666956] client said: b'GET /ubuntu/dists/jammy/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n' 13:29:55 [3666956] we say: 200 and serve local InRelease 13:29:55 [3666956] client said: b'GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n' 13:29:55 [3666956] we say: 503 503 request/response (valid length, no early FIN) GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1 Host: junk.devs.nu:3851 Cache-Control: max-age=0 Accept: text/* User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive HTTP/1.1 503 OK date: Fri, 17 Feb 2023 15:11:49 GMT content-type: text/plain content-length: 19 something is broken I present to you: - https://junk.devs.nu/2023/lp2003851/not-reproduced-port-34634.txt - https://junk.devs.nu/2023/lp2003851/not-reproduced-port-34634.pcap - https://junk.devs.nu/2023/lp2003851/reproduced-port-58014.txt - https://junk.devs.nu/2023/lp2003851/reproduced-port-58014.pcap In the not-reproduced case, the error is acknowledged and the download is retried. Then it fails quickly. In the reproduced case, the error is not acknowledged, and the result is a hanging apt-get. Logs ending with: Hit:3 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease 0% [Working] (*) observed in the wild with these req/resp: GET /ubuntu/dists/jammy-backports/InRelease HTTP/1.1 Host: apt.osso.nl Cache-Control: max-age=0 Accept: text/* If-Modified-Since: Tue, 21 Feb 2023 19:32:00 GMT User-Agent: Debian APT-HTTP/1.3 (2.4.8) non-interactive HTTP/1.1 200 OK date: Wed, 22 Feb 2023 02:36:08 GMT content-type: octet/stream content-length: 106807 last-modified: Wed, 22 Feb 2023 01:49:00 GMT x-original-source: http://nl.archive.ubuntu.com/ubuntu/dists/jammy-backports/InRelease (+ valid body) GET /ubuntu-security/dists/jammy-security/InRelease HTTP/1.1 Host: apt.osso.nl Cache-Control: max-age=0 Accept: text/* If-Modified-Since: Tue, 21 Feb 2023 23:11:00 GMT User-Agent: Debian APT-HTTP/1.3 (2.4.8) non-interactive HTTP/1.1 503 Connection timeout date: Wed, 22 Feb 2023 02:36:08 GMT content-type: text/html content-length: 486 (+ valid body) extra contemplations: We've only recently added the /ubuntu-security/ (extra path, on the same server). Maybe that is the reason why this is biting us right now. -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: New Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv)
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Okay. I got some pcaps now. And apparently our apt-cacher-ng proxy sometimes serves: - an InRelease - with a 200 OK - and a non-zero Content-Length - BUT, no content, but instead a TCP-FIN. Under the right circumstances, this causes apt to stall forever. The right circumstances seem to include: - a valid download (or 304) from the same host - a valid download from another host --- I created a test server, and got this to reproduce with apt 2.4.9 (from jammy-proposed): = sources.list = deb http://junk.devs.nu:3851/ubuntu jammy main deb http://junk.devs.nu:3851/ubuntu jammy-updates main deb [arch=amd64 signed-by=/etc/apt/keyrings/zabbix.gpg] http://repo.zabbix.com/zabbix/6.2/ubuntu jammy main = server.py logs = 13:31:58 [2771009] client connected: ('217.21.199.56', 42838) 13:31:58 [2771009] client said: b'GET /ubuntu/dists/jammy/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n' 13:31:58 [2771009] we say: 200 and serve local InRelease 13:31:58 [2771009] client said: b'GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n' 13:31:58 [2771009] we say: 200 but serve a FIN 13:31:58 [2771009] client said: b'' 13:31:58 [2771009] client disconnected: ('217.21.199.56', 42838) = apt-get update -o Debug::Acquire::http=1 = 0% [Connecting to junk.devs.nu] [Connecting to repo.zabbix.com (2604:a880:2:d0::2062:d001)]GET /zabbix/6.2/ubuntu/dists/jammy/InRelease HTTP/1.1 Host: repo.zabbix.com Cache-Control: max-age=0 Accept: text/* If-Modified-Since: Wed, 01 Feb 2023 15:02:57 GMT User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive Answer for: http://repo.zabbix.com/zabbix/6.2/ubuntu/dists/jammy/InRelease HTTP/1.1 304 Not Modified Server: nginx/1.14.2 Date: Tue, 21 Feb 2023 12:31:54 GMT Last-Modified: Wed, 01 Feb 2023 15:02:57 GMT Connection: keep-alive ETag: "63da7f21-135e" Hit:1 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease 0% [Connecting to junk.devs.nu]GET /ubuntu/dists/jammy/InRelease HTTP/1.1 Host: junk.devs.nu:3851 Cache-Control: max-age=0 Accept: text/* User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive Answer for: http://junk.devs.nu:3851/ubuntu/dists/jammy/InRelease HTTP/1.1 200 OK date: Fri, 17 Feb 2023 15:11:49 GMT content-type: octet/stream content-length: 270087 last-modified: Fri, 17 Feb 2023 13:29:00 GMT x-original-source: http://de.archive.ubuntu.com/ubuntu/dists/jammy/InRelease Get:2 http://junk.devs.nu:3851/ubuntu jammy InRelease [270 kB] 0% [2 InRelease 0 B/270 kB 0%]GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1 Host: junk.devs.nu:3851 Cache-Control: max-age=0 Accept: text/* User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive Answer for: http://junk.devs.nu:3851/ubuntu/dists/jammy-updates/InRelease HTTP/1.1 200 OK date: Fri, 17 Feb 2023 15:11:49 GMT content-type: octet/stream content-length: 118747 last-modified: Fri, 17 Feb 2023 13:29:00 GMT x-original-source: http://de.archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease Get:3 http://junk.devs.nu:3851/ubuntu jammy-updates InRelease [119 kB] Ign:3 http://junk.devs.nu:3851/ubuntu jammy-updates InRelease 0% [Working] = ps faxu = root 3549798 0.0 0.0 16652 6556 pts/3S+ 12:31 0:00 \_ apt-get update -o Debug::Acquire::http=1 _apt 3549801 0.0 0.0 22068 6540 pts/3S+ 12:31 0:00 \_ /usr/lib/apt/methods/http _apt 3549802 0.0 0.0 22068 6552 pts/3S+ 12:31 0:00 \_ /usr/lib/apt/methods/http _apt 3549804 0.0 0.0 15440 5016 pts/3S+ 12:31 0:00 \_ /usr/lib/apt/methods/gpgv = netstat = tcp1 0 217.21.199.56:44494 178.128.6.101:80CLOSE_WAIT 3549801/http (this IP belongs to the zabbix repo. the connection to the junk.devs.nu broken server is already gone.) = preliminary info = The connection to the zabbix repo is unclosed. The connection to the broken server is closed. The sequence was: 12:31:58.082726 eth0 In IP 217.21.198.165.3851 > 217.21.199.56.42838: Flags [F.], seq 270320:270560, ack 349, win 507, options [nop,nop,TS val 4041174334 ecr 558653630], length 240 12:31:58.083017 eth0 Out IP 217.21.199.56.42838 > 217.21.198.165.3851: Flags [F.], seq 349, ack 270561, win 1022, options [
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Checking in. I've had an `apt-get update` running in a while loop since Feb 6 on a machine: no luck getting it to stall. Halfway, I added a `iptables -A INPUT -m statistic --mode random --probability 0.05 -j DROP` but this did not affect anything. But, in the mean time, on production I have had two more of these since filing this bug report... -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: New Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv 154033 _apt0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w gpgv 154033 _apt1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r ``` So: - apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9) - http and gpgv are waiting for any data written by their parent (at their respective fd 0) Parent backtrace: ``` #0 0x7f420116a74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f420153fb5d in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #3 0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #4 0x7f42016d127b in DoUpdate (CmdL=...) at ./apt-private/private-update.cc:73 #5 0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) () from /lib
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Thanks for the reply. Julian. Let's assume that the problem is indeed latency/dropped packets/whatever on our side. IMO, an (occasionally) broken network should not cause apt- get to hang indefinitely. Do you think it should? Also, it doesn't address that the behaviour seems recent. We have not observed anything like this with Focal or Bionic. In the mean time, I've added notification code so at least we can we track when this occurs. I'll fill you in if I get more details (like versions that are (not) affected). Additional debug info: I called recv(3..) on the http socket with a recv-q. It returns 0 (no error, EOF). ``` (gdb) call (ssize_t)recv(3, "abcdef", 1, 0) $1 = 0 ``` ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp1 0 10.91.52.91:60868 217.21.205.139:80 CLOSE_WAIT 154030/http tcp1 0 10.91.52.91:40756 178.128.6.101:80CLOSE_WAIT 154029/http tcp0 0 10.91.52.91:56818 185.37.124.14:80CLOSE_WAIT 154031/http ``` Next, I called close(0) on the stdin socket on one of the processes, and this awoke the whole list of tasks: ``` (gdb) call (int)close(0) $1 = 0 ``` This yielded the following error message: ``` E: Method http has died unexpectedly! ``` And the apt-get process got unstuck / was reaped succesfully. -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: New Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv
[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Patches are welcome but unfortunately there isn't much we can do otherwise, this highly depends on network specifics like latency and mtus and is never reliably reproducible. -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: New Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=5}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root4uW REG8,10 262281 /var/lib/apt/lists/lock apt-get 154026 root5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv 154033 _apt0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w gpgv 154033 _apt1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r ``` So: - apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9) - http and gpgv are waiting for any data written by their parent (at their respective fd 0) Parent backtrace: ``` #0 0x7f420116a74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f420153fb5d in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #3 0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #4 0x7f42016d127b in DoUpdate (CmdL=...) at ./apt-private/private-update.cc:73 #5 0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #6 0x7f420169fa97 in DispatchCommandLine (CmdL=..., Cmds=std::vector of length 27, capacity 32 = {...}) at ./apt-priva