Got some more pcaps. I can also reproduce by sending a 503 after a 200, without aborting during/before body transfer (*).
============================ server.py logs ============================ 13:29:55 [3666956] client said: b'GET /ubuntu/dists/jammy/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n' 13:29:55 [3666956] we say: 200 and serve local InRelease 13:29:55 [3666956] client said: b'GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n' 13:29:55 [3666956] we say: 503 ============================ 503 request/response (valid length, no early FIN) ============================ GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1 Host: junk.devs.nu:3851 Cache-Control: max-age=0 Accept: text/* User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive HTTP/1.1 503 OK date: Fri, 17 Feb 2023 15:11:49 GMT content-type: text/plain content-length: 19 something is broken ============================ I present to you: - https://junk.devs.nu/2023/lp2003851/not-reproduced-port-34634.txt - https://junk.devs.nu/2023/lp2003851/not-reproduced-port-34634.pcap - https://junk.devs.nu/2023/lp2003851/reproduced-port-58014.txt - https://junk.devs.nu/2023/lp2003851/reproduced-port-58014.pcap In the not-reproduced case, the error is acknowledged and the download is retried. Then it fails quickly. In the reproduced case, the error is not acknowledged, and the result is a hanging apt-get. Logs ending with: Hit:3 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease 0% [Working] ============================ (*) observed in the wild with these req/resp: ============================ GET /ubuntu/dists/jammy-backports/InRelease HTTP/1.1 Host: apt.osso.nl Cache-Control: max-age=0 Accept: text/* If-Modified-Since: Tue, 21 Feb 2023 19:32:00 GMT User-Agent: Debian APT-HTTP/1.3 (2.4.8) non-interactive HTTP/1.1 200 OK date: Wed, 22 Feb 2023 02:36:08 GMT content-type: octet/stream content-length: 106807 last-modified: Wed, 22 Feb 2023 01:49:00 GMT x-original-source: http://nl.archive.ubuntu.com/ubuntu/dists/jammy-backports/InRelease (+ valid body) GET /ubuntu-security/dists/jammy-security/InRelease HTTP/1.1 Host: apt.osso.nl Cache-Control: max-age=0 Accept: text/* If-Modified-Since: Tue, 21 Feb 2023 23:11:00 GMT User-Agent: Debian APT-HTTP/1.3 (2.4.8) non-interactive HTTP/1.1 503 Connection timeout date: Wed, 22 Feb 2023 02:36:08 GMT content-type: text/html content-length: 486 (+ valid body) ============================ extra contemplations: ============================ We've only recently added the /ubuntu-security/ (extra path, on the same server). Maybe that is the reason why this is biting us right now. -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to apt in Ubuntu. https://bugs.launchpad.net/bugs/2003851 Title: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04 Status in apt package in Ubuntu: New Bug description: Hi! Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms. We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks. Example: ``` # apt-get update Reading package lists... Done E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to lock directory /var/lib/apt/lists/ ``` Task listing: ``` root 153929 \_ /usr/sbin/CRON -f -P root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron root 154026 \_ apt-get update _apt 154029 \_ /usr/lib/apt/methods/http _apt 154030 \_ /usr/lib/apt/methods/http _apt 154031 \_ /usr/lib/apt/methods/http _apt 154033 \_ /usr/lib/apt/methods/gpgv ``` Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?): ``` # netstat -apn | grep -E '154026|154029|154030|154031|154033' tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http ``` All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL). The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=500000000}, NULL). The http sockets in the children were at fd=3. Parent lsof: ``` # lsof -p 154026 +E ... apt-get 154026 root 4uW REG 8,1 0 262281 /var/lib/apt/lists/lock apt-get 154026 root 5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w apt-get 154026 root 6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w apt-get 154026 root 7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w apt-get 154026 root 8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r apt-get 154026 root 9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r http 154029 _apt 0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w http 154029 _apt 1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r http 154030 _apt 0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w http 154030 _apt 1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r http 154031 _apt 0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w http 154031 _apt 1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r gpgv 154033 _apt 0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w gpgv 154033 _apt 1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r ``` So: - apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9) - http and gpgv are waiting for any data written by their parent (at their respective fd 0) Parent backtrace: ``` #0 0x00007f420116a74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f420153fb5d in pkgAcquire::Run(int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x00007f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #3 0x00007f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #4 0x00007f42016d127b in DoUpdate (CmdL=...) at ./apt-private/private-update.cc:73 #5 0x00007f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #6 0x00007f420169fa97 in DispatchCommandLine (CmdL=..., Cmds=std::vector of length 27, capacity 32 = {...}) at ./apt-private/private-cmndline.cc:588 #7 0x0000561fc06bafbd in main (argc=2, argv=0x7ffc0e14aa38) at ./cmdline/apt-get.cc:447 ``` Child backtrace: ``` #0 0x00007f58b1c9b74d in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f58b237cb43 in WaitFd(int, bool, unsigned long) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0 #2 0x00005643e37bc7b3 in BaseHttpMethod::Loop ( this=0x7ffdeaed56c0) at ./methods/basehttp.cc:611 #3 main (argv=<optimized out>) at ./methods/http.cc:1052 ``` Where this is: ``` (gdb) print *this $1 = {<aptAuthConfMethod> = {<aptMethod> = {<pkgAcqMethod> = {<No data fields>}, <aptConfigWrapperForMethods> = { methodNames = std::vector of length 2, capacity 2 = { "http", "http::XXXXXX.nl"}}, Binary = "http", SeccompFlags = 6}, authconfs = std::vector of length 1, capacity 1 = { std::unique_ptr<FileFd> = {get() = 0x5643e5343100}}}, Server = std::unique_ptr<ServerState> = { get() = 0x5643e534fba0}, NextURI = "", AllowRedirect = true, Debug = false, PipelineDepth = 10, static FailFile = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x5643e534d650 "/var/lib/apt/lists/partial/apt.osso.nl_ubuntu-security_dists_jammy-security_InRelease"}, _M_string_length = 85, { _M_local_buf = "\226", '\000' <repeats 14 times>, _M_allocated_capacity = 150}}, static FailFd = -1, static FailTime = 1674071760} ``` Relevant code in parent: ``` pkgAcquire::RunResult pkgAcquire::Run(int PulseInterval) { ... // Run till all things have been acquired struct timeval tv = SteadyDurationToTimeVal(std::chrono::microseconds(PulseInterval)); while (ToFetch > 0) ... int Res; do { Res = select(Highest+1,&RFds,&WFds,0,&tv); } while (Res < 0 && errno == EINTR); ``` Relevant code in child: ``` int BaseHttpMethod::Loop() { signal(SIGTERM,SigTerm); signal(SIGINT,SigTerm); Server = 0; int FailCounter = 0; while (1) { // We have no commands, wait for some to arrive if (Queue == 0) { if (WaitFd(STDIN_FILENO) == false) return 0; ``` Versions: ``` Ubuntu 22.04.1 LTS (Jammy Jellyfish) apt 2.4.8 ``` Preliminiary conclusion/notes: - I don't know why it stalls, but it does. And this was not a single incident. All the stalling apt processes had stopped at a different time/day. - Apparently there is a situation possible where there are no commands queued from the parent, and the parent isn't planning on sending any either. - Our apt-get is invoked by a cron job which calls either `update- notifier-common` or `apt-get update`: https://github.com/ossobv/zabbix-agent- osso/blob/c61aee6087c0d03e66d03013c61acd9f65d0eaab/scripts/dpkg.updates#L14-L31 -- on the machines where apt-get was hanging, there was no `update- notifier-common`. - I don't know if this is 2.4.x related (on Focal we have apt 2.0.x), but it sure looks like it. I did check a few other Focal machines that lacked `update-notifier-common`, but they were not having a stalled apt-get (although _not_ seeing a hung process proves nothing) Let me know if there's anything I can get you. Should this be filed elsewhere? (Debian Salsa?) Cheers, Walter Doekes OSSO B.V. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851/+subscriptions -- Mailing list: https://launchpad.net/~touch-packages Post to : touch-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~touch-packages More help : https://help.launchpad.net/ListHelp