Okay. I got some pcaps now. And apparently our apt-cacher-ng proxy
sometimes serves:

- an InRelease
- with a 200 OK
- and a non-zero Content-Length
- BUT, no content, but instead a TCP-FIN.

Under the right circumstances, this causes apt to stall forever.

The right circumstances seem to include:

- a valid download (or 304) from the same host
- a valid download from another host

---

I created a test server, and got this to reproduce with apt 2.4.9 (from
jammy-proposed):

=====================================
sources.list
=====================================
deb http://junk.devs.nu:3851/ubuntu jammy main
deb http://junk.devs.nu:3851/ubuntu jammy-updates main
deb [arch=amd64 signed-by=/etc/apt/keyrings/zabbix.gpg] 
http://repo.zabbix.com/zabbix/6.2/ubuntu jammy main

=====================================
server.py logs
=====================================
13:31:58 [2771009] client connected: ('217.21.199.56', 42838)
13:31:58 [2771009] client said: b'GET /ubuntu/dists/jammy/InRelease 
HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: 
text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:31:58 [2771009] we say: 200 and serve local InRelease
13:31:58 [2771009] client said: b'GET /ubuntu/dists/jammy-updates/InRelease 
HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: 
text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:31:58 [2771009] we say: 200 but serve a FIN
13:31:58 [2771009] client said: b''
13:31:58 [2771009] client disconnected: ('217.21.199.56', 42838)

=====================================
apt-get update -o Debug::Acquire::http=1
=====================================
0% [Connecting to junk.devs.nu] [Connecting to repo.zabbix.com 
(2604:a880:2:d0::2062:d001)]GET /zabbix/6.2/ubuntu/dists/jammy/InRelease 
HTTP/1.1
Host: repo.zabbix.com
Cache-Control: max-age=0
Accept: text/*
If-Modified-Since: Wed, 01 Feb 2023 15:02:57 GMT
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive


Answer for: http://repo.zabbix.com/zabbix/6.2/ubuntu/dists/jammy/InRelease
HTTP/1.1 304 Not Modified
Server: nginx/1.14.2
Date: Tue, 21 Feb 2023 12:31:54 GMT
Last-Modified: Wed, 01 Feb 2023 15:02:57 GMT
Connection: keep-alive
ETag: "63da7f21-135e"

Hit:1 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease                  
           
0% [Connecting to junk.devs.nu]GET /ubuntu/dists/jammy/InRelease HTTP/1.1       
           
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive


Answer for: http://junk.devs.nu:3851/ubuntu/dists/jammy/InRelease
HTTP/1.1 200 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: octet/stream
content-length: 270087
last-modified: Fri, 17 Feb 2023 13:29:00 GMT
x-original-source: http://de.archive.ubuntu.com/ubuntu/dists/jammy/InRelease

Get:2 http://junk.devs.nu:3851/ubuntu jammy InRelease [270 kB]
0% [2 InRelease 0 B/270 kB 0%]GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive


Answer for: http://junk.devs.nu:3851/ubuntu/dists/jammy-updates/InRelease
HTTP/1.1 200 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: octet/stream
content-length: 118747
last-modified: Fri, 17 Feb 2023 13:29:00 GMT
x-original-source: 
http://de.archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease

Get:3 http://junk.devs.nu:3851/ubuntu jammy-updates InRelease [119 kB]
Ign:3 http://junk.devs.nu:3851/ubuntu jammy-updates InRelease
0% [Working]

=====================================
ps faxu
=====================================
root     3549798  0.0  0.0  16652  6556 pts/3    S+   12:31   0:00              
            \_ apt-get update -o Debug::Acquire::http=1
_apt     3549801  0.0  0.0  22068  6540 pts/3    S+   12:31   0:00              
                \_ /usr/lib/apt/methods/http
_apt     3549802  0.0  0.0  22068  6552 pts/3    S+   12:31   0:00              
                \_ /usr/lib/apt/methods/http
_apt     3549804  0.0  0.0  15440  5016 pts/3    S+   12:31   0:00              
                \_ /usr/lib/apt/methods/gpgv

=====================================
netstat
=====================================
tcp        1      0 217.21.199.56:44494     178.128.6.101:80        CLOSE_WAIT  
3549801/http        

(this IP belongs to the zabbix repo. the connection to the junk.devs.nu
broken server is already gone.)

=====================================
preliminary info
=====================================
The connection to the zabbix repo is unclosed. The connection to the broken 
server is closed. The sequence was:

12:31:58.082726 eth0  In  IP 217.21.198.165.3851 > 217.21.199.56.42838: Flags 
[F.], seq 270320:270560, ack 349, win 507, options [nop,nop,TS val 4041174334 
ecr 558653630], length 240
12:31:58.083017 eth0  Out IP 217.21.199.56.42838 > 217.21.198.165.3851: Flags 
[F.], seq 349, ack 270561, win 1022, options [nop,nop,TS val 558653631 ecr 
4041174334], length 0
12:31:58.083199 eth0  In  IP 217.21.198.165.3851 > 217.21.199.56.42838: Flags 
[.], ack 350, win 507, options [nop,nop,TS val 4041174335 ecr 558653631], 
length 0

Where that last request was:

GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive

And the last response was:

HTTP/1.1 200 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: octet/stream
content-length: 118747
last-modified: Fri, 17 Feb 2023 13:29:00 GMT
x-original-source: 
http://de.archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease

This response was sent in the FIN packet. So no actual content would be
delivered.

The apt-get process which got the broken server has this backtrace:

(gdb) bt
#0  0x00007efc90d3774d in __GI___select (nfds=1, readfds=0x7ffe64790920, 
writefds=0x0, exceptfds=0x0, timeout=0x0) at 
../sysdeps/unix/sysv/linux/select.c:69
#1  0x00007efc91418b43 in WaitFd(int, bool, unsigned long) () from 
/lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#2  0x0000555a612d47b3 in BaseHttpMethod::Loop (this=0x7ffe64790bf0) at 
./methods/basehttp.cc:611
#3  main (argv=<optimized out>) at ./methods/http.cc:1052

The process which got the good server (zabbix) has a similar backtrace:

(gdb) bt
#0  0x00007f8e345e674d in __GI___select (nfds=1, readfds=0x7ffc6f6b9990, 
writefds=0x0, exceptfds=0x0, timeout=0x0) at 
../sysdeps/unix/sysv/linux/select.c:69
#1  0x00007f8e34cc7b43 in WaitFd(int, bool, unsigned long) () from 
/lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#2  0x00005614260967b3 in BaseHttpMethod::Loop (this=0x7ffc6f6b9c60) at 
./methods/basehttp.cc:611
#3  main (argv=<optimized out>) at ./methods/http.cc:1052

=====================================
server.py
=====================================
http://junk.devs.nu/2023/lp2003851/server.py
http://junk.devs.nu/2023/lp2003851/2023-02-21-13-31.pcap

=====================================

I'll go take a look at why apt-cacher-ng misbehaves.

[~juliank]: Maybe this is sufficient info for you to work with? Even
though our server is at fault, we still don't want an infinitely hanging
apt.

NOTE: Maybe this isn't an apt 2.4.x thing, but something that started
happening for Jammy-served apt-cacher-ng responses and therefore only
visible on Jammy? In that case this isn't a newly introduced problem in
apt.

Cheers,
Walter Doekes
OSSO B.V.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  New

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root      153929  \_ /usr/sbin/CRON -f -P
  root      153942      \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root      153943          \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root      154026              \_ apt-get update
  _apt      154029                  \_ /usr/lib/apt/methods/http
  _apt      154030                  \_ /usr/lib/apt/methods/http
  _apt      154031                  \_ /usr/lib/apt/methods/http
  _apt      154033                  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http         
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http         
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http         
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=500000000}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root    4uW  REG    8,1        0  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root    5r  FIFO   0,13      0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root    6r  FIFO   0,13      0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root    7r  FIFO   0,13      0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root    8w  FIFO   0,13      0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root    9r  FIFO   0,13      0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13      0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13      0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13      0t0 4015234 pipe 154033,gpgv,0r
  http      154029 _apt    0r  FIFO   0,13      0t0 4015177 pipe 
154026,apt-get,8w
  http      154029 _apt    1w  FIFO   0,13      0t0 4015176 pipe 
154026,apt-get,5r
  http      154030 _apt    0r  FIFO   0,13      0t0 4012449 pipe 
154026,apt-get,10w
  http      154030 _apt    1w  FIFO   0,13      0t0 4012448 pipe 
154026,apt-get,6r
  http      154031 _apt    0r  FIFO   0,13      0t0 4015193 pipe 
154026,apt-get,12w
  http      154031 _apt    1w  FIFO   0,13      0t0 4015192 pipe 
154026,apt-get,7r
  gpgv      154033 _apt    0r  FIFO   0,13      0t0 4015234 pipe 
154026,apt-get,14w
  gpgv      154033 _apt    1w  FIFO   0,13      0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)

  Parent backtrace:
  ```
  #0  0x00007f420116a74d in select ()
     from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x00007f420153fb5d in pkgAcquire::Run(int) ()
     from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x00007f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
     from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #3  0x00007f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) 
()
     from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #4  0x00007f42016d127b in DoUpdate (CmdL=...)
      at ./apt-private/private-update.cc:73
  #5  0x00007f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch 
const*, bool) ()
     from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #6  0x00007f420169fa97 in DispatchCommandLine (CmdL=..., 
      Cmds=std::vector of length 27, capacity 32 = {...})
      at ./apt-private/private-cmndline.cc:588
  #7  0x0000561fc06bafbd in main (argc=2, argv=0x7ffc0e14aa38)
      at ./cmdline/apt-get.cc:447
  ```
  Child backtrace:
  ```
  #0  0x00007f58b1c9b74d in select ()
     from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x00007f58b237cb43 in WaitFd(int, bool, unsigned long) ()
     from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x00005643e37bc7b3 in BaseHttpMethod::Loop (
      this=0x7ffdeaed56c0) at ./methods/basehttp.cc:611
  #3  main (argv=<optimized out>) at ./methods/http.cc:1052
  ```
  Where this is:
  ```
  (gdb) print *this
  $1 = {<aptAuthConfMethod> = {<aptMethod> = {<pkgAcqMethod> = {<No data 
fields>}, <aptConfigWrapperForMethods> = {
          methodNames = std::vector of length 2, capacity 2 = {
            "http", "http::XXXXXX.nl"}}, Binary = "http", 
        SeccompFlags = 6}, 
      authconfs = std::vector of length 1, capacity 1 = {
        std::unique_ptr<FileFd> = {get() = 0x5643e5343100}}}, 
    Server = std::unique_ptr<ServerState> = {
      get() = 0x5643e534fba0}, NextURI = "", 
    AllowRedirect = true, Debug = false, PipelineDepth = 10, 
    static FailFile = {static npos = 18446744073709551615, 
      _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> 
= {<No data fields>}, <No data fields>}, 
        _M_p = 0x5643e534d650 
"/var/lib/apt/lists/partial/apt.osso.nl_ubuntu-security_dists_jammy-security_InRelease"},
 
      _M_string_length = 85, {
        _M_local_buf = "\226", '\000' <repeats 14 times>, 
        _M_allocated_capacity = 150}}, static FailFd = -1, 
    static FailTime = 1674071760}
  ```

  Relevant code in parent:
  ```
  pkgAcquire::RunResult pkgAcquire::Run(int PulseInterval)
  {
  ...
     // Run till all things have been acquired
     struct timeval tv = 
SteadyDurationToTimeVal(std::chrono::microseconds(PulseInterval));
     while (ToFetch > 0)
  ...
        int Res;
        do
        {
           Res = select(Highest+1,&RFds,&WFds,0,&tv);
        }
        while (Res < 0 && errno == EINTR);
  ```
  Relevant code in child:
  ```
  int BaseHttpMethod::Loop()
  {
     signal(SIGTERM,SigTerm);
     signal(SIGINT,SigTerm);
     
     Server = 0;
     
     int FailCounter = 0;
     while (1)
     {      
        // We have no commands, wait for some to arrive
        if (Queue == 0)
        {
           if (WaitFd(STDIN_FILENO) == false)
              return 0;
  ```

  Versions:
  ```
  Ubuntu 22.04.1 LTS (Jammy Jellyfish)
  apt 2.4.8
  ```

  Preliminiary conclusion/notes:

  - I don't know why it stalls, but it does. And this was not a single
  incident. All the stalling apt processes had stopped at a different
  time/day.

  - Apparently there is a situation possible where there are no commands
  queued from the parent, and the parent isn't planning on sending any
  either.

  - Our apt-get is invoked by a cron job which calls either `update-
  notifier-common` or `apt-get update`:
  https://github.com/ossobv/zabbix-agent-
  
osso/blob/c61aee6087c0d03e66d03013c61acd9f65d0eaab/scripts/dpkg.updates#L14-L31
  -- on the machines where apt-get was hanging, there was no `update-
  notifier-common`.

  - I don't know if this is 2.4.x related (on Focal we have apt 2.0.x),
  but it sure looks like it. I did check a few other Focal machines that
  lacked `update-notifier-common`, but they were not having a stalled
  apt-get (although _not_ seeing a hung process proves nothing)

  
  Let me know if there's anything I can get you. Should this be filed 
elsewhere? (Debian Salsa?)

  Cheers,
  Walter Doekes
  OSSO B.V.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to     : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to