[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2024-03-25 Thread Adam Saponara
Following up, `Acquire::Retries=0` seems to be avoiding the bug in prod,
which strengthens my suspicion that it's related to retries.

As for the actual bug, when apt is hanging, I notice that the work queue
has items in it, but they have not been delegated to a worker. There is
a condition that is supposed to "unstuck" the queue here[1], but for
some reason `GetFetchAfter` is returning a zeroed time_point which
avoids that condition[2].

The way we're setting `FetchAfter` looks OK to me[3] though I'm not
familiar with std::chrono. Maybe there's some other place where we're
partially copying the item or instantiating a new item and failing to
copy the `FetchAfter` value.

A bandaid patch is attached. I'm not sure if there's any downside in
calling `Cycle` there. A real bugfix IMO would address the zeroed
`FetchAfter` value described above, or potentially refactor so that work
items are immediately delegated instead of having to rely on calling
`Cycle` at the right time. I am missing a lot of context so that maybe
that's not possible.

[1] https://salsa.debian.org/apt-
team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt-
pkg/acquire.cc#L738-740

[2] https://salsa.debian.org/apt-
team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt-
pkg/acquire.cc#L733-736

[3] https://salsa.debian.org/apt-
team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt-
pkg/acquire-worker.cc#L651-655


** Attachment added: "patch"
   
https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851/+attachment/5759339/+files/patch

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  Confirmed

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2024-03-21 Thread Adam Saponara
Hello, also affected by this.

I'm able to reproduce the bug using Walter's mock server. I suspect it
may be related to the retry code that kicks in when a transient error is
encountered (like a 503). Retries are enabled by default since apt
2.3.2, which seems to fit the regression window people have mentioned
here.

Frustratingly, I can't reproduce while step debugging, I'm guessing
because the bug is clock-dependent.

Setting either `Acquire::Retries::Delay=false` or `Acquire::Retries=0`
seems to mitigate it. I'm curious if that works for others as well. I'll
be testing this in prod soon and will report back.

In the mean time I'll keep trying to find the actual bug.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  Confirmed

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 
154026,apt-get,6r
  http  154031 _apt0r  FIFO   0,13  0t0 4015193 pipe 
154026,apt-get,12w
  http  154031 _apt1w  FIFO   0,13  0t0 4015192 pipe 
154026,apt-get,7r
  gpgv  154033 _apt0r  FIFO   0,13  0t0 4015234 pipe 
154026,apt-get,14w
  gpgv  154033 _apt1w  FIFO   0,13  0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)

  Parent backtrace:
  ```
  #0  0x7f420116a74d in select ()
 from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x7f420153fb5d in pkgAcquire::Run(int) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
 from 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-10-13 Thread anarcat
We're seeing a similar issue here. At first we thought it was an issue
specific to a prometheus collector (https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=1028212 / https://github.com/prometheus-
community/node-exporter-textfile-collector-scripts/issues/179) but now
that I see this bug report, I can't help but think this is an issue in
apt itself.

I should also mention this seems like a regression between bullseye and
bookworm. For context, we have a nightly job that runs apt-update here
as well, a home-grown (`dsa-update-apt-status`, from DSA) script that
spews warnings through cron when there's an issue. In my mailbox where I
track those, I do have instances of this before we started the bookworm
upgrade, but those were rare. Starting from the beginning of our
bookworm deployment though, we are seeing more and more of this as we
upgrade machines over. We're now seeing daily warnings, as the `dsa-
update-apt-status` runs into lock contention with another job
(`apt_info.py` from the above collector, runs every 15m) more
frequently. We were getting daily warnings from the fleet, all bookworm
machines, with locks sometimes being held for hours.

Our current workaround has been to set a time limit to the `apt_info.py`
job, but we're *still* seeing errors, which is interesting in itself, as
it means the issue is *not* specific to that script: it's a global apt
issue. We've had unattended-upgrades.py hanging forever as well now,
which we've never seen before.

So I think this is an apt issue. Perhaps Acquire::http::Timeout=120 is a
valid workaround, but I can't help but think this is an issue that was
specifically introduced between bullseye and bookworm (2.2 vs 2.6).

** Bug watch added: Debian Bug tracker #1028212
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028212

** Bug watch added: 
github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues 
#179
   
https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/179

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  Confirmed

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-10-04 Thread Jay D.R.
FYI, ran into this same issue with Debian bookworm's apt-get (2.6.1)
hanging indefinitely talking to an apt-cacher-ng 3.7.x instance, easily
reproducible by having ansible run an `apt-get update` on ~20-25
machines simultaneously. Problem disappears when switching back to apt-
cacher-ng 3.6.x, so +1 for the apt-cacher-ng bug linked above.

Still, the problem persists even when invoking apt-get with `-o
Acquire::http::Timeout=120`, so presumably there must still be something
bugged on the side of apt itself not properly adhering to the configured
timeout.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  Confirmed

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 
154026,apt-get,6r
  http  154031 _apt0r  FIFO   0,13  0t0 4015193 pipe 
154026,apt-get,12w
  http  154031 _apt1w  FIFO   0,13  0t0 4015192 pipe 
154026,apt-get,7r
  gpgv  154033 _apt0r  FIFO   0,13  0t0 4015234 pipe 
154026,apt-get,14w
  gpgv  154033 _apt1w  FIFO   0,13  0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)

  Parent backtrace:
  ```
  #0  0x7f420116a74d in select ()
 from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x7f420153fb5d in pkgAcquire::Run(int) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #3  0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) 
()
 from 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-07-21 Thread Alexey Gavrilov
I suggest a temporary solution for cron . wrap apt update in timeout -k
60 1200 or add to systemd unit add JobTimeoutSec=1200

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  Confirmed

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 
154026,apt-get,6r
  http  154031 _apt0r  FIFO   0,13  0t0 4015193 pipe 
154026,apt-get,12w
  http  154031 _apt1w  FIFO   0,13  0t0 4015192 pipe 
154026,apt-get,7r
  gpgv  154033 _apt0r  FIFO   0,13  0t0 4015234 pipe 
154026,apt-get,14w
  gpgv  154033 _apt1w  FIFO   0,13  0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)

  Parent backtrace:
  ```
  #0  0x7f420116a74d in select ()
 from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x7f420153fb5d in pkgAcquire::Run(int) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #3  0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) 
()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #4  0x7f42016d127b in DoUpdate (CmdL=...)
  at ./apt-private/private-update.cc:73
  #5  0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch 
const*, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #6  0x7f420169fa97 in DispatchCommandLine (CmdL=..., 
  Cmds=std::vector of length 27, capacity 32 = {...})
  at ./apt-private/private-cmndline.cc:588
  #7  

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-05-02 Thread Walter
Hi Julian,

> [There] isn't much we can do otherwise, this highly depends on network
specifics like latency and mtus and is never reliably reproducible.

So, with my bogus python PoC server I mentioned in the comments, it is
quite easily reproducible.

If I can assist you in getting it up and running, do let me know.

Cheers,
Walter

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  Confirmed

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 
154026,apt-get,6r
  http  154031 _apt0r  FIFO   0,13  0t0 4015193 pipe 
154026,apt-get,12w
  http  154031 _apt1w  FIFO   0,13  0t0 4015192 pipe 
154026,apt-get,7r
  gpgv  154033 _apt0r  FIFO   0,13  0t0 4015234 pipe 
154026,apt-get,14w
  gpgv  154033 _apt1w  FIFO   0,13  0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)

  Parent backtrace:
  ```
  #0  0x7f420116a74d in select ()
 from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x7f420153fb5d in pkgAcquire::Run(int) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #3  0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) 
()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #4  0x7f42016d127b in DoUpdate (CmdL=...)
  at ./apt-private/private-update.cc:73
  #5  0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch 
const*, bool) ()
 from 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-04-24 Thread Ken Sharp
** Tags added: jammy

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  Confirmed

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 
154026,apt-get,6r
  http  154031 _apt0r  FIFO   0,13  0t0 4015193 pipe 
154026,apt-get,12w
  http  154031 _apt1w  FIFO   0,13  0t0 4015192 pipe 
154026,apt-get,7r
  gpgv  154033 _apt0r  FIFO   0,13  0t0 4015234 pipe 
154026,apt-get,14w
  gpgv  154033 _apt1w  FIFO   0,13  0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)

  Parent backtrace:
  ```
  #0  0x7f420116a74d in select ()
 from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x7f420153fb5d in pkgAcquire::Run(int) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #3  0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) 
()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #4  0x7f42016d127b in DoUpdate (CmdL=...)
  at ./apt-private/private-update.cc:73
  #5  0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch 
const*, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #6  0x7f420169fa97 in DispatchCommandLine (CmdL=..., 
  Cmds=std::vector of length 27, capacity 32 = {...})
  at ./apt-private/private-cmndline.cc:588
  #7  0x561fc06bafbd in main (argc=2, argv=0x7ffc0e14aa38)
  at ./cmdline/apt-get.cc:447
  ```
  Child 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-02-23 Thread Launchpad Bug Tracker
Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: apt (Ubuntu)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  Confirmed

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 
154026,apt-get,6r
  http  154031 _apt0r  FIFO   0,13  0t0 4015193 pipe 
154026,apt-get,12w
  http  154031 _apt1w  FIFO   0,13  0t0 4015192 pipe 
154026,apt-get,7r
  gpgv  154033 _apt0r  FIFO   0,13  0t0 4015234 pipe 
154026,apt-get,14w
  gpgv  154033 _apt1w  FIFO   0,13  0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)

  Parent backtrace:
  ```
  #0  0x7f420116a74d in select ()
 from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x7f420153fb5d in pkgAcquire::Run(int) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #3  0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) 
()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #4  0x7f42016d127b in DoUpdate (CmdL=...)
  at ./apt-private/private-update.cc:73
  #5  0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch 
const*, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #6  0x7f420169fa97 in DispatchCommandLine (CmdL=..., 
  Cmds=std::vector of length 27, capacity 32 = {...})
  at ./apt-private/private-cmndline.cc:588
  #7  

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-02-23 Thread Walter
Relevant bug in apt-cacher-ng:
https://bugs.launchpad.net/ubuntu/+source/apt-cacher-ng/+bug/1983856

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  New

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 
154026,apt-get,6r
  http  154031 _apt0r  FIFO   0,13  0t0 4015193 pipe 
154026,apt-get,12w
  http  154031 _apt1w  FIFO   0,13  0t0 4015192 pipe 
154026,apt-get,7r
  gpgv  154033 _apt0r  FIFO   0,13  0t0 4015234 pipe 
154026,apt-get,14w
  gpgv  154033 _apt1w  FIFO   0,13  0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)

  Parent backtrace:
  ```
  #0  0x7f420116a74d in select ()
 from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x7f420153fb5d in pkgAcquire::Run(int) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #3  0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) 
()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #4  0x7f42016d127b in DoUpdate (CmdL=...)
  at ./apt-private/private-update.cc:73
  #5  0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch 
const*, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #6  0x7f420169fa97 in DispatchCommandLine (CmdL=..., 
  Cmds=std::vector of length 27, capacity 32 = {...})
  at ./apt-private/private-cmndline.cc:588
  #7  0x561fc06bafbd in main (argc=2, 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-02-23 Thread Walter
Got some more pcaps. I can also reproduce by sending a 503 after a 200,
without aborting during/before body transfer (*).


server.py logs

13:29:55 [3666956] client said: b'GET /ubuntu/dists/jammy/InRelease 
HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: 
text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:29:55 [3666956] we say: 200 and serve local InRelease
13:29:55 [3666956] client said: b'GET /ubuntu/dists/jammy-updates/InRelease 
HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: 
text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:29:55 [3666956] we say: 503


503 request/response (valid length, no early FIN)

GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive


HTTP/1.1 503 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: text/plain
content-length: 19

something is broken



I present to you:

- https://junk.devs.nu/2023/lp2003851/not-reproduced-port-34634.txt
- https://junk.devs.nu/2023/lp2003851/not-reproduced-port-34634.pcap
- https://junk.devs.nu/2023/lp2003851/reproduced-port-58014.txt
- https://junk.devs.nu/2023/lp2003851/reproduced-port-58014.pcap

In the not-reproduced case, the error is acknowledged and the download
is retried. Then it fails quickly.

In the reproduced case, the error is not acknowledged, and the result is
a hanging apt-get. Logs ending with:

Hit:3 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease
0% [Working]


(*) observed in the wild with these req/resp:

GET /ubuntu/dists/jammy-backports/InRelease HTTP/1.1
Host: apt.osso.nl
Cache-Control: max-age=0
Accept: text/*
If-Modified-Since: Tue, 21 Feb 2023 19:32:00 GMT
User-Agent: Debian APT-HTTP/1.3 (2.4.8) non-interactive

HTTP/1.1 200 OK
date: Wed, 22 Feb 2023 02:36:08 GMT
content-type: octet/stream
content-length: 106807
last-modified: Wed, 22 Feb 2023 01:49:00 GMT
x-original-source: 
http://nl.archive.ubuntu.com/ubuntu/dists/jammy-backports/InRelease
(+ valid body)

GET /ubuntu-security/dists/jammy-security/InRelease HTTP/1.1
Host: apt.osso.nl
Cache-Control: max-age=0
Accept: text/*
If-Modified-Since: Tue, 21 Feb 2023 23:11:00 GMT
User-Agent: Debian APT-HTTP/1.3 (2.4.8) non-interactive

HTTP/1.1 503 Connection timeout
date: Wed, 22 Feb 2023 02:36:08 GMT
content-type: text/html
content-length: 486
(+ valid body)


extra contemplations:

We've only recently added the /ubuntu-security/ (extra path, on the same 
server). Maybe that is the reason why this is biting us right now.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  New

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-02-21 Thread Walter
Okay. I got some pcaps now. And apparently our apt-cacher-ng proxy
sometimes serves:

- an InRelease
- with a 200 OK
- and a non-zero Content-Length
- BUT, no content, but instead a TCP-FIN.

Under the right circumstances, this causes apt to stall forever.

The right circumstances seem to include:

- a valid download (or 304) from the same host
- a valid download from another host

---

I created a test server, and got this to reproduce with apt 2.4.9 (from
jammy-proposed):

=
sources.list
=
deb http://junk.devs.nu:3851/ubuntu jammy main
deb http://junk.devs.nu:3851/ubuntu jammy-updates main
deb [arch=amd64 signed-by=/etc/apt/keyrings/zabbix.gpg] 
http://repo.zabbix.com/zabbix/6.2/ubuntu jammy main

=
server.py logs
=
13:31:58 [2771009] client connected: ('217.21.199.56', 42838)
13:31:58 [2771009] client said: b'GET /ubuntu/dists/jammy/InRelease 
HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: 
text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:31:58 [2771009] we say: 200 and serve local InRelease
13:31:58 [2771009] client said: b'GET /ubuntu/dists/jammy-updates/InRelease 
HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: 
text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:31:58 [2771009] we say: 200 but serve a FIN
13:31:58 [2771009] client said: b''
13:31:58 [2771009] client disconnected: ('217.21.199.56', 42838)

=
apt-get update -o Debug::Acquire::http=1
=
0% [Connecting to junk.devs.nu] [Connecting to repo.zabbix.com 
(2604:a880:2:d0::2062:d001)]GET /zabbix/6.2/ubuntu/dists/jammy/InRelease 
HTTP/1.1
Host: repo.zabbix.com
Cache-Control: max-age=0
Accept: text/*
If-Modified-Since: Wed, 01 Feb 2023 15:02:57 GMT
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive


Answer for: http://repo.zabbix.com/zabbix/6.2/ubuntu/dists/jammy/InRelease
HTTP/1.1 304 Not Modified
Server: nginx/1.14.2
Date: Tue, 21 Feb 2023 12:31:54 GMT
Last-Modified: Wed, 01 Feb 2023 15:02:57 GMT
Connection: keep-alive
ETag: "63da7f21-135e"

Hit:1 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease  
   
0% [Connecting to junk.devs.nu]GET /ubuntu/dists/jammy/InRelease HTTP/1.1   
   
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive


Answer for: http://junk.devs.nu:3851/ubuntu/dists/jammy/InRelease
HTTP/1.1 200 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: octet/stream
content-length: 270087
last-modified: Fri, 17 Feb 2023 13:29:00 GMT
x-original-source: http://de.archive.ubuntu.com/ubuntu/dists/jammy/InRelease

Get:2 http://junk.devs.nu:3851/ubuntu jammy InRelease [270 kB]
0% [2 InRelease 0 B/270 kB 0%]GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive


Answer for: http://junk.devs.nu:3851/ubuntu/dists/jammy-updates/InRelease
HTTP/1.1 200 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: octet/stream
content-length: 118747
last-modified: Fri, 17 Feb 2023 13:29:00 GMT
x-original-source: 
http://de.archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease

Get:3 http://junk.devs.nu:3851/ubuntu jammy-updates InRelease [119 kB]
Ign:3 http://junk.devs.nu:3851/ubuntu jammy-updates InRelease
0% [Working]

=
ps faxu
=
root 3549798  0.0  0.0  16652  6556 pts/3S+   12:31   0:00  
\_ apt-get update -o Debug::Acquire::http=1
_apt 3549801  0.0  0.0  22068  6540 pts/3S+   12:31   0:00  
\_ /usr/lib/apt/methods/http
_apt 3549802  0.0  0.0  22068  6552 pts/3S+   12:31   0:00  
\_ /usr/lib/apt/methods/http
_apt 3549804  0.0  0.0  15440  5016 pts/3S+   12:31   0:00  
\_ /usr/lib/apt/methods/gpgv

=
netstat
=
tcp1  0 217.21.199.56:44494 178.128.6.101:80CLOSE_WAIT  
3549801/http

(this IP belongs to the zabbix repo. the connection to the junk.devs.nu
broken server is already gone.)

=
preliminary info
=
The connection to the zabbix repo is unclosed. The connection to the broken 
server is closed. The sequence was:

12:31:58.082726 eth0  In  IP 217.21.198.165.3851 > 217.21.199.56.42838: Flags 
[F.], seq 270320:270560, ack 349, win 507, options [nop,nop,TS val 4041174334 
ecr 558653630], length 240
12:31:58.083017 eth0  Out IP 217.21.199.56.42838 > 217.21.198.165.3851: Flags 
[F.], seq 349, ack 270561, win 1022, options 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-02-10 Thread Walter
Checking in.

I've had an `apt-get update` running in a while loop since Feb 6 on a
machine: no luck getting it to stall.

Halfway, I added a `iptables -A INPUT -m statistic --mode random
--probability 0.05 -j DROP` but this did not affect anything.

But, in the mean time, on production I have had two more of these since
filing this bug report...

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  New

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 
154026,apt-get,6r
  http  154031 _apt0r  FIFO   0,13  0t0 4015193 pipe 
154026,apt-get,12w
  http  154031 _apt1w  FIFO   0,13  0t0 4015192 pipe 
154026,apt-get,7r
  gpgv  154033 _apt0r  FIFO   0,13  0t0 4015234 pipe 
154026,apt-get,14w
  gpgv  154033 _apt1w  FIFO   0,13  0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)

  Parent backtrace:
  ```
  #0  0x7f420116a74d in select ()
 from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x7f420153fb5d in pkgAcquire::Run(int) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #3  0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) 
()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #4  0x7f42016d127b in DoUpdate (CmdL=...)
  at ./apt-private/private-update.cc:73
  #5  0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch 
const*, bool) ()
 from 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-01-25 Thread Walter
Thanks for the reply. Julian.

Let's assume that the problem is indeed latency/dropped packets/whatever
on our side. IMO, an (occasionally) broken network should not cause apt-
get to hang indefinitely. Do you think it should?

Also, it doesn't address that the behaviour seems recent. We have not
observed anything like this with Focal or Bionic.

In the mean time, I've added notification code so at least we can we
track when this occurs. I'll fill you in if I get more details (like
versions that are (not) affected).



Additional debug info:

I called recv(3..) on the http socket with a recv-q. It returns 0 (no error, 
EOF).
```
(gdb) call (ssize_t)recv(3, "abcdef", 1, 0)
$1 = 0
```
```
# netstat -apn | grep -E '154026|154029|154030|154031|154033'
tcp1  0 10.91.52.91:60868   217.21.205.139:80   CLOSE_WAIT  
154030/http 
tcp1  0 10.91.52.91:40756   178.128.6.101:80CLOSE_WAIT  
154029/http 
tcp0  0 10.91.52.91:56818   185.37.124.14:80CLOSE_WAIT  
154031/http 
```

Next, I called close(0) on the stdin socket on one of the processes, and this 
awoke the whole list of tasks:
```
(gdb) call (int)close(0)
$1 = 0
```
This yielded the following error message:
```
E: Method http has died unexpectedly!
```
And the apt-get process got unstuck / was reaped succesfully.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  New

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 
154026,apt-get,6r
  http  154031 _apt0r  FIFO   0,13  0t0 4015193 pipe 
154026,apt-get,12w
  http  154031 _apt1w  FIFO   0,13  0t0 4015192 pipe 
154026,apt-get,7r
  gpgv 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2023-01-25 Thread Julian Andres Klode
Patches are welcome but unfortunately there isn't much we can do
otherwise, this highly depends on network specifics like latency and
mtus and is never reliably reproducible.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  New

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 
154026,apt-get,6r
  http  154031 _apt0r  FIFO   0,13  0t0 4015193 pipe 
154026,apt-get,12w
  http  154031 _apt1w  FIFO   0,13  0t0 4015192 pipe 
154026,apt-get,7r
  gpgv  154033 _apt0r  FIFO   0,13  0t0 4015234 pipe 
154026,apt-get,14w
  gpgv  154033 _apt1w  FIFO   0,13  0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)

  Parent backtrace:
  ```
  #0  0x7f420116a74d in select ()
 from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x7f420153fb5d in pkgAcquire::Run(int) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #3  0x7f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) 
()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #4  0x7f42016d127b in DoUpdate (CmdL=...)
  at ./apt-private/private-update.cc:73
  #5  0x7f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch 
const*, bool) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #6  0x7f420169fa97 in DispatchCommandLine (CmdL=..., 
  Cmds=std::vector of length 27, capacity 32 = {...})
  at