Okay, upstream actually settled on this follow-up[1] rather than a revert of my patch, so I will do an upload with that additional commit.
[1] https://github.com/systemd/systemd/commit/20572a8938dd017a23478a66c04f5768b763cb56 ** Description changed: [Impact] cloud-init makes DNS queries early in boot, before systemd-networkd is active. The query causes systemd-resolved to query registered hooks, which then socket-activates systemd-networkd. But, this causes a boot delay, because systemd-networkd cannot start until the relevant cloud- init phase has completed. Hence, the user impact is a delay in boot on certain clouds. [Test Plan] When a new version of systemd hits `-proposed`, CPC will do the following: - * an image based on systemd from `-proposed` (using a custom line in our bootstraps, cpc_packaging.extra) will be built for GCE and published to a private project - * the CPC team will run internal validations (CTF) and Google's upstream test suite `cloud-image-tests` (CIT) - * Special attention will be paid to the image's bootspeed as that was the symptom of this specific bug + * an image based on systemd from `-proposed` (using a custom line in our bootstraps, cpc_packaging.extra) will be built for GCE and published to a private project + * the CPC team will run internal validations (CTF) and Google's upstream test suite `cloud-image-tests` (CIT) + * Special attention will be paid to the image's bootspeed as that was the symptom of this specific bug If all the testing indicates that the package is acceptable/fixes the issue, verification can be considered done and the tags will be updated as such. [Where problems could occur] The fix is to order systemd-networkd-resolve-hook.socket After=network- pre.target. This avoids the situation where the socket for the resolve hook is bound before systemd-networkd could actually be started. Hence, problems introduced by this would most likely be related to unit ordering or dependency problems. [Other info] - The fix comes from upstream: systemd-networkd-resolve-hook.socket + The fix comes from upstream: + + https://github.com/systemd/systemd/commit/37adb410a2b62716b666dbf8359edf8a6546ff94 + https://github.com/systemd/systemd/commit/20572a8938dd017a23478a66c04f5768b763cb56 [Original Description] `cloud-init`: `26.1-0ubuntu1` `systemd`: `259.5-0ubuntu2` `datasource`: `DataSourceGCELocal` `interface`: `ens4` DHCP used by `cloud-init`: `dhcpcd 10.3.0` Confirmed clouds affected: Google cloud On dev resolute images in GCE first boot is being delayed by about ~30s during `cloud-init`'s local datasource discovery. Specifically when `/etc/resolv.conf` is symlinked to the stub-resolver (`/etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf`) which is the default. On initial boot `cloud-init analyze blame` shows that `init- local/search-GCELocal` takes about ~31s and `/var/log/cloud-init.log` shows that although `DHCP` completes quickly, the `DNS` resolution of `metadata.google.internal` takes about 30s before the first metadata request is sent. The same slow boot’s logs also show that `systemd- resolved` times out on `io.systemd.Network` resolve hook, and that `systemd-networkd.service` doesn't start until after `cloud-init- local.service` *finishes*. To test the theory I manually mounted a vanilla image (I had to do it like this because `livecd-rootfs` forces a specific /etc/resolv.conf file on you...) and edited it so that `/etc/resolv.conf` pointed to the non-stubbed `systemd-resolved` file instead (`/etc/resolv.conf -> ../run/systemd/resolve/resolv.conf`). This reduced first boot `init- local/search-GCELocal` back down to about ~2s and reduced the metadata `hostname` resolution from about 30s to ~0.03s. This indicates to me that the boot delay is caused by `systemd- resolved`’s stub resolver path interacting badly with the `io.systemd.Network` resolve hook path in `systemd` v259 (?) during `cloud-init-local`. ## Longer explanation/moar logs `cloud-init status` showed that `cloud-init` completed successfully with no reported errors: ``` $ sudo cloud-init status --long status: done extended_status: done boot_status_code: enabled-by-generator detail: DataSourceGCELocal errors: [] recoverable_errors: {} ``` but that most of the time is spent in the `local` GCE `datasource` search: ``` $ sudo cloud-init analyze blame -- Boot Record 01 -- 30.98900s (init-local/search-GCELocal) 06.80700s (modules-config/config-apt_configure) 01.62200s (modules-config/config-ntp) 01.09700s (modules-final/config-keys_to_console) 00.94900s (init-network/config-growpart) 00.91800s (modules-final/config-ssh_authkey_fingerprints) 00.58400s (init-network/config-ssh) ... ... ``` `cloud-init analyze` shows: ``` $ sudo cloud-init analyze show -- Boot Record 01 -- Starting stage: init-local |`->no cache found (duration: 0.000s) @00.01300s +00.00000s |`->found local data from DataSourceGCELocal (duration: 30.989s) @00.02700s +30.98900s Finished stage: (init-local) 31.93500 seconds ``` ... but an important finding from `/var/log/cloud-init.log` (see below) is that: * DHCP was quick * routes were added immediately * the large stall comes from `hostname` resolution * and once resolution finished the metadata requests themselves succeed straight away ``` 2026-04-15 11:16:18,096 - dhcp.py[DEBUG]: Performing a dhcp discovery on ens4 2026-04-15 11:16:18,978 - performance.py[DEBUG]: Running ['dhcpcd', '--ipv4only', '--waitip', '--persistent', '--noarp', '--debug', '--script=/bin/true', 'ens4'] took 0.869 seconds 2026-04-15 11:16:18,984 - dhcp.py[DEBUG]: Parsing dhcpcd lease for interface ens4: 'reason=BOUND ... ... domain_name_servers=169.254.169.254 ... ... dhcp_server_identifier=169.254.169.254 ... ... 2026-04-15 11:16:18,988 - ephemeral.py[DEBUG]: Received dhcp lease on ens4 for 10.164.0.17/255.255.255.255 2026-04-15 11:16:18,992 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'append', '10.164.0.1/32', 'dev', 'ens4'] ... 2026-04-15 11:16:18,994 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'append', '0.0.0.0/0', 'via', '10.164.0.1', 'dev', 'ens4'] ... 2026-04-15 11:16:49,030 - performance.py[DEBUG]: Resolving URL took 30.033 seconds 2026-04-15 11:16:49,031 - url_helper.py[DEBUG]: [0/6] open 'http://metadata.google.internal/computeMetadata/v1/instance/id' ... 2026-04-15 11:16:49,039 - url_helper.py[DEBUG]: Read from http://metadata.google.internal/computeMetadata/v1/instance/id (200, 18b) after 1 attempts 2026-04-15 11:16:49,041 - url_helper.py[DEBUG]: Read from http://metadata.google.internal/computeMetadata/v1/instance/zone (200, 41b) after 1 attempts 2026-04-15 11:16:49,044 - url_helper.py[DEBUG]: Read from http://metadata.google.internal/computeMetadata/v1/instance/hostname (200, 55b) after 1 attempts ... ... 2026-04-15 11:16:49,063 - performance.py[DEBUG]: Getting metadata took 30.988 seconds ``` `journalctl` shows the resolve hook timeout and the ordering of `cloud-init-local` vs. `systemd-networkd` more clearly: ``` $ sudo journalctl -b 0 -o short-monotonic \ -u cloud-init-local.service \ -u systemd-resolved.service \ -u systemd-networkd.service [ 3.494233] localhost systemd[1]: Starting systemd-resolved.service - Network Name Resolution... [ 3.644166] localhost systemd[1]: Started systemd-resolved.service - Network Name Resolution. [ 8.555137] localhost systemd[1]: Starting cloud-init-local.service - Cloud-init: Local Stage (pre-network)... [ 39.023176] resolute-quick systemd-resolved[196]: System hostname changed to 'resolute-quick'. [ 39.124219] resolute-quick systemd-resolved[196]: Query on hook '/run/systemd/resolve.hook/io.systemd.Network' failed with error 'io.systemd.TimedOut', ignoring. [ 39.124478] resolute-quick systemd-resolved[196]: Query on hook '/run/systemd/resolve.hook/io.systemd.Network' failed with error 'io.systemd.TimedOut', ignoring. [ 39.658145] resolute-quick systemd[1]: Finished cloud-init-local.service - Cloud-init: Local Stage (pre-network). [ 39.693138] resolute-quick systemd[1]: Starting systemd-networkd.service - Network Management... [ 39.804399] resolute-quick systemd[1]: Started systemd-networkd.service - Network Management. ``` In other words: * `cloud-init-local` starts and needs to resolve `metadata.google.internal` * with `stub-resolv.conf` that lookup goes through `systemd-resolved` * `systemd-resolved` attempts to use the io.systemd.Network resolve hook, but that hook times out after 30 seconds * and finally `systemd-networkd` only starts after `cloud-init-local` has finished The ordering and socket relationships make this seem plausible: * `cloud-init-local.service` has `Wants=network-pre.target` && `Before=network-pre.target` * `systemd-networkd.service` has `After=network-pre.target` && `... systemd-networkd-resolve-hook.socket` * `systemd-networkd-resolve-hook.socket` listens on `/run/systemd/resolve.hook/io.systemd.Network` ### When using an image with the modified symlink On a fresh boot of the modified image: ``` $ cloud-init analyze blame -- Boot Record 01 -- 03.94800s (modules-config/config-ntp) 02.19300s (init-local/search-GCELocal) 01.29000s (modules-final/config-ssh_authkey_fingerprints) 00.88100s (init-network/config-growpart) 00.67000s (modules-final/config-keys_to_console) 00.41300s (modules-config/config-apt_configure) ... ... ``` and ``` $ cat /var/log/cloud-init.log ... ... 64:2026-04-15 20:10:24,412 - performance.py[DEBUG]: Resolving URL took 0.031 seconds ... ... ``` -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2148619 Title: `cloud-init` local datasource discovery takes ~30s to complete To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/2148619/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
