** Description changed: - ====== [Bug Description] ====== + [ Impact ] + + * On AWS Minimal EC2 instances and Google C3-metal instances, cloud-init may + fail to detect the correct datasource during the local init stage if the + kernel hasn’t initialized the NIC in time. This can prevent metadata + fetching and block SSH on first boot. + + * Users of these instances may be unable to log in immediately after first + boot, breaking automated deployments, cloud-init-based provisioning, and + scripts relying on metadata. + + * Including `ENA_ETHERNET` as built-in on the AWS kernel and including `IDPF` + as built-in on GCP kernels would minimize the race condition between cloud- + init local and the kernel initilizing the NIC, allowing the NIC to come up + early enough for cloud-init local to be able to successfully detect the + correct datasource. + + [ Test Plan ] + + * ----- AWS Instance ----- + + * 1. Launch an Ubuntu 24.04 Noble Minimal EC2 AMI. Use an EC2 instance type + where network interfaces may not be immediately available at early boot + (e.g.hpc7a.96xlarge reproduces consistently). + + * 2. Wait several minutes after the instance reaches running, then attempt to + SSH into the instance. You will see this error: + Permission denied (publickey). + + * 3. Apply the patch with ENA_ETHERNET built-in. + + * 4. Clean cloud-init logs with + `sudo cloud-init clean --logs --config all` + + * 5. Reboot the machine and check the following: + + * SSH works immediately + + * Check the journal logs with the following command: + journalctl -b 0 -o short-monotonic | + grep -E "cloud-init|ena|enp34s0|wait-online" + + * In the journal logs, the NIC should come up before + cloud-init local finishes running + + * In the journal logs, the NIC should have LINK UP, + gained carrier, and DCHP aquired before Net device + info is printed + + * `cloud-init status --long` should show the + correct datasource (DataSourceEc2Local) + + * ----- GCP Instance ----- + + * 1. Launch a Questing c3-standard-192-metal machine on Google cloud. + + * 2. Wait several minutes after the instance reaches running, then attempt to + SSH into the instance. You will see this error: + Permission denied (publickey). + + * 3. Apply the SRU with IDPF built-in. + + * 4. Clean cloud-init logs with + `sudo cloud-init clean --logs --config all` + + * 5. Reboot the machine and check the following: + + * SSH works immediately + + * Check the journal logs with the following command: + journalctl -b 0 -o short-monotonic | + grep -E "cloud-init|idpf|enp5s0f0|wait-online" + + * In the journal logs, the NIC should come up before + cloud-init local finishes running + + * In the journal logs, the NIC should have LINK UP, + gained carrier, and DCHP aquired before Net device + info is printed + + * `cloud-init status --long` should show the + correct datasource (DataSourceGCELocal) + + + [ Where problems could occur ] + + * Making `ENA_ETHERNET` or `IDPF` built-in does not address the underlying + race condition between cloud-init local and the kernel finishing NIC + initialization. It simply minimizes it enough to "fix" the issue as a + temporary workaround. + + * Problems may occur if `ENA_ETHERNET` or `IDPF` need to support features + that rely on loadable drivers, such as RDMA. + + + [ Other Info ] + + * See [Related Links] in [ORIGINAL Bug Description] below. + + * We can try and revisit modularization once the cloud-init first-boot races + are fully resolved. + + + ====== [ORIGINAL Bug Description] ====== ----- AWS Instance ----- On Ubuntu 24.04 Minimal EC2 AMIs, cloud-init may fail to retrieve EC2 metadata and userdata on the first boot for certain instance types (notably hpc7a.*, every time on hpc7a.96xlarge). This results in SSH keys from user-data not being applied and prevents SSH access until the instance is rebooted. The issue appears to be caused by a race condition where no eligible network interfaces are present during the init-local stage when the EC2 datasource attempts metadata discovery. ----- GCP Instance ----- C3-metal instances on Google Cloud fail to boot properly because cloud- init runs before the network interface (NIC) is up. As a result, cloud- init cannot detect any instance datasource. The issue appears to be caused by a race condition where no eligible network interfaces are present during the init-local stage when the GCE datasource attempts metadata discovery. ====== [Reproducer] ====== ----- AWS Instance ----- 1. Launch an Ubuntu 24.04 Noble Minimal EC2 AMI 2. Use an EC2 instance type where network interfaces may not be immediately available at early boot (e.g. hpc7a.96xlarge reproduces consistently) 3. Wait several minutes after the instance reaches running, then attempt to SSH into the instance You will see this error: ubuntu@<public-ip>: Permission denied (publickey). If you access the machine through the AWS console, you will see the following cloud-init errors: * Unable to get metadata - * The instance must have at least one eligible NICattempts metadata discovery. + * The instance must have at least one eligible NIC attempts metadata discovery. ----- GCP Instance ----- 1. Launch a c3-standard-192-metal machine on Google cloud. 2. Wait several minutes after the instance reaches running, then attempt to SSH into the instance You will see this error: Permission denied (publickey). If you access the machine through the Google cloud console, you will see the following cloud-init errors: * No instance datasource found! Likely bad things to come! * Getting data from <class 'cloudinit.sources.DataSourceGCE.DataSourceGCELocal'> failed ====== [Environment Details] ====== ----- AWS Instance ----- * Cloud-init version: cloud-init 25.2-0ubuntu1~24.04.1 * Operating System Distribution: Ubuntu 24.04 LTS (Noble) Minimal AMI * Cloud provider: Amazon EC2, hpc7a.* (notably hpc7a.96xlarge) instances * Kernel: linux-image-6.14.0-1018-aws ----- GCP Instance ----- * Cloud-init version: cloud-init 25.3~2g890873f5-0ubuntu2 * Operating System Distribution: Ubuntu Questing (25.10) and Ubuntu Noble (24.04) * Cloud provider, platform or installer type: Google * Kernel: linux-image-6.17.0-1008-gcp ====== [Suggested Fixes] ====== - * Include the `ENA_ETHERNET` driver as a built-in module in the affected aws - kernels. + * Include the `ENA_ETHERNET` driver as a built-in module in the affected aws + kernels. * Include the `IDPF` driver as a built-in module in the affected gcp kernels. ====== [Related Links] ====== cloud-init bug AWS: https://github.com/canonical/cloud-init/issues/6697 cloud-init bug GCP: https://github.com/canonical/cloud-init/issues/6737
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2144694 Title: Delayed NIC initialization on AWS and GCP instances lead to first-boot metadata failures To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/2144694/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
