[Yahoo-eng-team] [Bug 1960944] Re: cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes
While we do have sporadic messages like this in our nginx error.log, they started piling up around the time this issue was reported to us, starting with this message: 2022/02/15 01:49:24 [error] 3341359#3341359: *1929977 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.229.95.139, server: , request: "POST /MAAS/metadata/status/ww4mgk HTTP/1.1", upstream: "http://10.155.212.2:5240/MAAS/metadata/status/ww4mgk";, host: "10.229.32.21:5248" Around this time we started seeing these pile up in rackd.log: 2022-02-15 01:40:07 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS). Our regiond processes are running, and I don't see anything that seems abnormal in the regiond log around this time. However, these symptoms reminded me of a similar issue in bug 1908452, so I started debugging it similarly. Like bug 1908452, I see one regiond process stuck in a recv call: root@maas:/var/snap/maas/common/log# strace -p 3340720 strace: Process 3340720 attached recvfrom(23, All the other regiond processes are making progress, but not this one. The server it is talking to appears to be this canonical server, which I can't currently resolve: root@maas:/var/snap/maas/common/log# lsof -i -a -p 3340720 | grep 23 python3 3340720 root 23u IPv4 3487880288 0t0 TCP maas:42848->https-services.aerodent.canonical.com:http (ESTABLISHED) root@maas:/var/snap/maas/common/log# host https-services.aerodent.canonical.com Host https-services.aerodent.canonical.com not found: 3(NXDOMAIN) However, I suspect it maybe related to image fetching again. In our regiond logs, I see that the the last log entry related to images appears to have been about an hour before things locked up: root@maas:/var/snap/maas/common/log# grep image regiond.log | tail -1 2022-02-15 00:38:51 regiond: [info] 127.0.0.1 GET /MAAS/images-stream/streams/v1/maas:v2:download.json HTTP/1.1 --> 200 OK (referrer: -; agent: python-simplestreams/0.1) Prior to that, we have log entries every hour, but none after. So maybe simplestreams has other places that need a timeout? ** Changed in: cloud-init Status: New => Invalid ** Also affects: simplestreams Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to cloud-init. https://bugs.launchpad.net/bugs/1960944 Title: cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes Status in cloud-init: Invalid Status in MAAS: New Status in simplestreams: New Bug description: Not able to deploy baremetal (arm64 and amd64) on a snap-based MAAS: 3.1.0 (maas 3.1.0-10901-g.f1f8f1505 18199 3.1/stable) from MAAS event log: ``` Tue, 15 Feb. 2022 17:35:33Node changed status - From 'Deploying' to 'Failed deployment' Tue, 15 Feb. 2022 17:35:33 Marking node failed - Node operation 'Deploying' timed out after 30 minutes. Tue, 15 Feb. 2022 17:07:44 Node installation - 'cloudinit' searching for network data from DataSourceMAAS Tue, 15 Feb. 2022 17:06:44 Node installation - 'cloudinit' attempting to read from cache [trust] Tue, 15 Feb. 2022 17:06:42 Node installation - 'cloudinit' attempting to read from cache [check] Tue, 15 Feb. 2022 17:05:29 Performing PXE boot Tue, 15 Feb. 2022 17:05:29 PXE Request - installation Tue, 15 Feb. 2022 17:03:52 Node powered on ``` Server console log shows: ``` ubuntu login: Starting Message of the Day... [ OK ] Listening on Socket unix for snap application lxd.daemon. Starting Service for snap application lxd.activate... [ OK ] Finished Service for snap application lxd.activate. [ OK ] Started snap.lxd.hook.conf…-4400-96a8-0c5c9e438c51.scope. Starting Time & Date Service... [ OK ] Started Time & Date Service. [ OK ] Finished Wait until snapd is fully seeded. Starting Apply the settings specified in cloud-config... [ OK ] Reached target Multi-User System. [ OK ] Reached target Graphical Interface. Starting Update UTMP about System Runlevel Changes... [ OK ] Finished Update UTMP about System Runlevel Changes. [ 322.036861] cloud-init[2034]: Can not apply stage config, no datasource found! Likely bad things to come! [ 322.037477] cloud-init[2034]: [ 322.037907] cloud-init[2034]: Traceback (most recent call last): [ 322.038341] cloud-init[2034]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 521, in main_modules [ 322.038783] cloud-init[2034]: init.fetch(existing="trust") [ 322.039181] cloud-init[2034]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 411, in fetch [ 322.039584] cloud-init[2034]: return self._get_data_source
[Yahoo-eng-team] [Bug 1936972] [NEW] MAAS deploys fail if host has NIC w/ random MAC
Public bug reported: The Nvidia DGX A100 server includes a USB Redfish Host Interface NIC. This NIC apparently provides no MAC address of it's own, so the driver generates a random MAC for it: ./drivers/net/usb/cdc_ether.c: static int usbnet_cdc_zte_bind(struct usbnet *dev, struct usb_interface *intf) { int status = usbnet_cdc_bind(dev, intf); if (!status && (dev->net->dev_addr[0] & 0x02)) eth_hw_addr_random(dev->net); return status; } This causes a problem with MAAS because, during deployment, MAAS sees this as a normal NIC and records the MAC. The post-install reboot then fails: [ 43.652573] cloud-init[3761]: init.apply_network_config(bring_up=not args.local) [ 43.700516] cloud-init[3761]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 735, in apply_network_config [ 43.724496] cloud-init[3761]: self.distro.networking.wait_for_physdevs(netcfg) [ 43.740509] cloud-init[3761]: File "/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 177, in wait_for_physdevs [ 43.764523] cloud-init[3761]: raise RuntimeError(msg) [ 43.780511] cloud-init[3761]: RuntimeError: Not all expected physical devices present: {'fe:b8:63:69:9f:71'} I'm not sure what the best answer for MAAS is here, but here's some thoughts: 1) Ignore all Redfish system interfaces. These are a connect between the host and the BMC, so they don't really have a use-case in the MAAS model AFAICT. These devices can be identified using the SMBIOS as described in the Redfish Host Interface Specification, section 8: https://www.dmtf.org/sites/default/files/standards/documents/DSP0270_1.3.0.pdf Which can be read from within Linux using dmidecode. 2) Ignore (or specially handle) all NICs with randomly generated MAC addresses. While this is the only time I've seen the random MAC with production server hardware, it is something I've seen on e.g. ARM development boards. Problem is, I don't know how to detect a generated MAC. I'd hoped the permanent MAC (ethtool -P) MAC would be NULL, but it seems to also be set to the generated MAC :( fyi, 2 workarounds for this that seem to work: 1) Delete the NIC from the MAAS model in the MAAS UI after every commissioning. 2) Use a tag's kernel_opts field to modprobe.blacklist the driver used for the Redfish NIC. ** Affects: cloud-init Importance: Undecided Status: New ** Affects: curtin Importance: Undecided Status: New ** Affects: maas Importance: Undecided Status: New ** Also affects: cloud-init Importance: Undecided Status: New ** Also affects: curtin Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to cloud-init. https://bugs.launchpad.net/bugs/1936972 Title: MAAS deploys fail if host has NIC w/ random MAC Status in cloud-init: New Status in curtin: New Status in MAAS: New Bug description: The Nvidia DGX A100 server includes a USB Redfish Host Interface NIC. This NIC apparently provides no MAC address of it's own, so the driver generates a random MAC for it: ./drivers/net/usb/cdc_ether.c: static int usbnet_cdc_zte_bind(struct usbnet *dev, struct usb_interface *intf) { int status = usbnet_cdc_bind(dev, intf); if (!status && (dev->net->dev_addr[0] & 0x02)) eth_hw_addr_random(dev->net); return status; } This causes a problem with MAAS because, during deployment, MAAS sees this as a normal NIC and records the MAC. The post-install reboot then fails: [ 43.652573] cloud-init[3761]: init.apply_network_config(bring_up=not args.local) [ 43.700516] cloud-init[3761]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 735, in apply_network_config [ 43.724496] cloud-init[3761]: self.distro.networking.wait_for_physdevs(netcfg) [ 43.740509] cloud-init[3761]: File "/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 177, in wait_for_physdevs [ 43.764523] cloud-init[3761]: raise RuntimeError(msg) [ 43.780511] cloud-init[3761]: RuntimeError: Not all expected physical devices present: {'fe:b8:63:69:9f:71'} I'm not sure what the best answer for MAAS is here, but here's some thoughts: 1) Ignore all Redfish system interfaces. These are a connect between the host and the BMC, so they don't really have a use-case in the MAAS model AFAICT. These devices can be identified using the SMBIOS as described in the Redfish Host Interface Specification, section 8: https://www.dmtf.org/sites/default/files/standards/documents/DSP0270_1.3.0.pdf Which can be read from within Linux using dmidecode. 2) Ignore (or specially handle) all NICs with randomly generated MAC addresses. While this is the only time I've seen the random MAC with production server hardware, it is something I've seen on
[Yahoo-eng-team] [Bug 1858615] Re: dmidecode triggers system reboot on Inforce 6640
** Also affects: dmidecode (Ubuntu Xenial) Importance: Undecided Status: New ** Changed in: dmidecode (Ubuntu Xenial) Status: New => In Progress ** Changed in: dmidecode (Ubuntu Xenial) Assignee: (unassigned) => dann frazier (dannf) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to cloud-init. https://bugs.launchpad.net/bugs/1858615 Title: dmidecode triggers system reboot on Inforce 6640 Status in cloud-init: Invalid Status in dmidecode package in Ubuntu: Fix Released Status in dmidecode source package in Xenial: In Progress Status in dmidecode source package in Bionic: In Progress Status in dmidecode source package in Eoan: In Progress Status in dmidecode source package in Focal: Fix Released Status in dmidecode package in Debian: Unknown Bug description: [Impact] Running 'sudo dmidecode' on non-UEFI ARM systems can cause them to crash/reboot. cloud-init apparently runs dmidecode as root, so it breaks any cloud-init based installation. [Test Case] sudo dmidecode [Fix] Upstream has the following fix: commit e12ec26e19e02281d3e7258c3aabb88a5cf5ec1d Author: Jean Delvare Date: Mon Aug 26 14:20:15 2019 +0200 dmidecode: Only scan /dev/mem for entry point on x86 [Regression Risk] In Ubuntu, dmidecode only builds on amd64, arm64, armhf & i386. The fix is to disable code on !x86, so the regression risk is restricted to ARM platforms, where we know /dev/mem trolling is bad news. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1858615/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1858615] Re: dmidecode triggers system reboot on Inforce 6640
Sure Colin, I'll take it from here - thanks for your analysis so far. As a next step, I'll wait for Ethan - or someone else w/ hw access - to verify the PPA build in Comment #12. ** Bug watch added: Debian Bug tracker #946911 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=946911 ** Also affects: dmidecode (Debian) via https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=946911 Importance: Unknown Status: Unknown ** Also affects: dmidecode (Ubuntu Eoan) Importance: Undecided Status: New ** Also affects: dmidecode (Ubuntu Focal) Importance: Medium Assignee: Colin Ian King (colin-king) Status: In Progress ** Also affects: dmidecode (Ubuntu Bionic) Importance: Undecided Status: New ** Changed in: dmidecode (Ubuntu Focal) Assignee: Colin Ian King (colin-king) => (unassigned) ** Changed in: dmidecode (Ubuntu Eoan) Assignee: (unassigned) => dann frazier (dannf) ** Changed in: dmidecode (Ubuntu Bionic) Assignee: (unassigned) => dann frazier (dannf) ** Changed in: dmidecode (Ubuntu Eoan) Status: New => In Progress ** Changed in: dmidecode (Ubuntu Bionic) Status: New => In Progress ** Changed in: dmidecode (Ubuntu Focal) Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to cloud-init. https://bugs.launchpad.net/bugs/1858615 Title: dmidecode triggers system reboot on Inforce 6640 Status in cloud-init: Invalid Status in dmidecode package in Ubuntu: Fix Released Status in dmidecode source package in Bionic: In Progress Status in dmidecode source package in Eoan: In Progress Status in dmidecode source package in Focal: Fix Released Status in dmidecode package in Debian: Unknown Bug description: Device: Inforce 6640 https://www.inforcecomputing.com/products/single-board-computers-sbc/qualcomm-snapdragon-820-inforce-6640-sbc SoC: Snapdragon 820 sysname='Linux', nodename='ubuntu', release='4.15.0-1069-snapdragon', version='#76-Ubuntu SMP Tue Nov 26 16:10:14 UTC 2019', machine='aarch64' The issue is caused by following commit. Inforce 6640 doesn't have functional demidecode. System will reboot when executing dmidecode. commit 3416e2ee7f65defdb15aab861a85767d13e8c34c Author: Robert Schweikert Date: Sat Oct 29 09:29:53 2016 -0400 dmidecode: Allow dmidecode to be used on aarch64 aarch64 systems have functional dmidecode, so allow that to be used. - aarch64 has support for dmidecode as well To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1858615/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1771662] Re: [bionic] libvirtError: Node device not found: no node device with matching name
Fixes have now landed upstream: 04983c3c6a util: Fixing invalid error checking from virPCIGetNetname() 8fac64db5e util: Fix for NULL dereference 10bca495e0 util: Code simplification 6452e2f5e1 util: fixing wrong assumption that PF has to have netdev assigned ** Also affects: libvirt (Ubuntu Disco) Importance: Undecided Status: In Progress ** Also affects: libvirt (Ubuntu Cosmic) Importance: Undecided Status: New ** Also affects: libvirt (Ubuntu Bionic) Importance: Undecided Status: New ** Changed in: libvirt (Ubuntu Disco) Status: In Progress => Triaged ** Changed in: nova Status: New => Invalid ** Changed in: libvirt (Ubuntu Cosmic) Status: New => Triaged ** Changed in: libvirt (Ubuntu Bionic) Status: New => Triaged -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1771662 Title: [bionic] libvirtError: Node device not found: no node device with matching name Status in OpenStack nova-compute charm: Invalid Status in OpenStack Compute (nova): Invalid Status in libvirt package in Ubuntu: Triaged Status in libvirt source package in Bionic: Triaged Status in libvirt source package in Cosmic: Triaged Status in libvirt source package in Disco: Triaged Bug description: After deploying openstack on arm64 using bionic and queens, no hypervisors show upon. On my compute nodes, I have an error like: 2018-05-16 19:23:08.165 282170 ERROR nova.compute.manager libvirtError: Node device not found: no node device with matching name 'net_enP2p1s0f1_40_8d_5c_ba_b8_d2' In my /var/log/nova/nova-compute.log I'm not sure why this is happening - I don't use enP2p1s0f1 for anything. There are a lot of interesting messages about that interface in syslog: http://paste.ubuntu.com/p/8WT8NqCbCf/ Here is my bundle: http://paste.ubuntu.com/p/fWWs6r8Nr5/ The same bundle works fine for xenial-queens, with the source changed to the cloud-archive, and using stable charms rather than -next. I hit this same issue on bionic queens using either stable or next charms. This thread has some related info, I think: https://www.spinics.net/linux/fedora/libvir/msg160975.html This is with juju 2.4 beta 2. Package versions on affected system: http://paste.ubuntu.com/p/yfQH3KJzng/ To manage notifications about this bug go to: https://bugs.launchpad.net/charm-nova-compute/+bug/1771662/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1623871] Re: Nova hugepage support does not include aarch64
** Also affects: nova (Ubuntu) Importance: Undecided Status: New ** Also affects: nova (Ubuntu Xenial) Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1623871 Title: Nova hugepage support does not include aarch64 Status in OpenStack Compute (nova): In Progress Status in nova package in Ubuntu: New Status in nova source package in Xenial: New Bug description: Although aarch64 supports spawning a vm with hugepages, in nova code, the libvirt driver considers only x86_64 and I686. Both for NUMA and Hugepage support, AARCH64 needs to be added. Due to this bug, vm can not be launched with hugepage using OpenStack on aarch64 servers. Steps to reproduce: On an openstack environment running on aarch64: 1. Configure compute to use hugepages. 2. Set mem_page_size="2048" for a flavor 3. Launch a VM using the above flavor. Expected result: VM should be launched with hugepages and the libvirt xml should have Actual result: VM is launched without hugepages. There are no error logs in nova-scheduler. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1623871/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp