[Yahoo-eng-team] [Bug 1960944] Re: cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes

2022-02-16 Thread dann frazier
While we do have sporadic messages like this in our nginx error.log,
they started piling up around the time this issue was reported to us,
starting with this message:

2022/02/15 01:49:24 [error] 3341359#3341359: *1929977 upstream timed out
(110: Connection timed out) while reading response header from upstream,
client: 10.229.95.139, server: , request: "POST
/MAAS/metadata/status/ww4mgk HTTP/1.1", upstream:
"http://10.155.212.2:5240/MAAS/metadata/status/ww4mgk";, host:
"10.229.32.21:5248"

Around this time we started seeing these pile up in rackd.log:
2022-02-15 01:40:07 provisioningserver.rpc.clusterservice: [critical] Failed to 
contact region. (While requesting RPC info at http://localhost:5240/MAAS).

Our regiond processes are running, and I don't see anything that seems
abnormal in the regiond log around this time. However, these symptoms
reminded me of a similar issue in bug 1908452, so I started debugging it
similarly. Like bug 1908452, I see one regiond process stuck in a recv
call:

root@maas:/var/snap/maas/common/log# strace -p 3340720
strace: Process 3340720 attached
recvfrom(23, 

All the other regiond processes are making progress, but not this one.

The server it is talking to appears to be this canonical server, which I
can't currently resolve:

root@maas:/var/snap/maas/common/log# lsof -i -a -p  3340720 | grep 23
python3 3340720 root   23u  IPv4 3487880288  0t0  TCP 
maas:42848->https-services.aerodent.canonical.com:http (ESTABLISHED)
root@maas:/var/snap/maas/common/log# host https-services.aerodent.canonical.com
Host https-services.aerodent.canonical.com not found: 3(NXDOMAIN)

However, I suspect it maybe related to image fetching again. In our
regiond logs, I see that the the last log entry related to images
appears to have been about an hour before things locked up:

root@maas:/var/snap/maas/common/log# grep image regiond.log | tail -1
2022-02-15 00:38:51 regiond: [info] 127.0.0.1 GET 
/MAAS/images-stream/streams/v1/maas:v2:download.json HTTP/1.1 --> 200 OK 
(referrer: -; agent: python-simplestreams/0.1)

Prior to that, we have log entries every hour, but none after. So maybe
simplestreams has other places that need a timeout?

** Changed in: cloud-init
   Status: New => Invalid

** Also affects: simplestreams
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1960944

Title:
  cloudinit.sources.DataSourceNotFoundException: Did not find any data
  source, searched classes

Status in cloud-init:
  Invalid
Status in MAAS:
  New
Status in simplestreams:
  New

Bug description:
  Not able to deploy baremetal (arm64 and amd64) on a 
  snap-based MAAS: 3.1.0 (maas  3.1.0-10901-g.f1f8f1505  18199  3.1/stable) 

  from MAAS event log: 
  ```
  Tue, 15 Feb. 2022 17:35:33Node changed status - From 'Deploying' to 
'Failed deployment'
   Tue, 15 Feb. 2022 17:35:33   Marking node failed - Node operation 
'Deploying' timed out after 30 minutes.
   Tue, 15 Feb. 2022 17:07:44   Node installation - 'cloudinit' searching for 
network data from DataSourceMAAS
   Tue, 15 Feb. 2022 17:06:44   Node installation - 'cloudinit' attempting to 
read from cache [trust]
   Tue, 15 Feb. 2022 17:06:42   Node installation - 'cloudinit' attempting to 
read from cache [check]
   Tue, 15 Feb. 2022 17:05:29   Performing PXE boot
   Tue, 15 Feb. 2022 17:05:29   PXE Request - installation
   Tue, 15 Feb. 2022 17:03:52   Node powered on
  ```

  
  Server console log shows: 

  ```
  ubuntu login:  Starting Message of the Day...
  [  OK  ] Listening on Socket unix for snap application lxd.daemon.
   Starting Service for snap application lxd.activate...
  [  OK  ] Finished Service for snap application lxd.activate.
  [  OK  ] Started snap.lxd.hook.conf…-4400-96a8-0c5c9e438c51.scope.
   Starting Time & Date Service...
  [  OK  ] Started Time & Date Service.
  [  OK  ] Finished Wait until snapd is fully seeded.
   Starting Apply the settings specified in cloud-config...
  [  OK  ] Reached target Multi-User System.
  [  OK  ] Reached target Graphical Interface.
   Starting Update UTMP about System Runlevel Changes...
  [  OK  ] Finished Update UTMP about System Runlevel Changes.
  [  322.036861] cloud-init[2034]: Can not apply stage config, no datasource 
found! Likely bad things to come!
  [  322.037477] cloud-init[2034]: 

  [  322.037907] cloud-init[2034]: Traceback (most recent call last):
  [  322.038341] cloud-init[2034]:   File 
"/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 521, in 
main_modules
  [  322.038783] cloud-init[2034]: init.fetch(existing="trust")
  [  322.039181] cloud-init[2034]:   File 
"/usr/lib/python3/dist-packages/cloudinit/stages.py", line 411, in fetch
  [  322.039584] cloud-init[2034]: return 
self._get_data_source

[Yahoo-eng-team] [Bug 1936972] [NEW] MAAS deploys fail if host has NIC w/ random MAC

2021-07-20 Thread dann frazier
Public bug reported:

The Nvidia DGX A100 server includes a USB Redfish Host Interface NIC.
This NIC apparently provides no MAC address of it's own, so the driver
generates a random MAC for it:

./drivers/net/usb/cdc_ether.c:

static int usbnet_cdc_zte_bind(struct usbnet *dev, struct usb_interface *intf)
{
int status = usbnet_cdc_bind(dev, intf);

if (!status && (dev->net->dev_addr[0] & 0x02))
eth_hw_addr_random(dev->net);

return status;
}

This causes a problem with MAAS because, during deployment, MAAS sees
this as a normal NIC and records the MAC. The post-install reboot then
fails:

[   43.652573] cloud-init[3761]: init.apply_network_config(bring_up=not 
args.local)
[   43.700516] cloud-init[3761]:   File 
"/usr/lib/python3/dist-packages/cloudinit/stages.py", line 735, in 
apply_network_config
[   43.724496] cloud-init[3761]: 
self.distro.networking.wait_for_physdevs(netcfg)
[   43.740509] cloud-init[3761]:   File 
"/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 177, in 
wait_for_physdevs
[   43.764523] cloud-init[3761]: raise RuntimeError(msg)
[   43.780511] cloud-init[3761]: RuntimeError: Not all expected physical 
devices present: {'fe:b8:63:69:9f:71'}

I'm not sure what the best answer for MAAS is here, but here's some
thoughts:

1) Ignore all Redfish system interfaces. These are a connect between the host 
and the BMC, so they don't really have a use-case in the MAAS model AFAICT. 
These devices can be identified using the SMBIOS as described in the Redfish 
Host Interface Specification, section 8:
  https://www.dmtf.org/sites/default/files/standards/documents/DSP0270_1.3.0.pdf
Which can be read from within Linux using dmidecode.

2) Ignore (or specially handle) all NICs with randomly generated MAC
addresses. While this is the only time I've seen the random MAC with
production server hardware, it is something I've seen on e.g. ARM
development boards. Problem is, I don't know how to detect a generated
MAC. I'd hoped the permanent MAC (ethtool -P) MAC would be NULL, but it
seems to also be set to the generated MAC :(

fyi, 2 workarounds for this that seem to work:
 1) Delete the NIC from the MAAS model in the MAAS UI after every commissioning.
 2) Use a tag's kernel_opts field to modprobe.blacklist the driver used for the 
Redfish NIC.

** Affects: cloud-init
 Importance: Undecided
 Status: New

** Affects: curtin
 Importance: Undecided
 Status: New

** Affects: maas
 Importance: Undecided
 Status: New

** Also affects: cloud-init
   Importance: Undecided
   Status: New

** Also affects: curtin
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1936972

Title:
  MAAS deploys fail if host has NIC w/ random MAC

Status in cloud-init:
  New
Status in curtin:
  New
Status in MAAS:
  New

Bug description:
  The Nvidia DGX A100 server includes a USB Redfish Host Interface NIC.
  This NIC apparently provides no MAC address of it's own, so the driver
  generates a random MAC for it:

  ./drivers/net/usb/cdc_ether.c:

  static int usbnet_cdc_zte_bind(struct usbnet *dev, struct usb_interface *intf)
  {
  int status = usbnet_cdc_bind(dev, intf);

  if (!status && (dev->net->dev_addr[0] & 0x02))
  eth_hw_addr_random(dev->net);

  return status;
  }

  This causes a problem with MAAS because, during deployment, MAAS sees
  this as a normal NIC and records the MAC. The post-install reboot then
  fails:

  [   43.652573] cloud-init[3761]: init.apply_network_config(bring_up=not 
args.local)
  [   43.700516] cloud-init[3761]:   File 
"/usr/lib/python3/dist-packages/cloudinit/stages.py", line 735, in 
apply_network_config
  [   43.724496] cloud-init[3761]: 
self.distro.networking.wait_for_physdevs(netcfg)
  [   43.740509] cloud-init[3761]:   File 
"/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 177, in 
wait_for_physdevs
  [   43.764523] cloud-init[3761]: raise RuntimeError(msg)
  [   43.780511] cloud-init[3761]: RuntimeError: Not all expected physical 
devices present: {'fe:b8:63:69:9f:71'}

  I'm not sure what the best answer for MAAS is here, but here's some
  thoughts:

  1) Ignore all Redfish system interfaces. These are a connect between the host 
and the BMC, so they don't really have a use-case in the MAAS model AFAICT. 
These devices can be identified using the SMBIOS as described in the Redfish 
Host Interface Specification, section 8:

https://www.dmtf.org/sites/default/files/standards/documents/DSP0270_1.3.0.pdf
  Which can be read from within Linux using dmidecode.

  2) Ignore (or specially handle) all NICs with randomly generated MAC
  addresses. While this is the only time I've seen the random MAC with
  production server hardware, it is something I've seen on

[Yahoo-eng-team] [Bug 1858615] Re: dmidecode triggers system reboot on Inforce 6640

2020-01-27 Thread dann frazier
** Also affects: dmidecode (Ubuntu Xenial)
   Importance: Undecided
   Status: New

** Changed in: dmidecode (Ubuntu Xenial)
   Status: New => In Progress

** Changed in: dmidecode (Ubuntu Xenial)
 Assignee: (unassigned) => dann frazier (dannf)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1858615

Title:
  dmidecode triggers system reboot on Inforce 6640

Status in cloud-init:
  Invalid
Status in dmidecode package in Ubuntu:
  Fix Released
Status in dmidecode source package in Xenial:
  In Progress
Status in dmidecode source package in Bionic:
  In Progress
Status in dmidecode source package in Eoan:
  In Progress
Status in dmidecode source package in Focal:
  Fix Released
Status in dmidecode package in Debian:
  Unknown

Bug description:
  [Impact]
  Running 'sudo dmidecode' on non-UEFI ARM systems can cause them to 
crash/reboot. cloud-init apparently runs dmidecode as root, so it breaks any 
cloud-init based installation.

  [Test Case]
  sudo dmidecode

  [Fix]
  Upstream has the following fix:

  commit e12ec26e19e02281d3e7258c3aabb88a5cf5ec1d
  Author: Jean Delvare 
  Date: Mon Aug 26 14:20:15 2019 +0200

  dmidecode: Only scan /dev/mem for entry point on x86

  [Regression Risk]
  In Ubuntu, dmidecode only builds on amd64, arm64, armhf & i386.
  The fix is to disable code on !x86, so the regression risk is restricted to 
ARM platforms, where we know /dev/mem trolling is bad news.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1858615/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1858615] Re: dmidecode triggers system reboot on Inforce 6640

2020-01-27 Thread dann frazier
Sure Colin, I'll take it from here - thanks for your analysis so far. As
a next step, I'll wait for Ethan - or someone else w/ hw access - to
verify the PPA build in Comment #12.

** Bug watch added: Debian Bug tracker #946911
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=946911

** Also affects: dmidecode (Debian) via
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=946911
   Importance: Unknown
   Status: Unknown

** Also affects: dmidecode (Ubuntu Eoan)
   Importance: Undecided
   Status: New

** Also affects: dmidecode (Ubuntu Focal)
   Importance: Medium
 Assignee: Colin Ian King (colin-king)
   Status: In Progress

** Also affects: dmidecode (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Changed in: dmidecode (Ubuntu Focal)
 Assignee: Colin Ian King (colin-king) => (unassigned)

** Changed in: dmidecode (Ubuntu Eoan)
 Assignee: (unassigned) => dann frazier (dannf)

** Changed in: dmidecode (Ubuntu Bionic)
 Assignee: (unassigned) => dann frazier (dannf)

** Changed in: dmidecode (Ubuntu Eoan)
   Status: New => In Progress

** Changed in: dmidecode (Ubuntu Bionic)
   Status: New => In Progress

** Changed in: dmidecode (Ubuntu Focal)
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1858615

Title:
  dmidecode triggers system reboot on Inforce 6640

Status in cloud-init:
  Invalid
Status in dmidecode package in Ubuntu:
  Fix Released
Status in dmidecode source package in Bionic:
  In Progress
Status in dmidecode source package in Eoan:
  In Progress
Status in dmidecode source package in Focal:
  Fix Released
Status in dmidecode package in Debian:
  Unknown

Bug description:
  Device: Inforce 6640
  
https://www.inforcecomputing.com/products/single-board-computers-sbc/qualcomm-snapdragon-820-inforce-6640-sbc
  SoC: Snapdragon 820

  sysname='Linux',
  nodename='ubuntu',
  release='4.15.0-1069-snapdragon', 
  version='#76-Ubuntu SMP Tue Nov 26 16:10:14 UTC 2019', 
  machine='aarch64'

  The issue is caused by following commit.
  Inforce 6640 doesn't have functional demidecode.
  System will reboot when executing dmidecode.

  commit 3416e2ee7f65defdb15aab861a85767d13e8c34c
  Author: Robert Schweikert 
  Date: Sat Oct 29 09:29:53 2016 -0400
  dmidecode: Allow dmidecode to be used on aarch64
  aarch64 systems have functional dmidecode, so allow that to be used.
  - aarch64 has support for dmidecode as well

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1858615/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1771662] Re: [bionic] libvirtError: Node device not found: no node device with matching name

2019-01-23 Thread dann frazier
Fixes have now landed upstream:

04983c3c6a util: Fixing invalid error checking from virPCIGetNetname()
8fac64db5e util: Fix for NULL dereference
10bca495e0 util: Code simplification
6452e2f5e1 util: fixing wrong assumption that PF has to have netdev assigned


** Also affects: libvirt (Ubuntu Disco)
   Importance: Undecided
   Status: In Progress

** Also affects: libvirt (Ubuntu Cosmic)
   Importance: Undecided
   Status: New

** Also affects: libvirt (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Changed in: libvirt (Ubuntu Disco)
   Status: In Progress => Triaged

** Changed in: nova
   Status: New => Invalid

** Changed in: libvirt (Ubuntu Cosmic)
   Status: New => Triaged

** Changed in: libvirt (Ubuntu Bionic)
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1771662

Title:
  [bionic] libvirtError: Node device not found: no node device with
  matching name

Status in OpenStack nova-compute charm:
  Invalid
Status in OpenStack Compute (nova):
  Invalid
Status in libvirt package in Ubuntu:
  Triaged
Status in libvirt source package in Bionic:
  Triaged
Status in libvirt source package in Cosmic:
  Triaged
Status in libvirt source package in Disco:
  Triaged

Bug description:
  After deploying openstack on arm64 using bionic and queens, no
  hypervisors show upon. On my compute nodes, I have an error like:

  2018-05-16 19:23:08.165 282170 ERROR nova.compute.manager
  libvirtError: Node device not found: no node device with matching name
  'net_enP2p1s0f1_40_8d_5c_ba_b8_d2'

  In my /var/log/nova/nova-compute.log

  I'm not sure why this is happening - I don't use enP2p1s0f1 for
  anything.

  There are a lot of interesting messages about that interface in syslog:
  http://paste.ubuntu.com/p/8WT8NqCbCf/

  Here is my bundle: http://paste.ubuntu.com/p/fWWs6r8Nr5/

  The same bundle works fine for xenial-queens, with the source changed
  to the cloud-archive, and using stable charms rather than -next. I hit
  this same issue on bionic queens using either stable or next charms.

  This thread has some related info, I think:
  https://www.spinics.net/linux/fedora/libvir/msg160975.html

  This is with juju 2.4 beta 2.

  Package versions on affected system:
  http://paste.ubuntu.com/p/yfQH3KJzng/

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nova-compute/+bug/1771662/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1623871] Re: Nova hugepage support does not include aarch64

2016-09-27 Thread dann frazier
** Also affects: nova (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Xenial)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1623871

Title:
  Nova hugepage support does not include aarch64

Status in OpenStack Compute (nova):
  In Progress
Status in nova package in Ubuntu:
  New
Status in nova source package in Xenial:
  New

Bug description:
  Although aarch64 supports spawning a vm with hugepages, in nova code,
  the libvirt driver considers only x86_64 and I686. Both for NUMA and
  Hugepage support, AARCH64 needs to be added. Due to this bug, vm can
  not be launched with hugepage using OpenStack on aarch64 servers.

  Steps to reproduce:
  On an openstack environment running on aarch64:
  1. Configure compute to use hugepages.
  2. Set mem_page_size="2048" for a flavor
  3. Launch a VM using the above flavor. 

  Expected result:
  VM should be launched with hugepages and the libvirt xml should have 



  



  Actual result:
  VM is launched without hugepages.

  There are no error logs in nova-scheduler.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1623871/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp