[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken
I'm not sure why a "broken" Upstream DNS helps repro this bug, but I was not able to repro when the Upstream DNS was working. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1710278 Title: [2.3a1] named stuck on reload, DNS broken To manage notifications about this bug go to: https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken
repro.py attempts to trigger DNS queries during DNS Reloads. It does so by first deploying all 50 machines. Then one-by-one (not all at once!) release a machine, wait, deploy machine, move to next machine. At some point a machine will be releasing (Reloads) while others are starting to deploy (DNS Queries). This is the sweet spot. If one simply deploys all 50 machines simultaneously, then the DNS Reload would occur but without any DNS queries (because all machines have yet to PXE boot). -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1710278 Title: [2.3a1] named stuck on reload, DNS broken To manage notifications about this bug go to: https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken
repro.py attached ** Attachment added: "repro.py" https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1710278/+attachment/5276146/+files/repro.py -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1710278 Title: [2.3a1] named stuck on reload, DNS broken To manage notifications about this bug go to: https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken
OK - I was able to repro again, and this time with MAAS 2.6. Here are the steps PREP WORK 1) Have 50 machines in Ready state with one interface enabled configured as 'Autoassign' to Default VLAN PXE subnet (auto assign so that every deploy/release causes MAAS to reload DNS) 2) Clear out any DNS entries in the PXE subnet (this forces nodes to send DNS queries to MAAS) 3) Settings-> Network Services -> DNS -> Upstream DNS -> enter valid upstream DNS IP 4) Settings-> Network Services -> DNS -> DNSSEC -> Automatic (for some reason this breaks Upstream DNS) 5) Verify that Upstream DNS is broken a) Rescue Mode one machine b) ssh to Rescue machine c) dig www.google.com d) (dig should timeout/fail) e) MAAS->Settings-> Network Services -> DNS -> DNSSEC -> Disable f) dig www.google.com g) (dig should succeed) h) MAAS->Settings-> Network Services -> DNS -> DNSSEC -> Automatic i) Release Rescue machine REPRO 1) run repro.py (attached, WARNING this code will use all machines available to MAAS) 2) wait up to 3 hours, checking if bind9 is hung by regularly running `sudo rndc status` on MAAS monitoring steps (optional) (See DNS Query activity) in one ssh window to Maas run sudo tcpdump dst -i ens3 and dst port 53 (See DNS reloads, and why) in another ssh window to Maas run sudo tail -f /var/log/maas/regiond.log |grep Reloaded -A 3 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1710278 Title: [2.3a1] named stuck on reload, DNS broken To manage notifications about this bug go to: https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken
Hi Mark, Still seeing it with 18.04 and 2.6. The sweet spot seems to be when MAAS is receiving lots of DNS requests while simultaneously doing DNS reloads (as you alluded to in this case). I'm attempting to setup a simplified repro scenario which basically will do this: 1) enlist 50+ new machines on a untagged subnet *with DNS left blank* forcing nodes to DNS query MAAS 2) Leave machines PXE interface with Autoassign IP (so every deploy/releaes forces a DNS reload) 3) deploy and release (repeat until error) will report back with findings. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1710278 Title: [2.3a1] named stuck on reload, DNS broken To manage notifications about this bug go to: https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken
Mark, Do you have any updated repro steps? I'm seeing this failure with MAAS v2.5.3. I suspect when v2.5 moved the DNS logic from region to rack controller, that some of the mitigation logic was lost and thus this bug manifests more frequently. When I compare our v2.5.3 install from our v2.4.2 install, the amount of rndc reloads is vastly more on v2.5.3. [2.4.2] journalctl -b -u bind9.service |grep received.control Jun 22 00:22:05 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload' Jun 22 00:22:08 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload' Jun 22 00:22:54 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload' Jun 24 16:27:06 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload' Jun 25 13:53:34 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload' Jun 25 13:53:41 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload' Jun 25 13:54:51 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload' Jun 25 13:55:22 wdc1-p01-s01-maas-18 named[907]: received control channel command 'reload' [2.5.3] journalctl -b -u bind9.service |grep received.control Jun 26 14:23:59 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' Jun 26 14:24:04 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' Jun 26 14:24:09 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' Jun 26 14:24:11 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' Jun 26 14:24:15 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' Jun 26 14:24:18 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' Jun 26 14:24:22 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' Jun 26 14:24:27 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' Jun 26 14:24:31 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' Jun 26 14:24:36 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' Jun 26 14:24:40 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' Jun 26 14:24:42 ch31-p01-s01-maas-18 named[1041]: received control channel command 'reload' I had to trim the 2.5.3 output because it was way too long to fit in this comment, but as you can see 2.5.3 is spamming reload as compared to 2.4.2. 2.4.2 it may reload 4 times for the _entire day_ whereas 2.5.3 is doing hundreds if not thousands a day. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1710278 Title: [2.3a1] named stuck on reload, DNS broken To manage notifications about this bug go to: https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1446822] Re: maas erase disk cannot be canceled
Same here, it takes hours to erase drives on our servers, and even allowing the server to finish erasing the drives, MAAS still showing `Disk Erasing` state. And cannot `Abort` or `Mark Fixed`, as it errors with ``` Error:Node failed to be marked broken, because of the following error: mark-broken action is not available for this node. ``` -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1446822 Title: maas erase disk cannot be canceled To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/maas/+bug/1446822/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1677668] Re: no GARPs during ephemeral boot
In our case, we don't need GARP on every boot. Only during MaaS Deploy stage, where MaaS ephemeral boot image is trying to communicate with MaaS region controller (in a different VLAN). The irony is, even if there was a way to add our own GARP instructions in cloud-init config, the region controller would have no way of sending the commands to the maas machine. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1677668 Title: no GARPs during ephemeral boot To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1677668] Re: no GARPs during ephemeral boot
Hi Chris, Yes you are correct, and attached updated pic. Although I don't disagree the PXE/DHCP client should be sending GARPs, but shouldn't any OS that binds to an IP send a GARP as part of its TCP stack initialization? That is, shouldn't the ephemeral boot image itself send a GARP (independent of whether there was one from PXE client)? ** Attachment added: "updateddrawing.png" https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+attachment/4854773/+files/updateddrawing.png -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1677668 Title: no GARPs during ephemeral boot To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1677668] Re: no GARPs during ephemeral boot
attached pic ** Attachment added: "ascii-art.png" https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+attachment/4851597/+files/ascii-art.png -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1677668 Title: no GARPs during ephemeral boot To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1677668] Re: no GARPs during ephemeral boot
+---+ ++ | ARP CACHE | || | (expires 4 hours) | || | 10.1.1.11 22:22 | | ROUTER | | 10.1.2.100 33:33 | || | | || | | +-+| | | | -- +---+ | | | | +---++--+ |SWITCH A || SWITCH B | || +---+ +---++--+ | || | || | ||+---+ +--+ |++ | | | | |10.1.1.11 | | 10.1.2.100 | ++ |255.255.255.0 | | 255.255.255.0 | || | | | REGION CTLR| | MAAS MACHINE 2 | |MAAS MACHINE 1 | | | | MAC 22:22 | |MAC 11:11 | | | ++ +---+ +--+ -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1677668 Title: no GARPs during ephemeral boot To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1677668] Re: no GARPs during ephemeral boot
I forgot to mention, the TFTP conversation is happening between the Region Controller (DHCP/TFTP) and the Machine which both live on the same subnet, so the router's ARP Cache is not a factor. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1677668 Title: no GARPs during ephemeral boot To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1677668] Re: no GARPs during ephemeral boot
yikes! that did not format well...and I can't edit my own comment. Let me try again... -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1677668 Title: no GARPs during ephemeral boot To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1677668] Re: no GARPs during ephemeral boot
I forgot to mention, Region and Rack Controllers are in separate VLANs. So the TFTP conversation is happening between the RACK Controller (DHCP/TFTP) and the Machine which both live on the same subnet, so the router's ARP Cache is not a factor. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1677668 Title: no GARPs during ephemeral boot To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1677668] Re: no GARPs during ephemeral boot
Hi Chris, Some new clarifications are in order. Please disregard the "ARP Inspection" claim. That feature wasn't even enabled. Here's a very simplified drawing of the setup. +---+ | ARP CACHE | ++ | (expires 4 hours) | || | 10.1.1.11 22:22 || | 10.1.2.10033:33 | ROUTER | | | || | | || | | || +---+ +--+ | | | | +---+ +--+ |SWITCH A | | SWITCH B | ++ | | | |+---+ +--+ || | || | +--++--++--+ | || || | | || 10.1.1.11 || 10.1.2.100 | | || 255.255.255.0 || 255.255.255.0 | | || || REGION CTLR| | MAAS MACHINE 2 || MAAS MACHINE 1 || | | MAC 22:22 || MAC 11:11 ||MAC 33:33 | +--++--++--+ 1) Assuming Machine #2 was last deployed and then released within the past 4 hours, using the IP 10.1.1.11. Thus the router already has an ARP entry in its cache matching 10.1.1.11 to MAC 22:22. 2) Machine #1 is starting Deployment and happens to receive 10.1.1.11 from Controller to use for ephemeral PXE IP. 3) Machine #1 sends packet to 10.1.2.100:5240 4) Controller sees pack from 10.1.1.11 5) Controller responds to 10.1.1.11 6) Machine #1 never sees the response packet We suspect the response packet was sent Machine #2. We are actively parsing the pcap data to confirm. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1677668 Title: no GARPs during ephemeral boot To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1677668] [NEW] no GARPs during ephemeral boot
Public bug reported: Deploys time out with an error on the console that says, "Can not apply stage final, no datasource found! Likely bad things to come!" How to duplicate: MAAS Version 2.1.3+bzr5573-0ubuntu1 (16.04.1) 1) Rack Controller and Region Controller in different VLANs 2) Use Cisco ASA as the router with "ARP Inspection" enabled 3) Clear the router ARP cache 4) Deploy 2 maas machines with interfaces set to "Static assign" 5) Observe deploys successfully 6) Release both machines and swap IP's. 7) Redeploy the same 2 machines 8) Observe deploy failure with the machine consoles stuck in the "ubuntu login" screen with "Can not apply stage final, no datasource Found! Likely bad things to come!" The root cause is that during ephemeral PXE booting, no GARPs are sent, which in our environment will cause our router (Cisco ASA) to hold on to ARP table entries until it expires (default= 4 hours). Then combined with ASA feature "ARP Inspection" will drop packets from a MaaS machine using the previously used IP from a different MaaS machine. The ephemeral boot image ephemeral-ubuntu-amd64-ga-16.04-xenial-daily. Running tcpdump on the Rack Controller, showed no GARPs from the deploying MaaS machine. If there were GARPs sent, then the router would refresh its ARP cache thus avoiding the ARP Inspection dropping. [Excerpt from Cisco ASA] http://www.cisco.com/c/en/us/td/docs/security/asa/asa94/config-guides/cli/general/asa-94-general-config/basic-arp-mac.pdf When you enable ARP inspection, the ASA compares the MAC address, IP address, and source interface in all ARP packets to static entries in the ARP table, and takes the following actions: • If the IP address, MAC address, and source interface match an ARP entry, the packet is passed through. • If there is a mismatch between the MAC address, the IP address, or the interface, then the ASA drops the packet. • If the ARP packet does not match any entries in the static ARP table, then you can set the ASA to either forward the packet out all interfaces (flood), or to drop the packet. ** Affects: cloud-init (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1677668 Title: no GARPs during ephemeral boot To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1677668] Re: no GARPs during ephemeral boot
Forgot to mention that we didn't want to "Static assign" IPs in MaaS. We prefer using "Auto assign" but observed that MaaS will sometimes reuse a previously used IP from a different MaaS machine. But using "Static assign" we can reliably workaround the issue (or in this ticket case, force a failure on demand) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1677668 Title: no GARPs during ephemeral boot To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs