[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-07-10 Thread Sam Lee
I'm not sure why a "broken" Upstream DNS helps repro this bug, but I was
not able to repro when the Upstream DNS was working.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1710278

Title:
  [2.3a1] named stuck on reload, DNS broken

To manage notifications about this bug go to:
https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-07-10 Thread Sam Lee
repro.py attempts to trigger DNS queries during DNS Reloads.

It does so by first deploying all 50 machines.
Then one-by-one (not all at once!) release a machine, wait, deploy machine, 
move to next machine.

At some point a machine will be releasing (Reloads) while others are
starting to deploy (DNS Queries).  This is the sweet spot.


If one simply deploys all 50 machines simultaneously, then the DNS Reload would 
occur but without any DNS queries (because all machines have yet to PXE boot).

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1710278

Title:
  [2.3a1] named stuck on reload, DNS broken

To manage notifications about this bug go to:
https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-07-10 Thread Sam Lee
repro.py attached

** Attachment added: "repro.py"
   
https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1710278/+attachment/5276146/+files/repro.py

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1710278

Title:
  [2.3a1] named stuck on reload, DNS broken

To manage notifications about this bug go to:
https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-07-10 Thread Sam Lee
OK - I was able to repro again, and this time with MAAS 2.6.

Here are the steps

PREP WORK
1) Have 50 machines in Ready state with one interface enabled configured as 
'Autoassign' to Default VLAN PXE subnet (auto assign so that every 
deploy/release causes MAAS to reload DNS)
2) Clear out any DNS entries in the PXE subnet (this forces nodes to send DNS 
queries to MAAS)
3) Settings-> Network Services -> DNS -> Upstream DNS -> enter valid upstream 
DNS IP
4) Settings-> Network Services -> DNS -> DNSSEC -> Automatic (for some reason 
this breaks Upstream DNS)
5) Verify that Upstream DNS is broken
a) Rescue Mode one machine
b) ssh to Rescue machine
c) dig www.google.com
d) (dig should timeout/fail)
e) MAAS->Settings-> Network Services -> DNS -> DNSSEC -> Disable
f) dig www.google.com
g) (dig should succeed)
h) MAAS->Settings-> Network Services -> DNS -> DNSSEC -> Automatic
i) Release Rescue machine

REPRO
1) run repro.py (attached, WARNING this code will use all machines available to 
MAAS)
2) wait up to 3 hours, checking if bind9 is hung by regularly running `sudo 
rndc status` on MAAS 

monitoring steps (optional)
(See DNS Query activity)
in one ssh window to Maas run
sudo tcpdump dst  -i ens3 and dst port 53
(See DNS reloads, and why)
in another ssh window to Maas run
sudo tail -f /var/log/maas/regiond.log |grep Reloaded -A 3

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1710278

Title:
  [2.3a1] named stuck on reload, DNS broken

To manage notifications about this bug go to:
https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-07-09 Thread Sam Lee
Hi Mark,

Still seeing it with 18.04 and 2.6.  The sweet spot seems to be when
MAAS is receiving lots of DNS requests while simultaneously doing DNS
reloads (as you alluded to in this case).

I'm attempting to setup a simplified repro scenario which basically will
do this:

1) enlist 50+ new machines on a untagged subnet *with DNS left blank* forcing 
nodes to DNS query MAAS
2) Leave machines PXE interface with Autoassign IP (so every deploy/releaes 
forces a DNS reload)
3) deploy and release (repeat until error)

will report back with findings.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1710278

Title:
  [2.3a1] named stuck on reload, DNS broken

To manage notifications about this bug go to:
https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-06-26 Thread Sam Lee
Mark,

Do you have any updated repro steps?

I'm seeing this failure with MAAS v2.5.3.  I suspect when v2.5 moved the
DNS logic from region to rack controller, that some of the mitigation
logic was lost and thus this bug manifests more frequently.

When I compare our v2.5.3 install from our v2.4.2 install, the amount of
rndc reloads is vastly more on v2.5.3.

[2.4.2]
journalctl -b -u bind9.service |grep received.control
Jun 22 00:22:05 wdc1-p01-s01-maas-18 named[907]: received control channel 
command 'reload'
Jun 22 00:22:08 wdc1-p01-s01-maas-18 named[907]: received control channel 
command 'reload'
Jun 22 00:22:54 wdc1-p01-s01-maas-18 named[907]: received control channel 
command 'reload'
Jun 24 16:27:06 wdc1-p01-s01-maas-18 named[907]: received control channel 
command 'reload'
Jun 25 13:53:34 wdc1-p01-s01-maas-18 named[907]: received control channel 
command 'reload'
Jun 25 13:53:41 wdc1-p01-s01-maas-18 named[907]: received control channel 
command 'reload'
Jun 25 13:54:51 wdc1-p01-s01-maas-18 named[907]: received control channel 
command 'reload'
Jun 25 13:55:22 wdc1-p01-s01-maas-18 named[907]: received control channel 
command 'reload'

[2.5.3]
journalctl -b -u bind9.service |grep received.control
Jun 26 14:23:59 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'
Jun 26 14:24:04 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'
Jun 26 14:24:09 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'
Jun 26 14:24:11 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'
Jun 26 14:24:15 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'
Jun 26 14:24:18 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'
Jun 26 14:24:22 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'
Jun 26 14:24:27 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'
Jun 26 14:24:31 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'
Jun 26 14:24:36 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'
Jun 26 14:24:40 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'
Jun 26 14:24:42 ch31-p01-s01-maas-18 named[1041]: received control channel 
command 'reload'

I had to trim the 2.5.3 output because it was way too long to fit in
this comment, but as you can see 2.5.3 is spamming reload as compared to
2.4.2. 2.4.2 it may reload 4 times for the _entire day_ whereas 2.5.3 is
doing hundreds if not thousands a day.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1710278

Title:
  [2.3a1] named stuck on reload, DNS broken

To manage notifications about this bug go to:
https://bugs.launchpad.net/bind/+bug/1710278/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1446822] Re: maas erase disk cannot be canceled

2019-01-29 Thread Sam Lee
Same here, it takes hours to erase drives on our servers, and even
allowing the server to finish erasing the drives, MAAS still showing
`Disk Erasing` state.  And cannot `Abort` or `Mark Fixed`, as it errors
with

```
Error:Node failed to be marked broken, because of the following error: 
mark-broken action is not available for this node.
```

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1446822

Title:
  maas erase disk cannot be canceled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/maas/+bug/1446822/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1677668] Re: no GARPs during ephemeral boot

2017-05-24 Thread Sam Lee
In our case, we don't need GARP on every boot.  Only during MaaS Deploy
stage, where MaaS ephemeral boot image is trying to communicate with
MaaS region controller (in a different VLAN).

The irony is, even if there was a way to add our own GARP instructions
in cloud-init config, the region controller would have no way of sending
the commands to the maas machine.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1677668

Title:
  no GARPs during ephemeral boot

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1677668] Re: no GARPs during ephemeral boot

2017-04-04 Thread Sam Lee
Hi Chris, Yes you are correct, and attached updated pic.

Although I don't disagree the PXE/DHCP client should be sending GARPs,
but shouldn't any OS that binds to an IP send a GARP as part of its TCP
stack initialization?  That is, shouldn't the ephemeral boot image
itself send a GARP (independent of whether there was one from PXE
client)?

** Attachment added: "updateddrawing.png"
   
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+attachment/4854773/+files/updateddrawing.png

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1677668

Title:
  no GARPs during ephemeral boot

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
attached pic

** Attachment added: "ascii-art.png"
   
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+attachment/4851597/+files/ascii-art.png

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1677668

Title:
  no GARPs during ephemeral boot

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
+---+
 ++  | ARP CACHE |
 ||  | (expires 4 hours) |
 ||  | 10.1.1.11   22:22 |
 |   ROUTER   |  | 10.1.2.100  33:33 |
 ||  |   |
 ||  |   |
   +-+|  |   |
   | --  +---+
   | |
   | |
 +---++--+
 |SWITCH A   ||   SWITCH B   |
 ||  +---+
 +---++--+   |
  || |
  || |
  ||+---+ +--+
  |++   | |  |
  | |10.1.1.11  | |   10.1.2.100 |
++  |255.255.255.0  | |   255.255.255.0  |
||  |   | |   REGION CTLR|
| MAAS MACHINE 2 |  |MAAS MACHINE 1 | |  |
| MAC 22:22  |  |MAC 11:11  | |  |
++  +---+ +--+

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1677668

Title:
  no GARPs during ephemeral boot

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
I forgot to mention, the TFTP conversation is happening between the
Region Controller (DHCP/TFTP) and the Machine which both live on the
same subnet, so the router's ARP Cache is not a factor.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1677668

Title:
  no GARPs during ephemeral boot

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
yikes! that did not format well...and I can't edit my own comment.  Let
me try again...

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1677668

Title:
  no GARPs during ephemeral boot

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
I forgot to mention, Region and Rack Controllers are in separate VLANs.
So the TFTP conversation is happening between the RACK Controller
(DHCP/TFTP) and the Machine which both live on the same subnet, so the
router's ARP Cache is not a factor.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1677668

Title:
  no GARPs during ephemeral boot

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
Hi Chris,

Some new clarifications are in order.  Please disregard the "ARP
Inspection" claim.  That feature wasn't even enabled.

Here's a very simplified drawing of the setup.



 
+---+
 | ARP 
CACHE |
++   | 
(expires 4 hours) |
||   | 
10.1.1.11 22:22
||   | 
10.1.2.10033:33
|   ROUTER   |   |  
 |
||   |  
 |
||   |  
 |
||   
+---+
   +--+
   |  |
   |  |
  +---+ +--+
  |SWITCH A   | |   SWITCH B   |
 ++   | |  |
 |+---+ +--+
 ||  |
 ||  |
   +--++--++--+
   |  ||  ||  |
   |  ||   10.1.1.11  ||   10.1.2.100 |
   |  ||   255.255.255.0  ||   255.255.255.0  |
   |  ||  ||   REGION CTLR|
   |   MAAS MACHINE 2 ||   MAAS MACHINE 1 ||  |
   |   MAC 22:22  ||   MAC 11:11  ||MAC 33:33 |
   +--++--++--+


1) Assuming Machine #2 was last deployed and then released within the past 4 
hours, using the IP 10.1.1.11.  Thus the router already has an ARP entry in its 
cache matching 10.1.1.11 to MAC 22:22.
2) Machine #1 is starting Deployment and happens to receive 10.1.1.11 from 
Controller to use for ephemeral PXE IP.
3) Machine #1 sends packet to 10.1.2.100:5240
4) Controller sees pack from 10.1.1.11
5) Controller responds to 10.1.1.11
6) Machine #1 never sees the response packet

We suspect the response packet was sent Machine #2.  We are actively
parsing the pcap data to confirm.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1677668

Title:
  no GARPs during ephemeral boot

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1677668] [NEW] no GARPs during ephemeral boot

2017-03-30 Thread Sam Lee
Public bug reported:

Deploys time out with an error on the console that says,

"Can not apply stage final, no datasource found! Likely bad things to
come!"

How to duplicate:
MAAS Version 2.1.3+bzr5573-0ubuntu1 (16.04.1)
1) Rack Controller and Region Controller in different VLANs
2) Use Cisco ASA as the router with "ARP Inspection" enabled
3) Clear the router ARP cache
4) Deploy 2 maas machines with interfaces set to "Static assign"
5) Observe deploys successfully
6) Release both machines and swap IP's.
7) Redeploy the same 2 machines
8) Observe deploy failure with the machine consoles stuck in the "ubuntu login" 
screen with "Can not apply stage final, no datasource Found! Likely bad things 
to come!"
 
The root cause is that during ephemeral PXE booting, no GARPs are sent, which 
in our environment will cause our router (Cisco ASA) to hold on to ARP table 
entries until it expires (default= 4 hours).  Then combined with ASA feature 
"ARP Inspection" will drop packets from a MaaS machine using the previously 
used IP from a different MaaS machine.

The ephemeral boot image ephemeral-ubuntu-amd64-ga-16.04-xenial-daily.

Running tcpdump on the Rack Controller, showed no GARPs from the
deploying MaaS machine.  If there were GARPs sent, then the router would
refresh its ARP cache thus avoiding the ARP Inspection dropping.

[Excerpt from Cisco ASA]
http://www.cisco.com/c/en/us/td/docs/security/asa/asa94/config-guides/cli/general/asa-94-general-config/basic-arp-mac.pdf
When you enable ARP inspection, the ASA compares the MAC address, IP address, 
and source interface in
all ARP packets to static entries in the ARP table, and takes the following 
actions:
• If the IP address, MAC address, and source interface match an ARP entry, the 
packet is passed through.
• If there is a mismatch between the MAC address, the IP address, or the 
interface, then the ASA drops
the packet.
• If the ARP packet does not match any entries in the static ARP table, then 
you can set the ASA to either
forward the packet out all interfaces (flood), or to drop the packet.

** Affects: cloud-init (Ubuntu)
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1677668

Title:
  no GARPs during ephemeral boot

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-30 Thread Sam Lee
Forgot to mention that we didn't want to "Static assign" IPs in MaaS.
We prefer using "Auto assign" but observed that MaaS will sometimes
reuse a previously used IP from a different MaaS machine.  But using
"Static assign" we can reliably workaround the issue (or in this ticket
case, force a failure on demand)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1677668

Title:
  no GARPs during ephemeral boot

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs