Public bug reported: Sometimes (like 1 in 100 times), my Juju/Azure deployment hangs with a machine stuck in 'pending'. On these machines, cloud-init-output.log says:
... Setting up qemu-utils (2.0.0+dfsg-2ubuntu1.24) ... Setting up cloud-image-utils (0.27-0ubuntu9.2) ... Setting up cloud-utils (0.27-0ubuntu9.2) ... Processing triggers for libc-bin (2.19-0ubuntu6.9) ... And cloud-init.log says: ... Jun 1 16:44:11 machine-3 [CLOUDINIT] cloud-init[DEBUG]: Ran 19 modules with 0 failures Jun 1 16:44:11 machine-3 [CLOUDINIT] util.py[DEBUG]: Reading from /proc/uptime (quiet=False) Jun 1 16:44:11 machine-3 [CLOUDINIT] util.py[DEBUG]: Read 12 bytes from /proc/uptime Jun 1 16:44:11 machine-3 [CLOUDINIT] util.py[DEBUG]: cloud-init mode 'modules' took 48.449 seconds (48.45) This seems to be related to this failure in waagent.log: root@machine-3:/var/log# cat waagent.log 2016/06/01 16:43:07 Azure Linux Agent Version: WALinuxAgent-2.0.16 2016/06/01 16:43:07 Linux Distribution Detected : Ubuntu 2016/06/01 16:43:07 Module /lib/modules/3.19.0-59-generic/kernel/drivers/ata/ata_piix.ko driver for ATAPI CD-ROM does not exist. 2016/06/01 16:43:07 mount: block device /dev/sr0 is write-protected, mounting read-only 2016/06/01 16:43:07 mount: you didn't specify a filesystem type for /dev/sr0 2016/06/01 16:43:07 I will try type udf 2016/06/01 16:43:07 mount: you didn't specify a filesystem type for /dev/sr0 2016/06/01 16:43:07 I will try type udf 2016/06/01 16:43:07 /dev/sr0 on /mnt/cdrom/secure type udf (ro) 2016/06/01 16:43:07 mount succeeded on attempt #1 2016/06/01 16:43:07 VMM Init script not found. Provisioning for Azure 2016/06/01 16:43:07 IPv4 address: 10.0.0.6 2016/06/01 16:43:07 MAC address: 00:0D:3A:71:79:7D 2016/06/01 16:43:07 Probing for Azure environment. 2016/06/01 16:43:07 DoDhcpWork: Setting socket.timeout=10, entering recv 2016/06/01 16:43:07 Set default gateway: 10.0.0.1 2016/06/01 16:43:07 Discovered Azure endpoint: 168.63.129.16 2016/06/01 16:43:07 Fabric preferred wire protocol version: 2015-04-05 2016/06/01 16:43:07 Negotiated wire protocol version: 2012-11-30 2016/06/01 16:43:07 SetBlockDeviceTimeout: Update the device sda with timeout 300 2016/06/01 16:43:07 SetBlockDeviceTimeout: Update the device sdb with timeout 300 2016/06/01 16:43:08 Retrieved GoalState from Azure Fabric. 2016/06/01 16:43:08 ExpectedState: Started 2016/06/01 16:43:08 ContainerId: adea2037-ec24-4752-b6f0-0b53f7dee188 2016/06/01 16:43:08 RoleInstanceId: 2d0f948c-f310-4290-b3dc-bde08efeb3d9._machine-3 2016/06/01 16:43:08 Public cert with thumbprint: 0C420F72F3884BD2CE799D50B9C1AFDB1B0C1072 was retrieved. 2016/06/01 16:43:18 ERROR:Can't find host key: /etc/ssh/ssh_host_rsa_key.pub 2016/06/01 16:43:22 Finished processing ExtensionsConfig.xml 2016/06/01 16:43:22 Successfully reported handler status 2016/06/01 16:55:24 ERROR:Socket IOError The read operation timed out, args:('The read operation timed out',) 2016/06/01 16:55:24 ERROR:Retry=0 2016/06/01 16:55:24 ERROR:HTTP Req: HEAD https://p437b4r90zrnkr2vcf51fm9q.blob.core.windows.net/osvhds/machine-3.988b8739-235f-4b55-a0ce-9e896a7eb595.status?sv=2014-02-14&sr=b&sig=tEiFGWxfACZ8Y2%2FEF%2Bl6ocG9BdhSwgNRQj%2Btpr3JKHY%3D&se=9999-01-01T00%3A00%3A00Z&sp=rw 2016/06/01 16:55:24 ERROR:HTTP Req: Data=None 2016/06/01 16:55:24 ERROR:HTTP Req: Header={'x-ms-version': '2014-02-14', 'x-ms-date': '2016-06-01T16:55:14Z'} 2016/06/01 16:55:24 ERROR:HTTP Err: response is empty. I can wget the file that it's complaining about, though perhaps it's trying to get it earlier when it doesn't exist (or is empty). Contents of that file look like this: {"version":"1.0","timestampUTC":"2016-06-01T17:40:34Z","aggregateStatus":{"guestAgentStatus":{"version":"WALinuxAgent-2.0.16","status":"Ready","formattedMessage":{"lang ":"en-US","message":"GuestAgent is running and accepting new configurations."}},"handlerAggregateStatus":[]}} I've let machines sit like this for hours, but they don't seem to unstick themselves. My quick solution is to ssh onto a 'pending' machine and reboot it. The subsequent boot triggers cloud-init and it finishes whatever it's supposed to do. Juju then takes over and all is good. Is a waagent failure (if that is indeed the problem) something that cloud-init can watch for and handle? Logs coming shortly... ** Affects: cloud-init Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to cloud-init. https://bugs.launchpad.net/bugs/1588014 Title: cloud-init doesn't always complete on azure machines Status in cloud-init: New Bug description: Sometimes (like 1 in 100 times), my Juju/Azure deployment hangs with a machine stuck in 'pending'. On these machines, cloud-init-output.log says: ... Setting up qemu-utils (2.0.0+dfsg-2ubuntu1.24) ... Setting up cloud-image-utils (0.27-0ubuntu9.2) ... Setting up cloud-utils (0.27-0ubuntu9.2) ... Processing triggers for libc-bin (2.19-0ubuntu6.9) ... And cloud-init.log says: ... Jun 1 16:44:11 machine-3 [CLOUDINIT] cloud-init[DEBUG]: Ran 19 modules with 0 failures Jun 1 16:44:11 machine-3 [CLOUDINIT] util.py[DEBUG]: Reading from /proc/uptime (quiet=False) Jun 1 16:44:11 machine-3 [CLOUDINIT] util.py[DEBUG]: Read 12 bytes from /proc/uptime Jun 1 16:44:11 machine-3 [CLOUDINIT] util.py[DEBUG]: cloud-init mode 'modules' took 48.449 seconds (48.45) This seems to be related to this failure in waagent.log: root@machine-3:/var/log# cat waagent.log 2016/06/01 16:43:07 Azure Linux Agent Version: WALinuxAgent-2.0.16 2016/06/01 16:43:07 Linux Distribution Detected : Ubuntu 2016/06/01 16:43:07 Module /lib/modules/3.19.0-59-generic/kernel/drivers/ata/ata_piix.ko driver for ATAPI CD-ROM does not exist. 2016/06/01 16:43:07 mount: block device /dev/sr0 is write-protected, mounting read-only 2016/06/01 16:43:07 mount: you didn't specify a filesystem type for /dev/sr0 2016/06/01 16:43:07 I will try type udf 2016/06/01 16:43:07 mount: you didn't specify a filesystem type for /dev/sr0 2016/06/01 16:43:07 I will try type udf 2016/06/01 16:43:07 /dev/sr0 on /mnt/cdrom/secure type udf (ro) 2016/06/01 16:43:07 mount succeeded on attempt #1 2016/06/01 16:43:07 VMM Init script not found. Provisioning for Azure 2016/06/01 16:43:07 IPv4 address: 10.0.0.6 2016/06/01 16:43:07 MAC address: 00:0D:3A:71:79:7D 2016/06/01 16:43:07 Probing for Azure environment. 2016/06/01 16:43:07 DoDhcpWork: Setting socket.timeout=10, entering recv 2016/06/01 16:43:07 Set default gateway: 10.0.0.1 2016/06/01 16:43:07 Discovered Azure endpoint: 168.63.129.16 2016/06/01 16:43:07 Fabric preferred wire protocol version: 2015-04-05 2016/06/01 16:43:07 Negotiated wire protocol version: 2012-11-30 2016/06/01 16:43:07 SetBlockDeviceTimeout: Update the device sda with timeout 300 2016/06/01 16:43:07 SetBlockDeviceTimeout: Update the device sdb with timeout 300 2016/06/01 16:43:08 Retrieved GoalState from Azure Fabric. 2016/06/01 16:43:08 ExpectedState: Started 2016/06/01 16:43:08 ContainerId: adea2037-ec24-4752-b6f0-0b53f7dee188 2016/06/01 16:43:08 RoleInstanceId: 2d0f948c-f310-4290-b3dc-bde08efeb3d9._machine-3 2016/06/01 16:43:08 Public cert with thumbprint: 0C420F72F3884BD2CE799D50B9C1AFDB1B0C1072 was retrieved. 2016/06/01 16:43:18 ERROR:Can't find host key: /etc/ssh/ssh_host_rsa_key.pub 2016/06/01 16:43:22 Finished processing ExtensionsConfig.xml 2016/06/01 16:43:22 Successfully reported handler status 2016/06/01 16:55:24 ERROR:Socket IOError The read operation timed out, args:('The read operation timed out',) 2016/06/01 16:55:24 ERROR:Retry=0 2016/06/01 16:55:24 ERROR:HTTP Req: HEAD https://p437b4r90zrnkr2vcf51fm9q.blob.core.windows.net/osvhds/machine-3.988b8739-235f-4b55-a0ce-9e896a7eb595.status?sv=2014-02-14&sr=b&sig=tEiFGWxfACZ8Y2%2FEF%2Bl6ocG9BdhSwgNRQj%2Btpr3JKHY%3D&se=9999-01-01T00%3A00%3A00Z&sp=rw 2016/06/01 16:55:24 ERROR:HTTP Req: Data=None 2016/06/01 16:55:24 ERROR:HTTP Req: Header={'x-ms-version': '2014-02-14', 'x-ms-date': '2016-06-01T16:55:14Z'} 2016/06/01 16:55:24 ERROR:HTTP Err: response is empty. I can wget the file that it's complaining about, though perhaps it's trying to get it earlier when it doesn't exist (or is empty). Contents of that file look like this: {"version":"1.0","timestampUTC":"2016-06-01T17:40:34Z","aggregateStatus":{"guestAgentStatus":{"version":"WALinuxAgent-2.0.16","status":"Ready","formattedMessage":{"lang ":"en-US","message":"GuestAgent is running and accepting new configurations."}},"handlerAggregateStatus":[]}} I've let machines sit like this for hours, but they don't seem to unstick themselves. My quick solution is to ssh onto a 'pending' machine and reboot it. The subsequent boot triggers cloud-init and it finishes whatever it's supposed to do. Juju then takes over and all is good. Is a waagent failure (if that is indeed the problem) something that cloud-init can watch for and handle? Logs coming shortly... To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1588014/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp