[Bug 62958] New: Instances fail to initialize on initial boot due to network communication failures

bugzilla-daemon Sat, 22 Mar 2014 09:46:27 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=62958


            Bug ID: 62958
           Summary: Instances fail to initialize on initial boot due to
                    network communication failures
           Product: Wikimedia Labs
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: critical
          Priority: Unprioritized
         Component: General
          Assignee: wikibugs-l@lists.wikimedia.org
          Reporter: bda...@wikimedia.org
                CC: abog...@wikimedia.org, benap...@gmail.com,
                    rlan...@gmail.com
       Web browser: ---
   Mobile Platform: ---

Created attachment 14879
  --> https://bugzilla.wikimedia.org/attachment.cgi?id=14879&action=edit
Initial boot console log for failing instance

I'm trying to build the four m1.large elasticsearch hosts for beta.eqiad in the
deployment-prep project. Instance creation via the wikitech web interface
succeeds and the hosts begin their initial boot process. During this first boot
the hosts experience failures communicating with the LDAP servers and the labs
puppetmaster. This leaves them in an unusable state where ssh by normal users
is not possible. Rebooting the instances does not seem to correct the issues.
This is possibly due to the failure of the initial puppet run.

The failure does not seem to be isolated to the deployment-prep project or the
m1.large image. I can reproduce the problem in the wikimania-support and
logstash projects and with small, medium, large and xlarge instances.

First seen by me around 2014-03-21T22:49Z, but this was the first time I had
tried to build new instances that day. Problem persists today. Times in irc
logs are MDT (GMT-6).

[16:49:29] <bd808> Coren: I'm trying to create some new instances for the
deploymnet-prep project in eqiad and they are blowing up on initial boot with
ldap connection timeout errors to the instance console.
[16:49:49] <Coren> o_O
[16:49:50] <bd808> Instances are deploymnet-es[012]
[16:50:37] <Coren> bd808: Checking.
[16:50:56] <bd808> The last time I saw this Andrew eventually found out the
server they we placed on was missing a network cable
[16:51:24] <Coren> bd808: That's not the case here; the box is actually alive
and reachable.
[16:51:32] <bd808> i-0000026[cde] if that helps
[16:52:01] <Coren> It also seems to have only a partial puppet run done.
[17:00:07] <Coren> bd808: I'm honestly not seeing anything wrong with your
instances, except for the fact that it does't look like puppet ran correctly.
[17:00:39] <Coren> bd808: LDAP is up, at least, so I don't know where the
connection errors might come from except, perhaps, that the config files
weren't puppeted in?
[17:00:58] <Coren> Stupid question, have you tried rebooting them to force a
new puppet run?
[17:01:46] <bd808> Coren: So … reboot and hope?
[17:01:53] <bd808> jinx
[17:02:02] <bd808> I can totally do that
[17:02:15] <bd808> and I can nuke them and try again if that doesn't work
[17:04:24] <bd808> "deployment-es0 puppet-agent[1200]: Could not request
certificate: Connection timed out"
[17:04:51] * bd808 will blow them up and start over
[17:04:57] <Coren> Wait, that has nothing to do with LDAP; that's the puppet
master being out of its gourd (which would explain why you don't have a
complete puppet run)
[17:06:38] <bd808> They look jacked up. "Could not set 'directory on ensure:
File exists - /var/run/puppet"
[17:07:14] <bd808> puppet agent failed to start on reboot
[17:07:37] <Coren> Well yeah, if it doesn't have a cert then it can't work.
[17:07:41] *** andrewbogott_afk is now known as andrewbogott
[17:07:56] <Coren> bd808: Try just one at first.  I want to see why the first
run failed.
[17:08:25] <bd808> Ok. I'll start with es0
[17:11:13] <bd808> Coren: Could not parse configuration file: Certificate names
must be lower case; see #1168
[17:11:20] <bd808> Coren: Starting puppet agent       [80G
[74G[[31mfail[39;49m]
[17:12:05] <bd808> That's initial boot on the "new" es0 (i-0000026f)
[17:19:56] <bd808> Coren: Same final result "deployment-es0 puppet-agent[1194]:
Could not request certificate: Connection timed out - connect(2)"

[17:22:33] <andrewbogott> bd808: what project is this?
[17:22:44] <bd808> andrewbogott: deployment-prep
[17:24:06] <bd808> 4 m1.large image creations in a row have died on first boot
with logs full of ldap timeouts from nslcd followed by failure to get the cert
from the puppet master
[17:24:56] <andrewbogott> bd808: is it just large instances that fail?
[17:25:16] <bd808> andrewbogott: I haven't tried other sizes today
[17:26:49] <bd808> I'm setting up the cirrus cluster. Created 3 m1.large in
rapid succession, got an "instance not created" error when trying to create the
4th. Went to console of es0 (first one made) and saw these errors.
[17:27:27] <bd808> The next two instances showed the same error logs. Nuked es0
and created it again
[17:27:31] <bd808> same outcome
[17:28:54] <andrewbogott> bd808: your project was pushed right up against the
quota for cores.  I don't know if that was the problem, but… I just raised it
quite a bit.

[17:51:31] <andrewbogott> bd808: have you ever had a large size instance work?
[17:51:51] <andrewbogott> I just tried, small is working but large it not…
trying medium now
[17:52:07] <andrewbogott> Why would that affect network connectivity?  I cannot
guess.
[17:52:20] <bd808> andrewbogott: That's a good question. I don't know that I've
tried to build one before. small and xl have worked in the past
[17:53:53] <bd808> I'm sure Nik wouldn't mind having xlarge instances if that's
the case
[18:00:00] <bd808> andrewbogott: Not totally confirmed yet, but it looks like
xlarge may be having the same issues
[18:00:25] <andrewbogott> Yeah, I can't make anything but 'small' start up.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 62958] New: Instances fail to initialize on initial boot due to network communication failures

Reply via email to