https://bugzilla.wikimedia.org/show_bug.cgi?id=62958
Bug ID: 62958 Summary: Instances fail to initialize on initial boot due to network communication failures Product: Wikimedia Labs Version: unspecified Hardware: All OS: All Status: NEW Severity: critical Priority: Unprioritized Component: General Assignee: wikibugs-l@lists.wikimedia.org Reporter: bda...@wikimedia.org CC: abog...@wikimedia.org, benap...@gmail.com, rlan...@gmail.com Web browser: --- Mobile Platform: --- Created attachment 14879 --> https://bugzilla.wikimedia.org/attachment.cgi?id=14879&action=edit Initial boot console log for failing instance I'm trying to build the four m1.large elasticsearch hosts for beta.eqiad in the deployment-prep project. Instance creation via the wikitech web interface succeeds and the hosts begin their initial boot process. During this first boot the hosts experience failures communicating with the LDAP servers and the labs puppetmaster. This leaves them in an unusable state where ssh by normal users is not possible. Rebooting the instances does not seem to correct the issues. This is possibly due to the failure of the initial puppet run. The failure does not seem to be isolated to the deployment-prep project or the m1.large image. I can reproduce the problem in the wikimania-support and logstash projects and with small, medium, large and xlarge instances. First seen by me around 2014-03-21T22:49Z, but this was the first time I had tried to build new instances that day. Problem persists today. Times in irc logs are MDT (GMT-6). [16:49:29] <bd808> Coren: I'm trying to create some new instances for the deploymnet-prep project in eqiad and they are blowing up on initial boot with ldap connection timeout errors to the instance console. [16:49:49] <Coren> o_O [16:49:50] <bd808> Instances are deploymnet-es[012] [16:50:37] <Coren> bd808: Checking. [16:50:56] <bd808> The last time I saw this Andrew eventually found out the server they we placed on was missing a network cable [16:51:24] <Coren> bd808: That's not the case here; the box is actually alive and reachable. [16:51:32] <bd808> i-0000026[cde] if that helps [16:52:01] <Coren> It also seems to have only a partial puppet run done. [17:00:07] <Coren> bd808: I'm honestly not seeing anything wrong with your instances, except for the fact that it does't look like puppet ran correctly. [17:00:39] <Coren> bd808: LDAP is up, at least, so I don't know where the connection errors might come from except, perhaps, that the config files weren't puppeted in? [17:00:58] <Coren> Stupid question, have you tried rebooting them to force a new puppet run? [17:01:46] <bd808> Coren: So … reboot and hope? [17:01:53] <bd808> jinx [17:02:02] <bd808> I can totally do that [17:02:15] <bd808> and I can nuke them and try again if that doesn't work [17:04:24] <bd808> "deployment-es0 puppet-agent[1200]: Could not request certificate: Connection timed out" [17:04:51] * bd808 will blow them up and start over [17:04:57] <Coren> Wait, that has nothing to do with LDAP; that's the puppet master being out of its gourd (which would explain why you don't have a complete puppet run) [17:06:38] <bd808> They look jacked up. "Could not set 'directory on ensure: File exists - /var/run/puppet" [17:07:14] <bd808> puppet agent failed to start on reboot [17:07:37] <Coren> Well yeah, if it doesn't have a cert then it can't work. [17:07:41] *** andrewbogott_afk is now known as andrewbogott [17:07:56] <Coren> bd808: Try just one at first. I want to see why the first run failed. [17:08:25] <bd808> Ok. I'll start with es0 [17:11:13] <bd808> Coren: Could not parse configuration file: Certificate names must be lower case; see #1168 [17:11:20] <bd808> Coren: Starting puppet agent [80G [74G[[31mfail[39;49m] [17:12:05] <bd808> That's initial boot on the "new" es0 (i-0000026f) [17:19:56] <bd808> Coren: Same final result "deployment-es0 puppet-agent[1194]: Could not request certificate: Connection timed out - connect(2)" [17:22:33] <andrewbogott> bd808: what project is this? [17:22:44] <bd808> andrewbogott: deployment-prep [17:24:06] <bd808> 4 m1.large image creations in a row have died on first boot with logs full of ldap timeouts from nslcd followed by failure to get the cert from the puppet master [17:24:56] <andrewbogott> bd808: is it just large instances that fail? [17:25:16] <bd808> andrewbogott: I haven't tried other sizes today [17:26:49] <bd808> I'm setting up the cirrus cluster. Created 3 m1.large in rapid succession, got an "instance not created" error when trying to create the 4th. Went to console of es0 (first one made) and saw these errors. [17:27:27] <bd808> The next two instances showed the same error logs. Nuked es0 and created it again [17:27:31] <bd808> same outcome [17:28:54] <andrewbogott> bd808: your project was pushed right up against the quota for cores. I don't know if that was the problem, but… I just raised it quite a bit. [17:51:31] <andrewbogott> bd808: have you ever had a large size instance work? [17:51:51] <andrewbogott> I just tried, small is working but large it not… trying medium now [17:52:07] <andrewbogott> Why would that affect network connectivity? I cannot guess. [17:52:20] <bd808> andrewbogott: That's a good question. I don't know that I've tried to build one before. small and xl have worked in the past [17:53:53] <bd808> I'm sure Nik wouldn't mind having xlarge instances if that's the case [18:00:00] <bd808> andrewbogott: Not totally confirmed yet, but it looks like xlarge may be having the same issues [18:00:25] <andrewbogott> Yeah, I can't make anything but 'small' start up. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l