** Description changed:

  We are gathering more data, but it feels like we have enough to start a
  bug report.
  
  On Precise, on a 16 core/32 thread EC2 machine (instance type
  cc2.8xlarge), when starting approximately eight or more Lucid containers
- simultaneously, after the first seven or so connect to dnsmasq within
- just a few seconds, the next group takes just under three minutes to
- connect.  We will provide short steps to duplicate this.
+ simultaneously: after the first seven or so connect to dnsmasq within
+ just a few seconds, the next group waits just under three minutes before
+ they start to connect.
  
- We also have past experience indicating that more grouping occurs,
- because simultaneously starting 32 containers occasionally (about once
- in 15 attempts) will result in one container that still has not
- connected to dnsmasq after ten minutes.  We have not tried to duplicate
- this, because we hope that tackling the smaller issue will also address
- the larger one.
+ This affects our ability to ssh into the containers and do work.
  
- I have not yet had enough disk space on this machine to duplicate this
- without lxc-start-ephemeral.  I intend to do so.
- 
- I have tried with a precise container.  *Precise containers do not
- exhibit this bug* in the experiment I performed, at least up to 16
- containers.  I plan to duplicate this at least a couple more times to
- verify, but I'm comfortable saying initially that I believe this is
- specific to Lucid containers.  I'll report back.
+ We will provide short steps to duplicate this.  Please note also the
+ extra data we include at the end of this report, showing experiments we
+ have run to rule out certain causes and scenarios.
  
  To duplicate (again, I am doing this on a specific EC2 machine with a
  lot of cores, but I believe you can see something similar to this even
  on a 4 core/8 hyperthread machine):
  
  sudo apt-get install moreutils
  sudo lxc-create -t ubuntu -n lucid -- -r lucid -a i386 -b ubuntu
  parallel -j 16 bash -c "lxc-start-ephemeral -d -o lucid" -- 1 2 3 4 5 6 7 8 9 
10 11 12 13 14 15 16
  
  Now look at /var/log/syslog at the most recent dnsmasq entries.  You
- will see about seven containers get DHCPACK from dnsmasq within about 20
- seconds; then you will see a pause of about three minutes; then you will
- see more containers report to dnsmasq.  Here's an excerpt (the
- approximate three minute jump happens between lines 5 and 6):
+ will see that about seven containers get DHCPACK from dnsmasq within
+ about 20 seconds; then you will see a pause of about three minutes; then
+ you will see more containers report to dnsmasq.  Here's an excerpt (the
+ approximate three minute jump happens between lines 5 and 6, and then I
+ include all lines up to the next ACK):
  
  ...
  Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 
10.0.3.161 00:16:3e:38:9f:e2 lptests-temp-5uKcEua
- Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 
10.0.3.37 00:16:3e:9c:ae:16 
+ Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 
10.0.3.37 00:16:3e:9c:ae:16
  Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.37 
00:16:3e:9c:ae:16 lptests-temp-1HxiwbU
- Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 
10.0.3.62 00:16:3e:69:83:2b 
+ Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 
10.0.3.62 00:16:3e:69:83:2b
  Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.62 
00:16:3e:69:83:2b lptests-temp-6o30rvH
- Jun 18 20:43:05 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:0d:7b:65 
- Jun 18 20:43:05 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.203 00:16:3e:0d:7b:65 
- Jun 18 20:43:08 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:32:5b:99 
- Jun 18 20:43:08 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.68 00:16:3e:32:5b:99 
- Jun 18 20:43:11 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:7c:21:38 
- Jun 18 20:43:11 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.35 00:16:3e:7c:21:38 
- Jun 18 20:43:14 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:72:0c:64 
- Jun 18 20:43:14 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.208 00:16:3e:72:0c:64 
- Jun 18 20:43:17 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:ed:b3:a0 
- Jun 18 20:43:17 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.82 00:16:3e:ed:b3:a0 
- Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:e3:8f:1a 
- Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.247 00:16:3e:e3:8f:1a 
- Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:5c:99:20 
- Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.225 00:16:3e:5c:99:20 
- Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:ed:b3:a0 
- Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.82 00:16:3e:ed:b3:a0 
- Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 
10.0.3.203 00:16:3e:0d:7b:65 
+ Jun 18 20:43:05 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:0d:7b:65
+ Jun 18 20:43:05 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.203 00:16:3e:0d:7b:65
+ Jun 18 20:43:08 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:32:5b:99
+ Jun 18 20:43:08 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.68 00:16:3e:32:5b:99
+ Jun 18 20:43:11 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:7c:21:38
+ Jun 18 20:43:11 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.35 00:16:3e:7c:21:38
+ Jun 18 20:43:14 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:72:0c:64
+ Jun 18 20:43:14 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.208 00:16:3e:72:0c:64
+ Jun 18 20:43:17 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:ed:b3:a0
+ Jun 18 20:43:17 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.82 00:16:3e:ed:b3:a0
+ Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:e3:8f:1a
+ Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.247 00:16:3e:e3:8f:1a
+ Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:5c:99:20
+ Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.225 00:16:3e:5c:99:20
+ Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 
10.0.3.123 00:16:3e:ed:b3:a0
+ Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 
10.0.3.82 00:16:3e:ed:b3:a0
+ Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 
10.0.3.203 00:16:3e:0d:7b:65
  Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 
10.0.3.203 00:16:3e:0d:7b:65 lptests-temp-C5po26y
  ...
  
- What I expect to see is all of the containers reporting without a three
- minute break, so that all containers are ready for work much sooner.
+ What I expect to see is all of the containers reporting approximately
+ together, without a three minute break, so that all containers are ready
+ for work much sooner.
  
  I looked in the syslogs of the started containers.  The first one to
- connect to dnsmasq *after* the three minute pause showed a three minute
- pause in its own syslog.  This is an excerpt from the syslog of that
- machine (notice the jump between 20:40 to 20:43).  The "udevtrigger
- post-stop process" lines and following dhclient startup lines, down
- through the "plymouth-splash" line, are not found in the two fast
- machine syslogs that I looked at.
+ connect to dnsmasq *after* the three minute pause also showed a three
+ minute pause in its own syslog.  The following is an excerpt from the
+ syslog of that machine (notice the jump between 20:40 to 20:43).  The
+ "udevtrigger post-stop process" lines and following dhclient startup
+ lines, down through the "plymouth-splash" line, are not found in the two
+ fast machine syslogs that I looked at.
  
  Jun 18 20:40:10 lptests-temp-C5po26y ntpd[542]: frequency initialized 0.000 
PPM from /var/lib/ntp/ntp.drift
  Jun 18 20:40:12 lptests-temp-C5po26y kernel: 6n[8su4vP >25vs3e6e3 rr[2zIr
  Jun 18 20:40:12 lptests-temp-C5po26y kernel: <7[9]zPrt26t6pl3 w
  Jun 18 20:40:13 lptests-temp-C5po26y kernel: 
03rntt8hot<7[91h:6st42nre72.]Oorr67eq6 44vqvs
  Jun 18 20:43:02 lptests-temp-C5po26y init: udevtrigger post-stop process (65) 
terminated with status 1
  Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Internet Systems Consortium 
DHCP Client V3.1.3
  Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Copyright 2004-2009 Internet 
Systems Consortium.
  Jun 18 20:43:02 lptests-temp-C5po26y dhclient: All rights reserved.
  Jun 18 20:43:02 lptests-temp-C5po26y dhclient: For info, please visit 
https://www.isc.org/software/dhcp/
- Jun 18 20:43:02 lptests-temp-C5po26y dhclient: 
+ Jun 18 20:43:02 lptests-temp-C5po26y dhclient:
  Jun 18 20:43:02 lptests-temp-C5po26y init: plymouth-splash main process (639) 
terminated with status 1
  Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Listening on 
LPF/eth0/00:16:3e:0d:7b:65
  Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Sending on 
LPF/eth0/00:16:3e:0d:7b:65
  Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Sending on Socket/fallback
  Jun 18 20:43:02 lptests-temp-C5po26y dhclient: DHCPDISCOVER on eth0 to 
255.255.255.255 port 67 interval 5
  Jun 18 20:43:05 lptests-temp-C5po26y dhclient: DHCPOFFER of 10.0.3.203 from 
10.0.3.1
  Jun 18 20:43:05 lptests-temp-C5po26y dhclient: DHCPREQUEST of 10.0.3.203 on 
eth0 to 255.255.255.255 port 67
+ 
+ Here's a bit more data to close with.
+ 
+ We have experience indicating that more waiting occurs when you add more
+ containers to be run, because simultaneously starting 32 containers will
+ occasionally (about once in 15 attempts) result in one container that
+ still has not connected to dnsmasq after ten minutes.  We have not tried
+ to duplicate this, because we hope that tackling the smaller issue
+ described here will also address the larger one.
+ 
+ I have tried with a precise container.  *Precise containers do not
+ exhibit this bug* in the experiment I performed, at least up to 16
+ containers.  I plan to duplicate this at least a couple more times to
+ verify, but I'm comfortable saying initially that I believe this is
+ specific to Lucid containers.  I'll report back.
+ 
+ I have not yet had enough disk space on this machine to duplicate this
+ without lxc-start-ephemeral.  I intend to do so.  Because the Precise
+ container did not exhibit the problem for me even with ephemeral
+ containers, I suspected that the unusual ephemeral characteristics, such
+ as the use of overlayfs, might not be involved.
  
  Finally, according to the stats we gathered while doing this, as far as
  we could tell, the following are not causes of contention, even when
  starting 24 simultaneous containers on a 16 core hyperthreaded machine:
  
  * cpu (never got higher than about 80% utilization, load average still had 
some headroom)
  * RAM (never got even half utilization of RAM)
  * ioutil% never got above 1 or 2%
  * dnsmasq itself does not seem to be an issue, looking at syslog, since we do 
not see restarts there
  
  We plan to do a bit more analysis, as mentioned above, but we think this
  data might be good enough to ask for your input at this point.  Thank
  you.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1014916

Title:
  simultaneously started lucid containers pause while starting after the
  first seven

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1014916/+subscriptions

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs

Reply via email to