-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> 
>> The log size didn't come up again in this email. Not sure if you
>> meant separately or just got lost in the message length.

I didn't explicitly enumerate it, but it is because of section (4).
Namely, bringing up 1000 units triggers 1000*1000 (1M) calls to
Service.Life and CharmURL. Each request+response is 2 lines. So adding
1000 units generates at least 4M lines of log.

Each line averages ~189 bytes. So 1000 units => 2*2*1M*189 = 758MB.

> Once all the agents are up and running, they actually are very
> quiet (almost 0 log statements).
> 
> 
> 4) If I bring up the units one by one (for i in `seq 500`; do for j
> in `seq 10` do juju add-unit --to $j &; time wait; done), it ends
> up triggering O(N^2) behavior in the system. Each unit agent seems
> to have a watcher for other units of the same service. So when you
> add 1 unit, it wakes up all existing units to let them know about
> it. In theory this is on a 5s rate limit (only 1 wakeup per 5
> seconds). In
...

> 
> 5) Along with load, we weren't caching the IP address of the API 
> machine, which caused us to read the provider-state file from
> object storage and then ask EC2 for the IP address of that
> machine. Log of 1 unit agent's connection:
> http://paste.ubuntu.com/6329661/
> 
> 
>> Just to be clear for other readers (wasn't clear to me without
>> checking the src)  this isn't the agent resolving the api server
>> address from provider-state which would mean provider credentials
>> available to each agent, but each agent periodically requesting
>> via the api the address of the api servers. So the cache here is
>> on the api server.

The cache does need to be either in the DB or on the API server. The
trigger is that running a hook includes the API Addresses in the hook
context. So every hook triggers a call to API Addresses (not sure if
hooks fired in sequence cache the state between calls).

And that triggers the API server to make a request from EC2.

Dave Cheney has a bug that hooks that trigger lots of relation changed
end up DOSing your EC2 account because they end up rate limiting your
account, and then you are unable to use your EC2 creds to kill the
service.
...

> 
> 
> 6) If you restart jujud (say after an upgrade) it causes all unit 
> agents to restart the 41 requests for startup. This seems to be
> rate limited by the jujud process (up to 600% CPU) and a little bit
> Mongo (almost 100% CPU).
> 
> It seems to take a while but with enough horsepower and GOMAXPROCS 
> enabled it does seem to recover (IIRC it took about 20minutes).
> 
> 
>> It might be worth exploring how we do upgrades to keep the client
>> socket open (ala nginx) to avoid the extra thundering herd on
>> restart, ie serialize extant watch state and exec with open fds.
>> Upgrade is effectively already triggering a thundering herd with
>> the agents as they restart individually, and then the api server
>> restart does a restart for another herd.
> 
>> There's also an extant bug  that restart of juju agents causes 
>> unconditional config-changed hook execution even if there is no
>> delta on config to the unit.

We've had a few discussions around upgrade. One option is to bring up
all units in "upgrade-pending" mode. Which is a slow-starting new
herd, but which would prevent the double-thunder at least.

...

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlJxIo8ACgkQJdeBCYSNAAP79QCeOcUboSG2R6x5pm3FbDyyunZW
diEAoKPluc3EauJIkTTQR2MUdrw0TOrT
=hBPP
-----END PGP SIGNATURE-----

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Reply via email to