Re: Effect of lease tweaks

2017-11-02 Thread Mark Shuttleworth

Just to say the analytical approach is exactly right, thanks for setting
the pace.

Mark

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Effect of lease tweaks

2017-11-01 Thread Andrew Wilkins
On Wed, Nov 1, 2017 at 10:43 PM John A Meinel 
wrote:

> So I wanted to know if Andrew's changes in 2.3 are going to have a
> noticeable affect at scale on Leadership. So I went and set up a test with
> HA controllers running 10 machines each with 3 containers, and then
> distributing ~500 applications each with 3 units across everything.
> I started at commit 2e50e5cf4c3 which is just before Andrew's Lease patch
> landed.
>
> juju bootstrap aws/eu-west-2 --bootstrap-constraint
> instance-type=m4.xlarge --config vpc-id=
> juju enable-ha -n3
> # Wait for things to stablize
> juju deploy -B cs:~jameinel/ubuntu-lite -n10 --constraints
> instance-type=m4.xlarge
> # wait
>
> #set up the containers
> for i in `seq 0 9`; do
>   juju deploy -n3 -B cs:~jameinel/ubuntu-leader ul --to
> lxd:${i},lxd:${i},lxd:${i}
> done
>
> # scale up. I did this more in batches of a few at a time, but slowly grew
> all the way up
> for j in `seq 1 49`; do
>   echo $j
>   for i in `seq 0 9`; do
> juju deploy -B -n3 cs:~jameinel/ubuntu-leader ul${i}{$j} --to
> ${i}/lxd/0,${i}/lxd/1,${i}/lxd/2 &
>   done
>   time wait
> done
>
> I let it go for a while until "juju status" was happy that everything was
> up and running. Note that this was 1500 units, 500 applications in a single
> model.
> time juju status was around 4-10s.
>
> I was running 'mongotop' and watching 'top' while it was running.
>
> I then upgraded to the latest juju dev (c49dd0d88a).
> Now, the controller immediately started thrashing, with bad lease
> documents in the database, and eventually got to the point that it ran out
> of open file descriptors. Theoretically upgrading 2.2 => 2.3 won't have the
> same problem because the actual upgrade step should run.
>

The upgrade steps does run for upgrades of 2.3-beta2.M -> 2.3-beta2.N,
because the build number is changing. I've tested that.

It appears that the errors are preventing Juju from getting far enough
along to run the upgrade steps. I'll continue looking today.

Thanks very much for the load testing, that's very encouraging.


> However, if I just did "db.leases.remove({})" it recovered.
> I ended up having to restart mongo and jujud to recover from the open file
> handles, but it did eventually recover.
>
> At this point, I waited again for everything to look happy, and watch
> mongotop and top again.
>
> These aren't super careful results, where I would want to run things for
> like an hour each and check the load over that whole time. Really I should
> have set up prometheus monitoring. But as a quick check, these are the top
> values for mongotop before:
>
>   nstotal readwrite
>   local.oplog.rs181ms181ms  0ms
>juju.txns120ms 10ms110ms
>  juju.leases 80ms 34ms 46ms
>juju.txns.log 24ms  4ms 19ms
>
>   nstotal readwrite
>   local.oplog.rs208ms208ms  0ms
>juju.txns140ms 12ms128ms
>  juju.leases 98ms 42ms 56ms
>  juju.charms 43ms 43ms  0ms
>
>   nstotal readwrite
>   local.oplog.rs220ms220ms  0ms
>juju.txns161ms 14ms146ms
>  juju.leases115ms 52ms 63ms
> presence.presence.beings 69ms 68ms  0ms
>
>   nstotal readwrite
>   local.oplog.rs213ms213ms  0ms
>juju.txns164ms 15ms149ms
>  juju.leases 82ms 35ms 47ms
> presence.presence.beings 79ms 78ms  0ms
>
>   nstotal readwrite
>   local.oplog.rs221ms221ms  0ms
>juju.txns168ms 13ms154ms
>  juju.leases 95ms 40ms 55ms
>juju.statuses 33ms 16ms 17ms
>
> totals:
> 1043 local.oplog.rs
> juju.txns 868
> juju.leases 470
>
> and after
>
>   nstotalreadwrite
>   local.oplog.rs 95ms95ms  0ms
>juju.txns 68ms 6ms 61ms
>  juju.leases 33ms13ms 19ms
>juju.txns.log 13ms 3ms 10ms
>
>   nstotal readwrite
>   local.oplog.rs200ms200ms  0ms
>juju.txns160ms 10ms150ms
>  juju.leases 78ms 35ms 42ms
>juju.txns.log 29ms  4ms 24ms
>
>   nstotal readwrite
>   local.oplog.rs151ms151ms  0ms
>juju.txns103ms  6ms 97ms
>  juju.leases 45ms 20ms 25ms
>juju.txns.log 21ms  6ms 15ms
>
>   nstotal readwrite
>   local.oplog.rs138ms138ms  0ms
>

Re: Effect of lease tweaks

2017-11-01 Thread John Meinel
(sent too soon)

Summary:
before:
1043 local.oplog.rs
 868 juju.txns
 470 juju.leases

after:
 802 local.oplog.rs
 625 juju.txns
 267 juju.leases

So there seems to be a fairly noticeable decrease in load on the system
around leases (~70%). Again, not super scientific because I didn't measure
over enough time, deal with variation, all that kind of stuff. But at least
at a glimpse it looks pretty good.
As far as load around the global clock:
juju.globalclock  5ms 3ms  1ms
juju.globalclock 10ms 8ms  1ms
etc

So generally noticeable, but not specifically an issue.

Hopefully we'll see similar improvements in live systems. The main thing is
to make sure upgrade is smooth from 2.2 to 2.3 since the lease issue was a
pretty major crash of the system.

John
=:->

On Wed, Nov 1, 2017 at 6:43 PM, John A Meinel 
wrote:

> So I wanted to know if Andrew's changes in 2.3 are going to have a
> noticeable affect at scale on Leadership. So I went and set up a test with
> HA controllers running 10 machines each with 3 containers, and then
> distributing ~500 applications each with 3 units across everything.
> I started at commit 2e50e5cf4c3 which is just before Andrew's Lease patch
> landed.
>
> juju bootstrap aws/eu-west-2 --bootstrap-constraint
> instance-type=m4.xlarge --config vpc-id=
> juju enable-ha -n3
> # Wait for things to stablize
> juju deploy -B cs:~jameinel/ubuntu-lite -n10 --constraints
> instance-type=m4.xlarge
> # wait
>
> #set up the containers
> for i in `seq 0 9`; do
>   juju deploy -n3 -B cs:~jameinel/ubuntu-leader ul --to
> lxd:${i},lxd:${i},lxd:${i}
> done
>
> # scale up. I did this more in batches of a few at a time, but slowly grew
> all the way up
> for j in `seq 1 49`; do
>   echo $j
>   for i in `seq 0 9`; do
> juju deploy -B -n3 cs:~jameinel/ubuntu-leader ul${i}{$j} --to
> ${i}/lxd/0,${i}/lxd/1,${i}/lxd/2 &
>   done
>   time wait
> done
>
> I let it go for a while until "juju status" was happy that everything was
> up and running. Note that this was 1500 units, 500 applications in a single
> model.
> time juju status was around 4-10s.
>
> I was running 'mongotop' and watching 'top' while it was running.
>
> I then upgraded to the latest juju dev (c49dd0d88a).
> Now, the controller immediately started thrashing, with bad lease
> documents in the database, and eventually got to the point that it ran out
> of open file descriptors. Theoretically upgrading 2.2 => 2.3 won't have the
> same problem because the actual upgrade step should run.
> However, if I just did "db.leases.remove({})" it recovered.
> I ended up having to restart mongo and jujud to recover from the open file
> handles, but it did eventually recover.
>
> At this point, I waited again for everything to look happy, and watch
> mongotop and top again.
>
> These aren't super careful results, where I would want to run things for
> like an hour each and check the load over that whole time. Really I should
> have set up prometheus monitoring. But as a quick check, these are the top
> values for mongotop before:
>
>   nstotal readwrite
>   local.oplog.rs181ms181ms  0ms
>juju.txns120ms 10ms110ms
>  juju.leases 80ms 34ms 46ms
>juju.txns.log 24ms  4ms 19ms
>
>   nstotal readwrite
>   local.oplog.rs208ms208ms  0ms
>juju.txns140ms 12ms128ms
>  juju.leases 98ms 42ms 56ms
>  juju.charms 43ms 43ms  0ms
>
>   nstotal readwrite
>   local.oplog.rs220ms220ms  0ms
>juju.txns161ms 14ms146ms
>  juju.leases115ms 52ms 63ms
> presence.presence.beings 69ms 68ms  0ms
>
>   nstotal readwrite
>   local.oplog.rs213ms213ms  0ms
>juju.txns164ms 15ms149ms
>  juju.leases 82ms 35ms 47ms
> presence.presence.beings 79ms 78ms  0ms
>
>   nstotal readwrite
>   local.oplog.rs221ms221ms  0ms
>juju.txns168ms 13ms154ms
>  juju.leases 95ms 40ms 55ms
>juju.statuses 33ms 16ms 17ms
>
> totals:
> 1043 local.oplog.rs
> juju.txns 868
> juju.leases 470
>
> and after
>
>   nstotalreadwrite
>   local.oplog.rs 95ms95ms  0ms
>juju.txns 68ms 6ms 61ms
>  juju.leases 33ms13ms 19ms
>juju.txns.log 13ms 3ms 10ms
>
>   nstotal readwrite
>   local.oplog.rs200ms200ms  0ms
>juju.txns160ms 10ms150ms
>