We have enough tooling in place to see if these affect performance or not. What we will need is a clear demarcation of when the change is made to each OS. Ideally the change would go out across all OS's at the same (or nearly the same) time. If we can at least get the change rolled out across all OS's within the same small number of hours (or at worst the same day) that will vastly help us determine if there are any impacts to performance testing due to this change.

Thanks for the head's up.

Clint
On 10/17/2013 09:00 AM, John O'Duinn wrote:
tl:dr: We recently installed system monitoring software on our buildbot
masters, build-not-test slaves, and various other RelEng machines. IT
want to continue this rollout, deploying monitoring software onto RelEng
production test machines, which raises a concern about possible impact
to performance numbers. If you see any production impact, please let us
know.

======

We are being asked by IT to deploy monitoring tools onto all build,
unittest and performance testing machines. These are to help gather
system level statistics about CPU, memory, disk utilization, etc. This
is so IT can monitor efficiency of production jobs run on these systems.

This monitoring software has already been installed on buildbot masters,
linux+mac builders, and some misc other servers. As those changes were
zero-risk to production, we didn't need to forewarn these newsgroups.
However, installing this software on production win32/64 builders and
win/mac/linux performance testers has a small-but-non-zero risk that the
act of running these tools will change the timing results in performance
test jobs. Hence this advance notice.

Exact timing of this rollout is waiting on some unrelated win64
toolchain builder fixes to finish being deployed into production. We all
agreed that adding these monitoring tools *at the same time* as doing
windows toolchain upgrade, would unnecessarily complicate problem detection.

Once everything is ready for final deploy, another post will be sent to
newsgroup (and sheriffs), to help with any possible after-the-fact
regression range hunting. If there are any performance result wobble
because of these changes, I've been told we can tolerate minor
performance result disruption for a week or so, without impacting
releases. Currently, this experiment is slated to run for 2 weeks, but
obviously, if this monitoring introduces larger disruption, we will
disable them asap. Sheriffs and RelEng buildduty will be monitoring
closely, but as always, if you see anything weird, please make sure they
know asap.

No downtime is required, as our systems will pick up these changes
between test runs as machines reboot.

The curious can follow along in bug#920626 (deploy collectd to RelEng
mac+linux test systems) and bug#920629 (deploy graphite client to RelEng
Windows build and test systems).

If you've any questions, or concerns, please let me know.
John.

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to