On Oct 27, 2006, at 7:39 AM, Jeff Squyres wrote:
On Oct 25, 2006, at 10:37 AM, Josh Hursey wrote:
The discussion started with the bug characteristics of v1.2 versus
the trunk.
Gotcha.
It seemed from the call that IU was the only institution that can
asses this via MTT as noone else spoke up. Since people were
interested in seeing things that were breaking I suggested that I
start forwarding the IU internal MTT reports (run nightly and
weekly) to the test...@open-mpi.org. This was meet by Brain
insisting that it would result in "thousands" of emails to the
development list. I clarified that it is only 3 - 4 messages a day
from IU. However if all other institutions do this then it would be
a bunch of email (where 'a bunch' would still be less than
'thousands'). That's how we got to a 'we need a single summary
presented to the group' comment. It should be noted that we brought
up IU sending to the 'test...@open-mpi.org' list as a bandaid until
MTT could do it better.
How about sending them to me and Ethan?
Sure I can add you both to the list if you like.
This single summary can be email or a webpage that people can
check. Rich said that he would prefer a webpage, and noone else
really had a comment. That got us talking about the current summary
page that MTT generates. Tim M mentioned that the current website
is difficult to figure out how to get the answers you need. I
agree, it is hard [usability] for someone to go to the summary page
and answer the question "So what failed from IU last night, and how
does that differ from Yesterday -- e.g., what regressed and
progressed yesterday at IU?". The website is flexible enough to due
it, but having a couple of basic summary pages would be nice for
basic users. What that should look like we can discuss further.
Agreed; we aren't super-fond of the current web page, either. Do you
guys want to have a teleconf to go over the current status of MTT,
where you want it to go, etc.? I consider IU's input here quite
important, since you're the ones pushing the boundaries, flexing
MTT's muscles, etc.
In my previous email I suggested a couple of questions that I would
like a webpage to answer. A teleconf might be good to talk about some
of the various items that IU is trying to do around MTT.
The IU group really likes the emails that we currently generate. A
plain-text summary of the previous run. I posted copies on the MTT
bug tracker here:
http://svn.open-mpi.org/trac/mtt/ticket/61
Currently we have not put the work in to aggregate the runs, so for
each ini file that we run we get 1 email to the IU group. This is
fine for the moment, but as we add the rest of the clusters and
dimensions in the testing matrix we will need MTT to aggregate the
results for us and generate such an email.
Ok.
We created another ticket yesterday to make a new MTT Reporter (our
internal plugins) that duplicates this output format. It actually
shouldn't be that hard -- we don't have to do parsing to get the
numbers that you're reporting; we have access to the actual data. So
it's mostly caching the data, calculating the totals that you're
calculating, and printing in your output format.
Ethan has some other short tasks to do before he gets to this, but
its near the top of the priority list. You can see the current
workflow on the wiki (this is a living document; it keeps changing as
requirements, etc. change):
http://svn.open-mpi.org/trac/mtt/wiki/TaskPlan
Awesome Thanks! :)
So I think the general feel of the discussion is that we need the
following from MTT:
- A 'basic' summary page providing answers to some general
frequently asked queries. The current interface is too advanced for
the current users.
We have the summary.php page, but I personally have never found it
too useful. :-)
We're getting towards a full revamp of reporter.php (got some other
tasks to complete first, but we're definitely starting to think about
it) -- got any ideas / input? Our "haven't thought about it much
yet" idea is to be more menu/Q-A driven with a few common queries
easily available (rather than a huge, complicated single screen).
See previous email for some general ideas. Tim M might have a few
more that he would like to see since he is the one at IU that is
watching the nightly results the closest.
- A summary email [in plain-text preferably] similar to the one
that IU generated showing an aggregation of the previous nights
results for (a) all reporters (b) my institution [so I can track
them down and file bugs].
For the moment, we don't have the dynamic capability for you to login
to the web page, create a report, and say "mail this to me nightly".
However, Ethan can make up custom reports on the server quite easily
-- if you want some IU-specific reports, just file a ticket and
Ethan
can Make It So.
Cool. We'll talk it over and see what we would like.
- 1 email a day on the previous nights testing results.
That's what we intended for the mails that are coming today, but it
seemed to not be sufficient -- we ended up with 4 nightly mails, one
for each relevant phase failures and a 4th for showing stderr of mpi
installs.
Some relevant bugs currently in existence:
http://svn.open-mpi.org/trac/mtt/ticket/92
http://svn.open-mpi.org/trac/mtt/ticket/61
http://svn.open-mpi.org/trac/mtt/ticket/94
The other concern is that given the frequency of testing as bugs
appear from the testing someone needs to make sure the bug tracker
is updated. I think the group is unclear about how this is done.
Meaning when a MTT identifies a test as failed whom is responsible
for putting the bug in the bug tracker?
At the moment, I've been manually examining the mails every day and
firing off e-mails to those responsible. However, due to travel last
week and this week, I've gotten quite behind. :-(
I wonder if there is a way to do something more automated. Probably
too advanced for MTT 2.0 or 3.0, but something to think about. Maybe
tie it in with the bug tracker, so send a "Bug Master Engineer" an
aggregated list of failures that can be easily put into TRAC. Donno..
just an idea to help take the burden off of you.
The obvious solution is the institution that identified the bug.
[Warning: My opinion] But then that becomes unwieldy for IU since
we have a large testing matrix, and would need to commit someone to
doing this everyday (and it may take all day to properly track a
set of bugs). Also this kind of punishes an institution for testing
more instead of providing incentive to test.
True. I don't know the proper answer to this, either -- I know the
"Jeff look at e-mail" solution doesn't scale well.
------ Page Break -- Context switch ------
In case you all want to know what we are doing here at IU. I
attached to this email our planed MTT testing matrix. Currently we
have BigRed and Odin running the complete matrix less the BLACS
tests. Wotan and Thor will come online as we get more resources to
support them.
In order to do such a complex testing matrix we have various .ini
files that we use. And since some of the dimensions in the matrix
are large we break some of the tests into a couple .ini files that
are submitted concurrently to have them run in a reasonable time.
<MTT-testing-matrix.txt>
Awesome.
I would like to schedule some phone time with you guys and Ethan and
me to talk about what's working, what's not working, etc. One
obvious question I have is: is the INI config file format suitable?
Do we need to do something more complex that would allow
consolidation of your various configurations? ...etc.
Tim M and I spent the better part of two days revamping our current
setup to do some more 'advanced' things (Parallel builds, etc...). We
are putting all of these scripts in ompi-tests/iu/mtt in case anyone
wants to see how we are doing it and use that as an example for doing
something similar.
Basically our problems are:
- Testing results come in at various times as they complete, we
would really like a 'status report' at 8 am every day finished or not.
- Due to the combinatorial effect of MTT this lends itself to some
obvious parallelism. Can we harness that to reduce the time to
complete the testing cycle.
- We will soon have 4 clusters [wotan, bigred, odin, thor] each
running 3 branches [trunk, v1.2, v1.1], 2 different builds [64 bit
gcc, 32 bit gcc] every night! That 24 sets of the nightly tests, and
we have biweekly tests in there as well :o. That means a lot of ini
files that basically say the same thing.
What we are trying to do:
- Generalize the INI files with default sets that can be plugged in.
- Make the scripts more general so they can be used easily across
all clusters
- Reduce the number of emails to at most 2 from the nightly runs
per cluster [Progress, and Final] -- we are not using SLURM, LL, and
a hostlist in or runs.
- Increase the parallelism per stage as much as possible, in as
general a way as possible.
- 8 am (or 10 am) status report from our script to check on the run
as it goes.
We already have a list of refinements that we would like to add to
this new script setup, but those are a bit more advanced (e.g., using
a mgr/worker model to use allocations as they become available, using
a queue to order the tests into the most important, etc.)
One thing that would be nice if MTT could do, but would be initially
institution specific would be a custom trigger for aggregation from
the MTT server. The problem is that we currently get 2 emails from
each cluster every night (this does not include the weekly runs) so
that will be 8 emails a day, which can be a bit hard to parse. If we
put the aggregation code close to the server (or just had a way for
us to query the DB from the IU side via ODBC stuff) then we could
have the aggregation function generate 2 emails which include results
from all clusters. So 2 giant emails instead of 8 smaller emails.
Just an idea, but if you gave me the information so that I can send
queries to the MTT database I can mock up something and we can all
experiment with it to see if we can generalize a bit. Obviously the
'guest' user access to the DB that this aggregation function will use
would only have read access since we don't want it modifying the DB.
So generally yeah I think we would like to have a teleconf to talk
about our experiences with MTT, and what we have done around it to
fit our needs. We realize that we are pushing it a bit further than
others, so we are fine with doing a home brewed solution for a while
until MTT is able to replicate the functionality.
Thanks!
Josh
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
_______________________________________________
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
----
Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/