Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10969



(In reply to comment #67)

> The brw_stats looks enough for the time being. But please keep in mind that
> users do NOT have access to the servers and the brw_stats info stored on the
> OSTs will NOT be available to the apps perf tool directly. 

Yes, I'm just trying to get a handle on whether we're collecting the right data
in the first place.  Collecting/presenting it is a different challenge.
 
> Anomalies were meant over all the clients and per client as well. It was
> suggested as an idea to keep track of a slow client or a slow server for the
> duration of an application. Also it can be a very powerful tool when combined
> with the timestamps (see below please).

With the current stats at any given moment we could compare e.g. the average 
ost_setattr execution time and note that OST5 is 10% slower than the average
OST, or that client7 has the highest average write size on OST2.  I think
potentially one of most difficult parts of this tool is deciding how to prune
down the data we present into a comprehensible amount.

> Timestamped info means the ability to playback the I/O for the duration of an
> application. It does not need to be very fine grained (i.e. aggregate
> timestamped summary info for every X msecs/secs per each client/server should 
> be
> sufficient). 

e.g. something like:
11:02 client7 7MB w, 10MB r, 3004 RPCs, waited for 5 locks, 10 locks revoked
The more concrete we can make our examples, the better.

> Yes, we meant RPC request queues (e.g. time spent on queue, queue depth).

Ok.  We already collect this information per server.
req_waittime              117364 samples [usec] 34 23445 21251101 7973894281
req_qdepth                117364 samples [reqs] 0 8 29906 30464

> Probably not ALL RPC related info. I am assuming "ALL" would be overwhelming 
> to
> analyze and digest. Perhaps, we need to list the most striking ones. What do 
> you
> suggest Nathan?

The slow outliers would probably be the most interesting.
server info:
 - at 11:02 req 1002 type 42 from client7 took 102s to process
 - at that time, the q depth was 5, the avg waittime was 10s, and the average
req of that type took 6s
client info:
 - req 1002 from process 7 "ior" opc=fsync

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to