On our ddn's I've noticed that during large reads the backend bandwidth
of the ddn is pegged (700-800 MB/s) while the total bandwidth being
delivered through the fc interfaces is in the low hundreds of MBs. This
leads me to believe that the ddn cache is being thrashed heavily due to
overly aggressive read-ahead by many OSTs and the ddn itself.
I haven't re-run these tests with the 2.6 kernel (cray 1.4XX) so I'm not
sure if this phenomena still exists but this theory still may have some
validity since the ddn cache is shared between all the osts connected to it.
paul
Lee Ward wrote:
On Tue, 2006-11-21 at 20:48 +0200, Oleg Drokin wrote:
Hello!
On Tue, Nov 21, 2006 at 10:24:42AM -0700, Peter Braam wrote:
Sandia has made available a very interesting set of graphs with some
questions. They study single file per process and shared file IO on Red
Storm.
The test was unfair, see:
With one stripe per file file per process, 320 OSTs are used for 10k client
that's rougly 31 client per OST.
Now with shared file, stripe size(? count, I presume) is 157. Since the file
is only one, it means only 157 OSTs were in use, or roughly 64 clients per
OST.
There are 2 OSTs per OSS so I don't think the test is that unfair
unless somehow it failed to utilize every fiber channel link available.
So if one OST was used on each OSS then in terms of network and disk
bandwidth the tests are equivalent.
So for meaningful comparison we should compare 10k clients file per process
with 5k cleints shared file. This "only" gives us 2x difference which is still
better than 4x.
Also stipe size is not specified, what was it set to?
Please define "stripe size"?
The following issues seem key ones:
- the single shared file is a factor 4-5 too slow, what is the
overhead?
Only 2x it seems.
- why are reads so slow?
No proper readahead by backend disk?
MF is set to "on", max-prefetch is set to ("x" and "1")
- why is there a significant read dropoff?
Writes can be cached, nicely aggregated and written to disk in nice linear
chunks and disk backend cannot do proper readahead for such a seemingly random
1M here, 1M there?
Was write cache enabled?
Yes. Not partitioned.
- why is two cores so much slower than single core?
Was two cores on a single chip counted as single client on the graph,
or as two clients? Probably latter? Could it be some local bus bottleneck
due to increased load on same of the bus/network?
It's counted as 2 clients. The node architecture *is* 2 clients in this
scenario. Memory is partitioned, etc. The only thing shared is the NIC.
I suppose the HT is shared as well but it is so much faster than the NIC
that it would seem to need an architectural deficiency to figure in
here.
_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel