Re: [Lustre-devel] wide striping

Oleg Drokin Wed, 22 Nov 2006 11:14:45 -0800

Hello!

On Wed, Nov 22, 2006 at 11:39:02AM -0700, Lee Ward wrote:
> > So for meaningful comparison we should compare 10k clients file per process
> > with 5k cleints shared file. This "only" gives us 2x difference which is 
> > still
> > better than 4x.
> > Also stipe size is not specified, what was it set to?
> Please define "stripe size"?


tripe size is amount of bytes that go into a single stripe on one OST before we
switch to next OST for another stripe.
with lfs setstripe it is first of three arguments.

> > >  - why are reads so slow?
> > No proper readahead by backend disk?
> MF is set to "on", max-prefetch is set to ("x" and "1")

I am not familiar with those options you are speaking about, unfortunatelly.
Is there a simple description or something like that?
How large is such readahead with these settings?
See, if all clients read data sequentially 1M each, and we have 31 clients
competing for a read from one particular OST. Every such client has to read
1Mb out of every 31M in an object that lives on this OST.
It is also complicated by the fact there is no particular order those requests
arrive in.
E.g. if we get a request for last 1M chunk of that 31M area, readahead logic in
the backend is probably going to read data forward (how much forward?), but
when other requests come in they are uncached and we wait for them to hit the
disk. This is in single file case that is quite favourable. In fact you
see read speed does not descent as badly for this case.

For file per process there is separate file 'object' for every process.
Those objects are living in different (ext3) 'subdirs' and allocator well might
decide to allocate them in different parts of the disk (that's what ext3
default allocator does I think). So one client reading data is most likely
does not prefetch any data for other clients on first read. And if readahead
does not read gigabytes of data in advance for every object, there is going
to be some contention for backend storage to read all those multiple places in
various areas of the storage (how much parallel i/o streams can the
disk backend sustain?)
For 2Mb i/o figures above needs to be doubled, of course.

Was there a reason single file, single core case was measured with 1M xfer size
when all other cases were with 2M xfer size?

> > >  - why is there a significant read dropoff?
> > Writes can be cached, nicely aggregated and written to disk in nice linear
> > chunks and disk backend cannot do proper readahead for such a seemingly 
> > random
> > 1M here, 1M there?
> > Was write cache enabled?
> Yes. Not partitioned.

That's likely an answer to why writes are so fast and do not drop off with
number of clients going up as long as data fits into cache (What we can
see on first graph), it is not very clear why on second graph the write
performance starts to go down after some numbe rof clients,and I guess this
suggest a bottleneck in some other place than disk backend.

> > >  - why is two cores so much slower than single core?
> > Was two cores on a single chip counted as single client on the graph,
> > or as two clients? Probably latter? Could it be some local bus bottleneck
> > due to increased load on same of the bus/network?
> It's counted as 2 clients. The node architecture *is* 2 clients in this
> scenario. Memory is partitioned, etc. The only thing shared is the NIC.
> I suppose the HT is shared as well but it is so much faster than the NIC
> that it would seem to need an architectural deficiency to figure in
> here.

So it could be you saturate NIC or parts of the network? (you have 2x as much
traffic in dualcore nodes case for same network).
Is there a way to measure this?

Bye,
    Oleg

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Re: [Lustre-devel] wide striping

Reply via email to