Re: [Lustre-devel] wide striping

Lee Ward Wed, 22 Nov 2006 12:49:11 -0800

On Wed, 2006-11-22 at 21:11 +0200, Oleg Drokin wrote:
> Hello!
> 
> On Wed, Nov 22, 2006 at 11:39:02AM -0700, Lee Ward wrote:
> > > So for meaningful comparison we should compare 10k clients file per 
> > > process
> > > with 5k cleints shared file. This "only" gives us 2x difference which is 
> > > still
> > > better than 4x.
> > > Also stipe size is not specified, what was it set to?
> > Please define "stripe size"?
> 
> tripe size is amount of bytes that go into a single stripe on one OST before 
> we
> switch to next OST for another stripe.
> with lfs setstripe it is first of three arguments.


Ah. Ok. It's 2 MiB.

> 
> > > >  - why are reads so slow?
> > > No proper readahead by backend disk?
> > MF is set to "on", max-prefetch is set to ("x" and "1")
> 
> I am not familiar with those options you are speaking about, unfortunatelly.
> Is there a simple description or something like that?
> How large is such readahead with these settings?

Ok. Our interpretation is that these settings mean readahead is enabled,
up to a 64K prefetch.

The DDN site has the technical docs for the controller online. Feel free
to try to interpret the settings yourself :-)

> See, if all clients read data sequentially 1M each, and we have 31 clients
> competing for a read from one particular OST. Every such client has to read
> 1Mb out of every 31M in an object that lives on this OST.
> It is also complicated by the fact there is no particular order those requests
> arrive in.

But they should arrive very close to one another (all began at the same
time and proceed in order through the file, tending to keep them in
step) and with the elevator seek-sort applied, they are supposed to be
reordered so that the controller typically sees long, contiguous access
requests -- Provided the FS did an extent-based allocation, of course.
You seem to indicate that it does extent-based allocation for the
file-per-process case in the paragraph after the next. Are you thinking
it does not do extent allocation for objects involved in the shared-file
environment?

> E.g. if we get a request for last 1M chunk of that 31M area, readahead logic 
> in
> the backend is probably going to read data forward (how much forward?), but
> when other requests come in they are uncached and we wait for them to hit the
> disk. This is in single file case that is quite favourable. In fact you
> see read speed does not descent as badly for this case.

This would make sense without extent-based allocation and re-ordering.
With it, though, things should work out just fine.

Paul Nowicki, in a previous message in this thread, seems to have
evidence that pre-fetching at the controller might not be a good idea.
Could that be contributing? Are we wasting back-end bandwidth in the
RAID controller? I can see that explaining the fall-off for the
file-per-process case. It doesn't explain the simply bad performance of
the single-shared-file case, though. There, prefetch should be in the
noise -- Again, given a proper extent allocator and seek-sort algorithm.

> 
> For file per process there is separate file 'object' for every process.
> Those objects are living in different (ext3) 'subdirs' and allocator well 
> might
> decide to allocate them in different parts of the disk (that's what ext3
> default allocator does I think). So one client reading data is most likely
> does not prefetch any data for other clients on first read. And if readahead
> does not read gigabytes of data in advance for every object, there is going
> to be some contention for backend storage to read all those multiple places in
> various areas of the storage (how much parallel i/o streams can the
> disk backend sustain?)
> For 2Mb i/o figures above needs to be doubled, of course.
> 
> Was there a reason single file, single core case was measured with 1M xfer 
> size
> when all other cases were with 2M xfer size?

Dunno but the ranges and shapes are similar. I'm thinking, then, there's
not much difference due to this.

> 
> > > >  - why is there a significant read dropoff?
> > > Writes can be cached, nicely aggregated and written to disk in nice linear
> > > chunks and disk backend cannot do proper readahead for such a seemingly 
> > > random
> > > 1M here, 1M there?
> > > Was write cache enabled?
> > Yes. Not partitioned.
> 
> That's likely an answer to why writes are so fast and do not drop off with
> number of clients going up as long as data fits into cache (What we can
> see on first graph), it is not very clear why on second graph the write
> performance starts to go down after some numbe rof clients,and I guess this
> suggest a bottleneck in some other place than disk backend.

I agree. NIC arbitration, maybe? I'm more concerned about the single
file per process case. So long as we can get *something* usable we can
counsel our users. Such a thing exists for the file-per-process case. It
just doesn't for the single shared file.

> 
> > > >  - why is two cores so much slower than single core?
> > > Was two cores on a single chip counted as single client on the graph,
> > > or as two clients? Probably latter? Could it be some local bus bottleneck
> > > due to increased load on same of the bus/network?
> > It's counted as 2 clients. The node architecture *is* 2 clients in this
> > scenario. Memory is partitioned, etc. The only thing shared is the NIC.
> > I suppose the HT is shared as well but it is so much faster than the NIC
> > that it would seem to need an architectural deficiency to figure in
> > here.
> 
> So it could be you saturate NIC or parts of the network? (you have 2x as much
> traffic in dualcore nodes case for same network).
> Is there a way to measure this?

I can't see that from these graphs. Peak is the same, and at the same
node count -- I.e., the client can drive things sufficiently whether
single or dual. So it isn't the client side. As things go out further
the *number* of messages injected into the network is the same, so I
can't reach a place where we are calling out the network.

For the service side, it's the same NIC, receiving the same number of
messages. The server is *always* single-core. My only question would be
whether the same nid is used by a dual-core client and is Lustre/lnet
doing something different. Ruth or Jim, do you know if the client node
is using the same NID for both virtual machines?


> 
> Bye,
>     Oleg
> 

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Re: [Lustre-devel] wide striping

Reply via email to