subject:"Re\: Increasing MAXPHYS"

Re: Increasing MAXPHYS

2010-03-20 Thread Matthew Dillon


:All above I have successfully tested last months with MAXPHYS of 1MB on
:i386 and amd64 platforms.
:
:So my questions are:
:- does somebody know any issues denying increasing MAXPHYS in HEAD?
:- are there any specific opinions about value? 512K, 1MB, MD?
:
:-- 
:Alexander Motin

(nswbuf * MAXPHYS) of KVM is reserved for pbufs, so on i386 you
might hit up against KVM exhaustion issues in unrelated subsystems.
nswbuf typically maxes out at around 256.  For i386 1MB is probably
too large (256M of reserved KVM is a lot for i386).  On amd64 there
shouldn't be a problem.

Diminishing returns get hit pretty quickly with larger MAXPHYS values.
As long as the I/O can be pipelined the reduced transaction rate
becomes less interesting when the transaction rate is less than a
certain level.  Off the cuff I'd say 2000 tps is a good basis for
considering whether it is an issue or not.  256K is actually quite
a reasonable value.  Even 128K is reasonable.

Nearly all the issues I've come up against in the last few years have
been related more to pipeline algorithms breaking down and less with
I/O size.  The cluster_read() code is especially vulnerable to
algorithmic breakdowns when fast media (such as a SSD) is involved.
e.g.  I/Os queued from the previous cluster op can create stall
conditions in subsequent cluster ops before they can issue new I/Os
to keep the pipeline hot.

-Matt
Matthew Dillon 

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-20 Thread Scott Long

On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote:
> 
> :All above I have successfully tested last months with MAXPHYS of 1MB on
> :i386 and amd64 platforms.
> :
> :So my questions are:
> :- does somebody know any issues denying increasing MAXPHYS in HEAD?
> :- are there any specific opinions about value? 512K, 1MB, MD?
> :
> :-- 
> :Alexander Motin
> 
>(nswbuf * MAXPHYS) of KVM is reserved for pbufs, so on i386 you
>might hit up against KVM exhaustion issues in unrelated subsystems.
>nswbuf typically maxes out at around 256.  For i386 1MB is probably
>too large (256M of reserved KVM is a lot for i386).  On amd64 there
>shouldn't be a problem.
> 

Yes, this needs to be addressed.  I've never gotten a clear answer from
VM people like Peter Wemm and Alan Cox on what should be done.

>Diminishing returns get hit pretty quickly with larger MAXPHYS values.
>As long as the I/O can be pipelined the reduced transaction rate
>becomes less interesting when the transaction rate is less than a
>certain level.  Off the cuff I'd say 2000 tps is a good basis for
>considering whether it is an issue or not.  256K is actually quite
>a reasonable value.  Even 128K is reasonable.
> 

I agree completely.  I did quite a bit of testing on this in 2008 and 2009.
I even added some hooks into CAM to support this, and I thought that I had
discussed this extensively with Alexander at the time.  Guess it was yet another
wasted conversation with him =-(  I'll repeat it here for the record.

What I call the silly-i/o-test, filling a disk up with the dd command, yields
performance improvements up to a MAXPHYS of 512K.  Beyond that and
it's negligible, and actually starts running into contention on the VM page
queues lock.  There is some work to break down this lock, so it's worth
revisiting in the future.

For the non-silly-i/o-test, where I do real file i/o using various sequential 
and
random patterns, there was a modest improvement up to 256K, and a slight
improvement up to 512K.  This surprised me as I figured that most filesystem
i/o would be in UFS block sized chunks.  Then I realized that the UFS clustering
code was actually taking advantage of the larger I/O's.  The improvement really
depends on the workload, of course, and I wouldn't expect it to be noticeable
for most people unless they're running something like a media server.

Besides the nswbuf sizing problem, there is a real problem that a lot of drivers
have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are
particular values, and they've sized their data structures accordingly.  Before
these values are changed, an audit needs to be done OF EVERY SINGLE
STORAGE DRIVER.  No exceptions.  This isn't a case of changing MAXHYS
in the ata driver, testing that your machine boots, and then committing the 
change
to source control.  Some drivers will have non-obvious restrictions based on
the number of SG elements allowed in a particular command format.  MPT
comes to mind (its multi message SG code seems to be broken when I tried
testing large MAXPHYS on it), but I bet that there are others.

Windows has a MAXPHYS equivalent of 1M.  Linux has an equivalent of an
odd number less than 512k.  For the purpose of benchmarking against these
OS's, having comparable capabilities is essential; Linux easily beats FreeBSD
in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD 
typically
stomps linux in real I/O because of vastly better latency and caching 
algorithms).
I'm fine with raising MAXPHYS in production once the problems are addressed.

>Nearly all the issues I've come up against in the last few years have
>been related more to pipeline algorithms breaking down and less with
>I/O size.  The cluster_read() code is especially vulnerable to
>algorithmic breakdowns when fast media (such as a SSD) is involved.
>e.g.  I/Os queued from the previous cluster op can create stall
>conditions in subsequent cluster ops before they can issue new I/Os
>to keep the pipeline hot.
> 

Yes, this is another very good point.  It's time to start really figuring out 
what SSD
means for FreeBSD I/O.

Scott

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-20 Thread Matthew Dillon

:Pardon my ignorance, but wouldn't so much KVM make small embedded
:devices like Soekris boards with 128 MB of physical RAM totally unusable
:then? On my net4801, running RELENG_8:
:
:vm.kmem_size: 40878080
:
:hw.physmem: 125272064
:hw.usermen: 84840448
:hw.realmem: 134217728

KVM != physical memory.  On i386 by default the kernel has 1G of KVM
and userland has 3G.  While the partition can be moved to increase
available KVM on i386 (e.g. 2G/2G), it isn't recommended.

So the KVM reserved for various things does not generally impact
physical memory use.

The number of swap buffers (nswbuf) is scaled to 1/4 nbufs with a
maximum of 256.  Systems with small amounts of memory should not be
impacted.

The issue w/ regards to KVM problems on i386 is mostly restricted to
systems with 2G+ of ram where the kernel's various internal parameters
are scaled to their maximum values or limits.  On systems with less ram
the kernel's internal parameters are usually scaled down sufficiently
that there is very little chance of the kernel running out of KVM.

-Matt

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-20 Thread C. P. Ghost

On Sat, Mar 20, 2010 at 6:53 PM, Matthew Dillon
 wrote:
>
> :All above I have successfully tested last months with MAXPHYS of 1MB on
> :i386 and amd64 platforms.
> :
> :So my questions are:
> :- does somebody know any issues denying increasing MAXPHYS in HEAD?
> :- are there any specific opinions about value? 512K, 1MB, MD?
> :
> :--
> :Alexander Motin
>
>    (nswbuf * MAXPHYS) of KVM is reserved for pbufs, so on i386 you
>    might hit up against KVM exhaustion issues in unrelated subsystems.
>    nswbuf typically maxes out at around 256.  For i386 1MB is probably
>    too large (256M of reserved KVM is a lot for i386).  On amd64 there
>    shouldn't be a problem.

Pardon my ignorance, but wouldn't so much KVM make small embedded
devices like Soekris boards with 128 MB of physical RAM totally unusable
then? On my net4801, running RELENG_8:

vm.kmem_size: 40878080

hw.physmem: 125272064
hw.usermen: 84840448
hw.realmem: 134217728

>    Diminishing returns get hit pretty quickly with larger MAXPHYS values.
>    As long as the I/O can be pipelined the reduced transaction rate
>    becomes less interesting when the transaction rate is less than a
>    certain level.  Off the cuff I'd say 2000 tps is a good basis for
>    considering whether it is an issue or not.  256K is actually quite
>    a reasonable value.  Even 128K is reasonable.
>
>    Nearly all the issues I've come up against in the last few years have
>    been related more to pipeline algorithms breaking down and less with
>    I/O size.  The cluster_read() code is especially vulnerable to
>    algorithmic breakdowns when fast media (such as a SSD) is involved.
>    e.g.  I/Os queued from the previous cluster op can create stall
>    conditions in subsequent cluster ops before they can issue new I/Os
>    to keep the pipeline hot.

Thanks,
-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-20 Thread Alexander Motin

Scott Long wrote:
> On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote:
>>Diminishing returns get hit pretty quickly with larger MAXPHYS values.
>>As long as the I/O can be pipelined the reduced transaction rate
>>becomes less interesting when the transaction rate is less than a
>>certain level.  Off the cuff I'd say 2000 tps is a good basis for
>>considering whether it is an issue or not.  256K is actually quite
>>a reasonable value.  Even 128K is reasonable.
> 
> I agree completely.  I did quite a bit of testing on this in 2008 and 2009.
> I even added some hooks into CAM to support this, and I thought that I had
> discussed this extensively with Alexander at the time.  Guess it was yet 
> another
> wasted conversation with him =-(  I'll repeat it here for the record.

AFAIR at that time you've agreed that 256K gives improvements, and 64K
of DFLTPHYS limiting most SCSI SIMs is too small. That's why you've
implemented that hooks in CAM. I have not forgot that conversation (pity
that it quietly died for SCSI SIMs). I agree that too high value could
be just a waste of resources. As you may see I haven't blindly committed
it, but asked public opinion. If you think 256K is OK - let it be 256K.
If you think that 256K needed only for media servers - OK, but lets make
it usable there.

> Besides the nswbuf sizing problem, there is a real problem that a lot of 
> drivers
> have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are
> particular values, and they've sized their data structures accordingly.  
> Before
> these values are changed, an audit needs to be done OF EVERY SINGLE
> STORAGE DRIVER.  No exceptions.  This isn't a case of changing MAXHYS
> in the ata driver, testing that your machine boots, and then committing the 
> change
> to source control.  Some drivers will have non-obvious restrictions based on
> the number of SG elements allowed in a particular command format.  MPT
> comes to mind (its multi message SG code seems to be broken when I tried
> testing large MAXPHYS on it), but I bet that there are others.

As you should remember, we have made it in such way, that all unchecked
drivers keep using DFLTPHYS, which is not going to be changed ever. So
there is no problem. I would more worry about non-CAM storages and above
stuff, like some rare GEOM classes.

> I'm fine with raising MAXPHYS in production once the problems are
> addressed.

That's why in my post I've asked people about any known problems. I've
addressed several related issues in last months, and I am looking for
more. To address problems, it would be nice to know about them first.

-- 
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-20 Thread Alan Cox

2010/3/20 Alexander Motin 

> Hi.
>
> With set of changes done to ATA, CAM and GEOM subsystems last time we
> may now get use for increased MAXPHYS (maximum physical I/O size) kernel
> constant from 128K to some bigger value.


[snip]


> All above I have successfully tested last months with MAXPHYS of 1MB on
> i386 and amd64 platforms.
>
> So my questions are:
> - does somebody know any issues denying increasing MAXPHYS in HEAD?
> - are there any specific opinions about value? 512K, 1MB, MD?
>
>
For now, I think it should machine-dependent.  The virtual memory system
should have no problems with MAXPHYS of 1MB on amd64 and ia64.

Alan
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-20 Thread Julian Elischer


Alexander Motin wrote:

Scott Long wrote:

On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote:

   Diminishing returns get hit pretty quickly with larger MAXPHYS values.
   As long as the I/O can be pipelined the reduced transaction rate
   becomes less interesting when the transaction rate is less than a
   certain level.  Off the cuff I'd say 2000 tps is a good basis for
   considering whether it is an issue or not.  256K is actually quite
   a reasonable value.  Even 128K is reasonable.

I agree completely.  I did quite a bit of testing on this in 2008 and 2009.
I even added some hooks into CAM to support this, and I thought that I had
discussed this extensively with Alexander at the time.  Guess it was yet another
wasted conversation with him =-(  I'll repeat it here for the record.


In the Fusion-io driver we find that the limiting factor is not the
size of MAXPHYS, but the fact that we can not push more than
170k tps through geom. (in my test machine. I've seen more on some
beefier machines), but that is only a limit on small transacrtions,
or in the case of large transfers the DMA engine tops out before a 
bigger MAXPHYS would make any difference.


Where it may make a difference is that Linux only pushes 128k
at a time it looks like so many hardware engines have likely
never been tested with greater. (not sure about Windows).
Some drivers may also be written with the assumption that they
will not see more. OF course they should be able to limit the
transaction size down themselves if they are written well.





AFAIR at that time you've agreed that 256K gives improvements, and 64K
of DFLTPHYS limiting most SCSI SIMs is too small. That's why you've
implemented that hooks in CAM. I have not forgot that conversation (pity
that it quietly died for SCSI SIMs). I agree that too high value could
be just a waste of resources. As you may see I haven't blindly committed
it, but asked public opinion. If you think 256K is OK - let it be 256K.
If you think that 256K needed only for media servers - OK, but lets make
it usable there.


Besides the nswbuf sizing problem, there is a real problem that a lot of drivers
have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are
particular values, and they've sized their data structures accordingly.  Before
these values are changed, an audit needs to be done OF EVERY SINGLE
STORAGE DRIVER.  No exceptions.  This isn't a case of changing MAXHYS
in the ata driver, testing that your machine boots, and then committing the 
change
to source control.  Some drivers will have non-obvious restrictions based on
the number of SG elements allowed in a particular command format.  MPT
comes to mind (its multi message SG code seems to be broken when I tried
testing large MAXPHYS on it), but I bet that there are others.


As you should remember, we have made it in such way, that all unchecked
drivers keep using DFLTPHYS, which is not going to be changed ever. So
there is no problem. I would more worry about non-CAM storages and above
stuff, like some rare GEOM classes.


I'm fine with raising MAXPHYS in production once the problems are
addressed.


That's why in my post I've asked people about any known problems. I've
addressed several related issues in last months, and I am looking for
more. To address problems, it would be nice to know about them first.



___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-20 Thread Julian Elischer


Ivan Voras wrote:

Julian Elischer wrote:

Alexander Motin wrote:

Scott Long wrote:

On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote:
   Diminishing returns get hit pretty quickly with larger MAXPHYS 
values.

   As long as the I/O can be pipelined the reduced transaction rate
   becomes less interesting when the transaction rate is less than a
   certain level.  Off the cuff I'd say 2000 tps is a good basis for
   considering whether it is an issue or not.  256K is actually quite
   a reasonable value.  Even 128K is reasonable.
I agree completely.  I did quite a bit of testing on this in 2008 
and 2009.
I even added some hooks into CAM to support this, and I thought that 
I had
discussed this extensively with Alexander at the time.  Guess it was 
yet another

wasted conversation with him =-(  I'll repeat it here for the record.


In the Fusion-io driver we find that the limiting factor is not the
size of MAXPHYS, but the fact that we can not push more than
170k tps through geom. (in my test machine. I've seen more on some
beefier machines), but that is only a limit on small transacrtions,


Do the GEOM threads (g_up, g_down) go into saturation? Effectively all 
IO is serialized through them.


basically..

You can get better throughput by using TSC for timing because the geom 
and devstat code does a bit of timing.. Geom can be told to turn off

it's timing but devstat can't. The 170 ktps is with TSC as timer,
and geom timing turned off.

It could just be the shear weight of the work being done.
Linux on the same machine using the same driver code (with different
wrappers) gets 225k tps.





___
freebsd-a...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "freebsd-arch-unsubscr...@freebsd.org"


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Alexander Motin

Julian Elischer wrote:
> In the Fusion-io driver we find that the limiting factor is not the
> size of MAXPHYS, but the fact that we can not push more than
> 170k tps through geom. (in my test machine. I've seen more on some
> beefier machines), but that is only a limit on small transacrtions,
> or in the case of large transfers the DMA engine tops out before a
> bigger MAXPHYS would make any difference.

Yes, GEOM is quite CPU-hungry on high request rates due to number of
context switches. But impact probably may be reduced from two sides: by
reducing overhead per request, or by reducing number of requests. Both
ways may give benefits.

If common opinion is not to touch defaults now - OK, agreed. (Note,
Scott, I have agreed :)) But returning to the original question, does
somebody knows real situation when increased MAXPHYS still causes
problems? At least to make it safe.

-- 
Alexander Motin

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Alexander Motin

Ivan Voras wrote:
> Julian Elischer wrote:
>> You can get better throughput by using TSC for timing because the geom
>> and devstat code does a bit of timing.. Geom can be told to turn off
>> it's timing but devstat can't. The 170 ktps is with TSC as timer,
>> and geom timing turned off.
> 
> I see. I just ran randomio on a gzero device and with 10 userland
> threads (this is a slow 2xquad machine) I get g_up and g_down saturated
> fast with ~~ 120 ktps. Randomio uses gettimeofday() for measurements.

I've just got 140Ktps from two real Intel X25-M SSDs on ICH10R AHCI
controller and single Core2Quad CPU. So at least on synthetic tests it
is potentially reachable even with casual hardware, while it completely
saturated quad-core CPU.

> Hmm, it looks like it could be easy to spawn more g_* threads (and,
> barring specific class behaviour, it has a fair chance of working out of
> the box) but the incoming queue will need to also be broken up for
> greater effect.

According to "notes", looks there is a good chance to obtain races, as
some places expect only one up and one down thread.

-- 
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Andriy Gapon

on 21/03/2010 16:05 Alexander Motin said the following:
> Ivan Voras wrote:
>> Hmm, it looks like it could be easy to spawn more g_* threads (and,
>> barring specific class behaviour, it has a fair chance of working out of
>> the box) but the incoming queue will need to also be broken up for
>> greater effect.
> 
> According to "notes", looks there is a good chance to obtain races, as
> some places expect only one up and one down thread.

I haven't given any deep thought to this issue, but I remember us discussing
them over beer :-)
I think one idea was making sure (somehow) that requests traveling over the same
edge of a geom graph (in the same direction) do it using the same queue/thread.
Another idea was to bring some netgraph-like optimization where some (carefully
chosen) geom vertices pass requests by a direct call instead of requeuing.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Julian Elischer


Alexander Motin wrote:

Julian Elischer wrote:

In the Fusion-io driver we find that the limiting factor is not the
size of MAXPHYS, but the fact that we can not push more than
170k tps through geom. (in my test machine. I've seen more on some
beefier machines), but that is only a limit on small transacrtions,
or in the case of large transfers the DMA engine tops out before a
bigger MAXPHYS would make any difference.


Yes, GEOM is quite CPU-hungry on high request rates due to number of
context switches. But impact probably may be reduced from two sides: by
reducing overhead per request, or by reducing number of requests. Both
ways may give benefits.

If common opinion is not to touch defaults now - OK, agreed. (Note,
Scott, I have agreed :)) But returning to the original question, does
somebody knows real situation when increased MAXPHYS still causes
problems? At least to make it safe.



well I know we havn't tested our bsd driver yet with MAXPHYS > 128KB
at this time..   Must try that some time :-)


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Julian Elischer


Andriy Gapon wrote:

on 21/03/2010 16:05 Alexander Motin said the following:

Ivan Voras wrote:

Hmm, it looks like it could be easy to spawn more g_* threads (and,
barring specific class behaviour, it has a fair chance of working out of
the box) but the incoming queue will need to also be broken up for
greater effect.

According to "notes", looks there is a good chance to obtain races, as
some places expect only one up and one down thread.


I haven't given any deep thought to this issue, but I remember us discussing
them over beer :-)
I think one idea was making sure (somehow) that requests traveling over the same
edge of a geom graph (in the same direction) do it using the same queue/thread.
Another idea was to bring some netgraph-like optimization where some (carefully
chosen) geom vertices pass requests by a direct call instead of requeuing.



yeah, like the 1:1 single provider case. (which we an most of our 
custommers mostly use on our cards). i.e. no slicing or dicing, and 
just the raw flash card presented as /dev/fio0


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Scott Long

On Mar 21, 2010, at 8:05 AM, Alexander Motin wrote:

> Ivan Voras wrote:
>> Julian Elischer wrote:
>>> You can get better throughput by using TSC for timing because the geom
>>> and devstat code does a bit of timing.. Geom can be told to turn off
>>> it's timing but devstat can't. The 170 ktps is with TSC as timer,
>>> and geom timing turned off.
>> 
>> I see. I just ran randomio on a gzero device and with 10 userland
>> threads (this is a slow 2xquad machine) I get g_up and g_down saturated
>> fast with ~~ 120 ktps. Randomio uses gettimeofday() for measurements.
> 
> I've just got 140Ktps from two real Intel X25-M SSDs on ICH10R AHCI
> controller and single Core2Quad CPU. So at least on synthetic tests it
> is potentially reachable even with casual hardware, while it completely
> saturated quad-core CPU.
> 
>> Hmm, it looks like it could be easy to spawn more g_* threads (and,
>> barring specific class behaviour, it has a fair chance of working out of
>> the box) but the incoming queue will need to also be broken up for
>> greater effect.
> 
> According to "notes", looks there is a good chance to obtain races, as
> some places expect only one up and one down thread.
> 

I agree that more threads just creates many more race complications.  Even if 
it didn't, the storage driver is a serialization point; it doesn't matter if 
you have a dozen g_* threads if only one of them can be in the top half of the 
driver at a time.  No amount of fine-grained locking is going to help this.

I'd like to go in the opposite direction.  The queue-dispatch-queue model of 
GEOM is elegant and easy to extend, but very wasteful for the simple case, 
where the simple case is one or two simple partition transforms (mbr, bsdlabel) 
and/or a simple stripe/mirror transform.  None of these need a dedicated 
dispatch context in order to operate.  What I'd like to explore is compiling 
the GEOM stack at creation time into a linear array of operations that happen 
without a g_down/g_up context switch.  As providers and consumers taste each 
other and build a stack, that stack gets compiled into a graph, and that graph 
gets executed directly from the calling context, both from the dev_strategy() 
side on the top and the bio_done() on the bottom.  GEOM classes that need a 
detached context can mark themselves as such, doing so will prevent a graph 
from being created, and the current dispatch model will be retained.

I expect that this will reduce i/o latency by a great margin, thus directly 
addressing the performance problem that FusionIO makes an example of.  I'd like 
to also explore having the g_bio model not require a malloc at every stage in 
the stack/graph; even though going through UMA is fairly fast, it still 
represents overhead that can be eliminated.  It also represents an 
out-of-memory failure case that can be prevented.

I might try to work on this over the summer.  It's really a research project in 
my head at this point, but I'm hopeful that it'll show results.

Scott

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Scott Long

m
On Mar 21, 2010, at 8:56 AM, Andriy Gapon wrote:

> on 21/03/2010 16:05 Alexander Motin said the following:
>> Ivan Voras wrote:
>>> Hmm, it looks like it could be easy to spawn more g_* threads (and,
>>> barring specific class behaviour, it has a fair chance of working out of
>>> the box) but the incoming queue will need to also be broken up for
>>> greater effect.
>> 
>> According to "notes", looks there is a good chance to obtain races, as
>> some places expect only one up and one down thread.
> 
> I haven't given any deep thought to this issue, but I remember us discussing
> them over beer :-)
> I think one idea was making sure (somehow) that requests traveling over the 
> same
> edge of a geom graph (in the same direction) do it using the same 
> queue/thread.
> Another idea was to bring some netgraph-like optimization where some 
> (carefully
> chosen) geom vertices pass requests by a direct call instead of requeuing.
> 

Ah, I see that we were thinking about similar things.  Another tactic, and one 
that is
easier to prototype and implement than moving GEOM to a graph, is to allow 
separate
but related bio's to be chained.  If a caller, like maybe physio or the 
bufdaemon or 
even a middle geom transform, knows that it's going to send multiple bio's at 
once,
it chains them together into a single request, and that request gets pipelined 
through
the stack.  Each layer operates on the entire chain before requeueing to the 
next layer.
Layers/classes that can't operate this way will get the bio serialized 
automatically for them,
breaking the chain, but those won't be the common cases.  This will bring cache 
locality
benefits, and is something that know benefits high-transaction load network 
applications.

Scott

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Ulrich Spörlein

On Sat, 20.03.2010 at 12:17:33 -0600, Scott Long wrote:
> Windows has a MAXPHYS equivalent of 1M.  Linux has an equivalent of an
> odd number less than 512k.  For the purpose of benchmarking against these
> OS's, having comparable capabilities is essential; Linux easily beats FreeBSD
> in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD 
> typically
> stomps linux in real I/O because of vastly better latency and caching 
> algorithms).
> I'm fine with raising MAXPHYS in production once the problems are addressed.

Hi Scott,

while I'm sure that most of the FreeBSD admins are aware of "silly"
benchmarks where Linux I/O seems to dwarf FreeBSD, do you have some
pointers regarding your statement that FreeBSD triumphs for real-world
I/O loads? Can this be simulated using iozone, bonnie, etc? More
importantly, is there a way to do this file system independently?

Regards,
Uli
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Scott Long


On Mar 20, 2010, at 1:26 PM, Alexander Motin wrote:

> Scott Long wrote:
>> On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote:
>>>   Diminishing returns get hit pretty quickly with larger MAXPHYS values.
>>>   As long as the I/O can be pipelined the reduced transaction rate
>>>   becomes less interesting when the transaction rate is less than a
>>>   certain level.  Off the cuff I'd say 2000 tps is a good basis for
>>>   considering whether it is an issue or not.  256K is actually quite
>>>   a reasonable value.  Even 128K is reasonable.
>> 
>> I agree completely.  I did quite a bit of testing on this in 2008 and 2009.
>> I even added some hooks into CAM to support this, and I thought that I had
>> discussed this extensively with Alexander at the time.  Guess it was yet 
>> another
>> wasted conversation with him =-(  I'll repeat it here for the record.
> 
> AFAIR at that time you've agreed that 256K gives improvements, and 64K
> of DFLTPHYS limiting most SCSI SIMs is too small. That's why you've
> implemented that hooks in CAM. I have not forgot that conversation (pity
> that it quietly died for SCSI SIMs). I agree that too high value could
> be just a waste of resources. As you may see I haven't blindly committed
> it, but asked public opinion. If you think 256K is OK - let it be 256K.
> If you think that 256K needed only for media servers - OK, but lets make
> it usable there.
> 

I think that somewhere in the range of 128-512k is appropriate for a given 
platform.
Maybe big-iron gets 512k and notebooks and embedded systems get 128k?  It's
partially a platform architecture issue, and partially a platform application 
issue.
Ultimately, it should be possible to have up to 1M, and maybe even more.  I 
don't
know how best to make that selectable, or whether it should just be the default.

>> Besides the nswbuf sizing problem, there is a real problem that a lot of 
>> drivers
>> have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are
>> particular values, and they've sized their data structures accordingly.  
>> Before
>> these values are changed, an audit needs to be done OF EVERY SINGLE
>> STORAGE DRIVER.  No exceptions.  This isn't a case of changing MAXHYS
>> in the ata driver, testing that your machine boots, and then committing the 
>> change
>> to source control.  Some drivers will have non-obvious restrictions based on
>> the number of SG elements allowed in a particular command format.  MPT
>> comes to mind (its multi message SG code seems to be broken when I tried
>> testing large MAXPHYS on it), but I bet that there are others.
> 
> As you should remember, we have made it in such way, that all unchecked
> drivers keep using DFLTPHYS, which is not going to be changed ever. So
> there is no problem. I would more worry about non-CAM storages and above
> stuff, like some rare GEOM classes.

And that's why I say that everything needs to be audited.  Are there CAM drivers
that default to being silent on cpi->maxio, but still look at DFLTPHYS and 
MAXPHYS?
Are there non-CAM drivers that look at MAXPHYS, or that silently assume that
MAXPHYS will never be more than 128k?

Scott

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Scott Long

On Mar 21, 2010, at 10:30 AM, Ulrich Spörlein wrote:
> On Sat, 20.03.2010 at 12:17:33 -0600, Scott Long wrote:
>> Windows has a MAXPHYS equivalent of 1M.  Linux has an equivalent of an
>> odd number less than 512k.  For the purpose of benchmarking against these
>> OS's, having comparable capabilities is essential; Linux easily beats FreeBSD
>> in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD 
>> typically
>> stomps linux in real I/O because of vastly better latency and caching 
>> algorithms).
>> I'm fine with raising MAXPHYS in production once the problems are addressed.
> 
> Hi Scott,
> 
> while I'm sure that most of the FreeBSD admins are aware of "silly"
> benchmarks where Linux I/O seems to dwarf FreeBSD, do you have some
> pointers regarding your statement that FreeBSD triumphs for real-world
> I/O loads? Can this be simulated using iozone, bonnie, etc? More
> importantly, is there a way to do this file system independently?
> 

iozone and bonnie tend to be good at testing serialized I/O latency; each read 
and write is serialized without any buffering.  My experience is that they give 
mixed results, sometimes they favor freebsd, sometime linux, sometimes it's a 
wash, all because they are so sensitive to latency.  And that's where is also 
gets hard to have a "universal" benchmark; what are you really trying to model, 
and how does that model reflect your actual workload?  Are you running a 
single-instance, single threaded application that is sensitive to latency?  Are 
you running a multi-instance/multi-threaded app that is sensitive to bandwidth? 
 Are you operating on a single file, or on a large tree of files, or on a raw 
device?  Are you sharing a small number of relatively stable file descriptors, 
or constantly creating and deleting files and truncating 
space?___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Ulrich Spörlein

[CC trimmed]
On Sun, 21.03.2010 at 10:39:10 -0600, Scott Long wrote:
> On Mar 21, 2010, at 10:30 AM, Ulrich Spörlein wrote:
> > On Sat, 20.03.2010 at 12:17:33 -0600, Scott Long wrote:
> >> Windows has a MAXPHYS equivalent of 1M.  Linux has an equivalent of an
> >> odd number less than 512k.  For the purpose of benchmarking against these
> >> OS's, having comparable capabilities is essential; Linux easily beats 
> >> FreeBSD
> >> in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD 
> >> typically
> >> stomps linux in real I/O because of vastly better latency and caching 
> >> algorithms).
> >> I'm fine with raising MAXPHYS in production once the problems are 
> >> addressed.
> > 
> > Hi Scott,
> > 
> > while I'm sure that most of the FreeBSD admins are aware of "silly"
> > benchmarks where Linux I/O seems to dwarf FreeBSD, do you have some
> > pointers regarding your statement that FreeBSD triumphs for real-world
> > I/O loads? Can this be simulated using iozone, bonnie, etc? More
> > importantly, is there a way to do this file system independently?
> > 
> 
> iozone and bonnie tend to be good at testing serialized I/O latency; each 
> read and write is serialized without any buffering.  My experience is that 
> they give mixed results, sometimes they favor freebsd, sometime linux, 
> sometimes it's a wash, all because they are so sensitive to latency.  And 
> that's where is also gets hard to have a "universal" benchmark; what are you 
> really trying to model, and how does that model reflect your actual workload? 
>  Are you running a single-instance, single threaded application that is 
> sensitive to latency?  Are you running a multi-instance/multi-threaded app 
> that is sensitive to bandwidth?  Are you operating on a single file, or on a 
> large tree of files, or on a raw device?  Are you sharing a small number of 
> relatively stable file descriptors, or constantly creating and deleting files 
> and truncating space?

All true, that's why I wanted to know from you, which real world
situations you encountered where FreeBSD did/does outperform Linux in
regards to I/O throughput and/or latency (depending on scenario, of
course).

I hope you don't mind,
Uli
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Alexander Motin

Scott Long wrote:
> On Mar 20, 2010, at 1:26 PM, Alexander Motin wrote:
>> As you should remember, we have made it in such way, that all unchecked
>> drivers keep using DFLTPHYS, which is not going to be changed ever. So
>> there is no problem. I would more worry about non-CAM storages and above
>> stuff, like some rare GEOM classes.
> 
> And that's why I say that everything needs to be audited.  Are there CAM 
> drivers
> that default to being silent on cpi->maxio, but still look at DFLTPHYS and 
> MAXPHYS?

If some (most of) drivers silent on cpi->maxio - they will be limited by
safe level of DFLTPHYS, which should not be changed ever. There should
be no problem.

> Are there non-CAM drivers that look at MAXPHYS, or that silently assume that
> MAXPHYS will never be more than 128k?

That is a question.

-- 
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Scott Long

On Mar 21, 2010, at 10:53 AM, Ulrich Spörlein wrote:
> [CC trimmed]
> On Sun, 21.03.2010 at 10:39:10 -0600, Scott Long wrote:
>> On Mar 21, 2010, at 10:30 AM, Ulrich Spörlein wrote:
>>> On Sat, 20.03.2010 at 12:17:33 -0600, Scott Long wrote:
 Windows has a MAXPHYS equivalent of 1M.  Linux has an equivalent of an
 odd number less than 512k.  For the purpose of benchmarking against these
 OS's, having comparable capabilities is essential; Linux easily beats 
 FreeBSD
 in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD 
 typically
 stomps linux in real I/O because of vastly better latency and caching 
 algorithms).
 I'm fine with raising MAXPHYS in production once the problems are 
 addressed.
>>> 
>>> Hi Scott,
>>> 
>>> while I'm sure that most of the FreeBSD admins are aware of "silly"
>>> benchmarks where Linux I/O seems to dwarf FreeBSD, do you have some
>>> pointers regarding your statement that FreeBSD triumphs for real-world
>>> I/O loads? Can this be simulated using iozone, bonnie, etc? More
>>> importantly, is there a way to do this file system independently?
>>> 
>> 
>> iozone and bonnie tend to be good at testing serialized I/O latency; each 
>> read and write is serialized without any buffering.  My experience is that 
>> they give mixed results, sometimes they favor freebsd, sometime linux, 
>> sometimes it's a wash, all because they are so sensitive to latency.  And 
>> that's where is also gets hard to have a "universal" benchmark; what are you 
>> really trying to model, and how does that model reflect your actual 
>> workload?  Are you running a single-instance, single threaded application 
>> that is sensitive to latency?  Are you running a 
>> multi-instance/multi-threaded app that is sensitive to bandwidth?  Are you 
>> operating on a single file, or on a large tree of files, or on a raw device? 
>>  Are you sharing a small number of relatively stable file descriptors, or 
>> constantly creating and deleting files and truncating space?
> 
> All true, that's why I wanted to know from you, which real world
> situations you encountered where FreeBSD did/does outperform Linux in
> regards to I/O throughput and/or latency (depending on scenario, of
> course).


I have some tests that spawn N number of threads and then do sequential and 
random i/o either into a filesystem or a raw disk.  FreeBSD gets more work done 
with fewer I/O's than linux when you're operating through the filesystem, 
thanks to softupdates and the block layer.  Linux has a predictive cache that 
often times will generate too much i/o in a vain attempt to aggressively 
prefetch blocks. So even then it's hard to measure in a simple way; linux will 
do more i/o, but less of it will be useful to the application, thereby 
increasing latency and increasing application runtime.  Sorry I can't be more 
specific, but you're asking for something that I explicitly say I can't provide.

Scott

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Julian Elischer


Scott Long wrote:


I agree that more threads just creates many more race
complications.  Even if it didn't, the storage driver is a
serialization point; it doesn't matter if you have a dozen g_*
threads if only one of them can be in the top half of the driver at
a time.  No amount of fine-grained locking is going to help this.


Well that depends on the driver and device..
We have multiple linux threads coming in the top under some
setups so it wouldn't be a problem.



I'd like to go in the opposite direction.  The queue-dispatch-queue
model of GEOM is elegant and easy to extend, but very wasteful for
the simple case, where the simple case is one or two simple
partition transforms (mbr, bsdlabel) and/or a simple stripe/mirror
transform.  None of these need a dedicated dispatch context in
order to operate.  What I'd like to explore is compiling the GEOM
stack at creation time into a linear array of operations that
happen without a g_down/g_up context switch.  As providers and
consumers taste each other and build a stack, that stack gets
compiled into a graph, and that graph gets executed directly from
the calling context, both from the dev_strategy() side on the top
and the bio_done() on the bottom.  GEOM classes that need a
detached context can mark themselves as such, doing so will prevent
a graph from being created, and the current dispatch model will be
retained.


I've considered similar ideas. Or providing a non-queuing options for
some simple transformations.




I expect that this will reduce i/o latency by a great margin, thus
directly addressing the performance problem that FusionIO makes an
example of.  I'd like to also explore having the g_bio model not
require a malloc at every stage in the stack/graph; even though
going through UMA is fairly fast, it still represents overhead that
can be eliminated.  It also represents an out-of-memory failure
case that can be prevented.

I might try to work on this over the summer.  It's really a
research project in my head at this point, but I'm hopeful that
it'll show results.

Scott

___ 
freebsd-current@freebsd.org mailing list 
http://lists.freebsd.org/mailman/listinfo/freebsd-current To

unsubscribe, send any mail to
"freebsd-current-unsubscr...@freebsd.org"


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread jhell



On Sun, 21 Mar 2010 10:04, mav@ wrote:

Julian Elischer wrote:

In the Fusion-io driver we find that the limiting factor is not the
size of MAXPHYS, but the fact that we can not push more than
170k tps through geom. (in my test machine. I've seen more on some
beefier machines), but that is only a limit on small transacrtions,
or in the case of large transfers the DMA engine tops out before a
bigger MAXPHYS would make any difference.


Yes, GEOM is quite CPU-hungry on high request rates due to number of
context switches. But impact probably may be reduced from two sides: by
reducing overhead per request, or by reducing number of requests. Both
ways may give benefits.

If common opinion is not to touch defaults now - OK, agreed. (Note,
Scott, I have agreed :)) But returning to the original question, does
somebody knows real situation when increased MAXPHYS still causes
problems? At least to make it safe.




I played with it on one re-compile of a kernel and for the sake of it 
DFLTPHYS=128 MAXPHYS=256 and found out that I could not cause a crash dump 
to be performed upon request (reboot -d) due to the boundary being hit for 
DMA which is 65536. Obviously this would have to be adjusted in ata-dma.c.


I suppose that there would have to be a better way to get the real 
allowable boundary from the running system instead of setting it statically.


Other then the above I do not see a reason why not... It is HEAD and this 
is the type of experimental stuff it was meant for.


Regards,

--

 jhell

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread jhell



On Sun, 21 Mar 2010 20:54, jhell@ wrote:


On Sun, 21 Mar 2010 10:04, mav@ wrote:

Julian Elischer wrote:

In the Fusion-io driver we find that the limiting factor is not the
size of MAXPHYS, but the fact that we can not push more than
170k tps through geom. (in my test machine. I've seen more on some
beefier machines), but that is only a limit on small transacrtions,
or in the case of large transfers the DMA engine tops out before a
bigger MAXPHYS would make any difference.


Yes, GEOM is quite CPU-hungry on high request rates due to number of
context switches. But impact probably may be reduced from two sides: by
reducing overhead per request, or by reducing number of requests. Both
ways may give benefits.

If common opinion is not to touch defaults now - OK, agreed. (Note,
Scott, I have agreed :)) But returning to the original question, does
somebody knows real situation when increased MAXPHYS still causes
problems? At least to make it safe.




I played with it on one re-compile of a kernel and for the sake of it 
DFLTPHYS=128 MAXPHYS=256 and found out that I could not cause a crash dump to 
be performed upon request (reboot -d) due to the boundary being hit for DMA 
which is 65536. Obviously this would have to be adjusted in ata-dma.c.


I suppose that there would have to be a better way to get the real allowable 
boundary from the running system instead of setting it statically.


Other then the above I do not see a reason why not... It is HEAD and this is 
the type of experimental stuff it was meant for.


Regards,




I should have also said that I also repeated the above without setting 
DFLTPHYS and setting MAXPHYS to 256.


Regards,

--

 jhell

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-21 Thread Alexander Motin

jhell wrote:
> On Sun, 21 Mar 2010 20:54, jhell@ wrote:
>> I played with it on one re-compile of a kernel and for the sake of it
>> DFLTPHYS=128 MAXPHYS=256 and found out that I could not cause a crash
>> dump to be performed upon request (reboot -d) due to the boundary
>> being hit for DMA which is 65536. Obviously this would have to be
>> adjusted in ata-dma.c.
>>
>> I suppose that there would have to be a better way to get the real
>> allowable boundary from the running system instead of setting it
>> statically.
>>
>> Other then the above I do not see a reason why not... It is HEAD and
>> this is the type of experimental stuff it was meant for.
> 
> I should have also said that I also repeated the above without setting
> DFLTPHYS and setting MAXPHYS to 256.

It was bad idea to increase DFLTPHYS. It is not intended to be increased.

About DMA boundary, I do not very understand the problem. Yes, legacy
ATA has DMA boundary of 64K, but there is no problem to submit S/G list
of several segments. How long ago have you tried it, on which controller
and which diagnostics do you have?

-- 
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread Poul-Henning Kamp

In message <4ba633a0.2090...@icyb.net.ua>, Andriy Gapon writes:
>on 21/03/2010 16:05 Alexander Motin said the following:
>> Ivan Voras wrote:
>>> Hmm, it looks like it could be easy to spawn more g_* threads (and,
>>> barring specific class behaviour, it has a fair chance of working out of
>>> the box) but the incoming queue will need to also be broken up for
>>> greater effect.
>> 
>> According to "notes", looks there is a good chance to obtain races, as
>> some places expect only one up and one down thread.
>
>I haven't given any deep thought to this issue, but I remember us discussing
>them over beer :-)

The easiest way to obtain more parallelism, is to divide the mesh into
multiple independent meshes.

This will do you no good if you have five disks in a RAID-5 config, but
if you have two disks each mounted on its own filesystem, you can run
a g_up & g_down for each of them.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread Gary Jennejohn

On Sun, 21 Mar 2010 19:03:56 +0200
Alexander Motin  wrote:

> Scott Long wrote:
> > Are there non-CAM drivers that look at MAXPHYS, or that silently assume that
> > MAXPHYS will never be more than 128k?
> 
> That is a question.
> 

I only did a quick&dirty grep looking for MAXPHYS in /sys.

Some drivers redefine MAXPHYS to be 512KiB.  Some use their own local
MAXPHYS which is usually 128KiB.

Some look at MAXPHYS to figure out other things; the details escape me.

There's one driver which actually uses 100*MAXPHYS for something, but I
didn't check the details.

Lots of them were non-CAM drivers AFAICT.

--
Gary Jennejohn
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread Alexander Leidinger


Quoting Scott Long  (from Sat, 20 Mar 2010 12:17:33 -0600):

code was actually taking advantage of the larger I/O's.  The  
improvement really

depends on the workload, of course, and I wouldn't expect it to be noticeable
for most people unless they're running something like a media server.


I don't think this is limited to media servers, think about situations  
where you process a large amount of data seuqntially... (seuqntial  
access case in a big data-warehouse scenario or a 3D render farm which  
get's the huge amount of data from a shared resource ("how many  
render-clients can I support at the same time with my disk  
infrastructure"-scenario) or some of the bigtable/nosql stuff which  
seems to be more and more popular at some sites). There are enough  
situations where sequential file access is the key performance metric  
so that I wouldn't say that only media servers depend upon large  
sequential I/O's.


Bye,
Alexander.

--
That's life.
What's life?
A magazine.
How much does it cost?
Two-fifty.
I only have a dollar.
That's life.

http://www.Leidinger.netAlexander @ Leidinger.net: PGP ID = B0063FE7
http://www.FreeBSD.org   netchild @ FreeBSD.org  : PGP ID = 72077137
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread John Baldwin

On Monday 22 March 2010 7:40:18 am Gary Jennejohn wrote:
> On Sun, 21 Mar 2010 19:03:56 +0200
> Alexander Motin  wrote:
> 
> > Scott Long wrote:
> > > Are there non-CAM drivers that look at MAXPHYS, or that silently assume 
that
> > > MAXPHYS will never be more than 128k?
> > 
> > That is a question.
> > 
> 
> I only did a quick&dirty grep looking for MAXPHYS in /sys.
> 
> Some drivers redefine MAXPHYS to be 512KiB.  Some use their own local
> MAXPHYS which is usually 128KiB.
> 
> Some look at MAXPHYS to figure out other things; the details escape me.
> 
> There's one driver which actually uses 100*MAXPHYS for something, but I
> didn't check the details.
> 
> Lots of them were non-CAM drivers AFAICT.

The problem is the drivers that _don't_ reference MAXPHYS.  The driver author 
at the time "knew" that MAXPHYS was 128k, so he did the MAXPHYS-dependent 
calculation and just put the result in the driver (e.g. only supporting up to 
32 segments (32 4k pages == 128k) in a bus dma tag as a magic number to 
bus_dma_tag_create() w/o documenting that the '32' was derived from 128k or 
what the actual hardware limit on nsegments is).  These cannot be found by a 
simple grep, they require manually inspecting each driver.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread jhell



On Mon, 22 Mar 2010 01:53, Alexander Motin wrote:
In Message-Id: <4ba705cb.9090...@freebsd.org>


jhell wrote:

On Sun, 21 Mar 2010 20:54, jhell@ wrote:

I played with it on one re-compile of a kernel and for the sake of it
DFLTPHYS=128 MAXPHYS=256 and found out that I could not cause a crash
dump to be performed upon request (reboot -d) due to the boundary
being hit for DMA which is 65536. Obviously this would have to be
adjusted in ata-dma.c.

I suppose that there would have to be a better way to get the real
allowable boundary from the running system instead of setting it
statically.

Other then the above I do not see a reason why not... It is HEAD and
this is the type of experimental stuff it was meant for.


I should have also said that I also repeated the above without setting
DFLTPHYS and setting MAXPHYS to 256.


It was bad idea to increase DFLTPHYS. It is not intended to be increased.



I just wanted to see what I could break; when I increased DFLTPHYS it was 
just for that purpose. It booted and everything was running after. Wasn't 
long enough to do any damage.



About DMA boundary, I do not very understand the problem. Yes, legacy
ATA has DMA boundary of 64K, but there is no problem to submit S/G list
of several segments. How long ago have you tried it, on which controller
and which diagnostics do you have?




atap...@pci0:0:31:1:
class=0x01018a card=0x01271028 chip=0x24cb8086 rev=0x01 hdr=0x00
vendor = 'Intel Corporation'
device = '82801DB/DBL (ICH4/ICH4-L) UltraATA/100 EIDE Controller'
class  = mass storage
subclass   = ATA

I do not have any diagnostics but if any are requested I do have the 
kernel's that I have tuned to the above values readily available to run 
again.


The first time I tuned MAXPHYS was roughly about 7 weeks ago. That was 
until I noticed I could not get a crash dump for a problem I was having a 
week later and had to revert back to its default setting of 128. The 
problem I had a week later was unrelated.


Two days ago when I saw this thread I recalled having modified MAXPHYS but 
could not remember the problem it caused so I re-enabled it again to 
reproduce the problem for sureness.


Anything else you need please address,

Regards,

--

 jhell

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread Alexander Sack

On Mon, Mar 22, 2010 at 8:39 AM, John Baldwin  wrote:
> On Monday 22 March 2010 7:40:18 am Gary Jennejohn wrote:
>> On Sun, 21 Mar 2010 19:03:56 +0200
>> Alexander Motin  wrote:
>>
>> > Scott Long wrote:
>> > > Are there non-CAM drivers that look at MAXPHYS, or that silently assume
> that
>> > > MAXPHYS will never be more than 128k?
>> >
>> > That is a question.
>> >
>>
>> I only did a quick&dirty grep looking for MAXPHYS in /sys.
>>
>> Some drivers redefine MAXPHYS to be 512KiB.  Some use their own local
>> MAXPHYS which is usually 128KiB.
>>
>> Some look at MAXPHYS to figure out other things; the details escape me.
>>
>> There's one driver which actually uses 100*MAXPHYS for something, but I
>> didn't check the details.
>>
>> Lots of them were non-CAM drivers AFAICT.
>
> The problem is the drivers that _don't_ reference MAXPHYS.  The driver author
> at the time "knew" that MAXPHYS was 128k, so he did the MAXPHYS-dependent
> calculation and just put the result in the driver (e.g. only supporting up to
> 32 segments (32 4k pages == 128k) in a bus dma tag as a magic number to
> bus_dma_tag_create() w/o documenting that the '32' was derived from 128k or
> what the actual hardware limit on nsegments is).  These cannot be found by a
> simple grep, they require manually inspecting each driver.

100% awesome comment.  On another kernel, I myself was guilty of this
crime (I did have a nice comment though above the def).

This has been a great thread since our application really needs some
of the optimizations that are being thrown around here.  We have found
in real live performance testing that we are almost always either
controller bound (i.e. adding more disks to spread IOPs has little to
no effect in large array configurations on throughput, we suspect that
is hitting the RAID controller's firmware limitations) or tps bound,
i.e. I never thought going from 128k -> 256k per transaction would
have a dramatic effect on throughput (but I never verified).

Back to HBAs,  AFAIK, every modern iteration of the most popular HBAs
can easily do way more than a 128k scatter/gather I/O.  Do you guys
know of any *modern* (circa within the last 3-4 years) that can not do
more than 128k at a shot?

In other words, I've always thought the limit was kernel imposed and
not what the memory controller on the card can do (I certainly never
got the impression talking with some of the IHVs over the years that
they were designing their hardware for a 128k limit - I sure hope
not!).

-aps
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread Scott Long

On Mar 22, 2010, at 9:52 AM, Alexander Sack wrote:
> On Mon, Mar 22, 2010 at 8:39 AM, John Baldwin  wrote:
>> On Monday 22 March 2010 7:40:18 am Gary Jennejohn wrote:
>>> On Sun, 21 Mar 2010 19:03:56 +0200
>>> Alexander Motin  wrote:
>>> 
 Scott Long wrote:
> Are there non-CAM drivers that look at MAXPHYS, or that silently assume
>> that
> MAXPHYS will never be more than 128k?
 
 That is a question.
 
>>> 
>>> I only did a quick&dirty grep looking for MAXPHYS in /sys.
>>> 
>>> Some drivers redefine MAXPHYS to be 512KiB.  Some use their own local
>>> MAXPHYS which is usually 128KiB.
>>> 
>>> Some look at MAXPHYS to figure out other things; the details escape me.
>>> 
>>> There's one driver which actually uses 100*MAXPHYS for something, but I
>>> didn't check the details.
>>> 
>>> Lots of them were non-CAM drivers AFAICT.
>> 
>> The problem is the drivers that _don't_ reference MAXPHYS.  The driver author
>> at the time "knew" that MAXPHYS was 128k, so he did the MAXPHYS-dependent
>> calculation and just put the result in the driver (e.g. only supporting up to
>> 32 segments (32 4k pages == 128k) in a bus dma tag as a magic number to
>> bus_dma_tag_create() w/o documenting that the '32' was derived from 128k or
>> what the actual hardware limit on nsegments is).  These cannot be found by a
>> simple grep, they require manually inspecting each driver.
> 
> 100% awesome comment.  On another kernel, I myself was guilty of this
> crime (I did have a nice comment though above the def).
> 
> This has been a great thread since our application really needs some
> of the optimizations that are being thrown around here.  We have found
> in real live performance testing that we are almost always either
> controller bound (i.e. adding more disks to spread IOPs has little to
> no effect in large array configurations on throughput, we suspect that
> is hitting the RAID controller's firmware limitations) or tps bound,
> i.e. I never thought going from 128k -> 256k per transaction would
> have a dramatic effect on throughput (but I never verified).
> 
> Back to HBAs,  AFAIK, every modern iteration of the most popular HBAs
> can easily do way more than a 128k scatter/gather I/O.  Do you guys
> know of any *modern* (circa within the last 3-4 years) that can not do
> more than 128k at a shot?

>64K broken in MPT at the moment.  The hardware can do it, the driver thinks it 
>can do it, but it fails.  AAC hardware traditionally cannot, but maybe the 
>firmware has been improved in the past few years.  I know that there are other 
>low-performance devices that can't do more than 64 or 128K, but none are 
>coming to mind at the moment.  Still, it shouldn't be a universal assumption 
>that all hardware can do big I/O's.

Another consideration is that some hardware can do big I/O's, but not very 
efficiently.  Not all DMA engines are created equal, and moving to compound 
commands and excessively long S/G lists can be a pessimization.  For example, 
MFI hardware does a hinted prefetch on the segment list, but once you exceed a 
certain limit, that prefetch doesn't work anymore and the firmware has to take 
the slow path to execute the i/o.  I haven't quantified this penalty yet, but 
it's something that should be thought about.

> 
> In other words, I've always thought the limit was kernel imposed and
> not what the memory controller on the card can do (I certainly never
> got the impression talking with some of the IHVs over the years that
> they were designing their hardware for a 128k limit - I sure hope
> not!).

You'd be surprised at the engineering compromises and handicaps that are 
committed at IHVs because of misguided marketters.

Scott

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread M. Warner Losh

In message: 
Scott Long  writes:
: I'd like to go in the opposite direction.  The queue-dispatch-queue
: model of GEOM is elegant and easy to extend, but very wasteful for
: the simple case, where the simple case is one or two simple
: partition transforms (mbr, bsdlabel) and/or a simple stripe/mirror
: transform.  None of these need a dedicated dispatch context in order
: to operate.  What I'd like to explore is compiling the GEOM stack at
: creation time into a linear array of operations that happen without
: a g_down/g_up context switch.  As providers and consumers taste each
: other and build a stack, that stack gets compiled into a graph, and
: that graph gets executed directly from the calling context, both
: from the dev_strategy() side on the top and the bio_done() on the
: bottom.  GEOM classes that need a detached context can mark
: themselves as such, doing so will prevent a graph from being
: created, and the current dispatch model will be retained.

I have a few things to say on this.

First, I've done similar things at past companies for systems that are
similar to geom's queueing environment.  It is possible to convert the
queueing nodes in the graph to filtering nodes in the graph.  Another
way to look at this is to say you're implementing direct dispatch into
geom's stack.  This can be both good and bad, but should reduce
latency a lot.

One problem that I see is that you are calling into the driver from a
different set of contexts.  The queueing stuff was there to protect
the driver from LoRs due to its routines being called from many
different contexts, sometimes with other locks held (fact of life
often in the kernel).

So this certainly is something worth exploring, especially if we have
optimized paths for up/down for certain geom classes while still
allowing the current robust, but slow, paths for the more complicated
nodes in the tree.  It remains to be see if there's going to be issues
around locking order, but we've hit that with both geom and ifnet in
the past, so caution (eg, running with WITNESS turned on early and
often) is advised.

Warner
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread Alexander Sack

On Mon, Mar 22, 2010 at 2:45 PM, M. Warner Losh  wrote:
> In message: 
>            Scott Long  writes:
> : I'd like to go in the opposite direction.  The queue-dispatch-queue
> : model of GEOM is elegant and easy to extend, but very wasteful for
> : the simple case, where the simple case is one or two simple
> : partition transforms (mbr, bsdlabel) and/or a simple stripe/mirror
> : transform.  None of these need a dedicated dispatch context in order
> : to operate.  What I'd like to explore is compiling the GEOM stack at
> : creation time into a linear array of operations that happen without
> : a g_down/g_up context switch.  As providers and consumers taste each
> : other and build a stack, that stack gets compiled into a graph, and
> : that graph gets executed directly from the calling context, both
> : from the dev_strategy() side on the top and the bio_done() on the
> : bottom.  GEOM classes that need a detached context can mark
> : themselves as such, doing so will prevent a graph from being
> : created, and the current dispatch model will be retained.
>
> I have a few things to say on this.
>
> First, I've done similar things at past companies for systems that are
> similar to geom's queueing environment.  It is possible to convert the
> queueing nodes in the graph to filtering nodes in the graph.  Another
> way to look at this is to say you're implementing direct dispatch into
> geom's stack.  This can be both good and bad, but should reduce
> latency a lot.
>
> One problem that I see is that you are calling into the driver from a
> different set of contexts.  The queueing stuff was there to protect
> the driver from LoRs due to its routines being called from many
> different contexts, sometimes with other locks held (fact of life
> often in the kernel).
>
> So this certainly is something worth exploring, especially if we have
> optimized paths for up/down for certain geom classes while still
> allowing the current robust, but slow, paths for the more complicated
> nodes in the tree.  It remains to be see if there's going to be issues
> around locking order, but we've hit that with both geom and ifnet in
> the past, so caution (eg, running with WITNESS turned on early and
> often) is advised.

Am I going crazy or does this sound a lot like Sun/SVR's stream based
network stack?

 (design and problems, stream stack locking was notoriously tricky for
the exact issue mentioned above, different running contexts with
different locking granularity/requirements).

-aps
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread Poul-Henning Kamp

In message <3c0b01821003221207p4e4eecabqb4f448813bf5a...@mail.gmail.com>, Alexa
nder Sack writes:

>Am I going crazy or does this sound a lot like Sun/SVR's stream based
>network stack?

That is a good and pertinent observation.

I did investigate a number of optimizations to the g_up/g_down scheme
I eventually adopted, but found none that gained anything justifying
the complexity they brought.

In some cases, the optimizations used more CPU cycles than the straight
g_up/g_down path, but obviously, the circumstances are vastly different
with CPUs having 10 times higher clock, multiple cores and SSD disks,
so a fresh look at this tradeoff is in order.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread Pawel Jakub Dawidek

On Mon, Mar 22, 2010 at 08:23:43AM +, Poul-Henning Kamp wrote:
> In message <4ba633a0.2090...@icyb.net.ua>, Andriy Gapon writes:
> >on 21/03/2010 16:05 Alexander Motin said the following:
> >> Ivan Voras wrote:
> >>> Hmm, it looks like it could be easy to spawn more g_* threads (and,
> >>> barring specific class behaviour, it has a fair chance of working out of
> >>> the box) but the incoming queue will need to also be broken up for
> >>> greater effect.
> >> 
> >> According to "notes", looks there is a good chance to obtain races, as
> >> some places expect only one up and one down thread.
> >
> >I haven't given any deep thought to this issue, but I remember us discussing
> >them over beer :-)
> 
> The easiest way to obtain more parallelism, is to divide the mesh into
> multiple independent meshes.
> 
> This will do you no good if you have five disks in a RAID-5 config, but
> if you have two disks each mounted on its own filesystem, you can run
> a g_up & g_down for each of them.

A class is suppose to interact with other classes only via GEOM, so I
think it should be safe to choose g_up/g_down threads for each class
individually, for example:

/dev/ad0s1a (DEV)
   |
g_up_0 + g_down_0
   |
 ad0s1a (BSD)
   |
g_up_1 + g_down_1
   |
 ad0s1 (MBR)
   |
g_up_2 + g_down_2
   |
 ad0 (DISK)

We could easly calculate g_down thread based on bio_to->geom->class and
g_up thread based on bio_from->geom->class, so we know I/O requests for
our class are always coming from the same threads.

If we could make the same assumption for geoms it would allow for even
better distribution.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpFAxWFcI5ds.pgp
Description: PGP signature

Re: Increasing MAXPHYS

2010-03-22 Thread Scott Long


On Mar 22, 2010, at 5:36 PM, Pawel Jakub Dawidek wrote:

> On Mon, Mar 22, 2010 at 08:23:43AM +, Poul-Henning Kamp wrote:
>> In message <4ba633a0.2090...@icyb.net.ua>, Andriy Gapon writes:
>>> on 21/03/2010 16:05 Alexander Motin said the following:
 Ivan Voras wrote:
> Hmm, it looks like it could be easy to spawn more g_* threads (and,
> barring specific class behaviour, it has a fair chance of working out of
> the box) but the incoming queue will need to also be broken up for
> greater effect.
 
 According to "notes", looks there is a good chance to obtain races, as
 some places expect only one up and one down thread.
>>> 
>>> I haven't given any deep thought to this issue, but I remember us discussing
>>> them over beer :-)
>> 
>> The easiest way to obtain more parallelism, is to divide the mesh into
>> multiple independent meshes.
>> 
>> This will do you no good if you have five disks in a RAID-5 config, but
>> if you have two disks each mounted on its own filesystem, you can run
>> a g_up & g_down for each of them.
> 
> A class is suppose to interact with other classes only via GEOM, so I
> think it should be safe to choose g_up/g_down threads for each class
> individually, for example:
> 
>   /dev/ad0s1a (DEV)
>  |
>   g_up_0 + g_down_0
>  |
>ad0s1a (BSD)
>  |
>   g_up_1 + g_down_1
>  |
>ad0s1 (MBR)
>  |
>   g_up_2 + g_down_2
>  |
>ad0 (DISK)
> 
> We could easly calculate g_down thread based on bio_to->geom->class and
> g_up thread based on bio_from->geom->class, so we know I/O requests for
> our class are always coming from the same threads.
> 
> If we could make the same assumption for geoms it would allow for even
> better distribution.

The whole point of the discussion, sans PHK's interlude, is to reduce the 
context switches and indirection, not to increase it.  But if you can show 
decreased latency/higher-iops benefits of increasing it, more power to you.  I 
would think that the results of DFly's experiment with 
parallelism-via-more-queues would serve as a good warning, though.

Scott

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-22 Thread Julian Elischer


Pawel Jakub Dawidek wrote:

On Mon, Mar 22, 2010 at 08:23:43AM +, Poul-Henning Kamp wrote:

In message <4ba633a0.2090...@icyb.net.ua>, Andriy Gapon writes:

on 21/03/2010 16:05 Alexander Motin said the following:

Ivan Voras wrote:

Hmm, it looks like it could be easy to spawn more g_* threads (and,
barring specific class behaviour, it has a fair chance of working out of
the box) but the incoming queue will need to also be broken up for
greater effect.

According to "notes", looks there is a good chance to obtain races, as
some places expect only one up and one down thread.

I haven't given any deep thought to this issue, but I remember us discussing
them over beer :-)

The easiest way to obtain more parallelism, is to divide the mesh into
multiple independent meshes.

This will do you no good if you have five disks in a RAID-5 config, but
if you have two disks each mounted on its own filesystem, you can run
a g_up & g_down for each of them.


A class is suppose to interact with other classes only via GEOM, so I
think it should be safe to choose g_up/g_down threads for each class
individually, for example:

/dev/ad0s1a (DEV)
   |
g_up_0 + g_down_0
   |
 ad0s1a (BSD)
   |
g_up_1 + g_down_1
   |
 ad0s1 (MBR)
   |
g_up_2 + g_down_2
   |
 ad0 (DISK)

We could easly calculate g_down thread based on bio_to->geom->class and
g_up thread based on bio_from->geom->class, so we know I/O requests for
our class are always coming from the same threads.

If we could make the same assumption for geoms it would allow for even
better distribution.


doesn't really help my problem however.. I just want to access the 
base provider directly with no geom thread involved.






___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-23 Thread Matthew Dillon

:The whole point of the discussion, sans PHK's interlude, is to reduce the 
context switches and indirection, not to increase it.  But if you can show 
decreased latency/higher-iops benefits of increasing it, more power to you.  I 
would think that the results of DFly's experiment with 
parallelism-via-more-queues would serve as a good warning, though.
:
:Scott

Well, I'm not sure what experiment you are refering to but I'll assume
its the network threading, which works quite well actually.  The protocol
threads can be matched against the toeplitz function and in that case
the entire packet stream operates lockless.  Even without the matching
we still get good benefits from batching (e.g. via ether_input_chain())
which drops the IPI and per-packet switch overhead basically to zero.
We have other issues but the protocol threads aren't one of them.

In anycase, the lesson to learn with batching to a thread is that you
don't want the thread to immediately preempt the sender (if it happens
to be on the same cpu), or to generate an instant IPI (if going between
cpus).  This creates a degenerate case where you wind up with a
thread switch on each message or an excessive messaging interrupt
rate... THAT is what seriously screws up performance.  The key is to
be able to batch multiple messages per thread switch when under load
and to be able to maintain a pipeline.

A single user-process test case will always have a bit more latency
and can wind up being inefficient for a variety of other reasons
(e.g. whether the target thread is on the same cpu or not),
but that becomes less relevant when the machine is under load so
its a self-correcting problem for the most part.

Once the machine is under load batching becomes highly efficient.
That is, latency != cpu cycle cost under load.  When the threads
have enough work to do they can pick up the next message without the
cost of entering a sleep state or needing a wakeup (or needing to
generate an actual IPI interrupt, etc).  Plus you can run lockless
and you get excellent cache locality.  So as long as you ensure these
optimal operations become the norm under load you win.

Getting the threads to pipeline properly and avoid unnecessary
tsleeps and wakeups is the hard part.

--

But with regard to geom, I'd have to agree with you.  You don't want
to pipeline a single N-stage request through N threads.  One thread,
sure...  that can be batched to reduce overhead.  N-stages through
N-threads just creates unnecessary latency, complicates your ability
to maintain a pipeline, and has a multiplicative effect on thread
activity that negates the advantage of having multiple cpus (and
destroys cache locality as well).

You could possibly use a different trick at least for some of the
simpler transformations, and that is to replicate the control structures
on a per-cpu basis.  If you replicate the control structures on a
per-cpu basis then you can parallelize independent operations running
through the same set of devices and remove the bottlenecks.  The set of
transformations for a single BIO would be able to run lockless within
a single thread and the control system as a whole would have one
thread per cpu.  (Of course, a RAID layer would require some rendezvous
to deal with contention/conflicts, but that's easily dealt with).
That would be my suggestion.

We use that trick for our route tables in DFly, and also for listen
socket PCBs to remove choke points, and a few other things like
statistics gathering.

-Matt
Matthew Dillon 

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Increasing MAXPHYS

2010-03-23 Thread Poul-Henning Kamp

In message <20100322233607.gb1...@garage.freebsd.pl>, Pawel Jakub Dawidek write
s:

>A class is suppose to interact with other classes only via GEOM, so I
>think it should be safe to choose g_up/g_down threads for each class
>individually, for example:
>
>   /dev/ad0s1a (DEV)
>  |
>   g_up_0 + g_down_0
>  |
>ad0s1a (BSD)
>  |
>   g_up_1 + g_down_1
>  |
>ad0s1 (MBR)
>  |
>   g_up_2 + g_down_2
>  |
>ad0 (DISK)

Uhm, that way you get _more_ context switches than today, today g_down
will typically push the requests all the way down through the stack
without a context switch.  (Similar for g_up)


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

40 matches

Mail list logo