Re: [Qemu-devel] scsi-generic and max request size

2010-12-22 Thread Hannes Reinecke
On 12/21/2010 11:05 PM, Benjamin Herrenschmidt wrote:
 So back to square 1 ... my vscsi (and virtio-blk too btw) can
 technically pass a max size to the guest, but we don't have a way to
 interrogate scsi-generic (and the underlying block driver) which is the
 main issue (that plus the fact that the ioctl seems to be broken in
 compat mode for /dev/sg specifically)...

 Ah, the warm and fuzzy feeling knowing to be not alone in this ...

 This is basically the same issue I brought up with the first
 submission round of my megasas emulation.
 
 heh.
 
 As we're passing scatter-gather lists directly to the underlying
 device we might end up sending a request which is improperly
 formatted. The linux block layer has three limits onto which a
 request has to be formatted:
 - Max length of the scatter-gather list (max_sectors)
 - Max overall request size (max_segments)
 
 Didn't you swap the 2 above ? max_sectors is the max overall req. size
 and max_segments the max number of SG elements afaik :-)
 
Yeah, could be. 'twas only meant for illustration anyway.

 - Max length of individual sg elements (max_segment_size)
 
 newer kernels export these limits; they have been exported with
 commit c77a5710b7e23847bfdb81fcaa10b585f65c960a.
 For older kernels, however, we're being left in the dark here.
 
 Well, first of all, sg is not there so that doesn't help with the
 scsi-generic problem much, then parsing sysfs... yuck.
 
Well, sort of. 'sg' doesn't have any block queue limits directly as the
block queue is attached to the block device (surprise, surprise :-).
But nevertheless any commands send via SG_IO are being placed on the
block queue, hence the same limits apply here, too.

 So on newer kernel we probably could be doing a quick check on the
 block queue limits and reformat the I/O if required.
 
 Maybe but then, sg isn't there. We could I suppose use sr as an
 indication tho when we know it's a cdrom.
 
If it were me I would be using
 Instead of reformatting we could be sendiong each element of an eg
 list individually. Thereby we would be introducing some slowdown as
 the sg lists have to be reassembled again by the lower layers, but
 we would be insulated from any sg list mismatch.
 However, this won't cover requests with too large sg elements.
 For those we could probably use some simple divide-by-two algorithm
 on the element to make them fit.
 
 How can we ? We need a single request to match a single sg list anyways
 no ?
 
Yes, true. That's what I was trying to illustrate here.

 Let's say you get a READ10 from the guest for 200Kb and your underlying
 max_sectors is 128Kb. How do you want to break that up ? The only way
 would be to make it two different READ10's and that's a can of worms
 especially if you start putting tags into the picture...
 
Precisely. Hence I didn't try to implement anything in that area :-)

 But seeing we have to split the I/O requests anyway we might as well
 use the divide-by-two algorithm for the sg lists, too.

 Easiest would be if we could just transfer the available bits and
 push the request back to the guest as a partial completion.
 Sadly the I/O stack on the guest will choose to interpret this as an
 I/O error instead of retrying the remainder :-(

 So in the long run I fear we have to implement some sort of I/O
 request splitting in Qemu, using the values from sysfs.
 
 So in my case, I'm happy for the time being to continue doing bounce
 buffering and so my only problem at the moment is the max request size
 (aka max_sectors). Also I -can- tell the guest what my limitation is,
 it's part of the vscsi login protocol. I can look into doing DMA
 directly to the guest SG lists later maybe.
 
 However, I can't quite figure out how to reliably obtain that
 information in my driver since on one hand, the ioctl doesn't seem to
 work in mixed 32/64-bit environments, and on the other hand, sysfs
 doesn't seem to have anything for sg in /sys/class/block... Besides,
 those are both Linux-isms... so we'd have to be extra careful there too.
 
Yes. I've been bashing my head against this, too.

IMO the whole problem arises from the fact that we're deliberately
destroying information here.
Most modern HBAs are using separate codepaths for streaming/block I/O
anyway, but when using 'scsi-generic' we are forced to discard this
information. We have to fake a SCSI READ/WRITE command, and send it via
SG_IO to the underlying device and keep fingers crossed that we're not
exceeding any device limitations.

The whole problem would just go away if we could use the standard block
read()/write() calls here. Then the iovec would be placed _as
scatter-gather list_ on the request-queue and the block layer would take
care of the whole issue.

I've tried to advocate this approach once, but (again) was being told
that it's a misuse of scsi-generic and I should be using scsi-disk instead.

However, since Alex Graf is facing similar problems with the AHCI HBA of
his maybe we could retry again ...


Re: [Qemu-devel] scsi-generic and max request size

2010-12-22 Thread Christoph Hellwig
On Wed, Dec 22, 2010 at 02:54:54PM +0100, Hannes Reinecke wrote:
 Most modern HBAs are using separate codepaths for streaming/block I/O
 anyway,

That's not true at all.  Every normal HBA justs passes normal SCSI
commands to the SCSI targets.  It's just raid adapters that take special
commands, and the megaraid one is extremly special as it actually
emulates a few SCSI commands even in RAID mode, which almost no other
HBA does.  Strictly speaking we should not allow scsi-generic with
megaraid_sas, except for the separate passthrough channels that the real
hardware has for things like tape drives.

 However, since Alex Graf is facing similar problems with the AHCI HBA of
 his maybe we could retry again ...

AHCI is a ATA adapter, and should never be used with scsi-generic for
disks.  Only for the ATAPI-attached cdroms/tapes/etc it could be used,
although it's quite pointless.




Re: [Qemu-devel] scsi-generic and max request size

2010-12-22 Thread Benjamin Herrenschmidt
On Wed, 2010-12-22 at 14:54 +0100, Hannes Reinecke wrote:

 Well, sort of. 'sg' doesn't have any block queue limits directly as the
 block queue is attached to the block device (surprise, surprise :-).
 But nevertheless any commands send via SG_IO are being placed on the
 block queue, hence the same limits apply here, too.

Right, tho is there a simple way to map sg to the appropriate block
driver to retreive the info via sysfs ? I looks possible from a quick
peek there but it also looks like an ungodly mess.
 
 If it were me I would be using

I think you meant to type more here :-)

  However, I can't quite figure out how to reliably obtain that
  information in my driver since on one hand, the ioctl doesn't seem to
  work in mixed 32/64-bit environments, and on the other hand, sysfs
  doesn't seem to have anything for sg in /sys/class/block... Besides,
  those are both Linux-isms... so we'd have to be extra careful there too.
  
 Yes. I've been bashing my head against this, too.

Christoph, any suggestion there ?

 IMO the whole problem arises from the fact that we're deliberately
 destroying information here.
 Most modern HBAs are using separate codepaths for streaming/block I/O
 anyway, but when using 'scsi-generic' we are forced to discard this
 information. We have to fake a SCSI READ/WRITE command, and send it via
 SG_IO to the underlying device and keep fingers crossed that we're not
 exceeding any device limitations.

I wouldn't say it like that no.

It's a transport problem. In my case I'm not faking anything, vscsi is
just a transport (a variant of SRP). The problem is that when
'emulating' a HW HBA, you have no way to express the intrinsic
limitations of the underlying HBA, but that's not a problem I have with
vscsi which is meant to be a transport and as such does have means to
convey that sort of information (tho in my case, I have some issues due
to assumptions/bugs in the existing ibm vscsi client driver but that's a
different topic).

So I think there's a significant difference here between emulating a HW
HBA and doing something like vscsi. The former has problems that cannot
be easily solved I believe. The later problems on the other hands can be
solved, the means to do so are there, but we have to deal with
interface issues ... plumbing problems.

The non working compat ioctl is one, the fact that sg has
no /sys/class/block (or /sys/block) entries is another, etc... Ie, we
are faced with a problem with Linux not exposing those informations in
an easy to retrieve way, and no proper cross-platform way to obtain
those informations neither.

 The whole problem would just go away if we could use the standard block
 read()/write() calls here. Then the iovec would be placed _as
 scatter-gather list_ on the request-queue and the block layer would take
 care of the whole issue.

That would be somewhat cheating with the concept of just being a SCSI
transport layer :-) You would interpret some requests and turn them into
something else. That would be interesting when your user starts using
tags and make assumptions about what's in flight and what not etc...

 I've tried to advocate this approach once, but (again) was being told
 that it's a misuse of scsi-generic and I should be using scsi-disk instead.
 
 However, since Alex Graf is facing similar problems with the AHCI HBA of
 his maybe we could retry again ...

Again, I'd say different problems :-) To some extent scsi-disk will
solve the issues with basic read/write operations, but there's some more
nasty SCSI commands that you want through for things like DVD burning
for example, unless we start building higher level abstractions into the
kernel. So you -still- end up acting somewhat as a SCSI transport layer,
and potentially hit the problem with limits again.

Cheers,
Ben.

 Cheers,
 
 Hannes
 --
 Dr. Hannes Reinecke zSeries  Storage
 h...@suse.de+49 911 74053 688
 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
 GF: Markus Rex, HRB 16746 (AG Nürnberg)





Re: [Qemu-devel] scsi-generic and max request size

2010-12-22 Thread Benjamin Herrenschmidt
On Wed, 2010-12-22 at 14:27 +0100, Christoph Hellwig wrote:
 On Wed, Dec 22, 2010 at 02:54:54PM +0100, Hannes Reinecke wrote:
  Most modern HBAs are using separate codepaths for streaming/block I/O
  anyway,
 
 That's not true at all.  Every normal HBA justs passes normal SCSI
 commands to the SCSI targets.  It's just raid adapters that take special
 commands, and the megaraid one is extremly special as it actually
 emulates a few SCSI commands even in RAID mode, which almost no other
 HBA does.  Strictly speaking we should not allow scsi-generic with
 megaraid_sas, except for the separate passthrough channels that the real
 hardware has for things like tape drives.

Actually, I would put it differently here.

scsi-generic is -fundamentally- busted for HBA HW emulation since you
simply cannot convey the limits of the underlying real HBA.

If you are on top of usb-storage with a 120K max_sectors and try to
emulate a piece of HBA with no such limitation how in hell do you make
you guest know not to give your 120K requests at a time and what do you
do if it does ? You're stuffed basically...

Hence, the only way scsi-generic can make sense imho, is for something
like vscsi which I'm doing now, which is just a transport and does have
the ability to convey to the client/guest some of those limitations...
provided it can get to them in the first place (see the discussion, it's
really non trivial, which makes /dev/sg even less useful even for normal
userspace :-)

In the Megaraid case, the fact that it has this separate read/write
channel on the contrary should make it -easier- to solve that problem
typically by allowing the emulation layer to construct sequences of
READ/WRITE requests that match the limitations. IE. Ie makes
scsi-generic a possibility while it would otherwise (and is) broken in
unfixable ways with other HBA emulation.

  However, since Alex Graf is facing similar problems with the AHCI HBA of
  his maybe we could retry again ...
 
 AHCI is a ATA adapter, and should never be used with scsi-generic for
 disks.  Only for the ATAPI-attached cdroms/tapes/etc it could be used,
 although it's quite pointless.

Right, but in that case (cdroms etc...) it would have the exact same
problem. I'm not familiar with AHCI HW, and so I don't know whether
there's a way for the HW to convey limits to the driver, but if not,
then operating via scsi-generic would be busted the same way anything
else is.

Basically, scsi-generic cannot work for emulating an HBA. In fact, I
would go as far as saying that it's not possible to generically emulate
an HBA that just pass-through any SCSI command, simply due to the
inability to convey those limits.

vscsi is a special case (and other paravirt drivers that may exist)
because being explicitely designed for acting as such transports, they
-do- convey the necessary limit information. I don't know iscsi but I
would be surprised if it didn't provide similar facilities.

So what we need here is a way for qemu to retrieve those reliably when
using scsi-generic. That's the missing piece of the puzzle on my side.

Cheers,
Ben.





Re: [Qemu-devel] scsi-generic and max request size

2010-12-22 Thread Alexander Graf

On 22.12.2010, at 14:27, Christoph Hellwig wrote:

 On Wed, Dec 22, 2010 at 02:54:54PM +0100, Hannes Reinecke wrote:
 Most modern HBAs are using separate codepaths for streaming/block I/O
 anyway,
 
 That's not true at all.  Every normal HBA justs passes normal SCSI
 commands to the SCSI targets.  It's just raid adapters that take special
 commands, and the megaraid one is extremly special as it actually
 emulates a few SCSI commands even in RAID mode, which almost no other
 HBA does.  Strictly speaking we should not allow scsi-generic with
 megaraid_sas, except for the separate passthrough channels that the real
 hardware has for things like tape drives.
 
 However, since Alex Graf is facing similar problems with the AHCI HBA of
 his maybe we could retry again ...
 
 AHCI is a ATA adapter, and should never be used with scsi-generic for
 disks.  Only for the ATAPI-attached cdroms/tapes/etc it could be used,
 although it's quite pointless.

It's not 100% pointless - ATAPI passthrough is a feature requested by users.
If we were to model ATAPI properly, it would end up using whatever SCSI layers 
we have below - which means ATAPI passthrough would be a mere matter of 
replacing the scsi-cdrom backend with a scsi-passthrough backend.

Now for the fun part. ATAPI can also do NCQ. So we actually end up having the 
exact same thing here as megasas. If we were to do ATAPI passthrough on CD-ROM 
with NCQ. NCQ goes through the normal read/write path of a block backend. 
Passthrough would do SG_IO.


Alex





Re: [Qemu-devel] scsi-generic and max request size

2010-12-22 Thread Alexander Graf

On 22.12.2010, at 22:59, Benjamin Herrenschmidt wrote:

 On Wed, 2010-12-22 at 14:54 +0100, Hannes Reinecke wrote:
 
 Well, sort of. 'sg' doesn't have any block queue limits directly as the
 block queue is attached to the block device (surprise, surprise :-).
 But nevertheless any commands send via SG_IO are being placed on the
 block queue, hence the same limits apply here, too.
 
 Right, tho is there a simple way to map sg to the appropriate block
 driver to retreive the info via sysfs ? I looks possible from a quick
 peek there but it also looks like an ungodly mess.
 
 If it were me I would be using
 
 I think you meant to type more here :-)
 
 However, I can't quite figure out how to reliably obtain that
 information in my driver since on one hand, the ioctl doesn't seem to
 work in mixed 32/64-bit environments, and on the other hand, sysfs
 doesn't seem to have anything for sg in /sys/class/block... Besides,
 those are both Linux-isms... so we'd have to be extra careful there too.
 
 Yes. I've been bashing my head against this, too.
 
 Christoph, any suggestion there ?
 
 IMO the whole problem arises from the fact that we're deliberately
 destroying information here.
 Most modern HBAs are using separate codepaths for streaming/block I/O
 anyway, but when using 'scsi-generic' we are forced to discard this
 information. We have to fake a SCSI READ/WRITE command, and send it via
 SG_IO to the underlying device and keep fingers crossed that we're not
 exceeding any device limitations.
 
 I wouldn't say it like that no.
 
 It's a transport problem. In my case I'm not faking anything, vscsi is
 just a transport (a variant of SRP). The problem is that when
 'emulating' a HW HBA, you have no way to express the intrinsic
 limitations of the underlying HBA, but that's not a problem I have with
 vscsi which is meant to be a transport and as such does have means to
 convey that sort of information (tho in my case, I have some issues due
 to assumptions/bugs in the existing ibm vscsi client driver but that's a
 different topic).
 
 So I think there's a significant difference here between emulating a HW
 HBA and doing something like vscsi. The former has problems that cannot
 be easily solved I believe. The later problems on the other hands can be
 solved, the means to do so are there, but we have to deal with
 interface issues ... plumbing problems.
 
 The non working compat ioctl is one, the fact that sg has
 no /sys/class/block (or /sys/block) entries is another, etc... Ie, we
 are faced with a problem with Linux not exposing those informations in
 an easy to retrieve way, and no proper cross-platform way to obtain
 those informations neither.

Why would you care about cross-platform here? Not saying I fully understand 
what information exactly you're lacking. But it's either SG_IO max request size 
in which case you don't need any equivalent on other platforms, as it's not 
available anywhere else. Or it's something else in which case you can just set 
it to some safe small default value and call it a day :).


Alex




Re: [Qemu-devel] scsi-generic and max request size

2010-12-22 Thread Benjamin Herrenschmidt
On Thu, 2010-12-23 at 00:23 +0100, Alexander Graf wrote:
  The non working compat ioctl is one, the fact that sg has
  no /sys/class/block (or /sys/block) entries is another, etc... Ie,
 we
  are faced with a problem with Linux not exposing those informations
 in
  an easy to retrieve way, and no proper cross-platform way to obtain
  those informations neither.
 
 Why would you care about cross-platform here? Not saying I fully
 understand what information exactly you're lacking. But it's either
 SG_IO max request size in which case you don't need any equivalent on
 other platforms, as it's not available anywhere else. Or it's
 something else in which case you can just set it to some safe small
 default value and call it a day :).

Well, do we support something like scsi-generic on windows or BSD
hosts ? dunno.. .just asking :-) They -have- mechanisms (at least
windows do) to pass SCSI requests down the stack. In that case, they'll
have similar limitations (at the very least the max request size).

So we'd want some way to expose that... but if scsi-generic today is
linux only, then I can try to add linux-isms in there as a stop-gap to
try to at least retreive the max req. size which is the main issue for
me right now... at least until I start trying to have the SG_IO
read/write directly into guest memory without bouncing :-) At that
point, the SG limits might become trouble as well.

Cheers,
Ben.




Re: [Qemu-devel] scsi-generic and max request size

2010-12-22 Thread Alexander Graf

On 23.12.2010, at 00:35, Benjamin Herrenschmidt wrote:

 On Thu, 2010-12-23 at 00:23 +0100, Alexander Graf wrote:
 The non working compat ioctl is one, the fact that sg has
 no /sys/class/block (or /sys/block) entries is another, etc... Ie,
 we
 are faced with a problem with Linux not exposing those informations
 in
 an easy to retrieve way, and no proper cross-platform way to obtain
 those informations neither.
 
 Why would you care about cross-platform here? Not saying I fully
 understand what information exactly you're lacking. But it's either
 SG_IO max request size in which case you don't need any equivalent on
 other platforms, as it's not available anywhere else. Or it's
 something else in which case you can just set it to some safe small
 default value and call it a day :).
 
 Well, do we support something like scsi-generic on windows or BSD
 hosts ? dunno.. .just asking :-) They -have- mechanisms (at least
 windows do) to pass SCSI requests down the stack. In that case, they'll
 have similar limitations (at the very least the max request size).
 
 So we'd want some way to expose that... but if scsi-generic today is
 linux only, then I can try to add linux-isms in there as a stop-gap to
 try to at least retreive the max req. size which is the main issue for
 me right now... at least until I start trying to have the SG_IO
 read/write directly into guest memory without bouncing :-) At that
 point, the SG limits might become trouble as well.

This all belongs in the block layer. If you create a call back function or 
property in the block struct, windows can implement its own limits when someone 
sits down to implement SG_IO on Windows.


Alex




Re: [Qemu-devel] scsi-generic and max request size

2010-12-22 Thread Benjamin Herrenschmidt
On Thu, 2010-12-23 at 00:39 +0100, Alexander Graf wrote:
 This all belongs in the block layer. If you create a call back
 function or property in the block struct, windows can implement its
 own limits when someone sits down to implement SG_IO on Windows.

Right and we do have generic ways it seems to interrogate those
limits .. except they seem to be broken for sg :-)

Also I've spotted some oddities where the ioctl for the max request size
sometimes put_user as a int * and sometimes as a short * ... ooops...

Cheers,
Ben.





Re: [Qemu-devel] scsi-generic and max request size

2010-12-22 Thread Alexander Graf

On 23.12.2010, at 00:44, Benjamin Herrenschmidt wrote:

 On Thu, 2010-12-23 at 00:39 +0100, Alexander Graf wrote:
 This all belongs in the block layer. If you create a call back
 function or property in the block struct, windows can implement its
 own limits when someone sits down to implement SG_IO on Windows.
 
 Right and we do have generic ways it seems to interrogate those
 limits .. except they seem to be broken for sg :-)
 
 Also I've spotted some oddities where the ioctl for the max request size
 sometimes put_user as a int * and sometimes as a short * ... ooops...

Congratulations for finding lots of Linux bugs :). Look at it from that way: 
You'll most likely be the very first person actually using sg properly. So 
after you're done, others won't have to fix it :).


Alex




Re: [Qemu-devel] scsi-generic and max request size

2010-12-22 Thread Benjamin Herrenschmidt
On Thu, 2010-12-23 at 00:49 +0100, Alexander Graf wrote:
 
 Congratulations for finding lots of Linux bugs :). Look at it from
 that way: You'll most likely be the very first person actually using
 sg properly. So after you're done, others won't have to fix it :).

Hahah, I doubt it :-) Makes me wonder whether sg can be used properly
to be honest...

Cheers,
Ben.





Re: [Qemu-devel] scsi-generic and max request size

2010-12-21 Thread Hannes Reinecke
On 12/21/2010 04:52 AM, Benjamin Herrenschmidt wrote:
 On Tue, 2010-12-21 at 14:38 +1100, ronnie sahlberg wrote:
 Ben,

 Since it is a scsi device you can try the Inquiry command with
 pagecode 0xb0  :  Block Limit VPD Page.
 That pages show optimal and maximum request sizes.

 This is for SBC, in the Vital Product Data chapter.

 Unfortunately this page is not mandatory so some devices might not
 understand it. :-(

 sg_inq --page=0x00 /dev/sg?
 will show you what inq pages your device supports.
 
 Well, that won't help much figuring what the limit is since in most case
 the limit seems to come from the host linux HBA (ie, usb-storage for
 example artificially clamps the max request size to deal with bogus
 USB-ATA bridges).
 
Indeed. The request size is pretty much limited by the driver/scsi
layer, so the above page won't help much here.

 As for using this to try to inform the guest OS as to what the limit
 is, this could be done by patching the result of that command on the
 fly in qemu, but that is nasty, and would only work if the guest OS
 actually uses the said command in the first place. AFAIK, neither sr.c
 nor sd.c do in Linux.
 
And you'll be getting yelled at by hch to boot.

 So back to square 1 ... my vscsi (and virtio-blk too btw) can
 technically pass a max size to the guest, but we don't have a way to
 interrogate scsi-generic (and the underlying block driver) which is the
 main issue (that plus the fact that the ioctl seems to be broken in
 compat mode for /dev/sg specifically)...
 
Ah, the warm and fuzzy feeling knowing to be not alone in this ...

This is basically the same issue I brought up with the first
submission round of my megasas emulation.

As we're passing scatter-gather lists directly to the underlying
device we might end up sending a request which is improperly
formatted. The linux block layer has three limits onto which a
request has to be formatted:
- Max length of the scatter-gather list (max_sectors)
- Max overall request size (max_segments)
- Max length of individual sg elements (max_segment_size)

newer kernels export these limits; they have been exported with
commit c77a5710b7e23847bfdb81fcaa10b585f65c960a.
For older kernels, however, we're being left in the dark here.

So on newer kernel we probably could be doing a quick check on the
block queue limits and reformat the I/O if required.

Instead of reformatting we could be sendiong each element of an eg
list individually. Thereby we would be introducing some slowdown as
the sg lists have to be reassembled again by the lower layers, but
we would be insulated from any sg list mismatch.
However, this won't cover requests with too large sg elements.
For those we could probably use some simple divide-by-two algorithm
on the element to make them fit.

But seeing we have to split the I/O requests anyway we might as well
use the divide-by-two algorithm for the sg lists, too.

Easiest would be if we could just transfer the available bits and
push the request back to the guest as a partial completion.
Sadly the I/O stack on the guest will choose to interpret this as an
I/O error instead of retrying the remainder :-(

So in the long run I fear we have to implement some sort of I/O
request splitting in Qemu, using the values from sysfs.

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)



Re: [Qemu-devel] scsi-generic and max request size

2010-12-21 Thread Benjamin Herrenschmidt
  So back to square 1 ... my vscsi (and virtio-blk too btw) can
  technically pass a max size to the guest, but we don't have a way to
  interrogate scsi-generic (and the underlying block driver) which is the
  main issue (that plus the fact that the ioctl seems to be broken in
  compat mode for /dev/sg specifically)...
  
 Ah, the warm and fuzzy feeling knowing to be not alone in this ...
 
 This is basically the same issue I brought up with the first
 submission round of my megasas emulation.

heh.

 As we're passing scatter-gather lists directly to the underlying
 device we might end up sending a request which is improperly
 formatted. The linux block layer has three limits onto which a
 request has to be formatted:
 - Max length of the scatter-gather list (max_sectors)
 - Max overall request size (max_segments)

Didn't you swap the 2 above ? max_sectors is the max overall req. size
and max_segments the max number of SG elements afaik :-)

 - Max length of individual sg elements (max_segment_size)

 newer kernels export these limits; they have been exported with
 commit c77a5710b7e23847bfdb81fcaa10b585f65c960a.
 For older kernels, however, we're being left in the dark here.

Well, first of all, sg is not there so that doesn't help with the
scsi-generic problem much, then parsing sysfs... yuck.

 So on newer kernel we probably could be doing a quick check on the
 block queue limits and reformat the I/O if required.

Maybe but then, sg isn't there. We could I suppose use sr as an
indication tho when we know it's a cdrom.

 Instead of reformatting we could be sendiong each element of an eg
 list individually. Thereby we would be introducing some slowdown as
 the sg lists have to be reassembled again by the lower layers, but
 we would be insulated from any sg list mismatch.
 However, this won't cover requests with too large sg elements.
 For those we could probably use some simple divide-by-two algorithm
 on the element to make them fit.

How can we ? We need a single request to match a single sg list anyways
no ?

Let's say you get a READ10 from the guest for 200Kb and your underlying
max_sectors is 128Kb. How do you want to break that up ? The only way
would be to make it two different READ10's and that's a can of worms
especially if you start putting tags into the picture...

 But seeing we have to split the I/O requests anyway we might as well
 use the divide-by-two algorithm for the sg lists, too.
 
 Easiest would be if we could just transfer the available bits and
 push the request back to the guest as a partial completion.
 Sadly the I/O stack on the guest will choose to interpret this as an
 I/O error instead of retrying the remainder :-(
 
 So in the long run I fear we have to implement some sort of I/O
 request splitting in Qemu, using the values from sysfs.

So in my case, I'm happy for the time being to continue doing bounce
buffering and so my only problem at the moment is the max request size
(aka max_sectors). Also I -can- tell the guest what my limitation is,
it's part of the vscsi login protocol. I can look into doing DMA
directly to the guest SG lists later maybe.

However, I can't quite figure out how to reliably obtain that
information in my driver since on one hand, the ioctl doesn't seem to
work in mixed 32/64-bit environments, and on the other hand, sysfs
doesn't seem to have anything for sg in /sys/class/block... Besides,
those are both Linux-isms... so we'd have to be extra careful there too.

Cheers,
Ben.

 Cheers,
 
 Hannes
 --
 Dr. Hannes Reinecke zSeries  Storage
 h...@suse.de+49 911 74053 688
 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
 GF: Markus Rex, HRB 16746 (AG Nürnberg)





[Qemu-devel] scsi-generic and max request size

2010-12-20 Thread Benjamin Herrenschmidt
Hi folks !

There's an odd problem I've encountered with my scsi host (basically an
powerpc vscsi compatible with IBM PAPR).

When using /dev/sg (ie, scsi-generic), there seem to be no way I can
find to retrieve the underlying driver's max request transfer size.

This can normally be obtained with the BLKSECTGET ioctl under Linux (I'm
not familiar with other OSes here). However, this is a bit buggy as
well, ie, afaik, this doesn't work with 32-bit binaries on 64-bit
kernels (the compat ioctl doesn't seem to work on /dev/sg).

For now, qemu doesn't pass that from its bdev layer, which means that
scsi-generic doesn't pass it to its own upper layer neither.

What that means is two fold I suppose:

 - For real SCSI HBAs, how do you limit the transfer size anyways ? You
can't start breaking up user requests without taking risks with tags
etc...

 - For vscsi, I can expose the limit I want via the SRP interface, but
scsi-generic doesn't tell me what it is :-)

This is a real problem in practice. IE. the USB CD-ROM on this POWER7
blade limits transfers to 0x1e000 bytes for example and the Linux sr
driver on the guest is going to try to give me bigger requests than that
if I don't start limiting them, which will cause all sort of errors.

Cheers,
Ben.






Re: [Qemu-devel] scsi-generic and max request size

2010-12-20 Thread ronnie sahlberg
Ben,

Since it is a scsi device you can try the Inquiry command with
pagecode 0xb0  :  Block Limit VPD Page.
That pages show optimal and maximum request sizes.

This is for SBC, in the Vital Product Data chapter.

Unfortunately this page is not mandatory so some devices might not
understand it. :-(

sg_inq --page=0x00 /dev/sg?
will show you what inq pages your device supports.


regards
ronnie sahlberg


On Tue, Dec 21, 2010 at 2:25 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:
 Hi folks !

 There's an odd problem I've encountered with my scsi host (basically an
 powerpc vscsi compatible with IBM PAPR).

 When using /dev/sg (ie, scsi-generic), there seem to be no way I can
 find to retrieve the underlying driver's max request transfer size.

 This can normally be obtained with the BLKSECTGET ioctl under Linux (I'm
 not familiar with other OSes here). However, this is a bit buggy as
 well, ie, afaik, this doesn't work with 32-bit binaries on 64-bit
 kernels (the compat ioctl doesn't seem to work on /dev/sg).

 For now, qemu doesn't pass that from its bdev layer, which means that
 scsi-generic doesn't pass it to its own upper layer neither.

 What that means is two fold I suppose:

  - For real SCSI HBAs, how do you limit the transfer size anyways ? You
 can't start breaking up user requests without taking risks with tags
 etc...

  - For vscsi, I can expose the limit I want via the SRP interface, but
 scsi-generic doesn't tell me what it is :-)

 This is a real problem in practice. IE. the USB CD-ROM on this POWER7
 blade limits transfers to 0x1e000 bytes for example and the Linux sr
 driver on the guest is going to try to give me bigger requests than that
 if I don't start limiting them, which will cause all sort of errors.

 Cheers,
 Ben.








Re: [Qemu-devel] scsi-generic and max request size

2010-12-20 Thread Benjamin Herrenschmidt
On Tue, 2010-12-21 at 14:38 +1100, ronnie sahlberg wrote:
 Ben,
 
 Since it is a scsi device you can try the Inquiry command with
 pagecode 0xb0  :  Block Limit VPD Page.
 That pages show optimal and maximum request sizes.
 
 This is for SBC, in the Vital Product Data chapter.
 
 Unfortunately this page is not mandatory so some devices might not
 understand it. :-(
 
 sg_inq --page=0x00 /dev/sg?
 will show you what inq pages your device supports.

Well, that won't help much figuring what the limit is since in most case
the limit seems to come from the host linux HBA (ie, usb-storage for
example artificially clamps the max request size to deal with bogus
USB-ATA bridges).

As for using this to try to inform the guest OS as to what the limit
is, this could be done by patching the result of that command on the
fly in qemu, but that is nasty, and would only work if the guest OS
actually uses the said command in the first place. AFAIK, neither sr.c
nor sd.c do in Linux.

So back to square 1 ... my vscsi (and virtio-blk too btw) can
technically pass a max size to the guest, but we don't have a way to
interrogate scsi-generic (and the underlying block driver) which is the
main issue (that plus the fact that the ioctl seems to be broken in
compat mode for /dev/sg specifically)...

Cheers,
Ben.


 
 regards
 ronnie sahlberg
 
 
 On Tue, Dec 21, 2010 at 2:25 PM, Benjamin Herrenschmidt
 b...@kernel.crashing.org wrote:
  Hi folks !
 
  There's an odd problem I've encountered with my scsi host (basically an
  powerpc vscsi compatible with IBM PAPR).
 
  When using /dev/sg (ie, scsi-generic), there seem to be no way I can
  find to retrieve the underlying driver's max request transfer size.
 
  This can normally be obtained with the BLKSECTGET ioctl under Linux (I'm
  not familiar with other OSes here). However, this is a bit buggy as
  well, ie, afaik, this doesn't work with 32-bit binaries on 64-bit
  kernels (the compat ioctl doesn't seem to work on /dev/sg).
 
  For now, qemu doesn't pass that from its bdev layer, which means that
  scsi-generic doesn't pass it to its own upper layer neither.
 
  What that means is two fold I suppose:
 
   - For real SCSI HBAs, how do you limit the transfer size anyways ? You
  can't start breaking up user requests without taking risks with tags
  etc...
 
   - For vscsi, I can expose the limit I want via the SRP interface, but
  scsi-generic doesn't tell me what it is :-)
 
  This is a real problem in practice. IE. the USB CD-ROM on this POWER7
  blade limits transfers to 0x1e000 bytes for example and the Linux sr
  driver on the guest is going to try to give me bigger requests than that
  if I don't start limiting them, which will cause all sort of errors.
 
  Cheers,
  Ben.