Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-12-07 Thread Hannes Reinecke
Gerd Hoffmann wrote:
 On 11/27/09 12:08, Gerd Hoffmann wrote:
 On 11/26/09 16:50, Hannes Reinecke wrote:
 So indeed, this approach would only work if we signal some sense code
 back to the host.
 I, OTOH, don't have any qualms with returning HARDWARE_ERROR,
 0x26/0x08(TOO MANY SEGMENT DESCRIPTORS) resp 0x26h/0x0B (INLINE DATA
 LENGTH EXCEEDED).
 Feels only fair to notify the guest it has done something wrong.

 Also set the info field which linux uses to figure how many sectors it
 actually got.
 
 Hmm.  Well.  Seems to work out at least for linux, i.e. it figures it
 got a bunch of sectors and tries to continue.  Linux logs an I/O error.
  Also I didn't try other guests (yet).
 
 Using that as a way to limit scsi-disk request sizes probably isn't a
 good idea.  For scsi-generic that would be a improvement over the
 current situation though.
 
Yes, quite.

But for scsi-disk we could always fallback to using bounce-buffers,
could we not?
Provide we get a detailed enough error code, but this could be arranged
methinks.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-12-07 Thread Gerd Hoffmann

On 12/07/09 09:28, Hannes Reinecke wrote:

Hmm.  Well.  Seems to work out at least for linux, i.e. it figures it
got a bunch of sectors and tries to continue.  Linux logs an I/O error.
  Also I didn't try other guests (yet).

Using that as a way to limit scsi-disk request sizes probably isn't a
good idea.  For scsi-generic that would be a improvement over the
current situation though.


Yes, quite.

But for scsi-disk we could always fallback to using bounce-buffers,
could we not?


We want limit the bounce buffer size though.  As there seems to be no 
easy way to make sure the guest doesn't submit requests larger than a 
certain limit I guess there is no way around splitting the request into 
pieces for the bounce buffer case.


We could add offset+size arguments to scsi_req_buf() to accomplish this.

cheers,
  Gerd


PS: FYI: I suspect I wouldn't find time this year to continue working on
this seriously, especially on the time-consuming testing part. Top
priority right now are finishing touches for the 0.12 release.
x-mas will be family time.




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-12-02 Thread Gerd Hoffmann

On 11/27/09 12:08, Gerd Hoffmann wrote:

On 11/26/09 16:50, Hannes Reinecke wrote:

So indeed, this approach would only work if we signal some sense code
back to the host.
I, OTOH, don't have any qualms with returning HARDWARE_ERROR,
0x26/0x08(TOO MANY SEGMENT DESCRIPTORS) resp 0x26h/0x0B (INLINE DATA
LENGTH EXCEEDED).
Feels only fair to notify the guest it has done something wrong.


Also set the info field which linux uses to figure how many sectors it
actually got.


Hmm.  Well.  Seems to work out at least for linux, i.e. it figures it 
got a bunch of sectors and tries to continue.  Linux logs an I/O error. 
 Also I didn't try other guests (yet).


Using that as a way to limit scsi-disk request sizes probably isn't a 
good idea.  For scsi-generic that would be a improvement over the 
current situation though.


cheers,
  Gerd




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-27 Thread Gerd Hoffmann

On 11/26/09 16:50, Hannes Reinecke wrote:

So indeed, this approach would only work if we signal some sense code
back to the host.
I, OTOH, don't have any qualms with returning HARDWARE_ERROR,
0x26/0x08(TOO MANY SEGMENT DESCRIPTORS) resp 0x26h/0x0B (INLINE DATA LENGTH 
EXCEEDED).
Feels only fair to notify the guest it has done something wrong.


Also set the info field which linux uses to figure how many sectors it 
actually got.


Now I really need some more uptodate scsi specs.  Unfortunaly t10.org 
doesn't allow public access to the drafts any more.  Seems to be a 
recent change, and googeling found me tons of references to t10.org but 
not the documents itself.


Anyone has a source for me?

cheers,
  Gerd





Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-26 Thread Gerd Hoffmann

  Hi,


Answering large requests with Illegal request, Invalid field in CDB
doesn't makes linux try smaller requests, instead it reports I/O errors
to the syslog.

Hmm.


Can't we just put residuals to good use here?
Ie finish up the request up to the size we can handle, and return the
original request with the transfer size set correctly.


I'll try.


Should be straightforward to implement, one would assume.


The infrastructure and the HBAs have to handle that already.  It 
frequently happens with MODE SENSE for example (guest passing a 256 byte 
buffer and we have less data to fill in).  So it should be easy.



And we could probably encapsulate it entirely within the bdrv
as don't actually need to expose those limits when the block
driver layer is handling it correctly.


Hmm, I don't think so.  SG_IO just returns EINVAL.

cheers,
  Gerd





Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-26 Thread Hannes Reinecke
Gerd Hoffmann wrote:
   Hi,
 
 Answering large requests with Illegal request, Invalid field in CDB
 doesn't makes linux try smaller requests, instead it reports I/O errors
 to the syslog.

 Hmm.

 Can't we just put residuals to good use here?
 Ie finish up the request up to the size we can handle, and return the
 original request with the transfer size set correctly.
 
 I'll try.
 
 Should be straightforward to implement, one would assume.
 
 The infrastructure and the HBAs have to handle that already.  It
 frequently happens with MODE SENSE for example (guest passing a 256 byte
 buffer and we have less data to fill in).  So it should be easy.
 
 And we could probably encapsulate it entirely within the bdrv
 as don't actually need to expose those limits when the block
 driver layer is handling it correctly.
 
 Hmm, I don't think so.  SG_IO just returns EINVAL.
 
Yes, but we can hook into scsi_generic_map() to cap the resulting
iovec to the limits of the underlying block device.
Then we submit an iovec which matches the capabilities of the
underlying device.
Of course we should take care to update the resulting xferlen
values to match the actualy submitted data, but that should
be easily done.

Then the guest would see a partial request and retry the
remainder of the request, which (possibly after some iterations)
would result in all data transferred, albeit at a lower speed.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-26 Thread Gerd Hoffmann

On 11/26/09 11:57, Hannes Reinecke wrote:

Then the guest would see a partial request and retry the
remainder of the request, which (possibly after some iterations)
would result in all data transferred, albeit at a lower speed.


Except they don't.

/me looks at drivers/scsi/sd.c:sd_done()

I can't see any sane way to tell linux that the request was too big.

cheers,
  Gerd




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-26 Thread Gerd Hoffmann

On 11/26/09 12:20, Hannes Reinecke wrote:

Gerd Hoffmann wrote:

/me looks at drivers/scsi/sd.c:sd_done()

I can't see any sane way to tell linux that the request was too big.


residuals is the key:

drivers/scsi/scsi.c:scsi_finish_command()


good_bytes = scsi_bufflen(cmd);
 if (cmd-request-cmd_type != REQ_TYPE_BLOCK_PC) {
int old_good_bytes = good_bytes;
drv = scsi_cmd_to_driver(cmd);
if (drv-done)
good_bytes = drv-done(cmd);


drv-done() actually is sd_done() mentioned above.

sd_done() tries to figure how many sectors it actually got for serious 
errors.  I don't feel signaling medium error for the first sector 
behind our limit just because we'd like to have smaller requests.



/*
 * USB may not give sense identifying bad sector and
 * simply return a residue instead, so subtract off the
 * residue if drv-done() error processing indicates no
 * change to the completion length.
 */
if (good_bytes == old_good_bytes)
good_bytes -= scsi_get_resid(cmd);


Poor mans bad sector identification.  Same issue as above IMHO.  On top 
of that I wouldn't expect all other guest OSes having the same quirk in 
there to handle usb disks.


cheers,
  Gerd




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-26 Thread Hannes Reinecke
Gerd Hoffmann wrote:
 On 11/26/09 12:20, Hannes Reinecke wrote:
 Gerd Hoffmann wrote:
 /me looks at drivers/scsi/sd.c:sd_done()

 I can't see any sane way to tell linux that the request was too big.

 residuals is the key:

 drivers/scsi/scsi.c:scsi_finish_command()


 good_bytes = scsi_bufflen(cmd);
  if (cmd-request-cmd_type != REQ_TYPE_BLOCK_PC) {
 int old_good_bytes = good_bytes;
 drv = scsi_cmd_to_driver(cmd);
 if (drv-done)
 good_bytes = drv-done(cmd);
 
 drv-done() actually is sd_done() mentioned above.
 
 sd_done() tries to figure how many sectors it actually got for serious
 errors.  I don't feel signaling medium error for the first sector
 behind our limit just because we'd like to have smaller requests.
 
 /*
  * USB may not give sense identifying bad sector and
  * simply return a residue instead, so subtract off the
  * residue if drv-done() error processing indicates no
  * change to the completion length.
  */
 if (good_bytes == old_good_bytes)
 good_bytes -= scsi_get_resid(cmd);
 
 Poor mans bad sector identification.  Same issue as above IMHO.  On top
 of that I wouldn't expect all other guest OSes having the same quirk in
 there to handle usb disks.
 
Sigh. Communication seems to be tricky here.

Please, just ignore scsi_finish_command(), it's not doing anything
useful here.
Do skip this and study the next snipped, scsi_io_completion():

drivers/scsi/scsi_lib.c:scsi_io_completion()

/*
 * A number of bytes were successfully read.  If there
 * are leftovers and there is some kind of error
 * (result != 0), retry the rest.
 */
if (scsi_end_request(cmd, error, good_bytes, result == 0) == NULL)
return;

scsi_end_request is being called with the number of bytes _actually processed_,
which might be less than the number of bytes requested.
And the remainder will be retried by the upper layers.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-26 Thread Gerd Hoffmann

On 11/26/09 15:27, Hannes Reinecke wrote:

Gerd Hoffmann wrote:

sd_done() tries to figure how many sectors it actually got for serious
errors.  I don't feel signaling medium error for the first sector
behind our limit just because we'd like to have smaller requests.



scsi_end_request is being called with the number of bytes _actually processed_,
which might be less than the number of bytes requested.
And the remainder will be retried by the upper layers.


The number of bytes _actually processed_ must come from somewhere. 
And that somewhere is sd_done() in case of scsi-disks.  See my point now?


cheers,
  Gerd




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-26 Thread Hannes Reinecke
Gerd Hoffmann wrote:
 On 11/26/09 15:27, Hannes Reinecke wrote:
 Gerd Hoffmann wrote:
 sd_done() tries to figure how many sectors it actually got for serious
 errors.  I don't feel signaling medium error for the first sector
 behind our limit just because we'd like to have smaller requests.
 
 scsi_end_request is being called with the number of bytes _actually
 processed_,
 which might be less than the number of bytes requested.
 And the remainder will be retried by the upper layers.
 
 The number of bytes _actually processed_ must come from somewhere. And
 that somewhere is sd_done() in case of scsi-disks.  See my point now?
 
Ah. Now. I concur.

So indeed, this approach would only work if we signal some sense code
back to the host.
I, OTOH, don't have any qualms with returning HARDWARE_ERROR,
0x26/0x08(TOO MANY SEGMENT DESCRIPTORS) resp 0x26h/0x0B (INLINE DATA LENGTH 
EXCEEDED).
Feels only fair to notify the guest it has done something wrong.

The alternative would be to split the I/O into smaller sections.
But this really feels a bit like an overkill.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-25 Thread Hannes Reinecke
Gerd Hoffmann wrote:
 On 11/24/09 14:51, Paul Brook wrote:
 On Tuesday 24 November 2009, Gerd Hoffmann wrote:
 On 11/16/09 19:53, Paul Brook wrote:
 Capping the amount of memory required for a transfer *is*
 implemented, in
 both LSI and virtio-blk.  The exception being SCSI passthrough where
 the
 kernel API makes it impossible.

 Well.  Figured while doing more testing:  The allowed request size is
 limited by the kernel, so scsi-generic requests larger than (currently)
 128k fail.

 Now, how to handle *that*?  Is there some way to signal to the guest
 that the request was to big?

 Same as real hardware.  Probably also want to populate the Block
 Limits VPD
 page appropriately
 
 Some experiements later.
 
 Linux reads the block limits vpd, but seems to ignore the hard limit.
 
Ah. Maybe we should fix that up then ...

 Answering large requests with Illegal request, Invalid field in CDB
 doesn't makes linux try smaller requests, instead it reports I/O errors
 to the syslog.
 
 Hmm.

Can't we just put residuals to good use here?
Ie finish up the request up to the size we can handle, and return the
original request with the transfer size set correctly.
Then the SCSI stack of the guest should retry the leftovers, which
then we should be able to handle. Or redo from start until we are
able to.
Should be straightforward to implement, one would assume.
And we could probably encapsulate it entirely within the bdrv
as don't actually need to expose those limits when the block
driver layer is handling it correctly.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-24 Thread Gerd Hoffmann

On 11/16/09 19:53, Paul Brook wrote:

Capping the amount of memory required for a transfer *is* implemented, in both
LSI and virtio-blk.  The exception being SCSI passthrough where the kernel API
makes it impossible.


Well.  Figured while doing more testing:  The allowed request size is 
limited by the kernel, so scsi-generic requests larger than (currently) 
128k fail.


Now, how to handle *that*?  Is there some way to signal to the guest 
that the request was to big?


At least for known commands such as READ+WRITE which are likely to be 
big we could split the request internally into two (or more) if needed.


cheers,
  Gerd




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-24 Thread Paul Brook
On Tuesday 24 November 2009, Gerd Hoffmann wrote:
 On 11/16/09 19:53, Paul Brook wrote:
  Capping the amount of memory required for a transfer *is* implemented, in
  both LSI and virtio-blk.  The exception being SCSI passthrough where the
  kernel API makes it impossible.
 
 Well.  Figured while doing more testing:  The allowed request size is
 limited by the kernel, so scsi-generic requests larger than (currently)
 128k fail.
 
 Now, how to handle *that*?  Is there some way to signal to the guest
 that the request was to big?

Same as real hardware.  Probably also want to populate the Block Limits VPD 
page appropriately

Paul




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-16 Thread Gerd Hoffmann

On 11/11/09 17:38, Paul Brook wrote:

That cap is important.
For scsi-generic you probably don't have a choice because of the way the
kernel interface works.


Exactly.  And why is the cap important for scsi-disk if scsi-generic
does fine without?


With scsi-generic you're at the mercy of what the kernel API gives you, and if
the guest hardware/OS isn't cooperative then you loose.


The guest will loose with unreasonable large requests.  qemu_malloc() - 
oom() - abort() - guest is gone.


We can also limit the amout of host memory we allow the guest to 
consume, so uncooperative guests can't push the host into swap.  This is 
not implemented today, indicating that it hasn't been a problem so far. 
 And with zerocopy it will be even less a problem as we don't need host 
memory to buffer the data ...



It doesn't.  The disconnect and thus the opportunity to submit more
commands while the device is busy doing the actual I/O is there.


Disconnecting on the first DMA request (after switching to a data phase and
transferring zero bytes) is bizarre behavior, but probably allowable.


The new lsi code doesn't.  The old code could do that under certain 
circumstances.  And what is bizarre about that?  A real hard drive will 
most likely do exactly that on reads (unless it has the data cached and 
can start the transfer instantly).



However by my reading DMA transfers must be performed synchronously by the
SCRIPTS engine, so you need to do a lot of extra checking to prove that you
can safely continue execution without actually performing the transfer.


I'll happily add a 'strict' mode which does data transfers synchronously 
in case any compatibility issues show up.


Such a mode would be slower of course.  We'll have to either do the I/O 
in lots of little chunks or loose zerocopy.  Large transfers + memcpy is 
probably the faster option.



It may be that it's
hard/impossible to get both command queueing and zero-copy.


I have it working.


More likely you have a nasty hack that happens to work with the Linux drivers.


Linux works.  Windows XP works.  FreeBSD works.  More regression testing 
is planned.


Suggestions for other guests which might be sensitive to changes like 
this?  Maybe even some which don't work with the current lsi emulation?



IIUC you're pretending that the DMA completed and eventually disconnecting the
device, assuming that nothing will read that data until the command complete
notification is received.


Yes.


Consider the case there the guest transfers the data from a single command in
two blocks, and has the HBA raise an interrupt in-between so that it can start
processing before the command completes. This processing could even be done by
the SCRIPTS engine itself, or a guest could even reuse the buffer for the
second DMA block.


In theory you can do all sorts of crazy stuff with the lsi scripts 
engine, like implementing a ext2 filesystem driver.


In practice I expect guests will use the scripts engine only for stuff 
which is also supported by other scsi controllers.


cheers,
  Gerd




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-16 Thread Paul Brook
On Monday 16 November 2009, Gerd Hoffmann wrote:
 On 11/11/09 17:38, Paul Brook wrote:
  That cap is important.
  For scsi-generic you probably don't have a choice because of the way
  the kernel interface works.
 
  Exactly.  And why is the cap important for scsi-disk if scsi-generic
  does fine without?
 
  With scsi-generic you're at the mercy of what the kernel API gives you,
  and if the guest hardware/OS isn't cooperative then you loose.
 
 The guest will loose with unreasonable large requests.  qemu_malloc() -
 oom() - abort() - guest is gone.

Exactly. This lossage is not acceptable.

 We can also limit the amout of host memory we allow the guest to
 consume, so uncooperative guests can't push the host into swap.  This is
 not implemented today, indicating that it hasn't been a problem so far.

Capping the amount of memory required for a transfer *is* implemented, in both 
LSI and virtio-blk.  The exception being SCSI passthrough where the kernel API 
makes it impossible.

   And with zerocopy it will be even less a problem as we don't need host
 memory to buffer the data ...

zero-copy isn't possible in many cases. You must handle the other cases 
gracefully.

  It doesn't.  The disconnect and thus the opportunity to submit more
  commands while the device is busy doing the actual I/O is there.
 
  Disconnecting on the first DMA request (after switching to a data phase
  and transferring zero bytes) is bizarre behavior, but probably allowable.
 
 The new lsi code doesn't.  The old code could do that under certain
 circumstances.  And what is bizarre about that?  A real hard drive will
 most likely do exactly that on reads (unless it has the data cached and
 can start the transfer instantly).

No. The old code goes directly from the command phase to the message 
(disconnect) phase.

  However by my reading DMA transfers must be performed synchronously by
  the SCRIPTS engine, so you need to do a lot of extra checking to prove
  that you can safely continue execution without actually performing the
  transfer.
 
 I'll happily add a 'strict' mode which does data transfers synchronously
 in case any compatibility issues show up.
 
 Such a mode would be slower of course.  We'll have to either do the I/O
 in lots of little chunks or loose zerocopy.  Large transfers + memcpy is
 probably the faster option.

But as you agreed above, large transfers+memcpy is not a realistic option 
because it can have excessive memory requirements.

Paul




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-16 Thread Ryan Harper
* Gerd Hoffmann kra...@redhat.com [2009-11-16 12:38]:
 On 11/11/09 17:38, Paul Brook wrote:

 It may be that it's
 hard/impossible to get both command queueing and zero-copy.
 
 I have it working.
 
 More likely you have a nasty hack that happens to work with the Linux 
 drivers.
 
 Linux works.  Windows XP works.  FreeBSD works.  More regression testing 
 is planned.
 
 Suggestions for other guests which might be sensitive to changes like 
 this?  Maybe even some which don't work with the current lsi emulation?

XP 64-bit, 2k3 32/64, 2k8 32/64

64-bit is important because the DMA mode of the lsi driver is different
when the 64-bit versions of the win drivers.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-16 Thread Gerd Hoffmann

  Hi,


Suggestions for other guests which might be sensitive to changes like
this?  Maybe even some which don't work with the current lsi emulation?


XP 64-bit, 2k3 32/64, 2k8 32/64

64-bit is important because the DMA mode of the lsi driver is different
when the 64-bit versions of the win drivers.


Are these in the needs regression testing or in the known broken group?

thanks,
  Gerd




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-16 Thread Ryan Harper
* Gerd Hoffmann kra...@redhat.com [2009-11-16 14:42]:
   Hi,
 
 Suggestions for other guests which might be sensitive to changes like
 this?  Maybe even some which don't work with the current lsi emulation?
 
 XP 64-bit, 2k3 32/64, 2k8 32/64
 
 64-bit is important because the DMA mode of the lsi driver is different
 when the 64-bit versions of the win drivers.
 
 Are these in the needs regression testing or in the known broken group?

Regression:
xp-64
2k3-32/64

Known Broken:
2k8-32/64

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-16 Thread Gerd Hoffmann

On 11/16/09 19:53, Paul Brook wrote:

We can also limit the amout of host memory we allow the guest to
consume, so uncooperative guests can't push the host into swap.  This is
not implemented today, indicating that it hasn't been a problem so far.


Capping the amount of memory required for a transfer *is* implemented, in both
LSI and virtio-blk.  The exception being SCSI passthrough where the kernel API
makes it impossible.


I was talking about scsi-generic.  There is no option to reject 
excessive large requests, and it was no problem so far.



   And with zerocopy it will be even less a problem as we don't need host
memory to buffer the data ...


zero-copy isn't possible in many cases. You must handle the other cases
gracefully.


I havn't yet found a guest OS where lsi can't do zerocopy.
Name one where it doesn't work and I'll have a look.


Disconnecting on the first DMA request (after switching to a data phase
and transferring zero bytes) is bizarre behavior, but probably allowable.


The new lsi code doesn't.  The old code could do that under certain
circumstances.  And what is bizarre about that?  A real hard drive will
most likely do exactly that on reads (unless it has the data cached and
can start the transfer instantly).


No. The old code goes directly from the command phase to the message
(disconnect) phase.


Hmm, well.  It switches from DI / DO to MI before the guest runs again 
so the guest will not notice the switch ...



However by my reading DMA transfers must be performed synchronously by
the SCRIPTS engine, so you need to do a lot of extra checking to prove
that you can safely continue execution without actually performing the
transfer.


I'll happily add a 'strict' mode which does data transfers synchronously
in case any compatibility issues show up.

Such a mode would be slower of course.  We'll have to either do the I/O
in lots of little chunks or loose zerocopy.  Large transfers + memcpy is
probably the faster option.


But as you agreed above, large transfers+memcpy is not a realistic option
because it can have excessive memory requirements.


This large refers to normal request sizes (which are large compared to 
page-sized scatter list entries).  Having a 64k request submitted as a 
single I/O, then memcpy is most likely faster than submitting 16 I/O 
requests with 4k each one after another.  Buffering would be no problem 
here.


But I still don't expect problems with zerocopy though.  And zerocopy 
hasn't noticable host memory requirements even on excessive large requests.


cheers,
  Gerd




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-11 Thread Gerd Hoffmann

On 11/11/09 05:06, Paul Brook wrote:

On Friday 06 November 2009, Gerd Hoffmann wrote:

Hi,

http://repo.or.cz/w/qemu/kraxel.git/shortlog/refs/heads/scsi.v6

What is in there?
   (3) New interface for HBA=  SCSIDevice interaction:
   * building on the new SCSIRequest struct.
   * the silly read_data/write_data ping-pong game is gone.
   * support for scatter lists and zerocopy I/O is there.


The silly ping-pong game is there for a reason.

Your code assumes that the whole transfer will be completed in a single
request.


... to the qemu block layer.  Yes.

The HBA = guest driver communication is a completely different story 
though.



This is not true for any of the current HBAs.  Expecting the HBA to
cache the entire response is simply not acceptable.


The current qemu code *does* cache the response.  scsi-disk caps the 
buffer at 128k (which is big enough for any request I've seen in my 
testing).  scsi-generic has no cap.


With the old interface scsi-disk and scsi-generic do the caching. 
Unconditionally.


With the new interface the HBA has to handle the caching if needed.  But 
the HBA also has the option to pass scatter lists, in which case qemu 
doesn't do any caching, the data is transfered directly from/to guest 
memory.  Which is clearly an improvement IMO.



Remember that a single
transfer can be very large (greater than available ram on the host). Even with
a SG capable HBA, you can not assume all the data will be transferred in a
single request.


scsi-generic: It must be a single request anyway, and it already is today.

scsi-disk: dma_bdrv_{read,write} will split it into smaller chunks if 
needed.



You should also consider how this interacts with command queueing.
IIUC an Initiator (HBA) typically sends a tagged command, then disconnects
from the target (disk). At some later time the target reconnects, and the
initiator starts the DMA transfer. By my reading your code does not issue any
IO requests until after the HBA starts transferring data.


Hmm?  What code you are looking at?

For esp and usb-storage reads and writes are handles slightly different. 
 They roughly works like this:


read requests:
  - allocate + parse scsi command.  scsi_req_get+scsi_req_parse
  - submit I/O to qemu block layer. scsi_req_buf
  - copy data do guest.
  - return status, release request  scsi_req_put

write requests:
  - allocate + parse scsi command.  scsi_req_get+scsi_req_parse
  - copy data from guest.
  - submit I/O to qemu block layer. scsi_req_buf
  - return status, release request  scsi_req_put

Oh, and both do not support command queuing anyway.

lsi (only one in-tree with TCQ support) works like this:

- allocate + parse scsi command.scsi_req_get+scsi_req_parse
- continue script processing, collect
  DMA addresses and stick them into
  a scatter list until it is complete.
- queue command and disconnect.
- submit I/O to the qemu block layerscsi_req_sgl

*can process more scsi commands here*

- when I/O is finished reselect tag
- return status, release request.   scsi_req_put

[ Yes, this should go to the changelog.  As mentioned in the
  announcement the commit comments need some work ... ]


The only way to
achieve this is for the target to pretend it has data available immediately,
at which point the transfer stalls and we loose the opportunity for parallel
command queueing.


Note that command parsing and I/O submitting is separate now.  So the 
HBA knows how much data is going to be transfered by the command before 
actually submitting the I/O.


cheers,
  Gerd





Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-11 Thread Paul Brook
 The current qemu code *does* cache the response.  scsi-disk caps the
 buffer at 128k (which is big enough for any request I've seen in my
 testing).  scsi-generic has no cap.

That cap is important.
For scsi-generic you probably don't have a choice because of the way the 
kernel interface works.
 
 With the new interface the HBA has to handle the caching if needed.  But
 the HBA also has the option to pass scatter lists, in which case qemu
 doesn't do any caching, the data is transfered directly from/to guest
 memory.  Which is clearly an improvement IMO.

  Remember that a single
  transfer can be very large (greater than available ram on the host). Even
  with a SG capable HBA, you can not assume all the data will be
  transferred in a single request.
 
 scsi-generic: It must be a single request anyway, and it already is today.
 
 scsi-disk: dma_bdrv_{read,write} will split it into smaller chunks if
 needed.

You seem to be assuming the HBA knows where it's going to put the data before 
it issues the command. This is not true (see blow).

  You should also consider how this interacts with command queueing.
  IIUC an Initiator (HBA) typically sends a tagged command, then
  disconnects from the target (disk). At some later time the target
  reconnects, and the initiator starts the DMA transfer. By my reading your
  code does not issue any IO requests until after the HBA starts
  transferring data.
 
 Hmm?  What code you are looking at?
 
 For esp and usb-storage reads and writes are handles slightly different.
   They roughly works like this:

Neither ESP nor usb-storage implement command queueing, so aren't interesting.

 lsi (only one in-tree with TCQ support) works like this:
 
 - allocate + parse scsi command.scsi_req_get+scsi_req_parse
 - continue script processing, collect
DMA addresses and stick them into
a scatter list until it is complete.
 - queue command and disconnect.
 - submit I/O to the qemu block layerscsi_req_sgl
 
 *can process more scsi commands here*

 - when I/O is finished reselect tag
 - return status, release request.   scsi_req_put

I'm pretty sure this is wrong, and what actually happens is:

1) Wait for device to reconnect (goto 5), or commands from host (goto 2).

2) SCRIPTS connect to device, and send command.
3) If device has data immediately (metadata command) then goto 6
4) Device disconnects. goto 1

5) Device has data ready, and reconnects
6) SCRIPTS locate the next DMA block for this command, and initiate a (linear) 
DMA transfer.
7) DATA is transferred. Note that DMA stalls the SCRIPTS processor until the 
transfer completes.
8) If the device still has data then goto 6.
9) If the device runs out of data before the command completes then goto 3.
10) Command complete. goto 1

Note that the IO command is parsed at stage 2, but the data transfer is not 
requested until stage 6. i.e. after the command has partially completed. This 
window between issue and data transfer is where other commands are issued.

The only way to make your API work is to skip straight from step 3 to step 6, 
which effectively looses the command queueing capability. It may be that it's 
hard/impossible to get both command queueing and zero-copy. In that case I say 
command queueing wins.

Also note that use of self-modifying SCRIPTS is common.

Paul




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-11 Thread Paul Brook
 That cap is important.
 For scsi-generic you probably don't have a choice because of the way the
 kernel interface works.

Exactly.  And why is the cap important for scsi-disk if scsi-generic
does fine without?

With scsi-generic you're at the mercy of what the kernel API gives you, and if 
the guest hardware/OS isn't cooperative then you loose. With scsi-disk we can 
do significantly better.

  The only way to make your API work is to skip straight from step 3 to
  step 6, which effectively looses the command queueing capability.
 
 It doesn't.  The disconnect and thus the opportunity to submit more
 commands while the device is busy doing the actual I/O is there.

Disconnecting on the first DMA request (after switching to a data phase and 
transferring zero bytes) is bizarre behavior, but probably allowable.

However by my reading DMA transfers must be performed synchronously by the 
SCRIPTS engine, so you need to do a lot of extra checking to prove that you 
can safely continue execution without actually performing the transfer.

  It may be that it's
  hard/impossible to get both command queueing and zero-copy.
 
 I have it working.

More likely you have a nasty hack that happens to work with the Linux drivers. 
IIUC you're pretending that the DMA completed and eventually disconnecting the 
device, assuming that nothing will read that data until the command complete 
notification is received.

Consider the case there the guest transfers the data from a single command in 
two blocks, and has the HBA raise an interrupt in-between so that it can start 
processing before the command completes. This processing could even be done by 
the SCRIPTS engine itself, or a guest could even reuse the buffer for the 
second DMA block.

Paul




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-09 Thread Gerd Hoffmann

On 11/07/09 16:22, Blue Swirl wrote:

In general the commits look good, there are many obviously correct cleanups.

However, what happens to the DPRINTFs, it looks like they are removed
in the process?


I guess you are talking about the ones for each emulated command in 
scsi-disk.c?  There is scsi_print_req() now, filling this hole.  I'll 
stick in a call, wrapped into #ifdef DEBUG_SCSI, so you'll get this 
printed by default when compiling with debugging enabled.



You are also moving the compilation to Makefile.hw, which is not
exactly an improvement. Is this needed because of the QEMUIOVector
stuff?


Almost correct ;)

It is because of QEMUSGList which drags in a target_phys_addr_t dependency.

cheers,
  Gerd




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-09 Thread Avi Kivity

On 11/09/2009 11:08 AM, Gerd Hoffmann wrote:



You are also moving the compilation to Makefile.hw, which is not
exactly an improvement. Is this needed because of the QEMUIOVector
stuff?


Almost correct ;)

It is because of QEMUSGList which drags in a target_phys_addr_t 
dependency.


As Michael notes, devices have physical address sizes independent of the 
target platform; a PCI device that supports 64-bit addresses can be 
plugged into a motherboard that supports 32-bit address bus processors.


We can fix this in several ways:
- creating QEMUSG64List and QEMUSG32List (and typedefing PCISGList to 
the former)
- making QEMUSGList always use 64-bit addresses since it will almost 
always be used with devices (which are often 64-bit capable)
- making target_phys_addr_t always 64-bit (which loses some performance 
with 32-on-32 emulation)

- others?

--
error compiling committee.c: too many arguments to function





Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-09 Thread Gerd Hoffmann

On 11/09/09 13:37, Avi Kivity wrote:

On 11/09/2009 11:08 AM, Gerd Hoffmann wrote:

It is because of QEMUSGList which drags in a target_phys_addr_t
dependency.


As Michael notes, devices have physical address sizes independent of the
target platform; a PCI device that supports 64-bit addresses can be
plugged into a motherboard that supports 32-bit address bus processors.

We can fix this in several ways:
- creating QEMUSG64List and QEMUSG32List (and typedefing PCISGList to
the former)


Doesn't fly.  Either every SCSI HBA uses QEMUSG64List (thus making this 
aequivalent to the next option from the scsi point of view), or 
scsi-disk would have to support both formats.



- making QEMUSGList always use 64-bit addresses since it will almost
always be used with devices (which are often 64-bit capable)


Possible.


- making target_phys_addr_t always 64-bit (which loses some performance
with 32-on-32 emulation)


Possible too.

Has the advantage that more code can be compiled only once, often 
target_phys_addr_t is the only reason a source file is in Makefile.hw 
instead of Makefile.


Question is how big the performance hit actually is and whenever we are 
willing to accept that or not ...


cheers,
  Gerd





Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-09 Thread Avi Kivity

On 11/09/2009 03:03 PM, Gerd Hoffmann wrote:

On 11/09/09 13:37, Avi Kivity wrote:

On 11/09/2009 11:08 AM, Gerd Hoffmann wrote:

It is because of QEMUSGList which drags in a target_phys_addr_t
dependency.


As Michael notes, devices have physical address sizes independent of the
target platform; a PCI device that supports 64-bit addresses can be
plugged into a motherboard that supports 32-bit address bus processors.

We can fix this in several ways:
- creating QEMUSG64List and QEMUSG32List (and typedefing PCISGList to
the former)


Doesn't fly.  Either every SCSI HBA uses QEMUSG64List (thus making 
this aequivalent to the next option from the scsi point of view), or 
scsi-disk would have to support both formats.


A device or device set which is always 64-bit would always use 
QEMUSG64List.  A device which is always 32-bit would use QEMUSG32List.





- making QEMUSGList always use 64-bit addresses since it will almost
always be used with devices (which are often 64-bit capable)


Possible.


This is my preferred option btw.  There's no performance impact since 
actual device emulation will dwarf an 64-bit impact.



- making target_phys_addr_t always 64-bit (which loses some performance
with 32-on-32 emulation)


Possible too.

Has the advantage that more code can be compiled only once, often 
target_phys_addr_t is the only reason a source file is in Makefile.hw 
instead of Makefile.


Question is how big the performance hit actually is and whenever we 
are willing to accept that or not ...




Probably unmeasurable, but the qemu-devel thread size will be very 
measurable so not worth it.


Note platform-independent devices (like pci) shouldn't depend on 
target_phys_addr_t anyway; QEMSG64List is one part of this.


--
error compiling committee.c: too many arguments to function





Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-09 Thread Gerd Hoffmann

  Hi,


Doesn't fly. Either every SCSI HBA uses QEMUSG64List (thus making this
aequivalent to the next option from the scsi point of view), or
scsi-disk would have to support both formats.


A device or device set which is always 64-bit would always use
QEMUSG64List. A device which is always 32-bit would use QEMUSG32List.


A scsi-disk can be connected to both 32-bit and 64-bit HBAs though ...


Question is how big the performance hit actually is and whenever we
are willing to accept that or not ...


Probably unmeasurable, but the qemu-devel thread size will be very
measurable so not worth it.


;)

cheers,
  Gerd





Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-09 Thread Avi Kivity

On 11/09/2009 03:39 PM, Gerd Hoffmann wrote:

  Hi,


Doesn't fly. Either every SCSI HBA uses QEMUSG64List (thus making this
aequivalent to the next option from the scsi point of view), or
scsi-disk would have to support both formats.


A device or device set which is always 64-bit would always use
QEMUSG64List. A device which is always 32-bit would use QEMUSG32List.


A scsi-disk can be connected to both 32-bit and 64-bit HBAs though ...



A line has to be drawn.  Since we have a limited number of HBAs, and 
since the performance impact will be minimal, we can safely use 64-bit 
for scsi.


--
error compiling committee.c: too many arguments to function





Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-09 Thread Blue Swirl
On Mon, Nov 9, 2009 at 2:37 PM, Avi Kivity a...@redhat.com wrote:
 On 11/09/2009 11:08 AM, Gerd Hoffmann wrote:

 You are also moving the compilation to Makefile.hw, which is not
 exactly an improvement. Is this needed because of the QEMUIOVector
 stuff?

 Almost correct ;)

 It is because of QEMUSGList which drags in a target_phys_addr_t
 dependency.

 As Michael notes, devices have physical address sizes independent of the
 target platform; a PCI device that supports 64-bit addresses can be plugged
 into a motherboard that supports 32-bit address bus processors.

True. But I think the solution is not to make all buses maximum width,
but support multiple buses with width translation in between.

 We can fix this in several ways:
 - creating QEMUSG64List and QEMUSG32List (and typedefing PCISGList to the
 former)
 - making QEMUSGList always use 64-bit addresses since it will almost always
 be used with devices (which are often 64-bit capable)
 - making target_phys_addr_t always 64-bit (which loses some performance with
 32-on-32 emulation)
 - others?

We could improve the compilation system so that on 64 bit hosts the
benefit of single size target_phys_addr_t results in compiling the
files only once.




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-09 Thread Gerd Hoffmann

On 11/09/09 21:38, Blue Swirl wrote:

As Michael notes, devices have physical address sizes independent of the
target platform; a PCI device that supports 64-bit addresses can be plugged
into a motherboard that supports 32-bit address bus processors.


True. But I think the solution is not to make all buses maximum width,
but support multiple buses with width translation in between.


I suspect translating between 32 and 64 bit is more complex and more 
overhead than simply using 64bit all the time.



We could improve the compilation system so that on 64 bit hosts the
benefit of single size target_phys_addr_t results in compiling the
files only once.


I think this is already the case.  When building on x86_64 host (both 
32bit and 64bit targets) I find the libhw32 directory empty ...


cheers,
  Gerd




Re: [Qemu-devel] [sneak preview] major scsi overhaul

2009-11-07 Thread Blue Swirl
On Sat, Nov 7, 2009 at 1:09 AM, Gerd Hoffmann kra...@redhat.com wrote:
  Hi,

 http://repo.or.cz/w/qemu/kraxel.git/shortlog/refs/heads/scsi.v6

 What is in there?

  (1) Reworked scsi data structures.  There is a SCSIRequest struct now.
  (2) Restructed emulation in the scsi-disk driver.
  (3) New interface for HBA = SCSIDevice interaction:
      * building on the new SCSIRequest struct.
      * the silly read_data/write_data ping-pong game is gone.
      * support for scatter lists and zerocopy I/O is there.

Cool.

  (4) Added support for the new interface to scsi-disk and scsi-generic.
  (5) Switched esp and usb-storage to the new interface.
      * fixed usb-storage state engine along the way.  And, no, a
        backport to stable isn't going to happen.  With the new
        interface state tracking is simpler and easier to get right.
  (6) A bunch of misc fixes and improvements.

 What needs to be done?

  (1) Better patch descriptions.
  (2) Submit patches to the list for review.
  (3) Switch over lsi to the new interface.
  (4) Zap old interface, killing tons of dead code.
  (5) Final cleanups.

In general the commits look good, there are many obviously correct cleanups.

However, what happens to the DPRINTFs, it looks like they are removed
in the process?

You are also moving the compilation to Makefile.hw, which is not
exactly an improvement. Is this needed because of the QEMUIOVector
stuff?