Re: [Qemu-devel] [sneak preview] major scsi overhaul
Gerd Hoffmann wrote: On 11/27/09 12:08, Gerd Hoffmann wrote: On 11/26/09 16:50, Hannes Reinecke wrote: So indeed, this approach would only work if we signal some sense code back to the host. I, OTOH, don't have any qualms with returning HARDWARE_ERROR, 0x26/0x08(TOO MANY SEGMENT DESCRIPTORS) resp 0x26h/0x0B (INLINE DATA LENGTH EXCEEDED). Feels only fair to notify the guest it has done something wrong. Also set the info field which linux uses to figure how many sectors it actually got. Hmm. Well. Seems to work out at least for linux, i.e. it figures it got a bunch of sectors and tries to continue. Linux logs an I/O error. Also I didn't try other guests (yet). Using that as a way to limit scsi-disk request sizes probably isn't a good idea. For scsi-generic that would be a improvement over the current situation though. Yes, quite. But for scsi-disk we could always fallback to using bounce-buffers, could we not? Provide we get a detailed enough error code, but this could be arranged methinks. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg)
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 12/07/09 09:28, Hannes Reinecke wrote: Hmm. Well. Seems to work out at least for linux, i.e. it figures it got a bunch of sectors and tries to continue. Linux logs an I/O error. Also I didn't try other guests (yet). Using that as a way to limit scsi-disk request sizes probably isn't a good idea. For scsi-generic that would be a improvement over the current situation though. Yes, quite. But for scsi-disk we could always fallback to using bounce-buffers, could we not? We want limit the bounce buffer size though. As there seems to be no easy way to make sure the guest doesn't submit requests larger than a certain limit I guess there is no way around splitting the request into pieces for the bounce buffer case. We could add offset+size arguments to scsi_req_buf() to accomplish this. cheers, Gerd PS: FYI: I suspect I wouldn't find time this year to continue working on this seriously, especially on the time-consuming testing part. Top priority right now are finishing touches for the 0.12 release. x-mas will be family time.
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/27/09 12:08, Gerd Hoffmann wrote: On 11/26/09 16:50, Hannes Reinecke wrote: So indeed, this approach would only work if we signal some sense code back to the host. I, OTOH, don't have any qualms with returning HARDWARE_ERROR, 0x26/0x08(TOO MANY SEGMENT DESCRIPTORS) resp 0x26h/0x0B (INLINE DATA LENGTH EXCEEDED). Feels only fair to notify the guest it has done something wrong. Also set the info field which linux uses to figure how many sectors it actually got. Hmm. Well. Seems to work out at least for linux, i.e. it figures it got a bunch of sectors and tries to continue. Linux logs an I/O error. Also I didn't try other guests (yet). Using that as a way to limit scsi-disk request sizes probably isn't a good idea. For scsi-generic that would be a improvement over the current situation though. cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/26/09 16:50, Hannes Reinecke wrote: So indeed, this approach would only work if we signal some sense code back to the host. I, OTOH, don't have any qualms with returning HARDWARE_ERROR, 0x26/0x08(TOO MANY SEGMENT DESCRIPTORS) resp 0x26h/0x0B (INLINE DATA LENGTH EXCEEDED). Feels only fair to notify the guest it has done something wrong. Also set the info field which linux uses to figure how many sectors it actually got. Now I really need some more uptodate scsi specs. Unfortunaly t10.org doesn't allow public access to the drafts any more. Seems to be a recent change, and googeling found me tons of references to t10.org but not the documents itself. Anyone has a source for me? cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
Hi, Answering large requests with Illegal request, Invalid field in CDB doesn't makes linux try smaller requests, instead it reports I/O errors to the syslog. Hmm. Can't we just put residuals to good use here? Ie finish up the request up to the size we can handle, and return the original request with the transfer size set correctly. I'll try. Should be straightforward to implement, one would assume. The infrastructure and the HBAs have to handle that already. It frequently happens with MODE SENSE for example (guest passing a 256 byte buffer and we have less data to fill in). So it should be easy. And we could probably encapsulate it entirely within the bdrv as don't actually need to expose those limits when the block driver layer is handling it correctly. Hmm, I don't think so. SG_IO just returns EINVAL. cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
Gerd Hoffmann wrote: Hi, Answering large requests with Illegal request, Invalid field in CDB doesn't makes linux try smaller requests, instead it reports I/O errors to the syslog. Hmm. Can't we just put residuals to good use here? Ie finish up the request up to the size we can handle, and return the original request with the transfer size set correctly. I'll try. Should be straightforward to implement, one would assume. The infrastructure and the HBAs have to handle that already. It frequently happens with MODE SENSE for example (guest passing a 256 byte buffer and we have less data to fill in). So it should be easy. And we could probably encapsulate it entirely within the bdrv as don't actually need to expose those limits when the block driver layer is handling it correctly. Hmm, I don't think so. SG_IO just returns EINVAL. Yes, but we can hook into scsi_generic_map() to cap the resulting iovec to the limits of the underlying block device. Then we submit an iovec which matches the capabilities of the underlying device. Of course we should take care to update the resulting xferlen values to match the actualy submitted data, but that should be easily done. Then the guest would see a partial request and retry the remainder of the request, which (possibly after some iterations) would result in all data transferred, albeit at a lower speed. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg)
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/26/09 11:57, Hannes Reinecke wrote: Then the guest would see a partial request and retry the remainder of the request, which (possibly after some iterations) would result in all data transferred, albeit at a lower speed. Except they don't. /me looks at drivers/scsi/sd.c:sd_done() I can't see any sane way to tell linux that the request was too big. cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/26/09 12:20, Hannes Reinecke wrote: Gerd Hoffmann wrote: /me looks at drivers/scsi/sd.c:sd_done() I can't see any sane way to tell linux that the request was too big. residuals is the key: drivers/scsi/scsi.c:scsi_finish_command() good_bytes = scsi_bufflen(cmd); if (cmd-request-cmd_type != REQ_TYPE_BLOCK_PC) { int old_good_bytes = good_bytes; drv = scsi_cmd_to_driver(cmd); if (drv-done) good_bytes = drv-done(cmd); drv-done() actually is sd_done() mentioned above. sd_done() tries to figure how many sectors it actually got for serious errors. I don't feel signaling medium error for the first sector behind our limit just because we'd like to have smaller requests. /* * USB may not give sense identifying bad sector and * simply return a residue instead, so subtract off the * residue if drv-done() error processing indicates no * change to the completion length. */ if (good_bytes == old_good_bytes) good_bytes -= scsi_get_resid(cmd); Poor mans bad sector identification. Same issue as above IMHO. On top of that I wouldn't expect all other guest OSes having the same quirk in there to handle usb disks. cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
Gerd Hoffmann wrote: On 11/26/09 12:20, Hannes Reinecke wrote: Gerd Hoffmann wrote: /me looks at drivers/scsi/sd.c:sd_done() I can't see any sane way to tell linux that the request was too big. residuals is the key: drivers/scsi/scsi.c:scsi_finish_command() good_bytes = scsi_bufflen(cmd); if (cmd-request-cmd_type != REQ_TYPE_BLOCK_PC) { int old_good_bytes = good_bytes; drv = scsi_cmd_to_driver(cmd); if (drv-done) good_bytes = drv-done(cmd); drv-done() actually is sd_done() mentioned above. sd_done() tries to figure how many sectors it actually got for serious errors. I don't feel signaling medium error for the first sector behind our limit just because we'd like to have smaller requests. /* * USB may not give sense identifying bad sector and * simply return a residue instead, so subtract off the * residue if drv-done() error processing indicates no * change to the completion length. */ if (good_bytes == old_good_bytes) good_bytes -= scsi_get_resid(cmd); Poor mans bad sector identification. Same issue as above IMHO. On top of that I wouldn't expect all other guest OSes having the same quirk in there to handle usb disks. Sigh. Communication seems to be tricky here. Please, just ignore scsi_finish_command(), it's not doing anything useful here. Do skip this and study the next snipped, scsi_io_completion(): drivers/scsi/scsi_lib.c:scsi_io_completion() /* * A number of bytes were successfully read. If there * are leftovers and there is some kind of error * (result != 0), retry the rest. */ if (scsi_end_request(cmd, error, good_bytes, result == 0) == NULL) return; scsi_end_request is being called with the number of bytes _actually processed_, which might be less than the number of bytes requested. And the remainder will be retried by the upper layers. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg)
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/26/09 15:27, Hannes Reinecke wrote: Gerd Hoffmann wrote: sd_done() tries to figure how many sectors it actually got for serious errors. I don't feel signaling medium error for the first sector behind our limit just because we'd like to have smaller requests. scsi_end_request is being called with the number of bytes _actually processed_, which might be less than the number of bytes requested. And the remainder will be retried by the upper layers. The number of bytes _actually processed_ must come from somewhere. And that somewhere is sd_done() in case of scsi-disks. See my point now? cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
Gerd Hoffmann wrote: On 11/26/09 15:27, Hannes Reinecke wrote: Gerd Hoffmann wrote: sd_done() tries to figure how many sectors it actually got for serious errors. I don't feel signaling medium error for the first sector behind our limit just because we'd like to have smaller requests. scsi_end_request is being called with the number of bytes _actually processed_, which might be less than the number of bytes requested. And the remainder will be retried by the upper layers. The number of bytes _actually processed_ must come from somewhere. And that somewhere is sd_done() in case of scsi-disks. See my point now? Ah. Now. I concur. So indeed, this approach would only work if we signal some sense code back to the host. I, OTOH, don't have any qualms with returning HARDWARE_ERROR, 0x26/0x08(TOO MANY SEGMENT DESCRIPTORS) resp 0x26h/0x0B (INLINE DATA LENGTH EXCEEDED). Feels only fair to notify the guest it has done something wrong. The alternative would be to split the I/O into smaller sections. But this really feels a bit like an overkill. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg)
Re: [Qemu-devel] [sneak preview] major scsi overhaul
Gerd Hoffmann wrote: On 11/24/09 14:51, Paul Brook wrote: On Tuesday 24 November 2009, Gerd Hoffmann wrote: On 11/16/09 19:53, Paul Brook wrote: Capping the amount of memory required for a transfer *is* implemented, in both LSI and virtio-blk. The exception being SCSI passthrough where the kernel API makes it impossible. Well. Figured while doing more testing: The allowed request size is limited by the kernel, so scsi-generic requests larger than (currently) 128k fail. Now, how to handle *that*? Is there some way to signal to the guest that the request was to big? Same as real hardware. Probably also want to populate the Block Limits VPD page appropriately Some experiements later. Linux reads the block limits vpd, but seems to ignore the hard limit. Ah. Maybe we should fix that up then ... Answering large requests with Illegal request, Invalid field in CDB doesn't makes linux try smaller requests, instead it reports I/O errors to the syslog. Hmm. Can't we just put residuals to good use here? Ie finish up the request up to the size we can handle, and return the original request with the transfer size set correctly. Then the SCSI stack of the guest should retry the leftovers, which then we should be able to handle. Or redo from start until we are able to. Should be straightforward to implement, one would assume. And we could probably encapsulate it entirely within the bdrv as don't actually need to expose those limits when the block driver layer is handling it correctly. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg)
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/16/09 19:53, Paul Brook wrote: Capping the amount of memory required for a transfer *is* implemented, in both LSI and virtio-blk. The exception being SCSI passthrough where the kernel API makes it impossible. Well. Figured while doing more testing: The allowed request size is limited by the kernel, so scsi-generic requests larger than (currently) 128k fail. Now, how to handle *that*? Is there some way to signal to the guest that the request was to big? At least for known commands such as READ+WRITE which are likely to be big we could split the request internally into two (or more) if needed. cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On Tuesday 24 November 2009, Gerd Hoffmann wrote: On 11/16/09 19:53, Paul Brook wrote: Capping the amount of memory required for a transfer *is* implemented, in both LSI and virtio-blk. The exception being SCSI passthrough where the kernel API makes it impossible. Well. Figured while doing more testing: The allowed request size is limited by the kernel, so scsi-generic requests larger than (currently) 128k fail. Now, how to handle *that*? Is there some way to signal to the guest that the request was to big? Same as real hardware. Probably also want to populate the Block Limits VPD page appropriately Paul
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/11/09 17:38, Paul Brook wrote: That cap is important. For scsi-generic you probably don't have a choice because of the way the kernel interface works. Exactly. And why is the cap important for scsi-disk if scsi-generic does fine without? With scsi-generic you're at the mercy of what the kernel API gives you, and if the guest hardware/OS isn't cooperative then you loose. The guest will loose with unreasonable large requests. qemu_malloc() - oom() - abort() - guest is gone. We can also limit the amout of host memory we allow the guest to consume, so uncooperative guests can't push the host into swap. This is not implemented today, indicating that it hasn't been a problem so far. And with zerocopy it will be even less a problem as we don't need host memory to buffer the data ... It doesn't. The disconnect and thus the opportunity to submit more commands while the device is busy doing the actual I/O is there. Disconnecting on the first DMA request (after switching to a data phase and transferring zero bytes) is bizarre behavior, but probably allowable. The new lsi code doesn't. The old code could do that under certain circumstances. And what is bizarre about that? A real hard drive will most likely do exactly that on reads (unless it has the data cached and can start the transfer instantly). However by my reading DMA transfers must be performed synchronously by the SCRIPTS engine, so you need to do a lot of extra checking to prove that you can safely continue execution without actually performing the transfer. I'll happily add a 'strict' mode which does data transfers synchronously in case any compatibility issues show up. Such a mode would be slower of course. We'll have to either do the I/O in lots of little chunks or loose zerocopy. Large transfers + memcpy is probably the faster option. It may be that it's hard/impossible to get both command queueing and zero-copy. I have it working. More likely you have a nasty hack that happens to work with the Linux drivers. Linux works. Windows XP works. FreeBSD works. More regression testing is planned. Suggestions for other guests which might be sensitive to changes like this? Maybe even some which don't work with the current lsi emulation? IIUC you're pretending that the DMA completed and eventually disconnecting the device, assuming that nothing will read that data until the command complete notification is received. Yes. Consider the case there the guest transfers the data from a single command in two blocks, and has the HBA raise an interrupt in-between so that it can start processing before the command completes. This processing could even be done by the SCRIPTS engine itself, or a guest could even reuse the buffer for the second DMA block. In theory you can do all sorts of crazy stuff with the lsi scripts engine, like implementing a ext2 filesystem driver. In practice I expect guests will use the scripts engine only for stuff which is also supported by other scsi controllers. cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On Monday 16 November 2009, Gerd Hoffmann wrote: On 11/11/09 17:38, Paul Brook wrote: That cap is important. For scsi-generic you probably don't have a choice because of the way the kernel interface works. Exactly. And why is the cap important for scsi-disk if scsi-generic does fine without? With scsi-generic you're at the mercy of what the kernel API gives you, and if the guest hardware/OS isn't cooperative then you loose. The guest will loose with unreasonable large requests. qemu_malloc() - oom() - abort() - guest is gone. Exactly. This lossage is not acceptable. We can also limit the amout of host memory we allow the guest to consume, so uncooperative guests can't push the host into swap. This is not implemented today, indicating that it hasn't been a problem so far. Capping the amount of memory required for a transfer *is* implemented, in both LSI and virtio-blk. The exception being SCSI passthrough where the kernel API makes it impossible. And with zerocopy it will be even less a problem as we don't need host memory to buffer the data ... zero-copy isn't possible in many cases. You must handle the other cases gracefully. It doesn't. The disconnect and thus the opportunity to submit more commands while the device is busy doing the actual I/O is there. Disconnecting on the first DMA request (after switching to a data phase and transferring zero bytes) is bizarre behavior, but probably allowable. The new lsi code doesn't. The old code could do that under certain circumstances. And what is bizarre about that? A real hard drive will most likely do exactly that on reads (unless it has the data cached and can start the transfer instantly). No. The old code goes directly from the command phase to the message (disconnect) phase. However by my reading DMA transfers must be performed synchronously by the SCRIPTS engine, so you need to do a lot of extra checking to prove that you can safely continue execution without actually performing the transfer. I'll happily add a 'strict' mode which does data transfers synchronously in case any compatibility issues show up. Such a mode would be slower of course. We'll have to either do the I/O in lots of little chunks or loose zerocopy. Large transfers + memcpy is probably the faster option. But as you agreed above, large transfers+memcpy is not a realistic option because it can have excessive memory requirements. Paul
Re: [Qemu-devel] [sneak preview] major scsi overhaul
* Gerd Hoffmann kra...@redhat.com [2009-11-16 12:38]: On 11/11/09 17:38, Paul Brook wrote: It may be that it's hard/impossible to get both command queueing and zero-copy. I have it working. More likely you have a nasty hack that happens to work with the Linux drivers. Linux works. Windows XP works. FreeBSD works. More regression testing is planned. Suggestions for other guests which might be sensitive to changes like this? Maybe even some which don't work with the current lsi emulation? XP 64-bit, 2k3 32/64, 2k8 32/64 64-bit is important because the DMA mode of the lsi driver is different when the 64-bit versions of the win drivers. -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx ry...@us.ibm.com
Re: [Qemu-devel] [sneak preview] major scsi overhaul
Hi, Suggestions for other guests which might be sensitive to changes like this? Maybe even some which don't work with the current lsi emulation? XP 64-bit, 2k3 32/64, 2k8 32/64 64-bit is important because the DMA mode of the lsi driver is different when the 64-bit versions of the win drivers. Are these in the needs regression testing or in the known broken group? thanks, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
* Gerd Hoffmann kra...@redhat.com [2009-11-16 14:42]: Hi, Suggestions for other guests which might be sensitive to changes like this? Maybe even some which don't work with the current lsi emulation? XP 64-bit, 2k3 32/64, 2k8 32/64 64-bit is important because the DMA mode of the lsi driver is different when the 64-bit versions of the win drivers. Are these in the needs regression testing or in the known broken group? Regression: xp-64 2k3-32/64 Known Broken: 2k8-32/64 -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx ry...@us.ibm.com
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/16/09 19:53, Paul Brook wrote: We can also limit the amout of host memory we allow the guest to consume, so uncooperative guests can't push the host into swap. This is not implemented today, indicating that it hasn't been a problem so far. Capping the amount of memory required for a transfer *is* implemented, in both LSI and virtio-blk. The exception being SCSI passthrough where the kernel API makes it impossible. I was talking about scsi-generic. There is no option to reject excessive large requests, and it was no problem so far. And with zerocopy it will be even less a problem as we don't need host memory to buffer the data ... zero-copy isn't possible in many cases. You must handle the other cases gracefully. I havn't yet found a guest OS where lsi can't do zerocopy. Name one where it doesn't work and I'll have a look. Disconnecting on the first DMA request (after switching to a data phase and transferring zero bytes) is bizarre behavior, but probably allowable. The new lsi code doesn't. The old code could do that under certain circumstances. And what is bizarre about that? A real hard drive will most likely do exactly that on reads (unless it has the data cached and can start the transfer instantly). No. The old code goes directly from the command phase to the message (disconnect) phase. Hmm, well. It switches from DI / DO to MI before the guest runs again so the guest will not notice the switch ... However by my reading DMA transfers must be performed synchronously by the SCRIPTS engine, so you need to do a lot of extra checking to prove that you can safely continue execution without actually performing the transfer. I'll happily add a 'strict' mode which does data transfers synchronously in case any compatibility issues show up. Such a mode would be slower of course. We'll have to either do the I/O in lots of little chunks or loose zerocopy. Large transfers + memcpy is probably the faster option. But as you agreed above, large transfers+memcpy is not a realistic option because it can have excessive memory requirements. This large refers to normal request sizes (which are large compared to page-sized scatter list entries). Having a 64k request submitted as a single I/O, then memcpy is most likely faster than submitting 16 I/O requests with 4k each one after another. Buffering would be no problem here. But I still don't expect problems with zerocopy though. And zerocopy hasn't noticable host memory requirements even on excessive large requests. cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/11/09 05:06, Paul Brook wrote: On Friday 06 November 2009, Gerd Hoffmann wrote: Hi, http://repo.or.cz/w/qemu/kraxel.git/shortlog/refs/heads/scsi.v6 What is in there? (3) New interface for HBA= SCSIDevice interaction: * building on the new SCSIRequest struct. * the silly read_data/write_data ping-pong game is gone. * support for scatter lists and zerocopy I/O is there. The silly ping-pong game is there for a reason. Your code assumes that the whole transfer will be completed in a single request. ... to the qemu block layer. Yes. The HBA = guest driver communication is a completely different story though. This is not true for any of the current HBAs. Expecting the HBA to cache the entire response is simply not acceptable. The current qemu code *does* cache the response. scsi-disk caps the buffer at 128k (which is big enough for any request I've seen in my testing). scsi-generic has no cap. With the old interface scsi-disk and scsi-generic do the caching. Unconditionally. With the new interface the HBA has to handle the caching if needed. But the HBA also has the option to pass scatter lists, in which case qemu doesn't do any caching, the data is transfered directly from/to guest memory. Which is clearly an improvement IMO. Remember that a single transfer can be very large (greater than available ram on the host). Even with a SG capable HBA, you can not assume all the data will be transferred in a single request. scsi-generic: It must be a single request anyway, and it already is today. scsi-disk: dma_bdrv_{read,write} will split it into smaller chunks if needed. You should also consider how this interacts with command queueing. IIUC an Initiator (HBA) typically sends a tagged command, then disconnects from the target (disk). At some later time the target reconnects, and the initiator starts the DMA transfer. By my reading your code does not issue any IO requests until after the HBA starts transferring data. Hmm? What code you are looking at? For esp and usb-storage reads and writes are handles slightly different. They roughly works like this: read requests: - allocate + parse scsi command. scsi_req_get+scsi_req_parse - submit I/O to qemu block layer. scsi_req_buf - copy data do guest. - return status, release request scsi_req_put write requests: - allocate + parse scsi command. scsi_req_get+scsi_req_parse - copy data from guest. - submit I/O to qemu block layer. scsi_req_buf - return status, release request scsi_req_put Oh, and both do not support command queuing anyway. lsi (only one in-tree with TCQ support) works like this: - allocate + parse scsi command.scsi_req_get+scsi_req_parse - continue script processing, collect DMA addresses and stick them into a scatter list until it is complete. - queue command and disconnect. - submit I/O to the qemu block layerscsi_req_sgl *can process more scsi commands here* - when I/O is finished reselect tag - return status, release request. scsi_req_put [ Yes, this should go to the changelog. As mentioned in the announcement the commit comments need some work ... ] The only way to achieve this is for the target to pretend it has data available immediately, at which point the transfer stalls and we loose the opportunity for parallel command queueing. Note that command parsing and I/O submitting is separate now. So the HBA knows how much data is going to be transfered by the command before actually submitting the I/O. cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
The current qemu code *does* cache the response. scsi-disk caps the buffer at 128k (which is big enough for any request I've seen in my testing). scsi-generic has no cap. That cap is important. For scsi-generic you probably don't have a choice because of the way the kernel interface works. With the new interface the HBA has to handle the caching if needed. But the HBA also has the option to pass scatter lists, in which case qemu doesn't do any caching, the data is transfered directly from/to guest memory. Which is clearly an improvement IMO. Remember that a single transfer can be very large (greater than available ram on the host). Even with a SG capable HBA, you can not assume all the data will be transferred in a single request. scsi-generic: It must be a single request anyway, and it already is today. scsi-disk: dma_bdrv_{read,write} will split it into smaller chunks if needed. You seem to be assuming the HBA knows where it's going to put the data before it issues the command. This is not true (see blow). You should also consider how this interacts with command queueing. IIUC an Initiator (HBA) typically sends a tagged command, then disconnects from the target (disk). At some later time the target reconnects, and the initiator starts the DMA transfer. By my reading your code does not issue any IO requests until after the HBA starts transferring data. Hmm? What code you are looking at? For esp and usb-storage reads and writes are handles slightly different. They roughly works like this: Neither ESP nor usb-storage implement command queueing, so aren't interesting. lsi (only one in-tree with TCQ support) works like this: - allocate + parse scsi command.scsi_req_get+scsi_req_parse - continue script processing, collect DMA addresses and stick them into a scatter list until it is complete. - queue command and disconnect. - submit I/O to the qemu block layerscsi_req_sgl *can process more scsi commands here* - when I/O is finished reselect tag - return status, release request. scsi_req_put I'm pretty sure this is wrong, and what actually happens is: 1) Wait for device to reconnect (goto 5), or commands from host (goto 2). 2) SCRIPTS connect to device, and send command. 3) If device has data immediately (metadata command) then goto 6 4) Device disconnects. goto 1 5) Device has data ready, and reconnects 6) SCRIPTS locate the next DMA block for this command, and initiate a (linear) DMA transfer. 7) DATA is transferred. Note that DMA stalls the SCRIPTS processor until the transfer completes. 8) If the device still has data then goto 6. 9) If the device runs out of data before the command completes then goto 3. 10) Command complete. goto 1 Note that the IO command is parsed at stage 2, but the data transfer is not requested until stage 6. i.e. after the command has partially completed. This window between issue and data transfer is where other commands are issued. The only way to make your API work is to skip straight from step 3 to step 6, which effectively looses the command queueing capability. It may be that it's hard/impossible to get both command queueing and zero-copy. In that case I say command queueing wins. Also note that use of self-modifying SCRIPTS is common. Paul
Re: [Qemu-devel] [sneak preview] major scsi overhaul
That cap is important. For scsi-generic you probably don't have a choice because of the way the kernel interface works. Exactly. And why is the cap important for scsi-disk if scsi-generic does fine without? With scsi-generic you're at the mercy of what the kernel API gives you, and if the guest hardware/OS isn't cooperative then you loose. With scsi-disk we can do significantly better. The only way to make your API work is to skip straight from step 3 to step 6, which effectively looses the command queueing capability. It doesn't. The disconnect and thus the opportunity to submit more commands while the device is busy doing the actual I/O is there. Disconnecting on the first DMA request (after switching to a data phase and transferring zero bytes) is bizarre behavior, but probably allowable. However by my reading DMA transfers must be performed synchronously by the SCRIPTS engine, so you need to do a lot of extra checking to prove that you can safely continue execution without actually performing the transfer. It may be that it's hard/impossible to get both command queueing and zero-copy. I have it working. More likely you have a nasty hack that happens to work with the Linux drivers. IIUC you're pretending that the DMA completed and eventually disconnecting the device, assuming that nothing will read that data until the command complete notification is received. Consider the case there the guest transfers the data from a single command in two blocks, and has the HBA raise an interrupt in-between so that it can start processing before the command completes. This processing could even be done by the SCRIPTS engine itself, or a guest could even reuse the buffer for the second DMA block. Paul
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/07/09 16:22, Blue Swirl wrote: In general the commits look good, there are many obviously correct cleanups. However, what happens to the DPRINTFs, it looks like they are removed in the process? I guess you are talking about the ones for each emulated command in scsi-disk.c? There is scsi_print_req() now, filling this hole. I'll stick in a call, wrapped into #ifdef DEBUG_SCSI, so you'll get this printed by default when compiling with debugging enabled. You are also moving the compilation to Makefile.hw, which is not exactly an improvement. Is this needed because of the QEMUIOVector stuff? Almost correct ;) It is because of QEMUSGList which drags in a target_phys_addr_t dependency. cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/09/2009 11:08 AM, Gerd Hoffmann wrote: You are also moving the compilation to Makefile.hw, which is not exactly an improvement. Is this needed because of the QEMUIOVector stuff? Almost correct ;) It is because of QEMUSGList which drags in a target_phys_addr_t dependency. As Michael notes, devices have physical address sizes independent of the target platform; a PCI device that supports 64-bit addresses can be plugged into a motherboard that supports 32-bit address bus processors. We can fix this in several ways: - creating QEMUSG64List and QEMUSG32List (and typedefing PCISGList to the former) - making QEMUSGList always use 64-bit addresses since it will almost always be used with devices (which are often 64-bit capable) - making target_phys_addr_t always 64-bit (which loses some performance with 32-on-32 emulation) - others? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/09/09 13:37, Avi Kivity wrote: On 11/09/2009 11:08 AM, Gerd Hoffmann wrote: It is because of QEMUSGList which drags in a target_phys_addr_t dependency. As Michael notes, devices have physical address sizes independent of the target platform; a PCI device that supports 64-bit addresses can be plugged into a motherboard that supports 32-bit address bus processors. We can fix this in several ways: - creating QEMUSG64List and QEMUSG32List (and typedefing PCISGList to the former) Doesn't fly. Either every SCSI HBA uses QEMUSG64List (thus making this aequivalent to the next option from the scsi point of view), or scsi-disk would have to support both formats. - making QEMUSGList always use 64-bit addresses since it will almost always be used with devices (which are often 64-bit capable) Possible. - making target_phys_addr_t always 64-bit (which loses some performance with 32-on-32 emulation) Possible too. Has the advantage that more code can be compiled only once, often target_phys_addr_t is the only reason a source file is in Makefile.hw instead of Makefile. Question is how big the performance hit actually is and whenever we are willing to accept that or not ... cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/09/2009 03:03 PM, Gerd Hoffmann wrote: On 11/09/09 13:37, Avi Kivity wrote: On 11/09/2009 11:08 AM, Gerd Hoffmann wrote: It is because of QEMUSGList which drags in a target_phys_addr_t dependency. As Michael notes, devices have physical address sizes independent of the target platform; a PCI device that supports 64-bit addresses can be plugged into a motherboard that supports 32-bit address bus processors. We can fix this in several ways: - creating QEMUSG64List and QEMUSG32List (and typedefing PCISGList to the former) Doesn't fly. Either every SCSI HBA uses QEMUSG64List (thus making this aequivalent to the next option from the scsi point of view), or scsi-disk would have to support both formats. A device or device set which is always 64-bit would always use QEMUSG64List. A device which is always 32-bit would use QEMUSG32List. - making QEMUSGList always use 64-bit addresses since it will almost always be used with devices (which are often 64-bit capable) Possible. This is my preferred option btw. There's no performance impact since actual device emulation will dwarf an 64-bit impact. - making target_phys_addr_t always 64-bit (which loses some performance with 32-on-32 emulation) Possible too. Has the advantage that more code can be compiled only once, often target_phys_addr_t is the only reason a source file is in Makefile.hw instead of Makefile. Question is how big the performance hit actually is and whenever we are willing to accept that or not ... Probably unmeasurable, but the qemu-devel thread size will be very measurable so not worth it. Note platform-independent devices (like pci) shouldn't depend on target_phys_addr_t anyway; QEMSG64List is one part of this. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [sneak preview] major scsi overhaul
Hi, Doesn't fly. Either every SCSI HBA uses QEMUSG64List (thus making this aequivalent to the next option from the scsi point of view), or scsi-disk would have to support both formats. A device or device set which is always 64-bit would always use QEMUSG64List. A device which is always 32-bit would use QEMUSG32List. A scsi-disk can be connected to both 32-bit and 64-bit HBAs though ... Question is how big the performance hit actually is and whenever we are willing to accept that or not ... Probably unmeasurable, but the qemu-devel thread size will be very measurable so not worth it. ;) cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/09/2009 03:39 PM, Gerd Hoffmann wrote: Hi, Doesn't fly. Either every SCSI HBA uses QEMUSG64List (thus making this aequivalent to the next option from the scsi point of view), or scsi-disk would have to support both formats. A device or device set which is always 64-bit would always use QEMUSG64List. A device which is always 32-bit would use QEMUSG32List. A scsi-disk can be connected to both 32-bit and 64-bit HBAs though ... A line has to be drawn. Since we have a limited number of HBAs, and since the performance impact will be minimal, we can safely use 64-bit for scsi. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On Mon, Nov 9, 2009 at 2:37 PM, Avi Kivity a...@redhat.com wrote: On 11/09/2009 11:08 AM, Gerd Hoffmann wrote: You are also moving the compilation to Makefile.hw, which is not exactly an improvement. Is this needed because of the QEMUIOVector stuff? Almost correct ;) It is because of QEMUSGList which drags in a target_phys_addr_t dependency. As Michael notes, devices have physical address sizes independent of the target platform; a PCI device that supports 64-bit addresses can be plugged into a motherboard that supports 32-bit address bus processors. True. But I think the solution is not to make all buses maximum width, but support multiple buses with width translation in between. We can fix this in several ways: - creating QEMUSG64List and QEMUSG32List (and typedefing PCISGList to the former) - making QEMUSGList always use 64-bit addresses since it will almost always be used with devices (which are often 64-bit capable) - making target_phys_addr_t always 64-bit (which loses some performance with 32-on-32 emulation) - others? We could improve the compilation system so that on 64 bit hosts the benefit of single size target_phys_addr_t results in compiling the files only once.
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On 11/09/09 21:38, Blue Swirl wrote: As Michael notes, devices have physical address sizes independent of the target platform; a PCI device that supports 64-bit addresses can be plugged into a motherboard that supports 32-bit address bus processors. True. But I think the solution is not to make all buses maximum width, but support multiple buses with width translation in between. I suspect translating between 32 and 64 bit is more complex and more overhead than simply using 64bit all the time. We could improve the compilation system so that on 64 bit hosts the benefit of single size target_phys_addr_t results in compiling the files only once. I think this is already the case. When building on x86_64 host (both 32bit and 64bit targets) I find the libhw32 directory empty ... cheers, Gerd
Re: [Qemu-devel] [sneak preview] major scsi overhaul
On Sat, Nov 7, 2009 at 1:09 AM, Gerd Hoffmann kra...@redhat.com wrote: Hi, http://repo.or.cz/w/qemu/kraxel.git/shortlog/refs/heads/scsi.v6 What is in there? (1) Reworked scsi data structures. There is a SCSIRequest struct now. (2) Restructed emulation in the scsi-disk driver. (3) New interface for HBA = SCSIDevice interaction: * building on the new SCSIRequest struct. * the silly read_data/write_data ping-pong game is gone. * support for scatter lists and zerocopy I/O is there. Cool. (4) Added support for the new interface to scsi-disk and scsi-generic. (5) Switched esp and usb-storage to the new interface. * fixed usb-storage state engine along the way. And, no, a backport to stable isn't going to happen. With the new interface state tracking is simpler and easier to get right. (6) A bunch of misc fixes and improvements. What needs to be done? (1) Better patch descriptions. (2) Submit patches to the list for review. (3) Switch over lsi to the new interface. (4) Zap old interface, killing tons of dead code. (5) Final cleanups. In general the commits look good, there are many obviously correct cleanups. However, what happens to the DPRINTFs, it looks like they are removed in the process? You are also moving the compilation to Makefile.hw, which is not exactly an improvement. Is this needed because of the QEMUIOVector stuff?