Re: [PATCH] virtio: a new vcpu watchdog driver
On 7/31/23 04:30, Xuan Zhuo wrote: On Mon, 31 Jul 2023 09:25:12 +0800, "zhanghao1" wrote: A new virtio pci driver is added for listening to vcpus inside guest. Each vcpu creates a corresponding thread to periodically send data to qemu's back-end watchdog device. If a vCPU is in the stall state, data cannot be sent to back-end virtio device. As a result, the back-end device can detect that the guest is in the stall state. The driver is mainly used with the back-end watchdog device of qemu. The qemu backend watchdog device is implemented as follow: https://lore.kernel.org/qemu-devel/20230705081813.411526-1-zhangh...@kylinos.cn/ I think we need to introduce this new device to the virtio spec firstly. And when having a watchdog drivers: shouldn't you watch the virtio queue _itself_? There is no guarantee that the virtio queue makes forward progress ... Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH v2 48/48] sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)
On 3/29/23 16:13, David Howells wrote: [!] Note: This is a work in progress. At the moment, some things won't build if this patch is applied. nvme, kcm, smc, tls. Remove ->sendpage() and ->sendpage_locked(). sendmsg() with MSG_SPLICE_PAGES should be used instead. This allows multiple pages and multipage folios to be passed through. Signed-off-by: David Howells Acked-by: Marc Kleine-Budde # for net/can cc: "David S. Miller" cc: Eric Dumazet cc: Jakub Kicinski cc: Paolo Abeni cc: Jens Axboe cc: Matthew Wilcox cc: b...@vger.kernel.org cc: d...@vger.kernel.org cc: linux-...@lists.infradead.org cc: linux-arm-...@vger.kernel.org cc: linux-...@vger.kernel.org cc: linux-cry...@vger.kernel.org cc: linux-...@vger.kernel.org cc: linux-h...@vger.kernel.org cc: linux-ker...@vger.kernel.org cc: linux-r...@vger.kernel.org cc: linux-s...@vger.kernel.org cc: linux-w...@vger.kernel.org cc: linux-...@vger.kernel.org cc: mp...@lists.linux.dev cc: net...@vger.kernel.org cc: rds-de...@oss.oracle.com cc: tipc-discuss...@lists.sourceforge.net cc: virtualization@lists.linux-foundation.org --- Documentation/networking/scaling.rst | 4 +- crypto/af_alg.c | 29 -- crypto/algif_aead.c | 22 + crypto/algif_rng.c | 2 - crypto/algif_skcipher.c | 14 --- include/linux/net.h | 8 -- include/net/inet_common.h| 2 - include/net/sock.h | 6 -- net/appletalk/ddp.c | 1 - net/atm/pvc.c| 1 - net/atm/svc.c| 1 - net/ax25/af_ax25.c | 1 - net/caif/caif_socket.c | 2 - net/can/bcm.c| 1 - net/can/isotp.c | 1 - net/can/j1939/socket.c | 1 - net/can/raw.c| 1 - net/core/sock.c | 35 +-- net/dccp/ipv4.c | 1 - net/dccp/ipv6.c | 1 - net/ieee802154/socket.c | 2 - net/ipv4/af_inet.c | 21 net/ipv4/tcp.c | 34 --- net/ipv4/tcp_bpf.c | 21 +--- net/ipv4/tcp_ipv4.c | 1 - net/ipv4/udp.c | 22 - net/ipv4/udp_impl.h | 2 - net/ipv4/udplite.c | 1 - net/ipv6/af_inet6.c | 3 - net/ipv6/raw.c | 1 - net/ipv6/tcp_ipv6.c | 1 - net/key/af_key.c | 1 - net/l2tp/l2tp_ip.c | 1 - net/l2tp/l2tp_ip6.c | 1 - net/llc/af_llc.c | 1 - net/mctp/af_mctp.c | 1 - net/mptcp/protocol.c | 2 - net/netlink/af_netlink.c | 1 - net/netrom/af_netrom.c | 1 - net/packet/af_packet.c | 2 - net/phonet/socket.c | 2 - net/qrtr/af_qrtr.c | 1 - net/rds/af_rds.c | 1 - net/rose/af_rose.c | 1 - net/rxrpc/af_rxrpc.c | 1 - net/sctp/protocol.c | 1 - net/socket.c | 48 - net/tipc/socket.c| 3 - net/unix/af_unix.c | 139 --- net/vmw_vsock/af_vsock.c | 3 - net/x25/af_x25.c | 1 - net/xdp/xsk.c| 1 - 52 files changed, 9 insertions(+), 447 deletions(-) Weelll ... what happens to consumers of kernel_sendpage()? (Let's call them nvme ...) Should they be moved over, too? Or what is the general consensus here? (And what do we do with TLS? It does have a ->sendpage() version, too ...) Cheers, Hannes ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Virtio-net - add timeouts to control commands
On 8/24/22 11:42, Alvaro Karsz wrote: Hi Hannes, a) let the device do the timeout: pass in a timeout value with the command, and allow the device to return an ETIMEDOUT error when the timeout expires. Then it's up to the device to do the necessary timeout handling; the server won't be involved at all (except for handling an ETIMEDOUT error) This won't work if the device crashes. b) implement an 'abort' command. With that the server controls the timeout, and is allowed to send an 'abort' command when the timeout expires. That requires the device to be able to abort commands (which not all devices are able to), but avoids having to implement a timeout handling in the device. I actually thought about this idea. This may work, but you'll still have a few moments when the server assumes that the command failed, and the network device assumes that it succeeded. So the server may still receive packets in an unexpected queue. No. The server may only assume that the command failed until it gets the response for the abort command. Before that the state of the command is undefined, and the server may not assume anything here. And then we get into the fun topic of timing out aborts, which really can only be resolved if you have a fool-proof way of resetting the queue itself. But I guess virtio can do that (right?). I am very much in favour of having timeouts for virtio commands; we've had several massive customer escalations which could have been solved if we were able to set the command timeout in the VM. As this was for virtio-scsi/virtio-block I would advocate to have a generic virtio command timeout, not a protocol-specific one. This may be difficult to implement. Especially when multiple commands may be queued at the same time, and the device can handle the commands in any order. We'll need to add identifiers for every command. Why, but of course. You cannot assume in-order delivery of the completions; in fact, that's the whole _point_ of having a queue-based I/O command delivery method. I'm actually referring here to the Linux kernel implementation of virtnet control commands, in which the server spins for a response. Sheesh. Spinning for a response is never a good idea, as this means you'll end up with a non-interruptible command in the guest (essentially an ioctl into the hypervisor). And really, identifying the command isn't hard. Each command already has an identifier (namely the virtio ring index), so if in doubt you can always use that. To be foolproof, though, you might want to add a 'real' identifier (like a 32 or 64 bit command tag), which would even allow you to catch uninitialized / completed commands. Tends to be quite important when implementing an 'abort' command, as the command referred to by the 'abort' command might have been completed by the time the hypervisor processes the abort command. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Virtio-net - add timeouts to control commands
On 8/24/22 11:06, Jason Wang wrote: On Wed, Aug 24, 2022 at 3:52 PM Alvaro Karsz wrote: I think that we should add a timeout to the control virtqueue commands. If the hypervisor crashes while handling a control command, the guest will spin forever. This may not be necessary for a virtual environment, when both the hypervisor and the guest OS run in the same bare metal, but this is needed for a physical network device compatible with VirtIO. (In these cases, the network device acts as the hypervisor, and the server acts as the guest OS). The network device may fail to answer a control command, or may crash, leading to a stall in the server. My idea is to add a big enough timeout, to allow the slow devices to complete the command. I wrote a simple patch that returns false from virtnet_send_command in case of timeouts. The timeout approach introduces some serious problems in cases when the network device does answer the control command, but after the timeout. * The device will think that the command succeeded, while the server won't. This may be serious with the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command. The server may receive packets in an unexpected queue. * virtqueue_get_buf will return the previous response for the next control command. Addressing this case by adding a timeout to the spec won't be easy, since the network device and the server have different clocks, and the server won't know when exactly the network device noticed the kick. So maybe we should call virtnet_remove if we reach a timeout. Or reset but can we simply use interrupt instead of the busy waiting? There are two possible ways of handling this: a) let the device do the timeout: pass in a timeout value with the command, and allow the device to return an ETIMEDOUT error when the timeout expires. Then it's up to the device to do the necessary timeout handling; the server won't be involved at all (except for handling an ETIMEDOUT error) b) implement an 'abort' command. With that the server controls the timeout, and is allowed to send an 'abort' command when the timeout expires. That requires the device to be able to abort commands (which not all devices are able to), but avoids having to implement a timeout handling in the device. We can actually specify both methods, and have configuration bits indicating which method is supported by the device. I am very much in favour of having timeouts for virtio commands; we've had several massive customer escalations which could have been solved if we were able to set the command timeout in the VM. As this was for virtio-scsi/virtio-block I would advocate to have a generic virtio command timeout, not a protocol-specific one. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Virtio-net - add timeouts to control commands
On 8/24/22 11:16, Alvaro Karsz wrote: Hi Jason, Or reset but can we simply use interrupt instead of the busy waiting? I agree that timeouts are not needed using interrupts. Quite the contrary. There is no guarantee that a completion for a given command will be signaled at all; if the completion is never posted to the virtio queue the VM will spin forever. This can be done, but seems like a big change. All functions calling virtnet_send_command expect to get a response immediately. This may be difficult, since some of the commands are sent in the probe function. This is a simplistic approach. _Any_ command sent via virtio will have to have some sort of handling in the host, so an immediate response is not guaranteed. Especially virtio-block command will _not_ be handled immediately. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] virtio_scsi: to poll and kick the virtqueue in timeout handler
On 5/25/21 6:47 PM, Stefan Hajnoczi wrote: On Mon, May 24, 2021 at 11:33:33PM -0700, Dongli Zhang wrote: On 5/24/21 6:24 AM, Stefan Hajnoczi wrote: On Sun, May 23, 2021 at 09:39:51AM +0200, Hannes Reinecke wrote: On 5/23/21 8:38 AM, Dongli Zhang wrote: This RFC is to trigger the discussion about to poll and kick the virtqueue on purpose in virtio-scsi timeout handler. The virtio-scsi relies on the virtio vring shared between VM and host. The VM side produces requests to vring and kicks the virtqueue, while the host side produces responses to vring and interrupts the VM side. By default the virtio-scsi handler depends on the host timeout handler by BLK_EH_RESET_TIMER to give host a chance to perform EH. However, this is not helpful for the case that the responses are available on vring but the notification from host to VM is lost. How can this happen? If responses are lost the communication between VM and host is broken, and we should rather reset the virtio rings themselves. I agree. In principle it's fine to poll the virtqueue at any time, but I don't understand the failure scenario here. It's not clear to me why the device-to-driver vq notification could be lost. One example is the CPU hotplug issue before the commit bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline") was available. The issue is equivalent to loss of interrupt. Without the CPU hotplug fix, while NVMe driver relies on the timeout handler to complete inflight IO requests, the PV virtio-scsi may hang permanently. In addition, as the virtio/vhost/QEMU are complex software, we are not able to guarantee there is no further lost of interrupt/kick issue in the future. It is really painful if we encounter such issue in production environment. Any number of hardware or software bugs might exist that we don't know about, yet we don't pre-emptively add workarounds for them because where do you draw the line? I checked other SCSI/block drivers and found it's rare to poll in the timeout function so there does not seem to be a consensus that it's useful to do this. Not only this; it's downright dangerous attempting to do that in SCSI. In SCSI we don't have fixed lifetime guarantees that NVMe has, so there will be a race condition between timeout and command completion. Plus there is no interface in SCSI allowing to 'poll' for completions in a meaningful manner. That said, it's technically fine to do it, the virtqueue APIs are there and can be used like this. So if you and others think this is necessary, then it's a pretty small change and I'm not against merging a patch like this. I would rather _not_ put more functionality into the virtio_scsi timeout handler; this only serves to assume that the timeout handler has some functionality in virtio. Which it patently hasn't, as the prime reason for a timeout handler is to _abort_ a command, which we can't on virtio. Well, we can on virtio, but qemu as the main user will re-route the I/O from virtio into doing async-I/O, and there is no way how we can abort outstanding asynchronous I/O. Or any other ioctl, for that matter. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] virtio_scsi: to poll and kick the virtqueue in timeout handler
On 5/23/21 8:38 AM, Dongli Zhang wrote: This RFC is to trigger the discussion about to poll and kick the virtqueue on purpose in virtio-scsi timeout handler. The virtio-scsi relies on the virtio vring shared between VM and host. The VM side produces requests to vring and kicks the virtqueue, while the host side produces responses to vring and interrupts the VM side. By default the virtio-scsi handler depends on the host timeout handler by BLK_EH_RESET_TIMER to give host a chance to perform EH. However, this is not helpful for the case that the responses are available on vring but the notification from host to VM is lost. How can this happen? If responses are lost the communication between VM and host is broken, and we should rather reset the virtio rings themselves. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v2 02/24] scsi: allocate separate queue for reserved commands
On 4/23/20 4:13 PM, John Garry wrote: On 07/04/2020 17:30, Christoph Hellwig wrote: On Tue, Apr 07, 2020 at 04:00:10PM +0200, Hannes Reinecke wrote: My concern is this: struct scsi_device *scsi_get_host_dev(struct Scsi_Host *shost) { [ .. ] starget = scsi_alloc_target(&shost->shost_gendev, 0, shost->this_id); [ .. ] and we have typically: drivers/scsi/hisi_sas/hisi_sas_v3_hw.c: .this_id = -1, It's _very_ uncommon to have a negative number as the SCSI target device; in fact, it _is_ an unsigned int already. But alright, I'll give it a go; let's see what I'll end up with. But this shouldn't be exposed anywhere. And I prefer that over having magic requests/scsi_cmnd that do not have a valid ->device pointer. . (just looking at this again) Hi Christoph, So how would this look added in scsi_lib.c: struct scsi_cmnd *scsi_get_reserved_cmd(struct Scsi_Host *shost) { struct scsi_cmnd *scmd; struct request *rq; struct scsi_device *sdev = scsi_get_host_dev(shost); if (!sdev) return NULL; rq = blk_mq_alloc_request(sdev->request_queue, REQ_OP_DRV_OUT | REQ_NOWAIT, BLK_MQ_REQ_RESERVED); if (IS_ERR(rq)) // fix tidy-up return NULL; WARN_ON(rq->tag == -1); scmd = blk_mq_rq_to_pdu(rq); scmd->request = rq; scmd->device = sdev; return scmd; } EXPORT_SYMBOL_GPL(scsi_get_reserved_cmd); void scsi_put_reserved_cmd(struct scsi_cmnd *scmd) { struct request *rq = blk_mq_rq_from_pdu(scmd); if (blk_mq_rq_is_reserved(rq)) { struct scsi_device *sdev = scmd->device; blk_mq_free_request(rq); scsi_free_host_dev(sdev); } } EXPORT_SYMBOL_GPL(scsi_put_reserved_cmd); Not sure if we want a static scsi_device per host, or alloc and free dynamically. (@Hannes, I also have some proper patches for libsas if you want to add it) Hold your horses. I'm currently preparing a patchset implementing things by improving the current scsi_get_host_dev() etc. RFC should be ready by the end of the week. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v2 02/24] scsi: allocate separate queue for reserved commands
On 4/7/20 4:35 PM, John Garry wrote: On 07/04/2020 15:00, Hannes Reinecke wrote: On 4/7/20 1:54 PM, John Garry wrote: On 06/04/2020 10:05, Hannes Reinecke wrote: [ .. ] This would be okay if 'this_id' would have been defined by the driver; sadly, most drivers which are affected here do set 'this_id' to -1. So we wouldn't have a nice target ID to allocate the device from, let alone the problem that we would have to emulate a complete scsi device with all required minimal command support etc. And I'm not quite sure how well that would play with the exising SCSI host template; the device we'll be allocating would have basically nothing in common with the 'normal' SCSI devices. What we could do, though, is to try it the other way round: Lift the request queue from scsi_get_host_dev() into the scsi host itself, so that scsi_get_host_dev() can use that queue, but we also would be able to use it without a SCSI device attached. wouldn't that limit 1x scsi device per host, not that I know if any more would ever be required? But it does still seem better to use the request queue in the scsi device. My concern is this: struct scsi_device *scsi_get_host_dev(struct Scsi_Host *shost) { [ .. ] starget = scsi_alloc_target(&shost->shost_gendev, 0, shost->this_id); [ .. ] and we have typically: drivers/scsi/hisi_sas/hisi_sas_v3_hw.c: .this_id = -1, It's _very_ uncommon to have a negative number as the SCSI target device; in fact, it _is_ an unsigned int already. FWIW, the only other driver (gdth) which I see uses this API has this_id = -1 in the scsi host template. But alright, I'll give it a go; let's see what I'll end up with. note: If we want a fixed scsi_device per host, calling scsi_mq_setup_tags() -> scsi_get_host_dev() will fail as shost state is not running. Maybe we need to juggle some things there to provide a generic solution. It might even get worse, as during device setup things like 'slave_alloc' etc is getting called, which has a fair chance of getting confused for non-existing devices. Cf qla2xxx:qla2xx_slave_alloc() is calling starget_to_rport(), which will get us a nice oops when accessing a target which is _not_ the child of a fc remote port. And this is why I'm not utterly keen on this approach; auditing all these callbacks is _not_ fun. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v2 02/24] scsi: allocate separate queue for reserved commands
On 4/7/20 1:54 PM, John Garry wrote: On 06/04/2020 10:05, Hannes Reinecke wrote: On 3/11/20 7:22 AM, Christoph Hellwig wrote: On Tue, Mar 10, 2020 at 09:08:56PM +, John Garry wrote: On 10/03/2020 18:32, Christoph Hellwig wrote: On Wed, Mar 11, 2020 at 12:25:28AM +0800, John Garry wrote: From: Hannes Reinecke Allocate a separate 'reserved_cmd_q' for sending reserved commands. Why? Reserved command specifically are not in any way tied to queues. . So the v1 series used a combination of the sdev queue and the per-host reserved_cmd_q. Back then you questioned using the sdev queue for virtio scsi, and the unconfirmed conclusion was to use a common per-host q. This is the best link I can find now: https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg83177.html That was just a question on why virtio uses the per-device tags, which didn't look like it made any sense. What I'm worried about here is mixing up the concept of reserved tags in the tagset, and queues to use them. Note that we already have the scsi_get_host_dev to allocate a scsi_device and thus a request_queue for the host itself. That seems like the better interface to use a tag for a host wide command vs introducing a parallel path. Thinking about it some more, I don't think that scsi_get_host_dev() is the best way of handling it. Problem is that it'll create a new scsi_device with , which will then show up via eg 'lsscsi'. are you sure? Doesn't this function just allocate the sdev, but do nothing with it, like probing it? I bludgeoned it in here for PoC: https://github.com/hisilicon/kernel-dev/commit/ef0ae8540811e32776f64a5b42bd76cbed17ba47 And then still: john@ubuntu:~$ lsscsi [0:0:0:0] disk SEAGATE ST2000NM0045 N004 /dev/sda [0:0:1:0] disk SEAGATE ST2000NM0045 N004 /dev/sdb [0:0:2:0] disk ATASAMSUNG HM320JI 0_01 /dev/sdc [0:0:3:0] disk SEAGATE ST1000NM0023 0006 /dev/sdd [0:0:4:0] enclosu HUAWEIExpander 12Gx16 128- john@ubuntu:~$ Some proper plumbing would be needed, though. This would be okay if 'this_id' would have been defined by the driver; sadly, most drivers which are affected here do set 'this_id' to -1. So we wouldn't have a nice target ID to allocate the device from, let alone the problem that we would have to emulate a complete scsi device with all required minimal command support etc. And I'm not quite sure how well that would play with the exising SCSI host template; the device we'll be allocating would have basically nothing in common with the 'normal' SCSI devices. What we could do, though, is to try it the other way round: Lift the request queue from scsi_get_host_dev() into the scsi host itself, so that scsi_get_host_dev() can use that queue, but we also would be able to use it without a SCSI device attached. wouldn't that limit 1x scsi device per host, not that I know if any more would ever be required? But it does still seem better to use the request queue in the scsi device. My concern is this: struct scsi_device *scsi_get_host_dev(struct Scsi_Host *shost) { [ .. ] starget = scsi_alloc_target(&shost->shost_gendev, 0, shost->this_id); [ .. ] and we have typically: drivers/scsi/hisi_sas/hisi_sas_v3_hw.c: .this_id= -1, It's _very_ uncommon to have a negative number as the SCSI target device; in fact, it _is_ an unsigned int already. But alright, I'll give it a go; let's see what I'll end up with. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v2 02/24] scsi: allocate separate queue for reserved commands
On 3/11/20 7:22 AM, Christoph Hellwig wrote: > On Tue, Mar 10, 2020 at 09:08:56PM +, John Garry wrote: >> On 10/03/2020 18:32, Christoph Hellwig wrote: >>> On Wed, Mar 11, 2020 at 12:25:28AM +0800, John Garry wrote: >>>> From: Hannes Reinecke >>>> >>>> Allocate a separate 'reserved_cmd_q' for sending reserved commands. >>> >>> Why? Reserved command specifically are not in any way tied to queues. >>> . >>> >> >> So the v1 series used a combination of the sdev queue and the per-host >> reserved_cmd_q. Back then you questioned using the sdev queue for virtio >> scsi, and the unconfirmed conclusion was to use a common per-host q. This is >> the best link I can find now: >> >> https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg83177.html > > That was just a question on why virtio uses the per-device tags, which > didn't look like it made any sense. What I'm worried about here is > mixing up the concept of reserved tags in the tagset, and queues to use > them. Note that we already have the scsi_get_host_dev to allocate > a scsi_device and thus a request_queue for the host itself. That seems > like the better interface to use a tag for a host wide command vs > introducing a parallel path. > Thinking about it some more, I don't think that scsi_get_host_dev() is the best way of handling it. Problem is that it'll create a new scsi_device with , which will then show up via eg 'lsscsi'. This would be okay if 'this_id' would have been defined by the driver; sadly, most drivers which are affected here do set 'this_id' to -1. So we wouldn't have a nice target ID to allocate the device from, let alone the problem that we would have to emulate a complete scsi device with all required minimal command support etc. And I'm not quite sure how well that would play with the exising SCSI host template; the device we'll be allocating would have basically nothing in common with the 'normal' SCSI devices. What we could do, though, is to try it the other way round: Lift the request queue from scsi_get_host_dev() into the scsi host itself, so that scsi_get_host_dev() can use that queue, but we also would be able to use it without a SCSI device attached. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v2 12/24] hpsa: use reserved commands
On 3/17/20 10:38 AM, John Garry wrote: On 11/03/2020 08:10, Ming Lei wrote: ands(struct ctlr_info *h) @@ -5803,6 +5803,7 @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h) sh->max_lun = HPSA_MAX_LUN; sh->max_id = HPSA_MAX_LUN; sh->can_queue = h->nr_cmds - HPSA_NRESERVED_CMDS; + sh->nr_reserved_cmds = HPSA_NRESERVED_CMDS; Now .nr_reserved_cmds has been passed to blk-mq, you need to increase sh->can_queue to h->nr_cmds, because .can_queue is the whole queue depth (include the part of reserved tags), otherwise, IO tags will be decreased. Sounds correct. I will have having a look at the patchset; I thought I did a patch to modify .can_queue so that it would cover only the usable tags, not the reserved ones. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v2 02/24] scsi: allocate separate queue for reserved commands
On 3/11/20 7:22 AM, Christoph Hellwig wrote: > On Tue, Mar 10, 2020 at 09:08:56PM +, John Garry wrote: >> On 10/03/2020 18:32, Christoph Hellwig wrote: >>> On Wed, Mar 11, 2020 at 12:25:28AM +0800, John Garry wrote: >>>> From: Hannes Reinecke >>>> >>>> Allocate a separate 'reserved_cmd_q' for sending reserved commands. >>> >>> Why? Reserved command specifically are not in any way tied to queues. >>> . >>> >> >> So the v1 series used a combination of the sdev queue and the per-host >> reserved_cmd_q. Back then you questioned using the sdev queue for virtio >> scsi, and the unconfirmed conclusion was to use a common per-host q. This is >> the best link I can find now: >> >> https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg83177.html > > That was just a question on why virtio uses the per-device tags, which > didn't look like it made any sense. What I'm worried about here is > mixing up the concept of reserved tags in the tagset, and queues to use > them. Note that we already have the scsi_get_host_dev to allocate > a scsi_device and thus a request_queue for the host itself. That seems > like the better interface to use a tag for a host wide command vs > introducing a parallel path. > Ah. Right. Will be looking into that, and convert the patchset over to it. And the problem of the separate queue is the fact that I'll need a queue to reserve tags from; trying to allocate a tag directly from the bitmap turns out to be major surgery in the blocklayer with no immediate gain. And I can't use per-device queues as for some drivers the reserved commands are used to query the HBA itself to figure out how many devices are present. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v2 01/24] scsi: add 'nr_reserved_cmds' field to the SCSI host template
On 3/11/20 12:08 AM, Ming Lei wrote: > On Wed, Mar 11, 2020 at 12:25:27AM +0800, John Garry wrote: >> From: Hannes Reinecke >> >> Add a new field 'nr_reserved_cmds' to the SCSI host template to >> instruct the block layer to set aside a tag space for reserved >> commands. >> >> Signed-off-by: Hannes Reinecke >> --- >> drivers/scsi/scsi_lib.c | 1 + >> include/scsi/scsi_host.h | 6 ++ >> 2 files changed, 7 insertions(+) >> >> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c >> index 610ee41fa54c..2967325df7a0 100644 >> --- a/drivers/scsi/scsi_lib.c >> +++ b/drivers/scsi/scsi_lib.c >> @@ -1896,6 +1896,7 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost) >> shost->tag_set.ops = &scsi_mq_ops_no_commit; >> shost->tag_set.nr_hw_queues = shost->nr_hw_queues ? : 1; >> shost->tag_set.queue_depth = shost->can_queue; >> +shost->tag_set.reserved_tags = shost->nr_reserved_cmds; > > You reserve tags for special usage, meantime the whole queue depth > isn't increased, that means the tags for IO request is decreased given > reserved tags can't be used for IO, so IO performance may be effected. > > If not the case, please explain a bit in commit log. > The overall idea of this patchset is to fold _existing_ management command handling into using the blk-mq bitmap. Currently, quite a lot of drivers are using management commands internally, which typically use the same hardware tag pool (ie they are being allocated from the same hardware resources) than the 'normal' I/O commands. But as they are using the same tagpool these drivers already decrement the available number of commands; check eg. hpsa: static int hpsa_scsi_host_alloc(struct ctlr_info *h) { struct Scsi_Host *sh; sh = scsi_host_alloc(&hpsa_driver_template, sizeof(h)); if (sh == NULL) { dev_err(&h->pdev->dev, "scsi_host_alloc failed\n"); return -ENOMEM; } sh->io_port = 0; sh->n_io_port = 0; sh->this_id = -1; sh->max_channel = 3; sh->max_cmd_len = MAX_COMMAND_SIZE; sh->max_lun = HPSA_MAX_LUN; sh->max_id = HPSA_MAX_LUN; sh->can_queue = h->nr_cmds - HPSA_NRESERVED_CMDS; sh->cmd_per_lun = sh->can_queue; So the idea of this patchset is to convert existing use-cases; seeing that they already reserve commands internally performance will not be affected. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-blk: remove VIRTIO_BLK_F_SCSI support
On 12/12/19 5:37 PM, Christoph Hellwig wrote: Since the need for a special flag to support SCSI passthrough on a block device was added in May 2017 the SCSI passthrough support in virtio-blk has been disabled. It has always been a bad idea (just ask the original author..) and we have virtio-scsi for proper passthrough. The feature also never made it into the virtio 1.0 or later specifications. Signed-off-by: Christoph Hellwig --- arch/powerpc/configs/guest.config | 1 - drivers/block/Kconfig | 10 --- drivers/block/virtio_blk.c| 115 +- 3 files changed, 1 insertion(+), 125 deletions(-) I'm all for it. Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [RFC 0/3] VirtIO RDMA
On 4/15/19 12:35 PM, Yuval Shaia wrote: On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote: On Thu, 11 Apr 2019 14:01:54 +0300 Yuval Shaia wrote: Data center backends use more and more RDMA or RoCE devices and more and more software runs in virtualized environment. There is a need for a standard to enable RDMA/RoCE on Virtual Machines. Virtio is the optimal solution since is the de-facto para-virtualizaton technology and also because the Virtio specification allows Hardware Vendors to support Virtio protocol natively in order to achieve bare metal performance. This RFC is an effort to addresses challenges in defining the RDMA/RoCE Virtio Specification and a look forward on possible implementation techniques. Open issues/Todo list: List is huge, this is only start point of the project. Anyway, here is one example of item in the list: - Multi VirtQ: Every QP has two rings and every CQ has one. This means that in order to support for example 32K QPs we will need 64K VirtQ. Not sure that this is reasonable so one option is to have one for all and multiplex the traffic on it. This is not good approach as by design it introducing an optional starvation. Another approach would be multi queues and round-robin (for example) between them. Typically there will be a one-to-one mapping between QPs and CPUs (on the guest). So while one would need to be prepared to support quite some QPs, the expectation is that the actual number of QPs used will be rather low. In a similar vein, multiplexing QPs would be defeating the purpose, as the overall idea was to have _independent_ QPs to enhance parallelism. Expectations from this posting: In general, any comment is welcome, starting from hey, drop this as it is a very bad idea, to yeah, go ahead, we really want it. Idea here is that since it is not a minor effort i first want to know if there is some sort interest in the community for such device. My first reaction is: Sounds sensible, but it would be good to have a spec for this :) You'll need a spec if you want this to go forward anyway, so at least a sketch would be good to answer questions such as how many virtqueues you use for which purpose, what is actually put on the virtqueues, whether there are negotiable features, and what the expectations for the device and the driver are. It also makes it easier to understand how this is supposed to work in practice. If folks agree that this sounds useful, the next step would be to reserve an id for the device type. Thanks for the tips, will sure do that, it is that first i wanted to make sure there is a use case here. Waiting for any feedback from the community. I really do like the ides; in fact, it saved me from coding a similar thing myself :-) However, I'm still curious about the overall intent of this driver. Where would the I/O be routed _to_ ? It's nice that we have a virtualized driver, but this driver is intended to do I/O (even if it doesn't _do_ any I/O ATM :-) And this I/O needs to be send to (and possibly received from) something. So what exactly is this something? An existing piece of HW on the host? If so, wouldn't it be more efficient to use vfio, either by using SR-IOV or by using virtio-mdev? Another guest? If so, how would we route the I/O from one guest to the other? Shared memory? Implementing a full-blown RDMA switch in qemu? Oh, and I would _love_ to have a discussion about this at KVM Forum. Maybe I'll manage to whip up guest-to-guest RDMA connection using ivshmem ... let's see. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Question about eliminating virtio-scsi 30s timeout and hot-unplug
On 12/21/2017 08:29 PM, Jason Behmer via Virtualization wrote: > Hi Paolo, > I had a question about this patch that eliminates virtio-scsi timeout on > IOs - https://patchwork.kernel.org/patch/9802047/. How does this work > when we have IOs outstanding to a device that is then hot-unplugged. > I'm under the impression that those IOs will never get returned with any > status (please correct me if I'm wrong), and we will then end up waiting > on them forever with this patch. > It's no different from the virtio-block path; the I/O is pending in the qemu process. When a device is hot-unplugged the I/O will be returned to the qemu process, which then will return the final status to the guest. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 09/12] virtio_blk: Use blk_rq_is_scsi()
On 08/18/2017 01:23 AM, Bart Van Assche wrote: > This patch does not change any functionality. > > Signed-off-by: Bart Van Assche > Cc: Michael S. Tsirkin > Cc: Jason Wang > Cc: virtualization@lists.linux-foundation.org > --- > drivers/block/virtio_blk.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c > index 1498b899a593..0ba1eb911a42 100644 > --- a/drivers/block/virtio_blk.c > +++ b/drivers/block/virtio_blk.c > @@ -265,7 +265,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx > *hctx, > } > > spin_lock_irqsave(&vblk->vqs[qid].lock, flags); > - if (req_op(req) == REQ_OP_SCSI_IN || req_op(req) == REQ_OP_SCSI_OUT) > + if (blk_rq_is_scsi(req)) > err = virtblk_add_req_scsi(vblk->vqs[qid].vq, vbr, vbr->sg, > num); > else > err = virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num); > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/2] virtio-scsi: Implement FC_HOST feature
On 01/16/2017 05:04 PM, Fam Zheng wrote: > This series implements the proposed fc_host feature of virtio-scsi. > > The first patch updates the data structure changes according to the spec > proposal; the second patch actually implements the operations. > > Fam Zheng (2): > virtio_scsi: Add fc_host definitions > virtio_scsi: Implement fc_host > > drivers/scsi/virtio_scsi.c | 55 > +++- > include/uapi/linux/virtio_scsi.h | 6 + > 2 files changed, 60 insertions(+), 1 deletion(-) > Hmm. How it this supposed to work? You do export the WWPN/WWNN of the associated host to the guest (nb: will get interesting for non NPIV setups ...), but virtio scsi will still do a LUN remapping. IE the LUNs you see on the host will be different from the LUNs presented to the guest. This makes reliable operations very tricky. Plus you don't _actually_ expose the FC host, but rather the WWPN of the host presenting the LUN. So how do you handle LUNs from different FC hosts on the guest? >From what I've seen virtio will either presenting you with with one host per LUN (so you'll end up with different SCSI hosts on the guest each having the _same_ WWPN/WWNN), or a single host presenting all LUNs, making the WWPN setting ... interesting. Overall, I'm not overly happy with this approach. You already added WWPN ids to the virtio transport, so why didn't you update the LUN field, too, to avoid this ominous LUN remapping? And we really should make sure to have a single FC host in the guest presenting all LUNs. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10] virtio-net: add Max MTU configuration field
On 09/13/2016 05:18 PM, Aaron Conole wrote: > It is helpful for a host to indicate it's MTU to be set on guest NICs > other than the assumed 1500 byte value. This helps in situations where > the host network is using Jumbo Frames, or aiding in PMTU discovery by > configuring a homogenous network. It is also helpful for sizing receive > buffers correctly. > > The change adds a new field to configuration area of network > devices. It will be used to pass a maximum MTU from the device to > the driver. This will be used by the driver as a maximum value for > packet sizes during transmission, without segmentation offloading. > > In addition, in order to support backward and forward compatibility, > we introduce a new feature bit called VIRTIO_NET_F_MTU. > > VIRTIO-152 > Signed-off-by: Aaron Conole > Cc: Victor Kaplansky > Acked-by: Michael S. Tsirkin > Acked-by: Maxime Coquelin > --- Very good. I'm needing this for my FCoE over virtio work. Reviewed-by: Hannes Reiencke Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: req->nr_phys_segments > queue_max_segments (was Re: kernel BUG at drivers/block/virtio_blk.c:172!)
On 10/01/2015 11:00 AM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 03:10:14AM +0200, Thomas D. wrote: >> Hi, >> >> I have a virtual machine which fails to boot linux-4.1.8 while mounting >> file systems: >> >>> * Mounting local filesystem ... >>> [ cut here ] >>> kernel BUG at drivers/block/virtio_blk.c:172! >>> invalid opcode: 000 [#1] SMP >>> Modules linked in: pcspkr psmouse dm_log_userspace virtio_net e1000 fuse >>> nfs lockd grace sunrpc fscache dm_snapshot dm_bufio dm_mirror >>> dm_region_hash dm_log usbhid usb_storage sr_mod cdrom >>> CPU: 7 PIDL 2254 Comm: dmcrypt_write Not tainted 4.1.8-gentoo #1 >>> Hardware name: Red Hat KVM, BIOS seabios-1.7.5-8.el7 04/01/2014 >>> task: 88061fb7 ti: 88061ff3 task.ti: 88061ff3 >>> RIP: 0010:[] [] >>> virtio_queue_rq+0x210/0x2b0 >>> RSP: 0018:88061ff33ba8 EFLAGS: 00010202 >>> RAX: 00b1 RBX: 88061fb2fc00 RCX: 88061ff33c30 >>> RDX: 0008 RSI: 88061ff33c50 RDI: 88061fb2fc00 >>> RBP: 88061ff33bf8 R08: 88061eef3540 R09: 88061ff33c30 >>> R10: R11: 00af R12: >>> R13: 88061eef3540 R14: 88061eef3540 R15: 880622c7ca80 >>> FS: () GS:88063fdc() knlGS: >>> CS: 0010 DS: ES: CR0: 80050033 >>> CR2: 01ffe468 CR3: bb343000 CR4: 001406e0 >>> Stack: >>> 880622d4c478 88061ff33bd8 88061fb2f >>> 0001 88061fb2fc00 88061ff33c30 0 >>> 88061eef3540 88061ff33c98 b43eb >>> >>> Call Trace: >>> [] __blk_mq_run_hw_queue+0x1d0/0x370 >>> [] blk_mq_run_hw_queue+0x95/0xb0 >>> [] blk_mq_flush_plug_list+0x129/0x140 >>> [] blk_finish_plug+0x18/0x50 >>> [] dmcrypt_write+0x1da/0x1f0 >>> [] ? wake_up_state+0x20/0x20 >>> [] ? crypt_iv_lmk_dtr+0x60/0x60 >>> [] kthread_create_on_node+0x180/0x180 >>> [] ret_from_fork+0x42/0x70 >>> [] ? kthread_create_on_node+0x180/0x180 >>> Code: 00 41 c7 85 78 01 00 00 08 00 00 00 49 c7 85 80 01 00 00 00 00 >>> 00 00 41 89 85 7c 01 00 00 e9 93 fe ff ff 66 0f 1f 44 00 00 <0f> 0b 66 0f >>> 1f 44 00 00 49 8b 87 b0 00 00 00 41 83 e6 ef 4a 8b >>> RIP [] virtio_queue_rq+0x210/0x2b0 >>> RSP: >>> ---[ end trace 8078357c459d5fc0 ]--- > > > So this BUG_ON is from 1cf7e9c68fe84248174e998922b39e508375e7c1. > commit 1cf7e9c68fe84248174e998922b39e508375e7c1 > Author: Jens Axboe > Date: Fri Nov 1 10:52:52 2013 -0600 > > virtio_blk: blk-mq support > > > BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems); > > > On probe, we do > /* We can handle whatever the host told us to handle. */ > blk_queue_max_segments(q, vblk->sg_elems-2); > > > To debug this, > maybe you can print out sg_elems at init time and when this fails, > to make sure some kind of memory corruption > does not change sg_elems after initialization? > > > Jens, how may we get more segments than blk_queue_max_segments? > Is driver expected to validate and drop such requests? > Whee! I'm not alone anymore! I have seen similar issues even on non-mq systems; occasionally I'm hitting this bug in drivers/scsi/scsi_lib.c:scsi_init_io() count = blk_rq_map_sg(req->q, req, sdb->table.sgl); BUG_ON(count > sdb->table.nents); There are actually two problems here: The one is that blk_rq_map_sg() requires a table (ie the last argument), but doesn't have any indications on how large the table is. So one needs to check if the returned number of mapped sg elements exceeds the number of elements in the table. If so we already _have_ a memory overflow, and the only thing we can to is sit in a corner and cry. This really would need to be fixed up, eg by adding another argument for the table size. This other problem is that this _really_ shouldn't happen, and points to some issue with the block layer in general. Which I've been trying to find for several months now, but no avail :-( Cheers, Hannes -- Dr. Hannes ReineckezSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 0/2] virtio nvme
On 09/27/2015 07:01 AM, Nicholas A. Bellinger wrote: > On Wed, 2015-09-23 at 15:58 -0700, Ming Lin wrote: >> On Fri, 2015-09-18 at 14:09 -0700, Nicholas A. Bellinger wrote: >>> On Fri, 2015-09-18 at 11:12 -0700, Ming Lin wrote: >>>> On Thu, 2015-09-17 at 17:55 -0700, Nicholas A. Bellinger wrote: > > > > > Btw, after chatting with Dr. Hannes this week at SDC here are his > original rts-megasas -v6 patches from Feb 2013. > > Note they are standalone patches that require a sufficiently old enough > LIO + QEMU to actually build + function. > > https://github.com/Datera/rts-megasas/blob/master/rts_megasas-qemu-v6.patch > https://github.com/Datera/rts-megasas/blob/master/rts_megasas-fabric-v6.patch > > For groking purposes, they demonstrate the principle design for a host > kernel level driver, along with the megasas firmware interface (MFI) > specific emulation magic that makes up the bulk of the code. > And indeed, Nic persuaded me to have them updated to qemu latest. Which I'll be doing shortly. Stay tuned. Cheers, Hannes -- Dr. Hannes ReineckezSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: DEFINE_IDA causing memory leaks? (was Re: [PATCH 1/2] virtio: fix memory leak of virtio ida cache layers)
On 09/17/2015 07:33 AM, Michael S. Tsirkin wrote: > On Wed, Sep 16, 2015 at 07:29:17PM -0500, Suman Anna wrote: >> The virtio core uses a static ida named virtio_index_ida for >> assigning index numbers to virtio devices during registration. >> The ida core may allocate some internal idr cache layers and >> an ida bitmap upon any ida allocation, and all these layers are >> truely freed only upon the ida destruction. The virtio_index_ida >> is not destroyed at present, leading to a memory leak when using >> the virtio core as a module and atleast one virtio device is >> registered and unregistered. >> >> Fix this by invoking ida_destroy() in the virtio core module >> exit. >> >> Cc: "Michael S. Tsirkin" >> Signed-off-by: Suman Anna > > Interesting. > Will the same apply to e.g. sd_index_ida in drivers/scsi/sd.c > or iscsi_sess_ida in drivers/scsi/scsi_transport_iscsi.c? > > If no, why not? > > One doesn't generally expect to have to free global variables. > Maybe we should forbid DEFINE_IDA in modules? > > James, could you comment on this please? > Well, looking at the code 'ida_destroy' only need to be called if you want/need to do a general cleanup. It shouldn't be required if you do correct reference counting on your objects, and call idr_remove() on each of them. Unless I'm misreading something. Seems like a topic for KS; Johannes had a larger patchset recently to clean up idr, which run into very much the same issues. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: sometimes ping fails when enable stp
On 04/11/2014 09:31 AM, longguang.yue wrote: > hi,all > sometimes ping fails when enable stp , why? > Probably stp initialisation isn't yet finished, and the switch will block any traffic until that happens. During that time any traffic will return with -EHOSTUNREACH. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: updated: kvm PCI todo wiki
On 08/21/2013 12:48 PM, Michael S. Tsirkin wrote: Hey guys, I've put up a wiki page with a kvm PCI todo list, mainly to avoid effort duplication, but also in the hope to draw attention to what I think we should try addressing in KVM: http://www.linux-kvm.org/page/PCITodo This page could cover all PCI related activity in KVM, it is very incomplete. We should probably add e.g. IOMMU related stuff. Note: if there's no developer listed for an item, this just means I don't know of anyone actively working on an issue at the moment, not that no one intends to. I would appreciate it if others working on one of the items on this list would add their names so we can communicate better. If others like this wiki page, please go ahead and add stuff you are working on if any. It would be especially nice to add testing projects. Also, feel free to add links to bugzillas items. On a related note, did anyone ever tried to test MSI / MSI-X with a windows guest? I've tried to enable it for virtio but for some reason Windows didn't wanted to enable it. AHCI was even worse; the stock Windows version doesn't support MSI and the Intel one doesn't like our implementation :-(. Anyone ever managed to get this to work? If not it'd be a good topic for the wiki ... Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: binding/unbinding devices to vfio-pci
On 07/02/2013 05:13 PM, Yoder Stuart-B08248 wrote: -Original Message- From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Tuesday, July 02, 2013 9:46 AM To: Yoder Stuart-B08248 Cc: k...@vger.kernel.org list; Alexander Graf; Bhushan Bharat-R65777; a.mota...@virtualopensystems.com; virtualization@lists.linux-foundation.org Subject: Re: binding/unbinding devices to vfio-pci On Tue, 2013-07-02 at 14:15 +, Yoder Stuart-B08248 wrote: Alex, I'm trying to think through how binding/unbinding of devices will work with VFIO for platform devices and have a couple of questions about how vfio-pci works. When you bind a device to vfio-pci, e.g.: # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id ...I understand that the echo into 'new_id' tells the vfio pci driver that it now handles the specified PCI ID. But now there are 2 drivers that handle that PCI ID, the original host driver and vfio-pci. Say that you hotplug a PCI device that matches that ID. Which of the 2 drivers are going to get bound to the device? Also, if you unbind a device from vfio-pci and want to bind it again to the normal host driver you would just echo the full device info into the 'bind' sysfs file for the host driver, right? echo :06:0d.0 > /sys/bus/pci/drivers/... Hi Stuart, The driver binding interface is far from perfect. In your scenario where you've added the ID for one device, then hotplug another device with the same ID, the results are indeterminate. Both vfio-pci and the host driver, assuming it's still loaded, can claim the device, it's just a matter of which gets probed first. Generally that window should be very short though. To bind a device, the user should do: 1) echo :bb:dd.f > /sys/bus/pci/devices/:bb:dd.f/driver/unbind 2) echo > /sys/bus/pci/drivers/vfio-pci/new_id 3) echo :bb:dd.f > /sys/bus/pci/drivers/vfio-pci/bind 4) echo > /sys/bus/pci/drivers/vfio-pci/remove_id There are actually a number of ways you can do this and the default autoprobe behavior really makes step 3) unnecessary as the driver core will probe any unbound devices as soon as a new_id is added to vfio-pci. That can be changed by: # echo 0 > /sys/bus/pci/drivers_autoprobe But then we have to worry about races from any devices that might have been hotplugged in the interim. But, even apart from hot-plugged devices, what about the device we just unbound? There are now 2 host drivers that can handle the device when the autoprobe happens. Is it just luck that vfio-pci is the one that gets the device? Probably the best way of handling this would be to disallow autobinding for vfio in general. Then the 'normal' driver would be bound to the device, and vfio must be enabled via manual binding. Much like it is today. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: PCI device not properly reset after VFIO
On 10/18/2012 04:40 PM, Alex Williamson wrote: Hi Hannes, Thanks for testing vfio On Thu, 2012-10-18 at 08:47 +0200, Hannes Reinecke wrote: Hi Alex, I've been playing around with VFIO and megasas (of course). What I did now was switching between VFIO and 'normal' operation, ie emulated access. megasas is happily running under VFIO, but when I do an emergency stop like killing the Qemu session the PCI device is not properly reset. IE when I load 'megaraid_sas' after unbinding the vfio_pci module the driver cannot initialize the card and waits forever for the firmware state to change. I need to do a proper pci reset via echo 1 > /sys/bus/pci/device//reset to get it into a working state again. Looking at vfio_pci_disable() pci reset is called before the config state and BARs are restored. Seeing that vfio_pci_enable() calls pci reset right at the start, too, before modifying anything I do wonder whether the pci reset is at the correct location for disable. I would have expected to call pci reset in vfio_pci_disable() _after_ we have restored the configuration, to ensure a sane state after reset. And, as experience show, we do need to call it there. So what is the rationale for the pci reset? Can we move it to the end of vfio_pci_disable() or do we need to call pci reset twice? I believe the rationale was that by resetting the device before we restore the state we stop anything that the device was doing. Restoring the saved state on a running device seems like it could cause problems, so you may be right and we actually need to do reset, load, restore, reset. Does adding another call to pci_reset_function in the pci_restore_state (as below) solve the problem? Traditional KVM device assignment has a nearly identical path, does it have this same bug? It's actually the first time I've been able to test this (the hardware is a bit tricky to setup ...), so I cannot tell (yet) if KVM exhibited the same thing. Thanks, Alex diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 6c11994..d07a45c 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -107,9 +107,10 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev) pci_reset_function(vdev->pdev); if (pci_load_and_free_saved_state(vdev->pdev, - &vdev->pci_saved_state) == 0) + &vdev->pci_saved_state) == 0) { pci_restore_state(vdev->pdev); - else + pci_reset_function(vdev->pdev); + } else pr_info("%s: Couldn't reload %s saved state\n", __func__, dev_name(&vdev->pdev->dev)); I would have called reset after unmapping the BARs; the HBA I'm working with does need to access the BARs, so the content of them might be relevant, too. But then I'm not really a PCI expert. Maybe we should ask Tony Luck or Bjorn Helgaas. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
PCI device not properly reset after VFIO
Hi Alex, I've been playing around with VFIO and megasas (of course). What I did now was switching between VFIO and 'normal' operation, ie emulated access. megasas is happily running under VFIO, but when I do an emergency stop like killing the Qemu session the PCI device is not properly reset. IE when I load 'megaraid_sas' after unbinding the vfio_pci module the driver cannot initialize the card and waits forever for the firmware state to change. I need to do a proper pci reset via echo 1 > /sys/bus/pci/device//reset to get it into a working state again. Looking at vfio_pci_disable() pci reset is called before the config state and BARs are restored. Seeing that vfio_pci_enable() calls pci reset right at the start, too, before modifying anything I do wonder whether the pci reset is at the correct location for disable. I would have expected to call pci reset in vfio_pci_disable() _after_ we have restored the configuration, to ensure a sane state after reset. And, as experience show, we do need to call it there. So what is the rationale for the pci reset? Can we move it to the end of vfio_pci_disable() or do we need to call pci reset twice? Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2 v1] blkdrv: Add queue limits parameters for sg block drive
On 08/24/2012 09:56 AM, Paolo Bonzini wrote: Il 24/08/2012 02:45, Nicholas A. Bellinger ha scritto: So up until very recently, TCM would accept an I/O request for an DATA I/O type CDB with a max_sectors larger than the reported max_sectors for it's TCM backend (regardless of backend type), and silently generate N backend 'tasks' to complete the single initiator generated command. This is what QEMU does if you use scsi-block, except for MMC devices (because of the insanity of the commands used for burning). Also FYI for Paolo, for control type CDBs I've never actually seen an allocation length exceed max_sectors, so in practice AFAIK this only happens for DATA I/O type CDBs. Yes, that was my impression as well. This was historically required by the pSCSI backend driver (using a number of old SCSI passthrough interfaces) in order to support this very type of case described above, but over the years the logic ended up creeping into various other non-passthrough backend drivers like IBLOCK +FILEIO. So for v3.6-rc1 code, hch ended up removing the 'task' logic thus allowing backends (and the layers below) to the I/O sectors > max_sectors handling work, allowing modern pSCSI using struct request to do the same. (hch assured me this works now for pSCSI) So now LIO and QEMU work the same. (Did he test tapes too?) Anyways, I think having the guest limit virtio-scsi DATA I/O to max_sectors based upon the host accessible block limits is reasonable approach to consider. Reducing this value even further based upon the lowest max_sectors available amongst possible migration hosts would be a good idea here to avoid having to reject any I/O's exceeding a new host's device block queue limits. Yeah, it's reasonable _assuming it is needed at all_. For disks, it is not needed. For CD-ROMs it is, but right now we have only one report and it is using USB so we don't know if the problem is in the drive or rather in the USB bridge (whose quality usually leaves much to be desired). So in the only observed case, the fix would really be a workaround; the right thing to do with USB devices is to use USB passthrough. Hehe. So finally someone else stumbled across this one. All is fine and dandy as long as you're able to use scsi-disk. As soon as you're forced to use scsi-generic we're in trouble. With scsi-generic we actually have two problems: 1) scsi-generic just acts as a pass-through and passes the commands as-is, including the scatter-gather information as formatted by the guest. So the guest could easily format an SG_IO comand which will not be compatible with the host. 2) The host is not able to differentiate between a malformed SG_IO command and a real I/O error; in both cases it'll return -EIO. So we can fix this by either a) ignore (as we do nowadays :-) b) Fixup scsi-generic to inspect and modify SG_IO information to ensure the host-limits are respected c) Fixup the host to differentiate between a malformed SG_IO and a real I/O error. c) would only be feasible for Linux et al. _personally_ I would prefer that approach, as I fail to see why we cannot return a proper error code here. But I already can hear the outraged cry 'POSIX! POSIX!', so I guess it's not going to happen anytime soon. So I would vote for b). Yes, it's painful. But in the long run we'll have to do an SG_IO inspection anyway, otherwise we'll always be susceptible to malicious SG_IO attacks. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: virtio-scsi spec (was Re: [PATCH] Add virtio-scsi to the virtio spec)
On 11/30/2011 05:36 PM, Paolo Bonzini wrote: On 11/30/2011 03:17 PM, Hannes Reinecke wrote: seg_max is the maximum number of segments that can be in a command. A bidirectional command can include seg_max input segments and seg_max output segments. I would like to have the other request_queue limitations exposed here, too. Most notably we're missing the maximum size of an individual segment and the maximum size of the overall I/O request. The virtio transport does not put any limit, as far as I know. Virtio doesn't, but the underlying device/driver might. And if we don't expose these values we cannot format the request correctly. As this is the host specification I really would like to see an host identifier somewhere in there. Otherwise we won't be able to reliably identify a virtio SCSI host. I thought about it, but I couldn't figure out exactly how to use it. If it's just allocating 64 bits in the configuration space (with the stipulation that they could be zero), let's do it now. Otherwise a controlq command is indeed better, and it can come later. But even if it's just a 64-bit value, then: 1) where would you place it in sysfs for userspace? I can make up a random name, but existing user tools won't find it and that's against the design of virtio-scsi. 2) How would it be encoded as a transport ID? Is it FC, or firewire, or SAS, or what? I was thinking of something along the lines of the TransportID as defined in SPC. Main idea is to have a unique ID by which we can identify a given virtio-scsi host. Admittedly it might not be useful in general, so it might be an idea to delegate this to another controlq command. Plus you can't calculate the ITL nexus information, making Persistent Reservations impossible. They are not impossible, only some features such as SPEC_I_PT. If you use NPIV or iSCSI in the host, then the persistent reservations will already get the correct initiator port. If not, much more work is needed. Yes, for a a shared (physical) SCSI host persistent reservations will be tricky. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: virtio-scsi spec (was Re: [PATCH] Add virtio-scsi to the virtio spec)
u32 residual; > u16 status_qualifier; > u8 status; > u8 response; > u8 sense[sense_size]; > char datain[]; > }; > > /* command-specific response values */ > #define VIRTIO_SCSI_S_OK0 > #define VIRTIO_SCSI_S_UNDERRUN 1 > #define VIRTIO_SCSI_S_ABORTED 2 > #define VIRTIO_SCSI_S_BAD_TARGET3 > #define VIRTIO_SCSI_S_RESET 4 > #define VIRTIO_SCSI_S_TRANSPORT_FAILURE 5 > #define VIRTIO_SCSI_S_TARGET_FAILURE6 > #define VIRTIO_SCSI_S_NEXUS_FAILURE 7 > #define VIRTIO_SCSI_S_FAILURE 8 > > /* task_attr */ > #define VIRTIO_SCSI_S_SIMPLE0 > #define VIRTIO_SCSI_S_ORDERED 1 > #define VIRTIO_SCSI_S_HEAD 2 > #define VIRTIO_SCSI_S_ACA 3 > > The lun field addresses a target and logical unit in the > virtio-scsi device's SCSI domain. In this version of the spec, > the only supported format for the LUN field is: first byte set to > 1, second byte set to target, third and fourth byte representing > a single level LUN structure, followed by four zero bytes. With > this representation, a virtio-scsi device can serve up to 256 > targets and 16384 LUNs per target. > > The id field is the command identifier ("tag"). > > Task_attr, prio and crn should be left to zero: command priority > is explicitly not supported by this version of the device; > task_attr defines the task attribute as in the table above, but > all task attributes may be mapped to SIMPLE by the device; crn > may also be provided by clients, but is generally expected to be > 0. The maximum CRN value defined by the protocol is 255, since > CRN is stored in an 8-bit integer. > > All of these fields are defined in SAM. They are always > read-only, as are the cdb and dataout field. The cdb_size is > taken from the configuration space. > > sense and subsequent fields are always write-only. The sense_len > field indicates the number of bytes actually written to the sense > buffer. The residual field indicates the residual size, > calculated as "data_length - number_of_transferred_bytes", for > read or write operations. For bidirectional commands, the > number_of_transferred_bytes includes both read and written bytes. > A residual field that is less than the size of datain means that > the dataout field was processed entirely. A residual field that > exceeds the size of datain means that the dataout field was > processed partially and the datain field was not processed at > all. > > The status byte is written by the device to be the status > code as defined by SAM. > > The response byte is written by the device to be one of the > following: > > VIRTIO_SCSI_S_OK when the request was completed and the status > byte is filled with a SCSI status code (not necessarily > "GOOD"). > > VIRTIO_SCSI_S_UNDERRUN if the content of the CDB requires > transferring more data than is available in the data buffers. > > VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an > ABORT TASK or ABORT TASK SET task management function. > > VIRTIO_SCSI_S_BAD_TARGET if the request was never processed > because the target indicated by the lun field does not exist. > > VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus > or device reset. > > VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a > problem in the connection between the host and the target > (severed link). > > VIRTIO_SCSI_S_TARGET_FAILURE if the target is suffering a > failure and the guest should not retry on other paths. > > VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure > but retrying on other paths might yield a different result. > > VIRTIO_SCSI_S_FAILURE for other host or guest error. In > particular, if neither dataout nor datain is empty, and the > VIRTIO_SCSI_F_INOUT feature has not been negotiated, the > request will be immediately returned with a response equal to > VIRTIO_SCSI_S_FAILURE. > We should be adding VIRTIO_SCSI_S_BUSY for a temporary failure, indicating that a command retry might be sufficient to clear this situation. Equivalent to VIRTIO_SCSI_S_NEXUS_FAILURE, but issuing a retry on the same path. Thanks for the write-up. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: virtio scsi host draft specification, v3
On 07/01/2011 08:41 AM, Paolo Bonzini wrote: > On 06/29/2011 11:39 AM, Stefan Hajnoczi wrote: >> > > Of course, when doing so we would be lose the ability to >> freely remap >> > > LUNs. But then remapping LUNs doesn't gain you much imho. >> > > Plus you could always use qemu block backend here if you want >> > > to hide the details. >> > >> > And you could always use the QEMU block backend with >> > scsi-generic if you want to remap LUNs, instead of true >> > passthrough via the kernel target. >> >> IIUC the in-kernel target always does remapping. It passes through >> individual LUNs rather than entire targets and you pick LU Numbers to >> map to the backing storage (which may or may not be a SCSI >> pass-through device). Nicholas Bellinger can confirm whether this is >> correct. > > But then I don't understand. If you pick LU numbers both with the > in-kernel target and with QEMU, you do not need to use e.g. WWPNs > with fiber channel, because we are not passing through the details > of the transport protocol (one day we might have virtio-fc, but more > likely not). So the LUNs you use might as well be represented by > hierarchical LUNs. > Actually, the kernel does _not_ do a LUN remapping. It just so happens that most storage arrays will present the LUN starting with 0, so normally you wouldn't notice. However, some arrays have an array-wide LUN range, so you start seeing LUNs at odd places: [3:0:5:0]diskLSI INF-01-000750 /dev/sdw [3:0:5:7]diskLSI Universal Xport 0750 /dev/sdx > Using NPIV with KVM would be done by mapping the same virtual N_Port > ID in the host(s) to the same LU number in the guest. You might > already do this now with virtio-blk, in fact. > The point here is not the mapping. The point is rescanning. You can map existing NPIV devices already. But you _cannot_ rescan the host/device whatever _from the guest_ to detect if new devices are present. That is the problem I'm trying to describe here. To be more explicit: Currently you have to map existing devices directly as individual block or scsi devices to the guest. And rescan within the guest can only be sent to that device, so the only information you will get able to gather is if the device itself is still present. You are unable to detect if there are other devices attached to your guest which you should connect to. So we have to have an enclosing instance (ie the equivalent of a SCSI target), which is capable of telling us exactly this. > Put in another way: the virtio-scsi device is itself a SCSI target, > so yes, there is a single target port identifier in virtio-scsi. But > this SCSI target just passes requests down to multiple real targets, > and so will let you do ALUA and all that. > Argl. No way. The virtio-scsi device has to map to a single LUN. I thought I mentioned this already, but I'd better clarify this again: The SCSI spec itself only deals with LUNs, so anything you'll read in there obviously will only handle the interaction between the initiator (read: host) and the LUN itself. However, the actual command is send via an intermediat target, hence you'll always see the reference to the ITL (initiator-target-lun) nexus. The SCSI spec details discovery of the individual LUNs presented by a given target, it does _NOT_ detail the discovery of the targets themselves. That is being delegated to the underlying transport, in most cases SAS or FibreChannel. For the same reason the SCSI spec can afford to disdain any reference to path failure, device hot-plugging etc; all of these things are being delegated to the transport. In our context the virtio-scsi device should map to the LUN, and the virtio-scsi _host_ backend should map to the target. The virtio-scsi _guest_ driver will then map to the initiator. So we should be able to attach more than one device to the backend, which then will be presented to the initiator. In the case of NPIV it would make sense to map the virtual SCSI host to the backend, so that all devices presented to the virtual SCSI host will be presented to the backend, too. However, when doing so these devices will normally be referenced by their original LUN, as these will be presented to the guest via eg 'REPORT LUNS'. The above thread now tries to figure out if we should remap those LUN numbers or just expose them as they are. If we decide on remapping, we have to emulate _all_ commands referring explicitely to those LUN numbers (persistent reservations, anyone?). If we don't, we would expose some hardware detail to the guest, but would save us _a lot_ of processing. I'm all for the latter. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de
Re: virtio scsi host draft specification, v3
On 06/29/2011 12:07 PM, Christoph Hellwig wrote: > On Wed, Jun 29, 2011 at 10:39:42AM +0100, Stefan Hajnoczi wrote: >> I think we're missing a level of addressing. We need the ability to >> talk to multiple target ports in order for "list target ports" to make >> sense. Right now there is one implicit target that handles all >> commands. That means there is one fixed I_T Nexus. >> >> If we introduce "list target ports" we also need a way to say "This >> CDB is destined for target port #0". Then it is possible to enumerate >> target ports and address targets independently of the LUN field in the >> CDB. >> >> I'm pretty sure this is also how SAS and other transports work. In >> their framing they include the target port. > > Yes, exactly. Hierachial LUNs are a nasty fringe feature that we should > avoid as much as possible, that is for everything but IBM vSCSI which is > braindead enough to force them. > Yep. >> The question is whether we really need to support multiple targets on >> a virtio-scsi adapter or not. If you are selectively mapping LUNs >> that the guest may access, then multiple targets are not necessary. >> If we want to do pass-through of the entire SCSI bus then we need >> multiple targets but I'm not sure if there are other challenges like >> dependencies on the transport (Fibre Channel, SAS, etc) which make it >> impossible to pass through bus-level access? > > I don't think bus-level pass through is either easily possible nor > desirable. What multiple targets are useful for is allowing more > virtual disks than we have virtual PCI slots. We could do this by > supporting multiple LUNs, but given that many SCSI ressources are > target-based doing multiple targets most likely is the more scabale > and more logical variant. E.g. we could much more easily have one > virtqueue per target than per LUN. > The general idea here is that we can support NPIV. With NPIV we'll have several scsi_hosts, each of which is assigned a different set of LUNs by the array. With virtio we need to able to react on LUN remapping on the array side, ie we need to be able to issue a 'REPORT LUNS' command and add/remove LUNs on the fly. This means we have to expose the scsi_host in some way via virtio. This is impossible with a one-to-one mapping between targets and LUNs. The actual bus-level pass-through will be just on the SCSI layer, ie 'REPORT LUNS' should be possible. If and how we do a LUN remapping internally on the host is a totally different matter. Same goes for the transport details; I doubt we will expose all the dingy details of the various transports, but rather restrict ourselves to an abstract transport. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: virtio scsi host draft specification, v3
On 06/12/2011 09:51 AM, Michael S. Tsirkin wrote: > On Fri, Jun 10, 2011 at 02:55:35PM +0200, Hannes Reinecke wrote: >>> Device operation: request queues >>> >>> >>> The driver queues requests to an arbitrary request queue, and they are >>> used by the device on that same queue. >>> >> What about request ordering? >> If requests are placed on arbitrary queues you'll inevitably run on >> locking issues to ensure strict request ordering. >> I would add here: >> >> If a device uses more than one queue it is the responsibility of the >> device to ensure strict request ordering. > > Maybe I misunderstand - how can this be the responsibility of > the device if the device does not get the information about > the original ordering of the requests? > > For example, if the driver is crazy enough to put > all write requests on one queue and all barriers > on another one, how is the device supposed to ensure > ordering? > Which is exactly the problem I was referring to. When using more than one channel the request ordering _as seen by the initiator_ has to be preserved. This is quite hard to do from a device's perspective; it might be able to process the requests _in the order_ they've arrived, but it won't be able to figure out the latency of each request, ie how it'll take the request to be delivered to the initiator. What we need to do here is to ensure that virtio will deliver the requests in-order across all virtqueues. Not sure whether it does this already. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: virtio scsi host draft specification, v3
On 06/10/2011 04:35 PM, Paolo Bonzini wrote: >> If requests are placed on arbitrary queues you'll inevitably run on >> locking issues to ensure strict request ordering. >> I would add here: >> >> If a device uses more than one queue it is the responsibility of the >> device to ensure strict request ordering. > > Applied with s/device/guest/g. > >> Please do not rely in bus/target/lun here. These are leftovers from >> parallel SCSI and do not have any meaning on modern SCSI >> implementation (eg FC or SAS). Rephrase that to >> >> The lun field is the Logical Unit Number as defined in SAM. > > Ok. > >>> The status byte is written by the device to be the SCSI status >>> code. >> >> ?? I doubt that exists. Make that: >> >> The status byte is written by the device to be the status code as >> defined in SAM. > > Ok. > >>> The response byte is written by the device to be one of the >>> following: >>> >>> - VIRTIO_SCSI_S_OK when the request was completed and the >>> status byte >>> is filled with a SCSI status code (not necessarily "GOOD"). >>> >>> - VIRTIO_SCSI_S_UNDERRUN if the content of the CDB requires >>> transferring >>> more data than is available in the data buffers. >>> >>> - VIRTIO_SCSI_S_ABORTED if the request was cancelled due to a >>> reset >>> or another task management function. >>> >>> - VIRTIO_SCSI_S_FAILURE for other host or guest error. In >>> particular, >>> if neither dataout nor datain is empty, and the >>> VIRTIO_SCSI_F_INOUT >>> feature has not been negotiated, the request will be >>> immediately >>> returned with a response equal to VIRTIO_SCSI_S_FAILURE. >>> >> And, of course: >> >> VIRTIO_SCSI_S_DISCONNECT if the request could not be processed due >> to a communication failure (eg device was removed or could not be >> reached). > > Ok. > >> This specification implies a strict one-to-one mapping between host >> and target. IE there is no way of specifying more than one target >> per host. > > Actually no, the intention is to use hierarchical LUNs to support > more than one target per host. > Can't. Hierarchical LUNs is a target-internal representation. The initiator (ie guest OS) should _not_ try to assume anything about the internal structure and just use the LUN as an opaque number. Reason being that the LUN addressing is not unique, and there are several choices on how to represent a given LUN. So the consensus here is that different LUN numbers represent different physical devices, regardless on the (internal) LUN representation. Which in turn means we cannot use the LUN number to convey anything else than a device identification relative to a target. Cf SAM-3 paragraph 4.8: A logical unit number is a field (see 4.9) containing 64 bits that identifies the logical unit within a SCSI target device when accessed by a SCSI target port. IE the LUN is dependent on the target, but you cannot make assumptions on the target. Consequently, it's in the hosts' responsibility to figure out the targets in the system. After that it invokes the 'scan' function from the SCSI midlayer. You can't start from a LUN and try to figure out the targets ... If you want to support more than on target per host you need some sort of enumeration/callback which allows the host to figure out the number of available targets. But in general the targets are referenced by the target port identifier as specified in the appropriate standard (eg FC or SAS). Sadly, we don't have any standard to fall back on for this. If, however, we decide to expose some details about the backend, we could be using the values from the backend directly. EG we could be forwarding the SCSI target port identifier here (if backed by real hardware) or creating our own SAS-type identifier when backed by qemu block. Then we could just query the backend via a new command on the controlq (eg 'list target ports') and wouldn't have to worry about any protocol specific details here. Of course, when doing so we would be lose the ability to freely remap LUNs. But then remapping LUNs doesn't gain you much imho. Plus you could always use qemu block backend here if you want to hide the details. But we would be finally able to use NPIV for KVM, something I wanted to do for a _long_ time. I personally _really_ would like to see the real backing device details exposed to the guest. Otherwise the more advanced stuff like
Re: virtio scsi host draft specification, v3
--- > > The driver queues requests to an arbitrary request queue, and they are > used by the device on that same queue. > What about request ordering? If requests are placed on arbitrary queues you'll inevitably run on locking issues to ensure strict request ordering. I would add here: If a device uses more than one queue it is the responsibility of the device to ensure strict request ordering. > Requests have the following format: > > struct virtio_scsi_req_cmd { > u8 lun[8]; > u64 id; > u8 task_attr; > u8 prio; > u8 crn; > char cdb[cdb_size]; > char dataout[]; > > u8 sense[sense_size]; > u32 sense_len; > u32 residual; > u16 status_qualifier; > u8 status; > u8 response; > char datain[]; > }; > > /* command-specific response values */ > #define VIRTIO_SCSI_S_OK 0 > #define VIRTIO_SCSI_S_UNDERRUN1 > #define VIRTIO_SCSI_S_ABORTED 2 > #define VIRTIO_SCSI_S_FAILURE 3 > > /* task_attr */ > #define VIRTIO_SCSI_S_SIMPLE 0 > #define VIRTIO_SCSI_S_ORDERED 1 > #define VIRTIO_SCSI_S_HEAD2 > #define VIRTIO_SCSI_S_ACA 3 > > The lun field addresses a bus, target and logical unit in the SCSI > host. The id field is the command identifier as defined in SAM. > Please do not rely in bus/target/lun here. These are leftovers from parallel SCSI and do not have any meaning on modern SCSI implementation (eg FC or SAS). Rephrase that to The lun field is the Logical Unit Number as defined in SAM. > Task_attr, prio and CRN are defined in SAM. The prio field should > always be zero, as command priority is explicitly not supported by > this version of the device. task_attr defines the task attribute as > in the table above, Note that all task attributes may be mapped to > SIMPLE by the device. CRN is generally expected to be 0, but clients > can provide it. The maximum CRN value defined by the protocol is 255, > since CRN is stored in an 8-bit integer. > > All of these fields are always read-only, as are the cdb and dataout > field. sense and subsequent fields are always write-only. > > The sense_len field indicates the number of bytes actually written > to the sense buffer. The residual field indicates the residual > size, calculated as data_length - number_of_transferred_bytes, for > read or write operations. > > The status byte is written by the device to be the SCSI status code. > ?? I doubt that exists. Make that: The status byte is written by the device to be the status code as defined in SAM. > The response byte is written by the device to be one of the following: > > - VIRTIO_SCSI_S_OK when the request was completed and the status byte >is filled with a SCSI status code (not necessarily "GOOD"). > > - VIRTIO_SCSI_S_UNDERRUN if the content of the CDB requires transferring >more data than is available in the data buffers. > > - VIRTIO_SCSI_S_ABORTED if the request was cancelled due to a reset >or another task management function. > > - VIRTIO_SCSI_S_FAILURE for other host or guest error. In particular, >if neither dataout nor datain is empty, and the VIRTIO_SCSI_F_INOUT >feature has not been negotiated, the request will be immediately >returned with a response equal to VIRTIO_SCSI_S_FAILURE. > And, of course: VIRTIO_SCSI_S_DISCONNECT if the request could not be processed due to a communication failure (eg device was removed or could not be reached). The remaining bits seem to be okay. One general question: This specification implies a strict one-to-one mapping between host and target. IE there is no way of specifying more than one target per host. This will make things like ALUA (Asymmetric Logical Unit Access) a bit tricky to implement, as the port states there are bound to target port groups. So with the virtio host spec we would need to specify two hosts to represent that. If that's the intention here I'm fine, but maybe we should be specifying this expressis verbis somewhere. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation
Gerd Hoffmann wrote: > On 10/29/09 05:37, Christoph Hellwig wrote: >>> So something like >>> - Get next request >>> - Attach iovec/bounc-buffer >>> - handle request (command/write/read) >>> - complete request (callback) >> >> Btw, from some previuous attempts to sort out this code here are some >> thing that I think would be beneficial: > > Trying to go forward in review+bisect friendly baby steps. Here is what > I have now: > > http://repo.or.cz/w/qemu/kraxel.git?a=shortlog;h=refs/heads/scsi.v1 > > It is far from being completed, will continue tomorrow. Should give a > idea of the direction I'm heading to though. Comments welcome. > Yep, this looks good. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation
Am Wed 28 Oct 2009 02:58:33 PM CET schrieb Gerd Hoffmann : > Hi, > >> From a really quick view fixing up the data xfer code paths doesn't >> look too bad. Think I'll give it a try. > > Oh well. The interface pretty obviously designed for the esp, which is > the oldest scsi adapter in qemu ... > > ESP: There is no scatter-gather support in the hardware. So for large > reads/writes there are quite switches between OS and ESP: The OS > saying "dma next sectors to this location" via ioports, the ESP doing > it and raising a IRQ when done, next round. The existing callback > mechanism models that pretty closely. > > USB: streams the data in small packets (smaller than sector size, 64 > bytes IIRC). Current interface works good enougth. > > LSI: Hops through quite a few loops to work with the existing > interface. Current emulation reads one lsi script command at a time > and does reads/writes in small pieces like the ESP. I think it could > do alot better: parse lsi scripts into scatter lists and submit larger > requests. Maybe even have multiple requests in flight at the same > time. That probably means putting the lsi script parsing code upside > down though. > > MEGASAS: I guess you have scatter lists at hand and want to submit them > directly to the block layer for zerocopy block I/O. > Precisely. > So, where to go from here? > > I'm tempted to zap the complete read-in-pieces logic. For read/write > transfers storage must be passed where everything fits in. The > completion callback is called on command completion and nothing else. > Agree. That would be my idea here, as well. > I think we'll need to modes here: xfer from/to host-allocated bounce > buffer (linear buffer) and xfer from/to guest memory (scatter list). > I don't think we really need two modes. My preferred interface here is to pass down scatter-gather lists down with every xfer; this way it'll be the responsibility of the driver to create the lists in the first place. If it has hardware scatter-gather support it can just pass them down, if not it can as easily create a scatter-gather list with just one element as a bounce buffer. Much like the current code does now, only with explicit passing of iovecs. > That means (emulated) hardware without scatter-gather support must use > the bounce buffer mode can can't do zerocopy I/O. I don't think this > is a big problem though. Lots of small I/O requests don't perform very > well, so one big request filling the bounce buffer then memcpy() > from/to guest memory will most likely be faster anyway. > I would update the existing interface to split off the request generation from the command completion code; then it would be easy to attach the iovec to the request. So something like - Get next request - Attach iovec/bounc-buffer - handle request (command/write/read) - complete request (callback) would be an idea. Cheers, Hannes --- No .sig today as I'm writing from my laptop. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH 3/4] scsi-disk: Factor out SCSI command emulation
Am Wed 28 Oct 2009 09:42:47 AM CET schrieb Christoph Hellwig : > On Tue, Oct 27, 2009 at 04:28:59PM +0100, Hannes Reinecke wrote: >> >> Other drives might want to use SCSI command emulation without >> going through the SCSI disk abstraction, as this imposes too >> many limits on the emulation. > > Might be a good idea to move something like this first into the series > and share the CDB decoder between scsi and ide (atapi) as a start. A > little bit of refactoring of the CDB decoder, e.g. into one function > per opcode (-family) won't hurt either. > Yes, that was on my to-do list, as well. Eg even the megasas already emulates REPORT LUNs, as we certainly don't want to pass the original LUN information back to the guest. And it would make sense to trap some other commands (like INQUIRY), too, to strip it off any unwanted information. Cheers, Hannes --- No .sig today as this is from my laptop. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation
Am Tue 27 Oct 2009 05:47:21 PM CET schrieb Gerd Hoffmann : > Hi, > >> The device can be accessed by >> >> -drive if=raid,file=XXX > > Don't extend that qemu automagic please. The new way to handle this is: > > -drive if=none,id=mydisk,file=/path/to/some/disk.img > -device megasas,id=raid > -device scsi-disk,bus=raid.0,scsi-id=1,drive=mydisk > Alright, no problem there. Didn't know this. >> In order to support SCSI command emulation I had to update / >> patch up the existing SCSI disk support. This might be >> not to everyones taste, so I'm open to alternative >> suggestions. >> >> But I certainly do _not_ want to update the SCSI disk >> emulation, as this is really quite tied to the SCSI parallel >> interface used by the old lsi53c895a.c. > > --verbose please. I'd prefer to fix scsi-disk bugs and/or limitations > instead of working around them. > The problem is I don't have any documentation for the LSI parallel SCSI controller. So I don't know if and in what shape I/O is passed down, nor anything else. And as the SCSI disk emulation is really tied into the LSI parallel SCSI controller, any change in the former is likely to break the latter. And what with me no way of fixing it. Hence I decided on this approach. I surely can go ahead and patch up the scsi disk emulation, but it's quite likely to break the LSI controller. If that's okay with everybody, I'll surely go ahead there. >> Plus it doesn't do scatter-gather list handling, > > Which should be fixed anyway. > Quite. But as I said, the LSI parallel SCSI controller is going to suffer. Cheers, Hannes ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 4/4] megasas: Add SCSI command emulation
Now that we can use SCSI command emulation without using the SCSI disk abstraction we can easily add it to the megasas HBA. Signed-off-by: Hannes Reinecke --- hw/megasas.c | 88 +++--- 1 files changed, 53 insertions(+), 35 deletions(-) diff --git a/hw/megasas.c b/hw/megasas.c index a57e8e0..f32b313 100644 --- a/hw/megasas.c +++ b/hw/megasas.c @@ -661,40 +661,55 @@ static int megasas_handle_scsi(MPTState *s, uint8_t fcmd, } } -memset(&cmd->hdr, 0, sizeof(struct sg_io_hdr)); -cmd->hdr.interface_id = 'S'; -cmd->hdr.cmd_len = cdb_len; -cmd->hdr.cmdp = cdb; -cmd->hdr.iovec_count = cmd->sge_count; -cmd->hdr.dxferp = cmd->iov; -for (n = 0; n < cmd->sge_count; n++) - cmd->hdr.dxfer_len += cmd->iov[n].iov_len; -if (cmd->sge_count) { - if (dir) - cmd->hdr.dxfer_direction = SG_DXFER_TO_DEV; - else - cmd->hdr.dxfer_direction = SG_DXFER_FROM_DEV; -} else { - cmd->hdr.dxfer_direction = SG_DXFER_NONE; -} -cmd->hdr.sbp = cmd->sense; -cmd->hdr.mx_sb_len = cmd->sense_len; +if (bdrv_is_sg(cmd->lun->bdrv)) { + memset(&cmd->hdr, 0, sizeof(struct sg_io_hdr)); + cmd->hdr.interface_id = 'S'; + cmd->hdr.cmd_len = cdb_len; + cmd->hdr.cmdp = cdb; + cmd->hdr.iovec_count = cmd->sge_count; + cmd->hdr.dxferp = cmd->iov; + for (n = 0; n < cmd->sge_count; n++) + cmd->hdr.dxfer_len += cmd->iov[n].iov_len; + if (cmd->sge_count) { + if (dir) + cmd->hdr.dxfer_direction = SG_DXFER_TO_DEV; + else + cmd->hdr.dxfer_direction = SG_DXFER_FROM_DEV; + } else { + cmd->hdr.dxfer_direction = SG_DXFER_NONE; + } + cmd->hdr.sbp = cmd->sense; + cmd->hdr.mx_sb_len = cmd->sense_len; -ret = bdrv_ioctl(cmd->lun->bdrv, SG_IO, &cmd->hdr); -if (ret) { - DPRINTF("SCSI pthru dev %x lun %x failed with %d\n", - target, lun, errno); - sense_len = scsi_build_sense(cmd->sense, SENSE_IO_ERROR); - cmd->sge_size = 0; - scsi_status = SAM_STAT_CHECK_CONDITION; -} else if (cmd->hdr.status) { - sense_len = cmd->hdr.sb_len_wr; - scsi_status = cmd->hdr.status; - cmd->sge_size = cmd->hdr.dxfer_len; - scsi_status = SAM_STAT_CHECK_CONDITION; + ret = bdrv_ioctl(cmd->lun->bdrv, SG_IO, &cmd->hdr); + if (ret) { + DPRINTF("SCSI pthru dev %x lun %x failed with %d\n", + target, lun, errno); + sense_len = scsi_build_sense(cmd->sense, SENSE_IO_ERROR); + cmd->sge_size = 0; + scsi_status = SAM_STAT_CHECK_CONDITION; + } else if (cmd->hdr.status) { + sense_len = cmd->hdr.sb_len_wr; + scsi_status = cmd->hdr.status; + cmd->sge_size = cmd->hdr.dxfer_len; + scsi_status = SAM_STAT_CHECK_CONDITION; + } else { + sense_len = 0; + cmd->sge_size = cmd->hdr.dxfer_len; + } } else { - sense_len = 0; - cmd->sge_size = cmd->hdr.dxfer_len; + uint32_t sense; + + DPRINTF("Emulate SCSI pthru cmd %x\n", cdb[0]); + sense = scsi_emulate_command(cmd->lun->bdrv, 0, cdb, +cmd->iov[0].iov_len, +cmd->iov[0].iov_base, +&cmd->sge_size); + sense_len = scsi_build_sense(cmd->sense, sense); + if (sense_len) + scsi_status = SAM_STAT_CHECK_CONDITION; + else + scsi_status = SAM_STAT_GOOD; } out: megasas_unmap_sense(cmd, sense_len); @@ -1105,13 +1120,16 @@ static int megasas_scsi_init(PCIDevice *dev) lun->bdrv = NULL; continue; } + bdrv_set_tcq(lun->bdrv, 1); /* check if we can use SG_IO */ ret = bdrv_ioctl(lun->bdrv, SG_IO, &hdr); if (ret) { - DPRINTF("SCSI cmd passthrough not available on dev %d (error %d)\n", + DPRINTF("Using SCSI cmd emulation on dev %d (error %d)\n", unit, ret); - lun->sdev = NULL; - lun->bdrv = NULL; + bdrv_set_sg(lun->bdrv, 0); + } else { + DPRINTF("Using SCSI cmd passthrough on dev %d\n", unit); + bdrv_set_sg(lun->bdrv, 1); } } register_savevm("megasas", -1, 0, megasas_scsi_save, megasas_scsi_load, s); -- 1.6.0.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 3/4] scsi-disk: Factor out SCSI command emulation
Other drives might want to use SCSI command emulation without going through the SCSI disk abstraction, as this imposes too many limits on the emulation. Signed-off-by: Hannes Reinecke --- block.c| 15 ++ block.h|3 + block_int.h|1 + hw/scsi-disk.c | 610 ++-- hw/scsi-disk.h |3 + 5 files changed, 346 insertions(+), 286 deletions(-) diff --git a/block.c b/block.c index 33f3d65..06f92c4 100644 --- a/block.c +++ b/block.c @@ -930,6 +930,21 @@ int bdrv_is_sg(BlockDriverState *bs) return bs->sg; } +void bdrv_set_sg(BlockDriverState *bs, int set) +{ +bs->sg = set; +} + +int bdrv_get_tcq(BlockDriverState *bs) +{ +return bs->tcq; +} + +void bdrv_set_tcq(BlockDriverState *bs, int set) +{ +bs->tcq = set; +} + int bdrv_enable_write_cache(BlockDriverState *bs) { return bs->enable_write_cache; diff --git a/block.h b/block.h index a966afb..7862fa0 100644 --- a/block.h +++ b/block.h @@ -134,9 +134,12 @@ void bdrv_get_geometry_hint(BlockDriverState *bs, int *pcyls, int *pheads, int *psecs); int bdrv_get_type_hint(BlockDriverState *bs); int bdrv_get_translation_hint(BlockDriverState *bs); +int bdrv_get_tcq(BlockDriverState *bs); +void bdrv_set_tcq(BlockDriverState *bs, int set); int bdrv_is_removable(BlockDriverState *bs); int bdrv_is_read_only(BlockDriverState *bs); int bdrv_is_sg(BlockDriverState *bs); +void bdrv_set_sg(BlockDriverState *bs, int set); int bdrv_enable_write_cache(BlockDriverState *bs); int bdrv_is_inserted(BlockDriverState *bs); int bdrv_media_changed(BlockDriverState *bs); diff --git a/block_int.h b/block_int.h index 8e72abe..e5ee57b 100644 --- a/block_int.h +++ b/block_int.h @@ -129,6 +129,7 @@ struct BlockDriverState { int encrypted; /* if true, the media is encrypted */ int valid_key; /* if true, a valid encryption key has been set */ int sg;/* if true, the device is a /dev/sg* */ +int tcq; /* if true, the device supports tagged command queueing */ /* event callback when inserting/removing */ void (*change_cb)(void *opaque); void *change_opaque; diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c index 68b4e83..3e68518 100644 --- a/hw/scsi-disk.c +++ b/hw/scsi-disk.c @@ -56,7 +56,7 @@ typedef struct SCSIRequest { /* Both sector and sector_count are in terms of qemu 512 byte blocks. */ uint64_t sector; uint32_t sector_count; -struct iovec iov; +struct iovec *iov; QEMUIOVector qiov; BlockDriverAIOCB *aiocb; struct SCSIRequest *next; @@ -72,7 +72,8 @@ struct SCSIDiskState This is the number of 512 byte blocks in a single scsi sector. */ int cluster_size; uint64_t max_lba; -int sense; +uint8_t sense[SCSI_SENSE_LEN]; +uint8_t sense_len; char drive_serial_str[21]; QEMUBH *bh; }; @@ -90,13 +91,12 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, uint32_t tag) free_requests = r->next; } else { r = qemu_malloc(sizeof(SCSIRequest)); -r->iov.iov_base = qemu_memalign(512, SCSI_DMA_BUF_SIZE); + r->iov = NULL; } r->bus = scsi_bus_from_device(d); r->dev = s; r->tag = tag; r->sector_count = 0; -r->iov.iov_len = 0; r->aiocb = NULL; r->status = 0; @@ -126,6 +126,17 @@ static void scsi_remove_request(SCSIRequest *r) free_requests = r; } +static void *scsi_allocate_iovec(SCSIRequest *r) { +if (!r->iov) { + r->iov = qemu_malloc(sizeof(struct iovec)); + if (!r->iov) + return NULL; + r->iov->iov_base = qemu_memalign(512, SCSI_DMA_BUF_SIZE); + r->iov->iov_len = SCSI_DMA_BUF_SIZE; +} +return r->iov; +} + static SCSIRequest *scsi_find_request(SCSIDiskState *s, uint32_t tag) { SCSIRequest *r; @@ -137,12 +148,11 @@ static SCSIRequest *scsi_find_request(SCSIDiskState *s, uint32_t tag) return r; } -/* Helper function to build a sense block */ int32_t scsi_build_sense(uint8_t *sense_buf, uint32_t sense) { memset(sense_buf, 0, SCSI_SENSE_LEN); if (!sense) - return 0; + return 0; sense_buf[0] = 0xf0; /* current, fixed format */ sense_buf[2] = (sense >> 16) & 0x0F; @@ -154,15 +164,19 @@ int32_t scsi_build_sense(uint8_t *sense_buf, uint32_t sense) } /* Helper function for command completion. */ -static void scsi_command_complete(SCSIRequest *r, int status, int sense) +static void scsi_command_complete(SCSIRequest *r, int status, uint32_t sense) { SCSIDiskState *s = r->dev; uint32_t tag; -DPRINTF("Command complete tag=0x%x status=%d sense=%d\n", r->tag, status, sense); -s->sense = sense; + +DPRINTF("Command complete tag=0x%x status=%d sense=%d\n", r->tag, + status, s->sense_len); +if (status == STATUS_CHECK_CONDITION
[PATCH 2/4] megasas: LSI MegaRAID SAS HBA emulation
This patch add an emulation for the LSI MegaRAID SAS HBA. It is using SG_IO to forward / pass through SCSI commands to the underlying block driver, so no emulation is done currently. Signed-off-by: Hannes Reinecke --- Makefile.hw |2 +- hw/megasas.c | 1134 ++ hw/pci_ids.h |2 + 3 files changed, 1137 insertions(+), 1 deletions(-) create mode 100644 hw/megasas.c diff --git a/Makefile.hw b/Makefile.hw index de1db31..cae35f9 100644 --- a/Makefile.hw +++ b/Makefile.hw @@ -33,7 +33,7 @@ obj-y += wdt_i6300esb.o obj-y += ne2000.o # SCSI layer -obj-y += lsi53c895a.o +obj-y += lsi53c895a.o megasas.o obj-$(CONFIG_ESP) += esp.o obj-y += dma-helpers.o sysbus.o isa-bus.o diff --git a/hw/megasas.c b/hw/megasas.c new file mode 100644 index 000..a57e8e0 --- /dev/null +++ b/hw/megasas.c @@ -0,0 +1,1134 @@ +/* + * QEMU MegaRAID SAS 8708EM2 Host Bus Adapter emulation + * + * Copyright (c) 2009 Hannes Reinecke, SUSE Linux Products GmbH + * + * This code is licenced under the LGPL. + */ + + +#include + +#include "hw.h" +#include "pci.h" +#include "scsi.h" +#include "scsi-disk.h" +#include "block_int.h" +#ifdef __linux__ +# include +#endif + +#undef DEBUG_MEGASAS +#undef DEBUG_MEGASAS_REG +#undef DEBUG_MEGASAS_QUEUE +#undef DEBUG_MEGASAS_MFI + +#ifdef DEBUG_MEGASAS +#define DPRINTF(fmt, ...) \ +do { printf("megasas: " fmt , ## __VA_ARGS__); } while (0) +#define BADF(fmt, ...) \ +do { fprintf(stderr, "megasas: error: " fmt , ## __VA_ARGS__); exit(1);} while (0) +#else +#define DPRINTF(fmt, ...) do {} while(0) +#define BADF(fmt, ...) \ +do { fprintf(stderr, "megasas: error: " fmt , ## __VA_ARGS__);} while (0) +#endif + +/* Static definitions */ +#define MEGASAS_MAX_FRAMES 64 +#define MEGASAS_MAX_SGE 8 + +/* SCSI definitions */ +#define SAM_STAT_GOOD0x00 +#define SAM_STAT_CHECK_CONDITION 0x02 +#define SAM_STAT_CONDITION_MET 0x04 +#define SAM_STAT_BUSY0x08 +#define SAM_STAT_TASK_ABORTED0x40 + +/* Register definitions */ +#defineMEGASAS_INBOUND_MSG_0 0x0010 +#defineMEGASAS_INBOUND_MSG_1 0x0014 +#defineMEGASAS_OUTBOUND_MSG_0 0x0018 +#defineMEGASAS_OUTBOUND_MSG_1 0x001C +#defineMEGASAS_INBOUND_DOORBELL0x0020 +#defineMEGASAS_INBOUND_INTR_STATUS 0x0024 +#defineMEGASAS_INBOUND_INTR_MASK 0x0028 +#defineMEGASAS_OUTBOUND_DOORBELL 0x002C +#defineMEGASAS_OUTBOUND_INTR_STATUS0x0030 +#defineMEGASAS_OUTBOUND_INTR_MASK 0x0034 +#defineMEGASAS_INBOUND_QUEUE_PORT 0x0040 +#defineMEGASAS_OUTBOUND_QUEUE_PORT 0x0044 +#defineMEGASAS_OUTBOUND_DOORBELL_CLEAR 0x00A0 +#defineMEGASAS_OUTBOUND_SCRATCH_PAD0x00B0 +#defineMEGASAS_INBOUND_LOW_QUEUE_PORT 0x00C0 +#defineMEGASAS_INBOUND_HIGH_QUEUE_PORT 0x00C4 + +/* FW commands */ +#define MFI_INIT_ABORT 0x0001 +#define MFI_INIT_READY 0x0002 +#define MFI_INIT_MFIMODE 0x0004 +#define MFI_INIT_CLEAR_HANDSHAKE 0x0008 +#define MFI_INIT_HOTPLUG 0x0010 +#define MFI_STOP_ADP 0x0020 + +/* MFI states */ +#define MFI_STATE_UNDEFINED0x0 +#define MFI_STATE_BB_INIT 0x1 +#define MFI_STATE_FW_INIT 0x4 +#define MFI_STATE_WAIT_HANDSHAKE 0x6 +#define MFI_STATE_FW_INIT_20x7 +#define MFI_STATE_DEVICE_SCAN 0x8 +#define MFI_STATE_BOOT_MESSAGE_PENDING 0x9 +#define MFI_STATE_FLUSH_CACHE 0xA +#define MFI_STATE_READY0xB +#define MFI_STATE_OPERATIONAL 0xC +#define MFI_STATE_FAULT0xF + +/* + * MFI command opcodes + */ +#define MFI_CMD_INIT 0x00 +#define MFI_CMD_LD_READ0x01 +#define MFI_CMD_LD_WRITE 0x02 +#define MFI_CMD_LD_SCSI_IO 0x03 +#define MFI_CMD_PD_SCSI_IO 0x04 +#define MFI_CMD_DCMD 0x05 +#define MFI_CMD_ABORT 0x06 +#define MFI_CMD_SMP0x07 +#define MFI_CMD_STP0x08 + +#define MR_DCMD_CTRL_GET_INFO 0x0101 + +#define MR_DCMD_CTRL_CACHE_FLUSH 0x01101000 +#define MR_FLUSH_CTRL_CACHE0x01 +#define MR_FLUSH_DISK_CACHE0x02 + +#define MR_DCMD_CTRL_SHUTDOWN 0x0105 +#define MR_DCMD_HIBERNATE_SHUTDOWN 0x0106 +#define MR
[PATCH 0/4] megaraid_sas HBA emulation
Hi all, this patchset implements an emulation for the megaraid_sas HBA. It provides emulates an LSI MegaRAID SAS 8708EM2 HBA, ie presenting to the guest a virtual SCSI adapter. Internally it is using aio for read/write requests and either SG_IO or SCSI command emulation for everything else. The reason for choosing the megaraid_sas HBA and not, say, implementing a virtio scsi interface is because: - the megaraid_sas is using a very simple firmware interface, comparable to virtio - the HBA driver are already existent, so I only have to write the backend :-) The device can be accessed by -drive if=raid,file=XXX In order to support SCSI command emulation I had to update / patch up the existing SCSI disk support. This might be not to everyones taste, so I'm open to alternative suggestions. But I certainly do _not_ want to update the SCSI disk emulation, as this is really quite tied to the SCSI parallel interface used by the old lsi53c895a.c. Plus it doesn't do scatter-gather list handling, which is quite impossible to fix without proper documentation. Of course, if anyone else would step in here, I won't object :-) It currently runs guests with 2.6.27 and up; Windows XP support is not quite there yet. Anything else might work; if not, enable debugging and sent me the logfile. As usual, comment / suggestions etc welcome. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 1/4] Add 'raid' interface class
This patch adds a 'raid' interface class. It is basically a clone of the existing 'scsi' interface, only allowing up to 128 disks. Signed-off-by: Hannes Reinecke --- hw/pc.c |5 + hw/pci-hotplug.c |1 + hw/scsi-disk.c | 17 + hw/scsi-disk.h | 20 +++- qemu-config.c|2 +- sysemu.h |3 ++- vl.c |8 ++-- 7 files changed, 51 insertions(+), 5 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index 83012a9..26aad4c 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1345,6 +1345,11 @@ static void pc_init1(ram_addr_t ram_size, for (bus = 0; bus <= max_bus; bus++) { pci_create_simple(pci_bus, -1, "lsi53c895a"); } + + max_bus = drive_get_max_bus(IF_RAID); + for (bus = 0; bus <= max_bus; bus++) { + pci_create_simple(pci_bus, -1, "megasas"); + } } if (extboot_drive) { diff --git a/hw/pci-hotplug.c b/hw/pci-hotplug.c index 410fa3f..855a1ad 100644 --- a/hw/pci-hotplug.c +++ b/hw/pci-hotplug.c @@ -85,6 +85,7 @@ void drive_hot_add(Monitor *mon, const QDict *qdict) switch (type) { case IF_SCSI: +case IF_RAID: if (pci_read_devaddr(mon, pci_addr, &dom, &pci_bus, &slot)) { goto err; } diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c index 2a9268a..68b4e83 100644 --- a/hw/scsi-disk.c +++ b/hw/scsi-disk.c @@ -41,6 +41,7 @@ do { fprintf(stderr, "scsi-disk: " fmt , ## __VA_ARGS__); } while (0) #define SCSI_DMA_BUF_SIZE131072 #define SCSI_MAX_INQUIRY_LEN 256 +#define SCSI_SENSE_LEN 18 #define SCSI_REQ_STATUS_RETRY 0x01 @@ -136,6 +137,22 @@ static SCSIRequest *scsi_find_request(SCSIDiskState *s, uint32_t tag) return r; } +/* Helper function to build a sense block */ +int32_t scsi_build_sense(uint8_t *sense_buf, uint32_t sense) +{ +memset(sense_buf, 0, SCSI_SENSE_LEN); +if (!sense) + return 0; + +sense_buf[0] = 0xf0; /* current, fixed format */ +sense_buf[2] = (sense >> 16) & 0x0F; +sense_buf[7] = 10; +sense_buf[12] = (sense >> 8 ) & 0xFF; +sense_buf[13] = sense & 0xFF; + +return SCSI_SENSE_LEN; +} + /* Helper function for command completion. */ static void scsi_command_complete(SCSIRequest *r, int status, int sense) { diff --git a/hw/scsi-disk.h b/hw/scsi-disk.h index b6b6c12..5b54272 100644 --- a/hw/scsi-disk.h +++ b/hw/scsi-disk.h @@ -9,6 +9,23 @@ enum scsi_reason { SCSI_REASON_DATA /* Transfer complete, more data required. */ }; +/* LUN not ready, Manual intervention required */ +#define SENSE_LUN_NOT_READY 0x020403 +/* Hardware error, I/O process terminated */ +#define SENSE_IO_ERROR 0x040006 +/* Hardware error, I_T Nexus loss occured */ +#define SENSE_TAG_NOT_FOUND 0x042907 +/* Hardware error, internal target failure */ +#define SENSE_TARGET_FAILURE 0x044400 +/* Illegal request, invalid command operation code */ +#define SENSE_INVALID_OPCODE 0x052000 +/* Illegal request, LBA out of range */ +#define SENSE_LBA_OUT_OF_RANGE 0x052100 +/* Illegal request, Invalid field in CDB */ +#define SENSE_INVALID_FIELD 0x052400 +/* Illegal request, LUN not supported */ +#define SENSE_LUN_NOT_SUPPORTED 0x052500 + typedef struct SCSIBus SCSIBus; typedef struct SCSIDevice SCSIDevice; typedef struct SCSIDeviceInfo SCSIDeviceInfo; @@ -49,7 +66,7 @@ struct SCSIBus { int tcq, ndev; scsi_completionfn complete; -SCSIDevice *devs[8]; +SCSIDevice *devs[128]; }; void scsi_bus_new(SCSIBus *bus, DeviceState *host, int tcq, int ndev, @@ -63,5 +80,6 @@ static inline SCSIBus *scsi_bus_from_device(SCSIDevice *d) SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, DriveInfo *dinfo, int unit); void scsi_bus_legacy_handle_cmdline(SCSIBus *bus); +int32_t scsi_build_sense(uint8_t *sense_buf, uint32_t sense); #endif diff --git a/qemu-config.c b/qemu-config.c index 4fb7898..8d7a137 100644 --- a/qemu-config.c +++ b/qemu-config.c @@ -18,7 +18,7 @@ QemuOptsList qemu_drive_opts = { },{ .name = "if", .type = QEMU_OPT_STRING, -.help = "interface (ide, scsi, sd, mtd, floppy, pflash, virtio)", +.help = "interface (ide, scsi, raid, sd, mtd, floppy, pflash, virtio)", },{ .name = "index", .type = QEMU_OPT_NUMBER, diff --git a/sysemu.h b/sysemu.h index 2ef3797..8ed0b8c 100644 --- a/sysemu.h +++ b/sysemu.h @@ -159,7 +159,7 @@ extern unsigned int nb_prom_envs; typedef enum { IF_NONE, -IF_IDE, IF_SCSI, IF_FLOPPY, IF_PFLASH, IF_MTD, IF_SD, IF_VIRTIO, IF_XEN, +IF_IDE, IF_SCSI, IF_RAID, IF_FLOPPY, IF_PFLASH, IF_MTD, IF_SD, IF_VIRTIO, IF_XEN, IF_COUNT } BlockInterfaceType; @@ -185,6 +185,7 @@ typedef struct DriveInfo { #define MAX_IDE_DEVS 2 #define MAX_SCSI_DEVS 7 +#define MAX_RAID_DEVS 128 #define MAX_DRIVES 32 exte
Re: [RFC] [PATCH] SCSI passthrough for virtio-blk
Hi Christian, Christian Borntraeger wrote: > Am Freitag, 29. August 2008 schrieb Hannes Reinecke: >> Hmm. Works here, using an unpatched kvm-73. >> Which version did you use? > > I use the s390 userspace prototype kuli which uses an virtio transport > similar > to lguest. > > I retried and it seems to race. Most of the time it works fine, but sometimes > sdparm hangs. I will have a 2nd look. > > sysrq-t gives me the following trace: > > Call Trace: > ([<04000755bc78>] 0x4000755bc78) > sdparmD 0043659e 0 2493 1 > 0012004a 0744f740 0744f778 001896469fd23785 >0744f778 009e5500 0043f230 00120130 >0744f778 06d39400 06d39f80 0001 >009e6f00 076bf8e8 0744f7c8 07530670 >0043f610 00435e66 0744f7c8 0744f868 > Call Trace: > ([<00435e66>] schedule+0x32e/0x7ec) > [<0043659e>] schedule_timeout+0xba/0x10c > [<004358da>] wait_for_common+0xbe/0x1a8 > [<0027ec3e>] blk_execute_rq+0x86/0xc4 > [<00282768>] sg_io+0x1a4/0x360 > [<00282f8c>] scsi_cmd_ioctl+0x2bc/0x3f0 > [<002c3108>] virtblk_ioctl+0x44/0x58 > [<0027ff18>] blkdev_driver_ioctl+0x98/0xa4 > [<0027ffd8>] blkdev_ioctl+0xb4/0x7f8 > [<001e1572>] block_ioctl+0x3a/0x48 > [<001bca0a>] vfs_ioctl+0x52/0xdc > [<001bcb0a>] do_vfs_ioctl+0x76/0x350 > [<001bce6e>] sys_ioctl+0x8a/0xa0 > [<0011282c>] sysc_tracego+0xe/0x14 > [<020000114286>] 0x2114286 I'm tempted to say 'not my fault'; the submitted SCSI request on the _host_ hangs and doesn't come back. Looks more like a SCSI problem on the host ... Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage [EMAIL PROTECTED] +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [RFC] [PATCH] SCSI passthrough for virtio-blk
Hi Christian, Christian Borntraeger wrote: > Am Freitag, 29. August 2008 schrieb Hannes Reinecke: >> So when using '-drive file=/dev/sgXX,if=virtio,format=host_device' you >> can happily call any sg_XX command on the resulting vdX device. Quite >> neat, methinks. And it's even backwards compatible, so each of these >> patches should work without the other one applied. > > Does not work here. If the host does not support the pass-through, the device > drivers waits for an response. I tried sdparm /dev/vda with a patched kernel > and an unpatched userspace. > Hmm. Works here, using an unpatched kvm-73. Which version did you use? Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage [EMAIL PROTECTED] +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[RFC] [PATCH] SCSI passthrough for virtio-blk
Hi all, I got bored and implemented SCSI passthrough for the virtio-blk driver. Principle is quite simple, just put the missing fields (cdb, sense and status header) on the virtio queue and then call the SG_IO ioctl on the host. So when using '-drive file=/dev/sgXX,if=virtio,format=host_device' you can happily call any sg_XX command on the resulting vdX device. Quite neat, methinks. And it's even backwards compatible, so each of these patches should work without the other one applied. As one would have guessed there are two patches, one for the linux kernel to modify the virtio-blk driver in the guest and one for the qemu/kvm userland program to modify the virtio-blk driver on the host. This patch is relative to avi's kvm-userland tree from kernel.org. As usual, comments etc to me. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage [EMAIL PROTECTED] +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) virtio: Implement SCSI passthrough for virtio-blk This patch implements SCSI passthrough for any virtio-blk device. The data on the virtio queue will only be modified for a SCSI command, so the normal I/O flow is unchanged. Signed-off-by: Hannes Reinecke <[EMAIL PROTECTED]> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 4225109..46f03d2 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -35,6 +35,7 @@ struct virtblk_req struct list_head list; struct request *req; struct virtio_blk_outhdr out_hdr; + struct virtio_blk_inhdr in_hdr; u8 status; }; @@ -47,20 +48,29 @@ static void blk_done(struct virtqueue *vq) spin_lock_irqsave(&vblk->lock, flags); while ((vbr = vblk->vq->vq_ops->get_buf(vblk->vq, &len)) != NULL) { - int uptodate; + int error; + unsigned int bytes; switch (vbr->status) { case VIRTIO_BLK_S_OK: - uptodate = 1; + error = 0; break; case VIRTIO_BLK_S_UNSUPP: - uptodate = -ENOTTY; + error = -ENOTTY; break; default: - uptodate = 0; + error = -EIO; break; } - end_dequeued_request(vbr->req, uptodate); + if (blk_pc_request(vbr->req)) { + vbr->req->data_len = vbr->in_hdr.residual; + bytes = vbr->in_hdr.data_len; + vbr->req->sense_len = vbr->in_hdr.sense_len; + vbr->req->errors = vbr->in_hdr.status; + } else + bytes = blk_rq_bytes(vbr->req); + + __blk_end_request(vbr->req, error, bytes); list_del(&vbr->list); mempool_free(vbr, vblk->pool); } @@ -72,7 +82,7 @@ static void blk_done(struct virtqueue *vq) static bool do_req(struct request_queue *q, struct virtio_blk *vblk, struct request *req) { - unsigned long num, out, in; + unsigned long num, out = 0, in = 0; struct virtblk_req *vbr; vbr = mempool_alloc(vblk->pool, GFP_ATOMIC); @@ -99,20 +109,31 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk, /* This init could be done at vblk creation time */ sg_init_table(vblk->sg, VIRTIO_MAX_SG); - sg_set_buf(&vblk->sg[0], &vbr->out_hdr, sizeof(vbr->out_hdr)); - num = blk_rq_map_sg(q, vbr->req, vblk->sg+1); - sg_set_buf(&vblk->sg[num+1], &vbr->status, sizeof(vbr->status)); - - if (rq_data_dir(vbr->req) == WRITE) { - vbr->out_hdr.type |= VIRTIO_BLK_T_OUT; - out = 1 + num; - in = 1; - } else { - vbr->out_hdr.type |= VIRTIO_BLK_T_IN; - out = 1; - in = 1 + num; + sg_set_buf(&vblk->sg[out], &vbr->out_hdr, sizeof(vbr->out_hdr)); + out++; + if (blk_pc_request(vbr->req)) { + sg_set_buf(&vblk->sg[out], vbr->req->cmd, vbr->req->cmd_len); + out++; + } + num = blk_rq_map_sg(q, vbr->req, vblk->sg+out); + if (blk_pc_request(vbr->req)) { + sg_set_buf(&vblk->sg[num+out+in], vbr->req->sense, 96); + in++; + sg_set_buf(&vblk->sg[num+out+in], &vbr->in_hdr, + sizeof(vbr->in_hdr)); + in++; + } + sg_set_buf(&vblk->sg[num+out+in], &vbr->status, sizeof(vbr->status)); + in++;