Re: [PATCH] virtio: a new vcpu watchdog driver

2023-07-31 Thread Hannes Reinecke

On 7/31/23 04:30, Xuan Zhuo wrote:

On Mon, 31 Jul 2023 09:25:12 +0800, "zhanghao1"  wrote:

A new virtio pci driver is added for listening to vcpus
inside guest. Each vcpu creates a corresponding thread to
periodically send data to qemu's back-end watchdog device.
If a vCPU is in the stall state, data cannot be sent to
back-end virtio device. As a result, the back-end device
can detect that the guest is in the stall state.

The driver is mainly used with the back-end watchdog device of qemu.

The qemu backend watchdog device is implemented as follow:
https://lore.kernel.org/qemu-devel/20230705081813.411526-1-zhangh...@kylinos.cn/



I think we need to introduce this new device to the virtio spec firstly.

And when having a watchdog drivers: shouldn't you watch the virtio queue 
_itself_? There is no guarantee that the virtio queue makes forward 
progress ...


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v2 48/48] sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)

2023-03-29 Thread Hannes Reinecke

On 3/29/23 16:13, David Howells wrote:

[!] Note: This is a work in progress.  At the moment, some things won't
 build if this patch is applied.  nvme, kcm, smc, tls.

Remove ->sendpage() and ->sendpage_locked().  sendmsg() with
MSG_SPLICE_PAGES should be used instead.  This allows multiple pages and
multipage folios to be passed through.

Signed-off-by: David Howells 
Acked-by: Marc Kleine-Budde  # for net/can
cc: "David S. Miller" 
cc: Eric Dumazet 
cc: Jakub Kicinski 
cc: Paolo Abeni 
cc: Jens Axboe 
cc: Matthew Wilcox 
cc: b...@vger.kernel.org
cc: d...@vger.kernel.org
cc: linux-...@lists.infradead.org
cc: linux-arm-...@vger.kernel.org
cc: linux-...@vger.kernel.org
cc: linux-cry...@vger.kernel.org
cc: linux-...@vger.kernel.org
cc: linux-h...@vger.kernel.org
cc: linux-ker...@vger.kernel.org
cc: linux-r...@vger.kernel.org
cc: linux-s...@vger.kernel.org
cc: linux-w...@vger.kernel.org
cc: linux-...@vger.kernel.org
cc: mp...@lists.linux.dev
cc: net...@vger.kernel.org
cc: rds-de...@oss.oracle.com
cc: tipc-discuss...@lists.sourceforge.net
cc: virtualization@lists.linux-foundation.org
---
  Documentation/networking/scaling.rst |   4 +-
  crypto/af_alg.c  |  29 --
  crypto/algif_aead.c  |  22 +
  crypto/algif_rng.c   |   2 -
  crypto/algif_skcipher.c  |  14 ---
  include/linux/net.h  |   8 --
  include/net/inet_common.h|   2 -
  include/net/sock.h   |   6 --
  net/appletalk/ddp.c  |   1 -
  net/atm/pvc.c|   1 -
  net/atm/svc.c|   1 -
  net/ax25/af_ax25.c   |   1 -
  net/caif/caif_socket.c   |   2 -
  net/can/bcm.c|   1 -
  net/can/isotp.c  |   1 -
  net/can/j1939/socket.c   |   1 -
  net/can/raw.c|   1 -
  net/core/sock.c  |  35 +--
  net/dccp/ipv4.c  |   1 -
  net/dccp/ipv6.c  |   1 -
  net/ieee802154/socket.c  |   2 -
  net/ipv4/af_inet.c   |  21 
  net/ipv4/tcp.c   |  34 ---
  net/ipv4/tcp_bpf.c   |  21 +---
  net/ipv4/tcp_ipv4.c  |   1 -
  net/ipv4/udp.c   |  22 -
  net/ipv4/udp_impl.h  |   2 -
  net/ipv4/udplite.c   |   1 -
  net/ipv6/af_inet6.c  |   3 -
  net/ipv6/raw.c   |   1 -
  net/ipv6/tcp_ipv6.c  |   1 -
  net/key/af_key.c |   1 -
  net/l2tp/l2tp_ip.c   |   1 -
  net/l2tp/l2tp_ip6.c  |   1 -
  net/llc/af_llc.c |   1 -
  net/mctp/af_mctp.c   |   1 -
  net/mptcp/protocol.c |   2 -
  net/netlink/af_netlink.c |   1 -
  net/netrom/af_netrom.c   |   1 -
  net/packet/af_packet.c   |   2 -
  net/phonet/socket.c  |   2 -
  net/qrtr/af_qrtr.c   |   1 -
  net/rds/af_rds.c |   1 -
  net/rose/af_rose.c   |   1 -
  net/rxrpc/af_rxrpc.c |   1 -
  net/sctp/protocol.c  |   1 -
  net/socket.c |  48 -
  net/tipc/socket.c|   3 -
  net/unix/af_unix.c   | 139 ---
  net/vmw_vsock/af_vsock.c |   3 -
  net/x25/af_x25.c |   1 -
  net/xdp/xsk.c|   1 -
  52 files changed, 9 insertions(+), 447 deletions(-)


Weelll ... what happens to consumers of kernel_sendpage()?
(Let's call them nvme ...)
Should they be moved over, too?

Or what is the general consensus here?

(And what do we do with TLS? It does have a ->sendpage() version, too ...)

Cheers,

Hannes

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Virtio-net - add timeouts to control commands

2022-08-24 Thread Hannes Reinecke

On 8/24/22 11:42, Alvaro Karsz wrote:

Hi Hannes,


a) let the device do the timeout: pass in a timeout value with the
command, and allow the device to return an ETIMEDOUT error when the
timeout expires. Then it's up to the device to do the necessary timeout
handling; the server won't be involved at all (except for handling an
ETIMEDOUT error)



This won't work if the device crashes.



b) implement an 'abort' command. With that the server controls the
timeout, and is allowed to send an 'abort' command when the timeout
expires. That requires the device to be able to abort commands (which
not all devices are able to), but avoids having to implement a timeout
handling in the device.



I actually thought about this idea.
This may work, but you'll still have a few moments when the server
assumes that the command failed, and the network device assumes that
it succeeded.
So the server may still receive packets in an unexpected queue.


No. The server may only assume that the command failed until it gets the 
response for the abort command.
Before that the state of the command is undefined, and the server may 
not assume anything here.


And then we get into the fun topic of timing out aborts, which really 
can only be resolved if you have a fool-proof way of resetting the queue 
itself. But I guess virtio can do that (right?).




I am very much in favour of having timeouts for virtio commands; we've
had several massive customer escalations which could have been solved if
we were able to set the command timeout in the VM.
As this was for virtio-scsi/virtio-block I would advocate to have a
generic virtio command timeout, not a protocol-specific one.


This may be difficult to implement.
Especially when multiple commands may be queued at the same time, and
the device can handle the commands in any order.
We'll need to add identifiers for every command.

Why, but of course. You cannot assume in-order delivery of the 
completions; in fact, that's the whole _point_ of having a queue-based 
I/O command delivery method.



I'm actually referring here to the Linux kernel implementation of
virtnet control commands, in which the server spins for a response.


Sheesh. Spinning for a response is never a good idea, as this means 
you'll end up with a non-interruptible command in the guest (essentially 
an ioctl into the hypervisor).


And really, identifying the command isn't hard. Each command already has 
an identifier (namely the virtio ring index), so if in doubt you can 
always use that.
To be foolproof, though, you might want to add a 'real' identifier (like 
a 32 or 64 bit command tag), which would even allow you to catch 
uninitialized / completed commands. Tends to be quite important when 
implementing an 'abort' command, as the command referred to by the 
'abort' command might have been completed by the time the hypervisor 
processes the abort command.


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Virtio-net - add timeouts to control commands

2022-08-24 Thread Hannes Reinecke

On 8/24/22 11:06, Jason Wang wrote:

On Wed, Aug 24, 2022 at 3:52 PM Alvaro Karsz  wrote:


I think that we should add a timeout to the control virtqueue commands.
If the hypervisor crashes while handling a control command, the guest
will spin forever.
This may not be necessary for a virtual environment, when both the
hypervisor and the guest OS run in the same bare metal, but this
is needed for a physical network device compatible with VirtIO.

(In these cases, the network device acts as the hypervisor, and the
server acts as
the guest OS).

The network device may fail to answer a control command, or may crash, leading
to a stall in the server.

My idea is to add a big enough timeout, to allow the slow devices to
complete the command.

I wrote a simple patch that returns false from virtnet_send_command in
case of timeouts.

The timeout approach introduces some serious problems in cases when
the network device does answer the control command, but after the
timeout.

* The device will think that the command succeeded, while the server won't.
This may be serious with the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command.
The server may receive packets in an unexpected queue.

* virtqueue_get_buf will return the previous response for the next
control command.

Addressing this case by adding a timeout  to the spec won't be easy,
since the network device and the server have different clocks, and the
server won't know when exactly the network device noticed the kick.

So maybe we should call virtnet_remove if we reach a timeout.


Or reset but can we simply use interrupt instead of the busy waiting?



There are two possible ways of handling this:
a) let the device do the timeout: pass in a timeout value with the 
command, and allow the device to return an ETIMEDOUT error when the 
timeout expires. Then it's up to the device to do the necessary timeout 
handling; the server won't be involved at all (except for handling an 
ETIMEDOUT error)
b) implement an 'abort' command. With that the server controls the 
timeout, and is allowed to send an 'abort' command when the timeout 
expires. That requires the device to be able to abort commands (which 
not all devices are able to), but avoids having to implement a timeout 
handling in the device.


We can actually specify both methods, and have configuration bits 
indicating which method is supported by the device.


I am very much in favour of having timeouts for virtio commands; we've 
had several massive customer escalations which could have been solved if 
we were able to set the command timeout in the VM.
As this was for virtio-scsi/virtio-block I would advocate to have a 
generic virtio command timeout, not a protocol-specific one.


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Virtio-net - add timeouts to control commands

2022-08-24 Thread Hannes Reinecke

On 8/24/22 11:16, Alvaro Karsz wrote:

Hi Jason,


Or reset but can we simply use interrupt instead of the busy waiting?


I agree that timeouts are not needed using interrupts.


Quite the contrary.
There is no guarantee that a completion for a given command will be 
signaled at all; if the completion is never posted to the virtio queue 
the VM will spin forever.



This can be done, but seems like a big change.
All functions calling virtnet_send_command expect to get a response immediately.
This may be difficult, since some of the commands are sent in the
probe function.


This is a simplistic approach. _Any_ command sent via virtio will have 
to have some sort of handling in the host, so an immediate response is 
not guaranteed.

Especially virtio-block command will _not_ be handled immediately.

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio_scsi: to poll and kick the virtqueue in timeout handler

2021-05-25 Thread Hannes Reinecke

On 5/25/21 6:47 PM, Stefan Hajnoczi wrote:

On Mon, May 24, 2021 at 11:33:33PM -0700, Dongli Zhang wrote:

On 5/24/21 6:24 AM, Stefan Hajnoczi wrote:

On Sun, May 23, 2021 at 09:39:51AM +0200, Hannes Reinecke wrote:

On 5/23/21 8:38 AM, Dongli Zhang wrote:

This RFC is to trigger the discussion about to poll and kick the
virtqueue on purpose in virtio-scsi timeout handler.

The virtio-scsi relies on the virtio vring shared between VM and host.
The VM side produces requests to vring and kicks the virtqueue, while the
host side produces responses to vring and interrupts the VM side.

By default the virtio-scsi handler depends on the host timeout handler
by BLK_EH_RESET_TIMER to give host a chance to perform EH.

However, this is not helpful for the case that the responses are available
on vring but the notification from host to VM is lost.


How can this happen?
If responses are lost the communication between VM and host is broken, and
we should rather reset the virtio rings themselves.


I agree. In principle it's fine to poll the virtqueue at any time, but I
don't understand the failure scenario here. It's not clear to me why the
device-to-driver vq notification could be lost.



One example is the CPU hotplug issue before the commit bf0beec0607d ("blk-mq:
drain I/O when all CPUs in a hctx are offline") was available. The issue is
equivalent to loss of interrupt. Without the CPU hotplug fix, while NVMe driver
relies on the timeout handler to complete inflight IO requests, the PV
virtio-scsi may hang permanently.

In addition, as the virtio/vhost/QEMU are complex software, we are not able to
guarantee there is no further lost of interrupt/kick issue in the future. It is
really painful if we encounter such issue in production environment.


Any number of hardware or software bugs might exist that we don't know
about, yet we don't pre-emptively add workarounds for them because where
do you draw the line?

I checked other SCSI/block drivers and found it's rare to poll in the
timeout function so there does not seem to be a consensus that it's
useful to do this.


Not only this; it's downright dangerous attempting to do that in SCSI.
In SCSI we don't have fixed lifetime guarantees that NVMe has, so there 
will be a race condition between timeout and command completion.
Plus there is no interface in SCSI allowing to 'poll' for completions in 
a meaningful manner.



That said, it's technically fine to do it, the virtqueue APIs are there
and can be used like this. So if you and others think this is necessary,
then it's a pretty small change and I'm not against merging a patch like
this.

I would rather _not_ put more functionality into the virtio_scsi timeout 
handler; this only serves to assume that the timeout handler has some 
functionality in virtio.
Which it patently hasn't, as the prime reason for a timeout handler is 
to _abort_ a command, which we can't on virtio.
Well, we can on virtio, but qemu as the main user will re-route the I/O 
from virtio into doing async-I/O, and there is no way how we can abort 
outstanding asynchronous I/O.

Or any other ioctl, for that matter.

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio_scsi: to poll and kick the virtqueue in timeout handler

2021-05-23 Thread Hannes Reinecke

On 5/23/21 8:38 AM, Dongli Zhang wrote:

This RFC is to trigger the discussion about to poll and kick the
virtqueue on purpose in virtio-scsi timeout handler.

The virtio-scsi relies on the virtio vring shared between VM and host.
The VM side produces requests to vring and kicks the virtqueue, while the
host side produces responses to vring and interrupts the VM side.

By default the virtio-scsi handler depends on the host timeout handler
by BLK_EH_RESET_TIMER to give host a chance to perform EH.

However, this is not helpful for the case that the responses are available
on vring but the notification from host to VM is lost.


How can this happen?
If responses are lost the communication between VM and host is broken, 
and we should rather reset the virtio rings themselves.


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v2 02/24] scsi: allocate separate queue for reserved commands

2020-04-23 Thread Hannes Reinecke

On 4/23/20 4:13 PM, John Garry wrote:

On 07/04/2020 17:30, Christoph Hellwig wrote:

On Tue, Apr 07, 2020 at 04:00:10PM +0200, Hannes Reinecke wrote:

My concern is this:

struct scsi_device *scsi_get_host_dev(struct Scsi_Host *shost)
{
[ .. ]
starget = scsi_alloc_target(&shost->shost_gendev, 0, 
shost->this_id);

[ .. ]

and we have typically:

drivers/scsi/hisi_sas/hisi_sas_v3_hw.c: .this_id    = -1,

It's _very_ uncommon to have a negative number as the SCSI target 
device; in

fact, it _is_ an unsigned int already.

But alright, I'll give it a go; let's see what I'll end up with.


But this shouldn't be exposed anywhere.  And I prefer that over having
magic requests/scsi_cmnd that do not have a valid ->device pointer.
.



(just looking at this again)

Hi Christoph,

So how would this look added in scsi_lib.c:

struct scsi_cmnd *scsi_get_reserved_cmd(struct Scsi_Host *shost)
{
 struct scsi_cmnd *scmd;
 struct request *rq;
 struct scsi_device *sdev = scsi_get_host_dev(shost);

 if (!sdev)
     return NULL;

 rq = blk_mq_alloc_request(sdev->request_queue,
   REQ_OP_DRV_OUT | REQ_NOWAIT,
   BLK_MQ_REQ_RESERVED);
 if (IS_ERR(rq)) // fix tidy-up
     return NULL;
 WARN_ON(rq->tag == -1);
 scmd = blk_mq_rq_to_pdu(rq);
 scmd->request = rq;
 scmd->device = sdev;

 return scmd;
}
EXPORT_SYMBOL_GPL(scsi_get_reserved_cmd);

void scsi_put_reserved_cmd(struct scsi_cmnd *scmd)
{
 struct request *rq = blk_mq_rq_from_pdu(scmd);

 if (blk_mq_rq_is_reserved(rq)) {
     struct scsi_device *sdev = scmd->device;
     blk_mq_free_request(rq);
     scsi_free_host_dev(sdev);
 }
}
EXPORT_SYMBOL_GPL(scsi_put_reserved_cmd);

Not sure if we want a static scsi_device per host, or alloc and free 
dynamically.


(@Hannes, I also have some proper patches for libsas if you want to add it)


Hold your horses.
I'm currently preparing a patchset implementing things by improving the 
current scsi_get_host_dev() etc.


RFC should be ready by the end of the week.

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v2 02/24] scsi: allocate separate queue for reserved commands

2020-04-07 Thread Hannes Reinecke

On 4/7/20 4:35 PM, John Garry wrote:

On 07/04/2020 15:00, Hannes Reinecke wrote:

On 4/7/20 1:54 PM, John Garry wrote:

On 06/04/2020 10:05, Hannes Reinecke wrote:

[ .. ]

This would be okay if 'this_id' would have been defined by the driver;
sadly, most drivers which are affected here do set 'this_id' to -1.
So we wouldn't have a nice target ID to allocate the device from, let
alone the problem that we would have to emulate a complete scsi device
with all required minimal command support etc.
And I'm not quite sure how well that would play with the exising SCSI
host template; the device we'll be allocating would have basically
nothing in common with the 'normal' SCSI devices.

What we could do, though, is to try it the other way round:
Lift the request queue from scsi_get_host_dev() into the scsi host
itself, so that scsi_get_host_dev() can use that queue, but we also
would be able to use it without a SCSI device attached.


wouldn't that limit 1x scsi device per host, not that I know if any 
more would ever be required? But it does still seem better to use the 
request queue in the scsi device.



My concern is this:

struct scsi_device *scsi_get_host_dev(struct Scsi_Host *shost)
{
 [ .. ]
 starget = scsi_alloc_target(&shost->shost_gendev, 0, 
shost->this_id);

 [ .. ]

and we have typically:

drivers/scsi/hisi_sas/hisi_sas_v3_hw.c: .this_id    = -1,

It's _very_ uncommon to have a negative number as the SCSI target 
device; in fact, it _is_ an unsigned int already.




FWIW, the only other driver (gdth) which I see uses this API has this_id 
= -1 in the scsi host template.



But alright, I'll give it a go; let's see what I'll end up with.


note: If we want a fixed scsi_device per host, calling 
scsi_mq_setup_tags() -> scsi_get_host_dev() will fail as shost state is 
not running. Maybe we need to juggle some things there to provide a 
generic solution.


It might even get worse, as during device setup things like 
'slave_alloc' etc is getting called, which has a fair chance of getting 
confused for non-existing devices.
Cf qla2xxx:qla2xx_slave_alloc() is calling starget_to_rport(), which 
will get us a nice oops when accessing a target which is _not_ the child 
of a fc remote port.
And this is why I'm not utterly keen on this approach; auditing all 
these callbacks is _not_ fun.


Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v2 02/24] scsi: allocate separate queue for reserved commands

2020-04-07 Thread Hannes Reinecke

On 4/7/20 1:54 PM, John Garry wrote:

On 06/04/2020 10:05, Hannes Reinecke wrote:

On 3/11/20 7:22 AM, Christoph Hellwig wrote:

On Tue, Mar 10, 2020 at 09:08:56PM +, John Garry wrote:

On 10/03/2020 18:32, Christoph Hellwig wrote:

On Wed, Mar 11, 2020 at 12:25:28AM +0800, John Garry wrote:

From: Hannes Reinecke 

Allocate a separate 'reserved_cmd_q' for sending reserved commands.


Why?  Reserved command specifically are not in any way tied to queues.
.



So the v1 series used a combination of the sdev queue and the per-host
reserved_cmd_q. Back then you questioned using the sdev queue for 
virtio
scsi, and the unconfirmed conclusion was to use a common per-host q. 
This is

the best link I can find now:

https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg83177.html


That was just a question on why virtio uses the per-device tags, which
didn't look like it made any sense.  What I'm worried about here is
mixing up the concept of reserved tags in the tagset, and queues to use
them.  Note that we already have the scsi_get_host_dev to allocate
a scsi_device and thus a request_queue for the host itself.  That seems
like the better interface to use a tag for a host wide command vs
introducing a parallel path.


Thinking about it some more, I don't think that scsi_get_host_dev() is
the best way of handling it.
Problem is that it'll create a new scsi_device with ,
which will then show up via eg 'lsscsi'.


are you sure? Doesn't this function just allocate the sdev, but do 
nothing with it, like probing it?


I bludgeoned it in here for PoC:

https://github.com/hisilicon/kernel-dev/commit/ef0ae8540811e32776f64a5b42bd76cbed17ba47 



And then still:

john@ubuntu:~$ lsscsi
[0:0:0:0] disk SEAGATE  ST2000NM0045  N004  /dev/sda
[0:0:1:0] disk SEAGATE  ST2000NM0045  N004  /dev/sdb
[0:0:2:0] disk ATASAMSUNG HM320JI  0_01  /dev/sdc
[0:0:3:0] disk SEAGATE  ST1000NM0023  0006  /dev/sdd
[0:0:4:0] enclosu HUAWEIExpander 12Gx16  128-
john@ubuntu:~$

Some proper plumbing would be needed, though.


This would be okay if 'this_id' would have been defined by the driver;
sadly, most drivers which are affected here do set 'this_id' to -1.
So we wouldn't have a nice target ID to allocate the device from, let
alone the problem that we would have to emulate a complete scsi device
with all required minimal command support etc.
And I'm not quite sure how well that would play with the exising SCSI
host template; the device we'll be allocating would have basically
nothing in common with the 'normal' SCSI devices.

What we could do, though, is to try it the other way round:
Lift the request queue from scsi_get_host_dev() into the scsi host
itself, so that scsi_get_host_dev() can use that queue, but we also
would be able to use it without a SCSI device attached.


wouldn't that limit 1x scsi device per host, not that I know if any more 
would ever be required? But it does still seem better to use the request 
queue in the scsi device.



My concern is this:

struct scsi_device *scsi_get_host_dev(struct Scsi_Host *shost)
{
[ .. ]
starget = scsi_alloc_target(&shost->shost_gendev, 0, shost->this_id);
[ .. ]

and we have typically:

drivers/scsi/hisi_sas/hisi_sas_v3_hw.c: .this_id= -1,

It's _very_ uncommon to have a negative number as the SCSI target 
device; in fact, it _is_ an unsigned int already.


But alright, I'll give it a go; let's see what I'll end up with.

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v2 02/24] scsi: allocate separate queue for reserved commands

2020-04-06 Thread Hannes Reinecke
On 3/11/20 7:22 AM, Christoph Hellwig wrote:
> On Tue, Mar 10, 2020 at 09:08:56PM +, John Garry wrote:
>> On 10/03/2020 18:32, Christoph Hellwig wrote:
>>> On Wed, Mar 11, 2020 at 12:25:28AM +0800, John Garry wrote:
>>>> From: Hannes Reinecke 
>>>>
>>>> Allocate a separate 'reserved_cmd_q' for sending reserved commands.
>>>
>>> Why?  Reserved command specifically are not in any way tied to queues.
>>> .
>>>
>>
>> So the v1 series used a combination of the sdev queue and the per-host
>> reserved_cmd_q. Back then you questioned using the sdev queue for virtio
>> scsi, and the unconfirmed conclusion was to use a common per-host q. This is
>> the best link I can find now:
>>
>> https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg83177.html
> 
> That was just a question on why virtio uses the per-device tags, which
> didn't look like it made any sense.  What I'm worried about here is
> mixing up the concept of reserved tags in the tagset, and queues to use
> them.  Note that we already have the scsi_get_host_dev to allocate
> a scsi_device and thus a request_queue for the host itself.  That seems
> like the better interface to use a tag for a host wide command vs
> introducing a parallel path.
> 
Thinking about it some more, I don't think that scsi_get_host_dev() is
the best way of handling it.
Problem is that it'll create a new scsi_device with ,
which will then show up via eg 'lsscsi'.
This would be okay if 'this_id' would have been defined by the driver;
sadly, most drivers which are affected here do set 'this_id' to -1.
So we wouldn't have a nice target ID to allocate the device from, let
alone the problem that we would have to emulate a complete scsi device
with all required minimal command support etc.
And I'm not quite sure how well that would play with the exising SCSI
host template; the device we'll be allocating would have basically
nothing in common with the 'normal' SCSI devices.

What we could do, though, is to try it the other way round:
Lift the request queue from scsi_get_host_dev() into the scsi host
itself, so that scsi_get_host_dev() can use that queue, but we also
would be able to use it without a SCSI device attached.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v2 12/24] hpsa: use reserved commands

2020-03-17 Thread Hannes Reinecke

On 3/17/20 10:38 AM, John Garry wrote:

On 11/03/2020 08:10, Ming Lei wrote:

ands(struct ctlr_info *h)
@@ -5803,6 +5803,7 @@ static int hpsa_scsi_host_alloc(struct 
ctlr_info *h)

  sh->max_lun = HPSA_MAX_LUN;
  sh->max_id = HPSA_MAX_LUN;
  sh->can_queue = h->nr_cmds - HPSA_NRESERVED_CMDS;
+    sh->nr_reserved_cmds = HPSA_NRESERVED_CMDS;

Now .nr_reserved_cmds has been passed to blk-mq, you need to increase
sh->can_queue to h->nr_cmds, because .can_queue is the whole queue depth
(include the part of reserved tags), otherwise, IO tags will be
decreased.



Sounds correct.

I will have having a look at the patchset; I thought I did a patch to 
modify .can_queue so that it would cover only the usable tags, not the 
reserved ones.


Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v2 02/24] scsi: allocate separate queue for reserved commands

2020-03-10 Thread Hannes Reinecke
On 3/11/20 7:22 AM, Christoph Hellwig wrote:
> On Tue, Mar 10, 2020 at 09:08:56PM +, John Garry wrote:
>> On 10/03/2020 18:32, Christoph Hellwig wrote:
>>> On Wed, Mar 11, 2020 at 12:25:28AM +0800, John Garry wrote:
>>>> From: Hannes Reinecke 
>>>>
>>>> Allocate a separate 'reserved_cmd_q' for sending reserved commands.
>>>
>>> Why?  Reserved command specifically are not in any way tied to queues.
>>> .
>>>
>>
>> So the v1 series used a combination of the sdev queue and the per-host
>> reserved_cmd_q. Back then you questioned using the sdev queue for virtio
>> scsi, and the unconfirmed conclusion was to use a common per-host q. This is
>> the best link I can find now:
>>
>> https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg83177.html
> 
> That was just a question on why virtio uses the per-device tags, which
> didn't look like it made any sense.  What I'm worried about here is
> mixing up the concept of reserved tags in the tagset, and queues to use
> them.  Note that we already have the scsi_get_host_dev to allocate
> a scsi_device and thus a request_queue for the host itself.  That seems
> like the better interface to use a tag for a host wide command vs
> introducing a parallel path.
> 
Ah. Right.
Will be looking into that, and convert the patchset over to it.

And the problem of the separate queue is the fact that I'll need a queue
to reserve tags from; trying to allocate a tag directly from the bitmap
turns out to be major surgery in the blocklayer with no immediate gain.
And I can't use per-device queues as for some drivers the reserved
commands are used to query the HBA itself to figure out how many devices
are present.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v2 01/24] scsi: add 'nr_reserved_cmds' field to the SCSI host template

2020-03-10 Thread Hannes Reinecke
On 3/11/20 12:08 AM, Ming Lei wrote:
> On Wed, Mar 11, 2020 at 12:25:27AM +0800, John Garry wrote:
>> From: Hannes Reinecke 
>>
>> Add a new field 'nr_reserved_cmds' to the SCSI host template to
>> instruct the block layer to set aside a tag space for reserved
>> commands.
>>
>> Signed-off-by: Hannes Reinecke 
>> ---
>>  drivers/scsi/scsi_lib.c  | 1 +
>>  include/scsi/scsi_host.h | 6 ++
>>  2 files changed, 7 insertions(+)
>>
>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>> index 610ee41fa54c..2967325df7a0 100644
>> --- a/drivers/scsi/scsi_lib.c
>> +++ b/drivers/scsi/scsi_lib.c
>> @@ -1896,6 +1896,7 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
>>  shost->tag_set.ops = &scsi_mq_ops_no_commit;
>>  shost->tag_set.nr_hw_queues = shost->nr_hw_queues ? : 1;
>>  shost->tag_set.queue_depth = shost->can_queue;
>> +shost->tag_set.reserved_tags = shost->nr_reserved_cmds;
> 
> You reserve tags for special usage, meantime the whole queue depth
> isn't increased, that means the tags for IO request is decreased given
> reserved tags can't be used for IO, so IO performance may be effected.
> 
> If not the case, please explain a bit in commit log.
> 
The overall idea of this patchset is to fold _existing_ management
command handling into using the blk-mq bitmap.
Currently, quite a lot of drivers are using management commands
internally, which typically use the same hardware tag pool (ie they are
being allocated from the same hardware resources) than the 'normal' I/O
commands.
But as they are using the same tagpool these drivers already decrement
the available number of commands; check eg. hpsa:

static int hpsa_scsi_host_alloc(struct ctlr_info *h)
{
struct Scsi_Host *sh;

sh = scsi_host_alloc(&hpsa_driver_template, sizeof(h));
if (sh == NULL) {
dev_err(&h->pdev->dev, "scsi_host_alloc failed\n");
return -ENOMEM;
}

sh->io_port = 0;
sh->n_io_port = 0;
sh->this_id = -1;
sh->max_channel = 3;
sh->max_cmd_len = MAX_COMMAND_SIZE;
sh->max_lun = HPSA_MAX_LUN;
sh->max_id = HPSA_MAX_LUN;
sh->can_queue = h->nr_cmds - HPSA_NRESERVED_CMDS;
sh->cmd_per_lun = sh->can_queue;

So the idea of this patchset is to convert existing use-cases; seeing
that they already reserve commands internally performance will not be
affected.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio-blk: remove VIRTIO_BLK_F_SCSI support

2019-12-13 Thread Hannes Reinecke

On 12/12/19 5:37 PM, Christoph Hellwig wrote:

Since the need for a special flag to support SCSI passthrough on a
block device was added in May 2017 the SCSI passthrough support in
virtio-blk has been disabled.  It has always been a bad idea
(just ask the original author..) and we have virtio-scsi for proper
passthrough.  The feature also never made it into the virtio 1.0
or later specifications.

Signed-off-by: Christoph Hellwig 
---
  arch/powerpc/configs/guest.config |   1 -
  drivers/block/Kconfig |  10 ---
  drivers/block/virtio_blk.c| 115 +-
  3 files changed, 1 insertion(+), 125 deletions(-)


I'm all for it.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Qemu-devel] [RFC 0/3] VirtIO RDMA

2019-04-19 Thread Hannes Reinecke

On 4/15/19 12:35 PM, Yuval Shaia wrote:

On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:

On Thu, 11 Apr 2019 14:01:54 +0300
Yuval Shaia  wrote:


Data center backends use more and more RDMA or RoCE devices and more and
more software runs in virtualized environment.
There is a need for a standard to enable RDMA/RoCE on Virtual Machines.

Virtio is the optimal solution since is the de-facto para-virtualizaton
technology and also because the Virtio specification
allows Hardware Vendors to support Virtio protocol natively in order to
achieve bare metal performance.

This RFC is an effort to addresses challenges in defining the RDMA/RoCE
Virtio Specification and a look forward on possible implementation
techniques.

Open issues/Todo list:
List is huge, this is only start point of the project.
Anyway, here is one example of item in the list:
- Multi VirtQ: Every QP has two rings and every CQ has one. This means that
   in order to support for example 32K QPs we will need 64K VirtQ. Not sure
   that this is reasonable so one option is to have one for all and
   multiplex the traffic on it. This is not good approach as by design it
   introducing an optional starvation. Another approach would be multi
   queues and round-robin (for example) between them.

Typically there will be a one-to-one mapping between QPs and CPUs (on 
the guest). So while one would need to be prepared to support quite some 
QPs, the expectation is that the actual number of QPs used will be 
rather low.
In a similar vein, multiplexing QPs would be defeating the purpose, as 
the overall idea was to have _independent_ QPs to enhance parallelism.



Expectations from this posting:
In general, any comment is welcome, starting from hey, drop this as it is a
very bad idea, to yeah, go ahead, we really want it.
Idea here is that since it is not a minor effort i first want to know if
there is some sort interest in the community for such device.


My first reaction is: Sounds sensible, but it would be good to have a
spec for this :)

You'll need a spec if you want this to go forward anyway, so at least a
sketch would be good to answer questions such as how many virtqueues
you use for which purpose, what is actually put on the virtqueues,
whether there are negotiable features, and what the expectations for
the device and the driver are. It also makes it easier to understand
how this is supposed to work in practice.

If folks agree that this sounds useful, the next step would be to
reserve an id for the device type.


Thanks for the tips, will sure do that, it is that first i wanted to make
sure there is a use case here.

Waiting for any feedback from the community.

I really do like the ides; in fact, it saved me from coding a similar 
thing myself :-)


However, I'm still curious about the overall intent of this driver. 
Where would the I/O be routed _to_ ?

It's nice that we have a virtualized driver, but this driver is
intended to do I/O (even if it doesn't _do_ any I/O ATM :-)
And this I/O needs to be send to (and possibly received from)
something.

So what exactly is this something?
An existing piece of HW on the host?
If so, wouldn't it be more efficient to use vfio, either by using SR-IOV 
or by using virtio-mdev?


Another guest?
If so, how would we route the I/O from one guest to the other?
Shared memory? Implementing a full-blown RDMA switch in qemu?

Oh, and I would _love_ to have a discussion about this at KVM Forum.
Maybe I'll manage to whip up guest-to-guest RDMA connection using 
ivshmem ... let's see.


Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de  +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Question about eliminating virtio-scsi 30s timeout and hot-unplug

2018-01-10 Thread Hannes Reinecke
On 12/21/2017 08:29 PM, Jason Behmer via Virtualization wrote:
> Hi Paolo,
> I had a question about this patch that eliminates virtio-scsi timeout on
> IOs - https://patchwork.kernel.org/patch/9802047/.  How does this work
> when we have IOs outstanding to a device that is then hot-unplugged. 
> I'm under the impression that those IOs will never get returned with any
> status (please correct me if I'm wrong), and we will then end up waiting
> on them forever with this patch.
> 
It's no different from the virtio-block path; the I/O is pending in the
qemu process. When a device is hot-unplugged the I/O will be returned to
the qemu process, which then will return the final status to the guest.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 09/12] virtio_blk: Use blk_rq_is_scsi()

2017-08-18 Thread Hannes Reinecke
On 08/18/2017 01:23 AM, Bart Van Assche wrote:
> This patch does not change any functionality.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Michael S. Tsirkin 
> Cc: Jason Wang 
> Cc: virtualization@lists.linux-foundation.org
> ---
>  drivers/block/virtio_blk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 1498b899a593..0ba1eb911a42 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -265,7 +265,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>   }
>  
>   spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
> - if (req_op(req) == REQ_OP_SCSI_IN || req_op(req) == REQ_OP_SCSI_OUT)
> + if (blk_rq_is_scsi(req))
>   err = virtblk_add_req_scsi(vblk->vqs[qid].vq, vbr, vbr->sg, 
> num);
>   else
>   err = virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num);
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 0/2] virtio-scsi: Implement FC_HOST feature

2017-01-16 Thread Hannes Reinecke
On 01/16/2017 05:04 PM, Fam Zheng wrote:
> This series implements the proposed fc_host feature of virtio-scsi.
> 
> The first patch updates the data structure changes according to the spec
> proposal; the second patch actually implements the operations.
> 
> Fam Zheng (2):
>   virtio_scsi: Add fc_host definitions
>   virtio_scsi: Implement fc_host
> 
>  drivers/scsi/virtio_scsi.c   | 55 
> +++-
>  include/uapi/linux/virtio_scsi.h |  6 +
>  2 files changed, 60 insertions(+), 1 deletion(-)
> 
Hmm.

How it this supposed to work?
You do export the WWPN/WWNN of the associated host to the guest (nb:
will get interesting for non NPIV setups ...), but virtio scsi will
still do a LUN remapping.
IE the LUNs you see on the host will be different from the LUNs
presented to the guest.

This makes reliable operations very tricky.
Plus you don't _actually_ expose the FC host, but rather the WWPN of the
host presenting the LUN.
So how do you handle LUNs from different FC hosts on the guest?
>From what I've seen virtio will either presenting you with with one host
per LUN (so you'll end up with different SCSI hosts on the guest each
having the _same_ WWPN/WWNN), or a single host presenting all LUNs,
making the WWPN setting ... interesting.

Overall, I'm not overly happy with this approach.
You already added WWPN ids to the virtio transport, so why didn't you
update the LUN field, too, to avoid this ominous LUN remapping?
And we really should make sure to have a single FC host in the guest
presenting all LUNs.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10] virtio-net: add Max MTU configuration field

2016-09-13 Thread Hannes Reinecke
On 09/13/2016 05:18 PM, Aaron Conole wrote:
> It is helpful for a host to indicate it's MTU to be set on guest NICs
> other than the assumed 1500 byte value.  This helps in situations where
> the host network is using Jumbo Frames, or aiding in PMTU discovery by
> configuring a homogenous network.  It is also helpful for sizing receive
> buffers correctly.
> 
> The change adds a new field to configuration area of network
> devices.  It will be used to pass a maximum MTU from the device to
> the driver.  This will be used by the driver as a maximum value for
> packet sizes during transmission, without segmentation offloading.
> 
> In addition, in order to support backward and forward compatibility,
> we introduce a new feature bit called VIRTIO_NET_F_MTU.
> 
> VIRTIO-152
> Signed-off-by: Aaron Conole 
> Cc: Victor Kaplansky 
> Acked-by: Michael S. Tsirkin 
> Acked-by: Maxime Coquelin 
> ---
Very good. I'm needing this for my FCoE over virtio work.

Reviewed-by: Hannes Reiencke 

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: req->nr_phys_segments > queue_max_segments (was Re: kernel BUG at drivers/block/virtio_blk.c:172!)

2015-10-01 Thread Hannes Reinecke
On 10/01/2015 11:00 AM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 03:10:14AM +0200, Thomas D. wrote:
>> Hi,
>>
>> I have a virtual machine which fails to boot linux-4.1.8 while mounting
>> file systems:
>>
>>> * Mounting local filesystem ...
>>> [ cut here ]
>>> kernel BUG at drivers/block/virtio_blk.c:172!
>>> invalid opcode: 000 [#1] SMP
>>> Modules linked in: pcspkr psmouse dm_log_userspace virtio_net e1000 fuse 
>>> nfs lockd grace sunrpc fscache dm_snapshot dm_bufio dm_mirror 
>>> dm_region_hash dm_log usbhid usb_storage sr_mod cdrom
>>> CPU: 7 PIDL 2254 Comm: dmcrypt_write Not tainted 4.1.8-gentoo #1
>>> Hardware name: Red Hat KVM, BIOS seabios-1.7.5-8.el7 04/01/2014
>>> task: 88061fb7 ti: 88061ff3 task.ti: 88061ff3
>>> RIP: 0010:[] [] 
>>> virtio_queue_rq+0x210/0x2b0
>>> RSP: 0018:88061ff33ba8 EFLAGS: 00010202
>>> RAX: 00b1 RBX: 88061fb2fc00 RCX: 88061ff33c30
>>> RDX: 0008 RSI: 88061ff33c50 RDI: 88061fb2fc00
>>> RBP: 88061ff33bf8 R08: 88061eef3540 R09: 88061ff33c30
>>> R10:  R11: 00af R12: 
>>> R13: 88061eef3540 R14: 88061eef3540 R15: 880622c7ca80
>>> FS:  () GS:88063fdc() knlGS:
>>> CS:  0010 DS:  ES:  CR0: 80050033
>>> CR2: 01ffe468 CR3: bb343000 CR4: 001406e0
>>> Stack:
>>>  880622d4c478  88061ff33bd8 88061fb2f
>>>  0001 88061fb2fc00 88061ff33c30 0
>>>  88061eef3540  88061ff33c98 b43eb
>>>
>>> Call Trace:
>>>  [] __blk_mq_run_hw_queue+0x1d0/0x370
>>>  [] blk_mq_run_hw_queue+0x95/0xb0
>>>  [] blk_mq_flush_plug_list+0x129/0x140
>>>  [] blk_finish_plug+0x18/0x50
>>>  [] dmcrypt_write+0x1da/0x1f0
>>>  [] ? wake_up_state+0x20/0x20
>>>  [] ? crypt_iv_lmk_dtr+0x60/0x60
>>>  [] kthread_create_on_node+0x180/0x180
>>>  [] ret_from_fork+0x42/0x70
>>>  [] ? kthread_create_on_node+0x180/0x180
>>> Code: 00  41 c7 85 78 01 00 00 08 00 00 00 49 c7 85 80 01 00 00 00 00 
>>> 00 00 41 89 85 7c 01 00 00 e9 93 fe ff ff 66 0f 1f 44 00 00 <0f> 0b 66 0f 
>>> 1f 44 00 00 49 8b 87 b0 00 00 00 41 83 e6 ef 4a 8b
>>> RIP [] virtio_queue_rq+0x210/0x2b0
>>>  RSP: 
>>> ---[ end trace 8078357c459d5fc0 ]---
> 
> 
> So this BUG_ON is from 1cf7e9c68fe84248174e998922b39e508375e7c1.
>   commit 1cf7e9c68fe84248174e998922b39e508375e7c1
>   Author: Jens Axboe 
>   Date:   Fri Nov 1 10:52:52 2013 -0600
> 
>   virtio_blk: blk-mq support
> 
> 
>   BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> 
> 
> On probe, we do
> /* We can handle whatever the host told us to handle. */
> blk_queue_max_segments(q, vblk->sg_elems-2);
> 
> 
> To debug this,
> maybe you can print out sg_elems at init time and when this fails,
> to make sure some kind of memory corruption
> does not change sg_elems after initialization?
> 
> 
> Jens, how may we get more segments than blk_queue_max_segments?
> Is driver expected to validate and drop such requests?
> 
Whee! I'm not alone anymore!

I have seen similar issues even on non-mq systems; occasionally
I'm hitting this bug in drivers/scsi/scsi_lib.c:scsi_init_io()

count = blk_rq_map_sg(req->q, req, sdb->table.sgl);
BUG_ON(count > sdb->table.nents);

There are actually two problems here:
The one is that blk_rq_map_sg() requires a table (ie the last
argument), but doesn't have any indications on how large the
table is.
So one needs to check if the returned number of mapped sg
elements exceeds the number of elements in the table.
If so we already _have_ a memory overflow, and the only
thing we can to is sit in a corner and cry.
This really would need to be fixed up, eg by adding
another argument for the table size.

This other problem is that this _really_ shouldn't happen,
and points to some issue with the block layer in general.
Which I've been trying to find for several months now,
but no avail :-(

Cheers,

Hannes
-- 
Dr. Hannes ReineckezSeries & Storage
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 0/2] virtio nvme

2015-09-27 Thread Hannes Reinecke
On 09/27/2015 07:01 AM, Nicholas A. Bellinger wrote:
> On Wed, 2015-09-23 at 15:58 -0700, Ming Lin wrote:
>> On Fri, 2015-09-18 at 14:09 -0700, Nicholas A. Bellinger wrote:
>>> On Fri, 2015-09-18 at 11:12 -0700, Ming Lin wrote:
>>>> On Thu, 2015-09-17 at 17:55 -0700, Nicholas A. Bellinger wrote:
> 
> 
> 



> 
> Btw, after chatting with Dr. Hannes this week at SDC here are his
> original rts-megasas -v6 patches from Feb 2013.
> 
> Note they are standalone patches that require a sufficiently old enough
> LIO + QEMU to actually build + function.
> 
> https://github.com/Datera/rts-megasas/blob/master/rts_megasas-qemu-v6.patch
> https://github.com/Datera/rts-megasas/blob/master/rts_megasas-fabric-v6.patch
> 
> For groking purposes, they demonstrate the principle design for a host
> kernel level driver, along with the megasas firmware interface (MFI)
> specific emulation magic that makes up the bulk of the code.
> 
And indeed, Nic persuaded me to have them updated to qemu latest.
Which I'll be doing shortly. Stay tuned.

Cheers,

Hannes
-- 
Dr. Hannes ReineckezSeries & Storage
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: DEFINE_IDA causing memory leaks? (was Re: [PATCH 1/2] virtio: fix memory leak of virtio ida cache layers)

2015-09-16 Thread Hannes Reinecke
On 09/17/2015 07:33 AM, Michael S. Tsirkin wrote:
> On Wed, Sep 16, 2015 at 07:29:17PM -0500, Suman Anna wrote:
>> The virtio core uses a static ida named virtio_index_ida for
>> assigning index numbers to virtio devices during registration.
>> The ida core may allocate some internal idr cache layers and
>> an ida bitmap upon any ida allocation, and all these layers are
>> truely freed only upon the ida destruction. The virtio_index_ida
>> is not destroyed at present, leading to a memory leak when using
>> the virtio core as a module and atleast one virtio device is
>> registered and unregistered.
>>
>> Fix this by invoking ida_destroy() in the virtio core module
>> exit.
>>
>> Cc: "Michael S. Tsirkin" 
>> Signed-off-by: Suman Anna 
> 
> Interesting.
> Will the same apply to e.g. sd_index_ida in drivers/scsi/sd.c
> or iscsi_sess_ida in drivers/scsi/scsi_transport_iscsi.c?
> 
> If no, why not?
> 
> One doesn't generally expect to have to free global variables.
> Maybe we should forbid DEFINE_IDA in modules?
> 
> James, could you comment on this please?
> 
Well, looking at the code 'ida_destroy' only need to be called
if you want/need to do a general cleanup.
It shouldn't be required if you do correct reference counting
on your objects, and call idr_remove() on each of them.

Unless I'm misreading something.

Seems like a topic for KS; Johannes had a larger patchset recently to
clean up idr, which run into very much the same issues.

Cheers,

Hannes

-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: sometimes ping fails when enable stp

2014-04-11 Thread Hannes Reinecke
On 04/11/2014 09:31 AM, longguang.yue wrote:
> hi,all
> sometimes ping fails when enable stp , why?
>  
Probably stp initialisation isn't yet finished, and the switch will
block any traffic until that happens.
During that time any traffic will return with -EHOSTUNREACH.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: updated: kvm PCI todo wiki

2013-08-21 Thread Hannes Reinecke

On 08/21/2013 12:48 PM, Michael S. Tsirkin wrote:

Hey guys,
I've put up a wiki page with a kvm PCI todo list,
mainly to avoid effort duplication, but also in the hope
to draw attention to what I think we should try addressing
in KVM:

http://www.linux-kvm.org/page/PCITodo

This page could cover all PCI related activity in KVM,
it is very incomplete.
We should probably add e.g. IOMMU related stuff.

Note: if there's no developer listed for an item,
this just means I don't know of anyone actively working
on an issue at the moment, not that no one intends to.

I would appreciate it if others working on one of the items on this list
would add their names so we can communicate better.  If others like this
wiki page, please go ahead and add stuff you are working on if any.

It would be especially nice to add testing projects.

Also, feel free to add links to bugzillas items.

On a related note, did anyone ever tried to test MSI / MSI-X with a 
windows guest? I've tried to enable it for virtio but for some reason 
Windows didn't wanted to enable it. AHCI was even worse; the stock 
Windows version doesn't support MSI and the Intel one doesn't like our 
implementation :-(.


Anyone ever managed to get this to work?

If not it'd be a good topic for the wiki ...

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: binding/unbinding devices to vfio-pci

2013-07-02 Thread Hannes Reinecke

On 07/02/2013 05:13 PM, Yoder Stuart-B08248 wrote:




-Original Message-
From: Alex Williamson [mailto:alex.william...@redhat.com]
Sent: Tuesday, July 02, 2013 9:46 AM
To: Yoder Stuart-B08248
Cc: k...@vger.kernel.org list; Alexander Graf; Bhushan Bharat-R65777; 
a.mota...@virtualopensystems.com;
virtualization@lists.linux-foundation.org
Subject: Re: binding/unbinding devices to vfio-pci

On Tue, 2013-07-02 at 14:15 +, Yoder Stuart-B08248 wrote:

Alex,

I'm trying to think through how binding/unbinding of devices will
work with VFIO for platform devices and have a couple of questions
about how vfio-pci works.

When you bind a device to vfio-pci, e.g.:
# echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id

...I understand that the echo into 'new_id' tells the
vfio pci driver that it now handles the specified PCI ID.

But now there are 2 drivers that handle that PCI ID,
the original host driver and vfio-pci.   Say that
you hotplug a PCI device that matches that ID.   Which of
the 2 drivers are going to get bound to the device?

Also, if you unbind a device from vfio-pci and want to
bind it again to the normal host driver you would just
echo the full device info into the 'bind' sysfs file
for the host driver, right?

 echo :06:0d.0 > /sys/bus/pci/drivers/...


Hi Stuart,

The driver binding interface is far from perfect.  In your scenario
where you've added the ID for one device, then hotplug another device
with the same ID, the results are indeterminate.  Both vfio-pci and the
host driver, assuming it's still loaded, can claim the device, it's just
a matter of which gets probed first.

Generally that window should be very short though.  To bind a device,
the user should do:

1) echo :bb:dd.f > /sys/bus/pci/devices/:bb:dd.f/driver/unbind
2) echo   > /sys/bus/pci/drivers/vfio-pci/new_id
3) echo :bb:dd.f > /sys/bus/pci/drivers/vfio-pci/bind
4) echo   > /sys/bus/pci/drivers/vfio-pci/remove_id

There are actually a number of ways you can do this and the default
autoprobe behavior really makes step 3) unnecessary as the driver core
will probe any unbound devices as soon as a new_id is added to vfio-pci.
That can be changed by:

# echo 0 > /sys/bus/pci/drivers_autoprobe

But then we have to worry about races from any devices that might have
been hotplugged in the interim.


But, even apart from hot-plugged devices, what about the device
we just unbound?  There are now 2 host drivers that can handle the
device when the autoprobe happens.  Is it just luck that vfio-pci
is the one that gets the device?

Probably the best way of handling this would be to disallow autobinding 
for vfio in general.
Then the 'normal' driver would be bound to the device, and vfio must be 
enabled via manual binding.

Much like it is today.

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: PCI device not properly reset after VFIO

2012-10-18 Thread Hannes Reinecke

On 10/18/2012 04:40 PM, Alex Williamson wrote:

Hi Hannes,

Thanks for testing vfio

On Thu, 2012-10-18 at 08:47 +0200, Hannes Reinecke wrote:

Hi Alex,

I've been playing around with VFIO and megasas (of course).
What I did now was switching between VFIO and 'normal' operation, ie
emulated access.

megasas is happily running under VFIO, but when I do an emergency
stop like killing the Qemu session the PCI device is not properly reset.
IE when I load 'megaraid_sas' after unbinding the vfio_pci module
the driver cannot initialize the card and waits forever for the
firmware state to change.

I need to do a proper pci reset via
echo 1 > /sys/bus/pci/device//reset
to get it into a working state again.

Looking at vfio_pci_disable() pci reset is called before the config
state and BARs are restored.
Seeing that vfio_pci_enable() calls pci reset right at the start,
too, before modifying anything I do wonder whether the pci reset is
at the correct location for disable.

I would have expected to call pci reset in vfio_pci_disable()
_after_ we have restored the configuration, to ensure a sane state
after reset.
And, as experience show, we do need to call it there.

So what is the rationale for the pci reset?
Can we move it to the end of vfio_pci_disable() or do we need to
call pci reset twice?


I believe the rationale was that by resetting the device before we
restore the state we stop anything that the device was doing.  Restoring
the saved state on a running device seems like it could cause problems,
so you may be right and we actually need to do reset, load, restore,
reset.  Does adding another call to pci_reset_function in the
pci_restore_state (as below) solve the problem?  Traditional KVM device
assignment has a nearly identical path, does it have this same bug?


It's actually the first time I've been able to test this (the 
hardware is a bit tricky to setup ...), so I cannot tell (yet)

if KVM exhibited the same thing.


Thanks,

Alex

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6c11994..d07a45c 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -107,9 +107,10 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 pci_reset_function(vdev->pdev);

 if (pci_load_and_free_saved_state(vdev->pdev,
- &vdev->pci_saved_state) == 0)
+ &vdev->pci_saved_state) == 0) {
 pci_restore_state(vdev->pdev);
-   else
+   pci_reset_function(vdev->pdev);
+   } else
 pr_info("%s: Couldn't reload %s saved state\n",
 __func__, dev_name(&vdev->pdev->dev));



I would have called reset after unmapping the BARs; the HBA I'm 
working with does need to access the BARs, so the content of them 
might be relevant, too.


But then I'm not really a PCI expert.
Maybe we should ask Tony Luck or Bjorn Helgaas.

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

PCI device not properly reset after VFIO

2012-10-17 Thread Hannes Reinecke

Hi Alex,

I've been playing around with VFIO and megasas (of course).
What I did now was switching between VFIO and 'normal' operation, ie 
emulated access.


megasas is happily running under VFIO, but when I do an emergency 
stop like killing the Qemu session the PCI device is not properly reset.

IE when I load 'megaraid_sas' after unbinding the vfio_pci module
the driver cannot initialize the card and waits forever for the 
firmware state to change.


I need to do a proper pci reset via
echo 1 > /sys/bus/pci/device//reset
to get it into a working state again.

Looking at vfio_pci_disable() pci reset is called before the config 
state and BARs are restored.
Seeing that vfio_pci_enable() calls pci reset right at the start, 
too, before modifying anything I do wonder whether the pci reset is 
at the correct location for disable.


I would have expected to call pci reset in vfio_pci_disable() 
_after_ we have restored the configuration, to ensure a sane state 
after reset.

And, as experience show, we do need to call it there.

So what is the rationale for the pci reset?
Can we move it to the end of vfio_pci_disable() or do we need to 
call pci reset twice?


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2 v1] blkdrv: Add queue limits parameters for sg block drive

2012-08-24 Thread Hannes Reinecke

On 08/24/2012 09:56 AM, Paolo Bonzini wrote:

Il 24/08/2012 02:45, Nicholas A. Bellinger ha scritto:

So up until very recently, TCM would accept an I/O request for an DATA
I/O type CDB with a max_sectors larger than the reported max_sectors for
it's TCM backend (regardless of backend type), and silently generate N
backend 'tasks' to complete the single initiator generated command.


This is what QEMU does if you use scsi-block, except for MMC devices
(because of the insanity of the commands used for burning).


Also FYI for Paolo, for control type CDBs I've never actually seen an
allocation length exceed max_sectors, so in practice AFAIK this only
happens for DATA I/O type CDBs.


Yes, that was my impression as well.


This was historically required by the pSCSI backend driver (using a
number of old SCSI passthrough interfaces) in order to support this very
type of case described above, but over the years the logic ended up
creeping into various other non-passthrough backend drivers like IBLOCK
+FILEIO.  So for v3.6-rc1 code, hch ended up removing the 'task' logic
thus allowing backends (and the layers below) to the I/O sectors >
max_sectors handling work, allowing modern pSCSI using struct request to
do the same.  (hch assured me this works now for pSCSI)


So now LIO and QEMU work the same.  (Did he test tapes too?)


Anyways, I think having the guest limit virtio-scsi DATA I/O to
max_sectors based upon the host accessible block limits is reasonable
approach to consider.  Reducing this value even further based upon the
lowest max_sectors available amongst possible migration hosts would be a
good idea here to avoid having to reject any I/O's exceeding a new
host's device block queue limits.


Yeah, it's reasonable _assuming it is needed at all_.  For disks, it is
not needed.  For CD-ROMs it is, but right now we have only one report
and it is using USB so we don't know if the problem is in the drive or
rather in the USB bridge (whose quality usually leaves much to be desired).

So in the only observed case, the fix would really be a workaround; the
right thing to do with USB devices is to use USB passthrough.



Hehe. So finally someone else stumbled across this one.

All is fine and dandy as long as you're able to use scsi-disk.
As soon as you're forced to use scsi-generic we're in trouble.

With scsi-generic we actually have two problems:
1) scsi-generic just acts as a pass-through and passes the commands
   as-is, including the scatter-gather information as formatted by
   the guest. So the guest could easily format an SG_IO comand
   which will not be compatible with the host.
2) The host is not able to differentiate between a malformed
   SG_IO command and a real I/O error; in both cases it'll return
   -EIO.

So we can fix this by either
a) ignore (as we do nowadays :-)
b) Fixup scsi-generic to inspect and modify SG_IO information
   to ensure the host-limits are respected
c) Fixup the host to differentiate between a malformed SG_IO
   and a real I/O error.

c) would only be feasible for Linux et al. _personally_ I would prefer 
that approach, as I fail to see why we cannot return a proper error code 
here.
But I already can hear the outraged cry 'POSIX! POSIX!', so I guess it's 
not going to happen anytime soon.

So I would vote for b).
Yes, it's painful. But in the long run we'll have to do an SG_IO 
inspection anyway, otherwise we'll always be susceptible to malicious 
SG_IO attacks.


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio-scsi spec (was Re: [PATCH] Add virtio-scsi to the virtio spec)

2011-11-30 Thread Hannes Reinecke

On 11/30/2011 05:36 PM, Paolo Bonzini wrote:

On 11/30/2011 03:17 PM, Hannes Reinecke wrote:

seg_max is the maximum number of segments that can be in a
command. A bidirectional command can include seg_max input
segments and seg_max output segments.


I would like to have the other request_queue limitations exposed
here, too.
Most notably we're missing the maximum size of an individual segment
and the maximum size of the overall I/O request.


The virtio transport does not put any limit, as far as I know.


Virtio doesn't, but the underlying device/driver might.
And if we don't expose these values we cannot format the request correctly.


As this is the host specification I really would like to see an host
identifier somewhere in there.
Otherwise we won't be able to reliably identify a virtio SCSI host.


I thought about it, but I couldn't figure out exactly how to use it. If
it's just allocating 64 bits in the configuration space (with the
stipulation that they could be zero), let's do it now. Otherwise a
controlq command is indeed better, and it can come later.

But even if it's just a 64-bit value, then: 1) where would you place it
in sysfs for userspace? I can make up a random name, but existing user
tools won't find it and that's against the design of virtio-scsi. 2) How
would it be encoded as a transport ID? Is it FC, or firewire, or SAS, or
what?

I was thinking of something along the lines of the TransportID as 
defined in SPC.
Main idea is to have a unique ID by which we can identify a given 
virtio-scsi host. Admittedly it might not be useful in general, so it 
might be an idea to delegate this to another controlq command.



Plus you can't calculate the ITL nexus information, making
Persistent Reservations impossible.


They are not impossible, only some features such as SPEC_I_PT. If you
use NPIV or iSCSI in the host, then the persistent reservations will
already get the correct initiator port. If not, much more work is needed.

Yes, for a a shared (physical) SCSI host persistent reservations will be 
tricky.


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio-scsi spec (was Re: [PATCH] Add virtio-scsi to the virtio spec)

2011-11-30 Thread Hannes Reinecke
u32 residual;
> u16 status_qualifier;
> u8 status;
> u8 response;
> u8 sense[sense_size];
> char datain[];
> };
> 
> /* command-specific response values */
> #define VIRTIO_SCSI_S_OK0
> #define VIRTIO_SCSI_S_UNDERRUN  1
> #define VIRTIO_SCSI_S_ABORTED   2
> #define VIRTIO_SCSI_S_BAD_TARGET3
> #define VIRTIO_SCSI_S_RESET 4
> #define VIRTIO_SCSI_S_TRANSPORT_FAILURE 5
> #define VIRTIO_SCSI_S_TARGET_FAILURE6
> #define VIRTIO_SCSI_S_NEXUS_FAILURE 7
> #define VIRTIO_SCSI_S_FAILURE   8
> 
> /* task_attr */
> #define VIRTIO_SCSI_S_SIMPLE0
> #define VIRTIO_SCSI_S_ORDERED   1
> #define VIRTIO_SCSI_S_HEAD  2
> #define VIRTIO_SCSI_S_ACA   3
> 
> The lun field addresses a target and logical unit in the
> virtio-scsi device's SCSI domain. In this version of the spec,
> the only supported format for the LUN field is: first byte set to
> 1, second byte set to target, third and fourth byte representing
> a single level LUN structure, followed by four zero bytes. With
> this representation, a virtio-scsi device can serve up to 256
> targets and 16384 LUNs per target.
> 
> The id field is the command identifier ("tag").
> 
> Task_attr, prio and crn should be left to zero: command priority
> is explicitly not supported by this version of the device;
> task_attr defines the task attribute as in the table above, but
> all task attributes may be mapped to SIMPLE by the device; crn
> may also be provided by clients, but is generally expected to be
> 0. The maximum CRN value defined by the protocol is 255, since
> CRN is stored in an 8-bit integer.
> 
> All of these fields are defined in SAM. They are always
> read-only, as are the cdb and dataout field. The cdb_size is
> taken from the configuration space.
> 
> sense and subsequent fields are always write-only. The sense_len
> field indicates the number of bytes actually written to the sense
> buffer. The residual field indicates the residual size,
> calculated as "data_length - number_of_transferred_bytes", for
> read or write operations. For bidirectional commands, the
> number_of_transferred_bytes includes both read and written bytes.
> A residual field that is less than the size of datain means that
> the dataout field was processed entirely. A residual field that
> exceeds the size of datain means that the dataout field was
> processed partially and the datain field was not processed at
> all.
> 
> The status byte is written by the device to be the status
> code as defined by SAM.
> 
> The response byte is written by the device to be one of the
> following:
> 
>   VIRTIO_SCSI_S_OK when the request was completed and the status
>   byte is filled with a SCSI status code (not necessarily
>   "GOOD").
> 
>   VIRTIO_SCSI_S_UNDERRUN if the content of the CDB requires
>   transferring more data than is available in the data buffers.
> 
>   VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an
>   ABORT TASK or ABORT TASK SET task management function.
> 
>   VIRTIO_SCSI_S_BAD_TARGET if the request was never processed
>   because the target indicated by the lun field does not exist.
> 
>   VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus
>   or device reset.
> 
>   VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a
>   problem in the connection between the host and the target
>   (severed link).
> 
>   VIRTIO_SCSI_S_TARGET_FAILURE if the target is suffering a
>   failure and the guest should not retry on other paths.
> 
>   VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure
>   but retrying on other paths might yield a different result.
> 
>   VIRTIO_SCSI_S_FAILURE for other host or guest error. In
>   particular, if neither dataout nor datain is empty, and the
>   VIRTIO_SCSI_F_INOUT feature has not been negotiated, the
>   request will be immediately returned with a response equal to
>   VIRTIO_SCSI_S_FAILURE.
> 
We should be adding

VIRTIO_SCSI_S_BUSY

for a temporary failure, indicating that a command retry
might be sufficient to clear this situation.
Equivalent to VIRTIO_SCSI_S_NEXUS_FAILURE, but issuing a retry on
the same path.

Thanks for the write-up.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio scsi host draft specification, v3

2011-07-01 Thread Hannes Reinecke
On 07/01/2011 08:41 AM, Paolo Bonzini wrote:
> On 06/29/2011 11:39 AM, Stefan Hajnoczi wrote:
>> > > Of course, when doing so we would be lose the ability to
>> freely remap
>> > > LUNs. But then remapping LUNs doesn't gain you much imho.
>> > > Plus you could always use qemu block backend here if you want
>> > > to hide the details.
>> >
>> > And you could always use the QEMU block backend with
>> > scsi-generic if you want to remap LUNs, instead of true
 >> > passthrough via the kernel target.
>>
>> IIUC the in-kernel target always does remapping. It passes through
>> individual LUNs rather than entire targets and you pick LU Numbers to
>> map to the backing storage (which may or may not be a SCSI
>> pass-through device). Nicholas Bellinger can confirm whether this is
>> correct.
>
> But then I don't understand. If you pick LU numbers both with the
> in-kernel target and with QEMU, you do not need to use e.g. WWPNs
> with fiber channel, because we are not passing through the details
> of the transport protocol (one day we might have virtio-fc, but more
> likely not). So the LUNs you use might as well be represented by
> hierarchical LUNs.
>

Actually, the kernel does _not_ do a LUN remapping. It just so 
happens that most storage arrays will present the LUN starting with 
0, so normally you wouldn't notice.

However, some arrays have an array-wide LUN range, so you start 
seeing LUNs at odd places:

[3:0:5:0]diskLSI  INF-01-000750  /dev/sdw
[3:0:5:7]diskLSI  Universal Xport  0750  /dev/sdx

> Using NPIV with KVM would be done by mapping the same virtual N_Port
> ID in the host(s) to the same LU number in the guest. You might
> already do this now with virtio-blk, in fact.
>
The point here is not the mapping. The point is rescanning.

You can map existing NPIV devices already. But you _cannot_ rescan
the host/device whatever _from the guest_ to detect if new devices
are present.
That is the problem I'm trying to describe here.

To be more explicit:
Currently you have to map existing devices directly as individual 
block or scsi devices to the guest.
And rescan within the guest can only be sent to that device, so the 
only information you will get able to gather is if the device itself 
is still present.
You are unable to detect if there are other devices attached to your 
guest which you should connect to.

So we have to have an enclosing instance (ie the equivalent of a 
SCSI target), which is capable of telling us exactly this.

> Put in another way: the virtio-scsi device is itself a SCSI target,
> so yes, there is a single target port identifier in virtio-scsi. But
> this SCSI target just passes requests down to multiple real targets,
> and so will let you do ALUA and all that.
>
Argl. No way. The virtio-scsi device has to map to a single LUN.

I thought I mentioned this already, but I'd better clarify this again:

The SCSI spec itself only deals with LUNs, so anything you'll read 
in there obviously will only handle the interaction between the 
initiator (read: host) and the LUN itself. However, the actual 
command is send via an intermediat target, hence you'll always see 
the reference to the ITL (initiator-target-lun) nexus.
The SCSI spec details discovery of the individual LUNs presented by 
a given target, it does _NOT_ detail the discovery of the targets 
themselves.
That is being delegated to the underlying transport, in most cases 
SAS or FibreChannel.
For the same reason the SCSI spec can afford to disdain any 
reference to path failure, device hot-plugging etc; all of these 
things are being delegated to the transport.

In our context the virtio-scsi device should map to the LUN, and the 
virtio-scsi _host_ backend should map to the target.
The virtio-scsi _guest_ driver will then map to the initiator.

So we should be able to attach more than one device to the backend,
which then will be presented to the initiator.

In the case of NPIV it would make sense to map the virtual SCSI host 
to the backend, so that all devices presented to the virtual SCSI 
host will be presented to the backend, too.
However, when doing so these devices will normally be referenced by 
their original LUN, as these will be presented to the guest via eg 
'REPORT LUNS'.

The above thread now tries to figure out if we should remap those 
LUN numbers or just expose them as they are.
If we decide on remapping, we have to emulate _all_ commands 
referring explicitely to those LUN numbers (persistent reservations, 
anyone?). If we don't, we would expose some hardware detail to the 
guest, but would save us _a lot_ of processing.

I'm all for the latter.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  

Re: virtio scsi host draft specification, v3

2011-06-29 Thread Hannes Reinecke
On 06/29/2011 12:07 PM, Christoph Hellwig wrote:
> On Wed, Jun 29, 2011 at 10:39:42AM +0100, Stefan Hajnoczi wrote:
>> I think we're missing a level of addressing.  We need the ability to
>> talk to multiple target ports in order for "list target ports" to make
>> sense.  Right now there is one implicit target that handles all
>> commands.  That means there is one fixed I_T Nexus.
>>
>> If we introduce "list target ports" we also need a way to say "This
>> CDB is destined for target port #0".  Then it is possible to enumerate
>> target ports and address targets independently of the LUN field in the
>> CDB.
>>
>> I'm pretty sure this is also how SAS and other transports work.  In
>> their framing they include the target port.
>
> Yes, exactly.  Hierachial LUNs are a nasty fringe feature that we should
> avoid as much as possible, that is for everything but IBM vSCSI which is
> braindead enough to force them.
>
Yep.

>> The question is whether we really need to support multiple targets on
>> a virtio-scsi adapter or not.  If you are selectively mapping LUNs
>> that the guest may access, then multiple targets are not necessary.
>> If we want to do pass-through of the entire SCSI bus then we need
>> multiple targets but I'm not sure if there are other challenges like
>> dependencies on the transport (Fibre Channel, SAS, etc) which make it
>> impossible to pass through bus-level access?
>
> I don't think bus-level pass through is either easily possible nor
> desirable.  What multiple targets are useful for is allowing more
> virtual disks than we have virtual PCI slots.  We could do this by
> supporting multiple LUNs, but given that many SCSI ressources are
> target-based doing multiple targets most likely is the more scabale
> and more logical variant.  E.g. we could much more easily have one
> virtqueue per target than per LUN.
>
The general idea here is that we can support NPIV.
With NPIV we'll have several scsi_hosts, each of which is assigned a 
different set of LUNs by the array.
With virtio we need to able to react on LUN remapping on the array 
side, ie we need to be able to issue a 'REPORT LUNS' command and 
add/remove LUNs on the fly. This means we have to expose the 
scsi_host in some way via virtio.

This is impossible with a one-to-one mapping between targets and 
LUNs. The actual bus-level pass-through will be just on the SCSI 
layer, ie 'REPORT LUNS' should be possible. If and how we do a LUN 
remapping internally on the host is a totally different matter.
Same goes for the transport details; I doubt we will expose all the 
dingy details of the various transports, but rather restrict 
ourselves to an abstract transport.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: virtio scsi host draft specification, v3

2011-06-14 Thread Hannes Reinecke
On 06/12/2011 09:51 AM, Michael S. Tsirkin wrote:
> On Fri, Jun 10, 2011 at 02:55:35PM +0200, Hannes Reinecke wrote:
>>> Device operation: request queues
>>> 
>>>
>>> The driver queues requests to an arbitrary request queue, and they are
>>> used by the device on that same queue.
>>>
>> What about request ordering?
>> If requests are placed on arbitrary queues you'll inevitably run on
>> locking issues to ensure strict request ordering.
>> I would add here:
>>
>> If a device uses more than one queue it is the responsibility of the
>> device to ensure strict request ordering.
>
> Maybe I misunderstand - how can this be the responsibility of
> the device if the device does not get the information about
> the original ordering of the requests?
>
> For example, if the driver is crazy enough to put
> all write requests on one queue and all barriers
> on another one, how is the device supposed to ensure
> ordering?
>
Which is exactly the problem I was referring to.
When using more than one channel the request ordering
_as seen by the initiator_ has to be preserved.

This is quite hard to do from a device's perspective;
it might be able to process the requests _in the order_ they've 
arrived, but it won't be able to figure out the latency of each 
request, ie how it'll take the request to be delivered to the initiator.

What we need to do here is to ensure that virtio will deliver
the requests in-order across all virtqueues. Not sure whether it 
does this already.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: virtio scsi host draft specification, v3

2011-06-14 Thread Hannes Reinecke
On 06/10/2011 04:35 PM, Paolo Bonzini wrote:
>> If requests are placed on arbitrary queues you'll inevitably run on
>> locking issues to ensure strict request ordering.
>> I would add here:
>>
>> If a device uses more than one queue it is the responsibility of the
>> device to ensure strict request ordering.
>
> Applied with s/device/guest/g.
>
>> Please do not rely in bus/target/lun here. These are leftovers from
>> parallel SCSI and do not have any meaning on modern SCSI
>> implementation (eg FC or SAS). Rephrase that to
>>
>> The lun field is the Logical Unit Number as defined in SAM.
>
> Ok.
>
>>>   The status byte is written by the device to be the SCSI status
>>>   code.
>>
>> ?? I doubt that exists. Make that:
>>
>> The status byte is written by the device to be the status code as
>> defined in SAM.
>
> Ok.
>
>>>   The response byte is written by the device to be one of the
>>>   following:
>>>
>>>   - VIRTIO_SCSI_S_OK when the request was completed and the
>>>   status byte
>>> is filled with a SCSI status code (not necessarily "GOOD").
>>>
>>>   - VIRTIO_SCSI_S_UNDERRUN if the content of the CDB requires
>>>   transferring
>>> more data than is available in the data buffers.
>>>
>>>   - VIRTIO_SCSI_S_ABORTED if the request was cancelled due to a
>>>   reset
>>> or another task management function.
>>>
>>>   - VIRTIO_SCSI_S_FAILURE for other host or guest error. In
>>>   particular,
>>> if neither dataout nor datain is empty, and the
>>> VIRTIO_SCSI_F_INOUT
>>> feature has not been negotiated, the request will be
>>> immediately
>>> returned with a response equal to VIRTIO_SCSI_S_FAILURE.
>>>
>> And, of course:
>>
>> VIRTIO_SCSI_S_DISCONNECT if the request could not be processed due
>> to a communication failure (eg device was removed or could not be
>> reached).
>
> Ok.
>
>> This specification implies a strict one-to-one mapping between host
>> and target. IE there is no way of specifying more than one target
>> per host.
>
> Actually no, the intention is to use hierarchical LUNs to support
> more than one target per host.
>
Can't.

Hierarchical LUNs is a target-internal representation.
The initiator (ie guest OS) should _not_ try to assume anything 
about the internal structure and just use the LUN as an opaque number.

Reason being that the LUN addressing is not unique, and there are 
several choices on how to represent a given LUN.
So the consensus here is that different LUN numbers represent
different physical devices, regardless on the (internal) LUN 
representation.
Which in turn means we cannot use the LUN number to convey anything 
else than a device identification relative to a target.

Cf SAM-3 paragraph 4.8:

A logical unit number is a field (see 4.9) containing 64 bits that 
identifies the logical unit within a SCSI target device
when accessed by a SCSI target port.

IE the LUN is dependent on the target, but you cannot make 
assumptions on the target.

Consequently, it's in the hosts' responsibility to figure out the 
targets in the system. After that it invokes the 'scan' function 
from the SCSI midlayer.
You can't start from a LUN and try to figure out the targets ...

If you want to support more than on target per host you need some 
sort of enumeration/callback which allows the host to figure out
the number of available targets.
But in general the targets are referenced by the target port 
identifier as specified in the appropriate standard (eg FC or SAS).
Sadly, we don't have any standard to fall back on for this.

If, however, we decide to expose some details about the backend, we 
could be using the values from the backend directly.
EG we could be forwarding the SCSI target port identifier here
(if backed by real hardware) or creating our own SAS-type
identifier when backed by qemu block. Then we could just query
the backend via a new command on the controlq
(eg 'list target ports') and wouldn't have to worry about any 
protocol specific details here.

Of course, when doing so we would be lose the ability to freely 
remap LUNs. But then remapping LUNs doesn't gain you much imho.
Plus you could always use qemu block backend here if you want
to hide the details.
But we would be finally able to use NPIV for KVM, something
I wanted to do for a _long_ time.

I personally _really_ would like to see the real backing device 
details exposed to the guest.
Otherwise the more advanced stuff like 

Re: virtio scsi host draft specification, v3

2011-06-10 Thread Hannes Reinecke
---
>
> The driver queues requests to an arbitrary request queue, and they are
> used by the device on that same queue.
>
What about request ordering?
If requests are placed on arbitrary queues you'll inevitably run on 
locking issues to ensure strict request ordering.
I would add here:

If a device uses more than one queue it is the responsibility of the 
device to ensure strict request ordering.

> Requests have the following format:
>
>  struct virtio_scsi_req_cmd {
>  u8 lun[8];
>  u64 id;
>  u8 task_attr;
>  u8 prio;
>  u8 crn;
>  char cdb[cdb_size];
>  char dataout[];
>
>  u8 sense[sense_size];
>  u32 sense_len;
>  u32 residual;
>  u16 status_qualifier;
>  u8 status;
>  u8 response;
>  char datain[];
>  };
>
>  /* command-specific response values */
>  #define VIRTIO_SCSI_S_OK  0
>  #define VIRTIO_SCSI_S_UNDERRUN1
>  #define VIRTIO_SCSI_S_ABORTED 2
>  #define VIRTIO_SCSI_S_FAILURE 3
>
>  /* task_attr */
>  #define VIRTIO_SCSI_S_SIMPLE  0
>  #define VIRTIO_SCSI_S_ORDERED 1
>  #define VIRTIO_SCSI_S_HEAD2
>  #define VIRTIO_SCSI_S_ACA 3
>
>  The lun field addresses a bus, target and logical unit in the SCSI
>  host.  The id field is the command identifier as defined in SAM.
>
Please do not rely in bus/target/lun here. These are leftovers from 
parallel SCSI and do not have any meaning on modern SCSI 
implementation (eg FC or SAS). Rephrase that to

The lun field is the Logical Unit Number as defined in SAM.

>  Task_attr, prio and CRN are defined in SAM.  The prio field should
>  always be zero, as command priority is explicitly not supported by
>  this version of the device.  task_attr defines the task attribute as
>  in the table above, Note that all task attributes may be mapped to
>  SIMPLE by the device.  CRN is generally expected to be 0, but clients
>  can provide it.  The maximum CRN value defined by the protocol is 255,
>  since CRN is stored in an 8-bit integer.
>
>  All of these fields are always read-only, as are the cdb and dataout
>  field.  sense and subsequent fields are always write-only.
>
>  The sense_len field indicates the number of bytes actually written
>  to the sense buffer.  The residual field indicates the residual
>  size, calculated as data_length - number_of_transferred_bytes, for
>  read or write operations.
>
>  The status byte is written by the device to be the SCSI status code.
>
?? I doubt that exists. Make that:

The status byte is written by the device to be the status code as 
defined in SAM.

>  The response byte is written by the device to be one of the following:
>
>  - VIRTIO_SCSI_S_OK when the request was completed and the status byte
>is filled with a SCSI status code (not necessarily "GOOD").
>
>  - VIRTIO_SCSI_S_UNDERRUN if the content of the CDB requires transferring
>more data than is available in the data buffers.
>
>  - VIRTIO_SCSI_S_ABORTED if the request was cancelled due to a reset
>or another task management function.
>
>  - VIRTIO_SCSI_S_FAILURE for other host or guest error.  In particular,
>if neither dataout nor datain is empty, and the VIRTIO_SCSI_F_INOUT
>feature has not been negotiated, the request will be immediately
>returned with a response equal to VIRTIO_SCSI_S_FAILURE.
>
And, of course:

VIRTIO_SCSI_S_DISCONNECT if the request could not be processed due 
to a communication failure (eg device was removed or could not be
reached).

The remaining bits seem to be okay.

One general question:

This specification implies a strict one-to-one mapping between host 
and target. IE there is no way of specifying more than one target 
per host.
This will make things like ALUA (Asymmetric Logical Unit Access)
a bit tricky to implement, as the port states there are bound to 
target port groups. So with the virtio host spec we would need to 
specify two hosts to represent that.

If that's the intention here I'm fine, but maybe we should be 
specifying this expressis verbis somewhere.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation

2009-10-30 Thread Hannes Reinecke
Gerd Hoffmann wrote:
> On 10/29/09 05:37, Christoph Hellwig wrote:
>>> So something like
>>> - Get next request
>>> - Attach iovec/bounc-buffer
>>> - handle request (command/write/read)
>>> - complete request (callback)
>>
>> Btw, from some previuous attempts to sort out this code here are some
>> thing that I think would be beneficial:
> 
> Trying to go forward in review+bisect friendly baby steps.  Here is what
> I have now:
> 
> http://repo.or.cz/w/qemu/kraxel.git?a=shortlog;h=refs/heads/scsi.v1
> 
> It is far from being completed, will continue tomorrow.  Should give a
> idea of the direction I'm heading to though.  Comments welcome.
> 
Yep, this looks good.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation

2009-10-28 Thread Hannes Reinecke
Am Wed 28 Oct 2009 02:58:33 PM CET schrieb Gerd Hoffmann :

>   Hi,
>
>> From a really quick view fixing up the data xfer code paths doesn't
>> look too bad. Think I'll give it a try.
>
> Oh well.  The interface pretty obviously designed for the esp, which is
> the oldest scsi adapter in qemu ...
>
> ESP: There is no scatter-gather support in the hardware.  So for large
> reads/writes there are quite switches between OS and ESP:  The OS
> saying "dma next sectors to this location" via ioports, the ESP doing
> it and raising a IRQ when done, next round.  The existing callback
> mechanism models that pretty closely.
>
> USB: streams the data in small packets (smaller than sector size, 64
> bytes IIRC).  Current interface works good enougth.
>
> LSI: Hops through quite a few loops to work with the existing
> interface.  Current emulation reads one lsi script command at a time
> and does reads/writes in small pieces like the ESP.  I think it could
> do alot better: parse lsi scripts into scatter lists and submit larger
> requests.  Maybe even have multiple requests in flight at the same
> time.  That probably means putting the lsi script parsing code upside
> down though.
>
> MEGASAS: I guess you have scatter lists at hand and want to submit them
> directly to the block layer for zerocopy block I/O.
>
Precisely.

> So, where to go from here?
>
> I'm tempted to zap the complete read-in-pieces logic.  For read/write
> transfers storage must be passed where everything fits in.  The
> completion callback is called on command completion and nothing else.
>
Agree. That would be my idea here, as well.

> I think we'll need to modes here: xfer from/to host-allocated bounce
> buffer (linear buffer) and xfer from/to guest memory (scatter list).
>
I don't think we really need two modes.
My preferred interface here is to pass down scatter-gather lists down
with every xfer; this way it'll be the responsibility of the driver to
create the lists in the first place. If it has hardware scatter-gather
support it can just pass them down, if not it can as easily create a
scatter-gather list with just one element as a bounce buffer.
Much like the current code does now, only with explicit passing of iovecs.

> That means (emulated) hardware without scatter-gather support must use
> the bounce buffer mode can can't do zerocopy I/O.  I don't think this
> is a big problem though.  Lots of small I/O requests don't perform very
> well, so one big request filling the bounce buffer then memcpy()
> from/to guest memory will most likely be faster anyway.
>
I would update the existing interface to split off the request generation from
the command completion code; then it would be easy to attach the iovec to the
request.

So something like
- Get next request
- Attach iovec/bounc-buffer
- handle request (command/write/read)
- complete request (callback)

would be an idea.

Cheers,

Hannes
---
No .sig today as I'm writing from my laptop.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCH 3/4] scsi-disk: Factor out SCSI command emulation

2009-10-28 Thread Hannes Reinecke
Am Wed 28 Oct 2009 09:42:47 AM CET schrieb Christoph Hellwig :

> On Tue, Oct 27, 2009 at 04:28:59PM +0100, Hannes Reinecke wrote:
>>
>> Other drives might want to use SCSI command emulation without
>> going through the SCSI disk abstraction, as this imposes too
>> many limits on the emulation.
>
> Might be a good idea to move something like this first into the series
> and share the CDB decoder between scsi and ide (atapi) as a start.  A
> little bit of refactoring of the CDB decoder, e.g. into one function
> per opcode (-family) won't hurt either.
>

Yes, that was on my to-do list, as well.
Eg even the megasas already emulates REPORT LUNs, as we certainly  
don't want to pass the original LUN information back to the guest.
And it would make sense to trap some other commands (like INQUIRY), too, to
strip it off any unwanted information.

Cheers,

Hannes
---
No .sig today as this is from my laptop.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation

2009-10-28 Thread Hannes Reinecke
Am Tue 27 Oct 2009 05:47:21 PM CET schrieb Gerd Hoffmann :

>   Hi,
>
>> The device can be accessed by
>>
>> -drive if=raid,file=XXX
>
> Don't extend that qemu automagic please.  The new way to handle this is:
>
>   -drive if=none,id=mydisk,file=/path/to/some/disk.img
>   -device megasas,id=raid
>   -device scsi-disk,bus=raid.0,scsi-id=1,drive=mydisk
>
Alright, no problem there. Didn't know this.

>> In order to support SCSI command emulation I had to update /
>> patch up the existing SCSI disk support. This might be
>> not to everyones taste, so I'm open to alternative
>> suggestions.
>>
>> But I certainly do _not_ want to update the SCSI disk
>> emulation, as this is really quite tied to the SCSI parallel
>> interface used by the old lsi53c895a.c.
>
> --verbose please.  I'd prefer to fix scsi-disk bugs and/or limitations
> instead of working around them.
>
The problem is I don't have any documentation for the LSI parallel
SCSI controller. So I don't know if and in what shape I/O is passed
down, nor anything else. And as the SCSI disk emulation is really
tied into the LSI parallel SCSI controller, any change in the former
is likely to break the latter.
And what with me no way of fixing it. Hence I decided on this approach.

I surely can go ahead and patch up the scsi disk emulation, but it's
quite likely to break the LSI controller. If that's okay with everybody,
I'll surely go ahead there.

>> Plus it doesn't do scatter-gather list handling,
>
> Which should be fixed anyway.
>
Quite. But as I said, the LSI parallel SCSI controller is going to
suffer.

Cheers,

Hannes
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 4/4] megasas: Add SCSI command emulation

2009-10-27 Thread Hannes Reinecke

Now that we can use SCSI command emulation without using the SCSI
disk abstraction we can easily add it to the megasas HBA.

Signed-off-by: Hannes Reinecke 
---
 hw/megasas.c |   88 +++---
 1 files changed, 53 insertions(+), 35 deletions(-)

diff --git a/hw/megasas.c b/hw/megasas.c
index a57e8e0..f32b313 100644
--- a/hw/megasas.c
+++ b/hw/megasas.c
@@ -661,40 +661,55 @@ static int megasas_handle_scsi(MPTState *s, uint8_t fcmd,
 }
 }
 
-memset(&cmd->hdr, 0, sizeof(struct sg_io_hdr));
-cmd->hdr.interface_id = 'S';
-cmd->hdr.cmd_len = cdb_len;
-cmd->hdr.cmdp = cdb;
-cmd->hdr.iovec_count = cmd->sge_count;
-cmd->hdr.dxferp = cmd->iov;
-for (n = 0; n < cmd->sge_count; n++)
-   cmd->hdr.dxfer_len += cmd->iov[n].iov_len;
-if (cmd->sge_count) {
-   if (dir)
-   cmd->hdr.dxfer_direction = SG_DXFER_TO_DEV;
-   else
-   cmd->hdr.dxfer_direction = SG_DXFER_FROM_DEV;
-} else {
-   cmd->hdr.dxfer_direction = SG_DXFER_NONE;
-}
-cmd->hdr.sbp = cmd->sense;
-cmd->hdr.mx_sb_len = cmd->sense_len;
+if (bdrv_is_sg(cmd->lun->bdrv)) {
+   memset(&cmd->hdr, 0, sizeof(struct sg_io_hdr));
+   cmd->hdr.interface_id = 'S';
+   cmd->hdr.cmd_len = cdb_len;
+   cmd->hdr.cmdp = cdb;
+   cmd->hdr.iovec_count = cmd->sge_count;
+   cmd->hdr.dxferp = cmd->iov;
+   for (n = 0; n < cmd->sge_count; n++)
+   cmd->hdr.dxfer_len += cmd->iov[n].iov_len;
+   if (cmd->sge_count) {
+   if (dir)
+   cmd->hdr.dxfer_direction = SG_DXFER_TO_DEV;
+   else
+   cmd->hdr.dxfer_direction = SG_DXFER_FROM_DEV;
+   } else {
+   cmd->hdr.dxfer_direction = SG_DXFER_NONE;
+   }
+   cmd->hdr.sbp = cmd->sense;
+   cmd->hdr.mx_sb_len = cmd->sense_len;
 
-ret = bdrv_ioctl(cmd->lun->bdrv, SG_IO, &cmd->hdr);
-if (ret) {
-   DPRINTF("SCSI pthru dev %x lun %x failed with %d\n",
-   target, lun, errno);
-   sense_len = scsi_build_sense(cmd->sense, SENSE_IO_ERROR);
-   cmd->sge_size = 0;
-   scsi_status = SAM_STAT_CHECK_CONDITION;
-} else if (cmd->hdr.status) {
-   sense_len = cmd->hdr.sb_len_wr;
-   scsi_status = cmd->hdr.status;
-   cmd->sge_size = cmd->hdr.dxfer_len;
-   scsi_status = SAM_STAT_CHECK_CONDITION;
+   ret = bdrv_ioctl(cmd->lun->bdrv, SG_IO, &cmd->hdr);
+   if (ret) {
+   DPRINTF("SCSI pthru dev %x lun %x failed with %d\n",
+   target, lun, errno);
+   sense_len = scsi_build_sense(cmd->sense, SENSE_IO_ERROR);
+   cmd->sge_size = 0;
+   scsi_status = SAM_STAT_CHECK_CONDITION;
+   } else if (cmd->hdr.status) {
+   sense_len = cmd->hdr.sb_len_wr;
+   scsi_status = cmd->hdr.status;
+   cmd->sge_size = cmd->hdr.dxfer_len;
+   scsi_status = SAM_STAT_CHECK_CONDITION;
+   } else {
+   sense_len = 0;
+   cmd->sge_size = cmd->hdr.dxfer_len;
+   }
 } else {
-   sense_len = 0;
-   cmd->sge_size = cmd->hdr.dxfer_len;
+   uint32_t sense;
+
+   DPRINTF("Emulate SCSI pthru cmd %x\n", cdb[0]);
+   sense = scsi_emulate_command(cmd->lun->bdrv, 0, cdb,
+cmd->iov[0].iov_len,
+cmd->iov[0].iov_base,
+&cmd->sge_size);
+   sense_len = scsi_build_sense(cmd->sense, sense);
+   if (sense_len)
+   scsi_status = SAM_STAT_CHECK_CONDITION;
+   else
+   scsi_status = SAM_STAT_GOOD;
 }
 out:
 megasas_unmap_sense(cmd, sense_len);
@@ -1105,13 +1120,16 @@ static int megasas_scsi_init(PCIDevice *dev)
lun->bdrv = NULL;
continue;
}
+   bdrv_set_tcq(lun->bdrv, 1);
/* check if we can use SG_IO */
ret = bdrv_ioctl(lun->bdrv, SG_IO, &hdr);
if (ret) {
-   DPRINTF("SCSI cmd passthrough not available on dev %d (error %d)\n",
+   DPRINTF("Using SCSI cmd emulation on dev %d (error %d)\n",
unit, ret);
-   lun->sdev = NULL;
-   lun->bdrv = NULL;
+   bdrv_set_sg(lun->bdrv, 0);
+   } else {
+   DPRINTF("Using SCSI cmd passthrough on dev %d\n", unit);
+   bdrv_set_sg(lun->bdrv, 1);
}
 }
 register_savevm("megasas", -1, 0, megasas_scsi_save, megasas_scsi_load, s);
-- 
1.6.0.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 3/4] scsi-disk: Factor out SCSI command emulation

2009-10-27 Thread Hannes Reinecke

Other drives might want to use SCSI command emulation without
going through the SCSI disk abstraction, as this imposes too
many limits on the emulation.

Signed-off-by: Hannes Reinecke 
---
 block.c|   15 ++
 block.h|3 +
 block_int.h|1 +
 hw/scsi-disk.c |  610 ++--
 hw/scsi-disk.h |3 +
 5 files changed, 346 insertions(+), 286 deletions(-)

diff --git a/block.c b/block.c
index 33f3d65..06f92c4 100644
--- a/block.c
+++ b/block.c
@@ -930,6 +930,21 @@ int bdrv_is_sg(BlockDriverState *bs)
 return bs->sg;
 }
 
+void bdrv_set_sg(BlockDriverState *bs, int set)
+{
+bs->sg = set;
+}
+
+int bdrv_get_tcq(BlockDriverState *bs)
+{
+return bs->tcq;
+}
+
+void bdrv_set_tcq(BlockDriverState *bs, int set)
+{
+bs->tcq = set;
+}
+
 int bdrv_enable_write_cache(BlockDriverState *bs)
 {
 return bs->enable_write_cache;
diff --git a/block.h b/block.h
index a966afb..7862fa0 100644
--- a/block.h
+++ b/block.h
@@ -134,9 +134,12 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
 int *pcyls, int *pheads, int *psecs);
 int bdrv_get_type_hint(BlockDriverState *bs);
 int bdrv_get_translation_hint(BlockDriverState *bs);
+int bdrv_get_tcq(BlockDriverState *bs);
+void bdrv_set_tcq(BlockDriverState *bs, int set);
 int bdrv_is_removable(BlockDriverState *bs);
 int bdrv_is_read_only(BlockDriverState *bs);
 int bdrv_is_sg(BlockDriverState *bs);
+void bdrv_set_sg(BlockDriverState *bs, int set);
 int bdrv_enable_write_cache(BlockDriverState *bs);
 int bdrv_is_inserted(BlockDriverState *bs);
 int bdrv_media_changed(BlockDriverState *bs);
diff --git a/block_int.h b/block_int.h
index 8e72abe..e5ee57b 100644
--- a/block_int.h
+++ b/block_int.h
@@ -129,6 +129,7 @@ struct BlockDriverState {
 int encrypted; /* if true, the media is encrypted */
 int valid_key; /* if true, a valid encryption key has been set */
 int sg;/* if true, the device is a /dev/sg* */
+int tcq;   /* if true, the device supports tagged command queueing */
 /* event callback when inserting/removing */
 void (*change_cb)(void *opaque);
 void *change_opaque;
diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index 68b4e83..3e68518 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -56,7 +56,7 @@ typedef struct SCSIRequest {
 /* Both sector and sector_count are in terms of qemu 512 byte blocks.  */
 uint64_t sector;
 uint32_t sector_count;
-struct iovec iov;
+struct iovec *iov;
 QEMUIOVector qiov;
 BlockDriverAIOCB *aiocb;
 struct SCSIRequest *next;
@@ -72,7 +72,8 @@ struct SCSIDiskState
This is the number of 512 byte blocks in a single scsi sector.  */
 int cluster_size;
 uint64_t max_lba;
-int sense;
+uint8_t sense[SCSI_SENSE_LEN];
+uint8_t sense_len;
 char drive_serial_str[21];
 QEMUBH *bh;
 };
@@ -90,13 +91,12 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, 
uint32_t tag)
 free_requests = r->next;
 } else {
 r = qemu_malloc(sizeof(SCSIRequest));
-r->iov.iov_base = qemu_memalign(512, SCSI_DMA_BUF_SIZE);
+   r->iov = NULL;
 }
 r->bus = scsi_bus_from_device(d);
 r->dev = s;
 r->tag = tag;
 r->sector_count = 0;
-r->iov.iov_len = 0;
 r->aiocb = NULL;
 r->status = 0;
 
@@ -126,6 +126,17 @@ static void scsi_remove_request(SCSIRequest *r)
 free_requests = r;
 }
 
+static void *scsi_allocate_iovec(SCSIRequest *r) {
+if (!r->iov) {
+   r->iov = qemu_malloc(sizeof(struct iovec));
+   if (!r->iov)
+   return NULL;
+   r->iov->iov_base = qemu_memalign(512, SCSI_DMA_BUF_SIZE);
+   r->iov->iov_len = SCSI_DMA_BUF_SIZE;
+}
+return r->iov;
+}
+
 static SCSIRequest *scsi_find_request(SCSIDiskState *s, uint32_t tag)
 {
 SCSIRequest *r;
@@ -137,12 +148,11 @@ static SCSIRequest *scsi_find_request(SCSIDiskState *s, 
uint32_t tag)
 return r;
 }
 
-/* Helper function to build a sense block */
 int32_t scsi_build_sense(uint8_t *sense_buf, uint32_t sense)
 {
 memset(sense_buf, 0, SCSI_SENSE_LEN);
 if (!sense)
-   return 0;
+   return 0;
 
 sense_buf[0] = 0xf0; /* current, fixed format */
 sense_buf[2] = (sense >> 16) & 0x0F;
@@ -154,15 +164,19 @@ int32_t scsi_build_sense(uint8_t *sense_buf, uint32_t 
sense)
 }
 
 /* Helper function for command completion.  */
-static void scsi_command_complete(SCSIRequest *r, int status, int sense)
+static void scsi_command_complete(SCSIRequest *r, int status, uint32_t sense)
 {
 SCSIDiskState *s = r->dev;
 uint32_t tag;
-DPRINTF("Command complete tag=0x%x status=%d sense=%d\n", r->tag, status, 
sense);
-s->sense = sense;
+
+DPRINTF("Command complete tag=0x%x status=%d sense=%d\n", r->tag,
+   status, s->sense_len);
+if (status == STATUS_CHECK_CONDITION

[PATCH 2/4] megasas: LSI MegaRAID SAS HBA emulation

2009-10-27 Thread Hannes Reinecke

This patch add an emulation for the LSI MegaRAID SAS HBA. It is
using SG_IO to forward / pass through SCSI commands to the
underlying block driver, so no emulation is done currently.

Signed-off-by: Hannes Reinecke 
---
 Makefile.hw  |2 +-
 hw/megasas.c | 1134 ++
 hw/pci_ids.h |2 +
 3 files changed, 1137 insertions(+), 1 deletions(-)
 create mode 100644 hw/megasas.c

diff --git a/Makefile.hw b/Makefile.hw
index de1db31..cae35f9 100644
--- a/Makefile.hw
+++ b/Makefile.hw
@@ -33,7 +33,7 @@ obj-y += wdt_i6300esb.o
 obj-y += ne2000.o
 
 # SCSI layer
-obj-y += lsi53c895a.o
+obj-y += lsi53c895a.o megasas.o
 obj-$(CONFIG_ESP) += esp.o
 
 obj-y += dma-helpers.o sysbus.o isa-bus.o
diff --git a/hw/megasas.c b/hw/megasas.c
new file mode 100644
index 000..a57e8e0
--- /dev/null
+++ b/hw/megasas.c
@@ -0,0 +1,1134 @@
+/*
+ * QEMU MegaRAID SAS 8708EM2 Host Bus Adapter emulation
+ *
+ * Copyright (c) 2009 Hannes Reinecke, SUSE Linux Products GmbH
+ *
+ * This code is licenced under the LGPL.
+ */
+
+
+#include 
+
+#include "hw.h"
+#include "pci.h"
+#include "scsi.h"
+#include "scsi-disk.h"
+#include "block_int.h"
+#ifdef __linux__
+# include 
+#endif
+
+#undef DEBUG_MEGASAS
+#undef DEBUG_MEGASAS_REG
+#undef DEBUG_MEGASAS_QUEUE
+#undef DEBUG_MEGASAS_MFI
+
+#ifdef DEBUG_MEGASAS
+#define DPRINTF(fmt, ...) \
+do { printf("megasas: " fmt , ## __VA_ARGS__); } while (0)
+#define BADF(fmt, ...) \
+do { fprintf(stderr, "megasas: error: " fmt , ## __VA_ARGS__); exit(1);} while 
(0)
+#else
+#define DPRINTF(fmt, ...) do {} while(0)
+#define BADF(fmt, ...) \
+do { fprintf(stderr, "megasas: error: " fmt , ## __VA_ARGS__);} while (0)
+#endif
+
+/* Static definitions */
+#define MEGASAS_MAX_FRAMES 64
+#define MEGASAS_MAX_SGE 8
+
+/* SCSI definitions */
+#define SAM_STAT_GOOD0x00
+#define SAM_STAT_CHECK_CONDITION 0x02
+#define SAM_STAT_CONDITION_MET   0x04
+#define SAM_STAT_BUSY0x08
+#define SAM_STAT_TASK_ABORTED0x40
+
+/* Register definitions */
+#defineMEGASAS_INBOUND_MSG_0   0x0010
+#defineMEGASAS_INBOUND_MSG_1   0x0014
+#defineMEGASAS_OUTBOUND_MSG_0  0x0018
+#defineMEGASAS_OUTBOUND_MSG_1  0x001C
+#defineMEGASAS_INBOUND_DOORBELL0x0020
+#defineMEGASAS_INBOUND_INTR_STATUS 0x0024
+#defineMEGASAS_INBOUND_INTR_MASK   0x0028
+#defineMEGASAS_OUTBOUND_DOORBELL   0x002C
+#defineMEGASAS_OUTBOUND_INTR_STATUS0x0030
+#defineMEGASAS_OUTBOUND_INTR_MASK  0x0034
+#defineMEGASAS_INBOUND_QUEUE_PORT  0x0040
+#defineMEGASAS_OUTBOUND_QUEUE_PORT 0x0044
+#defineMEGASAS_OUTBOUND_DOORBELL_CLEAR 0x00A0
+#defineMEGASAS_OUTBOUND_SCRATCH_PAD0x00B0
+#defineMEGASAS_INBOUND_LOW_QUEUE_PORT  0x00C0
+#defineMEGASAS_INBOUND_HIGH_QUEUE_PORT 0x00C4
+
+/* FW commands */
+#define MFI_INIT_ABORT 0x0001
+#define MFI_INIT_READY 0x0002
+#define MFI_INIT_MFIMODE   0x0004
+#define MFI_INIT_CLEAR_HANDSHAKE   0x0008
+#define MFI_INIT_HOTPLUG   0x0010
+#define MFI_STOP_ADP   0x0020
+
+/* MFI states */
+#define MFI_STATE_UNDEFINED0x0
+#define MFI_STATE_BB_INIT  0x1
+#define MFI_STATE_FW_INIT  0x4
+#define MFI_STATE_WAIT_HANDSHAKE   0x6
+#define MFI_STATE_FW_INIT_20x7
+#define MFI_STATE_DEVICE_SCAN  0x8
+#define MFI_STATE_BOOT_MESSAGE_PENDING 0x9
+#define MFI_STATE_FLUSH_CACHE  0xA
+#define MFI_STATE_READY0xB
+#define MFI_STATE_OPERATIONAL  0xC
+#define MFI_STATE_FAULT0xF
+
+/*
+ * MFI command opcodes
+ */
+#define MFI_CMD_INIT   0x00
+#define MFI_CMD_LD_READ0x01
+#define MFI_CMD_LD_WRITE   0x02
+#define MFI_CMD_LD_SCSI_IO 0x03
+#define MFI_CMD_PD_SCSI_IO 0x04
+#define MFI_CMD_DCMD   0x05
+#define MFI_CMD_ABORT  0x06
+#define MFI_CMD_SMP0x07
+#define MFI_CMD_STP0x08
+
+#define MR_DCMD_CTRL_GET_INFO  0x0101
+
+#define MR_DCMD_CTRL_CACHE_FLUSH   0x01101000
+#define MR_FLUSH_CTRL_CACHE0x01
+#define MR_FLUSH_DISK_CACHE0x02
+
+#define MR_DCMD_CTRL_SHUTDOWN  0x0105
+#define MR_DCMD_HIBERNATE_SHUTDOWN 0x0106
+#define MR

[PATCH 0/4] megaraid_sas HBA emulation

2009-10-27 Thread Hannes Reinecke
Hi all,

this patchset implements an emulation for the megaraid_sas HBA.
It provides emulates an LSI MegaRAID SAS 8708EM2 HBA, ie
presenting to the guest a virtual SCSI adapter.
Internally it is using aio for read/write requests and
either SG_IO or SCSI command emulation for everything else.

The reason for choosing the megaraid_sas HBA and not, say,
implementing a virtio scsi interface is because:
- the megaraid_sas is using a very simple firmware interface,
  comparable to virtio
- the HBA driver are already existent, so I only have to
  write the backend :-)

The device can be accessed by

-drive if=raid,file=XXX

In order to support SCSI command emulation I had to update /
patch up the existing SCSI disk support. This might be
not to everyones taste, so I'm open to alternative
suggestions.

But I certainly do _not_ want to update the SCSI disk
emulation, as this is really quite tied to the SCSI parallel
interface used by the old lsi53c895a.c.
Plus it doesn't do scatter-gather list handling, which
is quite impossible to fix without proper documentation.

Of course, if anyone else would step in here, I won't object :-)

It currently runs guests with 2.6.27 and up; Windows XP
support is not quite there yet. Anything else might work;
if not, enable debugging and sent me the logfile.

As usual, comment / suggestions  etc welcome.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 1/4] Add 'raid' interface class

2009-10-27 Thread Hannes Reinecke

This patch adds a 'raid' interface class. It is basically a clone
of the existing 'scsi' interface, only allowing up to 128 disks.

Signed-off-by: Hannes Reinecke 
---
 hw/pc.c  |5 +
 hw/pci-hotplug.c |1 +
 hw/scsi-disk.c   |   17 +
 hw/scsi-disk.h   |   20 +++-
 qemu-config.c|2 +-
 sysemu.h |3 ++-
 vl.c |8 ++--
 7 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 83012a9..26aad4c 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1345,6 +1345,11 @@ static void pc_init1(ram_addr_t ram_size,
for (bus = 0; bus <= max_bus; bus++) {
 pci_create_simple(pci_bus, -1, "lsi53c895a");
 }
+
+   max_bus = drive_get_max_bus(IF_RAID);
+   for (bus = 0; bus <= max_bus; bus++) {
+   pci_create_simple(pci_bus, -1, "megasas");
+   }
 }
 
 if (extboot_drive) {
diff --git a/hw/pci-hotplug.c b/hw/pci-hotplug.c
index 410fa3f..855a1ad 100644
--- a/hw/pci-hotplug.c
+++ b/hw/pci-hotplug.c
@@ -85,6 +85,7 @@ void drive_hot_add(Monitor *mon, const QDict *qdict)
 
 switch (type) {
 case IF_SCSI:
+case IF_RAID:
 if (pci_read_devaddr(mon, pci_addr, &dom, &pci_bus, &slot)) {
 goto err;
 }
diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index 2a9268a..68b4e83 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -41,6 +41,7 @@ do { fprintf(stderr, "scsi-disk: " fmt , ## __VA_ARGS__); } 
while (0)
 
 #define SCSI_DMA_BUF_SIZE131072
 #define SCSI_MAX_INQUIRY_LEN 256
+#define SCSI_SENSE_LEN 18
 
 #define SCSI_REQ_STATUS_RETRY 0x01
 
@@ -136,6 +137,22 @@ static SCSIRequest *scsi_find_request(SCSIDiskState *s, 
uint32_t tag)
 return r;
 }
 
+/* Helper function to build a sense block */
+int32_t scsi_build_sense(uint8_t *sense_buf, uint32_t sense)
+{
+memset(sense_buf, 0, SCSI_SENSE_LEN);
+if (!sense)
+   return 0;
+
+sense_buf[0] = 0xf0; /* current, fixed format */
+sense_buf[2] = (sense >> 16) & 0x0F;
+sense_buf[7] = 10;
+sense_buf[12] = (sense >> 8 ) & 0xFF;
+sense_buf[13] = sense & 0xFF;
+
+return SCSI_SENSE_LEN;
+}
+
 /* Helper function for command completion.  */
 static void scsi_command_complete(SCSIRequest *r, int status, int sense)
 {
diff --git a/hw/scsi-disk.h b/hw/scsi-disk.h
index b6b6c12..5b54272 100644
--- a/hw/scsi-disk.h
+++ b/hw/scsi-disk.h
@@ -9,6 +9,23 @@ enum scsi_reason {
 SCSI_REASON_DATA  /* Transfer complete, more data required.  */
 };
 
+/* LUN not ready, Manual intervention required */
+#define SENSE_LUN_NOT_READY 0x020403
+/* Hardware error, I/O process terminated */
+#define SENSE_IO_ERROR 0x040006
+/* Hardware error, I_T Nexus loss occured */
+#define SENSE_TAG_NOT_FOUND 0x042907
+/* Hardware error, internal target failure */
+#define SENSE_TARGET_FAILURE 0x044400
+/* Illegal request, invalid command operation code */
+#define SENSE_INVALID_OPCODE 0x052000
+/* Illegal request, LBA out of range */
+#define SENSE_LBA_OUT_OF_RANGE 0x052100
+/* Illegal request, Invalid field in CDB */
+#define SENSE_INVALID_FIELD 0x052400
+/* Illegal request, LUN not supported */
+#define SENSE_LUN_NOT_SUPPORTED 0x052500
+
 typedef struct SCSIBus SCSIBus;
 typedef struct SCSIDevice SCSIDevice;
 typedef struct SCSIDeviceInfo SCSIDeviceInfo;
@@ -49,7 +66,7 @@ struct SCSIBus {
 int tcq, ndev;
 scsi_completionfn complete;
 
-SCSIDevice *devs[8];
+SCSIDevice *devs[128];
 };
 
 void scsi_bus_new(SCSIBus *bus, DeviceState *host, int tcq, int ndev,
@@ -63,5 +80,6 @@ static inline SCSIBus *scsi_bus_from_device(SCSIDevice *d)
 
 SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, DriveInfo *dinfo, int 
unit);
 void scsi_bus_legacy_handle_cmdline(SCSIBus *bus);
+int32_t scsi_build_sense(uint8_t *sense_buf, uint32_t sense);
 
 #endif
diff --git a/qemu-config.c b/qemu-config.c
index 4fb7898..8d7a137 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -18,7 +18,7 @@ QemuOptsList qemu_drive_opts = {
 },{
 .name = "if",
 .type = QEMU_OPT_STRING,
-.help = "interface (ide, scsi, sd, mtd, floppy, pflash, virtio)",
+.help = "interface (ide, scsi, raid, sd, mtd, floppy, pflash, 
virtio)",
 },{
 .name = "index",
 .type = QEMU_OPT_NUMBER,
diff --git a/sysemu.h b/sysemu.h
index 2ef3797..8ed0b8c 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -159,7 +159,7 @@ extern unsigned int nb_prom_envs;
 
 typedef enum {
 IF_NONE,
-IF_IDE, IF_SCSI, IF_FLOPPY, IF_PFLASH, IF_MTD, IF_SD, IF_VIRTIO, IF_XEN,
+IF_IDE, IF_SCSI, IF_RAID, IF_FLOPPY, IF_PFLASH, IF_MTD, IF_SD, IF_VIRTIO, 
IF_XEN,
 IF_COUNT
 } BlockInterfaceType;
 
@@ -185,6 +185,7 @@ typedef struct DriveInfo {
 
 #define MAX_IDE_DEVS   2
 #define MAX_SCSI_DEVS  7
+#define MAX_RAID_DEVS  128
 #define MAX_DRIVES 32
 
 exte

Re: [RFC] [PATCH] SCSI passthrough for virtio-blk

2008-08-29 Thread Hannes Reinecke
Hi Christian,

Christian Borntraeger wrote:
> Am Freitag, 29. August 2008 schrieb Hannes Reinecke:
>> Hmm. Works here, using an unpatched kvm-73.
>> Which version did you use?
> 
> I use the s390 userspace prototype kuli which uses an virtio transport 
> similar 
> to lguest.
> 
> I retried and it seems to race. Most of the time it works fine, but sometimes 
> sdparm hangs. I will have a 2nd look. 
> 
> sysrq-t gives me the following trace:
> 
> Call Trace:
> ([<04000755bc78>] 0x4000755bc78)
> sdparmD 0043659e 0  2493  1
> 0012004a 0744f740 0744f778 001896469fd23785
>0744f778 009e5500 0043f230 00120130
>0744f778 06d39400 06d39f80 0001
>009e6f00 076bf8e8 0744f7c8 07530670
>0043f610 00435e66 0744f7c8 0744f868
> Call Trace:
> ([<00435e66>] schedule+0x32e/0x7ec)
>  [<0043659e>] schedule_timeout+0xba/0x10c
>  [<004358da>] wait_for_common+0xbe/0x1a8
>  [<0027ec3e>] blk_execute_rq+0x86/0xc4
>  [<00282768>] sg_io+0x1a4/0x360
>  [<00282f8c>] scsi_cmd_ioctl+0x2bc/0x3f0
>  [<002c3108>] virtblk_ioctl+0x44/0x58
>  [<0027ff18>] blkdev_driver_ioctl+0x98/0xa4
>  [<0027ffd8>] blkdev_ioctl+0xb4/0x7f8
>  [<001e1572>] block_ioctl+0x3a/0x48
>  [<001bca0a>] vfs_ioctl+0x52/0xdc
>  [<001bcb0a>] do_vfs_ioctl+0x76/0x350
>  [<001bce6e>] sys_ioctl+0x8a/0xa0
>  [<0011282c>] sysc_tracego+0xe/0x14
>  [<020000114286>] 0x2114286

I'm tempted to say 'not my fault'; the submitted SCSI request on
the _host_ hangs and doesn't come back.
Looks more like a SCSI problem on the host ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] [PATCH] SCSI passthrough for virtio-blk

2008-08-29 Thread Hannes Reinecke
Hi Christian,

Christian Borntraeger wrote:
> Am Freitag, 29. August 2008 schrieb Hannes Reinecke:
>> So when using '-drive file=/dev/sgXX,if=virtio,format=host_device' you
>> can happily call any sg_XX command on the resulting vdX device. Quite
>> neat, methinks. And it's even backwards compatible, so each of these
>> patches should work without the other one applied.
> 
> Does not work here. If the host does not support the pass-through, the device 
> drivers waits for an response. I tried  sdparm /dev/vda with a patched kernel 
> and an unpatched userspace.
> 
Hmm. Works here, using an unpatched kvm-73.
Which version did you use?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[RFC] [PATCH] SCSI passthrough for virtio-blk

2008-08-29 Thread Hannes Reinecke

Hi all,

I got bored and implemented SCSI passthrough for the virtio-blk driver.
Principle is quite simple, just put the missing fields (cdb, sense and
status header) on the virtio queue and then call the SG_IO ioctl on the
host.

So when using '-drive file=/dev/sgXX,if=virtio,format=host_device' you
can happily call any sg_XX command on the resulting vdX device. Quite
neat, methinks. And it's even backwards compatible, so each of these
patches should work without the other one applied.

As one would have guessed there are two patches, one for the linux
kernel to modify the virtio-blk driver in the guest and one for the
qemu/kvm userland program to modify the virtio-blk driver on the host.
This patch is relative to avi's kvm-userland tree from kernel.org.

As usual, comments etc to me.

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
virtio: Implement SCSI passthrough for virtio-blk

This patch implements SCSI passthrough for any virtio-blk device.
The data on the virtio queue will only be modified for a SCSI command,
so the normal I/O flow is unchanged.

Signed-off-by: Hannes Reinecke <[EMAIL PROTECTED]>

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4225109..46f03d2 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -35,6 +35,7 @@ struct virtblk_req
struct list_head list;
struct request *req;
struct virtio_blk_outhdr out_hdr;
+   struct virtio_blk_inhdr in_hdr;
u8 status;
 };
 
@@ -47,20 +48,29 @@ static void blk_done(struct virtqueue *vq)
 
spin_lock_irqsave(&vblk->lock, flags);
while ((vbr = vblk->vq->vq_ops->get_buf(vblk->vq, &len)) != NULL) {
-   int uptodate;
+   int error;
+   unsigned int bytes;
switch (vbr->status) {
case VIRTIO_BLK_S_OK:
-   uptodate = 1;
+   error = 0;
break;
case VIRTIO_BLK_S_UNSUPP:
-   uptodate = -ENOTTY;
+   error = -ENOTTY;
break;
default:
-   uptodate = 0;
+   error = -EIO;
break;
}
 
-   end_dequeued_request(vbr->req, uptodate);
+   if (blk_pc_request(vbr->req)) {
+   vbr->req->data_len = vbr->in_hdr.residual;
+   bytes = vbr->in_hdr.data_len;
+   vbr->req->sense_len = vbr->in_hdr.sense_len;
+   vbr->req->errors = vbr->in_hdr.status;
+   } else
+   bytes = blk_rq_bytes(vbr->req);
+
+   __blk_end_request(vbr->req, error, bytes);
list_del(&vbr->list);
mempool_free(vbr, vblk->pool);
}
@@ -72,7 +82,7 @@ static void blk_done(struct virtqueue *vq)
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
   struct request *req)
 {
-   unsigned long num, out, in;
+   unsigned long num, out = 0, in = 0;
struct virtblk_req *vbr;
 
vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
@@ -99,20 +109,31 @@ static bool do_req(struct request_queue *q, struct 
virtio_blk *vblk,
 
/* This init could be done at vblk creation time */
sg_init_table(vblk->sg, VIRTIO_MAX_SG);
-   sg_set_buf(&vblk->sg[0], &vbr->out_hdr, sizeof(vbr->out_hdr));
-   num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
-   sg_set_buf(&vblk->sg[num+1], &vbr->status, sizeof(vbr->status));
-
-   if (rq_data_dir(vbr->req) == WRITE) {
-   vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
-   out = 1 + num;
-   in = 1;
-   } else {
-   vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
-   out = 1;
-   in = 1 + num;
+   sg_set_buf(&vblk->sg[out], &vbr->out_hdr, sizeof(vbr->out_hdr));
+   out++;
+   if (blk_pc_request(vbr->req)) {
+   sg_set_buf(&vblk->sg[out], vbr->req->cmd, vbr->req->cmd_len);
+   out++;
+   }
+   num = blk_rq_map_sg(q, vbr->req, vblk->sg+out);
+   if (blk_pc_request(vbr->req)) {
+   sg_set_buf(&vblk->sg[num+out+in], vbr->req->sense, 96);
+   in++;
+   sg_set_buf(&vblk->sg[num+out+in], &vbr->in_hdr,
+  sizeof(vbr->in_hdr));
+   in++;
+   }
+   sg_set_buf(&vblk->sg[num+out+in], &vbr->status, sizeof(vbr->status));
+   in++;