date:20200729

Re: [PATCH 1/2] hw/net/net_tx_pkt: add function to check pkt->max_raw_frags

2020-07-29 Thread Jason Wang




On 2020/7/29 上午12:26, Mauro Matteo Cascella wrote:

On Tue, Jul 28, 2020 at 6:06 AM Jason Wang  wrote:


On 2020/7/28 上午1:08, Mauro Matteo Cascella wrote:

This patch introduces a new function in hw/net/net_tx_pkt.{c,h} to check the
current data fragment against the maximum number of data fragments.


I wonder whether it's better to do the check in
net_tx_pkt_add_raw_fragment() and fail there.

Given the assertion, I assumed the caller is responsible for the
check, but moving the check in net_tx_pkt_add_raw_fragment() totally
makes sense to me.



Want to send a new version for this?





Btw, I find net_tx_pkt_add_raw_fragment() does not unmap dma when
returning to true, is this a bug?

Isn't it unmapped in net_tx_pkt_reset()?



Probably but see how it was used in e1000e, the net_tx_pkt_reset() is 
only called when eop is set. Is this a bug?


Thanks




Thanks



Reported-by: Ziming Zhang 
Signed-off-by: Mauro Matteo Cascella 
---
   hw/net/net_tx_pkt.c | 5 +
   hw/net/net_tx_pkt.h | 8 
   2 files changed, 13 insertions(+)

diff --git a/hw/net/net_tx_pkt.c b/hw/net/net_tx_pkt.c
index 9560e4a49e..d035618f2c 100644
--- a/hw/net/net_tx_pkt.c
+++ b/hw/net/net_tx_pkt.c
@@ -400,6 +400,11 @@ bool net_tx_pkt_add_raw_fragment(struct NetTxPkt *pkt, 
hwaddr pa,
   }
   }

+bool net_tx_pkt_exceed_max_fragments(struct NetTxPkt *pkt)
+{
+return pkt->raw_frags >= pkt->max_raw_frags;
+}
+
   bool net_tx_pkt_has_fragments(struct NetTxPkt *pkt)
   {
   return pkt->raw_frags > 0;
diff --git a/hw/net/net_tx_pkt.h b/hw/net/net_tx_pkt.h
index 4ec8bbe9bd..e2ee46ae03 100644
--- a/hw/net/net_tx_pkt.h
+++ b/hw/net/net_tx_pkt.h
@@ -179,6 +179,14 @@ bool net_tx_pkt_send_loopback(struct NetTxPkt *pkt, 
NetClientState *nc);
*/
   bool net_tx_pkt_parse(struct NetTxPkt *pkt);

+/**
+* indicates if the current data fragment exceeds max_raw_frags
+*
+* @pkt:packet
+*
+*/
+bool net_tx_pkt_exceed_max_fragments(struct NetTxPkt *pkt);
+
   /**
   * indicates if there are data fragments held by this packet object.
   *

Re: [PATCH] docs/nvdimm: add 'pmem=on' for the device dax backend file

2020-07-29 Thread Pankaj Gupta

> At the end of live migration, QEMU uses msync() to flush the data to
> the backend storage. When the backend file is a character device dax,
> the pages explicitly avoid the page cache. It will return failure from 
> msync().
> The following warning is output.
>
> "warning: qemu_ram_msync: failed to sync memory range“
>
> So we add 'pmem=on' to avoid calling msync(), use the QEMU command line:
>
> -object memory-backend-file,id=mem1,pmem=on,mem-path=/dev/dax0.0,size=4G
>
> Reviewed-by: Stefan Hajnoczi 
> Signed-off-by: Jingqi Liu 
> ---
>  docs/nvdimm.txt | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
> index c2c6e441b3..31048aff5e 100644
> --- a/docs/nvdimm.txt
> +++ b/docs/nvdimm.txt
> @@ -243,6 +243,13 @@ use the QEMU command line:
>
>  -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on
>
> +At the end of live migration, QEMU uses msync() to flush the data to the
> +backend storage. When the backend file is a character device dax, the pages
> +explicitly avoid the page cache. It will return failure from msync().
> +So we add 'pmem=on' to avoid calling msync(), use the QEMU command line:
> +
> +-object memory-backend-file,id=mem1,pmem=on,mem-path=/dev/dax0.0,size=4G
> +
>  References
>  --
>
> --
Good to document this.

Reviewed-by: Pankaj Gupta 

> 2.17.1
>
>

Re: [PATCH] introduce VFIO-over-socket protocol specificaion

2020-07-29 Thread John G Johnson



Thanos is on vacation.  My comments embedded.

JJ



> On Jul 29, 2020, at 5:48 AM, Stefan Hajnoczi  wrote:
> 
> On Wed, Jul 22, 2020 at 11:42:26AM +, Thanos Makatos wrote:
 diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-
>>> socket.rst
 new file mode 100644
 index 000..723b944
 --- /dev/null
 +++ b/docs/devel/vfio-over-socket.rst
 @@ -0,0 +1,1135 @@
 +***
 +VFIO-over-socket Protocol Specification
 +***
 +
 +Version 0.1
>>> 
>>> Please include a reference to the section below explaining how
>>> versioning works.
>> 
>> I'm not sure I understand, do you mean we should add something like the
>> following (right below "Version 0.1"):
>> 
>> "Refer to section 1.2.3 on how versioning works."
>> 
>> ?
> 
> Yes, coming across this version number the reader has no idea about its
> meaning and how the protocol is versioned. The spec then moves on to
> other things. It would be helpful to reference the section that explains
> how versioning works so that the reader knows where to go to understand
> the meaning of the number.
> 

OK, we’ll add a forward reference


>>> What about the ordering semantics at the vfio-user protocol level? For
>>> example, if a client sends multiple VFIO_USER_DMA_MAP/UNMAP
>>> messages
>>> then the order matters. What are the rules? I wonder if maybe
>>> concurrency is a special case and really only applies to a subset of
>>> protocol commands?
>> 
>> All commands are executed in the order they were sent, regardless of whether 
>> a
>> reply is needed.
>> 
>>> 
>>> I'm not sure how a client would exploit parallelism in this protocol.
>>> Can you give an example of a case where there would be multiple commands
>>> pending on a single device?
>> 
>> For instance, a client can issue the following operations back to back 
>> without
>> waiting for the first two to complete:
>> 1. map a DMA region 
>> 2. trigger some device-specific operation that results in data being read 
>> into
>>   that DMA region, and
>> 3. unmap the DMA region
> 
> That is pipelining, but I don't see the ability to "reorder
> non-conflicting requests in parallel". That example has no parallelism.
> 
> It's unclear to me what the spec means by concurrency and parallelism.
> 
> If the intention is just to allow pipelining then request identifiers
> aren't really necessary. The client can keep track of which request will
> complete next. So I'm wondering if there is some parallelism somewhere
> that I'm missing…
> 

The reason we have message IDs is so the client knows which request
is being acknowledged if it has more than one non-ack'd request outstanding.
Requests are executed in-order; the only time parallelism can happen is if
multiple client threads send requests in parallel.  A single thread can
pipeline requests, but those requests are not executed out of order by the
server.

I’ll try to re-word it to be clearer.


> 
>>> 
 +
 +Socket Disconnection Behavior
 +-
 +The server and the client can disconnect from each other, either
>>> intentionally
 +or unexpectedly. Both the client and the server need to know how to
>>> handle such
 +events.
 +
 +Server Disconnection
 +
 +A server disconnecting from the client may indicate that:
 +
 +1) A virtual device has been restarted, either intentionally (e.g. 
 because of
>>> a
 +device update) or unintentionally (e.g. because of a crash). In any case,
>>> the
 +virtual device will come back so the client should not do anything (e.g.
>>> simply
 +reconnect and retry failed operations).
 +
 +2) A virtual device has been shut down with no intention to be restarted.
 +
 +It is impossible for the client to know whether or not a failure is
 +intermittent or innocuous and should be retried, therefore the client
>>> should
 +attempt to reconnect to the socket. Since an intentional server restart
>>> (e.g.
 +due to an upgrade) might take some time, a reasonable timeout should
>>> be used.
 +In cases where the disconnection is expected (e.g. the guest shutting
>>> down), no
 +new requests will be sent anyway so this situation doesn't pose a
>>> problem. The
 +control stack will clean up accordingly.
 +
 +Parametrizing this behaviour by having the virtual device advertise a
>>> 
>>> s/Parametrizing/Parameterizing/
>> 
>> OK.
>> 
>>> 
 +reasonable reconnect is deferred to a future version of the protocol.
>>> 
>>> No mention is made of recovering state or how disconnect maps to VFIO
>>> device types (PCI, etc.). Does a disconnect mean that the device has
>>> been reset?
>> 
>> Regarding recovering state, I believe that since all the building blocks are
>> there and the client is pretty much the master in the

Re: [PATCH for-5.2 0/6] Continue booting in case the first device is not bootable

2020-07-29 Thread Thomas Huth

On 29/07/2020 13.42, Viktor Mihajlovski wrote:
> 
> 
> On 7/28/20 8:37 PM, Thomas Huth wrote:
>> If the user did not specify a "bootindex" property, the s390-ccw bios
>> tries to find a bootable device on its own. Unfortunately, it alwasy
>> stops at the very first device that it can find, no matter whether it's
>> bootable or not. That causes some weird behavior, for example while
>>
>>   qemu-system-s390x -hda bootable.qcow2
>>
>> boots perfectly fine, the bios refuses to work if you just specify
>> a virtio-scsi controller in front of it:
>>
>>   qemu-system-s390x -device virtio-scsi -hda bootable.qcow2
>>
>> Since this is quite uncomfortable and confusing for the users, and
>> all major firmwares on other architectures correctly boot in such
>> cases, too, let's also try to teach the s390-ccw bios how to boot
>> in such cases.
>>
>> For this, we have to get rid of the various panic()s and IPL_assert()
>> statements at the "low-level" function and let the main code handle
>> the decision instead whether a boot from a device should fail or not,
>> so that the main code can continue searching in case it wants to.
>>
> 
> Looking at it from an architectural perspective: If an IPL Information
> Block specifying the boot device has been set and can be retrieved using
> Diagnose 308 it has to be respected, even if the device doesn't contain
> a bootable program. The boot has to fail in this case.
> 
> I had not the bandwidth to follow all code paths, but I gather that this
> is still the case with the series.

Right. Just to be sure, I just double-checked with:

... -device virtio-blk,drive=baddrive,bootindex=1 \
-device virtio-blk,drive=gooddrive

and indeed, the s390-ccw bios only checks the "baddrive" here and
refuses to boot.

> So one can argue that these changes
> are taking care of an undefined situation (real hardware will always
> have the IPIB set).
> 
> As long as the architecture is not violated, I can live with the
> proposed changes.

Thanks!

> I however would like to point out that this only
> covers a corner case (no -boot or -device ..,bootindex specified).

Sure. We™ should/could maybe also add some more documentation to

 https://www.qemu.org/docs/master/system/target-s390x.html

to make it more clear to the "unexperienced" qemu-system-s390x users
that "bootindex" is the preferred / architected way of booting there.

> Please don't create the impression that this patches will lead to the
> same behavior as on other platforms.
Ok, I'll try to state that more clearly in the cover letter of v2.

 Thomas

[PATCH 2/2] target/arm: Fix compile error.

2020-07-29 Thread Kaige Li

When I compile qemu with such as:

git clone https://git.qemu.org/git/qemu.git
cd qemu
git submodule init
git submodule update --recursive
./configure
make

There is error log:

/home/LiKaige/qemu/target/arm/translate-a64.c: In function ‘disas_ldst’:
/home/LiKaige/qemu/target/arm/translate-a64.c:3392:5: error: ‘fn’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
 fn(cpu_reg(s, rt), clean_addr, tcg_rs, get_mem_index(s),
 ^
/home/LiKaige/qemu/target/arm/translate-a64.c:3318:22: note: ‘fn’ was declared 
here
 AtomicThreeOpFn *fn;
  ^
cc1: all warnings being treated as errors

So, add an initiallization value for fn to fix this.

Signed-off-by: Kaige Li 
---
 target/arm/translate-a64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 8c07649..910a91f 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -3315,7 +3315,7 @@ static void disas_ldst_atomic(DisasContext *s, uint32_t 
insn,
 bool r = extract32(insn, 22, 1);
 bool a = extract32(insn, 23, 1);
 TCGv_i64 tcg_rs, clean_addr;
-AtomicThreeOpFn *fn;
+AtomicThreeOpFn *fn = tcg_gen_atomic_fetch_add_i64;
 
 if (is_vector || !dc_isar_feature(aa64_atomics, s)) {
 unallocated_encoding(s);
-- 
2.1.0

[PATCH 1/2] virtio-mem: Change PRIx32 to PRIXPTR to fix compile error.

2020-07-29 Thread Kaige Li

When I compile qemu with such as:

git clone https://git.qemu.org/git/qemu.git
cd qemu
git submodule init
git submodule update --recursive
./configure
make

There is error log:

/home/LiKaige/qemu/hw/virtio/virtio-mem.c: In function 
‘virtio_mem_set_block_size’:
/home/LiKaige/qemu/hw/virtio/virtio-mem.c:756:9: error: format ‘%x’ expects 
argument of type ‘unsigned int’, but argument 7 has type ‘uintptr_t’ 
[-Werror=format=]
 error_setg(errp, "'%s' property has to be at least 0x%" PRIx32, name,
 ^
cc1: all warnings being treated as errors
/home/LiKaige/qemu/rules.mak:69: recipe for target 'hw/virtio/virtio-mem.o' 
failed

So, change PRIx32 to PRIXPTR to fix this.

Signed-off-by: Kaige Li 
---
 hw/virtio/virtio-mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index c12e9f7..3dcaf9a 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -753,7 +753,7 @@ static void virtio_mem_set_block_size(Object *obj, Visitor 
*v, const char *name,
 }
 
 if (value < VIRTIO_MEM_MIN_BLOCK_SIZE) {
-error_setg(errp, "'%s' property has to be at least 0x%" PRIx32, name,
+error_setg(errp, "'%s' property has to be at least 0x%" PRIXPTR "\n", 
name,
VIRTIO_MEM_MIN_BLOCK_SIZE);
 return;
 } else if (!is_power_of_2(value)) {
-- 
2.1.0

Re: device compatibility interface for live migration with assigned devices

2020-07-29 Thread Yan Zhao

On Wed, Jul 29, 2020 at 01:12:55PM -0600, Alex Williamson wrote:
> On Wed, 29 Jul 2020 12:28:46 +0100
> Sean Mooney  wrote:
> 
> > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:  
> > > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > > Yan Zhao  wrote:
> > > >   
> > > > > > > As you indicate, the vendor driver is responsible for checking 
> > > > > > > version
> > > > > > > information embedded within the migration stream.  Therefore a
> > > > > > > migration should fail early if the devices are incompatible.  Is 
> > > > > > > it
> > > > > > 
> > > > > > but as I know, currently in VFIO migration protocol, we have no way 
> > > > > > to
> > > > > > get vendor specific compatibility checking string in migration 
> > > > > > setup stage
> > > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > > In this way, for devices who does not save device data in precopy 
> > > > > > stage,
> > > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > > stage, which is too late.
> > > > > > do you think we need to add the getting/checking of vendor specific
> > > > > > compatibility string early in save_setup stage?
> > > > > >
> > > > > 
> > > > > hi Alex,
> > > > > after an offline discussion with Kevin, I realized that it may not be 
> > > > > a
> > > > > problem if migration compatibility check in vendor driver occurs late 
> > > > > in
> > > > > stop-and-copy phase for some devices, because if we report device
> > > > > compatibility attributes clearly in an interface, the chances for
> > > > > libvirt/openstack to make a wrong decision is little.  
> > > > 
> > > > I think it would be wise for a vendor driver to implement a pre-copy
> > > > phase, even if only to send version information and verify it at the
> > > > target.  Deciding you have no device state to send during pre-copy does
> > > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > > we've defined that we can enter stop-and-copy at any point, including
> > > > without a pre-copy phase, so I would recommend that vendor drivers
> > > > validate compatibility at the start of both the pre-copy and the
> > > > stop-and-copy phases.
> > > >   
> > > 
> > > ok. got it!
> > >   
> > > > > so, do you think we are now arriving at an agreement that we'll give 
> > > > > up
> > > > > the read-and-test scheme and start to defining one interface (perhaps 
> > > > > in
> > > > > json format), from which libvirt/openstack is able to parse and find 
> > > > > out
> > > > > compatibility list of a source mdev/physical device?  
> > > > 
> > > > Based on the feedback we've received, the previously proposed interface
> > > > is not viable.  I think there's agreement that the user needs to be
> > > > able to parse and interpret the version information.  Using json seems
> > > > viable, but I don't know if it's the best option.  Is there any
> > > > precedent of markup strings returned via sysfs we could follow?  
> > > 
> > > I found some examples of using formatted string under /sys, mostly under
> > > tracing. maybe we can do a similar implementation.
> > > 
> > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > > 
> > > name: kvm_mmio
> > > ID: 32
> > > format:
> > > field:unsigned short common_type;   offset:0;   size:2; 
> > > signed:0;
> > > field:unsigned char common_flags;   offset:2;   size:1; 
> > > signed:0;
> > > field:unsigned char common_preempt_count;   offset:3;   
> > > size:1; signed:0;
> > > field:int common_pid;   offset:4;   size:4; signed:1;
> > > 
> > > field:u32 type; offset:8;   size:4; signed:0;
> > > field:u32 len;  offset:12;  size:4; signed:0;
> > > field:u64 gpa;  offset:16;  size:8; signed:0;
> > > field:u64 val;  offset:24;  size:8; signed:0;
> > > 
> > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", 
> > > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read"
> > > }, { 2, "write" }), REC->len, REC->gpa, REC->val
> > >   
> > this is not json fromat and its not supper frendly to parse.
> > > 
> > > #cat /sys/devices/pci:00/:00:02.0/uevent
> > > DRIVER=vfio-pci
> > > PCI_CLASS=3
> > > PCI_ID=8086:591D
> > > PCI_SUBSYS_ID=8086:2212
> > > PCI_SLOT_NAME=:00:02.0
> > > MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00
> > >   
> > this is ini format or conf formant 
> > this is pretty simple to parse whichi would be fine.
> > that said you could also have a version or capablitiy directory with a file
> > for each key and a singel value.
> > 
> > i would prefer to only have to do one read personally the list the files in
> > directory and then read tehm all ot build the datastucture myself but that 
> > is
> > doable though the simple ini

Re: Adding VHOST_USER_PROTOCOL_F_CONFIG_MEM_SLOTS to 5.1 release notes

2020-07-29 Thread Raphael Norwitz

How about something like:
"A new feature, VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS, has been
added to the vhost-user protocol which, when negotiated, changes the
way QEMU transmit memory regions to backend devices. Instead of
sending all regions in a single VHOST_USER_SET_MEM_TABLE message, QEMU
will send supporting backends individual VHOST_USER_ADD_MEM_REG and
VHOST_USER_REM_MEM_REG messages to update the devices memory tables.
VMs with vhost-user device backends which support this feature will
not be subject to the max RAM slots limit of 8 and will be able to
hot-add memory as many times as the target platform supports. Backends
which do not support VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS are
unaffected."

I don't have permission to edit the wiki. How can I get permission? Or
can someone post it for me?

On Wed, Jul 29, 2020 at 8:19 AM Michael S. Tsirkin  wrote:
>
> On Tue, Jul 28, 2020 at 09:16:10PM -0600, Raphael Norwitz wrote:
> > Hi mst,
> >
> > Looking at the current changelog
> > https://wiki.qemu.org/ChangeLog/5.1#virtio, I don't see any mention of
> > the VHOST_USER_PROTOCOL_F_CONFIG_MEM_SLOTS protocol feature. It is a
> > user visible change so shouldn't we add a note?
> >
> > Thanks,
> > Raphael
>
> I didn't look at updating the changelog yet.
> Would be great if you could write up new vhost user things.
>
> --
> MST
>

[PATCH 0/1] linux-user: Add support for SG_IO and SG_GET_VERSION_NUM raw SCSI ioctls

2020-07-29 Thread Leif N Huhn

Hi. This is my first time trying to contribute to qemu. This patch
works correctly for architectures with the same bit-width, for example
32bit arm host and i386 user binary. Here is an example with the sg_simple2
executable from https://github.com/hreinecke/sg3_utils

32-bit ARM native:

  strace -e trace=ioctl ./sg_simple2 /dev/sg0
  ioctl(3, SG_GET_VERSION_NUM, [30536])   = 0
  ioctl(3, SG_IO, {interface_id='S', dxfer_direction=SG_DXFER_FROM_DEV, 
cmd_len=6, cmdp="\x12\x00\x00\x00\x60\x00", mx_sb_len=32, iovec_count=0, 
dxfer_len=96, timeout=2, flags=0, 
dxferp="\x05\x80\x00\x32\x5b\x00\x00\x00\x48\x4c\x2d\x44\x54\x2d\x53\x54\x42\x44\x2d\x52\x45\x20\x20\x57\x48\x31\x36\x4e\x53\x34\x30\x20"...,
 status=0, masked_status=0, msg_status=0, sb_len_wr=0, sbp="", host_status=0, 
driver_status=0, resid=0, duration=3, info=0}) = 0
  Some of the INQUIRY command's results:
  HL-DT-ST  BD-RE  WH16NS40   1.05  [wide=0 sync=0 cmdque=0 sftre=0]
  ioctl(3, SG_IO, {interface_id='S', dxfer_direction=SG_DXFER_NONE, cmd_len=6, 
cmdp="\x00\x00\x00\x00\x00\x00", mx_sb_len=32, iovec_count=0, dxfer_len=0, 
timeout=2, flags=0, status=0, masked_status=0, msg_status=0, sb_len_wr=0, 
sbp="", host_status=0, driver_status=0, resid=0, duration=4, info=0}) = 0
  Test Unit Ready successful so unit is ready!
  +++ exited with 0 +++

i386 binary on 32-bit arm host:

  strace -f -e trace=ioctl qemu/build/i386-linux-user/qemu-i386 
sg3_utils/examples/sg_simple2 /dev/sg0
  strace: Process 690 attached
  [pid   689] ioctl(3, SG_GET_VERSION_NUM, [30536]) = 0
  [pid   689] ioctl(3, SG_IO, {interface_id='S', 
dxfer_direction=SG_DXFER_FROM_DEV, cmd_len=6, cmdp="\x12\x00\x00\x00\x60\x00", 
mx_sb_len=32, iovec_count=0, dxfer_len=96, timeout=2, flags=0, 
dxferp="\x05\x80\x00\x32\x5b\x00\x00\x00\x48\x4c\x2d\x44\x54\x2d\x53\x54\x42\x44\x2d\x52\x45\x20\x20\x57\x48\x31\x36\x4e\x53\x34\x30\x20"...,
 status=0, masked_status=0, msg_status=0, sb_len_wr=0, sbp="", host_status=0, 
driver_status=0, resid=0, duration=3, info=0}) = 0
  Some of the INQUIRY command's results:
  HL-DT-ST  BD-RE  WH16NS40   1.05  [wide=0 sync=0 cmdque=0 sftre=0]
  [pid   689] ioctl(3, SG_IO, {interface_id='S', dxfer_direction=SG_DXFER_NONE, 
cmd_len=6, cmdp="\x00\x00\x00\x00\x00\x00", mx_sb_len=32, iovec_count=0, 
dxfer_len=0, timeout=2, flags=0, status=0, masked_status=0, msg_status=0, 
sb_len_wr=0, sbp="", host_status=0, driver_status=0, resid=0, duration=3, 
info=0}) = 0
  Test Unit Ready successful so unit is ready!
  [pid   690] +++ exited with 0 +++
  +++ exited with 0 +++

However when I try i386 guest on x86_64 host, the cmdp bytes in the
first SG_IO call are zero, incorrectly. I assume that is because I need
to write a special ioctl handler. Is that correct? Should I be calling
lock_user(VERIFY_WRITE...) to copy the buffers over?

Also, is the current patch acceptable as is, or does it need to be
reworked until the ioctl works with different architecture bit-widths?

Thanks!

Leif N Huhn (1):
  linux-user: Add support for SG_IO and SG_GET_VERSION_NUM raw SCSI
ioctls

 linux-user/ioctls.h|  2 ++
 linux-user/syscall.c   |  1 +
 linux-user/syscall_defs.h  | 33 +
 linux-user/syscall_types.h |  5 +
 4 files changed, 41 insertions(+)

-- 
2.28.0

[PATCH 1/1] linux-user: Add support for SG_IO and SG_GET_VERSION_NUM raw SCSI ioctls

2020-07-29 Thread Leif N Huhn

This patch implements functionalities of following ioctls:

SG_GET_VERSION_NUM - Returns SG driver version number

The sg version numbers are of the form "x.y.z" and the single number given
by the SG_GET_VERSION_NUM ioctl() is calculated by
(x * 1 + y * 100 + z).

SG_IO - Permits user applications to send SCSI commands to a device

It is logically equivalent to a write followed by a read.

Implementation notes:

For SG_GET_VERSION_NUM the value is an int and the implementation is
straightforward.

For SG_IO, the generic thunk mechanism is used, and works correctly when
the host and guest architecture have the same pointer size. A special ioctl
handler may be needed in other situations and is not covered in this
implementation.

Signed-off-by: Leif N Huhn 
---
 linux-user/ioctls.h|  2 ++
 linux-user/syscall.c   |  1 +
 linux-user/syscall_defs.h  | 33 +
 linux-user/syscall_types.h |  5 +
 4 files changed, 41 insertions(+)

diff --git a/linux-user/ioctls.h b/linux-user/ioctls.h
index 0713ae1311..92e2f65e05 100644
--- a/linux-user/ioctls.h
+++ b/linux-user/ioctls.h
@@ -333,6 +333,8 @@
   IOCTL(CDROM_DRIVE_STATUS, 0, TYPE_NULL)
   IOCTL(CDROM_DISC_STATUS, 0, TYPE_NULL)
   IOCTL(CDROMAUDIOBUFSIZ, 0, TYPE_INT)
+  IOCTL(SG_GET_VERSION_NUM, 0, TYPE_INT)
+  IOCTL(SG_IO, IOC_RW, MK_PTR(MK_STRUCT(STRUCT_sg_io_hdr)))
 
 #if 0
   IOCTL(SNDCTL_COPR_HALT, IOC_RW, MK_PTR(TYPE_INT))
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 945fc25279..d846ef1af2 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -115,6 +115,7 @@
 #ifdef HAVE_DRM_H
 #include 
 #endif
+#include 
 #include "linux_loop.h"
 #include "uname.h"
 
diff --git a/linux-user/syscall_defs.h b/linux-user/syscall_defs.h
index 3c261cff0e..0e3004eb31 100644
--- a/linux-user/syscall_defs.h
+++ b/linux-user/syscall_defs.h
@@ -2774,4 +2774,37 @@ struct target_statx {
/* 0x100 */
 };
 
+/* from kernel's include/scsi/sg.h */
+
+#define TARGET_SG_GET_VERSION_NUM 0x2282 /* Example: version 2.1.34 yields 
20134 */
+/* synchronous SCSI command ioctl, (only in version 3 interface) */
+#define TARGET_SG_IO 0x2285   /* similar effect as write() followed by read() 
*/
+
+struct target_sg_io_hdr
+{
+int interface_id;   /* [i] 'S' for SCSI generic (required) */
+int dxfer_direction;/* [i] data transfer direction  */
+unsigned char cmd_len;  /* [i] SCSI command length */
+unsigned char mx_sb_len;/* [i] max length to write to sbp */
+unsigned short iovec_count; /* [i] 0 implies no scatter gather */
+unsigned int dxfer_len; /* [i] byte count of data transfer */
+abi_ulongdxferp;   /* [i], [*io] points to data transfer memory
+ or scatter gather list */
+abi_ulongcmdp;  /* [i], [*i] points to command to perform */
+abi_ulongsbp;  /* [i], [*o] points to sense_buffer memory */
+unsigned int timeout;   /* [i] MAX_UINT->no timeout (unit: millisec) */
+unsigned int flags; /* [i] 0 -> default, see SG_FLAG... */
+int pack_id;/* [i->o] unused internally (normally) */
+abi_ulong usr_ptr;  /* [i->o] unused internally */
+unsigned char status;   /* [o] scsi status */
+unsigned char masked_status;/* [o] shifted, masked scsi status */
+unsigned char msg_status;   /* [o] messaging level data (optional) */
+unsigned char sb_len_wr;/* [o] byte count actually written to sbp */
+unsigned short host_status; /* [o] errors from host adapter */
+unsigned short driver_status;/* [o] errors from software driver */
+int resid;  /* [o] dxfer_len - actual_transferred */
+unsigned int duration;  /* [o] time taken by cmd (unit: millisec) */
+unsigned int info;  /* [o] auxiliary information */
+};  /* 64 bytes long (on i386) */
+
 #endif
diff --git a/linux-user/syscall_types.h b/linux-user/syscall_types.h
index 3f1f033464..3752d217e2 100644
--- a/linux-user/syscall_types.h
+++ b/linux-user/syscall_types.h
@@ -59,6 +59,11 @@ STRUCT(cdrom_read_audio,
TYPE_CHAR, TYPE_CHAR, TYPE_CHAR, TYPE_CHAR, TYPE_CHAR, TYPE_INT, 
TYPE_PTRVOID,
TYPE_NULL)
 
+STRUCT(sg_io_hdr,
+   TYPE_INT, TYPE_INT, TYPE_CHAR, TYPE_CHAR, TYPE_SHORT, TYPE_INT, 
TYPE_PTRVOID,
+   TYPE_PTRVOID, TYPE_PTRVOID, TYPE_INT, TYPE_INT, TYPE_INT, TYPE_PTRVOID, 
TYPE_CHAR,
+   TYPE_CHAR, TYPE_CHAR, TYPE_CHAR, TYPE_SHORT, TYPE_SHORT, TYPE_INT, 
TYPE_INT, TYPE_INT)
+
 STRUCT(hd_geometry,
TYPE_CHAR, TYPE_CHAR, TYPE_SHORT, TYPE_ULONG)
 
-- 
2.28.0

Re: device compatibility interface for live migration with assigned devices

2020-07-29 Thread Yan Zhao

On Wed, Jul 29, 2020 at 12:28:46PM +0100, Sean Mooney wrote:
> On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > Yan Zhao  wrote:
> > > 
> > > > > > As you indicate, the vendor driver is responsible for checking 
> > > > > > version
> > > > > > information embedded within the migration stream.  Therefore a
> > > > > > migration should fail early if the devices are incompatible.  Is it 
> > > > > >  
> > > > > 
> > > > > but as I know, currently in VFIO migration protocol, we have no way to
> > > > > get vendor specific compatibility checking string in migration setup 
> > > > > stage
> > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > In this way, for devices who does not save device data in precopy 
> > > > > stage,
> > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > stage, which is too late.
> > > > > do you think we need to add the getting/checking of vendor specific
> > > > > compatibility string early in save_setup stage?
> > > > >  
> > > > 
> > > > hi Alex,
> > > > after an offline discussion with Kevin, I realized that it may not be a
> > > > problem if migration compatibility check in vendor driver occurs late in
> > > > stop-and-copy phase for some devices, because if we report device
> > > > compatibility attributes clearly in an interface, the chances for
> > > > libvirt/openstack to make a wrong decision is little.
> > > 
> > > I think it would be wise for a vendor driver to implement a pre-copy
> > > phase, even if only to send version information and verify it at the
> > > target.  Deciding you have no device state to send during pre-copy does
> > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > we've defined that we can enter stop-and-copy at any point, including
> > > without a pre-copy phase, so I would recommend that vendor drivers
> > > validate compatibility at the start of both the pre-copy and the
> > > stop-and-copy phases.
> > > 
> > 
> > ok. got it!
> > 
> > > > so, do you think we are now arriving at an agreement that we'll give up
> > > > the read-and-test scheme and start to defining one interface (perhaps in
> > > > json format), from which libvirt/openstack is able to parse and find out
> > > > compatibility list of a source mdev/physical device?
> > > 
> > > Based on the feedback we've received, the previously proposed interface
> > > is not viable.  I think there's agreement that the user needs to be
> > > able to parse and interpret the version information.  Using json seems
> > > viable, but I don't know if it's the best option.  Is there any
> > > precedent of markup strings returned via sysfs we could follow?
> > 
> > I found some examples of using formatted string under /sys, mostly under
> > tracing. maybe we can do a similar implementation.
> > 
> > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > 
> > name: kvm_mmio
> > ID: 32
> > format:
> > field:unsigned short common_type;   offset:0;   size:2; 
> > signed:0;
> > field:unsigned char common_flags;   offset:2;   size:1; 
> > signed:0;
> > field:unsigned char common_preempt_count;   offset:3;   
> > size:1; signed:0;
> > field:int common_pid;   offset:4;   size:4; signed:1;
> > 
> > field:u32 type; offset:8;   size:4; signed:0;
> > field:u32 len;  offset:12;  size:4; signed:0;
> > field:u64 gpa;  offset:16;  size:8; signed:0;
> > field:u64 val;  offset:24;  size:8; signed:0;
> > 
> > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", 
> > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read"
> > }, { 2, "write" }), REC->len, REC->gpa, REC->val
> > 
> this is not json fromat and its not supper frendly to parse.
yes, it's just an example. It's exported to be used by userspace perf &
trace_cmd.

> > 
> > #cat /sys/devices/pci:00/:00:02.0/uevent
> > DRIVER=vfio-pci
> > PCI_CLASS=3
> > PCI_ID=8086:591D
> > PCI_SUBSYS_ID=8086:2212
> > PCI_SLOT_NAME=:00:02.0
> > MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00
> > 
> this is ini format or conf formant 
> this is pretty simple to parse whichi would be fine.
> that said you could also have a version or capablitiy directory with a file
> for each key and a singel value.
> 
if this is easy for openstack, maybe we can organize the data like below way?
 
 |- [device]
|- migration
|-self
|-compatible1
|-compatible2

e.g. 
#cat /sys/bus/pci/devices/:00:02.0/UUID1/migration/self
filed1=xxx
filed2=xxx
filed3=xxx
filed3=xxx
#cat /sys/bus/pci/devices/:00:02.0/UUID1/migration/compatible
filed1=xxx
filed2=xxx
filed3=xxx
filed3=xxx

or in a

[Bug 1888601] Re: QEMU v5.1.0-rc0/rc1 hang with nested virtualization

2020-07-29 Thread Jason Wang

Thanks for the information.

It could be fixed by this commit upstream.


a48aaf882b100b30111b5c7c75e1d9e83fe76cfd ("virtio-pci: fix wrong index in 
virtio_pci_queue_enabled")

Please try.

Thanks

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1888601

Title:
  QEMU v5.1.0-rc0/rc1 hang with nested virtualization

Status in QEMU:
  New

Bug description:
  We're running Kata Containers using QEMU and with v5.1.0rc0 and rc1
  have noticed a problem at startup where QEMu appears to hang. We are
  not seeing this problem on our bare metal nodes and only on a VSI that
  supports nested virtualization.

  We unfortunately see nothing at all in the QEMU logs to help
  understand the problem and a hung process is just a guess at this
  point.

  Using git bisect we first see the problem with...

  ---

  f19bcdfedd53ee93412d535a842a89fa27cae7f2 is the first bad commit
  commit f19bcdfedd53ee93412d535a842a89fa27cae7f2
  Author: Jason Wang 
  Date:   Wed Jul 1 22:55:28 2020 +0800

  virtio-pci: implement queue_enabled method

  With version 1, we can detect whether a queue is enabled via
  queue_enabled.

  Signed-off-by: Jason Wang 
  Signed-off-by: Cindy Lu 
  Message-Id: <20200701145538.22333-5-l...@redhat.com>
  Reviewed-by: Michael S. Tsirkin 
  Signed-off-by: Michael S. Tsirkin 
  Acked-by: Jason Wang 

   hw/virtio/virtio-pci.c | 13 +
   1 file changed, 13 insertions(+)

  ---

  Reverting this commit (on top of 5.1.0-rc1) seems to work and prevent
  the hanging.

  ---

  Here's how kata ends up launching qemu in our environment --
  /opt/kata/bin/qemu-system-x86_64 -name 
sandbox-849df14c6065931adedb9d18bc9260a6d896f1814a8c5cfa239865772f1b7a5f -uuid 
6bec458e-1da7-4847-a5d7-5ab31d4d2465 -machine pc,accel=kvm,kernel_irqchip -cpu 
host,pmu=off -qmp 
unix:/run/vc/vm/849df14c6065931adedb9d18bc9260a6d896f1814a8c5cfa239865772f1b7a5f/qmp.sock,server,nowait
 -m 4096M,slots=10,maxmem=30978M -device 
pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2,romfile= 
-device virtio-serial-pci,disable-modern=true,id=serial0,romfile= -device 
virtconsole,chardev=charconsole0,id=console0 -chardev 
socket,id=charconsole0,path=/run/vc/vm/849df14c6065931adedb9d18bc9260a6d896f1814a8c5cfa239865772f1b7a5f/console.sock,server,nowait
 -device virtio-scsi-pci,id=scsi0,disable-modern=true,romfile= -object 
rng-random,id=rng0,filename=/dev/urandom -device 
virtio-rng-pci,rng=rng0,romfile= -device 
virtserialport,chardev=charch0,id=channel0,name=agent.channel.0 -chardev 
socket,id=charch0,path=/run/vc/vm/849df14c6065931adedb9d18bc9260a6d896f1814a8c5cfa239865772f1b7a5f/kata.sock,server,nowait
 -chardev 
socket,id=char-396c5c3e19e29353,path=/run/vc/vm/849df14c6065931adedb9d18bc9260a6d896f1814a8c5cfa239865772f1b7a5f/vhost-fs.sock
 -device 
vhost-user-fs-pci,chardev=char-396c5c3e19e29353,tag=kataShared,romfile= -netdev 
tap,id=network-0,vhost=on,vhostfds=3:4,fds=5:6 -device 
driver=virtio-net-pci,netdev=network-0,mac=52:ac:2d:02:1f:6f,disable-modern=true,mq=on,vectors=6,romfile=
 -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults 
-nographic -daemonize -object 
memory-backend-file,id=dimm1,size=4096M,mem-path=/dev/shm,share=on -numa 
node,memdev=dimm1 -kernel /opt/kata/share/kata-containers/vmlinuz-5.7.9-74 
-initrd 
/opt/kata/share/kata-containers/kata-containers-initrd_alpine_1.11.2-6_agent.initrd
 -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 
i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 
console=hvc1 iommu=off cryptomgr.notests net.ifnames=0 pci=lastbus=0 debug 
panic=1 nr_cpus=4 agent.use_vsock=false scsi_mod.scan=none 
init=/usr/bin/kata-agent -pidfile 
/run/vc/vm/849df14c6065931adedb9d18bc9260a6d896f1814a8c5cfa239865772f1b7a5f/pid 
-D 
/run/vc/vm/849df14c6065931adedb9d18bc9260a6d896f1814a8c5cfa239865772f1b7a5f/qemu.log
 -smp 2,cores=1,threads=1,sockets=4,maxcpus=4

  ---

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1888601/+subscriptions

Re: [PATCH v5] hw/pci-host: save/restore pci host config register for old ones

2020-07-29 Thread Wangjing (Hogan, Cloud Infrastructure Service Product Dept.)

> On Tue, Jul 28, 2020 at 11:27:09AM +0800, Hogan Wang wrote:
> > The i440fx and q35 machines integrate i440FX or MCH PCI device by default.
> > Refer to i440FX and ICH9-LPC spcifications, there are some reserved 
> > configuration registers can used to save/restore PCIHostState.config_reg.
> > It's nasty but friendly to old ones.
> > 
> > Reproducer steps:
> > step 1. Make modifications to seabios and qemu for increase 
> > reproduction efficiency, write 0xf0 to 0x402 port notify qemu to stop 
> > vcpu after
> > 0x0cf8 port wrote i440 configure register. qemu stop vcpu when catch
> > 0x402 port wrote 0xf0.
> > 
> > seabios:/src/hw/pci.c
> > @@ -52,6 +52,11 @@ void pci_config_writeb(u16 bdf, u32 addr, u8 val)
> >  writeb(mmconfig_addr(bdf, addr), val);
> >  } else {
> >  outl(ioconfig_cmd(bdf, addr), PORT_PCI_CMD);
> > +   if (bdf == 0 && addr == 0x72 && val == 0xa) {
> > +dprintf(1, "stop vcpu\n");
> > +outb(0xf0, 0x402); // notify qemu to stop vcpu
> > +dprintf(1, "resume vcpu\n");
> > +}
> >  outb(val, PORT_PCI_DATA + (addr & 3));
> >  }
> >  }
> > 
> > qemu:hw/char/debugcon.c
> > @@ -60,6 +61,9 @@ static void debugcon_ioport_write(void *opaque, hwaddr 
> > addr, uint64_t val,
> >  printf(" [debugcon: write addr=0x%04" HWADDR_PRIx " val=0x%02" 
> > PRIx64 "]\n", addr, val);  #endif
> > 
> > +if (ch == 0xf0) {
> > +vm_stop(RUN_STATE_PAUSED);
> > +}
> >  /* XXX this blocks entire thread. Rewrite to use
> >   * qemu_chr_fe_write and background I/O callbacks */
> >  qemu_chr_fe_write_all(>chr, , 1);
> > 
> > step 2. start vm1 by the following command line, and then vm stopped.
> > $ qemu-system-x86_64 -machine pc-i440fx-5.0,accel=kvm\  -netdev 
> > tap,ifname=tap-test,id=hostnet0,vhost=on,downscript=no,script=no\
> >  -device 
> > virtio-net-pci,netdev=hostnet0,id=net0,bus=pci.0,addr=0x13,bootindex=3
> > \  -device cirrus-vga,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2\
> >  -chardev file,id=seabios,path=/var/log/test.seabios,append=on\
> >  -device isa-debugcon,iobase=0x402,chardev=seabios\
> >  -monitor stdio
> > 
> > step 3. start vm2 to accept vm1 state.
> > $ qemu-system-x86_64 -machine pc-i440fx-5.0,accel=kvm\  -netdev 
> > tap,ifname=tap-test1,id=hostnet0,vhost=on,downscript=no,script=no\
> >  -device 
> > virtio-net-pci,netdev=hostnet0,id=net0,bus=pci.0,addr=0x13,bootindex=3
> > \  -device cirrus-vga,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2\
> >  -chardev file,id=seabios,path=/var/log/test.seabios,append=on\
> >  -device isa-debugcon,iobase=0x402,chardev=seabios\
> >  -monitor stdio \
> >  -incoming tcp:127.0.0.1:8000
> > 
> > step 4. execute the following qmp command in vm1 to migrate.
> > (qemu) migrate tcp:127.0.0.1:8000
> > 
> > step 5. execute the following qmp command in vm2 to resume vcpu.
> > (qemu) cont
> > 
> > Before this patch, we get KVM "emulation failure" error on vm2.
> > This patch fixes it.
> > 
> > Signed-off-by: Hogan Wang 
> > ---
> >  hw/pci-host/i440fx.c | 46 
> >  hw/pci-host/q35.c| 44 ++
> >  2 files changed, 90 insertions(+)
> > 
> > diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c index 
> > 8ed2417f0c..419e27c21a 100644
> > --- a/hw/pci-host/i440fx.c
> > +++ b/hw/pci-host/i440fx.c
> > @@ -64,6 +64,14 @@ typedef struct I440FXState {
> >   */
> >  #define I440FX_COREBOOT_RAM_SIZE 0x57
> >  
> > +/* Older I440FX machines (5.0 and older) do not support 
> > +i440FX-pcihost state
> > + * migration, use some reserved INTEL 82441 configuration registers 
> > +to
> > + * save/restore i440FX-pcihost config register. Refer to [INTEL 440FX 
> > +PCISET
> > + * 82441FX PCI AND MEMORY CONTROLLER (PMC) AND 82442FX DATA BUS 
> > +ACCELERATOR
> > + * (DBX) Table 1. PMC Configuration Space]  */ #define 
> > +I440FX_PCI_HOST_CONFIG_REG 0x94
> > +
> >  static void i440fx_update_memory_mappings(PCII440FXState *d)  {
> >  int i;
> > @@ -98,15 +106,53 @@ static void i440fx_write_config(PCIDevice *dev,  
> > static int i440fx_post_load(void *opaque, int version_id)  {
> >  PCII440FXState *d = opaque;
> > +PCIDevice *dev;
> > +PCIHostState *s = OBJECT_CHECK(PCIHostState,
> > +   object_resolve_path("/machine/i440fx", 
> > NULL),
> > +   TYPE_PCI_HOST_BRIDGE);
> >  
> >  i440fx_update_memory_mappings(d);
> > +
> > +if (!s->mig_enabled) {
> 
> Thinking more about it, I think we should rename mig_enabled to 
> config_reg_mig_enabled or something like this.
> 
Thanks for your pertinent suggestions, I will resend a new patch to fix it.
>
> > +dev = PCI_DEVICE(d);
> > +s->config_reg = 
> > pci_get_long(>config[I440FX_PCI_HOST_CONFIG_REG]);
> > +pci_set_long(>config[I440FX_PCI_HOST_CONFIG_REG], 0);
> > +}
> > +return 0;
> > +}
> > +
> > +static int i440fx_pre_save(void *opaque) {

[PATCH 2/6] hw/pci-host: save/restore pci host config register for old ones

2020-07-29 Thread Hogan Wang

The i440fx and q35 machines integrate i440FX or MCH PCI device by default.
Refer to i440FX and ICH9-LPC spcifications, there are some reserved
configuration registers can used to save/restore PCIHostState.config_reg.
It's nasty but friendly to old ones.

Reproducer steps:
step 1. Make modifications to seabios and qemu for increase reproduction
efficiency, write 0xf0 to 0x402 port notify qemu to stop vcpu after
0x0cf8 port wrote i440 configure register. qemu stop vcpu when catch
0x402 port wrote 0xf0.

seabios:/src/hw/pci.c
@@ -52,6 +52,11 @@ void pci_config_writeb(u16 bdf, u32 addr, u8 val)
 writeb(mmconfig_addr(bdf, addr), val);
 } else {
 outl(ioconfig_cmd(bdf, addr), PORT_PCI_CMD);
+   if (bdf == 0 && addr == 0x72 && val == 0xa) {
+dprintf(1, "stop vcpu\n");
+outb(0xf0, 0x402); // notify qemu to stop vcpu
+dprintf(1, "resume vcpu\n");
+}
 outb(val, PORT_PCI_DATA + (addr & 3));
 }
 }

qemu:hw/char/debugcon.c
@@ -60,6 +61,9 @@ static void debugcon_ioport_write(void *opaque, hwaddr addr, 
uint64_t val,
 printf(" [debugcon: write addr=0x%04" HWADDR_PRIx " val=0x%02" PRIx64 
"]\n", addr, val);
 #endif

+if (ch == 0xf0) {
+vm_stop(RUN_STATE_PAUSED);
+}
 /* XXX this blocks entire thread. Rewrite to use
  * qemu_chr_fe_write and background I/O callbacks */
 qemu_chr_fe_write_all(>chr, , 1);

step 2. start vm1 by the following command line, and then vm stopped.
$ qemu-system-x86_64 -machine pc-i440fx-5.0,accel=kvm\
 -netdev tap,ifname=tap-test,id=hostnet0,vhost=on,downscript=no,script=no\
 -device virtio-net-pci,netdev=hostnet0,id=net0,bus=pci.0,addr=0x13,bootindex=3\
 -device cirrus-vga,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2\
 -chardev file,id=seabios,path=/var/log/test.seabios,append=on\
 -device isa-debugcon,iobase=0x402,chardev=seabios\
 -monitor stdio

step 3. start vm2 to accept vm1 state.
$ qemu-system-x86_64 -machine pc-i440fx-5.0,accel=kvm\
 -netdev tap,ifname=tap-test1,id=hostnet0,vhost=on,downscript=no,script=no\
 -device virtio-net-pci,netdev=hostnet0,id=net0,bus=pci.0,addr=0x13,bootindex=3\
 -device cirrus-vga,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2\
 -chardev file,id=seabios,path=/var/log/test.seabios,append=on\
 -device isa-debugcon,iobase=0x402,chardev=seabios\
 -monitor stdio \
 -incoming tcp:127.0.0.1:8000

step 4. execute the following qmp command in vm1 to migrate.
(qemu) migrate tcp:127.0.0.1:8000

step 5. execute the following qmp command in vm2 to resume vcpu.
(qemu) cont

Before this patch, we get KVM "emulation failure" error on vm2.
This patch fixes it.

Signed-off-by: Hogan Wang 
---
 hw/pci-host/i440fx.c  | 46 +++
 hw/pci-host/q35.c | 44 +
 hw/pci/pci_host.c |  4 ++--
 include/hw/pci/pci_host.h |  2 +-
 4 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index 8ed2417f0c..707e7e9dfb 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -64,6 +64,14 @@ typedef struct I440FXState {
  */
 #define I440FX_COREBOOT_RAM_SIZE 0x57
 
+/* Older I440FX machines (5.0 and older) do not support i440FX-pcihost state
+ * migration, use some reserved INTEL 82441 configuration registers to
+ * save/restore i440FX-pcihost config register. Refer to [INTEL 440FX PCISET
+ * 82441FX PCI AND MEMORY CONTROLLER (PMC) AND 82442FX DATA BUS ACCELERATOR
+ * (DBX) Table 1. PMC Configuration Space]
+ */
+#define I440FX_PCI_HOST_CONFIG_REG 0x94
+
 static void i440fx_update_memory_mappings(PCII440FXState *d)
 {
 int i;
@@ -98,15 +106,53 @@ static void i440fx_write_config(PCIDevice *dev,
 static int i440fx_post_load(void *opaque, int version_id)
 {
 PCII440FXState *d = opaque;
+PCIDevice *dev;
+PCIHostState *s = OBJECT_CHECK(PCIHostState,
+   object_resolve_path("/machine/i440fx", 
NULL),
+   TYPE_PCI_HOST_BRIDGE);
 
 i440fx_update_memory_mappings(d);
+
+if (!s->config_reg_mig_enabled) {
+dev = PCI_DEVICE(d);
+s->config_reg = pci_get_long(>config[I440FX_PCI_HOST_CONFIG_REG]);
+pci_set_long(>config[I440FX_PCI_HOST_CONFIG_REG], 0);
+}
+return 0;
+}
+
+static int i440fx_pre_save(void *opaque)
+{
+PCIDevice *dev = opaque;
+PCIHostState *s = OBJECT_CHECK(PCIHostState,
+   object_resolve_path("/machine/i440fx", 
NULL),
+   TYPE_PCI_HOST_BRIDGE);
+if (!s->config_reg_mig_enabled) {
+pci_set_long(>config[I440FX_PCI_HOST_CONFIG_REG],
+ s->config_reg);
+}
+return 0;
+}
+
+static int i440fx_post_save(void *opaque)
+{
+PCIDevice *dev = opaque;
+PCIHostState *s = OBJECT_CHECK(PCIHostState,
+   object_resolve_path("/machine/i440fx", 
NULL),
+

[PATCH] docs/nvdimm: add 'pmem=on' for the device dax backend file

2020-07-29 Thread Jingqi Liu

At the end of live migration, QEMU uses msync() to flush the data to
the backend storage. When the backend file is a character device dax,
the pages explicitly avoid the page cache. It will return failure from msync().
The following warning is output.

"warning: qemu_ram_msync: failed to sync memory range“

So we add 'pmem=on' to avoid calling msync(), use the QEMU command line:

-object memory-backend-file,id=mem1,pmem=on,mem-path=/dev/dax0.0,size=4G

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Jingqi Liu 
---
 docs/nvdimm.txt | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
index c2c6e441b3..31048aff5e 100644
--- a/docs/nvdimm.txt
+++ b/docs/nvdimm.txt
@@ -243,6 +243,13 @@ use the QEMU command line:
 
 -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on
 
+At the end of live migration, QEMU uses msync() to flush the data to the
+backend storage. When the backend file is a character device dax, the pages
+explicitly avoid the page cache. It will return failure from msync().
+So we add 'pmem=on' to avoid calling msync(), use the QEMU command line:
+
+-object memory-backend-file,id=mem1,pmem=on,mem-path=/dev/dax0.0,size=4G
+
 References
 --
 
-- 
2.17.1

Re: [PATCH 1/1] docs: adding NUMA documentation for pseries

2020-07-29 Thread David Gibson

On Wed, Jul 29, 2020 at 09:57:56AM -0300, Daniel Henrique Barboza wrote:
> This patch adds a new documentation file, ppc-spapr-numa.rst,
> informing what developers and user can expect of the NUMA distance
> support for the pseries machine, up to QEMU 5.1.
> 
> In the (hopefully soon) future, when we rework the NUMA mechanics
> of the pseries machine to at least attempt to contemplate user
> choice, this doc will be extended to inform about the new
> support.
> 
> Signed-off-by: Daniel Henrique Barboza 

Applied to ppc-for-5.2, thanks.

> ---
>  docs/specs/ppc-spapr-numa.rst | 191 ++
>  1 file changed, 191 insertions(+)
>  create mode 100644 docs/specs/ppc-spapr-numa.rst
> 
> diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst
> new file mode 100644
> index 00..e762038022
> --- /dev/null
> +++ b/docs/specs/ppc-spapr-numa.rst
> @@ -0,0 +1,191 @@
> +
> +NUMA mechanics for sPAPR (pseries machines)
> +
> +
> +NUMA in sPAPR works different than the System Locality Distance
> +Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR
> +1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This
> +document aims to complement this specification, providing details
> +of the elements that impacts how QEMU views NUMA in pseries.
> +
> +Associativity and ibm,associativity property
> +
> +
> +Associativity is defined as a group of platform resources that has
> +similar mean performance (or in our context here, distance) relative to
> +everyone else outside of the group.
> +
> +The format of the ibm,associativity property varies with the value of
> +bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with
> +bit 0 equal to zero is deprecated. The current format, with the bit 0
> +with the value of one, makes ibm,associativity property represent the
> +physical hierarchy of the platform, as one or more lists that starts
> +with the highest level grouping up to the smallest. Considering the
> +following topology:
> +
> +::
> +
> +Mem M1  Proc P1|
> +-  | Socket S1  ---|
> +  chip C1  |   |
> +   | HW module 1 (MOD1)
> +Mem M2  Proc P2|   |
> +-  | Socket S2  ---|
> +  chip C2  |
> +
> +The ibm,associativity property for the processors would be:
> +
> +* P1: {MOD1, S1, C1, P1}
> +* P2: {MOD1, S2, C2, P2}
> +
> +Each allocable resource has an ibm,associativity property. The LOPAPR
> +specification allows multiple lists to be present in this property,
> +considering that the same resource can have multiple connections to the
> +platform.
> +
> +Relative Performance Distance and ibm,associativity-reference-points
> +
> +
> +The ibm,associativity-reference-points property is an array that is used
> +to define the relevant performance/distance  related boundaries, defining
> +the NUMA levels for the platform.
> +
> +The definition of its elements also varies with the value of bit 0 of byte 5
> +of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero
> +is also deprecated. With the current format, each integer of the
> +ibm,associativity-reference-points represents an 1 based ordinal index (i.e.
> +the first element is 1) of the ibm,associativity array. The first
> +boundary is the most significant to application performance, followed by
> +less significant boundaries. Allocated resources that belongs to the
> +same performance boundaries are expected to have relative NUMA distance
> +that matches the relevancy of the boundary itself. Resources that belongs
> +to the same first boundary will have the shortest distance from each
> +other. Subsequent boundaries represents greater distances and degraded
> +performance.
> +
> +Using the previous example, the following setting reference points defines
> +three NUMA levels:
> +
> +* ibm,associativity-reference-points = {0x3, 0x2, 0x1}
> +
> +The first NUMA level (0x3) is interpreted as the third element of each
> +ibm,associativity array, the second level is the second element and
> +the third level is the first element. Let's also consider that elements
> +belonging to the first NUMA level have distance equal to 10 from each
> +other, and each NUMA level doubles the distance from the previous. This
> +means that the second would be 20 and the third level 40. For the P1 and
> +P2 processors, we would have the following NUMA levels:
> +
> +::
> +
> +  * ibm,associativity-reference-points = {0x3, 0x2, 0x1}
> +
> +  * P1: associativity{MOD1, S1, C1, P1}
> +
> +  First NUMA level (0x3) => associativity[2] = C1
> +  Second NUMA level (0x2) => associativity[1] = S1
> +  Third NUMA level (0x1) => associativity[0] = MOD1
> +
> +  * P2: associativity{MOD1, S2,

Re: [PATCH v3 0/8] Generalize start-powered-off property from ARM

2020-07-29 Thread David Gibson

On Tue, Jul 28, 2020 at 09:56:36PM -0300, Thiago Jung Bauermann wrote:
> 
> Thiago Jung Bauermann  writes:
> 
> > The ARM code has a start-powered-off property in ARMCPU, which is a
> > subclass of CPUState. This property causes arm_cpu_reset() to set
> > CPUState::halted to 1, signalling that the CPU should start in a halted
> > state. Other architectures also have code which aim to achieve the same
> > effect, but without using a property.
> >
> > The ppc/spapr version has a bug where QEMU does a KVM_RUN on the vcpu
> > before cs->halted is set to 1, causing the vcpu to run while it's still in
> > an unitialized state (more details in patch 3).
> 
> Since this series fixes a bug is it eligible for 5.1, at least the
> patches that were already approved by the appropriate maintainers?

Ok by me.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

[PATCH] virtio-mem: Work around format specifier mismatch for RISC-V

2020-07-29 Thread Bruce Rogers

This likely affects other, less popular host architectures as well.
Less common host architectures under linux get QEMU_VMALLOC_ALIGN (from
which VIRTIO_MEM_MIN_BLOCK_SIZE is derived) define to a variable of
type uintptr, which isn't compatible with the format specifier used to
print a user message. Since this particular usage of the underlying data
seems unique, the simple fix is to just cast it to the corresponding
format specifier.

Signed-off-by: Bruce Rogers 
---
 hw/virtio/virtio-mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index c12e9f79b0..fd01ffd83e 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -754,7 +754,7 @@ static void virtio_mem_set_block_size(Object *obj, Visitor 
*v, const char *name,
 
 if (value < VIRTIO_MEM_MIN_BLOCK_SIZE) {
 error_setg(errp, "'%s' property has to be at least 0x%" PRIx32, name,
-   VIRTIO_MEM_MIN_BLOCK_SIZE);
+   (unsigned int)VIRTIO_MEM_MIN_BLOCK_SIZE);
 return;
 } else if (!is_power_of_2(value)) {
 error_setg(errp, "'%s' property has to be a power of two", name);
-- 
2.27.0

Re: [PATCH 04/16] hw/block/nvme: remove redundant has_sg member

2020-07-29 Thread Minwoo Im

> -Original Message-
> From: Qemu-devel  On
> Behalf Of Klaus Jensen
> Sent: Thursday, July 30, 2020 3:29 AM
> To: Minwoo Im 
> Cc: Kevin Wolf ; qemu-bl...@nongnu.org; Klaus Jensen
> ; qemu-devel@nongnu.org; Max Reitz ;
> Keith Busch 
> Subject: Re: [PATCH 04/16] hw/block/nvme: remove redundant has_sg member
> 
> On Jul 30 00:29, Minwoo Im wrote:
> > Klaus,
> >
> 
> Hi Minwoo,
> 
> Thanks for the reviews and welcome to the party! :)
> 
> > On 20-07-20 13:37:36, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > >
> > > Remove the has_sg member from NvmeRequest since it's redundant.
> > >
> > > Also, make sure the request iov is destroyed at completion time.
> > >
> > > Signed-off-by: Klaus Jensen 
> > > Reviewed-by: Maxim Levitsky 
> > > ---
> > >  hw/block/nvme.c | 11 ++-
> > >  hw/block/nvme.h |  1 -
> > >  2 files changed, 6 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index cb236d1c8c46..6a1a1626b87b 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -548,16 +548,20 @@ static void nvme_rw_cb(void *opaque, int ret)
> > >  block_acct_failed(blk_get_stats(n->conf.blk), >acct);
> > >  req->status = NVME_INTERNAL_DEV_ERROR;
> > >  }
> > > -if (req->has_sg) {
> > > +
> > > +if (req->qsg.nalloc) {
> >
> > Personally, I prefer has_xxx or is_xxx to check whether the request is
> > based on sg or iov as an inline function, but 'nalloc' is also fine to
> > figure out the meaning of purpose here.
> >
> 
> What I really want to do is get rid of this duality with qsg and iovs at
> some point. I kinda wanna get rid of the dma helpers and the qsg
> entirely and do the DMA handling directly.
> 
> Maybe an `int flags` member in NvmeRequest would be better for this,
> such as NVME_REQ_DMA etc.

That looks better, but it looks like this is out of scope of this series.
Anyway, Please take my tag for this patch.

Reviewed-by: Minwoo Im 

> 
> > >  qemu_sglist_destroy(>qsg);
> > >  }
> > > +if (req->iov.nalloc) {
> > > +qemu_iovec_destroy(>iov);
> > > +}
> > > +
> >
> > Maybe this can be in a separated commit?
> >
> 
> Yeah. I guess whenever a commit message includes "Also, ..." you really
> should factor the change out ;)
> 
> I'll split it.
> 
> > Otherwise, It looks good to me.
> >
> > Thanks,
> >
> 
> Does that mean I can add your R-b? :)

Re: [PATCH] introduce VFIO-over-socket protocol specificaion

2020-07-29 Thread Alex Williamson

On Thu, 16 Jul 2020 08:31:43 -0700
Thanos Makatos  wrote:

> This patch introduces the VFIO-over-socket protocol specification, which
> is designed to allow devices to be emulated outside QEMU, in a separate
> process. VFIO-over-socket reuses the existing VFIO defines, structs and
> concepts.
> 
> It has been earlier discussed as an RFC in:
> "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> 
> Signed-off-by: John G Johnson 
> Signed-off-by: Thanos Makatos 
> ---
>  docs/devel/vfio-over-socket.rst | 1135 
> +++
>  1 files changed, 1135 insertions(+), 0 deletions(-)
>  create mode 100644 docs/devel/vfio-over-socket.rst
> 
> diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-socket.rst
> new file mode 100644
> index 000..723b944
> --- /dev/null
> +++ b/docs/devel/vfio-over-socket.rst
> @@ -0,0 +1,1135 @@
> +***
> +VFIO-over-socket Protocol Specification
> +***
> +
> +Version 0.1
> +
> +Introduction
> +
> +VFIO-over-socket, also known as vfio-user, is a protocol that allows a device
> +to be virtualized in a separate process outside of QEMU. VFIO-over-socket
> +devices consist of a generic VFIO device type, living inside QEMU, which we
> +call the client, and the core device implementation, living outside QEMU, 
> which
> +we call the server. VFIO-over-socket can be the main transport mechanism for
> +multi-process QEMU, however it can be used by other applications offering
> +device virtualization. Explaining the advantages of a
> +disaggregated/multi-process QEMU, and device virtualization outside QEMU in
> +general, is beyond the scope of this document.
> +
> +This document focuses on specifying the VFIO-over-socket protocol. VFIO has
> +been chosen for the following reasons:
> +
> +1) It is a mature and stable API, backed by an extensively used framework.
> +2) The existing VFIO client implementation (qemu/hw/vfio/) can be largely
> +   reused.
> +
> +In a proof of concept implementation it has been demonstrated that using VFIO
> +over a UNIX domain socket is a viable option. VFIO-over-socket is designed 
> with
> +QEMU in mind, however it could be used by other client applications. The
> +VFIO-over-socket protocol does not require that QEMU's VFIO client
> +implementation is used in QEMU. None of the VFIO kernel modules are required
> +for supporting the protocol, neither in the client nor the server, only the
> +source header files are used.
> +
> +The main idea is to allow a virtual device to function in a separate process 
> in
> +the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is
> +chosen because we can trivially send file descriptors over it, which in turn
> +allows:
> +
> +* Sharing of guest memory for DMA with the virtual device process.
> +* Sharing of virtual device memory with the guest for fast MMIO.
> +* Efficient sharing of eventfd's for triggering interrupts.
> +
> +However, other socket types could be used which allows the virtual device
> +process to run in a separate guest in the same host (AF_VSOCK) or remotely
> +(AF_INET). Theoretically the underlying transport doesn't necessarily have to
> +be a socket, however we don't examine such alternatives. In this document we
> +focus on using a UNIX domain socket and introduce basic support for the other
> +two types of sockets without considering performance implications.
> +
> +This document does not yet describe any internal details of the server-side
> +implementation, however QEMU's VFIO client implementation will have to be
> +adapted according to this protocol in order to support VFIO-over-socket 
> virtual
> +devices.
> +
> +VFIO
> +
> +VFIO is a framework that allows a physical device to be securely passed 
> through
> +to a user space process; the kernel does not drive the device at all.
> +Typically, the user space process is a VM and the device is passed through to
> +it in order to achieve high performance. VFIO provides an API and the 
> required
> +functionality in the kernel. QEMU has adopted VFIO to allow a guest virtual
> +machine to directly access physical devices, instead of emulating them in
> +software
> +
> +VFIO-over-socket reuses the core VFIO concepts defined in its API, but
> +implements them as messages to be sent over a UNIX-domain socket. It does not
> +change the kernel-based VFIO in any way, in fact none of the VFIO kernel
> +modules need to be loaded to use VFIO-over-socket. It is also possible for 
> QEMU
> +to concurrently use the current kernel-based VFIO for one guest device, and 
> use
> +VFIO-over-socket for another device in the same guest.
> +
> +VFIO Device Model
> +-
> +A device under VFIO presents a standard VFIO model to the user process. Many
> +of the VFIO operations in the existing kernel model use the ioctl() system
> +call, and references to the existing model are called the ioctl()
>

Re: [PATCH v6 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-29 Thread Klaus Jensen

On Jul 29 15:01, Andrzej Jakowski wrote:
> So far it was not possible to have CMB and PMR emulated on the same
> device, because BAR2 was used exclusively either of PMR or CMB. This
> patch places CMB at BAR4 offset so it not conflicts with MSI-X vectors.
> 
> Signed-off-by: Andrzej Jakowski 
> ---

Well, I'm certainly happy now. LGTM!

Reviewed-by: Klaus Jensen

Re: [PATCH-for-5.2 v4] hw/core/qdev: Increase qdev_realize() kindness

2020-07-29 Thread Paolo Bonzini

On 29/07/20 09:39, Markus Armbruster wrote:
> Taking a step back, I disagree with the notion that assertions should be
> avoided just because we have an Error **.  A programming error doesn't
> become less wrong, and continuing when the program is in disorder
> doesn't become any safer when you add an Error ** parameter to a
> function.

I don't think it is actually unsafe to continue after passing a bus-less
device with a bus_type to qdev_realize.  It will fail, but orderly.

So even though it's a programming error, it should not be a big deal to
avoid the assertion here: either the caller will pass _abort, or
it will print a nice error message and let the user go on with their
business.

I'm not particularly attached to the change, but it seemed inconsistent
to use error_setg(_abort).

Paolo

[PATCH] pci_dma_rw: return correct value instead of 0

2020-07-29 Thread Emanuele Giuseppe Esposito

pci_dma_rw currently always returns 0, regardless
of the result of dma_memory_rw. Adjusted to return
the correct value.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/hw/pci/pci.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index c1bf7d5356..41c4ab5932 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -787,8 +787,7 @@ static inline AddressSpace *pci_get_address_space(PCIDevice 
*dev)
 static inline int pci_dma_rw(PCIDevice *dev, dma_addr_t addr,
  void *buf, dma_addr_t len, DMADirection dir)
 {
-dma_memory_rw(pci_get_address_space(dev), addr, buf, len, dir);
-return 0;
+return dma_memory_rw(pci_get_address_space(dev), addr, buf, len, dir);
 }
 
 static inline int pci_dma_read(PCIDevice *dev, dma_addr_t addr,
-- 
2.17.1

[PATCH 4/5] virtiofsd: Open lo->source while setting up root in sandbox=NONE mode

2020-07-29 Thread Vivek Goyal

In sandbox=NONE mode, lo->source points to the directory which is being
exported. We have not done any chroot()/pivot_root(). So open lo->source.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 76ef891105..a6fa816b6c 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -3209,7 +3209,10 @@ static void setup_root(struct lo_data *lo, struct 
lo_inode *root)
 int fd, res;
 struct stat stat;
 
-fd = open("/", O_PATH);
+if (lo->sandbox == SANDBOX_NONE)
+fd = open(lo->source, O_PATH);
+else
+fd = open("/", O_PATH);
 if (fd == -1) {
 fuse_log(FUSE_LOG_ERR, "open(%s, O_PATH): %m\n", lo->source);
 exit(1);
-- 
2.25.4

[PATCH 1/5] virtiofsd: Add notion of unprivileged mode

2020-07-29 Thread Vivek Goyal

At startup if we are running as non-root user, then internall set
unpriviliged mode set. Also add a notion of sandbox NONE and set
that internally in unprivileged mode. setting up namespaces and
chroot() fails when one does not have privileges.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index e2fbc614fd..cd91c4a831 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -147,11 +147,13 @@ enum {
 enum {
 SANDBOX_NAMESPACE,
 SANDBOX_CHROOT,
+SANDBOX_NONE,
 };
 
 struct lo_data {
 pthread_mutex_t mutex;
 int sandbox;
+bool unprivileged;
 int debug;
 int writeback;
 int flock;
@@ -3288,6 +3290,12 @@ int main(int argc, char *argv[])
 lo_map_init(_map);
 lo_map_init(_map);
 
+if (geteuid() != 0) {
+   lo.unprivileged = true;
+   lo.sandbox = SANDBOX_NONE;
+   fuse_log(FUSE_LOG_DEBUG, "Running in unprivileged passthrough mode.\n");
+}
+
 if (fuse_parse_cmdline(, ) != 0) {
 goto err_out1;
 }
-- 
2.25.4

[PATCH 5/5] virtiofsd: Skip setup_capabilities() in sandbox=NONE mode

2020-07-29 Thread Vivek Goyal

While running as unpriviliged user setup_capabilities() fails for me.
So we are doing some operation which requires priviliges. For now
simply skip it in sandbox=NONE mode.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index a6fa816b6c..1a0b24cbf2 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -3030,7 +3030,8 @@ static void setup_sandbox(struct lo_data *lo, struct 
fuse_session *se,
 }
 
 setup_seccomp(enable_syslog);
-setup_capabilities(g_strdup(lo->modcaps));
+if (lo->sandbox != SANDBOX_NONE)
+   setup_capabilities(g_strdup(lo->modcaps));
 }
 
 /* Set the maximum number of open file descriptors */
-- 
2.25.4

[PATCH 3/5] virtiofsd: open /proc/self/fd/ in sandbox=NONE mode

2020-07-29 Thread Vivek Goyal

We need /proc/self/fd descriptor even in sandbox=NONE mode.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index cd91c4a831..76ef891105 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2969,6 +2969,15 @@ static void setup_capabilities(char *modcaps_in)
 pthread_mutex_unlock();
 }
 
+static void setup_none(struct lo_data *lo)
+{
+lo->proc_self_fd = open("/proc/self/fd", O_PATH);
+if (lo->proc_self_fd == -1) {
+fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
+exit(1);
+}
+}
+
 /*
  * Use chroot as a weaker sandbox for environments where the process is
  * launched without CAP_SYS_ADMIN.
@@ -3014,8 +3023,10 @@ static void setup_sandbox(struct lo_data *lo, struct 
fuse_session *se,
 if (lo->sandbox == SANDBOX_NAMESPACE) {
 setup_namespaces(lo, se);
 setup_mounts(lo->source);
-} else {
+} else if (lo->sandbox == SANDBOX_CHROOT) {
 setup_chroot(lo);
+} else {
+setup_none(lo);
 }
 
 setup_seccomp(enable_syslog);
-- 
2.25.4

[PATCH v2 12/16] hw/block/nvme: be consistent about zeros vs zeroes

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

The NVM Express specification generally uses 'zeroes' and not 'zeros',
so let us align with it.

Cc: Fam Zheng 
Signed-off-by: Klaus Jensen 
Reviewed-by: Minwoo Im 
Reviewed-by: Maxim Levitsky 
---
 block/nvme.c | 4 ++--
 hw/block/nvme.c  | 8 
 include/block/nvme.h | 4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index c1c4c07ac6cc..05485fdd1189 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -537,7 +537,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
   s->page_size / sizeof(uint64_t) * s->page_size);
 
 oncs = le16_to_cpu(idctrl->oncs);
-s->supports_write_zeroes = !!(oncs & NVME_ONCS_WRITE_ZEROS);
+s->supports_write_zeroes = !!(oncs & NVME_ONCS_WRITE_ZEROES);
 s->supports_discard = !!(oncs & NVME_ONCS_DSM);
 
 memset(resp, 0, 4096);
@@ -1201,7 +1201,7 @@ static coroutine_fn int 
nvme_co_pwrite_zeroes(BlockDriverState *bs,
 }
 
 NvmeCmd cmd = {
-.opcode = NVME_CMD_WRITE_ZEROS,
+.opcode = NVME_CMD_WRITE_ZEROES,
 .nsid = cpu_to_le32(s->nsid),
 .cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
 .cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 60034ea62ca8..2acde838986c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -616,7 +616,7 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
+static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
 NvmeRequest *req)
 {
 NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
@@ -716,8 +716,8 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 switch (cmd->opcode) {
 case NVME_CMD_FLUSH:
 return nvme_flush(n, ns, cmd, req);
-case NVME_CMD_WRITE_ZEROS:
-return nvme_write_zeros(n, ns, cmd, req);
+case NVME_CMD_WRITE_ZEROES:
+return nvme_write_zeroes(n, ns, cmd, req);
 case NVME_CMD_WRITE:
 case NVME_CMD_READ:
 return nvme_rw(n, ns, cmd, req);
@@ -2337,7 +2337,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->sqes = (0x6 << 4) | 0x6;
 id->cqes = (0x4 << 4) | 0x4;
 id->nn = cpu_to_le32(n->num_namespaces);
-id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP |
+id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROES | NVME_ONCS_TIMESTAMP |
NVME_ONCS_FEATURES);
 
 subnqn = g_strdup_printf("nqn.2019-08.org.qemu:%s", n->params.serial);
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 370df7fc0570..65e68a82c897 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -460,7 +460,7 @@ enum NvmeIoCommands {
 NVME_CMD_READ   = 0x02,
 NVME_CMD_WRITE_UNCOR= 0x04,
 NVME_CMD_COMPARE= 0x05,
-NVME_CMD_WRITE_ZEROS= 0x08,
+NVME_CMD_WRITE_ZEROES   = 0x08,
 NVME_CMD_DSM= 0x09,
 };
 
@@ -838,7 +838,7 @@ enum NvmeIdCtrlOncs {
 NVME_ONCS_COMPARE   = 1 << 0,
 NVME_ONCS_WRITE_UNCORR  = 1 << 1,
 NVME_ONCS_DSM   = 1 << 2,
-NVME_ONCS_WRITE_ZEROS   = 1 << 3,
+NVME_ONCS_WRITE_ZEROES  = 1 << 3,
 NVME_ONCS_FEATURES  = 1 << 4,
 NVME_ONCS_RESRVATIONS   = 1 << 5,
 NVME_ONCS_TIMESTAMP = 1 << 6,
-- 
2.27.0

[RFC PATCH 0/5] virtiofsd: Add notion of unprivileged mode

2020-07-29 Thread Vivek Goyal

Hi,

Daniel Berrange mentioned that having a unpriviliged mode in virtiofsd 
might be useful for certain use cases. Hence I decided to give it
a try.

This is RFC patch series to allow running virtiofsd as unpriviliged
user. This is still work in progress. I am posting it to get
some early feedback.

These patches are dependent on Stefan's patch series for sandbox=chroot.

https://www.redhat.com/archives/virtio-fs/2020-July/msg00078.html

I can now run virtiofsd as user "test" and also export a directory
into a VM running as user test.

This is ideally for the cases where user "test" inside VM will operate
on this virtiofs mount point. Any filesystem operations which can't
be done with the creds of "test" user on host, will fail.

Thanks
Vivek

Vivek Goyal (5):
  virtiofsd: Add notion of unprivileged mode
  virtiofsd: create lock/pid file in per user cache dir
  virtiofsd: open /proc/self/fd/ in sandbox=NONE mode
  virtiofsd: Open lo->source while setting up root in sandbox=NONE mode
  virtiofsd: Skip setup_capabilities() in sandbox=NONE mode

 tools/virtiofsd/fuse_virtio.c| 40 
 tools/virtiofsd/passthrough_ll.c | 29 ---
 2 files changed, 61 insertions(+), 8 deletions(-)

-- 
2.25.4

[PATCH v2 16/16] hw/block/nvme: remove explicit qsg/iov parameters

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Since nvme_map_prp always operate on the request-scoped qsg/iovs, just
pass a single pointer to the NvmeRequest instead of two for each of the
qsg and iov.

Suggested-by: Minwoo Im 
Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 55b1a68ced8c..aea8a8b6946c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -284,8 +284,8 @@ static uint16_t nvme_map_addr(NvmeCtrl *n, QEMUSGList *qsg, 
QEMUIOVector *iov,
 return NVME_SUCCESS;
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
- uint64_t prp2, uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(NvmeCtrl *n, uint64_t prp1, uint64_t prp2,
+ uint32_t len, NvmeRequest *req)
 {
 hwaddr trans_len = n->page_size - (prp1 % n->page_size);
 trans_len = MIN(len, trans_len);
@@ -293,6 +293,9 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 uint16_t status;
 bool prp_list_in_cmb = false;
 
+QEMUSGList *qsg = >qsg;
+QEMUIOVector *iov = >iov;
+
 trace_pci_nvme_map_prp(trans_len, len, prp1, prp2, num_prps);
 
 if (unlikely(!prp1)) {
@@ -386,7 +389,7 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, 
uint32_t len,
 {
 uint16_t status = NVME_SUCCESS;
 
-status = nvme_map_prp(>qsg, >iov, prp1, prp2, len, n);
+status = nvme_map_prp(n, prp1, prp2, len, req);
 if (status) {
 return status;
 }
@@ -431,7 +434,7 @@ static uint16_t nvme_map_dptr(NvmeCtrl *n, size_t len, 
NvmeRequest *req)
 uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
 uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
 
-return nvme_map_prp(>qsg, >iov, prp1, prp2, len, n);
+return nvme_map_prp(n, prp1, prp2, len, req);
 }
 
 static void nvme_post_cqes(void *opaque)
-- 
2.27.0

[PATCH v2 13/16] hw/block/nvme: add ns/cmd references in NvmeRequest

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Instead of passing around the NvmeNamespace and the NvmeCmd, add them as
members in the NvmeRequest structure.

Signed-off-by: Klaus Jensen 
Reviewed-by: Minwoo Im 
Reviewed-by: Maxim Levitsky 
---
 hw/block/nvme.c | 187 ++--
 hw/block/nvme.h |   2 +
 2 files changed, 104 insertions(+), 85 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 2acde838986c..3d7275eae369 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -211,6 +211,12 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 }
 }
 
+static void nvme_req_clear(NvmeRequest *req)
+{
+req->ns = NULL;
+memset(>cqe, 0x0, sizeof(req->cqe));
+}
+
 static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr addr,
   size_t len)
 {
@@ -428,9 +434,9 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, 
uint32_t len,
 return status;
 }
 
-static uint16_t nvme_map_dptr(NvmeCtrl *n, NvmeCmd *cmd, size_t len,
- NvmeRequest *req)
+static uint16_t nvme_map_dptr(NvmeCtrl *n, size_t len, NvmeRequest *req)
 {
+NvmeCmd *cmd = >cmd;
 uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
 uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
 
@@ -606,8 +612,7 @@ static void nvme_rw_cb(void *opaque, int ret)
 nvme_enqueue_req_completion(cq, req);
 }
 
-static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
-NvmeRequest *req)
+static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
 {
 block_acct_start(blk_get_stats(n->conf.blk), >acct, 0,
  BLOCK_ACCT_FLUSH);
@@ -616,10 +621,10 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace 
*ns, NvmeCmd *cmd,
 return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
-NvmeRequest *req)
+static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
 {
-NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
+NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
+NvmeNamespace *ns = req->ns;
 const uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
 const uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
 uint64_t slba = le64_to_cpu(rw->slba);
@@ -643,10 +648,10 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, 
NvmeNamespace *ns, NvmeCmd *cmd,
 return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
-NvmeRequest *req)
+static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
 {
-NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
+NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
+NvmeNamespace *ns = req->ns;
 uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
 uint64_t slba = le64_to_cpu(rw->slba);
 
@@ -674,7 +679,7 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 return status;
 }
 
-if (nvme_map_dptr(n, cmd, data_size, req)) {
+if (nvme_map_dptr(n, data_size, req)) {
 block_acct_invalid(blk_get_stats(n->conf.blk), acct);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
@@ -700,29 +705,29 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
-NvmeNamespace *ns;
-uint32_t nsid = le32_to_cpu(cmd->nsid);
+uint32_t nsid = le32_to_cpu(req->cmd.nsid);
 
-trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req), cmd->opcode);
+trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
+  req->cmd.opcode);
 
 if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
 trace_pci_nvme_err_invalid_ns(nsid, n->num_namespaces);
 return NVME_INVALID_NSID | NVME_DNR;
 }
 
-ns = >namespaces[nsid - 1];
-switch (cmd->opcode) {
+req->ns = >namespaces[nsid - 1];
+switch (req->cmd.opcode) {
 case NVME_CMD_FLUSH:
-return nvme_flush(n, ns, cmd, req);
+return nvme_flush(n, req);
 case NVME_CMD_WRITE_ZEROES:
-return nvme_write_zeroes(n, ns, cmd, req);
+return nvme_write_zeroes(n, req);
 case NVME_CMD_WRITE:
 case NVME_CMD_READ:
-return nvme_rw(n, ns, cmd, req);
+return nvme_rw(n, req);
 default:
-trace_pci_nvme_err_invalid_opc(cmd->opcode);
+trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
 return NVME_INVALID_OPCODE | NVME_DNR;
 }
 }
@@ -738,10 +743,10 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
 }
 }
 
-static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
+static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req)
 {
-NvmeDeleteQ *c = (NvmeDeleteQ *)cmd;
-NvmeRequest *req, *next;
+NvmeDeleteQ *c = (NvmeDeleteQ *)>cmd;
+NvmeRequest *r, *next;
 NvmeSQueue *sq;
 NvmeCQueue *cq;
 uint16_t qid = le16_to_cpu(c->qid);
@@ -755,19 +760,19 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd

[PATCH 2/5] virtiofsd: create lock/pid file in per user cache dir

2020-07-29 Thread Vivek Goyal

Right now we create lock/pid file in /usr/local/var/... and unprivliged
user does not have access to create files there.

So create this file in per user cache dir as queried as specified
by environment variable XDG_RUNTIME_DIR.

Note: "su $USER" does not update XDG_RUNTIME_DIR and it still points to
root user's director. So for now I create a directory /tmp/$UID to save
lock/pid file. Dan pointed out that it can be a problem if a malicious
app already has /tmp/$UID created. So we probably need to get rid of this.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_virtio.c | 40 ++-
 1 file changed, 35 insertions(+), 5 deletions(-)

diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 6b21a93841..f763a70ba5 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -972,13 +972,43 @@ static bool fv_socket_lock(struct fuse_session *se)
 g_autofree gchar *pidfile = NULL;
 g_autofree gchar *dir = NULL;
 Error *local_err = NULL;
+gboolean unprivileged = false;
 
-dir = qemu_get_local_state_pathname("run/virtiofsd");
+if (geteuid() != 0)
+unprivileged = true;
 
-if (g_mkdir_with_parents(dir, S_IRWXU) < 0) {
-fuse_log(FUSE_LOG_ERR, "%s: Failed to create directory %s: %s",
- __func__, dir, strerror(errno));
-return false;
+/*
+ * Unpriviliged users don't have access to /usr/local/var. Hence
+ * store lock/pid file in per user directory. Use environment
+ * variable XDG_RUNTIME_DIR.
+ * If one logs into the system as root and then does "su" then
+ * XDG_RUNTIME_DIR still points to root user directory. In that
+ * case create a directory for user in /tmp/$UID
+ */
+if (unprivileged) {
+gchar *user_dir = NULL;
+gboolean create_dir = false;
+user_dir = g_strdup(g_get_user_runtime_dir());
+if (!user_dir || g_str_has_suffix(user_dir, "/0")) {
+user_dir = g_strdup_printf("/tmp/%d", geteuid());
+create_dir = true;
+}
+
+if (create_dir && g_mkdir_with_parents(user_dir, S_IRWXU) < 0) {
+fuse_log(FUSE_LOG_ERR, "%s: Failed to create directory %s: %s",
+ __func__, user_dir, strerror(errno));
+g_free(user_dir);
+return false;
+}
+dir = g_strdup(user_dir);
+g_free(user_dir);
+} else {
+dir = qemu_get_local_state_pathname("run/virtiofsd");
+if (g_mkdir_with_parents(dir, S_IRWXU) < 0) {
+fuse_log(FUSE_LOG_ERR, "%s: Failed to create directory %s: %s",
+ __func__, dir, strerror(errno));
+return false;
+}
 }
 
 sk_name = g_strdup(se->vu_socket_path);
-- 
2.25.4

[PATCH v2 05/16] hw/block/nvme: destroy request iov before reuse

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Make sure the request iov is destroyed before reuse; fixing a memory
leak.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a9d9a2912655..8f8257e06eed 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -554,6 +554,10 @@ static void nvme_rw_cb(void *opaque, int ret)
 if (req->qsg.nalloc) {
 qemu_sglist_destroy(>qsg);
 }
+if (req->iov.nalloc) {
+qemu_iovec_destroy(>iov);
+}
+
 nvme_enqueue_req_completion(cq, req);
 }
 
-- 
2.27.0

[PATCH v2 06/16] hw/block/nvme: refactor dma read/write

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Refactor the nvme_dma_{read,write}_prp functions into a common function
taking a DMADirection parameter.

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Minwoo Im 
---
 hw/block/nvme.c | 91 +
 1 file changed, 46 insertions(+), 45 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 8f8257e06eed..571635ebe9f9 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -363,55 +363,53 @@ unmap:
 return status;
 }
 
-static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
-   uint64_t prp1, uint64_t prp2)
+static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
+ uint64_t prp1, uint64_t prp2, DMADirection dir)
 {
 QEMUSGList qsg;
 QEMUIOVector iov;
 uint16_t status = NVME_SUCCESS;
 
-if (nvme_map_prp(, , prp1, prp2, len, n)) {
-return NVME_INVALID_FIELD | NVME_DNR;
+status = nvme_map_prp(, , prp1, prp2, len, n);
+if (status) {
+return status;
 }
+
+/* assert that only one of qsg and iov carries data */
+assert((qsg.nsg > 0) != (iov.niov > 0));
+
 if (qsg.nsg > 0) {
-if (dma_buf_write(ptr, len, )) {
-status = NVME_INVALID_FIELD | NVME_DNR;
+uint64_t residual;
+
+if (dir == DMA_DIRECTION_TO_DEVICE) {
+residual = dma_buf_write(ptr, len, );
+} else {
+residual = dma_buf_read(ptr, len, );
 }
-qemu_sglist_destroy();
-} else {
-if (qemu_iovec_to_buf(, 0, ptr, len) != len) {
-status = NVME_INVALID_FIELD | NVME_DNR;
-}
-qemu_iovec_destroy();
-}
-return status;
-}
 
-static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
-uint64_t prp1, uint64_t prp2)
-{
-QEMUSGList qsg;
-QEMUIOVector iov;
-uint16_t status = NVME_SUCCESS;
-
-trace_pci_nvme_dma_read(prp1, prp2);
-
-if (nvme_map_prp(, , prp1, prp2, len, n)) {
-return NVME_INVALID_FIELD | NVME_DNR;
-}
-if (qsg.nsg > 0) {
-if (unlikely(dma_buf_read(ptr, len, ))) {
+if (unlikely(residual)) {
 trace_pci_nvme_err_invalid_dma();
 status = NVME_INVALID_FIELD | NVME_DNR;
 }
+
 qemu_sglist_destroy();
 } else {
-if (unlikely(qemu_iovec_from_buf(, 0, ptr, len) != len)) {
+size_t bytes;
+
+if (dir == DMA_DIRECTION_TO_DEVICE) {
+bytes = qemu_iovec_to_buf(, 0, ptr, len);
+} else {
+bytes = qemu_iovec_from_buf(, 0, ptr, len);
+}
+
+if (unlikely(bytes != len)) {
 trace_pci_nvme_err_invalid_dma();
 status = NVME_INVALID_FIELD | NVME_DNR;
 }
+
 qemu_iovec_destroy();
 }
+
 return status;
 }
 
@@ -843,8 +841,8 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, 
uint8_t rae,
 nvme_clear_events(n, NVME_AER_TYPE_SMART);
 }
 
-return nvme_dma_read_prp(n, (uint8_t *)  + off, trans_len, prp1,
- prp2);
+return nvme_dma_prp(n, (uint8_t *)  + off, trans_len, prp1, prp2,
+DMA_DIRECTION_FROM_DEVICE);
 }
 
 static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
@@ -865,8 +863,8 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, 
uint32_t buf_len,
 
 trans_len = MIN(sizeof(fw_log) - off, buf_len);
 
-return nvme_dma_read_prp(n, (uint8_t *) _log + off, trans_len, prp1,
- prp2);
+return nvme_dma_prp(n, (uint8_t *) _log + off, trans_len, prp1, prp2,
+DMA_DIRECTION_FROM_DEVICE);
 }
 
 static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
@@ -890,7 +888,8 @@ static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd, 
uint8_t rae,
 
 trans_len = MIN(sizeof(errlog) - off, buf_len);
 
-return nvme_dma_read_prp(n, (uint8_t *), trans_len, prp1, prp2);
+return nvme_dma_prp(n, (uint8_t *), trans_len, prp1, prp2,
+DMA_DIRECTION_FROM_DEVICE);
 }
 
 static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
@@ -1045,8 +1044,8 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, 
NvmeIdentify *c)
 
 trace_pci_nvme_identify_ctrl();
 
-return nvme_dma_read_prp(n, (uint8_t *)>id_ctrl, sizeof(n->id_ctrl),
-prp1, prp2);
+return nvme_dma_prp(n, (uint8_t *)>id_ctrl, sizeof(n->id_ctrl), prp1,
+prp2, DMA_DIRECTION_FROM_DEVICE);
 }
 
 static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
@@ -1065,8 +1064,8 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
NvmeIdentify *c)
 
 ns = >namespaces[nsid - 1];
 
-return nvme_dma_read_prp(n, (uint8_t *)>id_ns, sizeof(ns->id_ns),
-prp1, prp2);
+return nvme_dma_prp(n, (uint8_t *)>id_ns, sizeof(ns->id_ns), prp1,
+prp2,

[PATCH v2 11/16] hw/block/nvme: add check for mdts

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Add 'mdts' device parameter to control the Maximum Data Transfer Size of
the controller and check that it is respected.

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Minwoo Im 
---
 hw/block/nvme.c   | 30 +-
 hw/block/nvme.h   |  1 +
 hw/block/trace-events |  1 +
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index c35b35ed41c4..60034ea62ca8 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -20,7 +20,8 @@
  *  -device nvme,drive=,serial=,id=, \
  *  cmb_size_mb=, \
  *  [pmrdev=,] \
- *  max_ioqpairs=
+ *  max_ioqpairs=, \
+ *  mdts=
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -555,6 +556,17 @@ static void nvme_clear_events(NvmeCtrl *n, uint8_t 
event_type)
 }
 }
 
+static inline uint16_t nvme_check_mdts(NvmeCtrl *n, size_t len)
+{
+uint8_t mdts = n->params.mdts;
+
+if (mdts && len > n->page_size << mdts) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+return NVME_SUCCESS;
+}
+
 static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
  uint64_t slba, uint32_t nlb)
 {
@@ -648,6 +660,13 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 
 trace_pci_nvme_rw(is_write ? "write" : "read", nlb, data_size, slba);
 
+status = nvme_check_mdts(n, data_size);
+if (status) {
+trace_pci_nvme_err_mdts(nvme_cid(req), data_size);
+block_acct_invalid(blk_get_stats(n->conf.blk), acct);
+return status;
+}
+
 status = nvme_check_bounds(n, ns, slba, nlb);
 if (status) {
 trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
@@ -941,6 +960,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 uint32_t numdl, numdu;
 uint64_t off, lpol, lpou;
 size_t   len;
+uint16_t status;
 
 numdl = (dw10 >> 16);
 numdu = (dw11 & 0x);
@@ -956,6 +976,12 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 
 trace_pci_nvme_get_log(nvme_cid(req), lid, lsp, rae, len, off);
 
+status = nvme_check_mdts(n, len);
+if (status) {
+trace_pci_nvme_err_mdts(nvme_cid(req), len);
+return status;
+}
+
 switch (lid) {
 case NVME_LOG_ERROR_INFO:
 return nvme_error_info(n, cmd, rae, len, off, req);
@@ -2284,6 +2310,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->ieee[0] = 0x00;
 id->ieee[1] = 0x02;
 id->ieee[2] = 0xb3;
+id->mdts = n->params.mdts;
 id->ver = cpu_to_le32(NVME_SPEC_VER);
 id->oacs = cpu_to_le16(0);
 
@@ -2403,6 +2430,7 @@ static Property nvme_props[] = {
 DEFINE_PROP_UINT16("msix_qsize", NvmeCtrl, params.msix_qsize, 65),
 DEFINE_PROP_UINT8("aerl", NvmeCtrl, params.aerl, 3),
 DEFINE_PROP_UINT32("aer_max_queued", NvmeCtrl, params.aer_max_queued, 64),
+DEFINE_PROP_UINT8("mdts", NvmeCtrl, params.mdts, 7),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 5519b5cc7686..137cd8c2bf20 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -11,6 +11,7 @@ typedef struct NvmeParams {
 uint32_t cmb_size_mb;
 uint8_t  aerl;
 uint32_t aer_max_queued;
+uint8_t  mdts;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
diff --git a/hw/block/trace-events b/hw/block/trace-events
index f20c59a4b542..82c123230780 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -85,6 +85,7 @@ pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 
 # nvme traces for error conditions
+pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %"PRIu64""
 pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
 pci_nvme_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or 
not page aligned: 0x%"PRIx64""
 pci_nvme_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 
0x%"PRIx64""
-- 
2.27.0

[PATCH v2 14/16] hw/block/nvme: consolidate qsg/iov clearing

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Always destroy the request qsg/iov at the end of request use.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 52 -
 1 file changed, 21 insertions(+), 31 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3d7275eae369..045dd55376a5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -217,6 +217,17 @@ static void nvme_req_clear(NvmeRequest *req)
 memset(>cqe, 0x0, sizeof(req->cqe));
 }
 
+static void nvme_req_exit(NvmeRequest *req)
+{
+if (req->qsg.sg) {
+qemu_sglist_destroy(>qsg);
+}
+
+if (req->iov.iov) {
+qemu_iovec_destroy(>iov);
+}
+}
+
 static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr addr,
   size_t len)
 {
@@ -297,15 +308,14 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
QEMUIOVector *iov, uint64_t prp1,
 
 status = nvme_map_addr(n, qsg, iov, prp1, trans_len);
 if (status) {
-goto unmap;
+return status;
 }
 
 len -= trans_len;
 if (len) {
 if (unlikely(!prp2)) {
 trace_pci_nvme_err_invalid_prp2_missing();
-status = NVME_INVALID_FIELD | NVME_DNR;
-goto unmap;
+return NVME_INVALID_FIELD | NVME_DNR;
 }
 
 if (len > n->page_size) {
@@ -326,13 +336,11 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
QEMUIOVector *iov, uint64_t prp1,
 if (i == n->max_prp_ents - 1 && len > n->page_size) {
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
-status = NVME_INVALID_FIELD | NVME_DNR;
-goto unmap;
+return NVME_INVALID_FIELD | NVME_DNR;
 }
 
 if (prp_list_in_cmb != nvme_addr_is_cmb(n, prp_ent)) {
-status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
-goto unmap;
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
 }
 
 i = 0;
@@ -345,14 +353,13 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
QEMUIOVector *iov, uint64_t prp1,
 
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
-status = NVME_INVALID_FIELD | NVME_DNR;
-goto unmap;
+return NVME_INVALID_FIELD | NVME_DNR;
 }
 
 trans_len = MIN(len, n->page_size);
 status = nvme_map_addr(n, qsg, iov, prp_ent, trans_len);
 if (status) {
-goto unmap;
+return status;
 }
 
 len -= trans_len;
@@ -361,27 +368,16 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
QEMUIOVector *iov, uint64_t prp1,
 } else {
 if (unlikely(prp2 & (n->page_size - 1))) {
 trace_pci_nvme_err_invalid_prp2_align(prp2);
-status = NVME_INVALID_FIELD | NVME_DNR;
-goto unmap;
+return NVME_INVALID_FIELD | NVME_DNR;
 }
 status = nvme_map_addr(n, qsg, iov, prp2, len);
 if (status) {
-goto unmap;
+return status;
 }
 }
 }
+
 return NVME_SUCCESS;
-
-unmap:
-if (iov && iov->iov) {
-qemu_iovec_destroy(iov);
-}
-
-if (qsg && qsg->sg) {
-qemu_sglist_destroy(qsg);
-}
-
-return status;
 }
 
 static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
@@ -466,6 +462,7 @@ static void nvme_post_cqes(void *opaque)
 nvme_inc_cq_tail(cq);
 pci_dma_write(>parent_obj, addr, (void *)>cqe,
 sizeof(req->cqe));
+nvme_req_exit(req);
 QTAILQ_INSERT_TAIL(>req_list, req, entry);
 }
 if (cq->tail != cq->head) {
@@ -602,13 +599,6 @@ static void nvme_rw_cb(void *opaque, int ret)
 req->status = NVME_INTERNAL_DEV_ERROR;
 }
 
-if (req->qsg.nalloc) {
-qemu_sglist_destroy(>qsg);
-}
-if (req->iov.nalloc) {
-qemu_iovec_destroy(>iov);
-}
-
 nvme_enqueue_req_completion(cq, req);
 }
 
-- 
2.27.0

[PATCH v2 09/16] hw/block/nvme: verify validity of prp lists in the cmb

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Before this patch the device already supported PRP lists in the CMB, but
it did not check for the validity of it nor announced the support in the
Identify Controller data structure LISTS field.

If some of the PRPs in a PRP list are in the CMB, then ALL entries must
be there. This patch makes sure that requirement is verified as well as
properly announcing support for PRP lists in the CMB.

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Minwoo Im 
---
 hw/block/nvme.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 198a26890e0c..45e4060d52d9 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -273,6 +273,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 trans_len = MIN(len, trans_len);
 int num_prps = (len >> n->page_bits) + 1;
 uint16_t status;
+bool prp_list_in_cmb = false;
 
 trace_pci_nvme_map_prp(trans_len, len, prp1, prp2, num_prps);
 
@@ -299,11 +300,16 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
QEMUIOVector *iov, uint64_t prp1,
 status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
+
 if (len > n->page_size) {
 uint64_t prp_list[n->max_prp_ents];
 uint32_t nents, prp_trans;
 int i = 0;
 
+if (nvme_addr_is_cmb(n, prp2)) {
+prp_list_in_cmb = true;
+}
+
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
 nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
@@ -317,6 +323,11 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 goto unmap;
 }
 
+if (prp_list_in_cmb != nvme_addr_is_cmb(n, prp_ent)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+goto unmap;
+}
+
 i = 0;
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
@@ -336,6 +347,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 if (status) {
 goto unmap;
 }
+
 len -= trans_len;
 i++;
 }
@@ -2153,7 +2165,7 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
 
 NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
-NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
-- 
2.27.0

[PATCH v2 04/16] hw/block/nvme: remove redundant has_sg member

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Remove the has_sg member from NvmeRequest since it's redundant.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c | 7 ++-
 hw/block/nvme.h | 1 -
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index d60b19e1840f..a9d9a2912655 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -550,7 +550,8 @@ static void nvme_rw_cb(void *opaque, int ret)
 block_acct_failed(blk_get_stats(n->conf.blk), >acct);
 req->status = NVME_INTERNAL_DEV_ERROR;
 }
-if (req->has_sg) {
+
+if (req->qsg.nalloc) {
 qemu_sglist_destroy(>qsg);
 }
 nvme_enqueue_req_completion(cq, req);
@@ -559,7 +560,6 @@ static void nvme_rw_cb(void *opaque, int ret)
 static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
 NvmeRequest *req)
 {
-req->has_sg = false;
 block_acct_start(blk_get_stats(n->conf.blk), >acct, 0,
  BLOCK_ACCT_FLUSH);
 req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
@@ -585,7 +585,6 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace 
*ns, NvmeCmd *cmd,
 return NVME_LBA_RANGE | NVME_DNR;
 }
 
-req->has_sg = false;
 block_acct_start(blk_get_stats(n->conf.blk), >acct, 0,
  BLOCK_ACCT_WRITE);
 req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
@@ -623,7 +622,6 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 }
 
 if (req->qsg.nsg > 0) {
-req->has_sg = true;
 block_acct_start(blk_get_stats(n->conf.blk), >acct, req->qsg.size,
  acct);
 req->aiocb = is_write ?
@@ -632,7 +630,6 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 dma_blk_read(n->conf.blk, >qsg, data_offset, BDRV_SECTOR_SIZE,
  nvme_rw_cb, req);
 } else {
-req->has_sg = false;
 block_acct_start(blk_get_stats(n->conf.blk), >acct, req->iov.size,
  acct);
 req->aiocb = is_write ?
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 0b6a8ae66559..5519b5cc7686 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -22,7 +22,6 @@ typedef struct NvmeRequest {
 struct NvmeSQueue   *sq;
 BlockAIOCB  *aiocb;
 uint16_tstatus;
-boolhas_sg;
 NvmeCqe cqe;
 BlockAcctCookie acct;
 QEMUSGList  qsg;
-- 
2.27.0

[PATCH v2 15/16] hw/block/nvme: use preallocated qsg/iov in nvme_dma_prp

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Since clean up of the request qsg/iov is now always done post-use, there
is no need to use a stack-allocated qsg/iov in nvme_dma_prp.

Signed-off-by: Klaus Jensen 
Acked-by: Keith Busch 
Reviewed-by: Maxim Levitsky 
---
 hw/block/nvme.c | 41 ++---
 1 file changed, 18 insertions(+), 23 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 045dd55376a5..55b1a68ced8c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -381,50 +381,45 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
QEMUIOVector *iov, uint64_t prp1,
 }
 
 static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
- uint64_t prp1, uint64_t prp2, DMADirection dir)
+ uint64_t prp1, uint64_t prp2, DMADirection dir,
+ NvmeRequest *req)
 {
-QEMUSGList qsg;
-QEMUIOVector iov;
 uint16_t status = NVME_SUCCESS;
 
-status = nvme_map_prp(, , prp1, prp2, len, n);
+status = nvme_map_prp(>qsg, >iov, prp1, prp2, len, n);
 if (status) {
 return status;
 }
 
 /* assert that only one of qsg and iov carries data */
-assert((qsg.nsg > 0) != (iov.niov > 0));
+assert((req->qsg.nsg > 0) != (req->iov.niov > 0));
 
-if (qsg.nsg > 0) {
+if (req->qsg.nsg > 0) {
 uint64_t residual;
 
 if (dir == DMA_DIRECTION_TO_DEVICE) {
-residual = dma_buf_write(ptr, len, );
+residual = dma_buf_write(ptr, len, >qsg);
 } else {
-residual = dma_buf_read(ptr, len, );
+residual = dma_buf_read(ptr, len, >qsg);
 }
 
 if (unlikely(residual)) {
 trace_pci_nvme_err_invalid_dma();
 status = NVME_INVALID_FIELD | NVME_DNR;
 }
-
-qemu_sglist_destroy();
 } else {
 size_t bytes;
 
 if (dir == DMA_DIRECTION_TO_DEVICE) {
-bytes = qemu_iovec_to_buf(, 0, ptr, len);
+bytes = qemu_iovec_to_buf(>iov, 0, ptr, len);
 } else {
-bytes = qemu_iovec_from_buf(, 0, ptr, len);
+bytes = qemu_iovec_from_buf(>iov, 0, ptr, len);
 }
 
 if (unlikely(bytes != len)) {
 trace_pci_nvme_err_invalid_dma();
 status = NVME_INVALID_FIELD | NVME_DNR;
 }
-
-qemu_iovec_destroy();
 }
 
 return status;
@@ -893,7 +888,7 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, 
uint32_t buf_len,
 }
 
 return nvme_dma_prp(n, (uint8_t *)  + off, trans_len, prp1, prp2,
-DMA_DIRECTION_FROM_DEVICE);
+DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_fw_log_info(NvmeCtrl *n, uint32_t buf_len, uint64_t off,
@@ -916,7 +911,7 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, uint32_t 
buf_len, uint64_t off,
 trans_len = MIN(sizeof(fw_log) - off, buf_len);
 
 return nvme_dma_prp(n, (uint8_t *) _log + off, trans_len, prp1, prp2,
-DMA_DIRECTION_FROM_DEVICE);
+DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
@@ -941,7 +936,7 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, 
uint32_t buf_len,
 trans_len = MIN(sizeof(errlog) - off, buf_len);
 
 return nvme_dma_prp(n, (uint8_t *), trans_len, prp1, prp2,
-DMA_DIRECTION_FROM_DEVICE);
+DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
@@ -1107,7 +1102,7 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, 
NvmeRequest *req)
 trace_pci_nvme_identify_ctrl();
 
 return nvme_dma_prp(n, (uint8_t *)>id_ctrl, sizeof(n->id_ctrl), prp1,
-prp2, DMA_DIRECTION_FROM_DEVICE);
+prp2, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
@@ -1128,7 +1123,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest 
*req)
 ns = >namespaces[nsid - 1];
 
 return nvme_dma_prp(n, (uint8_t *)>id_ns, sizeof(ns->id_ns), prp1,
-prp2, DMA_DIRECTION_FROM_DEVICE);
+prp2, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
@@ -1165,7 +1160,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeRequest *req)
 }
 }
 ret = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
-   DMA_DIRECTION_FROM_DEVICE);
+   DMA_DIRECTION_FROM_DEVICE, req);
 g_free(list);
 return ret;
 }
@@ -1208,7 +1203,7 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, 
NvmeRequest *req)
 stl_be_p(_descrs->uuid.v, nsid);
 
 return nvme_dma_prp(n, list, NVME_IDENTIFY_DATA_SIZE, prp1, prp2,
-DMA_DIRECTION_FROM_DEVICE);
+

[PATCH v2 07/16] hw/block/nvme: add tracing to nvme_map_prp

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Add tracing to nvme_map_prp.

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.c   | 2 ++
 hw/block/trace-events | 1 +
 2 files changed, 3 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 571635ebe9f9..952afbb05175 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -274,6 +274,8 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 int num_prps = (len >> n->page_bits) + 1;
 uint16_t status;
 
+trace_pci_nvme_map_prp(trans_len, len, prp1, prp2, num_prps);
+
 if (unlikely(!prp1)) {
 trace_pci_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index f3b2d004e078..f20c59a4b542 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -35,6 +35,7 @@ pci_nvme_irq_masked(void) "IRQ is masked"
 pci_nvme_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" 
prp2=0x%"PRIx64""
 pci_nvme_map_addr(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
 pci_nvme_map_addr_cmb(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len 
%"PRIu64""
+pci_nvme_map_prp(uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t 
prp2, int num_prps) "trans_len %"PRIu64" len %"PRIu32" prp1 0x%"PRIx64" prp2 
0x%"PRIx64" num_prps %d"
 pci_nvme_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) 
"cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
 pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" 
sqid %"PRIu16" opc 0x%"PRIx8""
 pci_nvme_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, 
uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
-- 
2.27.0

[PATCH v2 00/16] hw/block/nvme: dma handling and address mapping cleanup

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

This series consists of patches that refactors dma read/write and adds a
number of address mapping helper functions.

v2:
  * hw/block/nvme: add mapping helpers
- Add an assert in case of out of bounds array access. (Maxim)

  * hw/block/nvme: remove redundant has_sg member
- Split the fix for the missing qemu_iov_destroy into a fresh patch
  ("hw/block/nvme: destroy request iov before reuse"). (Minwoo)

  * hw/block/nvme: pass request along for tracing [DROPPED]
- Dropped the patch and replaced it with a simple patch that just adds
  tracing to the nvme_map_prp function ("hw/block/nvme: add tracing to
  nvme_map_prp"). (Minwoo)

  * hw/block/nvme: add request mapping helper
- Changed the name from nvme_map to nvme_map_dptr. (Minwoo, Maxim)

  * hw/block/nvme: add check for mdts
- Don't touch the documentaiton for the cmb_size_mb and max_ioqpairs
  parameters in this patch. (Minwoo)

  * hw/block/nvme: refactor NvmeRequest clearing [DROPPED]
- Keep NvmeRequest structure clearing as "before use". (Maxim)

  * hw/block/nvme: add a namespace reference in NvmeRequest
  * hw/block/nvme: remove NvmeCmd parameter
- Squash these two patches together into "hw/block/nvme: add ns/cmd
  references in NvmeRequest".

  * hw/block/nvme: consolidate qsg/iov clearing
- Move the qsg/iov destroys to a new nvme_req_exit function that is called
  after the cqe has been posted.

  * hw/block/nvme: remove explicit qsg/iov parameters
- New patch. THe nvme_map_prp() function doesn't require the qsg and iov
  parameters since it can just get them from the passed NvmeRequest.

Based-on: <20200706061303.246057-1-...@irrelevant.dk>

Klaus Jensen (16):
  hw/block/nvme: memset preallocated requests structures
  hw/block/nvme: add mapping helpers
  hw/block/nvme: replace dma_acct with blk_acct equivalent
  hw/block/nvme: remove redundant has_sg member
  hw/block/nvme: destroy request iov before reuse
  hw/block/nvme: refactor dma read/write
  hw/block/nvme: add tracing to nvme_map_prp
  hw/block/nvme: add request mapping helper
  hw/block/nvme: verify validity of prp lists in the cmb
  hw/block/nvme: refactor request bounds checking
  hw/block/nvme: add check for mdts
  hw/block/nvme: be consistent about zeros vs zeroes
  hw/block/nvme: add ns/cmd references in NvmeRequest
  hw/block/nvme: consolidate qsg/iov clearing
  hw/block/nvme: use preallocated qsg/iov in nvme_dma_prp
  hw/block/nvme: remove explicit qsg/iov parameters

 block/nvme.c  |   4 +-
 hw/block/nvme.c   | 506 +++---
 hw/block/nvme.h   |   4 +-
 hw/block/trace-events |   4 +
 include/block/nvme.h  |   4 +-
 5 files changed, 340 insertions(+), 182 deletions(-)

-- 
2.27.0

[PATCH v2 10/16] hw/block/nvme: refactor request bounds checking

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Hoist bounds checking into its own function and check for wrap-around.

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Minwoo Im 
---
 hw/block/nvme.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 45e4060d52d9..c35b35ed41c4 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -555,6 +555,18 @@ static void nvme_clear_events(NvmeCtrl *n, uint8_t 
event_type)
 }
 }
 
+static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
+ uint64_t slba, uint32_t nlb)
+{
+uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
+
+if (unlikely(UINT64_MAX - slba < nlb || slba + nlb > nsze)) {
+return NVME_LBA_RANGE | NVME_DNR;
+}
+
+return NVME_SUCCESS;
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
 NvmeRequest *req = opaque;
@@ -602,12 +614,14 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, 
NvmeNamespace *ns, NvmeCmd *cmd,
 uint32_t nlb  = le16_to_cpu(rw->nlb) + 1;
 uint64_t offset = slba << data_shift;
 uint32_t count = nlb << data_shift;
+uint16_t status;
 
 trace_pci_nvme_write_zeroes(nvme_cid(req), slba, nlb);
 
-if (unlikely(slba + nlb > ns->id_ns.nsze)) {
+status = nvme_check_bounds(n, ns, slba, nlb);
+if (status) {
 trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
-return NVME_LBA_RANGE | NVME_DNR;
+return status;
 }
 
 block_acct_start(blk_get_stats(n->conf.blk), >acct, 0,
@@ -630,13 +644,15 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 uint64_t data_offset = slba << data_shift;
 int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
 enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
+uint16_t status;
 
 trace_pci_nvme_rw(is_write ? "write" : "read", nlb, data_size, slba);
 
-if (unlikely((slba + nlb) > ns->id_ns.nsze)) {
-block_acct_invalid(blk_get_stats(n->conf.blk), acct);
+status = nvme_check_bounds(n, ns, slba, nlb);
+if (status) {
 trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
-return NVME_LBA_RANGE | NVME_DNR;
+block_acct_invalid(blk_get_stats(n->conf.blk), acct);
+return status;
 }
 
 if (nvme_map_dptr(n, cmd, data_size, req)) {
-- 
2.27.0

[PATCH v2 01/16] hw/block/nvme: memset preallocated requests structures

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

This is preparatory to subsequent patches that change how QSGs/IOVs are
handled. It is important that the qsg and iov members of the NvmeRequest
are initially zeroed.

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
---
 hw/block/nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a1fd1fe14ba7..25b87486ea5a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -653,7 +653,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
uint64_t dma_addr,
 sq->size = size;
 sq->cqid = cqid;
 sq->head = sq->tail = 0;
-sq->io_req = g_new(NvmeRequest, sq->size);
+sq->io_req = g_new0(NvmeRequest, sq->size);
 
 QTAILQ_INIT(>req_list);
 QTAILQ_INIT(>out_req_list);
-- 
2.27.0

[PATCH v2 08/16] hw/block/nvme: add request mapping helper

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Introduce the nvme_map helper to remove some noise in the main nvme_rw
function.

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
---
 hw/block/nvme.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 952afbb05175..198a26890e0c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -415,6 +415,15 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, 
uint32_t len,
 return status;
 }
 
+static uint16_t nvme_map_dptr(NvmeCtrl *n, NvmeCmd *cmd, size_t len,
+ NvmeRequest *req)
+{
+uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
+uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
+
+return nvme_map_prp(>qsg, >iov, prp1, prp2, len, n);
+}
+
 static void nvme_post_cqes(void *opaque)
 {
 NvmeCQueue *cq = opaque;
@@ -602,8 +611,6 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
 uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
 uint64_t slba = le64_to_cpu(rw->slba);
-uint64_t prp1 = le64_to_cpu(rw->dptr.prp1);
-uint64_t prp2 = le64_to_cpu(rw->dptr.prp2);
 
 uint8_t lba_index  = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
 uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
@@ -620,7 +627,7 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 return NVME_LBA_RANGE | NVME_DNR;
 }
 
-if (nvme_map_prp(>qsg, >iov, prp1, prp2, data_size, n)) {
+if (nvme_map_dptr(n, cmd, data_size, req)) {
 block_acct_invalid(blk_get_stats(n->conf.blk), acct);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
-- 
2.27.0

[PATCH v2 03/16] hw/block/nvme: replace dma_acct with blk_acct equivalent

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

The QSG isn't always initialized, so accounting could be wrong. Issue a
call to blk_acct_start instead with the size taken from the QSG or IOV
depending on the kind of I/O.

Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Minwoo Im 
---
 hw/block/nvme.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 0422d4397d76..d60b19e1840f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -622,9 +622,10 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-dma_acct_start(n->conf.blk, >acct, >qsg, acct);
 if (req->qsg.nsg > 0) {
 req->has_sg = true;
+block_acct_start(blk_get_stats(n->conf.blk), >acct, req->qsg.size,
+ acct);
 req->aiocb = is_write ?
 dma_blk_write(n->conf.blk, >qsg, data_offset, 
BDRV_SECTOR_SIZE,
   nvme_rw_cb, req) :
@@ -632,6 +633,8 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
NvmeCmd *cmd,
  nvme_rw_cb, req);
 } else {
 req->has_sg = false;
+block_acct_start(blk_get_stats(n->conf.blk), >acct, req->iov.size,
+ acct);
 req->aiocb = is_write ?
 blk_aio_pwritev(n->conf.blk, data_offset, >iov, 0, nvme_rw_cb,
 req) :
-- 
2.27.0

[PATCH v2 02/16] hw/block/nvme: add mapping helpers

2020-07-29 Thread Klaus Jensen

From: Klaus Jensen 

Add nvme_map_addr, nvme_map_addr_cmb and nvme_addr_to_cmb helpers and
use them in nvme_map_prp.

This fixes a bug where in the case of a CMB transfer, the device would
map to the buffer with a wrong length.

Fixes: b2b2b67a00574 ("nvme: Add support for Read Data and Write Data in CMBs.")
Signed-off-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
Reviewed-by: Minwoo Im 
Reviewed-by: Andrzej Jakowski 
---
 hw/block/nvme.c   | 111 +++---
 hw/block/trace-events |   2 +
 2 files changed, 96 insertions(+), 17 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 25b87486ea5a..0422d4397d76 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -117,10 +117,17 @@ static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 return addr >= low && addr < hi;
 }
 
+static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
+{
+assert(nvme_addr_is_cmb(n, addr));
+
+return >cmbuf[addr - n->ctrl_mem.addr];
+}
+
 static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
 if (n->bar.cmbsz && nvme_addr_is_cmb(n, addr)) {
-memcpy(buf, (void *)>cmbuf[addr - n->ctrl_mem.addr], size);
+memcpy(buf, nvme_addr_to_cmb(n, addr), size);
 return;
 }
 
@@ -203,29 +210,91 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 }
 }
 
+static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr addr,
+  size_t len)
+{
+if (!len) {
+return NVME_SUCCESS;
+}
+
+trace_pci_nvme_map_addr_cmb(addr, len);
+
+if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len - 1)) {
+return NVME_DATA_TRAS_ERROR;
+}
+
+qemu_iovec_add(iov, nvme_addr_to_cmb(n, addr), len);
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_map_addr(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
+  hwaddr addr, size_t len)
+{
+if (!len) {
+return NVME_SUCCESS;
+}
+
+trace_pci_nvme_map_addr(addr, len);
+
+if (nvme_addr_is_cmb(n, addr)) {
+if (qsg && qsg->sg) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+
+assert(iov);
+
+if (!iov->iov) {
+qemu_iovec_init(iov, 1);
+}
+
+return nvme_map_addr_cmb(n, iov, addr, len);
+}
+
+if (iov && iov->iov) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+
+assert(qsg);
+
+if (!qsg->sg) {
+pci_dma_sglist_init(qsg, >parent_obj, 1);
+}
+
+qemu_sglist_add(qsg, addr, len);
+
+return NVME_SUCCESS;
+}
+
 static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
  uint64_t prp2, uint32_t len, NvmeCtrl *n)
 {
 hwaddr trans_len = n->page_size - (prp1 % n->page_size);
 trans_len = MIN(len, trans_len);
 int num_prps = (len >> n->page_bits) + 1;
+uint16_t status;
 
 if (unlikely(!prp1)) {
 trace_pci_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
-} else if (n->bar.cmbsz && prp1 >= n->ctrl_mem.addr &&
-   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
-qsg->nsg = 0;
+}
+
+if (nvme_addr_is_cmb(n, prp1)) {
 qemu_iovec_init(iov, num_prps);
-qemu_iovec_add(iov, (void *)>cmbuf[prp1 - n->ctrl_mem.addr], 
trans_len);
 } else {
 pci_dma_sglist_init(qsg, >parent_obj, num_prps);
-qemu_sglist_add(qsg, prp1, trans_len);
 }
+
+status = nvme_map_addr(n, qsg, iov, prp1, trans_len);
+if (status) {
+goto unmap;
+}
+
 len -= trans_len;
 if (len) {
 if (unlikely(!prp2)) {
 trace_pci_nvme_err_invalid_prp2_missing();
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
 if (len > n->page_size) {
@@ -242,6 +311,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 if (i == n->max_prp_ents - 1 && len > n->page_size) {
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
 
@@ -255,14 +325,14 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
QEMUIOVector *iov, uint64_t prp1,
 
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
 
 trans_len = MIN(len, n->page_size);
-if (qsg->nsg){
-qemu_sglist_add(qsg, prp_ent, trans_len);
-} else {
-qemu_iovec_add(iov, (void *)>cmbuf[prp_ent - 
n->ctrl_mem.addr], trans_len);
+status =

[PATCH v6] nvme: allow cmb and pmr emulation on same device

2020-07-29 Thread Andrzej Jakowski

Resending series recently posted on mailing list related to nvme device
extension with couple of fixes after review.

This patch series does following:
 - Fixes problem where CMBS bit was not set in controller capabilities
   register, so support for CMB was not correctly advertised to guest.
   This is resend of patch that has been submitted and reviewed by
   Klaus [1]
 - Introduces BAR4 sharing between MSI-X vectors and CMB. This allows
   to have CMB and PMR emulated on the same device. This extension
   was indicated by Keith [2]

v6:
 - instead of using memory_region_to_absolute_addr() function local helper has
   been defined (nvme_cmb_to_absolute_addr()) to calculate absolute address of
   CMB in simplified way. Also a number of code style changes has been done
   (function rename, use literal instead of macro definition, etc.)

v5:
 - fixed problem introduced in v4 where CMB buffer was represented as
   subregion of BAR4 memory region. In that case CMB address was used
   incorrectly as it was relative to BAR4 and not absolute. Appropriate
   changes were added to v5 to calculate CMB address properly ([6])

v4:
 - modified BAR4 initialization, so now it consists of CMB, MSIX and
   PBA memory regions overlapping on top of it. This reduces patch
   complexity significantly (Klaus [5])

v3:
 - code style fixes including: removal of spurious line break, moving
   define into define section and code alignment (Klaus [4])
 - removed unintended code reintroduction (Klaus [4])

v2:
 - rebase on Kevin's latest block branch (Klaus [3])
 - improved comments section (Klaus [3])
 - simplified calculation of BAR4 size (Klaus [3])

v1:
 - initial push of the patch

[1]: 
https://lore.kernel.org/qemu-devel/20200408055607.g2ii7gwqbnv6cd3w@apples.localdomain/
[2]: 
https://lore.kernel.org/qemu-devel/20200330165518.ga8...@redsun51.ssa.fujisawa.hgst.com/
[3]: 
https://lore.kernel.org/qemu-devel/20200605181043.28782-1-andrzej.jakow...@linux.intel.com/
[4]: 
https://lore.kernel.org/qemu-devel/20200618092524.posxi5mysb3jjtpn@apples.localdomain/
[5]: 
https://lore.kernel.org/qemu-devel/20200626055033.6vxqvi4s5pll7som@apples.localdomain/
[6]: 
https://lore.kernel.org/qemu-devel/9143a543-d32d-f3e7-c37b-b3df7f853...@linux.intel.com/

[PATCH v6 1/2] nvme: indicate CMB support through controller capabilities register

2020-07-29 Thread Andrzej Jakowski

This patch sets CMBS bit in controller capabilities register when user
configures NVMe driver with CMB support, so capabilites are correctly
reported to guest OS.

Signed-off-by: Andrzej Jakowski 
Reviewed-by: Klaus Jensen 
Reviewed-by: Maxim Levitsky 
---
 hw/block/nvme.c  |  1 +
 include/block/nvme.h | 10 +++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 841c18920c..43866b744f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -2198,6 +2198,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 NVME_CAP_SET_TO(n->bar.cap, 0xf);
 NVME_CAP_SET_CSS(n->bar.cap, 1);
 NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
+NVME_CAP_SET_CMBS(n->bar.cap, n->params.cmb_size_mb ? 1 : 0);
 
 n->bar.vs = NVME_SPEC_VER;
 n->bar.intmc = n->bar.intms = 0;
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 370df7fc05..d641ca6649 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -36,6 +36,7 @@ enum NvmeCapShift {
 CAP_MPSMIN_SHIFT   = 48,
 CAP_MPSMAX_SHIFT   = 52,
 CAP_PMR_SHIFT  = 56,
+CAP_CMB_SHIFT  = 57,
 };
 
 enum NvmeCapMask {
@@ -49,6 +50,7 @@ enum NvmeCapMask {
 CAP_MPSMIN_MASK= 0xf,
 CAP_MPSMAX_MASK= 0xf,
 CAP_PMR_MASK   = 0x1,
+CAP_CMB_MASK   = 0x1,
 };
 
 #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
@@ -78,9 +80,11 @@ enum NvmeCapMask {
 #define NVME_CAP_SET_MPSMIN(cap, val) (cap |= (uint64_t)(val & 
CAP_MPSMIN_MASK)\
<< CAP_MPSMIN_SHIFT)
 #define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & 
CAP_MPSMAX_MASK)\
-<< 
CAP_MPSMAX_SHIFT)
-#define NVME_CAP_SET_PMRS(cap, val) (cap |= (uint64_t)(val & CAP_PMR_MASK)\
-<< CAP_PMR_SHIFT)
+   << CAP_MPSMAX_SHIFT)
+#define NVME_CAP_SET_PMRS(cap, val)   (cap |= (uint64_t)(val & CAP_PMR_MASK)   
\
+   << CAP_PMR_SHIFT)
+#define NVME_CAP_SET_CMBS(cap, val)   (cap |= (uint64_t)(val & CAP_CMB_MASK)   
\
+   << CAP_CMB_SHIFT)
 
 enum NvmeCcShift {
 CC_EN_SHIFT = 0,
-- 
2.25.4

[PATCH v6 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-29 Thread Andrzej Jakowski

So far it was not possible to have CMB and PMR emulated on the same
device, because BAR2 was used exclusively either of PMR or CMB. This
patch places CMB at BAR4 offset so it not conflicts with MSI-X vectors.

Signed-off-by: Andrzej Jakowski 
---
 hw/block/nvme.c  | 124 +--
 hw/block/nvme.h  |   1 +
 include/block/nvme.h |   4 +-
 3 files changed, 89 insertions(+), 40 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 43866b744f..292bca445f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -22,12 +22,13 @@
  *  [pmrdev=,] \
  *  max_ioqpairs=
  *
- * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
- * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
+ * Note cmb_size_mb denotes size of CMB in MB. CMB when configured is assumed
+ * to be resident in BAR4 at offset that is 2MiB aligned. When CMB is emulated
+ * on Linux guest it is recommended to make cmb_size_mb multiple of 2. Both
+ * size and alignment restrictions are imposed by Linux guest.
  *
- * cmb_size_mb= and pmrdev= options are mutually exclusive due to limitation
- * in available BAR's. cmb_size_mb= will take precedence over pmrdev= when
- * both provided.
+ * pmrdev is assumed to be resident in BAR2/BAR3. When configured it consumes
+ * whole BAR2/BAR3 exclusively.
  * Enabling pmr emulation can be achieved by pointing to memory-backend-file.
  * For example:
  * -object memory-backend-file,id=,share=on,mem-path=, \
@@ -57,7 +58,6 @@
 #define NVME_MAX_IOQPAIRS 0x
 #define NVME_DB_SIZE  4
 #define NVME_SPEC_VER 0x00010300
-#define NVME_CMB_BIR 2
 #define NVME_PMR_BIR 2
 #define NVME_TEMPERATURE 0x143
 #define NVME_TEMPERATURE_WARNING 0x157
@@ -109,18 +109,25 @@ static uint16_t nvme_sqid(NvmeRequest *req)
 return le16_to_cpu(req->sq->sqid);
 }
 
+static inline hwaddr nvme_cmb_to_absolute_addr(NvmeCtrl *n)
+{
+return n->bar4.addr + n->ctrl_mem.addr;
+}
+
 static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 {
-hwaddr low = n->ctrl_mem.addr;
-hwaddr hi  = n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size);
+hwaddr low = nvme_cmb_to_absolute_addr(n);
+hwaddr hi  = low + int128_get64(n->ctrl_mem.size);
 
 return addr >= low && addr < hi;
 }
 
 static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
+hwaddr cmb_addr = nvme_cmb_to_absolute_addr(n);
+
 if (n->bar.cmbsz && nvme_addr_is_cmb(n, addr)) {
-memcpy(buf, (void *)>cmbuf[addr - n->ctrl_mem.addr], size);
+memcpy(buf, (void *)>cmbuf[addr - cmb_addr], size);
 return;
 }
 
@@ -207,17 +214,18 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
QEMUIOVector *iov, uint64_t prp1,
  uint64_t prp2, uint32_t len, NvmeCtrl *n)
 {
 hwaddr trans_len = n->page_size - (prp1 % n->page_size);
+hwaddr cmb_addr = nvme_cmb_to_absolute_addr(n);
 trans_len = MIN(len, trans_len);
 int num_prps = (len >> n->page_bits) + 1;
 
 if (unlikely(!prp1)) {
 trace_pci_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
-} else if (n->bar.cmbsz && prp1 >= n->ctrl_mem.addr &&
-   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
+} else if (n->bar.cmbsz && prp1 >= cmb_addr &&
+   prp1 < cmb_addr + int128_get64(n->ctrl_mem.size)) {
 qsg->nsg = 0;
 qemu_iovec_init(iov, num_prps);
-qemu_iovec_add(iov, (void *)>cmbuf[prp1 - n->ctrl_mem.addr], 
trans_len);
+qemu_iovec_add(iov, (void *)>cmbuf[prp1 - cmb_addr], trans_len);
 } else {
 pci_dma_sglist_init(qsg, >parent_obj, num_prps);
 qemu_sglist_add(qsg, prp1, trans_len);
@@ -262,7 +270,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 if (qsg->nsg){
 qemu_sglist_add(qsg, prp_ent, trans_len);
 } else {
-qemu_iovec_add(iov, (void *)>cmbuf[prp_ent - 
n->ctrl_mem.addr], trans_len);
+qemu_iovec_add(iov, (void *)>cmbuf[prp_ent - cmb_addr], 
trans_len);
 }
 len -= trans_len;
 i++;
@@ -275,7 +283,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector 
*iov, uint64_t prp1,
 if (qsg->nsg) {
 qemu_sglist_add(qsg, prp2, len);
 } else {
-qemu_iovec_add(iov, (void *)>cmbuf[prp2 - 
n->ctrl_mem.addr], trans_len);
+qemu_iovec_add(iov, (void *)>cmbuf[prp2 - cmb_addr], 
trans_len);
 }
 }
 }
@@ -1980,7 +1988,7 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 return;
 }
 
-if (!n->params.cmb_size_mb && n->pmrdev) {
+if (n->pmrdev) {
 if (host_memory_backend_is_mapped(n->pmrdev)) {
 char *path = 
object_get_canonical_path_component(OBJECT(n->pmrdev));
 error_setg(errp, "can't use already

Re: [PATCH 02/16] hw/block/nvme: add mapping helpers

2020-07-29 Thread Klaus Jensen

On Jul 29 14:51, Andrzej Jakowski wrote:
> On 7/29/20 2:24 PM, Klaus Jensen wrote:
> > On Jul 29 13:40, Andrzej Jakowski wrote:
> >> On 7/20/20 4:37 AM, Klaus Jensen wrote:
> >>> From: Klaus Jensen 
> >>>
> >>> Add nvme_map_addr, nvme_map_addr_cmb and nvme_addr_to_cmb helpers and
> >>> use them in nvme_map_prp.
> >>>
> >>> This fixes a bug where in the case of a CMB transfer, the device would
> >>> map to the buffer with a wrong length.
> >>>
> >>> Fixes: b2b2b67a00574 ("nvme: Add support for Read Data and Write Data in 
> >>> CMBs.")
> >>> Signed-off-by: Klaus Jensen 
> >>> ---
> >>>  hw/block/nvme.c   | 109 +++---
> >>>  hw/block/trace-events |   2 +
> >>>  2 files changed, 94 insertions(+), 17 deletions(-)
> >>>
> >>> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> >>> index 4d7b730a62b6..9b1a080cdc70 100644
> >>> --- a/hw/block/nvme.c
> >>> +++ b/hw/block/nvme.c
> >>> @@ -270,20 +338,27 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
> >>> QEMUIOVector *iov, uint64_t prp1,
> >>>  } else {
> >>>  if (unlikely(prp2 & (n->page_size - 1))) {
> >>>  trace_pci_nvme_err_invalid_prp2_align(prp2);
> >>> +status = NVME_INVALID_FIELD | NVME_DNR;
> >>>  goto unmap;
> >>>  }
> >>> -if (qsg->nsg) {
> >>> -qemu_sglist_add(qsg, prp2, len);
> >>> -} else {
> >>> -qemu_iovec_add(iov, (void *)>cmbuf[prp2 - 
> >>> n->ctrl_mem.addr], trans_len);
> >>> +status = nvme_map_addr(n, qsg, iov, prp2, len);
> >>> +if (status) {
> >>> +goto unmap;
> >>>  }
> >>>  }
> >>>  }
> >>>  return NVME_SUCCESS;
> >>>  
> >>> - unmap:
> >>> -qemu_sglist_destroy(qsg);
> >>> -return NVME_INVALID_FIELD | NVME_DNR;
> >>> +unmap:
> >>> +if (iov && iov->iov) {
> >>> +qemu_iovec_destroy(iov);
> >>> +}
> >>> +
> >>> +if (qsg && qsg->sg) {
> >>> +qemu_sglist_destroy(qsg);
> >>> +}
> >>> +
> >>> +return status;
> >>
> >> I think it would make sense to move whole unmap block to a separate 
> >> function.
> >> That function could be called from here and after completing IO and would 
> >> contain
> >> unified deinitialization block - so no code repetitions would be necessary.
> >> Other than that it looks good to me. Thx!
> >>
> >> Reviewed-by: Andrzej Jakowski 
> >>
> >  
> > Hi Andrzej,
> > 
> > Thanks for the review :)
> > 
> > Yes, this is done in a later patch ("hw/block/nvme: consolidate qsg/iov
> > clearing"), but kept here to reduce churn.
> > 
> Yep, noticed that after sending email :)
> Do you plan to submit second version of these patches incorporating some
> of the feedback?
> 

Yes, so you can defer your reviews for v2 ;)

Re: [PATCH 02/16] hw/block/nvme: add mapping helpers

2020-07-29 Thread Andrzej Jakowski

On 7/29/20 2:24 PM, Klaus Jensen wrote:
> On Jul 29 13:40, Andrzej Jakowski wrote:
>> On 7/20/20 4:37 AM, Klaus Jensen wrote:
>>> From: Klaus Jensen 
>>>
>>> Add nvme_map_addr, nvme_map_addr_cmb and nvme_addr_to_cmb helpers and
>>> use them in nvme_map_prp.
>>>
>>> This fixes a bug where in the case of a CMB transfer, the device would
>>> map to the buffer with a wrong length.
>>>
>>> Fixes: b2b2b67a00574 ("nvme: Add support for Read Data and Write Data in 
>>> CMBs.")
>>> Signed-off-by: Klaus Jensen 
>>> ---
>>>  hw/block/nvme.c   | 109 +++---
>>>  hw/block/trace-events |   2 +
>>>  2 files changed, 94 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
>>> index 4d7b730a62b6..9b1a080cdc70 100644
>>> --- a/hw/block/nvme.c
>>> +++ b/hw/block/nvme.c
>>> @@ -270,20 +338,27 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
>>> QEMUIOVector *iov, uint64_t prp1,
>>>  } else {
>>>  if (unlikely(prp2 & (n->page_size - 1))) {
>>>  trace_pci_nvme_err_invalid_prp2_align(prp2);
>>> +status = NVME_INVALID_FIELD | NVME_DNR;
>>>  goto unmap;
>>>  }
>>> -if (qsg->nsg) {
>>> -qemu_sglist_add(qsg, prp2, len);
>>> -} else {
>>> -qemu_iovec_add(iov, (void *)>cmbuf[prp2 - 
>>> n->ctrl_mem.addr], trans_len);
>>> +status = nvme_map_addr(n, qsg, iov, prp2, len);
>>> +if (status) {
>>> +goto unmap;
>>>  }
>>>  }
>>>  }
>>>  return NVME_SUCCESS;
>>>  
>>> - unmap:
>>> -qemu_sglist_destroy(qsg);
>>> -return NVME_INVALID_FIELD | NVME_DNR;
>>> +unmap:
>>> +if (iov && iov->iov) {
>>> +qemu_iovec_destroy(iov);
>>> +}
>>> +
>>> +if (qsg && qsg->sg) {
>>> +qemu_sglist_destroy(qsg);
>>> +}
>>> +
>>> +return status;
>>
>> I think it would make sense to move whole unmap block to a separate function.
>> That function could be called from here and after completing IO and would 
>> contain
>> unified deinitialization block - so no code repetitions would be necessary.
>> Other than that it looks good to me. Thx!
>>
>> Reviewed-by: Andrzej Jakowski 
>>
>  
> Hi Andrzej,
> 
> Thanks for the review :)
> 
> Yes, this is done in a later patch ("hw/block/nvme: consolidate qsg/iov
> clearing"), but kept here to reduce churn.
> 
Yep, noticed that after sending email :)
Do you plan to submit second version of these patches incorporating some
of the feedback?

RE: [PATCH v2 1/3] hw/i386: Initialize topo_ids from CpuInstanceProperties

2020-07-29 Thread Babu Moger

Igor,
Sorry. Few more questions.

> -Original Message-
> From: Igor Mammedov 
> Sent: Wednesday, July 29, 2020 9:12 AM
> To: Moger, Babu 
> Cc: pbonz...@redhat.com; r...@twiddle.net; qemu-devel@nongnu.org;
> ehabk...@redhat.com
> Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> CpuInstanceProperties
> 
> On Mon, 27 Jul 2020 18:59:42 -0500
> Babu Moger  wrote:
> 
> > > -Original Message-
> > > From: Igor Mammedov 
> > > Sent: Monday, July 27, 2020 12:14 PM
> > > To: Moger, Babu 
> > > Cc: qemu-devel@nongnu.org; pbonz...@redhat.com;
> ehabk...@redhat.com;
> > > r...@twiddle.net
> > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> > > CpuInstanceProperties
> > >
> > > On Mon, 27 Jul 2020 10:49:08 -0500
> > > Babu Moger  wrote:
> > >
> > > > > -Original Message-
> > > > > From: Igor Mammedov 
> > > > > Sent: Friday, July 24, 2020 12:05 PM
> > > > > To: Moger, Babu 
> > > > > Cc: qemu-devel@nongnu.org; pbonz...@redhat.com;
> > > ehabk...@redhat.com;
> > > > > r...@twiddle.net
> > > > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> > > > > CpuInstanceProperties
> > > > >
> > > > > On Mon, 13 Jul 2020 14:30:29 -0500 Babu Moger
> > > > >  wrote:
> > > > >
> > > > > > > -Original Message-
> > > > > > > From: Igor Mammedov 
> > > > > > > Sent: Monday, July 13, 2020 12:32 PM
> > > > > > > To: Moger, Babu 
> > > > > > > Cc: pbonz...@redhat.com; r...@twiddle.net;
> > > > > > > ehabk...@redhat.com;
> > > > > > > qemu- de...@nongnu.org
> > > > > > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids
> > > > > > > from CpuInstanceProperties
> > > > > > >
> > > > > > > On Mon, 13 Jul 2020 11:43:33 -0500 Babu Moger
> > > > > > >  wrote:
> > > > > > >
> > > > > > > > On 7/13/20 11:17 AM, Igor Mammedov wrote:
> > > > > > > > > On Mon, 13 Jul 2020 10:02:22 -0500 Babu Moger
> > > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > >>> -Original Message-
> > > > > > > > >>> From: Igor Mammedov 
> > > > > > > > >>> Sent: Monday, July 13, 2020 4:08 AM
> > > > > > > > >>> To: Moger, Babu 
> > > > > > > > >>> Cc: pbonz...@redhat.com; r...@twiddle.net;
> > > > > > > > >>> ehabk...@redhat.com;
> > > > > > > > >>> qemu- de...@nongnu.org
> > > > > > > > >>> Subject: Re: [PATCH v2 1/3] hw/i386: Initialize
> > > > > > > > >>> topo_ids from CpuInstanceProperties
> > > > > > > > > [...]
> > > > > > > >  +
> > > > > > > >  +/*
> > > > > > > >  + * Initialize topo_ids from CpuInstanceProperties
> > > > > > > >  + * node_id in CpuInstanceProperties(or in CPU
> > > > > > > >  +device) is a sequential
> > > > > > > >  + * number, but while building the topology
> > > > > > > > >>>
> > > > > > > >  we need to separate it for
> > > > > > > >  + * each socket(mod nodes_per_pkg).
> > > > > > > > >>> could you clarify a bit more on why this is necessary?
> > > > > > > > >>
> > > > > > > > >> If you have two sockets and 4 numa nodes, node_id in
> > > > > > > > >> CpuInstanceProperties will be number sequentially as 0, 1, 
> > > > > > > > >> 2, 3.
> > > > > > > > >> But in EPYC topology, it will be  0, 1, 0, 1( Basically
> > > > > > > > >> mod % number of nodes
> > > > > > > per socket).
> > > > > > > > >
> > > > > > > > > I'm confused, let's suppose we have 2 EPYC sockets with
> > > > > > > > > 2 nodes per socket so APIC id woulbe be composed like:
> > > > > > > > >
> > > > > > > > >  1st socket
> > > > > > > > >pkg_id(0) | node_id(0)
> > > > > > > > >pkg_id(0) | node_id(1)
> > > > > > > > >
> > > > > > > > >  2nd socket
> > > > > > > > >pkg_id(1) | node_id(0)
> > > > > > > > >pkg_id(1) | node_id(1)
> > > > > > > > >
> > > > > > > > > if that's the case, then EPYC's node_id here doesn't
> > > > > > > > > look like a NUMA node in the sense it's usually used
> > > > > > > > > (above config would have 4 different memory controllers
> > > > > > > > > => 4 conventional
> > > NUMA nodes).
> > > > > > > >
> > > > > > > > EPIC model uses combination of socket id and node id to
> > > > > > > > identify the numa nodes. So, it internally uses all the 
> > > > > > > > information.
> > > > > > >
> > > > > > > well with above values, EPYC's node_id doesn't look like
> > > > > > > it's specifying a machine numa node, but rather a node index
> > > > > > > within single socket. In which case, it doesn't make much
> > > > > > > sense calling it NUMA node_id, it's rather some index within
> > > > > > > a socket. (it starts looking like terminology is all mixed
> > > > > > > up)
> > > > > > >
> > > > > > > If you have access to a milti-socket EPYC machine, can you
> > > > > > > dump and post here its apic ids, pls?
> > > > > >
> > > > > > Here is the output from my EPYC machine with 2 sockets and
> > > > > > totally
> > > > > > 8 nodes(SMT disabled). The cpus 0-31 are in socket 0 and  cpus
> > > > > > 32-63 in socket 1.
> > > > > >
> > > > > > # lscpu
> > > > > > Architecture:x86_64
> > > > > > CPU op-mode(s):

Re: [PATCH 02/16] hw/block/nvme: add mapping helpers

2020-07-29 Thread Klaus Jensen

On Jul 29 13:40, Andrzej Jakowski wrote:
> On 7/20/20 4:37 AM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add nvme_map_addr, nvme_map_addr_cmb and nvme_addr_to_cmb helpers and
> > use them in nvme_map_prp.
> > 
> > This fixes a bug where in the case of a CMB transfer, the device would
> > map to the buffer with a wrong length.
> > 
> > Fixes: b2b2b67a00574 ("nvme: Add support for Read Data and Write Data in 
> > CMBs.")
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 109 +++---
> >  hw/block/trace-events |   2 +
> >  2 files changed, 94 insertions(+), 17 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 4d7b730a62b6..9b1a080cdc70 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -270,20 +338,27 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
> > QEMUIOVector *iov, uint64_t prp1,
> >  } else {
> >  if (unlikely(prp2 & (n->page_size - 1))) {
> >  trace_pci_nvme_err_invalid_prp2_align(prp2);
> > +status = NVME_INVALID_FIELD | NVME_DNR;
> >  goto unmap;
> >  }
> > -if (qsg->nsg) {
> > -qemu_sglist_add(qsg, prp2, len);
> > -} else {
> > -qemu_iovec_add(iov, (void *)>cmbuf[prp2 - 
> > n->ctrl_mem.addr], trans_len);
> > +status = nvme_map_addr(n, qsg, iov, prp2, len);
> > +if (status) {
> > +goto unmap;
> >  }
> >  }
> >  }
> >  return NVME_SUCCESS;
> >  
> > - unmap:
> > -qemu_sglist_destroy(qsg);
> > -return NVME_INVALID_FIELD | NVME_DNR;
> > +unmap:
> > +if (iov && iov->iov) {
> > +qemu_iovec_destroy(iov);
> > +}
> > +
> > +if (qsg && qsg->sg) {
> > +qemu_sglist_destroy(qsg);
> > +}
> > +
> > +return status;
> 
> I think it would make sense to move whole unmap block to a separate function.
> That function could be called from here and after completing IO and would 
> contain
> unified deinitialization block - so no code repetitions would be necessary.
> Other than that it looks good to me. Thx!
> 
> Reviewed-by: Andrzej Jakowski 
> 
 
Hi Andrzej,

Thanks for the review :)

Yes, this is done in a later patch ("hw/block/nvme: consolidate qsg/iov
clearing"), but kept here to reduce churn.

[Bug 1888601] Re: QEMU v5.1.0-rc0/rc1 hang with nested virtualization

2020-07-29 Thread Simon Kaegi

Here's what I get with 5.1.0-rc2

```
(gdb) thread apply all bt

Thread 5 (LWP 23730):
#0  0x7f9ae6040ebb in ioctl ()
#1  0x7f9ae57cf98b in kvm_vcpu_ioctl (cpu=cpu@entry=0x57539ea0, 
type=type@entry=44672) at /root/qemu/accel/kvm/kvm-all.c:2631
#2  0x7f9ae57cfac5 in kvm_cpu_exec (cpu=cpu@entry=0x57539ea0) at 
/root/qemu/accel/kvm/kvm-all.c:2468
#3  0x7f9ae586703c in qemu_kvm_cpu_thread_fn (arg=0x57539ea0) at 
/root/qemu/softmmu/cpus.c:1188
#4  qemu_kvm_cpu_thread_fn (arg=arg@entry=0x57539ea0) at 
/root/qemu/softmmu/cpus.c:1160
#5  0x7f9ae5bd2f13 in qemu_thread_start (args=) at 
util/qemu-thread-posix.c:521
#6  0x7f9ae5f97109 in start_thread (arg=) at 
pthread_create.c:477
#7  0x7f9ae6045353 in clone ()

Thread 4 (LWP 23729):
#0  0x7f9ae5fed337 in __strcmp_sse2 ()
#1  0x7f9ae5d9a8ad in g_str_equal () at pthread_create.c:679
#2  0x7f9ae5d99a9d in g_hash_table_lookup () at pthread_create.c:679
#3  0x7f9ae5ac728f in type_table_lookup (name=0x7f9ae609c9dd "virtio-bus") 
at qom/object.c:84
#4  type_get_by_name (name=0x7f9ae609c9dd "virtio-bus") at qom/object.c:171
#5  object_class_dynamic_cast (class=class@entry=0x572d1ac0, 
typename=typename@entry=0x7f9ae609c9dd "virtio-bus") at qom/object.c:879
#6  0x7f9ae5ac75b5 in object_class_dynamic_cast_assert 
(class=0x572d1ac0, typename=typename@entry=0x7f9ae609c9dd "virtio-bus", 
file=file@entry=0x7f9ae60a80b8 "/root/qemu/hw/virtio/virtio.c", 
line=line@entry=3290, 
func=func@entry=0x7f9ae60a8c30 <__func__.31954> "virtio_queue_enabled") at 
qom/object.c:935
#7  0x7f9ae5817842 in virtio_queue_enabled (vdev=0x58418be0, n=0) at 
/root/qemu/hw/virtio/virtio.c:3290
#8  0x7f9ae59c2837 in vhost_net_start_one (dev=0x58418be0, 
net=0x574d8ca0) at hw/net/vhost_net.c:259
#9  vhost_net_start (dev=dev@entry=0x58418be0, ncs=0x5842e030, 
total_queues=total_queues@entry=2) at hw/net/vhost_net.c:351
#10 0x7f9ae57f4d98 in virtio_net_vhost_status (status=, 
n=0x58418be0) at /root/qemu/hw/net/virtio-net.c:268
#11 virtio_net_set_status (vdev=0x58418be0, status=) at 
/root/qemu/hw/net/virtio-net.c:349
#12 0x7f9ae5815bdb in virtio_set_status (vdev=vdev@entry=0x58418be0, 
val=val@entry=7 '\a') at /root/qemu/hw/virtio/virtio.c:1956
#13 0x7f9ae5a5ddf0 in virtio_ioport_write (val=7, addr=18, 
opaque=0x58410a50) at hw/virtio/virtio-pci.c:331
#14 virtio_pci_config_write (opaque=0x58410a50, addr=18, val=, size=) at hw/virtio/virtio-pci.c:455
#15 0x7f9ae5870b2a in memory_region_write_accessor (attrs=..., mask=255, 
shift=, size=1, value=0x7f99dfffd5f8, addr=, 
mr=0x58411340) at /root/qemu/softmmu/memory.c:483
#16 access_with_adjusted_size (attrs=..., mr=0x58411340, 
access_fn=, access_size_max=, 
access_size_min=, size=1, value=0x7f99dfffd5f8, addr=18)
at /root/qemu/softmmu/memory.c:544
#17 memory_region_dispatch_write (mr=mr@entry=0x58411340, addr=, data=, op=, attrs=..., attrs@entry=...) at 
/root/qemu/softmmu/memory.c:1465
#18 0x7f9ae57ab4b2 in flatview_write_continue (fv=0x7f99d007ef80, 
addr=addr@entry=53394, attrs=..., attrs@entry=..., 
ptr=ptr@entry=0x7f9ae43f1000, len=len@entry=1, addr1=, 
l=, 
mr=0x58411340) at /root/qemu/include/qemu/host-utils.h:164
#19 0x7f9ae57afc4d in flatview_write (len=1, buf=0x7f9ae43f1000, attrs=..., 
addr=53394, fv=) at /root/qemu/exec.c:3216
#20 address_space_write (len=1, buf=0x7f9ae43f1000, attrs=..., addr=53394, 
as=0x7f9ae43f1000) at /root/qemu/exec.c:3307
#21 address_space_rw (as=as@entry=0x7f9ae6846d60 , 
addr=addr@entry=53394, attrs=attrs@entry=..., buf=0x7f9ae43f1000, 
len=len@entry=1, is_write=is_write@entry=true) at /root/qemu/exec.c:3317
#22 0x7f9ae57cfd5f in kvm_handle_io (count=1, size=1, direction=, data=, attrs=..., port=53394) at 
/root/qemu/accel/kvm/kvm-all.c:2262
#23 kvm_cpu_exec (cpu=cpu@entry=0x574f3ac0) at 
/root/qemu/accel/kvm/kvm-all.c:2508
#24 0x7f9ae586703c in qemu_kvm_cpu_thread_fn (arg=0x574f3ac0) at 
/root/qemu/softmmu/cpus.c:1188
#25 qemu_kvm_cpu_thread_fn (arg=arg@entry=0x574f3ac0) at 
/root/qemu/softmmu/cpus.c:1160
#26 0x7f9ae5bd2f13 in qemu_thread_start (args=) at 
util/qemu-thread-posix.c:521
#27 0x7f9ae5f97109 in start_thread (arg=) at 
pthread_create.c:477
#28 0x7f9ae6045353 in clone ()

Thread 3 (LWP 23728):
#0  0x7f9ae603fd0f in poll ()
#1  0x7f9ae5dac5de in g_main_context_iterate.isra () at pthread_create.c:679
#2  0x7f9ae5dac963 in g_main_loop_run () at pthread_create.c:679
#3  0x7f9ae58a7b71 in iothread_run (opaque=opaque@entry=0x5734b800) at 
iothread.c:82
#4  0x7f9ae5bd2f13 in qemu_thread_start (args=) at 
util/qemu-thread-posix.c:521
#5  0x7f9ae5f97109 in start_thread (arg=) at 
pthread_create.c:477
#6  0x7f9ae6045353 in clone ()

Thread 2 (LWP 23723):
#0  0x7f9ae604207d in syscall ()
#1  0x7f9ae5bd3e32 in qemu_futex_wait (val=, f=) at /root/qemu/include/qemu/futex.h:29
#2

Re: qemu repo lockdown message for a WHPX PR

2020-07-29 Thread Peter Maydell

On Wed, 29 Jul 2020 at 21:29, Paolo Bonzini  wrote:
> I was not referring to github pull requests, but rather to a maintainer
> pull request.  This is also sent to the mailing list.  There is no
> QEMU-specific documentation since maintainers are generally experienced
> enough to have observed how these are sent

We do have some notes on the wiki:
https://wiki.qemu.org/Contribute/SubmitAPullRequest
but they're more in the nature of points to check assuming
you have a basic understanding of the workflow rather than
a complete tutorial.

thanks
-- PMM

[Bug 1888601] Re: QEMU v5.1.0-rc0/rc1 hang with nested virtualization

2020-07-29 Thread Simon Kaegi

```
(gdb) thread apply all bt

Thread 5 (LWP 211759):
#0  0x7ff56a9988d8 in g_str_hash ()
#1  0x7ff56a997a0c in g_hash_table_lookup ()
#2  0x7ff56a6c528f in type_table_lookup (name=0x7ff56ac9a9dd "virtio-bus") 
at qom/object.c:84
#3  type_get_by_name (name=0x7ff56ac9a9dd "virtio-bus") at qom/object.c:171
#4  object_class_dynamic_cast (class=class@entry=0x56d92ac0, 
typename=typename@entry=0x7ff56ac9a9dd "virtio-bus") at qom/object.c:879
#5  0x7ff56a6c55b5 in object_class_dynamic_cast_assert 
(class=0x56d92ac0, typename=typename@entry=0x7ff56ac9a9dd "virtio-bus", 
file=file@entry=0x7ff56aca60b8 "/root/qemu/hw/virtio/virtio.c", 
line=line@entry=3290, 
func=func@entry=0x7ff56aca6c30 <__func__.31954> "virtio_queue_enabled") at 
qom/object.c:935
#6  0x7ff56a415842 in virtio_queue_enabled (vdev=0x57ed9be0, n=0) at 
/root/qemu/hw/virtio/virtio.c:3290
#7  0x7ff56a5c0837 in vhost_net_start_one (dev=0x57ed9be0, 
net=0x56f99ca0) at hw/net/vhost_net.c:259
#8  vhost_net_start (dev=dev@entry=0x57ed9be0, ncs=0x57eef030, 
total_queues=total_queues@entry=2) at hw/net/vhost_net.c:351
#9  0x7ff56a3f2d98 in virtio_net_vhost_status (status=, 
n=0x57ed9be0) at /root/qemu/hw/net/virtio-net.c:268
#10 virtio_net_set_status (vdev=0x57ed9be0, status=) at 
/root/qemu/hw/net/virtio-net.c:349
#11 0x7ff56a413bdb in virtio_set_status (vdev=vdev@entry=0x57ed9be0, 
val=val@entry=7 '\a') at /root/qemu/hw/virtio/virtio.c:1956
#12 0x7ff56a65bdf0 in virtio_ioport_write (val=7, addr=18, 
opaque=0x57ed1a50) at hw/virtio/virtio-pci.c:331
#13 virtio_pci_config_write (opaque=0x57ed1a50, addr=18, val=, size=) at hw/virtio/virtio-pci.c:455
#14 0x7ff56a46eb2a in memory_region_write_accessor (attrs=..., mask=255, 
shift=, size=1, value=0x7ff463ffd5f8, addr=, 
mr=0x57ed2340) at /root/qemu/softmmu/memory.c:483
#15 access_with_adjusted_size (attrs=..., mr=0x57ed2340, 
access_fn=, access_size_max=, 
access_size_min=, size=1, value=0x7ff463ffd5f8, addr=18)
at /root/qemu/softmmu/memory.c:544
#16 memory_region_dispatch_write (mr=mr@entry=0x57ed2340, addr=, data=, op=, attrs=..., attrs@entry=...) at 
/root/qemu/softmmu/memory.c:1465
#17 0x7ff56a3a94b2 in flatview_write_continue (fv=0x7ff45426a7c0, 
addr=addr@entry=53394, attrs=..., attrs@entry=..., 
ptr=ptr@entry=0x7ff5687eb000, len=len@entry=1, addr1=, 
l=, 
mr=0x57ed2340) at /root/qemu/include/qemu/host-utils.h:164
#18 0x7ff56a3adc4d in flatview_write (len=1, buf=0x7ff5687eb000, attrs=..., 
addr=53394, fv=) at /root/qemu/exec.c:3216
#19 address_space_write (len=1, buf=0x7ff5687eb000, attrs=..., addr=53394, 
as=0x7ff5687eb000) at /root/qemu/exec.c:3307
#20 address_space_rw (as=as@entry=0x7ff56b444d60 , 
addr=addr@entry=53394, attrs=attrs@entry=..., buf=0x7ff5687eb000, 
len=len@entry=1, is_write=is_write@entry=true) at /root/qemu/exec.c:3317
#21 0x7ff56a3cdd5f in kvm_handle_io (count=1, size=1, direction=, data=, attrs=..., port=53394) at 
/root/qemu/accel/kvm/kvm-all.c:2262
#22 kvm_cpu_exec (cpu=cpu@entry=0x56ffaea0) at 
/root/qemu/accel/kvm/kvm-all.c:2508
#23 0x7ff56a46503c in qemu_kvm_cpu_thread_fn (arg=0x56ffaea0) at 
/root/qemu/softmmu/cpus.c:1188
#24 qemu_kvm_cpu_thread_fn (arg=arg@entry=0x56ffaea0) at 
/root/qemu/softmmu/cpus.c:1160
#25 0x7ff56a7d0f13 in qemu_thread_start (args=) at 
util/qemu-thread-posix.c:521
#26 0x7ff56ab95109 in start_thread (arg=) at 
pthread_create.c:477
#27 0x7ff56ac43353 in clone ()

Thread 4 (LWP 211758):
#0  0x7ff56ac3eebb in ioctl ()
#1  0x7ff56a3cd98b in kvm_vcpu_ioctl (cpu=cpu@entry=0x56fb4ac0, 
type=type@entry=44672) at /root/qemu/accel/kvm/kvm-all.c:2631
#2  0x7ff56a3cdac5 in kvm_cpu_exec (cpu=cpu@entry=0x56fb4ac0) at 
/root/qemu/accel/kvm/kvm-all.c:2468
#3  0x7ff56a46503c in qemu_kvm_cpu_thread_fn (arg=0x56fb4ac0) at 
/root/qemu/softmmu/cpus.c:1188
#4  qemu_kvm_cpu_thread_fn (arg=arg@entry=0x56fb4ac0) at 
/root/qemu/softmmu/cpus.c:1160
#5  0x7ff56a7d0f13 in qemu_thread_start (args=) at 
util/qemu-thread-posix.c:521
#6  0x7ff56ab95109 in start_thread (arg=) at 
pthread_create.c:477
#7  0x7ff56ac43353 in clone ()

Thread 3 (LWP 211757):
#0  0x7ff56ac3dd0f in poll ()
#1  0x7ff56a9aa5de in g_main_context_iterate.isra () at pthread_create.c:679
#2  0x7ff56a9aa963 in g_main_loop_run () at pthread_create.c:679
#3  0x7ff56a4a5b71 in iothread_run (opaque=opaque@entry=0x56e0c800) at 
iothread.c:82
#4  0x7ff56a7d0f13 in qemu_thread_start (args=) at 
util/qemu-thread-posix.c:521
#5  0x7ff56ab95109 in start_thread (arg=) at 
pthread_create.c:477
#6  0x7ff56ac43353 in clone ()

Thread 2 (LWP 211752):
#0  0x7ff56ac4007d in syscall ()
#1  0x7ff56a7d1e32 in qemu_futex_wait (val=, f=) at /root/qemu/include/qemu/futex.h:29
#2  qemu_event_wait () at util/qemu-thread-posix.c:460
#3  0x7ff56a7dc0f2 in call_rcu_thread () at util/rcu.c:258
#4

Re: [PATCH 02/16] hw/block/nvme: add mapping helpers

2020-07-29 Thread Andrzej Jakowski

On 7/20/20 4:37 AM, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Add nvme_map_addr, nvme_map_addr_cmb and nvme_addr_to_cmb helpers and
> use them in nvme_map_prp.
> 
> This fixes a bug where in the case of a CMB transfer, the device would
> map to the buffer with a wrong length.
> 
> Fixes: b2b2b67a00574 ("nvme: Add support for Read Data and Write Data in 
> CMBs.")
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c   | 109 +++---
>  hw/block/trace-events |   2 +
>  2 files changed, 94 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 4d7b730a62b6..9b1a080cdc70 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -109,6 +109,11 @@ static uint16_t nvme_sqid(NvmeRequest *req)
>  return le16_to_cpu(req->sq->sqid);
>  }
>  
> +static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
> +{
> +return >cmbuf[addr - n->ctrl_mem.addr];
> +}
> +
>  static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
>  {
>  hwaddr low = n->ctrl_mem.addr;
> @@ -120,7 +125,7 @@ static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
>  static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
>  {
>  if (n->bar.cmbsz && nvme_addr_is_cmb(n, addr)) {
> -memcpy(buf, (void *)>cmbuf[addr - n->ctrl_mem.addr], size);
> +memcpy(buf, nvme_addr_to_cmb(n, addr), size);
>  return;
>  }
>  
> @@ -203,29 +208,91 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue 
> *cq)
>  }
>  }
>  
> +static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr 
> addr,
> +  size_t len)
> +{
> +if (!len) {
> +return NVME_SUCCESS;
> +}
> +
> +trace_pci_nvme_map_addr_cmb(addr, len);
> +
> +if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len - 1)) {
> +return NVME_DATA_TRAS_ERROR;
> +}
> +
> +qemu_iovec_add(iov, nvme_addr_to_cmb(n, addr), len);
> +
> +return NVME_SUCCESS;
> +}
> +
> +static uint16_t nvme_map_addr(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector 
> *iov,
> +  hwaddr addr, size_t len)
> +{
> +if (!len) {
> +return NVME_SUCCESS;
> +}
> +
> +trace_pci_nvme_map_addr(addr, len);
> +
> +if (nvme_addr_is_cmb(n, addr)) {
> +if (qsg && qsg->sg) {
> +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> +}
> +
> +assert(iov);
> +
> +if (!iov->iov) {
> +qemu_iovec_init(iov, 1);
> +}
> +
> +return nvme_map_addr_cmb(n, iov, addr, len);
> +}
> +
> +if (iov && iov->iov) {
> +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> +}
> +
> +assert(qsg);
> +
> +if (!qsg->sg) {
> +pci_dma_sglist_init(qsg, >parent_obj, 1);
> +}
> +
> +qemu_sglist_add(qsg, addr, len);
> +
> +return NVME_SUCCESS;
> +}
> +
>  static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t 
> prp1,
>   uint64_t prp2, uint32_t len, NvmeCtrl *n)
>  {
>  hwaddr trans_len = n->page_size - (prp1 % n->page_size);
>  trans_len = MIN(len, trans_len);
>  int num_prps = (len >> n->page_bits) + 1;
> +uint16_t status;
>  
>  if (unlikely(!prp1)) {
>  trace_pci_nvme_err_invalid_prp();
>  return NVME_INVALID_FIELD | NVME_DNR;
> -} else if (n->bar.cmbsz && prp1 >= n->ctrl_mem.addr &&
> -   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
> -qsg->nsg = 0;
> +}
> +
> +if (nvme_addr_is_cmb(n, prp1)) {
>  qemu_iovec_init(iov, num_prps);
> -qemu_iovec_add(iov, (void *)>cmbuf[prp1 - n->ctrl_mem.addr], 
> trans_len);
>  } else {
>  pci_dma_sglist_init(qsg, >parent_obj, num_prps);
> -qemu_sglist_add(qsg, prp1, trans_len);
>  }
> +
> +status = nvme_map_addr(n, qsg, iov, prp1, trans_len);
> +if (status) {
> +goto unmap;
> +}
> +
>  len -= trans_len;
>  if (len) {
>  if (unlikely(!prp2)) {
>  trace_pci_nvme_err_invalid_prp2_missing();
> +status = NVME_INVALID_FIELD | NVME_DNR;
>  goto unmap;
>  }
>  if (len > n->page_size) {
> @@ -242,6 +309,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
> QEMUIOVector *iov, uint64_t prp1,
>  if (i == n->max_prp_ents - 1 && len > n->page_size) {
>  if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
>  trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
> +status = NVME_INVALID_FIELD | NVME_DNR;
>  goto unmap;
>  }
>  
> @@ -255,14 +323,14 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
> QEMUIOVector *iov, uint64_t prp1,
>  
>  if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
>  trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
> +

Re: [RFC v2 27/76] target/riscv: rvv-0.9: load/store whole register instructions

2020-07-29 Thread Richard Henderson

On 7/22/20 2:15 AM, frank.ch...@sifive.com wrote:
> +static void
> +vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t 
> desc,
> +vext_ldst_elem_fn *ldst_elem, uint32_t esz, uintptr_t ra,
> +MMUAccessType access_type)
> +{
> +uint32_t i, k;
> +uint32_t nf = vext_nf(desc);
> +uint32_t vlmax = vext_maxsz(desc) / esz;
> +uint32_t vlenb = env_archcpu(env)->cfg.vlen >> 3;
> +
> +/* probe every access */
> +probe_pages(env, base, vlenb * nf * esz, ra, access_type);
> +
> +/* load bytes from guest memory */
> +for (i = 0; i < vlenb; i++) {
> +k = 0;
> +while (k < nf) {
> +target_ulong addr = base + (i * nf + k) * esz;
> +ldst_elem(env, addr, i + k * vlmax, vd, ra);
> +k++;
> +}
> +}
> +}

First, nf != 0 is reserved, so you shouldn't attempt to support it here.

Second, even then the note in the spec suggests that these two loops should be
interchanged -- but I'll also grant that the language could use improvement.

Indeed, the whole vector load/store section seems to need improvement.  For
instance, no where does it say how EEW < SEW load operations are extended.
>From reading Spike source code I can infer that it's sign-extended.  But that's
something a spec should explicitly say.


r~

Re: qemu repo lockdown message for a WHPX PR

2020-07-29 Thread Paolo Bonzini

On 28/07/20 21:19, Sunil Muthuswamy wrote:
> Hey Paolo,
> 
> Following your suggestion of creating PRs for WHPX changes, I tried creating 
> a PR https://github.com/qemu/qemu/pull/95
> 
> But, I am getting repo-lockdown message. What do I need to do differently?

Hi,

I was not referring to github pull requests, but rather to a maintainer
pull request.  This is also sent to the mailing list.  There is no
QEMU-specific documentation since maintainers are generally experienced
enough to have observed how these are sent, but the Linux documentation
more or less applies:
https://www.kernel.org/doc/html/latest/maintainer/pull-requests.html

In any case, I suspect this misunderstanding is a sign that I should be
handling WHPX patches for some more time, which I will gladly do. :)

Thanks!

Paolo

RE: [EXTERNAL] Re: qemu repo lockdown message for a WHPX PR

2020-07-29 Thread Sunil Muthuswamy

No, I am trying to submit a pull request as suggested by Paolo in this post:
https://patchwork.ozlabs.org/project/qemu-devel/patch/sn4pr2101mb08804d23439166e81ff151f7c0...@sn4pr2101mb0880.namprd21.prod.outlook.com/#2373829

> -Original Message-
> From: Eric Blake 
> Sent: Wednesday, July 29, 2020 1:19 PM
> To: Sunil Muthuswamy ; Paolo Bonzini 
> ; Peter Maydell 
> Cc: qemu-devel@nongnu.org
> Subject: [EXTERNAL] Re: qemu repo lockdown message for a WHPX PR
> 
> On 7/29/20 3:05 PM, Sunil Muthuswamy wrote:
> > Adding Peter Maydell as well.
> >
> >> -Original Message-
> >> From: Sunil Muthuswamy
> >> Sent: Tuesday, July 28, 2020 12:20 PM
> >> To: Paolo Bonzini 
> >> Cc: qemu-devel@nongnu.org
> >> Subject: qemu repo lockdown message for a WHPX PR
> >>
> >> Hey Paolo,
> >>
> >> Following your suggestion of creating PRs for WHPX changes, I tried 
> >> creating a PR
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fqemu%2Fqemu%2Fpull%2F95data=
> 02%7C01%7Csunilmut%40microsoft.com%7C3ece52c2861c48c5b2dc08d833fc9cfb%7C72f988bf86f141af91ab2d7cd011db47%7C
> 1%7C0%7C637316507344298308sdata=MDGK%2FX37j5Qh%2F%2FDoqMmmiiqZmPvXs34YHPkojrNyHRA%3Dreserv
> ed=0
> >>
> >> But, I am getting repo-lockdown message. What do I need to do differently?
> 
> Are you trying to submit a patch?  If so, we prefer submissions to the
> mailing list rather than via github pull requests:
> 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.qemu.org%2FContribute%2FSubmitAPatchdata
> =02%7C01%7Csunilmut%40microsoft.com%7C3ece52c2861c48c5b2dc08d833fc9cfb%7C72f988bf86f141af91ab2d7cd011db47%7
> C1%7C0%7C637316507344298308sdata=Wp0yfpcLH%2FF8%2F%2BGARQSwFb3ZxlYSuo5SVUMplmTBxfk%3Dreserve
> d=0
> 
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3226
> Virtualization:  qemu.org | libvirt.org

Re: qemu repo lockdown message for a WHPX PR

2020-07-29 Thread Eric Blake

On 7/29/20 3:05 PM, Sunil Muthuswamy wrote:

Adding Peter Maydell as well.

-Original Message-
From: Sunil Muthuswamy
Sent: Tuesday, July 28, 2020 12:20 PM
To: Paolo Bonzini 
Cc: qemu-devel@nongnu.org
Subject: qemu repo lockdown message for a WHPX PR

Hey Paolo,

Following your suggestion of creating PRs for WHPX changes, I tried creating a 
PR https://github.com/qemu/qemu/pull/95

But, I am getting repo-lockdown message. What do I need to do differently?

Are you trying to submit a patch?  If so, we prefer submissions to the 
mailing list rather than via github pull requests:

https://wiki.qemu.org/Contribute/SubmitAPatch

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v5 3/3] nvme: allow cmb and pmr to be enabled on same device

2020-07-29 Thread Klaus Jensen

On Jul 27 11:59, Andrzej Jakowski wrote:
> On 7/27/20 2:06 AM, Klaus Jensen wrote:
> > On Jul 23 09:03, Andrzej Jakowski wrote:
> >> So far it was not possible to have CMB and PMR emulated on the same
> >> device, because BAR2 was used exclusively either of PMR or CMB. This
> >> patch places CMB at BAR4 offset so it not conflicts with MSI-X vectors.
> >>
> >> Signed-off-by: Andrzej Jakowski 
> >> ---
> >>  hw/block/nvme.c  | 120 +--
> >>  hw/block/nvme.h  |   1 +
> >>  include/block/nvme.h |   4 +-
> >>  3 files changed, 85 insertions(+), 40 deletions(-)
> >>
> >> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> >> index 43866b744f..d55a71a346 100644
> >> --- a/hw/block/nvme.c
> >> +++ b/hw/block/nvme.c
> >> @@ -22,12 +22,13 @@
> >>   *  [pmrdev=,] \
> >>   *  max_ioqpairs=
> >>   *
> >> - * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
> >> - * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> >> + * Note cmb_size_mb denotes size of CMB in MB. CMB when configured is 
> >> assumed
> >> + * to be resident in BAR4 at offset that is 2MiB aligned. When CMB is 
> >> emulated
> >> + * on Linux guest it is recommended to make cmb_size_mb multiple of 2. 
> >> Both
> >> + * size and alignment restrictions are imposed by Linux guest.
> >>   *
> >> - * cmb_size_mb= and pmrdev= options are mutually exclusive due to 
> >> limitation
> >> - * in available BAR's. cmb_size_mb= will take precedence over pmrdev= when
> >> - * both provided.
> >> + * pmrdev is assumed to be resident in BAR2/BAR3. When configured it 
> >> consumes
> >> + * whole BAR2/BAR3 exclusively.
> >>   * Enabling pmr emulation can be achieved by pointing to 
> >> memory-backend-file.
> >>   * For example:
> >>   * -object memory-backend-file,id=,share=on,mem-path=, 
> >> \
> >> @@ -57,8 +58,8 @@
> >>  #define NVME_MAX_IOQPAIRS 0x
> >>  #define NVME_DB_SIZE  4
> >>  #define NVME_SPEC_VER 0x00010300
> >> -#define NVME_CMB_BIR 2
> >>  #define NVME_PMR_BIR 2
> >> +#define NVME_MSIX_BIR 4
> > 
> > I think that either we keep the CMB constant (but updated with '4' of
> > course) or we just get rid of both NVME_{CMB,MSIX}_BIR and use a literal
> > '4' in nvme_bar4_init. It is very clear that is only BAR 4 we use.
> > 
> >>  #define NVME_TEMPERATURE 0x143
> >>  #define NVME_TEMPERATURE_WARNING 0x157
> >>  #define NVME_TEMPERATURE_CRITICAL 0x175
> >> @@ -111,16 +112,18 @@ static uint16_t nvme_sqid(NvmeRequest *req)
> >>  
> >>  static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> >>  {
> >> -hwaddr low = n->ctrl_mem.addr;
> >> -hwaddr hi  = n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size);
> >> +hwaddr low = memory_region_to_absolute_addr(>ctrl_mem, 0);
> >> +hwaddr hi  = low + int128_get64(n->ctrl_mem.size);
> > 
> > Are we really really sure we want to use a global helper like this? What
> > are the chances/risk that we ever introduce another overlay? I'd say
> > zero. We are not even using a *real* overlay, it's just an io memory
> > region (ctrl_mem) on top of a pure container (bar4). Can't we live with
> > an internal helper doing `n->bar4.addr + n->ctrl_mem.addr` and be done
> > with it? It also removes a data structure walk on each invocation of
> > nvme_addr_is_cmb (which is done for **all** addresses in PRP lists and
> > SGLs).
> 
> Thx!
> My understanding of memory_region_absolute_addr()([1]) function is that it 
> walks
> memory hierarchy up to root while incrementing absolute addr. It is very 
> similar to n->bar4.addr + n->ctrl_mem.addr approach with following 
> differences:
>  * n->bar4.addr + n->ctrl_mem.addr assumes single level hierarchy. Updates 
> would
>be needed when another memory level is added.
>  * memory_region_to_absolute_addr() works for any-level hierarchy at tradeoff
>of dereferencing data structure. 
> 
> I don't have data for likelihood of adding new memory level, nor how much more
> memory_region_to_absolute_addr() vs n->bar4.addr + n->ctrl_mem.addr costs.
> 
> Please let me know which approach is preferred.
> 

Since you are directly asking me for my preference, then that is
"n->bar4.addr + n->ctrl_mem.addr". I don't like the walk, even though it
is super short. I know that the raw addition assumes single level
hierarchy, but I am fine with that. I would still like it to be in an
inline helper though.

Re: [PATCH v3 08/18] hw/block/nvme: add support for the asynchronous event request command

2020-07-29 Thread Klaus Jensen

On Jul 29 21:45, Maxim Levitsky wrote:
> On Wed, 2020-07-29 at 15:37 +0200, Klaus Jensen wrote:
> > On Jul 29 13:43, Maxim Levitsky wrote:
> > > On Mon, 2020-07-06 at 08:12 +0200, Klaus Jensen wrote:
> > > > +DEFINE_PROP_UINT8("aerl", NvmeCtrl, params.aerl, 3),
> > > So this is number of AERs that we allow the user to be outstanding
> > 
> > Yeah, and per the spec, 0's based.
> > 
> > > > +DEFINE_PROP_UINT32("aer_max_queued", NvmeCtrl, 
> > > > params.aer_max_queued, 64),
> > > And this is the number of AERs that we keep in our internal AER queue 
> > > untill user posts and AER so that we
> > > can complete it.
> > > 
> > 
> > Correct.
> 
> Yep - this is what I understood after examining all of the patch, but from 
> the names itself it is hard to understand this.
> Maybe a comment next to property to at least make it easier for advanced user 
> (e.g user that reads code)
> to understand?
> 
> (I often end up reading source to understand various qemu device parameters).
> 

I should add this in docs/specs/nvme.txt (which shows up in one of my
next series when I add a new PCI id for the device). For now, I will add
it to the top of the file like the rest of the parameters.

Subsequent series contains a lot more additions of new parameters that
is directly from the spec and to me it really only makes sense that they
share the names if they can.

We could consider having them under a "spec namespace"? So, say, we do
DEFINE_PROP_UINT("spec.aerl", ...)?

[PATCH] linux-user: Map signal numbers in fcntl

2020-07-29 Thread Timothy Baldwin


Map signal numbers in fcntl F_SETSIG and F_GETSIG.

Signed-off-by: Timothy E Baldwin 
---
 linux-user/syscall.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 945fc25279..8456bad109 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -6583,10 +6583,16 @@ static abi_long do_fcntl(int fd, int cmd, 
abi_ulong arg)

 break;
 #endif

-    case TARGET_F_SETOWN:
-    case TARGET_F_GETOWN:
 case TARGET_F_SETSIG:
+    ret = get_errno(safe_fcntl(fd, host_cmd, 
target_to_host_signal(arg)));

+    break;
+
 case TARGET_F_GETSIG:
+    ret = host_to_target_signal(get_errno(safe_fcntl(fd, host_cmd, 
arg)));

+    break;
+
+    case TARGET_F_SETOWN:
+    case TARGET_F_GETOWN:
 case TARGET_F_SETLEASE:
 case TARGET_F_GETLEASE:
 case TARGET_F_SETPIPE_SZ:
--
2.25.1

RE: qemu repo lockdown message for a WHPX PR

2020-07-29 Thread Sunil Muthuswamy

Adding Peter Maydell as well.

> -Original Message-
> From: Sunil Muthuswamy
> Sent: Tuesday, July 28, 2020 12:20 PM
> To: Paolo Bonzini 
> Cc: qemu-devel@nongnu.org
> Subject: qemu repo lockdown message for a WHPX PR
> 
> Hey Paolo,
> 
> Following your suggestion of creating PRs for WHPX changes, I tried creating 
> a PR https://github.com/qemu/qemu/pull/95
> 
> But, I am getting repo-lockdown message. What do I need to do differently?
> 
> Thanks,
> Sunil

Re: [PATCH 15/16] hw/block/nvme: remove NvmeCmd parameter

2020-07-29 Thread Klaus Jensen

On Jul 29 21:25, Maxim Levitsky wrote:
> On Mon, 2020-07-20 at 13:37 +0200, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Keep a copy of the raw nvme command in the NvmeRequest and remove the
> > now redundant NvmeCmd parameter.
> 
> Shouldn't you clear the req->cmd in nvme_req_clear too for consistency?

It always gets unconditionally overwritten with a memcpy in
nvme_process_sq, so we are not leaving anything dangling (like we would
do with the namespace reference because it's usually not initialized for
Admin commands)

[PATCH] configure: actually disable 'git_update' mode with --disable-git-update

2020-07-29 Thread Dan Streetman

The --disable-git-update configure param sets git_update=no, but
some later checks only look for the .git dir. This changes the
--enable-git-update to set git_update=yes but also fail if it
does not find a .git dir. Then all the later checks for the .git
dir can just be changed to a check for $git_update = "yes".

Also update the Makefile to skip the 'git_update' checks if it has
been disabled.

This is needed because downstream packagers, e.g. Debian, Ubuntu, etc,
also keep the source code in git, but do not want to enable the
'git_update' mode; with the current code, that's not possible even
if the downstream package specifies --disable-git-update.

Signed-off-by: Dan Streetman 
---
Note this is a rebased resend of a previous email to qemu-trivial:
https://lists.nongnu.org/archive/html/qemu-trivial/2020-07/msg00180.html

 Makefile  | 15 +--
 configure | 21 +
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/Makefile b/Makefile
index c2120d8d48..42550ae086 100644
--- a/Makefile
+++ b/Makefile
@@ -25,6 +25,8 @@ git-submodule-update:
 
 .PHONY: git-submodule-update
 
+# If --disable-git-update specified, skip these git checks
+ifneq (no,$(GIT_UPDATE))
 git_module_status := $(shell \
   cd '$(SRC_PATH)' && \
   GIT="$(GIT)" ./scripts/git-submodule.sh status $(GIT_SUBMODULES); \
@@ -32,7 +34,12 @@ git_module_status := $(shell \
 )
 
 ifeq (1,$(git_module_status))
-ifeq (no,$(GIT_UPDATE))
+ifeq (yes,$(GIT_UPDATE))
+git-submodule-update:
+   $(call quiet-command, \
+  (cd $(SRC_PATH) && GIT="$(GIT)" ./scripts/git-submodule.sh update 
$(GIT_SUBMODULES)), \
+  "GIT","$(GIT_SUBMODULES)")
+else
 git-submodule-update:
$(call quiet-command, \
 echo && \
@@ -41,11 +48,7 @@ git-submodule-update:
 echo "from the source directory checkout $(SRC_PATH)" && \
 echo && \
 exit 1)
-else
-git-submodule-update:
-   $(call quiet-command, \
-  (cd $(SRC_PATH) && GIT="$(GIT)" ./scripts/git-submodule.sh update 
$(GIT_SUBMODULES)), \
-  "GIT","$(GIT_SUBMODULES)")
+endif
 endif
 endif
 
diff --git a/configure b/configure
index 2acc4d1465..e7a241e971 100755
--- a/configure
+++ b/configure
@@ -318,7 +318,7 @@ then
 git_submodules="$git_submodules tests/fp/berkeley-testfloat-3"
 git_submodules="$git_submodules tests/fp/berkeley-softfloat-3"
 else
-git_update=no
+git_update=""
 git_submodules=""
 
 if ! test -f "$source_path/ui/keycodemapdb/README"
@@ -1598,7 +1598,12 @@ for opt do
   ;;
   --with-git=*) git="$optarg"
   ;;
-  --enable-git-update) git_update=yes
+  --enable-git-update)
+  git_update=yes
+  if test ! -e "$source_path/.git"; then
+  echo "ERROR: cannot --enable-git-update without .git"
+  exit 1
+  fi
   ;;
   --disable-git-update) git_update=no
   ;;
@@ -2011,7 +2016,7 @@ fi
 # Consult white-list to determine whether to enable werror
 # by default.  Only enable by default for git builds
 if test -z "$werror" ; then
-if test -e "$source_path/.git" && \
+if test "$git_update" = "yes" && \
 { test "$linux" = "yes" || test "$mingw32" = "yes"; }; then
 werror="yes"
 else
@@ -4412,10 +4417,10 @@ EOF
 fdt=system
   else
   # have GIT checkout, so activate dtc submodule
-  if test -e "${source_path}/.git" ; then
+  if test "$git_update" = "yes" ; then
   git_submodules="${git_submodules} dtc"
   fi
-  if test -d "${source_path}/dtc/libfdt" || test -e "${source_path}/.git" 
; then
+  if test -d "${source_path}/dtc/libfdt" || test "$git_update" = "yes" ; 
then
   fdt=git
   mkdir -p dtc
   if [ "$pwd_is_source_path" != "y" ] ; then
@@ -5385,7 +5390,7 @@ case "$capstone" in
   "" | yes)
 if $pkg_config capstone; then
   capstone=system
-elif test -e "${source_path}/.git" && test $git_update = 'yes' ; then
+elif test "$git_update" = "yes" ; then
   capstone=git
 elif test -e "${source_path}/capstone/Makefile" ; then
   capstone=internal
@@ -6414,7 +6419,7 @@ case "$slirp" in
   "" | yes)
 if $pkg_config slirp; then
   slirp=system
-elif test -e "${source_path}/.git" && test $git_update = 'yes' ; then
+elif test "$git_update" = "yes" ; then
   slirp=git
 elif test -e "${source_path}/slirp/Makefile" ; then
   slirp=internal
@@ -6776,7 +6781,7 @@ if test "$cpu" = "s390x" ; then
 roms="$roms s390-ccw"
 # SLOF is required for building the s390-ccw firmware on s390x,
 # since it is using the libnet code from SLOF for network booting.
-if test -e "${source_path}/.git" ; then
+if test "$git_update" = "yes" ; then
   git_submodules="${git_submodules} roms/SLOF"
 fi
   fi
-- 
2.25.1

Re: [PATCH 16/16] hw/block/nvme: use preallocated qsg/iov in nvme_dma_prp

2020-07-29 Thread Klaus Jensen

On Jul 30 01:15, Minwoo Im wrote:
> On 20-07-20 13:37:48, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Since clean up of the request qsg/iov is now always done post-use, there
> > is no need to use a stack-allocated qsg/iov in nvme_dma_prp.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > Reviewed-by: Maxim Levitsky 
> 
> > ---
> >  hw/block/nvme.c | 18 ++
> >  1 file changed, 6 insertions(+), 12 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 0b3dceccc89b..b6da5a9f3fc6 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -381,45 +381,39 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t 
> > *ptr, uint32_t len,
> >   uint64_t prp1, uint64_t prp2, DMADirection 
> > dir,
> >   NvmeRequest *req)
> >  {
> > -QEMUSGList qsg;
> > -QEMUIOVector iov;
> >  uint16_t status = NVME_SUCCESS;
> >  
> > -status = nvme_map_prp(n, , , prp1, prp2, len, req);
> > +status = nvme_map_prp(n, >qsg, >iov, prp1, prp2, len, req);
> 
> After this change, can we make nvme_map_prp() just receive
> NvmeRequest *req without >qsg, >iov by retrieve them from
> inside of the nvme_map_prp()?

Absolutely. I added a follow up patch to do this :)

Re: [PATCH v3 0/7] Fix scsi devices plug/unplug races w.r.t virtio-scsi iothread

2020-07-29 Thread Maxim Levitsky

On Wed, 2020-07-15 at 18:01 +0300, Maxim Levitsky wrote:
> Hi!
> 
> This is a patch series that is a result of my discussion with Paulo on
> how to correctly fix the root cause of the BZ #1812399.
> 
> The root cause of this bug is the fact that IO thread is running mostly
> unlocked versus main thread on which device hotplug is done.
> 
> qdev_device_add first creates the device object, then places it on the bus,
> and only then realizes it.
> 
> However some drivers and currently only virtio-scsi enumerate its child bus
> devices on each request that is received from the guest and that can happen 
> on the IO
> thread.
> 
> Thus we have a window when new device is on the bus but not realized and can 
> be accessed
> by the virtio-scsi driver in that state.
> 
> Fix that by doing two things:
> 
> 1. Add partial RCU protection to the list of a bus's child devices.
> This allows the scsi IO thread to safely enumerate the child devices
> while it races with the hotplug placing the device on the bus.
> 
> 2. Make the virtio-scsi driver check .realized property of the scsi device
> and avoid touching the device if it isn't
> 
> Note that in the particular bug report the issue wasn't a race but rather due
> to combination of things, the .realize code in the middle managed to trigger 
> IO on the virtqueue
> which caused the virtio-scsi driver to access the half realized device. 
> However
> since this can happen as well with real IO thread, this patch series was done,
> which fixes this as well.
> 
> Changes from V1:
>   * Patch 2 is new, as suggested by Stefan, added drain_call_rcu() to fix the 
> failing unit test,
> make check pass now
> 
>   * Patches 6,7 are new as well: I added scsi_device_get as suggested by 
> Stefan as well, although
> this is more a refactoring that anything else as it doesn't solve
> an existing race.
> 
>   * Addressed most of the review feedback from V1
> - still need to decide if we need QTAILQ_FOREACH_WITH_RCU_READ_LOCK
> 
> Changes from V2:
> 
>   * No longer RFC
>   * Addressed most of the feedback from Stefan
>   * Fixed reference count leak in patch 7 when device is about to be 
> unrealized
>   * Better testing
> 
> This series was tested by adding a virtio-scsi drive with iothread,
> then running fio stress job in the guest in a loop, and then adding/removing
> the scsi drive on the host in the loop.
> This test was failing usually on 1st iteration withouth this patch series,
> and now it seems to work smoothly.
> 
> Best regards,
>   Maxim Levitsky
> 
> Maxim Levitsky (7):
>   scsi/scsi_bus: switch search direction in scsi_device_find
>   Implement drain_call_rcu and use it in hmp_device_del
>   device-core: use RCU for list of childs of a bus
>   device-core: use atomic_set on .realized property
>   virtio-scsi: don't touch scsi devices that are not yet realized or
> about to be un-realized
>   scsi: Add scsi_device_get
>   virtio-scsi: use scsi_device_get
> 
>  hw/core/bus.c  | 28 +
>  hw/core/qdev.c | 56 +++---
>  hw/scsi/scsi-bus.c | 48 +++-
>  hw/scsi/virtio-scsi.c  | 47 ---
>  include/hw/qdev-core.h | 11 +
>  include/hw/scsi/scsi.h |  2 ++
>  include/qemu/rcu.h |  1 +
>  qdev-monitor.c | 22 +
>  util/rcu.c | 55 +
>  9 files changed, 230 insertions(+), 40 deletions(-)
> 
> -- 
> 2.26.2
> 
Very gentle ping about this patch series.

Best regards,
Maxim Levitsky

Re: [PATCH 06/16] hw/block/nvme: pass request along for tracing

2020-07-29 Thread Klaus Jensen

On Jul 30 00:49, Minwoo Im wrote:
> Klaus,
> 
> On 20-07-20 13:37:38, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Pass along the NvmeRequest in various functions since it is very useful
> > for tracing.
> 
> One doubt here.
>   This patch has put NvmeRequest argument to the nvme_map_prp() to trace
>   the request's command id.  But can we just trace the cid before this
>   kind of prp mapping, somewhere like nvme_process_sq() level.  Then we
>   can figure out the tracing for the prp mapping is from which request.
> 
> Tracing for cid is definitely great, but feels like too much cost to
> pass argument to trace 'cid' in the middle of the dma mapping stage.
> 

Good point Minwoo.

I ended up dropping the patch and just replacing it with a patch that
adds tracing to nvme_map_prp.

Re: [PATCH 14/16] hw/block/nvme: consolidate qsg/iov clearing

2020-07-29 Thread Klaus Jensen

On Jul 29 21:18, Maxim Levitsky wrote:
> On Mon, 2020-07-20 at 13:37 +0200, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Always destroy the request qsg/iov at the end of request use.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 48 +---
> >  1 file changed, 17 insertions(+), 31 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 54cd20f1ce22..b53afdeb3fb6 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -213,6 +213,14 @@ static void nvme_req_clear(NvmeRequest *req)
> >  {
> >  req->ns = NULL;
> >  memset(>cqe, 0x0, sizeof(req->cqe));
> > +
> > +if (req->qsg.sg) {
> > +qemu_sglist_destroy(>qsg);
> > +}
> > +
> > +if (req->iov.iov) {
> > +qemu_iovec_destroy(>iov);
> > +}
> >  }
> >  
> >  static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr 
> > addr,
> > @@ -297,15 +305,14 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> > *qsg, QEMUIOVector *iov,
> >  
> >  status = nvme_map_addr(n, qsg, iov, prp1, trans_len);
> >  if (status) {
> > -goto unmap;
> > +return status;
> >  }
> >  
> >  len -= trans_len;
> >  if (len) {
> >  if (unlikely(!prp2)) {
> >  trace_pci_nvme_err_invalid_prp2_missing();
> > -status = NVME_INVALID_FIELD | NVME_DNR;
> > -goto unmap;
> > +return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> >  
> >  if (len > n->page_size) {
> > @@ -326,13 +333,11 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> > *qsg, QEMUIOVector *iov,
> >  if (i == n->max_prp_ents - 1 && len > n->page_size) {
> >  if (unlikely(!prp_ent || prp_ent & (n->page_size - 
> > 1))) {
> >  trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
> > -status = NVME_INVALID_FIELD | NVME_DNR;
> > -goto unmap;
> > +return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> >  
> >  if (prp_list_in_cmb != nvme_addr_is_cmb(n, prp_ent)) {
> > -status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > -goto unmap;
> > +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> >  }
> >  
> >  i = 0;
> > @@ -345,14 +350,13 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> > *qsg, QEMUIOVector *iov,
> >  
> >  if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
> >  trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
> > -status = NVME_INVALID_FIELD | NVME_DNR;
> > -goto unmap;
> > +return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> >  
> >  trans_len = MIN(len, n->page_size);
> >  status = nvme_map_addr(n, qsg, iov, prp_ent, trans_len);
> >  if (status) {
> > -goto unmap;
> > +return status;
> >  }
> >  
> >  len -= trans_len;
> > @@ -361,27 +365,16 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> > *qsg, QEMUIOVector *iov,
> >  } else {
> >  if (unlikely(prp2 & (n->page_size - 1))) {
> >  trace_pci_nvme_err_invalid_prp2_align(prp2);
> > -status = NVME_INVALID_FIELD | NVME_DNR;
> > -goto unmap;
> > +return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> >  status = nvme_map_addr(n, qsg, iov, prp2, len);
> >  if (status) {
> > -goto unmap;
> > +return status;
> >  }
> >  }
> >  }
> > +
> >  return NVME_SUCCESS;
> > -
> > -unmap:
> > -if (iov && iov->iov) {
> > -qemu_iovec_destroy(iov);
> > -}
> > -
> > -if (qsg && qsg->sg) {
> > -qemu_sglist_destroy(qsg);
> > -}
> > -
> > -return status;
> >  }
> >  
> >  static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > @@ -601,13 +594,6 @@ static void nvme_rw_cb(void *opaque, int ret)
> >  req->status = NVME_INTERNAL_DEV_ERROR;
> >  }
> >  
> > -if (req->qsg.nalloc) {
> > -qemu_sglist_destroy(>qsg);
> > -}
> > -if (req->iov.nalloc) {
> > -qemu_iovec_destroy(>iov);
> > -}
> > -
> >  nvme_enqueue_req_completion(cq, req);
> >  }
> >  
> 
> This and former patch I guess answer my own question about why to clear the 
> request after its cqe got posted.
> 
> Looks reasonable.
> 

I ended up with a compromise. I keep clearing as a "before-use" job, but
we don't want to keep the qsg and iovs hanging around until the request
gets reused, so I'm adding a nvme_req_exit() to free that memory when
the cqe has been posted.

Re: [PATCH for-5.1] qapi/machine.json: Fix missing newline in doc comment

2020-07-29 Thread Philippe Mathieu-Daudé

On 7/29/20 9:10 PM, Peter Maydell wrote:
> In commit 176d2cda0dee9f4 we added the @die-id field
> to the CpuInstanceProperties struct, but in the process
> accidentally removed the newline between the doc-comment
> lines for @core-id and @thread-id.
> 
> Put the newline back in; this fixes a misformatting in the
> generated HTML QMP reference manual.
> 
> Signed-off-by: Peter Maydell 
> ---
> Not very important but I've suggested for-5.1 as it's a safe
> docs fix. You can see the misrendered doc at
> https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#index-CpuInstanceProperties
> 
>  qapi/machine.json | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/qapi/machine.json b/qapi/machine.json
> index f59144023ca..daede5ab149 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -825,7 +825,8 @@
>  # @node-id: NUMA node ID the CPU belongs to
>  # @socket-id: socket number within node/board the CPU belongs to
>  # @die-id: die number within node/board the CPU belongs to (Since 4.1)
> -# @core-id: core number within die the CPU belongs to# @thread-id: thread 
> number within core the CPU belongs to
> +# @core-id: core number within die the CPU belongs to
> +# @thread-id: thread number within core the CPU belongs to
>  #
>  # Note: currently there are 5 properties that could be present
>  #   but management should be prepared to pass through other
> 

Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH 15/16] hw/block/nvme: remove NvmeCmd parameter

2020-07-29 Thread Klaus Jensen

On Jul 30 01:10, Minwoo Im wrote:
> On 20-07-20 13:37:47, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Keep a copy of the raw nvme command in the NvmeRequest and remove the
> > now redundant NvmeCmd parameter.
> > 
> > Signed-off-by: Klaus Jensen 
> 
> I would really have suggested this change from 13th patch!
> 
> Reviewed-by: Minwoo Im 
> 

I squashed the two patches. If don't think it makes the match much
harder to review since the added namespace reference is a pretty small
change.

Re: [PATCH 10/16] hw/block/nvme: add check for mdts

2020-07-29 Thread Klaus Jensen

On Jul 30 01:00, Minwoo Im wrote:
> On 20-07-20 13:37:42, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add 'mdts' device parameter to control the Maximum Data Transfer Size of
> > the controller and check that it is respected.
> > 
> > Signed-off-by: Klaus Jensen 
> > Reviewed-by: Maxim Levitsky 
> > ---
> >  hw/block/nvme.c   | 32 ++--
> >  hw/block/nvme.h   |  1 +
> >  hw/block/trace-events |  1 +
> >  3 files changed, 32 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 35bc1a7b7e21..10fe53873ae9 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -18,9 +18,10 @@
> >   * Usage: add options:
> >   *  -drive file=,if=none,id=
> >   *  -device nvme,drive=,serial=,id=, \
> > - *  cmb_size_mb=, \
> > + *  [cmb_size_mb=,] \
> >   *  [pmrdev=,] \
> > - *  max_ioqpairs=
> > + *  [max_ioqpairs=,] \
> > + *  [mdts=]
> 
> Nitpick:
>   cmb and ioqpairs-things could be in another thread. :)
> 

So, with that I wanted to align the way optional parameters was
described. And I actually messed it up anyway. I'll remove the "fixes"
and just keep the addition of mdts there.

Re: [PATCH for-5.1] qapi/machine.json: Fix missing newline in doc comment

2020-07-29 Thread Eric Blake


On 7/29/20 2:10 PM, Peter Maydell wrote:

In commit 176d2cda0dee9f4 we added the @die-id field
to the CpuInstanceProperties struct, but in the process
accidentally removed the newline between the doc-comment
lines for @core-id and @thread-id.

Put the newline back in; this fixes a misformatting in the
generated HTML QMP reference manual.

Signed-off-by: Peter Maydell 
---
Not very important but I've suggested for-5.1 as it's a safe
docs fix. You can see the misrendered doc at
https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#index-CpuInstanceProperties


Reviewed-by: Eric Blake 



  qapi/machine.json | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/qapi/machine.json b/qapi/machine.json
index f59144023ca..daede5ab149 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -825,7 +825,8 @@
  # @node-id: NUMA node ID the CPU belongs to
  # @socket-id: socket number within node/board the CPU belongs to
  # @die-id: die number within node/board the CPU belongs to (Since 4.1)
-# @core-id: core number within die the CPU belongs to# @thread-id: thread 
number within core the CPU belongs to
+# @core-id: core number within die the CPU belongs to
+# @thread-id: thread number within core the CPU belongs to
  #
  # Note: currently there are 5 properties that could be present
  #   but management should be prepared to pass through other



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 07/16] hw/block/nvme: add request mapping helper

2020-07-29 Thread Klaus Jensen

On Jul 29 21:31, Maxim Levitsky wrote:
> On Thu, 2020-07-30 at 00:52 +0900, Minwoo Im wrote:
> > Klaus,
> > 
> > On 20-07-20 13:37:39, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > > 
> > > Introduce the nvme_map helper to remove some noise in the main nvme_rw
> > > function.
> > > 
> > > Signed-off-by: Klaus Jensen 
> > > Reviewed-by: Maxim Levitsky 
> > > ---
> > >  hw/block/nvme.c | 13 ++---
> > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index f1e04608804b..68c33a11c144 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -413,6 +413,15 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t 
> > > *ptr, uint32_t len,
> > >  return status;
> > >  }
> > >  
> > > +static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, size_t len,
> > > + NvmeRequest *req)
> > 
> > Can we specify what is going to be mapped in this function? like
> > nvme_map_dptr?
> I also once complained about the name, and I do like this idea!
> 

Hehe. I will change it ;)

Note that when I post support for metadata, it will have to change
again! Because then the function will be mapping both DPTR and MPTR.
But lets discuss naming when we get to that ;)

Re: device compatibility interface for live migration with assigned devices

2020-07-29 Thread Alex Williamson

On Wed, 29 Jul 2020 12:28:46 +0100
Sean Mooney  wrote:

> On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:  
> > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > Yan Zhao  wrote:
> > >   
> > > > > > As you indicate, the vendor driver is responsible for checking 
> > > > > > version
> > > > > > information embedded within the migration stream.  Therefore a
> > > > > > migration should fail early if the devices are incompatible.  Is it 
> > > > > >
> > > > > 
> > > > > but as I know, currently in VFIO migration protocol, we have no way to
> > > > > get vendor specific compatibility checking string in migration setup 
> > > > > stage
> > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > In this way, for devices who does not save device data in precopy 
> > > > > stage,
> > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > stage, which is too late.
> > > > > do you think we need to add the getting/checking of vendor specific
> > > > > compatibility string early in save_setup stage?
> > > > >
> > > > 
> > > > hi Alex,
> > > > after an offline discussion with Kevin, I realized that it may not be a
> > > > problem if migration compatibility check in vendor driver occurs late in
> > > > stop-and-copy phase for some devices, because if we report device
> > > > compatibility attributes clearly in an interface, the chances for
> > > > libvirt/openstack to make a wrong decision is little.  
> > > 
> > > I think it would be wise for a vendor driver to implement a pre-copy
> > > phase, even if only to send version information and verify it at the
> > > target.  Deciding you have no device state to send during pre-copy does
> > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > we've defined that we can enter stop-and-copy at any point, including
> > > without a pre-copy phase, so I would recommend that vendor drivers
> > > validate compatibility at the start of both the pre-copy and the
> > > stop-and-copy phases.
> > >   
> > 
> > ok. got it!
> >   
> > > > so, do you think we are now arriving at an agreement that we'll give up
> > > > the read-and-test scheme and start to defining one interface (perhaps in
> > > > json format), from which libvirt/openstack is able to parse and find out
> > > > compatibility list of a source mdev/physical device?  
> > > 
> > > Based on the feedback we've received, the previously proposed interface
> > > is not viable.  I think there's agreement that the user needs to be
> > > able to parse and interpret the version information.  Using json seems
> > > viable, but I don't know if it's the best option.  Is there any
> > > precedent of markup strings returned via sysfs we could follow?  
> > 
> > I found some examples of using formatted string under /sys, mostly under
> > tracing. maybe we can do a similar implementation.
> > 
> > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > 
> > name: kvm_mmio
> > ID: 32
> > format:
> > field:unsigned short common_type;   offset:0;   size:2; 
> > signed:0;
> > field:unsigned char common_flags;   offset:2;   size:1; 
> > signed:0;
> > field:unsigned char common_preempt_count;   offset:3;   
> > size:1; signed:0;
> > field:int common_pid;   offset:4;   size:4; signed:1;
> > 
> > field:u32 type; offset:8;   size:4; signed:0;
> > field:u32 len;  offset:12;  size:4; signed:0;
> > field:u64 gpa;  offset:16;  size:8; signed:0;
> > field:u64 val;  offset:24;  size:8; signed:0;
> > 
> > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", 
> > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read"
> > }, { 2, "write" }), REC->len, REC->gpa, REC->val
> >   
> this is not json fromat and its not supper frendly to parse.
> > 
> > #cat /sys/devices/pci:00/:00:02.0/uevent
> > DRIVER=vfio-pci
> > PCI_CLASS=3
> > PCI_ID=8086:591D
> > PCI_SUBSYS_ID=8086:2212
> > PCI_SLOT_NAME=:00:02.0
> > MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00
> >   
> this is ini format or conf formant 
> this is pretty simple to parse whichi would be fine.
> that said you could also have a version or capablitiy directory with a file
> for each key and a singel value.
> 
> i would prefer to only have to do one read personally the list the files in
> directory and then read tehm all ot build the datastucture myself but that is
> doable though the simple ini format use d for uevent seams the best of 3 
> options
> provided above.
> > > 
> > > Your idea of having both a "self" object and an array of "compatible"
> > > objects is perhaps something we can build on, but we must not assume
> > > PCI devices at the root level of the object.  Providing both the
> > > mdev-type and the driver is a bit

[PATCH for-5.1] qapi/machine.json: Fix missing newline in doc comment

2020-07-29 Thread Peter Maydell

In commit 176d2cda0dee9f4 we added the @die-id field
to the CpuInstanceProperties struct, but in the process
accidentally removed the newline between the doc-comment
lines for @core-id and @thread-id.

Put the newline back in; this fixes a misformatting in the
generated HTML QMP reference manual.

Signed-off-by: Peter Maydell 
---
Not very important but I've suggested for-5.1 as it's a safe
docs fix. You can see the misrendered doc at
https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#index-CpuInstanceProperties

 qapi/machine.json | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/qapi/machine.json b/qapi/machine.json
index f59144023ca..daede5ab149 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -825,7 +825,8 @@
 # @node-id: NUMA node ID the CPU belongs to
 # @socket-id: socket number within node/board the CPU belongs to
 # @die-id: die number within node/board the CPU belongs to (Since 4.1)
-# @core-id: core number within die the CPU belongs to# @thread-id: thread 
number within core the CPU belongs to
+# @core-id: core number within die the CPU belongs to
+# @thread-id: thread number within core the CPU belongs to
 #
 # Note: currently there are 5 properties that could be present
 #   but management should be prepared to pass through other
-- 
2.20.1

Re: device compatibility interface for live migration with assigned devices

2020-07-29 Thread Dr. David Alan Gilbert

* Alex Williamson (alex.william...@redhat.com) wrote:
> On Mon, 27 Jul 2020 15:24:40 +0800
> Yan Zhao  wrote:
> 
> > > > As you indicate, the vendor driver is responsible for checking version
> > > > information embedded within the migration stream.  Therefore a
> > > > migration should fail early if the devices are incompatible.  Is it  
> > > but as I know, currently in VFIO migration protocol, we have no way to
> > > get vendor specific compatibility checking string in migration setup stage
> > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > In this way, for devices who does not save device data in precopy stage,
> > > the migration compatibility checking is as late as in stop-and-copy
> > > stage, which is too late.
> > > do you think we need to add the getting/checking of vendor specific
> > > compatibility string early in save_setup stage?
> > >  
> > hi Alex,
> > after an offline discussion with Kevin, I realized that it may not be a
> > problem if migration compatibility check in vendor driver occurs late in
> > stop-and-copy phase for some devices, because if we report device
> > compatibility attributes clearly in an interface, the chances for
> > libvirt/openstack to make a wrong decision is little.
> 
> I think it would be wise for a vendor driver to implement a pre-copy
> phase, even if only to send version information and verify it at the
> target.  Deciding you have no device state to send during pre-copy does
> not mean your vendor driver needs to opt-out of the pre-copy phase
> entirely.  Please also note that pre-copy is at the user's discretion,
> we've defined that we can enter stop-and-copy at any point, including
> without a pre-copy phase, so I would recommend that vendor drivers
> validate compatibility at the start of both the pre-copy and the
> stop-and-copy phases.

That's quite curious; from a migration point of view I'd expect if you
did want to skip pre-copy, that you'd go through the motions of entering
it and then not saving any data and then going to stop-and-copy,
rather than having two flows.

Note that failing at a late stage of stop-and-copy is a pain; if you've
just spent an hour migrating your huge busy VM over, you're going to be
pretty annoyed when it goes pop near the end.

Dave

> > so, do you think we are now arriving at an agreement that we'll give up
> > the read-and-test scheme and start to defining one interface (perhaps in
> > json format), from which libvirt/openstack is able to parse and find out
> > compatibility list of a source mdev/physical device?
> 
> Based on the feedback we've received, the previously proposed interface
> is not viable.  I think there's agreement that the user needs to be
> able to parse and interpret the version information.  Using json seems
> viable, but I don't know if it's the best option.  Is there any
> precedent of markup strings returned via sysfs we could follow?
> 
> Your idea of having both a "self" object and an array of "compatible"
> objects is perhaps something we can build on, but we must not assume
> PCI devices at the root level of the object.  Providing both the
> mdev-type and the driver is a bit redundant, since the former includes
> the latter.  We can't have vendor specific versioning schemes though,
> ie. gvt-version. We need to agree on a common scheme and decide which
> fields the version is relative to, ex. just the mdev type?
> 
> I had also proposed fields that provide information to create a
> compatible type, for example to create a type_x2 device from a type_x1
> mdev type, they need to know to apply an aggregation attribute.  If we
> need to explicitly list every aggregation value and the resulting type,
> I think we run aground of what aggregation was trying to avoid anyway,
> so we might need to pick a language that defines variable substitution
> or some kind of tagging.  For example if we could define ${aggr} as an
> integer within a specified range, then we might be able to define a type
> relative to that value (type_x${aggr}) which requires an aggregation
> attribute using the same value.  I dunno, just spit balling.  Thanks,
> 
> Alex
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH 12/16] hw/block/nvme: refactor NvmeRequest clearing

2020-07-29 Thread Klaus Jensen

On Jul 29 20:47, Maxim Levitsky wrote:
> On Mon, 2020-07-20 at 13:37 +0200, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Move clearing of the structure from "clear before use" to "clear
> > after use".
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index e2932239c661..431f26c2f589 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -209,6 +209,11 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue 
> > *cq)
> >  }
> >  }
> >  
> > +static void nvme_req_clear(NvmeRequest *req)
> > +{
> > +memset(>cqe, 0x0, sizeof(req->cqe));
> > +}
> > +
> >  static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr 
> > addr,
> >size_t len)
> >  {
> > @@ -458,6 +463,7 @@ static void nvme_post_cqes(void *opaque)
> >  nvme_inc_cq_tail(cq);
> >  pci_dma_write(>parent_obj, addr, (void *)>cqe,
> >  sizeof(req->cqe));
> > +nvme_req_clear(req);
> 
> Don't we need some barrier here to avoid reordering the writes?
> pci_dma_write does seem to include a barrier prior to the write it does
> but not afterward.
> 


> Also what is the motivation of switching the order?

This was just preference. But I did not consider that I would be
breaking any DMA rules here.

> I think somewhat that it is a good thing to clear a buffer,
> before it is setup.
> 

I'll reverse my preference and keep the status quo since I have no
better motivation than personal preference.

The introduction of nvme_req_clear is just in preparation for
consolidating more cleanup here, but I'll drop this patch and introduce
nvme_req_clear later.

[PATCH] schemas: Add vim modeline

2020-07-29 Thread Andrea Bolognani

The various schemas included in QEMU use a JSON-based format which
is, however, strictly speaking not valid JSON.

As a consequence, when vim tries to apply syntax highlight rules
for JSON (as guessed from the file name), the result is an unreadable
mess which mostly consist of red markers pointing out supposed errors
in, well, pretty much everything.

Using Python syntax highlighting produces much better results, and
in fact these files already start with specially-formatted comments
that instruct Emacs to process them as if they were Python files.

This commit adds the equivalent special comments for vim.

Signed-off-by: Andrea Bolognani 
---
 docs/interop/firmware.json| 1 +
 docs/interop/vhost-user.json  | 1 +
 qapi/authz.json   | 1 +
 qapi/block-core.json  | 1 +
 qapi/block.json   | 1 +
 qapi/char.json| 1 +
 qapi/common.json  | 1 +
 qapi/control.json | 1 +
 qapi/crypto.json  | 1 +
 qapi/dump.json| 1 +
 qapi/error.json   | 1 +
 qapi/introspect.json  | 1 +
 qapi/job.json | 1 +
 qapi/machine-target.json  | 1 +
 qapi/machine.json | 1 +
 qapi/migration.json   | 1 +
 qapi/misc-target.json | 1 +
 qapi/misc.json| 1 +
 qapi/net.json | 1 +
 qapi/qapi-schema.json | 1 +
 qapi/qdev.json| 1 +
 qapi/qom.json | 1 +
 qapi/rdma.json| 1 +
 qapi/rocker.json  | 1 +
 qapi/run-state.json   | 1 +
 qapi/sockets.json | 1 +
 qapi/tpm.json | 1 +
 qapi/transaction.json | 1 +
 qapi/ui.json  | 1 +
 qga/qapi-schema.json  | 1 +
 storage-daemon/qapi/qapi-schema.json  | 1 +
 tests/qapi-schema/doc-good.json   | 2 ++
 tests/qapi-schema/include/sub-module.json | 1 +
 tests/qapi-schema/qapi-schema-test.json   | 1 +
 tests/qapi-schema/sub-sub-module.json | 1 +
 35 files changed, 36 insertions(+)

diff --git a/docs/interop/firmware.json b/docs/interop/firmware.json
index 240f565397..989f10b626 100644
--- a/docs/interop/firmware.json
+++ b/docs/interop/firmware.json
@@ -1,4 +1,5 @@
 # -*- Mode: Python -*-
+# vim: filetype=python
 #
 # Copyright (C) 2018 Red Hat, Inc.
 #
diff --git a/docs/interop/vhost-user.json b/docs/interop/vhost-user.json
index ef8ac5941f..feb5fe58ca 100644
--- a/docs/interop/vhost-user.json
+++ b/docs/interop/vhost-user.json
@@ -1,4 +1,5 @@
 # -*- Mode: Python -*-
+# vim: filetype=python
 #
 # Copyright (C) 2018 Red Hat, Inc.
 #
diff --git a/qapi/authz.json b/qapi/authz.json
index 1c836a3abd..f3e9745426 100644
--- a/qapi/authz.json
+++ b/qapi/authz.json
@@ -1,4 +1,5 @@
 # -*- Mode: Python -*-
+# vim: filetype=python
 #
 # QAPI authz definitions
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 943df1926a..5f72b50149 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1,4 +1,5 @@
 # -*- Mode: Python -*-
+# vim: filetype=python
 
 ##
 # == Block core (VM unrelated)
diff --git a/qapi/block.json b/qapi/block.json
index 2ddbfa8306..c54a393cf3 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -1,4 +1,5 @@
 # -*- Mode: Python -*-
+# vim: filetype=python
 
 ##
 # = Block devices
diff --git a/qapi/char.json b/qapi/char.json
index daceb20f84..8aeedf96b2 100644
--- a/qapi/char.json
+++ b/qapi/char.json
@@ -1,4 +1,5 @@
 # -*- Mode: Python -*-
+# vim: filetype=python
 #
 
 ##
diff --git a/qapi/common.json b/qapi/common.json
index 7b9cbcd97b..716712d4b3 100644
--- a/qapi/common.json
+++ b/qapi/common.json
@@ -1,4 +1,5 @@
 # -*- Mode: Python -*-
+# vim: filetype=python
 
 ##
 # = Common data types
diff --git a/qapi/control.json b/qapi/control.json
index 6b816bb61f..de51e9916c 100644
--- a/qapi/control.json
+++ b/qapi/control.json
@@ -1,4 +1,5 @@
 # -*- Mode: Python -*-
+# vim: filetype=python
 #
 
 ##
diff --git a/qapi/crypto.json b/qapi/crypto.json
index b2a4cff683..c41e869e31 100644
--- a/qapi/crypto.json
+++ b/qapi/crypto.json
@@ -1,4 +1,5 @@
 # -*- Mode: Python -*-
+# vim: filetype=python
 #
 
 ##
diff --git a/qapi/dump.json b/qapi/dump.json
index a1eed7b15c..f7c4267e3f 100644
--- a/qapi/dump.json
+++ b/qapi/dump.json
@@ -1,4 +1,5 @@
 # -*- Mode: Python -*-
+# vim: filetype=python
 #
 # This work is licensed under the terms of the GNU GPL, version 2 or later.
 # See the COPYING file in the top-level directory.
diff --git a/qapi/error.json b/qapi/error.json
index 3fad08f506..94a6502de9 100644
--- a/qapi/error.json
+++ b/qapi/error.json
@@ -1,4 +1,5 @@
 # -*- Mode: Python -*-
+# vim: filetype=python

Re: [PATCH v3 12/18] hw/block/nvme: support the get/set features select and save fields

2020-07-29 Thread Maxim Levitsky

On Wed, 2020-07-29 at 15:48 +0200, Klaus Jensen wrote:
> On Jul 29 16:17, Maxim Levitsky wrote:
> > On Mon, 2020-07-06 at 08:12 +0200, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > > 
> > > Since the device does not have any persistent state storage, no
> > > features are "saveable" and setting the Save (SV) field in any Set
> > > Features command will result in a Feature Identifier Not Saveable status
> > > code.
> > > 
> > > Similarly, if the Select (SEL) field is set to request saved values, the
> > > devices will (as it should) return the default values instead.
> > > 
> > > Since this also introduces "Supported Capabilities", the nsid field is
> > > now also checked for validity wrt. the feature being get/set'ed.
> > > 
> > > Signed-off-by: Klaus Jensen 
> > > ---
> > >  hw/block/nvme.c   | 103 +-
> > >  hw/block/trace-events |   4 +-
> > >  include/block/nvme.h  |  27 ++-
> > >  3 files changed, 119 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index 2d85e853403f..df8b786e4875 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -1083,20 +1091,47 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, 
> > > NvmeCmd *cmd, NvmeRequest *req)
> > >  {
> > >  uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > >  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > > +uint32_t nsid = le32_to_cpu(cmd->nsid);
> > >  uint32_t result;
> > >  uint8_t fid = NVME_GETSETFEAT_FID(dw10);
> > > +NvmeGetFeatureSelect sel = NVME_GETFEAT_SELECT(dw10);
> > >  uint16_t iv;
> > >  
> > >  static const uint32_t nvme_feature_default[NVME_FID_MAX] = {
> > >  [NVME_ARBITRATION] = NVME_ARB_AB_NOLIMIT,
> > >  };
> > >  
> > > -trace_pci_nvme_getfeat(nvme_cid(req), fid, dw11);
> > > +trace_pci_nvme_getfeat(nvme_cid(req), fid, sel, dw11);
> > >  
> > >  if (!nvme_feature_support[fid]) {
> > >  return NVME_INVALID_FIELD | NVME_DNR;
> > >  }
> > >  
> > > +if (nvme_feature_cap[fid] & NVME_FEAT_CAP_NS) {
> > > +if (!nsid || nsid > n->num_namespaces) {
> > > +/*
> > > + * The Reservation Notification Mask and Reservation 
> > > Persistence
> > > + * features require a status code of Invalid Field in 
> > > Command when
> > > + * NSID is 0x. Since the device does not support 
> > > those
> > > + * features we can always return Invalid Namespace or Format 
> > > as we
> > > + * should do for all other features.
> > > + */
> > > +return NVME_INVALID_NSID | NVME_DNR;
> > > +}
> > > +}
> > > +
> > > +switch (sel) {
> > > +case NVME_GETFEAT_SELECT_CURRENT:
> > > +break;
> > > +case NVME_GETFEAT_SELECT_SAVED:
> > > +/* no features are saveable by the controller; fallthrough */
> > > +case NVME_GETFEAT_SELECT_DEFAULT:
> > > +goto defaults;
> > 
> > I hate to say it, but while I have nothing against using 'goto' (unlike 
> > some types I met),
> > In this particular case it feels like it would be better to have  a 
> > separate function for
> > defaults, or have even have a a separate function per feature and have it 
> > return current/default/saved/whatever
> > value. The later would allow to have each feature self contained in its own 
> > function.
> > 
> > But on the other hand I see that you fail back to defaults for unchangeble 
> > features, which does make
> > sense. In other words, I don't have strong opinion against using goto here 
> > after all.
> > 
> > When feature code will be getting more features in the future (pun 
> > intended) you probably will have to split it,\
> > like I suggest to keep code complexity low.
> > 
> 
> Argh... I know you are right.
> 
> Since you are "accepting" the current state with your R-b and it already
> carries one from Dmitry I think I'll let this stay for now, but I will
> fix this in a follow up patch for sure.
Yep, this is exactly what I was thinking.

Best regards,
Maxim Levitsky

> 
> > > @@ -926,6 +949,8 @@ typedef struct NvmeLBAF {
> > >  uint8_t rp;
> > >  } NvmeLBAF;
> > >  
> > > +#define NVME_NSID_BROADCAST 0x
> > 
> > Cool, you probably want eventually to go over code and
> > change all places that use the number to the define.
> > (No need to do this now)
> > 
> 
> True. Noted :)
>

Re: [PATCH v3 08/18] hw/block/nvme: add support for the asynchronous event request command

2020-07-29 Thread Maxim Levitsky

On Wed, 2020-07-29 at 15:37 +0200, Klaus Jensen wrote:
> On Jul 29 13:43, Maxim Levitsky wrote:
> > On Mon, 2020-07-06 at 08:12 +0200, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > > 
> > > Add support for the Asynchronous Event Request command. Required for
> > > compliance with NVMe revision 1.3d. See NVM Express 1.3d, Section 5.2
> > > ("Asynchronous Event Request command").
> > > 
> > > Mostly imported from Keith's qemu-nvme tree. Modified with a max number
> > > of queued events (controllable with the aer_max_queued device
> > > parameter). The spec states that the controller *should* retain
> > > events, so we do best effort here.
> > > 
> > > Signed-off-by: Klaus Jensen 
> > > Signed-off-by: Klaus Jensen 
> > > Acked-by: Keith Busch 
> > > Reviewed-by: Maxim Levitsky 
> > > Reviewed-by: Dmitry Fomichev 
> > > ---
> > >  hw/block/nvme.c   | 180 --
> > >  hw/block/nvme.h   |  10 ++-
> > >  hw/block/trace-events |   9 +++
> > >  include/block/nvme.h  |   8 +-
> > >  4 files changed, 198 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index 7cb3787638f6..80c7285bc1cf 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -356,6 +356,85 @@ static void nvme_enqueue_req_completion(NvmeCQueue 
> > > *cq, NvmeRequest *req)
> > >  timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> > >  }
> > >  
> > > +static void nvme_process_aers(void *opaque)
> > > +{
> > > +NvmeCtrl *n = opaque;
> > > +NvmeAsyncEvent *event, *next;
> > > +
> > > +trace_pci_nvme_process_aers(n->aer_queued);
> > > +
> > > +QTAILQ_FOREACH_SAFE(event, >aer_queue, entry, next) {
> > > +NvmeRequest *req;
> > > +NvmeAerResult *result;
> > > +
> > > +/* can't post cqe if there is nothing to complete */
> > > +if (!n->outstanding_aers) {
> > > +trace_pci_nvme_no_outstanding_aers();
> > > +break;
> > > +}
> > > +
> > > +/* ignore if masked (cqe posted, but event not cleared) */
> > > +if (n->aer_mask & (1 << event->result.event_type)) {
> > > +trace_pci_nvme_aer_masked(event->result.event_type, 
> > > n->aer_mask);
> > > +continue;
> > > +}
> > > +
> > > +QTAILQ_REMOVE(>aer_queue, event, entry);
> > > +n->aer_queued--;
> > > +
> > > +n->aer_mask |= 1 << event->result.event_type;
> > > +n->outstanding_aers--;
> > > +
> > > +req = n->aer_reqs[n->outstanding_aers];
> > > +
> > > +result = (NvmeAerResult *) >cqe.result;
> > > +result->event_type = event->result.event_type;
> > > +result->event_info = event->result.event_info;
> > > +result->log_page = event->result.log_page;
> > > +g_free(event);
> > > +
> > > +req->status = NVME_SUCCESS;
> > > +
> > > +trace_pci_nvme_aer_post_cqe(result->event_type, 
> > > result->event_info,
> > > +result->log_page);
> > > +
> > > +nvme_enqueue_req_completion(>admin_cq, req);
> > > +}
> > > +}
> > > +
> > > +static void nvme_enqueue_event(NvmeCtrl *n, uint8_t event_type,
> > > +   uint8_t event_info, uint8_t log_page)
> > > +{
> > > +NvmeAsyncEvent *event;
> > > +
> > > +trace_pci_nvme_enqueue_event(event_type, event_info, log_page);
> > > +
> > > +if (n->aer_queued == n->params.aer_max_queued) {
> > > +trace_pci_nvme_enqueue_event_noqueue(n->aer_queued);
> > > +return;
> > > +}
> > > +
> > > +event = g_new(NvmeAsyncEvent, 1);
> > > +event->result = (NvmeAerResult) {
> > > +.event_type = event_type,
> > > +.event_info = event_info,
> > > +.log_page   = log_page,
> > > +};
> > > +
> > > +QTAILQ_INSERT_TAIL(>aer_queue, event, entry);
> > > +n->aer_queued++;
> > > +
> > > +nvme_process_aers(n);
> > > +}
> > > +
> > > +static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
> > > +{
> > > +n->aer_mask &= ~(1 << event_type);
> > > +if (!QTAILQ_EMPTY(>aer_queue)) {
> > > +nvme_process_aers(n);
> > > +}
> > > +}
> > > +
> > >  static void nvme_rw_cb(void *opaque, int ret)
> > >  {
> > >  NvmeRequest *req = opaque;
> > > @@ -606,8 +685,9 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd 
> > > *cmd)
> > >  return NVME_SUCCESS;
> > >  }
> > >  
> > > -static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > > buf_len,
> > > -uint64_t off, NvmeRequest *req)
> > > +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> > > +uint32_t buf_len, uint64_t off,
> > > +NvmeRequest *req)
> > >  {
> > >  uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> > >  uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> > > @@ -655,6 +735,10 @@ static uint16_t

Re: [PATCH v5] hw/pci-host: save/restore pci host config register for old ones

2020-07-29 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Tue, Jul 28, 2020 at 11:27:09AM +0800, Hogan Wang wrote:
> > The i440fx and q35 machines integrate i440FX or MCH PCI device by default.
> > Refer to i440FX and ICH9-LPC spcifications, there are some reserved
> > configuration registers can used to save/restore PCIHostState.config_reg.
> > It's nasty but friendly to old ones.
> > 
> > Reproducer steps:
> > step 1. Make modifications to seabios and qemu for increase reproduction
> > efficiency, write 0xf0 to 0x402 port notify qemu to stop vcpu after
> > 0x0cf8 port wrote i440 configure register. qemu stop vcpu when catch
> > 0x402 port wrote 0xf0.
> > 
> > seabios:/src/hw/pci.c
> > @@ -52,6 +52,11 @@ void pci_config_writeb(u16 bdf, u32 addr, u8 val)
> >  writeb(mmconfig_addr(bdf, addr), val);
> >  } else {
> >  outl(ioconfig_cmd(bdf, addr), PORT_PCI_CMD);
> > +   if (bdf == 0 && addr == 0x72 && val == 0xa) {
> > +dprintf(1, "stop vcpu\n");
> > +outb(0xf0, 0x402); // notify qemu to stop vcpu
> > +dprintf(1, "resume vcpu\n");
> > +}
> >  outb(val, PORT_PCI_DATA + (addr & 3));
> >  }
> >  }
> > 
> > qemu:hw/char/debugcon.c
> > @@ -60,6 +61,9 @@ static void debugcon_ioport_write(void *opaque, hwaddr 
> > addr, uint64_t val,
> >  printf(" [debugcon: write addr=0x%04" HWADDR_PRIx " val=0x%02" PRIx64 
> > "]\n", addr, val);
> >  #endif
> > 
> > +if (ch == 0xf0) {
> > +vm_stop(RUN_STATE_PAUSED);
> > +}
> >  /* XXX this blocks entire thread. Rewrite to use
> >   * qemu_chr_fe_write and background I/O callbacks */
> >  qemu_chr_fe_write_all(>chr, , 1);
> > 
> > step 2. start vm1 by the following command line, and then vm stopped.
> > $ qemu-system-x86_64 -machine pc-i440fx-5.0,accel=kvm\
> >  -netdev tap,ifname=tap-test,id=hostnet0,vhost=on,downscript=no,script=no\
> >  -device 
> > virtio-net-pci,netdev=hostnet0,id=net0,bus=pci.0,addr=0x13,bootindex=3\
> >  -device cirrus-vga,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2\
> >  -chardev file,id=seabios,path=/var/log/test.seabios,append=on\
> >  -device isa-debugcon,iobase=0x402,chardev=seabios\
> >  -monitor stdio
> > 
> > step 3. start vm2 to accept vm1 state.
> > $ qemu-system-x86_64 -machine pc-i440fx-5.0,accel=kvm\
> >  -netdev tap,ifname=tap-test1,id=hostnet0,vhost=on,downscript=no,script=no\
> >  -device 
> > virtio-net-pci,netdev=hostnet0,id=net0,bus=pci.0,addr=0x13,bootindex=3\
> >  -device cirrus-vga,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2\
> >  -chardev file,id=seabios,path=/var/log/test.seabios,append=on\
> >  -device isa-debugcon,iobase=0x402,chardev=seabios\
> >  -monitor stdio \
> >  -incoming tcp:127.0.0.1:8000
> > 
> > step 4. execute the following qmp command in vm1 to migrate.
> > (qemu) migrate tcp:127.0.0.1:8000
> > 
> > step 5. execute the following qmp command in vm2 to resume vcpu.
> > (qemu) cont
> > 
> > Before this patch, we get KVM "emulation failure" error on vm2.
> > This patch fixes it.
> > 
> > Signed-off-by: Hogan Wang 
> > ---
> >  hw/pci-host/i440fx.c | 46 
> >  hw/pci-host/q35.c| 44 ++
> >  2 files changed, 90 insertions(+)
> > 
> > diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
> > index 8ed2417f0c..419e27c21a 100644
> > --- a/hw/pci-host/i440fx.c
> > +++ b/hw/pci-host/i440fx.c
> > @@ -64,6 +64,14 @@ typedef struct I440FXState {
> >   */
> >  #define I440FX_COREBOOT_RAM_SIZE 0x57
> >  
> > +/* Older I440FX machines (5.0 and older) do not support i440FX-pcihost 
> > state
> > + * migration, use some reserved INTEL 82441 configuration registers to
> > + * save/restore i440FX-pcihost config register. Refer to [INTEL 440FX 
> > PCISET
> > + * 82441FX PCI AND MEMORY CONTROLLER (PMC) AND 82442FX DATA BUS ACCELERATOR
> > + * (DBX) Table 1. PMC Configuration Space]
> > + */
> > +#define I440FX_PCI_HOST_CONFIG_REG 0x94
> > +
> >  static void i440fx_update_memory_mappings(PCII440FXState *d)
> >  {
> >  int i;
> > @@ -98,15 +106,53 @@ static void i440fx_write_config(PCIDevice *dev,
> >  static int i440fx_post_load(void *opaque, int version_id)
> >  {
> >  PCII440FXState *d = opaque;
> > +PCIDevice *dev;
> > +PCIHostState *s = OBJECT_CHECK(PCIHostState,
> > +   object_resolve_path("/machine/i440fx", 
> > NULL),
> > +   TYPE_PCI_HOST_BRIDGE);
> >  
> >  i440fx_update_memory_mappings(d);
> > +
> > +if (!s->mig_enabled) {
> 
> Thinking more about it, I think we should rename mig_enabled to
> config_reg_mig_enabled or something like this.

Agreed.

Dave

> 
> > +dev = PCI_DEVICE(d);
> > +s->config_reg = 
> > pci_get_long(>config[I440FX_PCI_HOST_CONFIG_REG]);
> > +pci_set_long(>config[I440FX_PCI_HOST_CONFIG_REG], 0);
> > +}
> > +return 0;
> > +}
> > +
> > +static int i440fx_pre_save(void *opaque)
> > +{
> > +PCIDevice *dev

Re: [PATCH 05/16] hw/block/nvme: refactor dma read/write

2020-07-29 Thread Klaus Jensen

On Jul 29 20:35, Maxim Levitsky wrote:
> On Mon, 2020-07-20 at 13:37 +0200, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Refactor the nvme_dma_{read,write}_prp functions into a common function
> > taking a DMADirection parameter.
> > 
> > Signed-off-by: Klaus Jensen 
> > Reviewed-by: Maxim Levitsky 
> > ---
> >  hw/block/nvme.c | 88 -
> >  1 file changed, 43 insertions(+), 45 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 6a1a1626b87b..d314a604db81 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -361,55 +361,50 @@ unmap:
> >  return status;
> >  }
> >  
> > -static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > -   uint64_t prp1, uint64_t prp2)
> > +static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > + uint64_t prp1, uint64_t prp2, DMADirection 
> > dir)
> >  {
> >  QEMUSGList qsg;
> >  QEMUIOVector iov;
> >  uint16_t status = NVME_SUCCESS;
> >  
> > -if (nvme_map_prp(, , prp1, prp2, len, n)) {
> > -return NVME_INVALID_FIELD | NVME_DNR;
> > +status = nvme_map_prp(, , prp1, prp2, len, n);
> > +if (status) {
> > +return status;
> >  }
> > +
> >  if (qsg.nsg > 0) {
> > -if (dma_buf_write(ptr, len, )) {
> > -status = NVME_INVALID_FIELD | NVME_DNR;
> > +uint64_t residual;
> > +
> > +if (dir == DMA_DIRECTION_TO_DEVICE) {
> > +residual = dma_buf_write(ptr, len, );
> > +} else {
> > +residual = dma_buf_read(ptr, len, );
> >  }
> > -qemu_sglist_destroy();
> > -} else {
> > -if (qemu_iovec_to_buf(, 0, ptr, len) != len) {
> > -status = NVME_INVALID_FIELD | NVME_DNR;
> > -}
> > -qemu_iovec_destroy();
> > -}
> > -return status;
> > -}
> >  
> > -static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > -uint64_t prp1, uint64_t prp2)
> > -{
> > -QEMUSGList qsg;
> > -QEMUIOVector iov;
> > -uint16_t status = NVME_SUCCESS;
> > -
> > -trace_pci_nvme_dma_read(prp1, prp2);
> > -
> > -if (nvme_map_prp(, , prp1, prp2, len, n)) {
> > -return NVME_INVALID_FIELD | NVME_DNR;
> > -}
> > -if (qsg.nsg > 0) {
> > -if (unlikely(dma_buf_read(ptr, len, ))) {
> > +if (unlikely(residual)) {
> >  trace_pci_nvme_err_invalid_dma();
> >  status = NVME_INVALID_FIELD | NVME_DNR;
> >  }
> > +
> >  qemu_sglist_destroy();
> >  } else {
> > -if (unlikely(qemu_iovec_from_buf(, 0, ptr, len) != len)) {
> > +size_t bytes;
> > +
> > +if (dir == DMA_DIRECTION_TO_DEVICE) {
> > +bytes = qemu_iovec_to_buf(, 0, ptr, len);
> > +} else {
> > +bytes = qemu_iovec_from_buf(, 0, ptr, len);
> > +}
> > +
> > +if (unlikely(bytes != len)) {
> >  trace_pci_nvme_err_invalid_dma();
> >  status = NVME_INVALID_FIELD | NVME_DNR;
> >  }
> > +
> >  qemu_iovec_destroy();
> >  }
> > +
> I know I reviewed this, but thinking now, why not to add an assert here
> that we don't have both iov and qsg with data.
> 

Good point. I added it after the nvme_map_prp call.

Re: [PATCH v3 07/18] hw/block/nvme: add support for the get log page command

2020-07-29 Thread Maxim Levitsky

On Wed, 2020-07-29 at 13:44 +0200, Klaus Jensen wrote:
> On Jul 29 13:24, Maxim Levitsky wrote:
> > On Mon, 2020-07-06 at 08:12 +0200, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > > 
> > > Add support for the Get Log Page command and basic implementations of
> > > the mandatory Error Information, SMART / Health Information and Firmware
> > > Slot Information log pages.
> > > 
> > > In violation of the specification, the SMART / Health Information log
> > > page does not persist information over the lifetime of the controller
> > > because the device has no place to store such persistent state.
> > > 
> > > Note that the LPA field in the Identify Controller data structure
> > > intentionally has bit 0 cleared because there is no namespace specific
> > > information in the SMART / Health information log page.
> > > 
> > > Required for compliance with NVMe revision 1.3d. See NVM Express 1.3d,
> > > Section 5.14 ("Get Log Page command").
> > > 
> > > Signed-off-by: Klaus Jensen 
> > > Signed-off-by: Klaus Jensen 
> > > Acked-by: Keith Busch 
> > > ---
> > >  hw/block/nvme.c   | 140 +-
> > >  hw/block/nvme.h   |   2 +
> > >  hw/block/trace-events |   2 +
> > >  include/block/nvme.h  |   8 ++-
> > >  4 files changed, 149 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index b6bc75eb61a2..7cb3787638f6 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -606,6 +606,140 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd 
> > > *cmd)
> > >  return NVME_SUCCESS;
> > >  }
> > >  
> > > +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > > buf_len,
> > > +uint64_t off, NvmeRequest *req)
> > > +{
> > > +uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> > > +uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> > > +uint32_t nsid = le32_to_cpu(cmd->nsid);
> > > +
> > > +uint32_t trans_len;
> > > +time_t current_ms;
> > > +uint64_t units_read = 0, units_written = 0;
> > > +uint64_t read_commands = 0, write_commands = 0;
> > > +NvmeSmartLog smart;
> > > +BlockAcctStats *s;
> > > +
> > > +if (nsid && nsid != 0x) {
> > > +return NVME_INVALID_FIELD | NVME_DNR;
> > > +}
> > Correct.
> > > +
> > > +s = blk_get_stats(n->conf.blk);
> > > +
> > > +units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > > +units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > > +read_commands = s->nr_ops[BLOCK_ACCT_READ];
> > > +write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> > > +
> > > +if (off > sizeof(smart)) {
> > > +return NVME_INVALID_FIELD | NVME_DNR;
> > > +}
> > > +
> > > +trans_len = MIN(sizeof(smart) - off, buf_len);
> > > +
> > > +memset(, 0x0, sizeof(smart));
> > > +
> > > +smart.data_units_read[0] = cpu_to_le64(units_read / 1000);
> > > +smart.data_units_written[0] = cpu_to_le64(units_written / 1000);
> > Tiny nitpick - the spec asks the value to be rounded up
> > 
> 
> Ouch. You are correct. I'll swap that for a DIV_ROUND_UP.
Not a big deal though as these values don't matter much to anybody since we 
don't have
way of storing them permanently.

> 
> > > +static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > +{
> > > +uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > > +uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > > +uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> > > +uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> > > +uint8_t  lid = dw10 & 0xff;
> > > +uint8_t  lsp = (dw10 >> 8) & 0xf;
> > > +uint8_t  rae = (dw10 >> 15) & 0x1;
> > > +uint32_t numdl, numdu;
> > > +uint64_t off, lpol, lpou;
> > > +size_t   len;
> > > +
> > Nitpick: don't we want to check NSID=0 || NSID=0x here too?
> > 
> 
> The spec lists Get Log Page with "Yes" under "Namespace Identifier Used"
> but no log pages in v1.3 or v1.4 are namespace specific so we expect
> NSID to always be 0 or 0x. But, there are TPs that have
> namespace specific log pages (i.e. TP 4053 Zoned Namepaces). So, it is
> not invalid to have NSID set to something.
> 
> So, I think we have to defer handling of NSID values to the individual
> log pages (like we do for the SMART page).
Ah, OK.

> 


Best regards,
Maxim Levitsky

Re: [PATCH 07/16] hw/block/nvme: add request mapping helper

2020-07-29 Thread Maxim Levitsky

On Thu, 2020-07-30 at 00:52 +0900, Minwoo Im wrote:
> Klaus,
> 
> On 20-07-20 13:37:39, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Introduce the nvme_map helper to remove some noise in the main nvme_rw
> > function.
> > 
> > Signed-off-by: Klaus Jensen 
> > Reviewed-by: Maxim Levitsky 
> > ---
> >  hw/block/nvme.c | 13 ++---
> >  1 file changed, 10 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f1e04608804b..68c33a11c144 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -413,6 +413,15 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t 
> > *ptr, uint32_t len,
> >  return status;
> >  }
> >  
> > +static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, size_t len,
> > + NvmeRequest *req)
> 
> Can we specify what is going to be mapped in this function? like
> nvme_map_dptr?
I also once complained about the name, and I do like this idea!

Best regards,
Maxim Levitsky

> 
> Thanks,
>

Re: [PATCH 04/16] hw/block/nvme: remove redundant has_sg member

2020-07-29 Thread Klaus Jensen

On Jul 30 00:29, Minwoo Im wrote:
> Klaus,
> 

Hi Minwoo,

Thanks for the reviews and welcome to the party! :)

> On 20-07-20 13:37:36, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Remove the has_sg member from NvmeRequest since it's redundant.
> > 
> > Also, make sure the request iov is destroyed at completion time.
> > 
> > Signed-off-by: Klaus Jensen 
> > Reviewed-by: Maxim Levitsky 
> > ---
> >  hw/block/nvme.c | 11 ++-
> >  hw/block/nvme.h |  1 -
> >  2 files changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index cb236d1c8c46..6a1a1626b87b 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -548,16 +548,20 @@ static void nvme_rw_cb(void *opaque, int ret)
> >  block_acct_failed(blk_get_stats(n->conf.blk), >acct);
> >  req->status = NVME_INTERNAL_DEV_ERROR;
> >  }
> > -if (req->has_sg) {
> > +
> > +if (req->qsg.nalloc) {
> 
> Personally, I prefer has_xxx or is_xxx to check whether the request is
> based on sg or iov as an inline function, but 'nalloc' is also fine to
> figure out the meaning of purpose here.
> 

What I really want to do is get rid of this duality with qsg and iovs at
some point. I kinda wanna get rid of the dma helpers and the qsg
entirely and do the DMA handling directly.

Maybe an `int flags` member in NvmeRequest would be better for this,
such as NVME_REQ_DMA etc.

> >  qemu_sglist_destroy(>qsg);
> >  }
> > +if (req->iov.nalloc) {
> > +qemu_iovec_destroy(>iov);
> > +}
> > +
> 
> Maybe this can be in a separated commit?
> 

Yeah. I guess whenever a commit message includes "Also, ..." you really
should factor the change out ;)

I'll split it.

> Otherwise, It looks good to me.
> 
> Thanks,
> 

Does that mean I can add your R-b? :)

Re: [PATCH 15/16] hw/block/nvme: remove NvmeCmd parameter

2020-07-29 Thread Maxim Levitsky

On Mon, 2020-07-20 at 13:37 +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Keep a copy of the raw nvme command in the NvmeRequest and remove the
> now redundant NvmeCmd parameter.

Shouldn't you clear the req->cmd in nvme_req_clear too for consistency?

Other than that looks OK, but I might have missed something.

Best regards,
Maxim Levitsky
> 
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c | 177 +---
>  hw/block/nvme.h |   1 +
>  2 files changed, 93 insertions(+), 85 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index b53afdeb3fb6..0b3dceccc89b 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -425,9 +425,9 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, 
> uint32_t len,
>  return status;
>  }
>  
> -static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, size_t len,
> - NvmeRequest *req)
> +static uint16_t nvme_map(NvmeCtrl *n, size_t len, NvmeRequest *req)
>  {
> +NvmeCmd *cmd = >cmd;
>  uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
>  uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
>  
> @@ -597,7 +597,7 @@ static void nvme_rw_cb(void *opaque, int ret)
>  nvme_enqueue_req_completion(cq, req);
>  }
>  
> -static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
>  {
>  block_acct_start(blk_get_stats(n->conf.blk), >acct, 0,
>   BLOCK_ACCT_FLUSH);
> @@ -606,9 +606,9 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>  return NVME_NO_COMPLETE;
>  }
>  
> -static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest 
> *req)
> +static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
>  {
> -NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> +NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
>  NvmeNamespace *ns = req->ns;
>  const uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
>  const uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
> @@ -633,9 +633,9 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeCmd 
> *cmd, NvmeRequest *req)
>  return NVME_NO_COMPLETE;
>  }
>  
> -static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
>  {
> -NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> +NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
>  NvmeNamespace *ns = req->ns;
>  uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
>  uint64_t slba = le64_to_cpu(rw->slba);
> @@ -664,7 +664,7 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>  return status;
>  }
>  
> -if (nvme_map(n, cmd, data_size, req)) {
> +if (nvme_map(n, data_size, req)) {
>  block_acct_invalid(blk_get_stats(n->conf.blk), acct);
>  return NVME_INVALID_FIELD | NVME_DNR;
>  }
> @@ -690,11 +690,12 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>  return NVME_NO_COMPLETE;
>  }
>  
> -static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
>  {
> -uint32_t nsid = le32_to_cpu(cmd->nsid);
> +uint32_t nsid = le32_to_cpu(req->cmd.nsid);
>  
> -trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req), cmd->opcode);
> +trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
> +  req->cmd.opcode);
>  
>  if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
>  trace_pci_nvme_err_invalid_ns(nsid, n->num_namespaces);
> @@ -702,16 +703,16 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>  }
>  
>  req->ns = >namespaces[nsid - 1];
> -switch (cmd->opcode) {
> +switch (req->cmd.opcode) {
>  case NVME_CMD_FLUSH:
> -return nvme_flush(n, cmd, req);
> +return nvme_flush(n, req);
>  case NVME_CMD_WRITE_ZEROES:
> -return nvme_write_zeroes(n, cmd, req);
> +return nvme_write_zeroes(n, req);
>  case NVME_CMD_WRITE:
>  case NVME_CMD_READ:
> -return nvme_rw(n, cmd, req);
> +return nvme_rw(n, req);
>  default:
> -trace_pci_nvme_err_invalid_opc(cmd->opcode);
> +trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
>  return NVME_INVALID_OPCODE | NVME_DNR;
>  }
>  }
> @@ -727,10 +728,10 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
>  }
>  }
>  
> -static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
> +static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req)
>  {
> -NvmeDeleteQ *c = (NvmeDeleteQ *)cmd;
> -NvmeRequest *req, *next;
> +NvmeDeleteQ *c = (NvmeDeleteQ *)>cmd;
> +NvmeRequest *r, *next;
>  NvmeSQueue *sq;
>  NvmeCQueue *cq;
>  uint16_t qid = le16_to_cpu(c->qid);
> @@ -744,19 +745,19 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
>  
>  sq = n->sq[qid];
>  while (!QTAILQ_EMPTY(>out_req_list))

Re: [PATCH 02/16] hw/block/nvme: add mapping helpers

2020-07-29 Thread Klaus Jensen

On Jul 29 16:57, Maxim Levitsky wrote:
> On Mon, 2020-07-20 at 13:37 +0200, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add nvme_map_addr, nvme_map_addr_cmb and nvme_addr_to_cmb helpers and
> > use them in nvme_map_prp.
> > 
> > This fixes a bug where in the case of a CMB transfer, the device would
> > map to the buffer with a wrong length.
> > 
> > Fixes: b2b2b67a00574 ("nvme: Add support for Read Data and Write Data in 
> > CMBs.")
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 109 +++---
> >  hw/block/trace-events |   2 +
> >  2 files changed, 94 insertions(+), 17 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 4d7b730a62b6..9b1a080cdc70 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -109,6 +109,11 @@ static uint16_t nvme_sqid(NvmeRequest *req)
> >  return le16_to_cpu(req->sq->sqid);
> >  }
> >  
> > +static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
> > +{
> > +return >cmbuf[addr - n->ctrl_mem.addr];
> I would add an assert here just in case we do out of bounds array access.

We never end up in nvme_addr_to_cmb without nvme_addr_is_cmb checking
the bounds. But an assert cant hurt if someone decides to use it in
error.

I'll add it!

Re: [PATCH 1/4] hw/hppa: Sync hppa_hardware.h file with SeaBIOS sources

2020-07-29 Thread Philippe Mathieu-Daudé

On 7/27/20 11:46 PM, Helge Deller wrote:
> The hppa_hardware.h file is shared with SeaBIOS. Sync it.
> 
> Signed-off-by: Helge Deller 
> ---
>  hw/hppa/hppa_hardware.h | 6 ++
>  hw/hppa/lasi.c  | 2 --
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/hppa/hppa_hardware.h b/hw/hppa/hppa_hardware.h
> index 4a2fe2df60..cdb7fa6240 100644
> --- a/hw/hppa/hppa_hardware.h
> +++ b/hw/hppa/hppa_hardware.h
> @@ -17,6 +17,7 @@
>  #define LASI_UART_HPA   0xffd05000
>  #define LASI_SCSI_HPA   0xffd06000
>  #define LASI_LAN_HPA0xffd07000
> +#define LASI_RTC_HPA0xffd09000

I find the line you are removing cleaner:

-#define LASI_RTC_HPA(LASI_HPA + 0x9000)

"Offset in the LASI memory region".

Anyway not a blocker.

Having these values sorted would help.

>  #define LASI_LPT_HPA0xffd02000
>  #define LASI_AUDIO_HPA  0xffd04000
>  #define LASI_PS2KBD_HPA 0xffd08000
> @@ -37,10 +38,15 @@
>  #define PORT_PCI_CMD(PCI_HPA + DINO_PCI_ADDR)
>  #define PORT_PCI_DATA   (PCI_HPA + DINO_CONFIG_DATA)
> 
> +/* QEMU fw_cfg interface port */
> +#define QEMU_FW_CFG_IO_BASE (MEMORY_HPA + 0x80)
> +
>  #define PORT_SERIAL1(DINO_UART_HPA + 0x800)
>  #define PORT_SERIAL2(LASI_UART_HPA + 0x800)
> 
>  #define HPPA_MAX_CPUS   8   /* max. number of SMP CPUs */
>  #define CPU_CLOCK_MHZ   250 /* emulate a 250 MHz CPU */
> 
> +#define CPU_HPA_CR_REG  7   /* store CPU HPA in cr7 (SeaBIOS internal) */
> +
>  #endif
> diff --git a/hw/hppa/lasi.c b/hw/hppa/lasi.c
> index 19974034f3..ffcbb988b8 100644
> --- a/hw/hppa/lasi.c
> +++ b/hw/hppa/lasi.c
> @@ -54,8 +54,6 @@
>  #define LASI_CHIP(obj) \
>  OBJECT_CHECK(LasiState, (obj), TYPE_LASI_CHIP)
> 
> -#define LASI_RTC_HPA(LASI_HPA + 0x9000)
> -
>  typedef struct LasiState {
>  PCIHostState parent_obj;
> 
> --
> 2.21.3
> 
>

Re: [PATCH 14/16] hw/block/nvme: consolidate qsg/iov clearing

2020-07-29 Thread Maxim Levitsky

On Mon, 2020-07-20 at 13:37 +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Always destroy the request qsg/iov at the end of request use.
> 
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c | 48 +---
>  1 file changed, 17 insertions(+), 31 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 54cd20f1ce22..b53afdeb3fb6 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -213,6 +213,14 @@ static void nvme_req_clear(NvmeRequest *req)
>  {
>  req->ns = NULL;
>  memset(>cqe, 0x0, sizeof(req->cqe));
> +
> +if (req->qsg.sg) {
> +qemu_sglist_destroy(>qsg);
> +}
> +
> +if (req->iov.iov) {
> +qemu_iovec_destroy(>iov);
> +}
>  }
>  
>  static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr 
> addr,
> @@ -297,15 +305,14 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> *qsg, QEMUIOVector *iov,
>  
>  status = nvme_map_addr(n, qsg, iov, prp1, trans_len);
>  if (status) {
> -goto unmap;
> +return status;
>  }
>  
>  len -= trans_len;
>  if (len) {
>  if (unlikely(!prp2)) {
>  trace_pci_nvme_err_invalid_prp2_missing();
> -status = NVME_INVALID_FIELD | NVME_DNR;
> -goto unmap;
> +return NVME_INVALID_FIELD | NVME_DNR;
>  }
>  
>  if (len > n->page_size) {
> @@ -326,13 +333,11 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> *qsg, QEMUIOVector *iov,
>  if (i == n->max_prp_ents - 1 && len > n->page_size) {
>  if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
>  trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
> -status = NVME_INVALID_FIELD | NVME_DNR;
> -goto unmap;
> +return NVME_INVALID_FIELD | NVME_DNR;
>  }
>  
>  if (prp_list_in_cmb != nvme_addr_is_cmb(n, prp_ent)) {
> -status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> -goto unmap;
> +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
>  }
>  
>  i = 0;
> @@ -345,14 +350,13 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> *qsg, QEMUIOVector *iov,
>  
>  if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
>  trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
> -status = NVME_INVALID_FIELD | NVME_DNR;
> -goto unmap;
> +return NVME_INVALID_FIELD | NVME_DNR;
>  }
>  
>  trans_len = MIN(len, n->page_size);
>  status = nvme_map_addr(n, qsg, iov, prp_ent, trans_len);
>  if (status) {
> -goto unmap;
> +return status;
>  }
>  
>  len -= trans_len;
> @@ -361,27 +365,16 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> *qsg, QEMUIOVector *iov,
>  } else {
>  if (unlikely(prp2 & (n->page_size - 1))) {
>  trace_pci_nvme_err_invalid_prp2_align(prp2);
> -status = NVME_INVALID_FIELD | NVME_DNR;
> -goto unmap;
> +return NVME_INVALID_FIELD | NVME_DNR;
>  }
>  status = nvme_map_addr(n, qsg, iov, prp2, len);
>  if (status) {
> -goto unmap;
> +return status;
>  }
>  }
>  }
> +
>  return NVME_SUCCESS;
> -
> -unmap:
> -if (iov && iov->iov) {
> -qemu_iovec_destroy(iov);
> -}
> -
> -if (qsg && qsg->sg) {
> -qemu_sglist_destroy(qsg);
> -}
> -
> -return status;
>  }
>  
>  static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> @@ -601,13 +594,6 @@ static void nvme_rw_cb(void *opaque, int ret)
>  req->status = NVME_INTERNAL_DEV_ERROR;
>  }
>  
> -if (req->qsg.nalloc) {
> -qemu_sglist_destroy(>qsg);
> -}
> -if (req->iov.nalloc) {
> -qemu_iovec_destroy(>iov);
> -}
> -
>  nvme_enqueue_req_completion(cq, req);
>  }
>  

This and former patch I guess answer my own question about why to clear the 
request after its cqe got posted.

Looks reasonable.

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky

Re: [PATCH 13/16] hw/block/nvme: add a namespace reference in NvmeRequest

2020-07-29 Thread Maxim Levitsky

On Mon, 2020-07-20 at 13:37 +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Instead of passing around the NvmeNamespace, add it as a member in the
> NvmeRequest structure.
> 
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c | 21 ++---
>  hw/block/nvme.h |  1 +
>  2 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 431f26c2f589..54cd20f1ce22 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -211,6 +211,7 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
>  
>  static void nvme_req_clear(NvmeRequest *req)
>  {
> +req->ns = NULL;
>  memset(>cqe, 0x0, sizeof(req->cqe));
>  }
>  
> @@ -610,8 +611,7 @@ static void nvme_rw_cb(void *opaque, int ret)
>  nvme_enqueue_req_completion(cq, req);
>  }
>  
> -static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> -NvmeRequest *req)
> +static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
>  block_acct_start(blk_get_stats(n->conf.blk), >acct, 0,
>   BLOCK_ACCT_FLUSH);
> @@ -620,10 +620,10 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace 
> *ns, NvmeCmd *cmd,
>  return NVME_NO_COMPLETE;
>  }
>  
> -static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd 
> *cmd,
> -NvmeRequest *req)
> +static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest 
> *req)
>  {
>  NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> +NvmeNamespace *ns = req->ns;
>  const uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
>  const uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
>  uint64_t slba = le64_to_cpu(rw->slba);
> @@ -647,10 +647,10 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, 
> NvmeNamespace *ns, NvmeCmd *cmd,
>  return NVME_NO_COMPLETE;
>  }
>  
> -static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> -NvmeRequest *req)
> +static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
>  NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> +NvmeNamespace *ns = req->ns;
>  uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
>  uint64_t slba = le64_to_cpu(rw->slba);
>  
> @@ -706,7 +706,6 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, 
> NvmeCmd *cmd,
>  
>  static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
> -NvmeNamespace *ns;
>  uint32_t nsid = le32_to_cpu(cmd->nsid);
>  
>  trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req), cmd->opcode);
> @@ -716,15 +715,15 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>  return NVME_INVALID_NSID | NVME_DNR;
>  }
>  
> -ns = >namespaces[nsid - 1];
> +req->ns = >namespaces[nsid - 1];
>  switch (cmd->opcode) {
>  case NVME_CMD_FLUSH:
> -return nvme_flush(n, ns, cmd, req);
> +return nvme_flush(n, cmd, req);
>  case NVME_CMD_WRITE_ZEROES:
> -return nvme_write_zeroes(n, ns, cmd, req);
> +return nvme_write_zeroes(n, cmd, req);
>  case NVME_CMD_WRITE:
>  case NVME_CMD_READ:
> -return nvme_rw(n, ns, cmd, req);
> +return nvme_rw(n, cmd, req);
>  default:
>  trace_pci_nvme_err_invalid_opc(cmd->opcode);
>  return NVME_INVALID_OPCODE | NVME_DNR;
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 137cd8c2bf20..586fd3d62700 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -21,6 +21,7 @@ typedef struct NvmeAsyncEvent {
>  
>  typedef struct NvmeRequest {
>  struct NvmeSQueue   *sq;
> +struct NvmeNamespace*ns;
>  BlockAIOCB  *aiocb;
>  uint16_tstatus;
>  NvmeCqe cqe;

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky

Re: [PATCH 12/16] hw/block/nvme: refactor NvmeRequest clearing

2020-07-29 Thread Maxim Levitsky

On Mon, 2020-07-20 at 13:37 +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Move clearing of the structure from "clear before use" to "clear
> after use".
> 
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index e2932239c661..431f26c2f589 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -209,6 +209,11 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue 
> *cq)
>  }
>  }
>  
> +static void nvme_req_clear(NvmeRequest *req)
> +{
> +memset(>cqe, 0x0, sizeof(req->cqe));
> +}
> +
>  static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr 
> addr,
>size_t len)
>  {
> @@ -458,6 +463,7 @@ static void nvme_post_cqes(void *opaque)
>  nvme_inc_cq_tail(cq);
>  pci_dma_write(>parent_obj, addr, (void *)>cqe,
>  sizeof(req->cqe));
> +nvme_req_clear(req);

Don't we need some barrier here to avoid reordering the writes?
pci_dma_write does seem to include a barrier prior to the write it does
but not afterward.

Also what is the motivation of switching the order?
I think somewhat that it is a good thing to clear a buffer,
before it is setup.


>  QTAILQ_INSERT_TAIL(>req_list, req, entry);
>  }
>  if (cq->tail != cq->head) {
> @@ -1601,7 +1607,6 @@ static void nvme_process_sq(void *opaque)
>  req = QTAILQ_FIRST(>req_list);
>  QTAILQ_REMOVE(>req_list, req, entry);
>  QTAILQ_INSERT_TAIL(>out_req_list, req, entry);
> -memset(>cqe, 0, sizeof(req->cqe));
>  req->cqe.cid = cmd.cid;
>  
>  status = sq->sqid ? nvme_io_cmd(n, , req) :


Best regards,
Maxim Levitsky

Re: [PATCH 11/16] hw/block/nvme: be consistent about zeros vs zeroes

2020-07-29 Thread Maxim Levitsky

On Mon, 2020-07-20 at 13:37 +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> The NVM Express specification generally uses 'zeroes' and not 'zeros',
> so let us align with it.
> 
> Cc: Fam Zheng 
> Signed-off-by: Klaus Jensen 
> ---
>  block/nvme.c | 4 ++--
>  hw/block/nvme.c  | 8 
>  include/block/nvme.h | 4 ++--
>  3 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index c1c4c07ac6cc..05485fdd1189 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -537,7 +537,7 @@ static void nvme_identify(BlockDriverState *bs, int 
> namespace, Error **errp)
>s->page_size / sizeof(uint64_t) * s->page_size);
>  
>  oncs = le16_to_cpu(idctrl->oncs);
> -s->supports_write_zeroes = !!(oncs & NVME_ONCS_WRITE_ZEROS);
> +s->supports_write_zeroes = !!(oncs & NVME_ONCS_WRITE_ZEROES);
>  s->supports_discard = !!(oncs & NVME_ONCS_DSM);
>  
>  memset(resp, 0, 4096);
> @@ -1201,7 +1201,7 @@ static coroutine_fn int 
> nvme_co_pwrite_zeroes(BlockDriverState *bs,
>  }
>  
>  NvmeCmd cmd = {
> -.opcode = NVME_CMD_WRITE_ZEROS,
> +.opcode = NVME_CMD_WRITE_ZEROES,
>  .nsid = cpu_to_le32(s->nsid),
>  .cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
>  .cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 10fe53873ae9..e2932239c661 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -614,7 +614,7 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace 
> *ns, NvmeCmd *cmd,
>  return NVME_NO_COMPLETE;
>  }
>  
> -static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd 
> *cmd,
> +static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd 
> *cmd,
>  NvmeRequest *req)
>  {
>  NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> @@ -714,8 +714,8 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>  switch (cmd->opcode) {
>  case NVME_CMD_FLUSH:
>  return nvme_flush(n, ns, cmd, req);
> -case NVME_CMD_WRITE_ZEROS:
> -return nvme_write_zeros(n, ns, cmd, req);
> +case NVME_CMD_WRITE_ZEROES:
> +return nvme_write_zeroes(n, ns, cmd, req);
>  case NVME_CMD_WRITE:
>  case NVME_CMD_READ:
>  return nvme_rw(n, ns, cmd, req);
> @@ -2328,7 +2328,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
> *pci_dev)
>  id->sqes = (0x6 << 4) | 0x6;
>  id->cqes = (0x4 << 4) | 0x4;
>  id->nn = cpu_to_le32(n->num_namespaces);
> -id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP |
> +id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROES | NVME_ONCS_TIMESTAMP |
> NVME_ONCS_FEATURES);
>  
>  subnqn = g_strdup_printf("nqn.2019-08.org.qemu:%s", n->params.serial);
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 370df7fc0570..65e68a82c897 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -460,7 +460,7 @@ enum NvmeIoCommands {
>  NVME_CMD_READ   = 0x02,
>  NVME_CMD_WRITE_UNCOR= 0x04,
>  NVME_CMD_COMPARE= 0x05,
> -NVME_CMD_WRITE_ZEROS= 0x08,
> +NVME_CMD_WRITE_ZEROES   = 0x08,
>  NVME_CMD_DSM= 0x09,
>  };
>  
> @@ -838,7 +838,7 @@ enum NvmeIdCtrlOncs {
>  NVME_ONCS_COMPARE   = 1 << 0,
>  NVME_ONCS_WRITE_UNCORR  = 1 << 1,
>  NVME_ONCS_DSM   = 1 << 2,
> -NVME_ONCS_WRITE_ZEROS   = 1 << 3,
> +NVME_ONCS_WRITE_ZEROES  = 1 << 3,
>  NVME_ONCS_FEATURES  = 1 << 4,
>  NVME_ONCS_RESRVATIONS   = 1 << 5,
>  NVME_ONCS_TIMESTAMP = 1 << 6,

Nothing against this.

Reviewed-by: Maxim Levitsky 
Best regards,
Maxim Levitsky

Re: [PATCH v8 3/7] 9pfs: split out fs driver core of v9fs_co_readdir()

2020-07-29 Thread Christian Schoenebeck

On Mittwoch, 29. Juli 2020 18:02:56 CEST Greg Kurz wrote:
> On Wed, 29 Jul 2020 10:11:54 +0200
> 
> Christian Schoenebeck  wrote:
> > The implementation of v9fs_co_readdir() has two parts: the outer
> > part is executed by main I/O thread, whereas the inner part is
> > executed by fs driver on a background I/O thread.
> > 
> > Move the inner part to its own new, private function do_readdir(),
> > so it can be shared by another upcoming new function.
> > 
> > This is just a preparatory patch for the subsequent patch, with the
> > purpose to avoid the next patch to clutter the overall diff.
> > 
> > Signed-off-by: Christian Schoenebeck 
> > ---
> > 
> >  hw/9pfs/codir.c | 37 +++--
> >  1 file changed, 23 insertions(+), 14 deletions(-)
> > 
> > diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c
> > index 73f9a751e1..ff57fb8619 100644
> > --- a/hw/9pfs/codir.c
> > +++ b/hw/9pfs/codir.c
> > @@ -18,28 +18,37 @@
> > 
> >  #include "qemu/main-loop.h"
> >  #include "coth.h"
> > 
> > +/*
> > + * This must solely be executed on a background IO thread.
> > + */
> 
> Well, technically this function could be called from any context
> but of course calling it from the main I/O thread when handling
> T_readdir would make the request synchronous, which is certainly
> not what we want. So I'm not sure this comment brings much.

Yeah, the intention was to more clearly separate functions' intended usage 
context either being TH or rather BH focused, by sticking appropriate human-
readable API comments to them.

But you are right, the TH functions are more limited in this regards as they 
usually contain a co-routine dispatch call, whereas BH functions usually 
preserve a more flexible/universal nature.

So maybe rather:

/*
 * Intended to be called from bottom-half (e.g. background I/O thread) 
 * context.
 */

On doubt I can also just drop the comment, as the function is really quite 
simple.

> Anyway, the code change is okay so:
> 
> Reviewed-by: Greg Kurz 
> 
> > +static int do_readdir(V9fsPDU *pdu, V9fsFidState *fidp, struct dirent
> > **dent) +{
> > +int err = 0;
> > +V9fsState *s = pdu->s;
> > +struct dirent *entry;
> > +
> > +errno = 0;
> > +entry = s->ops->readdir(>ctx, >fs);
> > +if (!entry && errno) {
> > +*dent = NULL;
> > +err = -errno;
> > +} else {
> > +*dent = entry;
> > +}
> > +return err;
> > +}
> > +
> > 
> >  int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
> >  
> >   struct dirent **dent)
> >  
> >  {
> >  
> >  int err;
> > 
> > -V9fsState *s = pdu->s;
> > 
> >  if (v9fs_request_cancelled(pdu)) {
> >  
> >  return -EINTR;
> >  
> >  }
> > 
> > -v9fs_co_run_in_worker(
> > -{
> > -struct dirent *entry;
> > -
> > -errno = 0;
> > -entry = s->ops->readdir(>ctx, >fs);
> > -if (!entry && errno) {
> > -err = -errno;
> > -} else {
> > -*dent = entry;
> > -err = 0;
> > -}
> > -});
> > +v9fs_co_run_in_worker({
> > +err = do_readdir(pdu, fidp, dent);
> > +});
> > 
> >  return err;
> >  
> >  }

Re: [PATCH 05/16] hw/block/nvme: refactor dma read/write

2020-07-29 Thread Maxim Levitsky

On Mon, 2020-07-20 at 13:37 +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Refactor the nvme_dma_{read,write}_prp functions into a common function
> taking a DMADirection parameter.
> 
> Signed-off-by: Klaus Jensen 
> Reviewed-by: Maxim Levitsky 
> ---
>  hw/block/nvme.c | 88 -
>  1 file changed, 43 insertions(+), 45 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 6a1a1626b87b..d314a604db81 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -361,55 +361,50 @@ unmap:
>  return status;
>  }
>  
> -static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> -   uint64_t prp1, uint64_t prp2)
> +static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> + uint64_t prp1, uint64_t prp2, DMADirection dir)
>  {
>  QEMUSGList qsg;
>  QEMUIOVector iov;
>  uint16_t status = NVME_SUCCESS;
>  
> -if (nvme_map_prp(, , prp1, prp2, len, n)) {
> -return NVME_INVALID_FIELD | NVME_DNR;
> +status = nvme_map_prp(, , prp1, prp2, len, n);
> +if (status) {
> +return status;
>  }
> +
>  if (qsg.nsg > 0) {
> -if (dma_buf_write(ptr, len, )) {
> -status = NVME_INVALID_FIELD | NVME_DNR;
> +uint64_t residual;
> +
> +if (dir == DMA_DIRECTION_TO_DEVICE) {
> +residual = dma_buf_write(ptr, len, );
> +} else {
> +residual = dma_buf_read(ptr, len, );
>  }
> -qemu_sglist_destroy();
> -} else {
> -if (qemu_iovec_to_buf(, 0, ptr, len) != len) {
> -status = NVME_INVALID_FIELD | NVME_DNR;
> -}
> -qemu_iovec_destroy();
> -}
> -return status;
> -}
>  
> -static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> -uint64_t prp1, uint64_t prp2)
> -{
> -QEMUSGList qsg;
> -QEMUIOVector iov;
> -uint16_t status = NVME_SUCCESS;
> -
> -trace_pci_nvme_dma_read(prp1, prp2);
> -
> -if (nvme_map_prp(, , prp1, prp2, len, n)) {
> -return NVME_INVALID_FIELD | NVME_DNR;
> -}
> -if (qsg.nsg > 0) {
> -if (unlikely(dma_buf_read(ptr, len, ))) {
> +if (unlikely(residual)) {
>  trace_pci_nvme_err_invalid_dma();
>  status = NVME_INVALID_FIELD | NVME_DNR;
>  }
> +
>  qemu_sglist_destroy();
>  } else {
> -if (unlikely(qemu_iovec_from_buf(, 0, ptr, len) != len)) {
> +size_t bytes;
> +
> +if (dir == DMA_DIRECTION_TO_DEVICE) {
> +bytes = qemu_iovec_to_buf(, 0, ptr, len);
> +} else {
> +bytes = qemu_iovec_from_buf(, 0, ptr, len);
> +}
> +
> +if (unlikely(bytes != len)) {
>  trace_pci_nvme_err_invalid_dma();
>  status = NVME_INVALID_FIELD | NVME_DNR;
>  }
> +
>  qemu_iovec_destroy();
>  }
> +
I know I reviewed this, but thinking now, why not to add an assert here
that we don't have both iov and qsg with data.

Best regards,
Maxim Levitsky

>  return status;
>  }
>  
> @@ -840,8 +835,8 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd 
> *cmd, uint8_t rae,
>  nvme_clear_events(n, NVME_AER_TYPE_SMART);
>  }
>  
> -return nvme_dma_read_prp(n, (uint8_t *)  + off, trans_len, prp1,
> - prp2);
> +return nvme_dma_prp(n, (uint8_t *)  + off, trans_len, prp1, prp2,
> +DMA_DIRECTION_FROM_DEVICE);
>  }
>  
>  static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> @@ -862,8 +857,8 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd 
> *cmd, uint32_t buf_len,
>  
>  trans_len = MIN(sizeof(fw_log) - off, buf_len);
>  
> -return nvme_dma_read_prp(n, (uint8_t *) _log + off, trans_len, prp1,
> - prp2);
> +return nvme_dma_prp(n, (uint8_t *) _log + off, trans_len, prp1, prp2,
> +DMA_DIRECTION_FROM_DEVICE);
>  }
>  
>  static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> @@ -887,7 +882,8 @@ static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd 
> *cmd, uint8_t rae,
>  
>  trans_len = MIN(sizeof(errlog) - off, buf_len);
>  
> -return nvme_dma_read_prp(n, (uint8_t *), trans_len, prp1, prp2);
> +return nvme_dma_prp(n, (uint8_t *), trans_len, prp1, prp2,
> +DMA_DIRECTION_FROM_DEVICE);
>  }
>  
>  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> @@ -1042,8 +1038,8 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, 
> NvmeIdentify *c)
>  
>  trace_pci_nvme_identify_ctrl();
>  
> -return nvme_dma_read_prp(n, (uint8_t *)>id_ctrl, sizeof(n->id_ctrl),
> -prp1, prp2);
> +return nvme_dma_prp(n, (uint8_t *)>id_ctrl, sizeof(n->id_ctrl), prp1,
> +prp2, DMA_DIRECTION_FROM_DEVICE);
>  }
>  
>  static uint16_t

Re: [PATCH 4/4] hw/display/artist.c: fix out of bounds check

2020-07-29 Thread Richard Henderson

On 7/27/20 2:46 PM, Helge Deller wrote:
> -for (i = 0; i < pix_count; i++) {
> +for (i = 0; i < pix_count && offset + i < buf->size; i++) {
>  artist_rop8(s, p + offset + pix_count - 1 - i,
>  (data & 1) ? (s->plane_mask >> 24) : 0);
>  data >>= 1;

This doesn't look right.

You're writing to "offset + pix_count - 1 - i" and yet you're checking bounds
vs "offset + i".

This could be fixed by computing the complete offset into a local variable and
then have an inner if to avoid the write, as you do for the second loop.

But it would be better to precompute the correct loop bounds.


r~


> @@ -398,7 +390,9 @@ static void vram_bit_write(ARTISTState *s, int posx, int 
> posy, bool incr_x,
>  for (i = 3; i >= 0; i--) {
>  if (!(s->image_bitmap_op & 0x2000) ||
>  s->vram_bitmask & (1 << (28 + i))) {
> -artist_rop8(s, p + offset + 3 - i, data8[ROP8OFF(i)]);
> +if (offset + 3 - i < buf->size) {
> +artist_rop8(s, p + offset + 3 - i, data8[ROP8OFF(i)]);
> +}
>  }
>  }
>  memory_region_set_dirty(>mr, offset, 3);
> @@ -420,7 +414,7 @@ static void vram_bit_write(ARTISTState *s, int posx, int 
> posy, bool incr_x,
>  break;
>  }
> 
> -for (i = 0; i < pix_count; i++) {
> +for (i = 0; i < pix_count && offset + i < buf->size; i++) {
>  mask = 1 << (pix_count - 1 - i);
> 
>  if (!(s->image_bitmap_op & 0x2000) ||

Re: [PATCH v8 1/7] tests/virtio-9p: added split readdir tests

2020-07-29 Thread Christian Schoenebeck

On Mittwoch, 29. Juli 2020 17:42:54 CEST Greg Kurz wrote:
> On Wed, 29 Jul 2020 10:10:23 +0200
> 
> Christian Schoenebeck  wrote:
> > The previous, already existing 'basic' readdir test simply used a
> > 'count' parameter big enough to retrieve all directory entries with a
> > single Treaddir request.
> > 
> > In the 3 new 'split' readdir tests added by this patch, directory
> > entries are retrieved, split over several Treaddir requests by picking
> > small 'count' parameters which force the server to truncate the
> > response. So the test client sends as many Treaddir requests as
> > necessary to get all directory entries.
> > 
> > The following 3 new tests are added (executed in this sequence):
> > 
> > 1. Split readdir test with count=512
> > 2. Split readdir test with count=256
> > 3. Split readdir test with count=128
> > 
> > This test case sequence is chosen because the smaller the 'count' value,
> > the higher the chance of errors in case of implementation bugs on server
> > side.
> > 
> > Signed-off-by: Christian Schoenebeck 
> > ---
> 
> The existing fs_readdir() function for the 'basic' test is a subset
> of the new fs_readdir_split() introduced by this patch (quite visible
> if you sdiff the code).
> 
> To avoid code duplication, I would have probably tried to do the changes
> in fs_readdir() and implement the 'basic' test as:
> 
> static void fs_readdir_basic(void *obj, void *data,
>  QGuestAllocator *t_alloc)
> {
> /*
>  * submit count = msize - 11, because 11 is the header size of Rreaddir
>  */
> fs_readdir(obj, data, t_alloc, P9_MAX_SIZE - 11);
> }

You are right of course; there is code duplication. My thought was to preserve 
the simple readdir test code (at least at this initial stage) as it is really 
very simple and easy to understand.

The split readdir test code is probably already a tad more tedious to read.

I keep it in mind though and probably deduplicate this test code a bit later 
on. But I think it makes sense to start off with this version for now.

> but anyway this looks good to me so:
> 
> Reviewed-by: Greg Kurz 

Thanks!

> >  tests/qtest/virtio-9p-test.c | 108 +++
> >  1 file changed, 108 insertions(+)
> > 
> > diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
> > index 2167322985..de30b717b6 100644
> > --- a/tests/qtest/virtio-9p-test.c
> > +++ b/tests/qtest/virtio-9p-test.c
> > @@ -578,6 +578,7 @@ static bool fs_dirents_contain_name(struct V9fsDirent
> > *e, const char* name)> 
> >  return false;
> >  
> >  }
> > 
> > +/* basic readdir test where reply fits into a single response message */
> > 
> >  static void fs_readdir(void *obj, void *data, QGuestAllocator *t_alloc)
> >  {
> >  
> >  QVirtio9P *v9p = obj;
> > 
> > @@ -631,6 +632,89 @@ static void fs_readdir(void *obj, void *data,
> > QGuestAllocator *t_alloc)> 
> >  g_free(wnames[0]);
> >  
> >  }
> > 
> > +/* readdir test where overall request is split over several messages */
> > +static void fs_readdir_split(void *obj, void *data, QGuestAllocator
> > *t_alloc, + uint32_t count)
> > +{
> > +QVirtio9P *v9p = obj;
> > +alloc = t_alloc;
> > +char *const wnames[] = { g_strdup(QTEST_V9FS_SYNTH_READDIR_DIR) };
> > +uint16_t nqid;
> > +v9fs_qid qid;
> > +uint32_t nentries, npartialentries;
> > +struct V9fsDirent *entries, *tail, *partialentries;
> > +P9Req *req;
> > +int fid;
> > +uint64_t offset;
> > +
> > +fs_attach(v9p, NULL, t_alloc);
> > +
> > +fid = 1;
> > +offset = 0;
> > +entries = NULL;
> > +nentries = 0;
> > +tail = NULL;
> > +
> > +req = v9fs_twalk(v9p, 0, fid, 1, wnames, 0);
> > +v9fs_req_wait_for_reply(req, NULL);
> > +v9fs_rwalk(req, , NULL);
> > +g_assert_cmpint(nqid, ==, 1);
> > +
> > +req = v9fs_tlopen(v9p, fid, O_DIRECTORY, 0);
> > +v9fs_req_wait_for_reply(req, NULL);
> > +v9fs_rlopen(req, , NULL);
> > +
> > +/*
> > + * send as many Treaddir requests as required to get all directory
> > + * entries
> > + */
> > +while (true) {
> > +npartialentries = 0;
> > +partialentries = NULL;
> > +
> > +req = v9fs_treaddir(v9p, fid, offset, count, 0);
> > +v9fs_req_wait_for_reply(req, NULL);
> > +v9fs_rreaddir(req, , , );
> > +if (npartialentries > 0 && partialentries) {
> > +if (!entries) {
> > +entries = partialentries;
> > +nentries = npartialentries;
> > +tail = partialentries;
> > +} else {
> > +tail->next = partialentries;
> > +nentries += npartialentries;
> > +}
> > +while (tail->next) {
> > +tail = tail->next;
> > +}
> > +offset = tail->offset;
> > +} else {
> > +break;
> > +}
> > +}
> > +
> > +g_assert_cmpint(

Re: [PATCH for-5.2 0/6] Continue booting in case the first device is not bootable

2020-07-29 Thread Cornelia Huck

[restored cc:s]

On Wed, 29 Jul 2020 13:42:05 +0200
Viktor Mihajlovski  wrote:

> On 7/28/20 8:37 PM, Thomas Huth wrote:
> > If the user did not specify a "bootindex" property, the s390-ccw bios
> > tries to find a bootable device on its own. Unfortunately, it alwasy
> > stops at the very first device that it can find, no matter whether it's
> > bootable or not. That causes some weird behavior, for example while
> > 
> >   qemu-system-s390x -hda bootable.qcow2
> > 
> > boots perfectly fine, the bios refuses to work if you just specify
> > a virtio-scsi controller in front of it:
> > 
> >   qemu-system-s390x -device virtio-scsi -hda bootable.qcow2
> > 
> > Since this is quite uncomfortable and confusing for the users, and
> > all major firmwares on other architectures correctly boot in such
> > cases, too, let's also try to teach the s390-ccw bios how to boot
> > in such cases.
> > 
> > For this, we have to get rid of the various panic()s and IPL_assert()
> > statements at the "low-level" function and let the main code handle
> > the decision instead whether a boot from a device should fail or not,
> > so that the main code can continue searching in case it wants to.
> >   
> 
> Looking at it from an architectural perspective: If an IPL Information 
> Block specifying the boot device has been set and can be retrieved using 
> Diagnose 308 it has to be respected, even if the device doesn't contain 
> a bootable program. The boot has to fail in this case.
> 
> I had not the bandwidth to follow all code paths, but I gather that this 
> is still the case with the series. So one can argue that these changes 
> are taking care of an undefined situation (real hardware will always 
> have the IPIB set).
> 
> As long as the architecture is not violated, I can live with the 
> proposed changes. I however would like to point out that this only 
> covers a corner case (no -boot or -device ..,bootindex specified). A VM 
> defined and started with libvirt will always specify the boot device. 
> Please don't create the impression that this patches will lead to the 
> same behavior as on other platforms. It is still not possible to have an 
> order list of potential boot devices in an architecture compliant way.

Yes, libvirt will always add this parameter. Still, I've seen confusion
generated by this behaviour, so this change sounds like a good idea to
me.

(Is there any possibility to enhance the architecture to provide a list
of devices in the future?)

1 2 3 >

1 - 100 of 248 matches

Mail list logo