Re: [PATCH v4 3/7] vdpa: Make vhost_vdpa_net_cvq_map_elem accept any out sg

2022-07-31 Thread Eugenio Perez Martin
On Mon, Jul 25, 2022 at 10:48 AM Jason Wang  wrote:
>
>
> 在 2022/7/22 19:12, Eugenio Pérez 写道:
> > So its generic enough to accept any out sg buffer and we can inject
> > NIC state messages.
> >
> > Signed-off-by: Eugenio Pérez 
> > ---
> >   net/vhost-vdpa.c | 29 +++--
> >   1 file changed, 15 insertions(+), 14 deletions(-)
> >
> > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > index 1b82ac2e07..bbe1830824 100644
> > --- a/net/vhost-vdpa.c
> > +++ b/net/vhost-vdpa.c
> > @@ -302,35 +302,36 @@ dma_map_err:
> >   }
> >
> >   /**
> > - * Copy the guest element into a dedicated buffer suitable to be sent to 
> > NIC
> > + * Maps out sg and in buffer into dedicated buffers suitable to be sent to 
> > NIC
> >*
> > - * @iov: [0] is the out buffer, [1] is the in one
> > + * @dev_iov: [0] is the out buffer, [1] is the in one
>
>
> This still has assumption on the layout. I wonder if it's better to
> simply use in_sg and out_sg.
>

Sure, I can resend that way.

It complicates the code a little bit because of error paths. We
currently send one out sg and one in sg always. But we can make it
more generic for sure.

Thanks!

> Thanks
>
>
> >*/
> > -static bool vhost_vdpa_net_cvq_map_elem(VhostVDPAState *s,
> > -VirtQueueElement *elem,
> > -struct iovec *iov)
> > +static bool vhost_vdpa_net_cvq_map_sg(VhostVDPAState *s,
> > +  const struct iovec *out, size_t 
> > out_num,
> > +  struct iovec *dev_iov)
> >   {
> >   size_t in_copied;
> >   bool ok;
> >
> > -iov[0].iov_base = s->cvq_cmd_out_buffer;
> > -ok = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, elem->out_sg, 
> > elem->out_num,
> > -vhost_vdpa_net_cvq_cmd_len(), 
> > iov[0].iov_base,
> > -&iov[0].iov_len, false);
> > +dev_iov[0].iov_base = s->cvq_cmd_out_buffer;
> > +ok = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, out, out_num,
> > +vhost_vdpa_net_cvq_cmd_len(),
> > +dev_iov[0].iov_base, &dev_iov[0].iov_len,
> > +false);
> >   if (unlikely(!ok)) {
> >   return false;
> >   }
> >
> > -iov[1].iov_base = s->cvq_cmd_in_buffer;
> > +dev_iov[1].iov_base = s->cvq_cmd_in_buffer;
> >   ok = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, NULL, 0,
> > -sizeof(virtio_net_ctrl_ack), 
> > iov[1].iov_base,
> > -&in_copied, true);
> > +sizeof(virtio_net_ctrl_ack),
> > +dev_iov[1].iov_base, &in_copied, true);
> >   if (unlikely(!ok)) {
> >   vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer);
> >   return false;
> >   }
> >
> > -iov[1].iov_len = sizeof(virtio_net_ctrl_ack);
> > +dev_iov[1].iov_len = sizeof(virtio_net_ctrl_ack);
> >   return true;
> >   }
> >
> > @@ -434,7 +435,7 @@ static int 
> > vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
> >   };
> >   bool ok;
> >
> > -ok = vhost_vdpa_net_cvq_map_elem(s, elem, dev_buffers);
> > +ok = vhost_vdpa_net_cvq_map_sg(s, elem->out_sg, elem->out_num, 
> > dev_buffers);
> >   if (unlikely(!ok)) {
> >   goto out;
> >   }
>




[Bug 1180923] Re: unused memory filled with 0x00 instead of 0xFF

2022-07-31 Thread Yonggang Luo
** Bug watch added: gitlab.com/qemu-project/qemu/-/issues #1133
   https://gitlab.com/qemu-project/qemu/-/issues/1133

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1180923

Title:
  unused memory filled with 0x00 instead of 0xFF

Status in QEMU:
  Expired

Bug description:
  Qemu, ever since it was made (so, since 2003), has this problem in DOS
  (either PC-DOS or MS-DOS and partly Windows 9x) not recognizing the
  memory available when the memory is filled with 0x00 but when it is
  filled with 0xFF it gets recognized properly, where should I patch
  qemu to solve this memory problem?

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1180923/+subscriptions




Re: [Bug 1180923] Re: unused memory filled with 0x00 instead of 0xFF

2022-07-31 Thread Yonggang Luo
I create a issue at https://gitlab.com/qemu-project/qemu/-/issues/1133 to
tracking this

On Mon, Aug 1, 2022 at 5:06 AM David Glover <1180...@bugs.launchpad.net>
wrote:

> I know this is expired but it's still a problem in qemu 7.0.0. For
> example, when running MS-DOS 6.22, "mem" reports 0K of upper memory, and
> memmaker fails to run, complaining that it could not allocate any. I'd
> love to know if there's a workaround.
>
> --
> You received this bug notification because you are a member of qemu-
> devel-ml, which is subscribed to QEMU.
> https://bugs.launchpad.net/bugs/1180923
>
> Title:
>   unused memory filled with 0x00 instead of 0xFF
>
> Status in QEMU:
>   Expired
>
> Bug description:
>   Qemu, ever since it was made (so, since 2003), has this problem in DOS
>   (either PC-DOS or MS-DOS and partly Windows 9x) not recognizing the
>   memory available when the memory is filled with 0x00 but when it is
>   filled with 0xFF it gets recognized properly, where should I patch
>   qemu to solve this memory problem?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/qemu/+bug/1180923/+subscriptions
>
>
>

-- 
 此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo


Re: [PATCH] ipmi:smbus: Add a check around a memcpy

2022-07-31 Thread Michael S. Tsirkin
On Sun, Jul 31, 2022 at 06:02:46PM -0500, miny...@acm.org wrote:
> From: Corey Minyard 
> 
> In one case:
> 
>   memcpy(sid->inmsg + sid->inlen, buf, len);
> 
> if len == 0 then sid->inmsg + sig->inlen can point to one past the inmsg
> array if the array is full.  We have to allow len == 0 due to some
> vagueness in the spec, but we don't have to call memcpy.
> 
> Found by Coverity.  This is not a problem in practice, but the results
> are technically (maybe) undefined.  So make Coverity happy.
> 
> Reported-by: Peter Maydell 
> Signed-off-by: Corey Minyard 

Acked-by: Michael S. Tsirkin 

> ---
>  hw/ipmi/smbus_ipmi.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> I think this should do it.
> 
> diff --git a/hw/ipmi/smbus_ipmi.c b/hw/ipmi/smbus_ipmi.c
> index 9ef9112dd5..d0991ab7f9 100644
> --- a/hw/ipmi/smbus_ipmi.c
> +++ b/hw/ipmi/smbus_ipmi.c
> @@ -281,7 +281,9 @@ static int ipmi_write_data(SMBusDevice *dev, uint8_t 
> *buf, uint8_t len)
>   */
>  send = true;
>  }
> -memcpy(sid->inmsg + sid->inlen, buf, len);
> +if (len > 0) {
> +memcpy(sid->inmsg + sid->inlen, buf, len);
> +}
>  sid->inlen += len;
>  break;
>  }
> -- 
> 2.25.1




RE: [PULL V2 25/25] net/colo.c: fix segmentation fault when packet is not parsed correctly

2022-07-31 Thread Zhang, Chen


> -Original Message-
> From: Jason Wang 
> Sent: Monday, August 1, 2022 12:18 PM
> To: Zhang, Chen ; Xu, Tao3 
> Cc: qemu-devel@nongnu.org; Li Zhijian ; Peter
> Maydell 
> Subject: Re: [PULL V2 25/25] net/colo.c: fix segmentation fault when packet
> is not parsed correctly
> 
> 
> 在 2022/7/29 21:58, Peter Maydell 写道:
> > On Wed, 20 Jul 2022 at 10:04, Jason Wang  wrote:
> >> From: Zhang Chen 
> >>
> >> When COLO use only one vnet_hdr_support parameter between
> >> filter-redirector and filter-mirror(or colo-compare), COLO will crash
> >> with segmentation fault. Back track as follow:
> >>
> >> Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation
> fault.
> >> 0x55cb200b in eth_get_l2_hdr_length (p=0x0)
> >>  at /home/tao/project/COLO/colo-qemu/include/net/eth.h:296
> >> 296 uint16_t proto = be16_to_cpu(PKT_GET_ETH_HDR(p)->h_proto);
> >> (gdb) bt
> >> 0  0x55cb200b in eth_get_l2_hdr_length (p=0x0)
> >>  at /home/tao/project/COLO/colo-qemu/include/net/eth.h:296
> >> 1  0x55cb22b4 in parse_packet_early (pkt=0x56a44840) at
> >> net/colo.c:49
> >> 2  0x55cb2b91 in is_tcp_packet (pkt=0x56a44840) at
> >> net/filter-rewriter.c:63
> >>
> >> So wrong vnet_hdr_len will cause pkt->data become NULL. Add check to
> >> raise error and add trace-events to track vnet_hdr_len.
> >>
> >> Signed-off-by: Tao Xu 
> >> Signed-off-by: Zhang Chen 
> >> Reviewed-by: Li Zhijian 
> >> Signed-off-by: Jason Wang 
> > Hi -- Coverity points out that there's a problem with this fix (CID
> > 1490786):
> >
> >> @@ -46,7 +46,14 @@ int parse_packet_early(Packet *pkt)
> >>   static const uint8_t vlan[] = {0x81, 0x00};
> >>   uint8_t *data = pkt->data + pkt->vnet_hdr_len;
> > data here is set to pkt->data + pkt->vnet_hdr_len.
> > If pkt->data is NULL then this addition is C undefined behaviour, and
> > if pkt->data is not NULL but the integer added is such that the
> > pointer ends up not pointing within data, then that is also C
> > undefined behaviour...
> 
> 
> Right. And I don't see how pkt->data can be NULL, maybe a hint of bug
> elsewhere.
> 
> 
> >
> >>   uint16_t l3_proto;
> >> -ssize_t l2hdr_len = eth_get_l2_hdr_length(data);
> >> +ssize_t l2hdr_len;
> >> +
> >> +if (data == NULL) {
> > ...so the compiler is allowed to assume that data cannot be NULL
> > here, and this entire error checking clause is logically dead code.
> >
> > If you're trying to check whether vnet_hdr_len is valid, you
> > need to do that as an integer arithmetic check before you start
> > using it for pointer arithmetic. And you probably want to be
> > checking against some kind of bound, not just for whether the
> > result is going to be NULL.
> 
> 
> Or we can let the caller to validate the Pkt before calling this function.
> 
> 
> >
> > Overall this looks kinda sketchy and could probably use more
> > detailed code review. Where does the bogus vnet_hdr_len come from in
> > the first place? Is this data from the guest, or from the network
> > (ie uncontrolled)?
> 
> 
> If I understand correctly the vnet_hdr_len is set during handshake
> (net_fill_rstate()) which is sent from a network backend.
> 
> Chen or Xu, please post a fix and explain what happens in the changelog.

OK, we will send a patch to fix it.

Thanks
Chen

> 
> Thanks
> 
> 
> >
> >> +trace_colo_proxy_main_vnet_info("This packet is not parsed
> correctly, "
> >> +"pkt->vnet_hdr_len", 
> >> pkt->vnet_hdr_len);
> >> +return 1;
> >> +}
> > thanks
> > -- PMM
> >



Re: [PULL V2 25/25] net/colo.c: fix segmentation fault when packet is not parsed correctly

2022-07-31 Thread Jason Wang



在 2022/7/29 21:58, Peter Maydell 写道:

On Wed, 20 Jul 2022 at 10:04, Jason Wang  wrote:

From: Zhang Chen 

When COLO use only one vnet_hdr_support parameter between
filter-redirector and filter-mirror(or colo-compare), COLO will crash
with segmentation fault. Back track as follow:

Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
0x55cb200b in eth_get_l2_hdr_length (p=0x0)
 at /home/tao/project/COLO/colo-qemu/include/net/eth.h:296
296 uint16_t proto = be16_to_cpu(PKT_GET_ETH_HDR(p)->h_proto);
(gdb) bt
0  0x55cb200b in eth_get_l2_hdr_length (p=0x0)
 at /home/tao/project/COLO/colo-qemu/include/net/eth.h:296
1  0x55cb22b4 in parse_packet_early (pkt=0x56a44840) at
net/colo.c:49
2  0x55cb2b91 in is_tcp_packet (pkt=0x56a44840) at
net/filter-rewriter.c:63

So wrong vnet_hdr_len will cause pkt->data become NULL. Add check to
raise error and add trace-events to track vnet_hdr_len.

Signed-off-by: Tao Xu 
Signed-off-by: Zhang Chen 
Reviewed-by: Li Zhijian 
Signed-off-by: Jason Wang 

Hi -- Coverity points out that there's a problem with this fix
(CID 1490786):


@@ -46,7 +46,14 @@ int parse_packet_early(Packet *pkt)
  static const uint8_t vlan[] = {0x81, 0x00};
  uint8_t *data = pkt->data + pkt->vnet_hdr_len;

data here is set to pkt->data + pkt->vnet_hdr_len.
If pkt->data is NULL then this addition is C undefined behaviour,
and if pkt->data is not NULL but the integer added is such
that the pointer ends up not pointing within data, then that
is also C undefined behaviour...



Right. And I don't see how pkt->data can be NULL, maybe a hint of bug 
elsewhere.






  uint16_t l3_proto;
-ssize_t l2hdr_len = eth_get_l2_hdr_length(data);
+ssize_t l2hdr_len;
+
+if (data == NULL) {

...so the compiler is allowed to assume that data cannot be NULL
here, and this entire error checking clause is logically dead code.

If you're trying to check whether vnet_hdr_len is valid, you
need to do that as an integer arithmetic check before you start
using it for pointer arithmetic. And you probably want to be
checking against some kind of bound, not just for whether the
result is going to be NULL.



Or we can let the caller to validate the Pkt before calling this function.




Overall this looks kinda sketchy and could probably use more
detailed code review. Where does the bogus vnet_hdr_len come from in
the first place? Is this data from the guest, or from the network
(ie uncontrolled)?



If I understand correctly the vnet_hdr_len is set during handshake 
(net_fill_rstate()) which is sent from a network backend.


Chen or Xu, please post a fix and explain what happens in the changelog.

Thanks





+trace_colo_proxy_main_vnet_info("This packet is not parsed correctly, "
+"pkt->vnet_hdr_len", 
pkt->vnet_hdr_len);
+return 1;
+}

thanks
-- PMM






Re: [PULL V2 19/25] vdpa: Extract get features part from vhost_vdpa_get_max_queue_pairs

2022-07-31 Thread Jason Wang



在 2022/7/29 22:08, Peter Maydell 写道:

On Wed, 20 Jul 2022 at 10:04, Jason Wang  wrote:

From: Eugenio Pérez 

To know the device features is needed for CVQ SVQ, so SVQ knows if it
can handle all commands or not. Extract from
vhost_vdpa_get_max_queue_pairs so we can reuse it.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 

Hi; this change introduces a resource leak in the new
error-exit return path in net_init_vhost_vdpa(). Spotted
by Coverity, CID 1490785.


@@ -517,10 +521,11 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
  NetClientState *peer, Error **errp)
  {
  const NetdevVhostVDPAOptions *opts;
+uint64_t features;
  int vdpa_device_fd;
  g_autofree NetClientState **ncs = NULL;
  NetClientState *nc;
-int queue_pairs, i, has_cvq = 0;
+int queue_pairs, r, i, has_cvq = 0;

  assert(netdev->type == NET_CLIENT_DRIVER_VHOST_VDPA);
  opts = &netdev->u.vhost_vdpa;
@@ -534,7 +539,12 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
  return -errno;
  }

-queue_pairs = vhost_vdpa_get_max_queue_pairs(vdpa_device_fd,
+r = vhost_vdpa_get_features(vdpa_device_fd, &features, errp);
+if (unlikely(r < 0)) {
+return r;

At this point in the code we have allocated the file descriptor
vdpa_device_fd, but this return path fails to close it.



Exactly.





+}
+
+queue_pairs = vhost_vdpa_get_max_queue_pairs(vdpa_device_fd, features,
   &has_cvq, errp);
  if (queue_pairs < 0) {
  qemu_close(vdpa_device_fd);

Compare this pre-existing error-exit path, which correctly
calls qemu_close() on the fd.

Related question: is this function supposed to return -1 on
failure, or negative-errno ?



Kind of either:

  if (net_client_init_fun[netdev->type](netdev, netdev->id, peer, errp) 
< 0) {

    /* FIXME drop when all init functions store an Error */
    if (errp && !*errp) {
    error_setg(errp, "Device '%s' could not be initialized",
   NetClientDriver_str(netdev->type));
    }
    return -1;
    }



  At the moment it has a mix
of both. I think that the sole caller only really wants "<0 on
error", in which case the error-exit code paths could probably
be tidied up so that instead of explicitly calling
qemu_close() and returning r, queue_pairs, or whatever
they got back from the function they just called, they
could just 'goto err_svq' which will do the "close the fd
and return -1" work. Better still, by initializing 'i'
to 0 at the top of the function (and naming it something
clearer, ideally), all the code paths after the initial
qemu_open() succeeds could be made to use 'goto err'
for the error-exit case.



Yes, having a consistent goto based error handling seems much better.

Eugenio, please post patch to fix this.

Thanks




thanks
-- PMM






RE: [PATCH] disas/nanomips: Convert nanoMIPS disassembler to C

2022-07-31 Thread Vince Del Vecchio via
On Fri, 29 Jul 2022 at 10:18, Peter Maydell  wrote:
> ...
> Is it possible to break this down into smaller pieces so it isn't one single 
> enormous 5000 line patch ?
> 
> I guess partial conversion is likely to run into compilation difficulties 
> mid-series; if so we could do "disable the disassembler; convert it; reenable 
> it".
> 
> The rationale here is code review -- reviewing a single huge patch is 
> essentially impossible, so it needs to be broken down into coherent smaller 
> pieces to be reviewable.

Hi Peter,

Something like 90% of the patch is purely mechanical transformations of which 
I've excerpted two examples below.  (There are 3-4 chunks of mechanical 
transformations, of which I've shown the most complicated type that accounts 
for ~60% of the changed lines.)  Did you intend to review this by having a 
human inspect all of these?

Would it be feasible instead to provide a sed script to effect the mechanical 
parts of the patch (the reviewer could review the script, apply it themselves 
to verify equivalence if desired, and spot check the result) together with a 
(much smaller) non-mechanical diff?

-Vince


@@ -2004,17 +1740,17 @@ std::string NMD::ADD_S(uint64 instruction)
  * rt -
  *  rs -
  */
-std::string NMD::ADDIU_32_(uint64 instruction)
+static char *ADDIU_32_(uint64 instruction)
 {
 uint64 rt_value = extract_rt_25_24_23_22_21(instruction);
 uint64 rs_value = extract_rs_20_19_18_17_16(instruction);
 uint64 u_value = extract_u_15_to_0(instruction);
 
-std::string rt = GPR(copy(rt_value));
-std::string rs = GPR(copy(rs_value));
-std::string u = IMMEDIATE(copy(u_value));
+const char *rt = GPR(copy_ui(rt_value));
+const char *rs = GPR(copy_ui(rs_value));
+const char *u = IMMEDIATE_UI(copy_ui(u_value));
 
-return img::format("ADDIU %s, %s, %s", rt, rs, u);
+return img_format("ADDIU %s, %s, %s", rt, rs, u);
 }
 
 
@@ -2027,15 +1763,15 @@ std::string NMD::ADDIU_32_(uint64 instruction)
  * rt -
  *  rs -
  */
-std::string NMD::ADDIU_48_(uint64 instruction)
+static char *ADDIU_48_(uint64 instruction)
 {
 uint64 rt_value = extract_rt_41_40_39_38_37(instruction);
 int64 s_value = extract_s__se31_15_to_0_31_to_16(instruction);
 
-std::string rt = GPR(copy(rt_value));
-std::string s = IMMEDIATE(copy(s_value));
+const char *rt = GPR(copy_ui(rt_value));
+const char *s = IMMEDIATE_I(copy_i(s_value));
 
-return img::format("ADDIU %s, %s", rt, s);
+return img_format("ADDIU %s, %s", rt, s);
 }
 


Re: [RFC 0/3] Add Generic SPI GPIO model

2022-07-31 Thread Andrew Jeffery



On Sun, 31 Jul 2022, at 06:48, Cédric Le Goater wrote:
> On 7/29/22 19:30, Peter Delevoryas wrote:
>> Certainly we'd like to use IRQ's instead, but she ran into correctness
>> problems. Maybe we can investigate that further and fix it.

Yes, let's not work around problems that we have the ability to fix.

>
> So much is happening in that mode.

Yes, though while there's no-doubt accidental complexity in terms of 
its implementation, the underlying hardware is also quite haphazard and 
we need an approach that scales to the large number of GPIOs it 
provides. So I'd argue there's a bunch of essential complexity involved 
as well.

Do we start with a fresh implementation that allows us to get the 
expected behaviour and then build that out to replace the current 
implementation?

Or, can we add more tests for the existing model to pin down where the 
bugs are?

> We need more trace events in the Aspeed
> GPIO model at least an extra in aspeed_gpio_update()

We can always fall back to this but my hope is we can get better test 
coverage to iron out the bugs. Maybe that gets us a more refined and 
maintainable model implementation along the way.

Cheers,

Andrew



[RFC v5 08/11] virtio-blk: add zoned storage APIs for zoned devices

2022-07-31 Thread Sam Li
This patch extends virtio-blk emulation to handle zoned device commands
by calling the new block layer APIs to perform zoned device I/O on
behalf of the guest. It supports Report Zone, and four zone oparations (open,
close, finish, reset). The virtio-blk zoned device command specifications
is currently in the reviewing process.

VIRTIO_BLK_F_ZONED will only be set if the host does support zoned block
devices. The regular block device will not be set. The guest os having
zoned device support can use blkzone(8) to test those commands.

Signed-off-by: Sam Li 
---
 block/block-backend.c |  92 
 hw/block/virtio-blk.c | 172 +-
 include/sysemu/block-backend-io.h |   6 ++
 3 files changed, 268 insertions(+), 2 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index ef6a1f33d5..8f2cfcbd9d 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1431,6 +1431,15 @@ typedef struct BlkRwCo {
 void *iobuf;
 int ret;
 BdrvRequestFlags flags;
+union {
+struct {
+unsigned int *nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report;
+struct {
+BlockZoneOp op;
+} zone_mgmt;
+};
 } BlkRwCo;
 
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
@@ -1775,6 +1784,89 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
 return ret;
 }
 
+static void blk_aio_zone_report_entry(void *opaque) {
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
+   rwco->zone_report.nr_zones,
+   rwco->zone_report.zones);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor  *zones,
+BlockCompletionFunc *cb, void *opaque)
+{
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.offset = offset,
+.ret= NOT_DONE,
+.zone_report = {
+.zones = zones,
+.nr_zones = nr_zones,
+},
+};
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
+bdrv_coroutine_enter(blk_bs(blk), co);
+
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
+static void blk_aio_zone_mgmt_entry(void *opaque) {
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
+ rwco->offset, acb->bytes);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+  int64_t offset, int64_t len,
+  BlockCompletionFunc *cb, void *opaque) {
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.offset = offset,
+.ret= NOT_DONE,
+.zone_mgmt = {
+.op = op,
+},
+};
+acb->bytes = len;
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
+bdrv_coroutine_enter(blk_bs(blk), co);
+
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
 /*
  * Send a zone_report command.
  * offset is a byte offset from the start of the device. No alignment
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index e9ba752f6b..9722f447a2 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -37,6 +37,7 @@
 /* Config size before the discard support (hide associated config fields) */
 #define VIRTIO_BLK_CFG_SIZE offsetof(struct virtio_blk_config, \
  max_discard_sectors)
+
 /*
  * Starting from the discard feature, we can use this array to properly
  * set the config size depending on the features enabled.
@@ -46,6 +47,8 @@ static const VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_blk_config, discard_sector_alignment)},
 {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
  .end = endof(struct virtio_blk_config, write_zeroes_may_unmap)},
+{.flags = 1ULL << VIRTIO_BLK_F_ZONED,
+ .end = endo

[RFC v5 11/11] docs/zoned-storage: add zoned device documentation

2022-07-31 Thread Sam Li
Add the documentation about the zoned device support to virtio-blk
emulation.

Signed-off-by: Sam Li 
---
 docs/devel/zoned-storage.rst   | 68 ++
 docs/system/qemu-block-drivers.rst.inc |  6 +++
 2 files changed, 74 insertions(+)
 create mode 100644 docs/devel/zoned-storage.rst

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index 00..e62927dceb
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -0,0 +1,68 @@
+=
+zoned-storage
+=
+
+Zoned Block Devices (ZBDs) devide the LBA space to block regions called zones
+that are larger than the LBA size. It can only allow sequential writes, which
+reduces write amplification in SSD, leading to higher throughput and increased
+capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+zone emulation
+--
+In its current status, the virtio-blk device is not aware of ZBDs but the guest
+sees host-managed drives as regular drive that will runs correctly under the
+most common write workloads.
+
+The zoned device support aims to let guests (virtual machines) access zoned
+storage devices on the host (hypervisor) through a virtio-blk device. This
+involves extending QEMU's block layer and virtio-blk emulation code.
+
+If the host supports zoned block devices, it can set VIRTIO_BLK_F_ZONED. Then
+in the guest side, it appears following situations:
+1) If the guest virtio-blk driver sees the VIRTIO_BLK_F_ZONED bit set, then it
+will assume that the zoned characteristics fields of the config space are 
valid.
+2) If the guest virtio-blk driver sees a zoned model that is NONE, then it is
+known that is a regular block device.
+3) If the guest virtio-blk driver sees a zoned model that is HM(or HA), then it
+is known that is a zoned block device and probes the other zone fields.
+
+On QEMU sides,
+1) The DEFINE PROP BIT macro must be used to declare that the host supports
+zones.
+2) BlockDrivers can declare zoned device support once known the zoned model
+for the block device is not NONE.
+
+zoned storage APIs
+--
+
+Zone emulation part extends the block layer APIs and virtio-blk emulation 
section
+with the minimum set of zoned commands that are necessary to support zoned
+devices. The commands are - Report Zones, four zone operations and Zone Append
+(developing).
+
+testing
+---
+
+It can be tested on a null_blk device using qemu-io, qemu-iotests or blkzone(8)
+command in the guest os.
+
+1. For example, the command line for zone report using qemu-io is:
+
+$ path/to/qemu-io --image-opts driver=zoned_host_device,filename=/dev/nullb0 -c
+"zrp offset nr_zones"
+
+To enable zoned device in the guest os, the guest kernel must have the 
virtio-blk
+driver with ZBDs support. The link to such patches for the kernel is:
+
+https://github.com/dmitry-fomichev/virtblk-zbd
+
+Then, add the following options to the QEMU command line:
+-blockdev node-name=drive0,driver=zoned_host_device,filename=/dev/nullb0
+
+After the guest os booting, use blkzone(8) to test zone operations:
+blkzone report -o offset -c nr_zones /dev/vda
+
+2. We can also use the qemu-iotests in ./tests/qemu-iotests/tests/zoned.sh.
+
diff --git a/docs/system/qemu-block-drivers.rst.inc 
b/docs/system/qemu-block-drivers.rst.inc
index dfe5d2293d..2a761a4b80 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -430,6 +430,12 @@ Hard disks
   you may corrupt your host data (use the ``-snapshot`` command
   line option or modify the device permissions accordingly).
 
+Zoned block devices
+  Zoned block devices can passed through to the guest if the emulated storage
+  controller supports zoned storage. Use ``--blockdev zoned_host_device,
+  node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0``
+  as ``drive0``.
+
 Windows
 ^^^
 
-- 
2.37.1




[RFC v5 09/11] qemu-io: add zoned block device operations.

2022-07-31 Thread Sam Li
Add zoned storage commands of the device: zone_report(zrp), zone_open(zo),
zone_close(zc), zone_reset(zrs), zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts driver=zoned_host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li 
---
 block/io.c |  24 ++---
 qemu-io-cmds.c | 144 +
 2 files changed, 148 insertions(+), 20 deletions(-)

diff --git a/block/io.c b/block/io.c
index a4625fb0e1..de9ec1d740 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3209,19 +3209,11 @@ int bdrv_co_zone_report(BlockDriverState *bs, int64_t 
offset,
 IO_CODE();
 
 bdrv_inc_in_flight(bs);
-if (!drv || (!drv->bdrv_co_zone_report)) {
+if (!drv || !drv->bdrv_co_zone_report) {
 co.ret = -ENOTSUP;
 goto out;
 }
-
-if (drv->bdrv_co_zone_report) {
-co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
-} else {
-co.ret = -ENOTSUP;
-goto out;
-qemu_coroutine_yield();
-}
-
+co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
 out:
 bdrv_dec_in_flight(bs);
 return co.ret;
@@ -3237,19 +3229,11 @@ int bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp 
op,
 IO_CODE();
 
 bdrv_inc_in_flight(bs);
-if (!drv || (!drv->bdrv_co_zone_mgmt)) {
+if (!drv || !drv->bdrv_co_zone_mgmt) {
 co.ret = -ENOTSUP;
 goto out;
 }
-
-if (drv->bdrv_co_zone_mgmt) {
-co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
-} else {
-co.ret = -ENOTSUP;
-goto out;
-qemu_coroutine_yield();
-}
-
+co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
 out:
 bdrv_dec_in_flight(bs);
 return co.ret;
diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 952dc940f1..5a215277c7 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1712,6 +1712,145 @@ static const cmdinfo_t flush_cmd = {
 .oneline= "flush all in-core file state to disk",
 };
 
+static int zone_report_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+int64_t offset;
+unsigned int nr_zones;
+
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+nr_zones = cvtnum(argv[optind]);
+
+g_autofree BlockZoneDescriptor *zones = NULL;
+zones = g_new(BlockZoneDescriptor, nr_zones);
+ret = blk_zone_report(blk, offset, &nr_zones, zones);
+if (ret < 0) {
+printf("zone report failed: %s\n", strerror(-ret));
+} else {
+for (int i = 0; i < nr_zones; ++i) {
+printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
+   "cap"" 0x%" PRIx64 ",wptr 0x%" PRIx64 ", "
+   "zcond:%u, [type: %u]\n",
+   zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
+   zones[i].cond, zones[i].type);
+}
+}
+return ret;
+}
+
+static const cmdinfo_t zone_report_cmd = {
+.name = "zone_report",
+.altname = "zrp",
+.cfunc = zone_report_f,
+.argmin = 2,
+.argmax = 2,
+.args = "offset number",
+.oneline = "report zone information",
+};
+
+static int zone_open_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+int64_t offset, len;
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+len = cvtnum(argv[optind]);
+ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
+if (ret < 0) {
+printf("zone open failed: %s\n", strerror(-ret));
+}
+return ret;
+}
+
+static const cmdinfo_t zone_open_cmd = {
+.name = "zone_open",
+.altname = "zo",
+.cfunc = zone_open_f,
+.argmin = 2,
+.argmax = 2,
+.args = "offset len",
+.oneline = "explicit open a range of zones in zone block device",
+};
+
+static int zone_close_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+int64_t offset, len;
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+len = cvtnum(argv[optind]);
+ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
+if (ret < 0) {
+printf("zone close failed: %s\n", strerror(-ret));
+}
+return ret;
+}
+
+static const cmdinfo_t zone_close_cmd = {
+.name = "zone_close",
+.altname = "zc",
+.cfunc = zone_close_f,
+.argmin = 2,
+.argmax = 2,
+.args = "offset len",
+.oneline = "close a range of zones in zone block device",
+};
+
+static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+int64_t offset, len;
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+len = cvtnum(argv[optind]);
+ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
+if (ret < 0) {
+printf("zone finish failed: %s\n", strerror(-ret));
+}
+return ret;
+}
+
+static const cmdinfo_t zone_finish_cmd = {
+.name = "zone_finish",
+.altname = "zf",
+  

[RFC v5 06/11] raw-format: add zone operations to pass through requests

2022-07-31 Thread Sam Li
raw-format driver usually sits on top of file-posix driver. It needs to
pass through requests of zone commands.

Signed-off-by: Sam Li 
---
 block/raw-format.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index 69fd650eaf..6b20bd22ef 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -314,6 +314,17 @@ static int coroutine_fn raw_co_pdiscard(BlockDriverState 
*bs,
 return bdrv_co_pdiscard(bs->file, offset, bytes);
 }
 
+static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t 
offset,
+   unsigned int *nr_zones,
+   BlockZoneDescriptor *zones) {
+return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
+}
+
+static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+ int64_t offset, int64_t len) {
+return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
+}
+
 static int64_t raw_getlength(BlockDriverState *bs)
 {
 int64_t len;
@@ -614,6 +625,8 @@ BlockDriver bdrv_raw = {
 .bdrv_co_pwritev  = &raw_co_pwritev,
 .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
 .bdrv_co_pdiscard = &raw_co_pdiscard,
+.bdrv_co_zone_report  = &raw_co_zone_report,
+.bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
 .bdrv_co_block_status = &raw_co_block_status,
 .bdrv_co_copy_range_from = &raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
-- 
2.37.1




[RFC v5 10/11] qemu-iotests: test new zone operations

2022-07-31 Thread Sam Li
We have added new block layer APIs of zoned block devices. Test it with:
Create a null_blk device, run each zone operation on it and see
whether reporting right zone information.

Signed-off-by: Sam Li 
---
 tests/qemu-iotests/tests/zoned.out | 53 ++
 tests/qemu-iotests/tests/zoned.sh  | 86 ++
 2 files changed, 139 insertions(+)
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

diff --git a/tests/qemu-iotests/tests/zoned.out 
b/tests/qemu-iotests/tests/zoned.out
new file mode 100644
index 00..d09be2ffcd
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -0,0 +1,53 @@
+QA output created by zoned.sh
+Testing a null_blk device:
+Simple cases: if the operations work
+(1) report the first zone:
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:1, [type: 2]
+
+report the first 10 zones
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:1, [type: 2]
+start: 0x8, len 0x8, cap 0x8,wptr 0x8, zcond:1, [type: 2]
+start: 0x10, len 0x8, cap 0x8,wptr 0x10, zcond:1, [type: 2]
+start: 0x18, len 0x8, cap 0x8,wptr 0x18, zcond:1, [type: 2]
+start: 0x20, len 0x8, cap 0x8,wptr 0x20, zcond:1, [type: 2]
+start: 0x28, len 0x8, cap 0x8,wptr 0x28, zcond:1, [type: 2]
+start: 0x30, len 0x8, cap 0x8,wptr 0x30, zcond:1, [type: 2]
+start: 0x38, len 0x8, cap 0x8,wptr 0x38, zcond:1, [type: 2]
+start: 0x40, len 0x8, cap 0x8,wptr 0x40, zcond:1, [type: 2]
+start: 0x48, len 0x8, cap 0x8,wptr 0x48, zcond:1, [type: 2]
+
+report the last zone:
+start: 0x1f38, len 0x8, cap 0x8,wptr 0x1f38, zcond:1, [type: 2]
+
+
+(2) opening the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:3, [type: 2]
+
+opening the second zone
+report after:
+start: 0x8, len 0x8, cap 0x8,wptr 0x8, zcond:3, [type: 2]
+
+opening the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8,wptr 0x1f38, zcond:3, [type: 2]
+
+
+(3) closing the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:1, [type: 2]
+
+closing the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8,wptr 0x1f38, zcond:1, [type: 2]
+
+
+(4) finishing the second zone
+After finishing a zone:
+start: 0x8, len 0x8, cap 0x8,wptr 0x10, zcond:14, [type: 2]
+
+
+(5) resetting the second zone
+After resetting a zone:
+start: 0x8, len 0x8, cap 0x8,wptr 0x8, zcond:1, [type: 2]
+*** done
diff --git a/tests/qemu-iotests/tests/zoned.sh 
b/tests/qemu-iotests/tests/zoned.sh
new file mode 100755
index 00..db68aa88d4
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.sh
@@ -0,0 +1,86 @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+status=1 # failure is the default!
+
+_cleanup()
+{
+  _cleanup_test_img
+  sudo rmmod null_blk
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+# This test only runs on Linux hosts with raw image files.
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+QEMU_IO="build/qemu-io"
+IMG="--image-opts driver=zoned_host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device:"
+echo "Simple cases: if the operations work"
+sudo modprobe null_blk nr_devices=1 zoned=1
+
+echo "(1) report the first zone:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "report the first 10 zones"
+sudo $QEMU_IO $IMG -c "zrp 0 10"
+echo
+echo "report the last zone:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(2) opening the first zone"
+sudo $QEMU_IO $IMG -c "zo 0 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "opening the second zone"
+sudo $QEMU_IO $IMG -c "zo 524288 0x8" # 524288 is the zone sector size
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1" # 268435456 / 512 = 524288
+echo
+echo "opening the last zone"
+sudo $QEMU_IO $IMG -c "zo 0x1f38 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(3) closing the first zone"
+sudo $QEMU_IO $IMG -c "zc 0 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "closing the last zone"
+sudo $QEMU_IO $IMG -c "zc 0x1f38 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(4) finishing the second zone"
+sudo $QEMU_IO $IMG -c "zf 524288 0x8"
+echo "After finishing a zone:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+
+echo "(5) resetting the second zone"
+sudo $QEMU_IO $IMG -c "zrs 524288 0x8"
+echo "After resetting a zone:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+# success, all done
+echo "*** done"
+rm -f $seq.full

[RFC v5 07/11] config: add check to block layer

2022-07-31 Thread Sam Li
Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li 
---
 block.c  | 13 +
 block/file-posix.c   |  2 ++
 block/raw-format.c   |  1 +
 include/block/block_int-common.h | 10 ++
 4 files changed, 26 insertions(+)

diff --git a/block.c b/block.c
index bc85f46eed..8a259b158c 100644
--- a/block.c
+++ b/block.c
@@ -7947,6 +7947,19 @@ void bdrv_add_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
 return;
 }
 
+/*
+ * Non-zoned block drivers do not follow zoned storage constraints
+ * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+ * drivers in a graph.
+ */
+if (!parent_bs->drv->supports_zoned_children && child_bs->drv->is_zoned) {
+error_setg(errp, "Cannot add a %s child to a %s parent",
+   child_bs->drv->is_zoned ? "zoned" : "non-zoned",
+   parent_bs->drv->supports_zoned_children ?
+   "support zoned children" : "not support zoned children");
+return;
+}
+
 if (!QLIST_EMPTY(&child_bs->parents)) {
 error_setg(errp, "The node %s already has a parent",
child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
index 6c045eb6e8..8eb0b7bc9b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -4023,6 +4023,8 @@ static BlockDriver bdrv_zoned_host_device = {
 .format_name = "zoned_host_device",
 .protocol_name = "zoned_host_device",
 .instance_size = sizeof(BDRVRawState),
+.is_zoned = true,
+.supports_zoned_children = true,
 .bdrv_needs_filename = true,
 .bdrv_probe_device  = hdev_probe_device,
 .bdrv_parse_filename = zoned_host_device_parse_filename,
diff --git a/block/raw-format.c b/block/raw-format.c
index 6b20bd22ef..9441536819 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild 
*c,
 BlockDriver bdrv_raw = {
 .format_name  = "raw",
 .instance_size= sizeof(BDRVRawState),
+.supports_zoned_children = true,
 .bdrv_probe   = &raw_probe,
 .bdrv_reopen_prepare  = &raw_reopen_prepare,
 .bdrv_reopen_commit   = &raw_reopen_commit,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index de44c7b6f4..0476cd0491 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -126,6 +126,16 @@ struct BlockDriver {
  */
 bool is_format;
 
+/*
+ * Set to true if the BlockDriver is a zoned block driver.
+ */
+bool is_zoned;
+
+/*
+ * Set to true if the BlockDriver supports zoned children.
+ */
+bool supports_zoned_children;
+
 /*
  * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
  * this field set to true, except ones that are defined only by their
-- 
2.37.1




[RFC v5 03/11] file-posix: introduce get_sysfs_long_val for the long sysfs attribute

2022-07-31 Thread Sam Li
Use sysfs attribute files to get the long value of zoned device
information.

Signed-off-by: Sam Li 
---
 block/file-posix.c | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 48cd096624..bcf898f0cb 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1210,15 +1210,19 @@ static int hdev_get_max_hw_transfer(int fd, struct stat 
*st)
 #endif
 }
 
-static int hdev_get_max_segments(int fd, struct stat *st)
-{
+/*
+ * Get zoned device information (chunk_sectors, zoned_append_max_bytes,
+ * max_open_zones, max_active_zones) through sysfs attribute files.
+ */
+static long get_sysfs_long_val(int fd, struct stat *st,
+   const char *attribute) {
 #ifdef CONFIG_LINUX
 char buf[32];
 const char *end;
 char *sysfspath = NULL;
 int ret;
 int sysfd = -1;
-long max_segments;
+long val;
 
 if (S_ISCHR(st->st_mode)) {
 if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
@@ -1231,8 +1235,9 @@ static int hdev_get_max_segments(int fd, struct stat *st)
 return -ENOTSUP;
 }
 
-sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
-major(st->st_rdev), minor(st->st_rdev));
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
 sysfd = open(sysfspath, O_RDONLY);
 if (sysfd == -1) {
 ret = -errno;
@@ -1250,9 +1255,9 @@ static int hdev_get_max_segments(int fd, struct stat *st)
 }
 buf[ret] = 0;
 /* The file is ended with '\n', pass 'end' to accept that. */
-ret = qemu_strtol(buf, &end, 10, &max_segments);
+ret = qemu_strtol(buf, &end, 10, &val);
 if (ret == 0 && end && *end == '\n') {
-ret = max_segments;
+ret = val;
 }
 
 out:
@@ -1266,6 +1271,10 @@ out:
 #endif
 }
 
+static int hdev_get_max_segments(int fd, struct stat *st) {
+return get_sysfs_long_val(fd, st, "max_segments");
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
-- 
2.37.1




[RFC v5 05/11] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-07-31 Thread Sam Li
By adding zone management operations in BlockDriver, storage controller
emulation can use the new block layer APIs including Report Zone and
four zone management operations (open, close, finish, reset).

BlockDriver can get zone information from null_blk device by refreshing
BLockLimits.

Signed-off-by: Sam Li 
---
 block/block-backend.c|  47 ++
 block/coroutines.h   |   6 +
 block/file-posix.c   | 272 ++-
 block/io.c   |  57 +++
 include/block/block-common.h |   1 -
 include/block/block-io.h |  13 ++
 include/block/block_int-common.h |  22 ++-
 include/block/raw-aio.h  |   6 +-
 meson.build  |   1 +
 qapi/block-core.json |   7 +-
 10 files changed, 426 insertions(+), 6 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index d4a5df2ac2..ef6a1f33d5 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1775,6 +1775,53 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
 return ret;
 }
 
+/*
+ * Send a zone_report command.
+ * offset is a byte offset from the start of the device. No alignment
+ * required for offset.
+ * nr_zones represents IN maximum and OUT actual.
+ */
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor *zones)
+{
+int ret;
+IO_CODE();
+
+blk_inc_in_flight(blk); /* increase before waiting */
+blk_wait_while_drained(blk);
+if (!blk_is_available(blk)) {
+return -ENOMEDIUM;
+}
+ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
+blk_dec_in_flight(blk);
+return ret;
+}
+
+/*
+ * Send a zone_management command.
+ * offset is the starting zone specified as a sector offset.
+ * len is the maximum number of sectors the command should operate on.
+ */
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+int64_t offset, int64_t len)
+{
+int ret;
+IO_CODE();
+
+ret = blk_check_byte_request(blk, offset, len);
+if (ret < 0)
+return ret;
+blk_inc_in_flight(blk);
+blk_wait_while_drained(blk);
+if (!blk_is_available(blk)) {
+return -ENOMEDIUM;
+}
+ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
+blk_dec_in_flight(blk);
+return ret;
+}
+
 void blk_drain(BlockBackend *blk)
 {
 BlockDriverState *bs = blk_bs(blk);
diff --git a/block/coroutines.h b/block/coroutines.h
index 3a2bad564f..e3f62d94e5 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -63,6 +63,12 @@ nbd_co_do_establish_connection(BlockDriverState *bs, bool 
blocking,
Error **errp);
 
 
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor *zones);
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+  int64_t offset, int64_t len);
+
 /*
  * "I/O or GS" API functions. These functions can run without
  * the BQL, but only in one specific iothread/main loop.
diff --git a/block/file-posix.c b/block/file-posix.c
index 0d8b4acdc7..6c045eb6e8 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -67,6 +67,9 @@
 #include 
 #include 
 #include 
+#if defined(CONFIG_BLKZONED)
+#include 
+#endif
 #include 
 #include 
 #include 
@@ -216,6 +219,13 @@ typedef struct RawPosixAIOData {
 PreallocMode prealloc;
 Error **errp;
 } truncate;
+struct {
+unsigned int *nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report;
+struct {
+BlockZoneOp op;
+} zone_mgmt;
 };
 } RawPosixAIOData;
 
@@ -1386,7 +1396,7 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 #endif
 
 if (bs->sg || S_ISBLK(st.st_mode)) {
-int ret = hdev_get_max_hw_transfer(s->fd, &st);
+ret = hdev_get_max_hw_transfer(s->fd, &st);
 
 if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
 bs->bl.max_hw_transfer = ret;
@@ -1402,6 +1412,27 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 if (ret < 0)
 zoned = BLK_Z_NONE;
 bs->bl.zoned = zoned;
+if (zoned != BLK_Z_NONE) {
+ret = get_sysfs_long_val(s->fd, &st, "chunk_sectors");
+if (ret > 0) {
+bs->bl.zone_sectors = ret;
+}
+
+ret = get_sysfs_long_val(s->fd, &st, "zone_append_max_bytes");
+if (ret > 0) {
+bs->bl.zone_append_max_bytes = ret;
+}
+
+ret = get_sysfs_long_val(s->fd, &st, "max_open_zones");
+if (ret > 0) {
+bs->bl.max_open_zones = ret;
+}
+
+ret = get_sysfs_long_val(s->fd, &st, "max_active_zones");
+if (ret > 0) {
+bs->bl.ma

[RFC v5 04/11] file-posix: introduce get_sysfs_str_val for device zoned model

2022-07-31 Thread Sam Li
Use sysfs attribute files to get the string value of device
zoned model. Then get_sysfs_zoned_model can convert it to
BlockZoneModel type in QEMU.

Signed-off-by: Sam Li 
---
 block/file-posix.c   | 86 
 include/block/block_int-common.h |  3 ++
 2 files changed, 89 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index bcf898f0cb..0d8b4acdc7 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1271,6 +1271,85 @@ out:
 #endif
 }
 
+/*
+ * Convert the zoned attribute file in sysfs to internal value.
+ */
+static int get_sysfs_str_val(int fd, struct stat *st,
+  const char *attribute,
+  char **val) {
+#ifdef CONFIG_LINUX
+char buf[32];
+char *sysfspath = NULL;
+int ret, offset;
+int sysfd = -1;
+
+if (S_ISCHR(st->st_mode)) {
+if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
+return ret;
+}
+return -ENOTSUP;
+}
+
+if (!S_ISBLK(st->st_mode)) {
+return -ENOTSUP;
+}
+
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
+sysfd = open(sysfspath, O_RDONLY);
+if (sysfd == -1) {
+ret = -errno;
+goto out;
+}
+offset = 0;
+do {
+ret = read(sysfd, buf + offset, sizeof(buf) - 1 + offset);
+if (ret > 0) {
+offset += ret;
+}
+} while (ret == -1);
+/* The file is ended with '\n' */
+if (buf[ret - 1] == '\n') {
+buf[ret - 1] = '\0';
+}
+
+if (!strncpy(*val, buf, ret)) {
+goto out;
+}
+
+out:
+if (sysfd != -1) {
+close(sysfd);
+}
+g_free(sysfspath);
+return ret;
+#else
+return -ENOTSUP;
+#endif
+}
+
+static int get_sysfs_zoned_model(int fd, struct stat *st,
+ BlockZoneModel *zoned) {
+g_autofree char *val = NULL;
+val = g_malloc(32);
+get_sysfs_str_val(fd, st, "zoned", &val);
+if (!val) {
+return -ENOTSUP;
+}
+
+if (strcmp(val, "host-managed") == 0) {
+*zoned = BLK_Z_HM;
+} else if (strcmp(val, "host-aware") == 0) {
+*zoned = BLK_Z_HA;
+} else if (strcmp(val, "none") == 0) {
+*zoned = BLK_Z_NONE;
+} else {
+return -ENOTSUP;
+}
+return 0;
+}
+
 static int hdev_get_max_segments(int fd, struct stat *st) {
 return get_sysfs_long_val(fd, st, "max_segments");
 }
@@ -1279,6 +1358,8 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 {
 BDRVRawState *s = bs->opaque;
 struct stat st;
+int ret;
+BlockZoneModel zoned;
 
 s->needs_alignment = raw_needs_alignment(bs);
 raw_probe_alignment(bs, s->fd, errp);
@@ -1316,6 +1397,11 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.max_hw_iov = ret;
 }
 }
+
+ret = get_sysfs_zoned_model(s->fd, &st, &zoned);
+if (ret < 0)
+zoned = BLK_Z_NONE;
+bs->bl.zoned = zoned;
 }
 
 static int check_for_dasd(int fd)
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 8947abab76..7f7863cc9e 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -825,6 +825,9 @@ typedef struct BlockLimits {
 
 /* maximum number of iovec elements */
 int max_iov;
+
+/* device zone model */
+BlockZoneModel zoned;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
-- 
2.37.1




[RFC v5 02/11] include: import virtio_blk headers from linux with zoned storage support

2022-07-31 Thread Sam Li
Add file from Dmitry's "virtio-blk:add support for zoned block devices"
linux patch using scripts/update-linux-headers.sh. There is a link for
more information: https://github.com/dmitry-fomichev/virtblk-zbd

Signed-off-by: Sam Li 
---
 include/standard-headers/linux/virtio_blk.h | 118 
 1 file changed, 118 insertions(+)

diff --git a/include/standard-headers/linux/virtio_blk.h 
b/include/standard-headers/linux/virtio_blk.h
index 2dcc90826a..5c6856aec3 100644
--- a/include/standard-headers/linux/virtio_blk.h
+++ b/include/standard-headers/linux/virtio_blk.h
@@ -40,6 +40,7 @@
 #define VIRTIO_BLK_F_MQ12  /* support more than one vq */
 #define VIRTIO_BLK_F_DISCARD   13  /* DISCARD is supported */
 #define VIRTIO_BLK_F_WRITE_ZEROES  14  /* WRITE ZEROES is supported */
+#define VIRTIO_BLK_F_ZONED 17  /* Zoned block device */
 
 /* Legacy feature bits */
 #ifndef VIRTIO_BLK_NO_LEGACY
@@ -119,6 +120,20 @@ struct virtio_blk_config {
uint8_t write_zeroes_may_unmap;
 
uint8_t unused1[3];
+
+   /* Secure erase fields that are defined in the virtio spec */
+   uint8_t sec_erase[12];
+
+   /* Zoned block device characteristics (if VIRTIO_BLK_F_ZONED) */
+   struct virtio_blk_zoned_characteristics {
+   __virtio32 zone_sectors;
+   __virtio32 max_open_zones;
+   __virtio32 max_active_zones;
+   __virtio32 max_append_sectors;
+   __virtio32 write_granularity;
+   uint8_t model;
+   uint8_t unused2[3];
+   } zoned;
 } QEMU_PACKED;
 
 /*
@@ -153,6 +168,24 @@ struct virtio_blk_config {
 /* Write zeroes command */
 #define VIRTIO_BLK_T_WRITE_ZEROES  13
 
+/* Zone append command */
+#define VIRTIO_BLK_T_ZONE_APPEND15
+
+/* Report zones command */
+#define VIRTIO_BLK_T_ZONE_REPORT16
+
+/* Open zone command */
+#define VIRTIO_BLK_T_ZONE_OPEN  18
+
+/* Close zone command */
+#define VIRTIO_BLK_T_ZONE_CLOSE 20
+
+/* Finish zone command */
+#define VIRTIO_BLK_T_ZONE_FINISH22
+
+/* Reset zone command */
+#define VIRTIO_BLK_T_ZONE_RESET 24
+
 #ifndef VIRTIO_BLK_NO_LEGACY
 /* Barrier before this op. */
 #define VIRTIO_BLK_T_BARRIER   0x8000
@@ -172,6 +205,84 @@ struct virtio_blk_outhdr {
__virtio64 sector;
 };
 
+/*
+ * Supported zoned device models.
+ */
+
+/* Regular block device */
+#define VIRTIO_BLK_Z_NONE  0
+/* Host-managed zoned device */
+#define VIRTIO_BLK_Z_HM1
+/* Host-aware zoned device */
+#define VIRTIO_BLK_Z_HA2
+
+/* ZBD Management Out ALL flag */
+#define VIRTIO_BLK_ZONED_FLAG_ALL  (1 << 0)
+
+/*
+ * Header for VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE,
+ * VIRTIO_BLK_T_ZONE_RESET, VIRTIO_BLK_T_ZONE_FINISH requests.
+ */
+struct virtio_blk_zone_mgmt_outhdr {
+   /* Zoned request flags */
+   __virtio32 flags;
+};
+
+/*
+ * Zone descriptor. A part of VIRTIO_BLK_T_ZONE_REPORT command reply.
+ */
+struct virtio_blk_zone_descriptor {
+   /* Zone capacity */
+   __virtio64 z_cap;
+   /* The starting sector of the zone */
+   __virtio64 z_start;
+   /* Zone write pointer position in sectors */
+   __virtio64 z_wp;
+   /* Zone type */
+   uint8_t z_type;
+   /* Zone state */
+   uint8_t z_state;
+   uint8_t reserved[38];
+};
+
+struct virtio_blk_zone_report {
+   __virtio64 nr_zones;
+   uint8_t reserved[56];
+   struct virtio_blk_zone_descriptor zones[];
+};
+
+/*
+ * Supported zone types.
+ */
+
+/* Conventional zone */
+#define VIRTIO_BLK_ZT_CONV 1
+/* Sequential Write Required zone */
+#define VIRTIO_BLK_ZT_SWR  2
+/* Sequential Write Preferred zone */
+#define VIRTIO_BLK_ZT_SWP  3
+
+/*
+ * Zone states that are available for zones of all types.
+ */
+
+/* Not a write pointer (conventional zones only) */
+#define VIRTIO_BLK_ZS_NOT_WP   0
+/* Empty */
+#define VIRTIO_BLK_ZS_EMPTY1
+/* Implicitly Open */
+#define VIRTIO_BLK_ZS_IOPEN2
+/* Explicitly Open */
+#define VIRTIO_BLK_ZS_EOPEN3
+/* Closed */
+#define VIRTIO_BLK_ZS_CLOSED   4
+/* Read-Only */
+#define VIRTIO_BLK_ZS_RDONLY   13
+/* Full */
+#define VIRTIO_BLK_ZS_FULL 14
+/* Offline */
+#define VIRTIO_BLK_ZS_OFFLINE  15
+
 /* Unmap this range (only valid for write zeroes command) */
 #define VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP 0x0001
 
@@ -198,4 +309,11 @@ struct virtio_scsi_inhdr {
 #define VIRTIO_BLK_S_OK0
 #define VIRTIO_BLK_S_IOERR 1
 #define VIRTIO_BLK_S_UNSUPP2
+
+/* Error codes that are specific to zoned block devices */
+#define VIRTIO_BLK_S_ZONE_INVALID_CMD 3
+#define VIRTIO_BLK_S_ZONE_UNALIGNED_WP4
+#define VIRTIO_BLK_S_ZONE_OPEN_RESOURCE   5
+#define VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE 6
+
 #endif /* _LINUX_VIRTIO_BLK_H */
-- 
2.37.1




[RFC v5 01/11] include: add zoned device structs

2022-07-31 Thread Sam Li
Signed-off-by: Sam Li 
---
 include/block/block-common.h | 43 
 1 file changed, 43 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index fdb7306e78..c9d28b1c51 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver;
 typedef struct BdrvChild BdrvChild;
 typedef struct BdrvChildClass BdrvChildClass;
 
+typedef enum BlockZoneOp {
+BLK_ZO_OPEN,
+BLK_ZO_CLOSE,
+BLK_ZO_FINISH,
+BLK_ZO_RESET,
+} BlockZoneOp;
+
+typedef enum BlockZoneModel {
+BLK_Z_NONE = 0x0, /* Regular block device */
+BLK_Z_HM = 0x1, /* Host-aware zoned block device */
+BLK_Z_HA = 0x2, /* Host-managed zoned block device */
+} BlockZoneModel;
+
+typedef enum BlockZoneCondition {
+BLK_ZS_NOT_WP = 0x0,
+BLK_ZS_EMPTY = 0x1,
+BLK_ZS_IOPEN = 0x2,
+BLK_ZS_EOPEN = 0x3,
+BLK_ZS_CLOSED = 0x4,
+BLK_ZS_RDONLY = 0xD,
+BLK_ZS_FULL = 0xE,
+BLK_ZS_OFFLINE = 0xF,
+} BlockZoneCondition;
+
+typedef enum BlockZoneType {
+BLK_ZT_CONV = 0x1,
+BLK_ZT_SWR = 0x2,
+BLK_ZT_SWP = 0x3,
+} BlockZoneType;
+
+/*
+ * Zone descriptor data structure.
+ * Provide information on a zone with all position and size values in bytes.
+ */
+typedef struct BlockZoneDescriptor {
+uint64_t start;
+uint64_t length;
+uint64_t cap;
+uint64_t wp;
+BlockZoneType type;
+BlockZoneCondition cond;
+} BlockZoneDescriptor;
+
 typedef struct BlockDriverInfo {
 /* in bytes, 0 if irrelevant */
 int cluster_size;
-- 
2.37.1




[RFC v5 00/11] Add support for zoned device

2022-07-31 Thread Sam Li
Zoned Block Devices (ZBDs) devide the LBA space to block regions called zones
that are larger than the LBA size. It can only allow sequential writes, which
reduces write amplification in SSD, leading to higher throughput and increased
capacity. More details about ZBDs can be found at:

https://zonedstorage.io/docs/introduction/zoned-storage

The zoned device support aims to let guests (virtual machines) access zoned
storage devices on the host (hypervisor) through a virtio-blk device. This
involves extending QEMU's block layer and virtio-blk emulation code.  In its
current status, the virtio-blk device is not aware of ZBDs but the guest sees
host-managed drives as regular drive that will runs correctly under the most
common write workloads.

This patch series extend the block layer APIs and virtio-blk emulation section
with the minimum set of zoned commands that are necessary to support zoned
devices. The commands are - Report Zones, four zone operations and Zone Append
(developing).

It can be tested on a null_blk device using qemu-io, qemu-iotests or blkzone(8)
command in the guest os. For example, the command line for zone report using
qemu-io is:

$ path/to/qemu-io --image-opts driver=zoned_host_device,filename=/dev/nullb0 -c
"zrp offset nr_zones"

To enable zoned device in the guest os, the guest kernel must have the 
virtio-blk
driver with ZBDs support. The link to such patches for the kernel is:
https://github.com/dmitry-fomichev/virtblk-zbd

Then, add the following options to the QEMU command line:
-blockdev node-name=drive0,driver=zoned_host_device,filename=/dev/nullb0

After the guest os booting, use blkzone(8) to test zone operations:
blkzone report -o offset -c nr_zones /dev/vda

v5:
- add zoned storage emulation to virtio-blk device
- add documentation for zoned storage
- address review comments
  * fix qemu-iotests
  * fix check to block layer
  * modify interfaces of sysfs helper functions
  * rename zoned device structs according to QEMU styles
  * reorder patches

v4:
- add virtio-blk headers for zoned device
- add configurations for zoned host device
- add zone operations for raw-format
- address review comments
  * fix memory leak bug in zone_report
  * add checks to block layers
  * fix qemu-iotests format
  * fix sysfs helper functions

v3:
- add helper functions to get sysfs attributes
- address review comments
  * fix zone report bugs
  * fix the qemu-io code path
  * use thread pool to avoid blocking ioctl() calls

v2:
- add qemu-io sub-commands
- address review comments
  * modify interfaces of APIs

v1:
- add block layer APIs resembling Linux ZoneBlockDevice ioctls

Sam Li (11):
  include: add zoned device structs
  include: import virtio_blk headers from linux with zoned storage
support
  file-posix: introduce get_sysfs_long_val for the long sysfs attribute
  file-posix: introduce get_sysfs_str_val for device zoned model
  block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  raw-format: add zone operations to pass through requests
  config: add check to block layer
  virtio-blk: add zoned storage APIs for zoned devices
  qemu-io: add zoned block device operations.
  qemu-iotests: test new zone operations
  docs/zoned-storage: add zoned device documentation

 block.c |  13 +
 block/block-backend.c   | 139 +++
 block/coroutines.h  |   6 +
 block/file-posix.c  | 383 +++-
 block/io.c  |  41 +++
 block/raw-format.c  |  14 +
 docs/devel/zoned-storage.rst|  68 
 docs/system/qemu-block-drivers.rst.inc  |   6 +
 hw/block/virtio-blk.c   | 172 -
 include/block/block-common.h|  44 ++-
 include/block/block-io.h|  13 +
 include/block/block_int-common.h|  35 +-
 include/block/raw-aio.h |   6 +-
 include/standard-headers/linux/virtio_blk.h | 118 ++
 include/sysemu/block-backend-io.h   |   6 +
 meson.build |   1 +
 qapi/block-core.json|   7 +-
 qemu-io-cmds.c  | 144 
 tests/qemu-iotests/tests/zoned.out  |  53 +++
 tests/qemu-iotests/tests/zoned.sh   |  86 +
 20 files changed, 1340 insertions(+), 15 deletions(-)
 create mode 100644 docs/devel/zoned-storage.rst
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

-- 
2.37.1




[PULL 3/3] Hexagon (tests/tcg/hexagon) reference file for float_convd

2022-07-31 Thread Taylor Simpson
The test is in tests/tcg/multiarch/float_convd.c

Signed-off-by: Taylor Simpson 
Acked-by: Richard Henderson 
Message-Id: <20220718230320.2-4-tsimp...@quicinc.com>
---
 tests/tcg/hexagon/float_convd.ref | 988 ++
 1 file changed, 988 insertions(+)
 create mode 100644 tests/tcg/hexagon/float_convd.ref

diff --git a/tests/tcg/hexagon/float_convd.ref 
b/tests/tcg/hexagon/float_convd.ref
new file mode 100644
index 00..aba1e13e35
--- /dev/null
+++ b/tests/tcg/hexagon/float_convd.ref
@@ -0,0 +1,988 @@
+### Rounding to nearest
+from double: f64(nan:0x007ff4)
+  to single: f32(-nan:0x) (INVALID)
+   to int32: -1 (INVALID)
+   to int64: -1 (INVALID)
+  to uint32: -1 (INVALID)
+  to uint64: -1 (INVALID)
+from double: f64(-nan:0x00fff8)
+  to single: f32(-nan:0x) (OK)
+   to int32: -1 (INVALID)
+   to int64: -1 (INVALID)
+  to uint32: -1 (INVALID)
+  to uint64: -1 (INVALID)
+from double: f64(-inf:0x00fff0)
+  to single: f32(-inf:0xff80) (OK)
+   to int32: -2147483648 (INVALID)
+   to int64: -9223372036854775808 (INVALID)
+  to uint32: 0 (INVALID)
+  to uint64: 0 (INVALID)
+from double: f64(-0x1.f000p+1023:0x00ffef)
+  to single: f32(-inf:0xff80) (OVERFLOW INEXACT )
+   to int32: -2147483648 (INVALID)
+   to int64: -9223372036854775808 (INVALID)
+  to uint32: 0 (INVALID)
+  to uint64: 0 (INVALID)
+from double: f64(-0x1.fe00p+127:0x00c7efe000)
+  to single: f32(-0x1.fe00p+127:0xff7f) (OK)
+   to int32: -2147483648 (INVALID)
+   to int64: -9223372036854775808 (INVALID)
+  to uint32: 0 (INVALID)
+  to uint64: 0 (INVALID)
+from double: f64(-0x1.fe00p+127:0x00c7efe000)
+  to single: f32(-0x1.fe00p+127:0xff7f) (OK)
+   to int32: -2147483648 (INVALID)
+   to int64: -9223372036854775808 (INVALID)
+  to uint32: 0 (INVALID)
+  to uint64: 0 (INVALID)
+from double: f64(-0x1.1874b135ff654000p+103:0x00c661874b135ff654)
+  to single: f32(-0x1.1874b200p+103:0xf30c3a59) (INEXACT )
+   to int32: -2147483648 (INVALID)
+   to int64: -9223372036854775808 (INVALID)
+  to uint32: 0 (INVALID)
+  to uint64: 0 (INVALID)
+from double: f64(-0x1.c0bab523323b9000p+99:0x00c62c0bab523323b9)
+  to single: f32(-0x1.c0bab600p+99:0xf1605d5b) (INEXACT )
+   to int32: -2147483648 (INVALID)
+   to int64: -9223372036854775808 (INVALID)
+  to uint32: 0 (INVALID)
+  to uint64: 0 (INVALID)
+from double: f64(-0x1.p+1:0x00c000)
+  to single: f32(-0x1.p+1:0xc000) (OK)
+   to int32: -2 (OK)
+   to int64: -2 (OK)
+  to uint32: 0 (INVALID)
+  to uint64: 0 (INVALID)
+from double: f64(-0x1.p+0:0x00bff0)
+  to single: f32(-0x1.p+0:0xbf80) (OK)
+   to int32: -1 (OK)
+   to int64: -1 (OK)
+  to uint32: 0 (INVALID)
+  to uint64: 0 (INVALID)
+from double: f64(-0x1.p-1022:0x008010)
+  to single: f32(-0x0.p+0:0x8000) (UNDERFLOW INEXACT )
+   to int32: 0 (INEXACT )
+   to int64: 0 (INEXACT )
+  to uint32: 0 (INVALID)
+  to uint64: 0 (INVALID)
+from double: f64(-0x1.p-126:0x00b810)
+  to single: f32(-0x1.p-126:0x8080) (OK)
+   to int32: 0 (INEXACT )
+   to int64: 0 (INEXACT )
+  to uint32: 0 (INVALID)
+  to uint64: 0 (INVALID)
+from double: f64(0x0.p+0:)
+  to single: f32(0x0.p+0:00) (OK)
+   to int32: 0 (OK)
+   to int64: 0 (OK)
+  to uint32: 0 (OK)
+  to uint64: 0 (OK)
+from double: f64(0x1.p-126:0x003810)
+  to single: f32(0x1.p-126:0x0080) (OK)
+   to int32: 0 (INEXACT )
+   to int64: 0 (INEXACT )
+  to uint32: 0 (INEXACT )
+  to uint64: 0 (INEXACT )
+from double: f64(0x1.0001c5f68000p-25:0x003e60001c5f68)
+  to single: f32(0x1.p-25:0x3300) (INEXACT )
+   to int32: 0 (INEXACT )
+   to int64: 0 (INEXACT )
+  to uint32: 0 (INEXACT )
+  to uint64: 0 (INEXACT )
+from double: f64(0x1.e6cb2fa82000p-25:0x003e6e6cb2fa82)
+  to single: f32(0x1.e600p-25:0x3373) (INEXACT )
+   to int32: 0 (INEXACT )
+   to int64: 0 (INEXACT )
+  to uint32: 0 (INEXACT )
+  to uint64: 0 (INEXACT )
+from double: f64(0x1.ff801a9af58a1000p-15:0x003f0ff801a9af58a1)
+  to single: f32(0x1.ff801a00p-15:0x387fc00d) (INEXACT )
+   to int32: 0 (INEXACT )
+   to int64: 0 (INEXACT )
+  to uint32: 0 (INEXACT )
+  to uint64: 0 (INEXACT )
+from double: f64(0x1.0c06a1ef5000p-14:0x003f10c06a1ef5)
+  to single: f32(0x1.0c00p-14:0x3886) (INEXACT )
+   to int32: 0 (INEXACT )
+   to int64: 0 (INEXACT )
+  to uint32: 0 (INEXACT )
+  to uint64: 0 (INEXACT )
+from double: f64(0x1.p+0:0x003ff0)
+  

[PULL 2/3] Hexagon (tests/tcg/hexagon) Fix alignment in load_unpack.c

2022-07-31 Thread Taylor Simpson
The increment used in :brev tests was causing unaligned addresses
Change the increment and the relevant expected values

Signed-off-by: Taylor Simpson 
Acked-by: Richard Henderson 
Message-Id: <20220718230320.2-3-tsimp...@quicinc.com>
---
 tests/tcg/hexagon/load_unpack.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/tests/tcg/hexagon/load_unpack.c b/tests/tcg/hexagon/load_unpack.c
index 3575a37a28..4aa26fc388 100644
--- a/tests/tcg/hexagon/load_unpack.c
+++ b/tests/tcg/hexagon/load_unpack.c
@@ -245,7 +245,7 @@ TEST_pr(loadbsw4_pr, long long, S, 4, 0xff00ff00LL,
  */
 #define BxW_LOAD_pbr(SZ, RES, PTR) \
 __asm__( \
-"r4 = #(1 << (16 - 3))\n\t" \
+"r4 = #(1 << (16 - 4))\n\t" \
 "m0 = r4\n\t" \
 "%0 = mem" #SZ "(%1++m0:brev)\n\t" \
 : "=r"(RES), "+r"(PTR) \
@@ -273,15 +273,15 @@ void test_##NAME(void) \
 }
 
 TEST_pbr(loadbzw2_pbr, int, Z, 0x,
-0x00020081, 0x00060085, 0x00040083, 0x00080087)
+0x00020081, 0x000a0089, 0x00060085, 0x000e008d)
 TEST_pbr(loadbsw2_pbr, int, S, 0xff00,
-0x00020081, 0x00060085, 0x00040083, 0x00080087)
+0x00020081, 0x000aff89, 0x0006ff85, 0x000eff8d)
 TEST_pbr(loadbzw4_pbr, long long, Z, 0xLL,
-0x0004008300020081LL, 0x0008008700060085LL,
-0x0006008500040083LL, 0x000a008900080087LL)
+0x0004008300020081LL, 0x000c008b000a0089LL,
+0x0008008700060085LL, 0x0010008f000e008dLL)
 TEST_pbr(loadbsw4_pbr, long long, S, 0xff00ff00LL,
-0x0004008300020081LL, 0x0008008700060085LL,
-0x0006008500040083LL, 0x000a008900080087LL)
+0x0004008300020081LL, 0x000cff8b000aff89LL,
+0x0008ff870006ff85LL, 0x0010ff8f000eff8dLL)
 
 /*
  
-- 
2.17.1



[PULL 1/3] Hexagon (target/hexagon) make VyV operands use a unique temp

2022-07-31 Thread Taylor Simpson
VyV operand is only used in the vshuff and vdeal instructions.  These
instructions write to both VyV and VxV operands.  In the case where
both operands are the same register, we need a separate location for
VyV.  We use the existing vtmp field in CPUHexagonState.

Test case added in tests/tcg/hexagon/hvx_misc.c

Signed-off-by: Taylor Simpson 
Reviewed-by: Richard Henderson 
Message-Id: <20220718230320.2-2-tsimp...@quicinc.com>
---
 tests/tcg/hexagon/hvx_misc.c| 45 +
 target/hexagon/gen_tcg_funcs.py |  9 +++
 2 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/tests/tcg/hexagon/hvx_misc.c b/tests/tcg/hexagon/hvx_misc.c
index b896f5897e..6e2c9ab3cd 100644
--- a/tests/tcg/hexagon/hvx_misc.c
+++ b/tests/tcg/hexagon/hvx_misc.c
@@ -498,6 +498,49 @@ static void test_vsubuwsat_dv(void)
 check_output_w(__LINE__, 2);
 }
 
+static void test_vshuff(void)
+{
+/* Test that vshuff works when the two operands are the same register */
+const uint32_t splat = 0x089be55c;
+const uint32_t shuff = 0x454fa926;
+MMVector v0, v1;
+
+memset(expect, 0x12, sizeof(MMVector));
+memset(output, 0x34, sizeof(MMVector));
+
+asm volatile("v25 = vsplat(%0)\n\t"
+ "vshuff(v25, v25, %1)\n\t"
+ "vmem(%2 + #0) = v25\n\t"
+ : /* no outputs */
+ : "r"(splat), "r"(shuff), "r"(output)
+ : "v25", "memory");
+
+/*
+ * The semantics of Hexagon are the operands are pass-by-value, so create
+ * two copies of the vsplat result.
+ */
+for (int i = 0; i < MAX_VEC_SIZE_BYTES / 4; i++) {
+v0.uw[i] = splat;
+v1.uw[i] = splat;
+}
+/* Do the vshuff operation */
+for (int offset = 1; offset < MAX_VEC_SIZE_BYTES; offset <<= 1) {
+if (shuff & offset) {
+for (int k = 0; k < MAX_VEC_SIZE_BYTES; k++) {
+if (!(k & offset)) {
+uint8_t tmp = v0.ub[k];
+v0.ub[k] = v1.ub[k + offset];
+v1.ub[k + offset] = tmp;
+}
+}
+}
+}
+/* Put the result in the expect buffer for verification */
+expect[0] = v1;
+
+check_output_b(__LINE__, 1);
+}
+
 int main()
 {
 init_buffers();
@@ -533,6 +576,8 @@ int main()
 test_vadduwsat();
 test_vsubuwsat_dv();
 
+test_vshuff();
+
 puts(err ? "FAIL" : "PASS");
 return err ? 1 : 0;
 }
diff --git a/target/hexagon/gen_tcg_funcs.py b/target/hexagon/gen_tcg_funcs.py
index 1fd9de95d5..d72c689ad7 100755
--- a/target/hexagon/gen_tcg_funcs.py
+++ b/target/hexagon/gen_tcg_funcs.py
@@ -1,7 +1,7 @@
 #!/usr/bin/env python3
 
 ##
-##  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights 
Reserved.
+##  Copyright(c) 2019-2022 Qualcomm Innovation Center, Inc. All Rights 
Reserved.
 ##
 ##  This program is free software; you can redistribute it and/or modify
 ##  it under the terms of the GNU General Public License as published by
@@ -164,7 +164,9 @@ def genptr_decl(f, tag, regtype, regid, regno):
 (regtype, regid, regno))
 f.write("const intptr_t %s%sV_off =\n" % \
 (regtype, regid))
-if (hex_common.is_tmp_result(tag)):
+if (regid == "y"):
+f.write("offsetof(CPUHexagonState, vtmp);\n")
+elif (hex_common.is_tmp_result(tag)):
 f.write("ctx_tmp_vreg_off(ctx, %s%sN, 1, true);\n" % \
 (regtype, regid))
 else:
@@ -379,9 +381,6 @@ def genptr_src_read(f, tag, regtype, regid):
 f.write("vreg_src_off(ctx, %s%sN),\n" % \
  (regtype, regid))
 f.write("sizeof(MMVector), sizeof(MMVector));\n")
-if (not hex_common.skip_qemu_helper(tag)):
-f.write("tcg_gen_addi_ptr(%s%sV, cpu_env, %s%sV_off);\n" % 
\
- (regtype, regid, regtype, regid))
 else:
 print("Bad register parse: ", regtype, regid)
 elif (regtype == "Q"):
-- 
2.17.1



[PULL 0/3] Hexagon bug fixes and test improvements

2022-07-31 Thread Taylor Simpson
The following changes since commit 3916603e0c1d909e14e09d5ebcbdaa9c9e21adf3:

  Merge tag 'pull-la-20220729' of https://gitlab.com/rth7680/qemu into staging 
(2022-07-29 17:39:17 -0700)

are available in the Git repository at:

  https://github.com/quic/qemu tags/pull-hex-20220731

for you to fetch changes up to 7eabb050ea77e529f549ea1ddaaa18e91ae01e34:

  Hexagon (tests/tcg/hexagon) reference file for float_convd (2022-07-31 
16:22:09 -0700)


Hexagon bug fixes and test improvements

1) Fixes a bug in qemu-hexagon
2) Fixes a bug in a test case
3) Adds reference file for float_convd test case


Taylor Simpson (3):
  Hexagon (target/hexagon) make VyV operands use a unique temp
  Hexagon (tests/tcg/hexagon) Fix alignment in load_unpack.c
  Hexagon (tests/tcg/hexagon) reference file for float_convd

 tests/tcg/hexagon/hvx_misc.c  |  45 ++
 tests/tcg/hexagon/load_unpack.c   |  14 +-
 target/hexagon/gen_tcg_funcs.py   |   9 +-
 tests/tcg/hexagon/float_convd.ref | 988 ++
 4 files changed, 1044 insertions(+), 12 deletions(-)
 create mode 100644 tests/tcg/hexagon/float_convd.ref


[PATCH] ipmi:smbus: Add a check around a memcpy

2022-07-31 Thread minyard
From: Corey Minyard 

In one case:

  memcpy(sid->inmsg + sid->inlen, buf, len);

if len == 0 then sid->inmsg + sig->inlen can point to one past the inmsg
array if the array is full.  We have to allow len == 0 due to some
vagueness in the spec, but we don't have to call memcpy.

Found by Coverity.  This is not a problem in practice, but the results
are technically (maybe) undefined.  So make Coverity happy.

Reported-by: Peter Maydell 
Signed-off-by: Corey Minyard 
---
 hw/ipmi/smbus_ipmi.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

I think this should do it.

diff --git a/hw/ipmi/smbus_ipmi.c b/hw/ipmi/smbus_ipmi.c
index 9ef9112dd5..d0991ab7f9 100644
--- a/hw/ipmi/smbus_ipmi.c
+++ b/hw/ipmi/smbus_ipmi.c
@@ -281,7 +281,9 @@ static int ipmi_write_data(SMBusDevice *dev, uint8_t *buf, 
uint8_t len)
  */
 send = true;
 }
-memcpy(sid->inmsg + sid->inlen, buf, len);
+if (len > 0) {
+memcpy(sid->inmsg + sid->inlen, buf, len);
+}
 sid->inlen += len;
 break;
 }
-- 
2.25.1




[Bug 1180923] Re: unused memory filled with 0x00 instead of 0xFF

2022-07-31 Thread David Glover
I know this is expired but it's still a problem in qemu 7.0.0. For
example, when running MS-DOS 6.22, "mem" reports 0K of upper memory, and
memmaker fails to run, complaining that it could not allocate any. I'd
love to know if there's a workaround.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1180923

Title:
  unused memory filled with 0x00 instead of 0xFF

Status in QEMU:
  Expired

Bug description:
  Qemu, ever since it was made (so, since 2003), has this problem in DOS
  (either PC-DOS or MS-DOS and partly Windows 9x) not recognizing the
  memory available when the memory is filled with 0x00 but when it is
  filled with 0xFF it gets recognized properly, where should I patch
  qemu to solve this memory problem?

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1180923/+subscriptions




Re: virtio: why no full reset on virtio_set_status 0 ?

2022-07-31 Thread Claudio Fontana
On 7/29/22 16:00, Claudio Fontana wrote:
> On 7/29/22 15:21, Alex Bennée wrote:
>>
>> Claudio Fontana  writes:
>>
>>> On 7/29/22 12:13, Michael S. Tsirkin wrote:
 On Fri, Jul 29, 2022 at 11:46:05AM +0200, Claudio Fontana wrote:
>>> @@ -2025,7 +2031,6 @@ void virtio_reset(void *opaque)
>>>  VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev);
>>>  int i;
>>>  
>>> -virtio_set_status(vdev, 0);
>>>  if (current_cpu) {
>>>  /* Guest initiated reset */
>>>  vdev->device_endian = virtio_current_cpu_endian();
>>> -- 
>>> 2.26.2
>>
>> As you say this is incomplete ... bout could you share a bit more
>> of what issue does this address?
>>
>
> Hi, the problem I am trying to address is a segfault in OVS/dpdk that 
> looks like this:

 Sorry I was not clear. What I mean is, you don't yet know why does removing
 virtio_set_status call here prevent the crash in ovs, do you?

>>>
>>> I have no idea. Trying to collect logs to figure things out, but as
>>> mentioned the logs easily hide the issue.
>>> Likely there is just more to study here.
>>
>> Given the OVS is going off on a NULL ptr deref could it just be it's not
>> handling the disabling/reenabling of the virtqueues as you pause and
>> restart properly? I could certainly imagine a backend jumping the gun to
>> read a queue going very wrong if the current queue state is disabled.
>>
> 
> In this case both the ovs buf_addr and buf_iova are NULL, which is a nice 
> case as they are more detectable,
> however I also have segfaults where the addresses are just garbage.
> 
> I wonder whether it's possible that given the fact that the guest is going 
> away without notification (SIGKILL),
> as the guest driver resets the device and communicates with QEMU, QEMU adapts 
> the state without notifying ovs,
> so ovs happily tries to dequeue data from memory that isn't there. But I am 
> just guessing.
> 
> I am still studying the qemu vhost user side and ovs/dpdk side to try to 
> understand how this whole thing works.
> 
> Thanks,
> 
> CLaudio
> 

I am pursuing this as a DPDK library issue.

It would be cool to have ovs, dpdk and vhost-user with the default test-pmd 
application somehow hooked up in a basic test
in one of these projects..

Thanks,

Claudio





Re: virtio: why no full reset on virtio_set_status 0 ?

2022-07-31 Thread Claudio Fontana
On 7/28/22 12:24, Cornelia Huck wrote:
> On Thu, Jul 28 2022, Claudio Fontana  wrote:
> 
>> On 7/28/22 09:43, Claudio Fontana wrote:
>>> On 7/28/22 03:27, Jason Wang wrote:
 On Wed, Jul 27, 2022 at 11:32 PM Michael S. Tsirkin  
 wrote:
>
> On Wed, Jul 27, 2022 at 12:51:31PM +0200, Claudio Fontana wrote:
>> Hi Michael and all,
>>
>> I have started researching a qemu / ovs / dpdk bug:
>>
>> https://inbox.dpdk.org/dev/322122fb-619d-96f6-5c3e-9eabdbf38...@redhat.com/T/
>>
>> that seems to be affecting multiple parties in the telco space,
>>
>> and during this process I noticed that qemu/hw/virtio/virtio.c does not 
>> do a full virtio reset
>> in virtio_set_status, when receiving a status value of 0.
>>
>> It seems it has always been this way, so I am clearly missing / 
>> forgetting something basic,
>>
>> I checked the virtio spec at https://docs.oasis-open.org/
>>
>> and from:
>>
>> "
>> 4.1.4.3 Common configuration structure layout
>>
>> device_status
>> The driver writes the device status here (see 2.1). Writing 0 into this 
>> field resets the device.
>>
>> "
>>
>> and
>>
>> "
>> 2.4.1 Device Requirements: Device Reset
>> A device MUST reinitialize device status to 0 after receiving a reset.
>> "
> 
> Side note: We can also have a reset without writing 0 to the device
> status (RESET ccw on the virtio-ccw transport).
> 
>>
>> I would conclude that in virtio.c::virtio_set_status we should 
>> unconditionally do a full virtio_reset.
>>
>> Instead, we have just the check:
>>
>> if ((vdev->status & VIRTIO_CONFIG_S_DRIVER_OK) !=
>> (val & VIRTIO_CONFIG_S_DRIVER_OK)) {
>> virtio_set_started(vdev, val & VIRTIO_CONFIG_S_DRIVER_OK);
>> }
>>
>> which just sets the started field,
>>
>> and then we have the call to the virtio device class set_status 
>> (virtio_net...),
>> but the VirtioDevice is not fully reset, as per the virtio_reset() call 
>> we are missing:
>>
>> "
>> vdev->start_on_kick = false;
>> vdev->started = false;
>> vdev->broken = false;
>> vdev->guest_features = 0;
>> vdev->queue_sel = 0;
>> vdev->status = 0;
>> vdev->disabled = false;
>> qatomic_set(&vdev->isr, 0);
>> vdev->config_vector = VIRTIO_NO_VECTOR;
>> virtio_notify_vector(vdev, vdev->config_vector);
>>
>> for(i = 0; i < VIRTIO_QUEUE_MAX; i++) {
>> ... initialize vdev->vq[i] ...
>> }
>> "
>>
>> Doing a full reset seems to fix the problem for me, so I can send 
>> tentative patches if necessary,
>> but what am I missing here?
>>
>> Thanks,
>>
>> Claudio
>>
>> --
>> Claudio Fontana
>> Engineering Manager Virtualization, SUSE Labs Core
>>
>> SUSE Software Solutions Italy Srl
>
>
> So for example for pci:
>
> case VIRTIO_PCI_STATUS:
>
>
> 
>
> if (vdev->status == 0) {
> virtio_pci_reset(DEVICE(proxy));
> }
> 
> FWIW, ccw ends up calling virtio_ccw_reset_virtio() when the driver
> issues a reset command, or when it issues a write status 0 command, or
> when the generic reset function is invoked.
> 
>
> which I suspect is a bug because:
>
> static void virtio_pci_reset(DeviceState *qdev)
> {
> VirtIOPCIProxy *proxy = VIRTIO_PCI(qdev);
> VirtioBusState *bus = VIRTIO_BUS(&proxy->bus);
> PCIDevice *dev = PCI_DEVICE(qdev);
> int i;
>
> virtio_bus_reset(bus);

 Note that we do virtio_reset() here.
>>>
>>>
>>> Yes, thank you, I completely overlooked it, I noticed this in Michael's 
>>> response as well.
>>>
>>> However we end up with multiple calls to k->set_status, one from the 
>>> virtio_set_status call,
>>> and one from the virtio_bus_reset(), which is probably something we don't 
>>> want.
>>>
>>> All in all it is not clear what the meaning of virtio_set_status is 
>>> supposed to be I think,
>>> and I wonder what the assumptions are among all the callers.
>>> If it is supposed to be an implementation of the virtio standard field as 
>>> described, I think we should do the reset right then and there,
>>> but maybe the true meaning of the function is another one I couldn't 
>>> understand, since _some_ of the cases are processes there.
> 
> Hm. Maybe there needs to be a distinction between "we're forwarding the
> status setting by the driver to the core, take any appropriate action"
> and "we've just reset the device, now we just need to zero out the
> status field"?

Right, and the reset function of virtio already sets ->status to 0 manually as 
part of the reset.

Fundamentally, the only issue I am seeing in qemu is this semantic thing,

and the fact that the virtio devic

[PATCH v2 3/5] tests/acpi: allow changes for core_count2 test

2022-07-31 Thread Julia Suvorova
Signed-off-by: Julia Suvorova 
---
 tests/qtest/bios-tables-test-allowed-diff.h | 3 +++
 tests/data/acpi/q35/APIC.core-count2| 0
 tests/data/acpi/q35/DSDT.core-count2| 0
 tests/data/acpi/q35/FACP.core-count2| 0
 4 files changed, 3 insertions(+)
 create mode 100644 tests/data/acpi/q35/APIC.core-count2
 create mode 100644 tests/data/acpi/q35/DSDT.core-count2
 create mode 100644 tests/data/acpi/q35/FACP.core-count2

diff --git a/tests/qtest/bios-tables-test-allowed-diff.h 
b/tests/qtest/bios-tables-test-allowed-diff.h
index dfb8523c8b..e81dc67a2e 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1 +1,4 @@
 /* List of comma-separated changed AML files to ignore */
+"tests/data/acpi/q35/APIC.core-count2",
+"tests/data/acpi/q35/DSDT.core-count2",
+"tests/data/acpi/q35/FACP.core-count2",
diff --git a/tests/data/acpi/q35/APIC.core-count2 
b/tests/data/acpi/q35/APIC.core-count2
new file mode 100644
index 00..e69de29bb2
diff --git a/tests/data/acpi/q35/DSDT.core-count2 
b/tests/data/acpi/q35/DSDT.core-count2
new file mode 100644
index 00..e69de29bb2
diff --git a/tests/data/acpi/q35/FACP.core-count2 
b/tests/data/acpi/q35/FACP.core-count2
new file mode 100644
index 00..e69de29bb2
-- 
2.35.3




[PATCH v2 2/5] bios-tables-test: teach test to use smbios 3.0 tables

2022-07-31 Thread Julia Suvorova
Introduce the 64-bit entry point. Since we no longer have a total
number of structures, stop checking for the new ones at the EOF
structure (type 127).

Signed-off-by: Julia Suvorova 
---
 tests/qtest/bios-tables-test.c | 95 +-
 1 file changed, 71 insertions(+), 24 deletions(-)

diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
index 359916c228..e352d5249f 100644
--- a/tests/qtest/bios-tables-test.c
+++ b/tests/qtest/bios-tables-test.c
@@ -88,8 +88,8 @@ typedef struct {
 uint64_t rsdp_addr;
 uint8_t rsdp_table[36 /* ACPI 2.0+ RSDP size */];
 GArray *tables;
-uint32_t smbios_ep_addr;
-struct smbios_21_entry_point smbios_ep_table;
+uint64_t smbios_ep_addr[SMBIOS_ENTRY_POINT_TYPE__MAX];
+SmbiosEntryPoint smbios_ep_table;
 uint16_t smbios_cpu_max_speed;
 uint16_t smbios_cpu_curr_speed;
 uint8_t *required_struct_types;
@@ -533,10 +533,9 @@ static void test_acpi_asl(test_data *data)
 free_test_data(&exp_data);
 }
 
-static bool smbios_ep_table_ok(test_data *data)
+static bool smbios_ep2_table_ok(test_data *data, uint32_t addr)
 {
-struct smbios_21_entry_point *ep_table = &data->smbios_ep_table;
-uint32_t addr = data->smbios_ep_addr;
+struct smbios_21_entry_point *ep_table = &data->smbios_ep_table.ep21;
 
 qtest_memread(data->qts, addr, ep_table, sizeof(*ep_table));
 if (memcmp(ep_table->anchor_string, "_SM_", 4)) {
@@ -559,13 +558,29 @@ static bool smbios_ep_table_ok(test_data *data)
 return true;
 }
 
-static void test_smbios_entry_point(test_data *data)
+static bool smbios_ep3_table_ok(test_data *data, uint64_t addr)
+{
+struct smbios_30_entry_point *ep_table = &data->smbios_ep_table.ep30;
+
+qtest_memread(data->qts, addr, ep_table, sizeof(*ep_table));
+if (memcmp(ep_table->anchor_string, "_SM3_", 5)) {
+return false;
+}
+
+if (acpi_calc_checksum((uint8_t *)ep_table, sizeof *ep_table)) {
+return false;
+}
+
+return true;
+}
+
+static SmbiosEntryPointType test_smbios_entry_point(test_data *data)
 {
 uint32_t off;
 
 /* find smbios entry point structure */
 for (off = 0xf; off < 0x10; off += 0x10) {
-uint8_t sig[] = "_SM_";
+uint8_t sig[] = "_SM_", sig3[] = "_SM3_";
 int i;
 
 for (i = 0; i < sizeof sig - 1; ++i) {
@@ -574,14 +589,30 @@ static void test_smbios_entry_point(test_data *data)
 
 if (!memcmp(sig, "_SM_", sizeof sig)) {
 /* signature match, but is this a valid entry point? */
-data->smbios_ep_addr = off;
-if (smbios_ep_table_ok(data)) {
+if (smbios_ep2_table_ok(data, off)) {
+data->smbios_ep_addr[SMBIOS_ENTRY_POINT_TYPE_32] = off;
+}
+}
+
+for (i = 0; i < sizeof sig3 - 1; ++i) {
+sig3[i] = qtest_readb(data->qts, off + i);
+}
+
+if (!memcmp(sig3, "_SM3_", sizeof sig3)) {
+if (smbios_ep3_table_ok(data, off)) {
+data->smbios_ep_addr[SMBIOS_ENTRY_POINT_TYPE_64] = off;
+/* found 64-bit entry point, no need to look for 32-bit one */
 break;
 }
 }
 }
 
-g_assert_cmphex(off, <, 0x10);
+/* found at least one entry point */
+g_assert_true(data->smbios_ep_addr[SMBIOS_ENTRY_POINT_TYPE_32] ||
+  data->smbios_ep_addr[SMBIOS_ENTRY_POINT_TYPE_64]);
+
+return data->smbios_ep_addr[SMBIOS_ENTRY_POINT_TYPE_64] ?
+   SMBIOS_ENTRY_POINT_TYPE_64 : SMBIOS_ENTRY_POINT_TYPE_32;
 }
 
 static inline bool smbios_single_instance(uint8_t type)
@@ -625,16 +656,23 @@ static bool smbios_cpu_test(test_data *data, uint32_t 
addr)
 return true;
 }
 
-static void test_smbios_structs(test_data *data)
+static void test_smbios_structs(test_data *data, SmbiosEntryPointType ep_type)
 {
 DECLARE_BITMAP(struct_bitmap, SMBIOS_MAX_TYPE+1) = { 0 };
-struct smbios_21_entry_point *ep_table = &data->smbios_ep_table;
-uint32_t addr = le32_to_cpu(ep_table->structure_table_address);
-int i, len, max_len = 0;
+
+SmbiosEntryPoint *ep_table = &data->smbios_ep_table;
+int i = 0, len, max_len = 0;
 uint8_t type, prv, crt;
+uint64_t addr;
+
+if (ep_type == SMBIOS_ENTRY_POINT_TYPE_32) {
+addr = le32_to_cpu(ep_table->ep21.structure_table_address);
+} else {
+addr = le64_to_cpu(ep_table->ep30.structure_table_address);
+}
 
 /* walk the smbios tables */
-for (i = 0; i < le16_to_cpu(ep_table->number_of_structures); i++) {
+do {
 
 /* grab type and formatted area length from struct header */
 type = qtest_readb(data->qts, addr);
@@ -660,19 +698,28 @@ static void test_smbios_structs(test_data *data)
 }
 
 /* keep track of max. struct size */
-if (max_len < len) {
+if (ep_type == SMBIOS_ENTRY_POINT_TYPE_32 && max_len < len) {
 max_len = len;
-g_assert_c

[PATCH v2 4/5] bios-tables-test: add test for number of cores > 255

2022-07-31 Thread Julia Suvorova
The new test is run with a large number of cpus and checks if the
core_count field in smbios_cpu_test (structure type 4) is correct.

Choose q35 as it allows to run with -smp > 255.

Signed-off-by: Julia Suvorova 
---
 tests/qtest/bios-tables-test.c | 53 +-
 1 file changed, 40 insertions(+), 13 deletions(-)

diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
index e352d5249f..cebfa430ac 100644
--- a/tests/qtest/bios-tables-test.c
+++ b/tests/qtest/bios-tables-test.c
@@ -92,6 +92,8 @@ typedef struct {
 SmbiosEntryPoint smbios_ep_table;
 uint16_t smbios_cpu_max_speed;
 uint16_t smbios_cpu_curr_speed;
+uint8_t smbios_core_count;
+uint16_t smbios_core_count2;
 uint8_t *required_struct_types;
 int required_struct_types_len;
 QTestState *qts;
@@ -631,29 +633,38 @@ static inline bool smbios_single_instance(uint8_t type)
 }
 }
 
-static bool smbios_cpu_test(test_data *data, uint32_t addr)
+static void smbios_cpu_test(test_data *data, uint32_t addr)
 {
-uint16_t expect_speed[2];
-uint16_t real;
+uint8_t core_count, expected_core_count = data->smbios_core_count;
+uint16_t speed, core_count2, expected_core_count2 = 
data->smbios_core_count2;
+uint16_t expected_speed[2];
 int offset[2];
 int i;
 
 /* Check CPU speed for backward compatibility */
 offset[0] = offsetof(struct smbios_type_4, max_speed);
 offset[1] = offsetof(struct smbios_type_4, current_speed);
-expect_speed[0] = data->smbios_cpu_max_speed ? : 2000;
-expect_speed[1] = data->smbios_cpu_curr_speed ? : 2000;
+expected_speed[0] = data->smbios_cpu_max_speed ? : 2000;
+expected_speed[1] = data->smbios_cpu_curr_speed ? : 2000;
 
 for (i = 0; i < 2; i++) {
-real = qtest_readw(data->qts, addr + offset[i]);
-if (real != expect_speed[i]) {
-fprintf(stderr, "Unexpected SMBIOS CPU speed: real %u expect %u\n",
-real, expect_speed[i]);
-return false;
-}
+speed = qtest_readw(data->qts, addr + offset[i]);
+g_assert_cmpuint(speed, ==, expected_speed[i]);
 }
 
-return true;
+core_count = qtest_readb(data->qts,
+ addr + offsetof(struct smbios_type_4, 
core_count));
+core_count2 = qtest_readw(data->qts,
+  addr + offsetof(struct smbios_type_4, 
core_count2));
+
+if (expected_core_count) {
+g_assert_cmpuint(core_count, ==, expected_core_count);
+}
+
+/* Core Count has reached its limit, checking Core Count 2 */
+if (expected_core_count == 0xFF && expected_core_count2) {
+g_assert_cmpuint(core_count2, ==, expected_core_count2);
+}
 }
 
 static void test_smbios_structs(test_data *data, SmbiosEntryPointType ep_type)
@@ -686,7 +697,7 @@ static void test_smbios_structs(test_data *data, 
SmbiosEntryPointType ep_type)
 set_bit(type, struct_bitmap);
 
 if (type == 4) {
-g_assert(smbios_cpu_test(data, addr));
+smbios_cpu_test(data, addr);
 }
 
 /* seek to end of unformatted string area of this struct ("\0\0") */
@@ -903,6 +914,21 @@ static void test_acpi_q35_tcg(void)
 free_test_data(&data);
 }
 
+static void test_acpi_q35_tcg_core_count2(void)
+{
+test_data data = {
+.machine = MACHINE_Q35,
+.variant = ".core-count2",
+.required_struct_types = base_required_struct_types,
+.required_struct_types_len = ARRAY_SIZE(base_required_struct_types),
+.smbios_core_count = 0xFF,
+.smbios_core_count2 = 275,
+};
+
+test_acpi_one("-machine smbios-entry-point-type=64 -smp 275", &data);
+free_test_data(&data);
+}
+
 static void test_acpi_q35_tcg_bridge(void)
 {
 test_data data;
@@ -1822,6 +1848,7 @@ int main(int argc, char *argv[])
 qtest_add_func("acpi/piix4/pci-hotplug/off",
test_acpi_piix4_no_acpi_pci_hotplug);
 qtest_add_func("acpi/q35", test_acpi_q35_tcg);
+qtest_add_func("acpi/q35/core-count2", test_acpi_q35_tcg_core_count2);
 qtest_add_func("acpi/q35/bridge", test_acpi_q35_tcg_bridge);
 qtest_add_func("acpi/q35/multif-bridge", test_acpi_q35_multif_bridge);
 qtest_add_func("acpi/q35/mmio64", test_acpi_q35_tcg_mmio64);
-- 
2.35.3




[PATCH v2 5/5] tests/acpi: update tables for new core count test

2022-07-31 Thread Julia Suvorova
Changes in the tables (for 275 cores):
FACP:
+ Use APIC Cluster Model (V4) : 1

APIC:
+[02Ch 0044   1]Subtable Type : 00 [Processor Local APIC]
+[02Dh 0045   1]   Length : 08
+[02Eh 0046   1] Processor ID : 00
+[02Fh 0047   1]Local Apic ID : 00
+[030h 0048   4]Flags (decoded below) : 0001
+   Processor Enabled : 1
...
+
+[81Ch 2076   1]Subtable Type : 00 [Processor Local APIC]
+[81Dh 2077   1]   Length : 08
+[81Eh 2078   1] Processor ID : FE
+[81Fh 2079   1]Local Apic ID : FE
+[820h 2080   4]Flags (decoded below) : 0001
+   Processor Enabled : 1
+  Runtime Online Capable : 0
+
+[824h 2084   1]Subtable Type : 09 [Processor Local x2APIC]
+[825h 2085   1]   Length : 10
+[826h 2086   2] Reserved : 
+[828h 2088   4]  Processor x2Apic ID : 00FF
+[82Ch 2092   4]Flags (decoded below) : 0001
+   Processor Enabled : 1
+[830h 2096   4]Processor UID : 00FF
...

DSDT:
+Processor (C001, 0x01, 0x, 0x00)
+{
+Method (_STA, 0, Serialized)  // _STA: Status
+{
+Return (CSTA (One))
+}
+
+Name (_MAT, Buffer (0x08)  // _MAT: Multiple APIC Table Entry
+{
+ 0x00, 0x08, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00   // 

+})
+Method (_EJ0, 1, NotSerialized)  // _EJx: Eject Device, x=0-9
+{
+CEJ0 (One)
+}
+
+Method (_OST, 3, Serialized)  // _OST: OSPM Status Indication
+{
+COST (One, Arg0, Arg1, Arg2)
+}
+}
...
+Processor (C0FE, 0xFE, 0x, 0x00)
+{
+Method (_STA, 0, Serialized)  // _STA: Status
+{
+Return (CSTA (0xFE))
+}
+
+Name (_MAT, Buffer (0x08)  // _MAT: Multiple APIC Table Entry
+{
+ 0x00, 0x08, 0xFE, 0xFE, 0x01, 0x00, 0x00, 0x00   // 

+})
+Method (_EJ0, 1, NotSerialized)  // _EJx: Eject Device, x=0-9
+{
+CEJ0 (0xFE)
+}
+
+Method (_OST, 3, Serialized)  // _OST: OSPM Status Indication
+{
+COST (0xFE, Arg0, Arg1, Arg2)
+}
+}
+
+Device (C0FF)
+{
+Name (_HID, "ACPI0007" /* Processor Device */)  // _HID: 
Hardware ID
+Name (_UID, 0xFF)  // _UID: Unique ID
+Method (_STA, 0, Serialized)  // _STA: Status
+{
+Return (CSTA (0xFF))
+}
+
+Name (_MAT, Buffer (0x10)  // _MAT: Multiple APIC Table Entry
+{
+/*  */  0x09, 0x10, 0x00, 0x00, 0xFF, 0x00, 0x00, 
0x00,  // 
+/* 0008 */  0x01, 0x00, 0x00, 0x00, 0xFF, 0x00, 0x00, 0x00 
  // 
+})
+Method (_EJ0, 1, NotSerialized)  // _EJx: Eject Device, x=0-9
+{
+CEJ0 (0xFF)
+}
+
+Method (_OST, 3, Serialized)  // _OST: OSPM Status Indication
+{
+COST (0xFF, Arg0, Arg1, Arg2)
+}
+}
+
...

Signed-off-by: Julia Suvorova 
---
 tests/qtest/bios-tables-test-allowed-diff.h |   3 ---
 tests/data/acpi/q35/APIC.core-count2| Bin 0 -> 2478 bytes
 tests/data/acpi/q35/DSDT.core-count2| Bin 0 -> 32414 bytes
 tests/data/acpi/q35/FACP.core-count2| Bin 0 -> 244 bytes
 4 files changed, 3 deletions(-)

diff --git a/tests/qtest/bios-tables-test-allowed-diff.h 
b/tests/qtest/bios-tables-test-allowed-diff.h
index e81dc67a2e..dfb8523c8b 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1,4 +1 @@
 /* List of comma-separated changed AML files to ignore */
-"tests/data/acpi/q35/APIC.core-count2",
-"tests/data/acpi/q35/DSDT.core-count2",
-"tests/data/acpi/q35/FACP.core-count2",
diff --git a/tests/data/acpi/q35/APIC.core-count2 
b/tests/data/acpi/q35/APIC.core-count2
index 
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..a255082ef5bc39f0d92d3e372b91f09dd6d0d9a1
 100644
GIT binary patch
literal 2478
zcmXZeWl$AS7=Youz=a#M-K}6ZwgLuNAQ;$~*xjvQcY@uCVs{~Sf`WpLVt2Rbe!ORA
z_B`J^b9R56*&pj2=M2p(=#;b`y@>U1KQZ2
ztu5Nwq0xx;_UPb%CKH;?XtAKxijI!xf-7uzGc@Q3Gq%
z#9Fnmc5SRv2fe+~#|M4+PE2*{()H?L{rcFT0s8r&zdtr?h>aRyuWjcwXs+qT%Q9ky?e9Xepgju;w>ojPIX&e)|3
zcI}GYx?%V37#4;-dSK6<*sB-z?u~u=VBfyjuOIgBj{^qaz=1eu5Dp%ULx$

[PATCH v2 0/5] hw/smbios: add core_count2 to smbios table type 4

2022-07-31 Thread Julia Suvorova
The SMBIOS 3.0 specification provides the ability to reflect over
255 cores. The 64-bit entry point has been used for a while, but
structure type 4 has not been updated before, so the dmidecode output
looked like this (-smp 280):

Handle 0x0400, DMI type 4, 42 bytes
Processor Information
...
Core Count: 24
Core Enabled: 24
Thread Count: 1
...

Big update in the bios-tables-test as it couldn't work with SMBIOS 3.0.

v2:
* generate tables type 4 of different sizes based on the
  selected smbios version
* use SmbiosEntryPoint* types instead of creating new constants
* refactor smbios_cpu_test [Igor, Ani]
* clarify signature check [Igor]
* add comments with specifications and clarification of the structure loop 
[Ani]

Julia Suvorova (5):
  hw/smbios: add core_count2 to smbios table type 4
  bios-tables-test: teach test to use smbios 3.0 tables
  tests/acpi: allow changes for core_count2 test
  bios-tables-test: add test for number of cores > 255
  tests/acpi: update tables for new core count test

 hw/smbios/smbios_build.h |   9 +-
 include/hw/firmware/smbios.h |  11 ++
 hw/smbios/smbios.c   |  18 +++-
 tests/qtest/bios-tables-test.c   | 148 ---
 tests/data/acpi/q35/APIC.core-count2 | Bin 0 -> 2478 bytes
 tests/data/acpi/q35/DSDT.core-count2 | Bin 0 -> 32414 bytes
 tests/data/acpi/q35/FACP.core-count2 | Bin 0 -> 244 bytes
 7 files changed, 144 insertions(+), 42 deletions(-)
 create mode 100644 tests/data/acpi/q35/APIC.core-count2
 create mode 100644 tests/data/acpi/q35/DSDT.core-count2
 create mode 100644 tests/data/acpi/q35/FACP.core-count2

-- 
2.35.3




[PATCH v2 1/5] hw/smbios: add core_count2 to smbios table type 4

2022-07-31 Thread Julia Suvorova
In order to use the increased number of cpus, we need to bring smbios
tables in line with the SMBIOS 3.0 specification. This allows us to
introduce core_count2 which acts as a duplicate of core_count if we have
fewer cores than 256, and contains the actual core number per socket if
we have more.

core_enabled2 and thread_count2 fields work the same way.

Signed-off-by: Julia Suvorova 
---
 hw/smbios/smbios_build.h |  9 +++--
 include/hw/firmware/smbios.h | 11 +++
 hw/smbios/smbios.c   | 18 +++---
 3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/hw/smbios/smbios_build.h b/hw/smbios/smbios_build.h
index 56b5a1e3f3..351660024e 100644
--- a/hw/smbios/smbios_build.h
+++ b/hw/smbios/smbios_build.h
@@ -27,6 +27,11 @@ extern unsigned smbios_table_max;
 extern unsigned smbios_table_cnt;
 
 #define SMBIOS_BUILD_TABLE_PRE(tbl_type, tbl_handle, tbl_required)\
+SMBIOS_BUILD_TABLE_PRE_SIZE(tbl_type, tbl_handle, tbl_required,   \
+sizeof(struct smbios_type_##tbl_type))\
+
+#define SMBIOS_BUILD_TABLE_PRE_SIZE(tbl_type, tbl_handle, \
+tbl_required, tbl_len)\
 struct smbios_type_##tbl_type *t; \
 size_t t_off; /* table offset into smbios_tables */   \
 int str_index = 0;\
@@ -39,12 +44,12 @@ extern unsigned smbios_table_cnt;
 /* use offset of table t within smbios_tables */  \
 /* (pointer must be updated after each realloc) */\
 t_off = smbios_tables_len;\
-smbios_tables_len += sizeof(*t);  \
+smbios_tables_len += tbl_len; \
 smbios_tables = g_realloc(smbios_tables, smbios_tables_len);  \
 t = (struct smbios_type_##tbl_type *)(smbios_tables + t_off); \
   \
 t->header.type = tbl_type;\
-t->header.length = sizeof(*t);\
+t->header.length = tbl_len;   \
 t->header.handle = cpu_to_le16(tbl_handle);   \
 } while (0)
 
diff --git a/include/hw/firmware/smbios.h b/include/hw/firmware/smbios.h
index 4b7ad77a44..56f7bf0fea 100644
--- a/include/hw/firmware/smbios.h
+++ b/include/hw/firmware/smbios.h
@@ -18,6 +18,8 @@
 
 
 #define SMBIOS_MAX_TYPE 127
+#define offsetofend(TYPE, MEMBER) \
+   (offsetof(TYPE, MEMBER) + sizeof_field(TYPE, MEMBER))
 
 /* memory area description, used by type 19 table */
 struct smbios_phys_mem_area {
@@ -187,8 +189,17 @@ struct smbios_type_4 {
 uint8_t thread_count;
 uint16_t processor_characteristics;
 uint16_t processor_family2;
+/* SMBIOS spec 3.0.0, Table 21 */
+uint16_t core_count2;
+uint16_t core_enabled2;
+uint16_t thread_count2;
 } QEMU_PACKED;
 
+typedef enum smbios_type_4_len_ver {
+SMBIOS_TYPE_4_LEN_V28 = offsetofend(struct smbios_type_4, 
processor_family2),
+SMBIOS_TYPE_4_LEN_V30 = offsetofend(struct smbios_type_4, thread_count2),
+} smbios_type_4_len_ver;
+
 /* SMBIOS type 11 - OEM strings */
 struct smbios_type_11 {
 struct smbios_structure_header header;
diff --git a/hw/smbios/smbios.c b/hw/smbios/smbios.c
index 60349ee402..657093e5f6 100644
--- a/hw/smbios/smbios.c
+++ b/hw/smbios/smbios.c
@@ -681,8 +681,13 @@ static void smbios_build_type_3_table(void)
 static void smbios_build_type_4_table(MachineState *ms, unsigned instance)
 {
 char sock_str[128];
+size_t tbl_len = SMBIOS_TYPE_4_LEN_V28;
 
-SMBIOS_BUILD_TABLE_PRE(4, T4_BASE + instance, true); /* required */
+if (smbios_ep_type == SMBIOS_ENTRY_POINT_TYPE_64) {
+tbl_len = SMBIOS_TYPE_4_LEN_V30;
+}
+
+SMBIOS_BUILD_TABLE_PRE_SIZE(4, T4_BASE + instance, true, tbl_len); /* 
required */
 
 snprintf(sock_str, sizeof(sock_str), "%s%2x", type4.sock_pfx, instance);
 SMBIOS_TABLE_SET_STR(4, socket_designation_str, sock_str);
@@ -709,8 +714,15 @@ static void smbios_build_type_4_table(MachineState *ms, 
unsigned instance)
 SMBIOS_TABLE_SET_STR(4, serial_number_str, type4.serial);
 SMBIOS_TABLE_SET_STR(4, asset_tag_number_str, type4.asset);
 SMBIOS_TABLE_SET_STR(4, part_number_str, type4.part);
-t->core_count = t->core_enabled = ms->smp.cores;
-t->thread_count = ms->smp.threads;
+
+t->core_count = (ms->smp.cores > 255) ? 0xFF : ms->smp.cores;
+t->core_enabled = t->core_count;
+
+t->core_count2 = t->core_enabled2 = cpu_to_le16(ms->smp.cores);
+
+t->thread_count = (ms->smp.threads > 255) ? 0xFF : ms->smp.threads;
+t->thread_count2 = cpu_to_le16(ms->smp.threads);
+
 t->processor_characteristics = cpu_to_le16(0

Re: [PATCH] linux-user: Implement faccessat2

2022-07-31 Thread Richard Henderson

On 7/29/22 13:14, Richard Henderson wrote:

-ret = get_errno(access(path(p), arg2));
-unlock_user(p, arg1, 0);
-return ret;
+return do_faccessat2(AT_FDCWD, arg1, arg2, 0);


Oops, dropped path().

Should perhaps be incorporated into the helper, because newer targets won't have or use 
plain access()...



r~