Re: [Qemu-devel] [PATCH for-3.1 00/10] Move virtio-ccw devices to separate files

2018-07-26 Thread Cornelia Huck
On Wed, 25 Jul 2018 19:20:59 +0200
Thomas Huth  wrote:

> On 25.07.2018 15:48, Cornelia Huck wrote:
> > On Wed, 25 Jul 2018 14:20:14 +0200
> > Thomas Huth  wrote:
> >   
> >> For more fine-grained control over the build process, it would be good
> >> to have the possibility to disable single virtio devices, too. We already
> >> have CONFIG_VIRTIO_* switches in the Makefiles, but currently all
> >> virtio-ccw devices are compiled in anyway. Move them to separate files
> >> so we can disable them in the Makefile more easily.
> >>
> >> NB: I did not move virtio-blk-ccw and virtio-net-ccw to a separate file
> >> yet since they are essential for the s390x-virtio machine and thus it
> >> does IMHO not make much sense to disable net and blk. But if somebody
> >> needs that possibility, too, I can add two more patches on top...  
> > 
> > I understand virtio-net-ccw (autogenerated network devices), but why
> > virtio-blk-ccw?  
> 
> Because we set mc->block_default_type = IF_VIRTIO ==> if you disable
> virtio-blk, the CLI options like -hda and -cdrom would not work anymore.

I see. Maybe we need some strategic #ifdefs to deal with this.



Re: [Qemu-devel] [PATCH for-3.1 01/10] hw/s390x/virtio-ccw: Consolidate calls to virtio_ccw_unrealize()

2018-07-26 Thread Cornelia Huck
On Wed, 25 Jul 2018 14:20:15 +0200
Thomas Huth  wrote:

> Currently, every virtio-ccw device explicitely sets its unrealize
> function to virtio_ccw_unrealize() in its class_init function.
> We can simplify this by using a common unrealize function, just like
> it is already done for the realize functions.
> 
> Signed-off-by: Thomas Huth 
> ---
>  hw/s390x/virtio-ccw.c | 22 +++---
>  1 file changed, 7 insertions(+), 15 deletions(-)

LGTM now :)



Re: [Qemu-devel] [PATCH v5 24/24] slirp: fix ipv6 timers

2018-07-26 Thread Samuel Thibault
Pavel Dovgalyuk, le jeu. 26 juil. 2018 11:37:57 +0300, a ecrit:
> > From: Samuel Thibault [mailto:samuel.thiba...@gnu.org]
> > Pavel Dovgalyuk, le jeu. 26 juil. 2018 10:08:29 +0300, a ecrit:
> > > virtual clock should be used by the virtual devices.
> > > slirp module is not the virtual device. Therefore processed packets
> > > become visible to the guest after passing to the virtual network card.
> > > Before that it can create any timers that should not change the state of 
> > > the guest.
> > 
> > I'm not sure I understand that part correctly. slirp is not a "device"
> > strictly speaking, but it has a whole foot in the virtual world. All
> > TCP/UDP/ARP/RA timings are related to the guest timing, so
> 
> I don't know all details of slirp, so let me ask:
> if the virtual timer runs very slowly (when it configured this way with 
> icount option),
> should the timings relate this speed?

Yes. Otherwise the guest will not be fast enough to answer promptly
according to slirp's TCP delays.

> Or the timers are related to the network devices (e.g., servers in the
> outer world)?

No.

> > > > > this service is not related to the guest state.
> > 
> > seems incorrect. At the moment the ip6_icmp timer's current value is not
> > saved in the guest state, but in principle it should, so that the guest
> > does see the RAs at a regular rate. In practice we don't care because
> > the timing is randomized anyway.
> 
> Isn't this just a side effect?
> I mean that slirp may be replaced by, say, tap, and the guest should not 
> notice
> the difference.

Well, if a guest is connected through a tap, the virtual time should
really run as fast as the realtime, and it should not be paused.
Otherwise TCP connections will break since the guest won't be able to
reply fast enough, without even knowing about the issue. Slirp can
compensate this thanks to a buffer between what happens in the real
world and what happens in the virtual world. Real world timings are
handled by the OS socket implementation, and virtual world timings are
handled with the qemu timer.

> > > intended to be used for the internal QEMU purposes, but stops when VM
> > > is stopped.
> > 
> > I again don't understand this. The ip6_icmp timing is not for internal
> > QEMU purpose, its whole point is how often RAs are sent to the guest.
> > 
> > slirp's guest part is not a device as directly seen by guest I/O, but
> > it's a router device as seen through guest packets. Think of it like a
> > USB device, which is seen by the guest through USB packets.
> 
> Record/replay implementation creates a line between the guest state and
> the outer world. Everything crossing this line is saved in the log replayed.
> In case of network, this line is implemented with the network filter.
> It takes packets from slirp(or anything) and passes(or not) them to the guest 
> nic.
> When replaying, the saved packets are injected into the filter directly.

> Slirp is the part of the outer world,

In normal uses it is not. It is a virtual world (its DHCP server, tftp
server, TCP connexions, etc.) that lives along the guest.

Now, I understand that for record/replay it's simpler to put the line
after slirp.

Ideally slirp's state should ideally be split it two: the part connected
to the real world (data from/to the sockets), and the part connected to
the virtual world (TCP buffering with the guest). So that when pausing,
going back, going forward etc. the slirp buffers act accordingly, TCP
knowing exactly what is supposed to be sent or not (otherwise, TCP
would for instance be really astonished if the guest happens to insist
requesting old data that it has already ACKed).

But that's tricky, and I understand it's simpler to just put the line
after slirp, and let the replay of frames provide the guest (which for
instance has been reset to an older time) with the missing data, and TCP
will nicely cope with duplicate ACKs and spurious re-emissions from the
guest.

That being said, there will be problems with TCP connections if you
pause the guest for a long time: slirp's TCP will timeout and reset the
connexion. Yes, that happens with tap devices anyway, but slirp acting
as a buffer seems more useful to me.

Samuel



[Qemu-devel] [PATCH v2] block/gluster: defend on legacy ftruncate api use

2018-07-26 Thread Niels de Vos
From: Prasanna Kumar Kalever 

New versions of Glusters libgfapi.so have an updated glfs_ftruncate()
function that returns additional 'struct stat' structures to enable
advanced caching of attributes. This is useful for file servers, not so
much for QEMU. Never the less, the API has changed and needs to be
adopted.

Signed-off-by: Prasanna Kumar Kalever 
Signed-off-by: Niels de Vos 

--
v2: do a compile check as suggested by Eric Blake
---
 block/gluster.c | 15 +--
 configure   | 18 ++
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index 4fd55a9cc5..d1c6f81f5c 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -997,6 +997,7 @@ static int qemu_gluster_do_truncate(struct glfs_fd *fd, 
int64_t offset,
 PreallocMode prealloc, Error **errp)
 {
 int64_t current_length;
+int ret;
 
 current_length = glfs_lseek(fd, 0, SEEK_END);
 if (current_length < 0) {
@@ -1024,7 +1025,12 @@ static int qemu_gluster_do_truncate(struct glfs_fd *fd, 
int64_t offset,
 #endif /* CONFIG_GLUSTERFS_FALLOCATE */
 #ifdef CONFIG_GLUSTERFS_ZEROFILL
 case PREALLOC_MODE_FULL:
-if (glfs_ftruncate(fd, offset)) {
+#ifdef CONFIG_GLUSTERFS_LEGACY_FTRUNCATE
+ret = glfs_ftruncate(fd, offset);
+#else
+ret = glfs_ftruncate(fd, offset, NULL, NULL);
+#endif
+if (ret) {
 error_setg_errno(errp, errno, "Could not resize file");
 return -errno;
 }
@@ -1035,7 +1041,12 @@ static int qemu_gluster_do_truncate(struct glfs_fd *fd, 
int64_t offset,
 break;
 #endif /* CONFIG_GLUSTERFS_ZEROFILL */
 case PREALLOC_MODE_OFF:
-if (glfs_ftruncate(fd, offset)) {
+#ifdef CONFIG_GLUSTERFS_LEGACY_FTRUNCATE
+ret = glfs_ftruncate(fd, offset);
+#else
+ret = glfs_ftruncate(fd, offset, NULL, NULL);
+#endif
+if (ret) {
 error_setg_errno(errp, errno, "Could not resize file");
 return -errno;
 }
diff --git a/configure b/configure
index 2a7796ea80..f3c0918d6b 100755
--- a/configure
+++ b/configure
@@ -451,6 +451,7 @@ glusterfs_xlator_opt="no"
 glusterfs_discard="no"
 glusterfs_fallocate="no"
 glusterfs_zerofill="no"
+glusterfs_legacy_ftruncate="no"
 gtk=""
 gtkabi=""
 gtk_gl="no"
@@ -3947,6 +3948,19 @@ if test "$glusterfs" != "no" ; then
   glusterfs_fallocate="yes"
   glusterfs_zerofill="yes"
 fi
+cat > $TMPC << EOF
+#include 
+
+int
+main(void)
+{
+   /* new glfs_ftruncate() passes two additional args */
+   return glfs_ftruncate(NULL, 0 /*, NULL, NULL */);
+}
+EOF
+if compile_prog "$glusterfs_cflags" "$glusterfs_libs" ; then
+  glusterfs_legacy_ftruncate="yes"
+fi
   else
 if test "$glusterfs" = "yes" ; then
   feature_not_found "GlusterFS backend support" \
@@ -6644,6 +6658,10 @@ if test "$glusterfs_zerofill" = "yes" ; then
   echo "CONFIG_GLUSTERFS_ZEROFILL=y" >> $config_host_mak
 fi
 
+if test "$glusterfs_legacy_ftruncate" = "yes" ; then
+  echo "CONFIG_GLUSTERFS_LEGACY_FTRUNCATE=y" >> $config_host_mak
+fi
+
 if test "$libssh2" = "yes" ; then
   echo "CONFIG_LIBSSH2=m" >> $config_host_mak
   echo "LIBSSH2_CFLAGS=$libssh2_cflags" >> $config_host_mak
-- 
2.17.1




Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Peter Maydell
On 26 July 2018 at 08:44, Ard Biesheuvel  wrote:
> Does that really matter? As I pointed out before, we are not
> interested in running the firmware on actual bare metal. What we do
> care about is not having code in the reference firmware stack that was
> added specifically to deal with the dynamic nature of our QEMU
> platform.

I think we also care in the sense that what we put in the reference
platform is an implicit recommendation. So if we (collectively) think
that good Arm server platforms are likely to or should use XHCI,
we should definitely consider going with that, rather than going with
what happens to be being used on currently shipping hardware.
(We should also consider whether we want to implicitly recommend
that the USB controller should be memory-mapped rather than
hanging off a PCI controller.)

thanks
-- PMM



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Hongbo Zhang
On 25 July 2018 at 19:26, Andrew Jones  wrote:
> On Wed, Jul 25, 2018 at 06:22:17PM +0800, Hongbo Zhang wrote:
>> On 25 July 2018 at 17:54, Andrew Jones  wrote:
>> > On Wed, Jul 25, 2018 at 01:30:52PM +0800, Hongbo Zhang wrote:
>> >> For the Aarch64, there is one machine 'virt', it is primarily meant to
>> >> run on KVM and execute virtualization workloads, but we need an
>> >> environment as faithful as possible to physical hardware, for supporting
>> >> firmware and OS development for pysical Aarch64 machines.
>> >>
>> >> This patch introduces new machine type 'Enterprise' with main features:
>> >>  - Based on 'virt' machine type.
>> >>  - Re-designed memory map.
>> >>  - EL2 and EL3 are enabled by default.
>> >>  - GIC version 3 by default.
>> >>  - AHCI controller attached to system bus, and then CDROM and hard disc
>> >>can be added to it.
>> >>  - EHCI controller attached to system bus, with USB mouse and key board
>> >>installed by default.
>> >>  - E1000E ethernet card on PCIE bus.
>> >>  - VGA display adaptor on PCIE bus.
>> >>  - Default CPU type cortex-a57, 4 cores, and 1G bytes memory.
>> >>  - No virtio functions enabled, since this is to emulate real hardware.
>> >
>> > In the last review it was pointed out that using virtio-pci should still
>> > be "real" enough, so there's not much reason to avoid it. Well, unless
>> > there's some concern as to what drivers are available in the firmware and
>> > guest kernel. But that concern usually only applies to legacy firmwares
>> > and kernels, and therefore shouldn't apply to AArch64.
>> >
>> In real physical arm hardware, *HCI are system memory mapped, not on PCIE.
>> we need a QEMU platform like that. We need firmware developed on this
>> QEMU platform can run on real hardware without change(or only a minor
>> change)
>
> virtio-pci has nothing to do with *HCI. You're adding an E1000e to the
> PCIe bus instead of a virtio-pci nic. Why?
>
No virtio devices are need on this platform, so no virtio-pci either,
on the real Arm server hardware, a NIC is inserted into PCIE, and
E1000E is a typical one.

>> Concern is not only available firmwares, but more emphasis is on new
>> firmwares to be developed on this platform (target is developing
>> firmware for hardware, but using qemu if hardware is not available
>> temporarily), if virtio device used, then the newly developed firmware
>> must include virtio front-end codes, but it isn't needed while run on
>> real hardware at last.
>
> Right. The new firmwares and kernels would need to include virtio-pci nic
> and scsi drivers. Is that a problem? Anyway, this is all the more reason
> not to hard code peripherals. If a particular peripheral is a problem
> for a given firmware, then simply don't add it to the command line, add a
> different one.
>
Yes that is problem, for newly developed firmwares, extra efforts will
be wasted on frontend codes (or glue-layer, whatever we call it), we
want firmwares developed on this platform can run easily on real
hardware, without such change.
Requirement is: some Linaro members just want a qemu platform as true
as possible with real hardware, there should be no problem with such
requirement, problem is 'virt' machine cannot satisfy the requirement,
so a new one is needed.

>>
>> >>  - No paravirtualized fw_cfg device either.
>> >>
>> >> Arm Trusted Firmware and UEFI porting to this are done accordingly.
>> >>
>> >
>> > How will UEFI get the ACPI tables from QEMU without fw-cfg? I didn't
>> > see any sort of reserved ROM region in the patch for them.
>> >
>> UEFI gets ACPI and kernel from network or mass storage, just like the
>> real hardware.
>
> Hmm. I thought for real hardware that the ACPI tables were built into
> the firmware. So, assuming UEFI knows how to read ACPI tables from
> some storage, then how do the QEMU generated ACPI tables get into that
> storage?
>
I should say "mass storage and flash"
There was fw_cfg in v1 patch, it is removed in v2.
If no fw_cfg, just like real hardware, UEFI should include ACPI
support for this SBSA platform, and UEFI/ACPI is load via -pflash,
then the QEMU built-in ACPI isn't used.
But there are side effects too, command 'qemu -bios uefi -kernel'
won't work, I need extra time to evaluate this change.

>> If we develop firmware depends on paravirtualized device like fw_cfg,
>> then we canot run such firmware on real hardware.
>
> Yes, fw-cfg is paravirt, so it's obvious why you'd prefer it not be there,
> but, where real hardware builds its hardware descriptions into its
> platform-specific firmware, or uses some platform-specific means of
> extracting a description from some platform-specific place, QEMU
> machine models use fw-cfg.
>
> Thanks,
> drew



Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots

2018-07-26 Thread Peter Xu
On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
> On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > 
> > It may look like the uffd-wp model is wish-feature similar to an
> > optimization, but without the uffd-wp model when the WP fault is
> > triggered by kernel code, the sigsegv model falls apart and requires
> > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > has other benefits: it makes it all reliable in terms of not
> > increasing the number of vmas in use during the snapshot. Finally it
> > makes it faster too with no mmap_sem for reading and no sigsegv
> > signals.
> > 
> > The non cooperative features got merged first because there was much
> > activity on the kernel side on that front, but this is just an ideal
> > time to nail down the remaining issues in uffd-wp I think. That I
> > believe is time better spent than trying to emulate it with sigsegv
> > and changing all drivers to send new events down to qemu specific to
> > the sigsegv handling. We considered this before doing uffd for
> > postcopy too but overall it's unreliable and more work (no single
> > change was then needed to KVM code with uffd to handle postcopy and
> > here it should be the same).
> 
> I totally agree.  The hard part in userfaultfd was the changes to the
> kernel get_user_pages API, but the payback was huge because _all_ kernel
> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> back to mprotect would be a huge mistake.

Thanks for explaining the bits.  I'd say I wasn't aware of the
difference before I started the investigation (and only until now I
noticed that major difference between mprotect and userfaultfd).  I'm
really glad that it's much clear (at least for me) on which way we
should choose.

Now I'm thinking whether we can move the userfault write protect work
forward.  The latest discussion I saw so far is in 2016, when someone
from Huawei tried to use the write protect feature for that old
version of live snapshot but reported issue:

  https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html

Is that the latest status for userfaultfd wr-protect?

If so, I'm thinking whether I can try to re-verify the work (I tried
his QEMU repository but I failed to compile somehow, so I plan to
write some even simpler code to try) to see whether I can get the same
KVM error he encountered.

Thoughts?

Regards,

-- 
Peter Xu



Re: [Qemu-devel] [PATCH for-3.0] qcow: fix a reference leak

2018-07-26 Thread Kevin Wolf
Am 25.07.2018 um 20:07 hat KONRAD Frederic geschrieben:
> Since 42a3e1ab367cdf38cce093de24eb406b99a4ef96 qemu asserts when using the
> vvfat driver:
> 
> git clone git://qemu.org/qemu.git
> cd qemu
> ./configure --target-list=ppc-softmmu --enable-debug
> make -j8
> mkdir foo
> touch foo/hello
> ./ppc-softmmu/qemu-system-ppc -M prep --nographic --monitor null \
>   -hda fat:rw:./foo
> 
> "Ctrl-C"
> 
> qemu-system-ppc: block.c:3368: bdrv_close_all: Assertion \
>`((&all_bdrv_states)->tqh_first == ((void *)0))' failed.
> 
> This is because we reference bs twice in qcow_co_create(..) one time in
> bdrv_open_blockdev_ref(..) and in blk_insert_bs(..) but we unref it only once
> in blk_unref which leads to the reference leak.
> 
> Note that I didn't tested much QCOW after this change as I don't use it much.
> 
> Signed-off-by: KONRAD Frederic 

Thanks, applied to the block branch.

Kevin



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Hongbo Zhang
On 25 July 2018 at 22:08, Andrew Jones  wrote:
> On Wed, Jul 25, 2018 at 03:46:01PM +0200, Ard Biesheuvel wrote:
>> On 25 July 2018 at 15:38, Andrew Jones  wrote:
>> > On Wed, Jul 25, 2018 at 03:03:40PM +0200, Ard Biesheuvel wrote:
>> >> On 25 July 2018 at 14:59, Andrew Jones  wrote:
>> >> > On Wed, Jul 25, 2018 at 01:47:00PM +0100, Daniel P. Berrangé wrote:
>> >> >> Would iut make any sense to call the machine  "refplatform"  or 
>> >> >> "refboard"
>> >> >> to indicate it is a generic reference platform, not specifically 
>> >> >> following
>> >> >> any particular real impl, albeit influence by the sbsa spec.
>> >> >>
>> >> >
>> >> > That would indeed stop me from whining about a machine model with an
>> >> > 'sbsa' name not strictly implementing a minimal SBSA machine :-)
>> >> >
>> >>
>> >> I still don't get why only a minimal machine is worth considering,
>> >> given that a real-world minimal SBSA machine is not capable of doing
>> >> anything useful.
>> >
>> > One definition of an SBSA machine can be quite different than another.
>> > If we only hard code the required [useless] base, but also provide a
>> > readconfig for the rest, then we get a useful machine without loss of
>> > flexibility.
>> >
>> > That flexibility comes at the cost of platform-bus code (since we need
>> > to add devices to the system bus) and a less concise command line. The
>> > platform-bus code may indeed be too expensive for this purpose, but
>> > we may need to see patches to be sure. I understand the desire to have
>> > a shorter command line, but this is QEMU :)
>> >
>>
>> My concern is not the QEMU side. It is the code that we will need to
>> add to UEFI and ARM-TF to deal with the dynamic nature of the
>> underlying platform. That code has no counterpart in real world
>> hardware, but will surely turn up in production firmware nonetheless
>> if we end up having to add that to our SBSA reference codebase.
>
> It sounds like there's already some sort of informal SBSA reference
> instance specification that UEFI and ARM-TF intend to support. If
> that instance specification were slightly more formal, i.e. documented
> somewhere and given a more descriptive name (not just 'sbsa'), then
> I think it would be a lot more palatable to hard code those specifications
> directly into a QEMU machine model of the same name.
>
Root cause is requirement. This platform and virt serve different use cases.

As to command line and readconfig, my opinion is they are suitable for
change slightly some base/typical platforms, not for change one to
another with big difference, otherwise we can even reduce some
platforms in qemu I guess.

> Thanks,
> drew



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Hongbo Zhang
On 26 July 2018 at 00:15, Igor Mammedov  wrote:
> On Wed, 25 Jul 2018 13:36:45 +0200
> Andrew Jones  wrote:
>
>> On Wed, Jul 25, 2018 at 11:50:41AM +0100, Dr. David Alan Gilbert wrote:
>> > * Andrew Jones (drjo...@redhat.com) wrote:
>> > > On Wed, Jul 25, 2018 at 01:30:52PM +0800, Hongbo Zhang wrote:
>> > > > For the Aarch64, there is one machine 'virt', it is primarily meant to
>> > > > run on KVM and execute virtualization workloads, but we need an
>> > > > environment as faithful as possible to physical hardware, for 
>> > > > supporting
>> > > > firmware and OS development for pysical Aarch64 machines.
>> > > >
>> > > > This patch introduces new machine type 'Enterprise' with main features:
>> > > >  - Based on 'virt' machine type.
>> > > >  - Re-designed memory map.
>> > > >  - EL2 and EL3 are enabled by default.
>> > > >  - GIC version 3 by default.
>> > > >  - AHCI controller attached to system bus, and then CDROM and hard disc
>> > > >can be added to it.
>> > > >  - EHCI controller attached to system bus, with USB mouse and key board
>> > > >installed by default.
>> > > >  - E1000E ethernet card on PCIE bus.
>> > > >  - VGA display adaptor on PCIE bus.
>> > > >  - Default CPU type cortex-a57, 4 cores, and 1G bytes memory.
>> > > >  - No virtio functions enabled, since this is to emulate real hardware.
>> > >
>> > > In the last review it was pointed out that using virtio-pci should still
>> > > be "real" enough, so there's not much reason to avoid it. Well, unless
>> > > there's some concern as to what drivers are available in the firmware and
>> > > guest kernel. But that concern usually only applies to legacy firmwares
>> > > and kernels, and therefore shouldn't apply to AArch64.
>> >
>> > I think the difference from last time is Ard's comments earlier in this
>> > thread:
>> >
>> > The purpose of the SBSA machine is not to provide a minimal
>> > configuration. It is intended to exercise all the moving parts one
>> > might find in a server firmware/OS stack, including pieces that are
>> > not usually found on x86 machines, such as DRAM starting above 4 GB
>> > and SATA/USB controllers that are not PCIe based.
>> >
>> > that suggests that the intent of this board is to provide everything
>> > which a firmware writer might want to test;  that's quite different
>> > from forming the basis of a virtualised machine for real use.
>> >
>>
>> I think I understand the purpose, and I also don't believe anything I've
>> said is counter to it. Whether or not one drives a virtio-pci nic with a
>> virtio-pci-net driver or drives an E1000e, also on the PCIe bus, makes
>> little difference to the firmware, nor to the guest kernel - besides which
>> driver gets used. And, nothing stops somebody from not plugging the
>> virtio-pci nic (use -nodefaults) and then plugging the E1000e (-device)
>> instead. Machine models don't need to hard code these assumptions. For
>> this patch it'd probably be best if we just ensured there were no
>> default devices at all, rather than replace one with another.
>
> with that thinking it might be better to put this machine in completely
> different file so not mess with 'production' virt machine code and
> call it intentionally ambiguous: "[linaro|generic|dev|...]_armv8"
> so it would be just another sbsa compliant board with  a bunch of
> default devices needed for testing purposes if users/authors think
> that it serves their purpose better.
>
Yes, before sending, I had to solutions: a separate file and share with virt.c.
For a internal release, I have a separate file sbsa.c
http://git.linaro.org/people/hongbo.zhang/qemu-enterprise.git/tree/hw/arm/sbsa.c?h=sbsa-v2.12

If separate file is used as above, there are much duplicate with
virt.c, such as both have create_pcie() etc, I need to move all the
common parts to a new file say virt-common.c, then the virt.c and
sbsa.c can only handle their specific parts, in such a way, the
previous still cannot be left untouched.

so I send out this solution for discuss, people mainly concern the
necessity of it till now, not the code and file format yet.
Thanks for you advice.

>> Thanks,
>> drew
>>
>



Re: [Qemu-devel] [Qemu-block] [PATCH v5 1/4 for-3.0] qcow2: A grammar fix in conflicting cache sizing error message

2018-07-26 Thread Kevin Wolf
Am 26.07.2018 um 00:19 hat John Snow geschrieben:
> On 07/25/2018 10:27 AM, Leonid Bloch wrote:
> > Signed-off-by: Leonid Bloch 
> 
> Reviewed-by: John Snow 

Thanks, applied to the block branch.

Kevin



Re: [Qemu-devel] [PATCH v5 2/4 for-3.0] qcow2: Options' documentation fixes

2018-07-26 Thread Kevin Wolf
Am 25.07.2018 um 16:27 hat Leonid Bloch geschrieben:
> Signed-off-by: Leonid Bloch 
> ---
>  docs/qcow2-cache.txt |  3 +++
>  qemu-options.hx  | 10 ++
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/docs/qcow2-cache.txt b/docs/qcow2-cache.txt
> index 8a09a5cc5f..3673f2be0e 100644
> --- a/docs/qcow2-cache.txt
> +++ b/docs/qcow2-cache.txt
> @@ -130,6 +130,9 @@ There are a few things that need to be taken into account:
> memory as possible to the L2 cache before increasing the refcount
> cache size.
>  
> +- At most two of "l2-cache-size", "refcount-cache-size", and "cache-size"
> +  can be set simultaneously.

The indentation is off here, the other list items have one space more.

>  Unlike L2 tables, refcount blocks are not used during normal I/O but
>  only during allocations and internal snapshots. In most cases they are
>  accessed sequentially (even during random guest I/O) so increasing the
> diff --git a/qemu-options.hx b/qemu-options.hx
> index b1bf0f485f..13ece21cb6 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -752,15 +752,17 @@ image file)
>  
>  @item cache-size
>  The maximum total size of the L2 table and refcount block caches in bytes
> -(default: 1048576 bytes or 8 clusters, whichever is larger)

I think it would be good to still say something about the default.
Maybe something like "default: the sum of l2-cache-size and
refcount-cache-size"?

>  @item l2-cache-size
> -The maximum size of the L2 table cache in bytes
> -(default: 4/5 of the total cache size)
> +The maximum size of the L2 table cache.

Why did you remove "in bytes" and add a period which the other options
don't have? I prefer the old version of this line.

> +(default: if cache-size is not defined - 1048576 bytes or 8 clusters, 
> whichever
> +is larger; otherwise, as large as possible or needed within the cache-size,
> +while permitting the requested or the minimal refcount cache size)
>  
>  @item refcount-cache-size
>  The maximum size of the refcount block cache in bytes
> -(default: 1/5 of the total cache size)
> +(default: 4 times the cluster size, or any portion of the cache-size, if it 
> is
> +specified and large enough, left over after allocating the full L2 cache)

I found the second part hard to understand. Maybe "4 times the cluster
size; or if both cache-size and l2-cache-size are given, the part of
the cache-size that is not used for the L2 cache yet."?

Kevin



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Hongbo Zhang
On 25 July 2018 at 19:44, Andrew Jones  wrote:
> On Wed, Jul 25, 2018 at 06:46:59PM +0800, Hongbo Zhang wrote:
>> On 25 July 2018 at 17:54, Andrew Jones  wrote:
>> > On Wed, Jul 25, 2018 at 01:30:52PM +0800, Hongbo Zhang wrote:
>> >> For the Aarch64, there is one machine 'virt', it is primarily meant to
>> >> run on KVM and execute virtualization workloads, but we need an
>> >> environment as faithful as possible to physical hardware, for supporting
>> >> firmware and OS development for pysical Aarch64 machines.
>> >>
>> >> This patch introduces new machine type 'Enterprise' with main features:
>> >>  - Based on 'virt' machine type.
>> >>  - Re-designed memory map.
>> >>  - EL2 and EL3 are enabled by default.
>> >>  - GIC version 3 by default.
>> >>  - AHCI controller attached to system bus, and then CDROM and hard disc
>> >>can be added to it.
>> >>  - EHCI controller attached to system bus, with USB mouse and key board
>> >>installed by default.
>> >>  - E1000E ethernet card on PCIE bus.
>> >>  - VGA display adaptor on PCIE bus.
>> >>  - Default CPU type cortex-a57, 4 cores, and 1G bytes memory.
>> >>  - No virtio functions enabled, since this is to emulate real hardware.
>> >
>> > In the last review it was pointed out that using virtio-pci should still
>> > be "real" enough, so there's not much reason to avoid it. Well, unless
>> > there's some concern as to what drivers are available in the firmware and
>> > guest kernel. But that concern usually only applies to legacy firmwares
>> > and kernels, and therefore shouldn't apply to AArch64.
>> >
>> For Armv7, there is one typical platform 'vexpress', but for Armv8, no
>
> Wasn't the vexpress model designed for a specific machine? Namely for
> Arm's simulator? Is the vexpress model really something typical among
> all the Armv7 platforms?
>
>> such typical one, the 'virt' is typically for running workloads, one
>> example is using it under OpenStack.
>> So a 'typical' one for Armv8 is needed for firmware and OS
>> development, similar like 'vexpress' for Armv7.
>
> What is a "typical" Armv8 machine? What will a typical Armv8 machine be in
> two years?
>
> Note, I'm not actually opposed to the current definition (because I don't
> really have one myself). I'm just opposed to hard coding one.
>
For x86, we have i440fx and q35 (although they are old), but for
Armv8, simple usage like "qemu -bios/-pflash -cdrom" to install an OS
and "qemu -bios/-pflash -hda" to launch is needed too.
Armv8 has no such ones like x86, but we need, and SBSA could be the one.

Hard coding, user may have different customs, It couldn't be better if
one platform satisfy requirement, if not satisfied we edit it with
command or readconfig slightly, if we always need to change a platform
to another one with huge difference, then why not maintain the other
one, otherwise we don't need to maintain so many platforms at all.

> Thanks,
> drew



Re: [Qemu-devel] tb_flush during softreset

2018-07-26 Thread Sai Pavan Boddu
Hi,

Still looking if anyone can comment on this, Thanks.

Regards,
Sai Pavan

From: Sai Pavan Boddu
Sent: Wednesday, July 25, 2018 1:34 PM
To: qemu-devel@nongnu.org
Cc: Edgar Iglesias 
Subject: tb_flush during softreset

Hi,

We are seeing issue when fetching tb cache, any suggestions for debug would be 
helpful.

Issue: Post soft reset of the core, bootloader(running over different cpu) 
reloads the program memory and releases the reset. At that point CPU crashes 
after first tb cache fetch.

Note: cpu's have mmu disabled. So no tlb_flush_pages are called.

Question: Is tb_flush called when bootloader has written to program memory of 
the target cpu ?

Thanks & Regards,
Sai Pavan



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Hongbo Zhang
On 25 July 2018 at 20:19, Peter Maydell  wrote:
> On 25 July 2018 at 12:44, Andrew Jones  wrote:
>> On Wed, Jul 25, 2018 at 06:46:59PM +0800, Hongbo Zhang wrote:
>>> For Armv7, there is one typical platform 'vexpress', but for Armv8, no
>>
>> Wasn't the vexpress model designed for a specific machine?
>
> Yes.
>
>> Namely for
>> Arm's simulator?
>
> No.
>
>> Is the vexpress model really something typical among
>> all the Armv7 platforms?
>
> No.
>
> "Vexpress" is a model specifically of a development board
> produced by Arm (the versatile express). It's useful if you
> want to run code that runs on that devboard, but (as with
> most of the devboards we model), it's not necessarily ideal,
> because it has all the limitations of the real hardware it's
> modelling (in this case the big ones are limited memory, no PCI).
> The hardware it models is also quite old now (maybe 7 or 8 years)
> and it's not really "typical" of anything. (In the primarily
> embedded space where most v7 CPUs are there's not really anything
> that could be described as "typical" anyway: everything is
> different.)
>
> For most people who just want to run Linux on an emulated v7 CPU,
> I would recommend the "virt" board, for the same reasons I
> recommend it for v8 cores.
>
>>> such typical one, the 'virt' is typically for running workloads, one
>>> example is using it under OpenStack.
>>> So a 'typical' one for Armv8 is needed for firmware and OS
>>> development, similar like 'vexpress' for Armv7.
>>
>> What is a "typical" Armv8 machine? What will a typical Armv8 machine be in
>> two years?
>>
>> Note, I'm not actually opposed to the current definition (because I don't
>> really have one myself). I'm just opposed to hard coding one.
>
> AIUI the aim here is to provide an emulated platform that is
> set up in the way that server-style armv8 machines are
> recommended to be set up, so it can be used as a testbed and
> demonstration for the firmware/OS software stack. The hope
> is that following "best practices" results in a "typical"
> machine :-)  But the word "typical" is probably not really
> very helpful here...
Yes the aim is truely like that.
Let's forget the word 'typical', a none-english speaker may has his
slightly different understanding according to his own language
culture...

>
> I would expect that in the future we'd want this machine type
> to evolve with the recommendations for how to build server
> platform hardware, which might indeed be different in two years,
> since it would be the development platform for writing/testing
> the firmware/OS stack for that two-years-time hardware.
>
> thanks
> -- PMM



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Andrew Jones
On Thu, Jul 26, 2018 at 05:22:14PM +0800, Hongbo Zhang wrote:
> On 25 July 2018 at 19:26, Andrew Jones  wrote:
> > On Wed, Jul 25, 2018 at 06:22:17PM +0800, Hongbo Zhang wrote:
> >> On 25 July 2018 at 17:54, Andrew Jones  wrote:
> >> > On Wed, Jul 25, 2018 at 01:30:52PM +0800, Hongbo Zhang wrote:
> >> >> For the Aarch64, there is one machine 'virt', it is primarily meant to
> >> >> run on KVM and execute virtualization workloads, but we need an
> >> >> environment as faithful as possible to physical hardware, for supporting
> >> >> firmware and OS development for pysical Aarch64 machines.
> >> >>
> >> >> This patch introduces new machine type 'Enterprise' with main features:
> >> >>  - Based on 'virt' machine type.
> >> >>  - Re-designed memory map.
> >> >>  - EL2 and EL3 are enabled by default.
> >> >>  - GIC version 3 by default.
> >> >>  - AHCI controller attached to system bus, and then CDROM and hard disc
> >> >>can be added to it.
> >> >>  - EHCI controller attached to system bus, with USB mouse and key board
> >> >>installed by default.
> >> >>  - E1000E ethernet card on PCIE bus.
> >> >>  - VGA display adaptor on PCIE bus.
> >> >>  - Default CPU type cortex-a57, 4 cores, and 1G bytes memory.
> >> >>  - No virtio functions enabled, since this is to emulate real hardware.
> >> >
> >> > In the last review it was pointed out that using virtio-pci should still
> >> > be "real" enough, so there's not much reason to avoid it. Well, unless
> >> > there's some concern as to what drivers are available in the firmware and
> >> > guest kernel. But that concern usually only applies to legacy firmwares
> >> > and kernels, and therefore shouldn't apply to AArch64.
> >> >
> >> In real physical arm hardware, *HCI are system memory mapped, not on PCIE.
> >> we need a QEMU platform like that. We need firmware developed on this
> >> QEMU platform can run on real hardware without change(or only a minor
> >> change)
> >
> > virtio-pci has nothing to do with *HCI. You're adding an E1000e to the
> > PCIe bus instead of a virtio-pci nic. Why?
> >
> No virtio devices are need on this platform, so no virtio-pci either,
> on the real Arm server hardware, a NIC is inserted into PCIE, and
> E1000E is a typical one.

It is possible to make a real piece of hardware that really goes in a PCIe
slot which knows how to talk VirtIO. The fact that an E1000e driver will
drive an E1000e QEMU model instead of a VirtIO driver driving a VirtIO
backend is, to me, pretty arbitrary. The only reason it should matter for
the guest firmware/kernel is whether or not the firmware/kernel will have
VirtIO drivers available. Do we know that? Is it documented somewhere
that the guest firmware/kernel is guaranteed to have E1000e drivers, but
VirtIO drivers are optional, or even forbidden? If so, where's that
document?

> 
> >> Concern is not only available firmwares, but more emphasis is on new
> >> firmwares to be developed on this platform (target is developing
> >> firmware for hardware, but using qemu if hardware is not available
> >> temporarily), if virtio device used, then the newly developed firmware
> >> must include virtio front-end codes, but it isn't needed while run on
> >> real hardware at last.
> >
> > Right. The new firmwares and kernels would need to include virtio-pci nic
> > and scsi drivers. Is that a problem? Anyway, this is all the more reason
> > not to hard code peripherals. If a particular peripheral is a problem
> > for a given firmware, then simply don't add it to the command line, add a
> > different one.
> >
> Yes that is problem, for newly developed firmwares, extra efforts will
> be wasted on frontend codes (or glue-layer, whatever we call it), we
> want firmwares developed on this platform can run easily on real
> hardware, without such change.
> Requirement is: some Linaro members just want a qemu platform as true
> as possible with real hardware, there should be no problem with such
> requirement, problem is 'virt' machine cannot satisfy the requirement,
> so a new one is needed.

It sounds like somebody knows what drivers are available and what
drivers aren't. If that's not already documented, then it should
be, and a pointer to it should be in this patch series.

> 
> >>
> >> >>  - No paravirtualized fw_cfg device either.
> >> >>
> >> >> Arm Trusted Firmware and UEFI porting to this are done accordingly.
> >> >>
> >> >
> >> > How will UEFI get the ACPI tables from QEMU without fw-cfg? I didn't
> >> > see any sort of reserved ROM region in the patch for them.
> >> >
> >> UEFI gets ACPI and kernel from network or mass storage, just like the
> >> real hardware.
> >
> > Hmm. I thought for real hardware that the ACPI tables were built into
> > the firmware. So, assuming UEFI knows how to read ACPI tables from
> > some storage, then how do the QEMU generated ACPI tables get into that
> > storage?
> >
> I should say "mass storage and flash"
> There was fw_cfg in v1 patch, it is removed in v2.
> If no fw_cfg, just lik

Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Andrew Jones
On Thu, Jul 26, 2018 at 05:46:23PM +0800, Hongbo Zhang wrote:
> On 25 July 2018 at 22:08, Andrew Jones  wrote:
> > On Wed, Jul 25, 2018 at 03:46:01PM +0200, Ard Biesheuvel wrote:
> >> On 25 July 2018 at 15:38, Andrew Jones  wrote:
> >> > On Wed, Jul 25, 2018 at 03:03:40PM +0200, Ard Biesheuvel wrote:
> >> >> On 25 July 2018 at 14:59, Andrew Jones  wrote:
> >> >> > On Wed, Jul 25, 2018 at 01:47:00PM +0100, Daniel P. Berrangé wrote:
> >> >> >> Would iut make any sense to call the machine  "refplatform"  or 
> >> >> >> "refboard"
> >> >> >> to indicate it is a generic reference platform, not specifically 
> >> >> >> following
> >> >> >> any particular real impl, albeit influence by the sbsa spec.
> >> >> >>
> >> >> >
> >> >> > That would indeed stop me from whining about a machine model with an
> >> >> > 'sbsa' name not strictly implementing a minimal SBSA machine :-)
> >> >> >
> >> >>
> >> >> I still don't get why only a minimal machine is worth considering,
> >> >> given that a real-world minimal SBSA machine is not capable of doing
> >> >> anything useful.
> >> >
> >> > One definition of an SBSA machine can be quite different than another.
> >> > If we only hard code the required [useless] base, but also provide a
> >> > readconfig for the rest, then we get a useful machine without loss of
> >> > flexibility.
> >> >
> >> > That flexibility comes at the cost of platform-bus code (since we need
> >> > to add devices to the system bus) and a less concise command line. The
> >> > platform-bus code may indeed be too expensive for this purpose, but
> >> > we may need to see patches to be sure. I understand the desire to have
> >> > a shorter command line, but this is QEMU :)
> >> >
> >>
> >> My concern is not the QEMU side. It is the code that we will need to
> >> add to UEFI and ARM-TF to deal with the dynamic nature of the
> >> underlying platform. That code has no counterpart in real world
> >> hardware, but will surely turn up in production firmware nonetheless
> >> if we end up having to add that to our SBSA reference codebase.
> >
> > It sounds like there's already some sort of informal SBSA reference
> > instance specification that UEFI and ARM-TF intend to support. If
> > that instance specification were slightly more formal, i.e. documented
> > somewhere and given a more descriptive name (not just 'sbsa'), then
> > I think it would be a lot more palatable to hard code those specifications
> > directly into a QEMU machine model of the same name.
> >
> Root cause is requirement. This platform and virt serve different use cases.

Never any disagreement there. I'm just disagreeing with how you're naming
one instance of an SBSA platform with the very generic name of 'sbsa'.

> 
> As to command line and readconfig, my opinion is they are suitable for
> change slightly some base/typical platforms, not for change one to

Yes, command line and readconfig are suitable for changing a base
platform. That's why your new machine model should be a *base* platform,
not a specific platform with hard coded peripheral selections.

Thanks,
drew



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Ard Biesheuvel
On 26 July 2018 at 12:28, Andrew Jones  wrote:
> On Thu, Jul 26, 2018 at 05:22:14PM +0800, Hongbo Zhang wrote:
>> On 25 July 2018 at 19:26, Andrew Jones  wrote:
>> > On Wed, Jul 25, 2018 at 06:22:17PM +0800, Hongbo Zhang wrote:
>> >> On 25 July 2018 at 17:54, Andrew Jones  wrote:
>> >> > On Wed, Jul 25, 2018 at 01:30:52PM +0800, Hongbo Zhang wrote:
>> >> >> For the Aarch64, there is one machine 'virt', it is primarily meant to
>> >> >> run on KVM and execute virtualization workloads, but we need an
>> >> >> environment as faithful as possible to physical hardware, for 
>> >> >> supporting
>> >> >> firmware and OS development for pysical Aarch64 machines.
>> >> >>
>> >> >> This patch introduces new machine type 'Enterprise' with main features:
>> >> >>  - Based on 'virt' machine type.
>> >> >>  - Re-designed memory map.
>> >> >>  - EL2 and EL3 are enabled by default.
>> >> >>  - GIC version 3 by default.
>> >> >>  - AHCI controller attached to system bus, and then CDROM and hard disc
>> >> >>can be added to it.
>> >> >>  - EHCI controller attached to system bus, with USB mouse and key board
>> >> >>installed by default.
>> >> >>  - E1000E ethernet card on PCIE bus.
>> >> >>  - VGA display adaptor on PCIE bus.
>> >> >>  - Default CPU type cortex-a57, 4 cores, and 1G bytes memory.
>> >> >>  - No virtio functions enabled, since this is to emulate real hardware.
>> >> >
>> >> > In the last review it was pointed out that using virtio-pci should still
>> >> > be "real" enough, so there's not much reason to avoid it. Well, unless
>> >> > there's some concern as to what drivers are available in the firmware 
>> >> > and
>> >> > guest kernel. But that concern usually only applies to legacy firmwares
>> >> > and kernels, and therefore shouldn't apply to AArch64.
>> >> >
>> >> In real physical arm hardware, *HCI are system memory mapped, not on PCIE.
>> >> we need a QEMU platform like that. We need firmware developed on this
>> >> QEMU platform can run on real hardware without change(or only a minor
>> >> change)
>> >
>> > virtio-pci has nothing to do with *HCI. You're adding an E1000e to the
>> > PCIe bus instead of a virtio-pci nic. Why?
>> >
>> No virtio devices are need on this platform, so no virtio-pci either,
>> on the real Arm server hardware, a NIC is inserted into PCIE, and
>> E1000E is a typical one.
>
> It is possible to make a real piece of hardware that really goes in a PCIe
> slot which knows how to talk VirtIO. The fact that an E1000e driver will
> drive an E1000e QEMU model instead of a VirtIO driver driving a VirtIO
> backend is, to me, pretty arbitrary. The only reason it should matter for
> the guest firmware/kernel is whether or not the firmware/kernel will have
> VirtIO drivers available. Do we know that? Is it documented somewhere
> that the guest firmware/kernel is guaranteed to have E1000e drivers, but
> VirtIO drivers are optional, or even forbidden? If so, where's that
> document?
>

It is not pretty arbitrary. One is paravirtualization and one is not.

>>
>> >> Concern is not only available firmwares, but more emphasis is on new
>> >> firmwares to be developed on this platform (target is developing
>> >> firmware for hardware, but using qemu if hardware is not available
>> >> temporarily), if virtio device used, then the newly developed firmware
>> >> must include virtio front-end codes, but it isn't needed while run on
>> >> real hardware at last.
>> >
>> > Right. The new firmwares and kernels would need to include virtio-pci nic
>> > and scsi drivers. Is that a problem? Anyway, this is all the more reason
>> > not to hard code peripherals. If a particular peripheral is a problem
>> > for a given firmware, then simply don't add it to the command line, add a
>> > different one.
>> >
>> Yes that is problem, for newly developed firmwares, extra efforts will
>> be wasted on frontend codes (or glue-layer, whatever we call it), we
>> want firmwares developed on this platform can run easily on real
>> hardware, without such change.
>> Requirement is: some Linaro members just want a qemu platform as true
>> as possible with real hardware, there should be no problem with such
>> requirement, problem is 'virt' machine cannot satisfy the requirement,
>> so a new one is needed.
>
> It sounds like somebody knows what drivers are available and what
> drivers aren't. If that's not already documented, then it should
> be, and a pointer to it should be in this patch series.
>

Available where?

UEFI has drivers for ?HCI industry standard hardware. As for the
networking side, we should review whether E1000e is the most
appropriate or this, given the lack of open source drivers. However, I
do agree that discoverable hardware should not be hardcoded, and we
should even try to use the emulated option ROM to provide a UEFI
driver.

>>
>> >>
>> >> >>  - No paravirtualized fw_cfg device either.
>> >> >>
>> >> >> Arm Trusted Firmware and UEFI porting to this are done accordingly.
>> >> >>
>> >> >
>> >> > H

Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Andrew Jones
On Thu, Jul 26, 2018 at 06:17:36PM +0800, Hongbo Zhang wrote:
> On 25 July 2018 at 19:44, Andrew Jones  wrote:
> > On Wed, Jul 25, 2018 at 06:46:59PM +0800, Hongbo Zhang wrote:
> >> On 25 July 2018 at 17:54, Andrew Jones  wrote:
> >> > On Wed, Jul 25, 2018 at 01:30:52PM +0800, Hongbo Zhang wrote:
> >> >> For the Aarch64, there is one machine 'virt', it is primarily meant to
> >> >> run on KVM and execute virtualization workloads, but we need an
> >> >> environment as faithful as possible to physical hardware, for supporting
> >> >> firmware and OS development for pysical Aarch64 machines.
> >> >>
> >> >> This patch introduces new machine type 'Enterprise' with main features:
> >> >>  - Based on 'virt' machine type.
> >> >>  - Re-designed memory map.
> >> >>  - EL2 and EL3 are enabled by default.
> >> >>  - GIC version 3 by default.
> >> >>  - AHCI controller attached to system bus, and then CDROM and hard disc
> >> >>can be added to it.
> >> >>  - EHCI controller attached to system bus, with USB mouse and key board
> >> >>installed by default.
> >> >>  - E1000E ethernet card on PCIE bus.
> >> >>  - VGA display adaptor on PCIE bus.
> >> >>  - Default CPU type cortex-a57, 4 cores, and 1G bytes memory.
> >> >>  - No virtio functions enabled, since this is to emulate real hardware.
> >> >
> >> > In the last review it was pointed out that using virtio-pci should still
> >> > be "real" enough, so there's not much reason to avoid it. Well, unless
> >> > there's some concern as to what drivers are available in the firmware and
> >> > guest kernel. But that concern usually only applies to legacy firmwares
> >> > and kernels, and therefore shouldn't apply to AArch64.
> >> >
> >> For Armv7, there is one typical platform 'vexpress', but for Armv8, no
> >
> > Wasn't the vexpress model designed for a specific machine? Namely for
> > Arm's simulator? Is the vexpress model really something typical among
> > all the Armv7 platforms?
> >
> >> such typical one, the 'virt' is typically for running workloads, one
> >> example is using it under OpenStack.
> >> So a 'typical' one for Armv8 is needed for firmware and OS
> >> development, similar like 'vexpress' for Armv7.
> >
> > What is a "typical" Armv8 machine? What will a typical Armv8 machine be in
> > two years?
> >
> > Note, I'm not actually opposed to the current definition (because I don't
> > really have one myself). I'm just opposed to hard coding one.
> >
> For x86, we have i440fx and q35 (although they are old), but for
> Armv8, simple usage like "qemu -bios/-pflash -cdrom" to install an OS
> and "qemu -bios/-pflash -hda" to launch is needed too.
> Armv8 has no such ones like x86, but we need, and SBSA could be the one.

i440fx and q35 are specific machine types. 'sbsa' is a generic machine
type that conforms to the SBSA specification. If you weren't trying to
memory map AHCI and EHCI controllers, or if memory mapped AHCI and EHCI
controllers were mandated in the SBSA spec, then there wouldn't be
anything to discuss. But that's not the case. You're attempting to
hard code one instance of a generic class of SBSA machines, but then
just call it 'sbsa'.

> 
> Hard coding, user may have different customs, It couldn't be better if
> one platform satisfy requirement, if not satisfied we edit it with
> command or readconfig slightly, if we always need to change a platform
> to another one with huge difference, then why not maintain the other
> one, otherwise we don't need to maintain so many platforms at all.
>

I won't argue about creating a list of default devices for an 'sbsa'
machine that get plugged when '-nodefaults' isn't used, but if you add
them to the memory map then '-nodefaults' won't remove them, which isn't
acceptable for a machine type generically named 'sbsa'.

Thanks,
drew



Re: [Qemu-devel] [PULL 0/2] seccomp branch queue

2018-07-26 Thread Peter Maydell
On 25 July 2018 at 15:16, Eduardo Otubo  wrote:
> The following changes since commit 18a398f6a39df4b08ff86ac0d38384193ca5f4cc:
>
>   Update version for v3.0.0-rc2 release (2018-07-24 22:06:31 +0100)
>
> are available in the Git repository at:
>
>   https://github.com/otubo/qemu.git tags/pull-seccomp-20180725
>
> for you to fetch changes up to 5b2f59307372bae13a2ff95706646674eccb65e0:
>
>   RFC: seccomp: prefer SCMP_ACT_KILL_PROCESS if available (2018-07-25 
> 16:07:31 +0200)
>
> 
> pull-seccomp-20180725
>
> 
> Marc-André Lureau (2):
>   seccomp: use SIGSYS signal instead of killing the thread
>   RFC: seccomp: prefer SCMP_ACT_KILL_PROCESS if available

Hi. This fails to compile with clang:

  CC  qemu-seccomp.o
qemu-seccomp.c:112:1: error: unused function 'qemu_seccomp'
[-Werror,-Wunused-function]
qemu_seccomp(unsigned int operation, unsigned int flags, void *args)
^

This is because clang is stricter about warning about static inline
functions defined in .c files but never used and your ifdef
guard on the callsite is not matched by one around the function
definition.

thanks
-- PMM



Re: [Qemu-devel] tb_flush during softreset

2018-07-26 Thread Peter Maydell
On 25 July 2018 at 09:04, Sai Pavan Boddu  wrote:
> We are seeing issue when fetching tb cache, any suggestions for debug would 
> be helpful.
>
> Issue: Post soft reset of the core, bootloader(running over different cpu) 
> reloads the program memory and releases the reset. At that point CPU crashes 
> after first tb cache fetch.
>
> Note: cpu's have mmu disabled. So no tlb_flush_pages are called.
>
> Question: Is tb_flush called when bootloader has written to program memory of 
> the target cpu ?

(NB: TB flush and TLB flush are different things.)

When anything writes to memory it should result in the relevant
TB entries being flushed -- this happens via code in exec.c
which calls tb_invalidate_phys_page_fast() for writes to
notdirty memory.

Personally I would try debugging this by running QEMU under
rr (rr-project.org), and looking at where the stray tb cache
entry has come from and tracking it backwards to where it
went in, and looking at what happened for the memory writes
that should in theory have resulted in it being flushed.

thanks
-- PMM



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Peter Maydell
On 26 July 2018 at 11:46, Andrew Jones  wrote:
> i440fx and q35 are specific machine types. 'sbsa' is a generic machine
> type that conforms to the SBSA specification. If you weren't trying to
> memory map AHCI and EHCI controllers, or if memory mapped AHCI and EHCI
> controllers were mandated in the SBSA spec, then there wouldn't be
> anything to discuss. But that's not the case. You're attempting to
> hard code one instance of a generic class of SBSA machines, but then
> just call it 'sbsa'.

Modelling a 'generic SBSA machine' is not the goal here (I'm
not sure that's even possible or useful), so let's just
stipulate that we'll call the machine type something else
and move on ?

thanks
-- PMM



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Ard Biesheuvel
On 26 July 2018 at 12:52, Peter Maydell  wrote:
> On 26 July 2018 at 11:46, Andrew Jones  wrote:
>> i440fx and q35 are specific machine types. 'sbsa' is a generic machine
>> type that conforms to the SBSA specification. If you weren't trying to
>> memory map AHCI and EHCI controllers, or if memory mapped AHCI and EHCI
>> controllers were mandated in the SBSA spec, then there wouldn't be
>> anything to discuss. But that's not the case. You're attempting to
>> hard code one instance of a generic class of SBSA machines, but then
>> just call it 'sbsa'.
>
> Modelling a 'generic SBSA machine' is not the goal here (I'm
> not sure that's even possible or useful), so let's just
> stipulate that we'll call the machine type something else
> and move on ?
>

Fine with me. Care to suggest a name? :-)



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Andrew Jones
On Thu, Jul 26, 2018 at 05:55:54PM +0800, Hongbo Zhang wrote:
> On 26 July 2018 at 00:15, Igor Mammedov  wrote:
> > On Wed, 25 Jul 2018 13:36:45 +0200
> > Andrew Jones  wrote:
> >
> >> On Wed, Jul 25, 2018 at 11:50:41AM +0100, Dr. David Alan Gilbert wrote:
> >> > * Andrew Jones (drjo...@redhat.com) wrote:
> >> > > On Wed, Jul 25, 2018 at 01:30:52PM +0800, Hongbo Zhang wrote:
> >> > > > For the Aarch64, there is one machine 'virt', it is primarily meant 
> >> > > > to
> >> > > > run on KVM and execute virtualization workloads, but we need an
> >> > > > environment as faithful as possible to physical hardware, for 
> >> > > > supporting
> >> > > > firmware and OS development for pysical Aarch64 machines.
> >> > > >
> >> > > > This patch introduces new machine type 'Enterprise' with main 
> >> > > > features:
> >> > > >  - Based on 'virt' machine type.
> >> > > >  - Re-designed memory map.
> >> > > >  - EL2 and EL3 are enabled by default.
> >> > > >  - GIC version 3 by default.
> >> > > >  - AHCI controller attached to system bus, and then CDROM and hard 
> >> > > > disc
> >> > > >can be added to it.
> >> > > >  - EHCI controller attached to system bus, with USB mouse and key 
> >> > > > board
> >> > > >installed by default.
> >> > > >  - E1000E ethernet card on PCIE bus.
> >> > > >  - VGA display adaptor on PCIE bus.
> >> > > >  - Default CPU type cortex-a57, 4 cores, and 1G bytes memory.
> >> > > >  - No virtio functions enabled, since this is to emulate real 
> >> > > > hardware.
> >> > >
> >> > > In the last review it was pointed out that using virtio-pci should 
> >> > > still
> >> > > be "real" enough, so there's not much reason to avoid it. Well, unless
> >> > > there's some concern as to what drivers are available in the firmware 
> >> > > and
> >> > > guest kernel. But that concern usually only applies to legacy firmwares
> >> > > and kernels, and therefore shouldn't apply to AArch64.
> >> >
> >> > I think the difference from last time is Ard's comments earlier in this
> >> > thread:
> >> >
> >> > The purpose of the SBSA machine is not to provide a minimal
> >> > configuration. It is intended to exercise all the moving parts one
> >> > might find in a server firmware/OS stack, including pieces that are
> >> > not usually found on x86 machines, such as DRAM starting above 4 GB
> >> > and SATA/USB controllers that are not PCIe based.
> >> >
> >> > that suggests that the intent of this board is to provide everything
> >> > which a firmware writer might want to test;  that's quite different
> >> > from forming the basis of a virtualised machine for real use.
> >> >
> >>
> >> I think I understand the purpose, and I also don't believe anything I've
> >> said is counter to it. Whether or not one drives a virtio-pci nic with a
> >> virtio-pci-net driver or drives an E1000e, also on the PCIe bus, makes
> >> little difference to the firmware, nor to the guest kernel - besides which
> >> driver gets used. And, nothing stops somebody from not plugging the
> >> virtio-pci nic (use -nodefaults) and then plugging the E1000e (-device)
> >> instead. Machine models don't need to hard code these assumptions. For
> >> this patch it'd probably be best if we just ensured there were no
> >> default devices at all, rather than replace one with another.
> >
> > with that thinking it might be better to put this machine in completely
> > different file so not mess with 'production' virt machine code and
> > call it intentionally ambiguous: "[linaro|generic|dev|...]_armv8"
> > so it would be just another sbsa compliant board with  a bunch of
> > default devices needed for testing purposes if users/authors think
> > that it serves their purpose better.
> >
> Yes, before sending, I had to solutions: a separate file and share with 
> virt.c.
> For a internal release, I have a separate file sbsa.c
> http://git.linaro.org/people/hongbo.zhang/qemu-enterprise.git/tree/hw/arm/sbsa.c?h=sbsa-v2.12
> 
> If separate file is used as above, there are much duplicate with
> virt.c, such as both have create_pcie() etc, I need to move all the
> common parts to a new file say virt-common.c, then the virt.c and
> sbsa.c can only handle their specific parts, in such a way, the
> previous still cannot be left untouched.
> 
> so I send out this solution for discuss, people mainly concern the
> necessity of it till now, not the code and file format yet.
> Thanks for you advice.
>

Let's try to figure out how the "sbsa" machine type memory map should look
first. If it isn't adding a bunch of new stuff then it may not matter that
it's in the same file. OTOH, separating virt machine class device and fdt
building functions from virt machine instance parameters would be a nice
cleanup anyway. Doing that cleanup first, and then simply adding a new
machine instance file, would probably be the cleanest approach.

Thanks,
drew



Re: [Qemu-devel] [PATCH v3 2/3] arm: Add Nordic Semiconductor nRF51 SoC

2018-07-26 Thread Julia Suvorova via Qemu-devel

On 26.07.2018 05:36, Joel Stanley wrote:

The nRF51 is a Cortex-M0 microcontroller with an on-board radio module,
plus other common ARM SoC peripherals.

  http://infocenter.nordicsemi.com/pdf/nRF51_RM_v3.0.pdf

This defines a basic model of the CPU and memory, with no peripherals
implemented at this stage.

Signed-off-by: Joel Stanley 
---
v2:
   put memory as struct fileds in state structure
   pass OBJECT(s) as owner, not NULL
   Add missing addresses for ficr
   Fix flash and sram sizes for microbit
   Embed cpu object in state object an initalise it without use of armv7m_init
   Link to datasheet
v3:
   rebase nrf51 on m0 changes
   remove unused kernel_filename
   clarify flash and sram size
   make flash and sram size properties of the soc state
---
  default-configs/arm-softmmu.mak |   1 +
  hw/arm/Makefile.objs|   1 +
  hw/arm/nrf51_soc.c  | 119 
  include/hw/arm/nrf51_soc.h  |  42 +++
  4 files changed, 163 insertions(+)
  create mode 100644 hw/arm/nrf51_soc.c
  create mode 100644 include/hw/arm/nrf51_soc.h

diff --git a/default-configs/arm-softmmu.mak b/default-configs/arm-softmmu.mak
index e704cb6e34d7..3432721d7d08 100644
--- a/default-configs/arm-softmmu.mak
+++ b/default-configs/arm-softmmu.mak
@@ -102,6 +102,7 @@ CONFIG_STM32F2XX_SYSCFG=y
  CONFIG_STM32F2XX_ADC=y
  CONFIG_STM32F2XX_SPI=y
  CONFIG_STM32F205_SOC=y
+CONFIG_NRF51_SOC=y
  
  CONFIG_CMSDK_APB_TIMER=y

  CONFIG_CMSDK_APB_UART=y
diff --git a/hw/arm/Makefile.objs b/hw/arm/Makefile.objs
index b1e4f8f006aa..e31875ec69bc 100644
--- a/hw/arm/Makefile.objs
+++ b/hw/arm/Makefile.objs
@@ -36,3 +36,4 @@ obj-$(CONFIG_MSF2) += msf2-soc.o msf2-som.o
  obj-$(CONFIG_IOTKIT) += iotkit.o
  obj-$(CONFIG_FSL_IMX7) += fsl-imx7.o mcimx7d-sabre.o
  obj-$(CONFIG_ARM_SMMUV3) += smmu-common.o smmuv3.o
+obj-$(CONFIG_NRF51_SOC) += nrf51_soc.o
diff --git a/hw/arm/nrf51_soc.c b/hw/arm/nrf51_soc.c
new file mode 100644
index ..03fa1dfc7456
--- /dev/null
+++ b/hw/arm/nrf51_soc.c
@@ -0,0 +1,119 @@
+/*
+ * Nordic Semiconductor nRF51 SoC
+ * http://infocenter.nordicsemi.com/pdf/nRF51_RM_v3.0.1.pdf
+ *
+ * Copyright 2018 Joel Stanley 
+ *
+ * This code is licensed under the GPL version 2 or later.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu-common.h"
+#include "hw/arm/arm.h"
+#include "hw/sysbus.h"
+#include "hw/boards.h"
+#include "hw/devices.h"
+#include "hw/misc/unimp.h"
+#include "exec/address-spaces.h"
+#include "sysemu/sysemu.h"
+#include "qemu/log.h"
+#include "cpu.h"
+
+#include "hw/arm/nrf51_soc.h"
+
+#define IOMEM_BASE  0x4000
+#define IOMEM_SIZE  0x2000
+
+#define FICR_BASE   0x1000
+#define FICR_SIZE   0x00fc
+
+#define FLASH_BASE  0x
+#define SRAM_BASE   0x2000
+
+/* The size and base is for the NRF51822 part. If other parts
+ * are supported in the future, add a sub-class of NRF51SoC for
+ * the specific variants */
+#define NRF51822_FLASH_SIZE (256 * 1024)
+#define NRF51822_SRAM_SIZE  (16 * 1024)
+
+static void nrf51_soc_realize(DeviceState *dev_soc, Error **errp)
+{
+NRF51State *s = NRF51_SOC(dev_soc);
+Error *err = NULL;
+
+if (!s->board_memory) {
+error_setg(errp, "memory property was not set");
+return;
+}
+
+object_property_set_link(OBJECT(&s->cpu), OBJECT(&s->container), "memory",
+&err);
+object_property_set_bool(OBJECT(&s->cpu), true, "realized", &err);
+
+memory_region_add_subregion_overlap(&s->container, 0, s->board_memory, -1);
+
+memory_region_init_ram(&s->flash, OBJECT(s), "nrf51.flash", s->flash_size,
+&err);
+if (err) {
+error_propagate(errp, err);
+return;
+}
+memory_region_set_readonly(&s->flash, true);
+memory_region_add_subregion(&s->container, FLASH_BASE, &s->flash);
+
+memory_region_init_ram(&s->sram, NULL, "nrf51.sram", s->sram_size, &err);
+if (err) {
+error_propagate(errp, err);
+return;
+}
+memory_region_add_subregion(&s->container, SRAM_BASE, &s->sram);
+
+create_unimplemented_device("nrf51_soc.io", IOMEM_BASE, IOMEM_SIZE);
+create_unimplemented_device("nrf51_soc.ficr", FICR_BASE, FICR_SIZE);
+create_unimplemented_device("nrf51_soc.private", 0xF000, 0x1000);
+}
+
+static void nrf51_soc_init(Object *obj)
+{
+NRF51State *s = NRF51_SOC(obj);
+
+memory_region_init(&s->container, obj, "nrf51-container", UINT64_MAX);
+
+object_initialize(&s->cpu, sizeof(s->cpu), TYPE_ARM_M_PROFILE);
+object_property_add_child(OBJECT(s), "armv6m", OBJECT(&s->cpu), 
&error_abort);
+qdev_set_parent_bus(DEVICE(&s->cpu), sysbus_get_default());
+qdev_prop_set_string(DEVICE(&s->cpu), "cpu-type", 
ARM_CPU_TYPE_NAME("cortex-m0"));
+qdev_prop_set_uint32(DEVICE(&s->cpu), "num-irq", 96);


Where did this number come from? ARMv6-M NVIC supports only 32 interrup

Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Andrew Jones
On Thu, Jul 26, 2018 at 12:35:08PM +0200, Ard Biesheuvel wrote:
> On 26 July 2018 at 12:28, Andrew Jones  wrote:
> > On Thu, Jul 26, 2018 at 05:22:14PM +0800, Hongbo Zhang wrote:
> >> On 25 July 2018 at 19:26, Andrew Jones  wrote:
> >> > On Wed, Jul 25, 2018 at 06:22:17PM +0800, Hongbo Zhang wrote:
> >> >> On 25 July 2018 at 17:54, Andrew Jones  wrote:
> >> >> > On Wed, Jul 25, 2018 at 01:30:52PM +0800, Hongbo Zhang wrote:
> >> >> >> For the Aarch64, there is one machine 'virt', it is primarily meant 
> >> >> >> to
> >> >> >> run on KVM and execute virtualization workloads, but we need an
> >> >> >> environment as faithful as possible to physical hardware, for 
> >> >> >> supporting
> >> >> >> firmware and OS development for pysical Aarch64 machines.
> >> >> >>
> >> >> >> This patch introduces new machine type 'Enterprise' with main 
> >> >> >> features:
> >> >> >>  - Based on 'virt' machine type.
> >> >> >>  - Re-designed memory map.
> >> >> >>  - EL2 and EL3 are enabled by default.
> >> >> >>  - GIC version 3 by default.
> >> >> >>  - AHCI controller attached to system bus, and then CDROM and hard 
> >> >> >> disc
> >> >> >>can be added to it.
> >> >> >>  - EHCI controller attached to system bus, with USB mouse and key 
> >> >> >> board
> >> >> >>installed by default.
> >> >> >>  - E1000E ethernet card on PCIE bus.
> >> >> >>  - VGA display adaptor on PCIE bus.
> >> >> >>  - Default CPU type cortex-a57, 4 cores, and 1G bytes memory.
> >> >> >>  - No virtio functions enabled, since this is to emulate real 
> >> >> >> hardware.
> >> >> >
> >> >> > In the last review it was pointed out that using virtio-pci should 
> >> >> > still
> >> >> > be "real" enough, so there's not much reason to avoid it. Well, unless
> >> >> > there's some concern as to what drivers are available in the firmware 
> >> >> > and
> >> >> > guest kernel. But that concern usually only applies to legacy 
> >> >> > firmwares
> >> >> > and kernels, and therefore shouldn't apply to AArch64.
> >> >> >
> >> >> In real physical arm hardware, *HCI are system memory mapped, not on 
> >> >> PCIE.
> >> >> we need a QEMU platform like that. We need firmware developed on this
> >> >> QEMU platform can run on real hardware without change(or only a minor
> >> >> change)
> >> >
> >> > virtio-pci has nothing to do with *HCI. You're adding an E1000e to the
> >> > PCIe bus instead of a virtio-pci nic. Why?
> >> >
> >> No virtio devices are need on this platform, so no virtio-pci either,
> >> on the real Arm server hardware, a NIC is inserted into PCIE, and
> >> E1000E is a typical one.
> >
> > It is possible to make a real piece of hardware that really goes in a PCIe
> > slot which knows how to talk VirtIO. The fact that an E1000e driver will
> > drive an E1000e QEMU model instead of a VirtIO driver driving a VirtIO
> > backend is, to me, pretty arbitrary. The only reason it should matter for
> > the guest firmware/kernel is whether or not the firmware/kernel will have
> > VirtIO drivers available. Do we know that? Is it documented somewhere
> > that the guest firmware/kernel is guaranteed to have E1000e drivers, but
> > VirtIO drivers are optional, or even forbidden? If so, where's that
> > document?
> >
> 
> It is not pretty arbitrary. One is paravirtualization and one is not.

But the paravirtness is a driver detail, not a platform detail. The
virtio-pci device is just a PCIe device to the platform. Drive it or
not, drive it with knowledge that it's paravirt or not, the platform
doesn't care.

> 
> >>
> >> >> Concern is not only available firmwares, but more emphasis is on new
> >> >> firmwares to be developed on this platform (target is developing
> >> >> firmware for hardware, but using qemu if hardware is not available
> >> >> temporarily), if virtio device used, then the newly developed firmware
> >> >> must include virtio front-end codes, but it isn't needed while run on
> >> >> real hardware at last.
> >> >
> >> > Right. The new firmwares and kernels would need to include virtio-pci nic
> >> > and scsi drivers. Is that a problem? Anyway, this is all the more reason
> >> > not to hard code peripherals. If a particular peripheral is a problem
> >> > for a given firmware, then simply don't add it to the command line, add a
> >> > different one.
> >> >
> >> Yes that is problem, for newly developed firmwares, extra efforts will
> >> be wasted on frontend codes (or glue-layer, whatever we call it), we
> >> want firmwares developed on this platform can run easily on real
> >> hardware, without such change.
> >> Requirement is: some Linaro members just want a qemu platform as true
> >> as possible with real hardware, there should be no problem with such
> >> requirement, problem is 'virt' machine cannot satisfy the requirement,
> >> so a new one is needed.
> >
> > It sounds like somebody knows what drivers are available and what
> > drivers aren't. If that's not already documented, then it should
> > be, and a pointer to it should be in this pat

Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Andrew Jones
On Thu, Jul 26, 2018 at 12:56:22PM +0200, Ard Biesheuvel wrote:
> On 26 July 2018 at 12:52, Peter Maydell  wrote:
> > On 26 July 2018 at 11:46, Andrew Jones  wrote:
> >> i440fx and q35 are specific machine types. 'sbsa' is a generic machine
> >> type that conforms to the SBSA specification. If you weren't trying to
> >> memory map AHCI and EHCI controllers, or if memory mapped AHCI and EHCI
> >> controllers were mandated in the SBSA spec, then there wouldn't be
> >> anything to discuss. But that's not the case. You're attempting to
> >> hard code one instance of a generic class of SBSA machines, but then
> >> just call it 'sbsa'.
> >
> > Modelling a 'generic SBSA machine' is not the goal here (I'm
> > not sure that's even possible or useful), so let's just
> > stipulate that we'll call the machine type something else
> > and move on ?
> >
> 
> Fine with me. Care to suggest a name? :-)
>

Also fine by me, but I'm not going to suggest the name. I fear the
backlash I'd receive after this mail thread!



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Ard Biesheuvel
On 26 July 2018 at 13:11, Andrew Jones  wrote:
> On Thu, Jul 26, 2018 at 12:35:08PM +0200, Ard Biesheuvel wrote:
>> On 26 July 2018 at 12:28, Andrew Jones  wrote:
>> > On Thu, Jul 26, 2018 at 05:22:14PM +0800, Hongbo Zhang wrote:
>> >> On 25 July 2018 at 19:26, Andrew Jones  wrote:
>> >> > On Wed, Jul 25, 2018 at 06:22:17PM +0800, Hongbo Zhang wrote:
>> >> >> On 25 July 2018 at 17:54, Andrew Jones  wrote:
>> >> >> > On Wed, Jul 25, 2018 at 01:30:52PM +0800, Hongbo Zhang wrote:
>> >> >> >> For the Aarch64, there is one machine 'virt', it is primarily meant 
>> >> >> >> to
>> >> >> >> run on KVM and execute virtualization workloads, but we need an
>> >> >> >> environment as faithful as possible to physical hardware, for 
>> >> >> >> supporting
>> >> >> >> firmware and OS development for pysical Aarch64 machines.
>> >> >> >>
>> >> >> >> This patch introduces new machine type 'Enterprise' with main 
>> >> >> >> features:
>> >> >> >>  - Based on 'virt' machine type.
>> >> >> >>  - Re-designed memory map.
>> >> >> >>  - EL2 and EL3 are enabled by default.
>> >> >> >>  - GIC version 3 by default.
>> >> >> >>  - AHCI controller attached to system bus, and then CDROM and hard 
>> >> >> >> disc
>> >> >> >>can be added to it.
>> >> >> >>  - EHCI controller attached to system bus, with USB mouse and key 
>> >> >> >> board
>> >> >> >>installed by default.
>> >> >> >>  - E1000E ethernet card on PCIE bus.
>> >> >> >>  - VGA display adaptor on PCIE bus.
>> >> >> >>  - Default CPU type cortex-a57, 4 cores, and 1G bytes memory.
>> >> >> >>  - No virtio functions enabled, since this is to emulate real 
>> >> >> >> hardware.
>> >> >> >
>> >> >> > In the last review it was pointed out that using virtio-pci should 
>> >> >> > still
>> >> >> > be "real" enough, so there's not much reason to avoid it. Well, 
>> >> >> > unless
>> >> >> > there's some concern as to what drivers are available in the 
>> >> >> > firmware and
>> >> >> > guest kernel. But that concern usually only applies to legacy 
>> >> >> > firmwares
>> >> >> > and kernels, and therefore shouldn't apply to AArch64.
>> >> >> >
>> >> >> In real physical arm hardware, *HCI are system memory mapped, not on 
>> >> >> PCIE.
>> >> >> we need a QEMU platform like that. We need firmware developed on this
>> >> >> QEMU platform can run on real hardware without change(or only a minor
>> >> >> change)
>> >> >
>> >> > virtio-pci has nothing to do with *HCI. You're adding an E1000e to the
>> >> > PCIe bus instead of a virtio-pci nic. Why?
>> >> >
>> >> No virtio devices are need on this platform, so no virtio-pci either,
>> >> on the real Arm server hardware, a NIC is inserted into PCIE, and
>> >> E1000E is a typical one.
>> >
>> > It is possible to make a real piece of hardware that really goes in a PCIe
>> > slot which knows how to talk VirtIO. The fact that an E1000e driver will
>> > drive an E1000e QEMU model instead of a VirtIO driver driving a VirtIO
>> > backend is, to me, pretty arbitrary. The only reason it should matter for
>> > the guest firmware/kernel is whether or not the firmware/kernel will have
>> > VirtIO drivers available. Do we know that? Is it documented somewhere
>> > that the guest firmware/kernel is guaranteed to have E1000e drivers, but
>> > VirtIO drivers are optional, or even forbidden? If so, where's that
>> > document?
>> >
>>
>> It is not pretty arbitrary. One is paravirtualization and one is not.
>
> But the paravirtness is a driver detail, not a platform detail. The
> virtio-pci device is just a PCIe device to the platform. Drive it or
> not, drive it with knowledge that it's paravirt or not, the platform
> doesn't care.
>

That may be true. But we'll still end up with a UEFI build that has
OVMF virtio bus drivers and device drivers included, blurring the line
between emulation and virtualization.

>>
>> >>
>> >> >> Concern is not only available firmwares, but more emphasis is on new
>> >> >> firmwares to be developed on this platform (target is developing
>> >> >> firmware for hardware, but using qemu if hardware is not available
>> >> >> temporarily), if virtio device used, then the newly developed firmware
>> >> >> must include virtio front-end codes, but it isn't needed while run on
>> >> >> real hardware at last.
>> >> >
>> >> > Right. The new firmwares and kernels would need to include virtio-pci 
>> >> > nic
>> >> > and scsi drivers. Is that a problem? Anyway, this is all the more reason
>> >> > not to hard code peripherals. If a particular peripheral is a problem
>> >> > for a given firmware, then simply don't add it to the command line, add 
>> >> > a
>> >> > different one.
>> >> >
>> >> Yes that is problem, for newly developed firmwares, extra efforts will
>> >> be wasted on frontend codes (or glue-layer, whatever we call it), we
>> >> want firmwares developed on this platform can run easily on real
>> >> hardware, without such change.
>> >> Requirement is: some Linaro members just want a qemu platform as true
>> >> as possible wi

[Qemu-devel] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Kevin Wolf
The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
all-zero afterwards, so don't try to abuse it for zero writing. We try
to only use this if BLKDISCARDZEROES tells us that it is safe, but this
is unreliable on older kernels and a constant 0 in newer kernels. In
other words, this code path is never actually used with newer kernels,
so we don't even try to unmap while writing zeros.

This patch removes the abuse of discard for writing zeroes from
file-posix and instead adds a new function that uses interfaces that are
actually meant to deallocate and zero out at the same time. Only if
those fail, it falls back to zeroing out without unmap. We never fall
back to a discard operation any more that may or may not result in
zeros.

Signed-off-by: Kevin Wolf 
---
 block/file-posix.c | 62 +-
 1 file changed, 47 insertions(+), 15 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 60af4b3d51..9c66cd87d1 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -648,7 +648,7 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 }
 #endif
 
-bs->supported_zero_flags = s->discard_zeroes ? BDRV_REQ_MAY_UNMAP : 0;
+bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
 ret = 0;
 fail:
 if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
@@ -1487,6 +1487,38 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData 
*aiocb)
 return -ENOTSUP;
 }
 
+static ssize_t handle_aiocb_write_zeroes_unmap(RawPosixAIOData *aiocb)
+{
+BDRVRawState *s = aiocb->bs->opaque;
+int ret;
+
+/* First try to write zeros and unmap at the same time */
+
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+   aiocb->aio_offset, aiocb->aio_nbytes);
+if (ret != -ENOTSUP) {
+return ret;
+}
+#endif
+
+#ifdef CONFIG_XFS
+if (s->is_xfs) {
+/* xfs_discard() guarantees that the discarded area reads as all-zero
+ * afterwards, so we can use it here. */
+return xfs_discard(s, aiocb->aio_offset, aiocb->aio_nbytes);
+}
+#endif
+
+/* Make the compiler happy if neither of the above is compiled in */
+(void) s;
+
+/* If we couldn't manage to unmap while guaranteed that the area reads as
+ * all-zero afterwards, just write zeroes without unmapping */
+ret = handle_aiocb_write_zeroes(aiocb);
+return ret;
+}
+
 #ifndef HAVE_COPY_FILE_RANGE
 static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
  off_t *out_off, size_t len, unsigned int flags)
@@ -1729,6 +1761,9 @@ static int aio_worker(void *arg)
 case QEMU_AIO_WRITE_ZEROES:
 ret = handle_aiocb_write_zeroes(aiocb);
 break;
+case QEMU_AIO_WRITE_ZEROES | QEMU_AIO_DISCARD:
+ret = handle_aiocb_write_zeroes_unmap(aiocb);
+break;
 case QEMU_AIO_COPY_RANGE:
 ret = handle_aiocb_copy_range(aiocb);
 break;
@@ -2553,15 +2588,13 @@ static int coroutine_fn raw_co_pwrite_zeroes(
 int bytes, BdrvRequestFlags flags)
 {
 BDRVRawState *s = bs->opaque;
+int operation = QEMU_AIO_WRITE_ZEROES;
 
-if (!(flags & BDRV_REQ_MAY_UNMAP)) {
-return paio_submit_co(bs, s->fd, offset, NULL, bytes,
-  QEMU_AIO_WRITE_ZEROES);
-} else if (s->discard_zeroes) {
-return paio_submit_co(bs, s->fd, offset, NULL, bytes,
-  QEMU_AIO_DISCARD);
+if (flags & BDRV_REQ_MAY_UNMAP) {
+operation |= QEMU_AIO_DISCARD;
 }
-return -ENOTSUP;
+
+return paio_submit_co(bs, s->fd, offset, NULL, bytes, operation);
 }
 
 static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
@@ -3054,20 +3087,19 @@ static coroutine_fn int 
hdev_co_pwrite_zeroes(BlockDriverState *bs,
 int64_t offset, int bytes, BdrvRequestFlags flags)
 {
 BDRVRawState *s = bs->opaque;
+int operation = QEMU_AIO_WRITE_ZEROES | QEMU_AIO_BLKDEV;
 int rc;
 
 rc = fd_open(bs);
 if (rc < 0) {
 return rc;
 }
-if (!(flags & BDRV_REQ_MAY_UNMAP)) {
-return paio_submit_co(bs, s->fd, offset, NULL, bytes,
-  QEMU_AIO_WRITE_ZEROES|QEMU_AIO_BLKDEV);
-} else if (s->discard_zeroes) {
-return paio_submit_co(bs, s->fd, offset, NULL, bytes,
-  QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
+
+if (flags & BDRV_REQ_MAY_UNMAP) {
+operation |= QEMU_AIO_DISCARD;
 }
-return -ENOTSUP;
+
+return paio_submit_co(bs, s->fd, offset, NULL, bytes, operation);
 }
 
 static int coroutine_fn hdev_co_create_opts(const char *filename, QemuOpts 
*opts,
-- 
2.13.6




Re: [Qemu-devel] [PATCH] s390x/cpumodel: enum type S390FeatGroup now gets generated

2018-07-26 Thread Cornelia Huck
On Wed, 25 Jul 2018 16:36:17 +0200
Michael Mueller  wrote:

> The enumeration type S390FeatGroup is now generated as well.
> This shall simplify the definition of new feature groups
> without the requirement to modify existing code.

So can we expect new feature groups soon? :)

> 
> Signed-off-by: Michael Mueller 
> ---
>  target/s390x/cpu_features.c |  1 -
>  target/s390x/cpu_features.h | 19 +--
>  target/s390x/gen-features.c | 18 +-
>  3 files changed, 18 insertions(+), 20 deletions(-)

Thanks, applied.



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Andrew Jones
On Thu, Jul 26, 2018 at 01:15:15PM +0200, Ard Biesheuvel wrote:
> That may be true. But we'll still end up with a UEFI build that has
> OVMF virtio bus drivers and device drivers included, blurring the line
> between emulation and virtualization.

The UEFI build can drop the virtio-mmio support, and all the virtio-mmio
device drivers. Those haven't been necessary since the virt machine got
a PCIe host bridge. You'd still need the virtio-pci device drivers, which
may or may not "taint" the reference code, depending on its requirements.

> > I don't disagree, but there's no point in making QEMU ACPI generation
> > code changes that will never be consumed. This patch adds tables for
> > the hard coded ?HCI controllers to ACPI. We don't need those changes for
> > the virt machine and, without fw-cfg, you can't use them on the reference
> > machine.
> >
> 
> Ah, indeed. I missed that bit.
> 
> We should not include any changes that modify the DT node or ACPI
> table generation for mach-virt.
>

The patch guards the generation. It'll only modify DT and ACPI for the
new machine type. But, while modifying the DT makes sense, as that
generated DT will get consumed, the ACPI tables have no way of being
consumed without fw-cfg. If the new machine type doesn't want fw-cfg,
then there's no need to touch any ACPI generation code at all in this
patch.

Thanks,
drew



Re: [Qemu-devel] [PULL 0/2] seccomp branch queue

2018-07-26 Thread Eduardo Otubo
On 26/07/2018 - 11:47:46, Peter Maydell wrote:
> On 25 July 2018 at 15:16, Eduardo Otubo  wrote:
> > The following changes since commit 18a398f6a39df4b08ff86ac0d38384193ca5f4cc:
> >
> >   Update version for v3.0.0-rc2 release (2018-07-24 22:06:31 +0100)
> >
> > are available in the Git repository at:
> >
> >   https://github.com/otubo/qemu.git tags/pull-seccomp-20180725
> >
> > for you to fetch changes up to 5b2f59307372bae13a2ff95706646674eccb65e0:
> >
> >   RFC: seccomp: prefer SCMP_ACT_KILL_PROCESS if available (2018-07-25 
> > 16:07:31 +0200)
> >
> > 
> > pull-seccomp-20180725
> >
> > 
> > Marc-André Lureau (2):
> >   seccomp: use SIGSYS signal instead of killing the thread
> >   RFC: seccomp: prefer SCMP_ACT_KILL_PROCESS if available
> 
> Hi. This fails to compile with clang:
> 
>   CC  qemu-seccomp.o
> qemu-seccomp.c:112:1: error: unused function 'qemu_seccomp'
> [-Werror,-Wunused-function]
> qemu_seccomp(unsigned int operation, unsigned int flags, void *args)
> ^
> 
> This is because clang is stricter about warning about static inline
> functions defined in .c files but never used and your ifdef
> guard on the callsite is not matched by one around the function
> definition.
> 

Peter, sorry for not catching that before.
Marc, can you fix and resend?

-- 
Eduardo Otubo


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PULL 0/2] seccomp branch queue

2018-07-26 Thread Marc-André Lureau
Hi

On Thu, Jul 26, 2018 at 12:47 PM, Peter Maydell
 wrote:
> On 25 July 2018 at 15:16, Eduardo Otubo  wrote:
>> The following changes since commit 18a398f6a39df4b08ff86ac0d38384193ca5f4cc:
>>
>>   Update version for v3.0.0-rc2 release (2018-07-24 22:06:31 +0100)
>>
>> are available in the Git repository at:
>>
>>   https://github.com/otubo/qemu.git tags/pull-seccomp-20180725
>>
>> for you to fetch changes up to 5b2f59307372bae13a2ff95706646674eccb65e0:
>>
>>   RFC: seccomp: prefer SCMP_ACT_KILL_PROCESS if available (2018-07-25 
>> 16:07:31 +0200)
>>
>> 
>> pull-seccomp-20180725
>>
>> 
>> Marc-André Lureau (2):
>>   seccomp: use SIGSYS signal instead of killing the thread
>>   RFC: seccomp: prefer SCMP_ACT_KILL_PROCESS if available
>
> Hi. This fails to compile with clang:
>
>   CC  qemu-seccomp.o
> qemu-seccomp.c:112:1: error: unused function 'qemu_seccomp'
> [-Werror,-Wunused-function]
> qemu_seccomp(unsigned int operation, unsigned int flags, void *args)
> ^
>
> This is because clang is stricter about warning about static inline
> functions defined in .c files but never used and your ifdef
> guard on the callsite is not matched by one around the function
> definition.
>

https://lkml.org/lkml/2017/6/6/631 ;)



-- 
Marc-André Lureau



Re: [Qemu-devel] [PULL 0/2] seccomp branch queue

2018-07-26 Thread Marc-André Lureau
H

On Thu, Jul 26, 2018 at 2:04 PM, Eduardo Otubo  wrote:
> On 26/07/2018 - 11:47:46, Peter Maydell wrote:
>> On 25 July 2018 at 15:16, Eduardo Otubo  wrote:
>> > The following changes since commit 
>> > 18a398f6a39df4b08ff86ac0d38384193ca5f4cc:
>> >
>> >   Update version for v3.0.0-rc2 release (2018-07-24 22:06:31 +0100)
>> >
>> > are available in the Git repository at:
>> >
>> >   https://github.com/otubo/qemu.git tags/pull-seccomp-20180725
>> >
>> > for you to fetch changes up to 5b2f59307372bae13a2ff95706646674eccb65e0:
>> >
>> >   RFC: seccomp: prefer SCMP_ACT_KILL_PROCESS if available (2018-07-25 
>> > 16:07:31 +0200)
>> >
>> > 
>> > pull-seccomp-20180725
>> >
>> > 
>> > Marc-André Lureau (2):
>> >   seccomp: use SIGSYS signal instead of killing the thread
>> >   RFC: seccomp: prefer SCMP_ACT_KILL_PROCESS if available
>>
>> Hi. This fails to compile with clang:
>>
>>   CC  qemu-seccomp.o
>> qemu-seccomp.c:112:1: error: unused function 'qemu_seccomp'
>> [-Werror,-Wunused-function]
>> qemu_seccomp(unsigned int operation, unsigned int flags, void *args)
>> ^
>>
>> This is because clang is stricter about warning about static inline
>> functions defined in .c files but never used and your ifdef
>> guard on the callsite is not matched by one around the function
>> definition.
>>
>
> Peter, sorry for not catching that before.
> Marc, can you fix and resend?

I suggest to drop that patch from 3.0. Since it will require a newer
libseccomp to be actually useful, it can be delayed imho.



-- 
Marc-André Lureau



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Peter Maydell
On 26 July 2018 at 12:41, Andrew Jones  wrote:
> The patch guards the generation. It'll only modify DT and ACPI for the
> new machine type. But, while modifying the DT makes sense, as that
> generated DT will get consumed

...will it? Why would we want the firmware to read the
QEMU-generated DT? Real hardware doesn't work that way.

thanks
-- PMM



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Laszlo Ersek
On 07/26/18 13:13, Andrew Jones wrote:
> On Thu, Jul 26, 2018 at 12:56:22PM +0200, Ard Biesheuvel wrote:
>> On 26 July 2018 at 12:52, Peter Maydell  wrote:
>>> On 26 July 2018 at 11:46, Andrew Jones  wrote:
 i440fx and q35 are specific machine types. 'sbsa' is a generic machine
 type that conforms to the SBSA specification. If you weren't trying to
 memory map AHCI and EHCI controllers, or if memory mapped AHCI and EHCI
 controllers were mandated in the SBSA spec, then there wouldn't be
 anything to discuss. But that's not the case. You're attempting to
 hard code one instance of a generic class of SBSA machines, but then
 just call it 'sbsa'.
>>>
>>> Modelling a 'generic SBSA machine' is not the goal here (I'm
>>> not sure that's even possible or useful), so let's just
>>> stipulate that we'll call the machine type something else
>>> and move on ?
>>>
>>
>> Fine with me. Care to suggest a name? :-)
>>
> 
> Also fine by me, but I'm not going to suggest the name. I fear the
> backlash I'd receive after this mail thread!
> 

I suggest "showcase2018". (Dead serious.)

Laszlo



Re: [Qemu-devel] [PATCH v3 3/5] qcow2: Introduce an option for sufficient L2 cache for the entire image

2018-07-26 Thread Leonid Bloch

You mean with QDict? I'll look into that now. But already sent v5 before
reading this email.


Yes, with reading it from the QDict. (Or whatever the simplest way is
that results in the right external interface, but I suppose this is the
one.)


Well, there is a problem with that: I can easily isolate
l2-cache-size from QDict, check if it is "full", and if it is - do 
whatever is needed, and delete this option before parsing. But what if 
it is "foo"? It will not get deleted, and the regular QEMU_OPT_SIZE 
parsing error will appear, stating that l2-cache-size "expects a 
non-negative number..." - no word about that it can expect "full" as 
well. Now, one can try to modify local_err->msg for this particular 
option, but this will require substantial additional logic. I think 
considering this, it would be easier to stick with a dedicated option, 
l2-cache-full.


Do you think there is a smarter way to parse the l2-cache-size option, 
so it would accept both size and "full", while handling errors 
correctly? It seems more elegant to have a single option, but the 
internal handling will be more elegant and simpler with two mutually 
exclusive options.


By the way, the L2 cache resizes now on image resize. Will send the 
changes in v6. Thanks for the suggestion!


Leonid.



Re: [Qemu-devel] [PATCH] target/ppc: only save guest timebase once after stopping

2018-07-26 Thread Michael Roth
Quoting David Gibson (2018-07-26 00:07:46)
> On Thu, May 03, 2018 at 11:20:44PM -0500, Michael Roth wrote:
> > In some cases (e.g. spapr) we record guest timebase after qmp_stop()
> > via a runstate hook so we can restore it on qmp_cont(). If a migration
> > occurs in between those events we end up saving it again, this time
> > based on the current timebase the guest would be seeing had it been
> > running. This has the effect of advancing the guest timebase while
> > it is stopped, which is not what the code intends.
> > 
> > Other than simple jumps in time, this has been seen to trigger what
> > appear to be RCU-related crashes in recent kernels when the advance
> > exceeds rcu_cpu_stall_timeout, and it can be triggered by fairly
> > common operations such as `virsh migrate ... --timeout 60`.
> > 
> > Cc: Alexey Kardashevskiy 
> > Cc: David Gibson 
> > Cc: Laurent Vivier 
> > Cc: qemu-...@nongnu.org
> > Cc: qemu-sta...@nongnu.org
> > Signed-off-by: Michael Roth 
> 
> Sorry, this fell off my radar for ages, but I've finally had a chance
> to look at it properly.

No problem, I had assumed it just wasn't deemed needed based on the
discussion.

> 
> I'm not totally convinced this handle all the possible edge cases
> correctly, but I am convinced it gives behaviour that's more correct
> than we have now.  It doesn't introduce changes to the interface or
> migration stream that would break things in future, so I've applied it
> to ppc-for-3.1.

There's a flaw with the patch that Greg pointed out to me previously
in that it doesn't just avoid the presave hook when vm_stop is issued
manually via monitor, but also when vm_stop is issued by the migration
code itself. The latter wasn't intentional. I'm not sure if we currently
have a good way to distinguish between the 2 cases to rectify that.

As the patch currently stands it is effectively a revert of the
Laurent's original patch.

FWIW, the RCU-related crashes mentioned in the original commit turned
out to be a separate issue that was exacerbated by RCU stall warnings
messages (which *would* be less likely with this patch). So this would
be more about user experience and spurious log messages than an actual
known bug fix. I am still of the opinion that we shouldn't resave the
clock if the vm_stop was manually-issued; it's just that the current
patch oversteps that goal.

> 
> > ---
> >  hw/ppc/ppc.c | 12 
> >  target/ppc/cpu-qom.h |  1 +
> >  2 files changed, 13 insertions(+)
> > 
> > diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
> > index ec4be25f49..ff0a107864 100644
> > --- a/hw/ppc/ppc.c
> > +++ b/hw/ppc/ppc.c
> > @@ -865,6 +865,15 @@ static void timebase_save(PPCTimebase *tb)
> >  uint64_t ticks = cpu_get_host_ticks();
> >  PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
> >  
> > +/* since we generally save timebase just after the guest
> > + * has stopped, avoid trying to save it again since we will
> > + * end up advancing it by the amount of ticks that have
> > + * elapsed in the host since the initial save
> > + */
> > +if (tb->saved) {
> > +return;
> > +}
> > +
> >  if (!first_ppc_cpu->env.tb_env) {
> >  error_report("No timebase object");
> >  return;
> > @@ -877,6 +886,7 @@ static void timebase_save(PPCTimebase *tb)
> >   * there is no need to update it from KVM here
> >   */
> >  tb->guest_timebase = ticks + first_ppc_cpu->env.tb_env->tb_offset;
> > +tb->saved = true;
> >  }
> >  
> >  static void timebase_load(PPCTimebase *tb)
> > @@ -908,6 +918,8 @@ static void timebase_load(PPCTimebase *tb)
> >  &pcpu->env.tb_env->tb_offset);
> >  #endif
> >  }
> > +
> > +tb->saved = false;
> >  }
> >  
> >  void cpu_ppc_clock_vm_state_change(void *opaque, int running,
> > diff --git a/target/ppc/cpu-qom.h b/target/ppc/cpu-qom.h
> > index deaa46a14b..ec2dbcdcae 100644
> > --- a/target/ppc/cpu-qom.h
> > +++ b/target/ppc/cpu-qom.h
> > @@ -210,6 +210,7 @@ typedef struct PowerPCCPUClass {
> >  typedef struct PPCTimebase {
> >  uint64_t guest_timebase;
> >  int64_t time_of_the_day_ns;
> > +bool saved;
> >  } PPCTimebase;
> >  
> >  extern const struct VMStateDescription vmstate_ppc_timebase;
> 
> -- 
> David Gibson| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
> | _way_ _around_!
> http://www.ozlabs.org/~dgibson



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Andrew Jones
On Thu, Jul 26, 2018 at 01:10:34PM +0100, Peter Maydell wrote:
> On 26 July 2018 at 12:41, Andrew Jones  wrote:
> > The patch guards the generation. It'll only modify DT and ACPI for the
> > new machine type. But, while modifying the DT makes sense, as that
> > generated DT will get consumed
> 
> ...will it? Why would we want the firmware to read the
> QEMU-generated DT? Real hardware doesn't work that way.
>

Good point. If the plan for this reference software is to always
prepare its own DTB and ACPI tables, then this patch shouldn't
touch the DT generation either. Of course a lot of the device
and fdt node creation is intertwined in mach-virt, so it's going
to create a DTB anyway.

(Unless major refactoring is done first...)

Thanks,
drew



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Peter Maydell
On 26 July 2018 at 13:35, Andrew Jones  wrote:
> On Thu, Jul 26, 2018 at 01:10:34PM +0100, Peter Maydell wrote:
>> On 26 July 2018 at 12:41, Andrew Jones  wrote:
>> > The patch guards the generation. It'll only modify DT and ACPI for the
>> > new machine type. But, while modifying the DT makes sense, as that
>> > generated DT will get consumed
>>
>> ...will it? Why would we want the firmware to read the
>> QEMU-generated DT? Real hardware doesn't work that way.
>>
>
> Good point. If the plan for this reference software is to always
> prepare its own DTB and ACPI tables, then this patch shouldn't
> touch the DT generation either. Of course a lot of the device
> and fdt node creation is intertwined in mach-virt, so it's going
> to create a DTB anyway.

I haven't yet looked at this patch so I might change my mind
once I've had time to look at the code, but my initial
thought is to prefer it to be in its own file rather than
trying to share code with the virt board. There's a lot
of stuff 'virt' needs that this doesn't (DT generation,
ACPI generation, switches to disable virtualization or
trustzone support, options to select GICv2, etc etc).

Q: is this new board model intended to be able to work
under KVM, or is that out of scope? (You wouldn't be able
to run guest EL3 firmware under KVM, certainly.)

thanks
-- PMM



Re: [Qemu-devel] [PATCH 1/4] ui: add qapi parser for -display

2018-07-26 Thread Markus Armbruster
Gerd Hoffmann  writes:

> Add parse_display_qapi() function which parses the -display command line
> using a qapi visitor for DisplayOptions.  Wire up as default catch in
> parse_display().
>
> Improves the error message for unknown display types.
>
> Also enables json as -display argument, i.e. -display "{ 'type': 'gtk' }"

This new option argument syntax is undocumented.  Intentional?

> Signed-off-by: Gerd Hoffmann 
> ---
>  vl.c | 29 +++--
>  1 file changed, 27 insertions(+), 2 deletions(-)



Re: [Qemu-devel] [PATCH v2 2/2] hw/arm: Add Arm Enterprise machine type

2018-07-26 Thread Daniel P . Berrangé
On Thu, Jul 26, 2018 at 02:23:46PM +0200, Laszlo Ersek wrote:
> On 07/26/18 13:13, Andrew Jones wrote:
> > On Thu, Jul 26, 2018 at 12:56:22PM +0200, Ard Biesheuvel wrote:
> >> On 26 July 2018 at 12:52, Peter Maydell  wrote:
> >>> On 26 July 2018 at 11:46, Andrew Jones  wrote:
>  i440fx and q35 are specific machine types. 'sbsa' is a generic machine
>  type that conforms to the SBSA specification. If you weren't trying to
>  memory map AHCI and EHCI controllers, or if memory mapped AHCI and EHCI
>  controllers were mandated in the SBSA spec, then there wouldn't be
>  anything to discuss. But that's not the case. You're attempting to
>  hard code one instance of a generic class of SBSA machines, but then
>  just call it 'sbsa'.
> >>>
> >>> Modelling a 'generic SBSA machine' is not the goal here (I'm
> >>> not sure that's even possible or useful), so let's just
> >>> stipulate that we'll call the machine type something else
> >>> and move on ?
> >>>
> >>
> >> Fine with me. Care to suggest a name? :-)
> >>
> > 
> > Also fine by me, but I'm not going to suggest the name. I fear the
> > backlash I'd receive after this mail thread!
> > 
> 
> I suggest "showcase2018". (Dead serious.)

Further suggestions

   "refplatform"
   "sbsareference"
   "sbsademo"
   "phys"   (physical counterpart to "virt")


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



[Qemu-devel] [PATCH v1 for 3.0 1/2] linux-user/mmap.c: handle len = 0 maps correctly

2018-07-26 Thread Alex Bennée
I've slightly re-organised the check to more closely match the
sequence that the kernel uses in do_mmap().

Signed-off-by: Alex Bennée 
Cc: umarcor <1783...@bugs.launchpad.net>
---
 linux-user/mmap.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/linux-user/mmap.c b/linux-user/mmap.c
index d0c50e4888..3ef69fa2d0 100644
--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -391,14 +391,22 @@ abi_long target_mmap(abi_ulong start, abi_ulong len, int 
prot,
 }
 #endif
 
-if (offset & ~TARGET_PAGE_MASK) {
+if (!len) {
 errno = EINVAL;
 goto fail;
 }
 
 len = TARGET_PAGE_ALIGN(len);
-if (len == 0)
-goto the_end;
+if (!len) {
+errno = EINVAL;
+goto fail;
+}
+
+if (offset & ~TARGET_PAGE_MASK) {
+errno = EINVAL;
+goto fail;
+}
+
 real_start = start & qemu_host_page_mask;
 host_offset = offset & qemu_host_page_mask;
 
-- 
2.17.1




[Qemu-devel] [PATCH v1 for 3.0 0/2] fix for bug 1783362

2018-07-26 Thread Alex Bennée
Hi,

This fixes a bug where a zero len is passed to mmap. It comes with an
enhancement to the mmap test case.

Alex Bennée (2):
  linux-user/mmap.c: handle len = 0 maps correctly
  tests: add check_invalid_maps to test-mmap

 linux-user/mmap.c   | 14 +++---
 tests/tcg/multiarch/test-mmap.c | 16 +++-
 2 files changed, 26 insertions(+), 4 deletions(-)

-- 
2.17.1




[Qemu-devel] [PATCH v1 for 3.0 2/2] tests: add check_invalid_maps to test-mmap

2018-07-26 Thread Alex Bennée
This adds a test to make sure we fail properly for a 0 length mmap.
There are most likely other failure conditions we should also check.

Signed-off-by: Alex Bennée 
Cc: umarcor <1783...@bugs.launchpad.net>
---
 tests/tcg/multiarch/test-mmap.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/tests/tcg/multiarch/test-mmap.c b/tests/tcg/multiarch/test-mmap.c
index 5c0afe6e49..7f62eba4e9 100644
--- a/tests/tcg/multiarch/test-mmap.c
+++ b/tests/tcg/multiarch/test-mmap.c
@@ -27,7 +27,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 
 #define D(x)
@@ -435,6 +435,19 @@ void checked_write(int fd, const void *buf, size_t count)
 fail_unless(rc == count);
 }
 
+void check_invalid_mmaps(void)
+{
+unsigned char *addr;
+
+/* Attempt to map a zero length page.  */
+addr = mmap(NULL, 0, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+fprintf(stdout, "%s addr=%p", __func__, (void *)addr);
+fail_unless(addr == MAP_FAILED);
+fail_unless(errno == EINVAL);
+
+fprintf(stdout, " passed\n");
+}
+
 int main(int argc, char **argv)
 {
char tempname[] = "/tmp/.cmmapXX";
@@ -476,6 +489,7 @@ int main(int argc, char **argv)
check_file_fixed_mmaps();
check_file_fixed_eof_mmaps();
check_file_unfixed_eof_mmaps();
+   check_invalid_mmaps();
 
/* Fails at the moment.  */
/* check_aligned_anonymous_fixed_mmaps_collide_with_host(); */
-- 
2.17.1




Re: [Qemu-devel] [PATCH v1 for 3.0 1/2] linux-user/mmap.c: handle len = 0 maps correctly

2018-07-26 Thread Laurent Vivier
Le 26/07/2018 à 15:29, Alex Bennée a écrit :
> I've slightly re-organised the check to more closely match the
> sequence that the kernel uses in do_mmap().
> 
> Signed-off-by: Alex Bennée 
> Cc: umarcor <1783...@bugs.launchpad.net>
> ---
>  linux-user/mmap.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/linux-user/mmap.c b/linux-user/mmap.c
> index d0c50e4888..3ef69fa2d0 100644
> --- a/linux-user/mmap.c
> +++ b/linux-user/mmap.c
> @@ -391,14 +391,22 @@ abi_long target_mmap(abi_ulong start, abi_ulong len, 
> int prot,
>  }
>  #endif
>  
> -if (offset & ~TARGET_PAGE_MASK) {
> +if (!len) {
>  errno = EINVAL;
>  goto fail;
>  }
>  
>  len = TARGET_PAGE_ALIGN(len);
> -if (len == 0)
> -goto the_end;
> +if (!len) {
> +errno = EINVAL;
> +goto fail;
> +}

Why do you check twice len?
TARGET_PAGE_ALIGN() rounds up the value, so if it was not 0, it cannot
be now.

Thanks,
Laurent



[Qemu-devel] [PATCH v5 2/3] spapr: introduce a IRQ controller backend to the machine

2018-07-26 Thread Cédric Le Goater
This proposal moves all the related IRQ routines of the sPAPR machine
behind a sPAPR IRQ backend interface 'spapr_irq' to prepare for future
changes. First of which will be to increase the size of the IRQ number
space, then, will follow a new backend for the POWER9 XIVE IRQ controller.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr.h |  11 +-
 include/hw/ppc/spapr_irq.h |  22 
 hw/ppc/spapr.c | 180 +
 hw/ppc/spapr_cpu_core.c|   1 +
 hw/ppc/spapr_irq.c | 230 +
 5 files changed, 259 insertions(+), 185 deletions(-)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 73067f5ee8aa..ad4d7cfd97b0 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -4,7 +4,6 @@
 #include "qemu/units.h"
 #include "sysemu/dma.h"
 #include "hw/boards.h"
-#include "hw/ppc/xics.h"
 #include "hw/ppc/spapr_drc.h"
 #include "hw/mem/pc-dimm.h"
 #include "hw/ppc/spapr_ovec.h"
@@ -16,6 +15,7 @@ struct sPAPRNVRAM;
 typedef struct sPAPREventLogEntry sPAPREventLogEntry;
 typedef struct sPAPREventSource sPAPREventSource;
 typedef struct sPAPRPendingHPT sPAPRPendingHPT;
+typedef struct ICSState ICSState;
 
 #define HPTE64_V_HPTE_DIRTY 0x0040ULL
 #define SPAPR_ENTRY_POINT   0x100
@@ -110,6 +110,7 @@ struct sPAPRMachineClass {
   unsigned n_dma, uint32_t *liobns, Error **errp);
 sPAPRResizeHPT resize_hpt_default;
 sPAPRCapabilities default_caps;
+sPAPRIrq *irq;
 };
 
 /**
@@ -780,14 +781,6 @@ int spapr_get_vcpu_id(PowerPCCPU *cpu);
 void spapr_set_vcpu_id(PowerPCCPU *cpu, int cpu_index, Error **errp);
 PowerPCCPU *spapr_find_cpu(int vcpu_id);
 
-int spapr_irq_find(sPAPRMachineState *spapr, int num, bool align,
-   Error **errp);
-#define spapr_irq_findone(spapr, errp) spapr_irq_find(spapr, 1, false, errp)
-int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
-void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
-qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
-
-
 int spapr_caps_pre_load(void *opaque);
 int spapr_caps_pre_save(void *opaque);
 
diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index 6f7f50548809..0e98c4474bb2 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -29,4 +29,26 @@ int spapr_irq_msi_alloc(sPAPRMachineState *spapr, uint32_t 
num, bool align,
 void spapr_irq_msi_free(sPAPRMachineState *spapr, int irq, uint32_t num);
 void spapr_irq_msi_reset(sPAPRMachineState *spapr);
 
+typedef struct sPAPRIrq {
+uint32_tnr_irqs;
+
+void (*init)(sPAPRMachineState *spapr, Error **errp);
+int (*claim)(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
+void (*free)(sPAPRMachineState *spapr, int irq, int num);
+qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
+void (*print_info)(sPAPRMachineState *spapr, Monitor *mon);
+} sPAPRIrq;
+
+extern sPAPRIrq spapr_irq_xics;
+
+int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
+void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
+qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
+
+/*
+ * XICS legacy routines
+ */
+int spapr_irq_find(sPAPRMachineState *spapr, int num, bool align, Error 
**errp);
+#define spapr_irq_findone(spapr, errp) spapr_irq_find(spapr, 1, false, errp)
+
 #endif
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 1794952d4e2a..693144e40d49 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -54,7 +54,6 @@
 #include "hw/ppc/spapr.h"
 #include "hw/ppc/spapr_vio.h"
 #include "hw/pci-host/spapr.h"
-#include "hw/ppc/xics.h"
 #include "hw/pci/msi.h"
 
 #include "hw/pci/pci.h"
@@ -117,33 +116,6 @@ static bool spapr_is_thread0_in_vcore(sPAPRMachineState 
*spapr,
 return spapr_get_vcpu_id(cpu) % spapr->vsmt == 0;
 }
 
-static ICSState *spapr_ics_create(sPAPRMachineState *spapr,
-  const char *type_ics,
-  int nr_irqs, Error **errp)
-{
-Error *local_err = NULL;
-Object *obj;
-
-obj = object_new(type_ics);
-object_property_add_child(OBJECT(spapr), "ics", obj, &error_abort);
-object_property_add_const_link(obj, ICS_PROP_XICS, OBJECT(spapr),
-   &error_abort);
-object_property_set_int(obj, nr_irqs, "nr-irqs", &local_err);
-if (local_err) {
-goto error;
-}
-object_property_set_bool(obj, true, "realized", &local_err);
-if (local_err) {
-goto error;
-}
-
-return ICS_BASE(obj);
-
-error:
-error_propagate(errp, local_err);
-return NULL;
-}
-
 static bool pre_2_10_vmstate_dummy_icp_needed(void *opaque)
 {
 /* Dummy entries correspond to unused ICPState objects in older QEMUs,
@@ -184,43 +156,6 @@ static int xics_max_server_number(sPAPRMachineState *spapr)
 return DIV_ROUND_UP(max_cpus * spapr->vsmt, smp_threads);
 }
 
-static void xics_sys

[Qemu-devel] [PATCH v5 0/3] spapr: introduce a fixed IRQ number space and an IRQ controller backend

2018-07-26 Thread Cédric Le Goater
Here is a proposal for a new IRQ number space layout using static
numbers and a bitmap allocator for the MSIs. The previous layout is
kept for compatibly in machines raising the 'legacy_irq_allocation' flag.

The patchset also introduces a sPAPR IRQ interface which offers the
possibility to provide different IRQ controller backend to the sPAPR
machine. This is preparing ground for the new XIVE controller.

The new XICS layout will only be activated when a new pseries-3.1
machine is introduced.

Thanks,

C.

Changes since v4 :

 - assigned IRQ numbers to their default static values
 - improved the reg_to_irq conversion routine for the VIO devices

Changes since v3 :

 - introduced a reg_to_irq conversion routine for the VIO devices

Changes since v2 :

 - renamed 'xics_legacy' to 'legacy_irq_allocation'
 - introduced the sPAPR IRQ backend interface
 - increase the size of the IRQ number space for newer machines

Changes since v1 :

 - removed block allocation
 - spaced the IRQ offsets 
 - check for overlaps when allocating VIO irqs
 - removed 'Error *' arg from spapr_irq_msi_init() 

Cédric Le Goater (3):
  spapr: introduce a fixed IRQ number space
  spapr: introduce a IRQ controller backend to the machine
  spapr: increase the size of the IRQ number space

 include/hw/ppc/spapr.h |  16 +-
 include/hw/ppc/spapr_irq.h |  55 +++
 hw/ppc/spapr.c | 203 -
 hw/ppc/spapr_cpu_core.c|   1 +
 hw/ppc/spapr_events.c  |  12 +-
 hw/ppc/spapr_irq.c | 296 +
 hw/ppc/spapr_pci.c |  29 +++-
 hw/ppc/spapr_vio.c |  65 +++-
 hw/ppc/Makefile.objs   |   2 +-
 9 files changed, 480 insertions(+), 199 deletions(-)
 create mode 100644 include/hw/ppc/spapr_irq.h
 create mode 100644 hw/ppc/spapr_irq.c

-- 
2.17.1




[Qemu-devel] [PATCH v5 1/3] spapr: introduce a fixed IRQ number space

2018-07-26 Thread Cédric Le Goater
This proposal introduces a new IRQ number space layout using static
numbers for all devices, depending on a device index, and a bitmap
allocator for the MSI IRQ numbers which are negotiated by the guest at
runtime.

As the VIO device model does not have a device index but a "reg"
property, we introduce a formula to compute an IRQ number from a "reg"
value. It should minimize most of the collisions.

The previous layout is kept in pre-3.1 machines raising the
'legacy_irq_allocation' machine class flag.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr.h |  5 +++
 include/hw/ppc/spapr_irq.h | 32 +++
 hw/ppc/spapr.c | 32 ++-
 hw/ppc/spapr_events.c  | 12 ---
 hw/ppc/spapr_irq.c | 56 
 hw/ppc/spapr_pci.c | 29 +
 hw/ppc/spapr_vio.c | 65 ++
 hw/ppc/Makefile.objs   |  2 +-
 8 files changed, 214 insertions(+), 19 deletions(-)
 create mode 100644 include/hw/ppc/spapr_irq.h
 create mode 100644 hw/ppc/spapr_irq.c

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 7e5de1a6fd42..73067f5ee8aa 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -8,6 +8,7 @@
 #include "hw/ppc/spapr_drc.h"
 #include "hw/mem/pc-dimm.h"
 #include "hw/ppc/spapr_ovec.h"
+#include "hw/ppc/spapr_irq.h"
 
 struct VIOsPAPRBus;
 struct sPAPRPHBState;
@@ -101,6 +102,8 @@ struct sPAPRMachineClass {
 bool dr_lmb_enabled;   /* enable dynamic-reconfig/hotplug of LMBs */
 bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
 bool pre_2_10_has_unused_icps;
+bool legacy_irq_allocation;
+
 void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
   uint64_t *buid, hwaddr *pio, 
   hwaddr *mmio32, hwaddr *mmio64,
@@ -167,6 +170,8 @@ struct sPAPRMachineState {
 char *kvm_type;
 
 const char *icp_type;
+int32_t irq_map_nr;
+unsigned long *irq_map;
 
 bool cmd_line_caps[SPAPR_CAP_NUM];
 sPAPRCapabilities def, eff, mig;
diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
new file mode 100644
index ..6f7f50548809
--- /dev/null
+++ b/include/hw/ppc/spapr_irq.h
@@ -0,0 +1,32 @@
+/*
+ * QEMU PowerPC sPAPR IRQ backend definitions
+ *
+ * Copyright (c) 2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef HW_SPAPR_IRQ_H
+#define HW_SPAPR_IRQ_H
+
+/*
+ * IRQ range offsets per device type
+ */
+#define SPAPR_IRQ_EPOW   0x1000  /* XICS_IRQ_BASE offset */
+#define SPAPR_IRQ_HOTPLUG0x1001
+#define SPAPR_IRQ_VIO0x1100  /* 256 VIO devices */
+#define SPAPR_IRQ_PCI_LSI0x1200  /* 32+ PHBs devices */
+
+#define SPAPR_IRQ_MSI0x1300  /* Offset of the dynamic range covered
+  * by the bitmap allocator */
+
+typedef struct sPAPRMachineState sPAPRMachineState;
+
+void spapr_irq_msi_init(sPAPRMachineState *spapr, uint32_t nr_msis);
+int spapr_irq_msi_alloc(sPAPRMachineState *spapr, uint32_t num, bool align,
+Error **errp);
+void spapr_irq_msi_free(sPAPRMachineState *spapr, int irq, uint32_t num);
+void spapr_irq_msi_reset(sPAPRMachineState *spapr);
+
+#endif
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 421b2dd09b51..1794952d4e2a 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -189,6 +189,11 @@ static void xics_system_init(MachineState *machine, int 
nr_irqs, Error **errp)
 sPAPRMachineState *spapr = SPAPR_MACHINE(machine);
 Error *local_err = NULL;
 
+/* Initialize the MSI IRQ allocator. */
+if (!SPAPR_MACHINE_GET_CLASS(spapr)->legacy_irq_allocation) {
+spapr_irq_msi_init(spapr, XICS_IRQ_BASE + nr_irqs - SPAPR_IRQ_MSI);
+}
+
 if (kvm_enabled()) {
 if (machine_kernel_irqchip_allowed(machine) &&
 !xics_kvm_init(spapr, &local_err)) {
@@ -1636,6 +1641,10 @@ static void spapr_machine_reset(void)
 ppc_set_compat(first_ppc_cpu, spapr->max_compat_pvr, &error_fatal);
 }
 
+if (!SPAPR_MACHINE_GET_CLASS(spapr)->legacy_irq_allocation) {
+spapr_irq_msi_reset(spapr);
+}
+
 qemu_devices_reset();
 
 /* DRC reset may cause a device to be unplugged. This will cause troubles
@@ -1910,6 +1919,24 @@ static const VMStateDescription vmstate_spapr_patb_entry 
= {
 },
 };
 
+static bool spapr_irq_map_needed(void *opaque)
+{
+sPAPRMachineState *spapr = opaque;
+
+return spapr->irq_map && !bitmap_empty(spapr->irq_map, spapr->irq_map_nr);
+}
+
+static const VMStateDescription vmstate_spapr_irq_map = {
+.name = "spapr_irq_map",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = spapr_irq_map_needed,
+.fields = (VMStateField[]) {
+VMSTATE_BITMAP(irq_map, sPAPRMachineState, 0, irq_map_nr),
+VMSTATE_END_OF_LIST()
+},
+};
+
 static co

[Qemu-devel] [PATCH v5 3/3] spapr: increase the size of the IRQ number space

2018-07-26 Thread Cédric Le Goater
The new layout using static IRQ number does not leave much space to
the dynamic MSI range, only 0x100 IRQ numbers. Increase the total
number of IRQS for newer machines and introduce a legacy XICS backend
for pre-3.1 machines to maintain compatibility.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_irq.h |  1 +
 hw/ppc/spapr.c |  1 +
 hw/ppc/spapr_irq.c | 12 +++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index 0e98c4474bb2..626160ba475e 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -40,6 +40,7 @@ typedef struct sPAPRIrq {
 } sPAPRIrq;
 
 extern sPAPRIrq spapr_irq_xics;
+extern sPAPRIrq spapr_irq_xics_legacy;
 
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
 void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 693144e40d49..44b7423ebeaa 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3926,6 +3926,7 @@ static void spapr_machine_3_0_class_options(MachineClass 
*mc)
 sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
 
 smc->legacy_irq_allocation = true;
+smc->irq = &spapr_irq_xics_legacy;
 }
 
 DEFINE_SPAPR_MACHINE(3_0, "3.0", true);
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 0cbb5dd39368..620c49b38455 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -196,7 +196,7 @@ static void spapr_irq_print_info_xics(sPAPRMachineState 
*spapr, Monitor *mon)
 }
 
 sPAPRIrq spapr_irq_xics = {
-.nr_irqs = XICS_IRQS_SPAPR,
+.nr_irqs = 0x1000,
 
 .init= spapr_irq_init_xics,
 .claim   = spapr_irq_claim_xics,
@@ -284,3 +284,13 @@ int spapr_irq_find(sPAPRMachineState *spapr, int num, bool 
align, Error **errp)
 
 return first + ics->offset;
 }
+
+sPAPRIrq spapr_irq_xics_legacy = {
+.nr_irqs = XICS_IRQS_SPAPR,
+
+.init= spapr_irq_init_xics,
+.claim   = spapr_irq_claim_xics,
+.free= spapr_irq_free_xics,
+.qirq= spapr_qirq_xics,
+.print_info  = spapr_irq_print_info_xics,
+};
-- 
2.17.1




Re: [Qemu-devel] [Qemu-ppc] [PATCH v2 2/2] tpm_spapr: Support suspend and resume

2018-07-26 Thread Eric Blake

On 07/26/2018 12:31 AM, David Gibson wrote:


On Tue, Dec 12, 2017 at 03:44:03PM -0500, Stefan Berger wrote:

Signed-off-by: Stefan Berger 
---
  hw/tpm/tpm_spapr.c | 61 ++
  1 file changed, 57 insertions(+), 4 deletions(-)

diff --git a/hw/tpm/tpm_spapr.c b/hw/tpm/tpm_spapr.c
index ef5e62d..0b8a424 100644
--- a/hw/tpm/tpm_spapr.c
+++ b/hw/tpm/tpm_spapr.c
@@ -248,9 +248,8 @@ static int tpm_spapr_do_crq(struct VIOsPAPRDevice *dev, 
uint8_t *crq_data)
  return H_SUCCESS;
  }
  
-static void tpm_spapr_request_completed(TPMIf *ti)

+static void _tpm_spapr_request_completed(SPAPRvTPMState *s)


Don't start identifiers with _, please, they're reserved by the
standard library.


Technically, is is _[_A-Z] reserved by the implementation; use of _[a-z] 
is just fine for file-scoped identifiers, like this one. But it is still 
unusual in our codebase, and best avoided.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-devel] [PATCH 1/2] qstring: Assert size calculations don't overflow

2018-07-26 Thread Eric Blake

On 07/26/2018 01:18 AM, Markus Armbruster wrote:

Signed-off-by: Markus Armbruster 
---
  qobject/qstring.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/qobject/qstring.c b/qobject/qstring.c
index 18b8eb82f8..7990569c5a 100644
--- a/qobject/qstring.c
+++ b/qobject/qstring.c
@@ -41,17 +41,19 @@ QString *qstring_from_substr(const char *str, size_t start, 
size_t end)
  {
  QString *qstring;
  
+assert(start <= end + 1);


end + 1 can overflow size_t, but it is unsigned so well-defined, and the 
assert will trigger as desired.



@@ -68,7 +70,9 @@ QString *qstring_from_str(const char *str)
  static void capacity_increase(QString *qstring, size_t len)
  {
  if (qstring->capacity < (qstring->length + len)) {
+assert(len <= SIZE_MAX - qstring->capacity);
  qstring->capacity += len;


You've asserted that this addition won't overflow...


+assert(qstring->capacity + len <= SIZE_MAX / 2);


...but now that qstring->capacity is larger, this could overflow. Do you 
really need the +len in here, given that...



  qstring->capacity *= 2; /* use exponential growth */


...you are really only trying to prevent overflow of doubling 
qstring->capacity without adding yet another len in the mix?


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-devel] [PATCH 2/2] qstring: Move qstring_from_substr()'s @end one to the right

2018-07-26 Thread Eric Blake

On 07/26/2018 01:18 AM, Markus Armbruster wrote:

qstring_from_substr() takes the index of the substring's first and
last character.  qstring_from_substr(s, 0, SIZE_MAX) denotes an empty
substring.  Awkward.

Shift the end index one to the right.  This simplifies both
qstring_from_substr() and its callers.

Signed-off-by: Markus Armbruster 
---


Not strictly a bug fix, but found while fixing a bug, thus I'm okay if 
this makes it into 3.0 (your call as maintainer).



  block/blkdebug.c  | 2 +-
  block/blkverify.c | 2 +-
  block/nbd.c   | 2 +-
  qobject/qstring.c | 6 +++---
  tests/check-qobject.c | 2 +-
  tests/check-qstring.c | 2 +-
  6 files changed, 8 insertions(+), 8 deletions(-)



This does fix all users.

Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-devel] [PATCH v2] block/gluster: defend on legacy ftruncate api use

2018-07-26 Thread Eric Blake

On 07/26/2018 04:19 AM, Niels de Vos wrote:

From: Prasanna Kumar Kalever 


When sending a v2, it's best to start a new thread instead of 
in-reply-to v1, so that the automated tooling spots it easier.


Subject line is awkward, may I suggest:

block/gluster: Handle changed glfs_ftruncate signature



New versions of Glusters libgfapi.so have an updated glfs_ftruncate()
function that returns additional 'struct stat' structures to enable
advanced caching of attributes. This is useful for file servers, not so
much for QEMU. Never the less, the API has changed and needs to be


s/Never the less/Nevertheless/


adopted.

Signed-off-by: Prasanna Kumar Kalever 
Signed-off-by: Niels de Vos 

--
v2: do a compile check as suggested by Eric Blake
---
  block/gluster.c | 15 +--
  configure   | 18 ++
  2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index 4fd55a9cc5..d1c6f81f5c 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -997,6 +997,7 @@ static int qemu_gluster_do_truncate(struct glfs_fd *fd, 
int64_t offset,
  PreallocMode prealloc, Error **errp)
  {
  int64_t current_length;
+int ret;
  
  current_length = glfs_lseek(fd, 0, SEEK_END);

  if (current_length < 0) {
@@ -1024,7 +1025,12 @@ static int qemu_gluster_do_truncate(struct glfs_fd *fd, 
int64_t offset,
  #endif /* CONFIG_GLUSTERFS_FALLOCATE */
  #ifdef CONFIG_GLUSTERFS_ZEROFILL
  case PREALLOC_MODE_FULL:
-if (glfs_ftruncate(fd, offset)) {
+#ifdef CONFIG_GLUSTERFS_LEGACY_FTRUNCATE
+ret = glfs_ftruncate(fd, offset);
+#else
+ret = glfs_ftruncate(fd, offset, NULL, NULL);
+#endif


I'm personally a fan of minimizing in-function #ifdef, where it is easy. 
This could be done by a top-level:


#ifdef CONFIG_GLUSTERFS_LEGACY_FTRUNCATE
# define glfs_ftruncate(fd, offs, _1, _2) glfs_ftruncate(fd, offs)
#endif

then just using glfs_ftruncate(fd, offs, NULL, NULL) unconditionally in 
the rest of the file.


At any rate, the configure test looks a lot better than the v1 patch.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-devel] [PULL for-3.0 1/1] trace/simple: fix hang in child after fork(2)

2018-07-26 Thread Stefan Hajnoczi
On Tue, Jul 24, 2018 at 03:41:01PM +0100, Daniel P. Berrangé wrote:
> On Tue, Jul 24, 2018 at 03:35:51PM +0100, Daniel P. Berrangé wrote:
> > On Tue, Jul 24, 2018 at 03:25:04PM +0100, Stefan Hajnoczi wrote:
> > > The simple trace backend spawns a write-out thread which is used to
> > > asynchronously flush the in-memory ring buffer to disk.
> > > 
> > > fork(2) does not clone all threads, only the thread that invoked
> > > fork(2).  As a result there is no write-out thread in the child process!
> > > 
> > > This causes a hang during shutdown when atexit(3) handler installed by
> > > the simple trace backend waits for the non-existent write-out thread.
> > > 
> > > This patch uses pthread_atfork(3) to terminate the write-out thread
> > > before fork and restart it in both the parent and child after fork.
> > > This solves a hang in qemu-iotests 147 due to qemu-nbd --fork usage.
> > 
> > I'm not convinced this is safe, as it looks like it has a window in
> > which both the parent and child processes will be doing write-out to
> > the same file.
> > 
> > In particular in the main QEMU system emulators it means that any
> > time we fork() in QEMU, eg for spawning commands with migration
> > exec: URI, or TAP devuce ifup scripts, etc, we'll be starting a
> > write-out thread in the child.
> 
> I'd be more inclined to have the pthread_atfork() handle simply terminate
> the tracing process, reversing all effects of trace_init_backends(). Then
> after qemu-nbd has called fork(), it can simply call trace_init_backends()
> explicitly to start it running again. This avoids unecessarily starting
> tracing in child processes that are not requiring/expecting it.

Good point.

NACK

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH v5 2/4 for-3.0] qcow2: Options' documentation fixes

2018-07-26 Thread Eric Blake

On 07/26/2018 05:02 AM, Kevin Wolf wrote:

Am 25.07.2018 um 16:27 hat Leonid Bloch geschrieben:

Signed-off-by: Leonid Bloch 
---
  docs/qcow2-cache.txt |  3 +++
  qemu-options.hx  | 10 ++
  2 files changed, 9 insertions(+), 4 deletions(-)




+++ b/qemu-options.hx
@@ -752,15 +752,17 @@ image file)
  
  @item cache-size

  The maximum total size of the L2 table and refcount block caches in bytes
-(default: 1048576 bytes or 8 clusters, whichever is larger)


I think it would be good to still say something about the default.
Maybe something like "default: the sum of l2-cache-size and
refcount-cache-size"?


Except what happens if you specify only one of l2-cache-size or 
refcount-cache-size? Is the defaulted cache-size then just that one size 
you specified (and the other cache ignored), or is the total cache size 
still defaulted to the 1M/8-cluster size (assuming its larger than the 
other size specified)?





  @item l2-cache-size
-The maximum size of the L2 table cache in bytes
-(default: 4/5 of the total cache size)
+The maximum size of the L2 table cache.


Why did you remove "in bytes" and add a period which the other options
don't have? I prefer the old version of this line.


+(default: if cache-size is not defined - 1048576 bytes or 8 clusters, whichever
+is larger; otherwise, as large as possible or needed within the cache-size,
+while permitting the requested or the minimal refcount cache size)
  
  @item refcount-cache-size

  The maximum size of the refcount block cache in bytes
-(default: 1/5 of the total cache size)
+(default: 4 times the cluster size, or any portion of the cache-size, if it is
+specified and large enough, left over after allocating the full L2 cache)


I found the second part hard to understand. Maybe "4 times the cluster
size; or if both cache-size and l2-cache-size are given, the part of
the cache-size that is not used for the L2 cache yet."?

Kevin



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-devel] [PATCH v5 2/4 for-3.0] qcow2: Options' documentation fixes

2018-07-26 Thread Leonid Bloch

On 07/26/2018 01:02 PM, Kevin Wolf wrote:

Am 25.07.2018 um 16:27 hat Leonid Bloch geschrieben:

Signed-off-by: Leonid Bloch 
---
  docs/qcow2-cache.txt |  3 +++
  qemu-options.hx  | 10 ++
  2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/docs/qcow2-cache.txt b/docs/qcow2-cache.txt
index 8a09a5cc5f..3673f2be0e 100644
--- a/docs/qcow2-cache.txt
+++ b/docs/qcow2-cache.txt
@@ -130,6 +130,9 @@ There are a few things that need to be taken into account:
 memory as possible to the L2 cache before increasing the refcount
 cache size.
  
+- At most two of "l2-cache-size", "refcount-cache-size", and "cache-size"

+  can be set simultaneously.


The indentation is off here, the other list items have one space more.


  Unlike L2 tables, refcount blocks are not used during normal I/O but
  only during allocations and internal snapshots. In most cases they are
  accessed sequentially (even during random guest I/O) so increasing the
diff --git a/qemu-options.hx b/qemu-options.hx
index b1bf0f485f..13ece21cb6 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -752,15 +752,17 @@ image file)
  
  @item cache-size

  The maximum total size of the L2 table and refcount block caches in bytes
-(default: 1048576 bytes or 8 clusters, whichever is larger)


I think it would be good to still say something about the default.
Maybe something like "default: the sum of l2-cache-size and
refcount-cache-size"?


Yes, that sounds good!




  @item l2-cache-size
-The maximum size of the L2 table cache in bytes
-(default: 4/5 of the total cache size)
+The maximum size of the L2 table cache.


Why did you remove "in bytes" and add a period which the other options
don't have? I prefer the old version of this line.


I removed "in bytes" because it also accepts k, M, G, ... but on a 
second thought, these are amounts of bytes as well, so changing back to 
the old version.



+(default: if cache-size is not defined - 1048576 bytes or 8 clusters, whichever
+is larger; otherwise, as large as possible or needed within the cache-size,
+while permitting the requested or the minimal refcount cache size)
  
  @item refcount-cache-size

  The maximum size of the refcount block cache in bytes
-(default: 1/5 of the total cache size)
+(default: 4 times the cluster size, or any portion of the cache-size, if it is
+specified and large enough, left over after allocating the full L2 cache)


I found the second part hard to understand. Maybe "4 times the cluster
size; or if both cache-size and l2-cache-size are given, the part of
the cache-size that is not used for the L2 cache yet."?


That is not accurate. Because even if l2-cache-size is not given and 
cache-size is large enough to accommodate enough L2 cache to cover the 
entire image, plus the minimal amount of refcount cache with room to 
spare - refcount cache will use all the rest of the cache-size. How about:


"default: 4 times the cluster size; or if cache-size is specified, the 
part of it which is not used for the L2 cache"


Leonid.


Kevin





Re: [Qemu-devel] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Eric Blake

On 07/26/2018 06:33 AM, Kevin Wolf wrote:

The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
all-zero afterwards, so don't try to abuse it for zero writing. We try
to only use this if BLKDISCARDZEROES tells us that it is safe, but this
is unreliable on older kernels and a constant 0 in newer kernels. In


For my own curiosity, which kernel commit switched to constant 0, so I 
can read the rationale in that commit message?



other words, this code path is never actually used with newer kernels,
so we don't even try to unmap while writing zeros.

This patch removes the abuse of discard for writing zeroes from
file-posix and instead adds a new function that uses interfaces that are
actually meant to deallocate and zero out at the same time. Only if
those fail, it falls back to zeroing out without unmap. We never fall
back to a discard operation any more that may or may not result in
zeros.


Makes sense.



Signed-off-by: Kevin Wolf 
---
  block/file-posix.c | 62 +-
  1 file changed, 47 insertions(+), 15 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 60af4b3d51..9c66cd87d1 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -648,7 +648,7 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
  }
  #endif
  
-bs->supported_zero_flags = s->discard_zeroes ? BDRV_REQ_MAY_UNMAP : 0;

+bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
  ret = 0;
  fail:
  if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
@@ -1487,6 +1487,38 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData 
*aiocb)
  return -ENOTSUP;
  }
  
+static ssize_t handle_aiocb_write_zeroes_unmap(RawPosixAIOData *aiocb)

+{
+BDRVRawState *s = aiocb->bs->opaque;
+int ret;
+
+/* First try to write zeros and unmap at the same time */
+
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+   aiocb->aio_offset, aiocb->aio_nbytes);


Umm, doesn't this have to use FALLOC_FL_ZERO_RANGE? FALLOC_FL_PUNCH_HOLE 
deallocs, but is not required to write zeroes.



+if (ret != -ENOTSUP) {
+return ret;
+}
+#endif
+
+#ifdef CONFIG_XFS
+if (s->is_xfs) {
+/* xfs_discard() guarantees that the discarded area reads as all-zero
+ * afterwards, so we can use it here. */
+return xfs_discard(s, aiocb->aio_offset, aiocb->aio_nbytes);
+}
+#endif
+
+/* Make the compiler happy if neither of the above is compiled in */
+(void) s;


Could also be done in fewer lines by use of:

   BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;

since that attribute means "might be unused, don't warn if it is 
actually unused" (and not the stricter "must be unused, warn if it got 
used anyway")



+
+/* If we couldn't manage to unmap while guaranteed that the area reads as
+ * all-zero afterwards, just write zeroes without unmapping */
+ret = handle_aiocb_write_zeroes(aiocb);
+return ret;
+}
+
  #ifndef HAVE_COPY_FILE_RANGE
  static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
   off_t *out_off, size_t len, unsigned int flags)



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-devel] [PATCH v5 2/4 for-3.0] qcow2: Options' documentation fixes

2018-07-26 Thread Leonid Bloch

On 07/26/2018 05:20 PM, Eric Blake wrote:

On 07/26/2018 05:02 AM, Kevin Wolf wrote:

Am 25.07.2018 um 16:27 hat Leonid Bloch geschrieben:

Signed-off-by: Leonid Bloch 
---
  docs/qcow2-cache.txt |  3 +++
  qemu-options.hx  | 10 ++
  2 files changed, 9 insertions(+), 4 deletions(-)




+++ b/qemu-options.hx
@@ -752,15 +752,17 @@ image file)
  @item cache-size
  The maximum total size of the L2 table and refcount block caches in 
bytes

-(default: 1048576 bytes or 8 clusters, whichever is larger)


I think it would be good to still say something about the default.
Maybe something like "default: the sum of l2-cache-size and
refcount-cache-size"?


Except what happens if you specify only one of l2-cache-size or 
refcount-cache-size? Is the defaulted cache-size then just that one size 
you specified (and the other cache ignored), or is the total cache size 
still defaulted to the 1M/8-cluster size (assuming its larger than the 
other size specified)?


I think that the meaning here is the sum of the actual l2-cache-size and 
refcount-cache-size. Not the specified options, but the actual caches.


Leonid.



Re: [Qemu-devel] [PATCH v3 3/5] qcow2: Introduce an option for sufficient L2 cache for the entire image

2018-07-26 Thread Kevin Wolf
Am 26.07.2018 um 14:24 hat Leonid Bloch geschrieben:
> > > You mean with QDict? I'll look into that now. But already sent v5 before
> > > reading this email.
> > 
> > Yes, with reading it from the QDict. (Or whatever the simplest way is
> > that results in the right external interface, but I suppose this is the
> > one.)
> 
> Well, there is a problem with that: I can easily isolate
> l2-cache-size from QDict, check if it is "full", and if it is - do whatever
> is needed, and delete this option before parsing. But what if it is "foo"?
> It will not get deleted, and the regular QEMU_OPT_SIZE parsing error will
> appear, stating that l2-cache-size "expects a non-negative number..." - no
> word about that it can expect "full" as well. Now, one can try to modify
> local_err->msg for this particular option, but this will require substantial
> additional logic. I think considering this, it would be easier to stick with
> a dedicated option, l2-cache-full.
> 
> Do you think there is a smarter way to parse the l2-cache-size option, so it
> would accept both size and "full", while handling errors correctly? It seems
> more elegant to have a single option, but the internal handling will be more
> elegant and simpler with two mutually exclusive options.

I think we can live with the suboptimal error message for a while. Once
qcow2 is QAPIfied, it should become easy to improve it. Let's not choose
a worse design (that stays forever) for a temporarily better error
message.

> By the way, the L2 cache resizes now on image resize. Will send the changes
> in v6. Thanks for the suggestion!

Sounds good!

Kevin



Re: [Qemu-devel] [PATCH v5 2/4 for-3.0] qcow2: Options' documentation fixes

2018-07-26 Thread Kevin Wolf
Am 26.07.2018 um 16:27 hat Leonid Bloch geschrieben:
> On 07/26/2018 01:02 PM, Kevin Wolf wrote:
> > Am 25.07.2018 um 16:27 hat Leonid Bloch geschrieben:
> > > Signed-off-by: Leonid Bloch 
> > > ---
> > >   docs/qcow2-cache.txt |  3 +++
> > >   qemu-options.hx  | 10 ++
> > >   2 files changed, 9 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/docs/qcow2-cache.txt b/docs/qcow2-cache.txt
> > > index 8a09a5cc5f..3673f2be0e 100644
> > > --- a/docs/qcow2-cache.txt
> > > +++ b/docs/qcow2-cache.txt
> > > @@ -130,6 +130,9 @@ There are a few things that need to be taken into 
> > > account:
> > >  memory as possible to the L2 cache before increasing the refcount
> > >  cache size.
> > > +- At most two of "l2-cache-size", "refcount-cache-size", and "cache-size"
> > > +  can be set simultaneously.
> > 
> > The indentation is off here, the other list items have one space more.
> > 
> > >   Unlike L2 tables, refcount blocks are not used during normal I/O but
> > >   only during allocations and internal snapshots. In most cases they are
> > >   accessed sequentially (even during random guest I/O) so increasing the
> > > diff --git a/qemu-options.hx b/qemu-options.hx
> > > index b1bf0f485f..13ece21cb6 100644
> > > --- a/qemu-options.hx
> > > +++ b/qemu-options.hx
> > > @@ -752,15 +752,17 @@ image file)
> > >   @item cache-size
> > >   The maximum total size of the L2 table and refcount block caches in 
> > > bytes
> > > -(default: 1048576 bytes or 8 clusters, whichever is larger)
> > 
> > I think it would be good to still say something about the default.
> > Maybe something like "default: the sum of l2-cache-size and
> > refcount-cache-size"?
> 
> Yes, that sounds good!
> 
> > 
> > >   @item l2-cache-size
> > > -The maximum size of the L2 table cache in bytes
> > > -(default: 4/5 of the total cache size)
> > > +The maximum size of the L2 table cache.
> > 
> > Why did you remove "in bytes" and add a period which the other options
> > don't have? I prefer the old version of this line.
> 
> I removed "in bytes" because it also accepts k, M, G, ... but on a second
> thought, these are amounts of bytes as well, so changing back to the old
> version.
> 
> > > +(default: if cache-size is not defined - 1048576 bytes or 8 clusters, 
> > > whichever
> > > +is larger; otherwise, as large as possible or needed within the 
> > > cache-size,
> > > +while permitting the requested or the minimal refcount cache size)
> > >   @item refcount-cache-size
> > >   The maximum size of the refcount block cache in bytes
> > > -(default: 1/5 of the total cache size)
> > > +(default: 4 times the cluster size, or any portion of the cache-size, if 
> > > it is
> > > +specified and large enough, left over after allocating the full L2 cache)
> > 
> > I found the second part hard to understand. Maybe "4 times the cluster
> > size; or if both cache-size and l2-cache-size are given, the part of
> > the cache-size that is not used for the L2 cache yet."?
> 
> That is not accurate. Because even if l2-cache-size is not given and
> cache-size is large enough to accommodate enough L2 cache to cover the
> entire image, plus the minimal amount of refcount cache with room to spare -
> refcount cache will use all the rest of the cache-size.

Oh, you're right! I wasn't aware that we consider the image size there.

> How about:
> 
> "default: 4 times the cluster size; or if cache-size is specified, the part
> of it which is not used for the L2 cache"

Even simpler, easier to understand and more accurate. I like it.

Kevin



Re: [Qemu-devel] [PATCH v3 3/5] qcow2: Introduce an option for sufficient L2 cache for the entire image

2018-07-26 Thread Leonid Bloch

On 07/26/2018 05:42 PM, Kevin Wolf wrote:

Am 26.07.2018 um 14:24 hat Leonid Bloch geschrieben:

You mean with QDict? I'll look into that now. But already sent v5 before
reading this email.


Yes, with reading it from the QDict. (Or whatever the simplest way is
that results in the right external interface, but I suppose this is the
one.)


Well, there is a problem with that: I can easily isolate
l2-cache-size from QDict, check if it is "full", and if it is - do whatever
is needed, and delete this option before parsing. But what if it is "foo"?
It will not get deleted, and the regular QEMU_OPT_SIZE parsing error will
appear, stating that l2-cache-size "expects a non-negative number..." - no
word about that it can expect "full" as well. Now, one can try to modify
local_err->msg for this particular option, but this will require substantial
additional logic. I think considering this, it would be easier to stick with
a dedicated option, l2-cache-full.

Do you think there is a smarter way to parse the l2-cache-size option, so it
would accept both size and "full", while handling errors correctly? It seems
more elegant to have a single option, but the internal handling will be more
elegant and simpler with two mutually exclusive options.


I think we can live with the suboptimal error message for a while. Once
qcow2 is QAPIfied, it should become easy to improve it. Let's not choose
a worse design (that stays forever) for a temporarily better error
message.


OK. I'll add a TODO then.




By the way, the L2 cache resizes now on image resize. Will send the changes
in v6. Thanks for the suggestion!


Sounds good!

Kevin





Re: [Qemu-devel] [PATCH v5 2/4 for-3.0] qcow2: Options' documentation fixes

2018-07-26 Thread Leonid Bloch

How about:

"default: 4 times the cluster size; or if cache-size is specified, the part
of it which is not used for the L2 cache"


Even simpler, easier to understand and more accurate. I like it.


Thanks!



Kevin





Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Stefan Hajnoczi
On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:
> Hey,
> 
> Our team is adding a NIC functional  emulation to QEMU.
> One of the features we are adding to this NIC is SRIOV.
> 
> Here is the error message I get when checking SRIOV support of  our emulated 
> NIC on Win2016 server (the hyper-v VM).
> "
> SR-IOV cannot be used on this system as the PCI Express hardware does not 
> support Access Control Services (ACS) at any root port. Contact your system 
> vendor for further information.
> "

I'm not sure what the status of emulated SR-IOV is so I have CCed
Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.

> 
> Could you please advise about what could be the issue here?
> 
> BTW: I use same configuration (VM XML file attached) when running linux VM 
> (RH7.2) image (instead of Win Hyper-V) over the same host and SRIOV is 
> working for me there.
> 
> Here the XML file I use to define the VM (our emulated NIC is added at the 
> end of XML):
> "
> 
>   nst105
>   0249a525-2ee2-432b-a1f5-a6db83b089a3
>   8388608
>   8388608
>   8
>   
> /machine
>   
>   
> hvm
>   
>   
> 
> 
> 
>   
>   
>   
> 
>   
>   
> SandyBridge
> 
> 
>   
>   
> 
> 
> 
> 
>   
>   destroy
>   restart
>   destroy
>   
> 
> 
>   
>   
> /opt/qemu/bin/qemu-system-x86_64
> 
>   
>   
>   
>   
>   
> 
> 
>function='0x7'/>
> 
> 
>   
>function='0x0' multifunction='on'/>
> 
> 
>   
>function='0x1'/>
> 
> 
>   
>function='0x2'/>
> 
> 
>function='0x2'/>
> 
> 
> 
>   
>   
>function='0x0' multifunction='on'/>
> 
> 
>   
>   
>function='0x1'/>
> 
> 
>   
>   
>function='0x2'/>
> 
> 
>   
>function='0x0'/>
> 
> 
>   
>   
>function='0x0'/>
> 
> 
>   
>   
>   
>   
>function='0x0'/>
> 
> 
>   
> 
> 
>   
> 
> 
>   
> 
> 
> 
>  keymap='en-us'>
>   
> 
> 
>   
>function='0x0'/>
> 
> 
>function='0x0'/>
> 
>   
>   
>   
>   
> 
>  value='pcie-root-port,pref64-reserve=500M,slot=0,id=pcie_port.1'/>
> 
> 
> 
> 
>   
> 
> "
> __
> General info:
> Host OS: RH7.0 (Kernel: 4.14.13)
> QEMU version: 2.11
> libvirt version: 3.2.0
> Running the following on the host shows that both nested and IOMMU are
> enabled:
> ~]#: cat /sys/module/kvm_intel/parameters/nested
> Y
> ~]# dmesg | grep -e DMAR -e IOMMU
> [0.00] DMAR: IOMMU enabled
> 
> Thanks,
> Elijah
> 


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH 20/25] check: Only test usb-xhci-nec when it is compiled in

2018-07-26 Thread Juan Quintela
Thomas Huth  wrote:
> On 17.07.2018 13:33, Juan Quintela wrote:
>> Signed-off-by: Juan Quintela 
>> ---
>>  tests/Makefile.include | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/tests/Makefile.include b/tests/Makefile.include
>> index b3e707e8c3..ccf71bddcc 100644
>> --- a/tests/Makefile.include
>> +++ b/tests/Makefile.include
>> @@ -288,8 +288,8 @@ check-qtest-i386-y += tests/usb-hcd-ehci-test$(EXESUF)
>>  gcov-files-i386-y += hw/usb/hcd-ehci.c
>>  gcov-files-i386-y += hw/usb/dev-hid.c
>>  gcov-files-i386-y += hw/usb/dev-storage.c
>> -check-qtest-i386-y += tests/usb-hcd-xhci-test$(EXESUF)
>> -gcov-files-i386-y += hw/usb/hcd-xhci.c
>> +check-qtest-i386-$(CONFIG_USB_XHCI_NEC) += tests/usb-hcd-xhci-test$(EXESUF)
>> +gcov-files-i386-$(CONFIG_USB_XHCI) += hw/usb/hcd-xhci.c
>>  check-qtest-i386-y += tests/cpu-plug-test$(EXESUF)
>>  check-qtest-i386-y += tests/q35-test$(EXESUF)
>>  check-qtest-i386-y += tests/vmgenid-test$(EXESUF)
>> @@ -347,8 +347,8 @@ check-qtest-ppc64-y += tests/usb-hcd-ohci-test$(EXESUF)
>>  gcov-files-ppc64-y += hw/usb/hcd-ohci.c
>>  check-qtest-ppc64-y += tests/usb-hcd-uhci-test$(EXESUF)
>>  gcov-files-ppc64-y += hw/usb/hcd-uhci.c
>> -check-qtest-ppc64-y += tests/usb-hcd-xhci-test$(EXESUF)
>> -gcov-files-ppc64-y += hw/usb/hcd-xhci.c
>> +check-qtest-ppc64-$(CONFIG_USB_XHCI_NEC) += tests/usb-hcd-xhci-test$(EXESUF)
>> +gcov-files-ppc64-$(CONFIG_USB_XHCI) += hw/usb/hcd-xhci.c
>>  check-qtest-ppc64-y += $(check-qtest-virtio-y)
>>  check-qtest-ppc64-$(CONFIG_SLIRP) += tests/test-netfilter$(EXESUF)
>>  check-qtest-ppc64-$(CONFIG_POSIX) += tests/test-filter-mirror$(EXESUF)
>
> I wonder whether the gcov lines are still up-to-date, or whether they
> should include hcd-xhci-nec.c as well? (See commit 0bbb2f3df1ffd9ccf713
> for the split)

Added the hcd-xhci-nec.c file to gcov.  I still don't know how to launch
gcowv :-(  So no clue about how up-to-date the lines are.

Later, Juan.



Re: [Qemu-devel] [Qemu-block] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Nir Soffer
On Thu, Jul 26, 2018 at 5:28 PM Eric Blake  wrote:

> On 07/26/2018 06:33 AM, Kevin Wolf wrote:
> > The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
> > all-zero afterwards, so don't try to abuse it for zero writing. We try
> > to only use this if BLKDISCARDZEROES tells us that it is safe, but this
> > is unreliable on older kernels and a constant 0 in newer kernels. In
>
> For my own curiosity, which kernel commit switched to constant 0, so I
> can read the rationale in that commit message?
>

This should help
https://patchwork.kernel.org/patch/9903757/

Nir


Re: [Qemu-devel] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Kevin Wolf
Am 26.07.2018 um 16:28 hat Eric Blake geschrieben:
> On 07/26/2018 06:33 AM, Kevin Wolf wrote:
> > The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
> > all-zero afterwards, so don't try to abuse it for zero writing. We try
> > to only use this if BLKDISCARDZEROES tells us that it is safe, but this
> > is unreliable on older kernels and a constant 0 in newer kernels. In
> 
> For my own curiosity, which kernel commit switched to constant 0, so I can
> read the rationale in that commit message?

I think 48920ff2a5a9 is the commit that actually changes the value.

Essentially the kernel abused discard for zero writes previously (so it
needed something like BLKDISCARDZEROES) and then switched to a separate
write_zeroes operation like we do in QEMU. Once this was done, second
guessing the behaviour of discard wasn't necessary any more because you
just tell what you really want to do (discard or zero out).

> > other words, this code path is never actually used with newer kernels,
> > so we don't even try to unmap while writing zeros.
> > 
> > This patch removes the abuse of discard for writing zeroes from
> > file-posix and instead adds a new function that uses interfaces that are
> > actually meant to deallocate and zero out at the same time. Only if
> > those fail, it falls back to zeroing out without unmap. We never fall
> > back to a discard operation any more that may or may not result in
> > zeros.
> 
> Makes sense.
> 
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >   block/file-posix.c | 62 
> > +-
> >   1 file changed, 47 insertions(+), 15 deletions(-)
> > 
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 60af4b3d51..9c66cd87d1 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -648,7 +648,7 @@ static int raw_open_common(BlockDriverState *bs, QDict 
> > *options,
> >   }
> >   #endif
> > -bs->supported_zero_flags = s->discard_zeroes ? BDRV_REQ_MAY_UNMAP : 0;
> > +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> >   ret = 0;
> >   fail:
> >   if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
> > @@ -1487,6 +1487,38 @@ static ssize_t 
> > handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
> >   return -ENOTSUP;
> >   }
> > +static ssize_t handle_aiocb_write_zeroes_unmap(RawPosixAIOData *aiocb)
> > +{
> > +BDRVRawState *s = aiocb->bs->opaque;
> > +int ret;
> > +
> > +/* First try to write zeros and unmap at the same time */
> > +
> > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > +ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > +   aiocb->aio_offset, aiocb->aio_nbytes);
> 
> Umm, doesn't this have to use FALLOC_FL_ZERO_RANGE? FALLOC_FL_PUNCH_HOLE
> deallocs, but is not required to write zeroes.

Yes, it is. See the man page:

Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file. After a successful
call, subsequent reads from this range will return zeroes.

FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.

> > +if (ret != -ENOTSUP) {
> > +return ret;
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_XFS
> > +if (s->is_xfs) {
> > +/* xfs_discard() guarantees that the discarded area reads as 
> > all-zero
> > + * afterwards, so we can use it here. */
> > +return xfs_discard(s, aiocb->aio_offset, aiocb->aio_nbytes);
> > +}
> > +#endif
> > +
> > +/* Make the compiler happy if neither of the above is compiled in */
> > +(void) s;
> 
> Could also be done in fewer lines by use of:
> 
>BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
> 
> since that attribute means "might be unused, don't warn if it is actually
> unused" (and not the stricter "must be unused, warn if it got used anyway")

Thanks, I wasn't aware of G_GNUC_UNUSED (and didn't want to add a
__attribute__ here without a macro that wraps it).

Kevin



Re: [Qemu-devel] [PATCH v1] s390x/cpu_models: Add "-cpu max" support

2018-07-26 Thread Eduardo Habkost
On Thu, Jul 26, 2018 at 09:29:44AM +0200, David Hildenbrand wrote:
> On 25.07.2018 22:14, Eduardo Habkost wrote:
> > On Wed, Jul 25, 2018 at 07:50:21PM +0200, David Hildenbrand wrote:
> >> On 25.07.2018 19:09, Eduardo Habkost wrote:
> > [...]
>  +if (local_err) {
>  +g_assert(kvm_enabled());
>  +error_report_err(local_err);
>  +/* fallback to unsupported CPU models */
>  +return;
> >>>
> >>> What about moving this outside instance_init?
> >>
> >> To which place for example? We at least have to copy the CPU model
> >> for each and every CPU instance (so we can modify it without side
> >> effects using properties).
> > 
> > To any code that will look at cpu->model.
> > 
> > You are wrapping an interface that needs to report errors
> > (kvm_s390_get_host_cpu_model()) behind an interface that is not
> > able to report errors (object_new()).  There's nothing that
> > requires you to do that, does it?  You are free to provide an API
> > that is actually able to report errors, instead of relying on
> > object_new() only.
> 
> I see what you mean. One solution would be to preload and store the
> model somewhere globally (not locally). So in the init function, we
> would not have to handle errors.
> 
> But I am not even sure where we could do such a global initialization +
> be able to report errors easily. I remember that we had a hard time to
> get this running smoothly due to the dependency of
> kvm_s390_get_host_cpu_model() on:
> - accelerator
> - machine
> - KVM init state

If we had a S390KVMAccelerator object on machine->accelerator,
S390KVMAccelerator::host_model would be a good candidate?

> 
> And initializing cpu->model in realize() is too late, because all
> properties have to access it. Even a pre_plug handler will not work.

Yeah, the instance_init/realize abstraction seems insufficient
here.  instance_init has too many restrictions, realize is too
late.


> 
> On the other hand, I decided to ignore all errors back then and fallback
> to the "host CPU model unknown" case, because there are some corner
> cases where we still want to allow running the "host" model even though
> there was a problem detecting it.
> 
> So my summary would be: We ignore errors (and rather treat them like
> warnings) for a reason here and fallback to "unsupported CPU models",
> which allows to run + use QEMU even in environments where our CPU model
> detection fails (e.g. on a very strange new CPU model we could have in
> the future).
> 
> Especially "!cpu->model" does not imply that there was an error. It
> includes disabled CPU model support or unavailable CPU model support
> (KVM), which is perfectly fine. Replicating initialization attempts at
> all places where we access "cpu->model" does therefore not sound 100%
> clean to me and most likely makes the code way more complicated.
> 
> Right now the semantics are clear: if we have "!cpu->model" after the
> object has been created, details about the host CPU model are not
> available (models unavailable/unsupported). Modifying properties,
> baselining, expanding is not possible with that model then. But it can
> be used for execution.

This is interesting.  If most users of cpu->model don't care
about kvm_s390_get_host_cpu_model() errors at all, the current
solution sounds more reasonable.

Except for the error_report_err() call inside instance_init.
This still bothers me, but it's not a big deal.

-- 
Eduardo



Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots

2018-07-26 Thread Dr. David Alan Gilbert
* Paolo Bonzini (pbonz...@redhat.com) wrote:
> On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > 
> > It may look like the uffd-wp model is wish-feature similar to an
> > optimization, but without the uffd-wp model when the WP fault is
> > triggered by kernel code, the sigsegv model falls apart and requires
> > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > has other benefits: it makes it all reliable in terms of not
> > increasing the number of vmas in use during the snapshot. Finally it
> > makes it faster too with no mmap_sem for reading and no sigsegv
> > signals.
> > 
> > The non cooperative features got merged first because there was much
> > activity on the kernel side on that front, but this is just an ideal
> > time to nail down the remaining issues in uffd-wp I think. That I
> > believe is time better spent than trying to emulate it with sigsegv
> > and changing all drivers to send new events down to qemu specific to
> > the sigsegv handling. We considered this before doing uffd for
> > postcopy too but overall it's unreliable and more work (no single
> > change was then needed to KVM code with uffd to handle postcopy and
> > here it should be the same).
> 
> I totally agree.  The hard part in userfaultfd was the changes to the
> kernel get_user_pages API, but the payback was huge because _all_ kernel
> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> back to mprotect would be a huge mistake.

TBF I think Denis just wanted to get things moving, knowing that we've
not got userfault-wp yet;  what I'm curious about though is how Denis
has anything stable given that the faults can land in syscalls and
vhost etc.

Dave

> Paolo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Eric Blake

On 07/26/2018 10:06 AM, Kevin Wolf wrote:


+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+   aiocb->aio_offset, aiocb->aio_nbytes);


Umm, doesn't this have to use FALLOC_FL_ZERO_RANGE? FALLOC_FL_PUNCH_HOLE
deallocs, but is not required to write zeroes.


Yes, it is. See the man page:

 Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
 range starting at offset and continuing for len bytes. Within the
 specified range, partial filesystem blocks are zeroed, and whole
 filesystem blocks are removed from the file. After a successful
 call, subsequent reads from this range will return zeroes.


That's true for file-system fds, but not for block device fds.

As pointed out by Nir,

> https://patchwork.kernel.org/patch/9903757/
Which says, among other things:

>> Do we also know that the blocks were discarded as we do with
>> BLKDISCARD ?
>
> There never was a way to know for sure.
>
> ATA DSM TRIM and SCSI UNMAP are hints by definition. We attempted to
> bend their semantics towards getting predictable behavior but ultimately
> failed. Too many corner cases.
>
>> As I mentioned before. We relied on discard_zeroes_data in mkfs.ext4
>> to make sure that inode tables are zeroed after discard.
>
> The point is that you shouldn't have an if (discard_zeroes_data)
> conditional in the first place.
>
>  - If you need to dellocate a block range and you don't care about its
>contents in the future, use BLKDISCARD / FL_PUNCH_HOLE.
>
>  - If you need to zero a block range, use BLKZEROOUT / FL_ZERO_RANGE.

PUNCH_HOLE deallocates; but can only guarantee a read back of zero on 
file systems.


Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new 
flag not present/documented on Fedora 28. I wonder if it helps, too.




FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.


I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you read 
back 0, using whatever is most efficient under the hood (in the case of 
block devices, unmapping that reliably reads back as zero is favored).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-devel] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Kevin Wolf
Am 26.07.2018 um 17:23 hat Eric Blake geschrieben:
> On 07/26/2018 10:06 AM, Kevin Wolf wrote:
> 
> > > > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > > > +ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | 
> > > > FALLOC_FL_KEEP_SIZE,
> > > > +   aiocb->aio_offset, aiocb->aio_nbytes);
> > > 
> > > Umm, doesn't this have to use FALLOC_FL_ZERO_RANGE? FALLOC_FL_PUNCH_HOLE
> > > deallocs, but is not required to write zeroes.
> > 
> > Yes, it is. See the man page:
> > 
> >  Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
> >  2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
> >  range starting at offset and continuing for len bytes. Within the
> >  specified range, partial filesystem blocks are zeroed, and whole
> >  filesystem blocks are removed from the file. After a successful
> >  call, subsequent reads from this range will return zeroes.
> 
> That's true for file-system fds, but not for block device fds.

It is true for block device fds, too. Look at fs/block_dev.c,
specifically blkdev_fallocate():

switch (mode) {
case FALLOC_FL_ZERO_RANGE:
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE: 
error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
GFP_KERNEL, BLKDEV_ZERO_NOUNMAP);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE: 
error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
 GFP_KERNEL, 
BLKDEV_ZERO_NOFALLBACK);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
FALLOC_FL_NO_HIDE_STALE:
error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 GFP_KERNEL, 0);
break;
default:
return -EOPNOTSUPP;
}

> As pointed out by Nir,
> 
> > https://patchwork.kernel.org/patch/9903757/
> Which says, among other things:
> 
> >> Do we also know that the blocks were discarded as we do with
> >> BLKDISCARD ?
> >
> > There never was a way to know for sure.
> >
> > ATA DSM TRIM and SCSI UNMAP are hints by definition. We attempted to
> > bend their semantics towards getting predictable behavior but ultimately
> > failed. Too many corner cases.
> >
> >> As I mentioned before. We relied on discard_zeroes_data in mkfs.ext4
> >> to make sure that inode tables are zeroed after discard.
> >
> > The point is that you shouldn't have an if (discard_zeroes_data)
> > conditional in the first place.
> >
> >  - If you need to dellocate a block range and you don't care about its
> >contents in the future, use BLKDISCARD / FL_PUNCH_HOLE.
> >
> >  - If you need to zero a block range, use BLKZEROOUT / FL_ZERO_RANGE.
> 
> PUNCH_HOLE deallocates; but can only guarantee a read back of zero on file
> systems.

As far as I know, the comment you quoted is accurate for BLKDISCARD and
BLKZEROOUT, but not for the fallocate() flags.

> Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new flag
> not present/documented on Fedora 28. I wonder if it helps, too.
> 
> > 
> > FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.
> 
> I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you read back
> 0, using whatever is most efficient under the hood (in the case of block
> devices, unmapping that reliably reads back as zero is favored).

See the code I quoted above, FALLOC_FL_ZERO_RANGE calls
blkdev_issue_zeroout() with BLKDEV_ZERO_NOUNMAP internally.

Kevin



Re: [Qemu-devel] [PATCH v5 23/24] replay: add BH oneshot event for block layer

2018-07-26 Thread Alex Bennée


Pavel Dovgalyuk  writes:

> Replay is capable of recording normal BH events, but sometimes
> there are single use callbacks scheduled with aio_bh_schedule_oneshot
> function. This patch enables recording and replaying such callbacks.
> Block layer uses these events for calling the completion function.
> Replaying these calls makes the execution deterministic.
>
> Signed-off-by: Pavel Dovgalyuk 

I'm not sure what about this commit causes the compile breakage I'm
seeing:

  LINKaarch64-linux-user/qemu-aarch64
../libqemuutil.a(cpu-get-icount.o):(.bss+0x0): multiple definition of 
`use_icount'
exec.o:(.bss+0x58): first defined here
collect2: error: ld returned 1 exit status
Makefile:199: recipe for target 'qemu-aarch64' failed
make[1]: *** [qemu-aarch64] Error 1
Makefile:481: recipe for target 'subdir-aarch64-linux-user' failed
make: *** [subdir-aarch64-linux-user] Error 2

It only occurs on a make clean && make -j on that commit though. It's
hidden if you do incremental builds.



> ---
>  block/block-backend.c|3 ++-
>  include/sysemu/replay.h  |3 +++
>  replay/replay-events.c   |   16 
>  replay/replay-internal.h |1 +
>  replay/replay.c  |2 +-
>  stubs/replay.c   |6 ++
>  6 files changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/block/block-backend.c b/block/block-backend.c
> index f2f75a9..232d114 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -17,6 +17,7 @@
>  #include "block/throttle-groups.h"
>  #include "sysemu/blockdev.h"
>  #include "sysemu/sysemu.h"
> +#include "sysemu/replay.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-events-block.h"
>  #include "qemu/id.h"
> @@ -1370,7 +1371,7 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, 
> int64_t offset, int bytes,
>
>  acb->has_returned = true;
>  if (acb->rwco.ret != NOT_DONE) {
> -aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
>  blk_aio_complete_bh, acb);
>  }
>
> diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
> index 8118b00..945bc74 100644
> --- a/include/sysemu/replay.h
> +++ b/include/sysemu/replay.h
> @@ -152,6 +152,9 @@ bool replay_events_enabled(void);
>  void replay_flush_events(void);
>  /*! Adds bottom half event to the queue */
>  void replay_bh_schedule_event(QEMUBH *bh);
> +/*! Adds oneshot bottom half event to the queue */
> +void replay_bh_schedule_oneshot_event(AioContext *ctx,
> +QEMUBHFunc *cb, void *opaque);
>  /*! Adds input event to the queue */
>  void replay_input_event(QemuConsole *src, InputEvent *evt);
>  /*! Adds input sync event to the queue */
> diff --git a/replay/replay-events.c b/replay/replay-events.c
> index 0964a82..0ac8a5c 100644
> --- a/replay/replay-events.c
> +++ b/replay/replay-events.c
> @@ -37,6 +37,9 @@ static void replay_run_event(Event *event)
>  case REPLAY_ASYNC_EVENT_BH:
>  aio_bh_call(event->opaque);
>  break;
> +case REPLAY_ASYNC_EVENT_BH_ONESHOT:
> +((QEMUBHFunc*)event->opaque)(event->opaque2);
> +break;
>  case REPLAY_ASYNC_EVENT_INPUT:
>  qemu_input_event_send_impl(NULL, (InputEvent *)event->opaque);
>  qapi_free_InputEvent((InputEvent *)event->opaque);
> @@ -132,6 +135,17 @@ void replay_bh_schedule_event(QEMUBH *bh)
>  }
>  }
>
> +void replay_bh_schedule_oneshot_event(AioContext *ctx,
> +QEMUBHFunc *cb,void *opaque)
> +{
> +if (events_enabled) {
> +uint64_t id = replay_get_current_step();
> +replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb, opaque, id);
> +} else {
> +aio_bh_schedule_oneshot(ctx, cb, opaque);
> +}
> +}
> +
>  void replay_add_input_event(struct InputEvent *event)
>  {
>  replay_add_event(REPLAY_ASYNC_EVENT_INPUT, event, NULL, 0);
> @@ -162,6 +176,7 @@ static void replay_save_event(Event *event, int 
> checkpoint)
>  /* save event-specific data */
>  switch (event->event_kind) {
>  case REPLAY_ASYNC_EVENT_BH:
> +case REPLAY_ASYNC_EVENT_BH_ONESHOT:
>  replay_put_qword(event->id);
>  break;
>  case REPLAY_ASYNC_EVENT_INPUT:
> @@ -216,6 +231,7 @@ static Event *replay_read_event(int checkpoint)
>  /* Events that has not to be in the queue */
>  switch (replay_state.read_event_kind) {
>  case REPLAY_ASYNC_EVENT_BH:
> +case REPLAY_ASYNC_EVENT_BH_ONESHOT:
>  if (replay_state.read_event_id == -1) {
>  replay_state.read_event_id = replay_get_qword();
>  }
> diff --git a/replay/replay-internal.h b/replay/replay-internal.h
> index 08ef2ec..0c0ed16 100644
> --- a/replay/replay-internal.h
> +++ b/replay/replay-internal.h
> @@ -51,6 +51,7 @@ enum ReplayEvents {
>
>  enum ReplayAsyncEventKind {
>  REPLAY_ASYNC_EVENT_BH,
> +REPLAY_ASYNC_EVENT_BH_ONESHOT,
>  REPLAY_ASYNC_EVENT_INPUT,
>  REPLAY_ASYNC_EVENT_INPUT_SYNC,
> 

Re: [Qemu-devel] [PATCH hack dontapply v2 7/7] pc: HACK: acpi: tie in _CST object to Processor

2018-07-26 Thread Michael S. Tsirkin
On Wed, Jul 25, 2018 at 04:49:08PM +0200, Igor Mammedov wrote:
> On Wed, 25 Jul 2018 15:50:09 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Jul 25, 2018 at 02:37:24PM +0200, Igor Mammedov wrote:
> > > On Tue, 10 Jul 2018 03:01:34 +0300
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > Based on patch by Igor Mammedov.
> > > > 
> > > > This is a hack: we definitely shouldn't do it
> > > > unconditionally, and initialization should be handled
> > > > differently (through an isa device?). io port to use
> > > > should depend on the PC type and should be documented.
> > > > 
> > > > Notifications should be supported.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin 
> > > > ---
> > > >  include/hw/i386/pc.h |  5 +
> > > >  hw/acpi/cpu.c|  5 +
> > > >  hw/i386/acpi-build.c | 10 +-
> > > >  3 files changed, 19 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> > > > index 316230e570..83b3a84322 100644
> > > > --- a/include/hw/i386/pc.h
> > > > +++ b/include/hw/i386/pc.h
> > > > @@ -20,6 +20,11 @@
> > > >  
> > > >  #define HPET_INTCAP "hpet-intcap"
> > > >  
> > > > +typedef struct PCCstate {
> > > > +uint32_t latency;
> > > > +uint32_t hint;
> > > > +} PCCstate;
> > > > +
> > > >  /**
> > > >   * PCMachineState:
> > > >   * @acpi_dev: link to ACPI PM device that performs ACPI hotplug 
> > > > handling
> > > > diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
> > > > index 5ae595ecbe..e9e207b033 100644
> > > > --- a/hw/acpi/cpu.c
> > > > +++ b/hw/acpi/cpu.c
> > > > @@ -561,6 +561,11 @@ void build_cpus_aml(Aml *table, MachineState 
> > > > *machine, CPUHotplugFeatures opts,
> > > > aml_int(arch_ids->cpus[i].props.node_id)));
> > > >  }
> > > >  
> > > > +if (1) {
> > > > +method = aml_method("_CST", 0, AML_NOTSERIALIZED);
> > > > +aml_append(method, aml_return(aml_call0("CCST")));  
> > > in case of not loaded CCST it would be unresolved reference.
> > > It would be better to have a package /SB.CPUS.CCST filled with some 
> > > startup values
> > > and than that could be updated on the fly with new package.  
> > 
> > I think it will conflict if we do. But yes we could load \SB.CCST and
> > then \SB.CPUS.CCST will override that. Would it help though?
> > ACPI spec does not describe what happens on load failure.
> > For linux it does not matter much -
> > it just handles the error. What happens on windows with an unresolved
> > reference? Is failure to load the object any better?
> I was thinking more in a align of assigning a new temporary package value to
> a statically named CCST:
> 
>   update_cst_METHOD()
>  build package in local0
>  \SB.CPUS.CCST = local0
> 
> so guest gets valid initial CCST with values on source host at boot time
> and on target values are replaced on update.
> 
> It doesn't have to be named variable/could be a method but with always
> valid values.

Right but what are we trying to achieve here?

> > 
> > > > +aml_append(dev, method);
> > > > +}
> > > >  aml_append(cpus_dev, dev);
> > > >  }
> > > >  }
> > > > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > > > index fff1059a31..da2c830db7 100644
> > > > --- a/hw/i386/acpi-build.c
> > > > +++ b/hw/i386/acpi-build.c
> > > > @@ -64,6 +64,7 @@
> > > >  #include "hw/i386/intel_iommu.h"
> > > >  
> > > >  #include "hw/acpi/ipmi.h"
> > > > +#include "hw/acpi/cst.h"
> > > >  
> > > >  /* These are used to size the ACPI tables for -M pc-i440fx-1.7 and
> > > >   * -M pc-i440fx-2.0.  Even if the actual amount of AML generated grows
> > > > @@ -1840,7 +1841,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
> > > >  build_legacy_cpu_hotplug_aml(dsdt, machine, 
> > > > pm->cpu_hp_io_base);
> > > >  } else {
> > > >  CPUHotplugFeatures opts = {
> > > > -.apci_1_compatible = true, .has_legacy_cphp = true
> > > > +.apci_1_compatible = true, .has_legacy_cphp = true,  
> > > unrelated change???  
> > 
> > Good catch.
> > 
> > > >  };
> > > >  build_cpus_aml(dsdt, machine, opts, pm->cpu_hp_io_base,
> > > > "\\_SB.PCI0", "\\_GPE._E02");
> > > > @@ -2693,6 +2694,10 @@ void acpi_build(AcpiBuildTables *tables, 
> > > > MachineState *machine)
> > > > tables->vmgenid, tables->linker);
> > > >  }
> > > >  
> > > > +/* TODO: get a free ioport. This one is PIIX specific. */
> > > > +acpi_add_table(table_offsets, tables_blob);
> > > > +cst_build_acpi(tables_blob, tables->linker, 0xaf20);
> > > > +
> > > >  if (misc.has_hpet) {
> > > >  acpi_add_table(table_offsets, tables_blob);
> > > >  build_hpet(tables_blob, tables->linker);
> > > > @@ -2891,6 +2896,9 @@ void acpi_setup(void)
> > > >  
> > > >  build_state = g_malloc0(sizeof *build_state);
> > > >  
> > > > +/* TOD

Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Marcel Apfelbaum

Hi

On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:

On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:

Hey,

Our team is adding a NIC functional  emulation to QEMU.
One of the features we are adding to this NIC is SRIOV.

Here is the error message I get when checking SRIOV support of  our emulated 
NIC on Win2016 server (the hyper-v VM).
"
SR-IOV cannot be used on this system as the PCI Express hardware does not 
support Access Control Services (ACS) at any root port.


QEMU's emulated PCI Express Root Ports do not support ACS yet, however I 
am not sure ACS is a prerequisite
for SR-IOV. We would need ARI support for allowing more than 8VFs, but 
QEMU doesn't support that either (yet).


Knut Omag has some working patches, he successfully implemented SR-IOV 
with QEMU, see:

 https://github.com/knuto/qemu/tree/sriov_patches_v7

The code was not merged since we need at least a device with SR-IOV 
support to justify the addition.



Contact your system vendor for further information.
"

I'm not sure what the status of emulated SR-IOV is so I have CCed
Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.


Thanks,
Marcel


Could you please advise about what could be the issue here?

BTW: I use same configuration (VM XML file attached) when running linux VM 
(RH7.2) image (instead of Win Hyper-V) over the same host and SRIOV is working 
for me there.

Here the XML file I use to define the VM (our emulated NIC is added at the end 
of XML):
"

   nst105
   0249a525-2ee2-432b-a1f5-a6db83b089a3
   8388608
   8388608
   8
   
 /machine
   
   
 hvm
   
   
 
 
 
   
   
   
 
   
   
 SandyBridge
 
 
   
   
 
 
 
 
   
   destroy
   restart
   destroy
   
 
 
   
   
 /opt/qemu/bin/qemu-system-x86_64
 
   
   
   
   
   
 
 
   
 
 
   
   
 
 
   
   
 
 
   
   
 
 
   
 
 
 
   
   
   
 
 
   
   
   
 
 
   
   
   
 
 
   
   
 
 
   
   
   
 
 
   
   
   
   
   
 
 
   
 
 
   
 
 
   
 
 
 
 
   
 
 
   
   
 
 
   
 
   
   
   
   
 
 
 
 
 
 
   

"
__
General info:
Host OS: RH7.0 (Kernel: 4.14.13)
QEMU version: 2.11
libvirt version: 3.2.0
Running the following on the host shows that both nested and IOMMU are
enabled:
~]#: cat /sys/module/kvm_intel/parameters/nested
Y
~]# dmesg | grep -e DMAR -e IOMMU
[0.00] DMAR: IOMMU enabled

Thanks,
Elijah






Re: [Qemu-devel] [PATCH hack dontapply v2 0/7] Dynamic _CST generation

2018-07-26 Thread Michael S. Tsirkin
On Wed, Jul 25, 2018 at 05:53:35PM +0200, Igor Mammedov wrote:
> On Wed, 25 Jul 2018 15:44:37 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Jul 25, 2018 at 02:32:11PM +0200, Igor Mammedov wrote:
> > > On Tue, 10 Jul 2018 03:01:30 +0300
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > Now that basic support for guest CPU PM is upstream, I started looking
> > > > for making it migrateable.  Since a VM can be migrated between different
> > > > hosts, PM info needs to change each time with move the VM to a different
> > > > host.  
> > > Considering notification is async, so there will be a window when
> > > guest will be using old Cstates on new host. What will happen if
> > > machine is migrated to host that doesn't support a Cstate
> > > that was used on source when guest was started?  
> > 
> > My testing shows mwait with a wrong hint works, presumably it just uses
> > a lot of power.
> > 
> > > > This adds infrastructure - based on Load/Unload - to support exactly
> > > > that: QEMU generates AML (changing it on migration) and stores it in
> > > > reserved memory, OSPM loads _CST from there on demand.  
> > > Cool and very tempting idea but I have 2 worries about this approach:
> > > 1. How well does it work with Windows based guests?
> > >(I've tried something similar but generating new AML from AML itself
> > > to get rid of our long if/else chains there to make up Notify target 
> > > name.
> > > I ditched it (unfortunately I don't recall why) )  
> > 
> > After trying it, I can tell you why - it's a horrid mess of
> > unreadable code, with arbitrary restraints on package
> > length etc.
> in my case it was only 4 character NameString, but Windows probably
> wasn't happy or just ignored it.
> Considering recent development (TPM series) it might have been issue
> with SYSTEM_MEMORY not working properly if used as byte buffer field.
> 
> Even if it's an unreadable mess it's still stable mess and very constrained
> one within a single firmware blob that came from source.

That's exectly the argument we had for keeping ACPI
generation in bios. It's just not an interface that scales.

> Hence it's more preferable than split brain in this series.
> 
> But I don't think we even need dynamic AML for _CST usecase at all,
> existing cpuhp interface should work just fine for it and should be
> simpler as all infrastructure is already there.

Not sure I get what you mean. Could you post a patch?

> > > 2. (probably biggest issue) Loading dynamically generated AML
> > >basically would make all AML code ABI, so that static AML
> > >in RAM of old QEMU version would match dynamic generated
> > >one on target side with new QEMU (/me generalizing approach beyond 
> > > _CST support).
> > >I'd try to keep our AML version less as much as possible
> > >and go this route only if there is no other way to do it.  
> > 
> > That's a good point, thanks for bringing this up!
> > 
> > So it seems that we will need to define the ABI, yes. All AML code is
> > an over-statement though, there are specific entry points
> > we must maintain, right?
> Well, I'm rather unsure what and where could break,
> hence I'm afraid of a new beast and it probably won't be easy
> to convince me that ABI would be able to keep things manageable
> and stable.
> Considering big amount of AML code that we already have
> I'm not confident eye balling during review and testing the latest
> firmware/qemu would be sufficient as we suddenly would have exploded
> test matrix where firmware in use is a mixed from old/and new parts.

Well there's one exported method so far and it relies on one container
device in static table set which does not sound too hard to keep stable.

Given how simple the dynamic table is, how about just checking it
with a unit test? It is literally a single return statement.
If it returns a valid package, that is all that we
care about. I can write some firmware to test that constraint.


> > And that in the end isn't fundamentally different from the ABIs
> > mandated by the ACPI spec.
> > 
> > So I have these ideas to try to mitigate the issues:
> > - document the ABI (like we have in the ACPI spec)
> > - use a specific prefix for all external calls (like _ for ACPI spec ones)
> > - use a specific (different) prefix for all internal loaded calls (like
> >   [A-Z] for non-ACPI spec ones)
> ACPI spec horrible mess and bad example (even committee isn't able to keep it 
> consistent)
> but it's necessary evil (compared to no spec at all).
> Adding ABI requirement will complicate coding/reviewing for
> already complex AML compared to current non ABI way.
> Hence from maintainability POV we should stay away from this approach
> if there is a viable alternative.

I agree. Not that I see one.

> > 
> > Thoughts?
> > 
> > > > This is a partially functional prototype.
> > > > 
> > > > What works:
> > > > loading _CST dynamically and reporting it to OSPM
> > > > 
> > > > What doesn't:
> > > > detec

Re: [Qemu-devel] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Eric Blake

On 07/26/2018 10:33 AM, Kevin Wolf wrote:



As far as I know, the comment you quoted is accurate for BLKDISCARD and
BLKZEROOUT, but not for the fallocate() flags.



I sure wish the man pages were more explicit on what guarantees each 
flag offers.



Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new flag
not present/documented on Fedora 28. I wonder if it helps, too.



FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.


I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you read back
0, using whatever is most efficient under the hood (in the case of block
devices, unmapping that reliably reads back as zero is favored).


See the code I quoted above, FALLOC_FL_ZERO_RANGE calls
blkdev_issue_zeroout() with BLKDEV_ZERO_NOUNMAP internally.


Having to resort to reading the kernel code to learn what the guarantees 
are is annoying (it's nice that GPL guarantees that we CAN do that, but 
it means the man pages could use some TLC so that we don't have to). 
Especially since the kernel code has changed over time.


But your paste of current kernel code is rather hard to argue against.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-devel] RDMA wrongly detected as being supported on FreeBSD

2018-07-26 Thread Marcel Apfelbaum




On 07/25/2018 01:59 PM, Dr. David Alan Gilbert wrote:

* Marcel Apfelbaum (marcel.apfelb...@gmail.com) wrote:

Hi,

On 07/25/2018 10:32 AM, Thomas Huth wrote:

On 25.07.2018 06:47, Rebecca Cran wrote:

In commit 18a398f6a39df4b08ff86ac0d38384193ca5f4cc, ./configure on
FreeBSD incorrectly detects RDMA support,

configure is looking for the rdma related libraries,  if they are there,
the rdma code will be compiled into QEMU.

You can try and run configure with --disable-rdma flag until we find
a solution.

Note that this flag is tied to both the RMDA migration and the RDMA
device emulation; RDMA migration may well be working for the reporter.


That's in case they don't care for either RDMA functionality of course -
to be used until we would find a solution.

Thanks,
Marcel


Dave


   with the build subsequently
failing with:

/home/bcran/workspace/qemu/hw/rdma/vmw/pvrdma_cmd.c:19:10: fatal error:
'linux/types.h' file not found
#include 
   ^~~

Do you have the kernel-headers package installed?


1 error generated.
gmake[1]: *** [/home/bcran/workspace/qemu/rules.mak:69:
hw/rdma/vmw/pvrdma_cmd.o] Error 1
gmake[1]: *** Waiting for unfinished jobs
gmake: *** [Makefile:481: subdir-x86_64-softmmu] Error 2

When was it still working for you the last time? Was it still working
with v3.0-rc1 ? Or just with v2.12 ? Any chance that you could bisect
the problem to determine when it has been introduced?

Good idea.

Thanks,
Marcel


   Thomas



--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK





Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Michael S. Tsirkin
On Thu, Jul 26, 2018 at 06:51:13PM +0300, Marcel Apfelbaum wrote:
> Hi
> 
> On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:
> > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:
> > > Hey,
> > > 
> > > Our team is adding a NIC functional  emulation to QEMU.
> > > One of the features we are adding to this NIC is SRIOV.
> > > 
> > > Here is the error message I get when checking SRIOV support of  our 
> > > emulated NIC on Win2016 server (the hyper-v VM).
> > > "
> > > SR-IOV cannot be used on this system as the PCI Express hardware does not 
> > > support Access Control Services (ACS) at any root port.
> 
> QEMU's emulated PCI Express Root Ports do not support ACS yet, however I am
> not sure ACS is a prerequisite
> for SR-IOV.

Looks like windows blocks dev assignment in nested VMs without it.
Thinking about it, doesn't vfio do the same by default? I think vfio has
a flag to override this though.

> We would need ARI support for allowing more than 8VFs, but QEMU
> doesn't support that either (yet).
> 
> Knut Omag has some working patches, he successfully implemented SR-IOV with
> QEMU, see:
>  https://github.com/knuto/qemu/tree/sriov_patches_v7
> 
> The code was not merged since we need at least a device with SR-IOV support
> to justify the addition.
> 
> > > Contact your system vendor for further information.
> > > "
> > I'm not sure what the status of emulated SR-IOV is so I have CCed
> > Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.
> 
> Thanks,
> Marcel
> 
> > > Could you please advise about what could be the issue here?
> > > 
> > > BTW: I use same configuration (VM XML file attached) when running linux 
> > > VM (RH7.2) image (instead of Win Hyper-V) over the same host and SRIOV is 
> > > working for me there.
> > > 
> > > Here the XML file I use to define the VM (our emulated NIC is added at 
> > > the end of XML):
> > > "
> > >  > > xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
> > >nst105
> > >0249a525-2ee2-432b-a1f5-a6db83b089a3
> > >8388608
> > >8388608
> > >8
> > >
> > >  /machine
> > >
> > >
> > >  hvm
> > >
> > >
> > >  
> > >  
> > >  
> > >
> > >
> > >
> > >  
> > >
> > >
> > >  SandyBridge
> > >  
> > >  
> > >
> > >
> > >  
> > >  
> > >  
> > >  
> > >
> > >destroy
> > >restart
> > >destroy
> > >
> > >  
> > >  
> > >
> > >
> > >  /opt/qemu/bin/qemu-system-x86_64
> > >  
> > >
> > >
> > >
> > >
> > >
> > >  
> > >  
> > > > > function='0x7'/>
> > >  
> > >  
> > >
> > > > > function='0x0' multifunction='on'/>
> > >  
> > >  
> > >
> > > > > function='0x1'/>
> > >  
> > >  
> > >
> > > > > function='0x2'/>
> > >  
> > >  
> > > > > function='0x2'/>
> > >  
> > >  
> > >  
> > >
> > >
> > > > > function='0x0' multifunction='on'/>
> > >  
> > >  
> > >
> > >
> > > > > function='0x1'/>
> > >  
> > >  
> > >
> > >
> > > > > function='0x2'/>
> > >  
> > >  
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > >
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > >
> > >
> > >
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > >
> > >  
> > >  
> > >
> > >  
> > >  
> > >
> > >  
> > >  
> > >  
> > >   > > keymap='en-us'>
> > >
> > >  
> > >  
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > > > > function='0x0'/>
> > >  
> > >
> > >
> > >
> > >
> > >  
> > >   > > value='pcie-root-port,pref64-reserve=500M,slot=0,id=pcie_port.1'/>
> > >  
> > >  
> > >  
> > >  
> > >
> > > 
> > > "
> > > __
> > > General info:
> > > Host OS: RH7.0 (Kernel: 4.14.13)
> > > QEMU version: 2.11
> > > libvirt version: 3.2.0
> > > Running the following on the host shows that both nested and IOMMU are
> > > enabled:
> > > ~]#: cat /sys/module/kvm_intel/parameters/nested
> > > Y
> > > ~]# dmesg | grep -e DMAR -e IOMMU
> > > [0.00] DMAR: IOMMU enabled
> > > 
> > > Thanks,
> > > Elijah
> > > 



Re: [Qemu-devel] [Qemu-block] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Nir Soffer
On Thu, Jul 26, 2018 at 2:33 PM Kevin Wolf  wrote:

> The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
> all-zero afterwards, so don't try to abuse it for zero writing. We try
> to only use this if BLKDISCARDZEROES tells us that it is safe, but this
> is unreliable on older kernels and a constant 0 in newer kernels. In
> other words, this code path is never actually used with newer kernels,
> so we don't even try to unmap while writing zeros.
>
> This patch removes the abuse of discard for writing zeroes from
> file-posix and instead adds a new function that uses interfaces that are
> actually meant to deallocate and zero out at the same time. Only if
> those fail, it falls back to zeroing out without unmap. We never fall
> back to a discard operation any more that may or may not result in
> zeros.
>
> Signed-off-by: Kevin Wolf 
> ---
>  block/file-posix.c | 62
> +-
>  1 file changed, 47 insertions(+), 15 deletions(-)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 60af4b3d51..9c66cd87d1 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -648,7 +648,7 @@ static int raw_open_common(BlockDriverState *bs, QDict
> *options,
>  }
>  #endif
>
> -bs->supported_zero_flags = s->discard_zeroes ? BDRV_REQ_MAY_UNMAP : 0;
> +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>  ret = 0;
>  fail:
>  if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
> @@ -1487,6 +1487,38 @@ static ssize_t
> handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>  return -ENOTSUP;
>  }
>
> +static ssize_t handle_aiocb_write_zeroes_unmap(RawPosixAIOData *aiocb)
> +{
> +BDRVRawState *s = aiocb->bs->opaque;
> +int ret;
> +
> +/* First try to write zeros and unmap at the same time */
> +
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +   aiocb->aio_offset, aiocb->aio_nbytes);
> +if (ret != -ENOTSUP) {
>

This fails with ENODEV on RHEL 7.5 when fd is a block device.

The manual says:

   ENODEV fd does not refer to a regular file or a directory.  (If fd
is a pipe or FIFO, a different error results.)

Same issue exists in this patch:
http://lists.gnu.org/archive/html/qemu-devel/2018-03/msg07194.html

+return ret;
> +}
> +#endif
> +
> +#ifdef CONFIG_XFS
> +if (s->is_xfs) {
> +/* xfs_discard() guarantees that the discarded area reads as
> all-zero
> + * afterwards, so we can use it here. */
> +return xfs_discard(s, aiocb->aio_offset, aiocb->aio_nbytes);
> +}
> +#endif
> +
> +/* Make the compiler happy if neither of the above is compiled in */
> +(void) s;
> +
> +/* If we couldn't manage to unmap while guaranteed that the area
> reads as
> + * all-zero afterwards, just write zeroes without unmapping */
> +ret = handle_aiocb_write_zeroes(aiocb);
> +return ret;
> +}
> +
>  #ifndef HAVE_COPY_FILE_RANGE
>  static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
>   off_t *out_off, size_t len, unsigned int
> flags)
> @@ -1729,6 +1761,9 @@ static int aio_worker(void *arg)
>  case QEMU_AIO_WRITE_ZEROES:
>  ret = handle_aiocb_write_zeroes(aiocb);
>  break;
> +case QEMU_AIO_WRITE_ZEROES | QEMU_AIO_DISCARD:
> +ret = handle_aiocb_write_zeroes_unmap(aiocb);
> +break;
>  case QEMU_AIO_COPY_RANGE:
>  ret = handle_aiocb_copy_range(aiocb);
>  break;
> @@ -2553,15 +2588,13 @@ static int coroutine_fn raw_co_pwrite_zeroes(
>  int bytes, BdrvRequestFlags flags)
>  {
>  BDRVRawState *s = bs->opaque;
> +int operation = QEMU_AIO_WRITE_ZEROES;
>
> -if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> -return paio_submit_co(bs, s->fd, offset, NULL, bytes,
> -  QEMU_AIO_WRITE_ZEROES);
> -} else if (s->discard_zeroes) {
> -return paio_submit_co(bs, s->fd, offset, NULL, bytes,
> -  QEMU_AIO_DISCARD);
> +if (flags & BDRV_REQ_MAY_UNMAP) {
> +operation |= QEMU_AIO_DISCARD;
>  }
> -return -ENOTSUP;
> +
> +return paio_submit_co(bs, s->fd, offset, NULL, bytes, operation);
>  }
>
>  static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
> @@ -3054,20 +3087,19 @@ static coroutine_fn int
> hdev_co_pwrite_zeroes(BlockDriverState *bs,
>  int64_t offset, int bytes, BdrvRequestFlags flags)
>  {
>  BDRVRawState *s = bs->opaque;
> +int operation = QEMU_AIO_WRITE_ZEROES | QEMU_AIO_BLKDEV;
>  int rc;
>
>  rc = fd_open(bs);
>  if (rc < 0) {
>  return rc;
>  }
> -if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> -return paio_submit_co(bs, s->fd, offset, NULL, bytes,
> -  QEMU_AIO_WRITE_ZEROES|QEMU_AIO_BLKDEV);
> -} else if (s->discard_zeroes) {
> -return paio_submit_co(bs, s->fd, offset, NULL, bytes,
> -   

Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Marcel Apfelbaum




On 07/26/2018 07:14 PM, Michael S. Tsirkin wrote:

On Thu, Jul 26, 2018 at 06:51:13PM +0300, Marcel Apfelbaum wrote:

Hi

On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:

On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:

Hey,

Our team is adding a NIC functional  emulation to QEMU.
One of the features we are adding to this NIC is SRIOV.

Here is the error message I get when checking SRIOV support of  our emulated 
NIC on Win2016 server (the hyper-v VM).
"
SR-IOV cannot be used on this system as the PCI Express hardware does not 
support Access Control Services (ACS) at any root port.

QEMU's emulated PCI Express Root Ports do not support ACS yet, however I am
not sure ACS is a prerequisite
for SR-IOV.

Looks like windows blocks dev assignment in nested VMs without it.
Thinking about it, doesn't vfio do the same by default?
vfio requires vIOMMU even for nested, as far as I know. They do have a 
"no-iommu"

option, but I don't know how  it can help in this case.

Elijah, what about implementing ACS for QEMU's generic PCIe Root Ports?
Maybe is not too complicated and it seems like a handy feature to have 
anyway.

Patches would be welcomed :)

Thanks,
Marcel


  I think vfio has
a flag to override this though.


We would need ARI support for allowing more than 8VFs, but QEMU
doesn't support that either (yet).

Knut Omag has some working patches, he successfully implemented SR-IOV with
QEMU, see:
  https://github.com/knuto/qemu/tree/sriov_patches_v7

The code was not merged since we need at least a device with SR-IOV support
to justify the addition.


Contact your system vendor for further information.
"

I'm not sure what the status of emulated SR-IOV is so I have CCed
Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.

Thanks,
Marcel


Could you please advise about what could be the issue here?

BTW: I use same configuration (VM XML file attached) when running linux VM 
(RH7.2) image (instead of Win Hyper-V) over the same host and SRIOV is working 
for me there.

Here the XML file I use to define the VM (our emulated NIC is added at the end 
of XML):
"

nst105
0249a525-2ee2-432b-a1f5-a6db83b089a3
8388608
8388608
8

  /machine


  hvm


  
  
  



  


  SandyBridge
  
  


  
  
  
  

destroy
restart
destroy

  
  


  /opt/qemu/bin/qemu-system-x86_64
  





  
  

  
  


  
  


  
  


  
  

  
  
  



  
  



  
  



  
  


  
  



  
  





  
  

  
  

  
  

  
  
  
  

  
  


  
  

  




  
  
  
  
  
  


"
__
General info:
Host OS: RH7.0 (Kernel: 4.14.13)
QEMU version: 2.11
libvirt version: 3.2.0
Running the following on the host shows that both nested and IOMMU are
enabled:
~]#: cat /sys/module/kvm_intel/parameters/nested
Y
~]# dmesg | grep -e DMAR -e IOMMU
[0.00] DMAR: IOMMU enabled

Thanks,
Elijah






Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Michael S. Tsirkin
On Thu, Jul 26, 2018 at 07:34:41PM +0300, Marcel Apfelbaum wrote:
> 
> 
> On 07/26/2018 07:14 PM, Michael S. Tsirkin wrote:
> > On Thu, Jul 26, 2018 at 06:51:13PM +0300, Marcel Apfelbaum wrote:
> > > Hi
> > > 
> > > On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:
> > > > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:
> > > > > Hey,
> > > > > 
> > > > > Our team is adding a NIC functional  emulation to QEMU.
> > > > > One of the features we are adding to this NIC is SRIOV.
> > > > > 
> > > > > Here is the error message I get when checking SRIOV support of  our 
> > > > > emulated NIC on Win2016 server (the hyper-v VM).
> > > > > "
> > > > > SR-IOV cannot be used on this system as the PCI Express hardware does 
> > > > > not support Access Control Services (ACS) at any root port.
> > > QEMU's emulated PCI Express Root Ports do not support ACS yet, however I 
> > > am
> > > not sure ACS is a prerequisite
> > > for SR-IOV.
> > Looks like windows blocks dev assignment in nested VMs without it.
> > Thinking about it, doesn't vfio do the same by default?
> vfio requires vIOMMU even for nested, as far as I know.

Absolutely.

> They do have a
> "no-iommu"
> option, but I don't know how  it can help in this case.
> 
> Elijah, what about implementing ACS for QEMU's generic PCIe Root Ports?
> Maybe is not too complicated and it seems like a handy feature to have
> anyway.
> Patches would be welcomed :)
> 
> Thanks,
> Marcel
> 
> >   I think vfio has
> > a flag to override this though.
> > 
> > > We would need ARI support for allowing more than 8VFs, but QEMU
> > > doesn't support that either (yet).
> > > 
> > > Knut Omag has some working patches, he successfully implemented SR-IOV 
> > > with
> > > QEMU, see:
> > >   https://github.com/knuto/qemu/tree/sriov_patches_v7
> > > 
> > > The code was not merged since we need at least a device with SR-IOV 
> > > support
> > > to justify the addition.
> > > 
> > > > > Contact your system vendor for further information.
> > > > > "
> > > > I'm not sure what the status of emulated SR-IOV is so I have CCed
> > > > Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.
> > > Thanks,
> > > Marcel
> > > 
> > > > > Could you please advise about what could be the issue here?
> > > > > 
> > > > > BTW: I use same configuration (VM XML file attached) when running 
> > > > > linux VM (RH7.2) image (instead of Win Hyper-V) over the same host 
> > > > > and SRIOV is working for me there.
> > > > > 
> > > > > Here the XML file I use to define the VM (our emulated NIC is added 
> > > > > at the end of XML):
> > > > > "
> > > > >  > > > > xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
> > > > > nst105
> > > > > 0249a525-2ee2-432b-a1f5-a6db83b089a3
> > > > > 8388608
> > > > > 8388608
> > > > > 8
> > > > > 
> > > > >   /machine
> > > > > 
> > > > > 
> > > > >   hvm
> > > > > 
> > > > > 
> > > > >   
> > > > >   
> > > > >   
> > > > > 
> > > > > 
> > > > > 
> > > > >   
> > > > > 
> > > > > 
> > > > >   SandyBridge
> > > > >   
> > > > >   
> > > > > 
> > > > > 
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > 
> > > > > destroy
> > > > > restart
> > > > > destroy
> > > > > 
> > > > >   
> > > > >   
> > > > > 
> > > > > 
> > > > >   /opt/qemu/bin/qemu-system-x86_64
> > > > >   
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > >  > > > > unit='0'/>
> > > > >   
> > > > >   
> > > > >  > > > > function='0x7'/>
> > > > >   
> > > > >   
> > > > > 
> > > > >  > > > > function='0x0' multifunction='on'/>
> > > > >   
> > > > >   
> > > > > 
> > > > >  > > > > function='0x1'/>
> > > > >   
> > > > >   
> > > > > 
> > > > >  > > > > function='0x2'/>
> > > > >   
> > > > >   
> > > > >  > > > > function='0x2'/>
> > > > >   
> > > > >   
> > > > >   
> > > > > 
> > > > > 
> > > > >  > > > > function='0x0' multifunction='on'/>
> > > > >   
> > > > >   
> > > > > 
> > > > > 
> > > > >  > > > > function='0x1'/>
> > > > >   
> > > > >   
> > > > > 
> > > > > 
> > > > >  > > > > function='0x2'/>
> > > > >   
> > > > >   
> > > > > 
> > > > >  > > > > function='0x0'/>
> > > > >   
> > > > >   
> > > > > 
> > > > > 
> > > > >  > > > > function='0x0'/>
> > > > >   
> > > > >   
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > >  > > > > function='0x0'/>
> > > > >   
> > > > >   
> > > > > 
> > > > >   
> > > > >   
> > > > > 
> > > > >   
> > > > >   
> > > > > 
> > > > >   
> > > > > 

Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Knut Omang
On Thu, 2018-07-26 at 18:51 +0300, Marcel Apfelbaum wrote:
> Hi
> 
> On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:
> > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:
> > > Hey,
> > > 
> > > Our team is adding a NIC functional  emulation to QEMU.
> > > One of the features we are adding to this NIC is SRIOV.
> > > 
> > > Here is the error message I get when checking SRIOV support of  our
> > > emulated NIC on Win2016 server (the hyper-v VM).
> > > "
> > > SR-IOV cannot be used on this system as the PCI Express hardware does not
> > > support Access Control Services (ACS) at any root port.
> 
> QEMU's emulated PCI Express Root Ports do not support ACS yet, however I 
> am not sure ACS is a prerequisite
> for SR-IOV. We would need ARI support for allowing more than 8VFs, but 
> QEMU doesn't support that either (yet).
> 
> Knut Omag has some working patches, he successfully implemented SR-IOV 
> with QEMU, see:
>   https://github.com/knuto/qemu/tree/sriov_patches_v7
> 
> The code was not merged since we need at least a device with SR-IOV 
> support to justify the addition.

FYI, I recently rebased these to latest master but just didn't get to push them
out. I did just now - they are available here:

   https://github.com/knuto/qemu/tree/sriov_patches_v8

As far as I know ARI support with my patch set works just fine - I have tested
it with lots of VFs.

One of the patches in the series (pci: Make use of the devfn property when
registering new devices) is necessary to make ARI work as it should with SR/IOV.

For the hardware model I developed the SR/IOV patches for, I also added enough
ACS support in the root port (PCIe capability helper patch + usage in
ioh3420) to make VFIO "happy". I haven't submitted them because they
are "questionable" since they likely do not reflect the actual features of the
ioh3420. I can make those available if interesting.

Thanks,
Knut

> > > Contact your system vendor for further information.
> > > "
> > 
> > I'm not sure what the status of emulated SR-IOV is so I have CCed
> > Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.
> 
> Thanks,
> Marcel
> 
> > > Could you please advise about what could be the issue here?
> > > 
> > > BTW: I use same configuration (VM XML file attached) when running linux VM
> > > (RH7.2) image (instead of Win Hyper-V) over the same host and SRIOV is
> > > working for me there.
> > > 
> > > Here the XML file I use to define the VM (our emulated NIC is added at the
> > > end of XML):
> > > "
> > >  > > >;
> > >nst105
> > >0249a525-2ee2-432b-a1f5-a6db83b089a3
> > >8388608
> > >8388608
> > >8
> > >
> > >  /machine
> > >
> > >
> > >  hvm
> > >
> > >
> > >  
> > >  
> > >  
> > >
> > >
> > >
> > >  
> > >
> > >
> > >  SandyBridge
> > >  
> > >  
> > >
> > >
> > >  
> > >  
> > >  
> > >  
> > >
> > >destroy
> > >restart
> > >destroy
> > >
> > >  
> > >  
> > >
> > >
> > >  /opt/qemu/bin/qemu-system-x86_64
> > >  
> > >
> > >
> > >
> > >
> > >
> > >  
> > >  
> > > > > function='0x7'/>
> > >  
> > >  
> > >
> > > > > function='0x0' multifunction='on'/>
> > >  
> > >  
> > >
> > > > > function='0x1'/>
> > >  
> > >  
> > >
> > > > > function='0x2'/>
> > >  
> > >  
> > > > > function='0x2'/>
> > >  
> > >  
> > >  
> > >
> > >
> > > > > function='0x0' multifunction='on'/>
> > >  
> > >  
> > >
> > >
> > > > > function='0x1'/>
> > >  
> > >  
> > >
> > >
> > > > > function='0x2'/>
> > >  
> > >  
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > >
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > >
> > >
> > >
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > >
> > >  
> > >  
> > >
> > >  
> > >  
> > >
> > >  
> > >  
> > >  
> > >   > > keymap='en-us'>
> > >
> > >  
> > >  
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > > > > function='0x0'/>
> > >  
> > >
> > >
> > >
> > >
> > >  
> > >  
> > >  
> > >  
> > >  
> > >  
> > >
> > > 
> > > "
> > > __
> > > General info:
> > > Host OS: RH7.0 (Kernel: 4.14.13)
> > > QEMU version: 2.11
> > > libvirt version: 3.2.0
> > > Running the following on the host shows that both nested and IOMMU are
> > > enabled:
> > > ~]#: cat /sys/module/kvm_intel/parameters/nested
> > > Y
> > > ~]# dmesg | grep -e DMAR -e IOMMU
> > > [0.00] DMAR: IOMMU enabled
> > > 
> > > Thanks,
> 

Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Michael S. Tsirkin
On Thu, Jul 26, 2018 at 06:38:32PM +0200, Knut Omang wrote:
> On Thu, 2018-07-26 at 18:51 +0300, Marcel Apfelbaum wrote:
> > Hi
> > 
> > On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:
> > > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:
> > > > Hey,
> > > > 
> > > > Our team is adding a NIC functional  emulation to QEMU.
> > > > One of the features we are adding to this NIC is SRIOV.
> > > > 
> > > > Here is the error message I get when checking SRIOV support of  our
> > > > emulated NIC on Win2016 server (the hyper-v VM).
> > > > "
> > > > SR-IOV cannot be used on this system as the PCI Express hardware does 
> > > > not
> > > > support Access Control Services (ACS) at any root port.
> > 
> > QEMU's emulated PCI Express Root Ports do not support ACS yet, however I 
> > am not sure ACS is a prerequisite
> > for SR-IOV. We would need ARI support for allowing more than 8VFs, but 
> > QEMU doesn't support that either (yet).
> > 
> > Knut Omag has some working patches, he successfully implemented SR-IOV 
> > with QEMU, see:
> >   https://github.com/knuto/qemu/tree/sriov_patches_v7
> > 
> > The code was not merged since we need at least a device with SR-IOV 
> > support to justify the addition.
> 
> FYI, I recently rebased these to latest master but just didn't get to push 
> them
> out. I did just now - they are available here:
> 
>https://github.com/knuto/qemu/tree/sriov_patches_v8
> 
> As far as I know ARI support with my patch set works just fine - I have tested
> it with lots of VFs.
> 
> One of the patches in the series (pci: Make use of the devfn property when
> registering new devices) is necessary to make ARI work as it should with 
> SR/IOV.
> 
> For the hardware model I developed the SR/IOV patches for, I also added enough
> ACS support in the root port (PCIe capability helper patch + usage in
> ioh3420) to make VFIO "happy". I haven't submitted them because they
> are "questionable" since they likely do not reflect the actual features of the
> ioh3420.

In that the actual ioh3420 doesn't support ACS?

> I can make those available if interesting.
> 
> Thanks,
> Knut
> 
> > > > Contact your system vendor for further information.
> > > > "
> > > 
> > > I'm not sure what the status of emulated SR-IOV is so I have CCed
> > > Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.
> > 
> > Thanks,
> > Marcel
> > 
> > > > Could you please advise about what could be the issue here?
> > > > 
> > > > BTW: I use same configuration (VM XML file attached) when running linux 
> > > > VM
> > > > (RH7.2) image (instead of Win Hyper-V) over the same host and SRIOV is
> > > > working for me there.
> > > > 
> > > > Here the XML file I use to define the VM (our emulated NIC is added at 
> > > > the
> > > > end of XML):
> > > > "
> > > >  > > > xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'
> > > > >;
> > > >nst105
> > > >0249a525-2ee2-432b-a1f5-a6db83b089a3
> > > >8388608
> > > >8388608
> > > >8
> > > >
> > > >  /machine
> > > >
> > > >
> > > >  hvm
> > > >
> > > >
> > > >  
> > > >  
> > > >  
> > > >
> > > >
> > > >
> > > >  
> > > >
> > > >
> > > >  SandyBridge
> > > >  
> > > >  
> > > >
> > > >
> > > >  
> > > >  
> > > >  
> > > >  
> > > >
> > > >destroy
> > > >restart
> > > >destroy
> > > >
> > > >  
> > > >  
> > > >
> > > >
> > > >  /opt/qemu/bin/qemu-system-x86_64
> > > >  
> > > >
> > > >
> > > >
> > > >
> > > > > > > unit='0'/>
> > > >  
> > > >  
> > > > > > > function='0x7'/>
> > > >  
> > > >  
> > > >
> > > > > > > function='0x0' multifunction='on'/>
> > > >  
> > > >  
> > > >
> > > > > > > function='0x1'/>
> > > >  
> > > >  
> > > >
> > > > > > > function='0x2'/>
> > > >  
> > > >  
> > > > > > > function='0x2'/>
> > > >  
> > > >  
> > > >  
> > > >
> > > >
> > > > > > > function='0x0' multifunction='on'/>
> > > >  
> > > >  
> > > >
> > > >
> > > > > > > function='0x1'/>
> > > >  
> > > >  
> > > >
> > > >
> > > > > > > function='0x2'/>
> > > >  
> > > >  
> > > >
> > > > > > > function='0x0'/>
> > > >  
> > > >  
> > > >
> > > >
> > > > > > > function='0x0'/>
> > > >  
> > > >  
> > > >
> > > >
> > > >
> > > >
> > > > > > > function='0x0'/>
> > > >  
> > > >  
> > > >
> > > >  
> > > >  
> > > >
> > > >  
> > > >  
> > > >
> > > >  
> > > >  
> > > >  
> > > >   > > > keymap='en-us'>
> > > >
> > > >  
> > > >  
> > > >
> > > > > > > function='

Re: [Qemu-devel] [Qemu-block] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Kevin Wolf
Am 26.07.2018 um 18:22 hat Nir Soffer geschrieben:
> On Thu, Jul 26, 2018 at 2:33 PM Kevin Wolf  wrote:
> 
> > The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
> > all-zero afterwards, so don't try to abuse it for zero writing. We try
> > to only use this if BLKDISCARDZEROES tells us that it is safe, but this
> > is unreliable on older kernels and a constant 0 in newer kernels. In
> > other words, this code path is never actually used with newer kernels,
> > so we don't even try to unmap while writing zeros.
> >
> > This patch removes the abuse of discard for writing zeroes from
> > file-posix and instead adds a new function that uses interfaces that are
> > actually meant to deallocate and zero out at the same time. Only if
> > those fail, it falls back to zeroing out without unmap. We never fall
> > back to a discard operation any more that may or may not result in
> > zeros.
> >
> > Signed-off-by: Kevin Wolf 
> > ---
> >  block/file-posix.c | 62
> > +-
> >  1 file changed, 47 insertions(+), 15 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 60af4b3d51..9c66cd87d1 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -648,7 +648,7 @@ static int raw_open_common(BlockDriverState *bs, QDict
> > *options,
> >  }
> >  #endif
> >
> > -bs->supported_zero_flags = s->discard_zeroes ? BDRV_REQ_MAY_UNMAP : 0;
> > +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> >  ret = 0;
> >  fail:
> >  if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
> > @@ -1487,6 +1487,38 @@ static ssize_t
> > handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
> >  return -ENOTSUP;
> >  }
> >
> > +static ssize_t handle_aiocb_write_zeroes_unmap(RawPosixAIOData *aiocb)
> > +{
> > +BDRVRawState *s = aiocb->bs->opaque;
> > +int ret;
> > +
> > +/* First try to write zeros and unmap at the same time */
> > +
> > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > +ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > +   aiocb->aio_offset, aiocb->aio_nbytes);
> > +if (ret != -ENOTSUP) {
> >
> 
> This fails with ENODEV on RHEL 7.5 when fd is a block device.
> 
> The manual says:
> 
>ENODEV fd does not refer to a regular file or a directory.  (If fd
> is a pipe or FIFO, a different error results.)

do_fallocate() is a wrapper around fallocate() that handles EINTR and
translates a few errno values (including ENODEV) into a -ENOTSUP return
code.

Kevin



Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Alex Williamson
On Thu, 26 Jul 2018 19:14:45 +0300
"Michael S. Tsirkin"  wrote:

> On Thu, Jul 26, 2018 at 06:51:13PM +0300, Marcel Apfelbaum wrote:
> > Hi
> > 
> > On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:  
> > > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:  
> > > > Hey,
> > > > 
> > > > Our team is adding a NIC functional  emulation to QEMU.
> > > > One of the features we are adding to this NIC is SRIOV.
> > > > 
> > > > Here is the error message I get when checking SRIOV support of  our 
> > > > emulated NIC on Win2016 server (the hyper-v VM).
> > > > "
> > > > SR-IOV cannot be used on this system as the PCI Express hardware does 
> > > > not support Access Control Services (ACS) at any root port.  
> > 
> > QEMU's emulated PCI Express Root Ports do not support ACS yet, however I am
> > not sure ACS is a prerequisite
> > for SR-IOV.

ACS is certainly not a prerequisite for SR-IOV.
 
> Looks like windows blocks dev assignment in nested VMs without it.
> Thinking about it, doesn't vfio do the same by default? I think vfio has
> a flag to override this though.

IOMMU grouping in Linux takes isolation via ACS and device specific
mechanisms into account, limiting the granularity with which a
userspace driver can claim ownership of devices, but it doesn't
actually prevent enabling SR-IOV on the endpoint.  It just makes it
less useful if your intention is to use SR-IOV for device assignment.
The only overrides for this are out-of-tree.  Thanks,

Alex



Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Knut Omang
On Thu, 2018-07-26 at 19:41 +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 26, 2018 at 06:38:32PM +0200, Knut Omang wrote:
> > On Thu, 2018-07-26 at 18:51 +0300, Marcel Apfelbaum wrote:
> > > Hi
> > > 
> > > On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:
> > > > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:
> > > > > Hey,
> > > > > 
> > > > > Our team is adding a NIC functional  emulation to QEMU.
> > > > > One of the features we are adding to this NIC is SRIOV.
> > > > > 
> > > > > Here is the error message I get when checking SRIOV support of  our
> > > > > emulated NIC on Win2016 server (the hyper-v VM).
> > > > > "
> > > > > SR-IOV cannot be used on this system as the PCI Express hardware does
> > > > > not
> > > > > support Access Control Services (ACS) at any root port.
> > > 
> > > QEMU's emulated PCI Express Root Ports do not support ACS yet, however I 
> > > am not sure ACS is a prerequisite
> > > for SR-IOV. We would need ARI support for allowing more than 8VFs, but 
> > > QEMU doesn't support that either (yet).
> > > 
> > > Knut Omag has some working patches, he successfully implemented SR-IOV 
> > > with QEMU, see:
> > >   https://github.com/knuto/qemu/tree/sriov_patches_v7
> > > 
> > > The code was not merged since we need at least a device with SR-IOV 
> > > support to justify the addition.
> > 
> > FYI, I recently rebased these to latest master but just didn't get to push
> > them
> > out. I did just now - they are available here:
> > 
> >https://github.com/knuto/qemu/tree/sriov_patches_v8
> > 
> > As far as I know ARI support with my patch set works just fine - I have
> > tested
> > it with lots of VFs.
> > 
> > One of the patches in the series (pci: Make use of the devfn property when
> > registering new devices) is necessary to make ARI work as it should with
> > SR/IOV.
> > 
> > For the hardware model I developed the SR/IOV patches for, I also added
> > enough
> > ACS support in the root port (PCIe capability helper patch + usage in
> > ioh3420) to make VFIO "happy". I haven't submitted them because they
> > are "questionable" since they likely do not reflect the actual features of
> > the
> > ioh3420.
> 
> In that the actual ioh3420 doesn't support ACS?

yes.. I don't have one so I don't know but that was my assumption..

Knut

> > I can make those available if interesting.
> > 
> > Thanks,
> > Knut
> > 
> > > > > Contact your system vendor for further information.
> > > > > "
> > > > 
> > > > I'm not sure what the status of emulated SR-IOV is so I have CCed
> > > > Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.
> > > 
> > > Thanks,
> > > Marcel
> > > 
> > > > > Could you please advise about what could be the issue here?
> > > > > 
> > > > > BTW: I use same configuration (VM XML file attached) when running
> > > > > linux VM
> > > > > (RH7.2) image (instead of Win Hyper-V) over the same host and SRIOV is
> > > > > working for me there.
> > > > > 
> > > > > Here the XML file I use to define the VM (our emulated NIC is added at
> > > > > the
> > > > > end of XML):
> > > > > "
> > > > >  > > > > > ;
> > > > > 
> > > > >nst105
> > > > >0249a525-2ee2-432b-a1f5-a6db83b089a3
> > > > >8388608
> > > > >8388608
> > > > >8
> > > > >
> > > > >  /machine
> > > > >
> > > > >
> > > > >  hvm
> > > > >
> > > > >
> > > > >  
> > > > >  
> > > > >  
> > > > >
> > > > >
> > > > >
> > > > >  
> > > > >
> > > > >
> > > > >  SandyBridge
> > > > >  
> > > > >  
> > > > >
> > > > >
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >
> > > > >destroy
> > > > >restart
> > > > >destroy
> > > > >
> > > > >  
> > > > >  
> > > > >
> > > > >
> > > > >  /opt/qemu/bin/qemu-system-x86_64
> > > > >  
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > > > > unit='0'/>
> > > > >  
> > > > >  
> > > > > > > > > function='0x7'/>
> > > > >  
> > > > >  
> > > > >
> > > > > > > > > function='0x0' multifunction='on'/>
> > > > >  
> > > > >  
> > > > >
> > > > > > > > > function='0x1'/>
> > > > >  
> > > > >  
> > > > >
> > > > > > > > > function='0x2'/>
> > > > >  
> > > > >  
> > > > > > > > > function='0x2'/>
> > > > >  
> > > > >  
> > > > >  
> > > > >
> > > > >
> > > > > > > > > function='0x0' multifunction='on'/>
> > > > >  
> > > > >  
> > > > >
> > > > >
> > > > > > > > > function='0x1'/>
> > > > >  
> > > > >  
> > > > >
> > > > >
> > > > > > > > > function='0x2'/>
> > > > >  
> > > > >  
> > > > >
> > > > > > > > > function='0x0'/>
> > > > >  
> > > > >  
> > > > >
> > > > >
> > > > > > > > > funct

Re: [Qemu-devel] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Nir Soffer
On Thu, Jul 26, 2018 at 7:10 PM Eric Blake  wrote:

> On 07/26/2018 10:33 AM, Kevin Wolf wrote:
>
> >
> > As far as I know, the comment you quoted is accurate for BLKDISCARD and
> > BLKZEROOUT, but not for the fallocate() flags.
> >
>
> I sure wish the man pages were more explicit on what guarantees each
> flag offers.
>
> >> Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new
> flag
> >> not present/documented on Fedora 28. I wonder if it helps, too.
> >>
> >>>
> >>> FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.
> >>
> >> I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you read
> back
> >> 0, using whatever is most efficient under the hood (in the case of block
> >> devices, unmapping that reliably reads back as zero is favored).
> >
> > See the code I quoted above, FALLOC_FL_ZERO_RANGE calls
> > blkdev_issue_zeroout() with BLKDEV_ZERO_NOUNMAP internally.
>
> Having to resort to reading the kernel code to learn what the guarantees
> are is annoying (it's nice that GPL guarantees that we CAN do that, but
> it means the man pages could use some TLC so that we don't have to).
> Especially since the kernel code has changed over time.
>
> But your paste of current kernel code is rather hard to argue against.
>

I don't think we should depend on undocumented kernel code.

Nir


Re: [Qemu-devel] [Qemu-block] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Nir Soffer
On Thu, Jul 26, 2018 at 7:41 PM Kevin Wolf  wrote:
...

> > > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > > +ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE |
> FALLOC_FL_KEEP_SIZE,
> > > +   aiocb->aio_offset, aiocb->aio_nbytes);
> > > +if (ret != -ENOTSUP) {
> > >
> >
> > This fails with ENODEV on RHEL 7.5 when fd is a block device.
> >
> > The manual says:
> >
> >ENODEV fd does not refer to a regular file or a directory.  (If fd
> > is a pipe or FIFO, a different error results.)
>
> do_fallocate() is a wrapper around fallocate() that handles EINTR and
> translates a few errno values (including ENODEV) into a -ENOTSUP return
> code.
>

I missed that, so this looks correct.

Nir


Re: [Qemu-devel] [v23 1/2] virtio-crypto: Add virtio crypto device specification

2018-07-26 Thread Halil Pasic

Sorry I did not have any time for this last days. And this
to make it worse this is a follow up to something that was
half a year ago. That means I have to re-familiarize myself
with the topic.

If I don't get around to answer in couple of weeks, feel
free to ping me.

Since from what I've seen most of the questions were of type
'Can we do it like this?', maybe sending out a new version
is the best. So people (like me) can read through the whole
thing and point out problems if any.

Regards,
Halil

On 07/20/2018 02:01 PM, Longpeng (Mike) wrote:



On 2018/1/10 1:05, Halil Pasic wrote:




On 12/30/2017 10:35 AM, Longpeng(Mike) wrote:

From: Gonglei 





Re: [Qemu-devel] [PATCH for-3.0] file-posix: Fix write_zeroes with unmap on block devices

2018-07-26 Thread Kevin Wolf
Am 26.07.2018 um 18:46 hat Nir Soffer geschrieben:
> On Thu, Jul 26, 2018 at 7:10 PM Eric Blake  wrote:
> 
> > On 07/26/2018 10:33 AM, Kevin Wolf wrote:
> >
> > >
> > > As far as I know, the comment you quoted is accurate for BLKDISCARD and
> > > BLKZEROOUT, but not for the fallocate() flags.
> > >
> >
> > I sure wish the man pages were more explicit on what guarantees each
> > flag offers.
> >
> > >> Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new
> > flag
> > >> not present/documented on Fedora 28. I wonder if it helps, too.
> > >>
> > >>>
> > >>> FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.
> > >>
> > >> I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you read
> > back
> > >> 0, using whatever is most efficient under the hood (in the case of block
> > >> devices, unmapping that reliably reads back as zero is favored).
> > >
> > > See the code I quoted above, FALLOC_FL_ZERO_RANGE calls
> > > blkdev_issue_zeroout() with BLKDEV_ZERO_NOUNMAP internally.
> >
> > Having to resort to reading the kernel code to learn what the guarantees
> > are is annoying (it's nice that GPL guarantees that we CAN do that, but
> > it means the man pages could use some TLC so that we don't have to).
> > Especially since the kernel code has changed over time.
> >
> > But your paste of current kernel code is rather hard to argue against.
> 
> I don't think we should depend on undocumented kernel code.

Yes, the man page says "filesystem blocks" instead of just "blocks". If
you're in a nit-picking mood, that excludes block devices. But I think
it's pretty obvious what the intended meaning is and how it consistently
extends to block devices, and the code confirms it.

That's by far not "undocumented kernel code" in my book.

Kevin



Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Marcel Apfelbaum




On 07/26/2018 07:42 PM, Knut Omang wrote:

On Thu, 2018-07-26 at 19:41 +0300, Michael S. Tsirkin wrote:

On Thu, Jul 26, 2018 at 06:38:32PM +0200, Knut Omang wrote:

On Thu, 2018-07-26 at 18:51 +0300, Marcel Apfelbaum wrote:

Hi

On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:

On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:

Hey,

Our team is adding a NIC functional  emulation to QEMU.
One of the features we are adding to this NIC is SRIOV.

Here is the error message I get when checking SRIOV support of  our
emulated NIC on Win2016 server (the hyper-v VM).
"
SR-IOV cannot be used on this system as the PCI Express hardware does
not
support Access Control Services (ACS) at any root port.

QEMU's emulated PCI Express Root Ports do not support ACS yet, however I
am not sure ACS is a prerequisite
for SR-IOV. We would need ARI support for allowing more than 8VFs, but
QEMU doesn't support that either (yet).

Knut Omag has some working patches, he successfully implemented SR-IOV
with QEMU, see:
   https://github.com/knuto/qemu/tree/sriov_patches_v7

The code was not merged since we need at least a device with SR-IOV
support to justify the addition.

FYI, I recently rebased these to latest master but just didn't get to push
them
out. I did just now - they are available here:

https://github.com/knuto/qemu/tree/sriov_patches_v8

As far as I know ARI support with my patch set works just fine - I have
tested
it with lots of VFs.

One of the patches in the series (pci: Make use of the devfn property when
registering new devices) is necessary to make ARI work as it should with
SR/IOV.

For the hardware model I developed the SR/IOV patches for, I also added
enough
ACS support in the root port (PCIe capability helper patch + usage in
ioh3420) to make VFIO "happy". I haven't submitted them because they
are "questionable" since they likely do not reflect the actual features of
the
ioh3420.

In that the actual ioh3420 doesn't support ACS?

yes.. I don't have one so I don't know but that was my assumption..


Hi Knut,

We have now a generic PCIe Root Port we can add whatever we want to it.
See please hw/pci-bridge/gen_pcie_root_port.c.

So your patches add both ARI and ACS support, nice!
Maybe it worth merging at least these features.

Thanks,
Marcel



Knut


I can make those available if interesting.

Thanks,
Knut


Contact your system vendor for further information.
"

I'm not sure what the status of emulated SR-IOV is so I have CCed
Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.

Thanks,
Marcel


Could you please advise about what could be the issue here?

BTW: I use same configuration (VM XML file attached) when running
linux VM
(RH7.2) image (instead of Win Hyper-V) over the same host and SRIOV is
working for me there.

Here the XML file I use to define the VM (our emulated NIC is added at
the
end of XML):
"

;

nst105
0249a525-2ee2-432b-a1f5-a6db83b089a3
8388608
8388608
8

  /machine


  hvm


  
  
  



  


  SandyBridge
  
  


  
  
  
  

destroy
restart
destroy

  
  


  /opt/qemu/bin/qemu-system-x86_64
  





  
  

  
  


  
  


  
  


  
  

  
  
  



  
  



  
  



  
  


  
  



  
  





  
  

  
  

  
  

  
  
  
  

  
  


  
  

  




  
  
  
  
  
  


"
__
General info:
Host OS: RH7.0 (Kernel: 4.14.13)
QEMU version: 2.11
libvirt version: 3.2.0
Running the following on the host shows that both nested and IOMMU are
enabled:
~]#: cat /sys/module/kvm_intel/parameters/nested
Y
~]# dmesg | grep -e DMAR -e IOMMU
[0.00] DMAR: IOMMU enabled

Thanks,
Elijah








Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Michael S. Tsirkin
On Thu, Jul 26, 2018 at 10:42:52AM -0600, Alex Williamson wrote:
> On Thu, 26 Jul 2018 19:14:45 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Thu, Jul 26, 2018 at 06:51:13PM +0300, Marcel Apfelbaum wrote:
> > > Hi
> > > 
> > > On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:  
> > > > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:  
> > > > > Hey,
> > > > > 
> > > > > Our team is adding a NIC functional  emulation to QEMU.
> > > > > One of the features we are adding to this NIC is SRIOV.
> > > > > 
> > > > > Here is the error message I get when checking SRIOV support of  our 
> > > > > emulated NIC on Win2016 server (the hyper-v VM).
> > > > > "
> > > > > SR-IOV cannot be used on this system as the PCI Express hardware does 
> > > > > not support Access Control Services (ACS) at any root port.  
> > > 
> > > QEMU's emulated PCI Express Root Ports do not support ACS yet, however I 
> > > am
> > > not sure ACS is a prerequisite
> > > for SR-IOV.
> 
> ACS is certainly not a prerequisite for SR-IOV.
>  
> > Looks like windows blocks dev assignment in nested VMs without it.
> > Thinking about it, doesn't vfio do the same by default? I think vfio has
> > a flag to override this though.
> 
> IOMMU grouping in Linux takes isolation via ACS and device specific
> mechanisms into account, limiting the granularity with which a
> userspace driver can claim ownership of devices,

In that you must assign all devices behind this port as a group?

> but it doesn't
> actually prevent enabling SR-IOV on the endpoint.  It just makes it
> less useful if your intention is to use SR-IOV for device assignment.
> The only overrides for this are out-of-tree.  Thanks,
> 
> Alex




Re: [Qemu-devel] [PATCH] linux-user: ppc64: don't use volatile register during safe_syscall

2018-07-26 Thread Richard Henderson
On 07/25/2018 11:48 PM, Shivaprasad G Bhat wrote:
> r11 is a volatile register on PPC as per calling conventions.
> The safe_syscall code uses it to check if the signal_pending
> is set during the safe_syscall. When a syscall is interrupted
> on return from signal handling, the r11 might be corrupted
> before we retry the syscall leading to a crash. The registers
> r0-r13 are not to be used here as they have
> volatile/designated/reserved usages. Change the code to use
> r14 which is non-volatile and is appropriate for local use in
> safe_syscall.
> 
> Signed-off-by: Shivaprasad G Bhat 
> ---
> Steps to reproduce:
> On PPC host, issue `qemu-ppc64le /usr/bin/cc -E -`
> Attempt Ctrl-C, the issue is reproduced.
> 
> Reference:
> https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.9.html#REG
> 
>  linux-user/host/ppc64/safe-syscall.inc.S |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/linux-user/host/ppc64/safe-syscall.inc.S 
> b/linux-user/host/ppc64/safe-syscall.inc.S
> index d30050a67c..b0cbbe6a69 100644
> --- a/linux-user/host/ppc64/safe-syscall.inc.S
> +++ b/linux-user/host/ppc64/safe-syscall.inc.S
> @@ -49,7 +49,7 @@ safe_syscall_base:
>*   and returns the result in r3
>* Shuffle everything around appropriately.
>*/
> - mr  11, 3   /* signal_pending */
> + mr  14, 3   /* signal_pending */

I do see that I was incorrect in assuming that r11 would be unmodified.  But
you can't simply write to a call-saved register -- you must preserve its value
for the caller.

Saving the value requires that you find some space on, or create, a stack
frame.  Note that there are two different conventions for _CALL_AIX and
_CALL_ELF==2.


r~



Re: [Qemu-devel] [PATCH v5 23/24] replay: add BH oneshot event for block layer

2018-07-26 Thread Alex Bennée


Pavel Dovgalyuk  writes:

> Replay is capable of recording normal BH events, but sometimes
> there are single use callbacks scheduled with aio_bh_schedule_oneshot
> function. This patch enables recording and replaying such callbacks.
> Block layer uses these events for calling the completion function.
> Replaying these calls makes the execution deterministic.
>
> Signed-off-by: Pavel Dovgalyuk 
> ---
>  block/block-backend.c|3 ++-
>  include/sysemu/replay.h  |3 +++
>  replay/replay-events.c   |   16 
>  replay/replay-internal.h |1 +
>  replay/replay.c  |2 +-
>  stubs/replay.c   |6 ++
>  6 files changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/block/block-backend.c b/block/block-backend.c
> index f2f75a9..232d114 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -17,6 +17,7 @@
>  #include "block/throttle-groups.h"
>  #include "sysemu/blockdev.h"
>  #include "sysemu/sysemu.h"
> +#include "sysemu/replay.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-events-block.h"
>  #include "qemu/id.h"
> @@ -1370,7 +1371,7 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, 
> int64_t offset, int bytes,
>
>  acb->has_returned = true;
>  if (acb->rwco.ret != NOT_DONE) {
> -aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
>  blk_aio_complete_bh, acb);
>  }
>
> diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
> index 8118b00..945bc74 100644
> --- a/include/sysemu/replay.h
> +++ b/include/sysemu/replay.h
> @@ -152,6 +152,9 @@ bool replay_events_enabled(void);
>  void replay_flush_events(void);
>  /*! Adds bottom half event to the queue */
>  void replay_bh_schedule_event(QEMUBH *bh);
> +/*! Adds oneshot bottom half event to the queue */
> +void replay_bh_schedule_oneshot_event(AioContext *ctx,
> +QEMUBHFunc *cb, void *opaque);
>  /*! Adds input event to the queue */
>  void replay_input_event(QemuConsole *src, InputEvent *evt);
>  /*! Adds input sync event to the queue */
> diff --git a/replay/replay-events.c b/replay/replay-events.c
> index 0964a82..0ac8a5c 100644
> --- a/replay/replay-events.c
> +++ b/replay/replay-events.c
> @@ -37,6 +37,9 @@ static void replay_run_event(Event *event)
>  case REPLAY_ASYNC_EVENT_BH:
>  aio_bh_call(event->opaque);
>  break;
> +case REPLAY_ASYNC_EVENT_BH_ONESHOT:
> +((QEMUBHFunc*)event->opaque)(event->opaque2);
> +break;
>  case REPLAY_ASYNC_EVENT_INPUT:
>  qemu_input_event_send_impl(NULL, (InputEvent *)event->opaque);
>  qapi_free_InputEvent((InputEvent *)event->opaque);
> @@ -132,6 +135,17 @@ void replay_bh_schedule_event(QEMUBH *bh)
>  }
>  }
>
> +void replay_bh_schedule_oneshot_event(AioContext *ctx,
> +QEMUBHFunc *cb,void *opaque)
> +{
> +if (events_enabled) {
> +uint64_t id = replay_get_current_step();
> +replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb, opaque, id);
> +} else {
> +aio_bh_schedule_oneshot(ctx, cb, opaque);
> +}
> +}
> +
>  void replay_add_input_event(struct InputEvent *event)
>  {
>  replay_add_event(REPLAY_ASYNC_EVENT_INPUT, event, NULL, 0);
> @@ -162,6 +176,7 @@ static void replay_save_event(Event *event, int 
> checkpoint)
>  /* save event-specific data */
>  switch (event->event_kind) {
>  case REPLAY_ASYNC_EVENT_BH:
> +case REPLAY_ASYNC_EVENT_BH_ONESHOT:
>  replay_put_qword(event->id);
>  break;
>  case REPLAY_ASYNC_EVENT_INPUT:
> @@ -216,6 +231,7 @@ static Event *replay_read_event(int checkpoint)
>  /* Events that has not to be in the queue */
>  switch (replay_state.read_event_kind) {
>  case REPLAY_ASYNC_EVENT_BH:
> +case REPLAY_ASYNC_EVENT_BH_ONESHOT:
>  if (replay_state.read_event_id == -1) {
>  replay_state.read_event_id = replay_get_qword();
>  }
> diff --git a/replay/replay-internal.h b/replay/replay-internal.h
> index 08ef2ec..0c0ed16 100644
> --- a/replay/replay-internal.h
> +++ b/replay/replay-internal.h
> @@ -51,6 +51,7 @@ enum ReplayEvents {
>
>  enum ReplayAsyncEventKind {
>  REPLAY_ASYNC_EVENT_BH,
> +REPLAY_ASYNC_EVENT_BH_ONESHOT,
>  REPLAY_ASYNC_EVENT_INPUT,
>  REPLAY_ASYNC_EVENT_INPUT_SYNC,
>  REPLAY_ASYNC_EVENT_CHAR_READ,
> diff --git a/replay/replay.c b/replay/replay.c
> index 6e82764..061b1e2 100644
> --- a/replay/replay.c
> +++ b/replay/replay.c
> @@ -22,7 +22,7 @@
>
>  /* Current version of the replay mechanism.
> Increase it when file format changes. */
> -#define REPLAY_VERSION  0xe02007
> +#define REPLAY_VERSION  0xe02008
>  /* Size of replay log header */
>  #define HEADER_SIZE (sizeof(uint32_t) + sizeof(uint64_t))
>
> diff --git a/stubs/replay.c b/stubs/replay.c
> index 781974e..cbdac80 100644
> --- a/stubs/replay.c
> +++ b/

Re: [Qemu-devel] [Qemu-ppc] [PATCH v2 1/2] tpm_spapr: Support TPM for ppc64 using CRQ based interface

2018-07-26 Thread Stefan Berger

On 07/26/2018 01:26 AM, David Gibson wrote:

On Tue, Dec 12, 2017 at 03:44:02PM -0500, Stefan Berger wrote:

Implement support for TPM on ppc64 by implementing the vTPM CRQ
interface as a frontend. It can use the tpm_emulator driver
backend with the external swtpm.

The Linux vTPM driver for ppc64 works with this emulation.

This TPM emulator also handles the TPM 2 case.

Signed-off-by: Stefan Berger 

Sorry, this slipped through the cracks and I completely forgot about
it for a very long time.  I don't know enough about TPMs to review in
detail, but nothing looks obviously bogus, so,

Reviewed-by: David Gibson 

However, it looks like it will need a rebase against the current
ppc-for-3.1 tree in order to be applied.


Thanks a lot for the review. I have to see what I will do with it now 
since it requires a lot of extensions to the SLOF firmware as well...


    Stefan




---
  hw/tpm/Makefile.objs |   1 +
  hw/tpm/tpm_spapr.c   | 382 +++
  include/sysemu/tpm.h |   3 +
  qapi/tpm.json|   5 +-
  4 files changed, 389 insertions(+), 2 deletions(-)
  create mode 100644 hw/tpm/tpm_spapr.c

diff --git a/hw/tpm/Makefile.objs b/hw/tpm/Makefile.objs
index 41f0b7a..71ea63e 100644
--- a/hw/tpm/Makefile.objs
+++ b/hw/tpm/Makefile.objs
@@ -1,3 +1,4 @@
  common-obj-$(CONFIG_TPM_TIS) += tpm_tis.o
  common-obj-$(CONFIG_TPM_PASSTHROUGH) += tpm_passthrough.o tpm_util.o
  common-obj-$(CONFIG_TPM_EMULATOR) += tpm_emulator.o tpm_util.o
+obj-$(CONFIG_PSERIES) += tpm_spapr.o
diff --git a/hw/tpm/tpm_spapr.c b/hw/tpm/tpm_spapr.c
new file mode 100644
index 000..ef5e62d
--- /dev/null
+++ b/hw/tpm/tpm_spapr.c
@@ -0,0 +1,382 @@
+/*
+ * QEMU PowerPC pSeries Logical Partition (aka sPAPR) hardware System Emulator
+ *
+ * PAPR Virtual TPM
+ *
+ * Copyright (c) 2015, 2017 IBM Corporation.
+ *
+ * Authors:
+ *Stefan Berger 
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "qapi/error.h"
+
+#include "sysemu/tpm_backend.h"
+#include "tpm_int.h"
+#include "tpm_util.h"
+
+#include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_vio.h"
+
+#define DEBUG_SPAPR_VTPM 0
+
+#define DPRINTF(fmt, ...) do { \
+if (DEBUG_SPAPR_VTPM) { \
+printf(fmt, ## __VA_ARGS__); \
+} \
+} while (0)
+
+#define VIO_SPAPR_VTPM(obj) \
+ OBJECT_CHECK(SPAPRvTPMState, (obj), TYPE_TPM_SPAPR)
+
+typedef struct VioCRQ {
+uint8_t valid;  /* 0x80: cmd; 0xc0: init crq
+   0x81-0x83: CRQ message response */
+uint8_t msg;/* see below */
+uint16_t len;   /* len of TPM request; len of TPM response */
+uint32_t data;  /* rtce_dma_handle when sending TPM request */
+uint64_t reserved;
+} VioCRQ;
+
+typedef union TPMSpaprCRQ {
+VioCRQ s;
+uint8_t raw[sizeof(VioCRQ)];
+} TPMSpaprCRQ;
+
+#define SPAPR_VTPM_VALID_INIT_CRQ_COMMAND  0xC0
+#define SPAPR_VTPM_VALID_COMMAND   0x80
+#define SPAPR_VTPM_MSG_RESULT  0x80
+
+/* msg types for valid = SPAPR_VTPM_VALID_INIT_CRQ */
+#define SPAPR_VTPM_INIT_CRQ_RESULT   0x1
+#define SPAPR_VTPM_INIT_CRQ_COMPLETE_RESULT  0x2
+
+/* msg types for valid = SPAPR_VTPM_VALID_CMD */
+#define SPAPR_VTPM_GET_VERSION   0x1
+#define SPAPR_VTPM_TPM_COMMAND   0x2
+#define SPAPR_VTPM_GET_RTCE_BUFFER_SIZE  0x3
+#define SPAPR_VTPM_PREPARE_TO_SUSPEND0x4
+
+/* response error messages */
+#define SPAPR_VTPM_VTPM_ERROR0xff
+
+/* error codes */
+#define SPAPR_VTPM_ERR_COPY_IN_FAILED0x3
+#define SPAPR_VTPM_ERR_COPY_OUT_FAILED   0x4
+
+#define MAX_BUFFER_SIZE TARGET_PAGE_SIZE
+
+typedef struct {
+VIOsPAPRDevice vdev;
+
+TPMSpaprCRQ crq; /* track single TPM command */
+
+uint8_t state;
+#define SPAPR_VTPM_STATE_NONE 0
+#define SPAPR_VTPM_STATE_EXECUTION1
+#define SPAPR_VTPM_STATE_COMPLETION   2
+
+unsigned char buffer[MAX_BUFFER_SIZE];
+
+TPMBackendCmd cmd;
+
+TPMBackend *be_driver;
+TPMVersion be_tpm_version;
+
+size_t be_buffer_size;
+} SPAPRvTPMState;
+
+static void tpm_spapr_show_buffer(const unsigned char *buffer,
+  size_t buffer_len, const char *string)
+{
+#if DEBUG_SPAPR_VTPM
+size_t i, len;
+
+len = MIN(tpm_cmd_get_size(buffer), buffer_len);
+printf("spapr_vtpm: %s length = %zu\n", string, len);
+for (i = 0; i < len; i++) {
+if (i && !(i % 16)) {
+printf("\n");
+}
+printf("%.2X ", buffer[i]);
+}
+printf("\n");
+#endif
+}
+
+/*
+ * Send a request to the TPM.
+ */
+static void tpm_spapr_tpm_send(SPAPRvTPMState *s)
+{
+tpm_spapr_show_buffer(s->buffer, sizeof(s->buffer), "spapr_vtpm: Tx TPM");
+
+s->state = SPAPR_VTPM_STATE_EXECUTION;
+s->cmd = (TPMBackendCmd) {
+.locty = 0,
+.in = s->buffer,
+.in_len = MIN(tpm_cmd_get_size(s->buffer), sizeof(

Re: [Qemu-devel] [PATCH v5 23/24] replay: add BH oneshot event for block layer

2018-07-26 Thread Alex Bennée


Alex Bennée  writes:

> Pavel Dovgalyuk  writes:
>
>> Replay is capable of recording normal BH events, but sometimes
>> there are single use callbacks scheduled with aio_bh_schedule_oneshot
>> function. This patch enables recording and replaying such callbacks.
>> Block layer uses these events for calling the completion function.
>> Replaying these calls makes the execution deterministic.
>>
>> Signed-off-by: Pavel Dovgalyuk 
>
> I'm not sure what about this commit causes the compile breakage I'm
> seeing:
>
>   LINKaarch64-linux-user/qemu-aarch64
> ../libqemuutil.a(cpu-get-icount.o):(.bss+0x0): multiple definition of 
> `use_icount'
> exec.o:(.bss+0x58): first defined here
> collect2: error: ld returned 1 exit status
> Makefile:199: recipe for target 'qemu-aarch64' failed
> make[1]: *** [qemu-aarch64] Error 1
> Makefile:481: recipe for target 'subdir-aarch64-linux-user' failed
> make: *** [subdir-aarch64-linux-user] Error 2
>
> It only occurs on a make clean && make -j on that commit though. It's
> hidden if you do incremental builds.

And it seems to be the same failure across all the cross builds:

  https://app.shippable.com/github/stsquad/qemu/runs/538/summary/console

--
Alex Bennée



Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Knut Omang
On Thu, 2018-07-26 at 20:11 +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 26, 2018 at 10:42:52AM -0600, Alex Williamson wrote:
> > On Thu, 26 Jul 2018 19:14:45 +0300
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Thu, Jul 26, 2018 at 06:51:13PM +0300, Marcel Apfelbaum wrote:
> > > > Hi
> > > > 
> > > > On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:  
> > > > > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:  
> > > > > > Hey,
> > > > > > 
> > > > > > Our team is adding a NIC functional  emulation to QEMU.
> > > > > > One of the features we are adding to this NIC is SRIOV.
> > > > > > 
> > > > > > Here is the error message I get when checking SRIOV support of  our
> > > > > > emulated NIC on Win2016 server (the hyper-v VM).
> > > > > > "
> > > > > > SR-IOV cannot be used on this system as the PCI Express hardware
> > > > > > does not support Access Control Services (ACS) at any root port.  
> > > > 
> > > > QEMU's emulated PCI Express Root Ports do not support ACS yet, however I
> > > > am
> > > > not sure ACS is a prerequisite
> > > > for SR-IOV.
> > 
> > ACS is certainly not a prerequisite for SR-IOV.
> >  
> > > Looks like windows blocks dev assignment in nested VMs without it.
> > > Thinking about it, doesn't vfio do the same by default? I think vfio has
> > > a flag to override this though.
> > 
> > IOMMU grouping in Linux takes isolation via ACS and device specific
> > mechanisms into account, limiting the granularity with which a
> > userspace driver can claim ownership of devices,
> 
> In that you must assign all devices behind this port as a group?

Yes, that's why I needed to add PCIe config space ACS support in ioh3420:
to be able to use VFIO to pass through individual VFs to different L2 VMs with
"plain" kernels on the L1 host which owns the SR/IOV device.

Knut

> > but it doesn't
> > actually prevent enabling SR-IOV on the endpoint.  It just makes it
> > less useful if your intention is to use SR-IOV for device assignment.
> > The only overrides for this are out-of-tree.  Thanks,
> >
> > 
> > Alex
> 
> 



Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Alex Williamson
On Thu, 26 Jul 2018 20:11:44 +0300
"Michael S. Tsirkin"  wrote:

> On Thu, Jul 26, 2018 at 10:42:52AM -0600, Alex Williamson wrote:
> > On Thu, 26 Jul 2018 19:14:45 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Thu, Jul 26, 2018 at 06:51:13PM +0300, Marcel Apfelbaum wrote:  
> > > > Hi
> > > > 
> > > > On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:
> > > > > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:
> > > > > > Hey,
> > > > > > 
> > > > > > Our team is adding a NIC functional  emulation to QEMU.
> > > > > > One of the features we are adding to this NIC is SRIOV.
> > > > > > 
> > > > > > Here is the error message I get when checking SRIOV support of  our 
> > > > > > emulated NIC on Win2016 server (the hyper-v VM).
> > > > > > "
> > > > > > SR-IOV cannot be used on this system as the PCI Express hardware 
> > > > > > does not support Access Control Services (ACS) at any root port.
> > > > 
> > > > QEMU's emulated PCI Express Root Ports do not support ACS yet, however 
> > > > I am
> > > > not sure ACS is a prerequisite
> > > > for SR-IOV.  
> > 
> > ACS is certainly not a prerequisite for SR-IOV.
> >
> > > Looks like windows blocks dev assignment in nested VMs without it.
> > > Thinking about it, doesn't vfio do the same by default? I think vfio has
> > > a flag to override this though.  
> > 
> > IOMMU grouping in Linux takes isolation via ACS and device specific
> > mechanisms into account, limiting the granularity with which a
> > userspace driver can claim ownership of devices,  
> 
> In that you must assign all devices behind this port as a group?

Yup.  Not very useful with SR-IOV.  Thanks,

Alex



Re: [Qemu-devel] [PATCH hack dontapply v2 6/7] acpi: aml generation for _CST

2018-07-26 Thread Michael S. Tsirkin
On Wed, Jul 25, 2018 at 04:39:28PM +0200, Igor Mammedov wrote:
> Fixed dynamic tables are usually more or less self sufficient
> so it's safer to reload so I'm ok with their reloading.

Well it's just a matter of excercising self-control and
building sane APIs to make sure we do.
E.g. let's say we agree on prefix DQ for dynamic tables.

Now, isn't

Name(DQCS, Package() {...} )

which is constrained not to use any names from the main
table not self-sufficient?

To ensure this contraint, we can add APIs like dynamic_name
that add the "DQ" prefix, and we can verify that a dynamic
table does not use a name without this prefix.

> As for current cpuhp interface, it was designed to be extendable
> and we can extend it for CST. It most likely would be much less
> code to extend than it is in this series.

I do have patches that use cpuhp to signal CST changes.
I don't think cpuhp solves the mess that is dynamic
package generation though.

-- 
MST



Re: [Qemu-devel] [PATCH] linux-user: ppc64: don't use volatile register during safe_syscall

2018-07-26 Thread Richard Henderson
On 07/25/2018 11:48 PM, Shivaprasad G Bhat wrote:
> Reference:
> https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.9.html#REG

This document is for _CALL_ELF < 2.  For ppc64le, the document is at

https://openpowerfoundation.org/wp-content/uploads/2016/03/ABI64BitOpenPOWERv1.1_16July2015_pub4.pdf

In both cases, it appears that we can (ab)use SP+16 to save
the value of r14 across the syscall.  This slot would normally
be used for saving our own return address (LR), but we have no
need to save that value because it *is* preserved across the syscall.


r~



  1   2   >