from:"Peter Maydell"

Re: arm64 syzbot instances

2021-03-22 Thread Peter Maydell

On Mon, 22 Mar 2021 at 16:36, John Garry  wrote:
>
> >>
> >> There's apparently a bit in the PCI spec that reads:
> >>  The host bus bridge, in PC compatible systems, must return all
> >>  1's on a read transaction and discard data on a write transaction
> >>  when terminated with Master-Abort.
> >>
> >> which obviously applies only to "PC compatible systems".
> >
> > Right. As far as I can tell, all ARMv8 and most ARMv7 based SoCs
> > do this to be more compatible with PC style operating systems like
> > Linux, but you are right that the specification here does not
> > mandate that, and the older ARMv5 SoCs seem to be compliant
> > as well based on this.

> >> TBH I'm having difficulty seeing why the kernel should be doing
> >> this at all, though. The device tree tells you you have a PCI
> >> controller; PCI supports enumeration of devices; you know exactly
> >> where everything is mapped because the BARs tell you that.
> >> I don't see anything that justifies the kernel in randomly
> >> dereferencing areas of the IO or memory windows where it hasn't
> >> mapped anything.
>
> BIOS has described a CPU-addressable PIO region in the PCI hostbridge,
> and the kernel has mapped it:
>
> [3.974309][T1] pci-host-generic 401000.pcie:   IO
> 0x003eff..0x003eff -> 0x00
>
> So I don't see why any accesses there should fault.

As requested above, do you have the PCI spec reference for
why the PIO region is supposed to do -1/discard for parts of
the PIO region where the kernel hasn't mapped any devices ?
For classic PCI, at least, the spec does not seem to mandate it.

thanks
-- PMM

Re: arm64 syzbot instances

2021-03-22 Thread Peter Maydell

On Sun, 21 Mar 2021 at 19:00, Arnd Bergmann  wrote:
>
> On Sat, Mar 20, 2021 at 9:43 PM Peter Maydell  
> wrote:
> >
> > On Fri, 12 Mar 2021 at 09:16, Arnd Bergmann  wrote:
> > > So it's probably qemu that triggers the 'synchronous external
> > > abort' when accessing the PCI I/O space, which in turn hints
> > > towards a bug in qemu. Presumably it only returns data from
> > > I/O ports that are actually mapped to a device when real hardware
> > > is supposed to return 0x when reading from unused I/O ports.
> >
> > Do you have a reference to the bit of the PCI spec that mandates
> > this -1/discard behaviour for attempted access to places where
> > there isn't actually a PCI device mapped ? The spec is pretty
> > long and hard to read...
> >
> > (Knowing to what extent this behaviour is mandatory for all
> > PCI systems/host controllers vs just "it would be nice if the
> > gpex host controller worked this way" would help in figuring
> > out where in QEMU to change.)
>
> I spent some more time looking at both really old PCI specifications,
> and new ones.
> The old PCI specs seem to just leave this bit as out of scope because
> it does not concern transactions on the bus. The PCI host controller
> can either report a 'master abort' to the CPU, or ignore it, and each
> bridge can decide to turn master aborts on reads into all 1s.
> We do have support some SoCs in Linux that trigger a CPU exception,
> but we tend to deal with those with an ugly hack that just ignores
> all exceptions from the CPU. Most host bridges fortunately behave
> like an x86 PC though, and do not trigger an exception here.

There's apparently a bit in the PCI spec that reads:
The host bus bridge, in PC compatible systems, must return all
1's on a read transaction and
discard data on a write transaction when terminated with Master-Abort.

which obviously applies only to "PC compatible systems".

> In the PCIe 4.0 specification, I found that the behavior is configurable
> at the root port, using the 'RP PIO Exception Register' at offset 0x1c
> in the DPC Extended Capability. This register defaults to '0', meaning
> that reads from an unknown port that generate a 'Unsupported Request
> Completion' get turned into all 1s. If the firmware or OS enables it,
> this can be turned into an AER log event, generate an interrupt or
> a CPU exception.
>
> Linux has a driver for DPC, which apparently configures it to
> cause an interrupt to log the event, but it does not hook up the
> CPU exception handler to this. I don't see an implementation of DPC
> in qemu, which I take as an indication that it should use the
> default behavior and cause neither an interrupt nor a CPU exception.

Hmm, maybe. We should probably also implement -1/discard just because
we're not intending to have 'surprising' behaviour.

TBH I'm having difficulty seeing why the kernel should be doing
this at all, though. The device tree tells you you have a PCI
controller; PCI supports enumeration of devices; you know exactly
where everything is mapped because the BARs tell you that.
I don't see anything that justifies the kernel in randomly
dereferencing areas of the IO or memory windows where it hasn't
mapped anything. You shouldn't be probing for legacy ISA-port
devices unless you're on a system which might actually have them
(eg an x86 PC).

thanks
-- PMM

Re: arm64 syzbot instances

2021-03-20 Thread Peter Maydell

On Fri, 12 Mar 2021 at 09:16, Arnd Bergmann  wrote:
> So it's probably qemu that triggers the 'synchronous external
> abort' when accessing the PCI I/O space, which in turn hints
> towards a bug in qemu. Presumably it only returns data from
> I/O ports that are actually mapped to a device when real hardware
> is supposed to return 0x when reading from unused I/O ports.

Do you have a reference to the bit of the PCI spec that mandates
this -1/discard behaviour for attempted access to places where
there isn't actually a PCI device mapped ? The spec is pretty
long and hard to read...

(Knowing to what extent this behaviour is mandatory for all
PCI systems/host controllers vs just "it would be nice if the
gpex host controller worked this way" would help in figuring
out where in QEMU to change.)

thanks
-- PMM

Re: [PATCH v9 6/6] KVM: arm64: Document MTE capability and ioctl

2021-03-09 Thread Peter Maydell

On Mon, 1 Mar 2021 at 14:23, Steven Price  wrote:
>
> A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
> granting a guest access to the tags, and provides a mechanism for the
> VMM to enable it.
>
> A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
> access the tags of a guest without having to maintain a PROT_MTE mapping
> in userspace. The above capability gates access to the ioctl.
>
> Signed-off-by: Steven Price 
> ---
>  Documentation/virt/kvm/api.rst | 37 ++
>  1 file changed, 37 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index aed52b0fc16e..1406ea138127 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4939,6 +4939,23 @@ KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO
>  Allows Xen vCPU attributes to be read. For the structure and types,
>  see KVM_XEN_VCPU_SET_ATTR above.
>
> +4.131 KVM_ARM_MTE_COPY_TAGS
> +---
> +
> +:Capability: KVM_CAP_ARM_MTE
> +:Architectures: arm64
> +:Type: vm ioctl
> +:Parameters: struct kvm_arm_copy_mte_tags
> +:Returns: 0 on success, < 0 on error
> +
> +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory.

Mostly virt/kvm/api.rst seems to include documentation of the
associated structs, something like:

::

  struct kvm_arm_copy_mte_tags {
 __u64 guest_ipa;
 __u64 length;
 union {
 void __user *addr;
 __u64 padding;
 };
 __u64 flags;
  };


which saves the reader having to cross-reference against the header file.
It also means you can more naturally use the actual field names in the doc,
eg:

> +The
> +starting address and length of guest memory must be ``PAGE_SIZE`` aligned.

you could say "The guest_ipa and length fields" here.

Also "The addr field must point to a buffer which the tags will
be copied to or from." I assume.

> +The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)``
> +bytes (i.e. 1/16th of the corresponding size).

> + Each byte contains a single tag
> +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
> +``PTRACE_POKEMTETAGS``.

What are the valid values for 'flags' ? It looks like they specify which
direction the copy is, which we definitely need to document here.

What happens if the caller requests a tag copy for an area of guest
address space which doesn't have tags (eg it has nothing mapped),
or for an area of guest addres space which has tags in some parts
but not in others ?

> +
>  5. The kvm_run structure
>  
>
> @@ -6227,6 +6244,25 @@ KVM_RUN_BUS_LOCK flag is used to distinguish between 
> them.
>  This capability can be used to check / enable 2nd DAWR feature provided
>  by POWER10 processor.
>
> +7.23 KVM_CAP_ARM_MTE
> +
> +
> +:Architectures: arm64
> +:Parameters: none
> +
> +This capability indicates that KVM (and the hardware) supports exposing the
> +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
> +VMM before the guest will be granted access.
> +
> +When enabled the guest is able to access tags associated with any memory 
> given
> +to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so
> +that the tags are maintained during swap or hibernation of the host, however

s/,/;/

> +the VMM needs to manually save/restore the tags as appropriate if the VM is
> +migrated.
> +
> +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
> +perform a bulk copy of tags to/from the guest

"guest."

> +
>  8. Other capabilities.
>  ==
>
> @@ -6716,3 +6752,4 @@ KVM_XEN_HVM_SET_ATTR, KVM_XEN_HVM_GET_ATTR, 
> KVM_XEN_VCPU_SET_ATTR and
>  KVM_XEN_VCPU_GET_ATTR ioctls, as well as the delivery of exception vectors
>  for event channel upcalls when the evtchn_upcall_pending field of a vcpu's
>  vcpu_info is set.
> +
> --
> 2.20.1


Stray whitespace change ?

thanks
-- PMM

Re: [RFC PATCH v8 5/5] KVM: arm64: ioctl to fetch/store tags in a guest

2021-02-08 Thread Peter Maydell

On Fri, 5 Feb 2021 at 13:58, Steven Price  wrote:
>
> The VMM may not wish to have it's own mapping of guest memory mapped
> with PROT_MTE because this causes problems if the VMM has tag checking
> enabled (the guest controls the tags in physical RAM and it's unlikely
> the tags are correct for the VMM).
>
> Instead add a new ioctl which allows the VMM to easily read/write the
> tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
> while the VMM can still read/write the tags for the purpose of
> migration.
>
> Signed-off-by: Steven Price 
> ---
>  arch/arm64/include/uapi/asm/kvm.h | 13 +++
>  arch/arm64/kvm/arm.c  | 57 +++
>  include/uapi/linux/kvm.h  |  1 +
>  3 files changed, 71 insertions(+)

Missing the update to the docs in Documentation/virtual/kvm/api.txt :-)

thanks
-- PMM

Re: [PATCH v5 0/2] MTE support for KVM guest

2020-12-09 Thread Peter Maydell

On Wed, 9 Dec 2020 at 20:13, Richard Henderson
 wrote:
>
> On 12/9/20 12:39 PM, Catalin Marinas wrote:
> >> I would have thought that the best way is to use TCO, so that we don't 
> >> have to
> >> have dual mappings (and however many MB of extra page tables that might 
> >> imply).
> >
> > The problem appears when the VMM wants to use MTE itself (e.g. linked
> > against an MTE-aware glibc), toggling TCO is no longer generic enough,
> > especially when it comes to device emulation.
>
> But we do know exactly when we're manipulating guest memory -- we have special
> routines for that.

Well, yes and no. It's not like every access to guest memory
is through a specific set of "load from guest"/"store from guest"
functions, and in some situations there's a "get a pointer to
guest RAM, keep using it over a long-ish sequence of QEMU code,
then be done with it" pattern. It's because it's not that trivial
to isolate when something is accessing guest RAM that I don't want
to just have it be mapped PROT_MTE into QEMU. I think we'd end
up spending a lot of time hunting down "whoops, turns out this
is accessing guest RAM and sometimes it trips over the tags in
a hard-to-debug way" bugs. I'd much rather the kernel just
provided us with an API for what we want, which is (1) the guest
RAM as just RAM with no tag checking and separately (2) some
mechanism yet-to-be-designed which lets us bulk copy a page's
worth of tags for migration.

thanks
-- PMM

Re: [PATCH v5 0/2] MTE support for KVM guest

2020-12-07 Thread Peter Maydell

On Mon, 7 Dec 2020 at 16:44, Dr. David Alan Gilbert  wrote:
> * Steven Price (steven.pr...@arm.com) wrote:
> > Sorry, I know I simplified it rather by saying it's similar to protected VM.
> > Basically as I see it there are three types of memory access:
> >
> > 1) Debug case - has to go via a special case for decryption or ignoring the
> > MTE tag value. Hopefully this can be abstracted in the same way.
> >
> > 2) Migration - for a protected VM there's likely to be a special method to
> > allow the VMM access to the encrypted memory (AFAIK memory is usually kept
> > inaccessible to the VMM). For MTE this again has to be special cased as we
> > actually want both the data and the tag values.
> >
> > 3) Device DMA - for a protected VM it's usual to unencrypt a small area of
> > memory (with the permission of the guest) and use that as a bounce buffer.
> > This is possible with MTE: have an area the VMM purposefully maps with
> > PROT_MTE. The issue is that this has a performance overhead and we can do
> > better with MTE because it's trivial for the VMM to disable the protection
> > for any memory.
>
> Those all sound very similar to the AMD SEV world;  there's the special
> case for Debug that Peter mentioned; migration is ...complicated and
> needs special case that's still being figured out, and as I understand
> Device DMA also uses a bounce buffer (and swiotlb in the guest to make
> that happen).

Mmm, but for encrypted VMs the VM has to jump through all these
hoops because "don't let the VM directly access arbitrary guest RAM"
is the whole point of the feature. For MTE, we don't want in general
to be doing tag-checked accesses to guest RAM and there is nothing
in the feature "allow guests to use MTE" that requires that the VMM's
guest RAM accesses must do tag-checking. So we should avoid having
a design that require us to jump through all the hoops. Even if
it happens that handling encrypted VMs means that QEMU has to grow
some infrastructure for carefully positioning hoops in appropriate
places, we shouldn't use it unnecessarily... All we actually need is
a mechanism for migrating the tags: I don't think there's ever a
situation where you want tag-checking enabled for the VMM's accesses
to the guest RAM.

thanks
-- PMM

Re: [PATCH v5 0/2] MTE support for KVM guest

2020-12-07 Thread Peter Maydell

On Mon, 7 Dec 2020 at 14:48, Steven Price  wrote:
> Sounds like you are making good progress - thanks for the update. Have
> you thought about how the PROT_MTE mappings might work if QEMU itself
> were to use MTE? My worry is that we end up with MTE in a guest
> preventing QEMU from using MTE itself (because of the PROT_MTE
> mappings). I'm hoping QEMU can wrap its use of guest memory in a
> sequence which disables tag checking (something similar will be needed
> for the "protected VM" use case anyway), but this isn't something I've
> looked into.

It's not entirely the same as the "protected VM" case. For that
the patches currently on list basically special case "this is a
debug access (eg from gdbstub/monitor)" which then either gets
to go via "decrypt guest RAM for debug" or gets failed depending
on whether the VM has a debug-is-ok flag enabled. For an MTE
guest the common case will be guests doing standard DMA operations
to or from guest memory. The ideal API for that from QEMU's
point of view would be "accesses to guest RAM don't do tag
checks, even if tag checks are enabled for accesses QEMU does to
memory it has allocated itself as a normal userspace program".

thanks
-- PMM

Re: [PATCH v5 0/2] MTE support for KVM guest

2020-11-19 Thread Peter Maydell

On Thu, 19 Nov 2020 at 15:57, Steven Price  wrote:
> On 19/11/2020 15:45, Peter Maydell wrote:
> > I'm a bit dubious about requring the VMM to map the guest memory
> > PROT_MTE unless somebody's done at least a sketch of the design
> > for how this would work on the QEMU side. Currently QEMU just
> > assumes the guest memory is guest memory and it can access it
> > without special precautions...
>
> I agree this needs some investigation - I'm hoping Haibo will be able to
> provide some feedback here as he has been looking at the QEMU support.
> However the VMM is likely going to require some significant changes to
> ensure that migration doesn't break, so either way there's work to be done.
>
> Fundamentally most memory will need a mapping with PROT_MTE just so the
> VMM can get at the tags for migration purposes, so QEMU is going to have
> to learn how to treat guest memory specially if it wants to be able to
> enable MTE for both itself and the guest.

If the only reason the VMM needs tag access is for migration it
feels like there must be a nicer way to do it than by
requiring it to map the whole of the guest address space twice
(once for normal use and once to get the tags)...

Anyway, maybe "must map PROT_MTE" is workable, but it seems
a bit premature to fix the kernel ABI as working that way
until we are at least reasonably sure that it is the right
design.

thanks
-- PMM

Re: [PATCH v5 0/2] MTE support for KVM guest

2020-11-19 Thread Peter Maydell

On Thu, 19 Nov 2020 at 15:39, Steven Price  wrote:
> This series adds support for Arm's Memory Tagging Extension (MTE) to
> KVM, allowing KVM guests to make use of it. This builds on the existing
> user space support already in v5.10-rc1, see [1] for an overview.

> The change to require the VMM to map all guest memory PROT_MTE is
> significant as it means that the VMM has to deal with the MTE tags even
> if it doesn't care about them (e.g. for virtual devices or if the VMM
> doesn't support migration). Also unfortunately because the VMM can
> change the memory layout at any time the check for PROT_MTE/VM_MTE has
> to be done very late (at the point of faulting pages into stage 2).

I'm a bit dubious about requring the VMM to map the guest memory
PROT_MTE unless somebody's done at least a sketch of the design
for how this would work on the QEMU side. Currently QEMU just
assumes the guest memory is guest memory and it can access it
without special precautions...

thanks
-- PMM

Re: [PATCH 4/4] arm64: traps: fix -Woverride-init warnings

2020-10-26 Thread Peter Maydell

On Mon, 26 Oct 2020 at 16:23, Mark Rutland  wrote:
>
> Hi Arnd,
>
> On Mon, Oct 26, 2020 at 05:03:31PM +0100, Arnd Bergmann wrote:
> > From: Arnd Bergmann 
> >
> > There are many warnings in this file when we re-enable the
> > Woverride-init flag:
> >
> > arch/arm64/kernel/traps.c:704:26: warning: initialized field overwritten 
> > [-Woverride-init]
> >   704 |  [ESR_ELx_EC_UNKNOWN]  = "Unknown/Uncategorized",
> >   |  ^~~
> > arch/arm64/kernel/traps.c:704:26: note: (near initialization for 
> > 'esr_class_str[0]')
> > arch/arm64/kernel/traps.c:705:22: warning: initialized field overwritten 
> > [-Woverride-init]
> >   705 |  [ESR_ELx_EC_WFx]  = "WFI/WFE",
> >   |  ^
> >
> > This is harmless since they are only informational strings,
> > but it's easy to change the code to ignore missing initialization
> > and instead warn about possible duplicate initializers.
>
> This has come up before, and IMO the warning is more hindrance than
> helpful, given the prevalance of spurious warnings, and the (again IMO)
> the rework needed to avoid those making the code harder to reason about

FWIW in QEMU we turn the clang version of this off with
-Wno-initializer-overrides because we agree that the code is
fine and the compiler is being unhelpful in this case. (There's
a reason gcc doesn't put it in -Wall.)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91688 is a request
for something that would catch bugs without breaking ranged-array
initializer syntax usage, but the gcc devs don't seem to have
responded.

thanks
-- PMM

Re: [PATCH v2 2/2] arm64: kvm: Introduce MTE VCPU feature

2020-09-09 Thread Peter Maydell

On Wed, 9 Sep 2020 at 16:48, Andrew Jones  wrote:
> We either need a KVM cap or a new CPU feature probing interface to avoid
> making userspace try features one at a time. It's too bad that VCPU_INIT
> doesn't clear all offending features from the feature set when returning
> EINVAL, because then userspace could create a scratch VCPU with everything
> it supports in order to see what KVM also supports in one go.

You could add one if you wanted -- add a new feature bit
TELL_ME_WHAT_YOU_HAVE. If the kernel sees that then on filure
it clears out feature bits it doesn't support and also clears
TELL_ME_WHAT_YOU_HAVE. If QEMU sees EINVAL and TELL_ME_WHAT_YOU_HAVE
is still set, then it knows it's dealing with an old kernel
and has to do one-at-a-time probing. If it sees EINVAL but not
TELL_ME_WHAT_YOU_HAVE then it knows it has a new kernel and
has just got all the info.

-- PMM

Re: [PATCH 1/6] syscalls: use uaccess_kernel in addr_limit_user_check

2020-07-20 Thread Peter Maydell

On Mon, 20 Jul 2020 at 15:55, Guenter Roeck  wrote:
> Ah, sorry, you can't use the upstream version of qemu to test mps2-an385
> Linux images. You'll have to use a version from 
> https://github.com/groeck/qemu.
> I'd recommend to use the v5.0.0-local branch.
>
> I had to make some changes to qemu to be able to boot mps2-an385.
> I tried to submit those changes into upstream qemu, but that was
> rejected because, as I was told, the qemu implementation
> would no longer reflect the real hardware with those changes in
> place.

Yes; the rationale is that if you wanted to boot a kernel
on an actual MPS2 board you'd need a bit of guest code to
start it up (and to bundle the initrd/dtb in with it), so
since you need to write that code anyway you could use it for
booting the kernel in QEMU too.

I appreciate that this is awkward for kernel developers (and
perhaps for some other users too), but QEMU's handling of
-kernel and built-in-bootloader code is already a morass of
special cases and do-what-I-mean behaviour that I'm not
enthusiastic about further complicating :-)

(https://lists.gnu.org/archive/html/qemu-arm/2018-06/msg00393.html
has the archive of our original discussion on the point, for
other readers of this post interested in further context and
discussion.)

thanks
-- PMM

Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Peter Maydell

On Wed, 24 Jun 2020 at 12:18, Steven Price  wrote:
> Ah yes, similar to (1) but much lower overhead ;) That's probably the
> best option - it can be hidden in a memcpy_ignoring_tags() function.
> However it still means that the VMM can't directly touch the guest's
> memory which might cause issues for the VMM.

That's kind of awkward, since in general QEMU assumes it can
naturally just access guest RAM[*] (eg emulation of DMAing devices,
virtio, graphics display, gdb stub memory accesses). It would be
nicer to be able to do it the other way around, maybe, so that the
current APIs give you the "just the memory" and if you really want
to do tagged accesses to guest ram you can do it with tag-specific
APIs. I haven't thought about this very much though and haven't
read enough of the MTE spec recently enough to make much
sensible comment. So mostly what I'm trying to encourage here
is that the people implementing the KVM/kernel side of this API
also think about the userspace side of it, so we get one coherent
design rather than a half-a-product that turns out to be awkward
to use :-)

[*] "guest ram is encrypted" also breaks this assumption, of course;
I haven't looked at the efforts in that direction that are already in
QEMU to see how they work, though.

thanks
-- PMM

Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-23 Thread Peter Maydell

On Wed, 17 Jun 2020 at 13:39, Steven Price  wrote:
>
> These patches add support to KVM to enable MTE within a guest. It is
> based on Catalin's v4 MTE user space series[1].
>
> [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com
>
> Posting as an RFC as I'd like feedback on the approach taken.

What's your plan for handling tags across VM migration?
Will the kernel expose the tag ram to userspace so we
can copy it from the source machine to the destination
at the same time as we copy the actual ram contents ?

thanks
-- PMM

Re: [PATCH 1/1] KVM: inject data abort if instruction cannot be decoded

2019-09-06 Thread Peter Maydell

On Fri, 6 Sep 2019 at 14:13, Christoffer Dall  wrote:
> I'd prefer leaving it to userspace to worry about, but I thought Peter
> said that had been problematic historically, which I took at face value,
> but I could have misunderstood.
>
> If QEMU, kvmtool, and whatever the crazy^H cool kids are using in
> userspace these days are happy emulating the exception, then that's a
> viable approach.  The main concern I have with that is whether they'll
> all get it right, and since we already have the code in the kernel to do
> this, it might make sense to re-use the kernel logic for it.

Well, for QEMU we've had code that in theory might do this but
in practice it's never been tested. Essentially the problem is
that nobody ever wants to inject an exception from userspace
except in incredibly rare cases like "trying to use h/w breakpoints
simultaneously inside the VM and also to debug the VM from outside"
or "we're running on RAS hardware that just told us that the VM's
RAM was faulty". There's no even vaguely commonly-used usecase
for it today; and this ABI suggestion adds another "this is in
practice almost never going to happen" case to the pile. The
codepath is unreliable in QEMU because (a) it requires getting
syncing of sysreg values to and from the kernel right -- this is
about the only situation where userspace wants to modify sysregs
during execution of the VM, as opposed to just reading them -- and
(b) we try to reuse the code we already have that does TCG exception
injection, which might or might not be a design mistake, and
(c) as noted above it's a never-actually-used untested codepath...

So I think if I were you I wouldn't commit to the kernel ABI until
somebody had at least written some RFC-quality patches for QEMU and
tested that they work and the ABI is OK in that sense. (For the
avoidance of doubt, I'm not volunteering to do that myself.)
I don't object to the idea in principle, though.

PS: the other "injecting exceptions to the guest" oddity is that
if the kernel *does* find the ISV information and returns to userspace
for it to handle the MMIO, there's no way for userspace to say
"actually that address is supposed to generate a data abort".

thanks
-- PMM

Re: [Qemu-devel] Running linux on qemu omap

2019-05-27 Thread Peter Maydell

On Mon, 27 May 2019 at 16:56, Guenter Roeck  wrote:
> I'd be happy to use a different (supported) branch, but the Linaro branch
> was the only one I could find that supports those boards. Unfortunately,
> qemu changed so much since 2.3 that it is all but impossible to merge
> the code into mainline qemu without spending a lot of effort on it.

Even back at 2.3 it wasn't possible to merge the code into mainline
without spending a lot of effort -- that's why it was not merged :-)

thanks
-- PMM

Re: [Qemu-devel] Running linux on qemu omap

2019-05-24 Thread Peter Maydell

On Thu, 23 May 2019 at 19:36, Aaro Koskinen  wrote:
> Cheetah works with serial console. I tried with console on display,
> and it seems to boot up, and the frame buffer window gets correctly
> sized but for some reason it just stays blank.

As a general question, when you're doing these tests are you
using a kernel image that is known to work on real hardware?
One problem we have with some of these older QEMU platforms
is that it turns out that QEMU is only tested with the kernel,
and the kernel support for the platform is only tested with
QEMU, and so you get equal and opposite bugs in QEMU and the
kernel that cancel each other out and don't get noticed...

(On the QEMU side these platforms are all basically orphaned:
if somebody submits patches to fix bugs we'll review them,
but they're unlikely to get a great deal of attention otherwise.
They're also quite near the top of the "maybe we'll just
deprecate and then delete these" list, since we have not
historically had any working guest images to test against.
If there's a real userbase that wants them to continue to
exist that's a different matter, of course.)

thanks
- PMM

Re: rseq/arm32: choosing rseq code signature

2019-04-15 Thread Peter Maydell

On Mon, 15 Apr 2019 at 14:11, Mathieu Desnoyers
 wrote:
>
> - On Apr 11, 2019, at 3:55 PM, peter maydell peter.mayd...@linaro.org 
> wrote:
>
> > On Thu, 11 Apr 2019 at 18:51, Mathieu Desnoyers
> >  wrote:
> >>  * This translates to the following instruction pattern in the T16 
> >> instruction
> >>  * set:
> >>  *
> >>  * little endian:
> >>  * def3udf#243  ; 0xf3
> >>  * e7f5b.n<7f5>
> >>  *
> >>  * big endian:
> >>  * e7f5b.n<7f5>
> >>  * def3udf#243  ; 0xf3
> >
> > Do we really care about big-endian instruction-ordering for Thumb?
> > It requires (AIUI) either an ARMv7R CPU which implements and sets
> > SCTLR.IE to 1, or a v6-or-earlier CPU using BE32, and it's going to
> > be even rarer than normal BE8 big-endian...
>
> I don't think we care enough about it to look for a trick to
> turn the branch into something else (which would not branch away from the
> udf instruction), but considering this signature will be ABI, it's good to
> be thorough documentation-wise and cover all existing cases.

I think if you want to document it it would be helpful to
readers to make it clear that this is the ultra-rare
big-endian-instruction-order "big endian Thumb", not the only
moderately-rare little-endian-instructions-big-endian-data
"big endian Thumb".

thanks
-- PMM

Re: rseq/arm32: choosing rseq code signature

2019-04-11 Thread Peter Maydell

On Thu, 11 Apr 2019 at 18:51, Mathieu Desnoyers
 wrote:
> - On Apr 11, 2019, at 12:42 PM, Will Deacon will.dea...@arm.com wrote:
> > Peter suggests that anything of the form 0xe7fxdefx should trap in both A32
> > and T32, although it does assemble to UDF; B  in T16. I'm not sure we
> > should get too obsessed with trying to encode a signature that universally
> > decodes to a trap.
>
> That's a nice trick.
>
> >
> > Whatever you choose, it would be worth checking that it doesn't clash with
> > other allocations such as software breakpoints in GDB.
>
> GDB seems to have [1] :
>
> #define ARM_LE_BREAKPOINT {0xFE,0xDE,0xFF,0xE7}
> #define ARM_BE_BREAKPOINT {0xE7,0xFF,0xDE,0xFE}
> #define THUMB_LE_BREAKPOINT {0xbe,0xbe}
> #define THUMB_BE_BREAKPOINT {0xbe,0xbe}
>
> None of which match the value you hint at.

Hmm? The ARM BPs match 0xe7fxdefx when considered with
the appropriate endianness (clearly somebody has
been down this line of thought before). Still, as long as
we pick different values for the 8 bits of freedom we
have it should be fine.

> /*
>  * RSEQ_SIG uses the udf A32 instruction with an uncommon immediate operand
>  * value 0x5de3. This traps if user-space reaches this instruction by mistake,
>  * and the uncommon operand ensures the kernel does not move the instruction
>  * pointer to attacker-controlled code on rseq abort.
>  *
>  * The instruction pattern in the A32 instruction set is:
>  *
>  * e7f5def3udf#24035; 0x5de3
>  *
>  * This translates to the following instruction pattern in the T16 instruction
>  * set:
>  *
>  * little endian:
>  * def3udf#243  ; 0xf3
>  * e7f5b.n<7f5>
>  *
>  * big endian:
>  * e7f5b.n<7f5>
>  * def3udf#243  ; 0xf3

Do we really care about big-endian instruction-ordering for Thumb?
It requires (AIUI) either an ARMv7R CPU which implements and sets
SCTLR.IE to 1, or a v6-or-earlier CPU using BE32, and it's going to
be even rarer than normal BE8 big-endian...

thanks
-- PMM

Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space

2019-01-10 Thread Peter Maydell

On Thu, 10 Jan 2019 at 12:09, gengdongjiu  wrote:
> Peter, I summarize James's main idea, James think QEMU does not needs
> to check *something* if Qemu support firmware-first.
> What do we do for your comments?

Unless I'm missing something, the code in your most recent patchset
attempts to update an ACPI table when it gets the SIGBUS from the
host kernel without doing anything to check whether it has ever
created the ACPI table (and set up the QEMU global variable that
tells the code where it is in the guest memory) in the first place.
I don't see how that can work.

> >> I think one question here which it would be good to answer is:
> >> if we are modelling a guest and we haven't specifically provided
> >> it an ACPI table to tell it about memory errors, what do we do
> >> when we get a sigbus from the host? We have basically two choices:
> >>  (1) send the guest an SError (aka asynchronous external abort)
> >>  anyway (with no further info about what the memory error is)
> >
> > For an AR signal an external abort is valid. Its up to the implementation
> > whether these are synchronous or asynchronous. Qemu can only take a signal 
> > for
> > something that was synchronous, so you can choose between the two.
> > Synchronous external abort is marginally better as an unaware OS knows its
> > affects this thread, and may be able to kill it.
> > SError with an imp-def ESR is indistinguishable from 'part of the soc fell 
> > out',
> > and should always result in a panic().
> >
> >
> >>  (2) just stop QEMU (as we would for a memory error in QEMU's
> >>  own memory)
> >
> > This is also valid. A machine may take external-abort to EL3 and then
> > reboot/crash/burn.

We should decide which of these we want to do, and have a comment
explaining what we're doing. If I'm reading your current patchset
correctly, it does neither -- if it can't record the fault in the
ACPI table it just ignores it without either stopping QEMU or
delivering an SError.

I think I favour option (2) here.

thanks
-- PMM

Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-30 Thread Peter Maydell

On Sat, 29 Dec 2018 at 16:49, Andy Lutomirski  wrote:
> > Could you use a prctl to set whether you were running in 32 or 64 bit
> > mode?  Or do you change which kind of task you're emulating too often
> > to make this a good idea?

QEMU's linux-user mode always only runs the single process,
which is a fixed guest architecture. But it also wants to
make system calls on its own behalf, as well as the ones it
is passing through from the guest, and I suspect it would
confuse the host libc if we changed the semantics of those
under its feet.

> How would this work?  We already have the separate
> COMPAT_DEFINE_SYSCALL entries *and* in_compat_syscall(). Now we’d have
> a third degree of freedom.
>
> Either the arches people care about should add reasonable ways to
> issue 32-bit syscalls from 64-bit mode or there should be an explicit
> way to ask for the 32-bit directory offsets.

The first of those is not sufficient for QEMU if done
as a per-architecture thing, because there may not even be
a 32-bit syscall interface on the host kernel. The second
sounds better -- there's nothing conceptually architecture
specific about what we want to do or which is tied to the
idea of whether there's a 32-bit compat mode in the host
architecture or not.

thanks
-- PMM

Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-28 Thread Peter Maydell

On Fri, 28 Dec 2018 at 23:16, Andreas Dilger  wrot
> On Dec 28, 2018, at 4:18 AM, Peter Maydell  wrote:
> > The problem is that there is no 32-bit API in some cases
> > (unless I have misunderstood the kernel code) -- not all
> > host architectures implement compat syscalls or allow them
> > to be called from 64-bit processes or implement all the older
> > syscall variants that had smaller offets. If there was a guaranteed
> > "this syscall always exists and always gives me 32-bit offsets"
> > we could use it.
>
> The "32bitapi" mount option would use 32-bit hash for seekdir
> and telldir, regardless of what kernel API was used.  That would
> just set the FMODE_32BITHASH flag in the file->f_mode for all files.

A mount option wouldn't be much use to QEMU -- we can't tell
our users how to mount their filesystems, which they're
often doing lots of other things with besides running QEMU.
(Otherwise we could just tell them "don't use ext4", which
would also solve the problem :-)) We need something we can
use at the individual-syscall level.

thanks
-- PMM

Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-28 Thread Peter Maydell

On Fri, 28 Dec 2018 at 00:23, Andreas Dilger  wrote:
> On Dec 27, 2018, at 10:41 AM, Peter Maydell  wrote:
> > As you note, this causes breakage for userspace programs which
> > need to implement an API/ABI with 32-bit offset but which only
> > have access to the kernel's 64-bit offset API/ABI.
>
> This is (IMHO) a bit of an oxymoron, isn't it?  Applications using
> the 64-bit API, but storing the value in a 32-bit field?

I didn't say "which choose to store the value in a 32-bit field",
I said "which have to implement an API/ABI which has 32-bit fields".
In QEMU's case, we use the host kernel's ABI, which has 64-bit
offset fields. We implement a syscall ABI for the guest binary
we are running under emulation, which may have 32-bit offset fields
(for instance if we are running a 32-bit Arm binary.) Both of
these ABIs are fixed -- QEMU doesn't have a choice here, it
just has to make the best effort it can with what the host kernel
provides it, to provide the semantics the guest binary needs.
My suggestion in this thread is that the host kernel provides
a wider range of facilities so that QEMU can do the job it's
trying to do.

>  The same
> problem would exist for filesystems with 64-bit inodes or 64-bit
> file offsets trying to store these values in 32-bit variables.
> It might work most of the time, but it can also break randomly.

In general inodes and offsets start from 0 and work up --
so almost all of the time they don't actually overflow.
The problem with ext4 directory hash "offsets" is that they
overflow all the time and immediately, so instead of "works
unless you have a weird edge case" like all the other filesystems,
it's "never works".

> > I think the best fix for this would be for the kernel to either
> > (a) consistently use a 32-bit hash or (b) to provide an API
> > so that userspace can use the FMODE_32BITHASH flag the way
> > that kernel-internal users already can.
>
> It would be relatively straight forward to add a "32bitapi" mount
> option to return a 32-bit directory hash to userspace for operations
> on that mountpoint (ext4 doesn't have 64-bit inode numbers yet).
> However, I can't think of an easy way to do this on a per-process
> basis without just having it call the 32-bit API directly.

The problem is that there is no 32-bit API in some cases
(unless I have misunderstood the kernel code) -- not all
host architectures implement compat syscalls or allow them
to be called from 64-bit processes or implement all the older
syscall variants that had smaller offets. If there was a guaranteed
"this syscall always exists and always gives me 32-bit offsets"
we could use it.

> For ext4 at least, you could just shift the high 32-bit part of
> the 64-bit hash down into a 32-bit value in telldir(), and
> shift it back up when seekdir() is called.

Yes, that has been suggested, but it seemed a bit dubious
to bake in knowledge of ext4's internal implementation details.
Can we rely on this as an ABI promise that will always work
for all versions of all file systems going forwards?

thanks
-- PMM

Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-27 Thread Peter Maydell

On Thu, 27 Dec 2018 at 17:19, Florian Weimer  wrote:
> We have a bit of an interesting problem with respect to the d_off
> field in struct dirent.
>
> When running a 64-bit kernel on certain file systems, notably ext4,
> this field uses the full 63 bits even for small directories (strace -v
> output, wrapped here for readability):
>
> getdents(3, [
>   {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, 
> d_name="authorized_keys", d_type=DT_REG},
>   {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", 
> d_type=DT_DIR},
>   {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", 
> d_type=DT_DIR}
> ], 32768) = 88
>
> When running in 32-bit compat mode, this value is somehow truncated to
> 31 bits, for both the getdents and the getdents64 (!) system call (at
> least on i386).

Yes -- look for hash2pos() and friends in fs/ext4/dir.c.
The ext4 code in the kernel uses a 32 bit hash if (a) the kernel
is 32 bit (b) this is a compat syscall (b) some other bit of
the kernel asked it to via the FMODE_32BITHASH flag (currently only
NFS does that I think).

As you note, this causes breakage for userspace programs which
need to implement an API/ABI with 32-bit offset but which only
have access to the kernel's 64-bit offset API/ABI.

I think the best fix for this would be for the kernel to either
(a) consistently use a 32-bit hash or (b) to provide an API
so that userspace can use the FMODE_32BITHASH flag the way
that kernel-internal users already can.

I couldn't think of or find any existing way for userspace
to get the right results here, which is why
32-bit-guest-on-64-bit-host QEMU doesn't work on these filesystems
(depending on what exactly the guest's libc etc do).

> the 32-bit getdents system call emulation in a 64-bit qemu-user
> process would just silently truncate the d_off field as part of
> the translation, not reporting an error.
> [...]
> This truncation has always been a bug; it breaks telldir/seekdir
> at least in some cases.

Yes; you can't fit a quart into a pint pot, so if the guest
only handles 32-bit offsets then truncation is about all we
can do. This works fine if offsets are offsets, assuming the
directory isn't so enormous it would have broken the guest
anyway. I'm not aware of any issues with this other than the
oddball ext4 offsets-are-hashes situation -- could you expand
on the telldir/seekdir issue? (I suppose we should probably
make QEMU's syscall emulation layer return "no more entries"
rather than entries with truncated hashes.)

thanks
-- PMM

Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space

2018-12-19 Thread Peter Maydell

On Mon, 17 Dec 2018 at 15:56, James Morse  wrote:
> I think the root issue here is the name of the cpufeature 'RAS Extensions', 
> this
> doesn't mean RAS is new, or even requires these features. It's just 
> standardised
> records, classification and a barrier.
> Not only is it possible to build a platform that supports RAS without this
> extensions: there are at least three platforms out there that do!
>
>
> On 15/12/2018 00:12, gengdongjiu wrote:
> >> On Fri, 14 Dec 2018 at 13:56, James Morse  wrote:
> >>> On 14/12/2018 10:15, Dongjiu Geng wrote:
>  When user space do memory recovery, it will check whether KVM and
>  guest support the error recovery, only when both of them support,
>  user space will do the error recovery. This patch exports this
>  capability of KVM to user space.
> >>>
> >>> I can understand user-space only wanting to do the work if host and
> >>> guest support the feature. But 'error recovery' isn't a KVM feature,
> >>> its a Linux kernel feature.
>
> [...]
>
> > Thanks Peter's explanation. Frankly speaking, I agree Peter's suggestion.
> >
> > To James, I explain more to you, as peter said QEMU needs to check whether
> > the guest CPU is a type which can handle the error though guest ACPI table.
>
> I don't think this really matters. Its only the NMIlike notifications that the
> guest doesn't have to register or poll. The ones we support today extend the
> architectures existing behaviour: you would have taken an external-abort on a
> real system, whether you know about the additional metadata doesn't matter to 
> Qemu.

Consider the case where we booted the guest using a DTB and no ACPI
table at all -- we certainly can't just call QEMU code that tries to
add entries to a nonexistent table. My main point is that there
needs to be logic in Dongjiu's QEMU patches that checks more than
just "does this KVM feature exist". I'm not sufficiently familiar
with all this RAS stuff to be certain what those checks should
be and what the right choices are; I just know we need to check
*something*...

> > Let us see the X86's QEMU logic:
> > 1. Before the vCPU created, it will set a default env->mcg_cap value with
>
> > MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports
> > RAS error recovery.[1] 2. when the vCPU initialize, it will check whether 
> > host
> > kernel support this feature[2]. Only when host kernel and default 
> > env->mcg_cap
> > value all expected this feature, then it will setup vCPU support RAS error
> > recovery[3].
>
> This looks like KVM exposing a CPU capability to Qemu, which then configures 
> the
> behaviour KVM gives to the guest. This doesn't tell you anything about what 
> the
> guest supports.

It tells you what the *guest CPU* supports, which for x86 is a combination
of (a) what did the user/machine model ask for and (b) what can KVM
actually implement. I don't much care whether the guest OS supports
anything or not, that's its business... but it does seem odd to me
that the equivalent Arm code is not similarly saying "what were we
asked for, and what can we do?".

I think one question here which it would be good to answer is:
if we are modelling a guest and we haven't specifically provided
it an ACPI table to tell it about memory errors, what do we do
when we get a sigbus from the host? We have basically two choices:
 (1) send the guest an SError (aka asynchronous external abort)
 anyway (with no further info about what the memory error is)
 (2) just stop QEMU (as we would for a memory error in QEMU's
 own memory)

thanks
-- PMM

Re: [PATCH v6 04/13] arm64/kvm: hide ptrauth from guests

2018-12-19 Thread Peter Maydell

On Mon, 10 Dec 2018 at 20:22, Richard Henderson
 wrote:
>
> On 12/10/18 2:12 PM, Kristina Martsenko wrote:
> > The plan was to disable trapping, yes. However, after that thread there
> > was a retrospective change applied to the architecture, such that the
> > XPACLRI (and XPACD/XPACI) instructions are no longer trapped by
> > HCR_EL2.API. (The public documentation on this has not been updated
> > yet.) This means that no HINT-space instructions should trap anymore.
>
> Ah, thanks for the update.  I'll update my QEMU patch set.

Just to follow up on this loose end, this change to HCR_EL2.API
trap behaviour is documented in the 00bet9 release of the system
register XML which came out today:
https://developer.arm.com/products/architecture/cpu-architecture/a-profile/exploration-tools/system-registers-for-armv8-a

thanks
-- PMM

Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space

2018-12-14 Thread Peter Maydell

On Fri, 14 Dec 2018 at 13:56, James Morse  wrote:
>
> Hi Dongjiu Geng,
>
> On 14/12/2018 10:15, Dongjiu Geng wrote:
> > When user space do memory recovery, it will check whether KVM and
> > guest support the error recovery, only when both of them support,
> > user space will do the error recovery. This patch exports this
> > capability of KVM to user space.
>
> I can understand user-space only wanting to do the work if host and guest
> support the feature. But 'error recovery' isn't a KVM feature, its a Linux
> kernel feature.
>
> KVM will send it's user-space a SIGBUS with MCEERR code whenever its trying to
> map a page at stage2 that the kernel-mm code refuses this because its 
> poisoned.
> (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
>
> This is exactly the same as happens to a normal user-space process.
>
> I think you really want to know if the host kernel was built with
> CONFIG_MEMORY_FAILURE.

Does userspace need to care about that? Presumably if the host kernel
wasn't built with that support then it will simply never deliver
any memory failure events to QEMU, which is fine.

The point I was trying to make in the email Dongjiu references
(https://patchwork.codeaurora.org/patch/652261/) is simply that
"QEMU gets memory-failure notifications from the host kernel"
does not imply "the guest is prepared to receive memory
failure notifications", and so the code path which handles
the SIGBUS must do some kind of check for whether the guest
CPU is a type which expects them and that the board code
set up the ACPI tables that it wants to fill in.

thanks
-- PMM

Re: [PATCH v5 02/12] KVM: arm/arm64: Document KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-30 Thread Peter Maydell

On 30 April 2018 at 10:07, Eric Auger  wrote:
> We introduce a new KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attribute in
> KVM_DEV_ARM_VGIC_GRP_ADDR group.

Hi; one minor grammar nit in the docs below, otherwise
Reviewed-by: Peter Maydell 

> It allows userspace to provide the
> base address and size of a redistributor region
>
> Compared to KVM_VGIC_V3_ADDR_TYPE_REDIST, this new attribute allows
> to declare several separate redistributor regions.
>
> So the whole redist space does not need to be contiguous anymore.
>
> Signed-off-by: Eric Auger 
>
> ---
> v4 -> v5:
> - Document read access
> - removed Peter's R-b
>
> v3 -> v4:
> - Added Peter's R-b
> ---
>  Documentation/virtual/kvm/devices/arm-vgic-v3.txt | 30 
> +--
>  1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt 
> b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> index 9293b45..dcb7504 100644
> --- a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> +++ b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> @@ -27,16 +27,42 @@ Groups:
>VCPU and all of the redistributor pages are contiguous.
>Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
>This address needs to be 64K aligned.
> +
> +KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit)
> +  The attribute data pointed by kvm_device_attr.addr is a __u64 value:

"pointed to by"

> +  bits: | 63     52  |  51      16 | 15 - 12  |11 - 0
> +  values:   | count  |   base  |  flags   | index
> +  - index encodes the unique redistributor region index
> +  - flags: reserved for future use, currently 0
> +  - base field encodes bits [51:16] of the guest physical base address
> +of the first redistributor in the region.
 > +  - count encodes the number of redistributors in the region. Must be
> +greater than 0.
> +  There are two 64K pages for each redistributor in the region and
> +  redistributors are laid out contiguously within the region. Regions
> +  are filled with redistributors in the index order. The sum of all
> +  region count fields must be greater than or equal to the number of
> +  VCPUs. Redistributor regions must be registered in the incremental
> +  index order, starting from index 0.
> +  The characteristics of a specific redistributor region can be read
> +  by presetting the index field in the attr data.
> +  Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
> +
> +  It is invalid to mix calls with KVM_VGIC_V3_ADDR_TYPE_REDIST and
> +  KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attributes.
> +
>Errors:
>  -E2BIG:  Address outside of addressable IPA range
> --EINVAL: Incorrectly aligned address
> +-EINVAL: Incorrectly aligned address, bad redistributor region
> + count/index, mixed redistributor region attribute usage
>  -EEXIST: Address already configured
> +-ENOENT: Attempt to read the characteristics of a non existing
> + redistributor region
>  -ENXIO:  The group or attribute is unknown/unsupported for this device
>   or hardware support is missing.
>  -EFAULT: Invalid user pointer for attr->addr.
>
>
> -
>KVM_DEV_ARM_VGIC_GRP_DIST_REGS
>KVM_DEV_ARM_VGIC_GRP_REDIST_REGS
>Attributes:
> --
> 2.5.5

thanks
-- PMM

Re: [PATCH v3 02/12] KVM: arm/arm64: Document KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-24 Thread Peter Maydell

On 24 April 2018 at 17:46, Christoffer Dall  wrote:
> On Fri, Apr 13, 2018 at 10:20:48AM +0200, Eric Auger wrote:
>> --- a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
>> +++ b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
>> @@ -27,9 +27,32 @@ Groups:
>>VCPU and all of the redistributor pages are contiguous.
>>Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
>>This address needs to be 64K aligned.
>> +
>> +KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit)
>> +  The attr field of kvm_device_attr encodes 3 values:
>> +  bits: | 63     52  |  51      16 | 15 - 12  |11 - 0
>> +  values:   | count  |   base  |  flags   | index
>> +  - index encodes the unique redistributor region index
>
> I'm not entirely sure I understand the purpose of the index field.
> Isn't a redistributor region identified uniquely by its base address?

You need a way to tell the difference beween:
 (1) redistributors for CPUs 0..63 at 0x4000, redistributors
 for 64..127 at 0x8000
 (2) redistributors for CPUs 0..63 at 0x8000, redistributors
 for 64..127 at 0x4000

The index field tells you which order the redistributor
regions go in.

thanks
-- PMM

Re: [PATCH v2 12/17] kvm: arm/arm64: Expose supported physical address limit for VM

2018-04-13 Thread Peter Maydell

On 27 March 2018 at 14:15, Suzuki K Poulose  wrote:
> Expose the maximum physical address size supported by the host
> for a VM. This could be later used by the userspace to choose the
> appropriate size for a given VM. The limit is determined as the
> minimum of actual CPU limit, the kernel limit (i.e, either 48 or 52)
> and the stage2 page table support limit (which is 40bits at the moment).
> For backward compatibility, we support a minimum of 40bits. The limit
> will be lifted as we add support for the stage2 to support the host
> kernel PA limit.
>
> This value may be different from what is exposed to the VM via
> CPU ID registers. The limit only applies to the stage2 page table.
>
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: Peter Maydel 
> Signed-off-by: Suzuki K Poulose 
> ---
>  Documentation/virtual/kvm/api.txt | 14 ++
>  arch/arm/include/asm/kvm_mmu.h|  5 +
>  arch/arm64/include/asm/kvm_mmu.h  |  5 +
>  include/uapi/linux/kvm.h  |  6 ++
>  virt/kvm/arm/arm.c|  6 ++
>  5 files changed, 36 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 792fa87..55908a8 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3500,6 +3500,20 @@ Returns: 0 on success; -1 on error
>  This ioctl can be used to unregister the guest memory region registered
>  with KVM_MEMORY_ENCRYPT_REG_REGION ioctl above.
>
> +4.113 KVM_ARM_GET_MAX_VM_PHYS_SHIFT
> +Capability: basic
> +Architectures: arm, arm64
> +Type: system ioctl
> +Parameters: none
> +Returns: log2(Maximum physical address space size) supported by the
> +hyperviosr.

typo: "hypervisor".

> +
> +This ioctl can be used to identify the maximum physical address space size
> +supported by the hypervisor.

Is that the physical address space on the host, or the physical
address space size we present to the guest?

> The returned value indicates the maximum size
> +of the address that can be resolved by the stage2 translation table on
> +arm/arm64. On arm64, the value is decided based on the host kernel
> +configuration and the system wide safe value of ID_AA64MMFR0_EL1:PARange.
> +This may not match the value exposed to the VM in CPU ID registers.

Isn't it likely to confuse the guest if we lie to it about the PA range it
sees? When would the two values differ?

Do we also need a 'set' operation, so userspace can create a VM
that has a 40 bit userspace on a CPU that supports more than that,
or does it just work?

What's the x86 API for KVM to tell userspace about physical address
range restrictions?

thanks
-- PMM

Re: [PATCH v3 02/12] KVM: arm/arm64: Document KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-04-13 Thread Peter Maydell

On 13 April 2018 at 09:20, Eric Auger  wrote:
> We introduce a new KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attribute in
> KVM_DEV_ARM_VGIC_GRP_ADDR group. It allows userspace to provide the
> base address and size of a redistributor region
>
> Compared to KVM_VGIC_V3_ADDR_TYPE_REDIST, this new attribute allows
> to declare several separate redistributor regions.
>
> So the whole redist space does not need to be contiguous anymore.
>
> Signed-off-by: Eric Auger 
> ---

Reviewed-by: Peter Maydell 

thanks
-- PMM

Re: [RFC v2 11/12] KVM: arm/arm64: Implement KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-03-28 Thread Peter Maydell

On 27 March 2018 at 15:04, Eric Auger  wrote:
> Now all the internals are ready to handle multiple redistributor
> regions, let's allow the userspace to register them.
>
> Signed-off-by: Eric Auger 
> ---
>  virt/kvm/arm/vgic/vgic-kvm-device.c | 40 
> +++--
>  virt/kvm/arm/vgic/vgic-mmio-v3.c|  4 ++--
>  virt/kvm/arm/vgic/vgic.h|  9 -
>  3 files changed, 48 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c 
> b/virt/kvm/arm/vgic/vgic-kvm-device.c
> index e7b5a86..a2b99e4 100644
> --- a/virt/kvm/arm/vgic/vgic-kvm-device.c
> +++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
> @@ -65,7 +65,8 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>  {
> int r = 0;
> struct vgic_dist *vgic = &kvm->arch.vgic;
> -   phys_addr_t *addr_ptr, alignment;
> +   phys_addr_t *addr_ptr = NULL;
> +   phys_addr_t alignment;
> uint64_t undef_value = VGIC_ADDR_UNDEF;
>
> mutex_lock(&kvm->lock);
> @@ -92,7 +93,7 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
> if (r)
> break;
> if (write) {
> -   r = vgic_v3_set_redist_base(kvm, *addr);
> +   r = vgic_v3_set_redist_base(kvm, 0, *addr, 0);
> goto out;
> }
> rdreg = list_first_entry(&vgic->rd_regions,
> @@ -103,6 +104,40 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, 
> u64 *addr, bool write)
> addr_ptr = &rdreg->base;
> break;
> }
> +   case KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION:
> +   {
> +   struct vgic_redist_region *rdreg;
> +   uint8_t index;
> +
> +   r = vgic_check_type(kvm, KVM_DEV_TYPE_ARM_VGIC_V3);
> +   if (r)
> +   break;
> +
> +   index = *addr & KVM_VGIC_V3_RDIST_INDEX_MASK;
> +
> +   if (write) {
> +   gpa_t base = *addr & KVM_VGIC_V3_RDIST_BASE_MASK;
> +   uint32_t count = (*addr & 
> KVM_VGIC_V3_RDIST_COUNT_MASK)
> +   >> KVM_VGIC_V3_RDIST_COUNT_SHIFT;
> +   uint8_t flags = (*addr & KVM_VGIC_V3_RDIST_FLAGS_MASK)
> +   >> KVM_VGIC_V3_RDIST_FLAGS_SHIFT;
> +
> +   if (!count || flags)
> +   r = -EINVAL;
> +   else
> +   r = vgic_v3_set_redist_base(kvm, index,
> +   base, count);
> +   goto out;
> +   }
> +
> +   rdreg = vgic_v3_rdist_region_from_index(kvm, index);
> +   if (!rdreg)
> +   r = -ENODEV;

Here you check whether rdreg is NULL...

> +
> +   *addr_ptr = rdreg->base & index &

...but here you dereference it anyway. Missing "goto out"/break/other
control flow?

> +   (uint64_t)rdreg->count << 
> KVM_VGIC_V3_RDIST_COUNT_SHIFT;
> +   break;
> +   }

I was looking for the code which checked "EINVAL if you
already used KVM_VGIC_V3_ADDR_TYPE_REDIST", but couldn't
see it. Is that handled by one of the helper functions?

> default:
> r = -ENODEV;
> }
> @@ -674,6 +709,7 @@ static int vgic_v3_has_attr(struct kvm_device *dev,
> switch (attr->attr) {
> case KVM_VGIC_V3_ADDR_TYPE_DIST:
> case KVM_VGIC_V3_ADDR_TYPE_REDIST:
> +   case KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION:
> return 0;
> }
> break;
> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c 
> b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> index 89faadc..45287a0 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> @@ -766,11 +766,11 @@ static int vgic_v3_insert_redist_region(struct kvm 
> *kvm, uint32_t index,
> return ret;
>  }
>
> -int vgic_v3_set_redist_base(struct kvm *kvm, u64 addr)
> +int vgic_v3_set_redist_base(struct kvm *kvm, u32 index, u64 addr, u32 count)
>  {
> int ret;
>
> -   ret = vgic_v3_insert_redist_region(kvm, 0, addr, 0);
> +   ret = vgic_v3_insert_redist_region(kvm, index, addr, count);
> if (ret)
> return ret;

thanks
-- PMM

Re: [RFC 02/12] KVM: arm/arm64: Document KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION

2018-03-23 Thread Peter Maydell

On 19 March 2018 at 09:20, Eric Auger  wrote:
> We introduce a new KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attribute in
> KVM_DEV_ARM_VGIC_GRP_ADDR group. It allows userspace to provide the
> base address and size of a redistributor region
>
> Compared to KVM_VGIC_V3_ADDR_TYPE_REDIST, this new attribute allows
> to declare several separate redistributor regions.
>
> So the whole redist space does not need to be contiguous anymore.
>
> Signed-off-by: Eric Auger 
> ---
>  Documentation/virtual/kvm/devices/arm-vgic-v3.txt | 12 
>  1 file changed, 12 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt 
> b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> index 9293b45..2c0bedf 100644
> --- a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> +++ b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
> @@ -27,6 +27,18 @@ Groups:
>VCPU and all of the redistributor pages are contiguous.
>Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
>This address needs to be 64K aligned.
> +
> +KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit)
> +  The attr field of kvm_device_attr encodes 3 values:
> +  bits: | 63     52  |  51      12 |11 - 0
> +  values:   | pfns   |   base  | index
> +  - index encodes the unique redistibutor region index

"redistributor"

> +  - base field encodes bits [51:12] the guest physical base address

"of the guest"

> +of the first redistributor in the region. There are two 64K pages
> +for each VCPU and all of the redistributor pages are contiguous
> +within the redistributor region.
> +  - pfns encodes the size of the region in 64kB pages.
> +  Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.

You should say something here about what happens if userspace
tries to use both KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION and
KVM_VGIC_V3_ADDR_TYPE_REDIST. I think this should be an error
(reported by whichever of the two you try to set second).

>Errors:
>  -E2BIG:  Address outside of addressable IPA range
>  -EINVAL: Incorrectly aligned address

Marc wrote:
> Why does base have to include bits [15:12] of the IPA? If it is 64kB
> aligned (as it should), these bits are guaranteed to be 0. This also
> avoid having to return -EINVAL in case of alignment problems.

If you're not using the bits for anything else you want to
check they're 0 anyway. Otherwise we can't guarantee to safely
use them for something else in future, because userspace might
be handing us garbage in those bits without noticing.

thanks
-- PMM

Re: [REPOST PATCH] arm/arm64: KVM: Add PSCI version selection API

2018-03-15 Thread Peter Maydell

On 15 March 2018 at 19:00, Marc Zyngier  wrote:
> On 06/03/18 09:21, Andrew Jones wrote:
>> On Mon, Mar 05, 2018 at 04:47:55PM +0000, Peter Maydell wrote:
>>> On 2 March 2018 at 11:11, Marc Zyngier  wrote:
>>>> On Fri, 02 Mar 2018 10:44:48 +,
>>>> Auger Eric wrote:
>>>>> I understand the get/set is called as part of the migration process.
>>>>> So my understanding is the benefit of this series is migration fails in
>>>>> those cases:
>>>>>
>>>>>> =0.2 source -> 0.1 destination
>>>>> 0.1 source -> >=0.2 destination
>>>>
>>>> It also fails in the case where you migrate a 1.0 guest to something
>>>> that cannot support it.
>>>
>>> I think it would be useful if we could write out the various
>>> combinations of source, destination and what we expect/want to
>>> have happen. My gut feeling here is that we're sacrificing
>>> exact migration compatibility in favour of having the guest
>>> automatically get the variant-2 mitigations, but it's not clear
>>> to me exactly which migration combinations that's intended to
>>> happen for. Marc?
>>>
>>> If this wasn't a mitigation issue the desired behaviour would be
>>> straightforward:
>>>  * kernel should default to 0.2 on the basis that
>>>that's what it did before
>>>  * new QEMU version should enable 1.0 by default for virt-2.12
>>>and 0.2 for virt-2.11 and earlier
>>>  * PSCI version info shouldn't appear in migration stream unless
>>>it's something other than 0.2
>>> But that would leave some setups (which?) unnecessarily without the
>>> mitigation, so we're not doing that. The question is, exactly
>>> what *are* we aiming for?
>>
>> The reason Marc dropped this patch from the series it was first introduced
>> in was because we didn't have the aim 100% understood. We want the
>> mitigation by default, but also to have the least chance of migration
>> failure, and when we must fail (because we're not doing the
>> straightforward approach listed above, which would prevent failures), then
>> we want to fail with the least amount of damage to the user.
>>
>> I experimented with a couple different approaches and provided tables[1]
>> with my results. I even recommended an approach, but I may have changed
>> my mind after reading Marc's follow-up[2]. The thread continues from
>> there as well with follow-ups from Christoffer, Marc, and myself. Anyway,
>> Marc did this repost for us to debate it and work out the best approach
>> here.
> It doesn't look like we've made much progress on this, which makes me
> think that we probably don't need anything of the like.

I was waiting for a better explanation from you of what we're trying to
achieve. If you want to take the "do nothing" approach then a list
also of what migrations succeed/fail/break in that case would also
be useful.

(I am somewhat lazily trying to avoid having to spend time reverse
engineering the "what are we trying to do and what effects are
we accepting" parts from the patch and the code that's already gone
into the kernel.)

thanks
-- PMM

Re: [REPOST PATCH] arm/arm64: KVM: Add PSCI version selection API

2018-03-06 Thread Peter Maydell

On 6 March 2018 at 09:50, Marc Zyngier  wrote:
> On 05/03/18 20:37, Auger Eric wrote:
>> On 05/03/18 17:31, Peter Maydell wrote:
>>> That also means that we will fail migration from a new kernel where
>>> we've specifically asked for PSCI 0.2 to an old PSCI-0.2-only kernel
>>> (because the KVM_REG_ARM_PSCI_VERSION reg will appear in the migration
>>> stream even if its value is the one value that matches the old kernel
>>> behaviour). I don't know if we care about that.
>>
>> Do you know when are we likely to force PSCI 0.2 on a new kernel? At
>> which layer is the decision supposed to be made and on which criteria?
>
> No decent SW should need this. But if you've written a guest that cannot
> work if it doesn't get "2" as response to PSCI_VERSION, you can override it.

...but if you want to be able to migrate back from a new kernel to
an old one, then you need to ask the new kernel for 0.2 so it
behaves the same way as the old one. As it stands this code wouldn't
let you do that migration even if you did specifically ask for 0.2.
(As I said, I don't know if we care about that.)

thanks
-- PMM

Re: [REPOST PATCH] arm/arm64: KVM: Add PSCI version selection API

2018-03-05 Thread Peter Maydell

On 2 March 2018 at 11:11, Marc Zyngier  wrote:
> On Fri, 02 Mar 2018 10:44:48 +,
> Auger Eric wrote:
>> I understand the get/set is called as part of the migration process.
>> So my understanding is the benefit of this series is migration fails in
>> those cases:
>>
>> >=0.2 source -> 0.1 destination
>> 0.1 source -> >=0.2 destination
>
> It also fails in the case where you migrate a 1.0 guest to something
> that cannot support it.

I think it would be useful if we could write out the various
combinations of source, destination and what we expect/want to
have happen. My gut feeling here is that we're sacrificing
exact migration compatibility in favour of having the guest
automatically get the variant-2 mitigations, but it's not clear
to me exactly which migration combinations that's intended to
happen for. Marc?

If this wasn't a mitigation issue the desired behaviour would be
straightforward:
 * kernel should default to 0.2 on the basis that
   that's what it did before
 * new QEMU version should enable 1.0 by default for virt-2.12
   and 0.2 for virt-2.11 and earlier
 * PSCI version info shouldn't appear in migration stream unless
   it's something other than 0.2
But that would leave some setups (which?) unnecessarily without the
mitigation, so we're not doing that. The question is, exactly
what *are* we aiming for?

thanks
-- PMM

Re: [REPOST PATCH] arm/arm64: KVM: Add PSCI version selection API

2018-03-05 Thread Peter Maydell

On 2 March 2018 at 12:26, Auger Eric  wrote:
> Hi Marc,
> On 02/03/18 12:11, Marc Zyngier wrote:
>> On Fri, 02 Mar 2018 10:44:48 +,
>> Auger Eric wrote:
>>> I understand the get/set is called as part of the migration process.
>>> So my understanding is the benefit of this series is migration fails in
>>> those cases:
>>>
 =0.2 source -> 0.1 destination
>>> 0.1 source -> >=0.2 destination
>>
>> It also fails in the case where you migrate a 1.0 guest to something
>> that cannot support it.
>
> That's because on the destination, the number of regs is less than on
> source, right?

I think it fails because the KVM_REG_ARM_PSCI_VERSION register will be
in the migration state but not in the destination's list of registers:
the code in QEMU's target/arm/machine.c:cpu_post_load() function that
checks "register in their list but not ours: fail migration" will
catch this.

That also means that we will fail migration from a new kernel where
we've specifically asked for PSCI 0.2 to an old PSCI-0.2-only kernel
(because the KVM_REG_ARM_PSCI_VERSION reg will appear in the migration
stream even if its value is the one value that matches the old kernel
behaviour). I don't know if we care about that.

thanks
-- PMM

Re: [PATCH v1 01/16] virtio: Validate queue pfn for 32bit transports

2018-01-12 Thread Peter Maydell

On 10 January 2018 at 11:25, Jean-Philippe Brucker
 wrote:
> Hi Peter,
>
> On 10/01/18 11:19, Peter Maydell wrote:
>> Are there uses that make it worthwhile to get virtio-1
>> support added to virtio-mmio, rather than just getting
>> people to move over to virtio-pci instead ?
>
> virtio-iommu uses virtio-mmio transport. It makes little sense to have an
> IOMMU presented as a PCI endpoint.

Having an entire transport just for the IOMMU doesn't make
a great deal of sense either though :-) If we didn't already
have virtio-mmio kicking around would we really have designed
it that way?

thanks
-- PMM

Re: [PATCH v1 01/16] virtio: Validate queue pfn for 32bit transports

2018-01-10 Thread Peter Maydell

On 10 January 2018 at 11:06, Michael S. Tsirkin  wrote:
> For virtio-mmio? I don't seem to see that code in
> hw/virtio/virtio-mmio.c
> For example I still see handling for VIRTIO_MMIO_QUEUE_PFN
> there, and no handling for VIRTIO_MMIO_QUEUE_DESC_LOW
> and such.

Are there uses that make it worthwhile to get virtio-1
support added to virtio-mmio, rather than just getting
people to move over to virtio-pci instead ?

thanks
-- PMM

Re: [PATCH 7/9] KVM: arm/arm64: vgic-its: free caches when GITS_BASER Valid bit is cleared

2017-10-16 Thread Peter Maydell

On 16 October 2017 at 10:26, Christoffer Dall  wrote:
> Hi Eric,
>
> On Mon, Sep 25, 2017 at 03:34:36PM +0200, Eric Auger wrote:
>> When the GITS_BASER.Valid gets cleared, the data structures in
>> guest RAM are not provisionned anymore. The device, collection
>> and LPI lists stored in the in-kernel ITS represent the same
>> information in some form of cache. So let's void the cache.
>
> Just a thought.  What about the opposite case, if the BASERs were
> previously not valid, and then become valid, is the ITS expected restore
> the state from memory?

Architecturally speaking, it's a cache. As soon as the guest
turns the GITS_CTLR.Enabled bit on, the ITS is permitted to
start reading from the tables. It's an implementation choice
whether it wants to preload a bunch of stuff, only load it
up when it becomes necessary or even leave it all in memory
and cache nothing.

Eric wrote:
> Also the spec does not mandate clearing the cache when BASER
> moves to invalid (which this patch does), although this would
> have made sense to me.

The spec says that messing with GITS_BASER while GITS_CTRL.Enabled
is set or GITS_CTLR.Quiescent is 0 is UNPREDICTABLE. And it
says that once the OS has done the Enabled/Quiescent handshake
then the ITS must have written out any dirty data to memory
and stopped doing anything. So the effect is that BASER
can't ever move to invalid while the caches are dirty.

thanks
-- PMM

Re: [PATCH v2 09/10] KVM: arm/arm64: Document KVM_DEV_ARM_ITS_CTRL_RESET

2017-10-12 Thread Peter Maydell

On 27 September 2017 at 14:28, Eric Auger  wrote:
> At the moment, the in-kernel emulated ITS is not properly reset.
> On guest restart/reset some registers keep their old values and
> internal structures like device, ITE, collection lists are not freed.
>
> This may lead to various bugs. Among them, we can have incorrect state
> backup or failure when saving the ITS state at early guest boot stage.
>
> This patch documents a new attribute, KVM_DEV_ARM_ITS_CTRL_RESET in
> the KVM_DEV_ARM_VGIC_GRP_CTRL group.
>
> Upon this action, we can reset registers and especially those
> pointing to tables previously allocated by the guest and free
> the internal data structures storing the list of devices, collections
> and lpis.

I think we could probably expand this to explain why our usual
approach to reset doesn't work, something like:

"The usual approach for device reset of having userspace write
the reset values of the registers to the kernel via the register
read/write APIs doesn't work for the ITS because it has some
internal state (caches) which is not exposed as registers,
and there is no register interface for "drop cached data without
writing it back to RAM". So we need a KVM API which mimics the
hardware's reset line, to provide the equivalent behaviour to
a "pull the power cord out of the back of the machine" reset."

> Signed-off-by: Eric Auger 
> Reported-by: wanghaibin 
>
> ---
>
> v1 -> v2:
> - Describe architecturally-defined reset values
> ---
>  Documentation/virtual/kvm/devices/arm-vgic-its.txt | 16 
>  1 file changed, 16 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/devices/arm-vgic-its.txt 
> b/Documentation/virtual/kvm/devices/arm-vgic-its.txt
> index eb06beb..047358c 100644
> --- a/Documentation/virtual/kvm/devices/arm-vgic-its.txt
> +++ b/Documentation/virtual/kvm/devices/arm-vgic-its.txt
> @@ -33,6 +33,10 @@ Groups:
>request the initialization of the ITS, no additional parameter in
>kvm_device_attr.addr.
>
> +KVM_DEV_ARM_ITS_CTRL_RESET
> +  reset the ITS, no additional parameter in kvm_device_attr.addr.
> +  See "ITS Reset State" section.

Why is the init attribute for the ITS named
KVM_DEV_ARM_VGIC_CTRL_INIT, but the reset attribute is
KVM_DEV_ARM_ITS_CTRL_RESET ? Seems a bit inconsistent.

> +
>  KVM_DEV_ARM_ITS_SAVE_TABLES
>save the ITS table data into guest RAM, at the location provisioned
>by the guest in corresponding registers/table entries.
> @@ -157,3 +161,15 @@ Then vcpus can be started.
>   - pINTID is the physical LPI ID; if zero, it means the entry is not valid
> and other fields are not meaningful.
>   - ICID is the collection ID
> +
> + ITS Reset State:
> + 
> +
> +- the ITS is not enabled and quiescent:
> +  GITS_CTLR.Enabled = 0 .Quiescent=1
> +- caches are empty
> +- No collection or device table is provisionned

"provisioned"

> +  GITS_BASER.Valid = 0
> +- the command queue is not allocated:
> +  GITS_CBASER = 0, GITS_CREADR = 0, GITS_CWRITER = 0
> +- The ABI version corresponds to the one set before reset

Rather than trying to spell out all the reset values here,
it might be better to just say that RESET returns the ITS to
the same state that it is in after it is first created and
inited. (That is the behaviour that userspace wants.)

Happy with the general principle of the new userspace ABI.

thanks
-- PMM

Re: [PATCH v11 6/6] target-arm: kvm64: Handle SError interrupt for the guest OS

2017-09-13 Thread Peter Maydell

On 13 September 2017 at 08:52, gengdongjiu  wrote:
>
>
> On 2017/9/12 0:39, Peter Maydell wrote:
>>>>> +return  kvm_vcpu_ioctl(CPU(cpu), KVM_ARM_SEI, &syndrome);
>>>> This looks odd. If we don't have the RAS extension why do we need to do 
>>>> anything at all here ?
>>> This is because Qemu may need to support non-RAS extension as discussed 
>>> with ARM James before.
>>> That is to say host hardware CPU does not support RAS, but guest supports.
>>> That is under discussion.
>>> When host hardware supports RAS, specify the syndrome to a valid value, 
>>> otherwise, set it to 0.
>> If the guest CPU doesn't support the RAS extension then we have
>> no mechanism for delivering it a notification about the
>> memory problem at all, so setting the syndrome to anything
>> doesn't make sense.
>>
>> I'm not sure what you should do in the case of "host
>> supports telling us about a memory problem and has
>> done so, but guest does not support being told about it",
>> but I'm pretty sure it shouldn't be this.

>in short, if the hardware CPU does not support RAS extension, do you think 
> whether the Qemu or guest OS
> needs to support RAS(generate APEI table / record CPER / Error recovery).

This question seems to be not really related to the review
comment that it is responding to.

(1) If the host does not support notifying us about
errors, then there is clearly nothing to do in this
code, because we will never get a notification.

(2) If the host does support notifying us about errors,
but we choose not to expose RAS to the guest, then
there's not much to do either. We probably just want
to take whatever the default behaviour is for any
application when it touches memory that's bad.
We definitely don't want to tell the guest anything.

(3) If the host supports notification, and we choose
to expose RAS to the guest, then we need to do
whatever we have to do to notify the guest.

If we're in this signal handler and also
arm_feature(env, ARM_FEATURE_RAS) is false then that
is case (2), and my point is that doing anything with
the guest 'syndrome' value looks like the wrong thing.

thanks
-- PMM

Re: [PATCH v11 6/6] target-arm: kvm64: Handle SError interrupt for the guest OS

2017-09-11 Thread Peter Maydell

On 11 September 2017 at 16:17, gengdongjiu  wrote:
>> On 18 August 2017 at 15:23, Dongjiu Geng  wrote:
>> > +static int kvm_inject_arm_sei(CPUState *cs) {
>> > +ARMCPU *cpu = ARM_CPU(cs);
>> > +CPUARMState *env = &cpu->env;
>> > +
>> > +unsigned long syndrome = env->exception.vaddress;
>> > +/* set virtual SError syndrome */
>> > +if (arm_feature(env, ARM_FEATURE_RAS_EXTENSION)) {
>> > +syndrome = syndrome & ARM_EL_ISS_MASK;
>> > +} else {
>> > +syndrome = 0;
>> > +}
>> > +
>> > +return  kvm_vcpu_ioctl(CPU(cpu), KVM_ARM_SEI, &syndrome);
>>
>> This looks odd. If we don't have the RAS extension why do we need to do 
>> anything at all here ?
>
> This is because Qemu may need to support non-RAS extension as discussed with 
> ARM James before.
> That is to say host hardware CPU does not support RAS, but guest supports.
> That is under discussion.
> When host hardware supports RAS, specify the syndrome to a valid value, 
> otherwise, set it to 0.

If the guest CPU doesn't support the RAS extension then we have
no mechanism for delivering it a notification about the
memory problem at all, so setting the syndrome to anything
doesn't make sense.

I'm not sure what you should do in the case of "host
supports telling us about a memory problem and has
done so, but guest does not support being told about it",
but I'm pretty sure it shouldn't be this.

thanks
-- PMM

Re: [PATCH v11 5/6] target-arm: kvm64: handle SIGBUS signal for synchronous External Abort

2017-09-08 Thread Peter Maydell

On 8 September 2017 at 17:17, gengdongjiu  wrote:
>>
>> This code has all just been copied-and-pasted from target/i386/kvm.c.
>> Please instead abstract it out properly into a cpu-independent source file.
>
>
> Yes, it copied from x86.
> Do you mean abstracting this code to a common folder so that i386 and arm 
> platform share it?

I mean it should go into a common source file (perhaps
accel/kvm/kvm-all.c).


>> What is this ??? You should never need to look up things in the cpreg arrays 
>> by fieldoffset.
>
>
> I used it to set the esr_el1's and far_el1's value, for example:
>
>   /* set ESR_EL1 */
>   ret = kvm_arm_cpreg_value(cpu, offsetof(CPUARMState, cp15.esr_el[1]));
>
>   /* set FAR_EL1 */
>   ret = kvm_arm_cpreg_value(cpu, offsetof(CPUARMState, cp15.far_el[1]));

Yes, I saw that, but I have no idea why you think that's
the right way to set register values. No other code in
QEMU does this.

> So use the added API kvm_arm_cpreg_value() to set their value.
> If not used it, do you better method to set their value?

I suggest:

>> The code for handling debug exits (software step, watchpoint,
>> etc) is probably a good place to look for how to deal with register state.

>> Any new globally-visible function prototype in a header should
>> have a doc-comment formatted documentation comment, please.
>
>
> Ok, thanks for this reminder. Do you mean I need to add comments
> for this globally-visible function, such as below:
>
> /*
>  * xx
>  */
> void kvm_hwpoison_page_add(ram_addr_t ram_addr);

It should be in the doc-comment format, which begins
"/**" and has some stylization of how you list parameters
and so on. Lots of examples in the existing headers.

thanks
-- PMM

Re: [PATCH v11 4/6] target-arm: kvm64: detect guest RAS EXTENSION feature

2017-09-08 Thread Peter Maydell

On 8 September 2017 at 15:26, gengdongjiu  wrote:

>> Shouldn't we need to also tell the kernel that we actually want
>> it to expose RAS to the guest? Compare the PMU code in this function,
>> where we set a kvm_init_features bit to do this.

> In the PMU code, it indeed sets a kvm_init_features bit. Here ARM
> James has a concern that we are depend on the host CPU RAS extension,
> He means that if userspace receives the SIGBUS delivered by host
> memory_failure(), user space should record the CPER for guest
> and handling the error regardless whether host CPU supports RAS
> extension. But I think if user space receives the SIGBUS signal,
> that means
> host CPU RAS module detects the error or CPU consumes the poison
> data, thus we should check whether physical CPU support RAS extension.

I don't understand what you have in mind here. If the host does
not support the CPU RAS extension then we should never get a
SIGBUS in the first place.

In any case this doesn't seem relevant to the question of whether it
should be optional to expose the RAS extension to the *guest*.
Even if the host does support RAS, you should be able to run a
VM that knows nothing about RAS.

thanks
-- PMM

Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-09-06 Thread Peter Maydell

On 28 August 2017 at 11:38, Dongjiu Geng  wrote:
> In the firmware-first RAS solution, corrupt data is detected in a
> memory location when guest OS application software executing at EL0
> or guest OS kernel El1 software are reading from the memory. The
> memory node records errors in an error record accessible using
> system registers.

Hi Dongjiu -- it looks like this patch set is extending
the API KVM provides to userspace, but it doesn't update
the documentation in Documentation/virtual/kvm/api.txt.
I appreciate the API is still somewhat under discussion,
but if you can include the docs updates it's helpful to
me for reviewing whether the API makes sense from the
userspace consumer end of it.

thanks
-- PMM

Re: [PATCH v11 6/6] target-arm: kvm64: Handle SError interrupt for the guest OS

2017-09-05 Thread Peter Maydell

On 18 August 2017 at 15:23, Dongjiu Geng  wrote:
> When guest OS happens SError interrupt(SEI), it will trap to host.
> Host firstly calls memory failure to deal with this error and decide
> whether it needs to deliver SIGBUS signal to userspace. The advantage
> that using signal to notify is that it can make the notification method
> is general, non-KVM user can also use it. when userspace gets this
> signal and knows this is SError interrupt, it will translate the
> delivered host VA to PA and record this PA to GHES.
>
> Because ARMv8.2 adds an extension to RAS to allow system software insert
> implicit Error Synchronization Barrier operations to isolate the error and
> allow passes specified syndrome to guest OS, so after record the CPER, user
> space calls IOCTL to pass a specified syndrome to KVM, then switch to guest
> OS, guest OS can use the recorded CPER record and syndrome information to
> do the recovery.
>
> The steps are shown below:
> 1. translate the host VA to guest OS PA and record this error PA to HEST 
> table.
> 2. set specified virtual SError syndrome and pass the value to KVM.
>
> Signed-off-by: Dongjiu Geng 
> Signed-off-by: Quanming Wu 
> ---
>  linux-headers/linux/kvm.h |  1 +
>  target/arm/internals.h|  1 +
>  target/arm/kvm64.c| 28 
>  3 files changed, 30 insertions(+)
>
> diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> index 2aa176e..10dfcab 100644
> --- a/linux-headers/linux/kvm.h
> +++ b/linux-headers/linux/kvm.h
> @@ -1356,6 +1356,7 @@ struct kvm_s390_ucas_mapping {
>  /* Available with KVM_CAP_S390_CMMA_MIGRATION */
>  #define KVM_S390_GET_CMMA_BITS  _IOWR(KVMIO, 0xb8, struct 
> kvm_s390_cmma_log)
>  #define KVM_S390_SET_CMMA_BITS  _IOW(KVMIO, 0xb9, struct 
> kvm_s390_cmma_log)
> +#define KVM_ARM_SEI _IO(KVMIO,   0xb10)
>
>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0)
>  #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
> diff --git a/target/arm/internals.h b/target/arm/internals.h
> index fc0ad6d..18b1cbc 100644
> --- a/target/arm/internals.h
> +++ b/target/arm/internals.h
> @@ -237,6 +237,7 @@ enum arm_exception_class {
>  #define ARM_EL_ISV (1 << ARM_EL_ISV_SHIFT)
>  #define ARM_EL_EC_MASK  ((0x3F) << ARM_EL_EC_SHIFT)
>  #define ARM_EL_FSC_TYPE (0x3C)
> +#define ARM_EL_ISS_MASK ((1 << ARM_EL_IL_SHIFT) - 1)
>
>  #define FSC_SEA (0x10)
>  #define FSC_SEA_TTW0(0x14)
> diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
> index d3bdab2..b84cb49 100644
> --- a/target/arm/kvm64.c
> +++ b/target/arm/kvm64.c
> @@ -616,6 +616,22 @@ static int kvm_arm_cpreg_value(ARMCPU *cpu, ptrdiff_t 
> fieldoffset)
>  return -EINVAL;
>  }
>
> +static int kvm_inject_arm_sei(CPUState *cs)
> +{
> +ARMCPU *cpu = ARM_CPU(cs);
> +CPUARMState *env = &cpu->env;
> +
> +unsigned long syndrome = env->exception.vaddress;
> +/* set virtual SError syndrome */
> +if (arm_feature(env, ARM_FEATURE_RAS_EXTENSION)) {
> +syndrome = syndrome & ARM_EL_ISS_MASK;
> +} else {
> +syndrome = 0;
> +}
> +
> +return  kvm_vcpu_ioctl(CPU(cpu), KVM_ARM_SEI, &syndrome);

This looks odd. If we don't have the RAS extension why do
we need to do anything at all here ?

> +}
> +
>  /* Inject synchronous external abort */
>  static int kvm_inject_arm_sea(CPUState *c)
>  {
> @@ -1007,6 +1023,15 @@ static bool is_abort_sea(unsigned long syndrome)
>  }
>  }
>
> +static bool is_abort_sei(unsigned long syndrome)
> +{
> +uint8_t ec = ((syndrome & ARM_EL_EC_MASK) >> ARM_EL_EC_SHIFT);

You don't need to bother masking here -- in other places in
QEMU we assume that the EC field is at the top of the word,
so just "syndrome >> ARM_EL_EC_SHIFT" is sufficient.

> +if ((ec != EC_SERROR))
> +return false;
> +else
> +return true;

scripts/checkpatch.pl should tell you that this if needs braces
(it's good to get in the habit of running it on all patches; it
is not always correct, so judgement is required, but it will flag
up some common mistakes).

In this particular case, you should just
   return ec == EC_SERROR;
though.

> +}
> +
>  void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>  {
>  ram_addr_t ram_addr;
> @@ -1024,6 +1049,9 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, 
> void *addr)
>  if (is_abort_sea(env->exception.syndrome)) {
>  ghes_update_guest(ACPI_HEST_NOTIFY_SEA, paddr);
>  kvm_inject_arm_sea(c);
> +} else if (is_abort_sei(env->exception.syndrome)) {
> +ghes_update_guest(ACPI_HEST_NOTIFY_SEI, paddr);
> +kvm_inject_arm_sei(c);
>  }
>  return;
>  }
> --
> 1.8.3.1

thanks
-- PMM

Re: [PATCH v11 5/6] target-arm: kvm64: handle SIGBUS signal for synchronous External Abort

2017-09-05 Thread Peter Maydell

On 18 August 2017 at 15:23, Dongjiu Geng  wrote:
> Add SIGBUS signal handler. In this handler, it checks
> the exception type, translates the host VA which is
> delivered by host or KVM to guest PA, then fills this
> PA to CPER, finally injects a Error to guest OS through
> KVM.
>
> Add synchronous external abort injection logic, setup
> spsr_elx, esr_elx, PSTATE, far_elx, elr_elx etc, when
> switch to guest OS, it will jump to the synchronous
> external abort vector table entry.
>
> Signed-off-by: Dongjiu Geng 
> Signed-off-by: Quanming Wu 
> ---
>  include/sysemu/kvm.h  |   2 +-
>  linux-headers/asm-arm64/kvm.h |   5 ++
>  target/arm/internals.h|  13 
>  target/arm/kvm.c  |  34 ++
>  target/arm/kvm64.c| 150 
> ++
>  target/arm/kvm_arm.h  |   1 +
>  6 files changed, 204 insertions(+), 1 deletion(-)

Have you tested whether this patchset builds OK on aarch32 ?

> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index 3a458f5..90c1605 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -361,7 +361,7 @@ bool kvm_vcpu_id_is_valid(int vcpu_id);
>  /* Returns VCPU ID to be used on KVM_CREATE_VCPU ioctl() */
>  unsigned long kvm_arch_vcpu_id(CPUState *cpu);
>
> -#ifdef TARGET_I386
> +#if defined(TARGET_I386) || defined(TARGET_AARCH64)
>  #define KVM_HAVE_MCE_INJECTION 1
>  void kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr);
>  #endif
> diff --git a/linux-headers/asm-arm64/kvm.h b/linux-headers/asm-arm64/kvm.h
> index d254700..5909c30 100644
> --- a/linux-headers/asm-arm64/kvm.h
> +++ b/linux-headers/asm-arm64/kvm.h
> @@ -181,6 +181,11 @@ struct kvm_arch_memory_slot {
>  #define KVM_REG_ARM64_SYSREG_OP2_MASK  0x0007
>  #define KVM_REG_ARM64_SYSREG_OP2_SHIFT 0
>
> +/* AArch64 fault registers */
> +#define KVM_REG_ARM64_FAULT (0x0014 << KVM_REG_ARM_COPROC_SHIFT)
> +#define KVM_REG_ARM64_FAULT_ESR_EC  (0)
> +#define KVM_REG_ARM64_FAULT_FAR (1)
> +
>  #define ARM64_SYS_REG_SHIFT_MASK(x,n) \
> (((x) << KVM_REG_ARM64_SYSREG_ ## n ## _SHIFT) & \
> KVM_REG_ARM64_SYSREG_ ## n ## _MASK)

Again, linux-headers changes need to go in their own header sync patch.

> diff --git a/target/arm/internals.h b/target/arm/internals.h
> index 1f6efef..fc0ad6d 100644
> --- a/target/arm/internals.h
> +++ b/target/arm/internals.h
> @@ -235,6 +235,19 @@ enum arm_exception_class {
>  #define ARM_EL_ISV_SHIFT 24
>  #define ARM_EL_IL (1 << ARM_EL_IL_SHIFT)
>  #define ARM_EL_ISV (1 << ARM_EL_ISV_SHIFT)
> +#define ARM_EL_EC_MASK  ((0x3F) << ARM_EL_EC_SHIFT)
> +#define ARM_EL_FSC_TYPE (0x3C)
> +
> +#define FSC_SEA (0x10)
> +#define FSC_SEA_TTW0(0x14)
> +#define FSC_SEA_TTW1(0x15)
> +#define FSC_SEA_TTW2(0x16)
> +#define FSC_SEA_TTW3(0x17)
> +#define FSC_SECC(0x18)
> +#define FSC_SECC_TTW0   (0x1c)
> +#define FSC_SECC_TTW1   (0x1d)
> +#define FSC_SECC_TTW2   (0x1e)
> +#define FSC_SECC_TTW3   (0x1f)
>
>  /* Utility functions for constructing various kinds of syndrome value.
>   * Note that in general we follow the AArch64 syndrome values; in a
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index 7c17f0d..2e1716a 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -129,6 +129,39 @@ void kvm_arm_destroy_scratch_host_vcpu(int *fdarray)
>  }
>  }
>
> +typedef struct HWPoisonPage {
> +ram_addr_t ram_addr;
> +QLIST_ENTRY(HWPoisonPage) list;
> +} HWPoisonPage;
> +
> +static QLIST_HEAD(, HWPoisonPage) hwpoison_page_list =
> +QLIST_HEAD_INITIALIZER(hwpoison_page_list);
> +
> +static void kvm_unpoison_all(void *param)
> +{
> +HWPoisonPage *page, *next_page;
> +
> +QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
> +QLIST_REMOVE(page, list);
> +qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
> +g_free(page);
> +}
> +}
> +
> +void kvm_hwpoison_page_add(ram_addr_t ram_addr)
> +{
> +HWPoisonPage *page;
> +
> +QLIST_FOREACH(page, &hwpoison_page_list, list) {
> +if (page->ram_addr == ram_addr) {
> +return;
> +}
> +}
> +page = g_new(HWPoisonPage, 1);
> +page->ram_addr = ram_addr;
> +QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
> +}

This code has all just been copied-and-pasted from target/i386/kvm.c.
Please instead abstract it out properly into a cpu-independent
source file.

>  static void kvm_arm_host_cpu_class_init(ObjectClass *oc, void *data)
>  {
>  ARMHostCPUClass *ahcc = ARM_HOST_CPU_CLASS(oc);
> @@ -182,6 +215,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>
>  cap_has_mp_state = kvm_check_extension(s, KVM_CAP_MP_STATE);
>
> +qemu_register_reset(kvm_unpoison_all, NULL);
>  type_register_static(&host_arm_cpu_type_info);
>
>  return 0;
> diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
> index 0781367..d3bdab2 100644
> --- a/target/arm/kvm64.c
> +++ b/target/arm/

Re: [PATCH v11 4/6] target-arm: kvm64: detect guest RAS EXTENSION feature

2017-09-05 Thread Peter Maydell

On 18 August 2017 at 15:23, Dongjiu Geng  wrote:
> check if kvm supports guest RAS EXTENSION. if so, set
> corresponding feature bit for vcpu.
>
> Signed-off-by: Dongjiu Geng 
> ---
>  linux-headers/linux/kvm.h | 1 +
>  target/arm/cpu.h  | 3 +++
>  target/arm/kvm64.c| 8 
>  3 files changed, 12 insertions(+)
>
> diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> index 7971a4f..2aa176e 100644
> --- a/linux-headers/linux/kvm.h
> +++ b/linux-headers/linux/kvm.h
> @@ -929,6 +929,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_SMT_POSSIBLE 147
>  #define KVM_CAP_HYPERV_SYNIC2 148
>  #define KVM_CAP_HYPERV_VP_INDEX 149
> +#define KVM_CAP_ARM_RAS_EXTENSION 150
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>

Hi. Changes to linux-headers need to be done as a patch of their
own created using scripts/update-linux-headers.sh run against a
mainline kernel tree (and with a commit message that quotes the
kernel commit hash used). This ensures that we have a consistent
set of headers that don't diverge from the kernel copy.

> diff --git a/target/arm/cpu.h b/target/arm/cpu.h
> index b39d64a..6b0961b 100644
> --- a/target/arm/cpu.h
> +++ b/target/arm/cpu.h
> @@ -611,6 +611,8 @@ struct ARMCPU {
>
>  /* CPU has memory protection unit */
>  bool has_mpu;
> +/* CPU has ras extension unit */
> +bool has_ras_extension;
>  /* PMSAv7 MPU number of supported regions */
>  uint32_t pmsav7_dregion;
>
> @@ -1229,6 +1231,7 @@ enum arm_features {
>  ARM_FEATURE_THUMB_DSP, /* DSP insns supported in the Thumb encodings */
>  ARM_FEATURE_PMU, /* has PMU support */
>  ARM_FEATURE_VBAR, /* has cp15 VBAR */
> +ARM_FEATURE_RAS_EXTENSION, /*has RAS extension support */

Missing space after '/*' ?

>  };
>
>  static inline int arm_feature(CPUARMState *env, int feature)
> diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
> index a16abc8..0781367 100644
> --- a/target/arm/kvm64.c
> +++ b/target/arm/kvm64.c
> @@ -518,6 +518,14 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  unset_feature(&env->features, ARM_FEATURE_PMU);
>  }
>
> +if (kvm_check_extension(cs->kvm_state, KVM_CAP_ARM_RAS_EXTENSION)) {
> +cpu->has_ras_extension = true;
> +set_feature(&env->features, ARM_FEATURE_RAS_EXTENSION);
> +} else {
> +cpu->has_ras_extension = false;
> +unset_feature(&env->features, ARM_FEATURE_RAS_EXTENSION);
> +}
> +

Shouldn't we need to also tell the kernel that we actually want
it to expose RAS to the guest? Compare the PMU code in this
function, where we set a kvm_init_features bit to do this.
(This suggests that your ABI for the kernel part of this feature
may not be correct?)

You should also not be calling set_feature() here -- if the
CPU features bit doesn't say "this CPU should have the RAS
extensions" we shouldn't create a CPU with them. Instead
you should set it in kvm_arm_get_host_cpu_features() (again,
compare the PMU code).

thanks
-- PMM

Re: [PATCH] fs: Preventing READ_IMPLIES_EXEC Propagation

2017-04-19 Thread Peter Maydell

On 19 April 2017 at 11:33, Catalin Marinas  wrote:
> On Tue, Apr 18, 2017 at 09:01:52PM +0100, Peter Maydell wrote:
>>
>> > That's affecting most architectures with a risk of ABI breakage. We
>> > could do it on arm64 only, though I'm not yet clear on the ABI
>> > implications (at a first look, there shouldn't be any).
>>
>> Is there a reason why it isn't just straightforwardly a bug
>> (which we could fix) to make READ_IMPLIES_EXEC propagate to
>> child processes?
>
> While I agree that it looks like a bug, if there are user programs
> relying on such bug we call it "ABI".

Can there be any? Such a program would behave differently
depending on how the program that spawned it happened to
have been compiled, and for instance could break when
the OS happened to have its init binary updated even if
the kernel didn't change.

>> Behaviour shouldn't be variable across architectures either, I would
>> hope.
>
> The behaviour has already been variable for a long time. Even on x86,
> AFAICT x86_32 differs from x86_64 in this respect.

That also sounds like a bug to me.

> Anyway, the patch should be posted to linux-arch for a cross-arch
> discussion.

Agreed -- there may be something I'm missing, since it looks
like this behaviour of inheriting READ_IMPLIES_EXEC has always
been there.

thanks
-- PMM

Re: [PATCH] fs: Preventing READ_IMPLIES_EXEC Propagation

2017-04-18 Thread Peter Maydell

On 18 April 2017 at 18:01, Catalin Marinas  wrote:
> On Thu, Apr 13, 2017 at 08:33:52PM +0800, dongbo (E) wrote:
>> From: Dong Bo 
>>
>> In load_elf_binary(), once the READ_IMPLIES_EXEC flag is set,
>> the flag is propagated to its child processes, even the elf
>> files are marked as not requiring executable stack. It may
>> cause superfluous operations on some arch, e.g.
>> __sync_icache_dcache on aarch64 due to a PROT_READ mmap is
>> also marked as PROT_EXEC.

> That's affecting most architectures with a risk of ABI breakage. We
> could do it on arm64 only, though I'm not yet clear on the ABI
> implications (at a first look, there shouldn't be any).

Is there a reason why it isn't just straightforwardly a bug
(which we could fix) to make READ_IMPLIES_EXEC propagate to
child processes? AFAICT this should be per-process: just because
init happens not to have been (re)compiled to permit non-executable
stacks doesn't mean every process on the system needs to have
an executable stack. Behaviour shouldn't be variable across
architectures either, I would hope.

thanks
-- PMM

Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-28 Thread Peter Maydell

On 28 March 2017 at 12:23, Christoffer Dall  wrote:
> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>> On the host, part of UEFI is involved to generate the CPER records.
>> In a guest?, I don't know.
>> Qemu could generate the records, or drive some other component to do it.
>
> I think I am beginning to understand this a bit.  Since the guet UEFI
> instance is specifically built for the machine it runs on, QEMU's virt
> machine in this case, they could simply agree (by some contract) to
> place the records at some specific location in memory, and if the guest
> kernel asks its guest UEFI for that location, things should just work by
> having logic in QEMU to process error reports and populate guest memory.

Is "write direct to guest memory" the best ABI here or would
it be preferable to use the fw_cfg interface for the guest UEFI
to retrieve the data items on demand?

Is there a pre-existing "this is how it works on x86" implementation?

thanks
-- PMM

Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-21 Thread Peter Maydell

On 21 March 2017 at 19:39, Christoffer Dall  wrote:
> My confusion here comes from not thinking about QEMU or KVM as firmware,
> but as the machine, so it would be sort of like the functionality is
> baked into hardware rather than firmware.

There is precedent for that kind of thing -- we implement PSCI
in KVM/QEMU for the guest, though in real hardware it would be
provided by firmware at EL3.

thanks
-- PMM

Re: [RFC v2 10/10] KVM: arm/arm64: Emulate the EL1 phys timer register access

2017-01-30 Thread Peter Maydell

On 30 January 2017 at 17:08, Jintack Lim  wrote:
> On Sun, Jan 29, 2017 at 10:44 AM, Marc Zyngier  wrote:
>> Shouldn't we take the ENABLE bit into account? The ARMv8 ARM version I
>> have at hand (version h) seems to indicate that we should, but we should
>> check with the latest and greatest...
>
> Thanks! I was not clear about this. I have ARM ARM version k, and it
> says that 'When the value of the ENABLE bit is 0, the ISTATUS field is
> UNKNOWN.' So I thought the istatus value doesn't matter if ENABLE is
> 0, and just set istatus bit regardless of ENABLE bit. If this is not
> what the manual meant, then I'm happy to fix this.

It looks like the spec has been relaxed between the doc version
that Marc was looking at and the current one. So it's OK for
an implementation to either (a) set ISTATUS to 0 if ENABLE
is 0, or (b) do what you've done and set ISTATUS according
to the timer comparison whether ENABLE is clear or not
(or even (c) set ISTATUS to a random value if ENABLE is clear,
and other less likely choices).
I think we should add a comment to note that it's architecturally
UNKNOWN and we've made a choice for our implementation convenience.

thanks
-- PMM

Re: [PATCH] irqchip/gic-v3: Configure all interrupts as non-secure Group-1

2016-05-10 Thread Peter Maydell

On 10 May 2016 at 11:21, Marc Zyngier  wrote:
> The GICv3 driver wrongly assumes that it runs on the non-secure
> side of a secure-enabled system, while it could be on a system
> with a single security state, or a GICv3 with GICD_CTLR.DS set.
>
> Either way, it is important to configure this properly, or
> interrupts will simply not be delivered on this HW.
>
> Cc: sta...@vger.kernel.org
> Reported-by: Peter Maydell 
> Signed-off-by: Marc Zyngier 
> ---
> It is getting a bit late for 4.6, so I plan to put this on top of
> the 4.7 stuff, and let it trickle down to the stable trees (hence
> the CC to stable).

Tested-by: Peter Maydell 

(using a QEMU with GICv3 emulation, which is how I found this bug
in the first place.)

thanks
-- PMM

Re: Crashes in arm qemu emulations due to 'cpufreq: governor: Replace timers with utilization ...'

2016-02-15 Thread Peter Maydell

On 15 February 2016 at 17:05, Guenter Roeck  wrote:
> I see crashes in various arm qemu tests due to 'cpufreq: governor: Replace
> timers with utilization update callbacks' with next-20160215. An example
> crash log and bisect results are attached below.
>
> Please let me know if there is anything I can do to help tracking down
> the problem.
>
> Thanks,
> Guenter
>
> ---
>
> Building arm:beagle:multi_v7_defconfig:omap3-beagle ... running . failed 
> (crashed)
> 
> qemu log:

You're using the QEMU beagle board emulation? Can I ask which
QEMU you're using (qemu-linaro?). If the OMAP3 emulation is still
actively useful to people I might have another stab at getting
it into upstream QEMU some day...

thanks
-- PMM

Re: [Qemu-devel] arm64 qemu tests failing in linux-next since 'arm64: kernel: enforce pmuserenr_el0 initialization and restore'

2016-01-07 Thread Peter Maydell

On 7 January 2016 at 17:10, Guenter Roeck  wrote:
> Strictly speaking you may be right (regression is a bit strong, though),
> but for my part I tend to be pragmatic.
>
> A warning message such as "Access to unimplemented register X" may be
> useful

You can get these from QEMU if you pass it "-d unimp", which logs
various kinds of things-not-yet-implemented, with a couple of caveats:
 * the warning is when we translate the code, not when we execute it
 * it won't warn for registers which we implement but not completely
   (eg only partial functionality or dummy reads-as-written)

In this case it printed
"write access to unsupported AArch64 system register op0:3 op1:3 crn:9
crm:14 op2:0"

The 'guest_errors' suboption to -d warns about things which appear
to be errors in the guest OS, for instance some kinds of UNPREDICTABLE,
and may also be of interest.

Neither guest_errors nor unimp are comprehensive (there are many
more situations where we don't warn than where we do) but they can
be helpful sometimes.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] arm64 qemu tests failing in linux-next since 'arm64: kernel: enforce pmuserenr_el0 initialization and restore'

2016-01-07 Thread Peter Maydell

On 24 December 2015 at 00:52, Guenter Roeck  wrote:
> Hi all,
>
> since commit 60792ad349f3 ("arm64: kernel: enforce pmuserenr_el0
> initialization
> and restore"), my arm64 qemu tests of linux-next are failing. After this
> commit,
> qemu does not display any output.
>
> Qemu version is 2.5.0. Linux kernel configuration is arm64:defconfig.
>
> qemu command line is as follows:
>
> qemu-system-aarch64 -machine virt -cpu cortex-a57 -machine type=virt
> -nographic -smp 1 \
> -m 512 -kernel arch/arm64/boot/Image -initrd
> rootfs.arm64.cpio -no-reboot \
> -append "console=ttyAMA0"
>
> Any idea what might cause this problem and how to fix it (presumably in
> qemu) ?

This turns out to be because QEMU doesn't currently implement
PMUSERENR_EL0 for AArch64 (we do have an AArch32 implementation),
so you get an immediate UNDEF when the kernel touches it, followed
by an infinite loop of UNDEF exceptions because the instruction
at the UNDEF vector entrypoint is unallocated at this point in
execution.

We had previously been relying on the kernel not attempting to
touch the PMU if the ID_AA64DFR0_EL1 PMUVer bits read 
("Performance Monitors extension System registers not implemented").

Since the v8 ARM ARM states that the Performance Monitors Extension is
an optional feature of an implementation, this seems like a kernel
bug to me. (QEMU should probably get round to implementing the PMU
at some point for feature parity with v7, but this has not been
a priority for us since they're not actually very useful in a
fully emulated setup.)

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Running arm:versatile_defconfig with qemu in linux-next

2015-12-24 Thread Peter Maydell

On 24 December 2015 at 19:51, Guenter Roeck  wrote:
> I can not get the available dtb files for realview (arm-realview-pb1176.dtb
> and arm-realview-pb11mp.dtb) to work with anything I tried.

This is because QEMU does not model either of these two boards.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3] Documentation: dt: Add bindings for Secure-only devices

2015-11-24 Thread Peter Maydell

The existing device tree bindings assume that we are only trying to
describe a single address space with a device tree (for ARM, either
the Normal or the Secure world). Some uses for device tree need to
describe both Normal and Secure worlds in a single device tree. Add
documentation of how to do this, by adding extra properties which
describe when a device appears differently in the two worlds or when
it only appears in one of them.

The binding describes the general principles for adding new
properties describing the secure world, but for now we only need a
single new property, "secure-status", which can be used to annotate
devices to indicate that they are only visible in one of the two
worlds.

The primary expected use of this binding is for a virtual machine
like QEMU to describe the VM layout to a TrustZone aware firmware
(which would then use the secure-only devices itself, and pass the DT
on to a kernel running in the non-secure world, which ignores the
secure-only devices and uses the rest).

Signed-off-by: Peter Maydell 
---
This binding doesn't affect the kernel itself, but the kernel
Documentation/ tree is the de-facto current place where all DT
bindings are documented, so Grant suggested this was the right
place to send a doc patch.

Changes v1->v2:
 * list all the status/secure-status combinations explicitly
 * use /* */ comment syntax, not //
Changes v2->v3:
 * use secure-foo rather than secure-reg as the example when
   explaining how the prefixing works
 * say how prefixing a vendor,foo property name works

 Documentation/devicetree/bindings/arm/secure.txt | 53 
 1 file changed, 53 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/secure.txt

diff --git a/Documentation/devicetree/bindings/arm/secure.txt 
b/Documentation/devicetree/bindings/arm/secure.txt
new file mode 100644
index 000..e31303f
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/secure.txt
@@ -0,0 +1,53 @@
+* ARM Secure world bindings
+
+ARM CPUs with TrustZone support have two distinct address spaces,
+"Normal" and "Secure". Most devicetree consumers (including the Linux
+kernel) are not TrustZone aware and run entirely in either the Normal
+world or the Secure world. However some devicetree consumers are
+TrustZone aware and need to be able to determine whether devices are
+visible only in the Secure address space, only in the Normal address
+space, or visible in both. (One example of that situation would be a
+virtual machine which boots Secure firmware and wants to tell the
+firmware about the layout of the machine via devicetree.)
+
+The general principle of the naming scheme for Secure world bindings
+is that any property that needs a different value in the Secure world
+can be supported by prefixing the property name with "secure-". So for
+instance "secure-foo" would override "foo". For property names with
+a vendor prefix, the Secure variant of "vendor,foo" would be
+"vendor,secure-foo". If there is no "secure-" property then the Secure
+world value is the same as specified for the Normal world by the
+non-prefixed property. However, only the properties listed below may
+validly have "secure-" versions; this list will be enlarged on a
+case-by-case basis.
+
+Defining the bindings in this way means that a device tree which has
+been annotated to indicate the presence of Secure-only devices can
+still be processed unmodified by existing Non-secure software (and in
+particular by the kernel).
+
+Note that it is still valid for bindings intended for purely Secure
+world consumers (like kernels that run entirely in Secure) to simply
+describe the view of Secure world using the standard bindings. These
+secure- bindings only need to be used where both the Secure and Normal
+world views need to be described in a single device tree.
+
+Valid Secure world properties:
+
+- secure-status : specifies whether the device is present and usable
+  in the secure world. The combination of this with "status" allows
+  the various possible combinations of device visibility to be
+  specified. If "secure-status" is not specified it defaults to the
+  same value as "status"; if "status" is not specified either then
+  both default to "okay". This means the following combinations are
+  possible:
+
+   /* Neither specified: default to visible in both S and NS */
+   secure-status = "okay";  /* visible in both */
+   status = "okay"; /* visible in both */
+   status = "okay"; secure-status = "okay"; /* visible in both */
+   secure-status = "disabled";  /* NS-only */
+   status = "okay"; secure-status = "disabled"; /* NS-only */
+   status = "disabled"; secur

Re: [RFC/PATCH 0/3] ARM: Use udiv/sdiv for __aeabi_{u}idiv library functions

2015-11-22 Thread Peter Maydell

On 21 November 2015 at 23:21, Arnd Bergmann  wrote:
> Regarding PJ4, it's still unclear whether that has the same
> problem and it only reports idivt when it actually supports idiva,
> or whether the lack of idiva support on PJ4 is instead the reason
> why the ARM ARM was updated to have separate flags.

SDIV/IDIV were originally introduced for R and M profile only
and there the Thumb encodings of SDIV/IDIV are mandatory
whereas the ARM ones are optional (and weren't initially
defined at all). So if you're looking for CPUs with only the
Thumb encodings I would try checking older R profile cores
like the Cortex-R4.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] Documentation: dt: Add bindings for Secure-only devices

2015-11-12 Thread Peter Maydell

On 12 November 2015 at 21:33, Rob Herring  wrote:
> On Thu, Nov 12, 2015 at 04:24:50PM +0000, Peter Maydell wrote:
>> The existing device tree bindings assume that we are only trying to
>> describe a single address space with a device tree (for ARM, either
>> the Normal or the Secure world). Some uses for device tree need to
>> describe both Normal and Secure worlds in a single device tree. Add
>> documentation of how to do this, by adding extra properties which
>> describe when a device appears differently in the two worlds or when
>> it only appears in one of them.
>>
>> The binding describes the general principles for adding new
>> properties describing the secure world, but for now we only need a
>> single new property, "secure-status", which can be used to annotate
>> devices to indicate that they are only visible in one of the two
>> worlds.
>>
>> The primary expected use of this binding is for a virtual machine
>> like QEMU to describe the VM layout to a TrustZone aware firmware
>> (which would then use the secure-only devices itself, and pass the DT
>> on to a kernel running in the non-secure world, which ignores the
>> secure-only devices and uses the rest).
>>
>> Signed-off-by: Peter Maydell 
>
> I'd specifically like Mark's ack on this one.
>
>> ---
>> This binding doesn't affect the kernel itself, but the kernel
>> Documentation/ tree is the de-facto current place where all DT
>> bindings are documented, so Grant suggested this was the right
>> place to send a doc patch.
>>
>> Changes v1->v2:
>>  * list all the status/secure-status combinations explicitly
>>  * use /* */ comment syntax, not //
>>
>>  Documentation/devicetree/bindings/arm/secure.txt | 51 
>> 
>>  1 file changed, 51 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/arm/secure.txt
>>
>> diff --git a/Documentation/devicetree/bindings/arm/secure.txt 
>> b/Documentation/devicetree/bindings/arm/secure.txt
>> new file mode 100644
>> index 000..7ed5ed6
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/arm/secure.txt
>> @@ -0,0 +1,51 @@
>> +* ARM Secure world bindings
>> +
>> +ARM CPUs with TrustZone support have two distinct address spaces,
>> +"Normal" and "Secure". Most devicetree consumers (including the Linux
>> +kernel) are not TrustZone aware and run entirely in either the Normal
>> +world or the Secure world. However some devicetree consumers are
>> +TrustZone aware and need to be able to determine whether devices are
>> +visible only in the Secure address space, only in the Normal address
>> +space, or visible in both. (One example of that situation would be a
>> +virtual machine which boots Secure firmware and wants to tell the
>> +firmware about the layout of the machine via devicetree.)
>> +
>> +The general principle of the naming scheme for Secure world bindings
>> +is that any property that needs a different value in the Secure world
>> +can be supported by prefixing the property name with "secure-". So for
>> +instance "secure-reg" would override "reg". If there is no "secure-"
>
> I'd prefer this be "secure-foo" and "foo", rather than reg given I
> specifically have a differing opinion on how to support reg.
>
> Also, would it be secure-vendor,foo or vendor,secure-foo for properties
> with vendor prefix? The latter looks more correct to me, but the former
> would be easier to search for both variants of the property. I'd lean
> towards the latter.

OK, so how about making that para read:

+ The general principle of the naming scheme for Secure world bindings
+ is that any property that needs a different value in the Secure world
+ can be supported by prefixing the property name with "secure-". So for
+ instance "secure-foo" would override "foo". For property names with
+ a vendor prefix, the Secure variant of "vendor,foo" would be
+ "vendor,secure-foo". If there is no "secure-" property then the Secure
+ world value is the same as specified for the Normal world by the
+ non-prefixed property. However, only the properties listed below may
+ validly have "secure-" versions; this list will be enlarged on a
+ case-by-case basis.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] Documentation: dt: Add bindings for Secure-only devices

2015-11-12 Thread Peter Maydell

The existing device tree bindings assume that we are only trying to
describe a single address space with a device tree (for ARM, either
the Normal or the Secure world). Some uses for device tree need to
describe both Normal and Secure worlds in a single device tree. Add
documentation of how to do this, by adding extra properties which
describe when a device appears differently in the two worlds or when
it only appears in one of them.

The binding describes the general principles for adding new
properties describing the secure world, but for now we only need a
single new property, "secure-status", which can be used to annotate
devices to indicate that they are only visible in one of the two
worlds.

The primary expected use of this binding is for a virtual machine
like QEMU to describe the VM layout to a TrustZone aware firmware
(which would then use the secure-only devices itself, and pass the DT
on to a kernel running in the non-secure world, which ignores the
secure-only devices and uses the rest).

Signed-off-by: Peter Maydell 
---
This binding doesn't affect the kernel itself, but the kernel
Documentation/ tree is the de-facto current place where all DT
bindings are documented, so Grant suggested this was the right
place to send a doc patch.

Changes v1->v2:
 * list all the status/secure-status combinations explicitly
 * use /* */ comment syntax, not //

 Documentation/devicetree/bindings/arm/secure.txt | 51 
 1 file changed, 51 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/secure.txt

diff --git a/Documentation/devicetree/bindings/arm/secure.txt 
b/Documentation/devicetree/bindings/arm/secure.txt
new file mode 100644
index 000..7ed5ed6
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/secure.txt
@@ -0,0 +1,51 @@
+* ARM Secure world bindings
+
+ARM CPUs with TrustZone support have two distinct address spaces,
+"Normal" and "Secure". Most devicetree consumers (including the Linux
+kernel) are not TrustZone aware and run entirely in either the Normal
+world or the Secure world. However some devicetree consumers are
+TrustZone aware and need to be able to determine whether devices are
+visible only in the Secure address space, only in the Normal address
+space, or visible in both. (One example of that situation would be a
+virtual machine which boots Secure firmware and wants to tell the
+firmware about the layout of the machine via devicetree.)
+
+The general principle of the naming scheme for Secure world bindings
+is that any property that needs a different value in the Secure world
+can be supported by prefixing the property name with "secure-". So for
+instance "secure-reg" would override "reg". If there is no "secure-"
+property then the Secure world value is the same as specified for the
+Normal world by the non-prefixed property. However, only the
+properties listed below may validly have "secure-" versions; this list
+will be enlarged on a case-by-case basis.
+
+Defining the bindings in this way means that a device tree which has
+been annotated to indicate the presence of Secure-only devices can
+still be processed unmodified by existing Non-secure software (and in
+particular by the kernel).
+
+Note that it is still valid for bindings intended for purely Secure
+world consumers (like kernels that run entirely in Secure) to simply
+describe the view of Secure world using the standard bindings. These
+secure- bindings only need to be used where both the Secure and Normal
+world views need to be described in a single device tree.
+
+Valid Secure world properties:
+
+- secure-status : specifies whether the device is present and usable
+  in the secure world. The combination of this with "status" allows
+  the various possible combinations of device visibility to be
+  specified. If "secure-status" is not specified it defaults to the
+  same value as "status"; if "status" is not specified either then
+  both default to "okay". This means the following combinations are
+  possible:
+
+   /* Neither specified: default to visible in both S and NS */
+   secure-status = "okay";  /* visible in both */
+   status = "okay"; /* visible in both */
+   status = "okay"; secure-status = "okay"; /* visible in both */
+   secure-status = "disabled";  /* NS-only */
+   status = "okay"; secure-status = "disabled"; /* NS-only */
+   status = "disabled"; secure-status = "okay"; /* S-only */
+   status = "disabled"; /* disabled in both */
+   status = "disabled"; secure-status = "disabled"; /* disabled in both */
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Documentation: dt: Add bindings for Secure-only devices

2015-11-10 Thread Peter Maydell

On 10 November 2015 at 14:51, Rob Herring  wrote:
> On Fri, Oct 30, 2015 at 08:07:34PM +0000, Peter Maydell wrote:
>> +status = "okay"; secure-status = "okay"; // ditto
>> +secure-status = "okay";  // ditto
>> +// neither explicitly defined: ditto
>>
>> (Do you want the full set of 9 options you get from multiplying
>> out "okay" vs "disabled" vs not-set for each property?)
>
> Better to err on completeness. The inheritance is easily missed.
>
> Also, one nit. Use C style comments so when people copy-n-paste this it
> is the correct style.

OK; I'll send out a v2 with these fixes.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Documentation: dt: Add bindings for Secure-only devices

2015-10-30 Thread Peter Maydell

On 30 October 2015 at 18:28, Rob Herring  wrote:
> On Thu, Oct 29, 2015 at 9:01 AM, Peter Maydell  
> wrote:
>> +Valid Secure world properties:
>> +
>> +- secure-status : specifies whether the device is present and usable
>> +  in the secure world. The combination of this with "status" allows
>> +  the various possible combinations of device visibility to be
>> +  specified:
>> +   status = "okay"; // visible in S and NS
>
> I assume neither property present or both okay also mean the same.
>
> status = "okay"; secure-status = "okay";
>
> We should be explicit.

Yes; status defaults to "okay" (presumably this is listed in
the overal DT binding spec somewhere), and secure-status
defaults to "same as status, which might in turn be defaulted".
We can list the complete set of options (neither present,
both 'okay', status not present but secure-status present, etc),
though it gets a bit long-winded, especially if we later add
more secure- properties (they'd all have to have verbiage about
"if not present, same as non-prefixed property; if both not
present, both take the default the non-prefixed property takes
if it's not present; if prefixed property not present, it
defaults to same as non-prefixed property", which we already
say in the introductory section). Still, for just status it
would be easy enough to add a couple of lines:

+status = "okay"; secure-status = "okay"; // ditto
+secure-status = "okay";  // ditto
+// neither explicitly defined: ditto

(Do you want the full set of 9 options you get from multiplying
out "okay" vs "disabled" vs not-set for each property?)

>> +   status = "disabled"; secure-status = "okay"; // S-only
>> +   status = "okay"; secure-status = "disabled"; // NS-only
>
> In HKG when we discussed this, 'status = "secure"' was the proposal.
> That would be simpler:
>
> S world can use "okay" or "secure"
> NS world can use "okay" or no property.
>
> That leaves out the case of disabled in S and enabled for NS. We could
> want that for s/w reasons, but can we have h/w like that?

It's perfectly possible to design hardware like that (though
I can't think of a reason to do so offhand). I think it's the desire
to be able to describe all the possible valid h/w combinations
that brought us to this secure- prefix design. Plus it
extends nicely to cover other possibilities as we need it;
for instance "device A is at S-0x1 but NS-0x2" can be
done by specifying a device like:
   status = "okay";
   secure-status = "okay";
   reg = < 0x2 0x1000 >;
   secure-regs = < 0x1 0x1000 >;
(apologies if I've messed the syntax up there).

Just going for 'status=secure' would deal with the immediate
requirement, but my preference is for a description that
lets us describe all the possible configurations, not just
the ones we think are common, and secure-* is a neat way
of doing that (IIRC it was Grant's suggestion; speaking of
whom, I just noticed I forgot to cc him on the original patch).

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Documentation: dt: Add bindings for Secure-only devices

2015-10-29 Thread Peter Maydell

The existing device tree bindings assume that we are only trying to
describe a single address space with a device tree (for ARM, either
the Normal or the Secure world). Some uses for device tree need to
describe both Normal and Secure worlds in a single device tree. Add
documentation of how to do this, by adding extra properties which
describe when a device appears differently in the two worlds or when
it only appears in one of them.

The binding describes the general principles for adding new
properties describing the secure world, but for now we only need a
single new property, "secure-status", which can be used to annotate
devices to indicate that they are only visible in one of the two
worlds.

The primary expected use of this binding is for a virtual machine
like QEMU to describe the VM layout to a TrustZone aware firmware
(which would then use the secure-only devices itself, and pass the DT
on to a kernel running in the non-secure world, which ignores the
secure-only devices and uses the rest).

Signed-off-by: Peter Maydell 
---
This binding doesn't affect the kernel itself, but the kernel
Documentation/ tree is the de-facto current place where all DT
bindings are documented, so Grant suggested this was the right
place to send a doc patch.

 Documentation/devicetree/bindings/arm/secure.txt | 41 
 1 file changed, 41 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/secure.txt

diff --git a/Documentation/devicetree/bindings/arm/secure.txt 
b/Documentation/devicetree/bindings/arm/secure.txt
new file mode 100644
index 000..4129fba
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/secure.txt
@@ -0,0 +1,41 @@
+* ARM Secure world bindings
+
+ARM CPUs with TrustZone support have two distinct address spaces,
+"Normal" and "Secure". Most devicetree consumers (including the Linux
+kernel) are not TrustZone aware and run entirely in either the Normal
+world or the Secure world. However some devicetree consumers are
+TrustZone aware and need to be able to determine whether devices are
+visible only in the Secure address space, only in the Normal address
+space, or visible in both. (One example of that situation would be a
+virtual machine which boots Secure firmware and wants to tell the
+firmware about the layout of the machine via devicetree.)
+
+The general principle of the naming scheme for Secure world bindings
+is that any property that needs a different value in the Secure world
+can be supported by prefixing the property name with "secure-". So for
+instance "secure-reg" would override "reg". If there is no "secure-"
+property then the Secure world value is the same as specified for the
+Normal world by the non-prefixed property. However, only the
+properties listed below may validly have "secure-" versions; this list
+will be enlarged on a case-by-case basis.
+
+Defining the bindings in this way means that a device tree which has
+been annotated to indicate the presence of Secure-only devices can
+still be processed unmodified by existing Non-secure software (and in
+particular by the kernel).
+
+Note that it is still valid for bindings intended for purely Secure
+world consumers (like kernels that run entirely in Secure) to simply
+describe the view of Secure world using the standard bindings. These
+secure- bindings only need to be used where both the Secure and Normal
+world views need to be described in a single device tree.
+
+Valid Secure world properties:
+
+- secure-status : specifies whether the device is present and usable
+  in the secure world. The combination of this with "status" allows
+  the various possible combinations of device visibility to be
+  specified:
+   status = "okay"; // visible in S and NS
+   status = "disabled"; secure-status = "okay"; // S-only
+   status = "okay"; secure-status = "disabled"; // NS-only
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/4] SysFS driver for QEMU fw_cfg device

2015-10-05 Thread Peter Maydell

On 5 October 2015 at 13:40, Gabriel L. Somlo  wrote:
> In addition, Michael's comment earlier in the thread suggests that
> even my current acpi version isn't sufficiently "orthodox" w.r.t.
> ACPI, and I should be providing the hardware access routine as
> an ACPI/AML routine, to avoid race conditions with the rest of ACPI,
> and for encapsulation. I.e. it's even rude to use the fw_cfg node's
> ACPI _CRS method (the part where I'd be treating it like a DT stand-in
> only to query fw_cfg's hardware specifics).

If you want to try to support "firmware might also be reading
fw_cfg at the same time as the kernel" this is a (painful)
problem regardless of how the kernel figures out whether a
fw_cfg device is present. I had assumed that one of the design
assumptions of this series was that firmware would only
read the fw_cfg before booting the guest kernel and never touch
it afterwards. If it might touch it later then letting the
guest kernel also mess with fw_cfg seems like a really bad idea.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] arm: change vendor ID for virtio-mmio

2015-07-31 Thread Peter Maydell

On 29 July 2015 at 20:16, Michael S. Tsirkin  wrote:
> ACPI spec 5.0 allows the use of PCI vendor IDs.
>
> Since we have one for virtio, it seems neater to use that
> rather than LNRO. For the device ID, use 103F which is a legacy ID that
> isn't used in virtio PCI spec - seems to make sense since virtio-mmio is
> a legacy device but we don't know the correct device type.
>
> Guests should probably match everything in the range 1000-103F
> (just like legacy pci drivers do) which will allow us to pass in the
> actual ID in the future if we want to.
>
> Signed-off-by: Michael S. Tsirkin 
> ---
>  hw/arm/virt-acpi-build.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index f365140..dea61ba 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -145,7 +145,7 @@ static void acpi_dsdt_add_virtio(Aml *scope,
>
>  for (i = 0; i < num; i++) {
>  Aml *dev = aml_device("VR%02u", i);
> -aml_append(dev, aml_name_decl("_HID", aml_string("LNRO0005")));
> +aml_append(dev, aml_name_decl("_HID", aml_string("1AF4103F")));
>  aml_append(dev, aml_name_decl("_UID", aml_int(i)));

So, I've just checked, and I believe that the kernel that RedHat
are shipping in their RHEL7 dev preview for AArch64 (and probably
thus also the Fedora/Centos one) includes a patch which adds
ACPI support to the virtio-mmio driver using the LNRO0005 ID string.

This to me suggests that we should just stick with that ID,
rather than changing to QEMU, the hex one based on the PCI
vendor ID, or anything else.

We're obviously under no obligation to make life easy for people
who ship kernels full of patches that haven't gone upstream yet,
but in this case there doesn't seem to me to be any benefit to
QEMU from picking an ID string that would break compatibility...

[The kernel I checked was the one in
https://git.centos.org/sources/kernel-aarch64/c7-aarch64/c589ab77889df6d93dbe817c373080631ab3275b
which despite the filename is actually an 80MB .tar.xz archive,
as pointed to by
https://git.centos.org/blob/rpms!kernel-aarch64/910dbce5f13419d68002f58e67ee6e762a93a425/.kernel-aarch64.metadata
]

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH v2] arm: change vendor ID for virtio-mmio

2015-07-30 Thread Peter Maydell

On 30 July 2015 at 09:04, Michael S. Tsirkin  wrote:
> On Thu, Jul 30, 2015 at 09:23:20AM +0800, Shannon Zhao wrote:
>>
>> Why do we drop the previous way using "QEMU"? Something I missed?
>
> So that guests that bind to this interface will work fine with non QEMU
> implementations of virtio-mmio.

I don't understand this sentence. If there are pre-existing
non-QEMU virtio-mmio implementations, then they're using
LNRO0005, and we should use it too. If there are going to
be implementations of virtio-mmio in future, then they will
use whatever identifier we pick here. Either way, we get
interoperability. I don't see any difference between our
saying "the ID for virtio-mmio is QEMU0005" and saying
"the ID for virtio-mmio is 1AF4103F".

(The latter seems unnecessarily opaque to me, to be honest.
At least an ID string QEMU gives you a clue where to
look for who owns the thing.)

Note also that strictly you don't mean "non-QEMU implementations
of virtio-mmio", you mean "non-QEMU implementations of the
ACPI tables". The hardware implementation of virtio-mmio
doesn't care at all about the ACPI ID. (In fact the most
plausible other-implementation would be UEFI using its
own (hard-coded) ACPI tables on top of a QEMU vexpress-a15
model or something similar.)

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] virtio_mmio: add ACPI probing

2015-07-29 Thread Peter Maydell

On 28 July 2015 at 11:33, G Gregory  wrote:
> We assigned LNRO in ASWG to avoid collisions with our prototypes/real
> platforms so it makes sense to me to switch to QEMU.

So just to check, if we switch virtio-mmio from an LNRO0005 ID
to a QEMU ID we aren't going to break any existing widely
shipped or deployed code, right?

If we can change the ID without breaking anything significant
then I think the QEMU ID makes more sense; but it doesn't
really gain us much beyond tidiness.

PS: https://www.kernel.org/doc/Documentation/arm64/arm-acpi.txt
uses virtio-mmio and LNRO0005 as its code example, so if
we change this then it might be nice to update the docs
as a followup.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] virtio_mmio: add ACPI probing

2015-07-28 Thread Peter Maydell

On 28 July 2015 at 21:28, G Gregory  wrote:
> On 28 July 2015 at 21:12, Peter Maydell  wrote:
>> Mmm. I'm not terribly happy about stuff being in QEMU before the
>> ACPI spec for it has been finalised. We should not be picking
>> stuff randomly on the fly...
>>
>> If we want to fix the ACPI IDs QEMU is using for 2.4 then we
>> really need to do that now (ie within the next day or two).
>>
> It is upto the owner of the QEMU prefix to allocate numbers. This is
> not an issue for ACPI spec at all.

I mean "the specification for how this device should be advertised
in an ACPI table". I don't care whether that's an official ACPI
consortium thing or something less official. The table is
constructed by QEMU and read by the kernel (and possibly
also by UEFI?), so everybody needs to agree on what the
string is...

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] virtio_mmio: add ACPI probing

2015-07-28 Thread Peter Maydell

On 28 July 2015 at 11:27, Michael S. Tsirkin  wrote:
> On Tue, Jul 28, 2015 at 11:12:33AM +0100, Peter Maydell wrote:
>> On 28 July 2015 at 11:08, Michael S. Tsirkin  wrote:
>> > On Tue, Jul 28, 2015 at 10:44:02AM +0100, Graeme Gregory wrote:
>> >> Added the match table and pointers for ACPI probing to the driver.
>> >>
>> >> This uses the same identifier for virt devices as being used for qemu
>> >> ARM64 ACPI support.
>> >>
>> >> http://git.linaro.org/people/shannon.zhao/qemu.git/commit/d0bf1955a3ecbab4b51d46f8c5dda02b7e14a17e
>> >>
>> >> Signed-off-by: Graeme Gregory 

>> >> +#ifdef CONFIG_ACPI
>> >> +static const struct acpi_device_id virtio_mmio_acpi_match[] = {
>> >> + { "LNRO0005", },
>> >> + { }
>> >> +};
>> >
>> > Hmm - we have reserved QEMU in ASWG explicitly for this purpose.
>> >
>> > Pater - do you think it's a good idea to change this before QEMU 2.4
>> > is released?
>>
>> Shannon's call, I guess. I don't know enough about ACPI to say.
>> I thought these ACPI IDs were already fixed because they were
>> what the kernel was looking for...

> Apparently not :)

Mmm. I'm not terribly happy about stuff being in QEMU before the
ACPI spec for it has been finalised. We should not be picking
stuff randomly on the fly...

If we want to fix the ACPI IDs QEMU is using for 2.4 then we
really need to do that now (ie within the next day or two).

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] virtio_mmio: add ACPI probing

2015-07-28 Thread Peter Maydell

On 28 July 2015 at 11:08, Michael S. Tsirkin  wrote:
> On Tue, Jul 28, 2015 at 10:44:02AM +0100, Graeme Gregory wrote:
>> Added the match table and pointers for ACPI probing to the driver.
>>
>> This uses the same identifier for virt devices as being used for qemu
>> ARM64 ACPI support.
>>
>> http://git.linaro.org/people/shannon.zhao/qemu.git/commit/d0bf1955a3ecbab4b51d46f8c5dda02b7e14a17e
>>
>> Signed-off-by: Graeme Gregory 
>> ---
>>  drivers/virtio/virtio_mmio.c | 10 ++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
>> index 10189b5..f499d9d 100644
>> --- a/drivers/virtio/virtio_mmio.c
>> +++ b/drivers/virtio/virtio_mmio.c
>> @@ -58,6 +58,7 @@
>>
>>  #define pr_fmt(fmt) "virtio-mmio: " fmt
>>
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -732,12 +733,21 @@ static struct of_device_id virtio_mmio_match[] = {
>>  };
>>  MODULE_DEVICE_TABLE(of, virtio_mmio_match);
>>
>> +#ifdef CONFIG_ACPI
>> +static const struct acpi_device_id virtio_mmio_acpi_match[] = {
>> + { "LNRO0005", },
>> + { }
>> +};
>
> Hmm - we have reserved QEMU in ASWG explicitly for this purpose.
>
> Pater - do you think it's a good idea to change this before QEMU 2.4
> is released?

Shannon's call, I guess. I don't know enough about ACPI to say.
I thought these ACPI IDs were already fixed because they were
what the kernel was looking for...

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] arm64: KVM: Enable minimalistic support for Thunder

2015-06-29 Thread Peter Maydell

On 29 June 2015 at 18:11, Marc Zyngier  wrote:
> On 29/06/15 18:06, Chalamarla, Tirumalesh wrote:
>>
>>> On Jun 29, 2015, at 1:53 AM, Marc Zyngier  wrote:
>>> Constantly adding new CPUs without providing any insight as to how they
>>> should be emulated only brings churn, and not much benefit.
>>>
>> we are not planning(at this point) to emulate Thunder with QEMU/others.
>
> Fair enough. If cross-cpu VM migration is none of your concern, then
> Suzuki's patch should be enough.

"Emulate with QEMU" is a completely different thing from
"support cross-cpu VM migration" !

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 10/12] KVM: arm64: guest debug, HW assisted debug support

2015-05-15 Thread Peter Maydell

On 15 May 2015 at 17:16, Alex Bennée  wrote:
> Mark Rutland  writes:
>> This gets more fun when you consider the context-aware breakpoints are
>> the highest numbered. So the set of (context-aware) breakpoints might
>> not intersect across all CPUs.
>
> I didn't see a reference to that in the ARM ARM. It seemed to imply any
> breakpoint could be context aware is .BT was appropriately set and
> linked to the VR.

No; see D2.9.2; there must be at least one context-aware
breakpoint, but no requirement for more than that.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 03/12] KVM: arm64: guest debug, define API headers

2015-05-15 Thread Peter Maydell

On 15 May 2015 at 16:14, Alex Bennée  wrote:
>
> Mark Rutland  writes:
>
>> On Fri, May 15, 2015 at 03:27:06PM +0100, Alex Bennée wrote:
>>> +/*
>>> + * See v8 ARM ARM D7.3: Debug Registers
>>> + *
>>> + * The control registers are architecturally defined as 32 bits but are
>>> + * stored as 64 bit values alongside the value registers. This is done
>>
>> Stale comment? They're stored as __u32 below.
>
> Gah yes it is.
>
>> It's possible that the registers could grow in future as happened in the
>> case of CLIDR_EL1, so it might be worth treating system registers
>> generally as u64 values.
>
> Really? I mean the existing debug *control* registers have reserved bits
> 24-31 so there is space for expansion.

Other places in the userspace ABI which deal with sysregs (notably
ONE_REG) consistently define them all as 64-bit (which makes sense
anyway since the ISA only provides 64-bit accessors to them).
"Architecturally 32 bits" only means "top 32 bits reserved".

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 06/10] KVM: arm64: guest debug, add SW break point support

2015-04-28 Thread Peter Maydell

On 28 April 2015 at 09:42, Alex Bennée  wrote:
> Peter Maydell  writes:
>> Does the kernel already have a conveniently implemented "inject
>> exception into guest" lump of code? If so it might be less effort
>> to do it that way round, maybe.
>
> So you pointed out we can't just re-inject the exceptions we get as we
> need to map from things like ESR_ELx_EC_WATCHPT_LOW to
> ESR_ELx_EC_WATCHPT_CUR before re-injection.
>
> Of course if it is as simple as modifying the ESR_EL1 register and
> returning +ve in the handle_exit path then I can do that but I assumed
> if any other wrangling needs doing it should be done in userspace.

Well, somebody's got to do it, and it's the same amount of work
either way (fiddling with ESR, making sure we direct the guest
to the right exception vector entry point, maybe a few other
things).

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 06/10] KVM: arm64: guest debug, add SW break point support

2015-04-27 Thread Peter Maydell

On 27 April 2015 at 21:04, Christoffer Dall  wrote:
> On Thu, Apr 23, 2015 at 03:26:53PM +0100, Alex Bennée wrote:
>>
>> Christoffer Dall  writes:
>>
>> > On Tue, Mar 31, 2015 at 04:08:04PM +0100, Alex Bennée wrote:
>> >> + * just need to report the PC and the HSR values to userspace.
>> >> + * Userspace may decide to re-inject the exception and deliver it to
>> >> + * the guest if it wasn't for the host to deal with.
>> >
>> > now I'm confused - does userspace setup the guest to receive an
>> > exception or does it tell KVM to emulate an exception for the guest or
>> > do we execute the breakpoint without trapping the debug exception?
>>
>> I've made it all go through userspace as we may have to translate the
>> hypervisor visible exception code to what the guest was expecting to see.
>>
>
> ok, so I think you should re-phrase something like:
>
> "Userspace may decide that this exception is caused by the guest using
> debugging itself, and may in that case emulate the guest debug exception
> in userspace before resuming KVM."
>
> But does that really work?  Given that we don't support KVM-TCG
> migration, this sounds a little strange.  Did we test it?

The QEMU patches have a TODO note at the point where you'd want
to do this... Design-wise you can do the reinjection in the
kernel or in userspace (ppc QEMU does this in userspace, for
instance) because it's pretty much just setting registers to fake
up the exception-entry into EL1. Code-wise QEMU's ARM code isn't
set up to do it right now, but it shouldn't be too difficult to
persuade the TCG exception-entry code to work for this case too.

Does the kernel already have a conveniently implemented "inject
exception into guest" lump of code? If so it might be less effort
to do it that way round, maybe.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 03/10] KVM: arm: guest debug, define API headers

2015-04-01 Thread Peter Maydell

On 1 April 2015 at 17:01, Alex Bennée  wrote:
>
> David Hildenbrand  writes:

>>> +/*
>>> + * See ARM ARM D7.3: Debug Registers
>>
>> Maybe drop one ARM
>
> Well technically it is the ARM Architecture Reference Manual but that's
> quite long winded.

Dropping one "ARM" would be actively wrong, so don't do that.
You probably want to say 'v8 ARM ARM", assuming you mean that.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/5] arm: KVM: export vcpi->pause state via MP_STATE ioctls

2015-03-09 Thread Peter Maydell

On 10 March 2015 at 04:29, Christoffer Dall  wrote:
> On Mon, Mar 09, 2015 at 04:34:21PM +, Alex Bennée wrote:
>> - Boot
>> - Power on secondary CPUs
>> - Power off one secondary CPU
>> - Migrate to file (cpu_powered reflects state of each CPU)
>>
>> - Start fresh QEMU
>> - Restore from file (cpu_powered -> vcpu->paused via ioctl)
>> - Run (we continue with the same power state pre-migrate)
>>
>> - PSCI RESET
>> - Does what it does, power all secondaries down?
>> - Kernel boots, turns them on?
>>
> Hmmm. As long as QEMU always inits all VCPUs in the exact same way
> (including the KVM_ARM_VCPU_POWER_OFF feature bit) I guess it works and
> that's probably a reasonable requirement for migration.

We init the VCPUs with the POWER_OFF flag set for exactly the
set of CPUs that we want to start powered off. Typically that
means that the first CPU is inited with POWER_OFF clear and the
rest are inited with it set.

Regardless, if we're doing an incoming migration, then the
incoming data will be used to set the VCPU on/off state
appropriately for resuming the VM. (This might include
powering on a VCPU that's been inited in power-off and never
run, or powering down a VCPU that was inited power-on but
never ran. In the 'migration moves a vcpu to powered-on'
case we'll also set the vcpu's PC and other registers so
when it is run it will DTRT.)

If the resumed guest subsequently does a PSCI reset then
QEMU will re-init the VCPUs with the set of feature bits that
the machine model/etc defines for a reset condition for this
board, which will be the same "first CPU powered on, all the
rest powered off" config we started with.

(It's the user's responsibility to ensure that when doing a
migration the QEMUs at both ends are identically configured,
incidentally.)

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 3/4] arm64: dts: Add mediatek MT8173 SoC and evaluation board dts and Makefile

2014-12-20 Thread Peter Maydell

On 20 December 2014 at 20:07, Arnd Bergmann  wrote:
> On Wednesday 17 December 2014 15:01:29 Marc Zyngier wrote:
>> Also it is worth noticing that given how GICV is placed, it will never
>> work with 64K pages and virtualization. Pretty sad.
>
> Does this mean no VGIC support on this platform so you have to emulate it
> in order to run virtual machines with 64K pages, or does it mean that
> it's impossible to use that way because you can't emulate it?

Currently having the guest use the generic timer requires an
in-kernel GIC, because we don't provide an ABI for having the
kernel say "here is a generic timer interrupt for the userspace
emulated GIC". And at least for QEMU userspace the "virt" board
doesn't provide any other kind of timer, so you'd need to add one
if you wanted to use a userspace GIC. So for practical purposes
"out-of-kernel GIC" is not a supported config for KVM+QEMU.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v12 7/7] ARM: kprobes: enable OPTPROBES for ARM 32

2014-12-05 Thread Peter Maydell

On 5 December 2014 at 10:10, Jon Medhurst (Tixy)  wrote:
> I don't know much about QEMU and have never used it, but I'm assuming
> QEMU doesn't make any attempt to simulate caches like the data cache,
> instruction cache, TLBs, branch predictor? Does it even emulate multiple
> CPUs with multiple host CPU threads? Basically, I very much doubt QEMU
> is a very good test of kernel code in general, and especially code that
> modifies code and has multiple cpus running in parallel.

You're generally correct here, yes. QEMU doesn't emulate caches
or TLBs or branch predictors, and we currently emulate SMP by
doing round-robin execution on a single host thread (though
we're working on that for performance reasons). There are also
a range of buggy-guest-code conditions (alignment faults, for
instance) which we don't emulate. I tend to think of QEMU's
overall philosophy as "run known-good code quickly" rather
than "diagnose problems in buggy code". So it's definitely wise
to test complicated kernel code like this on real hardware
(though of course QEMU may be very helpful in speeding up the
development cycle compared to h/w).

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2] arm:extend the reserved mrmory for initrd to be page aligned

2014-12-05 Thread Peter Maydell

On 5 December 2014 at 18:44, Catalin Marinas  wrote:
> On Fri, Dec 05, 2014 at 05:27:02PM +, Russell King - ARM Linux wrote:
>> which makes the summary line rather misleading, and I really don't think
>> we need to do this on ARM for the simple reason that we've been doing it
>> for soo long that it can't be an issue.
>
> I started this as a revert and then realised that it doesn't solve
> anything for arm32 without changing the poisoning.
>
> Anyway, if you are happy with how it is, I'll drop the arm32 part. As I
> said yesterday, the issue is worse for arm64 with 64K pages.

If you do want to retain the arm32 "mustn't be in the 4K page of
the initrd tail" behaviour then it would probably be a good idea
to document this in the Booting spec.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2] arm:extend the reserved mrmory for initrd to be page aligned

2014-12-05 Thread Peter Maydell

On 5 December 2014 at 17:27, Russell King - ARM Linux
 wrote:
> On Fri, Dec 05, 2014 at 05:07:45PM +, Catalin Marinas wrote:
>> On Fri, Dec 05, 2014 at 12:05:06PM +, Will Deacon wrote:
>> > Care to submit this as a proper patch? We should at least fix Peter's issue
>> > before doing things like extending headers, which won't work for older
>> > kernels anyway.
>>
>> Quick fix is the revert of the whole patch, together with removing
>> PAGE_ALIGN(end) in poison_init_mem() on arm32. If Russell is ok with
>> this patch, we can take it via the arm64 tree, otherwise I'll send you a
>> partial revert only for the arm64 part.
>
> Not really.  Let's look at the history.
>
> For years, we've been poisoning memory, page aligning the end pointer.
> This has never been an issue.

Depends what you mean by "never been an issue". I had to change
QEMU (commit 98ed805c, January 2013) for 32-bit ARM back when the
kernel started trashing the tail end of the page after the initrd
with the poisoning, to 4K-align the dtb so it didn't share a page
with the initrd-tail. That nobody else complained suggests that most
bootloaders don't in practice overlap the two, though (ie that
QEMU is an outlier in how it chooses to arrange things in memory).

I should probably have reported the breakage at the time, but
I took the pragmatic (lazy?) approach of just changing our
bootloader code.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] KVM: arm/arm64: vgic: add init entry to VGIC KVM device

2014-12-04 Thread Peter Maydell

On 4 December 2014 at 12:01, Eric Auger  wrote:
> Here is the sequence:
> 1) The VGIC early initialization is initiated in a machine init done
> notifier. This notifier is registered in kvm_arm_gic_realize
> (http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00220.html). It
> executes after vcpu instantiations + dist/cpu interface base address
> setting + nb irq setting.
> 2) the VFIO signaling and irqfd setup is done in a reset notifier
> http://lists.gnu.org/archive/html/qemu-devel/2014-11/msg04365.html

OK. And on x86 VFIO how does this work? Obviously x86's GIC just
initializes as soon as it's created, but do we do the irqfd setup
in a reset notifier there too?

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] KVM: arm/arm64: vgic: add init entry to VGIC KVM device

2014-12-04 Thread Peter Maydell

On 2 December 2014 at 17:54, Eric Auger  wrote:
> as soon as VFIO signaling is set up (the device IRQ index is linked to
> an eventfd, the physical IRQ VFIO handler is installed and the physical
> IRQ is enabled at interrupt controller level), virtual IRQs are likely
> to be injected. With current QEMU code, we setup this VFIO signaling
> *before* the vgic readiness (either on machine init done or reset
> notifier) and we face that issue of early injection. QEMU related
> patches to follow ...

So can you describe in QEMU terms how the lifecycle of these
things works? How do we ensure that we don't start trying
to inject VFIO IRQs before we've even created the vgic, for
instance?

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] KVM: arm/arm64: vgic: add init entry to VGIC KVM device

2014-12-02 Thread Peter Maydell

On 2 December 2014 at 17:27, Eric Auger  wrote:
> Since the advent of dynamic initialization of VGIC, this latter is
> initialized very late, on the first vcpu run. This initialization
> could be initiated much earlier by the user, as soon as it has
> provided the requested dimensioning parameters:
> - number of IRQs and number of vCPUs,
> - DIST and CPU interface base address.
>
> One motivation behind being able to initialize the VGIC sooner is
> related to the setup of IRQ injection in VFIO use case. The VFIO
> signaling, especially when used along with irqfd must be set *after*
> vgic initialization to prevent any virtual IRQ injection before
> VGIC initialization. If virtual IRQ injection occurs before the VGIC
> init, the IRQ cannot be injected and subsequent injection is blocked
> due to VFIO completion mechanism (unmask/mask or forward/unforward).

This implies that you're potentially injecting virtual IRQs
(and changing the state of the VGIC) before we actually
start running the VM (ie before userspace calls KVM_RUN).
Is that right? It seems odd, but maybe vfio works that way?

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/7] KVM: arm64: guest debug, add support for single-step

2014-11-30 Thread Peter Maydell

On 30 November 2014 at 10:10, Christoffer Dall
 wrote:
> In any case, I think it was related to how userspace observes the state
> of the CPU, because when you do the MMIO operation emulation in
> userspace, currently if you observe the PC though GET_ONE_REG, you'll
> see a PC pointing to the next instruction, not the one you're emulating
> which is strange.

Also if we ever add support for userspace to say "this MMIO should
fault" then we definitely need the PC-advance to happen afterwards,
not before.

> Not sure what the relation to a guest single-stepping itself was.

I think it just came up in the course of that discussion, because
single-step handling also needs to perform an action (clear PSTATE.SS)
as part of the "advance over this insn" operation. But I think that
you're right that doing the advance before dropping out to userspace
is no worse for singlestep than it is for any other case.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/7] KVM: arm64: guest debug, add support for single-step

2014-11-26 Thread Peter Maydell

On 25 November 2014 at 16:10, Alex Bennée  wrote:
> This adds support for single-stepping the guest. As userspace can and
> will manipulate guest registers before restarting any tweaking of the
> registers has to occur just before control is passed back to the guest.
> Furthermore while guest debugging is in effect we need to squash the
> ability of the guest to single-step itself as we have no easy way of
> re-entering the guest after the exception has been delivered to the
> hypervisor.

A corner case I don't think this patch handles: if the debugger
tries to single step an insn which is emulated by the
hypervisor (because it's a load/store which is trapped and
handled as emulated mmio in userspace) then we won't
correctly update the single-step state machine (and so we'll end
up incorrectly stopping after the following insn rather than
before, I think).

You should be able to achieve this effect by simply always clearing
the guest's PSTATE.SS when you advance the PC to skip the emulated
instruction (cf the comment in the pseudocode SSAdvance() function).

I think we should also be doing this PC advance on return from
userspace's handling of the mmio rather than before we drop back
to userspace as we do now, but I can't remember why I think that.
Christoffer, I don't suppose you recall, do you? I think it was
you I had this conversation with on IRC a month or so back...

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/7] KVM: arm64: guest debug, add SW break point support

2014-11-26 Thread Peter Maydell

On 26 November 2014 at 16:07, Andrew Jones  wrote:
> There appears to be a typo in the ARM ARM. Subsection "Software
> Breakpoint Instruction exception" of D1.10.4 says BRK (ESR_EL2_EC_BRK64)
> is 0x39, but the table above that has it correctly as 0x3c.

Thanks for pointing out this typo -- I've reported it to the
appropriate documentation people.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/7] KVM: arm: guest debug, define API headers

2014-11-25 Thread Peter Maydell

On 25 November 2014 at 17:05, Paolo Bonzini  wrote:
> So there is no register that says "this breakpoint has triggered" or
> "this watchpoint has triggered"?

Nope. You take a debug exception; the syndrome register tells
you if it was a bp or a wp, and if it was a wp the fault address
register tells you the address being accessed (if it was a bp
you know the program counter, obviously). The debugger is expected
to be able to figure it out from there, if it cares.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/7] KVM: arm: guest debug, define API headers

2014-11-25 Thread Peter Maydell

On 25 November 2014 at 16:10, Alex Bennée  wrote:
> +/* Exit types which define why we did a debug exit */
> +#define KVM_DEBUG_EXIT_ERROR   0x0
> +#define KVM_DEBUG_EXIT_SINGLE_STEP 0x1
> +#define KVM_DEBUG_EXIT_SW_BKPT 0x2
> +#define KVM_DEBUG_EXIT_HW_BKPT 0x3
> +#define KVM_DEBUG_EXIT_HW_WTPT 0x4

The names of these imply that they're generic, but they're
defined in an arch-specific header file...

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

2014-11-21 Thread Peter Maydell

On 21 November 2014 20:14, Andrea Arcangeli  wrote:
> Hi Peter,
>
> On Wed, Oct 29, 2014 at 05:56:59PM +0000, Peter Maydell wrote:
>> On 29 October 2014 17:46, Andrea Arcangeli  wrote:
>> > After some chat during the KVMForum I've been already thinking it
>> > could be beneficial for some usage to give userland the information
>> > about the fault being read or write
>>
>> ...I wonder if that would let us replace the current nasty
>> mess we use in linux-user to detect read vs write faults
>> (which uses a bunch of architecture-specific hacks including
>> in some cases "look at the insn that triggered this SEGV and
>> decode it to see if it was a load or a store"; see the
>> various cpu_signal_handler() implementations in user-exec.c).
>
> There's currently no plan to deliver to userland read access
> notifications of a present page, simply because the task of the
> userfaultfd is to handle the page fault in userland, but if the page
> is mapped and readable it won't fault in the first place :). I just
> mean it's not like gdb read watch.

If it's mapped and readable-but-not-writable then it should still
fault on write accesses, though? These are cases we currently get
SEGV for, anyway.

> Even if the region would be set to PROT_NONE it would still SEGV
> without triggering an userfault (after all pte_present would still
> true because the page is still mapped despite not being readable, so
> in any case it wouldn't be considered a not-present page fault).

Ah, I guess we have a terminology difference. I was considering
"page fault" to mean (roughly) "anything that causes the CPU to
take an exception on an attempted load/store" and expected that
userfaultfd would notify userspace of any of those. (Well, not
alignment faults, maybe, but I'm definitely surprised that
access permission issues don't get reported the same way as
page-completely-missing issues. In other words I was expecting
that this was "everything previously reported via SIGSEGV or
SIGBUS now comes via userfaultfd".)

> Temporarily removing/moving the page with remap_anon_pages shall be
> much better than using PROT_NONE for this (or alternative syscall name
> to differentiate it further from remap_file_pages, or equivalent
> userfaultfd command if we decide to hide the pte/pmd mangling as
> userfaultfd commands instead of adding new standalone syscalls).

We don't use PROT_NONE for the linux-user situation, we just use
mprotect() to remove the PAGE_WRITE permission so it's still
readable.

I suspect actually linux-user would be better off implementing
something like "if this is a page which we've mapped read-only
because we translated code out of it, then go ahead and remap
it r/w and throw away the translation and retry the access,
otherwise report SEGV to the guest", because taking SEGVs shouldn't
be a fast path in the guest binary. That would let us work without
architecture-specific junk and without requiring new kernel
features either. So you can ignore this whole tangent thread :-)

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ARM: cacheflush: disallow pending signals during cacheflush

2014-11-13 Thread Peter Maydell

On 13 November 2014 11:26, Will Deacon  wrote:
> Whilst I don't think this is the correct solution, I agree that there's
> a potential issue here. We could change the restart return value to
> -ERESTARTNOINTR instead, but I can imagine something like a periodic
> SIGALRM which could prevent a large cacheflush from ever completing.
> Do we actually care about making forward progress in such a scenario?
>
> It is interesting to note that this change has been in mainline since
> May last year without any reported issues. That could be down to a number
> of reasons:
>
>   (1) People are using old kernels on ARM
>
>   (2) Code doesn't check the return value from the cacheflush system call,
>   because it historically always returned 0

...and the documentation comment in the source code didn't say
anything about the syscall having a return value; it only
described the input parameters. I would actually be surprised
if any userspace caller of this syscall checked its return value
(the libgcc cacheflush function used by gcc's clear_cache builtin
doesn't, to pick one popularly used example).

>   (3) People are getting lucky with timing, as this is likely difficult
>   to hit

(4) The resulting misbehaviour ("my JIT crashes occasionally and
non-reproducibly at some point possibly some while after the
cacheflush call") will be extremely hard to track back
to this kernel change

> This leaves me with the following questions:
>
>   - Has this change been shown to break anything in practice?
>   - Can we change the internal return value to -ERESTARTNOINTR?
>   - What do we do about kernels that *do* return -EINTR? (>=3.12?)

My suggestion would be "treat this as a bugfix, put it into
stable kernels in the usual way (and assume distros will pick
it up if appropriate)".

>   - Can we get a manpage put together to describe this mess?

That would be nice :-)

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

2014-10-29 Thread Peter Maydell

On 29 October 2014 17:46, Andrea Arcangeli  wrote:
> After some chat during the KVMForum I've been already thinking it
> could be beneficial for some usage to give userland the information
> about the fault being read or write

...I wonder if that would let us replace the current nasty
mess we use in linux-user to detect read vs write faults
(which uses a bunch of architecture-specific hacks including
in some cases "look at the insn that triggered this SEGV and
decode it to see if it was a load or a store"; see the
various cpu_signal_handler() implementations in user-exec.c).

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [RFC PATCH 0/2] virtio-mmio: add irqfd support for vhost-net based on virtio-mmio

2014-10-27 Thread Peter Maydell

On 27 October 2014 11:23, Li Liu  wrote:
> So you mean virtio-mmio will be replaced by PCI/PCIe on ARM at last?

That is the plan, yes. I can't make any promises on
timescales at the moment, though...

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [RFC PATCH 0/2] virtio-mmio: add irqfd support for vhost-net based on virtio-mmio

2014-10-27 Thread Peter Maydell

On 25 October 2014 09:24, john.liuli  wrote:
> To get the interrupt reason to support such VIRTIO_NET_F_STATUS
> features I add a new register offset VIRTIO_MMIO_ISRMEM which
> will help to establish a shared memory region between qemu and
> virtio-mmio device. Then the interrupt reason can be accessed by
> guest driver through this region. At the same time, the virtio-mmio
> dirver check this region to see irqfd is supported or not during
> the irq handler registration, and different handler will be assigned.

If you want to add a new register you should probably propose
an update to the virtio spec. However, it seems to me it would
be better to get generic PCI/PCIe working on the ARM virt
board instead; then we can let virtio-mmio quietly fade away.
This has been on the todo list for ages (and there have been
RFC patches posted for plain PCI), it's just nobody's had time
to work on it.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] arm: Handle starting up in secure mode

2014-09-18 Thread Peter Maydell

On 17 September 2014 06:25, Christopher Covington  wrote:
> On 09/16/2014 05:24 PM, Christopher Covington wrote:
>> On 09/16/2014 05:09 PM, Christopher Covington wrote:
>>> ARM Linux currently has the most features available to it in hypervisor
>>> (HYP) mode, so switch to it when possible. This can also ensure proper
>>> reset of newer registers such as CNTVOFF.
>>>
>>> The permissions on the Non-Secure Access Control Register (NSACR) are
>>> used to probe what the security setting currently is when in supervisor
>>> (SVC) mode.
>>
>> Sorry, this doesn't work yet. I was misinterpreting my test results. For what
>> it's worth, my testing and development methodology is to run it after hacked
>> up versions of the semihosting bootwrapper on the simulator that corresponds
>> to rtsm_ve-aemv8a.dtb (AEM VE FVP these days?) and examine the instruction 
>> traces.
>
> Looks like the real problem was that I was hacking up the bootwrapper
> incorrectly--my start-in-secure-mode bootwrapper variant wasn't setting up the
> GIC for non-secure access. With that changed, I've tested the following
> variations using the Image file in a single core configuration.
>
> Start in non-secure SVC with non-secure access to GIC configured.
>
> Start in secure SVC with non-secure access to GIC configured.
>
> Start in secure SVC with non-secure access to GIC configured and hypervisor
> support disabled in the model (-C cluster.has_el2=0). This required setting
> the VBAR again in non-secure SVC but with that fix it seems to work. I'll
> include this change in v2.

If you're relying on the boot loader to set up the GIC to support
non-secure access anyway, why not just have it boot the kernel in Hyp
like the boot protocol document recommends? (The same thing as the GIC
is going to apply for any other hardware that needs configuration to
allow NS access; if we need the firmware to deal with this we might as
well just have it boot us in the right mode too.)

Incidentally, on a v7 CPU without the Security Extensions the VBAR
doesn't exist at all, so your code is going to UNDEF at an earlier
point than you think it is...

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SMP BUG?] the return value of is_smp() is bug?

2014-09-04 Thread Peter Maydell

On 4 September 2014 02:13, long.wanglong  wrote:
> When i revert the commit bc41b8724f24, the secondary core can boot.
> The problem is that qemu doesn't provide emulation of the SCU base
> address register. When reading the SCU base, qemu just return 0.

You need to upgrade your QEMU -- we improved the emulation
of all our Cortex-A9 boards to make sure they set valid values
for the base address register, in order to fix exactly this regression.
QEMU 2.0 or later should work.

(As an aside, I think the Aegis quirk test is pretty ugly:
"base address happens to be set to zero" is not exactly
a very reliable way to identify a particular SoC. However
in this case QEMU was emulating real h/w insufficiently
accurately, so we fixed it.)

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 136 matches

Mail list logo