Re: Deprecation/removal of nios2 target support

2024-04-18 Thread Arnd Bergmann
On Thu, Apr 18, 2024, at 17:44, Joseph Myers wrote:
> On Wed, 17 Apr 2024, Sandra Loosemore wrote:
>
>> Therefore I'd like to mark Nios II as obsolete in GCC 14 now, and remove
>> support from all toolchain components after the release is made.  I'm not 
>> sure
>> there is an established process for obsoleting/removing support in other
>> components; besides binutils, GDB, and GLIBC, there's QEMU, newlib/libgloss,
>> and the Linux kernel.  But, we need to get the ball rolling somewhere.
>
> CC:ing Arnd Bergmann regarding the obsolescence in the Linux kernel.

We have not yet marked nios2 as deprecated in the kernel, but that
is mostly because the implementation does not get in the way too
much and Dinh Nguyen is still around as a maintainer and merging
bugfixes.

Almost all nios2 kernel changes I see in the past decade have been
done blindly without testing on hardware, either for treewide
changes, or by code inspection. The only notable exceptions I could
find are from Andreas Oetken and Bernd Weiberg at Siemens and
from Marek Vasut (all added to Cc in case they have something to add).

We should probably remove nios2 from the kernel in the near future,
but even if we decide not to, I think deprecating it from gcc is the
right idea: If there are a few remaining users that still plan
to update their kernels, gcc-14 will still be able to build new
kernels for several years.

 Arnd



Re: possible deprecation and removal of some old QEMU Arm machine types (pxa2xx, omap, sa1110)

2024-02-15 Thread Arnd Bergmann
On Thu, Feb 15, 2024, at 09:31, Andreas Kemnade wrote:
> On Wed, 14 Feb 2024 23:42:58 +0100
> "Arnd Bergmann"  wrote:
>> On Wed, Feb 14, 2024, at 13:26, Dmitry Baryshkov wrote:
>> > On Tue, 13 Feb 2024 at 23:22, Linus Walleij  
>> > wrote:  
>> >> On Tue, Feb 13, 2024 at 9:12 PM Arnd Bergmann  wrote:  
>> >> > On Tue, Feb 13, 2024, at 16:36, Guenter Roeck wrote:  
>> >> > > On Tue, Feb 13, 2024 at 03:14:21PM +, Peter Maydell wrote:  
>> >>
>> >> Andrea Adami and Dmitry Eremin-Solenikov did the work in 2017 to
>> >> modernize it a bit, and Russell helped out. I was under the impression
>> >> that they only used real hardware though!  
>> >
>> > I used both Qemu and actual hardware (having collie, poodle, tosa and
>> > c860 that was easy).
>> >
>> > The biggest issue with Zaurus PDAs was that supporting interesting
>> > parts of the platform (PCMCIA, companion chips) required almost
>> > rebootstrapping of the corresponding drivers.
>> > E.g. I had a separate driver for the LoCoMo chip which worked properly
>> > with the DT systems.
>> > PCMCIA was a huuuge trouble and it didn't play well at all. The driver
>> > must be rewritten to use the component framework.  
>> 
>> If we want to actually go there, I think the best option for PCMCIA
>> support is likely to replace the entire "soc_common" pcmcia driver
>> with a simple drivers/pata/ storage driver and no support for
>> other cards. There was a driver until commit 38943cbd25a2
>> ("ata: remove palmld pata driver") that could serve as an
>> template.
>> 
> hmm, main usage for PCMCIA/CF in those devices was often something else,
> not storage, at least on the IPAQ h2200. Wondering wether that road is
> actually good. When I was mainly using those devices, I was not good in
> mainlining things.

Do we still support any non-storage CF devices that someone might
actually use? Do you have a specific example in mind? These are
the currently supported devices that I see:

git grep -B4 -w depends.*PCMCIA | grep "Kconfig-\(config\|menuconfig\)" | grep 
-v ^drivers/pcmcia
drivers/ata/Kconfig-config PATA_PCMCIA
drivers/bluetooth/Kconfig-config BT_HCIDTL1
drivers/bluetooth/Kconfig-config BT_HCIBT3C
drivers/bluetooth/Kconfig-config BT_HCIBLUECARD
drivers/comedi/Kconfig-menuconfig COMEDI_PCMCIA_DRIVERS
drivers/mmc/host/Kconfig-config MMC_SDRICOH_CS
drivers/mtd/maps/Kconfig-config MTD_PCMCIA
drivers/mtd/maps/Kconfig-config MTD_PCMCIA_ANONYMOUS
drivers/net/arcnet/Kconfig-menuconfig ARCNET
drivers/net/arcnet/Kconfig-config ARCNET_COM20020_CS
drivers/net/can/sja1000/Kconfig-config CAN_EMS_PCMCIA
drivers/net/can/sja1000/Kconfig-config CAN_PEAK_PCMCIA
drivers/net/can/softing/Kconfig-config CAN_SOFTING_CS
drivers/net/ethernet/3com/Kconfig-config NET_VENDOR_3COM
drivers/net/ethernet/3com/Kconfig-config PCMCIA_3C574
drivers/net/ethernet/3com/Kconfig-config PCMCIA_3C589
drivers/net/ethernet/8390/Kconfig-config PCMCIA_AXNET
drivers/net/ethernet/8390/Kconfig-config APNE
drivers/net/ethernet/8390/Kconfig-config PCMCIA_PCNET
drivers/net/ethernet/amd/Kconfig-config PCMCIA_NMCLAN
drivers/net/ethernet/fujitsu/Kconfig-config NET_VENDOR_FUJITSU
drivers/net/ethernet/fujitsu/Kconfig-config PCMCIA_FMVJ18X
drivers/net/ethernet/smsc/Kconfig-config PCMCIA_SMC91C92
drivers/net/ethernet/xircom/Kconfig-config NET_VENDOR_XIRCOM
drivers/net/ethernet/xircom/Kconfig-config PCMCIA_XIRC2PS
drivers/parport/Kconfig-config PARPORT_PC_PCMCIA
drivers/scsi/pcmcia/Kconfig-menuconfig SCSI_LOWLEVEL_PCMCIA
drivers/ssb/Kconfig-config SSB_PCMCIAHOST_POSSIBLE
drivers/tty/Kconfig-config IPWIRELESS
drivers/tty/serial/8250/Kconfig-config SERIAL_8250_CS
drivers/usb/host/Kconfig-config USB_SL811_CS
sound/pcmcia/Kconfig-menuconfig SND_PCMCIA

 Arnd



Re: possible deprecation and removal of some old QEMU Arm machine types (pxa2xx, omap, sa1110)

2024-02-14 Thread Arnd Bergmann
On Wed, Feb 14, 2024, at 13:26, Dmitry Baryshkov wrote:
> On Tue, 13 Feb 2024 at 23:22, Linus Walleij  wrote:
>> On Tue, Feb 13, 2024 at 9:12 PM Arnd Bergmann  wrote:
>> > On Tue, Feb 13, 2024, at 16:36, Guenter Roeck wrote:
>> > > On Tue, Feb 13, 2024 at 03:14:21PM +, Peter Maydell wrote:
>>
>> Andrea Adami and Dmitry Eremin-Solenikov did the work in 2017 to
>> modernize it a bit, and Russell helped out. I was under the impression
>> that they only used real hardware though!
>
> I used both Qemu and actual hardware (having collie, poodle, tosa and
> c860 that was easy).
>
> The biggest issue with Zaurus PDAs was that supporting interesting
> parts of the platform (PCMCIA, companion chips) required almost
> rebootstrapping of the corresponding drivers.
> E.g. I had a separate driver for the LoCoMo chip which worked properly
> with the DT systems.
> PCMCIA was a huuuge trouble and it didn't play well at all. The driver
> must be rewritten to use the component framework.

If we want to actually go there, I think the best option for PCMCIA
support is likely to replace the entire "soc_common" pcmcia driver
with a simple drivers/pata/ storage driver and no support for
other cards. There was a driver until commit 38943cbd25a2
("ata: remove palmld pata driver") that could serve as an
template.

  Arnd



Re: possible deprecation and removal of some old QEMU Arm machine types (pxa2xx, omap, sa1110)

2024-02-14 Thread Arnd Bergmann
On Wed, Feb 14, 2024, at 02:27, Aaro Koskinen wrote:
> On Tue, Feb 13, 2024 at 09:11:38PM +0100, Arnd Bergmann wrote:
>
> I'm one of the OMAP1 Linux kernel maintainers, and I have Palm TE which
> I have been using for testing and development (and reporting bugs,
> regressions) along with those other boards you mentioned.
>
> Since I have the real Palm HW, I haven't used QEMU for that particular
> board. But however I use QEMU SX1 support frequently as it's quickest way
> to check if OMAP1 is bootable, and if the basic peripherals are working.
> SX1 is close to Palm/AMS-Delta, and also it's ARMv4T which is rare these
> days. I think it's useful to keep it in QEMU as long there are hardware
> that people use.
>
> So my wish is to keep at least SX1 support in QEMU as long as ARMv4T
> supported in the Linux kernel.

Makes sense. We have a couple of other ARMv4T systems in the kernel
that are still being tested (ep93xx, at91rm9200, clps71xx, imx1,
nspire, integrator/ap), but none of the others have any qemu
support apparently unless you count "-Mintegratorpb -cpu arm925".
All of these are are using DT or getting there (ep93xx), so we'll
probably keep them around for a while.

Similarly, we support a couple of ARMv4 (non-T) targets in the
kernel (footbridge, sa1100, rpc, moxart, gemini), but the only
one with qemu support here is sa1100/collie.

>> >> > > OMAP2 machines:
>> >> > >
>> >> > > n800 Nokia N800 tablet aka. RX-34 (OMAP2420)
>> >> > > n810 Nokia N810 tablet aka. RX-44 (OMAP2420)
>> >> > >
>> >> > I never managed to get those to boot the Linux kernel.
>> 
>> I think Tony still tests these on both hardware and qemu.
>> The platform side here is much more modern than any of the
>> others above since it does use DT and it has enough RAM
>> to be somewhat usable.
>
> I have also these boards (real hardware) and test them frequently with
> mainline Linux. However, QEMU support I haven't used/needed. I recall it
> was a bit buggy, and some changes in mainline made the kernel unbootable.
> Unless Tony needs the support, I guess they are good to go.

Thanks for confirming.

> (Arnd: RAM isn't everything. Some of the OMAP1 boards today are still
> more useful than N800/N810, even with modern bloaty Linux.)

Obviously RAM isn't everything, but the machines with just 32MB
or less do seem very small for real workloads, so I admit I
dismiss them easily. I am curious what you run on those, are
there any embedded distros that maintain good support for 32MB
systems on modern kernel/musl/Xorg/..., or are you using
omething older or highly customized?

 Arnd



Re: possible deprecation and removal of some old QEMU Arm machine types (pxa2xx, omap, sa1110)

2024-02-14 Thread Arnd Bergmann
On Tue, Feb 13, 2024, at 22:21, Linus Walleij wrote:

> The Collie is popular because it is/was easy to get hold of and
> easy to hack. PXA was in candybar phones (right?) which
> are just veritable fortresses and really hard to hack so that is why
> there is no interest (except for the occasional hyperfocused Harald
> Welte), so those are a bit like the iPhones: you *can* boot something
> custom on them, but it won't be easy or quick, and not as fun and
> rewarding.

The PXA machines that we support in kernel and qemu are either
industrial/embedded machines or upgraded versions of the SA1100
PDAs (sharpsl), not phones.

OTOH, these chips came out between 2003 and 2007 so I can
also see how people at the time had already moved away from
using pure PDAs to Symbian or Blackberry phones, so there
might just be fewer PDAs that got sold with PXA than with
SA1100 before there were smartphones.

  Arnd



Re: possible deprecation and removal of some old QEMU Arm machine types (pxa2xx, omap, sa1110)

2024-02-13 Thread Arnd Bergmann
On Tue, Feb 13, 2024, at 16:36, Guenter Roeck wrote:
> On Tue, Feb 13, 2024 at 03:14:21PM +, Peter Maydell wrote:
>> On Mon, 12 Feb 2024 at 14:36, Guenter Roeck  wrote:
>> > On 2/12/24 04:32, Peter Maydell wrote:
>> > > The machines I have in mind are:
>> > >
>> > > PXA2xx machines:
>> > >
>> > > akitaSharp SL-C1000 (Akita) PDA (PXA270)
>> > > borzoi   Sharp SL-C3100 (Borzoi) PDA (PXA270)
>> > > connex   Gumstix Connex (PXA255)
>> > > mainstoneMainstone II (PXA27x)
>> > > spitzSharp SL-C3000 (Spitz) PDA (PXA270)
>> > > terrier  Sharp SL-C3200 (Terrier) PDA (PXA270)
>> > > tosa Sharp SL-6000 (Tosa) PDA (PXA255)
>> > > verdex   Gumstix Verdex Pro XL6P COMs (PXA270)
>> > > z2   Zipit Z2 (PXA27x)
>> > >
>> > I test akita, borzoi, spitz, and terrier. Upstream Linux removed support
>> > for mainstone, tosa, and z2 from the Linux kernel as of version 6.0, so

It was 6.3 (about one year ago).

>> > I am no longer testing those.
>> >
>> > I never managed to boot connex or verdex.

I kept specifically the pxa boards that would be sensible to port
to devicetree because they had qemu support. gumstix verdex is the
one with the most RAM out of those; spitz, sharpsl and their
variants are the ones that some of us still have around.

Robert had working devicetree support for some machines out of tree
that he has not submitted, and presumably not worked on since.

Unless someone starts working on converting the remaining
pxa board files to DT, we can probably remove them after the
next LTS kernel a year from now.

I have no objection to removing them from qemu if that helps,
the existing qemu releases should be sufficient for anyone
trying the conversion.

>> > > OMAP1 machines:
>> > >
>> > > cheetah  Palm Tungsten|E aka. Cheetah PDA (OMAP310)
>> > > sx1  Siemens SX1 (OMAP310) V2
>> > > sx1-v1   Siemens SX1 (OMAP310) V1
>> > >
>> > I test sx1. I don't think I ever tried cheetah, and I could not get sx1-v1
>> > to work.

This is similar. omap1 development is slightly more active
than pxa, but then again they have no DT support today and
are unlikely to ever get there at this point.

Out of the five machines that are still supported in the
kernel, I think three still run on hardware (osk, ams-delta
and nokia770), while the other ones were left there only
for their qemu support. I don't mind removing them from
the kernel as well if they are gone from qemu.

>> > > OMAP2 machines:
>> > >
>> > > n800 Nokia N800 tablet aka. RX-34 (OMAP2420)
>> > > n810 Nokia N810 tablet aka. RX-44 (OMAP2420)
>> > >
>> > I never managed to get those to boot the Linux kernel.

I think Tony still tests these on both hardware and qemu.
The platform side here is much more modern than any of the
others above since it does use DT and it has enough RAM
to be somewhat usable.

On the other hand, this is the one platform actually using
the cursed arm1136r0 core (not counting imx31 and
realview here as I'm not aware of any real users), and
anything that would get us closer to dropping support for
this CPU would be welcome ;-)

>> > > The one SA1110 machine:
>> > >
>> > > collie   Sharp SL-5500 (Collie) PDA (SA-1110)
>> > >
>> > I do test collie.

Adding Linus Walleij and Stefan Lehner to Cc, as they were
interested in modernizing sa1100 back in 2022. If they
are still interested in that, they might want to keep collie
support.

Surprisingly, at the time I removed unused old board files,
there was a lot more interest in sa1100 than in the newer
pxa platform, which I guess wasn't as appealing for
retrocomputing yet.

>> > All the ones I use still boot the latest Linux kernel.
>> >
>> > > Obviously if we can remove all the machines that used a given
>> > > SoC, that's much more effective than if we just delete one or two.
>> > >
>> > > I don't have any test images for the SA1110 or OMAP1 machines,
>> > > so those are the ones I am most keen to be able to drop.
>> > > I do have test images for a few of the pxa2xx and the OMAP2 machines.
>> > >
>> > I don't mind dropping them, just listing what I use for testing the
>> > Linux kernel. I suspect I may be the only "user" of those boards,
>> > though, both in Linux and qemu.
>> 
>> Mmm; there's not much point in both QEMU and the kernel
>> maintaining code that nobody's using. Are you considering
>> dropping support for any of these SoC families from the kernel?
>> 
> Not me personally. Arnd is the one mostly involved in dropping
> support of obsolete hardware from the kernel

These are all clearly among the least maintained boards we have
left in the kernel after the last purge. At the time I asked
for remaining users in 2022, I kept pretty much anything that
had the slightest chance of still being used and I was already
planning another round for the next LTS kernel in early 2023
that would be more 

Re: [RFC PATCH] linux-user: time64: consolidate rt_sigtimedwait_time64 and rt_sigtimedwait

2022-12-20 Thread Arnd Bergmann
On Fri, Dec 16, 2022, at 19:08, Michael Tokarev wrote:
> Both system calls are exactly the same, the only difference is the
> (optional) timespec conversion. Consolidate the two, and check which
> syscall being emulated in the timespec conversion place.
>
> This is just a PoC. But this brings at least 2 questions.
>
> Let's take pselect6 as an example. There are 2 possible pselects
> in there (actually more, but let's focus on just the two):
> TARGET_NR_pselect6 and TARGET_NR_pselect6_time64.  Both are implemented
> in a single do_pselect6() function, which, depending on its "time64"
> argument, will call either target_to_host_timespec64() or
> target_to_host_timespec().
>
> But these functions, in turn, are guarded by a lot of #if
> defined(foo). In particular, in context of pselect6,
> target_to_host_timespec() is guarded by
>  if defined(TARGET_NR_pselect6),
> while target_to_host_timespec64() is guarded by
>  if defined(TARGET_NR_pselect6_time64).
>
> So basically, if just one of the two TARGET_NR_pselect6*
> is defined, one of target_to_host_timespec*() will not
> be defined, but do_pselect6() calls *both* anyway.  In
> other words, both functions must be provided if any of
> the select6 are to be implemented.
>
> But the good news is that these functions
> (target_to_host_timespec*()) are always defined because
> they're needed for some syscalls anyway, like, eg,
> TARGET_NR_semop or TARGET_NR_utimensat or CONFIG_TIMERFD.
>
> It looks like whole gigantic ifdeffery around these two
> functions should be dropped, and a common function provided,
> with an extra argument like time64, to be used in many
> places where this construct is already used. Like in
> this PoC too, which again calls one of the two depending
> on the actual syscall being used.  Or maybe we can even
> combine the two into one with an extra arg like "time64".

The cleanup you suggest here looks correct to me, but as I
mentioned on IRC, I think this should only be done if the
time64 host side bug is also fixed. At the moment, both
the time32 and time64 target syscalls are always translated
into the time32 variant on 32-bit hosts, which is a mistake
for multiple reasons:

- the libc-defined 'timespec' argument that is passed into
  the kernel-defined syscall has the wrong layout, resulting
  in incorrect data. This could be avoided by using
  the kernel-defined __kernel_old_timespec instead, but then

- time64 target arguments needlessly get truncated to
  the time32 range. This happens e.g. when running 64-bit
  target on 32-bit host, but could be easily avoided
  by converting into the time64 types (__kernel_timespec)
  and calling the time64 syscalls whenever those are
  available.

- On modern hosts, only the time64 syscalls are available, so
  e.g. on riscv32 hosts, the current implementation does not
  even build.

The best solution for all of the above I think would be to
have a wrapper for each of the affected syscalls that calls
the time64 interface first and falls back to the time32
version only if that fails, using __kernel_timespec
on the interface instead of libc timespec.

A simpler but less flexible method would be to detect
sizeof(time_t) at compile time and then use either the
time32 or time64 host syscall numbers, whichever goes
with the libc time_t/timespec.


  Arnd



Re: [RFC PATCH] target/arm: update the cortex-a15 MIDR to latest rev

2022-09-06 Thread Arnd Bergmann
On Tue, Sep 6, 2022, at 7:22 PM, Alex Bennée wrote:
>
> index 3099b38e32..59d5278868 100644
> --- a/target/arm/cpu_tcg.c
> +++ b/target/arm/cpu_tcg.c
> @@ -588,7 +588,9 @@ static void cortex_a15_initfn(Object *obj)
>  set_feature(>env, ARM_FEATURE_EL3);
>  set_feature(>env, ARM_FEATURE_PMU);
>  cpu->kvm_target = QEMU_KVM_ARM_TARGET_CORTEX_A15;
> -cpu->midr = 0x412fc0f1;
> +/* r4p0 cpu, not requiring expensive tlb flush errata */
> +cpu->midr = 0x414fc0f0;
> +cpu->revidr = 0x200;
>  cpu->reset_fpsid = 0x410430f0;
>  cpu->isar.mvfr0 = 0x10110222;
>  cpu->isar.mvfr1 = 0x;

This will work correctly with Linux, but after I checked the
Cortex-A15 MPCore r2/r3/r4 Software Developers Errata Notice again,
I think that setting revidr here is not fully correct:

With an r3p3 CPU, bit 9 of revidr (0x200) indicates that no
workaround is necessary, but with r4p0 and higher, this bit
is marked as reserved and both the Documentation and the
Linux source code assume the hardware works correctly.

So I think this should either be 0x413fc0f3/0x200
or 0x414fc0f0/0.

 Arnd



Re: [RFC PATCH 3/3] hw/openrisc: Add the OpenRISC virtual machine

2022-06-07 Thread Arnd Bergmann
On Tue, Jun 7, 2022 at 2:12 PM Stafford Horne  wrote:
> On Tue, Jun 07, 2022 at 11:43:08AM +0100, Peter Maydell wrote:
>
> However, in a followup mail from Laurent we see:
>
>   https://lore.kernel.org/lkml/cb884368-0226-e913-80d2-62d2b7b2e...@vivier.eu/
>
>   The reference document[1] doesn't define the endianness of goldfish.
>
>   [1] 
> https://android.googlesource.com/platform/external/qemu/+/master/docs/GOLDFISH-VIRTUAL-HARDWARE.TXT
>
>
> The documentation does not clearly specify it.  So maybe maybe or1k should 
> just
> be updated on the linux side and add gf_ioread32/gf_iowrite32 big-endian
> accessors.

I don't think it makes any sense to use big-endian for a new
architecture, just use
the default little-endian implementation on the linux side, and change
the qemu code
to have the backward-compatibility hack for m68k while using big-endian for
the rest.

   Arnd



Re: [RFC PATCH 3/3] hw/openrisc: Add the OpenRISC virtual machine

2022-06-07 Thread Arnd Bergmann
On Tue, Jun 7, 2022 at 11:47 AM Stafford Horne  wrote:
> On Tue, Jun 07, 2022 at 10:42:08AM +0200, Arnd Bergmann wrote:

> > Goldfish is a very old platform, as far as I know only the kernel port is 
> > new.
> > I don't know when qemu started shipping goldfish, but changing it now would
> > surely break compatibility with whatever OS the port was originally made 
> > for.
>
> Hi Arnd,
>
> As far as I can tell goldfish in qemu is not very old. There are 3 devices, 2 
> were
> added for the m68k virt machine, and 1 for riscv virt.
>
> $ git lo -- hw/char/goldfish_tty.c
> 2021-11-09 65b4c8c759 Philippe Mathieu-Daudé hw/m68k: Fix typo in SPDX tag
> 2021-03-15 8c6df16ff6 Laurent Vivier   hw/char: add goldfish-tty
>
> $  git lo -- hw/intc/goldfish_pic.c
> 2021-11-09 65b4c8c759 Philippe Mathieu-Daudé hw/m68k: Fix typo in SPDX tag
> 2021-03-15 8785559390 Laurent Vivier   hw/intc: add goldfish-pic

That is much younger than Laurent made it appear, from his earlier explanations
I expected this to have shipped a long time ago and been used in other
OSs to the
point where it cannot be fixed.

> The mips/loongson3_virt machine now also uses the goldfish_rtc.
>
> The problem with the goldfish device models is that they were defined as
> DEVICE_NATIVE_ENDIAN.
>
> $ grep endianness hw/*/goldfish*
> hw/char/goldfish_tty.c:.endianness = DEVICE_NATIVE_ENDIAN,
> hw/intc/goldfish_pic.c:.endianness = DEVICE_NATIVE_ENDIAN,
> hw/rtc/goldfish_rtc.c:.endianness = DEVICE_NATIVE_ENDIAN,
>
> RISC-V is little-endian so when it was added there was no problem with running
> linux goldfish drivers.
>
> MIPS Longson3, added last year, seems to be running as little-endian well, I
> understand MIPS can support both big and little endian. However according to
> this all Loongson cores are little-endian.
>
> https://en.wikipedia.org/wiki/Loongson
>
> As I understand when endianness of the devices in qemu are defined as
> DEVICE_NATIVE_ENDIAN the device endian takes the endian of the target CPU.
>
> This means that MIPS Loongson3 and RISC-V are affectively running as
> little-endian which is what would be expected.

Not really, the definition of DEVICE_NATIVE_ENDIAN in qemu is much less
well-defined than that as I understand, it is just whatever the person adding
support for that CPU thought would be the right one. A lot of CPUs can
run either big-endian or little-endian code, and e.g. on ARM, qemu
DEVICE_NATIVE_ENDIAN is just always little-endian, regardless of
what the CPU runs, while I think on MIPS it would be whatever the CPU
is actually executing.

  Arnd



Re: [RFC PATCH 3/3] hw/openrisc: Add the OpenRISC virtual machine

2022-06-07 Thread Arnd Bergmann
On Tue, Jun 7, 2022 at 10:11 AM Geert Uytterhoeven  wrote:
> On Sun, Jun 5, 2022 at 9:32 AM Stafford Horne  wrote:
> > On Sun, Jun 05, 2022 at 10:58:14AM +0900, Stafford Horne wrote:
> > It might be a good idea to revisit the qemu implementation and make
> > sure that the extra byteswap is only inserted on m68k and not on
> > other targets, but hopefully there are no new targets based on goldfish
> > anymore and we don't need to care.
> >
> > So, it seems that in addition to my patch we would need something in m68k to
> > switch it back to 'native' (big) endian?
> >
> > Looking at the m68k kernel/qemu interface I see:
> >
> > Pre 5.19:
> >(data) <-- kernel(readl / little) <-- m68k qemu (native / big) - RTC/PIC
> >(data) <-- kernel(__raw_readl / big) <-- m68k qemu (native / big) - TTY
> >
> > 5.19:
> >(data) <-- kernel(gf_ioread32 / big) <-- m68k qemu (native / big) - all
> >
> > The new fixes to add gf_ioread32/gf_iowrite32 fix this for goldfish and 
> > m68k.
> > This wouldn't have been an issue for little-endian platforms where 
> > readl/writel
> > were originally used.
> >
> > Why can't m68k switch to little-endian in qemu and the kernel?  The m68k 
> > virt
> > platform is not that old, 1 year? Are there a lot of users that this would 
> > be a big
> > problem?
> >
> > [1] 
> > https://lore.kernel.org/lkml/CAK8P3a1oN8NrUjkh2X8jHQbyz42Xo6GSa=5n0gd6vqcxrjm...@mail.gmail.com/

Goldfish is a very old platform, as far as I know only the kernel port is new.
I don't know when qemu started shipping goldfish, but changing it now would
surely break compatibility with whatever OS the port was originally made for.

  Arnd



Re: Approaches for same-on-same linux-user execve?

2021-10-08 Thread Arnd Bergmann
On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée  wrote:
>
> Are there any other approaches you could take? Which do you think has
> the most merit?

Reading through the ELF loader code in the kernel, I had another idea:
If qemu-user could be turned into a replacement for /lilb/ld.so and act
as an ELF interpreter rather than a binfmt-misc helper, this might address
a lot of the issues automatically.

It would need to be a statically linked binary so it doesn't itself require
an interpreter. It would have to do the job of ld.so in addition to
the emulation, but it could do that by finding the real ld.so somewhere
else and running that in emulation mode. It would also not work at
all for statically linked executables.

Not sure if that makes the tradeoffs better than your other suggestions,
but it seemed worth bringing up.

   Arnd



Re: Approaches for same-on-same linux-user execve?

2021-10-07 Thread Arnd Bergmann
On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée  wrote:
>
> I came across a use-case this week for ARM although this may be also
> applicable to architectures where QEMU's emulation is ahead of the
> hardware currently widely available - for example if you want to
> exercise SVE code on AArch64. When the linux-user architecture is not
> the same as the host architecture then binfmt_misc works perfectly fine.
>
> However in the case you are running same-on-same you can't use
> binfmt_misc to redirect execution to using QEMU because any attempt to
> trap native binaries will cause your userspace to hang as binfmt_misc
> will be invoked to run the QEMU binary needed to run your application
> and a deadlock ensues.

Can you clarify how the code would run in this case? Does qemu-user
still emulate every single instruction, both the compatible and the incompatible
ones, or is the idea here to run as much as possible natively and only
emulate the instructions that are not available natively, using either
SIGILL or searching through the object code for those instructions?

> Trap execve in QEMU linux-user
> --
>
> We could add a flag to QEMU so at the point of execve it manually
> invokes the new process with QEMU, passing on the flag to persist this
> behaviour.

This sounds like the obvious approach if you already do a full
instruction emulation. You'd still need to run the parent process
by calling qemu-user manually, but I suppose you need to do
something like this in any case.

> Add path mask to binfmt_misc
> 
>
> The other option would be to extend binfmt_misc to have a path mask so
> it only applies it's alternative execution scheme to binaries in a
> particular section of the file-system (or maybe some sort of pattern?).

The main downside I see here is that it requires kernel modification, so
it would not work for old kernels.

> Are there any other approaches you could take? Which do you think has
> the most merit?

If we modify binfmt_misc in the kernel, it might be helpful to do it
by extending it with namespace support, so it could be constrained
to a single container without having to do the emulation outside.
Unfortunately that does not solve the problem of preventing the
qemu-user binary from triggering the binfmt_misc lookup.

   Arnd



Re: [PATCH v2 for-6.0?] hw/pci-host/gpex: Don't fault for unmapped parts of MMIO and PIO windows

2021-04-20 Thread Arnd Bergmann
On Tue, Apr 20, 2021 at 1:52 PM Philippe Mathieu-Daudé  wrote:
> On 4/19/21 3:42 PM, Peter Maydell wrote:
> >>
> >> I suspect PCI-ISA bridges to provide an EISA bus.
> >
> > I'm not sure what you mean here -- there isn't an ISA bridge
> > or an EISA bus involved here. This is purely about the behaviour
> > of the memory window the PCI host controller exposes to the CPU
> > (and in particular the window for when a PCI device's BAR is
> > set to "IO" rather than "MMIO"), though we change both here.
>
> I guess I always interpreted the IO BAR were here to address ISA
> backward compatibility. I don't know well PCI so I'll study it
> more. Sorry for my confused comment.

It is mostly for compatibility, but there are many layers of it:

- PCI supports actual ISA/EISA/VLB/PCMCIA/LPC/PC104/... style
  devices behind a bridge, using I/O ports at their native address.

- PCI devices themselves can have fixed I/O ports at well-known
  addresses, e.g. VGA or IDE/ATA adapters

- PCI devices can behave like legacy devices using port I/O
  but use PCI resource assignment to pick an arbitrary port
  number outside of the legacy range

- PCIe can support all of the above by virtue of being backwards
  compatible with PCI and allowing PCI buses behind bridges,
  though port I/O is deprecated here and often not supported at all

The first two are very rare these days, but Linux still support them
in order to run on old hardware, and any driver for these that
assumes a hardcoded port number can crash the kernel if the
PCI host bridge causes an asynchronous external abort or
similar.

   Arnd



Re: [PATCH 3/5] tools/vhost-user-i2c: Add backend driver

2021-03-26 Thread Arnd Bergmann
On Fri, Mar 26, 2021 at 7:01 AM Viresh Kumar  wrote:
> On 25-03-21, 17:16, Arnd Bergmann wrote:
> > On Wed, Mar 24, 2021 at 8:33 AM Viresh Kumar  
> > wrote:
> >
> > > +static uint8_t vi2c_xfer(VuDev *dev, struct i2c_msg *msg)
> > > +{
> > > +VuI2c *i2c = container_of(dev, VuI2c, dev.parent);
> > > +struct i2c_rdwr_ioctl_data data;
> > > +VI2cAdapter *adapter;
> > > +
> > > +adapter = vi2c_find_adapter(i2c, msg->addr);
> > > +if (!adapter) {
> > > +g_printerr("Failed to find adapter for address: %x\n", 
> > > msg->addr);
> > > +return VIRTIO_I2C_MSG_ERR;
> > > +}
> > > +
> > > +data.nmsgs = 1;
> > > +data.msgs = msg;
> > > +
> > > +if (ioctl(adapter->fd, I2C_RDWR, ) < 0) {
> > > +g_printerr("Failed to transfer data to address %x : %d\n", 
> > > msg->addr, errno);
> > > +return VIRTIO_I2C_MSG_ERR;
> > > +}
> >
> > As you found during testing, this doesn't work for host kernels
> > that only implement the SMBUS protocol. Since most i2c clients
> > only need simple register read/write operations, I think you should
> > at least handle the common ones (and one two byte read/write)
> > here to make it more useful.
>
> I am thinking if that is what we really want to support, then
> shouldn't the i2c virtio spec be updated first to support SMBUS type
> transfers as well?

As far as I can tell, all the simple devices should just work, with
I2C_FUNC_SMBUS_READ_BLOCK_DATA being the main exception,
but it seems that has practically no users.

   Arnd



Re: [PATCH 3/5] tools/vhost-user-i2c: Add backend driver

2021-03-26 Thread Arnd Bergmann
On Fri, Mar 26, 2021 at 8:14 AM Viresh Kumar  wrote:
> On 25-03-21, 17:16, Arnd Bergmann wrote:
> > On Wed, Mar 24, 2021 at 8:33 AM Viresh Kumar  
> > wrote:

> >
> > It looks like you are not handling endianness conversion here. As far as I
> > can tell, the protocol requires little-endian data, but the code might
> > run on a big-endian CPU.
>
> I hope this is all we are required to do here, right ?
>
> @@ -442,7 +421,7 @@ static void vi2c_handle_ctrl(VuDev *dev, int qidx)
>  out_hdr = elem->out_sg[0].iov_base;
>
>  /* Bit 0 is reserved in virtio spec */
> -msg.addr = out_hdr->addr >> 1;
> +msg.addr = le16toh(out_hdr->addr) >> 1;
>
>  /* Read Operation */
>  if (elem->out_num == 1 && elem->in_num == 2) {
> @@ -489,7 +468,7 @@ static void vi2c_handle_ctrl(VuDev *dev, int qidx)
>  in_hdr->status = fail_next ? VIRTIO_I2C_MSG_ERR : vi2c_xfer(dev, 
> );
>  if (in_hdr->status == VIRTIO_I2C_MSG_ERR) {
>  /* We need to fail remaining transfers as well */
> -fail_next = out_hdr->flags & VIRTIO_I2C_FLAGS_FAIL_NEXT;
> +fail_next = le32toh(out_hdr->flags) & VIRTIO_I2C_FLAGS_FAIL_NEXT;
>  }
>
> These are the only fields we are passing apart from buf, which goes
> directly to the client device.

I think so, the in_hdr is only one byte long, so it doesn't have an
endianness.

   Arnd



Re: [PATCH 3/5] tools/vhost-user-i2c: Add backend driver

2021-03-25 Thread Arnd Bergmann
On Wed, Mar 24, 2021 at 8:33 AM Viresh Kumar  wrote:

> +static uint8_t vi2c_xfer(VuDev *dev, struct i2c_msg *msg)
> +{
> +VuI2c *i2c = container_of(dev, VuI2c, dev.parent);
> +struct i2c_rdwr_ioctl_data data;
> +VI2cAdapter *adapter;
> +
> +adapter = vi2c_find_adapter(i2c, msg->addr);
> +if (!adapter) {
> +g_printerr("Failed to find adapter for address: %x\n", msg->addr);
> +return VIRTIO_I2C_MSG_ERR;
> +}
> +
> +data.nmsgs = 1;
> +data.msgs = msg;
> +
> +if (ioctl(adapter->fd, I2C_RDWR, ) < 0) {
> +g_printerr("Failed to transfer data to address %x : %d\n", 
> msg->addr, errno);
> +return VIRTIO_I2C_MSG_ERR;
> +}

As you found during testing, this doesn't work for host kernels
that only implement the SMBUS protocol. Since most i2c clients
only need simple register read/write operations, I think you should
at least handle the common ones (and one two byte read/write)
here to make it more useful.

> +static void vi2c_handle_ctrl(VuDev *dev, int qidx)
> +{
> +VuVirtq *vq = vu_get_queue(dev, qidx);
> +struct i2c_msg msg;
> +struct virtio_i2c_out_hdr *out_hdr;
> +struct virtio_i2c_in_hdr *in_hdr;
> +bool fail_next = false;
> +size_t len, in_hdr_len;
> +
> +for (;;) {
> +VuVirtqElement *elem;
> +
> +elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement));
> +if (!elem) {
> +break;
> +}
> +
> +g_debug("%s: got queue (in %d, out %d)", __func__, elem->in_num,
> +elem->out_num);
> +
> +/* Validate size of out header */
> +if (elem->out_sg[0].iov_len != sizeof(*out_hdr)) {
> +g_warning("%s: Invalid out hdr %zu : %zu\n", __func__,
> +  elem->out_sg[0].iov_len, sizeof(*out_hdr));
> +continue;
> +}
> +
> +out_hdr = elem->out_sg[0].iov_base;
> +
> +/* Bit 0 is reserved in virtio spec */
> +msg.addr = out_hdr->addr >> 1;
> +
> +/* Read Operation */
> +if (elem->out_num == 1 && elem->in_num == 2) {
> +len = elem->in_sg[0].iov_len;
> +if (!len) {
> +g_warning("%s: Read buffer length can't be zero\n", 
> __func__);
> +continue;
> +}


It looks like you are not handling endianness conversion here. As far as I
can tell, the protocol requires little-endian data, but the code might
run on a big-endian CPU.

Jie Deng also pointed out the type differences, but actually handling
them correctly is more important that describing them the right way.

Arnd



Re: [PATCH] hw/pci-host/gpex: Don't fault for unmapped parts of MMIO and PIO windows

2021-03-22 Thread Arnd Bergmann
6On Mon, Mar 22, 2021 at 9:13 PM Peter Maydell  wrote:
>
> Currently the gpex PCI controller implements no special behaviour for
> guest accesses to areas of the PIO and MMIO where it has not mapped
> any PCI devices, which means that for Arm you end up with a CPU
> exception due to a data abort.
>
> Most host OSes expect "like an x86 PC" behaviour, where bad accesses
> like this return -1 for reads and ignore writes.  In the interests of
> not being surprising, make host CPU accesses to these windows behave
> as -1/discard where there's no mapped PCI device.
>
> Reported-by: Dmitry Vyukov 
> Fixes: https://bugs.launchpad.net/qemu/+bug/1918917
> Signed-off-by: Peter Maydell 
> ---
> Not convinced that this is 6.0 material, because IMHO the
> kernel shouldn't be doing this in the first place.
> Do we need to have the property machinery so that old
> virt-5.2 etc retain the previous behaviour ?

I think it would be sufficient to do this for the ioport window,
which is what old-style ISA drivers access. I am not aware
of any driver accessing hardcoded addresses in the mmio
window, at least not without probing io ports first (the VGA
text console would use both).

I checked which SoCs the kernel supports that do require a special
hook to avoid an abort and found these:

arch/arm/mach-bcm/bcm_5301x.c:  hook_fault_code(16 + 6,
bcm5301x_abort_handler, SIGBUS, BUS_OBJERR,
arch/arm/mach-cns3xxx/pcie.c:   hook_fault_code(16 + 6,
cns3xxx_pcie_abort_handler, SIGBUS, 0,
arch/arm/mach-iop32x/pci.c: hook_fault_code(16+6,
iop3xx_pci_abort, SIGBUS, 0, "imprecise external abort");
arch/arm/mach-ixp4xx/common-pci.c:  hook_fault_code(16+6,
abort_handler, SIGBUS, 0,
drivers/pci/controller/dwc/pci-imx6.c:  hook_fault_code(8,
imx6q_pcie_abort_handler, SIGBUS, 0,
drivers/pci/controller/dwc/pci-keystone.c:  hook_fault_code(17,
ks_pcie_fault, SIGBUS, 0,

The first four (bcm5301x, cns3xxx, iop32x and ixp4xx) generate an
'imprecise external abort' (16+6), imx6q has a "precise external abort on
non-linefetch" (8), and keystone in LPAE mode has an "asynchronous external
abort". The only SoC among those that is emulated by qemu to my knowledge
is the i.MX6q in the 'sabrelite' machine.

It's possible that some of these are not caused by a PCI master abort
but some other error condition on the PCI host though. I think most other
PCI implementations either ignore the error or generate an I/O interrupt.

Arnd



[Bug 1918917] Re: synchronous about on accessing unused I/O ports on aarch64

2021-03-12 Thread Arnd Bergmann
That seems correct, and could probably be improved. In this case, I know
that _outb() only writes within the PCI PIO virtual address between
fbfffe80 and fb80, which according to the boot log
(https://gist.githubusercontent.com/dvyukov/084890d9b4aa7cd54f468e652a9b5881/raw/54c12248ff6a4885ba6c530d56b3adad59bc6187/gistfile1.txt)
was mapped to qemu's PCI IO space, using DEVICE_nGnRnE attributes (PIO
space unlike any other MMIO in the kernel uses 'nE'):

[3.973419][T1] pci-host-generic 401000.pcie: host bridge 
/pcie@1000 ranges:
[3.974309][T1] pci-host-generic 401000.pcie:   IO 
0x003eff..0x003eff -> 0x00
[3.975021][T1] pci-host-generic 401000.pcie:  MEM 
0x001000..0x003efe -> 0x001000

So physical address 0x003eff corresponds to virtual address
fbfffe80, which corresponds to logical PIO port number 0. The
serial port in the kernel was probed at port number 0, so the driver
attempts to write to the 8250 compatible UART_IER at address
fbfffe81, which is in register x19.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1918917

Title:
  synchronous about on accessing unused I/O ports on aarch64

Status in QEMU:
  New

Bug description:
  version: QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-6)
  command line: qemu-system-aarch64 \
-machine virt,virtualization=on,graphics=on,usb=on -cpu cortex-a57 -smp 
2 -m 2G \
-device virtio-blk-device,drive=hd0 \
-drive if=none,format=raw,id=hd0,file=buildroot \
-kernel arch/arm64/boot/Image \
-nographic \
-device virtio-rng-pci \
-net user,host=10.0.2.10,hostfwd=tcp::10022-:22 -net 
nic,model=virtio-net-pci \
-append "root=/dev/vda earlyprintk=serial console=ttyAMA0 earlycon"

  I am observing "synchronous external abort" when kernel tries to
  access unused I/O ports (see below), while hardware/qemu should return
  0x in this case.

  This is factored out of this LKML thread where Arnd describes it in more 
details:
  
https://lore.kernel.org/lkml/CAK8P3a0HVu+x0T6+K3d0v1bvU-Pes0F0CSjqm5x=bxfgv5y...@mail.gmail.com/

  Internal error: synchronous external abort: 9650 [#1] PREEMPT SMP
  Dumping ftrace buffer:
 (ftrace buffer empty)
  Modules linked in:
  CPU: 0 PID: 11231 Comm: syz-executor.1 Not tainted 
5.12.0-rc2-syzkaller-00302-g28806e4d9b97 #0
  Hardware name: linux,dummy-virt (DT)
  pstate: 8085 (Nzcv daIf -PAN -UAO -TCO BTYPE=--)
  pc : __raw_writeb arch/arm64/include/asm/io.h:27 [inline]
  pc : _outb include/asm-generic/io.h:501 [inline]
  pc : logic_outb+0x3c/0x114 lib/logic_pio.c:302
  lr : io_serial_out+0x80/0xc0 drivers/tty/serial/8250/8250_port.c:453
  sp : 15f0f980
  x29: 15f0f980 x28: 80001de0005d 
  x27: 80001601df00 x26: 15f0fc90 
  x25: 80001de0 x24: 80001de0 
  x23: 0e27f600 x22:  
  x21: 0002 x20: 0002 
  x19: fbfffe81 x18: 6a678b48 
  x17:  x16:  
  x15: 8000197be810 x14: 1fffe2be1f0e 
  x13: 1fffe2be1e90 x12: 62be1f39 
  x11: 1fffe2be1f38 x10: 62be1f38 
  x9 : dfff8000 x8 : 0003 
  x7 : 0001 x6 : 0004 
  x5 : 15f0f9c0 x4 : dfff8000 
  x3 : 0001 x2 : 13494e6b 
  x1 : fbfffe80 x0 : 00ffbffe 
  Call trace:
   _outb include/asm-generic/io.h:501 [inline]
   logic_outb+0x3c/0x114 lib/logic_pio.c:302
   io_serial_out+0x80/0xc0 drivers/tty/serial/8250/8250_port.c:453
   serial_out drivers/tty/serial/8250/8250.h:118 [inline]
   serial8250_set_THRI drivers/tty/serial/8250/8250.h:138 [inline]
   __start_tx drivers/tty/serial/8250/8250_port.c:1566 [inline]
   serial8250_start_tx+0x338/0x6c0 drivers/tty/serial/8250/8250_port.c:1666
   __uart_start.isra.0+0x10c/0x154 drivers/tty/serial/serial_core.c:127
   uart_start+0xe0/0x210 drivers/tty/serial/serial_core.c:137
   uart_flush_chars+0x10/0x20 drivers/tty/serial/serial_core.c:573
   __receive_buf drivers/tty/n_tty.c:1646 [inline]
   n_tty_receive_buf_common+0x588/0x22c0 drivers/tty/n_tty.c:1739
   n_tty_receive_buf+0x14/0x20 drivers/tty/n_tty.c:1768
   tiocsti drivers/tty/tty_io.c:2317 [inline]
   tty_ioctl+0xed0/0x1aec drivers/tty/tty_io.c:2718
   vfs_ioctl fs/ioctl.c:48 [inline]
   __do_sys_ioctl fs/ioctl.c:753 [inline]
   __se_sys_ioctl fs/ioctl.c:739 [inline]
   __arm64_sys_ioctl+0x120/0x18c fs/ioctl.c:739
   __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
   invoke_syscall arch/arm64/kernel/syscall.c:49 [inline]
   el0_svc_common.constprop.0+0xf0/0x2c0 arch/arm64/kernel/syscall.c:129
   do_el0_svc+0xa4/0xd0 arch/arm64/kernel/syscall.c:168
   el0_svc+0x24/0x34 arch/arm64/kernel/entry-common.c:416
   el0_sync_handler+0x1a4/0x1b0 

[Bug 1918917] Re: synchronous about on accessing unused I/O ports on aarch64

2021-03-12 Thread Arnd Bergmann
My best guess is that the PCI I/O port handling in qemu only returns
data for ports that are connected to an actual device.

In this case, the kernel attempts to access a 8250 serial port at an
address where none exists. While this is also a bug in the kernel, a
real PCI implementation would not cause an external abort but simply
return 'all-ones' (0xff, 0x, 0x) for a read request and
ignore writes. This is something that the kernel relies on for probing
unknown ISA style devices.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1918917

Title:
  synchronous about on accessing unused I/O ports on aarch64

Status in QEMU:
  New

Bug description:
  version: QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-6)
  command line: qemu-system-aarch64 \
-machine virt,virtualization=on,graphics=on,usb=on -cpu cortex-a57 -smp 
2 -m 2G \
-device virtio-blk-device,drive=hd0 \
-drive if=none,format=raw,id=hd0,file=buildroot \
-kernel arch/arm64/boot/Image \
-nographic \
-device virtio-rng-pci \
-net user,host=10.0.2.10,hostfwd=tcp::10022-:22 -net 
nic,model=virtio-net-pci \
-append "root=/dev/vda earlyprintk=serial console=ttyAMA0 earlycon"

  I am observing "synchronous external abort" when kernel tries to
  access unused I/O ports (see below), while hardware/qemu should return
  0x in this case.

  This is factored out of this LKML thread where Arnd describes it in more 
details:
  
https://lore.kernel.org/lkml/CAK8P3a0HVu+x0T6+K3d0v1bvU-Pes0F0CSjqm5x=bxfgv5y...@mail.gmail.com/

  Internal error: synchronous external abort: 9650 [#1] PREEMPT SMP
  Dumping ftrace buffer:
 (ftrace buffer empty)
  Modules linked in:
  CPU: 0 PID: 11231 Comm: syz-executor.1 Not tainted 
5.12.0-rc2-syzkaller-00302-g28806e4d9b97 #0
  Hardware name: linux,dummy-virt (DT)
  pstate: 8085 (Nzcv daIf -PAN -UAO -TCO BTYPE=--)
  pc : __raw_writeb arch/arm64/include/asm/io.h:27 [inline]
  pc : _outb include/asm-generic/io.h:501 [inline]
  pc : logic_outb+0x3c/0x114 lib/logic_pio.c:302
  lr : io_serial_out+0x80/0xc0 drivers/tty/serial/8250/8250_port.c:453
  sp : 15f0f980
  x29: 15f0f980 x28: 80001de0005d 
  x27: 80001601df00 x26: 15f0fc90 
  x25: 80001de0 x24: 80001de0 
  x23: 0e27f600 x22:  
  x21: 0002 x20: 0002 
  x19: fbfffe81 x18: 6a678b48 
  x17:  x16:  
  x15: 8000197be810 x14: 1fffe2be1f0e 
  x13: 1fffe2be1e90 x12: 62be1f39 
  x11: 1fffe2be1f38 x10: 62be1f38 
  x9 : dfff8000 x8 : 0003 
  x7 : 0001 x6 : 0004 
  x5 : 15f0f9c0 x4 : dfff8000 
  x3 : 0001 x2 : 13494e6b 
  x1 : fbfffe80 x0 : 00ffbffe 
  Call trace:
   _outb include/asm-generic/io.h:501 [inline]
   logic_outb+0x3c/0x114 lib/logic_pio.c:302
   io_serial_out+0x80/0xc0 drivers/tty/serial/8250/8250_port.c:453
   serial_out drivers/tty/serial/8250/8250.h:118 [inline]
   serial8250_set_THRI drivers/tty/serial/8250/8250.h:138 [inline]
   __start_tx drivers/tty/serial/8250/8250_port.c:1566 [inline]
   serial8250_start_tx+0x338/0x6c0 drivers/tty/serial/8250/8250_port.c:1666
   __uart_start.isra.0+0x10c/0x154 drivers/tty/serial/serial_core.c:127
   uart_start+0xe0/0x210 drivers/tty/serial/serial_core.c:137
   uart_flush_chars+0x10/0x20 drivers/tty/serial/serial_core.c:573
   __receive_buf drivers/tty/n_tty.c:1646 [inline]
   n_tty_receive_buf_common+0x588/0x22c0 drivers/tty/n_tty.c:1739
   n_tty_receive_buf+0x14/0x20 drivers/tty/n_tty.c:1768
   tiocsti drivers/tty/tty_io.c:2317 [inline]
   tty_ioctl+0xed0/0x1aec drivers/tty/tty_io.c:2718
   vfs_ioctl fs/ioctl.c:48 [inline]
   __do_sys_ioctl fs/ioctl.c:753 [inline]
   __se_sys_ioctl fs/ioctl.c:739 [inline]
   __arm64_sys_ioctl+0x120/0x18c fs/ioctl.c:739
   __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
   invoke_syscall arch/arm64/kernel/syscall.c:49 [inline]
   el0_svc_common.constprop.0+0xf0/0x2c0 arch/arm64/kernel/syscall.c:129
   do_el0_svc+0xa4/0xd0 arch/arm64/kernel/syscall.c:168
   el0_svc+0x24/0x34 arch/arm64/kernel/entry-common.c:416
   el0_sync_handler+0x1a4/0x1b0 arch/arm64/kernel/entry-common.c:432
   el0_sync+0x170/0x180 arch/arm64/kernel/entry.S:699
  Code: d2bfd001 f2df7fe1 f2e1 8b010273 (39000274) 
  ---[ end trace 79cb47219936c254 ]---

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1918917/+subscriptions



Re: [PATCH v3 RESEND] fcntl: Add 32bit filesystem mode

2020-11-18 Thread Arnd Bergmann
On Wed, Nov 18, 2020 at 12:38 AM Linus Walleij  wrote:
>
> On Tue, Oct 13, 2020 at 11:22 AM Dave Martin  wrote:
>
> > >   case F_SETFD:
> > >   err = 0;
> > >   set_close_on_exec(fd, arg & FD_CLOEXEC);
> > > + if (arg & FD_32BIT_MODE)
> > > + filp->f_mode |= FMODE_32BITHASH;
> > > + else
> > > + filp->f_mode &= ~FMODE_32BITHASH;
> >
> > This seems inconsistent?  F_SETFD is for setting flags on a file
> > descriptor.  Won't setting a flag on filp here instead cause the
> > behaviour to change for all file descriptors across the system that are
> > open on this struct file?  Compare set_close_on_exec().
> >
> > I don't see any discussion on whether this should be an F_SETFL or an
> > F_SETFD, though I see F_SETFD was Ted's suggestion originally.
>
> I cannot honestly say I know the semantic difference.
>
> I would ask the QEMU people how a user program would expect
> the flag to behave.

I agree it should either use F_SETFD to set a bit in the fdtable structure
like set_close_on_exec() or it should use F_SETFL to set a bit in
filp->f_mode.

It appears the reason FMODE_32BITHASH is part of  filp->f_mode
is that the only user today is nfsd, which does not have a file
descriptor but only has a struct file. Similarly, the only code that
understands the difference (ext4_readdir()) has no reference to
the file descriptor.

If this becomes an O_DIR32BITHASH flag for F_SETFL,
I suppose it should also be supported by openat2().

   Arnd



Re: [PATCH 08/12] linux-user: Add support for setting alsa timer enhanced read using ioctl

2020-01-17 Thread Arnd Bergmann
On Fri, Jan 17, 2020 at 9:50 PM Aleksandar Markovic
 wrote:

> Alexandre (and Arnd too, or any other person knowledgeable in the area),
>
> I just need to clarify a couple of details with you, please.
>
> Firstly, here is what man page rtc(4) says:
>
> "The /dev/rtc (or /dev/rtc0, /dev/rtc1, etc.) device can be opened
> only once (until it is closed) and it is read-only. On read(2) and
> select(2) the calling process is blocked until the next interrupt from
> that RTC is received. Following the interrupt, the process can read a
> long integer, of which the least significant byte contains a bit mask
> encoding the types of interrupt that occurred, while the remaining 3
> bytes contain the number of interrupts since the last read(2)."
>
> So, it looks read() will always return only 4 bytes of useful info
> (regardless of host being 32-bit/64-bit).

It says "long integer", which is 64-bit on a 64-bit machine.

> My questions are:
>
> - Is the description in man page genuinely accurate?

Starting with linux-2.6.18, there is another possibility: If an
application asks for exactly four bytes on a 64-bit kernel,
it gets the lower four bytes, as it would on a 32-bit kernel.

This is a hack that was introduced for running 32-bit compat
tasks.

For any other size less than sizeof(long), the kernel reports
an EINVAL error, and for anything larger or equal to sizeof(long)
it attempts to output a long word.

> - To me (but I am really an outsider to using RTC in applications),
> this feature (blocking read()/select()) even looks very nice and
> convenient, in all fairness. But I would like to ask you: Is this
> feature used rarely or frequently by other libraries/tools/etc.? In
> other words, is the feature "obscure" or "crucial" part of RTC kernel
> support? Or, something in between?

> - Does MC146818 support this feature?

No idea, I'll leave these for Alexandre or someone else to answer.

  Arnd



Re: [PATCH 08/12] linux-user: Add support for setting alsa timer enhanced read using ioctl

2020-01-16 Thread Arnd Bergmann
On Thu, Jan 16, 2020 at 12:27 PM Aleksandar Markovic
 wrote:
> On Thursday, January 16, 2020, Aleksandar Markovic 
>  wrote:
>> On Wednesday, January 15, 2020, Laurent Vivier  wrote:
>>> Le 15/01/2020 à 20:17, Filip Bozuta a écrit :
>>> > On 15.1.20. 17:37, Arnd Bergmann wrote:
>>> >> On Wed, Jan 15, 2020 at 5:32 PM Laurent Vivier  wrote:
>>> >>> Le 15/01/2020 à 17:18, Arnd Bergmann a écrit :
>>> >>>> On Wed, Jan 15, 2020 at 4:53 PM Filip Bozuta
>>> >>>>  wrote:
>>> >>>>> This patch implements functionality of following ioctl:
>>> >>>>>
>>> >>>>> SNDRV_TIMER_IOCTL_TREAD - Setting enhanced time read
>>> >>>>>
>>> >>>>>  Sets enhanced time read which is used for reading time with
>>> >>>>> timestamps
>>> >>>>>  and events. The third ioctl's argument is a pointer to an
>>> >>>>> 'int'. Enhanced
>>> >>>>>  reading is set if the third argument is different than 0,
>>> >>>>> otherwise normal
>>> >>>>>  time reading is set.
>>> >>>>>
>>> >>>>> Implementation notes:
>>> >>>>>
>>> >>>>>  Because the implemented ioctl has 'int' as its third argument,
>>> >>>>> the
>>> >>>>>  implementation was straightforward.
>>> >>>>>
>>> >>>>> Signed-off-by: Filip Bozuta 
>>> >>>> I think this one is wrong when you go between 32-bit and 64-bit
>>> >>> kernel uses an "int" and "int" is always 32bit.
>>> >>> The problem is most likely with timespec I think.
>>> >>>
>>> >>>> targets, and it gets worse with the kernel patches that just got
>>> >>>> merged for linux-5.5, which extends the behavior to deal with
>>> >>>> 64-bit time_t on 32-bit architectures.
>>> >>>>
>>> >>>> Please have a look at
>>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?h=80fe7430c70859
>>> >>>>
>>> >>> Yes, we already had the same kind of problem with SIOCGSTAMP and
>>> >>> SIOCGSTAMPNS.
>>> >>>
>>> >>> Do the kernel patches add new ioctl numbers to differentiate 32bit and
>>> >>> 64bit time_t?
>>> >> Yes, though SNDRV_TIMER_IOCTL_TREAD is worse: the ioctl argument
>>> >> is a pure 'int' that decides what format you get when you 'read' from the
>>> >> same file descriptor.
>>> >>
>>> >> For emulating 64-bit on 32-bit kernels, you have to use the new
>>> >> SNDRV_TIMER_IOCTL_TREAD64, and for emulating 32-bit on
>>> >> 64-bit kernels, you probably have to return -ENOTTY to
>>> >> SNDRV_TIMER_IOCTL_TREAD_OLD unless you also want to
>>> >> emulate the read() behavior.
>>> >> When a 32-bit process calls SNDRV_TIMER_IOCTL_TREAD64,
>>> >> you can translate that into the 64-bit
>>> >> SNDRV_TIMER_IOCTL_TREAD_OLD.
>>> >
>>> > Thank you for bringing this up to my attention. Unfortunately i have
>>> > some duties of academic nature in next month so i won't have much time
>>> > fix this bug. I will try to fix this as soon as possible.
>>>
>>> Could you at least to try to have a mergeable series before you have to
>>> stop to work on this?
>>>
>>> You can only manage the case before the change reported by Arnd (I will
>>> manage the new case myself later).

>>
>> Sorry for interjecting myself into this discussion, but I just want to let 
>> you know about some related practicalities.
>>
>> Filip is a student that is from time to time (usually between two exam 
>> seasons) an intern in our company, working 4 hours each work day. He spent 
>> his internship in different teams in prevous years, and, from around 
>> mid-October 2019, was appointed to QEMU team. After some introductory tasks, 
>> he was assigned his main task: linux-user support for RTCs and ALSA timers. 
>> This series is the result of his work, and, to my great pleasure, is 
>> virtually entirely his independant work. I am positive he can complete the 
>> series by himself, even in the light of additional complexit

Re: [PATCH 08/12] linux-user: Add support for setting alsa timer enhanced read using ioctl

2020-01-15 Thread Arnd Bergmann
On Wed, Jan 15, 2020 at 8:17 PM Filip Bozuta  wrote:
> On 15.1.20. 17:37, Arnd Bergmann wrote:
> > On Wed, Jan 15, 2020 at 5:32 PM Laurent Vivier  wrote:
> >> Le 15/01/2020 à 17:18, Arnd Bergmann a écrit :
> >>> On Wed, Jan 15, 2020 at 4:53 PM Filip Bozuta  
> >>> wrote:
> >> Do the kernel patches add new ioctl numbers to differentiate 32bit and
> >> 64bit time_t?
> >
> > Yes, though SNDRV_TIMER_IOCTL_TREAD is worse: the ioctl argument
> > is a pure 'int' that decides what format you get when you 'read' from the
> > same file descriptor.
> >
> > For emulating 64-bit on 32-bit kernels, you have to use the new
> > SNDRV_TIMER_IOCTL_TREAD64, and for emulating 32-bit on
> > 64-bit kernels, you probably have to return -ENOTTY to
> > SNDRV_TIMER_IOCTL_TREAD_OLD unless you also want to
> > emulate the read() behavior.
> > When a 32-bit process calls SNDRV_TIMER_IOCTL_TREAD64,
> > you can translate that into the 64-bit
> > SNDRV_TIMER_IOCTL_TREAD_OLD.

>
> Thank you for bringing this up to my attention. Unfortunately i have
> some duties of academic nature in next month so i won't have much time
> fix this bug. I will try to fix this as soon as possible.

One more thing: I just realized it gets worse when emulating cross-endian,
as then even without calling SNDRV_TIMER_IOCTL_TREAD, reading
data from /dev/snd/timer requires byteswapping the two words.

 Arnd



Re: [PATCH 08/12] linux-user: Add support for setting alsa timer enhanced read using ioctl

2020-01-15 Thread Arnd Bergmann
On Wed, Jan 15, 2020 at 5:32 PM Laurent Vivier  wrote:
>
> Le 15/01/2020 à 17:18, Arnd Bergmann a écrit :
> > On Wed, Jan 15, 2020 at 4:53 PM Filip Bozuta  wrote:
> >>
> >> This patch implements functionality of following ioctl:
> >>
> >> SNDRV_TIMER_IOCTL_TREAD - Setting enhanced time read
> >>
> >> Sets enhanced time read which is used for reading time with timestamps
> >> and events. The third ioctl's argument is a pointer to an 'int'. 
> >> Enhanced
> >> reading is set if the third argument is different than 0, otherwise 
> >> normal
> >> time reading is set.
> >>
> >> Implementation notes:
> >>
> >> Because the implemented ioctl has 'int' as its third argument, the
> >> implementation was straightforward.
> >>
> >> Signed-off-by: Filip Bozuta 
> >
> > I think this one is wrong when you go between 32-bit and 64-bit
>
> kernel uses an "int" and "int" is always 32bit.
> The problem is most likely with timespec I think.
>
> > targets, and it gets worse with the kernel patches that just got
> > merged for linux-5.5, which extends the behavior to deal with
> > 64-bit time_t on 32-bit architectures.
> >
> > Please have a look at
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?h=80fe7430c70859
>
> Yes, we already had the same kind of problem with SIOCGSTAMP and
> SIOCGSTAMPNS.
>
> Do the kernel patches add new ioctl numbers to differentiate 32bit and
> 64bit time_t?

Yes, though SNDRV_TIMER_IOCTL_TREAD is worse: the ioctl argument
is a pure 'int' that decides what format you get when you 'read' from the
same file descriptor.

For emulating 64-bit on 32-bit kernels, you have to use the new
SNDRV_TIMER_IOCTL_TREAD64, and for emulating 32-bit on
64-bit kernels, you probably have to return -ENOTTY to
SNDRV_TIMER_IOCTL_TREAD_OLD unless you also want to
emulate the read() behavior.
When a 32-bit process calls SNDRV_TIMER_IOCTL_TREAD64,
you can translate that into the 64-bit
SNDRV_TIMER_IOCTL_TREAD_OLD.

 Arnd



Re: [PATCH 08/12] linux-user: Add support for setting alsa timer enhanced read using ioctl

2020-01-15 Thread Arnd Bergmann
On Wed, Jan 15, 2020 at 4:53 PM Filip Bozuta  wrote:
>
> This patch implements functionality of following ioctl:
>
> SNDRV_TIMER_IOCTL_TREAD - Setting enhanced time read
>
> Sets enhanced time read which is used for reading time with timestamps
> and events. The third ioctl's argument is a pointer to an 'int'. Enhanced
> reading is set if the third argument is different than 0, otherwise normal
> time reading is set.
>
> Implementation notes:
>
> Because the implemented ioctl has 'int' as its third argument, the
> implementation was straightforward.
>
> Signed-off-by: Filip Bozuta 

I think this one is wrong when you go between 32-bit and 64-bit
targets, and it gets worse with the kernel patches that just got
merged for linux-5.5, which extends the behavior to deal with
64-bit time_t on 32-bit architectures.

Please have a look at
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?h=80fe7430c70859

   Arnd



Re: [Qemu-devel] [PATCH v5] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels

2019-07-14 Thread Arnd Bergmann
On Sun, Jul 14, 2019 at 3:54 PM Laurent Vivier  wrote:
>
> From: Daniel P. Berrangé 
>
> The SIOCGSTAMP symbol was previously defined in the
> asm-generic/sockios.h header file. QEMU sees that header
> indirectly via sys/socket.h
>
> In linux kernel commit 0768e17073dc527ccd18ed5f96ce85f9985e9115
> the asm-generic/sockios.h header no longer defines SIOCGSTAMP.
> Instead it provides only SIOCGSTAMP_OLD, which only uses a
> 32-bit time_t on 32-bit architectures.
>
> The linux/sockios.h header then defines SIOCGSTAMP using
> either SIOCGSTAMP_OLD or SIOCGSTAMP_NEW as appropriate. If
> SIOCGSTAMP_NEW is used, then the tv_sec field is 64-bit even
> on 32-bit architectures
>
> To cope with this we must now convert the old and new type from
> the target to the host one.
>
> Signed-off-by: Daniel P. Berrangé 
> Signed-off-by: Laurent Vivier 

Looks good to me now

Reviewed-by: Arnd Bergmann 



Re: [Qemu-devel] [PATCH v4] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels

2019-07-14 Thread Arnd Bergmann
On Sun, Jul 14, 2019 at 12:41 PM Richard Henderson
 wrote:
>
> On 7/12/19 3:55 PM, Arnd Bergmann wrote:
> > glibc will have to create a definition that matches the kernel, which uses
> >
> > struct __kernel_timespec {
> > __s64 tv_sec;
> > __s64 tv_nsec;
> > };
> >
> > As posix requires tv_nsec to be 'long', you need padding between
> > tv_sec and tv_nsec to have a libc definition matching the kernel's
> > binary layout.
>
> Yes, but that's glibc's lookout.  All qemu cares about emulating is the kernel
> interface.  So I think Laurent is right here, in that two reads handle the
> above structure just fine.

But that only works if the structure defined by qemu matches the kernel's.

The structure that Laurent proposed

struct target_timeval64 {
abi_llong tv_sec;
abi_long tv_usec;
};

is not compatible with the kernel or the glibc structure.

  Arnd



Re: [Qemu-devel] [PATCH v4] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels

2019-07-12 Thread Arnd Bergmann
On Fri, Jul 12, 2019 at 3:50 PM Laurent Vivier  wrote:
> Le 12/07/2019 à 15:36, Arnd Bergmann a écrit :
> >> We don't do memcopy() but we set each field one by one, so the padding 
> >> doesn't
> >> seem needed if we define correctly the user structure:
> >>
> >> struct target_timeval64 {
> >> abi_llong tv_sec;
> >> abi_long tv_usec;
> >> };
> >>
> >> and do something like:
> >>
> >> struct target_timeval64 *target_tv;
> >> struct timeval *host_tv;
> >> ...
> >> __put_user(host_tv->tv_sec, _tv->tv_sec);
> >> __put_user(host_tv->tv_usec, _tv->tv_usec);
> >> ...
> >
> > That still seems wrong. The user application has a definition
> > of 'timeval' that contains the padding, so your definition has
> > to match that.
>
> I don't find this definition with the padding. Where it is defined?
>
> We are at the syscall level, so structures are the ones provided by the
> target to the syscall, and they can be converted by the libc if the one
> from the user space differs.

glibc will have to create a definition that matches the kernel, which uses

struct __kernel_timespec {
__s64 tv_sec;
__s64 tv_nsec;
};

As posix requires tv_nsec to be 'long', you need padding between
tv_sec and tv_nsec to have a libc definition matching the kernel's
binary layout.

  Arnd



Re: [Qemu-devel] [PATCH v4] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels

2019-07-12 Thread Arnd Bergmann
On Fri, Jul 12, 2019 at 3:23 PM Laurent Vivier  wrote:
>
> Le 12/07/2019 à 14:47, Arnd Bergmann a écrit :
> > On Fri, Jul 12, 2019 at 2:17 PM Laurent Vivier  wrote:
> >>
> >> Le 11/07/2019 à 23:05, Arnd Bergmann a écrit :
> >>> On Thu, Jul 11, 2019 at 7:32 PM Laurent Vivier  wrote:
> >>>
> >>>>
> >>>> Notes:
> >>>> v4: [lv] timeval64 and timespec64 are { long long , long }
> >>>
> >>>>
> >>>> +STRUCT(timeval64, TYPE_LONGLONG, TYPE_LONG)
> >>>> +
> >>>> +STRUCT(timespec64, TYPE_LONGLONG, TYPE_LONG)
> >>>> +
> >>>
> >>> This still doesn't look right, see my earlier comment about padding
> >>> on big-endian architectures.
> >>>
> >>> Note that the in-kernel 'timespec64' is different from the uapi
> >>> '__kernel_timespec' exported by the kernel. I also still think you may
> >>> need to convert between SIOCGSTAMP_NEW and SIOCGSTAMP_OLD,
> >>> e.g. when emulating a 32-bit riscv process (which only use
> >>> SIOCGSTAMP_NEW) on a kernel that only understands
> >>> SIOCGSTAMP_OLD.
> >>
> >> I agree.
> >> I'm preparing a patch always using SIOCGSTAMP and SIOCGSTAMPNS on the
> >> host (converting the structure when needed).
> >
> > That in turn would have the problem of breaking in 2038 when the
> > timestamp overflows.
>
> No, because SIOCGSTAMP and SIOCGSTAMPNS are aliased to the _NEW versions
> on system supporting them (yes, we need to rebuild the binary, but we have 19
> years to do that).
>
> #define SIOCGSTAMP  ((sizeof(struct timeval))  == 8 ? \
>  SIOCGSTAMP_OLD   : SIOCGSTAMP_NEW)
> #define SIOCGSTAMPNS((sizeof(struct timespec)) == 8 ? \
>  SIOCGSTAMPNS_OLD : SIOCGSTAMPNS_NEW)

Right, makes sense.

> >
> >> I've added the SH4 variant.> What is special about SH4?
>
> The definition of _OLD is different:
>
> #define SIOCGSTAMP_OLD  _IOR('s', 100, struct timeval) /* Get stamp (timeval) 
> */
> #define SIOCGSTAMPNS_OLD _IOR('s', 101, struct timespec) /* Get stamp 
> (timespec) */

Ah, that one.


> > No, you don't need to swap. The difference is only in the padding.
> > Since the kernel uses a 64/64 structure here, and user space
> > may have use 'long tv_nsec', you need to add the padding on
> > the correct side, like
> >
> > struct timeval64 {
> >long long tv_sec;
> > #if 32bit && big-endian
> >long :32; /* anonymous padding */
> > #endif
> >suseconds_t tv_usec;
> > #if (32bit && little-endian) || sparc64
> >long :32;
> > #endif
> > };
>
> We don't do memcopy() but we set each field one by one, so the padding doesn't
> seem needed if we define correctly the user structure:
>
> struct target_timeval64 {
> abi_llong tv_sec;
> abi_long tv_usec;
> };
>
> and do something like:
>
> struct target_timeval64 *target_tv;
> struct timeval *host_tv;
> ...
> __put_user(host_tv->tv_sec, _tv->tv_sec);
> __put_user(host_tv->tv_usec, _tv->tv_usec);
> ...

That still seems wrong. The user application has a definition
of 'timeval' that contains the padding, so your definition has
to match that.

   Arnd



Re: [Qemu-devel] [PATCH v4] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels

2019-07-12 Thread Arnd Bergmann
On Fri, Jul 12, 2019 at 2:17 PM Laurent Vivier  wrote:
>
> Le 11/07/2019 à 23:05, Arnd Bergmann a écrit :
> > On Thu, Jul 11, 2019 at 7:32 PM Laurent Vivier  wrote:
> >
> >>
> >> Notes:
> >> v4: [lv] timeval64 and timespec64 are { long long , long }
> >
> >>
> >> +STRUCT(timeval64, TYPE_LONGLONG, TYPE_LONG)
> >> +
> >> +STRUCT(timespec64, TYPE_LONGLONG, TYPE_LONG)
> >> +
> >
> > This still doesn't look right, see my earlier comment about padding
> > on big-endian architectures.
> >
> > Note that the in-kernel 'timespec64' is different from the uapi
> > '__kernel_timespec' exported by the kernel. I also still think you may
> > need to convert between SIOCGSTAMP_NEW and SIOCGSTAMP_OLD,
> > e.g. when emulating a 32-bit riscv process (which only use
> > SIOCGSTAMP_NEW) on a kernel that only understands
> > SIOCGSTAMP_OLD.
>
> I agree.
> I'm preparing a patch always using SIOCGSTAMP and SIOCGSTAMPNS on the
> host (converting the structure when needed).

That in turn would have the problem of breaking in 2038 when the
timestamp overflows.

> I've added the SH4 variant.

What is special about SH4?

> I've added the sparc64 variant too: does it means sparc64 use the same
> structure internally for the OLD and NEW version?

I had to look it up in the code. Yes, it does, timeval is 64/32/pad32,
timespec is 64/64 in both OLD and NEW for sparc.

> What about sparc 32bit?

sparc32 is like all other 32-bit targets: OLD uses 32/32, and
new uses 64/64.

> For big-endian, I didn't find in the kernel where the difference is
> managed: a byte swapping of the 64bit value is not enough?

No, you don't need to swap. The difference is only in the padding.
Since the kernel uses a 64/64 structure here, and user space
may have use 'long tv_nsec', you need to add the padding on
the correct side, like

struct timeval64 {
   long long tv_sec;
#if 32bit && big-endian
   long :32; /* anonymous padding */
#endif
   suseconds_t tv_usec;
#if (32bit && little-endian) || sparc64
   long :32;
#endif
};

 Arnd



Re: [Qemu-devel] [PATCH v4] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels

2019-07-11 Thread Arnd Bergmann
On Thu, Jul 11, 2019 at 7:32 PM Laurent Vivier  wrote:

>
> Notes:
> v4: [lv] timeval64 and timespec64 are { long long , long }

>
> +STRUCT(timeval64, TYPE_LONGLONG, TYPE_LONG)
> +
> +STRUCT(timespec64, TYPE_LONGLONG, TYPE_LONG)
> +

This still doesn't look right, see my earlier comment about padding
on big-endian architectures.

Note that the in-kernel 'timespec64' is different from the uapi
'__kernel_timespec' exported by the kernel. I also still think you may
need to convert between SIOCGSTAMP_NEW and SIOCGSTAMP_OLD,
e.g. when emulating a 32-bit riscv process (which only use
SIOCGSTAMP_NEW) on a kernel that only understands
SIOCGSTAMP_OLD.

 Arnd



Re: [Qemu-devel] [PATCH v2] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels

2019-06-17 Thread Arnd Bergmann
On Mon, Jun 17, 2019 at 3:11 PM Daniel P. Berrangé  wrote:
>
> The SIOCGSTAMP symbol was previously defined in the
> asm-generic/sockios.h header file. QEMU sees that header
> indirectly via sys/socket.h
>
> In linux kernel commit 0768e17073dc527ccd18ed5f96ce85f9985e9115
> the asm-generic/sockios.h header no longer defines SIOCGSTAMP.
> Instead it provides only SIOCGSTAMP_OLD, which only uses a
> 32-bit time_t on 32-bit architectures.

This is a bit misleading, as we still define SIOCGSTAMP in the
right place. asm-generic/sockios.h should not be used by normal
user space.

> The linux/sockios.h header then defines SIOCGSTAMP using
> either SIOCGSTAMP_OLD or SIOCGSTAMP_NEW as appropriate. If
> SIOCGSTAMP_NEW is used, then the tv_sec field is 64-bit even
> on 32-bit architectures
>
> To cope with this we must now define two separate syscalls,
> with corresponding old and new sizes, as well as including
> the new linux/sockios.h header.

The overall concept seems right. A few more comments on
details that may have gone wrong here. I'm not familiar with
the qemu-user implementation, so it's mostly guesswork
on my end.

>IOCTL(SIOCGIWNAME, IOC_W | IOC_R, MK_PTR(MK_STRUCT(STRUCT_char_ifreq)))
>IOCTL(SIOCSPGRP, IOC_W, MK_PTR(TYPE_INT)) /* pid_t */
>IOCTL(SIOCGPGRP, IOC_R, MK_PTR(TYPE_INT)) /* pid_t */
> +
> +#ifdef SIOCGSTAMP_OLD
> +  IOCTL(SIOCGSTAMP_OLD, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timeval)))
> +#else
>IOCTL(SIOCGSTAMP, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timeval)))
> +#endif
> +#ifdef SIOCGSTAMPNS_OLD
> +  IOCTL(SIOCGSTAMPNS_OLD, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timespec)))
> +#else
>IOCTL(SIOCGSTAMPNS, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timespec)))
> +#endif
> +#ifdef SIOCGSTAMP_NEW
> +  IOCTL(SIOCGSTAMP_NEW, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timeval64)))
> +#endif
> +#ifdef SIOCGSTAMPNS_NEW
> +  IOCTL(SIOCGSTAMPNS_NEW, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timespec64)))
> +#endif

Is timespec64 a qemu type? How is it defined?

> diff --git a/linux-user/syscall_defs.h b/linux-user/syscall_defs.h
> index 7f141f699c..7830b600e7 100644
> --- a/linux-user/syscall_defs.h
> +++ b/linux-user/syscall_defs.h
> @@ -750,6 +750,11 @@ struct target_pollfd {
>
>  #define TARGET_SIOCGSTAMP  0x8906  /* Get stamp (timeval) */
>  #define TARGET_SIOCGSTAMPNS0x8907  /* Get stamp (timespec) */
> +#define TARGET_SIOCGSTAMP_OLD   0x8906  /* Get stamp (timeval) */
> +#define TARGET_SIOCGSTAMPNS_OLD 0x8907  /* Get stamp (timespec) */

Note that these types are architecture specific. It seems that only
one architecture is left that has its own definitions though, so this
is only broken on arch/sh for current linux (and remains broken).

Future architectures, including 32-bit risc-v should only have the _NEW
version and not support SIOCGSTAMP_OLD at all.

When emulating risc-v user space on old kernels (pre-5.1), you may need to
translate the ioctl command and all system calls that take a 64-bit time_t into
the variants with a 32-bit time_t on the way into the kernel, and then back.

Similarly, running an old user binary on a riscv32 machine, you may
need to do the reverse translation.

> +#define TARGET_SIOCGSTAMP_NEW   TARGET_IOC(TARGET_IOC_READ, 's', 6, 
> sizeof(long long) + sizeof(long)) /* Get stamp (timeval64) */
> +#define TARGET_SIOCGSTAMPNS_NEW TARGET_IOC(TARGET_IOC_READ, 's', 7, 
> sizeof(long long) + sizeof(long)) /* Get stamp (timespec64) */

"sizeof(long long) + sizeof(long)" is not always the size of the argument to
TARGET_SIOCGSTAMP{NS}_NEW. On 32-bit architectures, the size is
two 64-bit values. sparc64 is potentially another special case, as 'struct
timeval is 'long + int' there (12 bytes).

On big-endian architectures, the nanoseconds are returned in the last
four bytes of the 16-byte structure.

>  /* Networking ioctls */
>  #define TARGET_SIOCADDRT   0x890B  /* add routing table entry */
> diff --git a/linux-user/syscall_types.h b/linux-user/syscall_types.h
> index b98a23b0f1..de4c5a5b6f 100644
> --- a/linux-user/syscall_types.h
> +++ b/linux-user/syscall_types.h
> @@ -20,6 +20,10 @@ STRUCT(timeval,
>  STRUCT(timespec,
> MK_ARRAY(TYPE_LONG, 2))
>
> +STRUCT(timeval64, TYPE_LONGLONG, TYPE_LONG)
> +
> +STRUCT(timespec64, TYPE_LONGLONG, TYPE_LONG)

Same here.

Arnd



[Qemu-devel] [PATCH] headers: fix linux/mod_devicetable.h inclusions

2018-07-09 Thread Arnd Bergmann
A couple of drivers produced build errors after the mod_devicetable.h
header was split out from the platform_device one, e.g.

drivers/media/platform/davinci/vpbe_osd.c:42:40: error: array type has 
incomplete element type 'struct platform_device_id'
drivers/media/platform/davinci/vpbe_venc.c:42:40: error: array type has 
incomplete element type 'struct platform_device_id'

This adds the inclusion where needed.

Fixes: ac3167257b9f ("headers: separate linux/mod_devicetable.h from 
linux/platform_device.h")
Signed-off-by: Arnd Bergmann 
---
 drivers/firmware/qemu_fw_cfg.c | 1 +
 drivers/media/platform/davinci/vpbe_osd.c  | 1 +
 drivers/media/platform/davinci/vpbe_venc.c | 1 +
 drivers/media/platform/qcom/venus/vdec.c   | 1 +
 drivers/media/platform/qcom/venus/venc.c   | 1 +
 drivers/media/platform/sti/hva/hva-v4l2.c  | 1 +
 drivers/platform/x86/intel_punit_ipc.c | 1 +
 7 files changed, 7 insertions(+)

diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c
index 14fedbeca724..039e0f91dba8 100644
--- a/drivers/firmware/qemu_fw_cfg.c
+++ b/drivers/firmware/qemu_fw_cfg.c
@@ -28,6 +28,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/media/platform/davinci/vpbe_osd.c 
b/drivers/media/platform/davinci/vpbe_osd.c
index 7f610320426d..c551a25d90d9 100644
--- a/drivers/media/platform/davinci/vpbe_osd.c
+++ b/drivers/media/platform/davinci/vpbe_osd.c
@@ -18,6 +18,7 @@
  *
  */
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/media/platform/davinci/vpbe_venc.c 
b/drivers/media/platform/davinci/vpbe_venc.c
index ba157827192c..ddcad7b3e76c 100644
--- a/drivers/media/platform/davinci/vpbe_venc.c
+++ b/drivers/media/platform/davinci/vpbe_venc.c
@@ -11,6 +11,7 @@
  * GNU General Public License for more details.
  */
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/media/platform/qcom/venus/vdec.c 
b/drivers/media/platform/qcom/venus/vdec.c
index f89a91d43cc9..d4e23c7df347 100644
--- a/drivers/media/platform/qcom/venus/vdec.c
+++ b/drivers/media/platform/qcom/venus/vdec.c
@@ -14,6 +14,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/media/platform/qcom/venus/venc.c 
b/drivers/media/platform/qcom/venus/venc.c
index f7a87a3dbb46..0522cf202b75 100644
--- a/drivers/media/platform/qcom/venus/venc.c
+++ b/drivers/media/platform/qcom/venus/venc.c
@@ -14,6 +14,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/media/platform/sti/hva/hva-v4l2.c 
b/drivers/media/platform/sti/hva/hva-v4l2.c
index 15080cb00fa7..5a807c7c5e79 100644
--- a/drivers/media/platform/sti/hva/hva-v4l2.c
+++ b/drivers/media/platform/sti/hva/hva-v4l2.c
@@ -6,6 +6,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/platform/x86/intel_punit_ipc.c 
b/drivers/platform/x86/intel_punit_ipc.c
index b5b890127479..f1afc0ebbc68 100644
--- a/drivers/platform/x86/intel_punit_ipc.c
+++ b/drivers/platform/x86/intel_punit_ipc.c
@@ -12,6 +12,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
2.9.0




[Qemu-devel] [PATCH] fw_cfg: avoid unused function warning

2018-02-28 Thread Arnd Bergmann
The newly introduced fw_cfg_dma_transfer() function is unused when
CONFIG_CRASH_CORE is disabled:

drivers/firmware/qemu_fw_cfg.c:89:16: error: 'fw_cfg_dma_transfer' defined but 
not used [-Werror=unused-function]
 static ssize_t fw_cfg_dma_transfer(void *address, u32 length, u32 control)

This moves it into the #ifdef section that hides its caller to avoid the
warning.

Fixes: 47e78bfb5426 ("fw_cfg: write vmcoreinfo details")
Signed-off-by: Arnd Bergmann <a...@arndb.de>
---
 drivers/firmware/qemu_fw_cfg.c | 60 +-
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c
index 3015e77aebca..f002bb40519b 100644
--- a/drivers/firmware/qemu_fw_cfg.c
+++ b/drivers/firmware/qemu_fw_cfg.c
@@ -66,6 +66,36 @@ static void fw_cfg_sel_endianness(u16 key)
iowrite16(key, fw_cfg_reg_ctrl);
 }
 
+/* read chunk of given fw_cfg blob (caller responsible for sanity-check) */
+static ssize_t fw_cfg_read_blob(u16 key,
+   void *buf, loff_t pos, size_t count)
+{
+   u32 glk = -1U;
+   acpi_status status;
+
+   /* If we have ACPI, ensure mutual exclusion against any potential
+* device access by the firmware, e.g. via AML methods:
+*/
+   status = acpi_acquire_global_lock(ACPI_WAIT_FOREVER, );
+   if (ACPI_FAILURE(status) && status != AE_NOT_CONFIGURED) {
+   /* Should never get here */
+   WARN(1, "fw_cfg_read_blob: Failed to lock ACPI!\n");
+   memset(buf, 0, count);
+   return -EINVAL;
+   }
+
+   mutex_lock(_cfg_dev_lock);
+   fw_cfg_sel_endianness(key);
+   while (pos-- > 0)
+   ioread8(fw_cfg_reg_data);
+   ioread8_rep(fw_cfg_reg_data, buf, count);
+   mutex_unlock(_cfg_dev_lock);
+
+   acpi_release_global_lock(glk);
+   return count;
+}
+
+#ifdef CONFIG_CRASH_CORE
 static inline bool fw_cfg_dma_enabled(void)
 {
return (fw_cfg_rev & FW_CFG_VERSION_DMA) && fw_cfg_reg_dma;
@@ -124,36 +154,6 @@ static ssize_t fw_cfg_dma_transfer(void *address, u32 
length, u32 control)
return ret;
 }
 
-/* read chunk of given fw_cfg blob (caller responsible for sanity-check) */
-static ssize_t fw_cfg_read_blob(u16 key,
-   void *buf, loff_t pos, size_t count)
-{
-   u32 glk = -1U;
-   acpi_status status;
-
-   /* If we have ACPI, ensure mutual exclusion against any potential
-* device access by the firmware, e.g. via AML methods:
-*/
-   status = acpi_acquire_global_lock(ACPI_WAIT_FOREVER, );
-   if (ACPI_FAILURE(status) && status != AE_NOT_CONFIGURED) {
-   /* Should never get here */
-   WARN(1, "fw_cfg_read_blob: Failed to lock ACPI!\n");
-   memset(buf, 0, count);
-   return -EINVAL;
-   }
-
-   mutex_lock(_cfg_dev_lock);
-   fw_cfg_sel_endianness(key);
-   while (pos-- > 0)
-   ioread8(fw_cfg_reg_data);
-   ioread8_rep(fw_cfg_reg_data, buf, count);
-   mutex_unlock(_cfg_dev_lock);
-
-   acpi_release_global_lock(glk);
-   return count;
-}
-
-#ifdef CONFIG_CRASH_CORE
 /* write chunk of given fw_cfg blob (caller responsible for sanity-check) */
 static ssize_t fw_cfg_write_blob(u16 key,
 void *buf, loff_t pos, size_t count)
-- 
2.9.0




Re: [Qemu-devel] [kernel PATCH v2 2/2] devicetree: document ARM bindings for QEMU's Firmware Config interface

2014-12-10 Thread Arnd Bergmann
On Friday 05 December 2014 19:08:46 Peter Maydell wrote:
 On 5 December 2014 at 19:04, Laszlo Ersek ler...@redhat.com wrote:
  On 12/05/14 19:57, Peter Maydell wrote:
  On 30 November 2014 at 16:51, Laszlo Ersek ler...@redhat.com wrote:
  +Example:
  +
  +/ {
  +   #size-cells = 0x2;
  +   #address-cells = 0x2;
  +
  +   fw-cfg@902 {
  +   compatible = qemu,fw-cfg-mmio;
  +   reg = 0x0 0x902 0x0 0x1000;
  +   };
 
  I've just noticed that this example claims a register
  region size of 0x1000. This seems wrong, because the
  underlying device doesn't have a register range that
  big. Surely this should be a size of 0x3 ?
 
  Arnd said I should round up the region to 0x1000.
 
 Right; I replied here as a reasonable place to do so
 on an email with the device-tree folk in cc.
 
  http://thread.gmane.org/gmane.linux.drivers.devicetree/101173/focus=101176
 
 Arnd, what was your reasoning in requesting the round-up?
 I would have expected that a dtb with an overlarge range
 is telling the guest it can access things that in fact
 just aren't there (ie the equivalent of unmapped space which
 on h/w would give you an external abort/decode error).

I was expecting that qemu would be allowed to put additional
registers in there for future extensions. My comment was mainly
directed at having two distinct but consecutive register ranges,
which can be better expressed by having a longer one.

As long as qemu knows exactly how long the register area is
(and it better should know), having the correct length in there
is probably best.


Arnd



Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface

2014-11-28 Thread Arnd Bergmann
On Friday 28 November 2014 13:26:44 Laszlo Ersek wrote:
 +Example:
 +
 +/ {
 +   #size-cells = 0x2;
 +   #address-cells = 0x2;
 +
 +   fw-cfg@902 {
 +   reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1;
 +   compatible = fw-cfg,mmio;
 +   };
 +};
 

fw-cfg is not a valid vendor string. Are you emulating an actual piece
of hardware here or is this something that comes from qemu?
How about using qemu,fwcfg as the compatible string, and documenting
qemu as an official vendor string in 
Documentation/devicetree/bindings/vendor-prefixes.txt

We don't normally list contiguous registers separately. Maybe just round
up to one page and make the register property

reg = 0x0 0x902 0x0 0x1000;

Finally, isn't there also an interrupt associated with this virtual device?

Arnd



Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation

2014-11-12 Thread Arnd Bergmann
On Wednesday 12 November 2014 10:56:40 Mark Rutland wrote:
 On Wed, Nov 12, 2014 at 09:08:55AM +, Claudio Fontana wrote:
  On 11.11.2014 16:29, Mark Rutland wrote:
  
  I tend to mostly agree with this, we might look for alternative
  solutions for speeding up guest startup in the future but in general
  if getting ACPI in the guest for ARM64 requires also getting UEFI,
  then I can personally live with that, especially if we strive to have
  the kind of optimized virtualized UEFI you mention.
 
 Given that UEFI will be required for other guests (e.g. if you want to
 boot a distribution's ISO image), I hope that virtualised UEFI will see
 some optimisation work.

I think the requirement it just for KVM to provide something that
behaves exactly like UEFI, it doesn't have to be the full Tianocore
implementation if it's easier to reimplement the boot interface.

  As mentioned by others, I'd rather see an implementation of ACPI in
  QEMU which learns from the experience of X86 (and possibly shares some
  code if possible), rather than going in a different direction by
  creating device trees first, and then converting them to ACPI tables
  somewhere in the firmware, just because device trees are already
  there, for the reasons which have already been mentioned before by
  Igor and others.
 
 For the features which ACPI provides which device trees do not (e.g. the
 dynamic addition and removal of memory and CPUs), there will need to be
 some sort of interface between QEMU and the ACPI implementation. That's
 already outside of the realm of DT, so as previously mentioned a simple
 conversion doesn't cover the general case.

I think we need to support the low-level interfaces in the kernel for
this anyway, we should not have to use ACPI just to do memory and CPU
hotplugging in KVM, that would be silly. If ACPI is present, it can
provide a wrapper for the same interface, but KVM should not need to
be aware of the fact that ACPI is used in the guest, after it has
passed the initial ACPI blob to the kernel.

 I think any ACPI implemenation for a hypervisor should provide a
 demonstrable useful feature (e.g. hot-add of CPUs) before merging so we
 know the infrastructure is suitable.

  I wouldn't want for ACPI to be sort of supported in QEMU, but with a
  limited functionality which makes it not fully useful in practice. I'd
  rather see it as a first class citizen instead, including the ability
  to run AML code.
 
 I agree that there's no point in having ACPI in a guest unless it
 provides something which dt does not. I don't know how it should be
 structured to provide those useful features.

I see it the opposite way: we shouldn't have to use ACPI just to make
use of some feature in Linux, the only reason why you'd want ACPI support
in KVM is to be able to run Windows. It makes sense for the ACPI
implementation to be compatible with the Linux ACPI code as well so
we can test it better.

Arnd



Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation

2014-11-12 Thread Arnd Bergmann
On Wednesday 12 November 2014 13:27:14 Paolo Bonzini wrote:
 On 12/11/2014 13:18, Mark Rutland wrote:
  On Wed, Nov 12, 2014 at 11:48:27AM +, Paolo Bonzini wrote:
  Perhaps you could treat it as a shared level-triggered interrupt in DT?
   I don't know.
  
  Putting an interrupt in DT is trivial. The hard part is the rest of the
  interface, which so far there is no specification for.
 
 Have you looked at docs/specs/acpi_{cpu,mem}_hotplug.txt?  Writing a DT
 binding for it is trivial too.  Or are we talking about two different
 things?

Interesting. I agree that doing a DT binding for these will be trivial,
and the implementation in Linux should also be straightforward, thanks
for pointing these out.

However, it seems that the implementation that qemu uses is incompatible
with ARM64, since GPE is not part of the ACPI hardware reduced mode.

Arnd



Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation

2014-11-12 Thread Arnd Bergmann
On Wednesday 12 November 2014 16:01:14 Claudio Fontana wrote:
 
 Would the last step you mention allow for guests to start with an already 
 existing PCI interrupt map
 and the BAR registers preprogrammed to point to somewhere sane?
 
 I ask because on OSv at the moment, the situation is that for x86 we don't 
 need to reprogram anything on PCI,
 as everything is already nicely set up by the time the guest starts, and thus 
 the BAR addresses can be read directly.
 On ARM we have to reprogram the BARs manually for all devices.
 
 Couldn't we give an easier time to each OS guest by having everything
 nicely set up on AArch64 as well?

Definitely, I think having the OS manually program the BARs only makes sense
in an environment where you don't have a full-featured boot loader or you
don't trust it. In servers and virtual machines, the PCI bus should always come
fully set up. This also implies that the OS should not have to deal with
registers for setting up the translation between PCI and system address
ranges.

Arnd



Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation

2014-11-12 Thread Arnd Bergmann
On Wednesday 12 November 2014 16:52:25 Paolo Bonzini wrote:
 On 12/11/2014 16:39, Peter Maydell wrote:
   Definitely, I think having the OS manually program the BARs only makes 
   sense
   in an environment where you don't have a full-featured boot loader or you
   don't trust it. In servers and virtual machines, the PCI bus should 
   always come
   fully set up. This also implies that the OS should not have to deal with
   registers for setting up the translation between PCI and system address
   ranges.
 
  It seems to me like complicated stuff like that definitely belongs
  in the UEFI/bootloader blob, though. I'd rather QEMU just modelled
  the hardware and let the guest (or the firmware, which is guest
  code from QEMU's point of view) set it up however it wants.
 
 It definitely doesn't belong in QEMU!

The easiest option would probably be not make all PCI devices have
fixed BARs and not even allow them to be changed. I believe this is
what kvmtool does, but I can see how supporting both modes is much
harder than either one.

How does it work on x86 with qemu?

Arnd



Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation

2014-11-12 Thread Arnd Bergmann
On Wednesday 12 November 2014 17:04:30 Paolo Bonzini wrote:
 On 12/11/2014 16:57, Arnd Bergmann wrote:
It seems to me like complicated stuff like that definitely belongs
in the UEFI/bootloader blob, though. I'd rather QEMU just modelled
the hardware and let the guest (or the firmware, which is guest
code from QEMU's point of view) set it up however it wants.
   
   It definitely doesn't belong in QEMU!
  
  The easiest option would probably be not make all PCI devices have
  fixed BARs and not even allow them to be changed. I believe this is
  what kvmtool does, but I can see how supporting both modes is much
  harder than either one.
 
 kvmtool does not have firmware; it starts the kernel directly, so it
 does all the setup that usually is done by the firmware.  It implements
 a couple real-mode interfaces that Linux uses when booting, but nothing
 of this deals with PCI.
 
 x86 QEMU always runs firmware.  Even if you specify -kernel, the
 firmware does all the usual initialization and then boots from a small
 ROM.  The ROM contains the bootloader, so it loads and starts the kernel.

Ok, I see.

  How does it work on x86 with qemu?
 
 Same as real hardware.  Firmware (SeaBIOS or OVMF) builds the memory
 map, decides where in the free space the BARs go, and programs the PCI
 devices accordingly.
 
 kvmtool is the special one here.  Xen, VMware, Hyper-V all do the same
 as QEMU.

Right. I guess embedded ARM images in qemu are a third way then, because
these don't have a guest firmware but also don't set up the hardware
the way that kvmtool does.

Claudio's request to do this differently on arm64 seems absolutely
reasonable to me, but I guess that implies having UEFI or something
like it that does the PCI scan. Not sure what the best default for
qemu -kernel image should be though if you don't explicitly pass
a firmware image.

Arnd



Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation

2014-11-06 Thread Arnd Bergmann
On Thursday 06 November 2014 13:30:01 Mark Rutland wrote:
 
 There is no way of doing this with DT. There has been work into DT
 fragments/overlays where portions can be added to the tree dynamically,
 but that's controlled by the OS rather than the hypervisor, and there's
 no standard for communicating what has been hotplugged to trigger
 changes to the tree, so it's not quite the same. It really only works
 for tightly coupled hw/kernel/userspace combinations (i.e. embedded).
 
 Depending on how you implement the hot-add/remove you might be able to
 get away with an initial static configuration translated from DT. If you
 need to describe what might be hotplugged from the start, then I suspect
 you cannot get away with translating a DT in general.

I believe IBM POWER 5/6/7 servers have an interface to update the DT
from the hypervisor, but it's not a nice interface, so we may want to
avoid duplicating that.

Arnd



Re: [Qemu-devel] Change of TEXT_OFFSET for multi_v7_defconfig

2014-04-22 Thread Arnd Bergmann
On Tuesday 22 April 2014 11:53:25 Jason Gunthorpe wrote:
 On Tue, Apr 22, 2014 at 06:11:42PM +0100, Russell King - ARM Linux wrote:
 
  Put another way, if your platform is part of the multi-platform kernel
  then you are *excluded* from being able to use this... unless you hack
  the Kconfig, and then also provide a constant value for PHYS_OFFSET,
  thereby _tying_ the kernel you built to a _single_ platform.
 
 That is exactly right. To get a fixed LMA you must commit to a
 non-relocatable kernel image.
 
 Realistically this patch would need to be accompanied by something
 that makes ARM_PATCH_PHYS_VIRT optional for multiplatform based on
 EXPERT or similar.

Incidentally, I have this experimental patch in my endless 'randconfig'
queue of things to do properly at a later point.
The idea is to let a person building a kernel intentionally do a
number of things that result in a not-quite-multiplatform configuration
when CONFIG_BROKEN_MULTIPLATFORM is disabled.

The main intention is allow NOMMU builds on platforms that are already
converted to multiplatform, and to prevent people from accidentally
turning on DEBUG_LL, as that breaks all other machines. A couple of
other things are in the same category.

Arnd

From c9917a4a1b3f326e84932d7e860597413a98643b Mon Sep 17 00:00:00 2001
From: Arnd Bergmann a...@arndb.de
Date: Mon, 24 Feb 2014 16:38:34 +0100
Subject: [PATCH] ARM: add CONFIG_BROKEN_MULTIPLATFORM

Signed-off-by: Arnd Bergmann a...@arndb.de

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 65ea715..9ae3c0b 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -233,7 +233,7 @@ config VECTORS_BASE
  in size.
 
 config ARM_PATCH_PHYS_VIRT
-   bool Patch physical to virtual translations at runtime if EMBEDDED
+   bool Patch physical to virtual translations at runtime if 
BROKEN_MULTIPLATFORM
default y
depends on !XIP_KERNEL  MMU
depends on !ARCH_REALVIEW || !SPARSEMEM
@@ -302,16 +302,14 @@ config MMU
 #
 choice
prompt ARM system type
-   default ARCH_VERSATILE if !MMU
-   default ARCH_MULTIPLATFORM if MMU
+   default ARCH_MULTIPLATFORM
 
 config ARCH_MULTIPLATFORM
bool Allow multiple platforms to be selected
-   depends on MMU
select ARCH_WANT_OPTIONAL_GPIOLIB
select ARM_HAS_SG_CHAIN
-   select ARM_PATCH_PHYS_VIRT
select AUTO_ZRELADDR
+   select BROKEN_MULTIPLATFORM if !MMU
select COMMON_CLK
select GENERIC_CLOCKEVENTS
select MULTI_IRQ_HANDLER
@@ -865,6 +863,18 @@ config ARCH_OMAP1
 
 endchoice
 
+config BROKEN_MULTIPLATFORM
+   bool Allow multiplatform builds that may be broken if 
ARCH_MULTIPLATFORM  EXPERT
+   def_bool !ARCH_MULTIPLATFORM
+   help
+ This is a meta option that makes other options visible that
+ are normally hidden when building a kernel with ARCH_MULTIPLATFORM
+ support.
+ You have to enable this option in order to select any option
+ that can cause your kernel to only work on a subset of the
+ platforms that are enabled. This includes disabling MMU
+ or ARM_PATCH_PHYS_VIRT, or enabling ZBOOT_ROM.
+
 menu Multiple platform selection
depends on ARCH_MULTIPLATFORM
 
@@ -1943,6 +1953,7 @@ config DEPRECATED_PARAM_STRUCT
 # TEXT and BSS so we preserve their values in the config files.
 config ZBOOT_ROM_TEXT
hex Compressed ROM boot loader base address
+   depends on BROKEN_MULTIPLATFORM
default 0
help
  The physical address at which the ROM-able zImage is to be
@@ -1954,6 +1965,7 @@ config ZBOOT_ROM_TEXT
 
 config ZBOOT_ROM_BSS
hex Compressed ROM boot loader BSS address
+   depends on BROKEN_MULTIPLATFORM
default 0
help
  The base address of an area of read/write memory in the target
@@ -1967,6 +1979,7 @@ config ZBOOT_ROM_BSS
 
 config ZBOOT_ROM
bool Compressed boot loader in ROM/flash
+   depends on BROKEN_MULTIPLATFORM
depends on ZBOOT_ROM_TEXT != ZBOOT_ROM_BSS
depends on !ARM_APPENDED_DTB  !XIP_KERNEL  !AUTO_ZRELADDR
help
@@ -2092,7 +2105,7 @@ endchoice
 
 config XIP_KERNEL
bool Kernel Execute-In-Place from ROM
-   depends on !ARM_LPAE  !ARCH_MULTIPLATFORM
+   depends on !ARM_LPAE  BROKEN_MULTIPLATFORM
help
  Execute-In-Place allows the kernel to run from non-volatile storage
  directly addressable by the CPU, such as NOR flash. This saves RAM
diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug
index 6175b50..e193881 100644
--- a/arch/arm/Kconfig.debug
+++ b/arch/arm/Kconfig.debug
@@ -80,6 +80,7 @@ config DEBUG_USER
 config DEBUG_LL
bool Kernel low-level debugging functions (read help!)
depends on DEBUG_KERNEL
+   depends on BROKEN_MULTIPLATFORM
help
  Say Y here to include definitions of printascii, printch, printhex
  in the kernel.  This is helpful if you are debugging code

Re: [Qemu-devel] Change of TEXT_OFFSET for multi_v7_defconfig

2014-04-22 Thread Arnd Bergmann
On Tuesday 22 April 2014 19:38:55 Russell King - ARM Linux wrote:
 On Tue, Apr 22, 2014 at 08:32:10PM +0200, Arnd Bergmann wrote:
  @@ -1943,6 +1953,7 @@ config DEPRECATED_PARAM_STRUCT
   # TEXT and BSS so we preserve their values in the config files.
   config ZBOOT_ROM_TEXT
hex Compressed ROM boot loader base address
  + depends on BROKEN_MULTIPLATFORM
default 0
help
  The physical address at which the ROM-able zImage is to be
  @@ -1954,6 +1965,7 @@ config ZBOOT_ROM_TEXT
   
   config ZBOOT_ROM_BSS
hex Compressed ROM boot loader BSS address
  + depends on BROKEN_MULTIPLATFORM
default 0
help
  The base address of an area of read/write memory in the target
 
 Please get rid of the above changes.

Good point, fixed now.

Arnd



Re: [Qemu-devel] [PATCH for-1.5 0/3] hw/pci-host/versatile: Fix issues with newer kernels

2013-05-16 Thread Arnd Bergmann
On Wednesday 15 May 2013, Linus Walleij wrote:
 On Tue, May 14, 2013 at 5:33 PM, Peter Maydell peter.mayd...@linaro.org 
 wrote:
 
  The reworking of the versatile PCI controller model so that it actually
  behaved like hardware included an attempt to autodetect whether the
  guest Linux kernel was assuming the old broken behaviour. Unfortunately
  it turns out that there are several different variant broken kernels
  which behave slightly differently (though none of them will work on
  real hardware). The first two patches in this series improve the
  autodetection so that we will work out of the box on more kernels.
  The third patch adds a property for forcing the behaviour, so that
  if there are further cases we didn't know about, at least users have
  a command line workaround they can enable.
 
 This looks like a viable approach to me. So FWIW:
 Acked-by: Linus Walleij linus.wall...@linaro.org

FWIW, I plan to really get this done in the kernel for 3.11 properly
and rework the entire versatile and realview code base to work without
any platform specific code in arch/arm. The plan is to use the new
infrastructure for PCI and put that code into drivers/pci/host,
and have it scan the hardware using DT only. We can have a backwards
compatibility setup for versatile-pb without DT, but in the long
run, I would prefer to kill off that boot option.

If this works out, any Linux kernel built for any platform should
be able to boot the versatilepb, versatileab, integratorcp, realview-eb,
realview-eb-mpcore, realview-pb-a8, realview-pbx-a9, vexpress-a9
and vexpress-a15 models, as long as you pass a -cpu option matching
a CPU enabled in the kernel, and all the drivers are enabled.

I remember there was a way to autogenerate the dtb blob in qemu at
some point, based on the devices enabled in the model. Did that ever
make it in?

Arnd



Re: [Qemu-devel] [PATCH v2 07/11] versatile_pci: Implement the correct PCI IRQ mapping

2013-03-26 Thread Arnd Bergmann
On Tuesday 26 March 2013, Peter Maydell wrote:
 
 Implement the correct IRQ mapping for the Versatile PCI controller; it
 differs between realview and versatile boards, but the previous QEMU
 implementation was correct only for the first PCI card on a versatile
 board, since we weren't swizzling IRQs based on the slot number.
 
 Since this change would otherwise break any uses of PCI on Linux kernels
 which have an equivalent bug (since they have effectively only been
 tested against QEMU, not real hardware), we implement a mechanism
 for automatically detecting those broken kernels and switching back
 to the old mapping. This works by looking at the values the kernel
 writes to the PCI_INTERRUPT_LINE register in the config space, which
 is effectively the interrupt number the kernel expects the device
 to be using. If this doesn't match reality we use the broken mapping.
 
 Signed-off-by: Peter Maydell peter.mayd...@linaro.org

Yes, very good.  We will probably introduce sparse irq support on
versatile in the near future, and then the value we write into the
PCI_INTERRUPT_LINE field will become arbitrary from qemu's point
of view, but I will make sure that we fix the interrupt mapping
in the kernel at the same time so we always fall into the
s-broken_irq_mapping = false; case.

We also need to find a way to make the new kernel work with
an old qemu, and I think we can do that by using the versatile-dt
board type with a PCI device node that sets all four lines to
27, while using the actual interrupt lines for the default
versatile device tree.

Arnd



Re: [Qemu-devel] [PATCH v2 07/11] versatile_pci: Implement the correct PCI IRQ mapping

2013-03-26 Thread Arnd Bergmann
On Tuesday 26 March 2013, Peter Maydell wrote:
 
 On 26 March 2013 10:54, Arnd Bergmann a...@arndb.de wrote:
  Yes, very good.  We will probably introduce sparse irq support on
  versatile in the near future, and then the value we write into the
  PCI_INTERRUPT_LINE field will become arbitrary from qemu's point
  of view, but I will make sure that we fix the interrupt mapping
  in the kernel at the same time so we always fall into the
  s-broken_irq_mapping = false; case.
 
 Yeah, as long as you avoid the number 27 you're ok :-)

Good point. I guess we'll have to keep using a legacy domain for
versatile then.

  We also need to find a way to make the new kernel work with
  an old qemu, and I think we can do that by using the versatile-dt
  board type with a PCI device node that sets all four lines to
  27, while using the actual interrupt lines for the default
  versatile device tree.
 
 Personally I'd be happy for you to just say needs a new QEMU.
 The broken QEMU is missing so much (including working memory
 windows) that I think it would be a pain to get the kernel to
 cope with it.

But it was working earlier, so I'd definitely try not to break
if at all possible. A lot of people use the verstatile qemu
model to run kernels and I would not want to deal with the
complaints I'd get if we break those. Using a separate dts
file seems easy enough.

Arnd



Re: [Qemu-devel] [PATCH 00/10] Fix versatile_pci (and break versatilepb linux guests!)

2013-03-24 Thread Arnd Bergmann
On Sunday 24 March 2013, Peter Maydell wrote:
 Yeah, ideally being able to detect the buggy kernel would be good;
 I can't see anything at the controller level that would do though,
 and I don't really know enough about PCI to know about generic
 PCI stuff that would work. (Why would the OS need to tell the
 device anything about its IRQ if it's hardwired?)

I think it actually does on versatile and other platforms on which
the kernel probes the PCI bus itself, rather than relying on firmware
to have resources assigned in advance.

IIRC, the PCI_INTERRUPT_LINE pci config space byte (0x3c) is purely
informational and used as a way to communicate the interrupt number
from the bus scan code (assumed to be a PC BIOS in the PCI spec,
but drivers/pci/setup-irq.c in case of versatile+linux) to a device
driver.

So the kernel should actually write the proper interrupt number in
there. In future kernels, this may not necessarily be the hardware
number, but today it is. Can you try out what the kernel writes into
that register in qemu, with and without my patches?

I would expect the numbers to be (64+27) to (64+30), since we
linearize the interrupt numbers so that VIC gets 32 through
63 and SIC gets 64 through 95.

Arnd



Re: [Qemu-devel] [PATCH 00/10] Fix versatile_pci (and break versatilepb linux guests!)

2013-03-24 Thread Arnd Bergmann
On Sunday 24 March 2013, Peter Maydell wrote:
 OK, I'll give that a go tomorrow.
 
 While you're here, do you know what the point of the PCI_SELFID
 register is? The h/w docs are clear that the OS has to write
 the slot number of the board itself in to this register, but
 again I don't see why the OS has to tell the hardware something
 it already knows. On QEMU I just implemented this register as
 a no-op and everything seems to still function...
 

I don't know really, but I think it has something to do with the
versatile board being plugged into a backplane that has multiple
slots. Apparently there is something the hardware cannot figure
out itself, e.g. whether to route the interrupts in or out of the
slot.

Arnd



Re: [Qemu-devel] [PATCH 00/10] Fix versatile_pci (and break versatilepb linux guests!)

2013-03-24 Thread Arnd Bergmann
On Sunday 24 March 2013, Michael S. Tsirkin wrote:
 For future kernels, let's build in some hook that let
 qemu detect a non broken guest. How about writing
 some magic value into revision ID or some other
 readonly field?

Yes, makes sense.



Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Arnd Bergmann
On Tuesday 07 February 2012, Alexander Graf wrote:
 On 07.02.2012, at 07:58, Michael Ellerman wrote:
 
  On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
  You're exposing a large, complex kernel subsystem that does very
  low-level things with the hardware.  It's a potential source of exploits
  (from bugs in KVM or in hardware).  I can see people wanting to be
  selective with access because of that.
  
  Exactly.
  
  In a perfect world I'd agree with Anthony, but in reality I think
  sysadmins are quite happy that they can prevent some users from using
  KVM.
  
  You could presumably achieve something similar with capabilities or
  whatever, but a node in /dev is much simpler.
 
 Well, you could still keep the /dev/kvm node and then have syscalls operate 
 on the fd.
 
 But again, I don't see the problem with the ioctl interface. It's nice, 
 extensible and works great for us.
 

ioctl is good for hardware devices and stuff that you want to enumerate
and/or control permissions on. For something like KVM that is really a
core kernel service, a syscall makes much more sense.

I would certainly never mix the two concepts: If you use a chardev to get
a file descriptor, use ioctl to do operations on it, and if you use a 
syscall to get the file descriptor then use other syscalls to do operations
on it.

I don't really have a good recommendation whether or not to change from an
ioctl based interface to syscall for KVM now. On the one hand I believe it
would be significantly cleaner, on the other hand we cannot remove the
chardev interface any more since there are many existing users.

Arnd



Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Arnd Bergmann
On Tuesday 07 February 2012, Alexander Graf wrote:
  
  Not sure we'll ever get there. For PPC, it will probably take another 1-2 
  years until we get the 32-bit targets stabilized. By then we will have new 
  64-bit support though. And then the next gen will come out giving us even 
  more new constraints.
  
  I would expect that newer archs have less constraints, not more.
 
 Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have 
 today on 32-bit, but extends a
 bunch of registers to 64-bit. So what if we laid out stuff wrong before?
 
 I don't even want to imagine what v7 arm vs v8 arm looks like. It's a 
 completely new architecture.
 

I have not seen the source but I'm pretty sure that v7 and v8 they look very
similar regarding virtualization support because they were designed together,
including the concept that on v8 you can run either a v7 compatible 32 bit
hypervisor with 32 bit guests or a 64 bit hypervisor with a combination of
32 and 64 bit guests. Also, the page table layout in v7-LPAE is identical
to the v8 one. The main difference is the instruction set, but then ARMv7
already has four of these (ARM, Thumb, Thumb2, ThumbEE).

Arnd




Re: [Qemu-devel] [RFC][PATCH] Import Linux headers for KVM and vhost

2011-05-03 Thread Arnd Bergmann
On Tuesday 03 May 2011, Jan Kiszka wrote:
 This helps reducing our build-time checks for feature support in the
 available Linux kernel headers. And it helps users that do not have
 sufficiently recent headers installed on their build machine.
 
 Header upstate is triggered via 'make update-linux-headers', optionally
 specifying a kernel directory as source via 'LINUX='.
 
 ---
 
 I'm sure this is not yet perfect from technical POV (e.g. the patch to
 Makefile.target looks like a hack to me, better ideas welcome),
 therefore RFC and no SOB. But it should demonstrate the plan. The
 workflow for adding a new KVM feature support would be first a 'make
 update-linux-header LINUX=/my/local/linux' + git commit, and then the
 commit of the kvm, vhost, whatever changes.
 
 Makefile|   17 +++
 Makefile.target |2 +-
 configure   |  127 ---
 3 files changed, 36 insertions(+), 110 deletions(-)

Very nice!

The old way to poke into the kernel internals was really annoying and confusing.

Arnd



Re: [Qemu-devel] [PATCH v2] Import Linux headers for KVM and vhost

2011-05-03 Thread Arnd Bergmann
On Tuesday 03 May 2011 19:57:18 Scott Wood wrote:
  I agree, it doesn't feel quite right to bring in the headers. I don't have
  any alternative suggestions (besides better HOWTOs/Documentation) though. 
 
 If you try to use the non-sanitized kernel headers, you'll get this warning
 from linux/types.h:
 
 #warning Attempt to use kernel headers from user space, see 
 http://kernelnewbies.org/KernelHeaders;
 

You're welcome ;-)

I tried to make the recommended approach as clear as I could with the
#warning and the wiki page. If there is anything still unclear about
how it should be done, please make suggestions for improving
the wiki text.

Arnd



Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?

2011-02-24 Thread Arnd Bergmann
On Thursday 24 February 2011, Jan Kiszka wrote:
 On 2011-02-24 07:49, Gerhard Wiesinger wrote:
  On Wed, 23 Feb 2011, Jan Kiszka wrote:
  Right, but if I set IP(eth0) == IP(macvlan0), I'm able to communicate
  between macvlan0 and mactapX, thus between guest and host. Just
  re-checked here, still works (after resolving the usual MAC address mess
  I caused by configuring manually).
  
  Thnx for the tipp.
  
  Did you use MAC(eth0) == MAC(macvlan0) or MAC(eth0)  MAC(macvlan0) to 
  get it to work?
 
 The latter (I just let macvlan/tap choose their MACs).

You cannot set the two to the same MAC address while they are up, but I think
you can do

ip link set eth0 down
ip link set macvlan0 ${MAC_ETH0}
ip link set macvlan0 up

Doing that is a bit tricky if eth0 is your only connection to the machine...

Arnd



Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?

2011-02-23 Thread Arnd Bergmann
On Wednesday 23 February 2011, Gerhard Wiesinger wrote:
 After some further tests and looking at the iproute ip and kernel code I 
 finally gave up because I thing such a setup it is not possible without 
 breaking up/reconfiguring eth0. When I have to reconfigure eth0 I think a 
 better approach is to configure a bridge which I finally did and works 
 well.
 
 I tried to explain/document the macvtap/macvlan concepts and limitations 
 below. Please comment on it whether this is true or false.
 
 macvtap/macvlan driver concepts and limitations:
 1.) macvlan driver adds a MAC address to a lower interface device where 
 the actual macvlanx device is based on
 2.) macvtap driver is based on macvlan driver and macvtap driver adds 
 additional functionality of interface = external program communication 
 with stdin/stdout channel.
 3.) Limitations: macvtap/macvlan based devices can only communicate with 
 childs based on the same lower device (e.g. eth0 in this sample) but not 
 to the lower device itself, only to the outside world of the interface

Correct.

 finally this makes the macvlan/macvtap approach useless because main eth0 
 interface must still be broken in the chain and reconfigured which was 
 against the requirements that eth0 should not be touched and reconfigured!

Yes, that is unfortunate, but it's the same that you'd get with a bridge
device: 
When you have a bridge on top of eth0, you can no longer assign an IP
address to eth0 and let it communicate with the virtual ports on the
bridge. You need to instead set the IP address on the bridge itself.

Macvlan is slightly better because it allows you to have multiple
host devices that can each have their own MAC/IP address, unlike
the bridge, but of course it can not be connected to anything else
besides macvlan or macvtap ports.

Arnd



Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?

2011-02-21 Thread Arnd Bergmann
On Monday 21 February 2011, Jan Kiszka wrote:
  Now I think I tried all useful possible combinations:
  1.) macvtap0 and macvlan0 in bridged and non bridge mode
  2.) macvlan0 based on eth0 or based on macvtap0
  3.) Using and ip address on macvlan0 or not (tried both for 7b mac and
  7c mac)
  4.) Using 7b and 7c mac address in KVM (MAC in KVM is always the command
  line configured) and used tap interface from macvtap0 (as no second tap
  devices shows up)
  
  Summary:
  1.) Using macvtap0 only with 7b mac on interface and also at qemu works
  well in bridged mode as well as non bridged mode but with limitation no
  guest/host communication possible, used interface in KVM is tap
  interface created by macvtap0. Quite logically that it doesn't work with
  guest/host because of missing hairpin mode on the switch. But according
  to http://virt.kernelnewbies.org/MacVTap bridge mode should work even
  without hairpinning available on the switch.
  2.) macvlan0 doesn't create any tap interface therefore it can't be used
  as a device on KVM.
  3.) Using 7c mac address in KVM doesn't work at all regardsless of
  setting ip addresses on macvlan0 or any other setup.
  
  @Arnd: Can you explain a setup in detail which should work also with
  host/guest communication. Thnx.
  
  Any further ideas?
  Which kernel is needed to work?
  (current: 2.6.34.7-56.fc13.x86_64)
 
 Actually, I tried this successfully over a 2.6.38-rcX, but I have no
 idea what changes related to macvlan/tap went in since .34 and if this
 is required for this.

We had a few bugs in .34, but it should work in general.

One thing to make sure is that the host has connectivity through the
macvlan interface and the lower interface (eth0) has no IP address assigned
but is 'up'.

It could also be a bug in the NIC, you could try to set the NIC into
promiscious mode using ethtool to work around that.

Also, the lower interface must not be connected to a bridge device.

Arnd



Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?

2011-02-20 Thread Arnd Bergmann
On Sunday 20 February 2011, Gerhard Wiesinger wrote:
 
  Not sure if this is by design or due to internals of the networking
  stack, but it looks unintuitive from user perspective. Maybe Arnd can
  shed a light on this.

The lower device cannot be in bridge mode, because that would make the
logic in the kernel awfully complex. I agree that it's a bit unfortunate,
but it simplified the design a lot.

 
  Of course, you could also simply offload all that setup to libvirt.
 
 http://wiki.libvirt.org/page/Networking
 But that still needs a bridge on ethernet level and break up the existing 
 interface, right?

No. macvtap is a tap that sits on an external interface, you don't
need a bridge if you use it.
 
 BTW: From: http://virt.kernelnewbies.org/MacVTap
 As of QEMU 0.12:
 qemu -net nic,model=virtio,addr=1a:46:0b:ca:bc:7b -net tap,fd=3 3/dev/tap11
 Is there a newer Syntax with tap interfaces possible (QEMU GIT) without 
 bash redirects?

I made some suggestions how to integrate it with qemu, but they never got in.
Libvirt makes it a lot easier, though. We discussed that it should be cleaned
up when the networking helper scripts make it into qemu. Not sure if that
ever happened.

Arnd



Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?

2011-02-20 Thread Arnd Bergmann
On Sunday 20 February 2011, Gerhard Wiesinger wrote:
  qemu-system-x86_64 ... some params ... -net
  nic,model=e1000,macaddr=1a:46:0b:ca:bc:7c -net tap,fd=3 3/dev/tap10
 
  Seems to me quite logically because macvtap0 (and not macvlan0) is
  associated with /dev/tap10 but with another mac address set in KVM.
 
  Any furher ideas?
 
  As you already noticed: you mixed up the MAC addresses. KVM's must be
  the same as used for its frontend macvtap. The macvlan is only for the
  host and has a separate one.
 
 I think I did everyting right in the last 1st try but it still didn't 
 work:
 1.) macvtap0: MAC: 1a:46:0b:ca:bc:7b
 2.) macvlan0: MAC: 1a:46:0b:ca:bc:7c, 192.168.0.23
 3.) KVM: MAC: 1a:46:0b:ca:bc:7b, assigned IP from DHCP: 1a:46:0b:ca:bc:7b
 (looks like an IP address conflict inside guest and outside?)
 
 That should be as you explained, right?
 

The qemu command above has the 7c mac address, which does not match.
Do you see the interface in the guest using ip link show ?

Arnd



[Qemu-devel] Re: TODO item: guest programmable mac/vlan filtering with macvtap

2010-10-18 Thread Arnd Bergmann
On Friday 15 October 2010, Michael S. Tsirkin wrote:
 On Thu, Oct 14, 2010 at 11:40:52PM +0200, Dragos Tatulea wrote:
  Hi,
  
  I'm starting a  thread related to the TODO item mentioned in the
  subject. Currently still gathering info and trying to make kvm 
  macvtap play nicely together. I have used this [1] guide to set it up
  but qemu is still complaining about the PCI device address of the
  virtio-net-pci. Tried with latest qemu. Am I missing something here?
  
  [1] - http://virt.kernelnewbies.org/MacVTap
  
 
 It really should be:
  -net nic,model=virtio,netdev=foo -netdev tap,id=foo
 
 Created account but still could not edit
 the wiki. Arnd, know why that is? Could you correct qemu
 command line pls?

I also have lost write access to the wiki, no idea what happened there.
I started the page, but it subsequently became protected.

We never added support for the qemu command line directly, the
plan was to do that using helper scripts.

The only way to do it is to redirect both input and output
to the tap device, so you ned to do

-net nic,model=virtio,netdev=foo -netdev tap,id=foo,fd=3 3

when starting from bash.

Arnd



[Qemu-devel] Re: [PATCH v2] pc: e820 qemu_cfg tables need to be packed

2010-10-15 Thread Arnd Bergmann
On Friday 15 October 2010, Alex Williamson wrote:
 We can't let the compiler define the alignment for qemu_cfg data.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
  v2: Adjust alignment to help non-x86 hosts per Arnd's suggestion

Ok, looks good now. Thanks!

Arnd



Re: [Qemu-devel] Re: [PATCH] pc: e820 qemu_cfg tables need to be packed

2010-10-14 Thread Arnd Bergmann
On Thursday 14 October 2010 21:58:08 Alex Williamson wrote:
 If it works anywhere (I assume it works on 32bit), then it's only
 because it happened to get the alignment right.  This just makes 64bit
 hosts get it right too.  I don't see any compatibility issues,
 non-packed + 64bit = broken.  Thanks,

I would actually assume that only x86-32 hosts got it right, because
all 32 bit hosts I've seen other than x86 also define 8 byte alignment
for uint64_t.

You might however consider making it 

__attribute((__packed__, __aligned__(4)))

instead of just packed, because otherwise you make the alignment one byte,
which is not only different from what it used to be on x86-32 but also
will cause inefficient compiler outpout on platforms that don't have unaligned
word accesses in hardware.

Arnd



Re: [Qemu-devel] Re: [PATCH] pc: e820 qemu_cfg tables need to be packed

2010-10-14 Thread Arnd Bergmann
On Thursday 14 October 2010 22:59:04 Alex Williamson wrote:
 The structs in question only contain 4  8 byte elements, so there
 shouldn't be any change on x86-32 using one-byte aligned packing.

I'm talking about the alignment of the structure, not the members
within the structure. The data structure should be compatible, but
not accesses to it.

 AFAIK, e820 is x86-only, so we don't need to worry about breaking anyone
 else.

You can use qemu to emulate an x86 pc on anything...

 Performance isn't much of a consideration for this type of
 interface since it's only used pre-boot.  In fact, the channel between
 qemu and the bios is only one byte wide, so wider alignment can cost
 extra emulated I/O accesses.

Right, the data gets passed as bytes, so it hardly matters in the end.
Still the e820_add_entry assigns data to the struct members, which
it either does using byte accesses and shifts or a multiple 32 bit
assignment. Just because using a one byte alignment technically
results in correct output doesn't make it the right solution.

I don't care about the few cycles of execution time or the few bytes
you waste in this particular case, but you are setting a wrong example
by using smaller alignment than necessary.

Arnd



Re: [Qemu-devel] vhost_net.c broken by --kerneldir

2010-08-26 Thread Arnd Bergmann
On Wednesday 25 August 2010, Hollis Blanchard wrote:
  We only recently fixed the kernel to have this warning in types.h, which
  triggers more often than kernel.h, where it used to be before. In 2.6.35
  and before, you consequently would not have noticed the problem.
 
 
 Thanks Arnd, that explains it.
 
 It looks like the --kerneldir option needs to be re-thought.

Yes. I believe that we should just kill that option, since there is
no reason for building with set of headers from a different kernel
than the one your glibc normally uses.

You obviously need copies of some headers to match qemu code, e.g.
the kvm headers, but for those we have run-time compatibility code
in qemu. Building with a newer kernel header is pointless because
qemu would not use any of the features that were added after the
release of the qemu tree you are trying to build. Building with
an older kernel header is equally pointless because all it would
do is to require #ifdef magic in qemu that ends up preventing you
from using all features of new kernels.

Arnd



Re: [Qemu-devel] vhost_net.c broken by --kerneldir

2010-08-26 Thread Arnd Bergmann
On Thursday 26 August 2010, Gleb Natapov wrote:
 You forgot about developers. Developer may want to use latest kvm kernel
 headers to compile code that he added to qemu to use new kernel feature.

In that case, you already need to install the kernel in order to test
it, so you might as well install the headers along with it.

Also, you're not normally touching multiple headers at a time, but rather
add a feature in one file, so it's not hard to add it to both copies of
that file. This may seem like an unnecessary step, but using the kernel
headers without installing them first can cause any number of problems
that we don't want anyone to have to deal with.

Arnd



Re: [Qemu-devel] vhost_net.c broken by --kerneldir

2010-08-25 Thread Arnd Bergmann
On Wednesday 25 August 2010, Hollis Blanchard wrote:

 The problem seems to be that jump from /usr/include/bits/sigcontext.h to 
 asm/sigcontext.h inside kerneldir rather than inside /usr/include. It 
 seems like adding -Ikerneldir/arch/foo/include will always be a problem, 
 since it will always be used to satisfy #include asm/bar.h. Only 
 files built with KVM_CFLAGS would be affected, and I guess vhost_net.c 
 accidentally gets into a broken include path where the other KVM_CFLAGS 
 files doesn't.
 
 I'm wondering why this isn't causing more problems for more people. My 
 host is Fedora 12, FWIW, but this doesn't seem like it could at all be 
 related to toolchain version, for example.

We only recently fixed the kernel to have this warning in types.h, which
triggers more often than kernel.h, where it used to be before. In 2.6.35
and before, you consequently would not have noticed the problem.

Arnd



[Qemu-devel] Re: [PATCH] Ignore writes of perf reg (cp15 with crm == 12)

2010-07-28 Thread Arnd Bergmann
On Sunday 25 July 2010, Loïc Minier wrote:
 On ARMv7, ignore writes to cp15 with crm == 12; these are to setup perf
 counters which we don't have.

Thanks Loïc, I have tested this and can confirm that I'm able to boot
a CONFIG_PERF_EVENTS kernel now.

Arnd



Re: [Qemu-devel] Re: KVM call minutes for June 15

2010-06-16 Thread Arnd Bergmann
On Wednesday 16 June 2010, Markus Armbruster wrote:
 Can't hurt reviewer motivation.  Could it be automated?  Find replies,
 extract tags.  If you want your acks to be picked up, you better make
 sure your References header works, and your tags are formatted
 correctly.

I think pwclient (https://patchwork.kernel.org/) can do this for you.

Arnd



[Qemu-devel] Re: [PATCH] Inter-VM shared memory PCI device

2010-03-11 Thread Arnd Bergmann
On Thursday 11 March 2010, Avi Kivity wrote:
  A totally different option that avoids this whole problem would
  be to separate the signalling from the shared memory, making the
  PCI shared memory device a trivial device with a single memory BAR,
  and using something a higher-level concept like a virtio based
  serial line for the actual signalling.
 
 
 That would be much slower.  The current scheme allows for an 
 ioeventfd/irqfd short circuit which allows one guest to interrupt 
 another without involving their qemus at all.

Yes, the serial line approach would be much slower, but my point
was that we can do signaling over something else, which could
well be something building on irqfd.

Arnd




[Qemu-devel] Re: [PATCH] Inter-VM shared memory PCI device

2010-03-11 Thread Arnd Bergmann
On Thursday 11 March 2010, Avi Kivity wrote:
  That would be much slower.  The current scheme allows for an
  ioeventfd/irqfd short circuit which allows one guest to interrupt
  another without involving their qemus at all.
   
  Yes, the serial line approach would be much slower, but my point
  was that we can do signaling over something else, which could
  well be something building on irqfd.
 
 Well, we could, but it seems to make things more complicated?  A card 
 with shared memory, and another card with an interrupt interconnect?

Yes, I agree that it's more complicated if you have a specific application
in mind that needs one of each, and most use cases that want shared memory
also need an interrupt mechanism, but it's not always the case:

- You could use ext2 with -o xip on a private mapping of a shared host file
in order to share the page cache. This does not need any interrupts.

- If you have more than two parties sharing the segment, there are different
ways to communicate, e.g. always send an interrupt to all others, or have
dedicated point-to-point connections. There is also some complexity in
trying to cover all possible cases in one driver.

I have to say that I also really like the idea of futex over shared memory,
which could potentially make this all a lot simpler. I don't know how this
would best be implemented on the host though.

Arnd




[Qemu-devel] Re: [PATCH] Inter-VM shared memory PCI device

2010-03-10 Thread Arnd Bergmann
On Tuesday 09 March 2010, Cam Macdonell wrote:
 
  We could make the masking in RAM, not in registers, like virtio, which would
  require no exits.  It would then be part of the application specific
  protocol and out of scope of of this spec.
 
 
 This kind of implementation would be possible now since with UIO it's
 up to the application whether to mask interrupts or not and what
 interrupts mean.  We could leave the interrupt mask register for those
 who want that behaviour.  Arnd's idea would remove the need for the
 Doorbell and Mask, but we will always need at least one MMIO register
 to send whatever interrupts we do send.

You'd also have to be very careful if the notification is in RAM to
avoid races between one guest triggering an interrupt and another
guest clearing its interrupt mask.

A totally different option that avoids this whole problem would
be to separate the signalling from the shared memory, making the
PCI shared memory device a trivial device with a single memory BAR,
and using something a higher-level concept like a virtio based
serial line for the actual signalling.

Arnd




[Qemu-devel] Re: [PATCH] Inter-VM shared memory PCI device

2010-03-09 Thread Arnd Bergmann
On Monday 08 March 2010, Cam Macdonell wrote:
 enum ivshmem_registers {
 IntrMask = 0,
 IntrStatus = 2,
 Doorbell = 4,
 IVPosition = 6,
 IVLiveList = 8
 };
 
 The first two registers are the interrupt mask and status registers.
 Interrupts are triggered when a message is received on the guest's eventfd 
 from
 another VM.  Writing to the 'Doorbell' register is how synchronization 
 messages
 are sent to other VMs.
 
 The IVPosition register is read-only and reports the guest's ID number.  The
 IVLiveList register is also read-only and reports a bit vector of currently
 live VM IDs.
 
 The Doorbell register is 16-bits, but is treated as two 8-bit values.  The
 upper 8-bits are used for the destination VM ID.  The lower 8-bits are the
 value which will be written to the destination VM and what the guest status
 register will be set to when the interrupt is trigger is the destination 
 guest.
 A value of 255 in the upper 8-bits will trigger a broadcast where the message
 will be sent to all other guests.

This means you have at least two intercepts for each message:

1. Sender writes to doorbell
2. Receiver gets interrupted

With optionally two more intercepts in order to avoid interrupting the
receiver every time:

3. Receiver masks interrupt in order to process data
4. Receiver unmasks interrupt when it's done and status is no longer pending

I believe you can do much better than this, you combine status and mask
bits, making this level triggered, and move to a bitmask of all guests:

In order to send an interrupt to another guest, the sender first checks
the bit for the receiver. If it's '1', no need for any intercept, the
receiver will come back anyway. If it's zero, write a '1' bit, which
gets OR'd into the bitmask by the host. The receiver gets interrupted
at a raising edge and just leaves the bit on, until it's done processing,
then turns the bit off by writing a '1' into its own location in the mask.

Arnd




[Qemu-devel] Re: [PATCH 2/2] net/macvtap: add vhost suppor

2010-02-15 Thread Arnd Bergmann
On Sunday 14 February 2010, Michael S. Tsirkin wrote:
  @@ -473,7 +477,7 @@ static struct socket *get_socket(int fd)
sock = get_raw_socket(fd);
if (!IS_ERR(sock))
return sock;
  - sock = get_tun_socket(fd);
  + sock = get_tap_socket(fd);
if (!IS_ERR(sock))
return sock;
return ERR_PTR(-ENOTSOCK);
 
 This will also need a dependency on macvtap in Kconfig.
 See how it's done for tun.

Ok, I'll add that.

Thanks,

Arnd




[Qemu-devel] [PATCH 1/2] macvtap: rework object lifetime rules

2010-02-13 Thread Arnd Bergmann
The original macvtap code has a number of problems
resulting from the use of RCU for protecting the
access to struct macvtap_queue from open files.

This includes
- need for GFP_ATOMIC allocations for skbs
- potential deadlocks when copy_*_user sleeps
- inability to work with vhost-net

Changing the lifetime of macvtap_queue to always
depend on the open file solves all these. The
RCU reference simply moves one step down to
the reference on the macvlan_dev, which we
only need for nonblocking operations.

Signed-off-by: Arnd Bergmann a...@arndb.de
---
 drivers/net/macvtap.c |  170 +---
 1 files changed, 89 insertions(+), 81 deletions(-)

This patch replaces the previous one that cleans up the locking in a different 
way.

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index ad1f6ef..7050997 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,29 +60,19 @@ static struct cdev macvtap_cdev;
 
 /*
  * RCU usage:
- * The macvtap_queue is referenced both from the chardev struct file
- * and from the struct macvlan_dev using rcu_read_lock.
+ * The macvtap_queue and the macvlan_dev are loosely coupled, the
+ * pointers from one to the other can only be read while rcu_read_lock
+ * or macvtap_lock is held.
  *
- * We never actually update the contents of a macvtap_queue atomically
- * with RCU but it is used for race-free destruction of a queue when
- * either the file or the macvlan_dev goes away. Pointers back to
- * the dev and the file are implicitly valid as long as the queue
- * exists.
+ * Both the file and the macvlan_dev hold a reference on the macvtap_queue
+ * through sock_hold(q-sk). When the macvlan_dev goes away first,
+ * q-vlan becomes inaccessible. When the files gets closed,
+ * macvtap_get_queue() fails.
  *
- * The callbacks from macvlan are always done with rcu_read_lock held
- * already, while in the file_operations, we get it ourselves.
- *
- * When destroying a queue, we remove the pointers from the file and
- * from the dev and then synchronize_rcu to make sure no thread is
- * still using the queue. There may still be references to the struct
- * sock inside of the queue from outbound SKBs, but these never
- * reference back to the file or the dev. The data structure is freed
- * through __sk_free when both our references and any pending SKBs
- * are gone.
- *
- * macvtap_lock is only used to prevent multiple concurrent open()
- * calls to assign a new vlan-tap pointer. It could be moved into
- * the macvlan_dev itself but is extremely rarely used.
+ * There may still be references to the struct sock inside of the
+ * queue from outbound SKBs, but these never reference back to the
+ * file or the dev. The data structure is freed through __sk_free
+ * when both our references and any pending SKBs are gone.
  */
 static DEFINE_SPINLOCK(macvtap_lock);
 
@@ -100,11 +90,12 @@ static int macvtap_set_queue(struct net_device *dev, 
struct file *file,
goto out;
 
err = 0;
-   q-vlan = vlan;
+   rcu_assign_pointer(q-vlan, vlan);
rcu_assign_pointer(vlan-tap, q);
+   sock_hold(q-sk);
 
q-file = file;
-   rcu_assign_pointer(file-private_data, q);
+   file-private_data = q;
 
 out:
spin_unlock(macvtap_lock);
@@ -112,28 +103,25 @@ out:
 }
 
 /*
- * We must destroy each queue exactly once, when either
- * the netdev or the file go away.
+ * The file owning the queue got closed, give up both
+ * the reference that the files holds as well as the
+ * one from the macvlan_dev if that still exists.
  *
  * Using the spinlock makes sure that we don't get
  * to the queue again after destroying it.
- *
- * synchronize_rcu serializes with the packet flow
- * that uses rcu_read_lock.
  */
-static void macvtap_del_queue(struct macvtap_queue **qp)
+static void macvtap_put_queue(struct macvtap_queue *q)
 {
-   struct macvtap_queue *q;
+   struct macvlan_dev *vlan;
 
spin_lock(macvtap_lock);
-   q = rcu_dereference(*qp);
-   if (!q) {
-   spin_unlock(macvtap_lock);
-   return;
+   vlan = rcu_dereference(q-vlan);
+   if (vlan) {
+   rcu_assign_pointer(vlan-tap, NULL);
+   rcu_assign_pointer(q-vlan, NULL);
+   sock_put(q-sk);
}
 
-   rcu_assign_pointer(q-vlan-tap, NULL);
-   rcu_assign_pointer(q-file-private_data, NULL);
spin_unlock(macvtap_lock);
 
synchronize_rcu();
@@ -151,21 +139,29 @@ static struct macvtap_queue *macvtap_get_queue(struct 
net_device *dev,
return rcu_dereference(vlan-tap);
 }
 
+/*
+ * The net_device is going away, give up the reference
+ * that it holds on the queue (all the queues one day)
+ * and safely set the pointer from the queues to NULL.
+ */
 static void macvtap_del_queues(struct net_device *dev)
 {
struct macvlan_dev *vlan = netdev_priv(dev);
-   macvtap_del_queue(vlan-tap);
-}
+   struct macvtap_queue *q

[Qemu-devel] [PATCH 2/2] net/macvtap: add vhost support

2010-02-13 Thread Arnd Bergmann
This adds support for passing a macvtap file descriptor into
vhost-net, much like we already do for tun/tap.

Most of the new code is taken from the respective patch
in the tun driver and may get consolidated in the future.

Signed-off-by: Arnd Bergmann a...@arndb.de
---
 drivers/net/macvtap.c  |   98 ++-
 drivers/vhost/net.c|8 +++-
 include/linux/if_macvlan.h |   13 ++
 3 files changed, 96 insertions(+), 23 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 7050997..e354501 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -58,6 +58,8 @@ static unsigned int macvtap_major;
 static struct class *macvtap_class;
 static struct cdev macvtap_cdev;
 
+static const struct proto_ops macvtap_socket_ops;
+
 /*
  * RCU usage:
  * The macvtap_queue and the macvlan_dev are loosely coupled, the
@@ -176,7 +178,7 @@ static int macvtap_forward(struct net_device *dev, struct 
sk_buff *skb)
return -ENOLINK;
 
skb_queue_tail(q-sk.sk_receive_queue, skb);
-   wake_up(q-sk.sk_sleep);
+   wake_up_interruptible_poll(q-sk.sk_sleep, POLLIN | POLLRDNORM | 
POLLRDBAND);
return 0;
 }
 
@@ -242,7 +244,7 @@ static void macvtap_sock_write_space(struct sock *sk)
return;
 
if (sk-sk_sleep  waitqueue_active(sk-sk_sleep))
-   wake_up_interruptible_sync(sk-sk_sleep);
+   wake_up_interruptible_poll(sk-sk_sleep, POLLOUT | POLLWRNORM | 
POLLWRBAND);
 }
 
 static int macvtap_open(struct inode *inode, struct file *file)
@@ -270,6 +272,8 @@ static int macvtap_open(struct inode *inode, struct file 
*file)
init_waitqueue_head(q-sock.wait);
q-sock.type = SOCK_RAW;
q-sock.state = SS_CONNECTED;
+   q-sock.file = file;
+   q-sock.ops = macvtap_socket_ops;
sock_init_data(q-sock, q-sk);
q-sk.sk_write_space = macvtap_sock_write_space;
 
@@ -387,32 +391,20 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 
rcu_read_lock_bh();
vlan = rcu_dereference(q-vlan);
-   macvlan_count_rx(vlan, len, ret == 0, 0);
+   if (vlan)
+   macvlan_count_rx(vlan, len, ret == 0, 0);
rcu_read_unlock_bh();
 
return ret ? ret : len;
 }
 
-static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
-   unsigned long count, loff_t pos)
+static ssize_t macvtap_do_read(struct macvtap_queue *q, struct kiocb *iocb,
+  const struct iovec *iv, unsigned long len,
+  int noblock)
 {
-   struct file *file = iocb-ki_filp;
-   struct macvtap_queue *q = file-private_data;
-
DECLARE_WAITQUEUE(wait, current);
struct sk_buff *skb;
-   ssize_t len, ret = 0;
-
-   if (!q) {
-   ret = -ENOLINK;
-   goto out;
-   }
-
-   len = iov_length(iv, count);
-   if (len  0) {
-   ret = -EINVAL;
-   goto out;
-   }
+   ssize_t ret = 0;
 
add_wait_queue(q-sk.sk_sleep, wait);
while (len) {
@@ -421,7 +413,7 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const 
struct iovec *iv,
/* Read frames from the queue */
skb = skb_dequeue(q-sk.sk_receive_queue);
if (!skb) {
-   if (file-f_flags  O_NONBLOCK) {
+   if (noblock) {
ret = -EAGAIN;
break;
}
@@ -440,7 +432,24 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const 
struct iovec *iv,
 
current-state = TASK_RUNNING;
remove_wait_queue(q-sk.sk_sleep, wait);
+   return ret;
+}
+
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+   unsigned long count, loff_t pos)
+{
+   struct file *file = iocb-ki_filp;
+   struct macvtap_queue *q = file-private_data;
+   ssize_t len, ret = 0;
 
+   len = iov_length(iv, count);
+   if (len  0) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = macvtap_do_read(q, iocb, iv, len, file-f_flags  O_NONBLOCK);
+   ret = min_t(ssize_t, ret, len); /* XXX copied from tun.c. Why? */
 out:
return ret;
 }
@@ -538,6 +547,53 @@ static const struct file_operations macvtap_fops = {
 #endif
 };
 
+static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
+  struct msghdr *m, size_t total_len)
+{
+   struct macvtap_queue *q = container_of(sock, struct macvtap_queue, 
sock);
+   return macvtap_get_user(q, m-msg_iov, total_len,
+   m-msg_flags  MSG_DONTWAIT);
+}
+
+static int macvtap_recvmsg(struct kiocb *iocb, struct socket *sock,
+  struct msghdr *m, size_t total_len,
+  int flags)
+{
+   struct

Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..

2010-01-28 Thread Arnd Bergmann
On Monday 25 January 2010, Dor Laor wrote:
 x86   qemu64
 x86   phenom
 x86 core2duo
 x86kvm64
 x86   qemu32
 x86  coreduo
 x86  486
 x86  pentium
 x86 pentium2
 x86 pentium3
 x86   athlon
 x86 n270

I think a really nice addition would be an autodetect option for those
users (e.g. desktop) that know they do not want to migrate the guest
to a lower-spec machine. 

That option IMHO should just show up as identical to the host cpu, with
the exception of features that are not supported in the guest.

Arnd




Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..

2010-01-28 Thread Arnd Bergmann
On Thursday 28 January 2010, Alexander Graf wrote:
  That option IMHO should just show up as identical to the host cpu, with
  the exception of features that are not supported in the guest.
 
 That's exactly what -cpu host is. IIRC it's the default now.

Ah, cool. Sorry for my ignorance here.

From what I can tell, neither the man page nor 'qemu -cpu \?' show that
information though.

Arnd




Re: [Qemu-devel] Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu

2010-01-26 Thread Arnd Bergmann
On Wednesday 27 January 2010, Anthony Liguori wrote:
  The raw backend can be attached to a physical device
 
 This is equivalent to bridging with tun/tap except that it has the 
 unexpected behaviour of unreliable host/guest networking (which is not 
 universally consistent across platforms either).  This is not a mode we 
 want to encourage users to use.

It's not the most common scenario, but I've seen systems (I remember
one on s/390 with z/VM) where you really want to isolate the guest
network as much as possible from the host network. Besides PCI
passthrough, giving the host device to a guest using a raw socket
is the next best approximation of that.

Then again, macvtap will do that too, if the device driver supports
multiple unicast MAC addresses without forcing promiscous mode.

  , macvlan
 
 macvtap is a superior way to achieve this use case because a macvtap fd 
 can safely be given to a lesser privilege process without allowing 
 escalation of privileges.

Yes.

or SR-IOV VF.
 
 
 This depends on vhost-net.

Why? I don't see anything in this scenario that is vhost-net specific.
I also plan to cover this aspect in macvtap in the future, but the current
code does not do it yet. It also requires device driver changes.

   In general, what I would like to see for 
 this is something more user friendly that dealt specifically with this 
 use-case.  Although honestly, given the recent security concerns around 
 raw sockets, I'm very concerned about supporting raw sockets in qemu at all.
 
 Essentially, you get worse security doing vhost-net + raw + VF then with 
 PCI passthrough + VF because at least in the later case you can run qemu 
 without privileges.  CAP_NET_RAW is a very big privilege.

It can be contained to a large degree with network namespaces. When you
run qemu in its own namespace and add the VF to that, CAP_NET_RAW
should ideally have no effect on other parts of the system (except
bugs in the namespace implementation).

Arnd




[Qemu-devel] Re: [PATCH] Add definitions for current cpu models..

2010-01-20 Thread Arnd Bergmann
On Monday 18 January 2010, john cooper wrote:
 +.name = Conroe,
 +.level = 2,
 +.vendor1 = CPUID_VENDOR_INTEL_1,
 +.vendor2 = CPUID_VENDOR_INTEL_2,
 +.vendor3 = CPUID_VENDOR_INTEL_3,
 +.family = 6,   /* P6 */
 +.model = 2,

 that looks wrong -- what is model 2 actually?

 +.stepping = 3,
 +.features = PPRO_FEATURES | 
 +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */
 +CPUID_PSE36,/* note 2 */
 +.ext_features = CPUID_EXT_SSE3 | CPUID_EXT_SSSE3,
 +.ext2_features = (PPRO_FEATURES  CPUID_EXT2_MASK) | 
 +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
 +.ext3_features = CPUID_EXT3_LAHF_LM,
 +.xlevel = 0x800A,
 +.model_id = Intel Celeron_4x0 (Conroe/Merom Class Core 2),
 +},

Celeron_4x0 is a rather bad example, because it is based on the 
single-core Conroe-L, which is family 6 / model 22 unlike all the dual-
and quad-core Merom/Conroe that are model 15.

 +{
 +.name = Penryn,
 +.level = 2,
 +.vendor1 = CPUID_VENDOR_INTEL_1,
 +.vendor2 = CPUID_VENDOR_INTEL_2,
 +.vendor3 = CPUID_VENDOR_INTEL_3,
 +.family = 6,   /* P6 */
 +.model = 2,
 +.stepping = 3,
 +.features = PPRO_FEATURES | 
 +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */
 +CPUID_PSE36,/* note 2 */
 +.ext_features = CPUID_EXT_SSE3 |
 +CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41,
 +.ext2_features = (PPRO_FEATURES  CPUID_EXT2_MASK) | 
 +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
 +.ext3_features = CPUID_EXT3_LAHF_LM,
 +.xlevel = 0x800A,
 +.model_id = Intel Core 2 Duo P9xxx (Penryn Class Core 2),
 +},

This would be model 23 for Penryn-class Xeon/Core/Pentium/Celeron processors
without L3 cache.

 +{
 +.name = Nehalem,
 +.level = 2,
 +.vendor1 = CPUID_VENDOR_INTEL_1,
 +.vendor2 = CPUID_VENDOR_INTEL_2,
 +.vendor3 = CPUID_VENDOR_INTEL_3,
 +.family = 6,   /* P6 */
 +.model = 2,
 +.stepping = 3,
 +.features = PPRO_FEATURES | 
 +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */
 +CPUID_PSE36,/* note 2 */
 +.ext_features = CPUID_EXT_SSE3 |
 +CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41 |
 +CPUID_EXT_SSE42 | CPUID_EXT_POPCNT,
 +.ext2_features = (PPRO_FEATURES  CPUID_EXT2_MASK) | 
 +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
 +.ext3_features = CPUID_EXT3_LAHF_LM,
 +.xlevel = 0x800A,
 +.model_id = Intel Core i7 9xx (Nehalem Class Core i7),
 +},

Apparently, not all the i7-9xx CPUs are Nehalem, the i7-980X is supposed
to be Westmere, which has more features.

Because of the complexity, I'd recommend passing down the *model* number
of the emulated CPU, the interesting Intel ones (those supported by KVM) being:

15-6: CedarMill/Presler/Dempsey/Tulsa (Pentium 4/Pentium D/Xeon 50xx/Xeon 71xx)
6-14: Yonah/Sossaman (Celeron M4xx, Core Solo/Duo, Pentium Dual-Core T1000, 
Xeon ULV)
6-15: Merom/Conroe/Kentsfield/Woodcrest/Clovertown/Tigerton
  (Celeron M5xx/E1xxx/T1xxx, Pentium T2xxx/T3xxx/E2xxx,Core 2 Solo U2xxx,
   Core 2 Duo E4xxx/E6xxx/Q6xxx/T5xxx/T7xxx/L7xxx/U7xxx/SP7xxx,
   Xeon 30xx/32xx/51xx/52xx/72xx/73xx)
6-22: Penryn/Wolfdale/Yorkfield/Harpertown (Celeron 7xx/9xx/SU2xxx/T3xxx/E3xxx,
   Pentium T4xxx/SU2xxx/SU4xxx/E5xxx/E6xxx, Core 2 Solo SU3xxx,
   Core 2 Duo P/SU/T6xxx/x8xxx/x9xxx,
   Xeon 31xx/33xx/52xx/54xx)
6-26: Gainestown/Bloomfield (Xeon 35xx/55xx, Core i7-9xx)
6-28: Atom
6-29: Dunnington (Xeon 74xx)
6-30: Lynnfield/Clarksfield/JasperForest (Xeon 34xx, Core i7-8xx, Core i7-xxxQM,
   Core i5-7xx)
6-37: Arrandale/Clarkdale (Dual-Core Core i3/i5/i7)
6-44: Gulftown (six-core)

Arnd




[Qemu-devel] Re: Guest bridge setup variations

2009-12-16 Thread Arnd Bergmann
On Wednesday 16 December 2009, Leonid Grossman wrote:
   3. Doing the bridging in the NIC using macvlan in passthrough
   mode. This lowers the CPU utilization further compared to 2,
   at the expense of limiting throughput by the performance of
   the PCIe interconnect to the adapter. Whether or not this
   is a win is workload dependent. 
 
 This is certainly true today for pci-e 1.1 and 2.0 devices, but 
 as NICs move to pci-e 3.0 (while remaining almost exclusively dual port
 10GbE for a long while), 
 EVB internal bandwidth will significantly exceed external bandwidth.
 So, #3 can become a win for most inter-guest workloads.

Right, it's also hardware dependent, but it usually comes down
to whether it's cheaper to spend CPU cycles or to spend IO bandwidth.

I would be surprised if all future machines with PCIe 3.0 suddenly have
a huge surplus of bandwidth but no CPU to keep up with that.

   Access controls now happen
   in the NIC. Currently, this is not supported yet, due to lack of
   device drivers, but it will be an important scenario in the future
   according to some people.
 
 Actually, x3100 10GbE drivers support this today via sysfs interface to
 the host driver 
 that can choose to control VEB tables (and therefore MAC addresses, vlan
 memberships, etc. for all passthru interfaces behind the VEB).

Ok, I didn't know about that.

 OF course a more generic vendor-independent interface will be important
 in the future.

Right. I hope we can come up with something soon. I'll have a look at
what your driver does and see if that can be abstracted in some way.
I expect that if we can find an interface between the kernel and device
driver for two or three NIC implementations that it will be good enough
to adapt to everyone else as well.

Arnd 




Re: [Qemu-devel] Guest bridge setup variations

2009-12-14 Thread Arnd Bergmann
On Friday 11 December 2009, Anthony Liguori wrote:
 Arnd Bergmann wrote:
  3) given an fd, treat a vhost-style interface
 
  This could mean two things, not sure which one you mean. Either the
  file descriptor could be the vhost file descriptor, or the socket or tap 
  file
  descriptor from above, with qemu operating on the vhost interface itself.
 
  Either option has its advantages, but I guess we should only implement
  one of the two to keep it simple.

 
 I was thinking the socket/tap descriptor.

ok.

  Right. I wonder if this helper should integrate with netcf as an 
  abstraction,
  or if we should rather do something generic. It may also be a good idea
  to let the helper decide which of the three options you listed to use
  and pass that back to qemu unless the user overrides it. The decision
  probably needs to be host specific, depending e.g. on the availability
  and version of tools (brctl, iproute, maybe tunctl, ...), the respective
  kernel modules (vhost, macvlan, bridge, tun, ...) and policy (VEPA, vlan,
  ebtables). Ideally the approach should be generic enough to work on
  other platforms (BSD, Solaris, Windows, ...).
 
 For helpers, I think I'd like to stick with what we currently support, 
 and then allow for a robust way for there to be independent projects 
 that implement their own helpers.   For instance, I would love it if 
 netcf had a qemu network plugin helper.

Moving netcf specific qemu helpers into the netcf project sounds good.
I'm not sure what you mean with 'stick to what we currently support',
do you mean with helpers that ship with qemu itself? That sounds
reasonable, though I'd obviously like to make sure they also work
with macvtap, which is currently not supported unless you pass an
open file descriptor into -net tap,fd.

 There's just too much in the networking space all wrapped up in layers 
 of policy decisions.  I think it's time to move it out of qemu.

Yes.

Arnd




Re: [Qemu-devel] Re: Guest bridge setup variations

2009-12-10 Thread Arnd Bergmann
On Thursday 10 December 2009 19:14:28 Alexander Graf wrote:
  This is something I also have been thinking about, but it is not what
  I was referring to above. I think it would be good to keep the three
  cases (macvlan, VMDq, SR-IOV) as similar as possible from the user
  perspective, so using macvlan as an infrastructure for all of them
  sounds reasonable to me.
 
 Oh, so you'd basically do -net vt-d,if=eth0 and the rest would
 automatically work? That's a pretty slick idea!

I was only referring to how they get set up under the covers, e.g.
creating the virtual device, configuring the MAC address etc, not
the qemu side, but that would probably make sense as well.

Or even better, qemu should probably not even know the difference
between macvlan and VT-d. In both cases, it would open a macvtap
file, but for VT-d adapters, the macvlan infrastructure can
use hardware support, much in the way that VLAN tagging gets
offloaded automatically to the hardware.

Arnd 




[Qemu-devel] Re: [PATCH, try 2] qemu/tap: add -net tap,dev= option

2009-12-10 Thread Arnd Bergmann
On Wednesday 09 December 2009, Anthony Liguori wrote:
 This isn't really a generic thing and I dislike pretending it is.  This 
 is specifically for macvtap.

Well, depending how how things work out with VMD-q and SR-IOV,
I might move the tap interface stuff into a library module and add more
user-level shims for it that deal with the creation of character devices.

Macvtap essentially implements a tun/tap compatible character device,
while leaving out some of the stuff that doesn't make sense there (e.g.
creation of virtual network interfaces). The idea of my patch was
to deal with anything that is a tap protocol implementation.

One thing that is special for macvtap is that the guest MAC address
has to match the MAC address of the macvtap downstream device,
because macvlan filters out all traffic destined for other unicast
addresses.

 If we were going to do this, I'd rather introduce a -net macvtap that 
 actually allocated the interfaces (similar to how tap works).  The 
 problem with this interface is that it's a two stage process.  You have 
 to create an interface, then hand the name to qemu.  It would be just as 
 easy to hand qemu an fd with a helper.

You can't really allocate the device from qemu in a nice way.
I had mostly implemented macvtap support for your -net bridge
helper when I realized this.

Since it uses netlink (with CAP_NET_ADMIN) to create or destroy
the netdev, it first needs to wait for udev to create the device
node (which is not so bad by itself), and then use netlink again
after shutting down to destroy the network device.

 What's nice about -net tap is that with a little bit of setup 
 (/etc/qemu-ifup), it Just Works.  -net tap,dev= does not share this 
 property.

The qemu-ifup stuff IMHO is just a workaround for the problem of
having to defer network setup until after you open the device.

Macvtap doesn't have this problem. In the same way we deal with
other host resources (raw disk access, USB passthrough, /dev/kvm),
you first set up specifically what the user is allowed to do, either
manually or through libvirt and then just use it.

 I think this is a good place where an exec helper would be a natural fit.

IMHO, the exec helper is a good way to abstract away the difference
between tap, macvtap and possibly others based on the host configuration,
but it doesn't really make sense for a separate -net macvtap backend
like you suggested. You've shown that the helper can be used to enforce
user permissions for tun/tap, which lacks this, but for macvtap we can
just use regular user permissions.

Arnd




Re: [Qemu-devel] Guest bridge setup variations

2009-12-10 Thread Arnd Bergmann
On Wednesday 09 December 2009, Anthony Liguori wrote:
 While we go over all of these things one thing is becoming clear to me.  
 We need to get qemu out of the network configuration business.  There's 
 too much going on here.

Agreed.

 What I'd like to see is the following interfaces supported:
 
 1) given an fd, make socket calls to send packets.  Could be used with a 
 raw socket, a multicast or tcp socket.
 2) given an fd, use tap-style read/write calls to send packets*

yes.

 3) given an fd, treat a vhost-style interface

This could mean two things, not sure which one you mean. Either the
file descriptor could be the vhost file descriptor, or the socket or tap file
descriptor from above, with qemu operating on the vhost interface itself.

Either option has its advantages, but I guess we should only implement
one of the two to keep it simple.

 I believe we should continue supporting the mechanisms we support 
 today.  However, for people that invoke qemu directly from the command 
 line, I believe we should provide a mechanism like the tap helper that 
 can be used to call out to a separate program to create these initial 
 file descriptors.  We'll have to think about how we can make this 
 integrate well so that the syntax isn't clumsy.

Right. I wonder if this helper should integrate with netcf as an abstraction,
or if we should rather do something generic. It may also be a good idea
to let the helper decide which of the three options you listed to use
and pass that back to qemu unless the user overrides it. The decision
probably needs to be host specific, depending e.g. on the availability
and version of tools (brctl, iproute, maybe tunctl, ...), the respective
kernel modules (vhost, macvlan, bridge, tun, ...) and policy (VEPA, vlan,
ebtables). Ideally the approach should be generic enough to work on
other platforms (BSD, Solaris, Windows, ...).

One thing I realized the last time we discussed the helper approach is
that qemu should not need to know or care about the arguments passed
to the helper, otherwise you get all the complexity back in qemu that
you're trying to avoid. Maybe for 0.13 we can convert -net socket and
-net tap to just pass all their options to the helper and move that code
out of qemu, along with introducting the new syntax.

Another unrelated issue that I think needs to be addressed in a
network code cleanup is adding better support for multi-queue
transmit and receive. I've prepared macvtap for that by letting you
open the chardev multiple times to get one queue per guest CPU,
but that needs to be supported by qemu and virtio-net as well
to actually parallelize network operation. Ideally, two guest CPUs
should be able to transmit and receive on separate queues of the
adapter without ever having to access any shared resources.

Arnd




[Qemu-devel] Re: Guest bridge setup variations

2009-12-10 Thread Arnd Bergmann
On Thursday 10 December 2009, Fischer, Anna wrote:
  
  3. Doing the bridging in the NIC using macvlan in passthrough
  mode. This lowers the CPU utilization further compared to 2,
  at the expense of limiting throughput by the performance of
  the PCIe interconnect to the adapter. Whether or not this
  is a win is workload dependent. Access controls now happen
  in the NIC. Currently, this is not supported yet, due to lack of
  device drivers, but it will be an important scenario in the future
  according to some people.
 
 Can you differentiate this option from typical PCI pass-through mode?
 It is not clear to me where macvlan sits in a setup where the NIC does
 bridging.

In this setup (hypothetical so far, the code doesn't exist yet), we use
the configuration logic of macvlan, but not the forwarding. This also
doesn't do PCI pass-through but instead gives all the logical interfaces
to the host, using only the bridging and traffic separation capabilities
of the NIC, but not the PCI-separation.

Intel calls this mode VMDq, as opposed to SR-IOV, which implies
the assignment of the adapter to a guest.

It was confusing of me to call it passthrough above, sorry for that.

 Typically, in a PCI pass-through configuration, all configuration goes
 through the physical function device driver (and all data goes directly
 to the NIC). Are you suggesting to use macvlan as a common
 configuration layer that then configures the underlying NIC?
 I could see some benefit in such a model, though I am not certain I
 understand you correctly.

This is something I also have been thinking about, but it is not what
I was referring to above. I think it would be good to keep the three
cases (macvlan, VMDq, SR-IOV) as similar as possible from the user
perspective, so using macvlan as an infrastructure for all of them
sounds reasonable to me.

The difference between VMDq and SR-IOV in that case would be
that the former uses a virtio-net driver in the guest and a hardware
driver in the host, while the latter uses a hardware driver in the guest
only. The data flow on these two would be identical though, while
in the classic macvlan the data forwarding decisions are made in
the host kernel.

Arnd




[Qemu-devel] Re: [PATCH, try 2] qemu/tap: add -net tap,dev= option

2009-12-09 Thread Arnd Bergmann
On Wednesday 09 December 2009, Michael S. Tsirkin wrote:
 On Tue, Dec 08, 2009 at 06:41:44PM +0100, Arnd Bergmann wrote:
  --- a/net/tap-bsd.c
  +++ b/net/tap-bsd.c
  @@ -40,7 +40,8 @@
   #include util.h
   #endif
   
  -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
  vnet_hdr_required)
  +int tap_open(char *ifname, int ifname_size, char *dev, int dev_size,
  +   int *vnet_hdr, int vnet_hdr_required)
   {
   int fd;
   char *dev;
 
 Does this compile?

I don't have a BSD or Solaris machine here, or even just a cross-compiler, so
I could not test. However, I only add two arguments and I did the identical
change in the header file and the linux version of this file, so I am reasonably
confident.
 
  -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
  vnet_hdr_required)
  +int tap_open(char *ifname, int ifname_size, char *dev, int dev_size,
 
 dev is never modified, so it should be const char *.

ok.

  +   int *vnet_hdr, int vnet_hdr_required)
   {
   struct ifreq ifr;
   int fd, ret;
  +const char *path;
   
  -TFR(fd = open(/dev/net/tun, O_RDWR));
  +if (dev[0] == '\0')
 
 == 0 checks are ugly. if (*dev) is shorter, and it's a standard C
 idiom to detect non-empty string. Or better pass NULL if no device,
 then you can just path = dev ? dev : /dev/net/tun.

Agreed in principle, but I was following the style that is already used
in the same function, and I think consistency is more important in
this case.
 
  +   path = /dev/net/tun;
  +else
  +   path = dev;
 
 Please do not indent by single space.

For some reason, this file uses four character indentation, which
I followed for consistency. In the patch this gets mangled when
some lines use only spaces for indentation and others use only
tabs.

I could change to using only spaces for indentation if that's preferred.

  diff --git a/net/tap.c b/net/tap.c
  index 0d8b424..14ddf65 100644
  --- a/net/tap.c
  +++ b/net/tap.c
  @@ -343,12 +343,17 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
   {
   int fd, vnet_hdr_required;
   char ifname[128] = {0,};
  +char dev[128] = {0,};
   const char *setup_script;
   
   if (qemu_opt_get(opts, ifname)) {
   pstrcpy(ifname, sizeof(ifname), qemu_opt_get(opts, ifname));
   }
   
  +if (qemu_opt_get(opts, dev)) {
  +pstrcpy(dev, sizeof(dev), qemu_opt_get(opts, dev));
  +}
  +
 
 While in this case this is unlikely to be a problem in practice, we still
 should not add arbitrary limitations on file name lengths.  Just teach
 tap_open to get NULL instead of and empty string, or better supply
 default /dev/net/tun to the option, and you will not need the strcpy.

Right, I will do that, or alternatively make dev an input/output argument,
see below.

   snprintf(s-nc.info_str, sizeof(s-nc.info_str),
  - ifname=%s,script=%s,downscript=%s,
  - ifname, script, downscript);
  + ifname=%s,dev=%s,script=%s,downscript=%s,
 
 This will look strange if dev is not supplied, will it not?
 Also, I wonder whether there might be any scripts parsing
 info string from monitor, that will get broken by the
 extra parameter. How about we only print dev if it
 is not the default?

Right. I originally planned to return /dev/net/tun from tap_open
in that case, but I forgot to put that in. It would also not work well
for Solaris and BSD unless I add untested code there.

I guess it would be consistent to do that, but unless someone insists
on it, I'll just follow your advice here and remove it from being printed.

  +-net 
  tap[,vlan=n][,name=str][,fd=h][,ifname=name][,dev=str][,script=file]\n
  +[,downscript=dfile][,sndbuf=nbytes][,vnet_hdr=on|off]\n
   connect the host TAP network interface to VLAN 'n' 
  and use the\n
   network scripts 'file' (default=%s)\n
   and 'dfile' (default=%s);\n
   use '[down]script=no' to disable script execution;\n
   use 'fd=h' to connect to an already opened TAP 
  interface\n
  +use 'dev=str' to open a specific tap character 
  device\n
 
 please document default /dev/net/tun

ok.

Thanks for the review!

Arnd




[Qemu-devel] [PATCH, try 2, version 2] qemu/tap: add -net tap, dev= option

2009-12-09 Thread Arnd Bergmann
In order to support macvtap, we need a way to select the actual
tap endpoint. While it would be nice to pass it by network interface
name, passing the character device is more flexible, and we can
easily do both in the long run.

This version makes it possible to use macvtap without introducing
any macvtap specific code in qemu, only a natural extension to
the existing interface.

Signed-off-by: Arnd Bergmann a...@arndb.de
Acked-by: Mark McLoughlin mar...@redhat.com
Cc: Michael S. Tsirkin m...@redhat.com
---
Hopefully addressed all comments from Michael. I did compile-test
the non-linux files I touched (the solaris file failed to build for another
reason), and this now prevents passing dev= on non-linux machines.

 net.c |5 +
 net/tap-aix.c |3 ++-
 net/tap-bsd.c |9 -
 net/tap-linux.c   |   12 
 net/tap-solaris.c |7 ++-
 net/tap.c |   17 ++---
 net/tap.h |3 ++-
 qemu-options.hx   |   10 +-
 8 files changed, 54 insertions(+), 12 deletions(-)

diff --git a/net.c b/net.c
index 13bdbb2..deb12fd 100644
--- a/net.c
+++ b/net.c
@@ -955,6 +955,11 @@ static struct {
 },
 #ifndef _WIN32
 {
+.name = dev,
+.type = QEMU_OPT_STRING,
+.help = character device pathname,
+},
+{
 .name = fd,
 .type = QEMU_OPT_STRING,
 .help = file descriptor of an already opened tap,
diff --git a/net/tap-aix.c b/net/tap-aix.c
index 4bc9f2f..238c190 100644
--- a/net/tap-aix.c
+++ b/net/tap-aix.c
@@ -25,7 +25,8 @@
 #include net/tap.h
 #include stdio.h
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, const char *dev,
+   int *vnet_hdr, int vnet_hdr_required)
 {
 fprintf(stderr, no tap on AIX\n);
 return -1;
diff --git a/net/tap-bsd.c b/net/tap-bsd.c
index 815997d..8a7334c 100644
--- a/net/tap-bsd.c
+++ b/net/tap-bsd.c
@@ -40,7 +40,8 @@
 #include util.h
 #endif
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, const char *devarg,
+   int *vnet_hdr, int vnet_hdr_required)
 {
 int fd;
 char *dev;
@@ -80,6 +81,12 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
int vnet_hdr_required
 }
 #endif
 
+if (devarg) {
+qemu_error(dev= not supported\n);
+close(fd);
+return -1;
+}
+
 fstat(fd, s);
 dev = devname(s.st_rdev, S_IFCHR);
 pstrcpy(ifname, ifname_size, dev);
diff --git a/net/tap-linux.c b/net/tap-linux.c
index 6af9e82..d6831c0 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -32,14 +32,18 @@
 #include sysemu.h
 #include qemu-common.h
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, const char *dev,
+int *vnet_hdr, int vnet_hdr_required)
 {
 struct ifreq ifr;
 int fd, ret;
 
-TFR(fd = open(/dev/net/tun, O_RDWR));
+if (!*dev)
+dev = /dev/net/tun;
+
+TFR(fd = open(dev, O_RDWR));
 if (fd  0) {
-fprintf(stderr, warning: could not open /dev/net/tun: no virtual 
network emulation\n);
+fprintf(stderr, warning: could not open %s: no virtual network 
emulation\n, dev);
 return -1;
 }
 memset(ifr, 0, sizeof(ifr));
@@ -70,7 +74,7 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
int vnet_hdr_required
 pstrcpy(ifr.ifr_name, IFNAMSIZ, tap%d);
 ret = ioctl(fd, TUNSETIFF, (void *) ifr);
 if (ret != 0) {
-fprintf(stderr, warning: could not configure /dev/net/tun: no virtual 
network emulation\n);
+fprintf(stderr, warning: could not configure %s: no virtual network 
emulation\n, dev);
 close(fd);
 return -1;
 }
diff --git a/net/tap-solaris.c b/net/tap-solaris.c
index e14fe36..3d10984 100644
--- a/net/tap-solaris.c
+++ b/net/tap-solaris.c
@@ -171,10 +171,15 @@ static int tap_alloc(char *dev, size_t dev_size)
 return tap_fd;
 }
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, const char *devarg,
+   int *vnet_hdr, int vnet_hdr_required)
 {
 char  dev[10]=;
 int fd;
+if (devarg) {
+fprintf(stderr, dev= not supported\n);
+return -1;
+}
 if( (fd = tap_alloc(dev, sizeof(dev)))  0 ){
fprintf(stderr, Cannot allocate TAP device\n);
return -1;
diff --git a/net/tap.c b/net/tap.c
index 0d8b424..8635ae2 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -343,12 +343,15 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
 {
 int fd, vnet_hdr_required;
 char ifname[128] = {0,};
+const char *dev;
 const char *setup_script;
 
 if (qemu_opt_get(opts, ifname

Re: [Qemu-devel] Re: [PATCH, try 2, version 2] qemu/tap: add -net tap, dev= option

2009-12-09 Thread Arnd Bergmann
On Wednesday 09 December 2009, Michael S. Tsirkin wrote:
 On Wed, Dec 09, 2009 at 03:49:04PM +0100, Arnd Bergmann wrote:
   
  -TFR(fd = open(/dev/net/tun, O_RDWR));
  +if (!*dev)
  +dev = /dev/net/tun;
  +
 
 Did you test without dev parameter? I think dev will be NULL
 so this will deference a nullpointer ...
 probably if (!dev) is what you mean?

D'oh. will fix.
  
 will be neater if you put [,dev=str] after [,script=file]
 Also - it does need a string, but only insofar as all options are strings.
 Maybe dev=devfile or dev=file would be clearer.

Yep. Thanks,

Arnd




[Qemu-devel] Guest bridge setup variations

2009-12-08 Thread Arnd Bergmann
As promised, here is my small writeup on which setups I feel
are important in the long run for server-type guests. This
does not cover -net user, which is really for desktop kinds
of applications where you do not want to connect into the
guest from another IP address.

I can see four separate setups that we may or may not want to
support, the main difference being how the forwarding between
guests happens:

1. The current setup, with a bridge and tun/tap devices on ports
of the bridge. This is what Gerhard's work on access controls is
focused on and the only option where the hypervisor actually
is in full control of the traffic between guests. CPU utilization should
be highest this way, and network management can be a burden,
because the controls are done through a Linux, libvirt and/or Director
specific interface.

2. Using macvlan as a bridging mechanism, replacing the bridge
and tun/tap entirely. This should offer the best performance on
inter-guest communication, both in terms of throughput and
CPU utilization, but offer no access control for this traffic at all.
Performance of guest-external traffic should be slightly better
than bridge/tap.

3. Doing the bridging in the NIC using macvlan in passthrough
mode. This lowers the CPU utilization further compared to 2,
at the expense of limiting throughput by the performance of
the PCIe interconnect to the adapter. Whether or not this
is a win is workload dependent. Access controls now happen
in the NIC. Currently, this is not supported yet, due to lack of
device drivers, but it will be an important scenario in the future
according to some people.

4. Using macvlan for actual VEPA on the outbound interface.
This is mostly interesting because it makes the network access
controls visible in an external switch that is already managed.
CPU utilization and guest-external throughput should be
identical to 3, but inter-guest latency can only be worse because
all frames go through the external switch.

In case 2 through 4, we have the choice between macvtap and
the raw packet interface for connecting macvlan to qemu.
Raw sockets are better tested right now, while macvtap has
better permission management (i.e. it does not require
CAP_NET_ADMIN). Neither one is upstream though at the
moment. The raw driver only requires qemu patches, while
macvtap requires both a new kernel driver and a trivial change
in qemu.

In all four cases, vhost-net could be used to move the workload
from user space into the kernel, which may be an advantage.
The decision for or against vhost-net is entirely independent of
the other decisions.

Arnd




[Qemu-devel] [PATCH, try 2] qemu/tap: add -net tap,dev= option

2009-12-08 Thread Arnd Bergmann
In order to support macvtap, we need a way to select the actual
tap endpoint. While it would be nice to pass it by network interface
name, passing the character device is more flexible, and we can
easily do both in the long run.

Signed-off-by: Arnd Bergmann a...@arndb.de
---
diff --git a/net.c b/net.c
index 13bdbb2..deb12fd 100644
--- a/net.c
+++ b/net.c
@@ -955,6 +955,11 @@ static struct {
 },
 #ifndef _WIN32
 {
+.name = dev,
+.type = QEMU_OPT_STRING,
+.help = character device pathname,
+},
+{
 .name = fd,
 .type = QEMU_OPT_STRING,
 .help = file descriptor of an already opened tap,
diff --git a/net/tap-aix.c b/net/tap-aix.c
index 4bc9f2f..b789d06 100644
--- a/net/tap-aix.c
+++ b/net/tap-aix.c
@@ -25,7 +25,8 @@
 #include net/tap.h
 #include stdio.h
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, char *dev, int dev_size,
+   int *vnet_hdr, int vnet_hdr_required)
 {
 fprintf(stderr, no tap on AIX\n);
 return -1;
diff --git a/net/tap-bsd.c b/net/tap-bsd.c
index 815997d..09a7780 100644
--- a/net/tap-bsd.c
+++ b/net/tap-bsd.c
@@ -40,7 +40,8 @@
 #include util.h
 #endif
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, char *dev, int dev_size,
+   int *vnet_hdr, int vnet_hdr_required)
 {
 int fd;
 char *dev;
diff --git a/net/tap-linux.c b/net/tap-linux.c
index 6af9e82..a6df123 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -32,14 +32,21 @@
 #include sysemu.h
 #include qemu-common.h
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, char *dev, int dev_size,
+   int *vnet_hdr, int vnet_hdr_required)
 {
 struct ifreq ifr;
 int fd, ret;
+const char *path;
 
-TFR(fd = open(/dev/net/tun, O_RDWR));
+if (dev[0] == '\0')
+   path = /dev/net/tun;
+else
+   path = dev;
+
+TFR(fd = open(dev, O_RDWR));
 if (fd  0) {
-fprintf(stderr, warning: could not open /dev/net/tun: no virtual 
network emulation\n);
+fprintf(stderr, warning: could not open %s: no virtual network 
emulation\n, path);
 return -1;
 }
 memset(ifr, 0, sizeof(ifr));
@@ -70,7 +77,7 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
int vnet_hdr_required
 pstrcpy(ifr.ifr_name, IFNAMSIZ, tap%d);
 ret = ioctl(fd, TUNSETIFF, (void *) ifr);
 if (ret != 0) {
-fprintf(stderr, warning: could not configure /dev/net/tun: no virtual 
network emulation\n);
+fprintf(stderr, warning: could not configure %s: no virtual network 
emulation\n, path);
 close(fd);
 return -1;
 }
diff --git a/net/tap-solaris.c b/net/tap-solaris.c
index e14fe36..72abb78 100644
--- a/net/tap-solaris.c
+++ b/net/tap-solaris.c
@@ -171,7 +171,8 @@ static int tap_alloc(char *dev, size_t dev_size)
 return tap_fd;
 }
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, char *dev, int dev_size,
+   int *vnet_hdr, int vnet_hdr_required)
 {
 char  dev[10]=;
 int fd;
diff --git a/net/tap.c b/net/tap.c
index 0d8b424..14ddf65 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -343,12 +343,17 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
 {
 int fd, vnet_hdr_required;
 char ifname[128] = {0,};
+char dev[128] = {0,};
 const char *setup_script;
 
 if (qemu_opt_get(opts, ifname)) {
 pstrcpy(ifname, sizeof(ifname), qemu_opt_get(opts, ifname));
 }
 
+if (qemu_opt_get(opts, dev)) {
+pstrcpy(dev, sizeof(dev), qemu_opt_get(opts, dev));
+}
+
 *vnet_hdr = qemu_opt_get_bool(opts, vnet_hdr, 1);
 if (qemu_opt_get(opts, vnet_hdr)) {
 vnet_hdr_required = *vnet_hdr;
@@ -356,7 +361,8 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
 vnet_hdr_required = 0;
 }
 
-TFR(fd = tap_open(ifname, sizeof(ifname), vnet_hdr, vnet_hdr_required));
+TFR(fd = tap_open(ifname, sizeof(ifname), dev, sizeof(dev),
+   vnet_hdr, vnet_hdr_required));
 if (fd  0) {
 return -1;
 }
@@ -371,6 +377,7 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
 }
 
 qemu_opt_set(opts, ifname, ifname);
+qemu_opt_set(opts, dev, dev);
 
 return fd;
 }
@@ -382,10 +389,12 @@ int net_init_tap(QemuOpts *opts, Monitor *mon, const char 
*name, VLANState *vlan
 
 if (qemu_opt_get(opts, fd)) {
 if (qemu_opt_get(opts, ifname) ||
+qemu_opt_get(opts, dev) ||
 qemu_opt_get(opts, script) ||
 qemu_opt_get(opts, downscript) ||
 qemu_opt_get(opts, vnet_hdr

Re: [Qemu-devel] [PATCH, RFC] tap-linux: support opening arbitrary char devices

2009-12-03 Thread Arnd Bergmann
On Wednesday 02 December 2009, Anthony Liguori wrote:
 Arnd Bergmann wrote:
  With the upcoming macvtap, we will want to open devices other than
  /dev/net/tun but no longer need to call TUNSETIFF.

 
 What are the names of these devices and how do you the character devices 
 get created in the first place?

Like a macvlan device, you use iproute2 to set up the interface, e.g.

ip link add link eth0 type macvtap mode vepa

This is consistent with how you'd set up macvlan, veth or vlan devices.
With my current code, the name that gets used for the character
device in absence of a matching udev rule will be /dev/tap%d with %d
being the interface index of the network device.
 
  This makes it possible to do 'qemu -net tap,ifname=/dev/tap/macvtap0'
  to refer to a chardev in addition to the current way of doing
  'qemu -net tap,ifname=tap0' to refer to a tap network interface
  set with TUNSETIFF. This is consistent with what we do on BSD.
 
 We definitely don't want to overload ifname like this.  
 /dev/tap/macvtap0 is clearly not the interface name.

tap interfaces on other operating systems already differ in this respect,
and macvtap seems to be much closer to what BSD and Solaris do
than the Linux tun/tap interface, judging from how they are used
in qemu.

Assuming you have a virtual interface virteth0 (name made up to
be different from existing ones) with interface index 3, the syntax
would be

* Linux, with tap: (which doesn't really work with arbitrary devices):
qemu -net tap,ifname=virteth0

* Solaris: (opens /dev/tap3)
qemu -net tap,ifname=3

* BSD: (opens /dev/tap3)
qemu -net tap,ifname=tap3

I chose the full path name to be sure it does not conflict with existing
usage, 3 and tap3 are both valid interface names, /dev/tap3 is not.

If you prefer, I can add some logic to determine the character device name
from the network interface name, so that you can call it consistently
with the interface name on linux, wether using tun/tap or macvtap.
That patch will be somewhat more complicated than the one I posted.

Alternatively, I can use the documented name= parameter for
pointing to the character device, but that would be inconsitent with
the current usage on all the supported OSs.

Arnd 




[Qemu-devel] [PATCH, RFC] tap-linux: support opening arbitrary char devices

2009-11-26 Thread Arnd Bergmann
With the upcoming macvtap, we will want to open devices other than
/dev/net/tun but no longer need to call TUNSETIFF.

This makes it possible to do 'qemu -net tap,ifname=/dev/tap/macvtap0'
to refer to a chardev in addition to the current way of doing
'qemu -net tap,ifname=tap0' to refer to a tap network interface
set with TUNSETIFF. This is consistent with what we do on BSD.

This is guaranteed not to cause conflicts with existing usage,
because network devices on Linux cannot start with a '/'.

Alternatively, I could do a patch to use the -net tap,name=
parameter which documented to have a similar purpose but not
currently implemented at all.

Signed-off-by: Arnd Bergmann a...@arndb.de

diff --git a/net/tap-linux.c b/net/tap-linux.c
index 0f621a2..21f0167 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -36,10 +36,20 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
int vnet_hdr_required
 {
 struct ifreq ifr;
 int fd, ret;
+int open_named;
+char *devname;
+
+if (ifname_size = 1  *ifname == /) {
+   open_named = 1;
+   devname = ifname;
+} else {
+   open_named = 0;
+   devname = /dev/net/tun;
+}
 
-TFR(fd = open(/dev/net/tun, O_RDWR));
+TFR(fd = open(devname, O_RDWR));
 if (fd  0) {
-fprintf(stderr, warning: could not open /dev/net/tun: no virtual 
network emulation\n);
+fprintf(stderr, warning: could not open %s: no virtual network 
emulation\n, devname);
 return -1;
 }
 memset(ifr, 0, sizeof(ifr));
@@ -62,17 +72,20 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
int vnet_hdr_required
 }
 }
 
-if (ifname[0] != '\0')
-pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname);
-else
-pstrcpy(ifr.ifr_name, IFNAMSIZ, tap%d);
-ret = ioctl(fd, TUNSETIFF, (void *) ifr);
-if (ret != 0) {
-fprintf(stderr, warning: could not configure /dev/net/tun: no virtual 
network emulation\n);
-close(fd);
-return -1;
+if (!open_named) {
+   if (ifname[0] != '\0')
+   pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname);
+   else
+   pstrcpy(ifr.ifr_name, IFNAMSIZ, tap%d);
+   ret = ioctl(fd, TUNSETIFF, (void *) ifr);
+   if (ret != 0) {
+   fprintf(stderr, warning: could not configure %s: no virtual 
+   network emulation\n, devname);
+   close(fd);
+   return -1;
+   }
+   pstrcpy(ifname, ifname_size, ifr.ifr_name);
 }
-pstrcpy(ifname, ifname_size, ifr.ifr_name);
 fcntl(fd, F_SETFL, O_NONBLOCK);
 return fd;
 }




Re: [Qemu-devel] [PATCH 10/13] Implement early printk in virtio-console

2009-11-25 Thread Arnd Bergmann
On Wednesday 25 November 2009, Carsten Otte wrote:
 Anthony Liguori wrote:
  Oh, that's bad :-)
  
  That should really be it's own character device.  We don't really have a 
  way to connect two character devices like that.  Maybe muxing?
 It will be a character device, once the device tree is initialized. 
 Better ideas are welcome, just keep in mind basic things like kmalloc 
 don't work yet.
 

Can't you just leave it out for this release? Early printk is great
for debugging, but not essential if you just want to run a guest
as long as you get it past the virtio init phase.

Arnd 




Re: [Qemu-devel] [PATCH 4/4] Add support for -net bridge

2009-11-08 Thread Arnd Bergmann
On Sunday 08 November 2009 08:27:41 Avi Kivity wrote:
 On 11/08/2009 12:11 AM, Anthony Liguori wrote:
 
   You don't need root privileges to use a tap device.
 
  You can access a preconfigured tap device but you cannot allocate a 
  tap device and connect it to a bridge without CAP_NET_ADMIN.
 
 btw, shouldn't we, in the general case, create a bridge per user and use 
 IP NAT?  If we have a global bridge, users can spoof each other's MAC 
 addresses and interfere with their virtual machines.  They can also 
 interfere with the real network.
 
 That's not a concern with most one-user-per-machine configurations, but 
 the default configuration should be safe.

It also depends a lot on what you want to do with the virtual machine.
If you want to run a game or a legacy application in a different operating
system on your desktop, a NATed bridge is ideal, but it does not work
on a server if the guest wants to listen on a socket with its own IP address.

Arnd 




Re: [Qemu-devel] [PATCH 0/4] net-bridge: rootless bridge support for qemu

2009-11-07 Thread Arnd Bergmann
On Saturday 07 November 2009, Anthony Liguori wrote:
 Avi Kivity wrote:
  On 11/07/2009 11:14 AM, Avi Kivity wrote:
  I'd welcome -net bridge as one of them.  But we shouldn't try to 
  invent access control systems or install suid helpers.
 
  We can make the helper a script that does
 
exec sudo /the/real/helper $@
 
  so a user can add it to /etc/sudoers and get pre-authenticated 
  configuration.
 
 The key point of the helper here is that you pass an fd to a socketpair 
 and you then receive an fd over that socket.  What the helper does is 
 really less important.  Whether it's a script like you suggest or 
 something like I proposed doesn't matter from a qemu perspective.

Well, the difference matters from a security perspective. The sudo
script that Avi suggested just means that you can guarantee you don't
introduce any security holes through a suid executable. Fortunately,
it does not impact the contents of your helper either, only the
installation. You could even be clever in qemu and use call the helper
using sudo if qemu is running as unpriviledged user and the helper is
not a suid file.

Arnd