Re: Deprecation/removal of nios2 target support
On Thu, Apr 18, 2024, at 17:44, Joseph Myers wrote: > On Wed, 17 Apr 2024, Sandra Loosemore wrote: > >> Therefore I'd like to mark Nios II as obsolete in GCC 14 now, and remove >> support from all toolchain components after the release is made. I'm not >> sure >> there is an established process for obsoleting/removing support in other >> components; besides binutils, GDB, and GLIBC, there's QEMU, newlib/libgloss, >> and the Linux kernel. But, we need to get the ball rolling somewhere. > > CC:ing Arnd Bergmann regarding the obsolescence in the Linux kernel. We have not yet marked nios2 as deprecated in the kernel, but that is mostly because the implementation does not get in the way too much and Dinh Nguyen is still around as a maintainer and merging bugfixes. Almost all nios2 kernel changes I see in the past decade have been done blindly without testing on hardware, either for treewide changes, or by code inspection. The only notable exceptions I could find are from Andreas Oetken and Bernd Weiberg at Siemens and from Marek Vasut (all added to Cc in case they have something to add). We should probably remove nios2 from the kernel in the near future, but even if we decide not to, I think deprecating it from gcc is the right idea: If there are a few remaining users that still plan to update their kernels, gcc-14 will still be able to build new kernels for several years. Arnd
Re: possible deprecation and removal of some old QEMU Arm machine types (pxa2xx, omap, sa1110)
On Thu, Feb 15, 2024, at 09:31, Andreas Kemnade wrote: > On Wed, 14 Feb 2024 23:42:58 +0100 > "Arnd Bergmann" wrote: >> On Wed, Feb 14, 2024, at 13:26, Dmitry Baryshkov wrote: >> > On Tue, 13 Feb 2024 at 23:22, Linus Walleij >> > wrote: >> >> On Tue, Feb 13, 2024 at 9:12 PM Arnd Bergmann wrote: >> >> > On Tue, Feb 13, 2024, at 16:36, Guenter Roeck wrote: >> >> > > On Tue, Feb 13, 2024 at 03:14:21PM +, Peter Maydell wrote: >> >> >> >> Andrea Adami and Dmitry Eremin-Solenikov did the work in 2017 to >> >> modernize it a bit, and Russell helped out. I was under the impression >> >> that they only used real hardware though! >> > >> > I used both Qemu and actual hardware (having collie, poodle, tosa and >> > c860 that was easy). >> > >> > The biggest issue with Zaurus PDAs was that supporting interesting >> > parts of the platform (PCMCIA, companion chips) required almost >> > rebootstrapping of the corresponding drivers. >> > E.g. I had a separate driver for the LoCoMo chip which worked properly >> > with the DT systems. >> > PCMCIA was a huuuge trouble and it didn't play well at all. The driver >> > must be rewritten to use the component framework. >> >> If we want to actually go there, I think the best option for PCMCIA >> support is likely to replace the entire "soc_common" pcmcia driver >> with a simple drivers/pata/ storage driver and no support for >> other cards. There was a driver until commit 38943cbd25a2 >> ("ata: remove palmld pata driver") that could serve as an >> template. >> > hmm, main usage for PCMCIA/CF in those devices was often something else, > not storage, at least on the IPAQ h2200. Wondering wether that road is > actually good. When I was mainly using those devices, I was not good in > mainlining things. Do we still support any non-storage CF devices that someone might actually use? Do you have a specific example in mind? These are the currently supported devices that I see: git grep -B4 -w depends.*PCMCIA | grep "Kconfig-\(config\|menuconfig\)" | grep -v ^drivers/pcmcia drivers/ata/Kconfig-config PATA_PCMCIA drivers/bluetooth/Kconfig-config BT_HCIDTL1 drivers/bluetooth/Kconfig-config BT_HCIBT3C drivers/bluetooth/Kconfig-config BT_HCIBLUECARD drivers/comedi/Kconfig-menuconfig COMEDI_PCMCIA_DRIVERS drivers/mmc/host/Kconfig-config MMC_SDRICOH_CS drivers/mtd/maps/Kconfig-config MTD_PCMCIA drivers/mtd/maps/Kconfig-config MTD_PCMCIA_ANONYMOUS drivers/net/arcnet/Kconfig-menuconfig ARCNET drivers/net/arcnet/Kconfig-config ARCNET_COM20020_CS drivers/net/can/sja1000/Kconfig-config CAN_EMS_PCMCIA drivers/net/can/sja1000/Kconfig-config CAN_PEAK_PCMCIA drivers/net/can/softing/Kconfig-config CAN_SOFTING_CS drivers/net/ethernet/3com/Kconfig-config NET_VENDOR_3COM drivers/net/ethernet/3com/Kconfig-config PCMCIA_3C574 drivers/net/ethernet/3com/Kconfig-config PCMCIA_3C589 drivers/net/ethernet/8390/Kconfig-config PCMCIA_AXNET drivers/net/ethernet/8390/Kconfig-config APNE drivers/net/ethernet/8390/Kconfig-config PCMCIA_PCNET drivers/net/ethernet/amd/Kconfig-config PCMCIA_NMCLAN drivers/net/ethernet/fujitsu/Kconfig-config NET_VENDOR_FUJITSU drivers/net/ethernet/fujitsu/Kconfig-config PCMCIA_FMVJ18X drivers/net/ethernet/smsc/Kconfig-config PCMCIA_SMC91C92 drivers/net/ethernet/xircom/Kconfig-config NET_VENDOR_XIRCOM drivers/net/ethernet/xircom/Kconfig-config PCMCIA_XIRC2PS drivers/parport/Kconfig-config PARPORT_PC_PCMCIA drivers/scsi/pcmcia/Kconfig-menuconfig SCSI_LOWLEVEL_PCMCIA drivers/ssb/Kconfig-config SSB_PCMCIAHOST_POSSIBLE drivers/tty/Kconfig-config IPWIRELESS drivers/tty/serial/8250/Kconfig-config SERIAL_8250_CS drivers/usb/host/Kconfig-config USB_SL811_CS sound/pcmcia/Kconfig-menuconfig SND_PCMCIA Arnd
Re: possible deprecation and removal of some old QEMU Arm machine types (pxa2xx, omap, sa1110)
On Wed, Feb 14, 2024, at 13:26, Dmitry Baryshkov wrote: > On Tue, 13 Feb 2024 at 23:22, Linus Walleij wrote: >> On Tue, Feb 13, 2024 at 9:12 PM Arnd Bergmann wrote: >> > On Tue, Feb 13, 2024, at 16:36, Guenter Roeck wrote: >> > > On Tue, Feb 13, 2024 at 03:14:21PM +, Peter Maydell wrote: >> >> Andrea Adami and Dmitry Eremin-Solenikov did the work in 2017 to >> modernize it a bit, and Russell helped out. I was under the impression >> that they only used real hardware though! > > I used both Qemu and actual hardware (having collie, poodle, tosa and > c860 that was easy). > > The biggest issue with Zaurus PDAs was that supporting interesting > parts of the platform (PCMCIA, companion chips) required almost > rebootstrapping of the corresponding drivers. > E.g. I had a separate driver for the LoCoMo chip which worked properly > with the DT systems. > PCMCIA was a huuuge trouble and it didn't play well at all. The driver > must be rewritten to use the component framework. If we want to actually go there, I think the best option for PCMCIA support is likely to replace the entire "soc_common" pcmcia driver with a simple drivers/pata/ storage driver and no support for other cards. There was a driver until commit 38943cbd25a2 ("ata: remove palmld pata driver") that could serve as an template. Arnd
Re: possible deprecation and removal of some old QEMU Arm machine types (pxa2xx, omap, sa1110)
On Wed, Feb 14, 2024, at 02:27, Aaro Koskinen wrote: > On Tue, Feb 13, 2024 at 09:11:38PM +0100, Arnd Bergmann wrote: > > I'm one of the OMAP1 Linux kernel maintainers, and I have Palm TE which > I have been using for testing and development (and reporting bugs, > regressions) along with those other boards you mentioned. > > Since I have the real Palm HW, I haven't used QEMU for that particular > board. But however I use QEMU SX1 support frequently as it's quickest way > to check if OMAP1 is bootable, and if the basic peripherals are working. > SX1 is close to Palm/AMS-Delta, and also it's ARMv4T which is rare these > days. I think it's useful to keep it in QEMU as long there are hardware > that people use. > > So my wish is to keep at least SX1 support in QEMU as long as ARMv4T > supported in the Linux kernel. Makes sense. We have a couple of other ARMv4T systems in the kernel that are still being tested (ep93xx, at91rm9200, clps71xx, imx1, nspire, integrator/ap), but none of the others have any qemu support apparently unless you count "-Mintegratorpb -cpu arm925". All of these are are using DT or getting there (ep93xx), so we'll probably keep them around for a while. Similarly, we support a couple of ARMv4 (non-T) targets in the kernel (footbridge, sa1100, rpc, moxart, gemini), but the only one with qemu support here is sa1100/collie. >> >> > > OMAP2 machines: >> >> > > >> >> > > n800 Nokia N800 tablet aka. RX-34 (OMAP2420) >> >> > > n810 Nokia N810 tablet aka. RX-44 (OMAP2420) >> >> > > >> >> > I never managed to get those to boot the Linux kernel. >> >> I think Tony still tests these on both hardware and qemu. >> The platform side here is much more modern than any of the >> others above since it does use DT and it has enough RAM >> to be somewhat usable. > > I have also these boards (real hardware) and test them frequently with > mainline Linux. However, QEMU support I haven't used/needed. I recall it > was a bit buggy, and some changes in mainline made the kernel unbootable. > Unless Tony needs the support, I guess they are good to go. Thanks for confirming. > (Arnd: RAM isn't everything. Some of the OMAP1 boards today are still > more useful than N800/N810, even with modern bloaty Linux.) Obviously RAM isn't everything, but the machines with just 32MB or less do seem very small for real workloads, so I admit I dismiss them easily. I am curious what you run on those, are there any embedded distros that maintain good support for 32MB systems on modern kernel/musl/Xorg/..., or are you using omething older or highly customized? Arnd
Re: possible deprecation and removal of some old QEMU Arm machine types (pxa2xx, omap, sa1110)
On Tue, Feb 13, 2024, at 22:21, Linus Walleij wrote: > The Collie is popular because it is/was easy to get hold of and > easy to hack. PXA was in candybar phones (right?) which > are just veritable fortresses and really hard to hack so that is why > there is no interest (except for the occasional hyperfocused Harald > Welte), so those are a bit like the iPhones: you *can* boot something > custom on them, but it won't be easy or quick, and not as fun and > rewarding. The PXA machines that we support in kernel and qemu are either industrial/embedded machines or upgraded versions of the SA1100 PDAs (sharpsl), not phones. OTOH, these chips came out between 2003 and 2007 so I can also see how people at the time had already moved away from using pure PDAs to Symbian or Blackberry phones, so there might just be fewer PDAs that got sold with PXA than with SA1100 before there were smartphones. Arnd
Re: possible deprecation and removal of some old QEMU Arm machine types (pxa2xx, omap, sa1110)
On Tue, Feb 13, 2024, at 16:36, Guenter Roeck wrote: > On Tue, Feb 13, 2024 at 03:14:21PM +, Peter Maydell wrote: >> On Mon, 12 Feb 2024 at 14:36, Guenter Roeck wrote: >> > On 2/12/24 04:32, Peter Maydell wrote: >> > > The machines I have in mind are: >> > > >> > > PXA2xx machines: >> > > >> > > akitaSharp SL-C1000 (Akita) PDA (PXA270) >> > > borzoi Sharp SL-C3100 (Borzoi) PDA (PXA270) >> > > connex Gumstix Connex (PXA255) >> > > mainstoneMainstone II (PXA27x) >> > > spitzSharp SL-C3000 (Spitz) PDA (PXA270) >> > > terrier Sharp SL-C3200 (Terrier) PDA (PXA270) >> > > tosa Sharp SL-6000 (Tosa) PDA (PXA255) >> > > verdex Gumstix Verdex Pro XL6P COMs (PXA270) >> > > z2 Zipit Z2 (PXA27x) >> > > >> > I test akita, borzoi, spitz, and terrier. Upstream Linux removed support >> > for mainstone, tosa, and z2 from the Linux kernel as of version 6.0, so It was 6.3 (about one year ago). >> > I am no longer testing those. >> > >> > I never managed to boot connex or verdex. I kept specifically the pxa boards that would be sensible to port to devicetree because they had qemu support. gumstix verdex is the one with the most RAM out of those; spitz, sharpsl and their variants are the ones that some of us still have around. Robert had working devicetree support for some machines out of tree that he has not submitted, and presumably not worked on since. Unless someone starts working on converting the remaining pxa board files to DT, we can probably remove them after the next LTS kernel a year from now. I have no objection to removing them from qemu if that helps, the existing qemu releases should be sufficient for anyone trying the conversion. >> > > OMAP1 machines: >> > > >> > > cheetah Palm Tungsten|E aka. Cheetah PDA (OMAP310) >> > > sx1 Siemens SX1 (OMAP310) V2 >> > > sx1-v1 Siemens SX1 (OMAP310) V1 >> > > >> > I test sx1. I don't think I ever tried cheetah, and I could not get sx1-v1 >> > to work. This is similar. omap1 development is slightly more active than pxa, but then again they have no DT support today and are unlikely to ever get there at this point. Out of the five machines that are still supported in the kernel, I think three still run on hardware (osk, ams-delta and nokia770), while the other ones were left there only for their qemu support. I don't mind removing them from the kernel as well if they are gone from qemu. >> > > OMAP2 machines: >> > > >> > > n800 Nokia N800 tablet aka. RX-34 (OMAP2420) >> > > n810 Nokia N810 tablet aka. RX-44 (OMAP2420) >> > > >> > I never managed to get those to boot the Linux kernel. I think Tony still tests these on both hardware and qemu. The platform side here is much more modern than any of the others above since it does use DT and it has enough RAM to be somewhat usable. On the other hand, this is the one platform actually using the cursed arm1136r0 core (not counting imx31 and realview here as I'm not aware of any real users), and anything that would get us closer to dropping support for this CPU would be welcome ;-) >> > > The one SA1110 machine: >> > > >> > > collie Sharp SL-5500 (Collie) PDA (SA-1110) >> > > >> > I do test collie. Adding Linus Walleij and Stefan Lehner to Cc, as they were interested in modernizing sa1100 back in 2022. If they are still interested in that, they might want to keep collie support. Surprisingly, at the time I removed unused old board files, there was a lot more interest in sa1100 than in the newer pxa platform, which I guess wasn't as appealing for retrocomputing yet. >> > All the ones I use still boot the latest Linux kernel. >> > >> > > Obviously if we can remove all the machines that used a given >> > > SoC, that's much more effective than if we just delete one or two. >> > > >> > > I don't have any test images for the SA1110 or OMAP1 machines, >> > > so those are the ones I am most keen to be able to drop. >> > > I do have test images for a few of the pxa2xx and the OMAP2 machines. >> > > >> > I don't mind dropping them, just listing what I use for testing the >> > Linux kernel. I suspect I may be the only "user" of those boards, >> > though, both in Linux and qemu. >> >> Mmm; there's not much point in both QEMU and the kernel >> maintaining code that nobody's using. Are you considering >> dropping support for any of these SoC families from the kernel? >> > Not me personally. Arnd is the one mostly involved in dropping > support of obsolete hardware from the kernel These are all clearly among the least maintained boards we have left in the kernel after the last purge. At the time I asked for remaining users in 2022, I kept pretty much anything that had the slightest chance of still being used and I was already planning another round for the next LTS kernel in early 2023 that would be more
Re: [RFC PATCH] linux-user: time64: consolidate rt_sigtimedwait_time64 and rt_sigtimedwait
On Fri, Dec 16, 2022, at 19:08, Michael Tokarev wrote: > Both system calls are exactly the same, the only difference is the > (optional) timespec conversion. Consolidate the two, and check which > syscall being emulated in the timespec conversion place. > > This is just a PoC. But this brings at least 2 questions. > > Let's take pselect6 as an example. There are 2 possible pselects > in there (actually more, but let's focus on just the two): > TARGET_NR_pselect6 and TARGET_NR_pselect6_time64. Both are implemented > in a single do_pselect6() function, which, depending on its "time64" > argument, will call either target_to_host_timespec64() or > target_to_host_timespec(). > > But these functions, in turn, are guarded by a lot of #if > defined(foo). In particular, in context of pselect6, > target_to_host_timespec() is guarded by > if defined(TARGET_NR_pselect6), > while target_to_host_timespec64() is guarded by > if defined(TARGET_NR_pselect6_time64). > > So basically, if just one of the two TARGET_NR_pselect6* > is defined, one of target_to_host_timespec*() will not > be defined, but do_pselect6() calls *both* anyway. In > other words, both functions must be provided if any of > the select6 are to be implemented. > > But the good news is that these functions > (target_to_host_timespec*()) are always defined because > they're needed for some syscalls anyway, like, eg, > TARGET_NR_semop or TARGET_NR_utimensat or CONFIG_TIMERFD. > > It looks like whole gigantic ifdeffery around these two > functions should be dropped, and a common function provided, > with an extra argument like time64, to be used in many > places where this construct is already used. Like in > this PoC too, which again calls one of the two depending > on the actual syscall being used. Or maybe we can even > combine the two into one with an extra arg like "time64". The cleanup you suggest here looks correct to me, but as I mentioned on IRC, I think this should only be done if the time64 host side bug is also fixed. At the moment, both the time32 and time64 target syscalls are always translated into the time32 variant on 32-bit hosts, which is a mistake for multiple reasons: - the libc-defined 'timespec' argument that is passed into the kernel-defined syscall has the wrong layout, resulting in incorrect data. This could be avoided by using the kernel-defined __kernel_old_timespec instead, but then - time64 target arguments needlessly get truncated to the time32 range. This happens e.g. when running 64-bit target on 32-bit host, but could be easily avoided by converting into the time64 types (__kernel_timespec) and calling the time64 syscalls whenever those are available. - On modern hosts, only the time64 syscalls are available, so e.g. on riscv32 hosts, the current implementation does not even build. The best solution for all of the above I think would be to have a wrapper for each of the affected syscalls that calls the time64 interface first and falls back to the time32 version only if that fails, using __kernel_timespec on the interface instead of libc timespec. A simpler but less flexible method would be to detect sizeof(time_t) at compile time and then use either the time32 or time64 host syscall numbers, whichever goes with the libc time_t/timespec. Arnd
Re: [RFC PATCH] target/arm: update the cortex-a15 MIDR to latest rev
On Tue, Sep 6, 2022, at 7:22 PM, Alex Bennée wrote: > > index 3099b38e32..59d5278868 100644 > --- a/target/arm/cpu_tcg.c > +++ b/target/arm/cpu_tcg.c > @@ -588,7 +588,9 @@ static void cortex_a15_initfn(Object *obj) > set_feature(>env, ARM_FEATURE_EL3); > set_feature(>env, ARM_FEATURE_PMU); > cpu->kvm_target = QEMU_KVM_ARM_TARGET_CORTEX_A15; > -cpu->midr = 0x412fc0f1; > +/* r4p0 cpu, not requiring expensive tlb flush errata */ > +cpu->midr = 0x414fc0f0; > +cpu->revidr = 0x200; > cpu->reset_fpsid = 0x410430f0; > cpu->isar.mvfr0 = 0x10110222; > cpu->isar.mvfr1 = 0x; This will work correctly with Linux, but after I checked the Cortex-A15 MPCore r2/r3/r4 Software Developers Errata Notice again, I think that setting revidr here is not fully correct: With an r3p3 CPU, bit 9 of revidr (0x200) indicates that no workaround is necessary, but with r4p0 and higher, this bit is marked as reserved and both the Documentation and the Linux source code assume the hardware works correctly. So I think this should either be 0x413fc0f3/0x200 or 0x414fc0f0/0. Arnd
Re: [RFC PATCH 3/3] hw/openrisc: Add the OpenRISC virtual machine
On Tue, Jun 7, 2022 at 2:12 PM Stafford Horne wrote: > On Tue, Jun 07, 2022 at 11:43:08AM +0100, Peter Maydell wrote: > > However, in a followup mail from Laurent we see: > > https://lore.kernel.org/lkml/cb884368-0226-e913-80d2-62d2b7b2e...@vivier.eu/ > > The reference document[1] doesn't define the endianness of goldfish. > > [1] > https://android.googlesource.com/platform/external/qemu/+/master/docs/GOLDFISH-VIRTUAL-HARDWARE.TXT > > > The documentation does not clearly specify it. So maybe maybe or1k should > just > be updated on the linux side and add gf_ioread32/gf_iowrite32 big-endian > accessors. I don't think it makes any sense to use big-endian for a new architecture, just use the default little-endian implementation on the linux side, and change the qemu code to have the backward-compatibility hack for m68k while using big-endian for the rest. Arnd
Re: [RFC PATCH 3/3] hw/openrisc: Add the OpenRISC virtual machine
On Tue, Jun 7, 2022 at 11:47 AM Stafford Horne wrote: > On Tue, Jun 07, 2022 at 10:42:08AM +0200, Arnd Bergmann wrote: > > Goldfish is a very old platform, as far as I know only the kernel port is > > new. > > I don't know when qemu started shipping goldfish, but changing it now would > > surely break compatibility with whatever OS the port was originally made > > for. > > Hi Arnd, > > As far as I can tell goldfish in qemu is not very old. There are 3 devices, 2 > were > added for the m68k virt machine, and 1 for riscv virt. > > $ git lo -- hw/char/goldfish_tty.c > 2021-11-09 65b4c8c759 Philippe Mathieu-Daudé hw/m68k: Fix typo in SPDX tag > 2021-03-15 8c6df16ff6 Laurent Vivier hw/char: add goldfish-tty > > $ git lo -- hw/intc/goldfish_pic.c > 2021-11-09 65b4c8c759 Philippe Mathieu-Daudé hw/m68k: Fix typo in SPDX tag > 2021-03-15 8785559390 Laurent Vivier hw/intc: add goldfish-pic That is much younger than Laurent made it appear, from his earlier explanations I expected this to have shipped a long time ago and been used in other OSs to the point where it cannot be fixed. > The mips/loongson3_virt machine now also uses the goldfish_rtc. > > The problem with the goldfish device models is that they were defined as > DEVICE_NATIVE_ENDIAN. > > $ grep endianness hw/*/goldfish* > hw/char/goldfish_tty.c:.endianness = DEVICE_NATIVE_ENDIAN, > hw/intc/goldfish_pic.c:.endianness = DEVICE_NATIVE_ENDIAN, > hw/rtc/goldfish_rtc.c:.endianness = DEVICE_NATIVE_ENDIAN, > > RISC-V is little-endian so when it was added there was no problem with running > linux goldfish drivers. > > MIPS Longson3, added last year, seems to be running as little-endian well, I > understand MIPS can support both big and little endian. However according to > this all Loongson cores are little-endian. > > https://en.wikipedia.org/wiki/Loongson > > As I understand when endianness of the devices in qemu are defined as > DEVICE_NATIVE_ENDIAN the device endian takes the endian of the target CPU. > > This means that MIPS Loongson3 and RISC-V are affectively running as > little-endian which is what would be expected. Not really, the definition of DEVICE_NATIVE_ENDIAN in qemu is much less well-defined than that as I understand, it is just whatever the person adding support for that CPU thought would be the right one. A lot of CPUs can run either big-endian or little-endian code, and e.g. on ARM, qemu DEVICE_NATIVE_ENDIAN is just always little-endian, regardless of what the CPU runs, while I think on MIPS it would be whatever the CPU is actually executing. Arnd
Re: [RFC PATCH 3/3] hw/openrisc: Add the OpenRISC virtual machine
On Tue, Jun 7, 2022 at 10:11 AM Geert Uytterhoeven wrote: > On Sun, Jun 5, 2022 at 9:32 AM Stafford Horne wrote: > > On Sun, Jun 05, 2022 at 10:58:14AM +0900, Stafford Horne wrote: > > It might be a good idea to revisit the qemu implementation and make > > sure that the extra byteswap is only inserted on m68k and not on > > other targets, but hopefully there are no new targets based on goldfish > > anymore and we don't need to care. > > > > So, it seems that in addition to my patch we would need something in m68k to > > switch it back to 'native' (big) endian? > > > > Looking at the m68k kernel/qemu interface I see: > > > > Pre 5.19: > >(data) <-- kernel(readl / little) <-- m68k qemu (native / big) - RTC/PIC > >(data) <-- kernel(__raw_readl / big) <-- m68k qemu (native / big) - TTY > > > > 5.19: > >(data) <-- kernel(gf_ioread32 / big) <-- m68k qemu (native / big) - all > > > > The new fixes to add gf_ioread32/gf_iowrite32 fix this for goldfish and > > m68k. > > This wouldn't have been an issue for little-endian platforms where > > readl/writel > > were originally used. > > > > Why can't m68k switch to little-endian in qemu and the kernel? The m68k > > virt > > platform is not that old, 1 year? Are there a lot of users that this would > > be a big > > problem? > > > > [1] > > https://lore.kernel.org/lkml/CAK8P3a1oN8NrUjkh2X8jHQbyz42Xo6GSa=5n0gd6vqcxrjm...@mail.gmail.com/ Goldfish is a very old platform, as far as I know only the kernel port is new. I don't know when qemu started shipping goldfish, but changing it now would surely break compatibility with whatever OS the port was originally made for. Arnd
Re: Approaches for same-on-same linux-user execve?
On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée wrote: > > Are there any other approaches you could take? Which do you think has > the most merit? Reading through the ELF loader code in the kernel, I had another idea: If qemu-user could be turned into a replacement for /lilb/ld.so and act as an ELF interpreter rather than a binfmt-misc helper, this might address a lot of the issues automatically. It would need to be a statically linked binary so it doesn't itself require an interpreter. It would have to do the job of ld.so in addition to the emulation, but it could do that by finding the real ld.so somewhere else and running that in emulation mode. It would also not work at all for statically linked executables. Not sure if that makes the tradeoffs better than your other suggestions, but it seemed worth bringing up. Arnd
Re: Approaches for same-on-same linux-user execve?
On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée wrote: > > I came across a use-case this week for ARM although this may be also > applicable to architectures where QEMU's emulation is ahead of the > hardware currently widely available - for example if you want to > exercise SVE code on AArch64. When the linux-user architecture is not > the same as the host architecture then binfmt_misc works perfectly fine. > > However in the case you are running same-on-same you can't use > binfmt_misc to redirect execution to using QEMU because any attempt to > trap native binaries will cause your userspace to hang as binfmt_misc > will be invoked to run the QEMU binary needed to run your application > and a deadlock ensues. Can you clarify how the code would run in this case? Does qemu-user still emulate every single instruction, both the compatible and the incompatible ones, or is the idea here to run as much as possible natively and only emulate the instructions that are not available natively, using either SIGILL or searching through the object code for those instructions? > Trap execve in QEMU linux-user > -- > > We could add a flag to QEMU so at the point of execve it manually > invokes the new process with QEMU, passing on the flag to persist this > behaviour. This sounds like the obvious approach if you already do a full instruction emulation. You'd still need to run the parent process by calling qemu-user manually, but I suppose you need to do something like this in any case. > Add path mask to binfmt_misc > > > The other option would be to extend binfmt_misc to have a path mask so > it only applies it's alternative execution scheme to binaries in a > particular section of the file-system (or maybe some sort of pattern?). The main downside I see here is that it requires kernel modification, so it would not work for old kernels. > Are there any other approaches you could take? Which do you think has > the most merit? If we modify binfmt_misc in the kernel, it might be helpful to do it by extending it with namespace support, so it could be constrained to a single container without having to do the emulation outside. Unfortunately that does not solve the problem of preventing the qemu-user binary from triggering the binfmt_misc lookup. Arnd
Re: [PATCH v2 for-6.0?] hw/pci-host/gpex: Don't fault for unmapped parts of MMIO and PIO windows
On Tue, Apr 20, 2021 at 1:52 PM Philippe Mathieu-Daudé wrote: > On 4/19/21 3:42 PM, Peter Maydell wrote: > >> > >> I suspect PCI-ISA bridges to provide an EISA bus. > > > > I'm not sure what you mean here -- there isn't an ISA bridge > > or an EISA bus involved here. This is purely about the behaviour > > of the memory window the PCI host controller exposes to the CPU > > (and in particular the window for when a PCI device's BAR is > > set to "IO" rather than "MMIO"), though we change both here. > > I guess I always interpreted the IO BAR were here to address ISA > backward compatibility. I don't know well PCI so I'll study it > more. Sorry for my confused comment. It is mostly for compatibility, but there are many layers of it: - PCI supports actual ISA/EISA/VLB/PCMCIA/LPC/PC104/... style devices behind a bridge, using I/O ports at their native address. - PCI devices themselves can have fixed I/O ports at well-known addresses, e.g. VGA or IDE/ATA adapters - PCI devices can behave like legacy devices using port I/O but use PCI resource assignment to pick an arbitrary port number outside of the legacy range - PCIe can support all of the above by virtue of being backwards compatible with PCI and allowing PCI buses behind bridges, though port I/O is deprecated here and often not supported at all The first two are very rare these days, but Linux still support them in order to run on old hardware, and any driver for these that assumes a hardcoded port number can crash the kernel if the PCI host bridge causes an asynchronous external abort or similar. Arnd
Re: [PATCH 3/5] tools/vhost-user-i2c: Add backend driver
On Fri, Mar 26, 2021 at 7:01 AM Viresh Kumar wrote: > On 25-03-21, 17:16, Arnd Bergmann wrote: > > On Wed, Mar 24, 2021 at 8:33 AM Viresh Kumar > > wrote: > > > > > +static uint8_t vi2c_xfer(VuDev *dev, struct i2c_msg *msg) > > > +{ > > > +VuI2c *i2c = container_of(dev, VuI2c, dev.parent); > > > +struct i2c_rdwr_ioctl_data data; > > > +VI2cAdapter *adapter; > > > + > > > +adapter = vi2c_find_adapter(i2c, msg->addr); > > > +if (!adapter) { > > > +g_printerr("Failed to find adapter for address: %x\n", > > > msg->addr); > > > +return VIRTIO_I2C_MSG_ERR; > > > +} > > > + > > > +data.nmsgs = 1; > > > +data.msgs = msg; > > > + > > > +if (ioctl(adapter->fd, I2C_RDWR, ) < 0) { > > > +g_printerr("Failed to transfer data to address %x : %d\n", > > > msg->addr, errno); > > > +return VIRTIO_I2C_MSG_ERR; > > > +} > > > > As you found during testing, this doesn't work for host kernels > > that only implement the SMBUS protocol. Since most i2c clients > > only need simple register read/write operations, I think you should > > at least handle the common ones (and one two byte read/write) > > here to make it more useful. > > I am thinking if that is what we really want to support, then > shouldn't the i2c virtio spec be updated first to support SMBUS type > transfers as well? As far as I can tell, all the simple devices should just work, with I2C_FUNC_SMBUS_READ_BLOCK_DATA being the main exception, but it seems that has practically no users. Arnd
Re: [PATCH 3/5] tools/vhost-user-i2c: Add backend driver
On Fri, Mar 26, 2021 at 8:14 AM Viresh Kumar wrote: > On 25-03-21, 17:16, Arnd Bergmann wrote: > > On Wed, Mar 24, 2021 at 8:33 AM Viresh Kumar > > wrote: > > > > It looks like you are not handling endianness conversion here. As far as I > > can tell, the protocol requires little-endian data, but the code might > > run on a big-endian CPU. > > I hope this is all we are required to do here, right ? > > @@ -442,7 +421,7 @@ static void vi2c_handle_ctrl(VuDev *dev, int qidx) > out_hdr = elem->out_sg[0].iov_base; > > /* Bit 0 is reserved in virtio spec */ > -msg.addr = out_hdr->addr >> 1; > +msg.addr = le16toh(out_hdr->addr) >> 1; > > /* Read Operation */ > if (elem->out_num == 1 && elem->in_num == 2) { > @@ -489,7 +468,7 @@ static void vi2c_handle_ctrl(VuDev *dev, int qidx) > in_hdr->status = fail_next ? VIRTIO_I2C_MSG_ERR : vi2c_xfer(dev, > ); > if (in_hdr->status == VIRTIO_I2C_MSG_ERR) { > /* We need to fail remaining transfers as well */ > -fail_next = out_hdr->flags & VIRTIO_I2C_FLAGS_FAIL_NEXT; > +fail_next = le32toh(out_hdr->flags) & VIRTIO_I2C_FLAGS_FAIL_NEXT; > } > > These are the only fields we are passing apart from buf, which goes > directly to the client device. I think so, the in_hdr is only one byte long, so it doesn't have an endianness. Arnd
Re: [PATCH 3/5] tools/vhost-user-i2c: Add backend driver
On Wed, Mar 24, 2021 at 8:33 AM Viresh Kumar wrote: > +static uint8_t vi2c_xfer(VuDev *dev, struct i2c_msg *msg) > +{ > +VuI2c *i2c = container_of(dev, VuI2c, dev.parent); > +struct i2c_rdwr_ioctl_data data; > +VI2cAdapter *adapter; > + > +adapter = vi2c_find_adapter(i2c, msg->addr); > +if (!adapter) { > +g_printerr("Failed to find adapter for address: %x\n", msg->addr); > +return VIRTIO_I2C_MSG_ERR; > +} > + > +data.nmsgs = 1; > +data.msgs = msg; > + > +if (ioctl(adapter->fd, I2C_RDWR, ) < 0) { > +g_printerr("Failed to transfer data to address %x : %d\n", > msg->addr, errno); > +return VIRTIO_I2C_MSG_ERR; > +} As you found during testing, this doesn't work for host kernels that only implement the SMBUS protocol. Since most i2c clients only need simple register read/write operations, I think you should at least handle the common ones (and one two byte read/write) here to make it more useful. > +static void vi2c_handle_ctrl(VuDev *dev, int qidx) > +{ > +VuVirtq *vq = vu_get_queue(dev, qidx); > +struct i2c_msg msg; > +struct virtio_i2c_out_hdr *out_hdr; > +struct virtio_i2c_in_hdr *in_hdr; > +bool fail_next = false; > +size_t len, in_hdr_len; > + > +for (;;) { > +VuVirtqElement *elem; > + > +elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement)); > +if (!elem) { > +break; > +} > + > +g_debug("%s: got queue (in %d, out %d)", __func__, elem->in_num, > +elem->out_num); > + > +/* Validate size of out header */ > +if (elem->out_sg[0].iov_len != sizeof(*out_hdr)) { > +g_warning("%s: Invalid out hdr %zu : %zu\n", __func__, > + elem->out_sg[0].iov_len, sizeof(*out_hdr)); > +continue; > +} > + > +out_hdr = elem->out_sg[0].iov_base; > + > +/* Bit 0 is reserved in virtio spec */ > +msg.addr = out_hdr->addr >> 1; > + > +/* Read Operation */ > +if (elem->out_num == 1 && elem->in_num == 2) { > +len = elem->in_sg[0].iov_len; > +if (!len) { > +g_warning("%s: Read buffer length can't be zero\n", > __func__); > +continue; > +} It looks like you are not handling endianness conversion here. As far as I can tell, the protocol requires little-endian data, but the code might run on a big-endian CPU. Jie Deng also pointed out the type differences, but actually handling them correctly is more important that describing them the right way. Arnd
Re: [PATCH] hw/pci-host/gpex: Don't fault for unmapped parts of MMIO and PIO windows
6On Mon, Mar 22, 2021 at 9:13 PM Peter Maydell wrote: > > Currently the gpex PCI controller implements no special behaviour for > guest accesses to areas of the PIO and MMIO where it has not mapped > any PCI devices, which means that for Arm you end up with a CPU > exception due to a data abort. > > Most host OSes expect "like an x86 PC" behaviour, where bad accesses > like this return -1 for reads and ignore writes. In the interests of > not being surprising, make host CPU accesses to these windows behave > as -1/discard where there's no mapped PCI device. > > Reported-by: Dmitry Vyukov > Fixes: https://bugs.launchpad.net/qemu/+bug/1918917 > Signed-off-by: Peter Maydell > --- > Not convinced that this is 6.0 material, because IMHO the > kernel shouldn't be doing this in the first place. > Do we need to have the property machinery so that old > virt-5.2 etc retain the previous behaviour ? I think it would be sufficient to do this for the ioport window, which is what old-style ISA drivers access. I am not aware of any driver accessing hardcoded addresses in the mmio window, at least not without probing io ports first (the VGA text console would use both). I checked which SoCs the kernel supports that do require a special hook to avoid an abort and found these: arch/arm/mach-bcm/bcm_5301x.c: hook_fault_code(16 + 6, bcm5301x_abort_handler, SIGBUS, BUS_OBJERR, arch/arm/mach-cns3xxx/pcie.c: hook_fault_code(16 + 6, cns3xxx_pcie_abort_handler, SIGBUS, 0, arch/arm/mach-iop32x/pci.c: hook_fault_code(16+6, iop3xx_pci_abort, SIGBUS, 0, "imprecise external abort"); arch/arm/mach-ixp4xx/common-pci.c: hook_fault_code(16+6, abort_handler, SIGBUS, 0, drivers/pci/controller/dwc/pci-imx6.c: hook_fault_code(8, imx6q_pcie_abort_handler, SIGBUS, 0, drivers/pci/controller/dwc/pci-keystone.c: hook_fault_code(17, ks_pcie_fault, SIGBUS, 0, The first four (bcm5301x, cns3xxx, iop32x and ixp4xx) generate an 'imprecise external abort' (16+6), imx6q has a "precise external abort on non-linefetch" (8), and keystone in LPAE mode has an "asynchronous external abort". The only SoC among those that is emulated by qemu to my knowledge is the i.MX6q in the 'sabrelite' machine. It's possible that some of these are not caused by a PCI master abort but some other error condition on the PCI host though. I think most other PCI implementations either ignore the error or generate an I/O interrupt. Arnd
[Bug 1918917] Re: synchronous about on accessing unused I/O ports on aarch64
That seems correct, and could probably be improved. In this case, I know that _outb() only writes within the PCI PIO virtual address between fbfffe80 and fb80, which according to the boot log (https://gist.githubusercontent.com/dvyukov/084890d9b4aa7cd54f468e652a9b5881/raw/54c12248ff6a4885ba6c530d56b3adad59bc6187/gistfile1.txt) was mapped to qemu's PCI IO space, using DEVICE_nGnRnE attributes (PIO space unlike any other MMIO in the kernel uses 'nE'): [3.973419][T1] pci-host-generic 401000.pcie: host bridge /pcie@1000 ranges: [3.974309][T1] pci-host-generic 401000.pcie: IO 0x003eff..0x003eff -> 0x00 [3.975021][T1] pci-host-generic 401000.pcie: MEM 0x001000..0x003efe -> 0x001000 So physical address 0x003eff corresponds to virtual address fbfffe80, which corresponds to logical PIO port number 0. The serial port in the kernel was probed at port number 0, so the driver attempts to write to the 8250 compatible UART_IER at address fbfffe81, which is in register x19. -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1918917 Title: synchronous about on accessing unused I/O ports on aarch64 Status in QEMU: New Bug description: version: QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-6) command line: qemu-system-aarch64 \ -machine virt,virtualization=on,graphics=on,usb=on -cpu cortex-a57 -smp 2 -m 2G \ -device virtio-blk-device,drive=hd0 \ -drive if=none,format=raw,id=hd0,file=buildroot \ -kernel arch/arm64/boot/Image \ -nographic \ -device virtio-rng-pci \ -net user,host=10.0.2.10,hostfwd=tcp::10022-:22 -net nic,model=virtio-net-pci \ -append "root=/dev/vda earlyprintk=serial console=ttyAMA0 earlycon" I am observing "synchronous external abort" when kernel tries to access unused I/O ports (see below), while hardware/qemu should return 0x in this case. This is factored out of this LKML thread where Arnd describes it in more details: https://lore.kernel.org/lkml/CAK8P3a0HVu+x0T6+K3d0v1bvU-Pes0F0CSjqm5x=bxfgv5y...@mail.gmail.com/ Internal error: synchronous external abort: 9650 [#1] PREEMPT SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 11231 Comm: syz-executor.1 Not tainted 5.12.0-rc2-syzkaller-00302-g28806e4d9b97 #0 Hardware name: linux,dummy-virt (DT) pstate: 8085 (Nzcv daIf -PAN -UAO -TCO BTYPE=--) pc : __raw_writeb arch/arm64/include/asm/io.h:27 [inline] pc : _outb include/asm-generic/io.h:501 [inline] pc : logic_outb+0x3c/0x114 lib/logic_pio.c:302 lr : io_serial_out+0x80/0xc0 drivers/tty/serial/8250/8250_port.c:453 sp : 15f0f980 x29: 15f0f980 x28: 80001de0005d x27: 80001601df00 x26: 15f0fc90 x25: 80001de0 x24: 80001de0 x23: 0e27f600 x22: x21: 0002 x20: 0002 x19: fbfffe81 x18: 6a678b48 x17: x16: x15: 8000197be810 x14: 1fffe2be1f0e x13: 1fffe2be1e90 x12: 62be1f39 x11: 1fffe2be1f38 x10: 62be1f38 x9 : dfff8000 x8 : 0003 x7 : 0001 x6 : 0004 x5 : 15f0f9c0 x4 : dfff8000 x3 : 0001 x2 : 13494e6b x1 : fbfffe80 x0 : 00ffbffe Call trace: _outb include/asm-generic/io.h:501 [inline] logic_outb+0x3c/0x114 lib/logic_pio.c:302 io_serial_out+0x80/0xc0 drivers/tty/serial/8250/8250_port.c:453 serial_out drivers/tty/serial/8250/8250.h:118 [inline] serial8250_set_THRI drivers/tty/serial/8250/8250.h:138 [inline] __start_tx drivers/tty/serial/8250/8250_port.c:1566 [inline] serial8250_start_tx+0x338/0x6c0 drivers/tty/serial/8250/8250_port.c:1666 __uart_start.isra.0+0x10c/0x154 drivers/tty/serial/serial_core.c:127 uart_start+0xe0/0x210 drivers/tty/serial/serial_core.c:137 uart_flush_chars+0x10/0x20 drivers/tty/serial/serial_core.c:573 __receive_buf drivers/tty/n_tty.c:1646 [inline] n_tty_receive_buf_common+0x588/0x22c0 drivers/tty/n_tty.c:1739 n_tty_receive_buf+0x14/0x20 drivers/tty/n_tty.c:1768 tiocsti drivers/tty/tty_io.c:2317 [inline] tty_ioctl+0xed0/0x1aec drivers/tty/tty_io.c:2718 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __arm64_sys_ioctl+0x120/0x18c fs/ioctl.c:739 __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline] invoke_syscall arch/arm64/kernel/syscall.c:49 [inline] el0_svc_common.constprop.0+0xf0/0x2c0 arch/arm64/kernel/syscall.c:129 do_el0_svc+0xa4/0xd0 arch/arm64/kernel/syscall.c:168 el0_svc+0x24/0x34 arch/arm64/kernel/entry-common.c:416 el0_sync_handler+0x1a4/0x1b0
[Bug 1918917] Re: synchronous about on accessing unused I/O ports on aarch64
My best guess is that the PCI I/O port handling in qemu only returns data for ports that are connected to an actual device. In this case, the kernel attempts to access a 8250 serial port at an address where none exists. While this is also a bug in the kernel, a real PCI implementation would not cause an external abort but simply return 'all-ones' (0xff, 0x, 0x) for a read request and ignore writes. This is something that the kernel relies on for probing unknown ISA style devices. -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1918917 Title: synchronous about on accessing unused I/O ports on aarch64 Status in QEMU: New Bug description: version: QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-6) command line: qemu-system-aarch64 \ -machine virt,virtualization=on,graphics=on,usb=on -cpu cortex-a57 -smp 2 -m 2G \ -device virtio-blk-device,drive=hd0 \ -drive if=none,format=raw,id=hd0,file=buildroot \ -kernel arch/arm64/boot/Image \ -nographic \ -device virtio-rng-pci \ -net user,host=10.0.2.10,hostfwd=tcp::10022-:22 -net nic,model=virtio-net-pci \ -append "root=/dev/vda earlyprintk=serial console=ttyAMA0 earlycon" I am observing "synchronous external abort" when kernel tries to access unused I/O ports (see below), while hardware/qemu should return 0x in this case. This is factored out of this LKML thread where Arnd describes it in more details: https://lore.kernel.org/lkml/CAK8P3a0HVu+x0T6+K3d0v1bvU-Pes0F0CSjqm5x=bxfgv5y...@mail.gmail.com/ Internal error: synchronous external abort: 9650 [#1] PREEMPT SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 11231 Comm: syz-executor.1 Not tainted 5.12.0-rc2-syzkaller-00302-g28806e4d9b97 #0 Hardware name: linux,dummy-virt (DT) pstate: 8085 (Nzcv daIf -PAN -UAO -TCO BTYPE=--) pc : __raw_writeb arch/arm64/include/asm/io.h:27 [inline] pc : _outb include/asm-generic/io.h:501 [inline] pc : logic_outb+0x3c/0x114 lib/logic_pio.c:302 lr : io_serial_out+0x80/0xc0 drivers/tty/serial/8250/8250_port.c:453 sp : 15f0f980 x29: 15f0f980 x28: 80001de0005d x27: 80001601df00 x26: 15f0fc90 x25: 80001de0 x24: 80001de0 x23: 0e27f600 x22: x21: 0002 x20: 0002 x19: fbfffe81 x18: 6a678b48 x17: x16: x15: 8000197be810 x14: 1fffe2be1f0e x13: 1fffe2be1e90 x12: 62be1f39 x11: 1fffe2be1f38 x10: 62be1f38 x9 : dfff8000 x8 : 0003 x7 : 0001 x6 : 0004 x5 : 15f0f9c0 x4 : dfff8000 x3 : 0001 x2 : 13494e6b x1 : fbfffe80 x0 : 00ffbffe Call trace: _outb include/asm-generic/io.h:501 [inline] logic_outb+0x3c/0x114 lib/logic_pio.c:302 io_serial_out+0x80/0xc0 drivers/tty/serial/8250/8250_port.c:453 serial_out drivers/tty/serial/8250/8250.h:118 [inline] serial8250_set_THRI drivers/tty/serial/8250/8250.h:138 [inline] __start_tx drivers/tty/serial/8250/8250_port.c:1566 [inline] serial8250_start_tx+0x338/0x6c0 drivers/tty/serial/8250/8250_port.c:1666 __uart_start.isra.0+0x10c/0x154 drivers/tty/serial/serial_core.c:127 uart_start+0xe0/0x210 drivers/tty/serial/serial_core.c:137 uart_flush_chars+0x10/0x20 drivers/tty/serial/serial_core.c:573 __receive_buf drivers/tty/n_tty.c:1646 [inline] n_tty_receive_buf_common+0x588/0x22c0 drivers/tty/n_tty.c:1739 n_tty_receive_buf+0x14/0x20 drivers/tty/n_tty.c:1768 tiocsti drivers/tty/tty_io.c:2317 [inline] tty_ioctl+0xed0/0x1aec drivers/tty/tty_io.c:2718 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __arm64_sys_ioctl+0x120/0x18c fs/ioctl.c:739 __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline] invoke_syscall arch/arm64/kernel/syscall.c:49 [inline] el0_svc_common.constprop.0+0xf0/0x2c0 arch/arm64/kernel/syscall.c:129 do_el0_svc+0xa4/0xd0 arch/arm64/kernel/syscall.c:168 el0_svc+0x24/0x34 arch/arm64/kernel/entry-common.c:416 el0_sync_handler+0x1a4/0x1b0 arch/arm64/kernel/entry-common.c:432 el0_sync+0x170/0x180 arch/arm64/kernel/entry.S:699 Code: d2bfd001 f2df7fe1 f2e1 8b010273 (39000274) ---[ end trace 79cb47219936c254 ]--- To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1918917/+subscriptions
Re: [PATCH v3 RESEND] fcntl: Add 32bit filesystem mode
On Wed, Nov 18, 2020 at 12:38 AM Linus Walleij wrote: > > On Tue, Oct 13, 2020 at 11:22 AM Dave Martin wrote: > > > > case F_SETFD: > > > err = 0; > > > set_close_on_exec(fd, arg & FD_CLOEXEC); > > > + if (arg & FD_32BIT_MODE) > > > + filp->f_mode |= FMODE_32BITHASH; > > > + else > > > + filp->f_mode &= ~FMODE_32BITHASH; > > > > This seems inconsistent? F_SETFD is for setting flags on a file > > descriptor. Won't setting a flag on filp here instead cause the > > behaviour to change for all file descriptors across the system that are > > open on this struct file? Compare set_close_on_exec(). > > > > I don't see any discussion on whether this should be an F_SETFL or an > > F_SETFD, though I see F_SETFD was Ted's suggestion originally. > > I cannot honestly say I know the semantic difference. > > I would ask the QEMU people how a user program would expect > the flag to behave. I agree it should either use F_SETFD to set a bit in the fdtable structure like set_close_on_exec() or it should use F_SETFL to set a bit in filp->f_mode. It appears the reason FMODE_32BITHASH is part of filp->f_mode is that the only user today is nfsd, which does not have a file descriptor but only has a struct file. Similarly, the only code that understands the difference (ext4_readdir()) has no reference to the file descriptor. If this becomes an O_DIR32BITHASH flag for F_SETFL, I suppose it should also be supported by openat2(). Arnd
Re: [PATCH 08/12] linux-user: Add support for setting alsa timer enhanced read using ioctl
On Fri, Jan 17, 2020 at 9:50 PM Aleksandar Markovic wrote: > Alexandre (and Arnd too, or any other person knowledgeable in the area), > > I just need to clarify a couple of details with you, please. > > Firstly, here is what man page rtc(4) says: > > "The /dev/rtc (or /dev/rtc0, /dev/rtc1, etc.) device can be opened > only once (until it is closed) and it is read-only. On read(2) and > select(2) the calling process is blocked until the next interrupt from > that RTC is received. Following the interrupt, the process can read a > long integer, of which the least significant byte contains a bit mask > encoding the types of interrupt that occurred, while the remaining 3 > bytes contain the number of interrupts since the last read(2)." > > So, it looks read() will always return only 4 bytes of useful info > (regardless of host being 32-bit/64-bit). It says "long integer", which is 64-bit on a 64-bit machine. > My questions are: > > - Is the description in man page genuinely accurate? Starting with linux-2.6.18, there is another possibility: If an application asks for exactly four bytes on a 64-bit kernel, it gets the lower four bytes, as it would on a 32-bit kernel. This is a hack that was introduced for running 32-bit compat tasks. For any other size less than sizeof(long), the kernel reports an EINVAL error, and for anything larger or equal to sizeof(long) it attempts to output a long word. > - To me (but I am really an outsider to using RTC in applications), > this feature (blocking read()/select()) even looks very nice and > convenient, in all fairness. But I would like to ask you: Is this > feature used rarely or frequently by other libraries/tools/etc.? In > other words, is the feature "obscure" or "crucial" part of RTC kernel > support? Or, something in between? > - Does MC146818 support this feature? No idea, I'll leave these for Alexandre or someone else to answer. Arnd
Re: [PATCH 08/12] linux-user: Add support for setting alsa timer enhanced read using ioctl
On Thu, Jan 16, 2020 at 12:27 PM Aleksandar Markovic wrote: > On Thursday, January 16, 2020, Aleksandar Markovic > wrote: >> On Wednesday, January 15, 2020, Laurent Vivier wrote: >>> Le 15/01/2020 à 20:17, Filip Bozuta a écrit : >>> > On 15.1.20. 17:37, Arnd Bergmann wrote: >>> >> On Wed, Jan 15, 2020 at 5:32 PM Laurent Vivier wrote: >>> >>> Le 15/01/2020 à 17:18, Arnd Bergmann a écrit : >>> >>>> On Wed, Jan 15, 2020 at 4:53 PM Filip Bozuta >>> >>>> wrote: >>> >>>>> This patch implements functionality of following ioctl: >>> >>>>> >>> >>>>> SNDRV_TIMER_IOCTL_TREAD - Setting enhanced time read >>> >>>>> >>> >>>>> Sets enhanced time read which is used for reading time with >>> >>>>> timestamps >>> >>>>> and events. The third ioctl's argument is a pointer to an >>> >>>>> 'int'. Enhanced >>> >>>>> reading is set if the third argument is different than 0, >>> >>>>> otherwise normal >>> >>>>> time reading is set. >>> >>>>> >>> >>>>> Implementation notes: >>> >>>>> >>> >>>>> Because the implemented ioctl has 'int' as its third argument, >>> >>>>> the >>> >>>>> implementation was straightforward. >>> >>>>> >>> >>>>> Signed-off-by: Filip Bozuta >>> >>>> I think this one is wrong when you go between 32-bit and 64-bit >>> >>> kernel uses an "int" and "int" is always 32bit. >>> >>> The problem is most likely with timespec I think. >>> >>> >>> >>>> targets, and it gets worse with the kernel patches that just got >>> >>>> merged for linux-5.5, which extends the behavior to deal with >>> >>>> 64-bit time_t on 32-bit architectures. >>> >>>> >>> >>>> Please have a look at >>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?h=80fe7430c70859 >>> >>>> >>> >>> Yes, we already had the same kind of problem with SIOCGSTAMP and >>> >>> SIOCGSTAMPNS. >>> >>> >>> >>> Do the kernel patches add new ioctl numbers to differentiate 32bit and >>> >>> 64bit time_t? >>> >> Yes, though SNDRV_TIMER_IOCTL_TREAD is worse: the ioctl argument >>> >> is a pure 'int' that decides what format you get when you 'read' from the >>> >> same file descriptor. >>> >> >>> >> For emulating 64-bit on 32-bit kernels, you have to use the new >>> >> SNDRV_TIMER_IOCTL_TREAD64, and for emulating 32-bit on >>> >> 64-bit kernels, you probably have to return -ENOTTY to >>> >> SNDRV_TIMER_IOCTL_TREAD_OLD unless you also want to >>> >> emulate the read() behavior. >>> >> When a 32-bit process calls SNDRV_TIMER_IOCTL_TREAD64, >>> >> you can translate that into the 64-bit >>> >> SNDRV_TIMER_IOCTL_TREAD_OLD. >>> > >>> > Thank you for bringing this up to my attention. Unfortunately i have >>> > some duties of academic nature in next month so i won't have much time >>> > fix this bug. I will try to fix this as soon as possible. >>> >>> Could you at least to try to have a mergeable series before you have to >>> stop to work on this? >>> >>> You can only manage the case before the change reported by Arnd (I will >>> manage the new case myself later). >> >> Sorry for interjecting myself into this discussion, but I just want to let >> you know about some related practicalities. >> >> Filip is a student that is from time to time (usually between two exam >> seasons) an intern in our company, working 4 hours each work day. He spent >> his internship in different teams in prevous years, and, from around >> mid-October 2019, was appointed to QEMU team. After some introductory tasks, >> he was assigned his main task: linux-user support for RTCs and ALSA timers. >> This series is the result of his work, and, to my great pleasure, is >> virtually entirely his independant work. I am positive he can complete the >> series by himself, even in the light of additional complexit
Re: [PATCH 08/12] linux-user: Add support for setting alsa timer enhanced read using ioctl
On Wed, Jan 15, 2020 at 8:17 PM Filip Bozuta wrote: > On 15.1.20. 17:37, Arnd Bergmann wrote: > > On Wed, Jan 15, 2020 at 5:32 PM Laurent Vivier wrote: > >> Le 15/01/2020 à 17:18, Arnd Bergmann a écrit : > >>> On Wed, Jan 15, 2020 at 4:53 PM Filip Bozuta > >>> wrote: > >> Do the kernel patches add new ioctl numbers to differentiate 32bit and > >> 64bit time_t? > > > > Yes, though SNDRV_TIMER_IOCTL_TREAD is worse: the ioctl argument > > is a pure 'int' that decides what format you get when you 'read' from the > > same file descriptor. > > > > For emulating 64-bit on 32-bit kernels, you have to use the new > > SNDRV_TIMER_IOCTL_TREAD64, and for emulating 32-bit on > > 64-bit kernels, you probably have to return -ENOTTY to > > SNDRV_TIMER_IOCTL_TREAD_OLD unless you also want to > > emulate the read() behavior. > > When a 32-bit process calls SNDRV_TIMER_IOCTL_TREAD64, > > you can translate that into the 64-bit > > SNDRV_TIMER_IOCTL_TREAD_OLD. > > Thank you for bringing this up to my attention. Unfortunately i have > some duties of academic nature in next month so i won't have much time > fix this bug. I will try to fix this as soon as possible. One more thing: I just realized it gets worse when emulating cross-endian, as then even without calling SNDRV_TIMER_IOCTL_TREAD, reading data from /dev/snd/timer requires byteswapping the two words. Arnd
Re: [PATCH 08/12] linux-user: Add support for setting alsa timer enhanced read using ioctl
On Wed, Jan 15, 2020 at 5:32 PM Laurent Vivier wrote: > > Le 15/01/2020 à 17:18, Arnd Bergmann a écrit : > > On Wed, Jan 15, 2020 at 4:53 PM Filip Bozuta wrote: > >> > >> This patch implements functionality of following ioctl: > >> > >> SNDRV_TIMER_IOCTL_TREAD - Setting enhanced time read > >> > >> Sets enhanced time read which is used for reading time with timestamps > >> and events. The third ioctl's argument is a pointer to an 'int'. > >> Enhanced > >> reading is set if the third argument is different than 0, otherwise > >> normal > >> time reading is set. > >> > >> Implementation notes: > >> > >> Because the implemented ioctl has 'int' as its third argument, the > >> implementation was straightforward. > >> > >> Signed-off-by: Filip Bozuta > > > > I think this one is wrong when you go between 32-bit and 64-bit > > kernel uses an "int" and "int" is always 32bit. > The problem is most likely with timespec I think. > > > targets, and it gets worse with the kernel patches that just got > > merged for linux-5.5, which extends the behavior to deal with > > 64-bit time_t on 32-bit architectures. > > > > Please have a look at > > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?h=80fe7430c70859 > > Yes, we already had the same kind of problem with SIOCGSTAMP and > SIOCGSTAMPNS. > > Do the kernel patches add new ioctl numbers to differentiate 32bit and > 64bit time_t? Yes, though SNDRV_TIMER_IOCTL_TREAD is worse: the ioctl argument is a pure 'int' that decides what format you get when you 'read' from the same file descriptor. For emulating 64-bit on 32-bit kernels, you have to use the new SNDRV_TIMER_IOCTL_TREAD64, and for emulating 32-bit on 64-bit kernels, you probably have to return -ENOTTY to SNDRV_TIMER_IOCTL_TREAD_OLD unless you also want to emulate the read() behavior. When a 32-bit process calls SNDRV_TIMER_IOCTL_TREAD64, you can translate that into the 64-bit SNDRV_TIMER_IOCTL_TREAD_OLD. Arnd
Re: [PATCH 08/12] linux-user: Add support for setting alsa timer enhanced read using ioctl
On Wed, Jan 15, 2020 at 4:53 PM Filip Bozuta wrote: > > This patch implements functionality of following ioctl: > > SNDRV_TIMER_IOCTL_TREAD - Setting enhanced time read > > Sets enhanced time read which is used for reading time with timestamps > and events. The third ioctl's argument is a pointer to an 'int'. Enhanced > reading is set if the third argument is different than 0, otherwise normal > time reading is set. > > Implementation notes: > > Because the implemented ioctl has 'int' as its third argument, the > implementation was straightforward. > > Signed-off-by: Filip Bozuta I think this one is wrong when you go between 32-bit and 64-bit targets, and it gets worse with the kernel patches that just got merged for linux-5.5, which extends the behavior to deal with 64-bit time_t on 32-bit architectures. Please have a look at https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?h=80fe7430c70859 Arnd
Re: [Qemu-devel] [PATCH v5] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels
On Sun, Jul 14, 2019 at 3:54 PM Laurent Vivier wrote: > > From: Daniel P. Berrangé > > The SIOCGSTAMP symbol was previously defined in the > asm-generic/sockios.h header file. QEMU sees that header > indirectly via sys/socket.h > > In linux kernel commit 0768e17073dc527ccd18ed5f96ce85f9985e9115 > the asm-generic/sockios.h header no longer defines SIOCGSTAMP. > Instead it provides only SIOCGSTAMP_OLD, which only uses a > 32-bit time_t on 32-bit architectures. > > The linux/sockios.h header then defines SIOCGSTAMP using > either SIOCGSTAMP_OLD or SIOCGSTAMP_NEW as appropriate. If > SIOCGSTAMP_NEW is used, then the tv_sec field is 64-bit even > on 32-bit architectures > > To cope with this we must now convert the old and new type from > the target to the host one. > > Signed-off-by: Daniel P. Berrangé > Signed-off-by: Laurent Vivier Looks good to me now Reviewed-by: Arnd Bergmann
Re: [Qemu-devel] [PATCH v4] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels
On Sun, Jul 14, 2019 at 12:41 PM Richard Henderson wrote: > > On 7/12/19 3:55 PM, Arnd Bergmann wrote: > > glibc will have to create a definition that matches the kernel, which uses > > > > struct __kernel_timespec { > > __s64 tv_sec; > > __s64 tv_nsec; > > }; > > > > As posix requires tv_nsec to be 'long', you need padding between > > tv_sec and tv_nsec to have a libc definition matching the kernel's > > binary layout. > > Yes, but that's glibc's lookout. All qemu cares about emulating is the kernel > interface. So I think Laurent is right here, in that two reads handle the > above structure just fine. But that only works if the structure defined by qemu matches the kernel's. The structure that Laurent proposed struct target_timeval64 { abi_llong tv_sec; abi_long tv_usec; }; is not compatible with the kernel or the glibc structure. Arnd
Re: [Qemu-devel] [PATCH v4] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels
On Fri, Jul 12, 2019 at 3:50 PM Laurent Vivier wrote: > Le 12/07/2019 à 15:36, Arnd Bergmann a écrit : > >> We don't do memcopy() but we set each field one by one, so the padding > >> doesn't > >> seem needed if we define correctly the user structure: > >> > >> struct target_timeval64 { > >> abi_llong tv_sec; > >> abi_long tv_usec; > >> }; > >> > >> and do something like: > >> > >> struct target_timeval64 *target_tv; > >> struct timeval *host_tv; > >> ... > >> __put_user(host_tv->tv_sec, _tv->tv_sec); > >> __put_user(host_tv->tv_usec, _tv->tv_usec); > >> ... > > > > That still seems wrong. The user application has a definition > > of 'timeval' that contains the padding, so your definition has > > to match that. > > I don't find this definition with the padding. Where it is defined? > > We are at the syscall level, so structures are the ones provided by the > target to the syscall, and they can be converted by the libc if the one > from the user space differs. glibc will have to create a definition that matches the kernel, which uses struct __kernel_timespec { __s64 tv_sec; __s64 tv_nsec; }; As posix requires tv_nsec to be 'long', you need padding between tv_sec and tv_nsec to have a libc definition matching the kernel's binary layout. Arnd
Re: [Qemu-devel] [PATCH v4] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels
On Fri, Jul 12, 2019 at 3:23 PM Laurent Vivier wrote: > > Le 12/07/2019 à 14:47, Arnd Bergmann a écrit : > > On Fri, Jul 12, 2019 at 2:17 PM Laurent Vivier wrote: > >> > >> Le 11/07/2019 à 23:05, Arnd Bergmann a écrit : > >>> On Thu, Jul 11, 2019 at 7:32 PM Laurent Vivier wrote: > >>> > >>>> > >>>> Notes: > >>>> v4: [lv] timeval64 and timespec64 are { long long , long } > >>> > >>>> > >>>> +STRUCT(timeval64, TYPE_LONGLONG, TYPE_LONG) > >>>> + > >>>> +STRUCT(timespec64, TYPE_LONGLONG, TYPE_LONG) > >>>> + > >>> > >>> This still doesn't look right, see my earlier comment about padding > >>> on big-endian architectures. > >>> > >>> Note that the in-kernel 'timespec64' is different from the uapi > >>> '__kernel_timespec' exported by the kernel. I also still think you may > >>> need to convert between SIOCGSTAMP_NEW and SIOCGSTAMP_OLD, > >>> e.g. when emulating a 32-bit riscv process (which only use > >>> SIOCGSTAMP_NEW) on a kernel that only understands > >>> SIOCGSTAMP_OLD. > >> > >> I agree. > >> I'm preparing a patch always using SIOCGSTAMP and SIOCGSTAMPNS on the > >> host (converting the structure when needed). > > > > That in turn would have the problem of breaking in 2038 when the > > timestamp overflows. > > No, because SIOCGSTAMP and SIOCGSTAMPNS are aliased to the _NEW versions > on system supporting them (yes, we need to rebuild the binary, but we have 19 > years to do that). > > #define SIOCGSTAMP ((sizeof(struct timeval)) == 8 ? \ > SIOCGSTAMP_OLD : SIOCGSTAMP_NEW) > #define SIOCGSTAMPNS((sizeof(struct timespec)) == 8 ? \ > SIOCGSTAMPNS_OLD : SIOCGSTAMPNS_NEW) Right, makes sense. > > > >> I've added the SH4 variant.> What is special about SH4? > > The definition of _OLD is different: > > #define SIOCGSTAMP_OLD _IOR('s', 100, struct timeval) /* Get stamp (timeval) > */ > #define SIOCGSTAMPNS_OLD _IOR('s', 101, struct timespec) /* Get stamp > (timespec) */ Ah, that one. > > No, you don't need to swap. The difference is only in the padding. > > Since the kernel uses a 64/64 structure here, and user space > > may have use 'long tv_nsec', you need to add the padding on > > the correct side, like > > > > struct timeval64 { > >long long tv_sec; > > #if 32bit && big-endian > >long :32; /* anonymous padding */ > > #endif > >suseconds_t tv_usec; > > #if (32bit && little-endian) || sparc64 > >long :32; > > #endif > > }; > > We don't do memcopy() but we set each field one by one, so the padding doesn't > seem needed if we define correctly the user structure: > > struct target_timeval64 { > abi_llong tv_sec; > abi_long tv_usec; > }; > > and do something like: > > struct target_timeval64 *target_tv; > struct timeval *host_tv; > ... > __put_user(host_tv->tv_sec, _tv->tv_sec); > __put_user(host_tv->tv_usec, _tv->tv_usec); > ... That still seems wrong. The user application has a definition of 'timeval' that contains the padding, so your definition has to match that. Arnd
Re: [Qemu-devel] [PATCH v4] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels
On Fri, Jul 12, 2019 at 2:17 PM Laurent Vivier wrote: > > Le 11/07/2019 à 23:05, Arnd Bergmann a écrit : > > On Thu, Jul 11, 2019 at 7:32 PM Laurent Vivier wrote: > > > >> > >> Notes: > >> v4: [lv] timeval64 and timespec64 are { long long , long } > > > >> > >> +STRUCT(timeval64, TYPE_LONGLONG, TYPE_LONG) > >> + > >> +STRUCT(timespec64, TYPE_LONGLONG, TYPE_LONG) > >> + > > > > This still doesn't look right, see my earlier comment about padding > > on big-endian architectures. > > > > Note that the in-kernel 'timespec64' is different from the uapi > > '__kernel_timespec' exported by the kernel. I also still think you may > > need to convert between SIOCGSTAMP_NEW and SIOCGSTAMP_OLD, > > e.g. when emulating a 32-bit riscv process (which only use > > SIOCGSTAMP_NEW) on a kernel that only understands > > SIOCGSTAMP_OLD. > > I agree. > I'm preparing a patch always using SIOCGSTAMP and SIOCGSTAMPNS on the > host (converting the structure when needed). That in turn would have the problem of breaking in 2038 when the timestamp overflows. > I've added the SH4 variant. What is special about SH4? > I've added the sparc64 variant too: does it means sparc64 use the same > structure internally for the OLD and NEW version? I had to look it up in the code. Yes, it does, timeval is 64/32/pad32, timespec is 64/64 in both OLD and NEW for sparc. > What about sparc 32bit? sparc32 is like all other 32-bit targets: OLD uses 32/32, and new uses 64/64. > For big-endian, I didn't find in the kernel where the difference is > managed: a byte swapping of the 64bit value is not enough? No, you don't need to swap. The difference is only in the padding. Since the kernel uses a 64/64 structure here, and user space may have use 'long tv_nsec', you need to add the padding on the correct side, like struct timeval64 { long long tv_sec; #if 32bit && big-endian long :32; /* anonymous padding */ #endif suseconds_t tv_usec; #if (32bit && little-endian) || sparc64 long :32; #endif }; Arnd
Re: [Qemu-devel] [PATCH v4] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels
On Thu, Jul 11, 2019 at 7:32 PM Laurent Vivier wrote: > > Notes: > v4: [lv] timeval64 and timespec64 are { long long , long } > > +STRUCT(timeval64, TYPE_LONGLONG, TYPE_LONG) > + > +STRUCT(timespec64, TYPE_LONGLONG, TYPE_LONG) > + This still doesn't look right, see my earlier comment about padding on big-endian architectures. Note that the in-kernel 'timespec64' is different from the uapi '__kernel_timespec' exported by the kernel. I also still think you may need to convert between SIOCGSTAMP_NEW and SIOCGSTAMP_OLD, e.g. when emulating a 32-bit riscv process (which only use SIOCGSTAMP_NEW) on a kernel that only understands SIOCGSTAMP_OLD. Arnd
Re: [Qemu-devel] [PATCH v2] linux-user: fix to handle variably sized SIOCGSTAMP with new kernels
On Mon, Jun 17, 2019 at 3:11 PM Daniel P. Berrangé wrote: > > The SIOCGSTAMP symbol was previously defined in the > asm-generic/sockios.h header file. QEMU sees that header > indirectly via sys/socket.h > > In linux kernel commit 0768e17073dc527ccd18ed5f96ce85f9985e9115 > the asm-generic/sockios.h header no longer defines SIOCGSTAMP. > Instead it provides only SIOCGSTAMP_OLD, which only uses a > 32-bit time_t on 32-bit architectures. This is a bit misleading, as we still define SIOCGSTAMP in the right place. asm-generic/sockios.h should not be used by normal user space. > The linux/sockios.h header then defines SIOCGSTAMP using > either SIOCGSTAMP_OLD or SIOCGSTAMP_NEW as appropriate. If > SIOCGSTAMP_NEW is used, then the tv_sec field is 64-bit even > on 32-bit architectures > > To cope with this we must now define two separate syscalls, > with corresponding old and new sizes, as well as including > the new linux/sockios.h header. The overall concept seems right. A few more comments on details that may have gone wrong here. I'm not familiar with the qemu-user implementation, so it's mostly guesswork on my end. >IOCTL(SIOCGIWNAME, IOC_W | IOC_R, MK_PTR(MK_STRUCT(STRUCT_char_ifreq))) >IOCTL(SIOCSPGRP, IOC_W, MK_PTR(TYPE_INT)) /* pid_t */ >IOCTL(SIOCGPGRP, IOC_R, MK_PTR(TYPE_INT)) /* pid_t */ > + > +#ifdef SIOCGSTAMP_OLD > + IOCTL(SIOCGSTAMP_OLD, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timeval))) > +#else >IOCTL(SIOCGSTAMP, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timeval))) > +#endif > +#ifdef SIOCGSTAMPNS_OLD > + IOCTL(SIOCGSTAMPNS_OLD, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timespec))) > +#else >IOCTL(SIOCGSTAMPNS, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timespec))) > +#endif > +#ifdef SIOCGSTAMP_NEW > + IOCTL(SIOCGSTAMP_NEW, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timeval64))) > +#endif > +#ifdef SIOCGSTAMPNS_NEW > + IOCTL(SIOCGSTAMPNS_NEW, IOC_R, MK_PTR(MK_STRUCT(STRUCT_timespec64))) > +#endif Is timespec64 a qemu type? How is it defined? > diff --git a/linux-user/syscall_defs.h b/linux-user/syscall_defs.h > index 7f141f699c..7830b600e7 100644 > --- a/linux-user/syscall_defs.h > +++ b/linux-user/syscall_defs.h > @@ -750,6 +750,11 @@ struct target_pollfd { > > #define TARGET_SIOCGSTAMP 0x8906 /* Get stamp (timeval) */ > #define TARGET_SIOCGSTAMPNS0x8907 /* Get stamp (timespec) */ > +#define TARGET_SIOCGSTAMP_OLD 0x8906 /* Get stamp (timeval) */ > +#define TARGET_SIOCGSTAMPNS_OLD 0x8907 /* Get stamp (timespec) */ Note that these types are architecture specific. It seems that only one architecture is left that has its own definitions though, so this is only broken on arch/sh for current linux (and remains broken). Future architectures, including 32-bit risc-v should only have the _NEW version and not support SIOCGSTAMP_OLD at all. When emulating risc-v user space on old kernels (pre-5.1), you may need to translate the ioctl command and all system calls that take a 64-bit time_t into the variants with a 32-bit time_t on the way into the kernel, and then back. Similarly, running an old user binary on a riscv32 machine, you may need to do the reverse translation. > +#define TARGET_SIOCGSTAMP_NEW TARGET_IOC(TARGET_IOC_READ, 's', 6, > sizeof(long long) + sizeof(long)) /* Get stamp (timeval64) */ > +#define TARGET_SIOCGSTAMPNS_NEW TARGET_IOC(TARGET_IOC_READ, 's', 7, > sizeof(long long) + sizeof(long)) /* Get stamp (timespec64) */ "sizeof(long long) + sizeof(long)" is not always the size of the argument to TARGET_SIOCGSTAMP{NS}_NEW. On 32-bit architectures, the size is two 64-bit values. sparc64 is potentially another special case, as 'struct timeval is 'long + int' there (12 bytes). On big-endian architectures, the nanoseconds are returned in the last four bytes of the 16-byte structure. > /* Networking ioctls */ > #define TARGET_SIOCADDRT 0x890B /* add routing table entry */ > diff --git a/linux-user/syscall_types.h b/linux-user/syscall_types.h > index b98a23b0f1..de4c5a5b6f 100644 > --- a/linux-user/syscall_types.h > +++ b/linux-user/syscall_types.h > @@ -20,6 +20,10 @@ STRUCT(timeval, > STRUCT(timespec, > MK_ARRAY(TYPE_LONG, 2)) > > +STRUCT(timeval64, TYPE_LONGLONG, TYPE_LONG) > + > +STRUCT(timespec64, TYPE_LONGLONG, TYPE_LONG) Same here. Arnd
[Qemu-devel] [PATCH] headers: fix linux/mod_devicetable.h inclusions
A couple of drivers produced build errors after the mod_devicetable.h header was split out from the platform_device one, e.g. drivers/media/platform/davinci/vpbe_osd.c:42:40: error: array type has incomplete element type 'struct platform_device_id' drivers/media/platform/davinci/vpbe_venc.c:42:40: error: array type has incomplete element type 'struct platform_device_id' This adds the inclusion where needed. Fixes: ac3167257b9f ("headers: separate linux/mod_devicetable.h from linux/platform_device.h") Signed-off-by: Arnd Bergmann --- drivers/firmware/qemu_fw_cfg.c | 1 + drivers/media/platform/davinci/vpbe_osd.c | 1 + drivers/media/platform/davinci/vpbe_venc.c | 1 + drivers/media/platform/qcom/venus/vdec.c | 1 + drivers/media/platform/qcom/venus/venc.c | 1 + drivers/media/platform/sti/hva/hva-v4l2.c | 1 + drivers/platform/x86/intel_punit_ipc.c | 1 + 7 files changed, 7 insertions(+) diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c index 14fedbeca724..039e0f91dba8 100644 --- a/drivers/firmware/qemu_fw_cfg.c +++ b/drivers/firmware/qemu_fw_cfg.c @@ -28,6 +28,7 @@ */ #include +#include #include #include #include diff --git a/drivers/media/platform/davinci/vpbe_osd.c b/drivers/media/platform/davinci/vpbe_osd.c index 7f610320426d..c551a25d90d9 100644 --- a/drivers/media/platform/davinci/vpbe_osd.c +++ b/drivers/media/platform/davinci/vpbe_osd.c @@ -18,6 +18,7 @@ * */ #include +#include #include #include #include diff --git a/drivers/media/platform/davinci/vpbe_venc.c b/drivers/media/platform/davinci/vpbe_venc.c index ba157827192c..ddcad7b3e76c 100644 --- a/drivers/media/platform/davinci/vpbe_venc.c +++ b/drivers/media/platform/davinci/vpbe_venc.c @@ -11,6 +11,7 @@ * GNU General Public License for more details. */ #include +#include #include #include #include diff --git a/drivers/media/platform/qcom/venus/vdec.c b/drivers/media/platform/qcom/venus/vdec.c index f89a91d43cc9..d4e23c7df347 100644 --- a/drivers/media/platform/qcom/venus/vdec.c +++ b/drivers/media/platform/qcom/venus/vdec.c @@ -14,6 +14,7 @@ */ #include #include +#include #include #include #include diff --git a/drivers/media/platform/qcom/venus/venc.c b/drivers/media/platform/qcom/venus/venc.c index f7a87a3dbb46..0522cf202b75 100644 --- a/drivers/media/platform/qcom/venus/venc.c +++ b/drivers/media/platform/qcom/venus/venc.c @@ -14,6 +14,7 @@ */ #include #include +#include #include #include #include diff --git a/drivers/media/platform/sti/hva/hva-v4l2.c b/drivers/media/platform/sti/hva/hva-v4l2.c index 15080cb00fa7..5a807c7c5e79 100644 --- a/drivers/media/platform/sti/hva/hva-v4l2.c +++ b/drivers/media/platform/sti/hva/hva-v4l2.c @@ -6,6 +6,7 @@ */ #include +#include #include #include #include diff --git a/drivers/platform/x86/intel_punit_ipc.c b/drivers/platform/x86/intel_punit_ipc.c index b5b890127479..f1afc0ebbc68 100644 --- a/drivers/platform/x86/intel_punit_ipc.c +++ b/drivers/platform/x86/intel_punit_ipc.c @@ -12,6 +12,7 @@ */ #include +#include #include #include #include -- 2.9.0
[Qemu-devel] [PATCH] fw_cfg: avoid unused function warning
The newly introduced fw_cfg_dma_transfer() function is unused when CONFIG_CRASH_CORE is disabled: drivers/firmware/qemu_fw_cfg.c:89:16: error: 'fw_cfg_dma_transfer' defined but not used [-Werror=unused-function] static ssize_t fw_cfg_dma_transfer(void *address, u32 length, u32 control) This moves it into the #ifdef section that hides its caller to avoid the warning. Fixes: 47e78bfb5426 ("fw_cfg: write vmcoreinfo details") Signed-off-by: Arnd Bergmann <a...@arndb.de> --- drivers/firmware/qemu_fw_cfg.c | 60 +- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c index 3015e77aebca..f002bb40519b 100644 --- a/drivers/firmware/qemu_fw_cfg.c +++ b/drivers/firmware/qemu_fw_cfg.c @@ -66,6 +66,36 @@ static void fw_cfg_sel_endianness(u16 key) iowrite16(key, fw_cfg_reg_ctrl); } +/* read chunk of given fw_cfg blob (caller responsible for sanity-check) */ +static ssize_t fw_cfg_read_blob(u16 key, + void *buf, loff_t pos, size_t count) +{ + u32 glk = -1U; + acpi_status status; + + /* If we have ACPI, ensure mutual exclusion against any potential +* device access by the firmware, e.g. via AML methods: +*/ + status = acpi_acquire_global_lock(ACPI_WAIT_FOREVER, ); + if (ACPI_FAILURE(status) && status != AE_NOT_CONFIGURED) { + /* Should never get here */ + WARN(1, "fw_cfg_read_blob: Failed to lock ACPI!\n"); + memset(buf, 0, count); + return -EINVAL; + } + + mutex_lock(_cfg_dev_lock); + fw_cfg_sel_endianness(key); + while (pos-- > 0) + ioread8(fw_cfg_reg_data); + ioread8_rep(fw_cfg_reg_data, buf, count); + mutex_unlock(_cfg_dev_lock); + + acpi_release_global_lock(glk); + return count; +} + +#ifdef CONFIG_CRASH_CORE static inline bool fw_cfg_dma_enabled(void) { return (fw_cfg_rev & FW_CFG_VERSION_DMA) && fw_cfg_reg_dma; @@ -124,36 +154,6 @@ static ssize_t fw_cfg_dma_transfer(void *address, u32 length, u32 control) return ret; } -/* read chunk of given fw_cfg blob (caller responsible for sanity-check) */ -static ssize_t fw_cfg_read_blob(u16 key, - void *buf, loff_t pos, size_t count) -{ - u32 glk = -1U; - acpi_status status; - - /* If we have ACPI, ensure mutual exclusion against any potential -* device access by the firmware, e.g. via AML methods: -*/ - status = acpi_acquire_global_lock(ACPI_WAIT_FOREVER, ); - if (ACPI_FAILURE(status) && status != AE_NOT_CONFIGURED) { - /* Should never get here */ - WARN(1, "fw_cfg_read_blob: Failed to lock ACPI!\n"); - memset(buf, 0, count); - return -EINVAL; - } - - mutex_lock(_cfg_dev_lock); - fw_cfg_sel_endianness(key); - while (pos-- > 0) - ioread8(fw_cfg_reg_data); - ioread8_rep(fw_cfg_reg_data, buf, count); - mutex_unlock(_cfg_dev_lock); - - acpi_release_global_lock(glk); - return count; -} - -#ifdef CONFIG_CRASH_CORE /* write chunk of given fw_cfg blob (caller responsible for sanity-check) */ static ssize_t fw_cfg_write_blob(u16 key, void *buf, loff_t pos, size_t count) -- 2.9.0
Re: [Qemu-devel] [kernel PATCH v2 2/2] devicetree: document ARM bindings for QEMU's Firmware Config interface
On Friday 05 December 2014 19:08:46 Peter Maydell wrote: On 5 December 2014 at 19:04, Laszlo Ersek ler...@redhat.com wrote: On 12/05/14 19:57, Peter Maydell wrote: On 30 November 2014 at 16:51, Laszlo Ersek ler...@redhat.com wrote: +Example: + +/ { + #size-cells = 0x2; + #address-cells = 0x2; + + fw-cfg@902 { + compatible = qemu,fw-cfg-mmio; + reg = 0x0 0x902 0x0 0x1000; + }; I've just noticed that this example claims a register region size of 0x1000. This seems wrong, because the underlying device doesn't have a register range that big. Surely this should be a size of 0x3 ? Arnd said I should round up the region to 0x1000. Right; I replied here as a reasonable place to do so on an email with the device-tree folk in cc. http://thread.gmane.org/gmane.linux.drivers.devicetree/101173/focus=101176 Arnd, what was your reasoning in requesting the round-up? I would have expected that a dtb with an overlarge range is telling the guest it can access things that in fact just aren't there (ie the equivalent of unmapped space which on h/w would give you an external abort/decode error). I was expecting that qemu would be allowed to put additional registers in there for future extensions. My comment was mainly directed at having two distinct but consecutive register ranges, which can be better expressed by having a longer one. As long as qemu knows exactly how long the register area is (and it better should know), having the correct length in there is probably best. Arnd
Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface
On Friday 28 November 2014 13:26:44 Laszlo Ersek wrote: +Example: + +/ { + #size-cells = 0x2; + #address-cells = 0x2; + + fw-cfg@902 { + reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1; + compatible = fw-cfg,mmio; + }; +}; fw-cfg is not a valid vendor string. Are you emulating an actual piece of hardware here or is this something that comes from qemu? How about using qemu,fwcfg as the compatible string, and documenting qemu as an official vendor string in Documentation/devicetree/bindings/vendor-prefixes.txt We don't normally list contiguous registers separately. Maybe just round up to one page and make the register property reg = 0x0 0x902 0x0 0x1000; Finally, isn't there also an interrupt associated with this virtual device? Arnd
Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation
On Wednesday 12 November 2014 10:56:40 Mark Rutland wrote: On Wed, Nov 12, 2014 at 09:08:55AM +, Claudio Fontana wrote: On 11.11.2014 16:29, Mark Rutland wrote: I tend to mostly agree with this, we might look for alternative solutions for speeding up guest startup in the future but in general if getting ACPI in the guest for ARM64 requires also getting UEFI, then I can personally live with that, especially if we strive to have the kind of optimized virtualized UEFI you mention. Given that UEFI will be required for other guests (e.g. if you want to boot a distribution's ISO image), I hope that virtualised UEFI will see some optimisation work. I think the requirement it just for KVM to provide something that behaves exactly like UEFI, it doesn't have to be the full Tianocore implementation if it's easier to reimplement the boot interface. As mentioned by others, I'd rather see an implementation of ACPI in QEMU which learns from the experience of X86 (and possibly shares some code if possible), rather than going in a different direction by creating device trees first, and then converting them to ACPI tables somewhere in the firmware, just because device trees are already there, for the reasons which have already been mentioned before by Igor and others. For the features which ACPI provides which device trees do not (e.g. the dynamic addition and removal of memory and CPUs), there will need to be some sort of interface between QEMU and the ACPI implementation. That's already outside of the realm of DT, so as previously mentioned a simple conversion doesn't cover the general case. I think we need to support the low-level interfaces in the kernel for this anyway, we should not have to use ACPI just to do memory and CPU hotplugging in KVM, that would be silly. If ACPI is present, it can provide a wrapper for the same interface, but KVM should not need to be aware of the fact that ACPI is used in the guest, after it has passed the initial ACPI blob to the kernel. I think any ACPI implemenation for a hypervisor should provide a demonstrable useful feature (e.g. hot-add of CPUs) before merging so we know the infrastructure is suitable. I wouldn't want for ACPI to be sort of supported in QEMU, but with a limited functionality which makes it not fully useful in practice. I'd rather see it as a first class citizen instead, including the ability to run AML code. I agree that there's no point in having ACPI in a guest unless it provides something which dt does not. I don't know how it should be structured to provide those useful features. I see it the opposite way: we shouldn't have to use ACPI just to make use of some feature in Linux, the only reason why you'd want ACPI support in KVM is to be able to run Windows. It makes sense for the ACPI implementation to be compatible with the Linux ACPI code as well so we can test it better. Arnd
Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation
On Wednesday 12 November 2014 13:27:14 Paolo Bonzini wrote: On 12/11/2014 13:18, Mark Rutland wrote: On Wed, Nov 12, 2014 at 11:48:27AM +, Paolo Bonzini wrote: Perhaps you could treat it as a shared level-triggered interrupt in DT? I don't know. Putting an interrupt in DT is trivial. The hard part is the rest of the interface, which so far there is no specification for. Have you looked at docs/specs/acpi_{cpu,mem}_hotplug.txt? Writing a DT binding for it is trivial too. Or are we talking about two different things? Interesting. I agree that doing a DT binding for these will be trivial, and the implementation in Linux should also be straightforward, thanks for pointing these out. However, it seems that the implementation that qemu uses is incompatible with ARM64, since GPE is not part of the ACPI hardware reduced mode. Arnd
Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation
On Wednesday 12 November 2014 16:01:14 Claudio Fontana wrote: Would the last step you mention allow for guests to start with an already existing PCI interrupt map and the BAR registers preprogrammed to point to somewhere sane? I ask because on OSv at the moment, the situation is that for x86 we don't need to reprogram anything on PCI, as everything is already nicely set up by the time the guest starts, and thus the BAR addresses can be read directly. On ARM we have to reprogram the BARs manually for all devices. Couldn't we give an easier time to each OS guest by having everything nicely set up on AArch64 as well? Definitely, I think having the OS manually program the BARs only makes sense in an environment where you don't have a full-featured boot loader or you don't trust it. In servers and virtual machines, the PCI bus should always come fully set up. This also implies that the OS should not have to deal with registers for setting up the translation between PCI and system address ranges. Arnd
Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation
On Wednesday 12 November 2014 16:52:25 Paolo Bonzini wrote: On 12/11/2014 16:39, Peter Maydell wrote: Definitely, I think having the OS manually program the BARs only makes sense in an environment where you don't have a full-featured boot loader or you don't trust it. In servers and virtual machines, the PCI bus should always come fully set up. This also implies that the OS should not have to deal with registers for setting up the translation between PCI and system address ranges. It seems to me like complicated stuff like that definitely belongs in the UEFI/bootloader blob, though. I'd rather QEMU just modelled the hardware and let the guest (or the firmware, which is guest code from QEMU's point of view) set it up however it wants. It definitely doesn't belong in QEMU! The easiest option would probably be not make all PCI devices have fixed BARs and not even allow them to be changed. I believe this is what kvmtool does, but I can see how supporting both modes is much harder than either one. How does it work on x86 with qemu? Arnd
Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation
On Wednesday 12 November 2014 17:04:30 Paolo Bonzini wrote: On 12/11/2014 16:57, Arnd Bergmann wrote: It seems to me like complicated stuff like that definitely belongs in the UEFI/bootloader blob, though. I'd rather QEMU just modelled the hardware and let the guest (or the firmware, which is guest code from QEMU's point of view) set it up however it wants. It definitely doesn't belong in QEMU! The easiest option would probably be not make all PCI devices have fixed BARs and not even allow them to be changed. I believe this is what kvmtool does, but I can see how supporting both modes is much harder than either one. kvmtool does not have firmware; it starts the kernel directly, so it does all the setup that usually is done by the firmware. It implements a couple real-mode interfaces that Linux uses when booting, but nothing of this deals with PCI. x86 QEMU always runs firmware. Even if you specify -kernel, the firmware does all the usual initialization and then boots from a small ROM. The ROM contains the bootloader, so it loads and starts the kernel. Ok, I see. How does it work on x86 with qemu? Same as real hardware. Firmware (SeaBIOS or OVMF) builds the memory map, decides where in the free space the BARs go, and programs the PCI devices accordingly. kvmtool is the special one here. Xen, VMware, Hyper-V all do the same as QEMU. Right. I guess embedded ARM images in qemu are a third way then, because these don't have a guest firmware but also don't set up the hardware the way that kvmtool does. Claudio's request to do this differently on arm64 seems absolutely reasonable to me, but I guess that implies having UEFI or something like it that does the PCI scan. Not sure what the best default for qemu -kernel image should be though if you don't explicitly pass a firmware image. Arnd
Re: [Qemu-devel] [Linaro-acpi] [RFC PATCH 0/7] hw/arm/virt: Dynamic ACPI v5.1 table generation
On Thursday 06 November 2014 13:30:01 Mark Rutland wrote: There is no way of doing this with DT. There has been work into DT fragments/overlays where portions can be added to the tree dynamically, but that's controlled by the OS rather than the hypervisor, and there's no standard for communicating what has been hotplugged to trigger changes to the tree, so it's not quite the same. It really only works for tightly coupled hw/kernel/userspace combinations (i.e. embedded). Depending on how you implement the hot-add/remove you might be able to get away with an initial static configuration translated from DT. If you need to describe what might be hotplugged from the start, then I suspect you cannot get away with translating a DT in general. I believe IBM POWER 5/6/7 servers have an interface to update the DT from the hypervisor, but it's not a nice interface, so we may want to avoid duplicating that. Arnd
Re: [Qemu-devel] Change of TEXT_OFFSET for multi_v7_defconfig
On Tuesday 22 April 2014 11:53:25 Jason Gunthorpe wrote: On Tue, Apr 22, 2014 at 06:11:42PM +0100, Russell King - ARM Linux wrote: Put another way, if your platform is part of the multi-platform kernel then you are *excluded* from being able to use this... unless you hack the Kconfig, and then also provide a constant value for PHYS_OFFSET, thereby _tying_ the kernel you built to a _single_ platform. That is exactly right. To get a fixed LMA you must commit to a non-relocatable kernel image. Realistically this patch would need to be accompanied by something that makes ARM_PATCH_PHYS_VIRT optional for multiplatform based on EXPERT or similar. Incidentally, I have this experimental patch in my endless 'randconfig' queue of things to do properly at a later point. The idea is to let a person building a kernel intentionally do a number of things that result in a not-quite-multiplatform configuration when CONFIG_BROKEN_MULTIPLATFORM is disabled. The main intention is allow NOMMU builds on platforms that are already converted to multiplatform, and to prevent people from accidentally turning on DEBUG_LL, as that breaks all other machines. A couple of other things are in the same category. Arnd From c9917a4a1b3f326e84932d7e860597413a98643b Mon Sep 17 00:00:00 2001 From: Arnd Bergmann a...@arndb.de Date: Mon, 24 Feb 2014 16:38:34 +0100 Subject: [PATCH] ARM: add CONFIG_BROKEN_MULTIPLATFORM Signed-off-by: Arnd Bergmann a...@arndb.de diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 65ea715..9ae3c0b 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -233,7 +233,7 @@ config VECTORS_BASE in size. config ARM_PATCH_PHYS_VIRT - bool Patch physical to virtual translations at runtime if EMBEDDED + bool Patch physical to virtual translations at runtime if BROKEN_MULTIPLATFORM default y depends on !XIP_KERNEL MMU depends on !ARCH_REALVIEW || !SPARSEMEM @@ -302,16 +302,14 @@ config MMU # choice prompt ARM system type - default ARCH_VERSATILE if !MMU - default ARCH_MULTIPLATFORM if MMU + default ARCH_MULTIPLATFORM config ARCH_MULTIPLATFORM bool Allow multiple platforms to be selected - depends on MMU select ARCH_WANT_OPTIONAL_GPIOLIB select ARM_HAS_SG_CHAIN - select ARM_PATCH_PHYS_VIRT select AUTO_ZRELADDR + select BROKEN_MULTIPLATFORM if !MMU select COMMON_CLK select GENERIC_CLOCKEVENTS select MULTI_IRQ_HANDLER @@ -865,6 +863,18 @@ config ARCH_OMAP1 endchoice +config BROKEN_MULTIPLATFORM + bool Allow multiplatform builds that may be broken if ARCH_MULTIPLATFORM EXPERT + def_bool !ARCH_MULTIPLATFORM + help + This is a meta option that makes other options visible that + are normally hidden when building a kernel with ARCH_MULTIPLATFORM + support. + You have to enable this option in order to select any option + that can cause your kernel to only work on a subset of the + platforms that are enabled. This includes disabling MMU + or ARM_PATCH_PHYS_VIRT, or enabling ZBOOT_ROM. + menu Multiple platform selection depends on ARCH_MULTIPLATFORM @@ -1943,6 +1953,7 @@ config DEPRECATED_PARAM_STRUCT # TEXT and BSS so we preserve their values in the config files. config ZBOOT_ROM_TEXT hex Compressed ROM boot loader base address + depends on BROKEN_MULTIPLATFORM default 0 help The physical address at which the ROM-able zImage is to be @@ -1954,6 +1965,7 @@ config ZBOOT_ROM_TEXT config ZBOOT_ROM_BSS hex Compressed ROM boot loader BSS address + depends on BROKEN_MULTIPLATFORM default 0 help The base address of an area of read/write memory in the target @@ -1967,6 +1979,7 @@ config ZBOOT_ROM_BSS config ZBOOT_ROM bool Compressed boot loader in ROM/flash + depends on BROKEN_MULTIPLATFORM depends on ZBOOT_ROM_TEXT != ZBOOT_ROM_BSS depends on !ARM_APPENDED_DTB !XIP_KERNEL !AUTO_ZRELADDR help @@ -2092,7 +2105,7 @@ endchoice config XIP_KERNEL bool Kernel Execute-In-Place from ROM - depends on !ARM_LPAE !ARCH_MULTIPLATFORM + depends on !ARM_LPAE BROKEN_MULTIPLATFORM help Execute-In-Place allows the kernel to run from non-volatile storage directly addressable by the CPU, such as NOR flash. This saves RAM diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug index 6175b50..e193881 100644 --- a/arch/arm/Kconfig.debug +++ b/arch/arm/Kconfig.debug @@ -80,6 +80,7 @@ config DEBUG_USER config DEBUG_LL bool Kernel low-level debugging functions (read help!) depends on DEBUG_KERNEL + depends on BROKEN_MULTIPLATFORM help Say Y here to include definitions of printascii, printch, printhex in the kernel. This is helpful if you are debugging code
Re: [Qemu-devel] Change of TEXT_OFFSET for multi_v7_defconfig
On Tuesday 22 April 2014 19:38:55 Russell King - ARM Linux wrote: On Tue, Apr 22, 2014 at 08:32:10PM +0200, Arnd Bergmann wrote: @@ -1943,6 +1953,7 @@ config DEPRECATED_PARAM_STRUCT # TEXT and BSS so we preserve their values in the config files. config ZBOOT_ROM_TEXT hex Compressed ROM boot loader base address + depends on BROKEN_MULTIPLATFORM default 0 help The physical address at which the ROM-able zImage is to be @@ -1954,6 +1965,7 @@ config ZBOOT_ROM_TEXT config ZBOOT_ROM_BSS hex Compressed ROM boot loader BSS address + depends on BROKEN_MULTIPLATFORM default 0 help The base address of an area of read/write memory in the target Please get rid of the above changes. Good point, fixed now. Arnd
Re: [Qemu-devel] [PATCH for-1.5 0/3] hw/pci-host/versatile: Fix issues with newer kernels
On Wednesday 15 May 2013, Linus Walleij wrote: On Tue, May 14, 2013 at 5:33 PM, Peter Maydell peter.mayd...@linaro.org wrote: The reworking of the versatile PCI controller model so that it actually behaved like hardware included an attempt to autodetect whether the guest Linux kernel was assuming the old broken behaviour. Unfortunately it turns out that there are several different variant broken kernels which behave slightly differently (though none of them will work on real hardware). The first two patches in this series improve the autodetection so that we will work out of the box on more kernels. The third patch adds a property for forcing the behaviour, so that if there are further cases we didn't know about, at least users have a command line workaround they can enable. This looks like a viable approach to me. So FWIW: Acked-by: Linus Walleij linus.wall...@linaro.org FWIW, I plan to really get this done in the kernel for 3.11 properly and rework the entire versatile and realview code base to work without any platform specific code in arch/arm. The plan is to use the new infrastructure for PCI and put that code into drivers/pci/host, and have it scan the hardware using DT only. We can have a backwards compatibility setup for versatile-pb without DT, but in the long run, I would prefer to kill off that boot option. If this works out, any Linux kernel built for any platform should be able to boot the versatilepb, versatileab, integratorcp, realview-eb, realview-eb-mpcore, realview-pb-a8, realview-pbx-a9, vexpress-a9 and vexpress-a15 models, as long as you pass a -cpu option matching a CPU enabled in the kernel, and all the drivers are enabled. I remember there was a way to autogenerate the dtb blob in qemu at some point, based on the devices enabled in the model. Did that ever make it in? Arnd
Re: [Qemu-devel] [PATCH v2 07/11] versatile_pci: Implement the correct PCI IRQ mapping
On Tuesday 26 March 2013, Peter Maydell wrote: Implement the correct IRQ mapping for the Versatile PCI controller; it differs between realview and versatile boards, but the previous QEMU implementation was correct only for the first PCI card on a versatile board, since we weren't swizzling IRQs based on the slot number. Since this change would otherwise break any uses of PCI on Linux kernels which have an equivalent bug (since they have effectively only been tested against QEMU, not real hardware), we implement a mechanism for automatically detecting those broken kernels and switching back to the old mapping. This works by looking at the values the kernel writes to the PCI_INTERRUPT_LINE register in the config space, which is effectively the interrupt number the kernel expects the device to be using. If this doesn't match reality we use the broken mapping. Signed-off-by: Peter Maydell peter.mayd...@linaro.org Yes, very good. We will probably introduce sparse irq support on versatile in the near future, and then the value we write into the PCI_INTERRUPT_LINE field will become arbitrary from qemu's point of view, but I will make sure that we fix the interrupt mapping in the kernel at the same time so we always fall into the s-broken_irq_mapping = false; case. We also need to find a way to make the new kernel work with an old qemu, and I think we can do that by using the versatile-dt board type with a PCI device node that sets all four lines to 27, while using the actual interrupt lines for the default versatile device tree. Arnd
Re: [Qemu-devel] [PATCH v2 07/11] versatile_pci: Implement the correct PCI IRQ mapping
On Tuesday 26 March 2013, Peter Maydell wrote: On 26 March 2013 10:54, Arnd Bergmann a...@arndb.de wrote: Yes, very good. We will probably introduce sparse irq support on versatile in the near future, and then the value we write into the PCI_INTERRUPT_LINE field will become arbitrary from qemu's point of view, but I will make sure that we fix the interrupt mapping in the kernel at the same time so we always fall into the s-broken_irq_mapping = false; case. Yeah, as long as you avoid the number 27 you're ok :-) Good point. I guess we'll have to keep using a legacy domain for versatile then. We also need to find a way to make the new kernel work with an old qemu, and I think we can do that by using the versatile-dt board type with a PCI device node that sets all four lines to 27, while using the actual interrupt lines for the default versatile device tree. Personally I'd be happy for you to just say needs a new QEMU. The broken QEMU is missing so much (including working memory windows) that I think it would be a pain to get the kernel to cope with it. But it was working earlier, so I'd definitely try not to break if at all possible. A lot of people use the verstatile qemu model to run kernels and I would not want to deal with the complaints I'd get if we break those. Using a separate dts file seems easy enough. Arnd
Re: [Qemu-devel] [PATCH 00/10] Fix versatile_pci (and break versatilepb linux guests!)
On Sunday 24 March 2013, Peter Maydell wrote: Yeah, ideally being able to detect the buggy kernel would be good; I can't see anything at the controller level that would do though, and I don't really know enough about PCI to know about generic PCI stuff that would work. (Why would the OS need to tell the device anything about its IRQ if it's hardwired?) I think it actually does on versatile and other platforms on which the kernel probes the PCI bus itself, rather than relying on firmware to have resources assigned in advance. IIRC, the PCI_INTERRUPT_LINE pci config space byte (0x3c) is purely informational and used as a way to communicate the interrupt number from the bus scan code (assumed to be a PC BIOS in the PCI spec, but drivers/pci/setup-irq.c in case of versatile+linux) to a device driver. So the kernel should actually write the proper interrupt number in there. In future kernels, this may not necessarily be the hardware number, but today it is. Can you try out what the kernel writes into that register in qemu, with and without my patches? I would expect the numbers to be (64+27) to (64+30), since we linearize the interrupt numbers so that VIC gets 32 through 63 and SIC gets 64 through 95. Arnd
Re: [Qemu-devel] [PATCH 00/10] Fix versatile_pci (and break versatilepb linux guests!)
On Sunday 24 March 2013, Peter Maydell wrote: OK, I'll give that a go tomorrow. While you're here, do you know what the point of the PCI_SELFID register is? The h/w docs are clear that the OS has to write the slot number of the board itself in to this register, but again I don't see why the OS has to tell the hardware something it already knows. On QEMU I just implemented this register as a no-op and everything seems to still function... I don't know really, but I think it has something to do with the versatile board being plugged into a backplane that has multiple slots. Apparently there is something the hardware cannot figure out itself, e.g. whether to route the interrupts in or out of the slot. Arnd
Re: [Qemu-devel] [PATCH 00/10] Fix versatile_pci (and break versatilepb linux guests!)
On Sunday 24 March 2013, Michael S. Tsirkin wrote: For future kernels, let's build in some hook that let qemu detect a non broken guest. How about writing some magic value into revision ID or some other readonly field? Yes, makes sense.
Re: [Qemu-devel] [RFC] Next gen kvm api
On Tuesday 07 February 2012, Alexander Graf wrote: On 07.02.2012, at 07:58, Michael Ellerman wrote: On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote: You're exposing a large, complex kernel subsystem that does very low-level things with the hardware. It's a potential source of exploits (from bugs in KVM or in hardware). I can see people wanting to be selective with access because of that. Exactly. In a perfect world I'd agree with Anthony, but in reality I think sysadmins are quite happy that they can prevent some users from using KVM. You could presumably achieve something similar with capabilities or whatever, but a node in /dev is much simpler. Well, you could still keep the /dev/kvm node and then have syscalls operate on the fd. But again, I don't see the problem with the ioctl interface. It's nice, extensible and works great for us. ioctl is good for hardware devices and stuff that you want to enumerate and/or control permissions on. For something like KVM that is really a core kernel service, a syscall makes much more sense. I would certainly never mix the two concepts: If you use a chardev to get a file descriptor, use ioctl to do operations on it, and if you use a syscall to get the file descriptor then use other syscalls to do operations on it. I don't really have a good recommendation whether or not to change from an ioctl based interface to syscall for KVM now. On the one hand I believe it would be significantly cleaner, on the other hand we cannot remove the chardev interface any more since there are many existing users. Arnd
Re: [Qemu-devel] [RFC] Next gen kvm api
On Tuesday 07 February 2012, Alexander Graf wrote: Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints. I would expect that newer archs have less constraints, not more. Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid out stuff wrong before? I don't even want to imagine what v7 arm vs v8 arm looks like. It's a completely new architecture. I have not seen the source but I'm pretty sure that v7 and v8 they look very similar regarding virtualization support because they were designed together, including the concept that on v8 you can run either a v7 compatible 32 bit hypervisor with 32 bit guests or a 64 bit hypervisor with a combination of 32 and 64 bit guests. Also, the page table layout in v7-LPAE is identical to the v8 one. The main difference is the instruction set, but then ARMv7 already has four of these (ARM, Thumb, Thumb2, ThumbEE). Arnd
Re: [Qemu-devel] [RFC][PATCH] Import Linux headers for KVM and vhost
On Tuesday 03 May 2011, Jan Kiszka wrote: This helps reducing our build-time checks for feature support in the available Linux kernel headers. And it helps users that do not have sufficiently recent headers installed on their build machine. Header upstate is triggered via 'make update-linux-headers', optionally specifying a kernel directory as source via 'LINUX='. --- I'm sure this is not yet perfect from technical POV (e.g. the patch to Makefile.target looks like a hack to me, better ideas welcome), therefore RFC and no SOB. But it should demonstrate the plan. The workflow for adding a new KVM feature support would be first a 'make update-linux-header LINUX=/my/local/linux' + git commit, and then the commit of the kvm, vhost, whatever changes. Makefile| 17 +++ Makefile.target |2 +- configure | 127 --- 3 files changed, 36 insertions(+), 110 deletions(-) Very nice! The old way to poke into the kernel internals was really annoying and confusing. Arnd
Re: [Qemu-devel] [PATCH v2] Import Linux headers for KVM and vhost
On Tuesday 03 May 2011 19:57:18 Scott Wood wrote: I agree, it doesn't feel quite right to bring in the headers. I don't have any alternative suggestions (besides better HOWTOs/Documentation) though. If you try to use the non-sanitized kernel headers, you'll get this warning from linux/types.h: #warning Attempt to use kernel headers from user space, see http://kernelnewbies.org/KernelHeaders; You're welcome ;-) I tried to make the recommended approach as clear as I could with the #warning and the wiki page. If there is anything still unclear about how it should be done, please make suggestions for improving the wiki text. Arnd
Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?
On Thursday 24 February 2011, Jan Kiszka wrote: On 2011-02-24 07:49, Gerhard Wiesinger wrote: On Wed, 23 Feb 2011, Jan Kiszka wrote: Right, but if I set IP(eth0) == IP(macvlan0), I'm able to communicate between macvlan0 and mactapX, thus between guest and host. Just re-checked here, still works (after resolving the usual MAC address mess I caused by configuring manually). Thnx for the tipp. Did you use MAC(eth0) == MAC(macvlan0) or MAC(eth0) MAC(macvlan0) to get it to work? The latter (I just let macvlan/tap choose their MACs). You cannot set the two to the same MAC address while they are up, but I think you can do ip link set eth0 down ip link set macvlan0 ${MAC_ETH0} ip link set macvlan0 up Doing that is a bit tricky if eth0 is your only connection to the machine... Arnd
Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?
On Wednesday 23 February 2011, Gerhard Wiesinger wrote: After some further tests and looking at the iproute ip and kernel code I finally gave up because I thing such a setup it is not possible without breaking up/reconfiguring eth0. When I have to reconfigure eth0 I think a better approach is to configure a bridge which I finally did and works well. I tried to explain/document the macvtap/macvlan concepts and limitations below. Please comment on it whether this is true or false. macvtap/macvlan driver concepts and limitations: 1.) macvlan driver adds a MAC address to a lower interface device where the actual macvlanx device is based on 2.) macvtap driver is based on macvlan driver and macvtap driver adds additional functionality of interface = external program communication with stdin/stdout channel. 3.) Limitations: macvtap/macvlan based devices can only communicate with childs based on the same lower device (e.g. eth0 in this sample) but not to the lower device itself, only to the outside world of the interface Correct. finally this makes the macvlan/macvtap approach useless because main eth0 interface must still be broken in the chain and reconfigured which was against the requirements that eth0 should not be touched and reconfigured! Yes, that is unfortunate, but it's the same that you'd get with a bridge device: When you have a bridge on top of eth0, you can no longer assign an IP address to eth0 and let it communicate with the virtual ports on the bridge. You need to instead set the IP address on the bridge itself. Macvlan is slightly better because it allows you to have multiple host devices that can each have their own MAC/IP address, unlike the bridge, but of course it can not be connected to anything else besides macvlan or macvtap ports. Arnd
Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?
On Monday 21 February 2011, Jan Kiszka wrote: Now I think I tried all useful possible combinations: 1.) macvtap0 and macvlan0 in bridged and non bridge mode 2.) macvlan0 based on eth0 or based on macvtap0 3.) Using and ip address on macvlan0 or not (tried both for 7b mac and 7c mac) 4.) Using 7b and 7c mac address in KVM (MAC in KVM is always the command line configured) and used tap interface from macvtap0 (as no second tap devices shows up) Summary: 1.) Using macvtap0 only with 7b mac on interface and also at qemu works well in bridged mode as well as non bridged mode but with limitation no guest/host communication possible, used interface in KVM is tap interface created by macvtap0. Quite logically that it doesn't work with guest/host because of missing hairpin mode on the switch. But according to http://virt.kernelnewbies.org/MacVTap bridge mode should work even without hairpinning available on the switch. 2.) macvlan0 doesn't create any tap interface therefore it can't be used as a device on KVM. 3.) Using 7c mac address in KVM doesn't work at all regardsless of setting ip addresses on macvlan0 or any other setup. @Arnd: Can you explain a setup in detail which should work also with host/guest communication. Thnx. Any further ideas? Which kernel is needed to work? (current: 2.6.34.7-56.fc13.x86_64) Actually, I tried this successfully over a 2.6.38-rcX, but I have no idea what changes related to macvlan/tap went in since .34 and if this is required for this. We had a few bugs in .34, but it should work in general. One thing to make sure is that the host has connectivity through the macvlan interface and the lower interface (eth0) has no IP address assigned but is 'up'. It could also be a bug in the NIC, you could try to set the NIC into promiscious mode using ethtool to work around that. Also, the lower interface must not be connected to a bridge device. Arnd
Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?
On Sunday 20 February 2011, Gerhard Wiesinger wrote: Not sure if this is by design or due to internals of the networking stack, but it looks unintuitive from user perspective. Maybe Arnd can shed a light on this. The lower device cannot be in bridge mode, because that would make the logic in the kernel awfully complex. I agree that it's a bit unfortunate, but it simplified the design a lot. Of course, you could also simply offload all that setup to libvirt. http://wiki.libvirt.org/page/Networking But that still needs a bridge on ethernet level and break up the existing interface, right? No. macvtap is a tap that sits on an external interface, you don't need a bridge if you use it. BTW: From: http://virt.kernelnewbies.org/MacVTap As of QEMU 0.12: qemu -net nic,model=virtio,addr=1a:46:0b:ca:bc:7b -net tap,fd=3 3/dev/tap11 Is there a newer Syntax with tap interfaces possible (QEMU GIT) without bash redirects? I made some suggestions how to integrate it with qemu, but they never got in. Libvirt makes it a lot easier, though. We discussed that it should be cleaned up when the networking helper scripts make it into qemu. Not sure if that ever happened. Arnd
Re: [Qemu-devel] Re: Network bridging without adding bridge with brctl, possible?
On Sunday 20 February 2011, Gerhard Wiesinger wrote: qemu-system-x86_64 ... some params ... -net nic,model=e1000,macaddr=1a:46:0b:ca:bc:7c -net tap,fd=3 3/dev/tap10 Seems to me quite logically because macvtap0 (and not macvlan0) is associated with /dev/tap10 but with another mac address set in KVM. Any furher ideas? As you already noticed: you mixed up the MAC addresses. KVM's must be the same as used for its frontend macvtap. The macvlan is only for the host and has a separate one. I think I did everyting right in the last 1st try but it still didn't work: 1.) macvtap0: MAC: 1a:46:0b:ca:bc:7b 2.) macvlan0: MAC: 1a:46:0b:ca:bc:7c, 192.168.0.23 3.) KVM: MAC: 1a:46:0b:ca:bc:7b, assigned IP from DHCP: 1a:46:0b:ca:bc:7b (looks like an IP address conflict inside guest and outside?) That should be as you explained, right? The qemu command above has the 7c mac address, which does not match. Do you see the interface in the guest using ip link show ? Arnd
[Qemu-devel] Re: TODO item: guest programmable mac/vlan filtering with macvtap
On Friday 15 October 2010, Michael S. Tsirkin wrote: On Thu, Oct 14, 2010 at 11:40:52PM +0200, Dragos Tatulea wrote: Hi, I'm starting a thread related to the TODO item mentioned in the subject. Currently still gathering info and trying to make kvm macvtap play nicely together. I have used this [1] guide to set it up but qemu is still complaining about the PCI device address of the virtio-net-pci. Tried with latest qemu. Am I missing something here? [1] - http://virt.kernelnewbies.org/MacVTap It really should be: -net nic,model=virtio,netdev=foo -netdev tap,id=foo Created account but still could not edit the wiki. Arnd, know why that is? Could you correct qemu command line pls? I also have lost write access to the wiki, no idea what happened there. I started the page, but it subsequently became protected. We never added support for the qemu command line directly, the plan was to do that using helper scripts. The only way to do it is to redirect both input and output to the tap device, so you ned to do -net nic,model=virtio,netdev=foo -netdev tap,id=foo,fd=3 3 when starting from bash. Arnd
[Qemu-devel] Re: [PATCH v2] pc: e820 qemu_cfg tables need to be packed
On Friday 15 October 2010, Alex Williamson wrote: We can't let the compiler define the alignment for qemu_cfg data. Signed-off-by: Alex Williamson alex.william...@redhat.com --- v2: Adjust alignment to help non-x86 hosts per Arnd's suggestion Ok, looks good now. Thanks! Arnd
Re: [Qemu-devel] Re: [PATCH] pc: e820 qemu_cfg tables need to be packed
On Thursday 14 October 2010 21:58:08 Alex Williamson wrote: If it works anywhere (I assume it works on 32bit), then it's only because it happened to get the alignment right. This just makes 64bit hosts get it right too. I don't see any compatibility issues, non-packed + 64bit = broken. Thanks, I would actually assume that only x86-32 hosts got it right, because all 32 bit hosts I've seen other than x86 also define 8 byte alignment for uint64_t. You might however consider making it __attribute((__packed__, __aligned__(4))) instead of just packed, because otherwise you make the alignment one byte, which is not only different from what it used to be on x86-32 but also will cause inefficient compiler outpout on platforms that don't have unaligned word accesses in hardware. Arnd
Re: [Qemu-devel] Re: [PATCH] pc: e820 qemu_cfg tables need to be packed
On Thursday 14 October 2010 22:59:04 Alex Williamson wrote: The structs in question only contain 4 8 byte elements, so there shouldn't be any change on x86-32 using one-byte aligned packing. I'm talking about the alignment of the structure, not the members within the structure. The data structure should be compatible, but not accesses to it. AFAIK, e820 is x86-only, so we don't need to worry about breaking anyone else. You can use qemu to emulate an x86 pc on anything... Performance isn't much of a consideration for this type of interface since it's only used pre-boot. In fact, the channel between qemu and the bios is only one byte wide, so wider alignment can cost extra emulated I/O accesses. Right, the data gets passed as bytes, so it hardly matters in the end. Still the e820_add_entry assigns data to the struct members, which it either does using byte accesses and shifts or a multiple 32 bit assignment. Just because using a one byte alignment technically results in correct output doesn't make it the right solution. I don't care about the few cycles of execution time or the few bytes you waste in this particular case, but you are setting a wrong example by using smaller alignment than necessary. Arnd
Re: [Qemu-devel] vhost_net.c broken by --kerneldir
On Wednesday 25 August 2010, Hollis Blanchard wrote: We only recently fixed the kernel to have this warning in types.h, which triggers more often than kernel.h, where it used to be before. In 2.6.35 and before, you consequently would not have noticed the problem. Thanks Arnd, that explains it. It looks like the --kerneldir option needs to be re-thought. Yes. I believe that we should just kill that option, since there is no reason for building with set of headers from a different kernel than the one your glibc normally uses. You obviously need copies of some headers to match qemu code, e.g. the kvm headers, but for those we have run-time compatibility code in qemu. Building with a newer kernel header is pointless because qemu would not use any of the features that were added after the release of the qemu tree you are trying to build. Building with an older kernel header is equally pointless because all it would do is to require #ifdef magic in qemu that ends up preventing you from using all features of new kernels. Arnd
Re: [Qemu-devel] vhost_net.c broken by --kerneldir
On Thursday 26 August 2010, Gleb Natapov wrote: You forgot about developers. Developer may want to use latest kvm kernel headers to compile code that he added to qemu to use new kernel feature. In that case, you already need to install the kernel in order to test it, so you might as well install the headers along with it. Also, you're not normally touching multiple headers at a time, but rather add a feature in one file, so it's not hard to add it to both copies of that file. This may seem like an unnecessary step, but using the kernel headers without installing them first can cause any number of problems that we don't want anyone to have to deal with. Arnd
Re: [Qemu-devel] vhost_net.c broken by --kerneldir
On Wednesday 25 August 2010, Hollis Blanchard wrote: The problem seems to be that jump from /usr/include/bits/sigcontext.h to asm/sigcontext.h inside kerneldir rather than inside /usr/include. It seems like adding -Ikerneldir/arch/foo/include will always be a problem, since it will always be used to satisfy #include asm/bar.h. Only files built with KVM_CFLAGS would be affected, and I guess vhost_net.c accidentally gets into a broken include path where the other KVM_CFLAGS files doesn't. I'm wondering why this isn't causing more problems for more people. My host is Fedora 12, FWIW, but this doesn't seem like it could at all be related to toolchain version, for example. We only recently fixed the kernel to have this warning in types.h, which triggers more often than kernel.h, where it used to be before. In 2.6.35 and before, you consequently would not have noticed the problem. Arnd
[Qemu-devel] Re: [PATCH] Ignore writes of perf reg (cp15 with crm == 12)
On Sunday 25 July 2010, Loïc Minier wrote: On ARMv7, ignore writes to cp15 with crm == 12; these are to setup perf counters which we don't have. Thanks Loïc, I have tested this and can confirm that I'm able to boot a CONFIG_PERF_EVENTS kernel now. Arnd
Re: [Qemu-devel] Re: KVM call minutes for June 15
On Wednesday 16 June 2010, Markus Armbruster wrote: Can't hurt reviewer motivation. Could it be automated? Find replies, extract tags. If you want your acks to be picked up, you better make sure your References header works, and your tags are formatted correctly. I think pwclient (https://patchwork.kernel.org/) can do this for you. Arnd
[Qemu-devel] Re: [PATCH] Inter-VM shared memory PCI device
On Thursday 11 March 2010, Avi Kivity wrote: A totally different option that avoids this whole problem would be to separate the signalling from the shared memory, making the PCI shared memory device a trivial device with a single memory BAR, and using something a higher-level concept like a virtio based serial line for the actual signalling. That would be much slower. The current scheme allows for an ioeventfd/irqfd short circuit which allows one guest to interrupt another without involving their qemus at all. Yes, the serial line approach would be much slower, but my point was that we can do signaling over something else, which could well be something building on irqfd. Arnd
[Qemu-devel] Re: [PATCH] Inter-VM shared memory PCI device
On Thursday 11 March 2010, Avi Kivity wrote: That would be much slower. The current scheme allows for an ioeventfd/irqfd short circuit which allows one guest to interrupt another without involving their qemus at all. Yes, the serial line approach would be much slower, but my point was that we can do signaling over something else, which could well be something building on irqfd. Well, we could, but it seems to make things more complicated? A card with shared memory, and another card with an interrupt interconnect? Yes, I agree that it's more complicated if you have a specific application in mind that needs one of each, and most use cases that want shared memory also need an interrupt mechanism, but it's not always the case: - You could use ext2 with -o xip on a private mapping of a shared host file in order to share the page cache. This does not need any interrupts. - If you have more than two parties sharing the segment, there are different ways to communicate, e.g. always send an interrupt to all others, or have dedicated point-to-point connections. There is also some complexity in trying to cover all possible cases in one driver. I have to say that I also really like the idea of futex over shared memory, which could potentially make this all a lot simpler. I don't know how this would best be implemented on the host though. Arnd
[Qemu-devel] Re: [PATCH] Inter-VM shared memory PCI device
On Tuesday 09 March 2010, Cam Macdonell wrote: We could make the masking in RAM, not in registers, like virtio, which would require no exits. It would then be part of the application specific protocol and out of scope of of this spec. This kind of implementation would be possible now since with UIO it's up to the application whether to mask interrupts or not and what interrupts mean. We could leave the interrupt mask register for those who want that behaviour. Arnd's idea would remove the need for the Doorbell and Mask, but we will always need at least one MMIO register to send whatever interrupts we do send. You'd also have to be very careful if the notification is in RAM to avoid races between one guest triggering an interrupt and another guest clearing its interrupt mask. A totally different option that avoids this whole problem would be to separate the signalling from the shared memory, making the PCI shared memory device a trivial device with a single memory BAR, and using something a higher-level concept like a virtio based serial line for the actual signalling. Arnd
[Qemu-devel] Re: [PATCH] Inter-VM shared memory PCI device
On Monday 08 March 2010, Cam Macdonell wrote: enum ivshmem_registers { IntrMask = 0, IntrStatus = 2, Doorbell = 4, IVPosition = 6, IVLiveList = 8 }; The first two registers are the interrupt mask and status registers. Interrupts are triggered when a message is received on the guest's eventfd from another VM. Writing to the 'Doorbell' register is how synchronization messages are sent to other VMs. The IVPosition register is read-only and reports the guest's ID number. The IVLiveList register is also read-only and reports a bit vector of currently live VM IDs. The Doorbell register is 16-bits, but is treated as two 8-bit values. The upper 8-bits are used for the destination VM ID. The lower 8-bits are the value which will be written to the destination VM and what the guest status register will be set to when the interrupt is trigger is the destination guest. A value of 255 in the upper 8-bits will trigger a broadcast where the message will be sent to all other guests. This means you have at least two intercepts for each message: 1. Sender writes to doorbell 2. Receiver gets interrupted With optionally two more intercepts in order to avoid interrupting the receiver every time: 3. Receiver masks interrupt in order to process data 4. Receiver unmasks interrupt when it's done and status is no longer pending I believe you can do much better than this, you combine status and mask bits, making this level triggered, and move to a bitmask of all guests: In order to send an interrupt to another guest, the sender first checks the bit for the receiver. If it's '1', no need for any intercept, the receiver will come back anyway. If it's zero, write a '1' bit, which gets OR'd into the bitmask by the host. The receiver gets interrupted at a raising edge and just leaves the bit on, until it's done processing, then turns the bit off by writing a '1' into its own location in the mask. Arnd
[Qemu-devel] Re: [PATCH 2/2] net/macvtap: add vhost suppor
On Sunday 14 February 2010, Michael S. Tsirkin wrote: @@ -473,7 +477,7 @@ static struct socket *get_socket(int fd) sock = get_raw_socket(fd); if (!IS_ERR(sock)) return sock; - sock = get_tun_socket(fd); + sock = get_tap_socket(fd); if (!IS_ERR(sock)) return sock; return ERR_PTR(-ENOTSOCK); This will also need a dependency on macvtap in Kconfig. See how it's done for tun. Ok, I'll add that. Thanks, Arnd
[Qemu-devel] [PATCH 1/2] macvtap: rework object lifetime rules
The original macvtap code has a number of problems resulting from the use of RCU for protecting the access to struct macvtap_queue from open files. This includes - need for GFP_ATOMIC allocations for skbs - potential deadlocks when copy_*_user sleeps - inability to work with vhost-net Changing the lifetime of macvtap_queue to always depend on the open file solves all these. The RCU reference simply moves one step down to the reference on the macvlan_dev, which we only need for nonblocking operations. Signed-off-by: Arnd Bergmann a...@arndb.de --- drivers/net/macvtap.c | 170 +--- 1 files changed, 89 insertions(+), 81 deletions(-) This patch replaces the previous one that cleans up the locking in a different way. diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index ad1f6ef..7050997 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -60,29 +60,19 @@ static struct cdev macvtap_cdev; /* * RCU usage: - * The macvtap_queue is referenced both from the chardev struct file - * and from the struct macvlan_dev using rcu_read_lock. + * The macvtap_queue and the macvlan_dev are loosely coupled, the + * pointers from one to the other can only be read while rcu_read_lock + * or macvtap_lock is held. * - * We never actually update the contents of a macvtap_queue atomically - * with RCU but it is used for race-free destruction of a queue when - * either the file or the macvlan_dev goes away. Pointers back to - * the dev and the file are implicitly valid as long as the queue - * exists. + * Both the file and the macvlan_dev hold a reference on the macvtap_queue + * through sock_hold(q-sk). When the macvlan_dev goes away first, + * q-vlan becomes inaccessible. When the files gets closed, + * macvtap_get_queue() fails. * - * The callbacks from macvlan are always done with rcu_read_lock held - * already, while in the file_operations, we get it ourselves. - * - * When destroying a queue, we remove the pointers from the file and - * from the dev and then synchronize_rcu to make sure no thread is - * still using the queue. There may still be references to the struct - * sock inside of the queue from outbound SKBs, but these never - * reference back to the file or the dev. The data structure is freed - * through __sk_free when both our references and any pending SKBs - * are gone. - * - * macvtap_lock is only used to prevent multiple concurrent open() - * calls to assign a new vlan-tap pointer. It could be moved into - * the macvlan_dev itself but is extremely rarely used. + * There may still be references to the struct sock inside of the + * queue from outbound SKBs, but these never reference back to the + * file or the dev. The data structure is freed through __sk_free + * when both our references and any pending SKBs are gone. */ static DEFINE_SPINLOCK(macvtap_lock); @@ -100,11 +90,12 @@ static int macvtap_set_queue(struct net_device *dev, struct file *file, goto out; err = 0; - q-vlan = vlan; + rcu_assign_pointer(q-vlan, vlan); rcu_assign_pointer(vlan-tap, q); + sock_hold(q-sk); q-file = file; - rcu_assign_pointer(file-private_data, q); + file-private_data = q; out: spin_unlock(macvtap_lock); @@ -112,28 +103,25 @@ out: } /* - * We must destroy each queue exactly once, when either - * the netdev or the file go away. + * The file owning the queue got closed, give up both + * the reference that the files holds as well as the + * one from the macvlan_dev if that still exists. * * Using the spinlock makes sure that we don't get * to the queue again after destroying it. - * - * synchronize_rcu serializes with the packet flow - * that uses rcu_read_lock. */ -static void macvtap_del_queue(struct macvtap_queue **qp) +static void macvtap_put_queue(struct macvtap_queue *q) { - struct macvtap_queue *q; + struct macvlan_dev *vlan; spin_lock(macvtap_lock); - q = rcu_dereference(*qp); - if (!q) { - spin_unlock(macvtap_lock); - return; + vlan = rcu_dereference(q-vlan); + if (vlan) { + rcu_assign_pointer(vlan-tap, NULL); + rcu_assign_pointer(q-vlan, NULL); + sock_put(q-sk); } - rcu_assign_pointer(q-vlan-tap, NULL); - rcu_assign_pointer(q-file-private_data, NULL); spin_unlock(macvtap_lock); synchronize_rcu(); @@ -151,21 +139,29 @@ static struct macvtap_queue *macvtap_get_queue(struct net_device *dev, return rcu_dereference(vlan-tap); } +/* + * The net_device is going away, give up the reference + * that it holds on the queue (all the queues one day) + * and safely set the pointer from the queues to NULL. + */ static void macvtap_del_queues(struct net_device *dev) { struct macvlan_dev *vlan = netdev_priv(dev); - macvtap_del_queue(vlan-tap); -} + struct macvtap_queue *q
[Qemu-devel] [PATCH 2/2] net/macvtap: add vhost support
This adds support for passing a macvtap file descriptor into vhost-net, much like we already do for tun/tap. Most of the new code is taken from the respective patch in the tun driver and may get consolidated in the future. Signed-off-by: Arnd Bergmann a...@arndb.de --- drivers/net/macvtap.c | 98 ++- drivers/vhost/net.c|8 +++- include/linux/if_macvlan.h | 13 ++ 3 files changed, 96 insertions(+), 23 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 7050997..e354501 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -58,6 +58,8 @@ static unsigned int macvtap_major; static struct class *macvtap_class; static struct cdev macvtap_cdev; +static const struct proto_ops macvtap_socket_ops; + /* * RCU usage: * The macvtap_queue and the macvlan_dev are loosely coupled, the @@ -176,7 +178,7 @@ static int macvtap_forward(struct net_device *dev, struct sk_buff *skb) return -ENOLINK; skb_queue_tail(q-sk.sk_receive_queue, skb); - wake_up(q-sk.sk_sleep); + wake_up_interruptible_poll(q-sk.sk_sleep, POLLIN | POLLRDNORM | POLLRDBAND); return 0; } @@ -242,7 +244,7 @@ static void macvtap_sock_write_space(struct sock *sk) return; if (sk-sk_sleep waitqueue_active(sk-sk_sleep)) - wake_up_interruptible_sync(sk-sk_sleep); + wake_up_interruptible_poll(sk-sk_sleep, POLLOUT | POLLWRNORM | POLLWRBAND); } static int macvtap_open(struct inode *inode, struct file *file) @@ -270,6 +272,8 @@ static int macvtap_open(struct inode *inode, struct file *file) init_waitqueue_head(q-sock.wait); q-sock.type = SOCK_RAW; q-sock.state = SS_CONNECTED; + q-sock.file = file; + q-sock.ops = macvtap_socket_ops; sock_init_data(q-sock, q-sk); q-sk.sk_write_space = macvtap_sock_write_space; @@ -387,32 +391,20 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, rcu_read_lock_bh(); vlan = rcu_dereference(q-vlan); - macvlan_count_rx(vlan, len, ret == 0, 0); + if (vlan) + macvlan_count_rx(vlan, len, ret == 0, 0); rcu_read_unlock_bh(); return ret ? ret : len; } -static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv, - unsigned long count, loff_t pos) +static ssize_t macvtap_do_read(struct macvtap_queue *q, struct kiocb *iocb, + const struct iovec *iv, unsigned long len, + int noblock) { - struct file *file = iocb-ki_filp; - struct macvtap_queue *q = file-private_data; - DECLARE_WAITQUEUE(wait, current); struct sk_buff *skb; - ssize_t len, ret = 0; - - if (!q) { - ret = -ENOLINK; - goto out; - } - - len = iov_length(iv, count); - if (len 0) { - ret = -EINVAL; - goto out; - } + ssize_t ret = 0; add_wait_queue(q-sk.sk_sleep, wait); while (len) { @@ -421,7 +413,7 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv, /* Read frames from the queue */ skb = skb_dequeue(q-sk.sk_receive_queue); if (!skb) { - if (file-f_flags O_NONBLOCK) { + if (noblock) { ret = -EAGAIN; break; } @@ -440,7 +432,24 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv, current-state = TASK_RUNNING; remove_wait_queue(q-sk.sk_sleep, wait); + return ret; +} + +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t pos) +{ + struct file *file = iocb-ki_filp; + struct macvtap_queue *q = file-private_data; + ssize_t len, ret = 0; + len = iov_length(iv, count); + if (len 0) { + ret = -EINVAL; + goto out; + } + + ret = macvtap_do_read(q, iocb, iv, len, file-f_flags O_NONBLOCK); + ret = min_t(ssize_t, ret, len); /* XXX copied from tun.c. Why? */ out: return ret; } @@ -538,6 +547,53 @@ static const struct file_operations macvtap_fops = { #endif }; +static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock, + struct msghdr *m, size_t total_len) +{ + struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock); + return macvtap_get_user(q, m-msg_iov, total_len, + m-msg_flags MSG_DONTWAIT); +} + +static int macvtap_recvmsg(struct kiocb *iocb, struct socket *sock, + struct msghdr *m, size_t total_len, + int flags) +{ + struct
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
On Monday 25 January 2010, Dor Laor wrote: x86 qemu64 x86 phenom x86 core2duo x86kvm64 x86 qemu32 x86 coreduo x86 486 x86 pentium x86 pentium2 x86 pentium3 x86 athlon x86 n270 I think a really nice addition would be an autodetect option for those users (e.g. desktop) that know they do not want to migrate the guest to a lower-spec machine. That option IMHO should just show up as identical to the host cpu, with the exception of features that are not supported in the guest. Arnd
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
On Thursday 28 January 2010, Alexander Graf wrote: That option IMHO should just show up as identical to the host cpu, with the exception of features that are not supported in the guest. That's exactly what -cpu host is. IIRC it's the default now. Ah, cool. Sorry for my ignorance here. From what I can tell, neither the man page nor 'qemu -cpu \?' show that information though. Arnd
Re: [Qemu-devel] Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu
On Wednesday 27 January 2010, Anthony Liguori wrote: The raw backend can be attached to a physical device This is equivalent to bridging with tun/tap except that it has the unexpected behaviour of unreliable host/guest networking (which is not universally consistent across platforms either). This is not a mode we want to encourage users to use. It's not the most common scenario, but I've seen systems (I remember one on s/390 with z/VM) where you really want to isolate the guest network as much as possible from the host network. Besides PCI passthrough, giving the host device to a guest using a raw socket is the next best approximation of that. Then again, macvtap will do that too, if the device driver supports multiple unicast MAC addresses without forcing promiscous mode. , macvlan macvtap is a superior way to achieve this use case because a macvtap fd can safely be given to a lesser privilege process without allowing escalation of privileges. Yes. or SR-IOV VF. This depends on vhost-net. Why? I don't see anything in this scenario that is vhost-net specific. I also plan to cover this aspect in macvtap in the future, but the current code does not do it yet. It also requires device driver changes. In general, what I would like to see for this is something more user friendly that dealt specifically with this use-case. Although honestly, given the recent security concerns around raw sockets, I'm very concerned about supporting raw sockets in qemu at all. Essentially, you get worse security doing vhost-net + raw + VF then with PCI passthrough + VF because at least in the later case you can run qemu without privileges. CAP_NET_RAW is a very big privilege. It can be contained to a large degree with network namespaces. When you run qemu in its own namespace and add the VF to that, CAP_NET_RAW should ideally have no effect on other parts of the system (except bugs in the namespace implementation). Arnd
[Qemu-devel] Re: [PATCH] Add definitions for current cpu models..
On Monday 18 January 2010, john cooper wrote: +.name = Conroe, +.level = 2, +.vendor1 = CPUID_VENDOR_INTEL_1, +.vendor2 = CPUID_VENDOR_INTEL_2, +.vendor3 = CPUID_VENDOR_INTEL_3, +.family = 6, /* P6 */ +.model = 2, that looks wrong -- what is model 2 actually? +.stepping = 3, +.features = PPRO_FEATURES | +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */ +CPUID_PSE36,/* note 2 */ +.ext_features = CPUID_EXT_SSE3 | CPUID_EXT_SSSE3, +.ext2_features = (PPRO_FEATURES CPUID_EXT2_MASK) | +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, +.ext3_features = CPUID_EXT3_LAHF_LM, +.xlevel = 0x800A, +.model_id = Intel Celeron_4x0 (Conroe/Merom Class Core 2), +}, Celeron_4x0 is a rather bad example, because it is based on the single-core Conroe-L, which is family 6 / model 22 unlike all the dual- and quad-core Merom/Conroe that are model 15. +{ +.name = Penryn, +.level = 2, +.vendor1 = CPUID_VENDOR_INTEL_1, +.vendor2 = CPUID_VENDOR_INTEL_2, +.vendor3 = CPUID_VENDOR_INTEL_3, +.family = 6, /* P6 */ +.model = 2, +.stepping = 3, +.features = PPRO_FEATURES | +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */ +CPUID_PSE36,/* note 2 */ +.ext_features = CPUID_EXT_SSE3 | +CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41, +.ext2_features = (PPRO_FEATURES CPUID_EXT2_MASK) | +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, +.ext3_features = CPUID_EXT3_LAHF_LM, +.xlevel = 0x800A, +.model_id = Intel Core 2 Duo P9xxx (Penryn Class Core 2), +}, This would be model 23 for Penryn-class Xeon/Core/Pentium/Celeron processors without L3 cache. +{ +.name = Nehalem, +.level = 2, +.vendor1 = CPUID_VENDOR_INTEL_1, +.vendor2 = CPUID_VENDOR_INTEL_2, +.vendor3 = CPUID_VENDOR_INTEL_3, +.family = 6, /* P6 */ +.model = 2, +.stepping = 3, +.features = PPRO_FEATURES | +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */ +CPUID_PSE36,/* note 2 */ +.ext_features = CPUID_EXT_SSE3 | +CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41 | +CPUID_EXT_SSE42 | CPUID_EXT_POPCNT, +.ext2_features = (PPRO_FEATURES CPUID_EXT2_MASK) | +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, +.ext3_features = CPUID_EXT3_LAHF_LM, +.xlevel = 0x800A, +.model_id = Intel Core i7 9xx (Nehalem Class Core i7), +}, Apparently, not all the i7-9xx CPUs are Nehalem, the i7-980X is supposed to be Westmere, which has more features. Because of the complexity, I'd recommend passing down the *model* number of the emulated CPU, the interesting Intel ones (those supported by KVM) being: 15-6: CedarMill/Presler/Dempsey/Tulsa (Pentium 4/Pentium D/Xeon 50xx/Xeon 71xx) 6-14: Yonah/Sossaman (Celeron M4xx, Core Solo/Duo, Pentium Dual-Core T1000, Xeon ULV) 6-15: Merom/Conroe/Kentsfield/Woodcrest/Clovertown/Tigerton (Celeron M5xx/E1xxx/T1xxx, Pentium T2xxx/T3xxx/E2xxx,Core 2 Solo U2xxx, Core 2 Duo E4xxx/E6xxx/Q6xxx/T5xxx/T7xxx/L7xxx/U7xxx/SP7xxx, Xeon 30xx/32xx/51xx/52xx/72xx/73xx) 6-22: Penryn/Wolfdale/Yorkfield/Harpertown (Celeron 7xx/9xx/SU2xxx/T3xxx/E3xxx, Pentium T4xxx/SU2xxx/SU4xxx/E5xxx/E6xxx, Core 2 Solo SU3xxx, Core 2 Duo P/SU/T6xxx/x8xxx/x9xxx, Xeon 31xx/33xx/52xx/54xx) 6-26: Gainestown/Bloomfield (Xeon 35xx/55xx, Core i7-9xx) 6-28: Atom 6-29: Dunnington (Xeon 74xx) 6-30: Lynnfield/Clarksfield/JasperForest (Xeon 34xx, Core i7-8xx, Core i7-xxxQM, Core i5-7xx) 6-37: Arrandale/Clarkdale (Dual-Core Core i3/i5/i7) 6-44: Gulftown (six-core) Arnd
[Qemu-devel] Re: Guest bridge setup variations
On Wednesday 16 December 2009, Leonid Grossman wrote: 3. Doing the bridging in the NIC using macvlan in passthrough mode. This lowers the CPU utilization further compared to 2, at the expense of limiting throughput by the performance of the PCIe interconnect to the adapter. Whether or not this is a win is workload dependent. This is certainly true today for pci-e 1.1 and 2.0 devices, but as NICs move to pci-e 3.0 (while remaining almost exclusively dual port 10GbE for a long while), EVB internal bandwidth will significantly exceed external bandwidth. So, #3 can become a win for most inter-guest workloads. Right, it's also hardware dependent, but it usually comes down to whether it's cheaper to spend CPU cycles or to spend IO bandwidth. I would be surprised if all future machines with PCIe 3.0 suddenly have a huge surplus of bandwidth but no CPU to keep up with that. Access controls now happen in the NIC. Currently, this is not supported yet, due to lack of device drivers, but it will be an important scenario in the future according to some people. Actually, x3100 10GbE drivers support this today via sysfs interface to the host driver that can choose to control VEB tables (and therefore MAC addresses, vlan memberships, etc. for all passthru interfaces behind the VEB). Ok, I didn't know about that. OF course a more generic vendor-independent interface will be important in the future. Right. I hope we can come up with something soon. I'll have a look at what your driver does and see if that can be abstracted in some way. I expect that if we can find an interface between the kernel and device driver for two or three NIC implementations that it will be good enough to adapt to everyone else as well. Arnd
Re: [Qemu-devel] Guest bridge setup variations
On Friday 11 December 2009, Anthony Liguori wrote: Arnd Bergmann wrote: 3) given an fd, treat a vhost-style interface This could mean two things, not sure which one you mean. Either the file descriptor could be the vhost file descriptor, or the socket or tap file descriptor from above, with qemu operating on the vhost interface itself. Either option has its advantages, but I guess we should only implement one of the two to keep it simple. I was thinking the socket/tap descriptor. ok. Right. I wonder if this helper should integrate with netcf as an abstraction, or if we should rather do something generic. It may also be a good idea to let the helper decide which of the three options you listed to use and pass that back to qemu unless the user overrides it. The decision probably needs to be host specific, depending e.g. on the availability and version of tools (brctl, iproute, maybe tunctl, ...), the respective kernel modules (vhost, macvlan, bridge, tun, ...) and policy (VEPA, vlan, ebtables). Ideally the approach should be generic enough to work on other platforms (BSD, Solaris, Windows, ...). For helpers, I think I'd like to stick with what we currently support, and then allow for a robust way for there to be independent projects that implement their own helpers. For instance, I would love it if netcf had a qemu network plugin helper. Moving netcf specific qemu helpers into the netcf project sounds good. I'm not sure what you mean with 'stick to what we currently support', do you mean with helpers that ship with qemu itself? That sounds reasonable, though I'd obviously like to make sure they also work with macvtap, which is currently not supported unless you pass an open file descriptor into -net tap,fd. There's just too much in the networking space all wrapped up in layers of policy decisions. I think it's time to move it out of qemu. Yes. Arnd
Re: [Qemu-devel] Re: Guest bridge setup variations
On Thursday 10 December 2009 19:14:28 Alexander Graf wrote: This is something I also have been thinking about, but it is not what I was referring to above. I think it would be good to keep the three cases (macvlan, VMDq, SR-IOV) as similar as possible from the user perspective, so using macvlan as an infrastructure for all of them sounds reasonable to me. Oh, so you'd basically do -net vt-d,if=eth0 and the rest would automatically work? That's a pretty slick idea! I was only referring to how they get set up under the covers, e.g. creating the virtual device, configuring the MAC address etc, not the qemu side, but that would probably make sense as well. Or even better, qemu should probably not even know the difference between macvlan and VT-d. In both cases, it would open a macvtap file, but for VT-d adapters, the macvlan infrastructure can use hardware support, much in the way that VLAN tagging gets offloaded automatically to the hardware. Arnd
[Qemu-devel] Re: [PATCH, try 2] qemu/tap: add -net tap,dev= option
On Wednesday 09 December 2009, Anthony Liguori wrote: This isn't really a generic thing and I dislike pretending it is. This is specifically for macvtap. Well, depending how how things work out with VMD-q and SR-IOV, I might move the tap interface stuff into a library module and add more user-level shims for it that deal with the creation of character devices. Macvtap essentially implements a tun/tap compatible character device, while leaving out some of the stuff that doesn't make sense there (e.g. creation of virtual network interfaces). The idea of my patch was to deal with anything that is a tap protocol implementation. One thing that is special for macvtap is that the guest MAC address has to match the MAC address of the macvtap downstream device, because macvlan filters out all traffic destined for other unicast addresses. If we were going to do this, I'd rather introduce a -net macvtap that actually allocated the interfaces (similar to how tap works). The problem with this interface is that it's a two stage process. You have to create an interface, then hand the name to qemu. It would be just as easy to hand qemu an fd with a helper. You can't really allocate the device from qemu in a nice way. I had mostly implemented macvtap support for your -net bridge helper when I realized this. Since it uses netlink (with CAP_NET_ADMIN) to create or destroy the netdev, it first needs to wait for udev to create the device node (which is not so bad by itself), and then use netlink again after shutting down to destroy the network device. What's nice about -net tap is that with a little bit of setup (/etc/qemu-ifup), it Just Works. -net tap,dev= does not share this property. The qemu-ifup stuff IMHO is just a workaround for the problem of having to defer network setup until after you open the device. Macvtap doesn't have this problem. In the same way we deal with other host resources (raw disk access, USB passthrough, /dev/kvm), you first set up specifically what the user is allowed to do, either manually or through libvirt and then just use it. I think this is a good place where an exec helper would be a natural fit. IMHO, the exec helper is a good way to abstract away the difference between tap, macvtap and possibly others based on the host configuration, but it doesn't really make sense for a separate -net macvtap backend like you suggested. You've shown that the helper can be used to enforce user permissions for tun/tap, which lacks this, but for macvtap we can just use regular user permissions. Arnd
Re: [Qemu-devel] Guest bridge setup variations
On Wednesday 09 December 2009, Anthony Liguori wrote: While we go over all of these things one thing is becoming clear to me. We need to get qemu out of the network configuration business. There's too much going on here. Agreed. What I'd like to see is the following interfaces supported: 1) given an fd, make socket calls to send packets. Could be used with a raw socket, a multicast or tcp socket. 2) given an fd, use tap-style read/write calls to send packets* yes. 3) given an fd, treat a vhost-style interface This could mean two things, not sure which one you mean. Either the file descriptor could be the vhost file descriptor, or the socket or tap file descriptor from above, with qemu operating on the vhost interface itself. Either option has its advantages, but I guess we should only implement one of the two to keep it simple. I believe we should continue supporting the mechanisms we support today. However, for people that invoke qemu directly from the command line, I believe we should provide a mechanism like the tap helper that can be used to call out to a separate program to create these initial file descriptors. We'll have to think about how we can make this integrate well so that the syntax isn't clumsy. Right. I wonder if this helper should integrate with netcf as an abstraction, or if we should rather do something generic. It may also be a good idea to let the helper decide which of the three options you listed to use and pass that back to qemu unless the user overrides it. The decision probably needs to be host specific, depending e.g. on the availability and version of tools (brctl, iproute, maybe tunctl, ...), the respective kernel modules (vhost, macvlan, bridge, tun, ...) and policy (VEPA, vlan, ebtables). Ideally the approach should be generic enough to work on other platforms (BSD, Solaris, Windows, ...). One thing I realized the last time we discussed the helper approach is that qemu should not need to know or care about the arguments passed to the helper, otherwise you get all the complexity back in qemu that you're trying to avoid. Maybe for 0.13 we can convert -net socket and -net tap to just pass all their options to the helper and move that code out of qemu, along with introducting the new syntax. Another unrelated issue that I think needs to be addressed in a network code cleanup is adding better support for multi-queue transmit and receive. I've prepared macvtap for that by letting you open the chardev multiple times to get one queue per guest CPU, but that needs to be supported by qemu and virtio-net as well to actually parallelize network operation. Ideally, two guest CPUs should be able to transmit and receive on separate queues of the adapter without ever having to access any shared resources. Arnd
[Qemu-devel] Re: Guest bridge setup variations
On Thursday 10 December 2009, Fischer, Anna wrote: 3. Doing the bridging in the NIC using macvlan in passthrough mode. This lowers the CPU utilization further compared to 2, at the expense of limiting throughput by the performance of the PCIe interconnect to the adapter. Whether or not this is a win is workload dependent. Access controls now happen in the NIC. Currently, this is not supported yet, due to lack of device drivers, but it will be an important scenario in the future according to some people. Can you differentiate this option from typical PCI pass-through mode? It is not clear to me where macvlan sits in a setup where the NIC does bridging. In this setup (hypothetical so far, the code doesn't exist yet), we use the configuration logic of macvlan, but not the forwarding. This also doesn't do PCI pass-through but instead gives all the logical interfaces to the host, using only the bridging and traffic separation capabilities of the NIC, but not the PCI-separation. Intel calls this mode VMDq, as opposed to SR-IOV, which implies the assignment of the adapter to a guest. It was confusing of me to call it passthrough above, sorry for that. Typically, in a PCI pass-through configuration, all configuration goes through the physical function device driver (and all data goes directly to the NIC). Are you suggesting to use macvlan as a common configuration layer that then configures the underlying NIC? I could see some benefit in such a model, though I am not certain I understand you correctly. This is something I also have been thinking about, but it is not what I was referring to above. I think it would be good to keep the three cases (macvlan, VMDq, SR-IOV) as similar as possible from the user perspective, so using macvlan as an infrastructure for all of them sounds reasonable to me. The difference between VMDq and SR-IOV in that case would be that the former uses a virtio-net driver in the guest and a hardware driver in the host, while the latter uses a hardware driver in the guest only. The data flow on these two would be identical though, while in the classic macvlan the data forwarding decisions are made in the host kernel. Arnd
[Qemu-devel] Re: [PATCH, try 2] qemu/tap: add -net tap,dev= option
On Wednesday 09 December 2009, Michael S. Tsirkin wrote: On Tue, Dec 08, 2009 at 06:41:44PM +0100, Arnd Bergmann wrote: --- a/net/tap-bsd.c +++ b/net/tap-bsd.c @@ -40,7 +40,8 @@ #include util.h #endif -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, char *dev, int dev_size, + int *vnet_hdr, int vnet_hdr_required) { int fd; char *dev; Does this compile? I don't have a BSD or Solaris machine here, or even just a cross-compiler, so I could not test. However, I only add two arguments and I did the identical change in the header file and the linux version of this file, so I am reasonably confident. -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, char *dev, int dev_size, dev is never modified, so it should be const char *. ok. + int *vnet_hdr, int vnet_hdr_required) { struct ifreq ifr; int fd, ret; +const char *path; -TFR(fd = open(/dev/net/tun, O_RDWR)); +if (dev[0] == '\0') == 0 checks are ugly. if (*dev) is shorter, and it's a standard C idiom to detect non-empty string. Or better pass NULL if no device, then you can just path = dev ? dev : /dev/net/tun. Agreed in principle, but I was following the style that is already used in the same function, and I think consistency is more important in this case. + path = /dev/net/tun; +else + path = dev; Please do not indent by single space. For some reason, this file uses four character indentation, which I followed for consistency. In the patch this gets mangled when some lines use only spaces for indentation and others use only tabs. I could change to using only spaces for indentation if that's preferred. diff --git a/net/tap.c b/net/tap.c index 0d8b424..14ddf65 100644 --- a/net/tap.c +++ b/net/tap.c @@ -343,12 +343,17 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr) { int fd, vnet_hdr_required; char ifname[128] = {0,}; +char dev[128] = {0,}; const char *setup_script; if (qemu_opt_get(opts, ifname)) { pstrcpy(ifname, sizeof(ifname), qemu_opt_get(opts, ifname)); } +if (qemu_opt_get(opts, dev)) { +pstrcpy(dev, sizeof(dev), qemu_opt_get(opts, dev)); +} + While in this case this is unlikely to be a problem in practice, we still should not add arbitrary limitations on file name lengths. Just teach tap_open to get NULL instead of and empty string, or better supply default /dev/net/tun to the option, and you will not need the strcpy. Right, I will do that, or alternatively make dev an input/output argument, see below. snprintf(s-nc.info_str, sizeof(s-nc.info_str), - ifname=%s,script=%s,downscript=%s, - ifname, script, downscript); + ifname=%s,dev=%s,script=%s,downscript=%s, This will look strange if dev is not supplied, will it not? Also, I wonder whether there might be any scripts parsing info string from monitor, that will get broken by the extra parameter. How about we only print dev if it is not the default? Right. I originally planned to return /dev/net/tun from tap_open in that case, but I forgot to put that in. It would also not work well for Solaris and BSD unless I add untested code there. I guess it would be consistent to do that, but unless someone insists on it, I'll just follow your advice here and remove it from being printed. +-net tap[,vlan=n][,name=str][,fd=h][,ifname=name][,dev=str][,script=file]\n +[,downscript=dfile][,sndbuf=nbytes][,vnet_hdr=on|off]\n connect the host TAP network interface to VLAN 'n' and use the\n network scripts 'file' (default=%s)\n and 'dfile' (default=%s);\n use '[down]script=no' to disable script execution;\n use 'fd=h' to connect to an already opened TAP interface\n +use 'dev=str' to open a specific tap character device\n please document default /dev/net/tun ok. Thanks for the review! Arnd
[Qemu-devel] [PATCH, try 2, version 2] qemu/tap: add -net tap, dev= option
In order to support macvtap, we need a way to select the actual tap endpoint. While it would be nice to pass it by network interface name, passing the character device is more flexible, and we can easily do both in the long run. This version makes it possible to use macvtap without introducing any macvtap specific code in qemu, only a natural extension to the existing interface. Signed-off-by: Arnd Bergmann a...@arndb.de Acked-by: Mark McLoughlin mar...@redhat.com Cc: Michael S. Tsirkin m...@redhat.com --- Hopefully addressed all comments from Michael. I did compile-test the non-linux files I touched (the solaris file failed to build for another reason), and this now prevents passing dev= on non-linux machines. net.c |5 + net/tap-aix.c |3 ++- net/tap-bsd.c |9 - net/tap-linux.c | 12 net/tap-solaris.c |7 ++- net/tap.c | 17 ++--- net/tap.h |3 ++- qemu-options.hx | 10 +- 8 files changed, 54 insertions(+), 12 deletions(-) diff --git a/net.c b/net.c index 13bdbb2..deb12fd 100644 --- a/net.c +++ b/net.c @@ -955,6 +955,11 @@ static struct { }, #ifndef _WIN32 { +.name = dev, +.type = QEMU_OPT_STRING, +.help = character device pathname, +}, +{ .name = fd, .type = QEMU_OPT_STRING, .help = file descriptor of an already opened tap, diff --git a/net/tap-aix.c b/net/tap-aix.c index 4bc9f2f..238c190 100644 --- a/net/tap-aix.c +++ b/net/tap-aix.c @@ -25,7 +25,8 @@ #include net/tap.h #include stdio.h -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, const char *dev, + int *vnet_hdr, int vnet_hdr_required) { fprintf(stderr, no tap on AIX\n); return -1; diff --git a/net/tap-bsd.c b/net/tap-bsd.c index 815997d..8a7334c 100644 --- a/net/tap-bsd.c +++ b/net/tap-bsd.c @@ -40,7 +40,8 @@ #include util.h #endif -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, const char *devarg, + int *vnet_hdr, int vnet_hdr_required) { int fd; char *dev; @@ -80,6 +81,12 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required } #endif +if (devarg) { +qemu_error(dev= not supported\n); +close(fd); +return -1; +} + fstat(fd, s); dev = devname(s.st_rdev, S_IFCHR); pstrcpy(ifname, ifname_size, dev); diff --git a/net/tap-linux.c b/net/tap-linux.c index 6af9e82..d6831c0 100644 --- a/net/tap-linux.c +++ b/net/tap-linux.c @@ -32,14 +32,18 @@ #include sysemu.h #include qemu-common.h -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, const char *dev, +int *vnet_hdr, int vnet_hdr_required) { struct ifreq ifr; int fd, ret; -TFR(fd = open(/dev/net/tun, O_RDWR)); +if (!*dev) +dev = /dev/net/tun; + +TFR(fd = open(dev, O_RDWR)); if (fd 0) { -fprintf(stderr, warning: could not open /dev/net/tun: no virtual network emulation\n); +fprintf(stderr, warning: could not open %s: no virtual network emulation\n, dev); return -1; } memset(ifr, 0, sizeof(ifr)); @@ -70,7 +74,7 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required pstrcpy(ifr.ifr_name, IFNAMSIZ, tap%d); ret = ioctl(fd, TUNSETIFF, (void *) ifr); if (ret != 0) { -fprintf(stderr, warning: could not configure /dev/net/tun: no virtual network emulation\n); +fprintf(stderr, warning: could not configure %s: no virtual network emulation\n, dev); close(fd); return -1; } diff --git a/net/tap-solaris.c b/net/tap-solaris.c index e14fe36..3d10984 100644 --- a/net/tap-solaris.c +++ b/net/tap-solaris.c @@ -171,10 +171,15 @@ static int tap_alloc(char *dev, size_t dev_size) return tap_fd; } -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, const char *devarg, + int *vnet_hdr, int vnet_hdr_required) { char dev[10]=; int fd; +if (devarg) { +fprintf(stderr, dev= not supported\n); +return -1; +} if( (fd = tap_alloc(dev, sizeof(dev))) 0 ){ fprintf(stderr, Cannot allocate TAP device\n); return -1; diff --git a/net/tap.c b/net/tap.c index 0d8b424..8635ae2 100644 --- a/net/tap.c +++ b/net/tap.c @@ -343,12 +343,15 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr) { int fd, vnet_hdr_required; char ifname[128] = {0,}; +const char *dev; const char *setup_script; if (qemu_opt_get(opts, ifname
Re: [Qemu-devel] Re: [PATCH, try 2, version 2] qemu/tap: add -net tap, dev= option
On Wednesday 09 December 2009, Michael S. Tsirkin wrote: On Wed, Dec 09, 2009 at 03:49:04PM +0100, Arnd Bergmann wrote: -TFR(fd = open(/dev/net/tun, O_RDWR)); +if (!*dev) +dev = /dev/net/tun; + Did you test without dev parameter? I think dev will be NULL so this will deference a nullpointer ... probably if (!dev) is what you mean? D'oh. will fix. will be neater if you put [,dev=str] after [,script=file] Also - it does need a string, but only insofar as all options are strings. Maybe dev=devfile or dev=file would be clearer. Yep. Thanks, Arnd
[Qemu-devel] Guest bridge setup variations
As promised, here is my small writeup on which setups I feel are important in the long run for server-type guests. This does not cover -net user, which is really for desktop kinds of applications where you do not want to connect into the guest from another IP address. I can see four separate setups that we may or may not want to support, the main difference being how the forwarding between guests happens: 1. The current setup, with a bridge and tun/tap devices on ports of the bridge. This is what Gerhard's work on access controls is focused on and the only option where the hypervisor actually is in full control of the traffic between guests. CPU utilization should be highest this way, and network management can be a burden, because the controls are done through a Linux, libvirt and/or Director specific interface. 2. Using macvlan as a bridging mechanism, replacing the bridge and tun/tap entirely. This should offer the best performance on inter-guest communication, both in terms of throughput and CPU utilization, but offer no access control for this traffic at all. Performance of guest-external traffic should be slightly better than bridge/tap. 3. Doing the bridging in the NIC using macvlan in passthrough mode. This lowers the CPU utilization further compared to 2, at the expense of limiting throughput by the performance of the PCIe interconnect to the adapter. Whether or not this is a win is workload dependent. Access controls now happen in the NIC. Currently, this is not supported yet, due to lack of device drivers, but it will be an important scenario in the future according to some people. 4. Using macvlan for actual VEPA on the outbound interface. This is mostly interesting because it makes the network access controls visible in an external switch that is already managed. CPU utilization and guest-external throughput should be identical to 3, but inter-guest latency can only be worse because all frames go through the external switch. In case 2 through 4, we have the choice between macvtap and the raw packet interface for connecting macvlan to qemu. Raw sockets are better tested right now, while macvtap has better permission management (i.e. it does not require CAP_NET_ADMIN). Neither one is upstream though at the moment. The raw driver only requires qemu patches, while macvtap requires both a new kernel driver and a trivial change in qemu. In all four cases, vhost-net could be used to move the workload from user space into the kernel, which may be an advantage. The decision for or against vhost-net is entirely independent of the other decisions. Arnd
[Qemu-devel] [PATCH, try 2] qemu/tap: add -net tap,dev= option
In order to support macvtap, we need a way to select the actual tap endpoint. While it would be nice to pass it by network interface name, passing the character device is more flexible, and we can easily do both in the long run. Signed-off-by: Arnd Bergmann a...@arndb.de --- diff --git a/net.c b/net.c index 13bdbb2..deb12fd 100644 --- a/net.c +++ b/net.c @@ -955,6 +955,11 @@ static struct { }, #ifndef _WIN32 { +.name = dev, +.type = QEMU_OPT_STRING, +.help = character device pathname, +}, +{ .name = fd, .type = QEMU_OPT_STRING, .help = file descriptor of an already opened tap, diff --git a/net/tap-aix.c b/net/tap-aix.c index 4bc9f2f..b789d06 100644 --- a/net/tap-aix.c +++ b/net/tap-aix.c @@ -25,7 +25,8 @@ #include net/tap.h #include stdio.h -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, char *dev, int dev_size, + int *vnet_hdr, int vnet_hdr_required) { fprintf(stderr, no tap on AIX\n); return -1; diff --git a/net/tap-bsd.c b/net/tap-bsd.c index 815997d..09a7780 100644 --- a/net/tap-bsd.c +++ b/net/tap-bsd.c @@ -40,7 +40,8 @@ #include util.h #endif -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, char *dev, int dev_size, + int *vnet_hdr, int vnet_hdr_required) { int fd; char *dev; diff --git a/net/tap-linux.c b/net/tap-linux.c index 6af9e82..a6df123 100644 --- a/net/tap-linux.c +++ b/net/tap-linux.c @@ -32,14 +32,21 @@ #include sysemu.h #include qemu-common.h -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, char *dev, int dev_size, + int *vnet_hdr, int vnet_hdr_required) { struct ifreq ifr; int fd, ret; +const char *path; -TFR(fd = open(/dev/net/tun, O_RDWR)); +if (dev[0] == '\0') + path = /dev/net/tun; +else + path = dev; + +TFR(fd = open(dev, O_RDWR)); if (fd 0) { -fprintf(stderr, warning: could not open /dev/net/tun: no virtual network emulation\n); +fprintf(stderr, warning: could not open %s: no virtual network emulation\n, path); return -1; } memset(ifr, 0, sizeof(ifr)); @@ -70,7 +77,7 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required pstrcpy(ifr.ifr_name, IFNAMSIZ, tap%d); ret = ioctl(fd, TUNSETIFF, (void *) ifr); if (ret != 0) { -fprintf(stderr, warning: could not configure /dev/net/tun: no virtual network emulation\n); +fprintf(stderr, warning: could not configure %s: no virtual network emulation\n, path); close(fd); return -1; } diff --git a/net/tap-solaris.c b/net/tap-solaris.c index e14fe36..72abb78 100644 --- a/net/tap-solaris.c +++ b/net/tap-solaris.c @@ -171,7 +171,8 @@ static int tap_alloc(char *dev, size_t dev_size) return tap_fd; } -int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required) +int tap_open(char *ifname, int ifname_size, char *dev, int dev_size, + int *vnet_hdr, int vnet_hdr_required) { char dev[10]=; int fd; diff --git a/net/tap.c b/net/tap.c index 0d8b424..14ddf65 100644 --- a/net/tap.c +++ b/net/tap.c @@ -343,12 +343,17 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr) { int fd, vnet_hdr_required; char ifname[128] = {0,}; +char dev[128] = {0,}; const char *setup_script; if (qemu_opt_get(opts, ifname)) { pstrcpy(ifname, sizeof(ifname), qemu_opt_get(opts, ifname)); } +if (qemu_opt_get(opts, dev)) { +pstrcpy(dev, sizeof(dev), qemu_opt_get(opts, dev)); +} + *vnet_hdr = qemu_opt_get_bool(opts, vnet_hdr, 1); if (qemu_opt_get(opts, vnet_hdr)) { vnet_hdr_required = *vnet_hdr; @@ -356,7 +361,8 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr) vnet_hdr_required = 0; } -TFR(fd = tap_open(ifname, sizeof(ifname), vnet_hdr, vnet_hdr_required)); +TFR(fd = tap_open(ifname, sizeof(ifname), dev, sizeof(dev), + vnet_hdr, vnet_hdr_required)); if (fd 0) { return -1; } @@ -371,6 +377,7 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr) } qemu_opt_set(opts, ifname, ifname); +qemu_opt_set(opts, dev, dev); return fd; } @@ -382,10 +389,12 @@ int net_init_tap(QemuOpts *opts, Monitor *mon, const char *name, VLANState *vlan if (qemu_opt_get(opts, fd)) { if (qemu_opt_get(opts, ifname) || +qemu_opt_get(opts, dev) || qemu_opt_get(opts, script) || qemu_opt_get(opts, downscript) || qemu_opt_get(opts, vnet_hdr
Re: [Qemu-devel] [PATCH, RFC] tap-linux: support opening arbitrary char devices
On Wednesday 02 December 2009, Anthony Liguori wrote: Arnd Bergmann wrote: With the upcoming macvtap, we will want to open devices other than /dev/net/tun but no longer need to call TUNSETIFF. What are the names of these devices and how do you the character devices get created in the first place? Like a macvlan device, you use iproute2 to set up the interface, e.g. ip link add link eth0 type macvtap mode vepa This is consistent with how you'd set up macvlan, veth or vlan devices. With my current code, the name that gets used for the character device in absence of a matching udev rule will be /dev/tap%d with %d being the interface index of the network device. This makes it possible to do 'qemu -net tap,ifname=/dev/tap/macvtap0' to refer to a chardev in addition to the current way of doing 'qemu -net tap,ifname=tap0' to refer to a tap network interface set with TUNSETIFF. This is consistent with what we do on BSD. We definitely don't want to overload ifname like this. /dev/tap/macvtap0 is clearly not the interface name. tap interfaces on other operating systems already differ in this respect, and macvtap seems to be much closer to what BSD and Solaris do than the Linux tun/tap interface, judging from how they are used in qemu. Assuming you have a virtual interface virteth0 (name made up to be different from existing ones) with interface index 3, the syntax would be * Linux, with tap: (which doesn't really work with arbitrary devices): qemu -net tap,ifname=virteth0 * Solaris: (opens /dev/tap3) qemu -net tap,ifname=3 * BSD: (opens /dev/tap3) qemu -net tap,ifname=tap3 I chose the full path name to be sure it does not conflict with existing usage, 3 and tap3 are both valid interface names, /dev/tap3 is not. If you prefer, I can add some logic to determine the character device name from the network interface name, so that you can call it consistently with the interface name on linux, wether using tun/tap or macvtap. That patch will be somewhat more complicated than the one I posted. Alternatively, I can use the documented name= parameter for pointing to the character device, but that would be inconsitent with the current usage on all the supported OSs. Arnd
[Qemu-devel] [PATCH, RFC] tap-linux: support opening arbitrary char devices
With the upcoming macvtap, we will want to open devices other than /dev/net/tun but no longer need to call TUNSETIFF. This makes it possible to do 'qemu -net tap,ifname=/dev/tap/macvtap0' to refer to a chardev in addition to the current way of doing 'qemu -net tap,ifname=tap0' to refer to a tap network interface set with TUNSETIFF. This is consistent with what we do on BSD. This is guaranteed not to cause conflicts with existing usage, because network devices on Linux cannot start with a '/'. Alternatively, I could do a patch to use the -net tap,name= parameter which documented to have a similar purpose but not currently implemented at all. Signed-off-by: Arnd Bergmann a...@arndb.de diff --git a/net/tap-linux.c b/net/tap-linux.c index 0f621a2..21f0167 100644 --- a/net/tap-linux.c +++ b/net/tap-linux.c @@ -36,10 +36,20 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required { struct ifreq ifr; int fd, ret; +int open_named; +char *devname; + +if (ifname_size = 1 *ifname == /) { + open_named = 1; + devname = ifname; +} else { + open_named = 0; + devname = /dev/net/tun; +} -TFR(fd = open(/dev/net/tun, O_RDWR)); +TFR(fd = open(devname, O_RDWR)); if (fd 0) { -fprintf(stderr, warning: could not open /dev/net/tun: no virtual network emulation\n); +fprintf(stderr, warning: could not open %s: no virtual network emulation\n, devname); return -1; } memset(ifr, 0, sizeof(ifr)); @@ -62,17 +72,20 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required } } -if (ifname[0] != '\0') -pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname); -else -pstrcpy(ifr.ifr_name, IFNAMSIZ, tap%d); -ret = ioctl(fd, TUNSETIFF, (void *) ifr); -if (ret != 0) { -fprintf(stderr, warning: could not configure /dev/net/tun: no virtual network emulation\n); -close(fd); -return -1; +if (!open_named) { + if (ifname[0] != '\0') + pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname); + else + pstrcpy(ifr.ifr_name, IFNAMSIZ, tap%d); + ret = ioctl(fd, TUNSETIFF, (void *) ifr); + if (ret != 0) { + fprintf(stderr, warning: could not configure %s: no virtual + network emulation\n, devname); + close(fd); + return -1; + } + pstrcpy(ifname, ifname_size, ifr.ifr_name); } -pstrcpy(ifname, ifname_size, ifr.ifr_name); fcntl(fd, F_SETFL, O_NONBLOCK); return fd; }
Re: [Qemu-devel] [PATCH 10/13] Implement early printk in virtio-console
On Wednesday 25 November 2009, Carsten Otte wrote: Anthony Liguori wrote: Oh, that's bad :-) That should really be it's own character device. We don't really have a way to connect two character devices like that. Maybe muxing? It will be a character device, once the device tree is initialized. Better ideas are welcome, just keep in mind basic things like kmalloc don't work yet. Can't you just leave it out for this release? Early printk is great for debugging, but not essential if you just want to run a guest as long as you get it past the virtio init phase. Arnd
Re: [Qemu-devel] [PATCH 4/4] Add support for -net bridge
On Sunday 08 November 2009 08:27:41 Avi Kivity wrote: On 11/08/2009 12:11 AM, Anthony Liguori wrote: You don't need root privileges to use a tap device. You can access a preconfigured tap device but you cannot allocate a tap device and connect it to a bridge without CAP_NET_ADMIN. btw, shouldn't we, in the general case, create a bridge per user and use IP NAT? If we have a global bridge, users can spoof each other's MAC addresses and interfere with their virtual machines. They can also interfere with the real network. That's not a concern with most one-user-per-machine configurations, but the default configuration should be safe. It also depends a lot on what you want to do with the virtual machine. If you want to run a game or a legacy application in a different operating system on your desktop, a NATed bridge is ideal, but it does not work on a server if the guest wants to listen on a socket with its own IP address. Arnd
Re: [Qemu-devel] [PATCH 0/4] net-bridge: rootless bridge support for qemu
On Saturday 07 November 2009, Anthony Liguori wrote: Avi Kivity wrote: On 11/07/2009 11:14 AM, Avi Kivity wrote: I'd welcome -net bridge as one of them. But we shouldn't try to invent access control systems or install suid helpers. We can make the helper a script that does exec sudo /the/real/helper $@ so a user can add it to /etc/sudoers and get pre-authenticated configuration. The key point of the helper here is that you pass an fd to a socketpair and you then receive an fd over that socket. What the helper does is really less important. Whether it's a script like you suggest or something like I proposed doesn't matter from a qemu perspective. Well, the difference matters from a security perspective. The sudo script that Avi suggested just means that you can guarantee you don't introduce any security holes through a suid executable. Fortunately, it does not impact the contents of your helper either, only the installation. You could even be clever in qemu and use call the helper using sudo if qemu is running as unpriviledged user and the helper is not a suid file. Arnd