RE: [PATCH 1/8] colo: Only support the same qemu version on source and destination
> -Original Message- > From: qemu-devel-bounces+eddie.dong=intel@nongnu.org devel-bounces+eddie.dong=intel@nongnu.org> On Behalf Of Lukas > Straub > Sent: Thursday, June 22, 2023 5:15 AM > To: qemu-devel > Cc: Zhang, Hailiang ; Juan Quintela > ; Peter Xu ; Leonardo Bras > ; Zhang, Chen > Subject: [PATCH 1/8] colo: Only support the same qemu version on source > and destination > > Signed-off-by: Lukas Straub > --- > docs/COLO-FT.txt | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/docs/COLO-FT.txt b/docs/COLO-FT.txt index > 2e760a4aee..8e64480dbd 100644 > --- a/docs/COLO-FT.txt > +++ b/docs/COLO-FT.txt > @@ -148,6 +148,8 @@ in test procedure. > Note: Here we are running both instances on the same host for testing, > change the IP Addresses if you want to run it on two hosts. Initially > 127.0.0.1 is the Primary Host and 127.0.0.2 is the Secondary Host. > +Note: COLO is a experimental feature, an experimental feature >so currently is should only be it should ... > +used with the same qemu version on sourcee and target. > > == Startup qemu == > 1. Primary: > -- > 2.39.2
RE: [PATCH 1/3] hw/gpio: add PCA6414 i2c GPIO expander
> -Original Message- > From: qemu-devel-bounces+eddie.dong=intel@nongnu.org devel-bounces+eddie.dong=intel@nongnu.org> On Behalf Of Titus > Rwantare > Sent: Monday, February 6, 2023 11:50 AM > To: peter.mayd...@linaro.org > Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org; Titus Rwantare > ; Hao Wu > Subject: [PATCH 1/3] hw/gpio: add PCA6414 i2c GPIO expander > > This is a simple i2c device that allows i2c capable devices to have GPIOs. This patch is to implement a device model of I2C to GPIO for PCA_xxx, right? Or do you implement a general/common I2C to GPIO device? I think it is for the former one. In this case, the commit message is not clear. > > Reviewed-by: Hao Wu > Signed-off-by: Titus Rwantare > --- > hw/arm/Kconfig | 1 + > hw/gpio/meson.build | 1 + > hw/gpio/pca_i2c_gpio.c | 362 > hw/gpio/trace-events| 5 + > hw/i2c/Kconfig | 4 + > include/hw/gpio/pca_i2c_gpio.h | 72 +++ > tests/qtest/meson.build | 1 + > tests/qtest/pca_i2c_gpio-test.c | 169 +++ > 8 files changed, 615 insertions(+) > create mode 100644 hw/gpio/pca_i2c_gpio.c create mode 100644 > include/hw/gpio/pca_i2c_gpio.h create mode 100644 > tests/qtest/pca_i2c_gpio-test.c > > diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig index > 2d157de9b8..1b533ddd76 100644 > --- a/hw/arm/Kconfig > +++ b/hw/arm/Kconfig > @@ -418,6 +418,7 @@ config NPCM7XX > select SSI > select UNIMP > select PCA954X > +select PCA_I2C_GPIO > > config FSL_IMX25 > bool > diff --git a/hw/gpio/meson.build b/hw/gpio/meson.build index > b726e6d27a..1e5b602002 100644 > --- a/hw/gpio/meson.build > +++ b/hw/gpio/meson.build > @@ -12,3 +12,4 @@ softmmu_ss.add(when: 'CONFIG_OMAP', if_true: > files('omap_gpio.c')) > softmmu_ss.add(when: 'CONFIG_RASPI', if_true: files('bcm2835_gpio.c')) > softmmu_ss.add(when: 'CONFIG_ASPEED_SOC', if_true: > files('aspeed_gpio.c')) > softmmu_ss.add(when: 'CONFIG_SIFIVE_GPIO', if_true: files('sifive_gpio.c')) > +softmmu_ss.add(when: 'CONFIG_PCA_I2C_GPIO', if_true: > +files('pca_i2c_gpio.c')) > diff --git a/hw/gpio/pca_i2c_gpio.c b/hw/gpio/pca_i2c_gpio.c new file > mode 100644 index 00..afae497a22 > --- /dev/null > +++ b/hw/gpio/pca_i2c_gpio.c Should this file be located under hw/i2c ? > @@ -0,0 +1,362 @@ > +/* > + * NXP PCA I2C GPIO Expanders > + * > + * Low-voltage translating 16-bit I2C/SMBus GPIO expander with > +interrupt output, > + * reset, and configuration registers > + * > + * Datasheet: https://www.nxp.com/docs/en/data-sheet/PCA6416A.pdf > + * > + * Copyright 2023 Google LLC > + * > + * SPDX-License-Identifier: GPL-2.0-or-later > + * > + * These devices, by default, are configured to input only. The > +configuration is > + * settable through qom/qmp, or i2c.To set some pins as inputs before > +boot, use > + * the following in the board file of the machine: > + * object_property_set_uint(Object *obj, const char *name, > + * uint64_t value, Error **errp); > + * specifying name as "gpio_config" and the value as a bitfield of the > +inputs > + * e.g. for the pca6416, a value of 0xFFF0, sets pins 0-3 as outputs > +and > + * 4-15 as inputs. > + * This value can also be set at runtime through qmp externally, or by > + * writing to the config register using i2c. > + * > + */ > + > +#include "qemu/osdep.h" > +#include "hw/gpio/pca_i2c_gpio.h" > +#include "hw/irq.h" > +#include "hw/qdev-properties.h" > +#include "migration/vmstate.h" > +#include "qapi/visitor.h" > +#include "qemu/log.h" > +#include "qemu/module.h" > +#include "trace.h" > + > +/* > + * compare new_output to curr_output and update irq to match > new_output > + * > + * The Input port registers (registers 0 and 1) reflect the incoming > +logic > + * levels of the pins, regardless of whether the pin is defined as an > +input or > + * an output by the Configuration register. > + */ > +static void pca_i2c_update_irqs(PCAGPIOState *ps) { > +PCAGPIOClass *pc = PCA_I2C_GPIO_GET_CLASS(ps); > +uint16_t out_diff = ps->new_output ^ ps->curr_output; > +uint16_t in_diff = ps->new_input ^ ps->curr_input; > +uint16_t mask, pin_i; > + > +if (in_diff || out_diff) { The spec says " A pin configured as an output cannot cause an interrupt". Do we need to check out_diff here? > +for (int i = 0; i < pc->num_pins; i++) { At least for PCA_6416A, the state of pins are described in UINT16. We can check all together rather than bit scan/pin scan. > +mask = BIT(i); > +/* pin must be configured as an output to be set here */ > +if (out_diff & ~ps->config & mask) { > +pin_i = mask & ps->new_output; > +qemu_set_irq(ps->output[i], pin_i > 0); > +ps->curr_output &= ~mask; > +ps->curr_output |= pin_i; > +} > + > +if (in_diff &
RE: [PATCH v2 02/11] hw/watchdog/wdt_aspeed: Extend MMIO range to cover more registers
> When booting the Zephyr demo in [1] we get: > > aspeed.io: unimplemented device write (size 4, offset 0x185128, value > 0x030f1ff1) <-- > aspeed.io: unimplemented device write (size 4, offset 0x18512c, value > 0x03f1) > > This corresponds to this Zephyr code [2]: > > static int aspeed_wdt_init(const struct device *dev) > { > const struct aspeed_wdt_config *config = dev->config; > struct aspeed_wdt_data *const data = dev->data; > uint32_t reg_val; > > /* disable WDT by default */ > reg_val = sys_read32(config->ctrl_base + WDT_CTRL_REG); > reg_val &= ~WDT_CTRL_ENABLE; > sys_write32(reg_val, config->ctrl_base + WDT_CTRL_REG); > > sys_write32(data->rst_mask1, > config->ctrl_base + WDT_SW_RESET_MASK1_REG); <-- > sys_write32(data->rst_mask2, > config->ctrl_base + WDT_SW_RESET_MASK2_REG); > > return 0; > } > > The register definitions are [3]: > > #define WDT_RELOAD_VAL_REG 0x0004 > #define WDT_RESTART_REG 0x0008 > #define WDT_CTRL_REG0x000C > #define WDT_TIMEOUT_STATUS_REG 0x0010 > #define WDT_TIMEOUT_STATUS_CLR_REG 0x0014 > #define WDT_RESET_MASK1_REG 0x001C > #define WDT_RESET_MASK2_REG 0x0020 > #define WDT_SW_RESET_MASK1_REG 0x0028 <-- > #define WDT_SW_RESET_MASK2_REG 0x002C > #define WDT_SW_RESET_CTRL_REG 0x0024 > > Currently QEMU only cover a MMIO region of size 0x20: > > #define ASPEED_WDT_REGS_MAX(0x20 / 4) > > Change to map the whole 'iosize' which might be bigger, covering the other The root cause is that ASPEED_WDT_REGS_MAX is too small, right? Probably the Qemu is emulating an old version of the hardware. Given the meaning of ASPEED_WDT_REGS_MAX, it should be larger than iosize, not? Probably ASPEED_WDT_REGS_MAX should be per device type (aspeed_2400/2500), while iosize is for all devices, and its initial value comes from the per device type REGS_MAX. > registers. The MemoryRegionOps read/write handlers will report the accesses > as out-of-bounds guest-errors, but the next commit will report them as > unimplemented. > > [1] https://github.com/AspeedTech-BMC/zephyr/releases/tag/v00.01.07 > [2] https://github.com/AspeedTech-BMC/zephyr/commit/2e99f10ac27b > [3] https://github.com/AspeedTech- > BMC/zephyr/blob/v00.01.08/drivers/watchdog/wdt_aspeed.c#L31 > > Reviewed-by: Peter Delevoryas > Signed-off-by: Philippe Mathieu-Daudé > --- > hw/watchdog/wdt_aspeed.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/hw/watchdog/wdt_aspeed.c b/hw/watchdog/wdt_aspeed.c index > 958725a1b5..eefca31ae4 100644 > --- a/hw/watchdog/wdt_aspeed.c > +++ b/hw/watchdog/wdt_aspeed.c > @@ -260,6 +260,7 @@ static void aspeed_wdt_realize(DeviceState *dev, > Error **errp) { > SysBusDevice *sbd = SYS_BUS_DEVICE(dev); > AspeedWDTState *s = ASPEED_WDT(dev); > +AspeedWDTClass *awc = ASPEED_WDT_GET_CLASS(dev); > > assert(s->scu); > > @@ -271,7 +272,7 @@ static void aspeed_wdt_realize(DeviceState *dev, > Error **errp) > s->pclk_freq = PCLK_HZ; > > memory_region_init_io(>iomem, OBJECT(s), _wdt_ops, s, > - TYPE_ASPEED_WDT, ASPEED_WDT_REGS_MAX * 4); > + TYPE_ASPEED_WDT, awc->iosize); > sysbus_init_mmio(sbd, >iomem); > } > > -- > 2.38.1 >
RE: [PATCH v2 01/11] hw/watchdog/wdt_aspeed: Rename MMIO region size as 'iosize'
> > Avoid confusing two different things: > - the WDT I/O region size ('iosize') > - at which offset the SoC map the WDT ('offset') While it is often the same, > we > can map smaller region sizes at larger offsets. > > Here we are interested in the I/O region size, so rename as 'iosize'. > > Reviewed-by: Peter Delevoryas > Signed-off-by: Philippe Mathieu-Daudé > --- > hw/arm/aspeed_ast10x0.c | 2 +- > hw/arm/aspeed_ast2600.c | 2 +- > hw/arm/aspeed_soc.c | 2 +- > hw/watchdog/wdt_aspeed.c | 8 > include/hw/watchdog/wdt_aspeed.h | 2 +- > 5 files changed, 8 insertions(+), 8 deletions(-) > > diff --git a/hw/arm/aspeed_ast10x0.c b/hw/arm/aspeed_ast10x0.c index > 4d0b9b115f..122b3fd3f3 100644 > --- a/hw/arm/aspeed_ast10x0.c > +++ b/hw/arm/aspeed_ast10x0.c > @@ -325,7 +325,7 @@ static void aspeed_soc_ast1030_realize(DeviceState > *dev_soc, Error **errp) > return; > } > aspeed_mmio_map(s, SYS_BUS_DEVICE(>wdt[i]), 0, > -sc->memmap[ASPEED_DEV_WDT] + i * awc->offset); > +sc->memmap[ASPEED_DEV_WDT] + i * awc->iosize); How does the specification say here (I didn't find it)? I read this is for a case where multiple WDTs are implemented. And offset means the distance io registers of WDT[1] are located from WDT[0]. Or does the spec explicitly says the io registers of WDT[1] is located right after WDT[0] without any gaps in this case, iosize is better)? > } > > /* GPIO */ > diff --git a/hw/arm/aspeed_ast2600.c b/hw/arm/aspeed_ast2600.c index > cd75465c2b..a79e05ddbd 100644 > --- a/hw/arm/aspeed_ast2600.c > +++ b/hw/arm/aspeed_ast2600.c > @@ -472,7 +472,7 @@ static void aspeed_soc_ast2600_realize(DeviceState > *dev, Error **errp) > return; > } > aspeed_mmio_map(s, SYS_BUS_DEVICE(>wdt[i]), 0, > -sc->memmap[ASPEED_DEV_WDT] + i * awc->offset); > +sc->memmap[ASPEED_DEV_WDT] + i * awc->iosize); > } > > /* RAM */ > diff --git a/hw/arm/aspeed_soc.c b/hw/arm/aspeed_soc.c index > b05b9dd416..2c0924d311 100644 > --- a/hw/arm/aspeed_soc.c > +++ b/hw/arm/aspeed_soc.c > @@ -393,7 +393,7 @@ static void aspeed_soc_realize(DeviceState *dev, > Error **errp) > return; > } > aspeed_mmio_map(s, SYS_BUS_DEVICE(>wdt[i]), 0, > -sc->memmap[ASPEED_DEV_WDT] + i * awc->offset); > +sc->memmap[ASPEED_DEV_WDT] + i * awc->iosize); > } > > /* RAM */ > diff --git a/hw/watchdog/wdt_aspeed.c b/hw/watchdog/wdt_aspeed.c index > d753693a2e..958725a1b5 100644 > --- a/hw/watchdog/wdt_aspeed.c > +++ b/hw/watchdog/wdt_aspeed.c > @@ -309,7 +309,7 @@ static void aspeed_2400_wdt_class_init(ObjectClass > *klass, void *data) > AspeedWDTClass *awc = ASPEED_WDT_CLASS(klass); > > dc->desc = "ASPEED 2400 Watchdog Controller"; > -awc->offset = 0x20; > +awc->iosize = 0x20; > awc->ext_pulse_width_mask = 0xff; > awc->reset_ctrl_reg = SCU_RESET_CONTROL1; > awc->wdt_reload = aspeed_wdt_reload; @@ -346,7 +346,7 @@ static void > aspeed_2500_wdt_class_init(ObjectClass *klass, void *data) > AspeedWDTClass *awc = ASPEED_WDT_CLASS(klass); > > dc->desc = "ASPEED 2500 Watchdog Controller"; > -awc->offset = 0x20; > +awc->iosize = 0x20; > awc->ext_pulse_width_mask = 0xf; > awc->reset_ctrl_reg = SCU_RESET_CONTROL1; > awc->reset_pulse = aspeed_2500_wdt_reset_pulse; @@ -369,7 +369,7 > @@ static void aspeed_2600_wdt_class_init(ObjectClass *klass, void *data) > AspeedWDTClass *awc = ASPEED_WDT_CLASS(klass); > > dc->desc = "ASPEED 2600 Watchdog Controller"; > -awc->offset = 0x40; > +awc->iosize = 0x40; > awc->ext_pulse_width_mask = 0xf; /* TODO */ > awc->reset_ctrl_reg = AST2600_SCU_RESET_CONTROL1; > awc->reset_pulse = aspeed_2500_wdt_reset_pulse; @@ -392,7 +392,7 > @@ static void aspeed_1030_wdt_class_init(ObjectClass *klass, void *data) > AspeedWDTClass *awc = ASPEED_WDT_CLASS(klass); > > dc->desc = "ASPEED 1030 Watchdog Controller"; > -awc->offset = 0x80; > +awc->iosize = 0x80; > awc->ext_pulse_width_mask = 0xf; /* TODO */ > awc->reset_ctrl_reg = AST2600_SCU_RESET_CONTROL1; > awc->reset_pulse = aspeed_2500_wdt_reset_pulse; diff --git > a/include/hw/watchdog/wdt_aspeed.h > b/include/hw/watchdog/wdt_aspeed.h > index dfa5dfa424..db91ee6b51 100644 > --- a/include/hw/watchdog/wdt_aspeed.h > +++ b/include/hw/watchdog/wdt_aspeed.h > @@ -40,7 +40,7 @@ struct AspeedWDTState { struct AspeedWDTClass { > SysBusDeviceClass parent_class; > > -uint32_t offset; > +uint32_t iosize; > uint32_t ext_pulse_width_mask; > uint32_t reset_ctrl_reg; > void (*reset_pulse)(AspeedWDTState *s, uint32_t property); > -- > 2.38.1 >
RE: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to mtime
Using a union can better reflect this. Also, it can avoid the convert from 2 32-bits register to 64 bits, like the above code does. ibex_timer_update_irqs() also does this conversion. It took me a bit of time, but now I think I understand what you mean: a union of 2 uint32_t's (perhaps packed into a struct or an array) and a uint64_t would make it easier to access the components, is that what you mean? That is pretty handy, thanks. YES. You decide Thanks Eddie
RE: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to mtime
From: Tyler Ng Sent: Monday, September 26, 2022 4:38 PM To: Dong, Eddie Cc: open list:RISC-V ; qemu-devel@nongnu.org Developers ; Alistair Francis ; Meng, Bin ; Thomas Huth ; Paolo Bonzini ; Palmer Dabbelt ; Laurent Vivier Subject: Re: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to mtime Hi Eddie, On Mon, Sep 26, 2022 at 2:06 PM Dong, Eddie <mailto:eddie.d...@intel.com> wrote: > -Original Message- > From: Qemu-devel mailto:intel@nongnu.org> > On Behalf Of Tyler Ng > Sent: Thursday, September 22, 2022 8:59 AM > To: open list:RISC-V <mailto:qemu-ri...@nongnu.org>; > mailto:qemu-devel@nongnu.org > Developers <mailto:qemu-devel@nongnu.org> > Cc: Alistair Francis <mailto:alistair.fran...@wdc.com>; Meng, Bin > <mailto:bin.m...@windriver.com>; Thomas Huth <mailto:th...@redhat.com>; Paolo > Bonzini <mailto:pbonz...@redhat.com>; Palmer Dabbelt > <mailto:pal...@dabbelt.com>; > Laurent Vivier <mailto:lviv...@redhat.com> > Subject: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to > mtime > > 1. Adds fields to hold the value of mtime in timer_upper0 and timer_lower0. > > 2. Changes the read and write functions to use the mtime fields. > > 3. Updates the value of mtime in update_mtime() by extrapolating the time > elapsed. This will need to change if/when the prescalar is implemented. > > 4. Adds a qtest for the ibex timer. > > Signed-off-by: Tyler Ng <mailto:t...@rivosinc.com> > --- > hw/timer/ibex_timer.c | 98 +-- > include/hw/timer/ibex_timer.h | 6 ++ > tests/qtest/ibex-timer-test.c | 178 ++ > tests/qtest/meson.build | 3 +- > 4 files changed, 256 insertions(+), 29 deletions(-) create mode 100644 > tests/qtest/ibex-timer-test.c > > diff --git a/hw/timer/ibex_timer.c b/hw/timer/ibex_timer.c index > d8b8e4e1f6..4230d60e85 100644 > --- a/hw/timer/ibex_timer.c > +++ b/hw/timer/ibex_timer.c > @@ -52,28 +52,56 @@ REG32(INTR_STATE, 0x118) REG32(INTR_TEST, > 0x11C) > FIELD(INTR_TEST, T_0, 0, 1) > > -static uint64_t cpu_riscv_read_rtc(uint32_t timebase_freq) > +static inline uint64_t get_mtime(void *opaque) > { > - return muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), > - timebase_freq, NANOSECONDS_PER_SECOND); > + IbexTimerState *s = opaque; > + return (s->timer_lower0) | ((uint64_t) s->timer_upper0 << 32); > } > > -static void ibex_timer_update_irqs(IbexTimerState *s) > +/* > + * The goal of this function is to: > + * 1. Check if the timer is enabled. If not, return false, > + * 2. Calculate the amount of time that has passed since. > + * 3. Extrapolate the number of ticks that have passed, and add it to > `mtime`. > + * 4. Return true. > + */ > +static bool update_mtime(IbexTimerState *s) > { > - uint64_t value = s->timer_compare_lower0 | > - ((uint64_t)s->timer_compare_upper0 << 32); So the hardware actually implements 64 bits register (used in 32 bits CPU), why not use an union for this? Same for: + uint32_t timer_lower0; + uint32_t timer_upper0; I'm not too sure what a C union would do for us here? I think what the hardware really implement is a 64 bits register, with 32 bits interface to access. struct IbexTimerState actually defines both of them: uint64_t mtimecmp; uint32_t timer_compare_lower0; uint32_t timer_compare_upper0; Using a union can better reflect this. Also, it can avoid the convert from 2 32-bits register to 64 bits, like the above code does. ibex_timer_update_irqs() also does this conversion. > - uint64_t next, diff; > - uint64_t now = cpu_riscv_read_rtc(s->timebase_freq); > - > if (!(s->timer_ctrl & R_CTRL_ACTIVE_MASK)) { > - /* Timer isn't active */ > + return false; > + } > + /* Get the time then extrapolate the number of ticks that have elapsed */ > + uint64_t mtime = get_mtime(s); > + int64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); > + int64_t elapsed = now - s->timer_last_update; > + if (elapsed < 0) { > + /* We jumped back in time. */ > + mtime -= muldiv64((uint64_t)(-elapsed), s->timebase_freq, > + NANOSECONDS_PER_SECOND); > + } else { > + mtime += muldiv64(elapsed, s->timebase_freq, > NANOSECONDS_PER_SECOND); > + } > + s->timer_lower0 = mtime & 0x; > + s->timer_upper0 = (mtime >> 32) & 0x; > + /* update last-checkpoint timestamp */ > + s->timer_last_update = now; > + return true; > +} > + > +static void ibex_timer_update_irqs(IbexTimerSt
RE: [PATCH v2 1/3] hw/watchdog: wdt_ibex_aon.c: Implement the watchdog for the OpenTitan
Hi Tyler: Some other comments embedded. > + > +static void ibex_aon_write(void *opaque, > + hwaddr addr, uint64_t value, > + unsigned int size) { > +IbexAONTimerState *s = IBEX_AON_TIMER(opaque); > + > +/* When writing, need to consider if the configuration is locked */ > +switch (addr) { > +case A_ALERT_TEST: > +s->regs[R_ALERT_TEST] = value & 0x1; > +break; > +case A_WKUP_CTRL: > +qemu_log_mask(LOG_UNIMP, "%s: AON wkup not implemented", > __func__); > +break; > +case A_WKUP_THOLD: > +qemu_log_mask(LOG_UNIMP, "%s: AON wkup not implemented", > __func__); > +break; > +case A_WKUP_COUNT: > +qemu_log_mask(LOG_UNIMP, "%s: AON wkup not implemented", > __func__); > +break; > +case A_WDOG_REGWEN: > +if (s->regs[R_WDOG_REGWEN] == IBEX_AONTIMER_WDOG_UNLOCKED) > { > +s->regs[R_WDOG_REGWEN] = value & > IBEX_AONTIMER_WDOG_REGWEN_MASK; > +} else { > +qemu_log_mask(LOG_GUEST_ERROR, > +"%s: AON Timer configuration locked\n", > __func__); > +} > +break; > +case A_WDOG_CTRL: > +if (s->regs[R_WDOG_REGWEN] == IBEX_AONTIMER_WDOG_UNLOCKED) > { > +if (!(s->regs[R_WDOG_CTRL] & IBEX_AONTIMER_WDOG_CTRL_MASK)) > { > +s->wdog_last = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); > +} > +s->regs[R_WDOG_CTRL] = value & > IBEX_AONTIMER_WDOG_CTRL_MASK; > +ibex_aon_update_bark_timer(s); > +ibex_aon_update_bite_timer(s); In case, the guest clears the ENABLE bit of CTRL register, we want to remove the registered timer, right? The current code path do nothing when ENABLE is cleared. That is incorrect and may lead to bark and bite event happen again. > +} else { > +qemu_log_mask(LOG_GUEST_ERROR, > +"%s: AON Timer configuration locked\n", > __func__); > +} > +break; > +case A_WDOG_BARK_THOLD: > +if (s->regs[R_WDOG_REGWEN] == IBEX_AONTIMER_WDOG_UNLOCKED) > { > +s->regs[R_WDOG_BARK_THOLD] = value; > +ibex_aon_update_bark_timer(s); > +} else { > +qemu_log_mask(LOG_GUEST_ERROR, > +"%s: AON Timer configuration locked\n", > __func__); > +} > +break; > +case A_WDOG_BITE_THOLD: > +if (s->regs[R_WDOG_REGWEN] == IBEX_AONTIMER_WDOG_UNLOCKED) > { > +s->regs[R_WDOG_BITE_THOLD] = value; > +ibex_aon_update_bite_timer(s); > +} else { > +qemu_log_mask(LOG_GUEST_ERROR, > +"%s: AON Timer configuration locked\n", > __func__); > +} > +break; > +case A_WDOG_COUNT: > +s->regs[R_WDOG_COUNT] = value; > +s->wdog_last = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); > +ibex_aon_update_bark_timer(s); > +ibex_aon_update_bite_timer(s); > +break; > +case A_INTR_STATE: > +/* Service the IRQs by writing 1 to the appropriate field */ > +if ((value & R_INTR_STATE_WDOG_MASK)) { Do we need to clear the bit in s->regs[R_INTR_STATE] ? Otherwise, guest read of the register gets wrong value. > +qemu_irq_lower(s->bark_irq); > +ibex_aon_update_count(s); > +int64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); > +/* > + * We need to make sure that COUNT < *_THOLD. If it isn't, an > + * interrupt is generated the next clock cycle > + */ > +if (s->regs[R_WDOG_COUNT] >= s->regs[R_WDOG_BARK_THOLD]) { > +if (now + IBEX_AONTIMER_PERIOD_NS < now) { > +timer_mod_ns(s->barker, INT64_MAX); Overflow of an 64 bits timer(in the unit of ns) is too far away, which can never happen. > +} else { > +timer_mod_ns(s->barker, now + IBEX_AONTIMER_PERIOD_NS); We can call the fire API (ibex_aon_barker_expired) directly here. > +} > +} > +} > +break; > +case A_INTR_TEST: > +qemu_log_mask(LOG_UNIMP, "%s: Intr test not implemented", > __func__); > +break; > +case A_WKUP_CAUSE: > +qemu_log_mask(LOG_UNIMP, > +"%s: Wkup cause not implemented", __func__); > +break; > +default: > +qemu_log_mask(LOG_GUEST_ERROR, > +"%s: Invalid register read 0x%" HWADDR_PRIx "\n", > +__func__, addr); > +break; > +} > +} > + > +static void ibex_aon_enter_reset(Object *obj, ResetType type) { > +IbexAONTimerState *s = IBEX_AON_TIMER(obj); > +s->regs[R_ALERT_TEST] = 0x0; > +s->regs[R_WKUP_CTRL] = 0x0; > +s->regs[R_WKUP_THOLD] = 0x0; > +s->regs[R_WKUP_COUNT] = 0x0; > +s->regs[R_WDOG_REGWEN] = 0x1; > +
RE: [PATCH v2 1/3] hw/watchdog: wdt_ibex_aon.c: Implement the watchdog for the OpenTitan
Hi Tyler: > +} > + > +/* Called when the bark timer expires */ static void > +ibex_aon_barker_expired(void *opaque) { This may happen during ibex_aon_update_count(), right? > + IbexAONTimerState *s = IBEX_AON_TIMER(opaque); > + if (ibex_aon_update_count(s) && This may happen during ibex_aon_update_count(). Nested call may lead to incorrect s->regs[R_WDOG_COUNT] & s->wdog_last. Can you elaborate? The timers for bark and bite are not updated in "update_count". When 1st ibex_aon_update_count() is executing, and is in the place PPP (after updating s->regs[R_WDOG_COUNT] but before updating s->wdog_last), a timer (barker) may happen. Inside ibex_aon_barker_expired(), it calls ibex_aon_update_count() again (nest call), and update s->regs[R_WDOG_COUNT] & s->wdog_last, with the new value. After the nest execution ends, and returns to the initial point (PPP) , the s->wdog_last is updated (with the value of 1st execution time), this leads to mismatch of s->regs[R_WDOG_COUNT] & s->wdog_last. This case may not be triggered at normal case, but if the guest read A_WDOG_COUNT, the 1st ibex_aon_update_count() does execute, and bring the potential issue. I think we can solve the problem, by not updating s->regs[R_WDOG_COUNT] & s->wdog_last in the timer call back API. The update is not necessary given that the stored value is anyway not the current COUNT. We only need to update when the guest write the COUNT. > + s->regs[R_WDOG_COUNT] >= s->regs[R_WDOG_BARK_THOLD]) { > + s->regs[R_INTR_STATE] |= (1 << 1); > + qemu_irq_raise(s->bark_irq); > + } > +} > + THX Eddie
RE: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to mtime
> -Original Message- > From: Qemu-devel > On Behalf Of Tyler Ng > Sent: Thursday, September 22, 2022 8:59 AM > To: open list:RISC-V ; qemu-devel@nongnu.org > Developers > Cc: Alistair Francis ; Meng, Bin > ; Thomas Huth ; Paolo > Bonzini ; Palmer Dabbelt ; > Laurent Vivier > Subject: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to > mtime > > 1. Adds fields to hold the value of mtime in timer_upper0 and timer_lower0. > > 2. Changes the read and write functions to use the mtime fields. > > 3. Updates the value of mtime in update_mtime() by extrapolating the time > elapsed. This will need to change if/when the prescalar is implemented. > > 4. Adds a qtest for the ibex timer. > > Signed-off-by: Tyler Ng > --- > hw/timer/ibex_timer.c | 98 +-- > include/hw/timer/ibex_timer.h | 6 ++ > tests/qtest/ibex-timer-test.c | 178 ++ > tests/qtest/meson.build | 3 +- > 4 files changed, 256 insertions(+), 29 deletions(-) create mode 100644 > tests/qtest/ibex-timer-test.c > > diff --git a/hw/timer/ibex_timer.c b/hw/timer/ibex_timer.c index > d8b8e4e1f6..4230d60e85 100644 > --- a/hw/timer/ibex_timer.c > +++ b/hw/timer/ibex_timer.c > @@ -52,28 +52,56 @@ REG32(INTR_STATE, 0x118) REG32(INTR_TEST, > 0x11C) > FIELD(INTR_TEST, T_0, 0, 1) > > -static uint64_t cpu_riscv_read_rtc(uint32_t timebase_freq) > +static inline uint64_t get_mtime(void *opaque) > { > -return muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), > -timebase_freq, NANOSECONDS_PER_SECOND); > +IbexTimerState *s = opaque; > +return (s->timer_lower0) | ((uint64_t) s->timer_upper0 << 32); > } > > -static void ibex_timer_update_irqs(IbexTimerState *s) > +/* > + * The goal of this function is to: > + * 1. Check if the timer is enabled. If not, return false, > + * 2. Calculate the amount of time that has passed since. > + * 3. Extrapolate the number of ticks that have passed, and add it to > `mtime`. > + * 4. Return true. > + */ > +static bool update_mtime(IbexTimerState *s) > { > -uint64_t value = s->timer_compare_lower0 | > - ((uint64_t)s->timer_compare_upper0 << 32); So the hardware actually implements 64 bits register (used in 32 bits CPU), why not use an union for this? Same for: +uint32_t timer_lower0; +uint32_t timer_upper0; > -uint64_t next, diff; > -uint64_t now = cpu_riscv_read_rtc(s->timebase_freq); > - > if (!(s->timer_ctrl & R_CTRL_ACTIVE_MASK)) { > -/* Timer isn't active */ > +return false; > +} > +/* Get the time then extrapolate the number of ticks that have elapsed */ > +uint64_t mtime = get_mtime(s); > +int64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); > +int64_t elapsed = now - s->timer_last_update; > +if (elapsed < 0) { > +/* We jumped back in time. */ > +mtime -= muldiv64((uint64_t)(-elapsed), s->timebase_freq, > + NANOSECONDS_PER_SECOND); > +} else { > +mtime += muldiv64(elapsed, s->timebase_freq, > NANOSECONDS_PER_SECOND); > +} > +s->timer_lower0 = mtime & 0x; > +s->timer_upper0 = (mtime >> 32) & 0x; > +/* update last-checkpoint timestamp */ > +s->timer_last_update = now; > +return true; > +} > + > +static void ibex_timer_update_irqs(IbexTimerState *s) { > +if (!update_mtime(s)) { > +/* Timer is not enabled? */ > return; > } > +uint64_t mtimecmp = s->timer_compare_lower0 | > + ((uint64_t)s->timer_compare_upper0 << 32); > +uint64_t mtime = get_mtime(s); > > /* Update the CPUs mtimecmp */ > -s->mtimecmp = value; > +s->mtimecmp = mtimecmp; > > -if (s->mtimecmp <= now) { > +if (s->mtimecmp <= mtime) { > /* > * If the mtimecmp was in the past raise the interrupt now. > */ > @@ -84,18 +112,17 @@ static void ibex_timer_update_irqs(IbexTimerState > *s) > } > return; > } > - > -/* Setup a timer to trigger the interrupt in the future */ > +/* Update timers: setup a timer to trigger the interrupt in the > + future */ > qemu_irq_lower(s->m_timer_irq); > qemu_set_irq(s->irq, false); > - > -diff = s->mtimecmp - now; > -next = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + > - muldiv64(diff, > - NANOSECONDS_PER_SECOND, > - s->timebase_freq); > - > -if (next < qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL)) { > +/* Compute the difference, and set a timer for the next update. */ > +const uint64_t diff = mtimecmp - mtime; > +const int64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); > +const uint64_t towait = muldiv64(diff, NANOSECONDS_PER_SECOND, > + s->timebase_freq); > +/* timer_mod takes in a int64_t, not uint64_t! Need
RE: [PATCH v2 1/3] hw/watchdog: wdt_ibex_aon.c: Implement the watchdog for the OpenTitan
> -Original Message- > From: Qemu-devel > On Behalf Of Tyler Ng > Sent: Thursday, September 22, 2022 8:59 AM > To: open list:RISC-V ; qemu-devel@nongnu.org > Developers > Cc: Alistair Francis ; Meng, Bin > ; Thomas Huth ; Paolo > Bonzini ; Palmer Dabbelt ; > Laurent Vivier > Subject: [PATCH v2 1/3] hw/watchdog: wdt_ibex_aon.c: Implement the > watchdog for the OpenTitan > > This commit adds most of an implementation of the OpenTitan Always-On > Timer. The documentation for this timer is found here: > > https://docs.opentitan.org/hw/ip/aon_timer/doc/ > > Using commit 217a0168ba118503c166a9587819e3811eeb0c0c > > The implementation includes most of the watchdog features; it does not > implement the wakeup timer. > > An important note: the OpenTitan board uses the sifive_plic. The plic will not > be able to claim the bark interrupt (159) because the sifive plic sets > priority[159], but checks priority[158] for the priority, so it thinks that > the > interrupt's priority is 0 (effectively disabled). > > Changed Files: > hw/riscv/Kconfig: Add configuration for the watchdog. > hw/riscv/opentitan.c: Connect AON Timer to the OpenTitan board. > > hw/watchdog/Kconfig: Configuration for the watchdog. > hw/watchdog/meson.build: Compile the watchdog. > hw/watchdog/wdt_ibex_aon.c: The watchdog itself. > > include/hw/riscv/opentitan.h: Add watchdog bark/wakeup IRQs and timer. > include/hw/watchdog/wdt_ibex_aon.h: Add watchdog. > > tests/qtest/ibex-aon-timer-test.c: Ibex AON Timer test. > tests/qtest/meson.build: Build the timer test. > > Signed-off-by: Tyler Ng > --- > hw/riscv/Kconfig | 4 + > hw/riscv/opentitan.c | 21 +- > hw/watchdog/Kconfig| 3 + > hw/watchdog/meson.build| 1 + > hw/watchdog/wdt_ibex_aon.c | 405 + > include/hw/riscv/opentitan.h | 7 +- > include/hw/watchdog/wdt_ibex_aon.h | 67 + tests/qtest/ibex-aon- > timer-test.c | 189 ++ > tests/qtest/meson.build| 3 + > 9 files changed, 696 insertions(+), 4 deletions(-) create mode 100644 > hw/watchdog/wdt_ibex_aon.c create mode 100644 > include/hw/watchdog/wdt_ibex_aon.h > create mode 100644 tests/qtest/ibex-aon-timer-test.c > > diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig index > 79ff61c464..72094010be 100644 > --- a/hw/riscv/Kconfig > +++ b/hw/riscv/Kconfig > @@ -4,6 +4,9 @@ config RISCV_NUMA > config IBEX > bool > > +config IBEX_AON > +bool > + > config MICROCHIP_PFSOC > bool > select CADENCE_SDHCI > @@ -20,6 +23,7 @@ config MICROCHIP_PFSOC config OPENTITAN > bool > select IBEX > +select IBEX_AON > select UNIMP > > config SHAKTI_C > diff --git a/hw/riscv/opentitan.c b/hw/riscv/opentitan.c index > af13dbe3b1..eae9b2504f 100644 > --- a/hw/riscv/opentitan.c > +++ b/hw/riscv/opentitan.c > @@ -48,7 +48,7 @@ static const MemMapEntry ibex_memmap[] = { > [IBEX_DEV_RSTMGR] = { 0x4041, 0x1000 }, > [IBEX_DEV_CLKMGR] = { 0x4042, 0x1000 }, > [IBEX_DEV_PINMUX] = { 0x4046, 0x1000 }, > -[IBEX_DEV_PADCTRL] ={ 0x4047, 0x1000 }, > +[IBEX_DEV_AON_TIMER] = { 0x4047, 0x1000 }, > [IBEX_DEV_FLASH_CTRL] = { 0x4100, 0x1000 }, > [IBEX_DEV_AES] ={ 0x4110, 0x1000 }, > [IBEX_DEV_HMAC] = { 0x4111, 0x1000 }, > @@ -122,6 +122,8 @@ static void lowrisc_ibex_soc_init(Object *obj) > > object_initialize_child(obj, "timer", >timer, TYPE_IBEX_TIMER); > > +object_initialize_child(obj, "aon_timer", >aon_timer, > TYPE_IBEX_AON_TIMER); > + > for (int i = 0; i < OPENTITAN_NUM_SPI_HOSTS; i++) { > object_initialize_child(obj, "spi_host[*]", >spi_host[i], > TYPE_IBEX_SPI_HOST); @@ -207,6 +209,7 @@ > static void > lowrisc_ibex_soc_realize(DeviceState > *dev_soc, Error **errp) > 3, qdev_get_gpio_in(DEVICE(>plic), > IBEX_UART0_RX_OVERFLOW_IRQ)); > > +/* RV Timer */ > if (!sysbus_realize(SYS_BUS_DEVICE(>timer), errp)) { > return; > } > @@ -218,6 +221,20 @@ static void lowrisc_ibex_soc_realize(DeviceState > *dev_soc, Error **errp) >qdev_get_gpio_in(DEVICE(qemu_get_cpu(0)), > IRQ_M_TIMER)); > > +/* AON Timer */ > +if (!sysbus_realize(SYS_BUS_DEVICE(>aon_timer), errp)) { > +return; > +} > +sysbus_mmio_map(SYS_BUS_DEVICE(>aon_timer), 0, > memmap[IBEX_DEV_AON_TIMER].base); > +sysbus_connect_irq(SYS_BUS_DEVICE(>aon_timer), > + 0, qdev_get_gpio_in(DEVICE(>plic), > + IBEX_AONTIMER_WDOG_BARK)); > +/* > + * Note: There should be a line to pwrmgr but it's not implemented. > + * TODO: Needs a line connected in, "counter-run" (stop the watchdog if > + *
RE: [PATCH] virtio/vhost-user: Fix wrong vhost notifier GPtrArray size
Reviewed-by: Eddie Dong > -Original Message- > From: Qemu-devel bounces+eddie.dong=intel@nongnu.org> On Behalf Of Yajun Wu > Sent: Wednesday, May 25, 2022 8:49 PM > To: qemu-devel@nongnu.org; m...@redhat.com; alex.ben...@linaro.org; > yaj...@nvidia.com > Cc: Parav Pandit > Subject: [PATCH] virtio/vhost-user: Fix wrong vhost notifier GPtrArray size > > In fetch_or_create_notifier, idx begins with 0. So the GPtrArray size should > be idx + 1 and g_ptr_array_set_size should be called with idx + 1. > > This wrong GPtrArray size causes fetch_or_create_notifier return an invalid > address. Passing this invalid pointer to vhost_user_host_notifier_remove > causes assert fail: > > qemu/include/qemu/int128.h:27: int128_get64: Assertion `r == a' failed. > shutting down, reason=crashed > > Backends like dpdk-vdpa which sends out vhost notifier requests almost > always hit qemu crash. > > Fixes: 503e355465 ("virtio/vhost-user: dynamically assign > VhostUserHostNotifiers") > Signed-off-by: Yajun Wu > Acked-by: Parav Pandit > Change-Id: I87e0f7591ca9a59d210879b260704a2d9e9d6bcd > --- > hw/virtio/vhost-user.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index > b040c1ad2b..dbc690d16c 100644 > --- a/hw/virtio/vhost-user.c > +++ b/hw/virtio/vhost-user.c > @@ -1525,7 +1525,7 @@ static VhostUserHostNotifier > *fetch_or_create_notifier(VhostUserState *u, { > VhostUserHostNotifier *n = NULL; > if (idx >= u->notifiers->len) { > -g_ptr_array_set_size(u->notifiers, idx); > +g_ptr_array_set_size(u->notifiers, idx + 1); > } > > n = g_ptr_array_index(u->notifiers, idx); > -- > 2.36.0 >
RE: [RFC PATCH 00/13] Add a plugin to support out-of-band live migration for VFIO pass-through device
> -Original Message- > From: Alex Williamson > Sent: Wednesday, June 1, 2022 11:01 AM > To: Dong, Eddie > Cc: Rao, Lei ; Tian, Kevin ; Zeng, > Jason ; quint...@redhat.com; dgilb...@redhat.com; > Li, Yadong ; Liu, Yi L ; qemu- > de...@nongnu.org > Subject: Re: [RFC PATCH 00/13] Add a plugin to support out-of-band live > migration for VFIO pass-through device > > On Wed, 1 Jun 2022 17:09:25 + > "Dong, Eddie" wrote: > > > > -Original Message- > > > From: Qemu-devel > > bounces+eddie.dong=intel@nongnu.org> On Behalf Of Alex > > > bounces+Williamson > > > On Tue, 24 May 2022 14:18:35 +0800 > > > Lei Rao wrote: > > > > This proposal adopts a plugin mechanism (an example can be found > > > > in > > > > [1]) given that IPU/DPU vendors usually implement proprietary > > > > migration interfaces without a standard. But we are also open if > > > > an alternative option makes better sense, e.g. via loadable > > > > modules (with Qemu supporting gRPC or JSON-RPC support) or an IPC > > > > mechanism similar > > > to vhost-user. > > > > > > AFAIU, QEMU is not interested in supporting plugin modules, > > > especially proprietary ones. I don't see that a case has really > > > been made that this cannot be done in-band, through a vfio-pci > > > variant driver, possibly making use of proprietary interfaces to a > > > userspace agent if necessary (though please don't ask such to be > > > accepted in-tree for the kernel either). A vfio- user device server > > > might also host such proprietary, device specific agents while > > > supporting the standard, in-band migration interface. Thanks, > > > > > > > Thanks Alex. Removing plug-in module is not a problem. > > > > Do you mean to implement the migration and protocol handling inside > > Qemu-client (probably vfio-client after the VFIO-user is merged)? Or > > to build as part of libvfio-user? We can also build it as a separate > > process of Qemu-server as part of Multi-Process Qemu. > > AIUI, the QEMU "client" in a vfio-user configuration is simply QEMU itself. > The vfio-user "server" provides the actual device implementation, which > could support different license models, depending on what libraries or > existing code is incorporated to implement that server. The QEMU remote > machine type is simply a QEMU-based implementation of a vfio-user server. > The vfio-user server is analogous to a vfio-pci variant driver in the > kernel/ioctl interface model. The vfio-user client should be device agnostic, > possibly with similar exceptions we have today via device specific quirk > handling for the vfio kernel interface. > > > In here, the protocol between host CPU and SoC is based on gRPC, which > > support Rust code in client (Host CPU side here) more than C/C++. Do > > you have any suggestion to better support Rust code with Qemu C/C++ > > code? > > I'm not qualified to provide suggestions regarding Rust code integration with > QEMU, but I think that's only required if the device specific migration > support is on the "client". As above, I don't think that's the correct model, > the vfio migration protocol is meant to support any device specific > requirements on the device end of the connection, ie. the "server" end for > vfio-user, which can be an entirely separate, non-QEMU based process. I > think there are also ways to write kernel drivers in Rust, so possibly a > kernel > interface vfio-pci variant driver could also work. Thanks, > Alex: Thanks for your suggestion. Yes, agree Qemu (client) is, by nature, neutral to physical device knowledge. With more thinking, it seems that: 1: Solution to have a separate kernel driver: The way the host CPU talking with the SoC chip of the device is going thru TCP/IP network, plus high level protocol (gRPC or Json..). This is going to be very complicated and might be hard to be accepted by community. 2: Implement as a full qemu-server device model. This way works if we implement a full device model in vfio-user, but given that the device (NVME for now) works in VFIO passthru mode for performance, the issue Kevin Tian raised in another email is a real concern too. 3: Implement partial (or supplemental) feature in Qemu-server device model. This solution defines a Qemu/VFIO migration interface between the client and server for migration. Client migration-proxy uses hardware transparent interface to talk with the remote-migration server. The
RE: [RFC PATCH 00/13] Add a plugin to support out-of-band live migration for VFIO pass-through device
> -Original Message- > From: Qemu-devel bounces+eddie.dong=intel@nongnu.org> On Behalf Of Alex Williamson > Sent: Thursday, May 26, 2022 11:44 AM > To: Rao, Lei > Cc: Tian, Kevin ; Dong, Eddie > ; Zeng, Jason ; > quint...@redhat.com; dgilb...@redhat.com; Li, Yadong > ; Liu, Yi L ; qemu- > de...@nongnu.org > Subject: Re: [RFC PATCH 00/13] Add a plugin to support out-of-band live > migration for VFIO pass-through device > > On Tue, 24 May 2022 14:18:35 +0800 > Lei Rao wrote: > > > Migration of a VFIO passthrough device can be supported by using a > > device specific kernel driver to save/restore the device state thru > > device specific interfaces. But this approach doesn't work for devices > > that lack a state migration interface, e.g. NVMe. > > > > On the other hand, Infrastructure Process Unit (IPU) or Data > > Processing Unit > > (DPU) vendors may choose to implement an out-of-band interface from > > the SoC to help manage the state of such non-migratable devices e.g. > > via gRPC or JSON-RPC protocols. > > > > This RFC attempts to support such out-of-band migration interface by > > introducing the concept of migration backends in vfio. The existing > > logic around vfio migration uAPI is now called the 'local' backend while a > new 'out-of-band' > > backend is further introduced allowing vfio to redirect VMState ops to > > an external plugin. > > > > Currently, the backend migration Ops is defined close to > > SaveVMHandlers. We also considered whether there is value of > > abstracting it in a lower level e.g. close to vfio migration uAPI but > > no clear conclusion. Hence this is one part which we'd like to hear > suggestions. > > > > This proposal adopts a plugin mechanism (an example can be found in > > [1]) given that IPU/DPU vendors usually implement proprietary > > migration interfaces without a standard. But we are also open if an > > alternative option makes better sense, e.g. via loadable modules (with > > Qemu supporting gRPC or JSON-RPC support) or an IPC mechanism similar > to vhost-user. > > AFAIU, QEMU is not interested in supporting plugin modules, especially > proprietary ones. I don't see that a case has really been made that this > cannot be done in-band, through a vfio-pci variant driver, possibly making > use of proprietary interfaces to a userspace agent if necessary (though > please don't ask such to be accepted in-tree for the kernel either). A vfio- > user device server might also host such proprietary, device specific agents > while supporting the standard, in-band migration interface. Thanks, > Thanks Alex. Removing plug-in module is not a problem. Do you mean to implement the migration and protocol handling inside Qemu-client (probably vfio-client after the VFIO-user is merged)? Or to build as part of libvfio-user? We can also build it as a separate process of Qemu-server as part of Multi-Process Qemu. In here, the protocol between host CPU and SoC is based on gRPC, which support Rust code in client (Host CPU side here) more than C/C++. Do you have any suggestion to better support Rust code with Qemu C/C++ code? Thx Eddie > Alex > > > > > The following graph describes the overall component relationship: > > > > ++ > > | QEMU | > > | ++ | > > | |VFIO Live Migration Framework | | > > | |+--+| | > > | || VFIOMigrationOps || | > > | |+---^-^+| | > > | || | | | > > | |+---v---+ +---v+| | > > | || LM Backend Via| | LM Backend Via || | > > | || Device Fd | |Plugins || | > > | |+---^---+ | +--+| | > > | || | |Plugin Ops++-++ > > | || +-+--+| || > > | || | | > > +-v--+ > > | ++---+ | | Vendor Specific > > | > > | | | |Plugins(.so) > > | > > +--+-+ > > +--+-+ >
Re: [Qemu-devel] Asynchronous / synchronous IO emulation
> > The usage is to construct a secondary hot standby VM (SVM), identical with > the primary VM (PVM). > > When an virtual DMA happens in PVM side, we need to know at which > instruction boundary the virtual DMA is delivered, so that we can replay the > virtual DMA event at the 2nd VM side, to keep them identical at any time. > > > > Asynchronous IO emulations seems to be a little bit more complicate to be > deterministic... > > I might be wrong. > > There is a record-replay mode that might be worth investigating. See > docs/replay.txt. > > It sounds like you are not using live migration for micro-checkpoints? > The live migration mechanism would keep memory in sync at each > checkpoint. > > Is this work releated to the COLO effort? > > Stefan Thanks Stefan. COLO is one what we push for high availability. This is yet another project leveraging KVM for deterministic execution, which is still in room. I use VM replicating to explain the concept, but it is not used for high availability. Eddie
Re: [Qemu-devel] Asynchronous / synchronous IO emulation
> > On Thu, Nov 24, 2016 at 08:44:06AM +, Dong, Eddie wrote: > > Under certain situation, we have a requirement to apply the > virtual DMA event in a deterministic way. Current Qemu uses asynchronous > approach to emulate the virtual IO events since quite long time ago. I am > wondering if we still have an option to choose the synchronous emulation > solution. > > Please explain exactly what you mean. > The usage is to construct a secondary hot standby VM (SVM), identical with the primary VM (PVM). When an virtual DMA happens in PVM side, we need to know at which instruction boundary the virtual DMA is delivered, so that we can replay the virtual DMA event at the 2nd VM side, to keep them identical at any time. Asynchronous IO emulations seems to be a little bit more complicate to be deterministic... I might be wrong. Thx Eddie
[Qemu-devel] Asynchronous / synchronous IO emulation
Hi: Under certain situation, we have a requirement to apply the virtual DMA event in a deterministic way. Current Qemu uses asynchronous approach to emulate the virtual IO events since quite long time ago. I am wondering if we still have an option to choose the synchronous emulation solution. Thanks, Eddie
Re: [Qemu-devel] live migration vs device assignment (motivation)
> > > > Even if the device driver doesn't support migration, you still want to > > migrate VM? That maybe risk and we should add the "bad path" for the > > driver at least. > > At a minimum we should have support for hot-plug if we are expecting to > support migration. You would simply have to hot-plug the device before you > start migration and then return it after. That is how the current bonding > approach for this works if I am not mistaken. Hotplug is good to eliminate the device spefic state clone, but bonding approach is very network specific, it doesn’t work for other devices such as FPGA device, QaT devices & GPU devices, which we plan to support gradually :) > > The advantage we are looking to gain is to avoid removing/disabling the > device for as long as possible. Ideally we want to keep the device active > through the warm-up period, but if the guest doesn't do that we should still > be able to fall back on the older approaches if needed. >
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyuwrote: > > On 2015年11月25日 13:30, Alexander Duyck wrote: > >> No, what I am getting at is that you can't go around and modify the > >> configuration space for every possible device out there. This > >> solution won't scale. > > > > > > PCI config space regs are emulation by Qemu and so We can find the > > free PCI config space regs for the faked PCI capability. Its position > > can be not permanent. > > Yes, but do you really want to edit every driver on every OS that you plan to > support this on. What about things like direct assignment of regular Ethernet > ports? What you really need is a solution that will work generically on any > existing piece of hardware out there. The fundamental assumption of this patch series is to modify the driver in guest to self-emulate or track the device state, so that the migration may be possible. I don't think we can modify OS, without modifying the drivers, even using the PCIe hotplug mechanism. In the meantime, modifying Windows OS is a big challenge given that only Microsoft can do. While, modifying driver is relatively simple and manageable to device vendors, if the device vendor want to support state-clone based migration. Thx Eddie
Re: [Qemu-devel] [POC]colo-proxy in qemu
> - What's the plan for vhost? Userspace network in qemu is rather slow, most > user will choose vhost. [Dong, Eddie] Hi Jason: How about we take staging approach? In general, COLO opens a door of high performance HA solution, but it will take very long time to make everything perfect. As for the network virtualization, I think we may start from usage with moderate network bandwidth, like 1Gbps. Otherwise, the performance of COLO may be not that good (of course, like David mentioned, the worst case is same with periodic checkpoint). At the moment, how about we start from in Qemu virtio network, and enhance for vhost case in future? The good thing is that we get more people working on the patch series, and glad to see UMU also joined the effort. Thanks and welcome... Thx Eddie
Re: [Qemu-devel] [POC] colo-proxy in qemu
A question here, the packet comparing may be very tricky. For example, some protocol use random data to generate unpredictable id or something else. One example is ipv6_select_ident() in Linux. So COLO needs a mechanism to make sure PVM and SVM can generate same random data? Good question, the random data connection is a big problem for COLO. At present, it will trigger checkpoint processing because of the different random data. I don't think any mechanisms can assure two different machines generate the same random data. If you have any ideas, pls tell us :) Frequent checkpoint can handle this scenario, but maybe will cause the performance poor. :( The assumption is that, after VM checkpoint, SVM and PVM have identical internal state, so the pattern used to generate random data has high possibility to generate identical data at short time, at least... Thx Eddie
Re: [Qemu-devel] [POC] colo-proxy in qemu
BTW, I felt it is better to be called as an agency, rather than a proxy. Any comments from native speaker? Thx Eddie Hi, all We are planning to implement colo-proxy in qemu to cache and compare packets. This module is one of the important component of COLO project and now it is still in early stage, so any comments and feedback are warmly welcomed, thanks in advance. ## Background COLO FT/HA (COarse-grain LOck-stepping Virtual Machines for Non-stop Service) project is a high availability solution. Both Primary VM (PVM) and Secondary VM (SVM) run in parallel. They receive the same request from client, and generate responses in parallel too. If the response packets from PVM and SVM are identical, they are released immediately. Otherwise, a VM checkpoint (on demand) is conducted. Paper: http://www.socc2013.org/home/program/a3-dong.pdf?attredirects=0 COLO on Xen: http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping COLO on Qemu/KVM: http://wiki.qemu.org/Features/COLO By the needs of capturing response packets from PVM and SVM and finding out whether they are identical, we introduce a new module to qemu networking called colo-proxy. This document describes the design of the colo-proxy module ## Glossary PVM - Primary VM, which provides services to clients. SVM - Secondary VM, a hot standby and replication of PVM. PN - Primary Node, the host which PVM runs on SN - Secondary Node, the host which SVM runs on ## Workflow ## The following image show the qemu networking packet datapath between guest's NIC and qemu's backend in colo-proxy. +---++---+ |PN ||SN | +---+--+ +---+--+ +--+ | +---+ | | +---+ | ++ |chkpoint[socket]---chkpoint ++ |PVM | +---^---+ | | +---+---+ |SVM | || +proxy--v+ | | | || || || | | | || | +---+ | | +TCP/IP stack+ | | | +-v---proxy | +---+ | +-|NIC|--+ | || | | | | | +-|NIC|--+ | +^-++ | | ++ | | | | | +TCP/IP stack-+ | +^--+ | | | +-- | | compare| | -[socket]-forward- | ++ | || | | || | +---++ | | | | | | |seqack | | + | | || +-|--+ | | | | | |adjust | | | | | || || | | | | ++ | | | | +---+-copyforward-[socket]--- +-+ | | | +---|---|+ | | +^+ | | | | | | | | | | | | | x | |+--+---v+ | |+-v-+ | | QEMU | backend | | | QEMU | backend | | ++ (tap)+-+ ++ (tap)+-+ +---++---+ ## Our Idea ## ### Net filter In current QEMU, a packet is transported between networking backend(tap) and qemu network adapter(NIC) directly. Backend and adapter is linked by NetClientState-peer in qemu as following ++ v| +NetClientState+ +---+NetClientState+| |info-type=TAP| ||info-type=NIC|| +--+ |+--+| | *peer +---+| *peer ++ +--++--+ |name=tap0 ||name=e1000 | +--++--+ | ... || ... | +--++--+ In COLO QEMU, we insert a net filter named colo-proxy between backend and adapter like below: typedef struct COLOState { NetClientState nc; NetClientState *peer; } COLOState; +---+NetClientState++NetClientState++ ||info-type=TAP||info-type=NIC| | |+--++--+ | +---+ *peer || *peer ++ | |+--++--+ | | | ||name=tap0 ||name=e1000 | | | | |+--++--+ | | | || ... || ... | | | | |+--++--+ | | | |
Re: [Qemu-devel] [POC] colo-proxy in qemu
Hi Stefan: Thanks for your comments! On Mon, Jul 20, 2015 at 02:42:33PM +0800, Li Zhijian wrote: We are planning to implement colo-proxy in qemu to cache and compare packets. I thought there is a kernel module to do that? Yes, that is the previous solution the COLO sub-community choose to go, but we realized it might be not the best choices, and thus we want to bring discussion back here :) More comments are welcome. Why does the proxy need to be part of the QEMU process? -netdev socket or host network stack features allow you to process packets in a separate process. Hailiang did a very good summary, and we don't need privilege ops so far. The main thing that motivated us to revisit is because the former kernel land driver would be a pure virtualization (and for high availability) component. It may have limited user only at very beginning and thus slower to be accepted. And, people typically like to put component to be in user land as if it can. In addition, as a pure virtualization feature, we guess people in Qemu mailing list may be much more interested to support all the VMMs, such as pure Qemu VMM, KVM, etc. Thx Eddie
Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description
-Original Message- From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com] Sent: Tuesday, May 05, 2015 11:24 PM To: Stefan Hajnoczi Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang Hongyang; zhanghailiang; arm...@redhat.com; jc...@redhat.com Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description * Stefan Hajnoczi (stefa...@redhat.com) wrote: On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote: On 24/04/2015 11:38, Wen Congyang wrote: That can be done with drive-mirror. But I think it's too early for that. Do you mean use drive-mirror instead of quorum? Only before starting up a new secondary. Basically you do a migration with non-shared storage, and then start the secondary in colo mode. But it's only for the failover case. Quorum (or a new block/colo.c driver or filter) is fine for normal colo operation. Perhaps this patch series should mirror the Secondary's disk to a Backup Secondary so that the system can be protected very quickly after failover. I think anyone serious about fault tolerance would deploy a Backup Secondary, otherwise the system cannot survive two failures unless a human administrator is lucky/fast enough to set up a new Secondary. I'd assumed that a higher level management layer would do the allocation of a new secondary after the first failover, so no human need be involved. I agree. The cloud OS, such as open stack, will have the capability to handle the case, together with certain API in VMM side for this (libvirt?). Thx Eddie
Re: [Qemu-devel] [RFC] COLO HA Project proposal
I didn't quite understand a couple of things though, perhaps you can explain: 1) If we ignore the TCP sequence number problem, in an SMP machine don't we get other randomnesses - e.g. which core completes something first, or who wins a lock contention, so the output stream might not be identical - so do those normal bits of randomness cause the machines to flag as out-of-sync? It's about COLO agent, CCing Congyang, he can give the detailed explanation. Let me clarify on this issue. COLO didn't ignore the TCP sequence number, but uses a new implementation to make the sequence number to be best effort identical between the primary VM (PVM) and secondary VM (SVM). Likely, VMM has to synchronize the emulation of randomization number generation mechanism between the PVM and SVM, like the lock-stepping mechanism does. Further mnore, for long TCP connection, we can rely on the (on-demand) VM checkpoint to get the identical Sequence number both in PVM and SVM. Thanks, Eddie
Re: [Qemu-devel] [RFC] COLO HA Project proposal
Let me clarify on this issue. COLO didn't ignore the TCP sequence number, but uses a new implementation to make the sequence number to be best effort identical between the primary VM (PVM) and secondary VM (SVM). Likely, VMM has to synchronize the emulation of randomization number generation mechanism between the PVM and SVM, like the lock-stepping mechanism does. Further mnore, for long TCP connection, we can rely on the (on-demand) VM checkpoint to get the identical Sequence number both in PVM and SVM. That wasn't really my question; I was worrying about other forms of randomness, such as winners of lock contention, and other SMP non-determinisms, and I'm also worried by what proportion of time the system can't recover from a failure due to being unable to distinguish an SVM failure from a randomness issue. Thanks Dave: Whether the randomness value/branch/code path the PVM and SVM may have, It is only a performance issue. COLO never assumes the PVM and SVM has same internal Machine state. From correctness p.o.v, as if the PVM and SVM generate Identical response, we can view the SVM is a valid replica of PVM, and the SVM can take over When the PVM suffers from hardware failure. We can view the client is all the way talking with the SVM, without the notion of PVM. Of course, if the SVM dies, we can regenerate a copy of PVM with a new checkpoint too. The SOCC paper has the detail recovery model :) Thanks, Eddie
Re: [Qemu-devel] [RFC] COLO HA Project proposal
Thanks Dave: Whether the randomness value/branch/code path the PVM and SVM may have, It is only a performance issue. COLO never assumes the PVM and SVM has same internal Machine state. From correctness p.o.v, as if the PVM and SVM generate Identical response, we can view the SVM is a valid replica of PVM, and the SVM can take over When the PVM suffers from hardware failure. We can view the client is all the way talking with the SVM, without the notion of PVM. Of course, if the SVM dies, we can regenerate a copy of PVM with a new checkpoint too. The SOCC paper has the detail recovery model :) I've had a read; I think the bit I was asking about was what you labelled 'D' in that papers fig.4 - so I think that does explain it for me. Very good :) But I also have some more questions: 1) 5.3.3 Web server a) In fig 11 it shows Remus's performance dropping off with the number of threads - why is that? Is it just an increase in the amount of memory changes in each snapshot? I didn't dig into details of them, but document the throughput we observed. I felt a bit stranger too, memory dirty page set may be larger than small connection Case, but I am not sure and that is the data we saw :( b) Is fig 11/12 measured with all of the TCP optimisations shown in fig 13 on? Yes. 2) Did you manage to overcome the issue shown in 5.6 with newer guest kernels degredation - could you just fall back to micro checkpointing if the guests diverge too quickly? In general, I would say the COLO performance for these 2 workloads is pretty good, and I actually didn't list the subsection 5.6 initially. It is the conference sepherd who ask me to add this paragraph to make the paper to be balanced :) In summary, COLO can have very good MP-guest performance comparing with Remus, with the payment of potential optimization/modification effort to guest TCP/IP stack. One solution may Not work for all workloads, but it provides a large room for OSVs to provide customized solution for a specific usage -- which I think is very good for open source biz model: make money through consultant. Huawei technology Ltd. announced to support COLO in there cloud OS, Probably for specific usage too. 3) Was the link between the two servers for synchronisation a low-latency dedicated connection? We use 10 Gbps NIC in the paper, and yes it is dedicated link, but the solution itself doesn't require dedicated link. 4) Did you try an ftp PUT benchmark using external storage - i.e. that wouldn't have the local disc synchronisation overhead? Not yet. External network shared storage works, but today the performance may be not that good, because our optimization so far is still very limited. It is just an initial effort to make the 2 common workloads happy. We believe there are large room ahead to make the response of TCP/IP stack more predictable. Once the basic COLO stuff is ready for product and accepted by the industry, it is possible we may impact TCP community to have this kind of predictability in mind for the future protocol development, which will greatly help the performance. Thx Eddie
[Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
A VF interrupt usually happens in 4-8KHZ. How about the virtio? I assume virtio will be widely used together w/ leagcy guest with INTx mode. True, but in time it will be replaced by MSI. Note without vhost virtio is also in userspace, so there are lots of exits anyway for the status register. Few months ago, we noticed the interrupt frequency of PV I/O in previous solution is almost same with physical NIC interrupt which ticks in ~4KHZ. Each PV I/O frontend driver (or its interrupt source) has similar interrupt frequency which means Nx more interrupt. I guess virtio is in similar situation. We then did an optimization for PV IO to mitigate the interrupt to guest by setting interrupt throttle in backend side, because native NIC also does in that way -- so called ITR register in Intel NIC. We can see 30-90% CPU utilization saving depending on how many frontend driver interrupt is employed. Not sure if it is adopted in vhost side. One drawback of course is the latency, but it is mostly tolerable if it is reduced to ~1KHZ. Thx, Eddie
[Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
Avi Kivity wrote: I am currently investigating a problem with the a guest running Linux malfunctioning in the NMI watchdog code. The problem is that we don't handle NMI delivery mode for the local APIC LINT0 pin; instead we expect ExtInt deliver mode or that the line is disabled completely. In addition the i8254 timer is tied to the BSP, while in this case the timer can broadcast to all vcpus. There is some code that tries to second-guess the guest and provide it the inputs it sees, but this is fragile. The only way to get reliable operation is to emulate the hardware fully. Now I'd much rather do that in userspace, since it's a lot of sensitive work. I'll enumerate below the general motivation, advantages and disadvantages, and a plan for moving forward. Motivation == The original motivation for moving the PIC and IOAPIC into the kernel was performance, especially for assigned devices. Both devices are high interaction since they deal with interrupts; practically after every interrupt there is either a PIC ioport write, or an APIC bus message, both signalling an EOI operation. Moving the PIT into the kernel allowed us to catch up with missed timer interrupt injections, and speeded up guests which read the PIT counters (e.g. tickless guests). However, modern guests running on modern qemu use MSI extensively; both virtio and assigned devices now have MSI support; and the planned VFIO only supports kernel delivery via MSI anyway; line based interrupts will need to be mediated by userspace. The only high frequency non-MSI interrupt sources remaining are the various timers; and the default one, HPET, is in userspace (and having its own scaling problems as a result). So in theory we can move PIC, IOAPIC, and PIT support to userspace and not lose much performance. Moving the implementation to userspace allows us more flexibility, and more consistency in the implementation of timekeeping for the various clock chips; it becomes easier to follow the nuances of real hardware in this area. Interestingly, while the IOAPIC/PIC code was written we proposed making it independent of the local APIC; had we done so, the move would have been much easier (simply dropping the existing code). Advantages of a move 1. Reduced kernel footprint Good for security, and allows fixing bugs without reboots. 2. Centralized timekeeping Instead of having one solution for PIT timekeeping, and another for RTC and HPET timekeeping, we can have all timer chips in userspace. The local APIC timer still needs to be in the kernel - it is much too high bandwidth to be in userspace; but on the other hand it is very different from the other timer chips. 3. Flexibility Easier to have wierd board layouts (multiple IOAPICs, etc.). Not a very strong advantage. Disadvantages = 1. Still need to keep the old code around for a long while We can't just rip it out - old userspace depends on it. So the security advantages are only with cooperating userspace, and the other advantages only show up. 2. Need to bring the qemu code up to date The current qemu ioapic code lags some way behind the kernel; also need PIT timekeeping 3. May need kernel support for interval-timer-follows-thread Currently the timekeeping code has an optimization which causes the hrtimer that models the PIT to follow the BSP (which is most likely to receive the interrupt); this reduces cpu cross-talk. I don't think the kernel interval timer code has such an optimization; we may need to implement it. 4. Much churn This is a lot of work. 5. Risk We may find out after all this is implemented that performance is not acceptable and all the work will have to be dropped. Besides VF IO interrupt and timer interrupt introduced performance overhead risk, EOI message deliver from lapic to ioapic, which becomes in user land now, may have potential scalability issue. For example, if we have a 64 VCPU guest, if each vcpu has 1khz interrupt (or ipi), the EOI from guest will normally have to involve ioapic module for clearance in 64khz which may have long lock contentio. you may reduce the involvement of ioapic eoi by tracking ioapic pin - vector map in kernel, but not sure if it is clean enough.
[Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
Avi Kivity wrote: On 06/09/2010 06:59 PM, Dong, Eddie wrote: Besides VF IO interrupt and timer interrupt introduced performance overhead risk, VF usually uses MSI Typo, I mean PV IO. A VF interrupt usually happens in 4-8KHZ. How about the virtio? I assume virtio will be widely used together w/ leagcy guest with INTx mode. EOI message deliver from lapic to ioapic, Only for non-MSI which becomes in user land now, may have potential scalability issue. For example, if we have a 64 VCPU guest, if each vcpu has 1khz interrupt (or ipi), the EOI from guest will normally have to involve ioapic module for clearance in 64khz which may have long lock contentio. No, EOI for IPI or for local APIC timer does not involve the IOAPIC. you may reduce the involvement of ioapic eoi by tracking ioapic pin- vector map in kernel, but not sure if it is clean enough. It's sufficient to look at TMR, no? For edge triggered I don't think we need the EOI. Mmm, I noticed statements difference between new SDM old SDM. In old SDM, IPI can have both edge and level trigger mode. But new SDM says only INIT can have both choice. Given the new SDM eliminates the level trigger mode, it is OK. But, the amount of churn and risk worries me, so I don't think the move is worthwhile. This also remind me the debate at early stage of KVM when Gregory Haskins is working on any arbitrary choice of irqchip. The patch even at that time (without SMP) is very complicated. I agree from ROI point of view, this movement may be not worthwhile. thx, eddie