RE: [PATCH 1/8] colo: Only support the same qemu version on source and destination

2023-06-22 Thread Dong, Eddie



> -Original Message-
> From: qemu-devel-bounces+eddie.dong=intel@nongnu.org  devel-bounces+eddie.dong=intel@nongnu.org> On Behalf Of Lukas
> Straub
> Sent: Thursday, June 22, 2023 5:15 AM
> To: qemu-devel 
> Cc: Zhang, Hailiang ; Juan Quintela
> ; Peter Xu ; Leonardo Bras
> ; Zhang, Chen 
> Subject: [PATCH 1/8] colo: Only support the same qemu version on source
> and destination
> 
> Signed-off-by: Lukas Straub 
> ---
>  docs/COLO-FT.txt | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/docs/COLO-FT.txt b/docs/COLO-FT.txt index
> 2e760a4aee..8e64480dbd 100644
> --- a/docs/COLO-FT.txt
> +++ b/docs/COLO-FT.txt
> @@ -148,6 +148,8 @@ in test procedure.
>  Note: Here we are running both instances on the same host for testing,
> change the IP Addresses if you want to run it on two hosts. Initially
>  127.0.0.1 is the Primary Host and 127.0.0.2 is the Secondary Host.
> +Note: COLO is a experimental feature, 
an experimental feature

>so currently is should only be
it should ...

> +used with the same qemu version on sourcee and target.
> 
>  == Startup qemu ==
>  1. Primary:
> --
> 2.39.2




RE: [PATCH 1/3] hw/gpio: add PCA6414 i2c GPIO expander

2023-02-06 Thread Dong, Eddie


> -Original Message-
> From: qemu-devel-bounces+eddie.dong=intel@nongnu.org  devel-bounces+eddie.dong=intel@nongnu.org> On Behalf Of Titus
> Rwantare
> Sent: Monday, February 6, 2023 11:50 AM
> To: peter.mayd...@linaro.org
> Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org; Titus Rwantare
> ; Hao Wu 
> Subject: [PATCH 1/3] hw/gpio: add PCA6414 i2c GPIO expander
> 
> This is a simple i2c device that allows i2c capable devices to have GPIOs.

This patch is to implement a device model of I2C to GPIO for PCA_xxx, right?
Or do you implement a general/common I2C to GPIO device? 
I think it is for the former one. In this case, the commit message is not clear.


> 
> Reviewed-by: Hao Wu 
> Signed-off-by: Titus Rwantare 
> ---
>  hw/arm/Kconfig  |   1 +
>  hw/gpio/meson.build |   1 +
>  hw/gpio/pca_i2c_gpio.c  | 362 
>  hw/gpio/trace-events|   5 +
>  hw/i2c/Kconfig  |   4 +
>  include/hw/gpio/pca_i2c_gpio.h  |  72 +++
>  tests/qtest/meson.build |   1 +
>  tests/qtest/pca_i2c_gpio-test.c | 169 +++
>  8 files changed, 615 insertions(+)
>  create mode 100644 hw/gpio/pca_i2c_gpio.c  create mode 100644
> include/hw/gpio/pca_i2c_gpio.h  create mode 100644
> tests/qtest/pca_i2c_gpio-test.c
> 
> diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig index
> 2d157de9b8..1b533ddd76 100644
> --- a/hw/arm/Kconfig
> +++ b/hw/arm/Kconfig
> @@ -418,6 +418,7 @@ config NPCM7XX
>  select SSI
>  select UNIMP
>  select PCA954X
> +select PCA_I2C_GPIO
> 
>  config FSL_IMX25
>  bool
> diff --git a/hw/gpio/meson.build b/hw/gpio/meson.build index
> b726e6d27a..1e5b602002 100644
> --- a/hw/gpio/meson.build
> +++ b/hw/gpio/meson.build
> @@ -12,3 +12,4 @@ softmmu_ss.add(when: 'CONFIG_OMAP', if_true:
> files('omap_gpio.c'))
>  softmmu_ss.add(when: 'CONFIG_RASPI', if_true: files('bcm2835_gpio.c'))
>  softmmu_ss.add(when: 'CONFIG_ASPEED_SOC', if_true:
> files('aspeed_gpio.c'))
>  softmmu_ss.add(when: 'CONFIG_SIFIVE_GPIO', if_true: files('sifive_gpio.c'))
> +softmmu_ss.add(when: 'CONFIG_PCA_I2C_GPIO', if_true:
> +files('pca_i2c_gpio.c'))
> diff --git a/hw/gpio/pca_i2c_gpio.c b/hw/gpio/pca_i2c_gpio.c new file
> mode 100644 index 00..afae497a22
> --- /dev/null
> +++ b/hw/gpio/pca_i2c_gpio.c

Should this file be located under hw/i2c ?

> @@ -0,0 +1,362 @@
> +/*
> + * NXP PCA I2C GPIO Expanders
> + *
> + * Low-voltage translating 16-bit I2C/SMBus GPIO expander with
> +interrupt output,
> + * reset, and configuration registers
> + *
> + * Datasheet: https://www.nxp.com/docs/en/data-sheet/PCA6416A.pdf
> + *
> + * Copyright 2023 Google LLC
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * These devices, by default, are configured to input only. The
> +configuration is
> + * settable through qom/qmp, or i2c.To set some pins as inputs before
> +boot, use
> + * the following in the board file of the machine:
> + *  object_property_set_uint(Object *obj, const char *name,
> + *   uint64_t value, Error **errp);
> + * specifying name as "gpio_config" and the value as a bitfield of the
> +inputs
> + * e.g. for the pca6416, a value of 0xFFF0, sets pins 0-3 as outputs
> +and
> + * 4-15 as inputs.
> + * This value can also be set at runtime through qmp externally, or by
> + * writing to the config register using i2c.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/gpio/pca_i2c_gpio.h"
> +#include "hw/irq.h"
> +#include "hw/qdev-properties.h"
> +#include "migration/vmstate.h"
> +#include "qapi/visitor.h"
> +#include "qemu/log.h"
> +#include "qemu/module.h"
> +#include "trace.h"
> +
> +/*
> + * compare new_output to curr_output and update irq to match
> new_output
> + *
> + * The Input port registers (registers 0 and 1) reflect the incoming
> +logic
> + * levels of the pins, regardless of whether the pin is defined as an
> +input or
> + * an output by the Configuration register.
> + */
> +static void pca_i2c_update_irqs(PCAGPIOState *ps) {
> +PCAGPIOClass *pc = PCA_I2C_GPIO_GET_CLASS(ps);
> +uint16_t out_diff = ps->new_output ^ ps->curr_output;
> +uint16_t in_diff = ps->new_input ^ ps->curr_input;
> +uint16_t mask, pin_i;
> +
> +if (in_diff || out_diff) {
The spec says " A pin configured as an output cannot cause an interrupt".
Do we need to check out_diff here?

> +for (int i = 0; i < pc->num_pins; i++) {
At least for PCA_6416A, the state of pins are described in UINT16. We can check
all together rather than bit scan/pin scan.

> +mask = BIT(i);
> +/* pin must be configured as an output to be set here */
> +if (out_diff & ~ps->config & mask) {
> +pin_i = mask & ps->new_output;
> +qemu_set_irq(ps->output[i], pin_i > 0);
> +ps->curr_output &= ~mask;
> +ps->curr_output |= pin_i;
> +}
> +
> +if (in_diff & 

RE: [PATCH v2 02/11] hw/watchdog/wdt_aspeed: Extend MMIO range to cover more registers

2022-12-31 Thread Dong, Eddie
> When booting the Zephyr demo in [1] we get:
> 
>   aspeed.io: unimplemented device write (size 4, offset 0x185128, value
> 0x030f1ff1) <--
>   aspeed.io: unimplemented device write (size 4, offset 0x18512c, value
> 0x03f1)
> 
> This corresponds to this Zephyr code [2]:
> 
>   static int aspeed_wdt_init(const struct device *dev)
>   {
> const struct aspeed_wdt_config *config = dev->config;
> struct aspeed_wdt_data *const data = dev->data;
> uint32_t reg_val;
> 
> /* disable WDT by default */
> reg_val = sys_read32(config->ctrl_base + WDT_CTRL_REG);
> reg_val &= ~WDT_CTRL_ENABLE;
> sys_write32(reg_val, config->ctrl_base + WDT_CTRL_REG);
> 
> sys_write32(data->rst_mask1,
> config->ctrl_base + WDT_SW_RESET_MASK1_REG);   <--
> sys_write32(data->rst_mask2,
> config->ctrl_base + WDT_SW_RESET_MASK2_REG);
> 
> return 0;
>   }
> 
> The register definitions are [3]:
> 
>   #define WDT_RELOAD_VAL_REG  0x0004
>   #define WDT_RESTART_REG 0x0008
>   #define WDT_CTRL_REG0x000C
>   #define WDT_TIMEOUT_STATUS_REG  0x0010
>   #define WDT_TIMEOUT_STATUS_CLR_REG  0x0014
>   #define WDT_RESET_MASK1_REG 0x001C
>   #define WDT_RESET_MASK2_REG 0x0020
>   #define WDT_SW_RESET_MASK1_REG  0x0028   <--
>   #define WDT_SW_RESET_MASK2_REG  0x002C
>   #define WDT_SW_RESET_CTRL_REG   0x0024
> 
> Currently QEMU only cover a MMIO region of size 0x20:
> 
>   #define ASPEED_WDT_REGS_MAX(0x20 / 4)
> 
> Change to map the whole 'iosize' which might be bigger, covering the other

The root cause is that ASPEED_WDT_REGS_MAX is too small, right?
Probably the Qemu is emulating an old version of the hardware.

Given the meaning of ASPEED_WDT_REGS_MAX, it should be larger than iosize, not?
Probably ASPEED_WDT_REGS_MAX should be per device type (aspeed_2400/2500),
while iosize is for all devices, and its initial value comes from the per 
device type REGS_MAX.

> registers. The MemoryRegionOps read/write handlers will report the accesses
> as out-of-bounds guest-errors, but the next commit will report them as
> unimplemented.
> 
> [1] https://github.com/AspeedTech-BMC/zephyr/releases/tag/v00.01.07
> [2] https://github.com/AspeedTech-BMC/zephyr/commit/2e99f10ac27b
> [3] https://github.com/AspeedTech-
> BMC/zephyr/blob/v00.01.08/drivers/watchdog/wdt_aspeed.c#L31
> 
> Reviewed-by: Peter Delevoryas 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  hw/watchdog/wdt_aspeed.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/watchdog/wdt_aspeed.c b/hw/watchdog/wdt_aspeed.c index
> 958725a1b5..eefca31ae4 100644
> --- a/hw/watchdog/wdt_aspeed.c
> +++ b/hw/watchdog/wdt_aspeed.c
> @@ -260,6 +260,7 @@ static void aspeed_wdt_realize(DeviceState *dev,
> Error **errp)  {
>  SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
>  AspeedWDTState *s = ASPEED_WDT(dev);
> +AspeedWDTClass *awc = ASPEED_WDT_GET_CLASS(dev);
> 
>  assert(s->scu);
> 
> @@ -271,7 +272,7 @@ static void aspeed_wdt_realize(DeviceState *dev,
> Error **errp)
>  s->pclk_freq = PCLK_HZ;
> 
>  memory_region_init_io(>iomem, OBJECT(s), _wdt_ops, s,
> -  TYPE_ASPEED_WDT, ASPEED_WDT_REGS_MAX * 4);
> +  TYPE_ASPEED_WDT, awc->iosize);
>  sysbus_init_mmio(sbd, >iomem);
>  }
> 
> --
> 2.38.1
> 



RE: [PATCH v2 01/11] hw/watchdog/wdt_aspeed: Rename MMIO region size as 'iosize'

2022-12-31 Thread Dong, Eddie
> 
> Avoid confusing two different things:
> - the WDT I/O region size ('iosize')
> - at which offset the SoC map the WDT ('offset') While it is often the same, 
> we
> can map smaller region sizes at larger offsets.
> 
> Here we are interested in the I/O region size, so rename as 'iosize'.
> 
> Reviewed-by: Peter Delevoryas 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  hw/arm/aspeed_ast10x0.c  | 2 +-
>  hw/arm/aspeed_ast2600.c  | 2 +-
>  hw/arm/aspeed_soc.c  | 2 +-
>  hw/watchdog/wdt_aspeed.c | 8 
>  include/hw/watchdog/wdt_aspeed.h | 2 +-
>  5 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/arm/aspeed_ast10x0.c b/hw/arm/aspeed_ast10x0.c index
> 4d0b9b115f..122b3fd3f3 100644
> --- a/hw/arm/aspeed_ast10x0.c
> +++ b/hw/arm/aspeed_ast10x0.c
> @@ -325,7 +325,7 @@ static void aspeed_soc_ast1030_realize(DeviceState
> *dev_soc, Error **errp)
>  return;
>  }
>  aspeed_mmio_map(s, SYS_BUS_DEVICE(>wdt[i]), 0,
> -sc->memmap[ASPEED_DEV_WDT] + i * awc->offset);
> +sc->memmap[ASPEED_DEV_WDT] + i * awc->iosize);


How does the specification say here (I didn't find it)?

I read this is for a case where multiple WDTs are implemented. 
And offset means the distance io registers of WDT[1] are located from WDT[0].
Or does the spec explicitly says the io registers of WDT[1] is located right 
after
WDT[0] without any gaps in this case, iosize is better)?

>  }
> 
>  /* GPIO */
> diff --git a/hw/arm/aspeed_ast2600.c b/hw/arm/aspeed_ast2600.c index
> cd75465c2b..a79e05ddbd 100644
> --- a/hw/arm/aspeed_ast2600.c
> +++ b/hw/arm/aspeed_ast2600.c
> @@ -472,7 +472,7 @@ static void aspeed_soc_ast2600_realize(DeviceState
> *dev, Error **errp)
>  return;
>  }
>  aspeed_mmio_map(s, SYS_BUS_DEVICE(>wdt[i]), 0,
> -sc->memmap[ASPEED_DEV_WDT] + i * awc->offset);
> +sc->memmap[ASPEED_DEV_WDT] + i * awc->iosize);
>  }
> 
>  /* RAM */
> diff --git a/hw/arm/aspeed_soc.c b/hw/arm/aspeed_soc.c index
> b05b9dd416..2c0924d311 100644
> --- a/hw/arm/aspeed_soc.c
> +++ b/hw/arm/aspeed_soc.c
> @@ -393,7 +393,7 @@ static void aspeed_soc_realize(DeviceState *dev,
> Error **errp)
>  return;
>  }
>  aspeed_mmio_map(s, SYS_BUS_DEVICE(>wdt[i]), 0,
> -sc->memmap[ASPEED_DEV_WDT] + i * awc->offset);
> +sc->memmap[ASPEED_DEV_WDT] + i * awc->iosize);
>  }
> 
>  /* RAM  */
> diff --git a/hw/watchdog/wdt_aspeed.c b/hw/watchdog/wdt_aspeed.c index
> d753693a2e..958725a1b5 100644
> --- a/hw/watchdog/wdt_aspeed.c
> +++ b/hw/watchdog/wdt_aspeed.c
> @@ -309,7 +309,7 @@ static void aspeed_2400_wdt_class_init(ObjectClass
> *klass, void *data)
>  AspeedWDTClass *awc = ASPEED_WDT_CLASS(klass);
> 
>  dc->desc = "ASPEED 2400 Watchdog Controller";
> -awc->offset = 0x20;
> +awc->iosize = 0x20;
>  awc->ext_pulse_width_mask = 0xff;
>  awc->reset_ctrl_reg = SCU_RESET_CONTROL1;
>  awc->wdt_reload = aspeed_wdt_reload; @@ -346,7 +346,7 @@ static void
> aspeed_2500_wdt_class_init(ObjectClass *klass, void *data)
>  AspeedWDTClass *awc = ASPEED_WDT_CLASS(klass);
> 
>  dc->desc = "ASPEED 2500 Watchdog Controller";
> -awc->offset = 0x20;
> +awc->iosize = 0x20;
>  awc->ext_pulse_width_mask = 0xf;
>  awc->reset_ctrl_reg = SCU_RESET_CONTROL1;
>  awc->reset_pulse = aspeed_2500_wdt_reset_pulse; @@ -369,7 +369,7
> @@ static void aspeed_2600_wdt_class_init(ObjectClass *klass, void *data)
>  AspeedWDTClass *awc = ASPEED_WDT_CLASS(klass);
> 
>  dc->desc = "ASPEED 2600 Watchdog Controller";
> -awc->offset = 0x40;
> +awc->iosize = 0x40;
>  awc->ext_pulse_width_mask = 0xf; /* TODO */
>  awc->reset_ctrl_reg = AST2600_SCU_RESET_CONTROL1;
>  awc->reset_pulse = aspeed_2500_wdt_reset_pulse; @@ -392,7 +392,7
> @@ static void aspeed_1030_wdt_class_init(ObjectClass *klass, void *data)
>  AspeedWDTClass *awc = ASPEED_WDT_CLASS(klass);
> 
>  dc->desc = "ASPEED 1030 Watchdog Controller";
> -awc->offset = 0x80;
> +awc->iosize = 0x80;
>  awc->ext_pulse_width_mask = 0xf; /* TODO */
>  awc->reset_ctrl_reg = AST2600_SCU_RESET_CONTROL1;
>  awc->reset_pulse = aspeed_2500_wdt_reset_pulse; diff --git
> a/include/hw/watchdog/wdt_aspeed.h
> b/include/hw/watchdog/wdt_aspeed.h
> index dfa5dfa424..db91ee6b51 100644
> --- a/include/hw/watchdog/wdt_aspeed.h
> +++ b/include/hw/watchdog/wdt_aspeed.h
> @@ -40,7 +40,7 @@ struct AspeedWDTState {  struct AspeedWDTClass {
>  SysBusDeviceClass parent_class;
> 
> -uint32_t offset;
> +uint32_t iosize;
>  uint32_t ext_pulse_width_mask;
>  uint32_t reset_ctrl_reg;
>  void (*reset_pulse)(AspeedWDTState *s, uint32_t property);
> --
> 2.38.1
> 



RE: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to mtime

2022-09-29 Thread Dong, Eddie
Using a union can better reflect this. Also, it can avoid the convert from 2 
32-bits register to 64 bits, like the above code does.
ibex_timer_update_irqs() also does this conversion.

It took me a bit of time, but now I think I understand what you mean: a union 
of 2 uint32_t's (perhaps packed into a struct or an array) and a uint64_t would 
make it easier to access the components, is that what you mean? That is pretty 
handy, thanks.


YES.  You decide

Thanks Eddie


RE: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to mtime

2022-09-28 Thread Dong, Eddie


From: Tyler Ng  
Sent: Monday, September 26, 2022 4:38 PM
To: Dong, Eddie 
Cc: open list:RISC-V ; qemu-devel@nongnu.org Developers 
; Alistair Francis ; Meng, Bin 
; Thomas Huth ; Paolo Bonzini 
; Palmer Dabbelt ; Laurent Vivier 

Subject: Re: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to 
mtime

Hi Eddie,

On Mon, Sep 26, 2022 at 2:06 PM Dong, Eddie <mailto:eddie.d...@intel.com> wrote:


> -Original Message-
> From: Qemu-devel mailto:intel@nongnu.org>
> On Behalf Of Tyler Ng
> Sent: Thursday, September 22, 2022 8:59 AM
> To: open list:RISC-V <mailto:qemu-ri...@nongnu.org>; 
> mailto:qemu-devel@nongnu.org
> Developers <mailto:qemu-devel@nongnu.org>
> Cc: Alistair Francis <mailto:alistair.fran...@wdc.com>; Meng, Bin
> <mailto:bin.m...@windriver.com>; Thomas Huth <mailto:th...@redhat.com>; Paolo
> Bonzini <mailto:pbonz...@redhat.com>; Palmer Dabbelt 
> <mailto:pal...@dabbelt.com>;
> Laurent Vivier <mailto:lviv...@redhat.com>
> Subject: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to
> mtime
> 
> 1. Adds fields to hold the value of mtime in timer_upper0 and timer_lower0.
> 
> 2. Changes the read and write functions to use the mtime fields.
> 
> 3. Updates the value of mtime in update_mtime() by extrapolating the time
> elapsed. This will need to change if/when the prescalar is implemented.
> 
> 4. Adds a qtest for the ibex timer.
> 
> Signed-off-by: Tyler Ng <mailto:t...@rivosinc.com>
> ---
>  hw/timer/ibex_timer.c         |  98 +--
>  include/hw/timer/ibex_timer.h |   6 ++
>  tests/qtest/ibex-timer-test.c | 178 ++
>  tests/qtest/meson.build       |   3 +-
>  4 files changed, 256 insertions(+), 29 deletions(-)  create mode 100644
> tests/qtest/ibex-timer-test.c
> 
> diff --git a/hw/timer/ibex_timer.c b/hw/timer/ibex_timer.c index
> d8b8e4e1f6..4230d60e85 100644
> --- a/hw/timer/ibex_timer.c
> +++ b/hw/timer/ibex_timer.c
> @@ -52,28 +52,56 @@ REG32(INTR_STATE, 0x118)  REG32(INTR_TEST,
> 0x11C)
>      FIELD(INTR_TEST, T_0, 0, 1)
> 
> -static uint64_t cpu_riscv_read_rtc(uint32_t timebase_freq)
> +static inline uint64_t get_mtime(void *opaque)
>  {
> -    return muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
> -                    timebase_freq, NANOSECONDS_PER_SECOND);
> +    IbexTimerState *s = opaque;
> +    return (s->timer_lower0) | ((uint64_t) s->timer_upper0 << 32);
>  }
> 
> -static void ibex_timer_update_irqs(IbexTimerState *s)
> +/*
> + * The goal of this function is to:
> + * 1. Check if the timer is enabled. If not, return false,
> + * 2. Calculate the amount of time that has passed since.
> + * 3. Extrapolate the number of ticks that have passed, and add it to 
> `mtime`.
> + * 4. Return true.
> + */
> +static bool update_mtime(IbexTimerState *s)
>  {
> -    uint64_t value = s->timer_compare_lower0 |
> -                         ((uint64_t)s->timer_compare_upper0 << 32);
So the hardware actually implements 64 bits register (used in 32 bits CPU), why 
not use an union for this?
Same for:
+    uint32_t timer_lower0;
+    uint32_t timer_upper0;
I'm not too sure what a C union would do for us here?
 
I think what the hardware really implement is a 64 bits register, with 32 bits 
interface to access.
struct IbexTimerState actually defines both of them:  
  uint64_t mtimecmp;
  uint32_t timer_compare_lower0;
  uint32_t timer_compare_upper0;

Using a union can better reflect this. Also, it can avoid the convert from 2 
32-bits register to 64 bits, like the above code does.
ibex_timer_update_irqs() also does this conversion.


> -    uint64_t next, diff;
> -    uint64_t now = cpu_riscv_read_rtc(s->timebase_freq);
> -
>      if (!(s->timer_ctrl & R_CTRL_ACTIVE_MASK)) {
> -        /* Timer isn't active */
> +        return false;
> +    }
> +    /* Get the time then extrapolate the number of ticks that have elapsed */
> +    uint64_t mtime = get_mtime(s);
> +    int64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
> +    int64_t elapsed = now - s->timer_last_update;
> +    if (elapsed < 0) {
> +        /* We jumped back in time. */
> +        mtime -= muldiv64((uint64_t)(-elapsed), s->timebase_freq,
> +                           NANOSECONDS_PER_SECOND);
> +    } else {
> +        mtime += muldiv64(elapsed, s->timebase_freq,
> NANOSECONDS_PER_SECOND);
> +    }
> +    s->timer_lower0 = mtime & 0x;
> +    s->timer_upper0 = (mtime >> 32) & 0x;
> +    /* update last-checkpoint timestamp */
> +    s->timer_last_update = now;
> +    return true;
> +}
> +
> +static void ibex_timer_update_irqs(IbexTimerSt

RE: [PATCH v2 1/3] hw/watchdog: wdt_ibex_aon.c: Implement the watchdog for the OpenTitan

2022-09-28 Thread Dong, Eddie
Hi Tyler:
Some other comments embedded.


> +
> +static void ibex_aon_write(void *opaque,
> +   hwaddr addr, uint64_t value,
> +   unsigned int size) {
> +IbexAONTimerState *s = IBEX_AON_TIMER(opaque);
> +
> +/* When writing, need to consider if the configuration is locked */
> +switch (addr) {
> +case A_ALERT_TEST:
> +s->regs[R_ALERT_TEST] = value & 0x1;
> +break;
> +case A_WKUP_CTRL:
> +qemu_log_mask(LOG_UNIMP, "%s: AON wkup not implemented",
> __func__);
> +break;
> +case A_WKUP_THOLD:
> +qemu_log_mask(LOG_UNIMP, "%s: AON wkup not implemented",
> __func__);
> +break;
> +case A_WKUP_COUNT:
> +qemu_log_mask(LOG_UNIMP, "%s: AON wkup not implemented",
> __func__);
> +break;
> +case A_WDOG_REGWEN:
> +if (s->regs[R_WDOG_REGWEN] == IBEX_AONTIMER_WDOG_UNLOCKED)
> {
> +s->regs[R_WDOG_REGWEN] = value &
> IBEX_AONTIMER_WDOG_REGWEN_MASK;
> +} else {
> +qemu_log_mask(LOG_GUEST_ERROR,
> +"%s: AON Timer configuration locked\n", 
> __func__);
> +}
> +break;
> +case A_WDOG_CTRL:
> +if (s->regs[R_WDOG_REGWEN] == IBEX_AONTIMER_WDOG_UNLOCKED)
> {
> +if (!(s->regs[R_WDOG_CTRL] & IBEX_AONTIMER_WDOG_CTRL_MASK))
> {
> +s->wdog_last = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
> +}
> +s->regs[R_WDOG_CTRL] = value &
> IBEX_AONTIMER_WDOG_CTRL_MASK;
> +ibex_aon_update_bark_timer(s);
> +ibex_aon_update_bite_timer(s);

In case, the guest clears the ENABLE bit of CTRL register, we want to remove 
the registered timer, right?
The current code path do nothing when ENABLE is cleared.  That is incorrect and 
may lead to bark and bite event happen again.

> +} else {
> +qemu_log_mask(LOG_GUEST_ERROR,
> +"%s: AON Timer configuration locked\n", 
> __func__);
> +}
> +break;
> +case A_WDOG_BARK_THOLD:
> +if (s->regs[R_WDOG_REGWEN] == IBEX_AONTIMER_WDOG_UNLOCKED)
> {
> +s->regs[R_WDOG_BARK_THOLD] = value;
> +ibex_aon_update_bark_timer(s);
> +} else {
> +qemu_log_mask(LOG_GUEST_ERROR,
> +"%s: AON Timer configuration locked\n", 
> __func__);
> +}
> +break;
> +case A_WDOG_BITE_THOLD:
> +if (s->regs[R_WDOG_REGWEN] == IBEX_AONTIMER_WDOG_UNLOCKED)
> {
> +s->regs[R_WDOG_BITE_THOLD] = value;
> +ibex_aon_update_bite_timer(s);
> +} else {
> +qemu_log_mask(LOG_GUEST_ERROR,
> +"%s: AON Timer configuration locked\n", 
> __func__);
> +}
> +break;
> +case A_WDOG_COUNT:
> +s->regs[R_WDOG_COUNT] = value;
> +s->wdog_last = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
> +ibex_aon_update_bark_timer(s);
> +ibex_aon_update_bite_timer(s);
> +break;
> +case A_INTR_STATE:
> +/* Service the IRQs by writing 1 to the appropriate field */
> +if ((value & R_INTR_STATE_WDOG_MASK)) {

Do we need to clear the bit in s->regs[R_INTR_STATE] ?
Otherwise, guest read of the register gets wrong value.

> +qemu_irq_lower(s->bark_irq);
> +ibex_aon_update_count(s);
> +int64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
> +/*
> + * We need to make sure that COUNT < *_THOLD. If it isn't, an
> + * interrupt is generated the next clock cycle
> + */
> +if (s->regs[R_WDOG_COUNT] >= s->regs[R_WDOG_BARK_THOLD]) {
> +if (now + IBEX_AONTIMER_PERIOD_NS < now) {
> +timer_mod_ns(s->barker, INT64_MAX);

Overflow of an 64 bits timer(in the unit of ns) is too far away, which can 
never happen. 

> +} else {
> +timer_mod_ns(s->barker, now + IBEX_AONTIMER_PERIOD_NS);

We can call the fire API (ibex_aon_barker_expired) directly here.

> +}
> +}
> +}
> +break;
> +case A_INTR_TEST:
> +qemu_log_mask(LOG_UNIMP, "%s: Intr test not implemented",
> __func__);
> +break;
> +case A_WKUP_CAUSE:
> +qemu_log_mask(LOG_UNIMP,
> +"%s: Wkup cause not implemented", __func__);
> +break;
> +default:
> +qemu_log_mask(LOG_GUEST_ERROR,
> +"%s: Invalid register read 0x%" HWADDR_PRIx "\n",
> +__func__, addr);
> +break;
> +}
> +}
> +
> +static void ibex_aon_enter_reset(Object *obj, ResetType type) {
> +IbexAONTimerState *s = IBEX_AON_TIMER(obj);
> +s->regs[R_ALERT_TEST]  = 0x0;
> +s->regs[R_WKUP_CTRL]   = 0x0;
> +s->regs[R_WKUP_THOLD]  = 0x0;
> +s->regs[R_WKUP_COUNT]  = 0x0;
> +s->regs[R_WDOG_REGWEN] = 0x1;
> +

RE: [PATCH v2 1/3] hw/watchdog: wdt_ibex_aon.c: Implement the watchdog for the OpenTitan

2022-09-27 Thread Dong, Eddie
Hi Tyler:

> +}
> +
> +/* Called when the bark timer expires */ static void
> +ibex_aon_barker_expired(void *opaque) {
This may happen during ibex_aon_update_count(), right? 

> +    IbexAONTimerState *s = IBEX_AON_TIMER(opaque);
> +    if (ibex_aon_update_count(s) &&
This may happen during ibex_aon_update_count().
Nested call may lead to incorrect s->regs[R_WDOG_COUNT] & s->wdog_last. 

Can you elaborate? The timers for bark and bite are not updated in 
"update_count".

When 1st ibex_aon_update_count() is executing, and is in the place PPP (after 
updating s->regs[R_WDOG_COUNT] but before updating s->wdog_last), a timer 
(barker) may happen.
Inside ibex_aon_barker_expired(), it calls ibex_aon_update_count() again (nest 
call), and update s->regs[R_WDOG_COUNT] & s->wdog_last, with the new value.
After the nest execution ends, and returns to the initial point (PPP) , the 
s->wdog_last is updated (with the value of 1st execution time), this leads to 
mismatch of s->regs[R_WDOG_COUNT] & s->wdog_last.

This case may not be triggered at normal case, but if the guest read 
A_WDOG_COUNT, the 1st ibex_aon_update_count() does execute, and bring the 
potential issue.

I think we can solve the problem, by not updating s->regs[R_WDOG_COUNT] & 
s->wdog_last in the timer call back API.  The update is not necessary given 
that the stored value is anyway not the current COUNT. We only need to update 
when the guest write the COUNT.


> +        s->regs[R_WDOG_COUNT] >= s->regs[R_WDOG_BARK_THOLD]) {
> +        s->regs[R_INTR_STATE] |= (1 << 1);
> +        qemu_irq_raise(s->bark_irq);
> +    }
> +}
> +


THX  Eddie


RE: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to mtime

2022-09-26 Thread Dong, Eddie


> -Original Message-
> From: Qemu-devel 
> On Behalf Of Tyler Ng
> Sent: Thursday, September 22, 2022 8:59 AM
> To: open list:RISC-V ; qemu-devel@nongnu.org
> Developers 
> Cc: Alistair Francis ; Meng, Bin
> ; Thomas Huth ; Paolo
> Bonzini ; Palmer Dabbelt ;
> Laurent Vivier 
> Subject: [PATCH v2 3/3] hw/timer: ibex_timer.c: Add support for writes to
> mtime
> 
> 1. Adds fields to hold the value of mtime in timer_upper0 and timer_lower0.
> 
> 2. Changes the read and write functions to use the mtime fields.
> 
> 3. Updates the value of mtime in update_mtime() by extrapolating the time
> elapsed. This will need to change if/when the prescalar is implemented.
> 
> 4. Adds a qtest for the ibex timer.
> 
> Signed-off-by: Tyler Ng 
> ---
>  hw/timer/ibex_timer.c |  98 +--
>  include/hw/timer/ibex_timer.h |   6 ++
>  tests/qtest/ibex-timer-test.c | 178 ++
>  tests/qtest/meson.build   |   3 +-
>  4 files changed, 256 insertions(+), 29 deletions(-)  create mode 100644
> tests/qtest/ibex-timer-test.c
> 
> diff --git a/hw/timer/ibex_timer.c b/hw/timer/ibex_timer.c index
> d8b8e4e1f6..4230d60e85 100644
> --- a/hw/timer/ibex_timer.c
> +++ b/hw/timer/ibex_timer.c
> @@ -52,28 +52,56 @@ REG32(INTR_STATE, 0x118)  REG32(INTR_TEST,
> 0x11C)
>  FIELD(INTR_TEST, T_0, 0, 1)
> 
> -static uint64_t cpu_riscv_read_rtc(uint32_t timebase_freq)
> +static inline uint64_t get_mtime(void *opaque)
>  {
> -return muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
> -timebase_freq, NANOSECONDS_PER_SECOND);
> +IbexTimerState *s = opaque;
> +return (s->timer_lower0) | ((uint64_t) s->timer_upper0 << 32);
>  }
> 
> -static void ibex_timer_update_irqs(IbexTimerState *s)
> +/*
> + * The goal of this function is to:
> + * 1. Check if the timer is enabled. If not, return false,
> + * 2. Calculate the amount of time that has passed since.
> + * 3. Extrapolate the number of ticks that have passed, and add it to 
> `mtime`.
> + * 4. Return true.
> + */
> +static bool update_mtime(IbexTimerState *s)
>  {
> -uint64_t value = s->timer_compare_lower0 |
> - ((uint64_t)s->timer_compare_upper0 << 32);
So the hardware actually implements 64 bits register (used in 32 bits CPU), why 
not use an union for this?
Same for:
+uint32_t timer_lower0;
+uint32_t timer_upper0;

> -uint64_t next, diff;
> -uint64_t now = cpu_riscv_read_rtc(s->timebase_freq);
> -
>  if (!(s->timer_ctrl & R_CTRL_ACTIVE_MASK)) {
> -/* Timer isn't active */
> +return false;
> +}
> +/* Get the time then extrapolate the number of ticks that have elapsed */
> +uint64_t mtime = get_mtime(s);
> +int64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
> +int64_t elapsed = now - s->timer_last_update;
> +if (elapsed < 0) {
> +/* We jumped back in time. */
> +mtime -= muldiv64((uint64_t)(-elapsed), s->timebase_freq,
> +   NANOSECONDS_PER_SECOND);
> +} else {
> +mtime += muldiv64(elapsed, s->timebase_freq,
> NANOSECONDS_PER_SECOND);
> +}
> +s->timer_lower0 = mtime & 0x;
> +s->timer_upper0 = (mtime >> 32) & 0x;
> +/* update last-checkpoint timestamp */
> +s->timer_last_update = now;
> +return true;
> +}
> +
> +static void ibex_timer_update_irqs(IbexTimerState *s) {
> +if (!update_mtime(s)) {
> +/* Timer is not enabled? */
>  return;
>  }
> +uint64_t mtimecmp = s->timer_compare_lower0 |
> + ((uint64_t)s->timer_compare_upper0 << 32);
> +uint64_t mtime = get_mtime(s);
> 
>  /* Update the CPUs mtimecmp */
> -s->mtimecmp = value;
> +s->mtimecmp = mtimecmp;
> 
> -if (s->mtimecmp <= now) {
> +if (s->mtimecmp <= mtime) {
>  /*
>   * If the mtimecmp was in the past raise the interrupt now.
>   */
> @@ -84,18 +112,17 @@ static void ibex_timer_update_irqs(IbexTimerState
> *s)
>  }
>  return;
>  }
> -
> -/* Setup a timer to trigger the interrupt in the future */
> +/* Update timers: setup a timer to trigger the interrupt in the
> + future */
>  qemu_irq_lower(s->m_timer_irq);
>  qemu_set_irq(s->irq, false);
> -
> -diff = s->mtimecmp - now;
> -next = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> - muldiv64(diff,
> -  NANOSECONDS_PER_SECOND,
> -  s->timebase_freq);
> -
> -if (next < qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL)) {
> +/* Compute the difference, and set a timer for the next update. */
> +const uint64_t diff = mtimecmp - mtime;
> +const int64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
> +const uint64_t towait = muldiv64(diff, NANOSECONDS_PER_SECOND,
> + s->timebase_freq);
> +/* timer_mod takes in a int64_t, not uint64_t! Need 

RE: [PATCH v2 1/3] hw/watchdog: wdt_ibex_aon.c: Implement the watchdog for the OpenTitan

2022-09-26 Thread Dong, Eddie


> -Original Message-
> From: Qemu-devel 
> On Behalf Of Tyler Ng
> Sent: Thursday, September 22, 2022 8:59 AM
> To: open list:RISC-V ; qemu-devel@nongnu.org
> Developers 
> Cc: Alistair Francis ; Meng, Bin
> ; Thomas Huth ; Paolo
> Bonzini ; Palmer Dabbelt ;
> Laurent Vivier 
> Subject: [PATCH v2 1/3] hw/watchdog: wdt_ibex_aon.c: Implement the
> watchdog for the OpenTitan
> 
> This commit adds most of an implementation of the OpenTitan Always-On
> Timer. The documentation for this timer is found here:
> 
> https://docs.opentitan.org/hw/ip/aon_timer/doc/
> 
> Using commit 217a0168ba118503c166a9587819e3811eeb0c0c
> 
> The implementation includes most of the watchdog features; it does not
> implement the wakeup timer.
> 
> An important note: the OpenTitan board uses the sifive_plic. The plic will not
> be able to claim the bark interrupt (159) because the sifive plic sets
> priority[159], but checks priority[158] for the priority, so it thinks that 
> the
> interrupt's priority is 0 (effectively disabled).
> 
> Changed Files:
> hw/riscv/Kconfig: Add configuration for the watchdog.
> hw/riscv/opentitan.c: Connect AON Timer to the OpenTitan board.
> 
> hw/watchdog/Kconfig: Configuration for the watchdog.
> hw/watchdog/meson.build: Compile the watchdog.
> hw/watchdog/wdt_ibex_aon.c: The watchdog itself.
> 
> include/hw/riscv/opentitan.h: Add watchdog bark/wakeup IRQs and timer.
> include/hw/watchdog/wdt_ibex_aon.h: Add watchdog.
> 
> tests/qtest/ibex-aon-timer-test.c: Ibex AON Timer test.
> tests/qtest/meson.build: Build the timer test.
> 
> Signed-off-by: Tyler Ng 
> ---
>  hw/riscv/Kconfig   |   4 +
>  hw/riscv/opentitan.c   |  21 +-
>  hw/watchdog/Kconfig|   3 +
>  hw/watchdog/meson.build|   1 +
>  hw/watchdog/wdt_ibex_aon.c | 405 +
>  include/hw/riscv/opentitan.h   |   7 +-
>  include/hw/watchdog/wdt_ibex_aon.h |  67 +  tests/qtest/ibex-aon-
> timer-test.c  | 189 ++
>  tests/qtest/meson.build|   3 +
>  9 files changed, 696 insertions(+), 4 deletions(-)  create mode 100644
> hw/watchdog/wdt_ibex_aon.c  create mode 100644
> include/hw/watchdog/wdt_ibex_aon.h
>  create mode 100644 tests/qtest/ibex-aon-timer-test.c
> 
> diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig index
> 79ff61c464..72094010be 100644
> --- a/hw/riscv/Kconfig
> +++ b/hw/riscv/Kconfig
> @@ -4,6 +4,9 @@ config RISCV_NUMA
>  config IBEX
>  bool
> 
> +config IBEX_AON
> +bool
> +
>  config MICROCHIP_PFSOC
>  bool
>  select CADENCE_SDHCI
> @@ -20,6 +23,7 @@ config MICROCHIP_PFSOC  config OPENTITAN
>  bool
>  select IBEX
> +select IBEX_AON
>  select UNIMP
> 
>  config SHAKTI_C
> diff --git a/hw/riscv/opentitan.c b/hw/riscv/opentitan.c index
> af13dbe3b1..eae9b2504f 100644
> --- a/hw/riscv/opentitan.c
> +++ b/hw/riscv/opentitan.c
> @@ -48,7 +48,7 @@ static const MemMapEntry ibex_memmap[] = {
>  [IBEX_DEV_RSTMGR] = {  0x4041,  0x1000  },
>  [IBEX_DEV_CLKMGR] = {  0x4042,  0x1000  },
>  [IBEX_DEV_PINMUX] = {  0x4046,  0x1000  },
> -[IBEX_DEV_PADCTRL] ={  0x4047,  0x1000  },
> +[IBEX_DEV_AON_TIMER] =  {  0x4047,  0x1000  },
>  [IBEX_DEV_FLASH_CTRL] = {  0x4100,  0x1000  },
>  [IBEX_DEV_AES] ={  0x4110,  0x1000  },
>  [IBEX_DEV_HMAC] =   {  0x4111,  0x1000  },
> @@ -122,6 +122,8 @@ static void lowrisc_ibex_soc_init(Object *obj)
> 
>  object_initialize_child(obj, "timer", >timer, TYPE_IBEX_TIMER);
> 
> +object_initialize_child(obj, "aon_timer", >aon_timer,
> TYPE_IBEX_AON_TIMER);
> +
>  for (int i = 0; i < OPENTITAN_NUM_SPI_HOSTS; i++) {
>  object_initialize_child(obj, "spi_host[*]", >spi_host[i],
>  TYPE_IBEX_SPI_HOST); @@ -207,6 +209,7 @@ 
> static void
> lowrisc_ibex_soc_realize(DeviceState
> *dev_soc, Error **errp)
> 3, qdev_get_gpio_in(DEVICE(>plic),
> IBEX_UART0_RX_OVERFLOW_IRQ));
> 
> +/* RV Timer */
>  if (!sysbus_realize(SYS_BUS_DEVICE(>timer), errp)) {
>  return;
>  }
> @@ -218,6 +221,20 @@ static void lowrisc_ibex_soc_realize(DeviceState
> *dev_soc, Error **errp)
>qdev_get_gpio_in(DEVICE(qemu_get_cpu(0)),
> IRQ_M_TIMER));
> 
> +/* AON Timer */
> +if (!sysbus_realize(SYS_BUS_DEVICE(>aon_timer), errp)) {
> +return;
> +}
> +sysbus_mmio_map(SYS_BUS_DEVICE(>aon_timer), 0,
> memmap[IBEX_DEV_AON_TIMER].base);
> +sysbus_connect_irq(SYS_BUS_DEVICE(>aon_timer),
> +   0, qdev_get_gpio_in(DEVICE(>plic),
> +   IBEX_AONTIMER_WDOG_BARK));
> +/*
> + * Note: There should be a line to pwrmgr but it's not implemented.
> + * TODO: Needs a line connected in, "counter-run" (stop the watchdog if
> + * 

RE: [PATCH] virtio/vhost-user: Fix wrong vhost notifier GPtrArray size

2022-06-15 Thread Dong, Eddie
Reviewed-by: Eddie Dong 

> -Original Message-
> From: Qemu-devel  bounces+eddie.dong=intel@nongnu.org> On Behalf Of Yajun Wu
> Sent: Wednesday, May 25, 2022 8:49 PM
> To: qemu-devel@nongnu.org; m...@redhat.com; alex.ben...@linaro.org;
> yaj...@nvidia.com
> Cc: Parav Pandit 
> Subject: [PATCH] virtio/vhost-user: Fix wrong vhost notifier GPtrArray size
> 
> In fetch_or_create_notifier, idx begins with 0. So the GPtrArray size should
> be idx + 1 and g_ptr_array_set_size should be called with idx + 1.
> 
> This wrong GPtrArray size causes fetch_or_create_notifier return an invalid
> address. Passing this invalid pointer to vhost_user_host_notifier_remove
> causes assert fail:
> 
> qemu/include/qemu/int128.h:27: int128_get64: Assertion `r == a' failed.
>   shutting down, reason=crashed
> 
> Backends like dpdk-vdpa which sends out vhost notifier requests almost
> always hit qemu crash.
> 
> Fixes: 503e355465 ("virtio/vhost-user: dynamically assign
> VhostUserHostNotifiers")
> Signed-off-by: Yajun Wu 
> Acked-by: Parav Pandit 
> Change-Id: I87e0f7591ca9a59d210879b260704a2d9e9d6bcd
> ---
>  hw/virtio/vhost-user.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index
> b040c1ad2b..dbc690d16c 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -1525,7 +1525,7 @@ static VhostUserHostNotifier
> *fetch_or_create_notifier(VhostUserState *u,  {
>  VhostUserHostNotifier *n = NULL;
>  if (idx >= u->notifiers->len) {
> -g_ptr_array_set_size(u->notifiers, idx);
> +g_ptr_array_set_size(u->notifiers, idx + 1);
>  }
> 
>  n = g_ptr_array_index(u->notifiers, idx);
> --
> 2.36.0
> 



RE: [RFC PATCH 00/13] Add a plugin to support out-of-band live migration for VFIO pass-through device

2022-06-14 Thread Dong, Eddie


> -Original Message-
> From: Alex Williamson 
> Sent: Wednesday, June 1, 2022 11:01 AM
> To: Dong, Eddie 
> Cc: Rao, Lei ; Tian, Kevin ; Zeng,
> Jason ; quint...@redhat.com; dgilb...@redhat.com;
> Li, Yadong ; Liu, Yi L ; qemu-
> de...@nongnu.org
> Subject: Re: [RFC PATCH 00/13] Add a plugin to support out-of-band live
> migration for VFIO pass-through device
> 
> On Wed, 1 Jun 2022 17:09:25 +
> "Dong, Eddie"  wrote:
> 
> > > -Original Message-
> > > From: Qemu-devel  > > bounces+eddie.dong=intel@nongnu.org> On Behalf Of Alex
> > > bounces+Williamson
> > > On Tue, 24 May 2022 14:18:35 +0800
> > > Lei Rao  wrote:
> > > > This proposal adopts a plugin mechanism (an example can be found
> > > > in
> > > > [1]) given that IPU/DPU vendors usually implement proprietary
> > > > migration interfaces without a standard. But we are also open if
> > > > an alternative option makes better sense, e.g. via loadable
> > > > modules (with Qemu supporting gRPC or JSON-RPC support) or an IPC
> > > > mechanism similar
> > > to vhost-user.
> > >
> > > AFAIU, QEMU is not interested in supporting plugin modules,
> > > especially proprietary ones.  I don't see that a case has really
> > > been made that this cannot be done in-band, through a vfio-pci
> > > variant driver, possibly making use of proprietary interfaces to a
> > > userspace agent if necessary (though please don't ask such to be
> > > accepted in-tree for the kernel either).  A vfio- user device server
> > > might also host such proprietary, device specific agents while
> > > supporting the standard, in-band migration interface.  Thanks,
> > >
> >
> > Thanks Alex. Removing plug-in module is not a problem.
> >
> > Do you mean to implement the migration and protocol handling inside
> > Qemu-client (probably vfio-client after the VFIO-user is merged)? Or
> > to build as part of libvfio-user? We can also build it as a separate
> > process of Qemu-server as part of Multi-Process Qemu.
> 
> AIUI, the QEMU "client" in a vfio-user configuration is simply QEMU itself.
> The vfio-user "server" provides the actual device implementation, which
> could support different license models, depending on what libraries or
> existing code is incorporated to implement that server.  The QEMU remote
> machine type is simply a QEMU-based implementation of a vfio-user server.
> The vfio-user server is analogous to a vfio-pci variant driver in the
> kernel/ioctl interface model.  The vfio-user client should be device agnostic,
> possibly with similar exceptions we have today via device specific quirk
> handling for the vfio kernel interface.
> 
> > In here, the protocol between host CPU and SoC is based on gRPC, which
> > support Rust code in client (Host CPU side here) more than C/C++. Do
> > you have any suggestion to better support Rust code with Qemu C/C++
> > code?
> 
> I'm not qualified to provide suggestions regarding Rust code integration with
> QEMU, but I think that's only required if the device specific migration
> support is on the "client".  As above, I don't think that's the correct model,
> the vfio migration protocol is meant to support any device specific
> requirements on the device end of the connection, ie. the "server" end for
> vfio-user, which can be an entirely separate, non-QEMU based process.  I
> think there are also ways to write kernel drivers in Rust, so possibly a 
> kernel
> interface vfio-pci variant driver could also work.  Thanks,
> 


Alex:
Thanks for your suggestion. Yes, agree Qemu (client) is, by nature, 
neutral to physical device knowledge.
With more thinking, it seems that:
1: Solution to have a separate kernel driver:   
The way the host CPU talking with the SoC chip of the device is 
going thru TCP/IP network, plus high level protocol (gRPC or Json..). This is 
going to be very complicated and might be hard to be accepted by community.

2: Implement as a full qemu-server device model.
This way works if we implement a full device model in 
vfio-user, but given that the device (NVME for now) works in VFIO passthru mode 
for performance, the issue Kevin Tian raised in another email is a real concern 
too.

3: Implement partial (or supplemental) feature in Qemu-server device 
model.
This solution defines a Qemu/VFIO migration interface between 
the client and server for migration.  Client migration-proxy uses hardware 
transparent interface to talk with the remote-migration server. The

RE: [RFC PATCH 00/13] Add a plugin to support out-of-band live migration for VFIO pass-through device

2022-06-01 Thread Dong, Eddie


> -Original Message-
> From: Qemu-devel  bounces+eddie.dong=intel@nongnu.org> On Behalf Of Alex Williamson
> Sent: Thursday, May 26, 2022 11:44 AM
> To: Rao, Lei 
> Cc: Tian, Kevin ; Dong, Eddie
> ; Zeng, Jason ;
> quint...@redhat.com; dgilb...@redhat.com; Li, Yadong
> ; Liu, Yi L ; qemu-
> de...@nongnu.org
> Subject: Re: [RFC PATCH 00/13] Add a plugin to support out-of-band live
> migration for VFIO pass-through device
> 
> On Tue, 24 May 2022 14:18:35 +0800
> Lei Rao  wrote:
> 
> > Migration of a VFIO passthrough device can be supported by using a
> > device specific kernel driver to save/restore the device state thru
> > device specific interfaces. But this approach doesn't work for devices
> > that lack a state migration interface, e.g. NVMe.
> >
> > On the other hand, Infrastructure Process Unit (IPU) or Data
> > Processing Unit
> > (DPU) vendors may choose to implement an out-of-band interface from
> > the SoC to help manage the state of such non-migratable devices e.g.
> > via gRPC or JSON-RPC protocols.
> >
> > This RFC attempts to support such out-of-band migration interface by
> > introducing the concept of migration backends in vfio. The existing
> > logic around vfio migration uAPI is now called the 'local' backend while a
> new 'out-of-band'
> > backend is further introduced allowing vfio to redirect VMState ops to
> > an external plugin.
> >
> > Currently, the backend migration Ops is defined close to
> > SaveVMHandlers. We also considered whether there is value of
> > abstracting it in a lower level e.g. close to vfio migration uAPI but
> > no clear conclusion. Hence this is one part which we'd like to hear
> suggestions.
> >
> > This proposal adopts a plugin mechanism (an example can be found in
> > [1]) given that IPU/DPU vendors usually implement proprietary
> > migration interfaces without a standard. But we are also open if an
> > alternative option makes better sense, e.g. via loadable modules (with
> > Qemu supporting gRPC or JSON-RPC support) or an IPC mechanism similar
> to vhost-user.
> 
> AFAIU, QEMU is not interested in supporting plugin modules, especially
> proprietary ones.  I don't see that a case has really been made that this
> cannot be done in-band, through a vfio-pci variant driver, possibly making
> use of proprietary interfaces to a userspace agent if necessary (though
> please don't ask such to be accepted in-tree for the kernel either).  A vfio-
> user device server might also host such proprietary, device specific agents
> while supporting the standard, in-band migration interface.  Thanks,
> 

Thanks Alex. Removing plug-in module is not a problem.

Do you mean to implement the migration and protocol handling inside Qemu-client 
(probably vfio-client after the VFIO-user is merged)? Or to build as part of 
libvfio-user? We can also build it as a separate process of Qemu-server as part 
of Multi-Process Qemu.

In here, the protocol between host CPU and SoC is based on gRPC, which support 
Rust code in client (Host CPU side here) more than C/C++. Do you have any 
suggestion to better support Rust code with Qemu C/C++ code? 


Thx Eddie




> Alex
> 
> >
> > The following graph describes the overall component relationship:
> >
> >  ++
> >  | QEMU   |
> >  | ++ |
> >  | |VFIO Live Migration Framework   | |
> >  | |+--+| |
> >  | || VFIOMigrationOps || |
> >  | |+---^-^+| |
> >  | || | | |
> >  | |+---v---+ +---v+| |
> >  | || LM Backend Via| | LM Backend Via || |
> >  | ||   Device Fd   | |Plugins || |
> >  | |+---^---+ | +--+| |
> >  | || | |Plugin Ops++-++
> >  | || +-+--+| ||
> >  | ||   | |  
> > +-v--+
> >  | ++---+ |  |  Vendor Specific 
> >   |
> >  |  | |  |Plugins(.so)  
> >   |
> >  +--+-+  
> > +--+-+
>

Re: [Qemu-devel] Asynchronous / synchronous IO emulation

2016-11-25 Thread Dong, Eddie
> > The usage is to construct a secondary hot standby VM (SVM), identical with
> the primary VM (PVM).
> > When an virtual DMA happens in PVM side, we need to know at which
> instruction boundary the virtual DMA is delivered, so that we can replay the
> virtual DMA event at the 2nd VM side, to keep them identical at any time.
> >
> > Asynchronous IO emulations seems to be a little bit more complicate to be
> deterministic...
> > I might be wrong.
> 
> There is a record-replay mode that might be worth investigating.  See
> docs/replay.txt.
> 
> It sounds like you are not using live migration for micro-checkpoints?
> The live migration mechanism would keep memory in sync at each
> checkpoint.
> 
> Is this work releated to the COLO effort?
> 
> Stefan

Thanks Stefan.  COLO is one what we push for high availability.
This is yet another project leveraging KVM for deterministic execution, which 
is still in room. I use VM replicating to explain the concept, but it is not 
used for high availability. 
Eddie



Re: [Qemu-devel] Asynchronous / synchronous IO emulation

2016-11-25 Thread Dong, Eddie
> 
> On Thu, Nov 24, 2016 at 08:44:06AM +, Dong, Eddie wrote:
> >  Under certain situation, we have a requirement to apply the
> virtual DMA event in a deterministic way. Current Qemu uses asynchronous
> approach to emulate the virtual IO events since quite long time ago. I am
> wondering if we still have an option to choose the synchronous emulation
> solution.
> 
> Please explain exactly what you mean.
> 
The usage is to construct a secondary hot standby VM (SVM), identical with the 
primary VM (PVM).
When an virtual DMA happens in PVM side, we need to know at which instruction 
boundary the virtual DMA is delivered, so that we can replay the virtual DMA 
event at the 2nd VM side, to keep them identical at any time.

Asynchronous IO emulations seems to be a little bit more complicate to be 
deterministic...
I might be wrong.

Thx Eddie



[Qemu-devel] Asynchronous / synchronous IO emulation

2016-11-24 Thread Dong, Eddie
Hi:
 Under certain situation, we have a requirement to apply the virtual 
DMA event in a deterministic way. Current Qemu uses asynchronous approach to 
emulate the virtual IO events since quite long time ago. I am wondering if we 
still have an option to choose the synchronous emulation solution.

 Thanks, Eddie


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-27 Thread Dong, Eddie
> >
> > Even if the device driver doesn't support migration, you still want to
> > migrate VM? That maybe risk and we should add the "bad path" for the
> > driver at least.
> 
> At a minimum we should have support for hot-plug if we are expecting to
> support migration.  You would simply have to hot-plug the device before you
> start migration and then return it after.  That is how the current bonding
> approach for this works if I am not mistaken.

Hotplug is good to eliminate the device spefic state clone, but bonding 
approach is very network specific, it doesn’t work for other devices such as 
FPGA device, QaT devices & GPU devices, which we plan to support gradually :)

> 
> The advantage we are looking to gain is to avoid removing/disabling the
> device for as long as possible.  Ideally we want to keep the device active
> through the warm-up period, but if the guest doesn't do that we should still
> be able to fall back on the older approaches if needed.
> 


Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-25 Thread Dong, Eddie
> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu  wrote:
> > On 2015年11月25日 13:30, Alexander Duyck wrote:
> >> No, what I am getting at is that you can't go around and modify the
> >> configuration space for every possible device out there.  This
> >> solution won't scale.
> >
> >
> > PCI config space regs are emulation by Qemu and so We can find the
> > free PCI config space regs for the faked PCI capability. Its position
> > can be not permanent.
> 
> Yes, but do you really want to edit every driver on every OS that you plan to
> support this on.  What about things like direct assignment of regular Ethernet
> ports?  What you really need is a solution that will work generically on any
> existing piece of hardware out there.

The fundamental assumption of this patch series is to modify the driver in 
guest to self-emulate or track the device state, so that the migration may be 
possible.
I don't think we can modify OS, without modifying the drivers, even using the 
PCIe hotplug mechanism.  
In the meantime, modifying Windows OS is a big challenge given that only 
Microsoft can do. While, modifying driver is relatively simple and manageable 
to device vendors, if the device vendor want to support state-clone based 
migration.

Thx Eddie


Re: [Qemu-devel] [POC]colo-proxy in qemu

2015-11-10 Thread Dong, Eddie
> - What's the plan for vhost? Userspace network in qemu is rather slow, most
> user will choose vhost.
[Dong, Eddie] Hi Jason:
How about we take staging approach? In general, COLO opens a door of 
high performance HA solution, but it will take very long time to make 
everything perfect. As for the network virtualization, I think we may start 
from usage with moderate network bandwidth, like 1Gbps. Otherwise, the 
performance of COLO may be not that good (of course, like David mentioned, the 
worst case is same with periodic checkpoint). At the moment, how about we start 
from in Qemu virtio network, and enhance for vhost case in future? 

The good thing is that we get more people working on the patch series, 
and glad to see UMU also joined the effort.  Thanks and welcome...

Thx Eddie


Re: [Qemu-devel] [POC] colo-proxy in qemu

2015-07-30 Thread Dong, Eddie
 
  A question here, the packet comparing may be very tricky. For example,
  some protocol use random data to generate unpredictable id or
  something else. One example is ipv6_select_ident() in Linux. So COLO
  needs a mechanism to make sure PVM and SVM can generate same random
 data?
 
 Good question, the random data connection is a big problem for COLO. At
 present, it will trigger checkpoint processing because of the different random
 data.
 I don't think any mechanisms can assure two different machines generate the
 same random data. If you have any ideas, pls tell us :)
 
 Frequent checkpoint can handle this scenario, but maybe will cause the
 performance poor. :(
 
The assumption is that, after VM checkpoint, SVM and PVM have identical 
internal state, so the pattern used to generate random data has high 
possibility to generate identical data at short time, at least...

Thx Eddie



Re: [Qemu-devel] [POC] colo-proxy in qemu

2015-07-23 Thread Dong, Eddie
BTW,  I felt it is better to be called as an agency, rather than a proxy. Any 
comments from native speaker?

Thx Eddie



 Hi, all
 
 We are planning to implement colo-proxy in qemu to cache and compare
 packets.
 This module is one of the important component of COLO project and now it is
 still in early stage, so any comments and feedback are warmly welcomed, thanks
 in advance.
 
 ## Background
 COLO FT/HA (COarse-grain LOck-stepping Virtual Machines for Non-stop Service)
 project is a high availability solution. Both Primary VM (PVM) and Secondary
 VM
 (SVM) run in parallel. They receive the same request from client, and generate
 responses in parallel too. If the response packets from PVM and SVM are
 identical, they are released immediately. Otherwise, a VM checkpoint (on
 demand) is conducted.
 Paper:
 http://www.socc2013.org/home/program/a3-dong.pdf?attredirects=0
 COLO on Xen:
 http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping
 COLO on Qemu/KVM:
 http://wiki.qemu.org/Features/COLO
 
 By the needs of capturing response packets from PVM and SVM and finding out
 whether they are identical, we introduce a new module to qemu networking
 called colo-proxy.
 
 This document describes the design of the colo-proxy module
 
 ## Glossary
PVM - Primary VM, which provides services to clients.
SVM - Secondary VM, a hot standby and replication of PVM.
PN - Primary Node, the host which PVM runs on
SN - Secondary Node, the host which SVM runs on
 
 ## Workflow ##
 The following image show the qemu networking packet datapath between
 guest's NIC and qemu's backend in colo-proxy.
 
 +---++---+
 |PN ||SN |
 +---+--+
 +---+--+ +--+
 |   +---+  | |   +---+  |
 ++  |chkpoint[socket]---chkpoint ++
 |PVM |  +---^---+  | |   +---+---+ |SVM |
 ||  +proxy--v+ | |   | ||
 ||  || | |   | ||
 | +---+  |  | +TCP/IP stack+ | | | +-v---proxy | +---+  |
 +-|NIC|--+  | || | | | | | +-|NIC|--+
 | +^-++ | | ++ | | | | | +TCP/IP stack-+ |   +^--+  |
 |  | +-- | | compare| | -[socket]-forward- | ++  | || |
 |  || | +---++ | | | | | | |seqack |  | + |
 |  || +-|--+ | | | | | |adjust  |  | |  |
 |  ||   || | | | | ++  | |  |
 |  +---+-copyforward-[socket]--- +-+ |  |
 |   +---|---|+ | | +^+  |
 |   |   |  | |  |   |
 |   |   |  | |  x   |
 |+--+---v+ | |+-v-+ |
 | QEMU   |  backend  | | | QEMU   |  backend  | |
 ++  (tap)+-+ ++  (tap)+-+
   +---++---+
 
 ## Our Idea ##
 
 ### Net filter
 In current QEMU, a packet is transported between networking backend(tap)
 and qemu network adapter(NIC) directly. Backend and adapter is linked by
 NetClientState-peer in qemu as following
 ++
 v|
 +NetClientState+   +---+NetClientState+|
 |info-type=TAP|   ||info-type=NIC||
 +--+   |+--+|
 |   *peer  +---+|   *peer  ++
 +--++--+
 |name=tap0   ||name=e1000  |
 +--++--+
 | ...  || ...  |
 +--++--+
 
 In COLO QEMU, we insert a net filter named colo-proxy between backend and
 adapter like below:
 typedef struct COLOState {
  NetClientState nc;
  NetClientState *peer;
 } COLOState;
 +---+NetClientState++NetClientState++
 ||info-type=TAP||info-type=NIC| |
 |+--++--+ |
 +---+   *peer  ||   *peer  ++
 |  |+--++--+ |  |
 |  ||name=tap0   ||name=e1000  | |  |
 |  |+--++--+ |  |
 |  || ...  || ...  | |  |
 |  |+--++--+ |  |
 |  | 

Re: [Qemu-devel] [POC] colo-proxy in qemu

2015-07-23 Thread Dong, Eddie
Hi Stefan:
Thanks for your comments!

 
 On Mon, Jul 20, 2015 at 02:42:33PM +0800, Li Zhijian wrote:
  We are planning to implement colo-proxy in qemu to cache and compare
 packets.
 
 I thought there is a kernel module to do that?

Yes, that is the previous solution the COLO sub-community choose to go, 
but we realized it might be not the best choices, and thus we want to bring 
discussion back here :)  More comments are welcome.

 
 Why does the proxy need to be part of the QEMU process?  -netdev socket or
 host network stack features allow you to process packets in a separate 
 process.

Hailiang did a very good summary, and we don't need privilege ops so 
far. The main thing that motivated us to revisit is because the former kernel 
land driver would be a pure virtualization (and for high availability) 
component. It may have limited user only at very beginning and thus slower to 
be accepted. And, people typically like to put component to be in user land as 
if it can. 

In addition, as a pure virtualization feature, we guess people in Qemu 
mailing list may be much more interested to support all the VMMs, such as pure 
Qemu VMM, KVM, etc. 


Thx Eddie 



Re: [Qemu-devel] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-06 Thread Dong, Eddie


 -Original Message-
 From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
 Sent: Tuesday, May 05, 2015 11:24 PM
 To: Stefan Hajnoczi
 Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu
 block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang
 Hongyang; zhanghailiang; arm...@redhat.com; jc...@redhat.com
 Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description
 
 * Stefan Hajnoczi (stefa...@redhat.com) wrote:
  On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
  
  
   On 24/04/2015 11:38, Wen Congyang wrote:

 That can be done with drive-mirror.  But I think it's too early for 
 that.
Do you mean use drive-mirror instead of quorum?
  
   Only before starting up a new secondary.  Basically you do a
   migration with non-shared storage, and then start the secondary in colo
 mode.
  
   But it's only for the failover case.  Quorum (or a new block/colo.c
   driver or filter) is fine for normal colo operation.
 
  Perhaps this patch series should mirror the Secondary's disk to a
  Backup Secondary so that the system can be protected very quickly
  after failover.
 
  I think anyone serious about fault tolerance would deploy a Backup
  Secondary, otherwise the system cannot survive two failures unless a
  human administrator is lucky/fast enough to set up a new Secondary.
 
 I'd assumed that a higher level management layer would do the allocation of a
 new secondary after the first failover, so no human need be involved.
 

I agree. The cloud OS, such as open stack, will have the capability to handle 
the case, together with certain API in VMM side for this (libvirt?). 

Thx Eddie



Re: [Qemu-devel] [RFC] COLO HA Project proposal

2014-07-04 Thread Dong, Eddie
 
  I didn't quite understand a couple of things though, perhaps you can
  explain:
 1) If we ignore the TCP sequence number problem, in an SMP machine
  don't we get other randomnesses - e.g. which core completes something
  first, or who wins a lock contention, so the output stream might not
  be identical - so do those normal bits of randomness cause the
  machines to flag as out-of-sync?
 
 It's about COLO agent, CCing Congyang, he can give the detailed
 explanation.
 

Let me clarify on this issue. COLO didn't ignore the TCP sequence number, but 
uses a 
new implementation to make the sequence number to be best effort identical 
between the primary VM (PVM) and secondary VM (SVM). Likely, VMM has to 
synchronize 
the emulation of randomization number generation mechanism between the 
PVM and SVM, like the lock-stepping mechanism does. 

Further mnore, for long TCP connection, we can rely on the (on-demand) VM 
checkpoint to get the 
identical Sequence number both in PVM and SVM. 


Thanks, Eddie



Re: [Qemu-devel] [RFC] COLO HA Project proposal

2014-07-04 Thread Dong, Eddie
 
  Let me clarify on this issue. COLO didn't ignore the TCP sequence
  number, but uses a new implementation to make the sequence number to
  be best effort identical between the primary VM (PVM) and secondary VM
  (SVM). Likely, VMM has to synchronize the emulation of randomization
  number generation mechanism between the PVM and SVM, like the
 lock-stepping mechanism does.
 
  Further mnore, for long TCP connection, we can rely on the (on-demand)
  VM checkpoint to get the identical Sequence number both in PVM and
 SVM.
 
 That wasn't really my question; I was worrying about other forms of
 randomness, such as winners of lock contention, and other SMP
 non-determinisms, and I'm also worried by what proportion of time the
 system can't recover from a failure due to being unable to distinguish an
 SVM failure from a randomness issue.
 
Thanks Dave:
Whether the randomness value/branch/code path the PVM and SVM may have,
It is only a performance issue. COLO never assumes the PVM and SVM has same 
internal
Machine state.  From correctness p.o.v, as if the PVM and SVM generate
Identical response, we can view the SVM is a valid replica of PVM, and the SVM 
can take over
When the PVM suffers from hardware failure. We can view the client is all the 
way talking with 
the SVM, without the notion of PVM.  Of course, if the SVM dies, we can 
regenerate a copy
of PVM with a new checkpoint too.
The SOCC paper has the detail recovery model :)

Thanks, Eddie






Re: [Qemu-devel] [RFC] COLO HA Project proposal

2014-07-04 Thread Dong, Eddie
  Thanks Dave:
  Whether the randomness value/branch/code path the PVM and SVM
 may
  have, It is only a performance issue. COLO never assumes the PVM and
  SVM has same internal Machine state.  From correctness p.o.v, as if
  the PVM and SVM generate Identical response, we can view the SVM is a
  valid replica of PVM, and the SVM can take over When the PVM suffers
  from hardware failure. We can view the client is all the way talking
  with the SVM, without the notion of PVM.  Of course, if the SVM dies, we
 can regenerate a copy of PVM with a new checkpoint too.
  The SOCC paper has the detail recovery model :)
 
 I've had a read; I think the bit I was asking about was what you labelled 'D' 
 in
 that papers fig.4 - so I think that does explain it for me.

Very good :)

 But I also have some more questions:
 
   1) 5.3.3 Web server
 a) In fig 11 it shows Remus's performance dropping off with the number
 of threads - why is that? Is it
just an increase in the amount of memory changes in each
 snapshot?

I didn't dig into details of them, but document the throughput we observed.
I felt a bit stranger too, memory dirty page set may be larger than small 
connection
Case, but I am not sure and that is the data we saw :(

 b) Is fig 11/12 measured with all of the TCP optimisations shown in fig
 13 on?

Yes.

 
   2) Did you manage to overcome the issue shown in 5.6 with newer guest
 kernels degredation - could you just fall
  back to micro checkpointing if the guests diverge too quickly?

In general, I would say the COLO performance for these 2 workloads is pretty 
good, and 
I actually didn't list the subsection 5.6 initially. It is the conference 
sepherd who ask me to 
add this paragraph to make the paper to be balanced :)

In summary, COLO can have very good MP-guest performance comparing with Remus, 
with 
the payment of potential optimization/modification effort to guest TCP/IP 
stack. One solution may
Not work for all workloads, but it provides a large room for OSVs to provide 
customized solution
for a specific usage -- which I think is very good for open source biz model: 
make money through 
consultant. Huawei technology Ltd. announced to support COLO in there cloud OS, 
Probably for specific usage too.

 
   3) Was the link between the two servers for synchronisation a low-latency
 dedicated connection?

We use 10 Gbps NIC in the paper, and yes it is dedicated link, but the solution 
itself doesn't 
require dedicated link.

 
   4) Did you try an ftp PUT benchmark using external storage - i.e. that
 wouldn't have the local disc
  synchronisation overhead?

Not yet.
External network shared storage works, but today the performance may be not 
that good, 
because our optimization so far is still very limited. It is just an initial 
effort to make the 2
common workloads happy. We believe there are large room ahead to make the 
response of 
TCP/IP stack more predictable. Once the basic COLO stuff is ready for product 
and accepted
by the industry, it is possible we may impact TCP community to have this kind 
of predictability
 in mind for the future protocol development, which will greatly help the 
performance.


Thx Eddie



[Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace

2010-06-10 Thread Dong, Eddie

 A VF interrupt usually happens in 4-8KHZ. How about the virtio?
 I assume virtio will be widely used together w/ leagcy guest with
 INTx mode. 
 
 
 True, but in time it will be replaced by MSI.
 
 Note without vhost virtio is also in userspace, so there are lots of
 exits anyway for the status register.

Few months ago, we noticed the interrupt frequency of PV I/O in previous 
solution is almost same with physical NIC interrupt which ticks in ~4KHZ. Each 
PV I/O frontend driver (or its interrupt source) has similar interrupt 
frequency which means Nx more interrupt. I guess virtio is in similar situation.

We then did an optimization for PV IO to mitigate the interrupt to guest by 
setting interrupt throttle in backend side, because native NIC also does in 
that way -- so called ITR register in Intel NIC. We can see 30-90% CPU 
utilization saving depending on how many frontend driver interrupt is employed. 
Not sure if it is adopted in vhost side.

One drawback of course is the latency, but it is mostly tolerable if it is 
reduced to ~1KHZ. 

Thx, Eddie


[Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace

2010-06-09 Thread Dong, Eddie
Avi Kivity wrote:
 I am currently investigating a problem with the a guest running Linux
 malfunctioning in the NMI watchdog code.  The problem is that we don't
 handle NMI delivery mode for the local APIC LINT0 pin; instead we
 expect ExtInt deliver mode or that the line is disabled completely. 
 In addition the i8254 timer is tied to the BSP, while in this case the
 timer can broadcast to all vcpus.
 
 There is some code that tries to second-guess the guest and provide it
 the inputs it sees, but this is fragile.  The only way to get reliable
 operation is to emulate the hardware fully.
 
 Now I'd much rather do that in userspace, since it's a lot of
 sensitive work.  I'll enumerate below the general motivation,
 advantages and disadvantages, and a plan for moving forward.
 
 Motivation
 ==
 
 The original motivation for moving the PIC and IOAPIC into the kernel
 was performance, especially for assigned devices.  Both devices are
 high interaction since they deal with interrupts; practically after
 every interrupt there is either a PIC ioport write, or an APIC bus
 message, both signalling an EOI operation.  Moving the PIT into the
 kernel allowed us to catch up with missed timer interrupt injections,
 and speeded up guests which read the PIT counters (e.g. tickless
 guests). 
 
 However, modern guests running on modern qemu use MSI extensively;
 both virtio and assigned devices now have MSI support; and the
 planned VFIO only supports kernel delivery via MSI anyway; line based
 interrupts will need to be mediated by userspace.
 
 The only high frequency non-MSI interrupt sources remaining are the
 various timers; and the default one, HPET, is in userspace (and having
 its own scaling problems as a result).  So in theory we can move PIC,
 IOAPIC, and PIT support to userspace and not lose much performance.
 
 Moving the implementation to userspace allows us more flexibility, and
 more consistency in the implementation of timekeeping for the various
 clock chips; it becomes easier to follow the nuances of real hardware
 in this area.
 
 Interestingly, while the IOAPIC/PIC code was written we proposed
 making it independent of the local APIC; had we done so, the move
 would have been much easier (simply dropping the existing code).
 
 
 Advantages of a move
 
 
 1. Reduced kernel footprint
 
 Good for security, and allows fixing bugs without reboots.
 
 2. Centralized timekeeping
 
 Instead of having one solution for PIT timekeeping, and another for
 RTC and HPET timekeeping, we can have all timer chips in userspace. 
 The local APIC timer still needs to be in the kernel - it is much too
 high bandwidth to be in userspace; but on the other hand it is very
 different from the other timer chips.
 
 3. Flexibility
 
 Easier to have wierd board layouts (multiple IOAPICs, etc.).  Not a
 very strong advantage.
 
 Disadvantages
 =
 
 1. Still need to keep the old code around for a long while
 
 We can't just rip it out - old userspace depends on it.  So the
 security advantages are only with cooperating userspace, and the
 other advantages only show up.
 
 2. Need to bring the qemu code up to date
 
 The current qemu ioapic code lags some way behind the kernel; also
 need PIT timekeeping
 
 3. May need kernel support for interval-timer-follows-thread
 
 Currently the timekeeping code has an optimization which causes the
 hrtimer that models the PIT to follow the BSP (which is most likely to
 receive the interrupt); this reduces cpu cross-talk.
 
 I don't think the kernel interval timer code has such an optimization;
 we may need to implement it.
 
 4. Much churn
 
 This is a lot of work.
 
 5. Risk
 
 We may find out after all this is implemented that performance is not
 acceptable and all the work will have to be dropped.
 
 

Besides VF IO interrupt and timer interrupt introduced performance overhead 
risk, EOI message deliver from lapic to ioapic, which becomes in user land now, 
may have potential scalability issue. For example, if we have a 64 VCPU guest, 
if each vcpu has 1khz interrupt (or ipi), the EOI from guest will normally have 
to involve ioapic module for clearance in 64khz which may have long lock 
contentio. you may reduce the involvement of ioapic eoi by tracking ioapic pin 
- vector map in kernel, but not sure if it is clean enough.


[Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace

2010-06-09 Thread Dong, Eddie
Avi Kivity wrote:
 On 06/09/2010 06:59 PM, Dong, Eddie wrote:
 
 Besides VF IO interrupt and timer interrupt introduced performance
 overhead risk, 
 
 VF usually uses MSI

Typo, I mean PV IO. 
A VF interrupt usually happens in 4-8KHZ. How about the virtio?
I assume virtio will be widely used together w/ leagcy guest with INTx mode.

 
   EOI message deliver from lapic to ioapic,
 
 Only for non-MSI
 
   which becomes in user land now, may have potential scalability
 issue. For example, if we have a 64 VCPU guest, if each vcpu has
 1khz interrupt (or ipi), the EOI from guest will normally have to
 involve ioapic module for clearance in 64khz which may have long
 lock contentio.
 
 No, EOI for IPI or for local APIC timer does not involve the IOAPIC.
 
 you may reduce the involvement of ioapic eoi by tracking ioapic
 pin-  vector map in kernel, but not sure if it is clean enough. 
 
 It's sufficient to look at TMR, no?  For edge triggered I don't think
 we need the EOI.

Mmm, I noticed statements difference between new SDM  old SDM.
In old SDM, IPI can have both edge and level trigger mode. But new SDM says 
only INIT can have both choice.
Given the new SDM eliminates the level trigger mode, it is OK.

 
 But, the amount of churn and risk worries me, so I don't think the
 move is worthwhile.

This also remind me the debate at early stage of KVM when Gregory Haskins is 
working on any arbitrary choice of irqchip.
The patch even at that time (without SMP) is very complicated.

I agree from ROI point of view, this movement may be not worthwhile.
thx, eddie