from:"Wei Xu"

Re: [Qemu-devel] [PATCH v4 00/15] tests: acpi: add UEFI (ARM) testing support

2019-05-02 Thread Wei Xu

Hi Ignor,

On 5/2/2019 3:51 PM, Igor Mammedov wrote:
> Changelog:
>   - from v3:
>   * reshaffle patch order a bit
>   * move out acpi_parse_rsdp_table() hunk to
>   "tests: acpi: make pointer to RSDP  64bit"
> where it belongs
>   * move acpi_fetch_rsdp_table(s/uint32_t addr/uint64_t addr/) to
> this patch where it belongs from:
>"tests: acpi: make RSDT test routine handle XSDT"
>   * dropping Reviewed-bys due to acpi_fetch_table() change
> introduced by earlier patch:
>   "tests: acpi: make acpi_fetch_table() take size of fetched table 
> pointer"
>   * update [8/15] commit message to point to commit which introduced
> signature_guid value.
>   * get rid of test_acpi_rsdp_address() in [9/15]
>   * added new patch
>  tests: acpi: allow to override default accelerator
>   * force arm/virt test to use TCG accelerator
>   - from v2:
>   * rebase on top current master (with UEFI blobs merged)
>   * added a Makefile rule to include bios-tables-test to aarch64 tests by
> default into 11/13 (kept Reviewed-bys)
>   * other trivial fixes and cleanups (see per patch changelogs)
>
>   - from v1:
>   * rebase on top
>  (1) [PATCH for-4.1 v3 00/12] bundle edk2 platform firmware  with QEMU
> let me to drop edk2 images and drop Makefile magic to unpack them,
> Laszlo's series conveniently does it all for me.
>   * use new path/names for firmware images as supplied by [1]
>   * reorder patches a bit so that UEFI parts would go after generic 
> changes
> 
> Series adds support for ACPI tables located above 4G. It adds 64-bit handling
> necessary for testing arm/virt board (i.e. might be not complete wrt spec) and
> uses recently merged UEFI (AVMF) firmware/test disk image which provides
> an entry point[1] for fetching ACPI tables (RSDP pointer).
> 
> Git tree for testing:
>https://github.com/imammedo/qemu.git acpi_arm_tests_v4
> 
> Ref to previos vesrsion:
>[PATCH v3 00/13] tests: acpi: add UEFI (ARM) testing support
>https://www.mail-archive.com/qemu-devel@nongnu.org/msg612679.html
> 
> CC: Laszlo Ersek 
> CC: "Michael S. Tsirkin" 
> CC: Gonglei 
> CC: Philippe Mathieu-Daudé 
> CC: Shannon Zhao 
> CC: Wei Yang 
> CC: Andrew Jones 
> CC: Shameer Kolothum 
> CC: Ben Warren 
> CC: 
> CC: 
> CC: 
> CC: 
> 
> Igor Mammedov (15):
>   tests: acpi: rename acpi_parse_rsdp_table() into
> acpi_fetch_rsdp_table()
>   tests: acpi: make acpi_fetch_table() take size of fetched table
> pointer
>   tests: acpi: make RSDT test routine handle XSDT
>   tests: acpi: make pointer to RSDP 64bit
>   tests: acpi: fetch X_DSDT if pointer to DSDT is 0
>   tests: acpi: skip FACS table if board uses hw reduced ACPI profile
>   tests: acpi: move boot_sector_init() into x86 tests branch
>   tests: acpi: add acpi_find_rsdp_address_uefi() helper
>   tests: acpi: add a way to start tests with UEFI firmware
>   tests: acpi: ignore SMBIOS tests when UEFI firmware is used
>   tests: acpi: allow to override default accelerator
>   tests: add expected ACPI tables for arm/virt board
>   tests: acpi: add simple arm/virt testcase
>   tests: acpi: refactor rebuild-expected-aml.sh to dump ACPI tables for
> a specified list of targets
>   tests: acpi: print error unable to dump ACPI table during rebuild
> 
>  tests/acpi-utils.h  |   7 +-
>  tests/Makefile.include  |   1 +
>  tests/acpi-utils.c  |  68 +++
>  tests/bios-tables-test.c| 148 
> +++-
>  tests/data/acpi/rebuild-expected-aml.sh |  23 +++--
>  tests/data/acpi/virt/APIC   | Bin 0 -> 168 bytes
>  tests/data/acpi/virt/DSDT   | Bin 0 -> 18476 bytes
>  tests/data/acpi/virt/FACP   | Bin 0 -> 268 bytes
>  tests/data/acpi/virt/GTDT   | Bin 0 -> 96 bytes
>  tests/data/acpi/virt/MCFG   | Bin 0 -> 60 bytes
>  tests/data/acpi/virt/SPCR   | Bin 0 -> 80 bytes
>  tests/vmgenid-test.c|   6 +-
>  12 files changed, 178 insertions(+), 75 deletions(-)
>  create mode 100644 tests/data/acpi/virt/APIC
>  create mode 100644 tests/data/acpi/virt/DSDT
>  create mode 100644 tests/data/acpi/virt/FACP
>  create mode 100644 tests/data/acpi/virt/GTDT
>  create mode 100644 tests/data/acpi/virt/MCFG
>  create mode 100644 tests/data/acpi/virt/SPCR
> 

Tested the series on the hisilicon D05 board(arm64 based), so FWIW:

Tested-by: Wei Xu 

Thanks!

Best Regards,
Wei

Re: [Qemu-devel] [PATCH v3 11/13] tests: acpi: add simple arm/virt testcase

2019-05-02 Thread Wei Xu

Hi Igor,

On 5/2/2019 3:24 PM, Igor Mammedov wrote:
> On Fri, 26 Apr 2019 17:28:10 +0100
> Wei Xu  wrote:
> 
>> Hi Igor,
>>
>> On 4/26/2019 12:54 PM, Igor Mammedov wrote:
>>> On Fri, 26 Apr 2019 00:51:56 +0800
>>> x00249684  wrote:
>>>
>>>> Hi Igor,
>>>>
>>>> +static void test_acpi_virt_tcg(void)
>>>> +{
>>>> +test_data data = {
>>>> +.machine = "virt",
>>>> +.uefi_fl1 = "pc-bios/edk2-aarch64-code.fd",
>>>> +.uefi_fl2 = "pc-bios/edk2-arm-vars.fd",
>>>> +.cd = 
>>>> "tests/data/uefi-boot-images/bios-tables-test.aarch64.iso.qcow2",
>>>> +.ram_start = 0x4000ULL,
>>>> +.scan_len = 128ULL * 1024 * 1024,
>>>> +};
>>>> +
>>>> +test_acpi_one("-cpu cortex-a57 ", );
>>>>
>>>> Replaced the cortex-a57 with host and succesfully tested on the hisilicon 
>>>> arm64 
>>>> D05 board. Otherwise it failed with "kvm_init_vcpu failed: Invalid 
>>>> argument".
>>>> Is it possilbe to set the cpu type like numa-test.c?
>>>
>>> I think it works with numa-test because it uses TCG only but in case of 
>>> bios-tables-test
>>> it uses accel="kvm:tcg" to leverage KVM capabilities whenever possible to 
>>> speed up test.
>>>
>>> Now back to our ARM test case, uefi requirement is to use 64bit CPU (hence 
>>> it was cortex-a57)
>>> however unlike x86 it obviously breaks since KVM accelerator on ARM host is 
>>> used
>>> and it doesn't work with anything other than 'host' cpu model.
>>>
>>> I think we still want to use KVM whenever possible, but problem lies in that
>>> user (testcase) doesn't have an idea if KVM accelerator is available and 
>>> host is 64 CPU.
>>>
>>> to sum up we need to support 2 modes:
>>>   1. host is 64 ARM, use kvm with -cpu host
>>>   2. all other cases use tcg with -cpu cortex-a57
>>>
>>> I can hack to probe if /dev/kvm is accessible and host is 64 bit and use #1
>>> otherwise fallback to #2
>>> or as quick fix do only #2 initially and think about a better solution to 
>>> #1 
>>
>> Thanks!
>> Fine to me.
>>
>>>
>>> Is there any other suggestions/opinions how to approach issue/proceed.
>>
>> To check the host cpu architecture is ARM or not, maybe we can check the 
>> value
>> of "CPU implementer" is 0x41 or not from the /proc/cpuinfo.
> 
> it turned out it's more complicated.
> we also should pick a correct GIC depending on host's CPU and that
> changes ACPI tables, so it will worn on some hosts and fail on other.
Sorry, I did not consider that case.

> 
> I'll add a patch to enable test case to pick accelerator and force TCG
> for ARM tests for now.
>

Thanks!

Best Regards,
Wei

>>
>> Best Regards,
>> Wei
>>
>>>
>>> PS:
>>> we probably would like to reuse this not only for acpi tests but also for 
>>> other
>>> arm/virt test cases that involve running guest code. 
>>>
>>>> Thanks!
>>>>
>>>> Best Regards,
>>>> Wei
>>>
>>>
>>> .
>>>
>>
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v3 11/13] tests: acpi: add simple arm/virt testcase

2019-04-26 Thread Wei Xu

Hi Igor,

On 4/26/2019 12:54 PM, Igor Mammedov wrote:
> On Fri, 26 Apr 2019 00:51:56 +0800
> x00249684  wrote:
> 
>> Hi Igor,
>>
>> +static void test_acpi_virt_tcg(void)
>> +{
>> +test_data data = {
>> +.machine = "virt",
>> +.uefi_fl1 = "pc-bios/edk2-aarch64-code.fd",
>> +.uefi_fl2 = "pc-bios/edk2-arm-vars.fd",
>> +.cd = 
>> "tests/data/uefi-boot-images/bios-tables-test.aarch64.iso.qcow2",
>> +.ram_start = 0x4000ULL,
>> +.scan_len = 128ULL * 1024 * 1024,
>> +};
>> +
>> +test_acpi_one("-cpu cortex-a57 ", );
>>
>> Replaced the cortex-a57 with host and succesfully tested on the hisilicon 
>> arm64 
>> D05 board. Otherwise it failed with "kvm_init_vcpu failed: Invalid argument".
>> Is it possilbe to set the cpu type like numa-test.c?
> 
> I think it works with numa-test because it uses TCG only but in case of 
> bios-tables-test
> it uses accel="kvm:tcg" to leverage KVM capabilities whenever possible to 
> speed up test.
> 
> Now back to our ARM test case, uefi requirement is to use 64bit CPU (hence it 
> was cortex-a57)
> however unlike x86 it obviously breaks since KVM accelerator on ARM host is 
> used
> and it doesn't work with anything other than 'host' cpu model.
> 
> I think we still want to use KVM whenever possible, but problem lies in that
> user (testcase) doesn't have an idea if KVM accelerator is available and host 
> is 64 CPU.
> 
> to sum up we need to support 2 modes:
>   1. host is 64 ARM, use kvm with -cpu host
>   2. all other cases use tcg with -cpu cortex-a57
> 
> I can hack to probe if /dev/kvm is accessible and host is 64 bit and use #1
> otherwise fallback to #2
> or as quick fix do only #2 initially and think about a better solution to #1 

Thanks!
Fine to me.

> 
> Is there any other suggestions/opinions how to approach issue/proceed.

To check the host cpu architecture is ARM or not, maybe we can check the value
of "CPU implementer" is 0x41 or not from the /proc/cpuinfo.

Best Regards,
Wei

> 
> PS:
> we probably would like to reuse this not only for acpi tests but also for 
> other
> arm/virt test cases that involve running guest code. 
> 
>> Thanks!
>>
>> Best Regards,
>> Wei
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v4 09/11] virtio-net: update the head descriptor in a chain lastly

2019-02-19 Thread Wei Xu

On Wed, Feb 20, 2019 at 10:34:32AM +0800, Jason Wang wrote:
> 
> On 2019/2/20 上午9:54, Wei Xu wrote:
> >On Tue, Feb 19, 2019 at 09:09:33PM +0800, Jason Wang wrote:
> >>On 2019/2/19 下午6:51, Wei Xu wrote:
> >>>On Tue, Feb 19, 2019 at 03:23:01PM +0800, Jason Wang wrote:
> >>>>On 2019/2/14 下午12:26, w...@redhat.com wrote:
> >>>>>From: Wei Xu 
> >>>>>
> >>>>>This is a helper for packed ring.
> >>>>>
> >>>>>To support packed ring, the head descriptor in a chain should be updated
> >>>>>lastly since no 'avail_idx' like in packed ring to explicitly tell the
> >>>>>driver side that all payload is ready after having done the chain, so
> >>>>>the head is always visible immediately.
> >>>>>
> >>>>>This patch fills the header after done all the other ones.
> >>>>>
> >>>>>Signed-off-by: Wei Xu 
> >>>>It's really odd to workaround API issue in the implementation of device.
> >>>>Please introduce batched used updating helpers instead.
> >>>Can you elaborate a bit more? I don't get it as well.
> >>>
> >>>The exact batch as vhost-net or dpdk pmd is not supported by userspace
> >>>backend. The change here is to keep the header descriptor updated at
> >>>last in case of a chaining descriptors and the helper might not help
> >>>too much.
> >>>
> >>>Wei
> >>
> >>Of course we can add batching support why not?
> >It is always good to improve performance with anything, while probably
> >this could be done in another separate batch, also we need to bear
> >in mind that usually qemu userspace backend is not the first option for
> >performance oriented user.
> 
> 
> The point is to hide layout specific things from device emulation. If it
> helps for performance, it could be treated as a good byproduct.
> 
> 
> >
> >AFAICT, virtqueue_fill() is a generic API for all relevant userspace virtio
> >devices that do not support batching , without touching virtqueue_fill(),
> >supporting batching changes the meaning of the parameter 'idx' which should
> >be kept overall.
> >
> >To fix it, I got two proposals so far:
> >1). batching support(two APIs needed to keep compatibility)
> >2). save a head elem for a vq instead of caching an array of elems like 
> >vhost,
> > and introduce a new API(virtqueue_chain_fill()) functioning with an
> > additional parameter 'more' to the current virtqueue_fill() to indicate 
> > if
> > there are more descriptor(s) coming in a chain.
> >
> >Either way it changes the API somehow and it does not seem to be clean and 
> >clear
> >as wanted.
> 
> 
> It's as simple as accepting an array of elems in e.g
> virtqueue_fill_batched()?

It is trivial for both, my concern is an array elements need to be allocated 
dynamically
due to vq size which no any other devices are using, a head would be enough for 
2.

> 
> 
> >
> >Any better idea?
> >
> >>Your code assumes the device know the virtio layout specific assumption
> >>whih breaks the layer. Device should not care about the actual layout.
> >>
> >Good point, but anyway, change to virtio-net receiving code path is
> >unavoidable to support split and packed rings, and batching is like a new
> >feature somehow.
> 
> 
> It's ok to change the code as a result of introducing of generic helper but
> it's bad to change to code for working around a bad API.

Agree.

Wei

> 
> Thanks
> 
> 
> >
> >Wei
> >>Thanks
> >>
> >>
> >>>>Thanks
> >>>>
> >>>>
> >>>>>---
> >>>>>  hw/net/virtio-net.c | 11 ++-
> >>>>>  1 file changed, 10 insertions(+), 1 deletion(-)
> >>>>>
> >>>>>diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> >>>>>index 3f319ef..330abea 100644
> >>>>>--- a/hw/net/virtio-net.c
> >>>>>+++ b/hw/net/virtio-net.c
> >>>>>@@ -1251,6 +1251,8 @@ static ssize_t 
> >>>>>virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
> >>>>>  struct virtio_net_hdr_mrg_rxbuf mhdr;
> >>>>>  unsigned mhdr_cnt = 0;
> >>>>>  size_t offset, i, guest_offset;
> >>>>>+VirtQueueElement head;
> >>>>>+int head_len = 0;
> >>>>>  if (!virtio_net_can_receive(nc)) {
> >>>>>  return -1;
> >>>>>@@ -1328,7 +1330,13 @@ static ssize_t 
> >>>>>virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
> >>>>>  }
> >>>>>  /* signal other side */
> >>>>>-virtqueue_fill(q->rx_vq, elem, total, i++);
> >>>>>+if (i == 0) {
> >>>>>+head_len = total;
> >>>>>+head = *elem;
> >>>>>+} else {
> >>>>>+virtqueue_fill(q->rx_vq, elem, len, i);
> >>>>>+}
> >>>>>+i++;
> >>>>>  g_free(elem);
> >>>>>  }
> >>>>>@@ -1339,6 +1347,7 @@ static ssize_t 
> >>>>>virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
> >>>>>   _buffers, sizeof mhdr.num_buffers);
> >>>>>  }
> >>>>>+virtqueue_fill(q->rx_vq, , head_len, 0);
> >>>>>  virtqueue_flush(q->rx_vq, i);
> >>>>>  virtio_notify(vdev, q->rx_vq);
>

Re: [Qemu-devel] [PATCH v4 08/11] virtio: event suppression support for packed ring

2019-02-19 Thread Wei Xu

On Tue, Feb 19, 2019 at 09:06:42PM +0800, Jason Wang wrote:
> 
> On 2019/2/19 下午6:40, Wei Xu wrote:
> >On Tue, Feb 19, 2019 at 03:19:58PM +0800, Jason Wang wrote:
> >>On 2019/2/14 下午12:26, w...@redhat.com wrote:
> >>>From: Wei Xu 
> >>>
> >>>Difference between 'avail_wrap_counter' and 'last_avail_wrap_counter':
> >>>For Tx(guest transmitting), they are the same after each pop of a desc.
> >>>
> >>>For Rx(guest receiving), they are also the same when there are enough
> >>>descriptors to carry the payload for a packet(e.g. usually 16 descs are
> >>>needed for a 64k packet in typical iperf tcp connection with tso enabled),
> >>>however, when the ring is running out of descriptors while there are
> >>>still a few free ones, e.g. 6 descriptors are available which is not
> >>>enough to carry an entire packet which needs 16 descriptors, in this
> >>>case the 'avail_wrap_counter' should be set as the first one pending
> >>>being handled by guest driver in order to get a notification, and the
> >>>'last_avail_wrap_counter' should stay unchanged to the head of available
> >>>descriptors, like below:
> >>>
> >>>Mark meaning:
> >>> | | -- available
> >>> |*| -- used
> >>>
> >>>A Snapshot of the queue:
> >>>   last_avail_idx = 253
> >>>   last_avail_wrap_counter = 1
> >>>  |
> >>> +-+
> >>>  0  | | | |*|*|*|*|*|*|*|*|*|*|*|*|*|*|*|*|*| | | | 255
> >>> +-+
> >>>|
> >>>   shadow_avail_idx = 3
> >>>   avail_wrap_counter = 0
> >>
> >>Well this might not be the good place to describe the difference between
> >>shadow_avail_idx and last_avail_idx. And the comments above their definition
> >>looks good enough?
> >Sorry, I do not get it, can you elaborate more?
> 
> 
> I meant the comment is good enough to explain what it did. For packed ring,
> the only difference is the wrap counter. You can add e.g "The wrap counter
> of next head to pop" to above last_avail_wrap_counter. And so does
> shadow_avail_wrap_counter.

OK, I will remove the example part.

> 
> 
> >
> >This is one of the buggy parts of packed ring, it is good to make it clear 
> >here.
> >
> >>     /* Next head to pop */
> >>     uint16_t last_avail_idx;
> >>
> >>     /* Last avail_idx read from VQ. */
> >>     uint16_t shadow_avail_idx;
> >>
> >What is the meaning of these comment?
> 
> 
> It's pretty easy to be understood, isn't it?

:)

> 
> 
> >  Do you mean I should replace put them
> >to the comments also? thanks.
> >
> >>Instead, how or why need event suppress is not mentioned ...
> >Yes, but presumably this knowledge has been acquired from reading through the
> >spec, so I skipped this part.
> 
> 
> You need at least add reference to the spec. Commit log is pretty important
> for starters to understand what has been done in the patch before reading
> the code. I'm pretty sure they will get confused for reading what you wrote
> here.
> 

OK.

> 
> Thanks
> 
> 
> >
> >Wei
> >
> >>
> >>
> >>>Signed-off-by: Wei Xu 
> >>>---
> >>>  hw/virtio/virtio.c | 137 
> >>> +
> >>>  1 file changed, 128 insertions(+), 9 deletions(-)
> >>>
> >>>diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >>>index 7e276b4..8cfc7b6 100644
> >>>--- a/hw/virtio/virtio.c
> >>>+++ b/hw/virtio/virtio.c
> >>>@@ -234,6 +234,34 @@ static void vring_desc_read(VirtIODevice *vdev, 
> >>>VRingDesc *desc,
> >>>  virtio_tswap16s(vdev, >next);
> >>>  }
> >>>+static void vring_packed_event_read(VirtIODevice *vdev,
> >>>+MemoryRegionCache *cache, 
> >>>VRingPackedDescEvent *e)
> >>>+{
> >>>+address_space_read_cached(cache, 0, e, sizeof(*e));
> >>>+virtio_tswap16s(vdev, >off_wrap);
> >>>+virtio_tswap16s(vdev, >flags);
> >>>+}
> >>>+
> >>>+static void vring_packed_off_wrap_write(VirtIODevice *vdev,
> >>>+

Re: [Qemu-devel] [PATCH v4 09/11] virtio-net: update the head descriptor in a chain lastly

2019-02-19 Thread Wei Xu

On Tue, Feb 19, 2019 at 09:09:33PM +0800, Jason Wang wrote:
> 
> On 2019/2/19 下午6:51, Wei Xu wrote:
> >On Tue, Feb 19, 2019 at 03:23:01PM +0800, Jason Wang wrote:
> >>On 2019/2/14 下午12:26, w...@redhat.com wrote:
> >>>From: Wei Xu 
> >>>
> >>>This is a helper for packed ring.
> >>>
> >>>To support packed ring, the head descriptor in a chain should be updated
> >>>lastly since no 'avail_idx' like in packed ring to explicitly tell the
> >>>driver side that all payload is ready after having done the chain, so
> >>>the head is always visible immediately.
> >>>
> >>>This patch fills the header after done all the other ones.
> >>>
> >>>Signed-off-by: Wei Xu 
> >>
> >>It's really odd to workaround API issue in the implementation of device.
> >>Please introduce batched used updating helpers instead.
> >Can you elaborate a bit more? I don't get it as well.
> >
> >The exact batch as vhost-net or dpdk pmd is not supported by userspace
> >backend. The change here is to keep the header descriptor updated at
> >last in case of a chaining descriptors and the helper might not help
> >too much.
> >
> >Wei
> 
> 
> Of course we can add batching support why not?

It is always good to improve performance with anything, while probably
this could be done in another separate batch, also we need to bear
in mind that usually qemu userspace backend is not the first option for
performance oriented user.

AFAICT, virtqueue_fill() is a generic API for all relevant userspace virtio
devices that do not support batching , without touching virtqueue_fill(),
supporting batching changes the meaning of the parameter 'idx' which should 
be kept overall.

To fix it, I got two proposals so far:
1). batching support(two APIs needed to keep compatibility)
2). save a head elem for a vq instead of caching an array of elems like vhost,
and introduce a new API(virtqueue_chain_fill()) functioning with an
additional parameter 'more' to the current virtqueue_fill() to indicate if
there are more descriptor(s) coming in a chain.

Either way it changes the API somehow and it does not seem to be clean and clear
as wanted.

Any better idea?

> 
> Your code assumes the device know the virtio layout specific assumption
> whih breaks the layer. Device should not care about the actual layout.
>

Good point, but anyway, change to virtio-net receiving code path is
unavoidable to support split and packed rings, and batching is like a new
feature somehow.

Wei
 
> Thanks
> 
> 
> >>Thanks
> >>
> >>
> >>>---
> >>>  hw/net/virtio-net.c | 11 ++-
> >>>  1 file changed, 10 insertions(+), 1 deletion(-)
> >>>
> >>>diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> >>>index 3f319ef..330abea 100644
> >>>--- a/hw/net/virtio-net.c
> >>>+++ b/hw/net/virtio-net.c
> >>>@@ -1251,6 +1251,8 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
> >>>*nc, const uint8_t *buf,
> >>>  struct virtio_net_hdr_mrg_rxbuf mhdr;
> >>>  unsigned mhdr_cnt = 0;
> >>>  size_t offset, i, guest_offset;
> >>>+VirtQueueElement head;
> >>>+int head_len = 0;
> >>>  if (!virtio_net_can_receive(nc)) {
> >>>  return -1;
> >>>@@ -1328,7 +1330,13 @@ static ssize_t 
> >>>virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
> >>>  }
> >>>  /* signal other side */
> >>>-virtqueue_fill(q->rx_vq, elem, total, i++);
> >>>+if (i == 0) {
> >>>+head_len = total;
> >>>+head = *elem;
> >>>+} else {
> >>>+virtqueue_fill(q->rx_vq, elem, len, i);
> >>>+}
> >>>+i++;
> >>>  g_free(elem);
> >>>  }
> >>>@@ -1339,6 +1347,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
> >>>*nc, const uint8_t *buf,
> >>>   _buffers, sizeof mhdr.num_buffers);
> >>>  }
> >>>+virtqueue_fill(q->rx_vq, , head_len, 0);
> >>>  virtqueue_flush(q->rx_vq, i);
> >>>  virtio_notify(vdev, q->rx_vq);
>

Re: [Qemu-devel] [PATCH v4 11/11] virtio: CLI and provide packed ring feature bit by default

2019-02-19 Thread Wei Xu

On Tue, Feb 19, 2019 at 09:33:40PM +0800, Jason Wang wrote:
> 
> On 2019/2/19 下午7:23, Wei Xu wrote:
> >On Tue, Feb 19, 2019 at 03:32:19PM +0800, Jason Wang wrote:
> >>On 2019/2/14 下午12:26,w...@redhat.com  wrote:
> >>>From: Wei Xu
> >>>
> >>>Add userspace and vhost kernel/user support.
> >>>
> >>>Add CLI "ring_packed=true/false" to enable/disable packed ring provision.
> >>>Usage:
> >>> -device 
> >>> virtio-net-pci,netdev=xx,mac=xx:xx:xx:xx:xx:xx,ring_packed=false
> >>>
> >>>By default it is provided.
> >>Please compat this for old machine types.
> >It is provided by default, how to make it compatible for old machine types?
> >Hide or provide it?
> >
> >Wei
> >
> 
> Take a look at e.g how pc_compat_3_1 and hw_compat_3_1 was used.

OK, thanks.

Wei
> 
> Thanks
> 
>

Re: [Qemu-devel] [PATCH v4 07/11] virtio: fill/flush/pop for packed ring

2019-02-19 Thread Wei Xu

On Tue, Feb 19, 2019 at 05:33:57PM +0800, Jason Wang wrote:
> 
> On 2019/2/19 下午4:21, Wei Xu wrote:
> >On Tue, Feb 19, 2019 at 02:49:42PM +0800, Jason Wang wrote:
> >>On 2019/2/18 下午10:46, Wei Xu wrote:
> >>>>Do we allow chain more descriptors than vq size in the case of indirect?
> >>>>According to the spec:
> >>>>
> >>>>"
> >>>>
> >>>>The device limits the number of descriptors in a list through a
> >>>>transport-specific and/or device-specific value. If not limited,
> >>>>the maximum number of descriptors in a list is the virt queue
> >>>>size.
> >>>>"
> >>>>
> >>>>This looks possible, so the above is probably wrong if the max number of
> >>>>chained descriptors is negotiated through a device specific way.
> >>>>
> >>>OK, I will remove this check, while it is necessary to have a limitation
> >>>for indirect descriptor anyway, otherwise it is possible to hit an overflow
> >>>since presumably u16 is used for most number/size in the spec.
> >>>
> >>Please try to test block and scsi device for you changes as well.
> >Any idea about what kind of test should be covered? AFAICT, currently
> >packed ring is targeted for virtio-net as discussed during voting spec.
> >
> >Wei
> 
> 
> Well it's not only for net for sure, it should support all kinds of device.
> For testing, you should test basic function plus migration.

For sure we will support all the other devices, can we make it for
virtio-net device first and then move on to other devices?

Also, can anybody give me a CLI example for block and scsi devices?
I will give it a quick shot.

Wei

> 
> Thanks
> 
> 
> >
> >>Thanks
> >>
> >>
>

Re: [Qemu-devel] [PATCH v4 11/11] virtio: CLI and provide packed ring feature bit by default

2019-02-19 Thread Wei Xu

On Tue, Feb 19, 2019 at 03:32:19PM +0800, Jason Wang wrote:
> 
> On 2019/2/14 下午12:26, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Add userspace and vhost kernel/user support.
> >
> >Add CLI "ring_packed=true/false" to enable/disable packed ring provision.
> >Usage:
> > -device virtio-net-pci,netdev=xx,mac=xx:xx:xx:xx:xx:xx,ring_packed=false
> >
> >By default it is provided.
> 
> 
> Please compat this for old machine types.

It is provided by default, how to make it compatible for old machine types?
Hide or provide it?

Wei

> 
> Thanks
> 
> 
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/net/vhost_net.c | 2 ++
> >  include/hw/virtio/virtio.h | 4 +++-
> >  2 files changed, 5 insertions(+), 1 deletion(-)
> >
> >diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> >index e037db6..f593086 100644
> >--- a/hw/net/vhost_net.c
> >+++ b/hw/net/vhost_net.c
> >@@ -53,6 +53,7 @@ static const int kernel_feature_bits[] = {
> >  VIRTIO_F_VERSION_1,
> >  VIRTIO_NET_F_MTU,
> >  VIRTIO_F_IOMMU_PLATFORM,
> >+VIRTIO_F_RING_PACKED,
> >  VHOST_INVALID_FEATURE_BIT
> >  };
> >@@ -78,6 +79,7 @@ static const int user_feature_bits[] = {
> >  VIRTIO_NET_F_MRG_RXBUF,
> >  VIRTIO_NET_F_MTU,
> >  VIRTIO_F_IOMMU_PLATFORM,
> >+VIRTIO_F_RING_PACKED,
> >  /* This bit implies RARP isn't sent by QEMU out of band */
> >  VIRTIO_NET_F_GUEST_ANNOUNCE,
> >diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> >index 9c1fa07..2eb27d2 100644
> >--- a/include/hw/virtio/virtio.h
> >+++ b/include/hw/virtio/virtio.h
> >@@ -264,7 +264,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
> >  DEFINE_PROP_BIT64("any_layout", _state, _field, \
> >VIRTIO_F_ANY_LAYOUT, true), \
> >  DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
> >-  VIRTIO_F_IOMMU_PLATFORM, false)
> >+  VIRTIO_F_IOMMU_PLATFORM, false), \
> >+DEFINE_PROP_BIT64("ring_packed", _state, _field, \
> >+  VIRTIO_F_RING_PACKED, true)
> >  hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
> >  hwaddr virtio_queue_get_avail_addr(VirtIODevice *vdev, int n);
>

Re: [Qemu-devel] [PATCH v4 10/11] virtio: migration support for packed ring

2019-02-19 Thread Wei Xu

On Tue, Feb 19, 2019 at 03:30:41PM +0800, Jason Wang wrote:
> 
> On 2019/2/14 下午12:26, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Both userspace and vhost-net/user are supported with this patch.
> >
> >A new subsection is introduced for packed ring, only 'last_avail_idx'
> >and 'last_avail_wrap_counter' are saved/loaded presumably based on
> >all the others relevant data(inuse, used/avail index and wrap count
> >should be the same.
> 
> 
> This is probably only true for net device, see comment in virtio_load():
> 
>     /*
>  * Some devices migrate VirtQueueElements that have been popped
>  * from the avail ring but not yet returned to the used ring.
>  * Since max ring size < UINT16_MAX it's safe to use modulo
>  * UINT16_MAX + 1 subtraction.
>  */
>     vdev->vq[i].inuse = (uint16_t)(vdev->vq[i].last_avail_idx -
>     vdev->vq[i].used_idx);
> 
> 
> So you need to migrate used_idx and used_wrap_counter since we don't have
> used idx.

This is trying to align with vhost-net/user as we discussed, since all we
have done is to support  virtio-net device for packed ring, maybe we can
consider supporting other devices after we have got it verified.

> 
> 
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 69 
> > +++---
> >  1 file changed, 66 insertions(+), 3 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 8cfc7b6..7c5de07 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -2349,6 +2349,13 @@ static bool virtio_virtqueue_needed(void *opaque)
> >  return virtio_host_has_feature(vdev, VIRTIO_F_VERSION_1);
> >  }
> >+static bool virtio_packed_virtqueue_needed(void *opaque)
> >+{
> >+VirtIODevice *vdev = opaque;
> >+
> >+return virtio_host_has_feature(vdev, VIRTIO_F_RING_PACKED);
> >+}
> >+
> >  static bool virtio_ringsize_needed(void *opaque)
> >  {
> >  VirtIODevice *vdev = opaque;
> >@@ -2390,6 +2397,17 @@ static const VMStateDescription vmstate_virtqueue = {
> >  }
> >  };
> >+static const VMStateDescription vmstate_packed_virtqueue = {
> >+.name = "packed_virtqueue_state",
> >+.version_id = 1,
> >+.minimum_version_id = 1,
> >+.fields = (VMStateField[]) {
> >+VMSTATE_UINT16(last_avail_idx, struct VirtQueue),
> >+VMSTATE_BOOL(last_avail_wrap_counter, struct VirtQueue),
> >+VMSTATE_END_OF_LIST()
> >+}
> >+};
> >+
> >  static const VMStateDescription vmstate_virtio_virtqueues = {
> >  .name = "virtio/virtqueues",
> >  .version_id = 1,
> >@@ -2402,6 +2420,18 @@ static const VMStateDescription 
> >vmstate_virtio_virtqueues = {
> >  }
> >  };
> >+static const VMStateDescription vmstate_virtio_packed_virtqueues = {
> >+.name = "virtio/packed_virtqueues",
> >+.version_id = 1,
> >+.minimum_version_id = 1,
> >+.needed = _packed_virtqueue_needed,
> >+.fields = (VMStateField[]) {
> >+VMSTATE_STRUCT_VARRAY_POINTER_KNOWN(vq, struct VirtIODevice,
> >+  VIRTIO_QUEUE_MAX, 0, vmstate_packed_virtqueue, 
> >VirtQueue),
> >+VMSTATE_END_OF_LIST()
> >+}
> >+};
> >+
> >  static const VMStateDescription vmstate_ringsize = {
> >  .name = "ringsize_state",
> >  .version_id = 1,
> >@@ -2522,6 +2552,7 @@ static const VMStateDescription vmstate_virtio = {
> >  _virtio_ringsize,
> >  _virtio_broken,
> >  _virtio_extra_state,
> >+_virtio_packed_virtqueues,
> >  NULL
> >  }
> >  };
> >@@ -2794,6 +2825,17 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int 
> >version_id)
> >  virtio_queue_update_rings(vdev, i);
> >  }
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >+vdev->vq[i].shadow_avail_idx = vdev->vq[i].last_avail_idx;
> >+vdev->vq[i].avail_wrap_counter =
> >+vdev->vq[i].last_avail_wrap_counter;
> >+
> >+vdev->vq[i].used_idx = vdev->vq[i].last_avail_idx;
> >+vdev->vq[i].used_wrap_counter =
> >+vdev->vq[i].last_avail_wrap_counter;
> >+continue;
> >

Re: [Qemu-devel] [PATCH v4 09/11] virtio-net: update the head descriptor in a chain lastly

2019-02-19 Thread Wei Xu

On Tue, Feb 19, 2019 at 03:23:01PM +0800, Jason Wang wrote:
> 
> On 2019/2/14 下午12:26, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >This is a helper for packed ring.
> >
> >To support packed ring, the head descriptor in a chain should be updated
> >lastly since no 'avail_idx' like in packed ring to explicitly tell the
> >driver side that all payload is ready after having done the chain, so
> >the head is always visible immediately.
> >
> >This patch fills the header after done all the other ones.
> >
> >Signed-off-by: Wei Xu 
> 
> 
> It's really odd to workaround API issue in the implementation of device.
> Please introduce batched used updating helpers instead.
Can you elaborate a bit more? I don't get it as well.

The exact batch as vhost-net or dpdk pmd is not supported by userspace
backend. The change here is to keep the header descriptor updated at
last in case of a chaining descriptors and the helper might not help
too much.

Wei
> 
> Thanks
> 
> 
> >---
> >  hw/net/virtio-net.c | 11 ++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> >diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> >index 3f319ef..330abea 100644
> >--- a/hw/net/virtio-net.c
> >+++ b/hw/net/virtio-net.c
> >@@ -1251,6 +1251,8 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
> >*nc, const uint8_t *buf,
> >  struct virtio_net_hdr_mrg_rxbuf mhdr;
> >  unsigned mhdr_cnt = 0;
> >  size_t offset, i, guest_offset;
> >+VirtQueueElement head;
> >+int head_len = 0;
> >  if (!virtio_net_can_receive(nc)) {
> >  return -1;
> >@@ -1328,7 +1330,13 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
> >*nc, const uint8_t *buf,
> >  }
> >  /* signal other side */
> >-virtqueue_fill(q->rx_vq, elem, total, i++);
> >+if (i == 0) {
> >+head_len = total;
> >+head = *elem;
> >+} else {
> >+virtqueue_fill(q->rx_vq, elem, len, i);
> >+}
> >+i++;
> >  g_free(elem);
> >  }
> >@@ -1339,6 +1347,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
> >*nc, const uint8_t *buf,
> >   _buffers, sizeof mhdr.num_buffers);
> >  }
> >+virtqueue_fill(q->rx_vq, , head_len, 0);
> >  virtqueue_flush(q->rx_vq, i);
> >  virtio_notify(vdev, q->rx_vq);
>

Re: [Qemu-devel] [PATCH v4 08/11] virtio: event suppression support for packed ring

2019-02-19 Thread Wei Xu

On Tue, Feb 19, 2019 at 03:19:58PM +0800, Jason Wang wrote:
> 
> On 2019/2/14 下午12:26, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Difference between 'avail_wrap_counter' and 'last_avail_wrap_counter':
> >For Tx(guest transmitting), they are the same after each pop of a desc.
> >
> >For Rx(guest receiving), they are also the same when there are enough
> >descriptors to carry the payload for a packet(e.g. usually 16 descs are
> >needed for a 64k packet in typical iperf tcp connection with tso enabled),
> >however, when the ring is running out of descriptors while there are
> >still a few free ones, e.g. 6 descriptors are available which is not
> >enough to carry an entire packet which needs 16 descriptors, in this
> >case the 'avail_wrap_counter' should be set as the first one pending
> >being handled by guest driver in order to get a notification, and the
> >'last_avail_wrap_counter' should stay unchanged to the head of available
> >descriptors, like below:
> >
> >Mark meaning:
> > | | -- available
> > |*| -- used
> >
> >A Snapshot of the queue:
> >   last_avail_idx = 253
> >   last_avail_wrap_counter = 1
> >  |
> > +-+
> >  0  | | | |*|*|*|*|*|*|*|*|*|*|*|*|*|*|*|*|*| | | | 255
> > +-+
> >|
> >   shadow_avail_idx = 3
> >   avail_wrap_counter = 0
> 
> 
> Well this might not be the good place to describe the difference between
> shadow_avail_idx and last_avail_idx. And the comments above their definition
> looks good enough?

Sorry, I do not get it, can you elaborate more? 

This is one of the buggy parts of packed ring, it is good to make it clear here.

> 
>     /* Next head to pop */
>     uint16_t last_avail_idx;
> 
>     /* Last avail_idx read from VQ. */
>     uint16_t shadow_avail_idx;
> 

What is the meaning of these comment? Do you mean I should replace put them 
to the comments also? thanks.

> Instead, how or why need event suppress is not mentioned ...

Yes, but presumably this knowledge has been acquired from reading through the
spec, so I skipped this part.

Wei

> 
> 
> 
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 137 
> > +
> >  1 file changed, 128 insertions(+), 9 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 7e276b4..8cfc7b6 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -234,6 +234,34 @@ static void vring_desc_read(VirtIODevice *vdev, 
> >VRingDesc *desc,
> >  virtio_tswap16s(vdev, >next);
> >  }
> >+static void vring_packed_event_read(VirtIODevice *vdev,
> >+MemoryRegionCache *cache, VRingPackedDescEvent 
> >*e)
> >+{
> >+address_space_read_cached(cache, 0, e, sizeof(*e));
> >+virtio_tswap16s(vdev, >off_wrap);
> >+virtio_tswap16s(vdev, >flags);
> >+}
> >+
> >+static void vring_packed_off_wrap_write(VirtIODevice *vdev,
> >+MemoryRegionCache *cache, uint16_t off_wrap)
> >+{
> >+virtio_tswap16s(vdev, _wrap);
> >+address_space_write_cached(cache, offsetof(VRingPackedDescEvent, 
> >off_wrap),
> >+_wrap, sizeof(off_wrap));
> >+address_space_cache_invalidate(cache,
> >+offsetof(VRingPackedDescEvent, off_wrap), sizeof(off_wrap));
> >+}
> >+
> >+static void vring_packed_flags_write(VirtIODevice *vdev,
> >+MemoryRegionCache *cache, uint16_t flags)
> >+{
> >+virtio_tswap16s(vdev, );
> >+address_space_write_cached(cache, offsetof(VRingPackedDescEvent, flags),
> >+, sizeof(flags));
> >+address_space_cache_invalidate(cache,
> >+offsetof(VRingPackedDescEvent, flags), 
> >sizeof(flags));
> >+}
> >+
> >  static VRingMemoryRegionCaches *vring_get_region_caches(struct VirtQueue 
> > *vq)
> >  {
> >  VRingMemoryRegionCaches *caches = atomic_rcu_read(>vring.caches);
> >@@ -340,14 +368,8 @@ static inline void vring_set_avail_event(VirtQueue *vq, 
> >uint16_t val)
> >  address_space_cache_invalidate(>used, pa, sizeof(val));
> >  }
> >-void virtio_queue_set_notification(VirtQueue *vq, int enable)
> >+static void virtio_queue_set_notification_split

Re: [Qemu-devel] [PATCH v4 06/11] virtio: get avail bytes check for packed ring

2019-02-19 Thread Wei Xu

On Tue, Feb 19, 2019 at 02:24:11PM +0800, Jason Wang wrote:
> 
> On 2019/2/19 上午1:07, Wei Xu wrote:
> >On Mon, Feb 18, 2019 at 03:27:21PM +0800, Jason Wang wrote:
> >>On 2019/2/14 下午12:26, w...@redhat.com wrote:
> >>>From: Wei Xu 
> >>>
> >>>Add packed ring headcount check.
> >>>
> >>>Common part of split/packed ring are kept.
> >>>
> >>>Signed-off-by: Wei Xu 
> >>>---
> >>>  hw/virtio/virtio.c | 197 
> >>> -
> >>>  1 file changed, 179 insertions(+), 18 deletions(-)
> >>>
> >>>diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >>>index f2ff980..832287b 100644
> >>>--- a/hw/virtio/virtio.c
> >>>+++ b/hw/virtio/virtio.c
> >>>@@ -368,6 +368,17 @@ int virtio_queue_ready(VirtQueue *vq)
> >>>  return vq->vring.avail != 0;
> >>>  }
> >>>+static void vring_packed_desc_read(VirtIODevice *vdev, VRingPackedDesc 
> >>>*desc,
> >>>+MemoryRegionCache *cache, int i)
> >>>+{
> >>>+address_space_read_cached(cache, i * sizeof(VRingPackedDesc),
> >>>+  desc, sizeof(VRingPackedDesc));
> >>>+virtio_tswap16s(vdev, >flags);
> >>>+virtio_tswap64s(vdev, >addr);
> >>>+virtio_tswap32s(vdev, >len);
> >>>+virtio_tswap16s(vdev, >id);
> >>>+}
> >>>+
> >>>  static void vring_packed_desc_read_flags(VirtIODevice *vdev,
> >>>  VRingPackedDesc *desc, MemoryRegionCache *cache, int 
> >>> i)
> >>>  {
> >>>@@ -667,9 +678,9 @@ static int virtqueue_read_next_desc(VirtIODevice 
> >>>*vdev, VRingDesc *desc,
> >>>  return VIRTQUEUE_READ_DESC_MORE;
> >>>  }
> >>>-void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
> >>>-   unsigned int *out_bytes,
> >>>-   unsigned max_in_bytes, unsigned 
> >>>max_out_bytes)
> >>>+static void virtqueue_split_get_avail_bytes(VirtQueue *vq,
> >>>+unsigned int *in_bytes, unsigned int 
> >>>*out_bytes,
> >>>+unsigned max_in_bytes, unsigned max_out_bytes)
> >>>  {
> >>>  VirtIODevice *vdev = vq->vdev;
> >>>  unsigned int max, idx;
> >>>@@ -679,27 +690,12 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, 
> >>>unsigned int *in_bytes,
> >>>  int64_t len = 0;
> >>>  int rc;
> >>>-if (unlikely(!vq->vring.desc)) {
> >>>-if (in_bytes) {
> >>>-*in_bytes = 0;
> >>>-}
> >>>-if (out_bytes) {
> >>>-*out_bytes = 0;
> >>>-}
> >>>-return;
> >>>-}
> >>>-
> >>>  rcu_read_lock();
> >>>  idx = vq->last_avail_idx;
> >>>  total_bufs = in_total = out_total = 0;
> >>>  max = vq->vring.num;
> >>>  caches = vring_get_region_caches(vq);
> >>>-if (caches->desc.len < max * sizeof(VRingDesc)) {
> >>>-virtio_error(vdev, "Cannot map descriptor ring");
> >>>-goto err;
> >>>-}
> >>>-
> >>>  while ((rc = virtqueue_num_heads(vq, idx)) > 0) {
> >>>  MemoryRegionCache *desc_cache = >desc;
> >>>  unsigned int num_bufs;
> >>>@@ -792,6 +788,171 @@ err:
> >>>  goto done;
> >>>  }
> >>>+static void virtqueue_packed_get_avail_bytes(VirtQueue *vq,
> >>>+unsigned int *in_bytes, unsigned int 
> >>>*out_bytes,
> >>>+unsigned max_in_bytes, unsigned max_out_bytes)
> >>>+{
> >>>+VirtIODevice *vdev = vq->vdev;
> >>>+unsigned int max, idx;
> >>>+unsigned int total_bufs, in_total, out_total;
> >>>+MemoryRegionCache *desc_cache;
> >>>+VRingMemoryRegionCaches *caches;
> >>>+MemoryRegionCache indirect_desc_cache = MEMORY_REGION_CACHE_INVALID;
> >>>+int64_t len = 0;
> >>>+VRingPackedDesc desc;
> >>>+bool wrap_counter;
> >>>+
> >>>+rcu_read_

Re: [Qemu-devel] [PATCH v4 07/11] virtio: fill/flush/pop for packed ring

2019-02-19 Thread Wei Xu

On Tue, Feb 19, 2019 at 02:49:42PM +0800, Jason Wang wrote:
> 
> On 2019/2/18 下午10:46, Wei Xu wrote:
> >>Do we allow chain more descriptors than vq size in the case of indirect?
> >>According to the spec:
> >>
> >>"
> >>
> >>The device limits the number of descriptors in a list through a
> >>transport-specific and/or device-specific value. If not limited,
> >>the maximum number of descriptors in a list is the virt queue
> >>size.
> >>"
> >>
> >>This looks possible, so the above is probably wrong if the max number of
> >>chained descriptors is negotiated through a device specific way.
> >>
> >OK, I will remove this check, while it is necessary to have a limitation
> >for indirect descriptor anyway, otherwise it is possible to hit an overflow
> >since presumably u16 is used for most number/size in the spec.
> >
> 
> Please try to test block and scsi device for you changes as well.

Any idea about what kind of test should be covered? AFAICT, currently
packed ring is targeted for virtio-net as discussed during voting spec.

Wei

> 
> Thanks
> 
>

Re: [Qemu-devel] [PATCH v4 06/11] virtio: get avail bytes check for packed ring

2019-02-18 Thread Wei Xu

On Mon, Feb 18, 2019 at 03:27:21PM +0800, Jason Wang wrote:
> 
> On 2019/2/14 下午12:26, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Add packed ring headcount check.
> >
> >Common part of split/packed ring are kept.
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 197 
> > -
> >  1 file changed, 179 insertions(+), 18 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index f2ff980..832287b 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -368,6 +368,17 @@ int virtio_queue_ready(VirtQueue *vq)
> >  return vq->vring.avail != 0;
> >  }
> >+static void vring_packed_desc_read(VirtIODevice *vdev, VRingPackedDesc 
> >*desc,
> >+MemoryRegionCache *cache, int i)
> >+{
> >+address_space_read_cached(cache, i * sizeof(VRingPackedDesc),
> >+  desc, sizeof(VRingPackedDesc));
> >+virtio_tswap16s(vdev, >flags);
> >+virtio_tswap64s(vdev, >addr);
> >+virtio_tswap32s(vdev, >len);
> >+virtio_tswap16s(vdev, >id);
> >+}
> >+
> >  static void vring_packed_desc_read_flags(VirtIODevice *vdev,
> >  VRingPackedDesc *desc, MemoryRegionCache *cache, int i)
> >  {
> >@@ -667,9 +678,9 @@ static int virtqueue_read_next_desc(VirtIODevice *vdev, 
> >VRingDesc *desc,
> >  return VIRTQUEUE_READ_DESC_MORE;
> >  }
> >-void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
> >-   unsigned int *out_bytes,
> >-   unsigned max_in_bytes, unsigned 
> >max_out_bytes)
> >+static void virtqueue_split_get_avail_bytes(VirtQueue *vq,
> >+unsigned int *in_bytes, unsigned int *out_bytes,
> >+unsigned max_in_bytes, unsigned max_out_bytes)
> >  {
> >  VirtIODevice *vdev = vq->vdev;
> >  unsigned int max, idx;
> >@@ -679,27 +690,12 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned 
> >int *in_bytes,
> >  int64_t len = 0;
> >  int rc;
> >-if (unlikely(!vq->vring.desc)) {
> >-if (in_bytes) {
> >-*in_bytes = 0;
> >-}
> >-if (out_bytes) {
> >-*out_bytes = 0;
> >-}
> >-return;
> >-}
> >-
> >  rcu_read_lock();
> >  idx = vq->last_avail_idx;
> >  total_bufs = in_total = out_total = 0;
> >  max = vq->vring.num;
> >  caches = vring_get_region_caches(vq);
> >-if (caches->desc.len < max * sizeof(VRingDesc)) {
> >-virtio_error(vdev, "Cannot map descriptor ring");
> >-goto err;
> >-}
> >-
> >  while ((rc = virtqueue_num_heads(vq, idx)) > 0) {
> >  MemoryRegionCache *desc_cache = >desc;
> >  unsigned int num_bufs;
> >@@ -792,6 +788,171 @@ err:
> >  goto done;
> >  }
> >+static void virtqueue_packed_get_avail_bytes(VirtQueue *vq,
> >+unsigned int *in_bytes, unsigned int *out_bytes,
> >+unsigned max_in_bytes, unsigned max_out_bytes)
> >+{
> >+VirtIODevice *vdev = vq->vdev;
> >+unsigned int max, idx;
> >+unsigned int total_bufs, in_total, out_total;
> >+MemoryRegionCache *desc_cache;
> >+VRingMemoryRegionCaches *caches;
> >+MemoryRegionCache indirect_desc_cache = MEMORY_REGION_CACHE_INVALID;
> >+int64_t len = 0;
> >+VRingPackedDesc desc;
> >+bool wrap_counter;
> >+
> >+rcu_read_lock();
> >+idx = vq->last_avail_idx;
> >+wrap_counter = vq->last_avail_wrap_counter;
> >+total_bufs = in_total = out_total = 0;
> >+
> >+max = vq->vring.num;
> >+caches = vring_get_region_caches(vq);
> >+desc_cache = >desc;
> >+vring_packed_desc_read_flags(vdev, , desc_cache, idx);
> >+while (is_desc_avail(, wrap_counter)) {
> >+unsigned int num_bufs;
> >+unsigned int i = 0;
> >+
> >+num_bufs = total_bufs;
> >+
> >+/* Make sure flags has been read before all the fields. */
> >+smp_rmb();
> >+vring_packed_desc_read(vdev, , desc_cache, idx);
> 
> 
> It's better to have single function to deal with reading flags and
> descriptors and check its availability like packed ring.

There is something di

Re: [Qemu-devel] [PATCH v4 07/11] virtio: fill/flush/pop for packed ring

2019-02-18 Thread Wei Xu

On Mon, Feb 18, 2019 at 03:51:05PM +0800, Jason Wang wrote:
> 
> On 2019/2/14 下午12:26, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >last_used_idx/wrap_counter should be equal to last_avail_idx/wrap_counter
> >after a successful flush.
> >
> >Batching in vhost-net & dpdk testpmd is not equivalently supported in
> >userspace backend, but a chained descriptors for Rx is similarly presented
> >as a lightweight batch, so a write barrier is nailed only for the
> >first(head) descriptor.
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 291 
> > +
> >  1 file changed, 274 insertions(+), 17 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 832287b..7e276b4 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -379,6 +379,25 @@ static void vring_packed_desc_read(VirtIODevice *vdev, 
> >VRingPackedDesc *desc,
> >  virtio_tswap16s(vdev, >id);
> >  }
> >+static void vring_packed_desc_write_data(VirtIODevice *vdev,
> >+VRingPackedDesc *desc, MemoryRegionCache *cache, int i)
> >+{
> >+virtio_tswap32s(vdev, >len);
> >+virtio_tswap16s(vdev, >id);
> >+address_space_write_cached(cache,
> >+  i * sizeof(VRingPackedDesc) + offsetof(VRingPackedDesc, id),
> >+  >id, sizeof(desc->id));
> >+address_space_cache_invalidate(cache,
> >+  i * sizeof(VRingPackedDesc) + offsetof(VRingPackedDesc, id),
> >+  sizeof(desc->id));
> >+address_space_write_cached(cache,
> >+  i * sizeof(VRingPackedDesc) + offsetof(VRingPackedDesc, len),
> >+  >len, sizeof(desc->len));
> >+address_space_cache_invalidate(cache,
> >+  i * sizeof(VRingPackedDesc) + offsetof(VRingPackedDesc, len),
> >+  sizeof(desc->len));
> >+}
> >+
> >  static void vring_packed_desc_read_flags(VirtIODevice *vdev,
> >  VRingPackedDesc *desc, MemoryRegionCache *cache, int i)
> >  {
> >@@ -388,6 +407,18 @@ static void vring_packed_desc_read_flags(VirtIODevice 
> >*vdev,
> >  virtio_tswap16s(vdev, >flags);
> >  }
> >+static void vring_packed_desc_write_flags(VirtIODevice *vdev,
> >+VRingPackedDesc *desc, MemoryRegionCache *cache, int i)
> >+{
> >+virtio_tswap16s(vdev, >flags);
> >+address_space_write_cached(cache,
> >+  i * sizeof(VRingPackedDesc) + offsetof(VRingPackedDesc, 
> >flags),
> >+  >flags, sizeof(desc->flags));
> >+address_space_cache_invalidate(cache,
> >+  i * sizeof(VRingPackedDesc) + offsetof(VRingPackedDesc, 
> >flags),
> >+  sizeof(desc->flags));
> >+}
> >+
> >  static inline bool is_desc_avail(struct VRingPackedDesc *desc,
> >  bool wrap_counter)
> >  {
> >@@ -554,19 +585,11 @@ bool virtqueue_rewind(VirtQueue *vq, unsigned int num)
> >  }
> >  /* Called within rcu_read_lock().  */
> >-void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
> >+static void virtqueue_split_fill(VirtQueue *vq, const VirtQueueElement 
> >*elem,
> >  unsigned int len, unsigned int idx)
> >  {
> >  VRingUsedElem uelem;
> >-trace_virtqueue_fill(vq, elem, len, idx);
> >-
> >-virtqueue_unmap_sg(vq, elem, len);
> >-
> >-if (unlikely(vq->vdev->broken)) {
> >-return;
> >-}
> >-
> >  if (unlikely(!vq->vring.used)) {
> >  return;
> >  }
> >@@ -578,16 +601,71 @@ void virtqueue_fill(VirtQueue *vq, const 
> >VirtQueueElement *elem,
> >  vring_used_write(vq, , idx);
> >  }
> >-/* Called within rcu_read_lock().  */
> >-void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >+static void virtqueue_packed_fill(VirtQueue *vq, const VirtQueueElement 
> >*elem,
> >+unsigned int len, unsigned int idx)
> >  {
> >-uint16_t old, new;
> >+uint16_t head;
> >+VRingMemoryRegionCaches *caches;
> >+VRingPackedDesc desc = {
> >+.flags = 0,
> >+.id = elem->index,
> >+.len = len,
> >+};
> >+bool wrap_counter = vq->used_wrap_counter;
> >+
> >+if (unlikely(!vq->vring.desc)) {
> >+return;
> >+}
> >+
> >+head = vq->used_idx + idx;
>

Re: [Qemu-devel] [PATCH v3 00/11] packed ring virtio-net backends support

2019-02-13 Thread Wei Xu

On Wed, Feb 13, 2019 at 09:17:57AM -0500, Michael S. Tsirkin wrote:
> On Wed, Feb 13, 2019 at 08:25:35AM -0500, w...@redhat.com wrote:
> > From: Wei Xu 
> > 
> > https://github.com/Whishay/qemu.git 
> > 
> > Userspace and vhost-net backedn test has been done with upstream kernel
> > in guest.
> 
> Just a general comment: please format *all* patches
> with --subject-prefix "PATCH v3", or with -v3.
> 
> Do not manually change patch 0 subject adding version there.
> 
> This makes it possible to figure out where does each patch go.
> 
OK, thanks a lot.

Wei
> 
> > v2->v3
> > v2/01 - drop it since the header has been synchronized from kernel.(mst 
> > & jason)
> > v3/01 - rename 'avail_wrap_counter' to 'last_avail_wrap_counter',
> > 'event_wrap_counter' to 'avail_wrap_counter' to make it easier
> > to understand.(Jason)
> >   - revise commit message.(Jason)
> > v3/02 - split packed ring areas size calculation to next patch.(Jason)
> > to not break bisect(Jason).
> > v3/03 - initialize packed ring region with correct size and attribute.
> >   - remove unnecessary 'else' checks. (Jason)
> > v3/06 - add commit log.
> >   - replace 'event_wrap-counter' with 'avail_wrap_counter'.
> >   - merge common memory cache size check to 
> > virtqueue_get_avail_bytes().(Jason)
> >   - revise memory barrier comment.(Jason) 
> >   - check indirect descriptors by desc.len/sizeof(desc).(Jason)
> >   - flip wrap counter with '^=1'.(Jason)
> > v3/07 - move desc.id/len initialization to the declaration.(Jason)
> >   - flip wrap counter '!' with '^=1'.(Jason)
> >   - add memory barrier comments in commit message.
> > v3/08 - use offsetof() when writing cache.(Jason)
> >   - avoid duplicated memory region write when turning off event_idx
> > supported notification.(Jason)
> >   - add commit log.(Jason)
> >   - add avail & last_avail wrap counter difference description in 
> > commit log.
> > v3/09 - remove unnecessary used/avail idx/wrap-counter from subsection.
> >   - put new subsection to the end of vmstate_virtio.(Jason)
> >   - squash the two userspace and vhost-net migration patches in 
> > v2.(Jason)
> > v3/10 - reword commit message.
> >   - this is a help not a bug fix so I would like to keep it as a
> > separate patch still.(Proposed a merge it by Jason)
> >   - the virtqueue_fill() is also not like an API so I would prefer 
> > not
> >         to touch it, please correct me if I did not get it in the right
> > way.(Proposed a squash by Jason)
> > v3/11 - squash feature bits for user space and vhost kernel/user 
> > backends.
> >   - enable packed ring feature bit provision on host by 
> > default.(Jason)
> > 
> > Wei Xu (11):
> >   virtio: rename structure for packed ring
> >   virtio: device/driver area size calculation helper for split ring
> >   virtio: initialize packed ring region
> >   virtio: initialize wrap counter for packed ring
> >   virtio: queue/descriptor check helpers for packed ring
> >   virtio: get avail bytes check for packed ring
> >   virtio: fill/flush/pop for packed ring
> >   virtio: event suppression support for packed ring
> >   virtio-net: update the head descriptor in a chain lastly
> >   virtio: migration support for packed ring
> >   virtio: CLI and provide packed ring feature bit by default
> > 
> >  hw/net/vhost_net.c |   2 +
> >  hw/net/virtio-net.c|  11 +-
> >  hw/virtio/virtio.c | 798 
> > +
> >  include/hw/virtio/virtio.h |   4 +-
> >  4 files changed, 757 insertions(+), 58 deletions(-)
> > 
> > -- 
> > 1.8.3.1

Re: [Qemu-devel] [PATCH v1 13/16] virtio: add vhost-net migration of packed ring

2019-01-16 Thread Wei Xu

On Wed, Nov 28, 2018 at 11:34:46AM +0100, Maxime Coquelin wrote:
> 
> 
> On 11/22/18 3:06 PM, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >tweaked vhost-net code to test migration.
> >
> >@@ -1414,64 +1430,20 @@ long vhost_vring_ioctl(struct vhost_dev
> > r = -EFAULT;
> > break;
> > }
> >+   vq->last_avail_idx = s.num & 0x7FFF;
> >+   /* Forget the cached index value. */
> >+   vq->avail_idx = vq->last_avail_idx;
> >+   if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
> >+   vq->last_avail_wrap_counter = !!(s.num & 0x8000);
> >+   vq->avail_wrap_counter = vq->last_avail_wrap_counter;
> >+
> >+   vq->last_used_idx = (s.num & 0x7fFF) >> 16;
> >+   vq->last_used_wrap_counter = !!(s.num & 0x8000);
> >+   }
> >+   break;
> >+   case VHOST_GET_VRING_BASE:
> >+   s.index = idx;
> >+s.num = vq->last_avail_idx;
> >+   if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
> >+   s.num |= vq->last_avail_wrap_counter << 15;
> >+   s.num |= vq->last_used_idx << 16;
> >+       s.num |= vq->last_used_wrap_counter << 31;
> >+   }
> >+   if (copy_to_user(argp, , sizeof(s)))
> >+   r = -EFAULT;
> >+   break;
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 35 ++-
> >  include/hw/virtio/virtio.h |  4 ++--
> >  2 files changed, 32 insertions(+), 7 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 64d5c04..7487d3d 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -2963,19 +2963,40 @@ hwaddr virtio_queue_get_used_size(VirtIODevice 
> >*vdev, int n)
> >  }
> >  }
> >-uint16_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n)
> >+int virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n)
> >  {
> >-return vdev->vq[n].last_avail_idx;
> >+int idx;
> >+
> >+if (virtio_host_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> 
> Also, I think you want to use virtio_vdev_has_feature() here instead,
> else it will set wrap counter in the case ring_packed=on in the QEMU
> command line but the feature has not been negotiated.
> 
> 
> For example, with ring_packed=on and with stock Fedora 28 kernel, which
> does not support packed ring, I get this warning with DPDK vhost user
> backend:
> 
> VHOST_CONFIG: last_used_idx (32768) and vq->used->idx (0) mismatches;
> some packets maybe resent for Tx and dropped for Rx

Thanks, will fix it.

Wei

> 
> >+idx = vdev->vq[n].last_avail_idx;
> >+idx |= ((int)vdev->vq[n].avail_wrap_counter) << 15;
> >+idx |= (vdev->vq[n].used_idx) << 16;
> >+idx |= ((int)vdev->vq[n].used_wrap_counter) << 31;
> >+} else {
> >+idx = (int)vdev->vq[n].last_avail_idx;
> >+}
> >+return idx;
> >  }
> >-void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t 
> >idx)
> >+void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, int idx)
> >  {
> >-vdev->vq[n].last_avail_idx = idx;
> >-vdev->vq[n].shadow_avail_idx = idx;
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >+vdev->vq[n].last_avail_idx = idx & 0x7fff;
> >+vdev->vq[n].avail_wrap_counter = !!(idx & 0x8000);
> >+vdev->vq[n].used_idx = (idx & 0x7fff) >> 16;
> >+vdev->vq[n].used_wrap_counter = !!(idx & 0x8000);
> >+} else {
> >+vdev->vq[n].last_avail_idx = idx;
> >+vdev->vq[n].shadow_avail_idx = idx;
> >+}
> >  }
> >  void virtio_queue_restore_last_avail_idx(VirtIODevice *vdev, int n)
> >  {
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >+return;
> >+}
> >+
> >  rcu_read_lock();
> >  if (vdev->vq[n].vring.desc) {
> >  vdev->vq[n].last_avail_idx = vring_used_idx(>vq[n]);
> >@@ -2986,6 +3007,10 @@ void virtio_queue_restore_last_avail_idx(VirtIODevice 
> >*vdev, int n)
> >  void virtio_queue_update_used_idx(VirtIODevice *vdev, int n)
> >  {
>

Re: [Qemu-devel] [PATCH v1 12/16] virtio: add userspace migration of packed ring

2019-01-16 Thread Wei Xu

On Thu, Nov 22, 2018 at 10:45:36PM +0800, Jason Wang wrote:
> 
> On 2018/11/22 下午10:06, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Signed-off-by: Wei Xu 
> 
> 
> I think you need subsection. Otherwise you will break migration
> compatibility.

ok, thanks.

Wei

> 
> Thanks
> 
> 
> >---
> >  hw/virtio/virtio.c | 18 ++
> >  1 file changed, 18 insertions(+)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 240c4e3..64d5c04 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -2558,6 +2558,12 @@ int virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >   */
> >  qemu_put_be64(f, vdev->vq[i].vring.desc);
> >  qemu_put_be16s(f, >vq[i].last_avail_idx);
> >+qemu_put_8s(f, (const uint8_t *)>vq[i].avail_wrap_counter);
> >+qemu_put_8s(f, (const uint8_t *)>vq[i].event_wrap_counter);
> >+qemu_put_8s(f, (const uint8_t *)>vq[i].used_wrap_counter);
> >+qemu_put_be16s(f, >vq[i].used_idx);
> >+qemu_put_be16s(f, >vq[i].shadow_avail_idx);
> >+qemu_put_be32s(f, >vq[i].inuse);
> >  if (k->save_queue) {
> >  k->save_queue(qbus->parent, i, f);
> >  }
> >@@ -2705,6 +2711,14 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int 
> >version_id)
> >  }
> >  vdev->vq[i].vring.desc = qemu_get_be64(f);
> >  qemu_get_be16s(f, >vq[i].last_avail_idx);
> >+
> >+qemu_get_8s(f, (uint8_t *)>vq[i].avail_wrap_counter);
> >+qemu_get_8s(f, (uint8_t *)>vq[i].event_wrap_counter);
> >+qemu_get_8s(f, (uint8_t *)>vq[i].used_wrap_counter);
> >+qemu_get_be16s(f, >vq[i].used_idx);
> >+qemu_get_be16s(f, >vq[i].shadow_avail_idx);
> >+qemu_get_be32s(f, >vq[i].inuse);
> >+
> >  vdev->vq[i].signalled_used_valid = false;
> >  vdev->vq[i].notification = true;
> >@@ -2786,6 +2800,10 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int 
> >version_id)
> >  virtio_queue_update_rings(vdev, i);
> >  }
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >+continue;
> >+}
> >+
> >  nheads = vring_avail_idx(>vq[i]) - 
> > vdev->vq[i].last_avail_idx;
> >  /* Check it isn't doing strange things with descriptor 
> > numbers. */
> >  if (nheads > vdev->vq[i].vring.num) {
>

Re: [Qemu-devel] [PATCH v1 09/16] virtio: fill/flush/pop for packed ring

2019-01-16 Thread Wei Xu

On Fri, Nov 30, 2018 at 01:45:19PM +0100, Maxime Coquelin wrote:
> Hi Wei,
> 
> On 11/22/18 3:06 PM, w...@redhat.com wrote:
> >+void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >+{
> >+if (unlikely(vq->vdev->broken)) {
> >+vq->inuse -= count;
> >+return;
> >+}
> >+
> >+if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> >+virtqueue_packed_flush(vq, count);
> >+} else {
> >+virtqueue_split_flush(vq, count);
> >+}
> >+}
> >+
> >  void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem,
> >  unsigned int len)
> >  {
> >@@ -1074,7 +1180,7 @@ static void *virtqueue_alloc_element(size_t sz, 
> >unsigned out_num, unsigned in_nu
> >  return elem;
> >  }
> >-void *virtqueue_pop(VirtQueue *vq, size_t sz)
> >+static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
> >  {
> >  unsigned int i, head, max;
> >  VRingMemoryRegionCaches *caches;
> >@@ -1089,9 +1195,6 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
> >  VRingDesc desc;
> >  int rc;
> >-if (unlikely(vdev->broken)) {
> >-return NULL;
> >-}
> >  rcu_read_lock();
> >  if (virtio_queue_empty_rcu(vq)) {
> >  goto done;
> >@@ -1209,6 +1312,159 @@ err_undo_map:
> >  goto done;
> >  }
> >+static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
> >+{
> >+unsigned int i, head, max;
> >+VRingMemoryRegionCaches *caches;
> >+MemoryRegionCache indirect_desc_cache = MEMORY_REGION_CACHE_INVALID;
> >+MemoryRegionCache *cache;
> >+int64_t len;
> >+VirtIODevice *vdev = vq->vdev;
> >+VirtQueueElement *elem = NULL;
> >+unsigned out_num, in_num, elem_entries;
> >+hwaddr addr[VIRTQUEUE_MAX_SIZE];
> >+struct iovec iov[VIRTQUEUE_MAX_SIZE];
> >+VRingPackedDesc desc;
> >+uint16_t id;
> >+
> >+rcu_read_lock();
> >+if (virtio_queue_packed_empty_rcu(vq)) {
> >+goto done;
> >+}
> >+
> >+/* When we start there are none of either input nor output. */
> >+out_num = in_num = elem_entries = 0;
> >+
> >+max = vq->vring.num;
> >+
> >+if (vq->inuse >= vq->vring.num) {
> >+virtio_error(vdev, "Virtqueue size exceeded");
> >+goto done;
> >+}
> >+
> >+head = vq->last_avail_idx;
> >+i = head;
> >+
> >+caches = vring_get_region_caches(vq);
> >+cache = >desc;
> >+
> >+/* Empty check has been done at the beginning, so it is an available
> >+ * entry already, make sure all fields has been exposed by guest */
> >+smp_rmb();
> >+vring_packed_desc_read(vdev, , cache, i);
> >+
> >+id = desc.id;
> >+if (desc.flags & VRING_DESC_F_INDIRECT) {
> >+
> >+if (desc.len % sizeof(VRingPackedDesc)) {
> >+virtio_error(vdev, "Invalid size for indirect buffer table");
> >+goto done;
> >+}
> >+
> >+/* loop over the indirect descriptor table */
> >+len = address_space_cache_init(_desc_cache, vdev->dma_as,
> >+   desc.addr, desc.len, false);
> >+cache = _desc_cache;
> >+if (len < desc.len) {
> >+virtio_error(vdev, "Cannot map indirect buffer");
> >+goto done;
> >+}
> >+
> >+max = desc.len / sizeof(VRingPackedDesc);
> >+i = 0;
> >+vring_packed_desc_read(vdev, , cache, i);
> >+/* Make sure we see all the fields*/
> >+smp_rmb();
> >+}
> >+
> >+/* Collect all the descriptors */
> >+while (1) {
> >+bool map_ok;
> >+
> >+if (desc.flags & VRING_DESC_F_WRITE) {
> >+map_ok = virtqueue_map_desc(vdev, _num, addr + out_num,
> >+iov + out_num,
> >+VIRTQUEUE_MAX_SIZE - out_num, true,
> >+desc.addr, desc.len);
> >+} else {
> >+if (in_num) {
> >+virtio_error(vdev, "Incorrect order for descriptors");
> >+goto err_undo_map;
> >+}
> >+map_ok = virtqueue_map_desc(vdev, _num, addr, iov,
> >+VIRTQUEUE_MAX_SIZE, false,
> >+desc.addr, desc.len);
> >+}
> >+if (!map_ok) {
> >+goto err_undo_map;
> >+}
> >+
> >+/* If we've got too many, that implies a descriptor loop. */
> >+if (++elem_entries > max) {
> >+virtio_error(vdev, "Looped descriptor");
> >+goto err_undo_map;
> >+}
> >+
> >+if (++i >= vq->vring.num) {
> >+i -= vq->vring.num;
> >+}
> >+
> >+if (desc.flags & VRING_DESC_F_NEXT) {
> >+vring_packed_desc_read(vq->vdev, , cache, i);
> >+} else {
> >+break;
> >+}
> >+}
> >+
> >+/* Now copy what we have collected and mapped */
> >+elem = virtqueue_alloc_element(sz, out_num, in_num);
> >+

Re: [Qemu-devel] [PATCH v3 12/16] virtio: split virtio net bits rom virtio-pci

2019-01-01 Thread Wei Xu

On Thu, Dec 13, 2018 at 10:00:53PM +0100, Juan Quintela wrote:
> Reviewed-by: Thomas Huth 
> Reviewed-by: Laurent Vivier 
> Signed-off-by: Juan Quintela 
> ---
>  hw/virtio/Makefile.objs|  1 +
>  hw/virtio/virtio-net-pci.c | 96 ++
>  hw/virtio/virtio-pci.c | 57 --
>  hw/virtio/virtio-pci.h | 14 --
>  tests/Makefile.include |  2 +-
>  5 files changed, 98 insertions(+), 72 deletions(-)
>  create mode 100644 hw/virtio/virtio-net-pci.c
> 
> diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> index 557ad06231..4c31acb017 100644
> --- a/hw/virtio/Makefile.objs
> +++ b/hw/virtio/Makefile.objs
> @@ -23,6 +23,7 @@ obj-$(CONFIG_VIRTIO_BALLOON) += virtio-balloon-pci.o
>  obj-$(CONFIG_VIRTIO_9P) += virtio-9p-pci.o
>  obj-$(CONFIG_VIRTIO_SCSI) += virtio-scsi-pci.o
>  obj-$(CONFIG_VIRTIO_BLK) += virtio-blk-pci.o
> +obj-$(CONFIG_VIRTIO_NET) += virtio-net-pci.o
>  endif
>  endif

s/rom/from/ in the subject for patch 10, 11, 12 and 13.

Reviewed-by: Wei Xu 

>  
> diff --git a/hw/virtio/virtio-net-pci.c b/hw/virtio/virtio-net-pci.c
> new file mode 100644
> index 00..0b676f078d
> --- /dev/null
> +++ b/hw/virtio/virtio-net-pci.c
> @@ -0,0 +1,96 @@
> +/*
> + * Virtio net PCI Bindings
> + *
> + * Copyright IBM, Corp. 2007
> + * Copyright (c) 2009 CodeSourcery
> + *
> + * Authors:
> + *  Anthony Liguori   
> + *  Paul Brook
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Contributions after 2012-01-13 are licensed under the terms of the
> + * GNU GPL, version 2 or (at your option) any later version.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "hw/virtio/virtio-net.h"
> +#include "virtio-pci.h"
> +#include "qapi/error.h"
> +
> +typedef struct VirtIONetPCI VirtIONetPCI;
> +
> +/*
> + * virtio-net-pci: This extends VirtioPCIProxy.
> + */
> +#define TYPE_VIRTIO_NET_PCI "virtio-net-pci"
> +#define VIRTIO_NET_PCI(obj) \
> +OBJECT_CHECK(VirtIONetPCI, (obj), TYPE_VIRTIO_NET_PCI)
> +
> +struct VirtIONetPCI {
> +VirtIOPCIProxy parent_obj;
> +VirtIONet vdev;
> +};
> +
> +static Property virtio_net_properties[] = {
> +DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags,
> +VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
> +DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
> +DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void virtio_net_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
> +{
> +DeviceState *qdev = DEVICE(vpci_dev);
> +VirtIONetPCI *dev = VIRTIO_NET_PCI(vpci_dev);
> +DeviceState *vdev = DEVICE(>vdev);
> +
> +virtio_net_set_netclient_name(>vdev, qdev->id,
> +  object_get_typename(OBJECT(qdev)));
> +qdev_set_parent_bus(vdev, BUS(_dev->bus));
> +object_property_set_bool(OBJECT(vdev), true, "realized", errp);
> +}
> +
> +static void virtio_net_pci_class_init(ObjectClass *klass, void *data)
> +{
> +DeviceClass *dc = DEVICE_CLASS(klass);
> +PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
> +VirtioPCIClass *vpciklass = VIRTIO_PCI_CLASS(klass);
> +
> +k->romfile = "efi-virtio.rom";
> +k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
> +k->device_id = PCI_DEVICE_ID_VIRTIO_NET;
> +k->revision = VIRTIO_PCI_ABI_VERSION;
> +k->class_id = PCI_CLASS_NETWORK_ETHERNET;
> +set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
> +dc->props = virtio_net_properties;
> +vpciklass->realize = virtio_net_pci_realize;
> +}
> +
> +static void virtio_net_pci_instance_init(Object *obj)
> +{
> +VirtIONetPCI *dev = VIRTIO_NET_PCI(obj);
> +
> +virtio_instance_init_common(obj, >vdev, sizeof(dev->vdev),
> +TYPE_VIRTIO_NET);
> +object_property_add_alias(obj, "bootindex", OBJECT(>vdev),
> +  "bootindex", _abort);
> +}
> +
> +static const TypeInfo virtio_net_pci_info = {
> +.name  = TYPE_VIRTIO_NET_PCI,
> +.parent= TYPE_VIRTIO_PCI,
> +.instance_size = sizeof(VirtIONetPCI),
> +.instance_init = virtio_net_pci_instance_init,
> +.class_init= virtio_net_pci_class_init,
> +};
> +
> +static void virtio_net_pci_register(void)
> +{
> +type_register_static(_net_pci_info);
> +}
> +
> +type_init(virtio_net_pci_register)
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio

Re: [Qemu-devel] [PATCH v1 00/16] packed ring virtio-net backend support

2018-11-25 Thread Wei Xu

On Fri, Nov 23, 2018 at 01:57:37PM +0800, Wei Xu wrote:
> On Thu, Nov 22, 2018 at 06:57:31PM +0100, Maxime Coquelin wrote:
> > Hi Wei,
> > 
> > I just tested your series with Tiwei's v3, and it fails
> > with ctrl vq enabled:
> > qemu-system-x86_64: virtio-net ctrl missing headers
> 
> OK, I haven't tried Tiwei's v3 yet, will give it a try.

Hi Maxime,
It is caused by the _F_NEXT flag bit for indirect descriptor as
mentioned by tiwei, this patch is needed to fix it.

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 7487d3d..8e61e6f 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -1478,7 +1478,11 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t 
sz)
 i -= vq->vring.num;
 }
 
-if (desc.flags & VRING_DESC_F_NEXT) {
+if (cache == _desc_cache) {
+if (i == max)
+break;
+vring_packed_desc_read(vq->vdev, , cache, i);
+} else if (desc.flags & VRING_DESC_F_NEXT) {
 vring_packed_desc_read(vq->vdev, , cache, i);
 } else {

Re: [Qemu-devel] [PATCH v1 00/16] packed ring virtio-net backend support

2018-11-22 Thread Wei Xu

On Thu, Nov 22, 2018 at 06:57:31PM +0100, Maxime Coquelin wrote:
> Hi Wei,
> 
> I just tested your series with Tiwei's v3, and it fails
> with ctrl vq enabled:
> qemu-system-x86_64: virtio-net ctrl missing headers

OK, I haven't tried Tiwei's v3 yet, will give it a try.

Wei

> 
> Regards,
> Maxime
> 
> On 11/22/18 3:06 PM, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Code base:
> > https://github.com/Whishay/qemu.git
> >
> >rfc v3 -> v1
> >- migration support for both userspace and vhost-net, need tweak vhost
> >   ioctl() to make it work(the code is pasted in the commit message of
> >   vhost migration patch #13).
> >
> >Note:
> >   the high 32-bit guest feature bit is saved as a subsection for
> >   virtio devices which makes packed ring feature bit check unusable when
> >   loading the saved per-queue variables(this is done before loading
> >   subsection which is the last action for device during migration),
> >   so I save and load all the things generally for now, any idea to fix this?
> >
> >- Fixed comments from Jason for rfc v3 sorted by patch #, two comments I
> >   didn't take were(from patch) listed here:
> >09: - introduce new API(virtqueue_fill_n()).
> >   - Didn't take it since userspace backend does not support batching,
> > so only one element is popped and current API should be enough.
> >06 & 07: Refactor split and packed pop()/get_avail_bytes().
> >  - the duplicated code interwined with split/packed ring specific
> >things and it might make it unclear, so I only extracted the few
> >common parts out side rcu and keep the others separate.
> >
> >The other revised comments:
> >02: - reuse current 'avail/used' for 'driver/device' in 
> >VRingMemoryRegionCache.
> > - remove event_idx since shadow_avail_idx works.
> >03: - move size recalculation to a separate patch.
> > - keep 'avail/used' in current calculation function name.
> > - initialize 'desc' memory region as 'false' for 1.0('true' for 1.1)
> >04: - delete 'event_idx'
> >05: - rename 'wc' to wrap_counter.
> >06: - converge common part outside rcu section for 1.0/1.1.
> > - move memory barrier for the first 'desc' in between checking flag
> >   and read other fields.
> > - remove unnecessary memory barriers for indirect descriptors.
> > - no need to destroy indirect memory cache since it is generally done
> >   before return from the function.
> > - remove redundant maximum chained descriptors limitation check.
> > - there are some differences(desc name, wrap idx/counter, flags) between
> >   split and packed rings, so keep them separate for now.
> > - amend the comment when recording index and wrap counter for a kick
> >   from guest.
> >07: - calculate fields in descriptor instead of read it when filling.
> > - put memory barrier correctly before filling the flags in descriptor.
> > - replace full memory barrier with a write barrier in fill.
> > - shift to read descriptor flags and descriptor necessarily and
> >   separately in packed_pop().
> > - correct memory barrier in packed_pop() as in packed_fill().
> >08: - reuse 'shadow_avail_idx' instead of adding a new 'event_idx'.
> > - use the compact and verified vring_packed_need_event()
> >   version for vhost net/user.
> >12: - remove the odd cherry-pick comment.
> > - used bit '15' for wrap_counters.
> >
> >rfc v2->v3
> >- addressed performance issue
> >- fixed feedback from v2
> >
> >rfc v1->v2
> >- sync to tiwei's v5
> >- reuse memory cache function with 1.0
> >- dropped detach patch and notification helper(04 & 05 in v1)
> >- guest virtio-net driver unload/reload support
> >- event suppression support(not tested)
> >- addressed feedback from v1
> >
> >Wei Xu (15):
> >   virtio: introduce packed ring definitions
> >   virtio: redefine structure & memory cache for packed ring
> >   virtio: expand offset calculation for packed ring
> >   virtio: add memory region init for packed ring
> >   virtio: init wrap counter for packed ring
> >   virtio: init and desc empty check for packed ring
> >   virtio: get avail bytes check for packed ring
> >   virtio: fill/flush/pop for packed ring
> >   virtio: event suppression support for packed ring
> >   virtio-net: fill head desc after done all in a chain
> >   virtio: add userspace migration of packed ring
> >   virtio: add vhost-net migration of packed ring
&

Re: [Qemu-devel] [PATCH v1 01/16] Update version for v3.1.0-rc2 release

2018-11-22 Thread Wei Xu

This is an irrelevant patch mistakenly posted, please drop this, sorry.

On Thu, Nov 22, 2018 at 09:06:06AM -0500, w...@redhat.com wrote:
> From: Peter Maydell 
> 
> Signed-off-by: Peter Maydell 
> ---
>  VERSION | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/VERSION b/VERSION
> index 3af1c22..bbcce69 100644
> --- a/VERSION
> +++ b/VERSION
> @@ -1 +1 @@
> -3.0.91
> +3.0.92
> -- 
> 1.8.3.1
> 
>

Re: [Qemu-devel] [[RFC v3 12/12] virtio: feature vhost-net support for packed ring

2018-11-21 Thread Wei Xu

On Wed, Nov 21, 2018 at 02:03:59PM +0100, Maxime Coquelin wrote:
> Hi Wei,
> 
> On 10/11/18 4:08 PM, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >(cherry picked from commit 305a2c4640c15c5717245067ab937fd10f478ee6)
> >Signed-off-by: Wei Xu 
> >(cherry picked from commit 46476dae6f44c6fef8802a4a0ac7d0d79fe399e3)
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/vhost.c  | 3 +++
> >  hw/virtio/virtio.c | 4 
> >  include/hw/virtio/virtio.h | 1 +
> >  3 files changed, 8 insertions(+)
> >
> >diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >index 9df2da3..de06d55 100644
> >--- a/hw/virtio/vhost.c
> >+++ b/hw/virtio/vhost.c
> >@@ -974,6 +974,9 @@ static int vhost_virtqueue_start(struct vhost_dev *dev,
> >  }
> >  state.num = virtio_queue_get_last_avail_idx(vdev, idx);
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >+state.num |= ((int)virtio_queue_packed_get_wc(vdev, idx)) << 31;
> 
> For next version, please note that we agreed to move the wrap counter
> value at bit 15. DPDK vhost lib implemented it that way.

Yes, I have revised it in my next version, thanks for remindering.

>

Re: [Qemu-devel] [RFC v3 00/12] packed ring virtio-net userspace backend support

2018-11-21 Thread Wei Xu

On Wed, Nov 21, 2018 at 10:39:20PM +0800, Tiwei Bie wrote:
> Hi Wei,
> 
> FYI, the latest packed ring series for guest driver doesn't set
> the _F_NEXT bit for indirect descriptors any more. So below hack
> in guest driver is needed to make it work with this series:

OK, will do a test, thanks.

Wei

> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index cd7e755484e3..42faea7d8cf8 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -980,6 +980,7 @@ static int virtqueue_add_indirect_packed(struct 
> vring_virtqueue *vq,
>   unsigned int i, n, err_idx;
>   u16 head, id;
>   dma_addr_t addr;
> + int c = 0;
>  
>   head = vq->packed.next_avail_idx;
>   desc = alloc_indirect_packed(total_sg, gfp);
> @@ -1001,8 +1002,9 @@ static int virtqueue_add_indirect_packed(struct 
> vring_virtqueue *vq,
>   if (vring_mapping_error(vq, addr))
>   goto unmap_release;
>  
> - desc[i].flags = cpu_to_le16(n < out_sgs ?
> - 0 : VRING_DESC_F_WRITE);
> + desc[i].flags = cpu_to_le16((n < out_sgs ?
> + 0 : VRING_DESC_F_WRITE) |
> + (++c == total_sg ? 0 : VRING_DESC_F_NEXT));
>   desc[i].addr = cpu_to_le64(addr);
>   desc[i].len = cpu_to_le32(sg->length);
>           i++;
> -- 
> 2.14.1
> 
> On Thu, Oct 11, 2018 at 10:08:23AM -0400, w...@redhat.com wrote:
> > From: Wei Xu 
> > 
> > code base:
> > https://github.com/Whishay/qemu.git
> > 
> > Todo:
> > - migration has not been support yet
> > 
> > v2->v3
> > - addressed performance issue
> > - fixed feedback from v2
> > 
> > v1->v2
> > - sync to tiwei's v5
> > - reuse memory cache function with 1.0
> > - dropped detach patch and notification helper(04 & 05 in v1)
> > - guest virtio-net driver unload/reload support
> > - event suppression support(not tested)
> > - addressed feedback from v1
> > 
> > Wei Xu (12):
> >   virtio: introduce packed ring definitions
> >   virtio: redefine structure & memory cache for packed ring
> >   virtio: init memory cache for packed ring
> >   virtio: init wrap counter for packed ring
> >   virtio: init and desc empty check for packed ring
> >   virtio: get avail bytes check for packed ring
> >   virtio: fill/flush/pop for packed ring
> >   virtio: event suppression support for packed ring
> >   virtio-net: fill head desc after done all in a chain
> >   virtio: packed ring feature bit for userspace backend
> >   virtio: enable packed ring via a new command line
> >   virtio: feature vhost-net support for packed ring
> > 
> >  hw/net/vhost_net.c |   1 +
> >  hw/net/virtio-net.c|  11 +-
> >  hw/virtio/vhost.c  |  19 +-
> >  hw/virtio/virtio.c | 685 
> > +++--
> >  include/hw/virtio/virtio.h |   9 +-
> >  include/standard-headers/linux/virtio_config.h |  15 +
> >  include/standard-headers/linux/virtio_ring.h   |  43 ++
> >  7 files changed, 736 insertions(+), 47 deletions(-)
> > 
> > -- 
> > 1.8.3.1
> >

Re: [Qemu-devel] [PATCH] virtio-net: support RSC v4/v6 tcp traffic for Windows HCK

2018-11-12 Thread Wei Xu

Looks good, I can't recall the status of last version well but
I remember Jason gave some comments about sanity check are quiet
essential, have you addressed them?

Reviewed by: Wei Xu 

On Fri, Nov 09, 2018 at 04:58:27PM +0200, Yuri Benditovich wrote:
> This commit adds implementation of RX packets
> coalescing, compatible with requirements of Windows
> Hardware compatibility kit.
> 
> The device enables feature VIRTIO_NET_F_RSC_EXT in
> host features if it supports extended RSC functionality
> as defined in the specification.
> This feature requires at least one of VIRTIO_NET_F_GUEST_TSO4,
> VIRTIO_NET_F_GUEST_TSO6. Windows guest driver acks
> this feature only if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
> is also present.
> 
> In case vhost is enabled the feature bit is cleared in
> host_features during device initialization.
> 
> If the guest driver acks VIRTIO_NET_F_RSC_EXT feature,
> the device coalesces TCPv4 and TCPv6 packets (if
> respective VIRTIO_NET_F_GUEST_TSO feature is on,
> populates extended RSC information in virtio header
> and sets VIRTIO_NET_HDR_F_RSC_INFO bit in header flags.
> The device does not recalculate checksums in the coalesced
> packet, so they are not valid.
> 
> In this case:
> All the data packets in a tcp connection are cached
> to a single buffer in every receive interval, and will
> be sent out via a timer, the 'virtio_net_rsc_timeout'
> controls the interval, this value may impact the
> performance and response time of tcp connection,
> 5(50us) is an experience value to gain a performance
> improvement, since the whql test sends packets every 100us,
> so '30(300us)' passes the test case, it is the default
> value as well, tune it via the command line parameter
> 'rsc_interval' within 'virtio-net-pci' device, for example,
> to launch a guest with interval set as '50':
> 
> 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,
> guest_rsc_ext=on,rsc_interval=50'
> 
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets.
> 
> 'NetRscChain' is used to save the segments of IPv4/6 in a
> VirtIONet device.
> 
> A new segment becomes a 'Candidate' as well as it passed sanity check,
> the main handler of TCP includes TCP window update, duplicated
> ACK check and the real data coalescing.
> 
> An 'Candidate' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. 'ACK' of the segment is in the valid window.
> 
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. An IP options or IP fragment
> 3. Not a TCP packet
> 4. Sanity size check to prevent buffer overflow attack.
> 5. An ECN packet
> 
> Even though, there might more cases should be considered such as
> ip identification other flags, while it breaks the test because
> windows set it to the same even it's not a fragment.
> 
> Normally it includes 2 typical ways to handle a TCP control flag,
> 'bypass' and 'finalize', 'bypass' means should be sent out directly,
> while 'finalize' means the packets should also be bypassed, but this
> should be done after search for the same connection packets in the
> pool and drain all of them out, this is to avoid out of order fragment.
> 
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
> finalization, because this normally happens upon a connection is going
> to be closed, an 'URG' packet also finalize current coalescing unit.
> 
> Statistics can be used to monitor the basic coalescing status, the
> 'out of order' and 'out of window' means how many retransmitting packets,
> thus describe the performance intuitively.
> 
> Difference between ip v4 and v6 processing:
>  Fragment length in ipv4 header includes itself, while it's not
>  included for ipv6, thus means ipv6 can carry a real 65535 payload.
> 
> Signed-off-by: Wei Xu 
> Signed-off-by: Yuri Benditovich 
> ---
>  hw/net/virtio-net.c | 648 +++-
>  include/hw/virtio/virtio-net.h  |  81 +++
>  include/net/eth.h   |   2 +
>  include/standard-headers/linux/virtio_net.h |   8 +
>  4 files changed, 734 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 385b1a03e9..43a7021409 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -41,6 +41,28 @@
>  #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
>  #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
>  
> +#define VIRTIO_NET_IP4_ADDR_SIZE   8/* ipv4 saddr + daddr */
> +
> +#define VIRTIO_NET_

Re: [Qemu-devel] [[RFC v3 08/12] virtio: event suppression support for packed ring

2018-10-15 Thread Wei Xu

On Mon, Oct 15, 2018 at 02:59:48PM +0800, Jason Wang wrote:
> 
> 
> On 2018年10月11日 22:08, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 126 
> > +++--
> >  1 file changed, 123 insertions(+), 3 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index d12a7e3..1d25776 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -241,6 +241,30 @@ static void vring_desc_read(VirtIODevice *vdev, 
> >VRingDesc *desc,
> >  virtio_tswap16s(vdev, >next);
> >  }
> >+static void vring_packed_event_read(VirtIODevice *vdev,
> >+MemoryRegionCache *cache, VRingPackedDescEvent 
> >*e)
> >+{
> >+address_space_read_cached(cache, 0, e, sizeof(*e));
> >+virtio_tswap16s(vdev, >off_wrap);
> >+virtio_tswap16s(vdev, >flags);
> >+}
> >+
> >+static void vring_packed_off_wrap_write(VirtIODevice *vdev,
> >+MemoryRegionCache *cache, uint16_t off_wrap)
> >+{
> >+virtio_tswap16s(vdev, _wrap);
> >+address_space_write_cached(cache, 0, _wrap, sizeof(off_wrap));
> >+address_space_cache_invalidate(cache, 0, sizeof(off_wrap));
> >+}
> >+
> >+static void vring_packed_flags_write(VirtIODevice *vdev,
> >+MemoryRegionCache *cache, uint16_t flags)
> >+{
> >+virtio_tswap16s(vdev, );
> >+address_space_write_cached(cache, sizeof(uint16_t), , 
> >sizeof(flags));
> >+address_space_cache_invalidate(cache, sizeof(uint16_t), sizeof(flags));
> >+}
> >+
> >  static VRingMemoryRegionCaches *vring_get_region_caches(struct VirtQueue 
> > *vq)
> >  {
> >  VRingMemoryRegionCaches *caches = atomic_rcu_read(>vring.caches);
> >@@ -347,7 +371,7 @@ static inline void vring_set_avail_event(VirtQueue *vq, 
> >uint16_t val)
> >  address_space_cache_invalidate(>used, pa, sizeof(val));
> >  }
> >-void virtio_queue_set_notification(VirtQueue *vq, int enable)
> >+static void virtio_queue_set_notification_split(VirtQueue *vq, int enable)
> >  {
> >  vq->notification = enable;
> >@@ -370,6 +394,51 @@ void virtio_queue_set_notification(VirtQueue *vq, int 
> >enable)
> >  rcu_read_unlock();
> >  }
> >+static void virtio_queue_set_notification_packed(VirtQueue *vq, int enable)
> >+{
> >+VRingPackedDescEvent e;
> >+VRingMemoryRegionCaches *caches;
> >+
> >+rcu_read_lock();
> >+caches  = vring_get_region_caches(vq);
> >+vring_packed_event_read(vq->vdev, >device, );
> >+
> >+if (!enable) {
> >+e.flags = RING_EVENT_FLAGS_DISABLE;
> >+goto out;
> >+}
> >+
> >+e.flags = RING_EVENT_FLAGS_ENABLE;
> >+if (virtio_vdev_has_feature(vq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
> >+uint16_t off_wrap = vq->event_idx | vq->event_wrap_counter << 15;
> 
> Btw, why not just use shadow_avail_idx here?

It is nice to do that but an issue here is that it is 'shadow_avail_idx' for
Rx but 'used_idx' for Tx when setting up for a kick, haven't figured out a
clear fix because it helps easier migration part work, any idea?

Wei

> 
> Thanks
> 
> >+
> >+vring_packed_off_wrap_write(vq->vdev, >device, off_wrap);
> >+/* Make sure off_wrap is wrote before flags */
> >+smp_wmb();
> >+
> >+e.flags = RING_EVENT_FLAGS_DESC;
> >+}
> >+
> >+out:
> >+vring_packed_flags_write(vq->vdev, >device, e.flags);
> >+rcu_read_unlock();
> >+}
> >+
> >+void virtio_queue_set_notification(VirtQueue *vq, int enable)
> >+{
> >+vq->notification = enable;
> >+
> >+if (!vq->vring.desc) {
> >+return;
> >+}
> >+
> >+if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> >+virtio_queue_set_notification_packed(vq, enable);
> >+} else {
> >+virtio_queue_set_notification_split(vq, enable);
> >+}
> >+}
> >+
> >  int virtio_queue_ready(VirtQueue *vq)
> >  {
> >  return vq->vring.avail != 0;
> >@@ -2103,8 +2172,7 @@ static void virtio_set_isr(VirtIODevice *vdev, int 
> >value)
> >  }
> >  }
> >-/* Called within rcu_read_lock().  */
> >-static bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
> >+static bool virtio_split_should_notify

Re: [Qemu-devel] [[RFC v3 12/12] virtio: feature vhost-net support for packed ring

2018-10-15 Thread Wei Xu

On Mon, Oct 15, 2018 at 03:50:21PM +0800, Jason Wang wrote:
> 
> 
> On 2018年10月11日 22:08, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >(cherry picked from commit 305a2c4640c15c5717245067ab937fd10f478ee6)
> >Signed-off-by: Wei Xu 
> >(cherry picked from commit 46476dae6f44c6fef8802a4a0ac7d0d79fe399e3)
> >Signed-off-by: Wei Xu 
> 
> The cherry-pick tag looks odd.

My bad.

> 
> >---
> >  hw/virtio/vhost.c  | 3 +++
> >  hw/virtio/virtio.c | 4 
> >  include/hw/virtio/virtio.h | 1 +
> >  3 files changed, 8 insertions(+)
> >
> >diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >index 9df2da3..de06d55 100644
> >--- a/hw/virtio/vhost.c
> >+++ b/hw/virtio/vhost.c
> >@@ -974,6 +974,9 @@ static int vhost_virtqueue_start(struct vhost_dev *dev,
> >  }
> >  state.num = virtio_queue_get_last_avail_idx(vdev, idx);
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >+state.num |= ((int)virtio_queue_packed_get_wc(vdev, idx)) << 31;
> >+}
> 
> We decide to use bit 15 instead.
> 
> And please refer the recent discussion for the agreement.

OK.

> 
> Thanks
> 
> >  r = dev->vhost_ops->vhost_set_vring_base(dev, );
> >  if (r) {
> >  VHOST_OPS_DEBUG("vhost_set_vring_base failed");
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 1d25776..2a90163 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -2894,6 +2894,10 @@ void virtio_init(VirtIODevice *vdev, const char *name,
> >  vdev->use_guest_notifier_mask = true;
> >  }
> >+bool virtio_queue_packed_get_wc(VirtIODevice *vdev, int n)
> >+{
> >+return vdev->vq[n].avail_wrap_counter;
> >+}
> >  hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n)
> >  {
> >  return vdev->vq[n].vring.desc;
> >diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> >index 9af8839..0bb3be5 100644
> >--- a/include/hw/virtio/virtio.h
> >+++ b/include/hw/virtio/virtio.h
> >@@ -295,6 +295,7 @@ void 
> >virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
> >  VirtIOHandleAIOOutput 
> > handle_output);
> >  VirtQueue *virtio_vector_first_queue(VirtIODevice *vdev, uint16_t vector);
> >  VirtQueue *virtio_vector_next_queue(VirtQueue *vq);
> >+bool virtio_queue_packed_get_wc(VirtIODevice *vdev, int n);
> >  static inline void virtio_add_feature(uint64_t *features, unsigned int 
> > fbit)
> >  {
> 
>

Re: [Qemu-devel] [[RFC v3 09/12] virtio-net: fill head desc after done all in a chain

2018-10-15 Thread Wei Xu

On Mon, Oct 15, 2018 at 03:45:46PM +0800, Jason Wang wrote:
> 
> 
> On 2018年10月11日 22:08, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >With the support of marking a descriptor used/unused in 'flags'
> >field for 1.1, the current way of filling a chained descriptors
> >does not work since driver side may get the wrong 'num_buffer'
> >information in case of the head descriptor has been filled in
> >while the subsequent ones are still in processing in device side.
> >
> >This patch fills the head one after done all the others one.
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/net/virtio-net.c | 11 ++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> >diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> >index 4bdd5b8..186c86cd2 100644
> >--- a/hw/net/virtio-net.c
> >+++ b/hw/net/virtio-net.c
> >@@ -1198,6 +1198,8 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
> >*nc, const uint8_t *buf,
> >  struct virtio_net_hdr_mrg_rxbuf mhdr;
> >  unsigned mhdr_cnt = 0;
> >  size_t offset, i, guest_offset;
> >+VirtQueueElement head;
> >+int head_len = 0;
> >  if (!virtio_net_can_receive(nc)) {
> >  return -1;
> >@@ -1275,7 +1277,13 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
> >*nc, const uint8_t *buf,
> >  }
> >  /* signal other side */
> >-virtqueue_fill(q->rx_vq, elem, total, i++);
> >+if (i == 0) {
> >+head_len = total;
> >+head = *elem;
> >+} else {
> >+virtqueue_fill(q->rx_vq, elem, len, i);
> >+}
> >+i++;
> >  g_free(elem);
> >  }
> >@@ -1286,6 +1294,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
> >*nc, const uint8_t *buf,
> >   _buffers, sizeof mhdr.num_buffers);
> >  }
> >+virtqueue_fill(q->rx_vq, , head_len, 0);
> 
> It's not a good idea to fix API in device implementation. Let's introduce
> new API and fix it there.
> 
> E.g virtqueue_fill_n() and update the flag of first elem at the last step.

OK, I haven't considered about the other devices so far.

Wei

> 
> Thanks
> 
> >  virtqueue_flush(q->rx_vq, i);
> >  virtio_notify(vdev, q->rx_vq);
> 
>

Re: [Qemu-devel] [[RFC v3 02/12] virtio: redefine structure & memory cache for packed ring

2018-10-15 Thread Wei Xu

On Mon, Oct 15, 2018 at 11:03:52AM +0800, Jason Wang wrote:
> 
> 
> On 2018年10月11日 22:08, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Redefine packed ring structure according to qemu nomenclature,
> >also supported data(event index, wrap counter, etc) are introduced.
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 26 --
> >  1 file changed, 24 insertions(+), 2 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 94f5c8e..500eecf 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -39,6 +39,13 @@ typedef struct VRingDesc
> >  uint16_t next;
> >  } VRingDesc;
> >+typedef struct VRingPackedDesc {
> >+uint64_t addr;
> >+uint32_t len;
> >+uint16_t id;
> >+uint16_t flags;
> >+} VRingPackedDesc;
> >+
> >  typedef struct VRingAvail
> >  {
> >  uint16_t flags;
> >@@ -62,8 +69,14 @@ typedef struct VRingUsed
> >  typedef struct VRingMemoryRegionCaches {
> >  struct rcu_head rcu;
> >  MemoryRegionCache desc;
> >-MemoryRegionCache avail;
> >-MemoryRegionCache used;
> >+union {
> >+MemoryRegionCache avail;
> >+MemoryRegionCache driver;
> >+};
> 
> Can we reuse avail and used?

Sure.

> 
> >+union {
> >+MemoryRegionCache used;
> >+MemoryRegionCache device;
> >+};
> >  } VRingMemoryRegionCaches;
> >  typedef struct VRing
> >@@ -77,6 +90,11 @@ typedef struct VRing
> >  VRingMemoryRegionCaches *caches;
> >  } VRing;
> >+typedef struct VRingPackedDescEvent {
> >+uint16_t off_wrap;
> >+uint16_t flags;
> >+} VRingPackedDescEvent ;
> >+
> >  struct VirtQueue
> >  {
> >  VRing vring;
> >@@ -87,6 +105,10 @@ struct VirtQueue
> >  /* Last avail_idx read from VQ. */
> >  uint16_t shadow_avail_idx;
> >+uint16_t event_idx;
> 
> Need a comment to explain this field.

Yes, it is the unified name for interrupt which is what I want to see
if we can reuse 'shadow' and 'used' index in current code, for Tx
queue, it should be the 'used' index after finished sending the last
desc. For Rx queue, it should be the 'shadow' index when no enough
descriptors which might be a few descriptors ahead of the 'used' index,
there are a few indexes already so this makes code a bit redundant.

Will see if I can remove this in next version, any comments?

Wei


> 
> Thanks
> 
> >+bool event_wrap_counter;
> >+bool avail_wrap_counter;
> >+
> >  uint16_t used_idx;
> >  /* Last used index value we have signalled on */
> 
>

Re: [Qemu-devel] [[RFC v3 03/12] virtio: init memory cache for packed ring

2018-10-15 Thread Wei Xu

On Mon, Oct 15, 2018 at 11:10:12AM +0800, Jason Wang wrote:
> 
> 
> On 2018年10月11日 22:08, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Expand 1.0 by adding offset calculation accordingly.
> 
> This is only part of what this patch did and I suggest to another patch to
> do this.
> 
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/vhost.c  | 16 
> >  hw/virtio/virtio.c | 35 +++
> >  include/hw/virtio/virtio.h |  4 ++--
> >  3 files changed, 33 insertions(+), 22 deletions(-)
> >
> >diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >index 569c405..9df2da3 100644
> >--- a/hw/virtio/vhost.c
> >+++ b/hw/virtio/vhost.c
> >@@ -996,14 +996,14 @@ static int vhost_virtqueue_start(struct vhost_dev *dev,
> >  r = -ENOMEM;
> >  goto fail_alloc_desc;
> >  }
> >-vq->avail_size = s = l = virtio_queue_get_avail_size(vdev, idx);
> >+vq->avail_size = s = l = virtio_queue_get_driver_size(vdev, idx);
> 
> Let's try to use a more consistent name. E.g either use avail/used or
> driver/device.
> 
> I prefer to use avail/used, it can save lots of unnecessary changes.

OK.

> 
> >  vq->avail_phys = a = virtio_queue_get_avail_addr(vdev, idx);
> >  vq->avail = vhost_memory_map(dev, a, , 0);
> >  if (!vq->avail || l != s) {
> >  r = -ENOMEM;
> >  goto fail_alloc_avail;
> >  }
> >-vq->used_size = s = l = virtio_queue_get_used_size(vdev, idx);
> >+vq->used_size = s = l = virtio_queue_get_device_size(vdev, idx);
> >  vq->used_phys = a = virtio_queue_get_used_addr(vdev, idx);
> >  vq->used = vhost_memory_map(dev, a, , 1);
> >  if (!vq->used || l != s) {
> >@@ -1051,10 +1051,10 @@ static int vhost_virtqueue_start(struct vhost_dev 
> >*dev,
> >  fail_vector:
> >  fail_kick:
> >  fail_alloc:
> >-vhost_memory_unmap(dev, vq->used, virtio_queue_get_used_size(vdev, idx),
> >+vhost_memory_unmap(dev, vq->used, virtio_queue_get_device_size(vdev, 
> >idx),
> > 0, 0);
> >  fail_alloc_used:
> >-vhost_memory_unmap(dev, vq->avail, virtio_queue_get_avail_size(vdev, 
> >idx),
> >+vhost_memory_unmap(dev, vq->avail, virtio_queue_get_driver_size(vdev, 
> >idx),
> > 0, 0);
> >  fail_alloc_avail:
> >  vhost_memory_unmap(dev, vq->desc, virtio_queue_get_desc_size(vdev, 
> > idx),
> >@@ -1101,10 +1101,10 @@ static void vhost_virtqueue_stop(struct vhost_dev 
> >*dev,
> >  vhost_vq_index);
> >  }
> >-vhost_memory_unmap(dev, vq->used, virtio_queue_get_used_size(vdev, idx),
> >-   1, virtio_queue_get_used_size(vdev, idx));
> >-vhost_memory_unmap(dev, vq->avail, virtio_queue_get_avail_size(vdev, 
> >idx),
> >-   0, virtio_queue_get_avail_size(vdev, idx));
> >+vhost_memory_unmap(dev, vq->used, virtio_queue_get_device_size(vdev, 
> >idx),
> >+   1, virtio_queue_get_device_size(vdev, idx));
> >+vhost_memory_unmap(dev, vq->avail, virtio_queue_get_driver_size(vdev, 
> >idx),
> >+   0, virtio_queue_get_driver_size(vdev, idx));
> >  vhost_memory_unmap(dev, vq->desc, virtio_queue_get_desc_size(vdev, 
> > idx),
> > 0, virtio_queue_get_desc_size(vdev, idx));
> >  }
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 500eecf..bfb3364 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -162,11 +162,8 @@ static void virtio_init_region_cache(VirtIODevice 
> >*vdev, int n)
> >  VRingMemoryRegionCaches *old = vq->vring.caches;
> >  VRingMemoryRegionCaches *new = NULL;
> >  hwaddr addr, size;
> >-int event_size;
> >  int64_t len;
> >-event_size = virtio_vdev_has_feature(vq->vdev, VIRTIO_RING_F_EVENT_IDX) 
> >? 2 : 0;
> >-
> >  addr = vq->vring.desc;
> >  if (!addr) {
> >  goto out_no_cache;
> >@@ -174,13 +171,13 @@ static void virtio_init_region_cache(VirtIODevice 
> >*vdev, int n)
> >  new = g_new0(VRingMemoryRegionCaches, 1);
> >  size = virtio_queue_get_desc_size(vdev, n);
> >  len = address_space_cache_init(>desc, vdev->dma_as,
> >-   addr, size, false);
> >+   addr, size, true);
> 
>

Re: [Qemu-devel] [[RFC v3 05/12] virtio: init and desc empty check for packed ring

2018-10-15 Thread Wei Xu

On Mon, Oct 15, 2018 at 11:18:05AM +0800, Jason Wang wrote:
> 
> 
> On 2018年10月11日 22:08, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Basic initialization and helpers for packed ring.
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 57 
> > +-
> >  1 file changed, 56 insertions(+), 1 deletion(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 9185efb..86f88da 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -24,6 +24,9 @@
> >  #include "hw/virtio/virtio-access.h"
> >  #include "sysemu/dma.h"
> >+#define AVAIL_DESC_PACKED(b) ((b) << 7)
> >+#define USED_DESC_PACKED(b)  ((b) << 15)
> >+
> >  /*
> >   * The alignment to use between consumer and producer parts of vring.
> >   * x86 pagesize again. This is the default, used by transports like PCI
> >@@ -372,6 +375,23 @@ int virtio_queue_ready(VirtQueue *vq)
> >  return vq->vring.avail != 0;
> >  }
> >+static void vring_packed_desc_read_flags(VirtIODevice *vdev,
> >+VRingPackedDesc *desc, MemoryRegionCache *cache, int i)
> >+{
> >+address_space_read_cached(cache,
> >+  i * sizeof(VRingPackedDesc) + offsetof(VRingPackedDesc, 
> >flags),
> >+  >flags, sizeof(desc->flags));
> >+}
> >+
> >+static inline bool is_desc_avail(struct VRingPackedDesc *desc, bool wc)
> >+{
> 
> I think it's better use wrap_counter instead of wc here (unless you want to
> use wc everywhere which is a even worse idea).

It was to avoid a new line for a parameter since this is a mini function,
I will take it back.

Wei

> 
> Thanks
> 
> >+bool avail, used;
> >+
> >+avail = !!(desc->flags & AVAIL_DESC_PACKED(1));
> >+used = !!(desc->flags & USED_DESC_PACKED(1));
> >+return (avail != used) && (avail == wc);
> >+}
> >+
> >  /* Fetch avail_idx from VQ memory only when we really need to know if
> >   * guest has added some buffers.
> >   * Called within rcu_read_lock().  */
> >@@ -392,7 +412,7 @@ static int virtio_queue_empty_rcu(VirtQueue *vq)
> >  return vring_avail_idx(vq) == vq->last_avail_idx;
> >  }
> >-int virtio_queue_empty(VirtQueue *vq)
> >+static int virtio_queue_split_empty(VirtQueue *vq)
> >  {
> >  bool empty;
> >@@ -414,6 +434,41 @@ int virtio_queue_empty(VirtQueue *vq)
> >  return empty;
> >  }
> >+static int virtio_queue_packed_empty_rcu(VirtQueue *vq)
> >+{
> >+struct VRingPackedDesc desc;
> >+VRingMemoryRegionCaches *cache;
> >+
> >+if (unlikely(!vq->vring.desc)) {
> >+return 1;
> >+}
> >+
> >+cache = vring_get_region_caches(vq);
> >+vring_packed_desc_read_flags(vq->vdev, , >desc,
> >+vq->last_avail_idx);
> >+
> >+return !is_desc_avail(, vq->avail_wrap_counter);
> >+}
> >+
> >+static int virtio_queue_packed_empty(VirtQueue *vq)
> >+{
> >+bool empty;
> >+
> >+rcu_read_lock();
> >+empty = virtio_queue_packed_empty_rcu(vq);
> >+rcu_read_unlock();
> >+return empty;
> >+}
> >+
> >+int virtio_queue_empty(VirtQueue *vq)
> >+{
> >+if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> >+return virtio_queue_packed_empty(vq);
> >+} else {
> >+return virtio_queue_split_empty(vq);
> >+}
> >+}
> >+
> >  static void virtqueue_unmap_sg(VirtQueue *vq, const VirtQueueElement *elem,
> > unsigned int len)
> >  {
>

Re: [Qemu-devel] [RFC v2 5/8] virtio: queue pop for packed ring

2018-06-19 Thread Wei Xu

On Wed, Jun 06, 2018 at 11:41:18AM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月06日 11:38, Wei Xu wrote:
> >>>+
> >>>+head = vq->last_avail_idx;
> >>>+i = head;
> >>>+
> >>>+caches = vring_get_region_caches(vq);
> >>>+cache = >desc;
> >>>+vring_packed_desc_read(vdev, , cache, i);
> >>I think we'd better find a way to avoid reading descriptor twice.
> >Do you mean here and the read for empty check?
> >
> >Wei
> >
> 
> Yes.

OK, will figure it out.

> 
> Thanks
> 
>

Re: [Qemu-devel] [RFC v2 8/8] virtio: guest driver reload for vhost-net

2018-06-19 Thread Wei Xu

On Wed, Jun 06, 2018 at 11:48:19AM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月06日 03:08, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >last_avail, avail_wrap_count, used_idx and used_wrap_count are
> >needed to support vhost-net backend, all these are either 16 or
> >bool variables, since state.num is 64bit wide, so here it is
> >possible to put them to the 'num' without introducing a new case
> >while handling ioctl.
> >
> >Unload/Reload test has been done successfully with a patch in vhost kernel.
> 
> You need a patch to enable vhost.
> 
> And I think you can only do it for vhost-kenrel now since vhost-user
> protocol needs some extension I believe.

OK.

> 
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 42 ++
> >  1 file changed, 34 insertions(+), 8 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 4543974..153f6d7 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -2862,33 +2862,59 @@ hwaddr virtio_queue_get_used_size(VirtIODevice 
> >*vdev, int n)
> >  }
> >  }
> >-uint16_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n)
> >+uint64_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n)
> >  {
> >-return vdev->vq[n].last_avail_idx;
> >+uint64_t num;
> >+
> >+num = vdev->vq[n].last_avail_idx;
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >+num |= ((uint64_t)vdev->vq[n].avail_wrap_counter) << 16;
> >+num |= ((uint64_t)vdev->vq[n].used_idx) << 32;
> >+num |= ((uint64_t)vdev->vq[n].used_wrap_counter) << 48;
> 
> So s.num is 32bit, I don't think this can even work.

I mistakenly checked out s.num is 64bit, will add a new case in next version.

Wei

> 
> Thanks
> 
> >+}
> >+
> >+return num;
> >  }
> >-void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t 
> >idx)
> >+void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint64_t 
> >num)
> >  {
> >-vdev->vq[n].last_avail_idx = idx;
> >-vdev->vq[n].shadow_avail_idx = idx;
> >+vdev->vq[n].shadow_avail_idx = vdev->vq[n].last_avail_idx = 
> >(uint16_t)(num);
> >+
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >+vdev->vq[n].avail_wrap_counter = (uint16_t)(num >> 16);
> >+vdev->vq[n].used_idx = (uint16_t)(num >> 32);
> >+vdev->vq[n].used_wrap_counter = (uint16_t)(num >> 48);
> >+}
> >  }
> >  void virtio_queue_restore_last_avail_idx(VirtIODevice *vdev, int n)
> >  {
> >  rcu_read_lock();
> >-if (vdev->vq[n].vring.desc) {
> >+if (!vdev->vq[n].vring.desc) {
> >+goto out;
> >+}
> >+
> >+if (!virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >  vdev->vq[n].last_avail_idx = vring_used_idx(>vq[n]);
> >-vdev->vq[n].shadow_avail_idx = vdev->vq[n].last_avail_idx;
> >  }
> >+vdev->vq[n].shadow_avail_idx = vdev->vq[n].last_avail_idx;
> >+
> >+out:
> >  rcu_read_unlock();
> >  }
> >  void virtio_queue_update_used_idx(VirtIODevice *vdev, int n)
> >  {
> >  rcu_read_lock();
> >-if (vdev->vq[n].vring.desc) {
> >+if (!vdev->vq[n].vring.desc) {
> >+goto out;
> >+}
> >+
> >+if (!virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >  vdev->vq[n].used_idx = vring_used_idx(>vq[n]);
> >  }
> >+
> >+out:
> >  rcu_read_unlock();
> >  }
>

Re: [Qemu-devel] [RFC v2 0/8] packed ring virtio-net userspace backend support

2018-06-19 Thread Wei Xu

On Wed, Jun 06, 2018 at 11:49:22AM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月06日 03:07, w...@redhat.com wrote:
> >From: Wei Xu
> >
> >Todo:
> >- address Rx slow performance
> >- event index interrupt suppression test
> 
> And there's something more need to test:
> 
> - vIOMMU support
> - migration

OK, I will test it and put it to next version.

> 
> Thanks

Re: [Qemu-devel] [RFC v2 2/8] virtio: memory cache for packed ring

2018-06-19 Thread Wei Xu

On Wed, Jun 06, 2018 at 10:53:07AM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月06日 03:07, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Mostly reuse memory cache with 1.0 except for the offset calculation.
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 29 -
> >  1 file changed, 20 insertions(+), 9 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index e192a9a..f6c0689 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -150,11 +150,8 @@ static void virtio_init_region_cache(VirtIODevice 
> >*vdev, int n)
> >  VRingMemoryRegionCaches *old = vq->vring.caches;
> >  VRingMemoryRegionCaches *new;
> >  hwaddr addr, size;
> >-int event_size;
> >  int64_t len;
> >-event_size = virtio_vdev_has_feature(vq->vdev, VIRTIO_RING_F_EVENT_IDX) 
> >? 2 : 0;
> >-
> >  addr = vq->vring.desc;
> >  if (!addr) {
> >  return;
> >@@ -168,7 +165,7 @@ static void virtio_init_region_cache(VirtIODevice *vdev, 
> >int n)
> >  goto err_desc;
> >  }
> >-size = virtio_queue_get_used_size(vdev, n) + event_size;
> >+size = virtio_queue_get_used_size(vdev, n);
> >  len = address_space_cache_init(>used, vdev->dma_as,
> > vq->vring.used, size, true);
> >  if (len < size) {
> >@@ -176,7 +173,7 @@ static void virtio_init_region_cache(VirtIODevice *vdev, 
> >int n)
> >  goto err_used;
> >  }
> >-size = virtio_queue_get_avail_size(vdev, n) + event_size;
> >+size = virtio_queue_get_avail_size(vdev, n);
> >  len = address_space_cache_init(>avail, vdev->dma_as,
> > vq->vring.avail, size, false);
> >  if (len < size) {
> >@@ -2320,14 +2317,28 @@ hwaddr virtio_queue_get_desc_size(VirtIODevice 
> >*vdev, int n)
> >  hwaddr virtio_queue_get_avail_size(VirtIODevice *vdev, int n)
> 
> I would rather rename this to virtio_queue_get_driver_size().

Would this confuse 1.0 if it is shared by both? Otherwise I will take it to 
next version, thanks.

Wei

> 
> >  {
> >-return offsetof(VRingAvail, ring) +
> >-sizeof(uint16_t) * vdev->vq[n].vring.num;
> >+int s;
> >+
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >+return sizeof(struct VRingPackedDescEvent);
> >+} else {
> >+s = virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> >+return offsetof(VRingAvail, ring) +
> >+sizeof(uint16_t) * vdev->vq[n].vring.num + s;
> >+}
> >  }
> >  hwaddr virtio_queue_get_used_size(VirtIODevice *vdev, int n)
> 
> virtio_queue_get_device_size().
> 
> Thanks
> 
> >  {
> >-return offsetof(VRingUsed, ring) +
> >-sizeof(VRingUsedElem) * vdev->vq[n].vring.num;
> >+int s;
> >+
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_F_RING_PACKED)) {
> >+return sizeof(struct VRingPackedDescEvent);
> >+} else {
> >+s = virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> >+return offsetof(VRingUsed, ring) +
> >+sizeof(VRingUsedElem) * vdev->vq[n].vring.num + s;
> >+}
> >  }
> >  uint16_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n)
>

Re: [Qemu-devel] [RFC v2 5/8] virtio: queue pop for packed ring

2018-06-05 Thread Wei Xu

On Wed, Jun 06, 2018 at 11:29:54AM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月06日 03:08, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 145 
> > -
> >  1 file changed, 144 insertions(+), 1 deletion(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index cdbb5af..0160d03 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -1041,7 +1041,7 @@ static void *virtqueue_alloc_element(size_t sz, 
> >unsigned out_num, unsigned in_nu
> >  return elem;
> >  }
> >-void *virtqueue_pop(VirtQueue *vq, size_t sz)
> >+static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
> >  {
> >  unsigned int i, head, max;
> >  VRingMemoryRegionCaches *caches;
> >@@ -1176,6 +1176,149 @@ err_undo_map:
> >  goto done;
> >  }
> >+static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
> >+{
> >+unsigned int i, head, max;
> >+VRingMemoryRegionCaches *caches;
> >+MemoryRegionCache indirect_desc_cache = MEMORY_REGION_CACHE_INVALID;
> >+MemoryRegionCache *cache;
> >+int64_t len;
> >+VirtIODevice *vdev = vq->vdev;
> >+VirtQueueElement *elem = NULL;
> >+unsigned out_num, in_num, elem_entries;
> >+hwaddr addr[VIRTQUEUE_MAX_SIZE];
> >+struct iovec iov[VIRTQUEUE_MAX_SIZE];
> >+VRingDescPacked desc;
> >+
> >+if (unlikely(vdev->broken)) {
> >+return NULL;
> >+}
> >+
> >+rcu_read_lock();
> >+if (virtio_queue_packed_empty_rcu(vq)) {
> >+goto done;
> >+}
> 
> Instead of depending on the barriers inside virtio_queue_packed_empty_rcu().
> I think it's better to keep a smp_rmb() here with comments.

OK.

> 
> >+
> >+/* When we start there are none of either input nor output. */
> >+out_num = in_num = elem_entries = 0;
> >+
> >+max = vq->vring.num;
> >+
> >+if (vq->inuse >= vq->vring.num) {
> >+virtio_error(vdev, "Virtqueue size exceeded");
> >+goto done;
> >+}
> >+
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX)) {
> >+/* FIXME: TBD */
> >+}
> 
> This part could be removed.

My bad, thanks.

> 
> >+
> >+head = vq->last_avail_idx;
> >+i = head;
> >+
> >+caches = vring_get_region_caches(vq);
> >+cache = >desc;
> >+vring_packed_desc_read(vdev, , cache, i);
> 
> I think we'd better find a way to avoid reading descriptor twice.

Do you mean here and the read for empty check?

Wei

> 
> Thanks
> 
> >+if (desc.flags & VRING_DESC_F_INDIRECT) {
> >+if (desc.len % sizeof(VRingDescPacked)) {
> >+virtio_error(vdev, "Invalid size for indirect buffer table");
> >+goto done;
> >+}
> >+
> >+/* loop over the indirect descriptor table */
> >+len = address_space_cache_init(_desc_cache, vdev->dma_as,
> >+   desc.addr, desc.len, false);
> >+cache = _desc_cache;
> >+if (len < desc.len) {
> >+virtio_error(vdev, "Cannot map indirect buffer");
> >+goto done;
> >+}
> >+
> >+max = desc.len / sizeof(VRingDescPacked);
> >+i = 0;
> >+vring_packed_desc_read(vdev, , cache, i);
> >+}
> >+
> >+/* Collect all the descriptors */
> >+while (1) {
> >+bool map_ok;
> >+
> >+if (desc.flags & VRING_DESC_F_WRITE) {
> >+map_ok = virtqueue_map_desc(vdev, _num, addr + out_num,
> >+iov + out_num,
> >+VIRTQUEUE_MAX_SIZE - out_num, true,
> >+desc.addr, desc.len);
> >+} else {
> >+if (in_num) {
> >+virtio_error(vdev, "Incorrect order for descriptors");
> >+goto err_undo_map;
> >+}
> >+map_ok = virtqueue_map_desc(vdev, _num, addr, iov,
> >+VIRTQUEUE_MAX_SIZE, false,
> >+desc.addr, desc.len);
> >+}
> >+if (!map_ok) {
> >+goto err_undo_map;
> >+}
> >+
> >+/* If we've got too many, that impl

Re: [Qemu-devel] [PATCH 4/8] virtio: add detach element for packed ring(1.1)

2018-06-04 Thread Wei Xu

On Mon, Jun 04, 2018 at 04:54:45AM +0300, Michael S. Tsirkin wrote:
> On Mon, Jun 04, 2018 at 09:34:35AM +0800, Wei Xu wrote:
> > On Tue, Apr 10, 2018 at 03:32:53PM +0800, Jason Wang wrote:
> > > 
> > > 
> > > On 2018年04月04日 20:54, w...@redhat.com wrote:
> > > >From: Wei Xu 
> > > >
> > > >helper for packed ring
> > > 
> > > It's odd and hard to review if you put detach patch first. I think this
> > > patch needs to be reordered after the implementation of pop/map.
> > 
> > This patch is not necessary after sync to tiwei's v5, so we can skip it.
> > 
> > Wei
> 
> I suspect we will need to bring detach back eventually but yes,
> it can wait.

Sure, I reuse the code for 1.0 for now.

Wei

> 
> > > 
> > > Thanks
> > > 
> > > >Signed-off-by: Wei Xu 
> > > >---
> > > >  hw/virtio/virtio.c | 21 +++--
> > > >  1 file changed, 19 insertions(+), 2 deletions(-)
> > > >
> > > >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > > >index 478df3d..fdee40f 100644
> > > >--- a/hw/virtio/virtio.c
> > > >+++ b/hw/virtio/virtio.c
> > > >@@ -561,6 +561,20 @@ static void virtqueue_unmap_sg(VirtQueue *vq, const 
> > > >VirtQueueElement *elem,
> > > >   elem->out_sg[i].iov_len);
> > > >  }
> > > >+static void virtqueue_detach_element_split(VirtQueue *vq,
> > > >+const VirtQueueElement *elem, unsigned int 
> > > >len)
> > > >+{
> > > >+vq->inuse--;
> > > >+virtqueue_unmap_sg(vq, elem, len);
> > > >+}
> > > >+
> > > >+static void virtqueue_detach_element_packed(VirtQueue *vq,
> > > >+const VirtQueueElement *elem, unsigned int 
> > > >len)
> > > >+{
> > > >+vq->inuse -= elem->count;
> > > >+virtqueue_unmap_sg(vq, elem, len);
> > > >+}
> > > >+
> > > >  /* virtqueue_detach_element:
> > > >   * @vq: The #VirtQueue
> > > >   * @elem: The #VirtQueueElement
> > > >@@ -573,8 +587,11 @@ static void virtqueue_unmap_sg(VirtQueue *vq, const 
> > > >VirtQueueElement *elem,
> > > >  void virtqueue_detach_element(VirtQueue *vq, const VirtQueueElement 
> > > > *elem,
> > > >unsigned int len)
> > > >  {
> > > >-vq->inuse--;
> > > >-virtqueue_unmap_sg(vq, elem, len);
> > > >+if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> > > >+virtqueue_detach_element_packed(vq, elem, len);
> > > >+} else {
> > > >+virtqueue_detach_element_split(vq, elem, len);
> > > >+}
> > > >  }
> > > >  /* virtqueue_unpop:
> > > 
>

Re: [Qemu-devel] [PATCH 8/8] virtio: queue pop support for packed ring

2018-06-04 Thread Wei Xu

On Wed, Apr 11, 2018 at 10:43:40AM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月04日 20:54, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >cloned from split ring pop, a global static length array
> >and the inside-element length array are introduced to
> >easy prototype, this consumes more memory and it is valuable
> >to move to dynamic allocation as the out/in sg does.
> 
> To ease the reviewer, I suggest to reorder this patch to patch 4.

OK.

> 
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 154 
> > -
> >  1 file changed, 153 insertions(+), 1 deletion(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index cf726f3..0eafb38 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -1221,7 +1221,7 @@ static void *virtqueue_alloc_element(size_t sz, 
> >unsigned out_num, unsigned in_nu
> >  return elem;
> >  }
> >-void *virtqueue_pop(VirtQueue *vq, size_t sz)
> >+static void *virtqueue_pop_split(VirtQueue *vq, size_t sz)
> >  {
> >  unsigned int i, head, max;
> >  VRingMemoryRegionCaches *caches;
> >@@ -1356,6 +1356,158 @@ err_undo_map:
> >  goto done;
> >  }
> >+static uint16_t dma_len[VIRTQUEUE_MAX_SIZE];
> 
> This looks odd.

This has been removed.

> 
> >+static void *virtqueue_pop_packed(VirtQueue *vq, size_t sz)
> >+{
> >+unsigned int i, head, max;
> >+VRingMemoryRegionCaches *caches;
> >+MemoryRegionCache indirect_desc_cache = MEMORY_REGION_CACHE_INVALID;
> >+MemoryRegionCache *cache;
> >+int64_t len;
> >+VirtIODevice *vdev = vq->vdev;
> >+VirtQueueElement *elem = NULL;
> >+unsigned out_num, in_num, elem_entries;
> >+hwaddr addr[VIRTQUEUE_MAX_SIZE];
> >+struct iovec iov[VIRTQUEUE_MAX_SIZE];
> >+VRingDescPacked desc;
> >+uint8_t wrap_counter;
> >+
> >+if (unlikely(vdev->broken)) {
> >+return NULL;
> >+}
> >+
> >+vq->last_avail_idx %= vq->packed.num;
> 
> Queue size could not be a power of 2.

Has replaced it with subtraction.

> 
> >+
> >+rcu_read_lock();
> >+if (virtio_queue_empty_packed_rcu(vq)) {
> >+goto done;
> >+}
> >+
> >+/* When we start there are none of either input nor output. */
> >+out_num = in_num = elem_entries = 0;
> >+
> >+max = vq->vring.num;
> >+
> >+if (vq->inuse >= vq->vring.num) {
> >+virtio_error(vdev, "Virtqueue size exceeded");
> >+goto done;
> >+}
> >+
> >+if (virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX)) {
> >+/* FIXME: TBD */
> >+}
> >+
> >+head = vq->last_avail_idx;
> >+i = head;
> >+
> >+caches = vring_get_region_caches(vq);
> >+cache = >desc_packed;
> >+vring_desc_read_packed(vdev, , cache, i);
> >+if (desc.flags & VRING_DESC_F_INDIRECT) {
> >+if (desc.len % sizeof(VRingDescPacked)) {
> >+virtio_error(vdev, "Invalid size for indirect buffer table");
> >+goto done;
> >+}
> >+
> >+/* loop over the indirect descriptor table */
> >+len = address_space_cache_init(_desc_cache, vdev->dma_as,
> >+   desc.addr, desc.len, false);
> >+cache = _desc_cache;
> >+if (len < desc.len) {
> >+virtio_error(vdev, "Cannot map indirect buffer");
> >+goto done;
> >+}
> >+
> >+max = desc.len / sizeof(VRingDescPacked);
> >+i = 0;
> >+vring_desc_read_packed(vdev, , cache, i);
> >+}
> >+
> >+wrap_counter = vq->wrap_counter;
> >+/* Collect all the descriptors */
> >+while (1) {
> >+bool map_ok;
> >+
> >+if (desc.flags & VRING_DESC_F_WRITE) {
> >+map_ok = virtqueue_map_desc(vdev, _num, addr + out_num,
> >+iov + out_num,
> >+VIRTQUEUE_MAX_SIZE - out_num, true,
> >+desc.addr, desc.len);
> >+} else {
> >+if (in_num) {
> >+virtio_error(vdev, "Incorrect order for descriptors");
> >+goto err_undo_map;
> >+}
> >+map_ok = virtqueue_map_desc(vdev, _num,

Re: [Qemu-devel] [PATCH 7/8] virtio: get avail bytes check for packed ring

2018-06-04 Thread Wei Xu

On Wed, Apr 11, 2018 at 11:03:24AM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月04日 20:54, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >mostly as same as 1.0, copy it separately for
> >prototype, need a refactoring.
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 142 
> > +++--
> >  1 file changed, 139 insertions(+), 3 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index def07c6..cf726f3 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -836,9 +836,9 @@ static int virtqueue_read_next_desc(VirtIODevice *vdev, 
> >VRingDesc *desc,
> >  return VIRTQUEUE_READ_DESC_MORE;
> >  }
> >-void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
> >-   unsigned int *out_bytes,
> >-   unsigned max_in_bytes, unsigned 
> >max_out_bytes)
> >+static void virtqueue_get_avail_bytes_split(VirtQueue *vq,
> >+unsigned int *in_bytes, unsigned int *out_bytes,
> >+unsigned max_in_bytes, unsigned max_out_bytes)
> >  {
> >  VirtIODevice *vdev = vq->vdev;
> >  unsigned int max, idx;
> >@@ -961,6 +961,142 @@ err:
> >  goto done;
> >  }
> >+static void virtqueue_get_avail_bytes_packed(VirtQueue *vq,
> >+unsigned int *in_bytes, unsigned int *out_bytes,
> >+unsigned max_in_bytes, unsigned max_out_bytes)
> >+{
> >+VirtIODevice *vdev = vq->vdev;
> >+unsigned int max, idx;
> >+unsigned int total_bufs, in_total, out_total;
> >+MemoryRegionCache *desc_cache;
> >+VRingMemoryRegionCaches *caches;
> >+MemoryRegionCache indirect_desc_cache = MEMORY_REGION_CACHE_INVALID;
> >+int64_t len = 0;
> >+VRingDescPacked desc;
> >+
> >+if (unlikely(!vq->packed.desc)) {
> >+if (in_bytes) {
> >+*in_bytes = 0;
> >+}
> >+if (out_bytes) {
> >+*out_bytes = 0;
> >+}
> >+return;
> >+}
> >+
> >+rcu_read_lock();
> >+idx = vq->last_avail_idx;
> >+total_bufs = in_total = out_total = 0;
> >+
> >+max = vq->packed.num;
> >+caches = vring_get_region_caches(vq);
> >+if (caches->desc.len < max * sizeof(VRingDescPacked)) {
> >+virtio_error(vdev, "Cannot map descriptor ring");
> >+goto err;
> >+}
> >+
> >+desc_cache = >desc;
> >+vring_desc_read_packed(vdev, , desc_cache, idx);
> >+while (is_desc_avail()) {
> >+unsigned int num_bufs;
> >+unsigned int i;
> >+
> >+num_bufs = total_bufs;
> >+
> >+if (desc.flags & VRING_DESC_F_INDIRECT) {
> >+if (desc.len % sizeof(VRingDescPacked)) {
> >+virtio_error(vdev, "Invalid size for indirect buffer 
> >table");
> >+goto err;
> >+}
> >+
> >+/* If we've got too many, that implies a descriptor loop. */
> >+if (num_bufs >= max) {
> >+virtio_error(vdev, "Looped descriptor");
> >+goto err;
> >+}
> >+
> >+/* loop over the indirect descriptor table */
> >+len = address_space_cache_init(_desc_cache,
> >+   vdev->dma_as,
> >+   desc.addr, desc.len, false);
> >+desc_cache = _desc_cache;
> >+if (len < desc.len) {
> >+virtio_error(vdev, "Cannot map indirect buffer");
> >+goto err;
> >+}
> >+
> >+max = desc.len / sizeof(VRingDescPacked);
> >+num_bufs = i = 0;
> >+vring_desc_read_packed(vdev, , desc_cache, i);
> >+}
> >+
> >+do {
> >+/* If we've got too many, that implies a descriptor loop. */
> >+if (++num_bufs > max) {
> >+virtio_error(vdev, "Looped descriptor");
> >+goto err;
> >+}
> >+
> >+if (desc.flags & VRING_DESC_F_WRITE) {
> >+in_total += desc.len;
> >+} else {
> >+out_total += desc.len;
> >+}
> >+

Re: [Qemu-devel] [PATCH 4/8] virtio: add detach element for packed ring(1.1)

2018-06-03 Thread Wei Xu

On Tue, Apr 10, 2018 at 03:32:53PM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月04日 20:54, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >helper for packed ring
> 
> It's odd and hard to review if you put detach patch first. I think this
> patch needs to be reordered after the implementation of pop/map.

This patch is not necessary after sync to tiwei's v5, so we can skip it.

Wei

> 
> Thanks
> 
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 21 +++--
> >  1 file changed, 19 insertions(+), 2 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 478df3d..fdee40f 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -561,6 +561,20 @@ static void virtqueue_unmap_sg(VirtQueue *vq, const 
> >VirtQueueElement *elem,
> >   elem->out_sg[i].iov_len);
> >  }
> >+static void virtqueue_detach_element_split(VirtQueue *vq,
> >+const VirtQueueElement *elem, unsigned int len)
> >+{
> >+vq->inuse--;
> >+virtqueue_unmap_sg(vq, elem, len);
> >+}
> >+
> >+static void virtqueue_detach_element_packed(VirtQueue *vq,
> >+const VirtQueueElement *elem, unsigned int len)
> >+{
> >+vq->inuse -= elem->count;
> >+virtqueue_unmap_sg(vq, elem, len);
> >+}
> >+
> >  /* virtqueue_detach_element:
> >   * @vq: The #VirtQueue
> >   * @elem: The #VirtQueueElement
> >@@ -573,8 +587,11 @@ static void virtqueue_unmap_sg(VirtQueue *vq, const 
> >VirtQueueElement *elem,
> >  void virtqueue_detach_element(VirtQueue *vq, const VirtQueueElement *elem,
> >unsigned int len)
> >  {
> >-vq->inuse--;
> >-virtqueue_unmap_sg(vq, elem, len);
> >+if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> >+virtqueue_detach_element_packed(vq, elem, len);
> >+} else {
> >+virtqueue_detach_element_split(vq, elem, len);
> >+}
> >  }
> >  /* virtqueue_unpop:
>

Re: [Qemu-devel] [PATCH 3/8] virtio: add empty check for packed ring

2018-06-03 Thread Wei Xu

On Tue, Apr 10, 2018 at 03:23:03PM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月04日 20:53, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >helper for ring empty check.
> 
> And descriptor read.

OK.

> 
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 62 
> > +++---
> >  1 file changed, 59 insertions(+), 3 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 73a35a4..478df3d 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -24,6 +24,9 @@
> >  #include "hw/virtio/virtio-access.h"
> >  #include "sysemu/dma.h"
> >+#define AVAIL_DESC_PACKED(b) ((b) << 7)
> >+#define USED_DESC_PACKED(b)  ((b) << 15)
> 
> Can we pass value other than 1 to this macro?

Yes, '0' is also provided for some clear/reset cases.

> 
> >+
> >  /*
> >   * The alignment to use between consumer and producer parts of vring.
> >   * x86 pagesize again. This is the default, used by transports like PCI
> >@@ -446,10 +449,27 @@ int virtio_queue_ready(VirtQueue *vq)
> >  return vq->vring.avail != 0;
> >  }
> >+static void vring_desc_read_packed(VirtIODevice *vdev, VRingDescPacked 
> >*desc,
> >+MemoryRegionCache *cache, int i)
> >+{
> >+address_space_read_cached(cache, i * sizeof(VRingDescPacked),
> >+  desc, sizeof(VRingDescPacked));
> >+virtio_tswap64s(vdev, >addr);
> >+virtio_tswap32s(vdev, >len);
> >+virtio_tswap16s(vdev, >id);
> >+virtio_tswap16s(vdev, >flags);
> >+}
> >+
> >+static inline bool is_desc_avail(struct VRingDescPacked* desc)
> >+{
> >+return (!!(desc->flags & AVAIL_DESC_PACKED(1)) !=
> >+!!(desc->flags & USED_DESC_PACKED(1)));
> 
> Don't we need to care about endian here?

Usually we don't since endian has been converted during reading,
will double check it.

> 
> >+}
> >+
> >  /* Fetch avail_idx from VQ memory only when we really need to know if
> >   * guest has added some buffers.
> >   * Called within rcu_read_lock().  */
> >-static int virtio_queue_empty_rcu(VirtQueue *vq)
> >+static int virtio_queue_empty_split_rcu(VirtQueue *vq)
> >  {
> >  if (unlikely(!vq->vring.avail)) {
> >  return 1;
> >@@ -462,7 +482,7 @@ static int virtio_queue_empty_rcu(VirtQueue *vq)
> >  return vring_avail_idx(vq) == vq->last_avail_idx;
> >  }
> >-int virtio_queue_empty(VirtQueue *vq)
> >+static int virtio_queue_empty_split(VirtQueue *vq)
> >  {
> >  bool empty;
> >@@ -480,6 +500,42 @@ int virtio_queue_empty(VirtQueue *vq)
> >  return empty;
> >  }
> >+static int virtio_queue_empty_packed_rcu(VirtQueue *vq)
> >+{
> >+struct VRingDescPacked desc;
> >+VRingMemoryRegionCaches *cache;
> >+
> >+if (unlikely(!vq->packed.desc)) {
> >+return 1;
> >+}
> >+
> >+cache = vring_get_region_caches(vq);
> >+vring_desc_read_packed(vq->vdev, , >desc_packed, 
> >vq->last_avail_idx);
> >+
> >+/* Make sure we see the updated flag */
> >+smp_mb();
> 
> What we need here is to make sure flag is read before all other fields,
> looks like this barrier can't.

Isn't flag updated yet if it has been read?

> 
> >+return !is_desc_avail();
> >+}
> >+
> >+static int virtio_queue_empty_packed(VirtQueue *vq)
> >+{
> >+bool empty;
> >+
> >+rcu_read_lock();
> >+empty = virtio_queue_empty_packed_rcu(vq);
> >+rcu_read_unlock();
> >+return empty;
> >+}
> >+
> >+int virtio_queue_empty(VirtQueue *vq)
> >+{
> >+if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> >+return virtio_queue_empty_packed(vq);
> >+} else {
> >+return virtio_queue_empty_split(vq);
> >+}
> >+}
> >+
> >  static void virtqueue_unmap_sg(VirtQueue *vq, const VirtQueueElement *elem,
> > unsigned int len)
> >  {
> >@@ -951,7 +1007,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
> >  return NULL;
> >  }
> >  rcu_read_lock();
> >-if (virtio_queue_empty_rcu(vq)) {
> >+if (virtio_queue_empty_split_rcu(vq)) {
> 
> I think you'd better have a switch inside virtio_queue_empty_rcu() like
> virtio_queue_empty() here.

OK.

> 
> Thanks
> 
> >  goto done;
> >  }
> >  /* Needed after virtio_queue_empty(), see comment in
>

Re: [Qemu-devel] [PATCH 1/8] virtio: feature bit, data structure for packed ring

2018-06-03 Thread Wei Xu

On Tue, Apr 10, 2018 at 03:05:24PM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月04日 20:53, w...@redhat.com wrote:
> >From: Wei Xu 
> >
> >Only minimum definitions from the spec are included
> >for prototype.
> >
> >Signed-off-by: Wei Xu 
> >---
> >  hw/virtio/virtio.c | 47 
> > +++---
> >  include/hw/virtio/virtio.h | 12 ++-
> >  include/standard-headers/linux/virtio_config.h |  2 ++
> >  3 files changed, 56 insertions(+), 5 deletions(-)
> >
> >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >index 006d3d1..9a6bfe7 100644
> >--- a/hw/virtio/virtio.c
> >+++ b/hw/virtio/virtio.c
> >@@ -39,6 +39,14 @@ typedef struct VRingDesc
> >  uint16_t next;
> >  } VRingDesc;
> >+typedef struct VRingDescPacked
> >+{
> >+uint64_t addr;
> >+uint32_t len;
> >+uint16_t id;
> >+uint16_t flags;
> >+} VRingDescPacked;
> >+
> >  typedef struct VRingAvail
> >  {
> >  uint16_t flags;
> >@@ -61,9 +69,18 @@ typedef struct VRingUsed
> >  typedef struct VRingMemoryRegionCaches {
> >  struct rcu_head rcu;
> >-MemoryRegionCache desc;
> >-MemoryRegionCache avail;
> >-MemoryRegionCache used;
> >+union {
> >+struct {
> >+MemoryRegionCache desc;
> >+MemoryRegionCache avail;
> >+MemoryRegionCache used;
> >+};
> >+struct {
> >+MemoryRegionCache desc_packed;
> >+MemoryRegionCache driver;
> >+MemoryRegionCache device;
> >+};
> >+};
> 
> I think we can reuse exist memory region caches? Especially consider
> device/driver area could be treated as a renaming of avail/used area.
> 
> E.g desc for desc_packed, avail for driver area and used for device area.

Yes, I will take it.

> 
> >  } VRingMemoryRegionCaches;
> >  typedef struct VRing
> >@@ -77,10 +94,31 @@ typedef struct VRing
> >  VRingMemoryRegionCaches *caches;
> >  } VRing;
> >+typedef struct VRingPackedDescEvent {
> >+uint16_t desc_event_off:15,
> >+ desc_event_wrap:1;
> >+uint16_t desc_event_flags:2;
> >+} VRingPackedDescEvent ;
> >+
> >+typedef struct VRingPacked
> >+{
> >+unsigned int num;
> >+unsigned int num_default;
> >+unsigned int align;
> >+hwaddr desc;
> >+hwaddr driver;
> >+hwaddr device;
> >+VRingMemoryRegionCaches *caches;
> >+} VRingPacked;
> 
> Same here, can we reuse VRing here?

Yes.

> 
> >+
> >  struct VirtQueue
> >  {
> >-VRing vring;
> >+union {
> >+struct VRing vring;
> >+struct VRingPacked packed;
> >+};
> >+uint8_t wrap_counter:1;
> >  /* Next head to pop */
> >  uint16_t last_avail_idx;
> >@@ -1220,6 +1258,7 @@ void virtio_reset(void *opaque)
> >  vdev->vq[i].vring.num = vdev->vq[i].vring.num_default;
> >  vdev->vq[i].inuse = 0;
> >  virtio_virtqueue_reset_region_cache(>vq[i]);
> >+vdev->vq[i].wrap_counter = 1;
> >  }
> >  }
> >diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> >index 098bdaa..563e88e 100644
> >--- a/include/hw/virtio/virtio.h
> >+++ b/include/hw/virtio/virtio.h
> >@@ -46,6 +46,14 @@ typedef struct VirtQueueElement
> >  unsigned int index;
> >  unsigned int out_num;
> >  unsigned int in_num;
> >+
> >+/* Number of descriptors used by packed ring */
> 
> Do you mean the number of chained descriptors?

These has been removed.

> 
> >+uint16_t count;
> >+uint8_t wrap_counter:1;
> 
> What's the use of this bit? If you refer to my v1 vhost code, I used to have
> this, but it won't work for OOO completion e.g when zerocopy is disabled.
> I've dropped it now.
> 
> This is tricky and can only work when device complete descriptors in order.

Same here.

> 
> >+/* FIXME: Length of every used buffer for a descriptor,
> >+   move to dynamical allocating due to out/in sgs numbers */
> >+uint32_t len[VIRTQUEUE_MAX_SIZE];
> 
> Can you explain more about this?

Also here.

> 
> >+
> >  hwaddr *in_addr;
> >  hwaddr *out_addr;
> >  struct iovec *in_sg;
> >@@ -262,7 +270,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
> >  DEFINE_PROP_BIT64("any_layout", _state, _field, \
> >

Re: [Qemu-devel] [RFC PATCH 0/8] virtio-net 1.1 userspace backend support

2018-04-10 Thread Wei Xu

On Tue, Apr 10, 2018 at 11:46:47AM +0800, Jason Wang wrote:
>
> 
> On 2018年04月04日 20:53, w...@redhat.com wrote:
> >From: Wei Xu <w...@redhat.com>
> >
> >This is a prototype for virtio-net 1.1 support in userspace backend,
> >only minimum part are included in this RFC(roughly synced to v8 as
> >Jason and Tiwei's RFC).
> >
> >Test has been done together with Tiwei's RFC guest virtio-net driver
> >patch, ping and a quick iperf test successfully.
> >
> >Issues:
> >1. Rx performance of Iperf is much slower than TX.
> > TX: 13-15Gb
> > RX: 100-300Mb
> 
> This needs to be investigated. What's the pps of TX/RX then? (Maybe you can
> try Jen's dpdk code too).

Yes, I haven't tune any tso/gso on tap so the pps should match the bandwidth,
will try some more debugging and tried Jen's code if I can not resolve it.

Wei

> 
> Thanks
> 
> >
> >Missing:
> >- device and driver
> >- indirect descriptor
> >- migration
> >- vIOMMU support
> >- other revisions since v8
> >- see FIXME
> >
> >Wei Xu (8):
> >   virtio: feature bit, data structure for packed ring
> >   virtio: memory cache for packed ring
> >   virtio: add empty check for packed ring
> >   virtio: add detach element for packed ring(1.1)
> >   virtio: notification tweak for packed ring
> >   virtio: flush/push support for packed ring
> >   virtio: get avail bytes check for packed ring
> >   virtio: queue pop support for packed ring
> >
> >  hw/virtio/virtio.c | 618 
> > +++--
> >  include/hw/virtio/virtio.h |  12 +-
> >  include/standard-headers/linux/virtio_config.h |   2 +
> >  3 files changed, 601 insertions(+), 31 deletions(-)
> >
>

Re: [Qemu-devel] [Qemu-arm] [PATCH] pl011: do not put into fifo before enabled the interruption

2018-01-29 Thread Wei Xu

Hi Peter,

On 2018/1/26 18:01, Peter Maydell wrote:
> On 26 January 2018 at 17:33, Wei Xu <xuw...@hisilicon.com> wrote:
>> On 2018/1/26 17:15, Peter Maydell wrote:
>>> The pl011 code should call qemu_set_irq(..., 1) when the
>>> guest enables interrupts on the device by writing to the int_enabled
>>> (UARTIMSC) register. That will be a 0-to-1 level change and the KVM
>>> VGIC should report the interrupt to the guest.
>>>
>>
>> Yes.
>> And in the pl011_update, the irq level is set by s->int_level & 
>> s->int_enabled.
>> When writing to the int_enabled, not sure why the int_level is set to
>> 0x20(PL011_INT_TX) but int_enabled is 0x50.
>>
>> It still call qemu_set_irq(..., 0).
>>
>> I added "s->int_level |= PL011_INT_RX" before calling pl011_update
>> when writing to the int_enabled and tested it also works.
> 
> No, that's not right either. int_level should already have the
> RX bit set, because pl011_put_fifo() sets that bit when it gets a
> character from QEMU and puts it into the FIFO.
> 
> Does something else clear the int_level between the character
> going into the FIFO from QEMU and the guest enabling
> interrupts?

Yes. When the guest enabling the interrupts, the pl011 driver in
the kernel will clear the RX interrupts[1].
And pasted the code below to make it easy to read.

static void pl011_enable_interrupts(struct uart_amba_port *uap)
{
spin_lock_irq(>port.lock);

/* Clear out any spuriously appearing RX interrupts */
pl011_write(UART011_RTIS | UART011_RXIS, uap, REG_ICR);
uap->im = UART011_RTIM;
if (!pl011_dma_rx_running(uap))
uap->im |= UART011_RXIM;
pl011_write(uap->im, uap, REG_IMSC);
spin_unlock_irq(>port.lock);
}

I tried kept the RXIS in the kernel side to test and found the issue is still 
there.
A little confused now :(

[1]: 
https://elixir.free-electrons.com/linux/latest/source/drivers/tty/serial/amba-pl011.c#L1732

Best Regards,
Wei

> 
> thanks
> -- PMM
> 
> .
>

Re: [Qemu-devel] [Qemu-arm] [PATCH] pl011: do not put into fifo before enabled the interruption

2018-01-29 Thread Wei Xu

Hi Andrew,

On 2018/1/29 10:29, Andrew Jones wrote:
> On Fri, Jan 26, 2018 at 06:01:33PM +, Peter Maydell wrote:
>> On 26 January 2018 at 17:33, Wei Xu <xuw...@hisilicon.com> wrote:
>>> On 2018/1/26 17:15, Peter Maydell wrote:
>>>> The pl011 code should call qemu_set_irq(..., 1) when the
>>>> guest enables interrupts on the device by writing to the int_enabled
>>>> (UARTIMSC) register. That will be a 0-to-1 level change and the KVM
>>>> VGIC should report the interrupt to the guest.
>>>>
>>>
>>> Yes.
>>> And in the pl011_update, the irq level is set by s->int_level & 
>>> s->int_enabled.
>>> When writing to the int_enabled, not sure why the int_level is set to
>>> 0x20(PL011_INT_TX) but int_enabled is 0x50.
>>>
>>> It still call qemu_set_irq(..., 0).
>>>
>>> I added "s->int_level |= PL011_INT_RX" before calling pl011_update
>>> when writing to the int_enabled and tested it also works.
>>
>> No, that's not right either. int_level should already have the
>> RX bit set, because pl011_put_fifo() sets that bit when it gets a
>> character from QEMU and puts it into the FIFO.
>>
>> Does something else clear the int_level between the character
>> going into the FIFO from QEMU and the guest enabling
>> interrupts?
> 
> As part of the boot process Linux restarts the UART a few times. When
> Linux drives the PL011 with the SBSA driver then the FIFO doesn't get
> reset prior to being used again, as the SBSA doesn't specify a way to
> do that. I'm not sure if this issue is due to the SBSA attempting to
> be overly simple, or something the Linux driver can deal with. See
> this thread for a discussion I started once.
> 
> https://www.spinics.net/lists/linux-serial/msg23163.html

I am not sure it is the same problem or not.
I will check that.
Thanks!

> 
> Wei,
> 
> I assume you're using UEFI/ACPI when booting, as I don't recall this
> problem occurring with the Linux PL011 driver which would be used
> when booting with DT.
>

I am using an ARM64 board, the guest is booted *without* UEFI but the
host is booted with UEFI/ACPI.
The command I am using is as below:
"qemu-system-aarch64 -enable-kvm -m 1024 -cpu host -M virt \
-nographic --kernel Image --initrd roofs.cpio.gz"

Thanks!

Best Regards,
Wei

> Thanks,
> drew
> 
> .
>

[Qemu-devel] [Qemu-arm] [PATCH] pl011: do not put into fifo before enabled the interruption

2018-01-26 Thread Wei Xu

If the user pressed some keys in the console during the guest booting,
the console will be hanged after entering the shell.
Because in the above case the pl011_can_receive will return 0 that
the pl011_receive will not be called. That means no interruption will
be injected in to the kernel and the pl011 state could not be driven
further.

This patch fixed that issue by checking the interruption is enabled or
not before putting into the fifo.

Signed-off-by: Wei Xu <xuw...@hisilicon.com>
---
 hw/char/pl011.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/char/pl011.c b/hw/char/pl011.c
index 2aa277fc4f..6296de9527 100644
--- a/hw/char/pl011.c
+++ b/hw/char/pl011.c
@@ -229,6 +229,8 @@ static int pl011_can_receive(void *opaque)
 PL011State *s = (PL011State *)opaque;
 int r;

+if (!s->int_enabled)
+   return 0;
 if (s->lcr & 0x10) {
 r = s->read_count < 16;
 } else {
-- 
2.11.0

Re: [Qemu-devel] [Qemu-arm] [PATCH] pl011: do not put into fifo before enabled the interruption

2018-01-26 Thread Wei Xu

Hi Peter,

On 2018/1/26 17:15, Peter Maydell wrote:
> On 26 January 2018 at 17:05, Wei Xu <xuw...@hisilicon.com> wrote:
>> On 2018/1/26 16:36, Peter Maydell wrote:
>>> If the user presses keys before interrupts are enabled,
>>> what ought to happen is:
>>>  * we put the key in the FIFO, and update the int_level flags
>>>  * when the FIFO is full, can_receive starts returning 0 and
>>>QEMU stops passing us new characters
>>>  * when the guest driver for the pl011 initializes the
>>>device and enables interrupts then either:
>>> (a) it does something that clears the FIFO, which will
>>> mean can_receive starts allowing new chars again, or
>>> (b) it leaves the FIFO as it is, and we should thus
>>> immediately raise an interrupt for the characters still
>>> in the FIFO; when the guest handles this interrupt and
>>> gets the characters, can_receive will permit new ones
>>>
>>
>> Yes, now it is handled like b.
>>
>>> What is happening in your situation that means this is not
>>> working as expected ?
>>
>> But in the kernel side, the pll011 is triggered as a level interruption.
>> During the booting, if any key is pressed ,the call stack is as below:
>> QEMU side:
>> pl011_update
>> -->qemu_set_irq(level as 0)
>> >kvm_arm_gic_set_irq
>>
>> Kernel side:
>> kvm_vm_ioctl_irq_line
>> -->kvm_vgic_inject_irq
>> >vgic_validate_injection (if level did not change, return)
>> >vgic_queue_irq_unlock
>>
>> Without above changes, in the vgic_validate_injection, because the
>> interruption level is always 0, this irq will not be queued into vgic.
>> And the guest will not read the pl011 fifo.
> 
> The pl011 code should call qemu_set_irq(..., 1) when the
> guest enables interrupts on the device by writing to the int_enabled
> (UARTIMSC) register. That will be a 0-to-1 level change and the KVM
> VGIC should report the interrupt to the guest.
> 

Yes.
And in the pl011_update, the irq level is set by s->int_level & s->int_enabled.
When writing to the int_enabled, not sure why the int_level is set to
0x20(PL011_INT_TX) but int_enabled is 0x50.
It still call qemu_set_irq(..., 0).

I added "s->int_level |= PL011_INT_RX" before calling pl011_update
when writing to the int_enabled and tested it also works.
How do you think about like that?
Thanks!

Best Regards,
Wei

> thanks
> -- PMM
> 
> .
>

Re: [Qemu-devel] [Qemu-arm] [PATCH] pl011: do not put into fifo before enabled the interruption

2018-01-26 Thread Wei Xu

Hi Peter,

On 2018/1/26 16:36, Peter Maydell wrote:
> On 26 January 2018 at 16:00, Wei Xu <xuw...@hisilicon.com> wrote:
>> If the user pressed some keys in the console during the guest booting,
>> the console will be hanged after entering the shell.
>> Because in the above case the pl011_can_receive will return 0 that
>> the pl011_receive will not be called. That means no interruption will
>> be injected in to the kernel and the pl011 state could not be driven
>> further.
>>
>> This patch fixed that issue by checking the interruption is enabled or
>> not before putting into the fifo.
>>
>> Signed-off-by: Wei Xu <xuw...@hisilicon.com>
>> ---
>>  hw/char/pl011.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/hw/char/pl011.c b/hw/char/pl011.c
>> index 2aa277fc4f..6296de9527 100644
>> --- a/hw/char/pl011.c
>> +++ b/hw/char/pl011.c
>> @@ -229,6 +229,8 @@ static int pl011_can_receive(void *opaque)
>>  PL011State *s = (PL011State *)opaque;
>>  int r;
>>
>> +if (!s->int_enabled)
>> +   return 0;
>>  if (s->lcr & 0x10) {
>>  r = s->read_count < 16;
>>  } else {
>> --
> 
> This doesn't look right. You should be able to use the PL011
> in a strictly polling mode, without ever enabling interrupts.
> Returning false in can_receive if interrupts are disabled
> would break that.
> 
> If the user presses keys before interrupts are enabled,
> what ought to happen is:
>  * we put the key in the FIFO, and update the int_level flags
>  * when the FIFO is full, can_receive starts returning 0 and
>QEMU stops passing us new characters
>  * when the guest driver for the pl011 initializes the
>device and enables interrupts then either:
> (a) it does something that clears the FIFO, which will
> mean can_receive starts allowing new chars again, or
> (b) it leaves the FIFO as it is, and we should thus
> immediately raise an interrupt for the characters still
> in the FIFO; when the guest handles this interrupt and
> gets the characters, can_receive will permit new ones
>

Yes, now it is handled like b.

> What is happening in your situation that means this is not
> working as expected ?

But in the kernel side, the pll011 is triggered as a level interruption.
During the booting, if any key is pressed ,the call stack is as below:
QEMU side:
pl011_update
-->qemu_set_irq(level as 0)
>kvm_arm_gic_set_irq

Kernel side:
kvm_vm_ioctl_irq_line
-->kvm_vgic_inject_irq
>vgic_validate_injection (if level did not change, return)
>vgic_queue_irq_unlock

Without above changes, in the vgic_validate_injection, because the
interruption level is always 0, this irq will not be queued into vgic.
And the guest will not read the pl011 fifo.

Best Regards,
Wei

> 
> thanks
> -- PMM
> 
> .
>

Re: [Qemu-devel] dropped pkts with Qemu on tap interace (RX)

2018-01-03 Thread Wei Xu

On Wed, Jan 03, 2018 at 04:07:44PM +0100, Stefan Priebe - Profihost AG wrote:
> 
> Am 03.01.2018 um 04:57 schrieb Wei Xu:
> > On Tue, Jan 02, 2018 at 10:17:25PM +0100, Stefan Priebe - Profihost AG 
> > wrote:
> >>
> >> Am 02.01.2018 um 18:04 schrieb Wei Xu:
> >>> On Tue, Jan 02, 2018 at 04:24:33PM +0100, Stefan Priebe - Profihost AG 
> >>> wrote:
> >>>> Hi,
> >>>> Am 02.01.2018 um 15:20 schrieb Wei Xu:
> >>>>> On Tue, Jan 02, 2018 at 12:17:29PM +0100, Stefan Priebe - Profihost AG 
> >>>>> wrote:
> >>>>>> Hello,
> >>>>>>
> >>>>>> currently i'm trying to fix a problem where we have "random" missing
> >>>>>> packets.
> >>>>>>
> >>>>>> We're doing an ssh connect from machine a to machine b every 5 minutes
> >>>>>> via rsync and ssh.
> >>>>>>
> >>>>>> Sometimes it happens that we get this cron message:
> >>>>>> "Connection to 192.168.0.2 closed by remote host.
> >>>>>> rsync: connection unexpectedly closed (0 bytes received so far) 
> >>>>>> [sender]
> >>>>>> rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.2]
> >>>>>> ssh: connect to host 192.168.0.2 port 22: Connection refused"
> >>>>>
> >>>>> Hi Stefan,
> >>>>> What kind of virtio-net backend are you using? Can you paste your qemu
> >>>>> command line here?
> >>>>
> >>>> Sure netdev part:
> >>>> -netdev
> >>>> type=tap,id=net0,ifname=tap317i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on
> >>>> -device
> >>>> virtio-net-pci,mac=EA:37:42:5C:F3:33,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300
> >>>> -netdev
> >>>> type=tap,id=net1,ifname=tap317i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on,queues=4
> >>>> -device
> >>>> virtio-net-pci,mac=6A:8E:74:45:1A:0B,nedev=net1,bus=pci.0,addr=0x13,id=net1,vectors=10,mq=on,bootindex=301
> >>>
> >>> According to what you have mentioned, the traffic is not heavy for the 
> >>> guests,
> >>> the dropping shouldn't happen for regular case.
> >>
> >> The avg traffic is around 300kb/s.
> >>
> >>> What is your hardware platform?
> >>
> >> Dual Intel Xeon E5-2680 v4
> >>
> >>> and Which versions are you using for both
> >>> guest/host kernel
> >> Kernel v4.4.103
> >>
> >>> and qemu?
> >> 2.9.1
> >>
> >>> Are there other VMs on the same host?
> >> Yes.
> > 
> > What about the CPU load? 
> 
> Host:
> 80-90% Idle
> LoadAvg: 6-7
> 
> VM:
> 97%-99% Idle
> 

OK, then this shouldn't be a concern.

> >>>>> 'Connection refused' usually means that the client gets a TCP Reset 
> >>>>> rather
> >>>>> than losing packets, so this might not be a relevant issue.
> >>>>
> >>>> Mhm so you mean these might be two seperate ones?
> >>>
> >>> Yes.
> >>>
> >>>>
> >>>>> Also you can do a tcpdump on both guests and see what happened to SSH 
> >>>>> packets
> >>>>> (tcpdump -i tapXXX port 22).
> >>>>
> >>>> Sadly not as there's too much traffic on that part as rsync is syncing
> >>>> every 5 minutes through ssh.
> >>>
> >>> You can do a tcpdump for the entire traffic from the guest and host and 
> >>> compare
> >>> what kind of packets are dropped if the traffic is not overloaded.
> >>
> >> Are you sure? I don't get why the same amount and same kind of packets
> >> should be received by both tap which are connected to different bridges
> >> to different HW and physical interfaces.
> > 
> > Exactly, possibly this would be a host or guest kernel bug cos than qemu 
> > issue
> > you are using vhost kernel as the backend and the two stats are independent,
> > you might have to check out what is happening inside the traffic.
> 
> What do you mean by inside the traffic?

You might need to figure what kind of packets are dropped on host tap interface,
are they random packets or specific packets?

There are few other tests which help to see what happened besides triaging
the traffic, or you can try alternative tests according to your test bed.

1). Upgrade host & guest kernel to latest kernel and see if it comes up, you can
use net-next tree.
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git

2). Do some traffic throughput(netperf, iperf, etc) on both guests(traffic from 
guest to host if the guests are isolated due to your comments) and check out
the statistics.

Wei

> 
> Stefan
>

Re: [Qemu-devel] dropped pkts with Qemu on tap interace (RX)

2018-01-02 Thread Wei Xu

On Tue, Jan 02, 2018 at 10:17:25PM +0100, Stefan Priebe - Profihost AG wrote:
> 
> Am 02.01.2018 um 18:04 schrieb Wei Xu:
> > On Tue, Jan 02, 2018 at 04:24:33PM +0100, Stefan Priebe - Profihost AG 
> > wrote:
> >> Hi,
> >> Am 02.01.2018 um 15:20 schrieb Wei Xu:
> >>> On Tue, Jan 02, 2018 at 12:17:29PM +0100, Stefan Priebe - Profihost AG 
> >>> wrote:
> >>>> Hello,
> >>>>
> >>>> currently i'm trying to fix a problem where we have "random" missing
> >>>> packets.
> >>>>
> >>>> We're doing an ssh connect from machine a to machine b every 5 minutes
> >>>> via rsync and ssh.
> >>>>
> >>>> Sometimes it happens that we get this cron message:
> >>>> "Connection to 192.168.0.2 closed by remote host.
> >>>> rsync: connection unexpectedly closed (0 bytes received so far) [sender]
> >>>> rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.2]
> >>>> ssh: connect to host 192.168.0.2 port 22: Connection refused"
> >>>
> >>> Hi Stefan,
> >>> What kind of virtio-net backend are you using? Can you paste your qemu
> >>> command line here?
> >>
> >> Sure netdev part:
> >> -netdev
> >> type=tap,id=net0,ifname=tap317i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on
> >> -device
> >> virtio-net-pci,mac=EA:37:42:5C:F3:33,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300
> >> -netdev
> >> type=tap,id=net1,ifname=tap317i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on,queues=4
> >> -device
> >> virtio-net-pci,mac=6A:8E:74:45:1A:0B,nedev=net1,bus=pci.0,addr=0x13,id=net1,vectors=10,mq=on,bootindex=301
> > 
> > According to what you have mentioned, the traffic is not heavy for the 
> > guests,
> > the dropping shouldn't happen for regular case.
> 
> The avg traffic is around 300kb/s.
> 
> > What is your hardware platform?
> 
> Dual Intel Xeon E5-2680 v4
> 
> > and Which versions are you using for both
> > guest/host kernel
> Kernel v4.4.103
> 
> > and qemu?
> 2.9.1
> 
> > Are there other VMs on the same host?
> Yes.

What about the CPU load? 

> 
> 
> >>> 'Connection refused' usually means that the client gets a TCP Reset rather
> >>> than losing packets, so this might not be a relevant issue.
> >>
> >> Mhm so you mean these might be two seperate ones?
> > 
> > Yes.
> > 
> >>
> >>> Also you can do a tcpdump on both guests and see what happened to SSH 
> >>> packets
> >>> (tcpdump -i tapXXX port 22).
> >>
> >> Sadly not as there's too much traffic on that part as rsync is syncing
> >> every 5 minutes through ssh.
> > 
> > You can do a tcpdump for the entire traffic from the guest and host and 
> > compare
> > what kind of packets are dropped if the traffic is not overloaded.
> 
> Are you sure? I don't get why the same amount and same kind of packets
> should be received by both tap which are connected to different bridges
> to different HW and physical interfaces.

Exactly, possibly this would be a host or guest kernel bug cos than qemu issue
you are using vhost kernel as the backend and the two stats are independent,
you might have to check out what is happening inside the traffic.

Wei

Re: [Qemu-devel] dropped pkts with Qemu on tap interace (RX)

2018-01-02 Thread Wei Xu

On Tue, Jan 02, 2018 at 04:24:33PM +0100, Stefan Priebe - Profihost AG wrote:
> Hi,
> Am 02.01.2018 um 15:20 schrieb Wei Xu:
> > On Tue, Jan 02, 2018 at 12:17:29PM +0100, Stefan Priebe - Profihost AG 
> > wrote:
> >> Hello,
> >>
> >> currently i'm trying to fix a problem where we have "random" missing
> >> packets.
> >>
> >> We're doing an ssh connect from machine a to machine b every 5 minutes
> >> via rsync and ssh.
> >>
> >> Sometimes it happens that we get this cron message:
> >> "Connection to 192.168.0.2 closed by remote host.
> >> rsync: connection unexpectedly closed (0 bytes received so far) [sender]
> >> rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.2]
> >> ssh: connect to host 192.168.0.2 port 22: Connection refused"
> > 
> > Hi Stefan,
> > What kind of virtio-net backend are you using? Can you paste your qemu
> > command line here?
> 
> Sure netdev part:
> -netdev
> type=tap,id=net0,ifname=tap317i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on
> -device
> virtio-net-pci,mac=EA:37:42:5C:F3:33,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300
> -netdev
> type=tap,id=net1,ifname=tap317i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on,queues=4
> -device
> virtio-net-pci,mac=6A:8E:74:45:1A:0B,nedev=net1,bus=pci.0,addr=0x13,id=net1,vectors=10,mq=on,bootindex=301

According to what you have mentioned, the traffic is not heavy for the guests,
the dropping shouldn't happen for regular case.

What is your hardware platform? and Which versions are you using for both
guest/host kernel and qemu? Are there other VMs on the same host?

> 
> 
> > 'Connection refused' usually means that the client gets a TCP Reset rather
> > than losing packets, so this might not be a relevant issue.
> 
> Mhm so you mean these might be two seperate ones?

Yes.

> 
> > Also you can do a tcpdump on both guests and see what happened to SSH 
> > packets
> > (tcpdump -i tapXXX port 22).
> 
> Sadly not as there's too much traffic on that part as rsync is syncing
> every 5 minutes through ssh.

You can do a tcpdump for the entire traffic from the guest and host and compare
what kind of packets are dropped if the traffic is not overloaded.

Wei

> 
> >> The tap devices on the target vm shows dropped RX packages on BOTH tap
> >> interfaces - strangely with the same amount of pkts?
> >>
> >> # ifconfig tap317i0; ifconfig tap317i1
> >> tap317i0  Link encap:Ethernet  HWaddr 6e:cb:65:94:bb:bf
> >>   UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
> >>   RX packets:2238445 errors:0 dropped:13159 overruns:0 frame:0
> >>   TX packets:9655853 errors:0 dropped:0 overruns:0 carrier:0
> >>   collisions:0 txqueuelen:1000
> >>   RX bytes:177991267 (169.7 MiB)  TX bytes:910412749 (868.2 MiB)
> >>
> >> tap317i1  Link encap:Ethernet  HWaddr 96:f8:b5:d0:9a:07
> >>   UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
> >>   RX packets:1516085 errors:0 dropped:13159 overruns:0 frame:0
> >>   TX packets:1446964 errors:0 dropped:0 overruns:0 carrier:0
> >>   collisions:0 txqueuelen:1000
> >>   RX bytes:1597564313 (1.4 GiB)  TX bytes:3517734365 (3.2 GiB)
> >>
> >> Any ideas how to inspect this issue?
> > 
> > It seems both tap interfaces lose RX pkts, dropping pkts of RX means the
> > host(backend) cann't receive packets from the guest as fast as the guest 
> > sends.
> 
> Inside the guest i see no dropped packets at all. It's only on the host
> and strangely on both taps at the same value? And both are connected to
> absolutely different networks.
> 
> > Are you running some symmetrical test on both guests? 
> 
> No.
> 
> Stefan
> 
> 
> > Wei
> > 
> >>
> >> Greets,
> >> Stefan
> >>

Re: [Qemu-devel] dropped pkts with Qemu on tap interace (RX)

2018-01-02 Thread Wei Xu

On Tue, Jan 02, 2018 at 12:17:29PM +0100, Stefan Priebe - Profihost AG wrote:
> Hello,
> 
> currently i'm trying to fix a problem where we have "random" missing
> packets.
> 
> We're doing an ssh connect from machine a to machine b every 5 minutes
> via rsync and ssh.
> 
> Sometimes it happens that we get this cron message:
> "Connection to 192.168.0.2 closed by remote host.
> rsync: connection unexpectedly closed (0 bytes received so far) [sender]
> rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.2]
> ssh: connect to host 192.168.0.2 port 22: Connection refused"

Hi Stefan,
What kind of virtio-net backend are you using? Can you paste your qemu
command line here?

'Connection refused' usually means that the client gets a TCP Reset rather
than losing packets, so this might not be a relevant issue.

Also you can do a tcpdump on both guests and see what happened to SSH packets
(tcpdump -i tapXXX port 22).

> 
> The tap devices on the target vm shows dropped RX packages on BOTH tap
> interfaces - strangely with the same amount of pkts?
> 
> # ifconfig tap317i0; ifconfig tap317i1
> tap317i0  Link encap:Ethernet  HWaddr 6e:cb:65:94:bb:bf
>   UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
>   RX packets:2238445 errors:0 dropped:13159 overruns:0 frame:0
>   TX packets:9655853 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:177991267 (169.7 MiB)  TX bytes:910412749 (868.2 MiB)
> 
> tap317i1  Link encap:Ethernet  HWaddr 96:f8:b5:d0:9a:07
>   UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
>   RX packets:1516085 errors:0 dropped:13159 overruns:0 frame:0
>   TX packets:1446964 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:1597564313 (1.4 GiB)  TX bytes:3517734365 (3.2 GiB)
> 
> Any ideas how to inspect this issue?

It seems both tap interfaces lose RX pkts, dropping pkts of RX means the
host(backend) cann't receive packets from the guest as fast as the guest sends.

Are you running some symmetrical test on both guests? 

Wei

> 
> Greets,
> Stefan
>

Re: [Qemu-devel] [PATCH] virtio-scsi: finalize IOMMU support

2017-07-04 Thread Wei Xu

On Tue, Jul 04, 2017 at 08:21:06PM +0800, Jason Wang wrote:
> After converting to use DMA api for virtio devices, we should use
> dma_as instead of address_space_memory. Otherwise it won't work if
> IOMMU is enabled.
> 
> Fixes: commit 8607f5c3072c ("virtio: convert to use DMA api")
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: Jason Wang <jasow...@redhat.com>
> ---
>  hw/scsi/virtio-scsi.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index f46f06d..d076fe7 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -43,12 +43,13 @@ static inline SCSIDevice 
> *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun)
>  
>  void virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq, VirtIOSCSIReq *req)
>  {
> +VirtIODevice *vdev = VIRTIO_DEVICE(s);
>  const size_t zero_skip =
>  offsetof(VirtIOSCSIReq, resp_iov) + sizeof(req->resp_iov);
>  
>  req->vq = vq;
>  req->dev = s;
> -qemu_sglist_init(>qsgl, DEVICE(s), 8, _space_memory);
> +qemu_sglist_init(>qsgl, DEVICE(s), 8, vdev->dma_as);
>  qemu_iovec_init(>resp_iov, 1);
>  memset((uint8_t *)req + zero_skip, 0, sizeof(*req) - zero_skip);
>  }

Reviewed-by: Wei Xu <w...@redhat.com>

> -- 
> 2.7.4
> 
>

Re: [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic

2016-11-30 Thread Wei Xu


On 2016年11月24日 12:17, Jason Wang wrote:



On 2016年11月01日 01:41, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

All the data packets in a tcp connection are cached
to a single buffer in every receive interval, and will
be sent out via a timer, the 'virtio_net_rsc_timeout'
controls the interval, this value may impact the
performance and response time of tcp connection,
5(50us) is an experience value to gain a performance
improvement, since the whql test sends packets every 100us,
so '30(300us)' passes the test case, it is the default
value as well, tune it via the command line parameter
'rsc_interval' within 'virtio-net-pci' device, for example,
to launch a guest with interval set as '50':

'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,rsc_interval=50'


The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets.

'NetRscChain' is used to save the segments of IPv4/6 in a
VirtIONet device.

A new segment becomes a 'Candidate' as well as it passed sanity check,
the main handler of TCP includes TCP window update, duplicated
ACK check and the real data coalescing.

An 'Candidate' segment means:
1. Segment is within current window and the sequence is the expected one.
2. 'ACK' of the segment is in the valid window.

Sanity check includes:
1. Incorrect version in IP header
2. An IP options or IP fragment
3. Not a TCP packet
4. Sanity size check to prevent buffer overflow attack.
5. An ECN packet

Even though, there might more cases should be considered such as
ip identification other flags, while it breaks the test because
windows set it to the same even it's not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag,
'bypass' and 'finalize', 'bypass' means should be sent out directly,
while 'finalize' means the packets should also be bypassed, but this
should be done after search for the same connection packets in the
pool and drain all of them out, this is to avoid out of order fragment.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
finalization, because this normally happens upon a connection is going
to be closed, an 'URG' packet also finalize current coalescing unit.

Statistics can be used to monitor the basic coalescing status, the
'out of order' and 'out of window' means how many retransmitting packets,
thus describe the performance intuitively.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c | 602
++--
  include/hw/virtio/virtio-net.h  |   5 +-
  include/hw/virtio/virtio.h  |  76 
  include/net/eth.h   |   2 +
  include/standard-headers/linux/virtio_net.h |  14 +
  net/tap.c   |   3 +-
  6 files changed, 670 insertions(+), 32 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 06bfe4b..d1824d9 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -15,10 +15,12 @@
  #include "qemu/iov.h"
  #include "hw/virtio/virtio.h"
  #include "net/net.h"
+#include "net/eth.h"
  #include "net/checksum.h"
  #include "net/tap.h"
  #include "qemu/error-report.h"
  #include "qemu/timer.h"
+#include "qemu/sockets.h"
  #include "hw/virtio/virtio-net.h"
  #include "net/vhost_net.h"
  #include "hw/virtio/virtio-bus.h"
@@ -43,6 +45,24 @@
  #define endof(container, field) \
  (offsetof(container, field) + sizeof(((container *)0)->field))
+#define VIRTIO_NET_IP4_ADDR_SIZE   8/* ipv4 saddr + daddr */


Only used once in the code, I don't see much value of this macro.


Just to keep it a bit readable.




+
+#define VIRTIO_NET_TCP_FLAG 0x3F
+#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
+
+/* IPv4 max payload, 16 bits in the header */
+#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
+#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
+
+/* header length value in ip header without option */
+#define VIRTIO_NET_IP4_HEADER_LENGTH 5
+
+/* Purge coalesced packets timer interval, This value affects the
performance
+   a lot, and should be tuned carefully, '30'(300us) is the
recommended
+   value to pass the WHQL test, '5' can gain 2x netperf
throughput with
+   tso/gso/gro 'off'. */
+#define VIRTIO_NET_RSC_INTERVAL  30


This should be a property for virito-net and the above comment can be
the description of the property.


This is a value for a property, actually I hadn't found a place to put
it.




+
  typedef struct VirtIOFeature {
  uint32_t flags;
  size_t end;
@@ -589,7 +609,12 @@ static uint64_t
virtio_net_guest_offloads_by_features(uint32_t features)
  (1ULL << VIRTIO_NET_F_GUEST_ECN)  |
  (1ULL << VIRTIO_NET_F_GUEST_UFO);

Re: [Qemu-devel] [ RFC Patch v7 0/2] Support Receive-Segment-Offload(RSC) for WHQL

2016-11-24 Thread Wei Xu


On 2016年11月24日 12:28, Jason Wang wrote:



On 2016年11月01日 01:41, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

This patch is to support WHQL test for Windows guest, while this
feature also benifits other guest works as a kernel 'gro' like
feature with userspace implementation.

Feature information:
http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324

v6->v7
- Change the drain timer from 'virtual' to 'host' since it invisible
   to guest.
- Move the buffer list empty check to virtio_net_rsc_do_coalesc().
- The header comparision is a bit odd for ipv4 in this patch, it
   should be simpler with equal check, but this is also a helper for ipv6
   in next patch, and ipv6 used a different size address fields, so i
used
   an 'address + size' byte comparision for address, and change comparing
   the tcp port with 'int' equal check.
- Add count for packets whose size less than a normal tcp packet in
   sanity check.
- Move constant value comparison to the right side of the equal symbol.
- Use host header length in stead of guest header length to verify a
   packet in virtio_net_rsc_receive(), in case of the different header
   length for guest and host.
- Check whether the packet size is enough to hold a legal packet before
   extract ip unit.
- Bypass ip/tcp ECN packets.
- Expand the feature bit definition from 32 to 64 bits.

Other notes:
- About tcp windows scale, we don't have connection tracking about all
   tcp connections, so we don't know what the exact window size is using,
   thus this feature may get negative influence to it, have to turn this
   feature off for such a user case currently.
- There are 2 new fields in the virtio net header, it's not in either
   kernel tree or maintainer's tree right now, I just put it directly
here.
- The statistics is kept in this version since it's helpful for
   troubleshooting.


Please do not adding more and more stuffs in the same patch. Instead,
you can add them by using new patches on top. This can greatly simplify
the reviewers' work. E.g in this version, it looks like the parts of
virtio extension brings lots of troubles. So I suggest to split the
patch into several parts:

- helpers (e.g macro for ECN bit)
- core coalescing logic which has been reviewed for several version,
please do not add more functions to this part. This part could be even
disabled in the code until virtio part is introduced.
- virtio extension (e.g virtio-net header extension and feature bits)
- stats


OK, get split it in next version.



Thanks

Re: [Qemu-devel] [PATCH for 2.8 03/11] intel_iommu: name vtd address space with devfn

2016-09-05 Thread Wei Xu


On 2016年08月30日 11:06, Jason Wang wrote:

To avoid duplicated name and ease debugging.

Cc: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Acked-by: Peter Xu 
Signed-off-by: Jason Wang 
---
  hw/i386/intel_iommu.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 28c31a2..db70310 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2303,6 +2303,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, 
PCIBus *bus, int devfn)
  uintptr_t key = (uintptr_t)bus;
  VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, );
  VTDAddressSpace *vtd_dev_as;
+char name[128];

  if (!vtd_bus) {
  /* No corresponding free() */
@@ -2316,6 +2317,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, 
PCIBus *bus, int devfn)
  vtd_dev_as = vtd_bus->dev_as[devfn];

  if (!vtd_dev_as) {
+snprintf(name, sizeof(name), "intel_iommu_devfn_%d", devfn);
  vtd_bus->dev_as[devfn] = vtd_dev_as = 
g_malloc0(sizeof(VTDAddressSpace));

  vtd_dev_as->bus = bus;
@@ -2330,7 +2332,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, 
PCIBus *bus, int devfn)
  memory_region_add_subregion(_dev_as->iommu, 
VTD_INTERRUPT_ADDR_FIRST,
  _dev_as->iommu_ir);
  address_space_init(_dev_as->as,
-   _dev_as->iommu, "intel_iommu");
+   _dev_as->iommu, name);


No need to use the name for iommu region as before?


  }
  return vtd_dev_as;
  }

Re: [Qemu-devel] [PATCH for 2.8 02/11] virtio: convert to use DMA api

2016-09-04 Thread Wei Xu


On 2016年08月30日 11:06, Jason Wang wrote:

@@ -1587,6 +1595,11 @@ static void virtio_pci_device_plugged(DeviceState *d, 
Error **errp)
  }

  if (legacy) {
+if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM)) {
+error_setg(errp, "VIRTIO_F_IOMMU_PLATFORM was supported by"
+   "neither legacy nor transitional device.");
+return ;
+}


Not sure if i understand it correctly, the transitional device here 
maybe a bit hard to understand, just a tip for your convenience,
besides the denied prompt, can we add what kind of device is supported 
to the message? such as modern device only, like this.


"VIRTIO_F_IOMMU_PLATFORM is supported by modern device only, it
is not supported by either legacy or transitional device."


  /* legacy and transitional */
  pci_set_word(config + PCI_SUBSYSTEM_VENDOR_ID,
   pci_get_word(config + PCI_VENDOR_ID));

Re: [Qemu-devel] [PATCH for 2.8 01/11] linux-headers: update to 4.8-rc4

2016-09-04 Thread Wei Xu




On 2016年08月30日 11:06, Jason Wang wrote:

Signed-off-by: Jason Wang 
---
  include/standard-headers/linux/input-event-codes.h | 32 +
  include/standard-headers/linux/input.h |  1 +
  include/standard-headers/linux/virtio_config.h | 10 +-
  include/standard-headers/linux/virtio_ids.h|  1 +
  include/standard-headers/linux/virtio_net.h|  3 ++
  linux-headers/asm-arm/kvm.h|  4 +--
  linux-headers/asm-arm64/kvm.h  |  2 ++
  linux-headers/asm-s390/kvm.h   | 41 ++
  linux-headers/asm-x86/unistd_x32.h |  4 +--
  linux-headers/linux/kvm.h  | 18 --
  linux-headers/linux/vhost.h| 33 +
  11 files changed, 142 insertions(+), 7 deletions(-)

diff --git a/include/standard-headers/linux/input-event-codes.h 
b/include/standard-headers/linux/input-event-codes.h
index 354f0de..5c10f7e 100644
--- a/include/standard-headers/linux/input-event-codes.h
+++ b/include/standard-headers/linux/input-event-codes.h
@@ -611,6 +611,37 @@
  #define KEY_KBDINPUTASSIST_ACCEPT 0x264
  #define KEY_KBDINPUTASSIST_CANCEL 0x265

+/* Diagonal movement keys */
+#define KEY_RIGHT_UP   0x266
+#define KEY_RIGHT_DOWN 0x267
+#define KEY_LEFT_UP0x268
+#define KEY_LEFT_DOWN  0x269
+
+#define KEY_ROOT_MENU  0x26a /* Show Device's Root Menu */
+/* Show Top Menu of the Media (e.g. DVD) */
+#define KEY_MEDIA_TOP_MENU 0x26b
+#define KEY_NUMERIC_11 0x26c
+#define KEY_NUMERIC_12 0x26d
+/*
+ * Toggle Audio Description: refers to an audio service that helps blind and
+ * visually impaired consumers understand the action in a program. Note: in
+ * some countries this is referred to as "Video Description".
+ */
+#define KEY_AUDIO_DESC 0x26e
+#define KEY_3D_MODE0x26f
+#define KEY_NEXT_FAVORITE  0x270
+#define KEY_STOP_RECORD0x271
+#define KEY_PAUSE_RECORD   0x272
+#define KEY_VOD0x273 /* Video on Demand */
+#define KEY_UNMUTE 0x274
+#define KEY_FASTREVERSE0x275
+#define KEY_SLOWREVERSE0x276
+/*
+ * Control a data application associated with the currently viewed channel,
+ * e.g. teletext or data broadcast application (MHEG, MHP, HbbTV, etc.)
+ */
+#define KEY_DATA   0x275
+
  #define BTN_TRIGGER_HAPPY 0x2c0
  #define BTN_TRIGGER_HAPPY10x2c0
  #define BTN_TRIGGER_HAPPY20x2c1
@@ -749,6 +780,7 @@
  #define SW_ROTATE_LOCK0x0c  /* set = rotate locked/disabled */
  #define SW_LINEIN_INSERT  0x0d  /* set = inserted */
  #define SW_MUTE_DEVICE0x0e  /* set = device disabled */
+#define SW_PEN_INSERTED0x0f  /* set = pen inserted */
  #define SW_MAX_   0x0f
  #define SW_CNT(SW_MAX_+1)

diff --git a/include/standard-headers/linux/input.h 
b/include/standard-headers/linux/input.h
index a52b202..7361a16 100644
--- a/include/standard-headers/linux/input.h
+++ b/include/standard-headers/linux/input.h
@@ -244,6 +244,7 @@ struct input_mask {
  #define BUS_ATARI 0x1B
  #define BUS_SPI   0x1C
  #define BUS_RMI   0x1D
+#define BUS_CEC0x1E

  /*
   * MT_TOOL types
diff --git a/include/standard-headers/linux/virtio_config.h 
b/include/standard-headers/linux/virtio_config.h
index b30d0cb..b777069 100644
--- a/include/standard-headers/linux/virtio_config.h
+++ b/include/standard-headers/linux/virtio_config.h
@@ -49,7 +49,7 @@
   * transport being used (eg. virtio_ring), the rest are per-device feature
   * bits. */
  #define VIRTIO_TRANSPORT_F_START  28
-#define VIRTIO_TRANSPORT_F_END 33
+#define VIRTIO_TRANSPORT_F_END 34

  #ifndef VIRTIO_CONFIG_NO_LEGACY
  /* Do we get callbacks when the ring is completely used, even if we've
@@ -63,4 +63,12 @@
  /* v1.0 compliant. */
  #define VIRTIO_F_VERSION_132

+/*
+ * If clear - device has the IOMMU bypass quirk feature.
+ * If set - use platform tools to detect the IOMMU.
+ *
+ * Note the reverse polarity (compared to most other features),
+ * this is for compatibility with legacy systems.
+ */
+#define VIRTIO_F_IOMMU_PLATFORM33
  #endif /* _LINUX_VIRTIO_CONFIG_H */
diff --git a/include/standard-headers/linux/virtio_ids.h 
b/include/standard-headers/linux/virtio_ids.h
index 77925f5..3228d58 100644
--- a/include/standard-headers/linux/virtio_ids.h
+++ b/include/standard-headers/linux/virtio_ids.h
@@ -41,5 +41,6 @@
  #define VIRTIO_ID_CAIF   12 /* Virtio caif */
  #define VIRTIO_ID_GPU

Re: [Qemu-devel] [RFC Patch 1/3] chardev: add new socket fd parameter for unix socket

2016-06-28 Thread Wei Xu


On 2016年06月28日 14:48, Michael S. Tsirkin wrote:

On Thu, Jun 23, 2016 at 12:46:46AM +0800, Wei Xu wrote:

On 2016年06月22日 23:39, Eric Blake wrote:

On 06/22/2016 09:25 AM, Wei Xu wrote:

There have been comments on this patch, but i forgot adding this patch to
the list, just forward it again.

When manage VMs via libvirt, qemu ofter runs with limited permission,
thus qemu can't create a file/socket, this patch is to  add a new
parameter 'sockfd' to accept fd opened and passed in from libvirt.

Signed-off-by: Wei Xu <w...@redhat.com>
---
   qapi-schema.json | 3 ++-
   qemu-char.c  | 3 +++
   2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/qapi-schema.json b/qapi-schema.json
index 8483bdf..e9f0268 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -2921,7 +2921,8 @@
   ##
   { 'struct': 'UnixSocketAddress',
 'data': {
-'path': 'str' } }
+'path': 'str',
+'sockfd': 'int32' } }


Missing documentation.

This makes the new 'sockfd' parameter mandatory, but SocketAddress is an
input type.  This is not backwards compatible.  At best, you'd want to
make it optional, but I'm not even convinced you want to add it, since
we already can use the magic /dev/fdset/nnn in 'path' to pass an
arbitrary fd if the underlying code uses qemu_open().


Thanks for commenting it again, i was going to forward it to the list and
ask some questions.:)

Actually i'm going to try the magic way as you suggested, just a few
questions about that.

Command line change should be like this according to my understanding.

Current command line:
-chardev socket,id=char0,path=/var/run/openvswitch/vhost-user1,server

New command line:
qemu-kvm -add-fd fd=3,set=2,opaque="/var/run/openvswitch/vhost-user1"
-chardev socket,id=char0,path=/dev/fdset/3,server


Q1. The 'sockfd' is not used anymore, but looks the 'path' parameter is
still mandatory AFAIK, because it's a unix domain socket, which is different
with a network tcp/udp socket, it works like a pipe file for local
communication only, and the 'path' parameter is a must-have condition before
a real connect could be established, it needs a bind() operation to a
specific path before either connect() or listen(), this is caused by libvirt
only takes the responsibility to creates a socket and pass the 'fd' in only,
there is nothing more about the bind(), thus i think qemu will have to
bind() it by itself, i'm thinking maybe 'opaque' can be used for this case.


I think libvirt will have to bind it.
Passing in an unbound socket wouldn't make sense.
Yes, but libvirt only cares about the creation for all sockets, not sure 
if this will break the rule and make libvirt considering more beyond 
it's responsiblility, actually the 'opaque' parameter in the command 
line is ok for qemu to get the path info and bind it.




Q2. Do you know how can i test it? i'm going to fork a process first and
create a socket like libvirt, then exec qemu and pass it in, just wondering
how can i map it to '/dev/fdset/' directory after created the socket?


/dev/fdset/ is QEMU syntax to pass fd numbers where path is normally
used.

I see, thanks.



Q3.
---
Daniel's comment before:
'Path' refers to a UNIX domain socket path, so the code won't be using
qemu_open() - it'll be using the sockets APIs to open/create a UNIX
socket. I don't think qemu_open() would even work with a socket FD,
since it'll be trying to set various modes on the FD via fcntl() that
don't work with sockets AFAIK
---
Seems what i should call is qemu_socket() to fill in qemu_open(), i should
check if it's started as '/dev/fdset' like in qemu_open(), and just pick the
'fd' up, is this enough? should i check the modes?

Thanks,
Wei

Re: [Qemu-devel] [RFC Patch 1/3] chardev: add new socket fd parameter for unix socket

2016-06-27 Thread Wei Xu

Anyone can help on this, or at least help me with question 2 about how 
to test it using 'add-fd' command line?


Q2. Does anybody know how can i test it? i'm going to fork a process 
firstly and create a socket like libvirt, then exec qemu and pass it in, 
just wondering how can i map the socket to '/dev/fdset/' directory after 
created the socket? Another issue is if 'set=2' is a right usage for a 
socket fd, any comment?


Command line:
qemu-kvm -add-fd fd=3,set=2,opaque="/var/run/openvswitch/vhost-user1"
 -chardev socket,id=char0,path=/dev/fdset/3,server


On 2016年06月23日 00:46, Wei Xu wrote:

On 2016年06月22日 23:39, Eric Blake wrote:

On 06/22/2016 09:25 AM, Wei Xu wrote:

There have been comments on this patch, but i forgot adding this
patch to
the list, just forward it again.

When manage VMs via libvirt, qemu ofter runs with limited permission,
thus qemu can't create a file/socket, this patch is to  add a new
parameter 'sockfd' to accept fd opened and passed in from libvirt.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  qapi-schema.json | 3 ++-
  qemu-char.c  | 3 +++
  2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/qapi-schema.json b/qapi-schema.json
index 8483bdf..e9f0268 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -2921,7 +2921,8 @@
  ##
  { 'struct': 'UnixSocketAddress',
'data': {
-'path': 'str' } }
+'path': 'str',
+'sockfd': 'int32' } }


Missing documentation.

This makes the new 'sockfd' parameter mandatory, but SocketAddress is an
input type.  This is not backwards compatible.  At best, you'd want to
make it optional, but I'm not even convinced you want to add it, since
we already can use the magic /dev/fdset/nnn in 'path' to pass an
arbitrary fd if the underlying code uses qemu_open().


Thanks for commenting it again, i was going to forward it to the list
and ask some questions.:)

Actually i'm going to try the magic way as you suggested, just a few
questions about that.

Command line change should be like this according to my understanding.

Current command line:
-chardev socket,id=char0,path=/var/run/openvswitch/vhost-user1,server

New command line:
qemu-kvm -add-fd fd=3,set=2,opaque="/var/run/openvswitch/vhost-user1"
-chardev socket,id=char0,path=/dev/fdset/3,server


Q1. The 'sockfd' is not used anymore, but looks the 'path' parameter is
still mandatory AFAIK, because it's a unix domain socket, which is
different with a network tcp/udp socket, it works like a pipe file for
local communication only, and the 'path' parameter is a must-have
condition before a real connect could be established, it needs a bind()
operation to a specific path before either connect() or listen(), this
is caused by libvirt only takes the responsibility to creates a socket
and pass the 'fd' in only, there is nothing more about the bind(), thus
i think qemu will have to bind() it by itself, i'm thinking maybe
'opaque' can be used for this case.

Q2. Do you know how can i test it? i'm going to fork a process first and
create a socket like libvirt, then exec qemu and pass it in, just
wondering how can i map it to '/dev/fdset/' directory after created the
socket?

Q3.
---
Daniel's comment before:
'Path' refers to a UNIX domain socket path, so the code won't be using
qemu_open() - it'll be using the sockets APIs to open/create a UNIX
socket. I don't think qemu_open() would even work with a socket FD,
since it'll be trying to set various modes on the FD via fcntl() that
don't work with sockets AFAIK
---
Seems what i should call is qemu_socket() to fill in qemu_open(), i
should check if it's started as '/dev/fdset' like in qemu_open(), and
just pick the 'fd' up, is this enough? should i check the modes?

Thanks,
Wei

Re: [Qemu-devel] [RFC Patch 1/3] chardev: add new socket fd parameter for unix socket

2016-06-22 Thread Wei Xu


On 2016年06月22日 23:39, Eric Blake wrote:

On 06/22/2016 09:25 AM, Wei Xu wrote:

There have been comments on this patch, but i forgot adding this patch to
the list, just forward it again.

When manage VMs via libvirt, qemu ofter runs with limited permission,
thus qemu can't create a file/socket, this patch is to  add a new
parameter 'sockfd' to accept fd opened and passed in from libvirt.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  qapi-schema.json | 3 ++-
  qemu-char.c  | 3 +++
  2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/qapi-schema.json b/qapi-schema.json
index 8483bdf..e9f0268 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -2921,7 +2921,8 @@
  ##
  { 'struct': 'UnixSocketAddress',
'data': {
-'path': 'str' } }
+'path': 'str',
+'sockfd': 'int32' } }


Missing documentation.

This makes the new 'sockfd' parameter mandatory, but SocketAddress is an
input type.  This is not backwards compatible.  At best, you'd want to
make it optional, but I'm not even convinced you want to add it, since
we already can use the magic /dev/fdset/nnn in 'path' to pass an
arbitrary fd if the underlying code uses qemu_open().

Thanks for commenting it again, i was going to forward it to the list 
and ask some questions.:)


Actually i'm going to try the magic way as you suggested, just a few 
questions about that.


Command line change should be like this according to my understanding.

Current command line:
-chardev socket,id=char0,path=/var/run/openvswitch/vhost-user1,server

New command line:
qemu-kvm -add-fd fd=3,set=2,opaque="/var/run/openvswitch/vhost-user1"
-chardev socket,id=char0,path=/dev/fdset/3,server


Q1. The 'sockfd' is not used anymore, but looks the 'path' parameter is 
still mandatory AFAIK, because it's a unix domain socket, which is 
different with a network tcp/udp socket, it works like a pipe file for 
local communication only, and the 'path' parameter is a must-have 
condition before a real connect could be established, it needs a bind() 
operation to a specific path before either connect() or listen(), this 
is caused by libvirt only takes the responsibility to creates a socket 
and pass the 'fd' in only, there is nothing more about the bind(), thus 
i think qemu will have to bind() it by itself, i'm thinking maybe 
'opaque' can be used for this case.


Q2. Do you know how can i test it? i'm going to fork a process first and 
create a socket like libvirt, then exec qemu and pass it in, just 
wondering how can i map it to '/dev/fdset/' directory after created the 
socket?


Q3.
---
Daniel's comment before:
'Path' refers to a UNIX domain socket path, so the code won't be using
qemu_open() - it'll be using the sockets APIs to open/create a UNIX
socket. I don't think qemu_open() would even work with a socket FD,
since it'll be trying to set various modes on the FD via fcntl() that
don't work with sockets AFAIK
---
Seems what i should call is qemu_socket() to fill in qemu_open(), i 
should check if it's started as '/dev/fdset' like in qemu_open(), and 
just pick the 'fd' up, is this enough? should i check the modes?


Thanks,
Wei

[Qemu-devel] [RFC Patch 3/3] sockets: replace creating a new socket with the record one

2016-06-22 Thread Wei Xu

There has been comments on this patch, but i forgot adding this patch to 
the list, just forward it again.


Both server mode and client mode are supported.

Signed-off-by: Wei Xu <w...@redhat.com>
---
 util/qemu-sockets.c | 25 +
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 0d6cd1f..e6429d7 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -713,10 +713,14 @@ static int unix_listen_saddr(UnixSocketAddress *saddr,
 struct sockaddr_un un;
 int sock, fd;

-sock = qemu_socket(PF_UNIX, SOCK_STREAM, 0);
-if (sock < 0) {
-error_setg_errno(errp, errno, "Failed to create Unix socket");
-return -1;
+if (saddr->sockfd) {
+sock = saddr->sockfd;
+} else {
+sock = qemu_socket(PF_UNIX, SOCK_STREAM, 0);
+if (sock < 0) {
+error_setg_errno(errp, errno, "Failed to create Unix socket");
+return -1;
+}
 }

 memset(, 0, sizeof(un));
@@ -786,11 +790,16 @@ static int unix_connect_saddr(UnixSocketAddress 
*saddr, Error **errp,

 return -1;
 }

-sock = qemu_socket(PF_UNIX, SOCK_STREAM, 0);
-if (sock < 0) {
-error_setg_errno(errp, errno, "Failed to create socket");
-return -1;
+if (saddr->sockfd) {
+sock = saddr->sockfd;
+} else {
+sock = qemu_socket(PF_UNIX, SOCK_STREAM, 0);
+if (sock < 0) {
+error_setg_errno(errp, errno, "Failed to create socket");
+return -1;
+}
 }
+
 if (callback != NULL) {
 connect_state = g_malloc0(sizeof(*connect_state));
 connect_state->callback = callback;
--
2.7.1

[Qemu-devel] [RFC Patch 1/3] chardev: add new socket fd parameter for unix socket

2016-06-22 Thread Wei Xu

There has been comments on this patch, but i forgot adding this patch to 
the list, just forward it again.


When manage VMs via libvirt, qemu ofter runs with limited permission,
thus qemu can't create a file/socket, this patch is to  add a new
parameter 'sockfd' to accept fd opened and passed in from libvirt.

Signed-off-by: Wei Xu <w...@redhat.com>
---
 qapi-schema.json | 3 ++-
 qemu-char.c  | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/qapi-schema.json b/qapi-schema.json
index 8483bdf..e9f0268 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -2921,7 +2921,8 @@
 ##
 { 'struct': 'UnixSocketAddress',
   'data': {
-'path': 'str' } }
+'path': 'str',
+'sockfd': 'int32' } }

 ##
 # @SocketAddress
diff --git a/qemu-char.c b/qemu-char.c
index b597ee1..ea9c02e 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -4116,6 +4116,9 @@ QemuOptsList qemu_chardev_opts = {
 .name = "path",
 .type = QEMU_OPT_STRING,
 },{
+.name = "sockfd",
+.type = QEMU_OPT_NUMBER,
+},{
 .name = "host",
 .type = QEMU_OPT_STRING,
 },{
--
2.7.1

[Qemu-devel] [RFC Patch 2/3] chardev: save the passed in 'fd' parameter during parsing

2016-06-22 Thread Wei Xu

There has been comments on this patch, but i forgot adding this patch to 
the list, just forward it again.


Save the 'fd' paramter as unix socket 'sockfd' member.

Signed-off-by: Wei Xu <w...@redhat.com>
---
 qemu-char.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/qemu-char.c b/qemu-char.c
index ea9c02e..8d20494 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -3664,6 +3664,7 @@ static void qemu_chr_parse_socket(QemuOpts *opts, 
ChardevBackend *backend,

 bool is_telnet  = qemu_opt_get_bool(opts, "telnet", false);
 bool do_nodelay = !qemu_opt_get_bool(opts, "delay", true);
 int64_t reconnect   = qemu_opt_get_number(opts, "reconnect", 0);
+const int32_t fd = (int32_t)qemu_opt_get_number(opts, "sockfd", 0);
 const char *path = qemu_opt_get(opts, "path");
 const char *host = qemu_opt_get(opts, "host");
 const char *port = qemu_opt_get(opts, "port");
@@ -3708,6 +3709,12 @@ static void qemu_chr_parse_socket(QemuOpts *opts, 
ChardevBackend *backend,

 addr->type = SOCKET_ADDRESS_KIND_UNIX;
 q_unix = addr->u.q_unix.data = g_new0(UnixSocketAddress, 1);
 q_unix->path = g_strdup(path);
+
+if (fd) {
+q_unix->sockfd = fd;
+} else {
+q_unix->sockfd = 0;
+}
 } else {
 addr->type = SOCKET_ADDRESS_KIND_INET;
 addr->u.inet.data = g_new(InetSocketAddress, 1);
--
2.7.1

[Qemu-devel] [RFC Patch 0/3] Accept passed in socket 'fd' open from outside for unix socket

2016-06-22 Thread Wei Xu

There has been comments on this patch, but i forgot adding this patch to 
the list, just forward it again.


Recently I'm working on a fd passing issue, selinux forbids qemu to
create a unix socket for a chardev when managing VMs with libvirt,
because qemu don't have sufficient permissions in this case, and
proposal from libvirt team is opening the 'fd' in libvirt and merely
passing it to qemu.

I finished a RFC patch for Unix socket after a glance of the code,
and not sure if this is right or there maybe other side-effects,
please point me out.

I tested it for both server and client mode 'PF_UNIX' socket with a VM
running vhost-user.

Old command line:
-chardev socket,id=char0,path=/var/run/openvswitch/vhost-user1,server

New command line:
-chardev 
socket,id=char0,path=/var/run/openvswitch/vhost-user1,server,sockfd=$n


because unix socket is bundled with a path, so it should be kept even 
with the

'fd' is indicated, this looks odd, any comments?

Wei Xu (3):
  chardev: add new socket fd parameter for unix socket
  chardev: save the passed in 'fd' parameter during parsing
  sockets: replace creating a new socket with the record one

 qapi-schema.json|  3 ++-
 qemu-char.c | 10 ++
 util/qemu-sockets.c | 25 +
 3 files changed, 29 insertions(+), 9 deletions(-)

--
2.7.1

Re: [Qemu-devel] [RFC Patch 0/3] Accept passed in socket 'fd' open from outside for unix socket

2016-06-14 Thread Wei Xu


On 2016年06月14日 22:23, Aaron Conole wrote:

"Daniel P. Berrange" <berra...@redhat.com> writes:


On Tue, Jun 14, 2016 at 04:03:43PM +0800, Wei Xu wrote:

On 2016年06月09日 05:48, Aaron Conole wrote:

Flavio Leitner <f...@redhat.com> writes:


Adding Aaron who is fixing exactly that on the OVS side.

Aaron, please see the last question in the bottom of this email.

On Wed, Jun 08, 2016 at 06:07:29AM -0400, Amnon Ilan wrote:



- Original Message -

From: "Michal Privoznik" <mpriv...@redhat.com>
To: "Daniel P. Berrange" <berra...@redhat.com>
Cc: qemu-devel@nongnu.org, "amit shah" <amit.s...@redhat.com>,
jasow...@redhat.com, "Wei Xu" <w...@redhat.com>,
arm...@redhat.com
Sent: Thursday, June 2, 2016 2:38:53 PM
Subject: Re: [Qemu-devel] [RFC Patch 0/3] Accept passed in socket
'fd' open from outside for unix socket

On 02.06.2016 10:29, Daniel P. Berrange wrote:

On Thu, Jun 02, 2016 at 09:41:56AM +0200, Michal Privoznik wrote:

On 01.06.2016 18:16, Wei Xu wrote:

On 2016年06月01日 00:44, Daniel P. Berrange wrote:

On Wed, Jun 01, 2016 at 12:30:44AM +0800, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Recently I'm working on a fd passing issue,
selinux forbids qemu to
create a unix socket for a chardev when managing
VMs with libvirt,
because qemu don't have sufficient permissions in
this case, and
proposal from libvirt team is opening the 'fd' in
libvirt and merely
passing it to qemu.


This sounds like a bug in libvirt, or selinux, or a mistaken
configuration
of the guest. It is entirely possible for QEMU to
create a unix socket
- not
least because that is exactly what QEMU uses for its QMP monitor
backend.
Looking at your example command line, I think the
issue is simply that
you
should be putting the sockets in a different location. ie at
/var/lib/libvirt/qemu/$guest-vhost-user1.sock where QEMU has
permission to
create sockets already.

ah.. adjusting permission or file location can solve
this problem, i'm
guessing maybe this is a more security concern, the
socket is used as a
network interface for a vm, similar as the qcow image
file, thus should
prevent it to be arbitrarily accessed.

Michael, do you have any comment on this?


I haven't seen the patches. But in libvirt we allow users to create a
vhostuser interface and even specify where the socket should be placed:

  



  

The following cmd line is generated by libvirt then:

-chardev socket,id=charnet1,path=/tmp/vhost1.sock,server \
-netdev type=vhost-user,id=hostnet1,chardev=charnet1 \
-device
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:ee:96:6c,bus=pci.0,\

Now, if we accept only /var/run/openvwitch path in
/interface/source/@path (or whatever path to OVS is), we don't need this
and have users manually label the dir (unless already labeled). But
since we accept just any path in there, we should make sure that qemu is
then able to create the socket. One possible fix would be to allow qemu
create sockets just anywhere in the system. This, however, brings huge
security risks and it's not acceptable IMO. The other option would be
that libvirt would create the socket, and pass its FD to qemu (since
libvirt already is allowed to create sockets anywhere).


There are plenty of other places where we allow arbitrary paths in the
XML, but which have restrictions imposed by the security drivers. Not
least the  devices which have the exact same scenario as this
network device, and require use of /var/lib/libvirt/qemu
as the directory
for the sockets. We certainly do not want to allow QEMU to
create sockets
anywhere.

I don't think we want to grant QEMU svirtt permission to create sockets
in the /var/run/openvswitch directory either really.IMHO,
users of vhost
user should really be using /var/lib/libvirt/qemu, as is used for all
other UNIX sockets we create wrt other devices.


Okay. I can live with that; but in that case we should document it
somewhere, that we guarantee only paths under /var/lib/libvirt/ to be
accessible and for the rest we do our best but maybe require sys admin
intervention (e.g. to label the whole tree for a non-standard location).


Does OVS has some limit for it's sockets to be only in
/var/run/openvswitch ?


As of a recent commit, it can only be in /var/run/openvswitch or a
subdirectory therein (found in the openvswitch database).

Aaron, thanks for your reply.

Just a question about the usage of openvswitch, in this user case when
launching a vhostuser/dpdk via libvirt, qemu works as server mode for socket
under /var/run/openvswitch, but per my previous test, ovs/dpdk always works
as server mode, which means ovs will creates the socket and listening for
connection, thus qemu works as client mode, does ovs/dpdk support working in
client mode? which means it's qemu's duty to create the socket? and ovs will
connect to it on demanding?


Oh, I was assuming that QEMU would be working in server mod

Re: [Qemu-devel] [RFC Patch 0/3] Accept passed in socket 'fd' open from outside for unix socket

2016-06-14 Thread Wei Xu


On 2016年06月09日 05:48, Aaron Conole wrote:

Flavio Leitner <f...@redhat.com> writes:


Adding Aaron who is fixing exactly that on the OVS side.

Aaron, please see the last question in the bottom of this email.

On Wed, Jun 08, 2016 at 06:07:29AM -0400, Amnon Ilan wrote:



- Original Message -

From: "Michal Privoznik" <mpriv...@redhat.com>
To: "Daniel P. Berrange" <berra...@redhat.com>
Cc: qemu-devel@nongnu.org, "amit shah" <amit.s...@redhat.com>,
jasow...@redhat.com, "Wei Xu" <w...@redhat.com>,
arm...@redhat.com
Sent: Thursday, June 2, 2016 2:38:53 PM
Subject: Re: [Qemu-devel] [RFC Patch 0/3] Accept passed in socket
'fd' open from outside for unix socket

On 02.06.2016 10:29, Daniel P. Berrange wrote:

On Thu, Jun 02, 2016 at 09:41:56AM +0200, Michal Privoznik wrote:

On 01.06.2016 18:16, Wei Xu wrote:

On 2016年06月01日 00:44, Daniel P. Berrange wrote:

On Wed, Jun 01, 2016 at 12:30:44AM +0800, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Recently I'm working on a fd passing issue, selinux forbids qemu to
create a unix socket for a chardev when managing VMs with libvirt,
because qemu don't have sufficient permissions in this case, and
proposal from libvirt team is opening the 'fd' in libvirt and merely
passing it to qemu.


This sounds like a bug in libvirt, or selinux, or a mistaken
configuration
of the guest. It is entirely possible for QEMU to create a unix socket
- not
least because that is exactly what QEMU uses for its QMP monitor
backend.
Looking at your example command line, I think the issue is simply that
you
should be putting the sockets in a different location. ie at
/var/lib/libvirt/qemu/$guest-vhost-user1.sock where QEMU has
permission to
create sockets already.

ah.. adjusting permission or file location can solve this problem, i'm
guessing maybe this is a more security concern, the socket is used as a
network interface for a vm, similar as the qcow image file, thus should
prevent it to be arbitrarily accessed.

Michael, do you have any comment on this?


I haven't seen the patches. But in libvirt we allow users to create a
vhostuser interface and even specify where the socket should be placed:

 
   
   
   
 

The following cmd line is generated by libvirt then:

-chardev socket,id=charnet1,path=/tmp/vhost1.sock,server \
-netdev type=vhost-user,id=hostnet1,chardev=charnet1 \
-device
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:ee:96:6c,bus=pci.0,\

Now, if we accept only /var/run/openvwitch path in
/interface/source/@path (or whatever path to OVS is), we don't need this
and have users manually label the dir (unless already labeled). But
since we accept just any path in there, we should make sure that qemu is
then able to create the socket. One possible fix would be to allow qemu
create sockets just anywhere in the system. This, however, brings huge
security risks and it's not acceptable IMO. The other option would be
that libvirt would create the socket, and pass its FD to qemu (since
libvirt already is allowed to create sockets anywhere).


There are plenty of other places where we allow arbitrary paths in the
XML, but which have restrictions imposed by the security drivers. Not
least the  devices which have the exact same scenario as this
network device, and require use of /var/lib/libvirt/qemu as the directory
for the sockets. We certainly do not want to allow QEMU to create sockets
anywhere.

I don't think we want to grant QEMU svirtt permission to create sockets
in the /var/run/openvswitch directory either really.IMHO, users of vhost
user should really be using /var/lib/libvirt/qemu, as is used for all
other UNIX sockets we create wrt other devices.


Okay. I can live with that; but in that case we should document it
somewhere, that we guarantee only paths under /var/lib/libvirt/ to be
accessible and for the rest we do our best but maybe require sys admin
intervention (e.g. to label the whole tree for a non-standard location).


Does OVS has some limit for it's sockets to be only in /var/run/openvswitch ?


As of a recent commit, it can only be in /var/run/openvswitch or a
subdirectory therein (found in the openvswitch database).

Aaron, thanks for your reply.

Just a question about the usage of openvswitch, in this user case when 
launching a vhostuser/dpdk via libvirt, qemu works as server mode for 
socket under /var/run/openvswitch, but per my previous test, ovs/dpdk 
always works as server mode, which means ovs will creates the socket and 
listening for connection, thus qemu works as client mode, does ovs/dpdk 
support working in client mode? which means it's qemu's duty to create 
the socket? and ovs will connect to it on demanding?





Flavio, do you know?
If not, we are good as it is today with /var/lib/libvirt/qemu, right?


Probably not.  There are other issues as well.  From a DAC perspective
(so forgetting selinux at the moment), qemu and ovs run as d

Re: [Qemu-devel] [RFC Patch 0/3] Accept passed in socket 'fd' open from outside for unix socket

2016-06-02 Thread Wei Xu


On 2016年06月02日 19:38, Michal Privoznik wrote:

On 02.06.2016 10:29, Daniel P. Berrange wrote:

On Thu, Jun 02, 2016 at 09:41:56AM +0200, Michal Privoznik wrote:

On 01.06.2016 18:16, Wei Xu wrote:

On 2016年06月01日 00:44, Daniel P. Berrange wrote:

On Wed, Jun 01, 2016 at 12:30:44AM +0800, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Recently I'm working on a fd passing issue, selinux forbids qemu to
create a unix socket for a chardev when managing VMs with libvirt,
because qemu don't have sufficient permissions in this case, and
proposal from libvirt team is opening the 'fd' in libvirt and merely
passing it to qemu.


This sounds like a bug in libvirt, or selinux, or a mistaken
configuration
of the guest. It is entirely possible for QEMU to create a unix socket
- not
least because that is exactly what QEMU uses for its QMP monitor backend.
Looking at your example command line, I think the issue is simply that
you
should be putting the sockets in a different location. ie at
/var/lib/libvirt/qemu/$guest-vhost-user1.sock where QEMU has
permission to
create sockets already.

ah.. adjusting permission or file location can solve this problem, i'm
guessing maybe this is a more security concern, the socket is used as a
network interface for a vm, similar as the qcow image file, thus should
prevent it to be arbitrarily accessed.

Michael, do you have any comment on this?


I haven't seen the patches. But in libvirt we allow users to create a
vhostuser interface and even specify where the socket should be placed:

 
   
   
   
 

The following cmd line is generated by libvirt then:

-chardev socket,id=charnet1,path=/tmp/vhost1.sock,server \
-netdev type=vhost-user,id=hostnet1,chardev=charnet1 \
-device
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:ee:96:6c,bus=pci.0,\

Now, if we accept only /var/run/openvwitch path in
/interface/source/@path (or whatever path to OVS is), we don't need this
and have users manually label the dir (unless already labeled). But
since we accept just any path in there, we should make sure that qemu is
then able to create the socket. One possible fix would be to allow qemu
create sockets just anywhere in the system. This, however, brings huge
security risks and it's not acceptable IMO. The other option would be
that libvirt would create the socket, and pass its FD to qemu (since
libvirt already is allowed to create sockets anywhere).


There are plenty of other places where we allow arbitrary paths in the
XML, but which have restrictions imposed by the security drivers. Not
least the  devices which have the exact same scenario as this
network device, and require use of /var/lib/libvirt/qemu as the directory
for the sockets. We certainly do not want to allow QEMU to create sockets
anywhere.
AFAIK, Vhost user is an interface for third party implementations, and 
ovs/dpdk is one of the most popular choices, if it limits the socket 
location of a fixed and unprivileged directory to qemu, actually this 
should be the default and only one option, this maybe also a security 
consideration, so we'll have no other choice but ask sys admin to 
manipulate the permission, looks accepting a safe passed in 'fd' from 
libvirt is more rigorous and convenient, i'm not sure if this is a 
typical or a corner scenario.


Daniel,
How do you think about this with a general purpose? does qemu need such 
a feature?


I don't think we want to grant QEMU svirtt permission to create sockets
in the /var/run/openvswitch directory either really.IMHO, users of vhost
user should really be using /var/lib/libvirt/qemu, as is used for all
other UNIX sockets we create wrt other devices.


Okay. I can live with that; but in that case we should document it
somewhere, that we guarantee only paths under /var/lib/libvirt/ to be
accessible and for the rest we do our best but maybe require sys admin
intervention (e.g. to label the whole tree for a non-standard location).

Michal

Re: [Qemu-devel] [ RFC Patch v6 3/3] virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr'

2016-06-02 Thread Wei Xu




On 2016年05月30日 13:57, Jason Wang wrote:



On 2016年05月29日 00:37, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Field 'coalesced' is to indicate how many packets are coalesced and field
'dup_ack' is how many duplicate acks are merged, guest driver can use
these
information to notify what's the exact scene of original traffic over the
networks.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c | 8 
  include/standard-headers/linux/virtio_net.h | 2 ++
  2 files changed, 10 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index cc8cbe4..20f552a 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1768,6 +1768,10 @@ static size_t
virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
  if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
  virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
  }
+h->coalesced = seg->packets;
+h->dup_ack = seg->dup_ack;
+h->gso_type = chain->gso_type;
+h->gso_size = chain->max_payload;
  ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
  QTAILQ_REMOVE(>buffers, seg, next);
  g_free(seg->buf);
@@ -2302,9 +2306,13 @@ static ssize_t
virtio_net_receive(NetClientState *nc,
const uint8_t *buf, size_t size)
  {
  VirtIONet *n;
+struct virtio_net_hdr *h;
  n = qemu_get_nic_opaque(nc);
  if (n->host_features & (1ULL << VIRTIO_NET_F_GUEST_RSC)) {
+h = (struct virtio_net_hdr *)buf;
+h->coalesced = 0;
+h->dup_ack = 0;
  return virtio_net_rsc_receive(nc, buf, size);
  } else {
  return virtio_net_do_receive(nc, buf, size);
diff --git a/include/standard-headers/linux/virtio_net.h
b/include/standard-headers/linux/virtio_net.h
index 5b95762..c837417 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -114,6 +114,8 @@ struct virtio_net_hdr {
  __virtio16 gso_size;/* Bytes to append to hdr_len per
frame */
  __virtio16 csum_start;/* Position to start checksumming from */
  __virtio16 csum_offset;/* Offset after that to place
checksum */
+__virtio16 coalesced;   /* packets coalesced by host */


Can we just reuse gso_segs for this?
That's really a good idea, in particular, if we can multiplex the 
capability field and header fields, then i'm supposing we don't need 
change the spec, this feature may work if we don't coalesce any 
'dup_ack' packet due to my latest test.



+__virtio16 dup_ack; /* duplicate ack count */
  };
  /* This is the version of the header to use when the MRG_RXBUF

Re: [Qemu-devel] [ RFC Patch v6 2/3] virtio-net rsc: support coalescing ipv6 tcp traffic

2016-06-02 Thread Wei Xu




On 2016年05月30日 12:25, Jason Wang wrote:



On 2016年05月29日 00:37, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Most stuffs are like ipv4 2 differences between ipv4 and ipv6.

1. Fragment length in ipv4 header includes itself, while it's not
included for ipv6, thus means ipv6 can carry a real '65535' payload.

2. IPv6 header does not need calculate header checksum.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c | 152
+---
  1 file changed, 144 insertions(+), 8 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index b3bb63b..cc8cbe4 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -53,6 +53,10 @@
  /* header length value in ip header without option */
  #define VIRTIO_NET_IP4_HEADER_LENGTH 5
+#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
+#define VIRTIO_NET_IP6_ADDR_SIZE   32  /* ipv6 saddr + daddr */
+#define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD
+
  /* Purge coalesced packets timer interval, This value affects the
performance
 a lot, and should be tuned carefully, '30'(300us) is the
recommended
 value to pass the WHQL test, '5' can gain 2x netperf
throughput with
@@ -1724,6 +1728,25 @@ static void
virtio_net_rsc_extract_unit4(NetRscChain *chain,
  unit->payload = htons(*unit->ip_plen) - ip_hdrlen -
unit->tcp_hdrlen;
  }
+static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
+ const uint8_t *buf,
NetRscUnit* unit)
+{
+uint16_t hdr_len;
+struct ip6_header *ip6;
+
+hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
+ip6 = (struct ip6_header *)(buf + hdr_len + sizeof(struct
eth_header));
+unit->ip = ip6;
+unit->ip_plen = &(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
+unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip)\
++ sizeof(struct ip6_header));
+unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000)
>> 10;
+
+/* There is a difference between payload lenght in ipv4 and v6,
+   ip header is excluded in ipv6 */
+unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
+}
+
  static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr *vhdr,
   struct ip_header *ip)
  {
@@ -1742,7 +1765,9 @@ static size_t
virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
  struct virtio_net_hdr *h;
  h = (struct virtio_net_hdr *)seg->buf;
-virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
+if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
+virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
+}
  ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
  QTAILQ_REMOVE(>buffers, seg, next);
  g_free(seg->buf);
@@ -1798,7 +1823,7 @@ static void virtio_net_rsc_cache_buf(NetRscChain
*chain, NetClientState *nc,
  hdr_len = chain->n->guest_hdr_len;
  seg = g_malloc(sizeof(NetRscSeg));
  seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)\
-   + VIRTIO_NET_MAX_TCP_PAYLOAD);
+   + sizeof(struct ip6_header) +
VIRTIO_NET_MAX_TCP_PAYLOAD);
  memcpy(seg->buf, buf, size);
  seg->size = size;
  seg->packets = 1;
@@ -1809,7 +1834,18 @@ static void
virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
  QTAILQ_INSERT_TAIL(>buffers, seg, next);
  chain->stat.cache++;
-virtio_net_rsc_extract_unit4(chain, seg->buf, >unit);
+switch (chain->proto) {
+case ETH_P_IP:
+virtio_net_rsc_extract_unit4(chain, seg->buf, >unit);
+break;
+
+case ETH_P_IPV6:
+virtio_net_rsc_extract_unit6(chain, seg->buf, >unit);
+break;
+
+default:
+g_assert_not_reached();
+}
  }
  static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
@@ -1929,6 +1965,24 @@ static int32_t
virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
  return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
  }
+static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg
*seg,
+const uint8_t *buf, size_t size, NetRscUnit
*unit)
+{
+struct ip6_header *ip1, *ip2;
+
+ip1 = (struct ip6_header *)(unit->ip);
+ip2 = (struct ip6_header *)(seg->unit.ip);
+if (memcmp(>ip6_src, >ip6_src, sizeof(struct in6_address))
+|| memcmp(>ip6_dst, >ip6_dst, sizeof(struct
in6_address))
+|| (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
+|| (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
+chain->stat.no_match++;
+return RSC_NO_MATCH;
+}
+
+return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
+}
+
  /* Pakcets with 'SYN' should bypass, other flag should be sent after
d

Re: [Qemu-devel] [ RFC Patch v6 1/3] virtio-net rsc: support coalescing ipv4 tcp traffic

2016-06-02 Thread Wei Xu


On 2016年05月30日 12:20, Jason Wang wrote:



On 2016年05月29日 00:37, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

All the data packets in a tcp connection will be cached to a big buffer
in every receive interval, and will be sent out via a timer, the
'virtio_net_rsc_timeout' controls the interval, the value will
influent the
performance and response of tcp connection extremely, 5(50us) is a
experience value to gain a performance improvement, since the whql test
sends packets every 100us, so '30(300us)' can pass the test case,
this is also the default value, it's tunable via the command line
parameter 'rsc_interval' with 'virtio-net-pci' device, for example, below
parameter is to launch a guest with interval set as '50'.


Does the value make sense if it was smaller than 1us? If not, why not
just make the unit to be 1us?
It's an experience value, 500us is a good candidate for netperf 
throughput test, too short interval less than 50 us will gain an obvious 
penalty AFAIK, this is caused only few packets can be coalesced and the 
cycles wasted for rsc itself.


'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,rsc_interval=50'

will

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets.

'NetRscChain' is used to save the segments of different protocols in a
VirtIONet device.

The main handler of TCP includes TCP window update, duplicated ACK check
and the real data coalescing if the new segment passed sanity check
and is identified as an 'wanted' one.

An 'wanted' segment means:
1. Segment is within current window and the sequence is the expected one.
2. 'ACK' of the segment is in the valid window.

Sanity check includes:
1. Incorrect version in IP header
2. IP options & IP fragment
3. Not a TCP packets
4. Sanity size check to prevent buffer overflow attack.

There maybe more cases should be considered such as ip identification
other
flags, while it broke the test because windows set it to the same even
it's
not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag,
'bypass'
and 'finalize', 'bypass' means should be sent out directly, and
'finalize'
means the packets should also be bypassed, and this should be done
after searching for the same connection packets in the pool and sending
all of them out, this is to avoid out of data.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'FIN/RST' will trigger a finalization,
because
this normally happens upon a connection is going to be closed, an
'URG' packet
also finalize current coalescing unit.

Statistics can be used to monitor the basic coalescing status, the
'out of order'
and 'out of window' means how many retransmitting packets, thus
describe the
performance intuitively.


We usually don't add device specific monitor command. Maybe a new
control vq command ethtool -S in guest in the future. I was thinking of
removing those counters since it was never used in this series.
Yes, that's a good idea, actually i'm doubting whether this feature 
should be a guest feature or a host feature, the spec says it should be 
more like a guest feature, but it's provided as a host built-in feature, 
so how and where to examine it is a problem. I'm using gdb to debugging 
it currently, normally i would check this counter directly via debug 
command, and the statistics is quite useful for troubleshooting, it'll 
be optional when this feature is geting more and more robust, can we 
keep it and leave if should keep it or where to display it along with 
qa's testing?




Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c | 498
+++-
  include/hw/virtio/virtio-net.h  |   2 +
  include/hw/virtio/virtio.h  |  75 +
  include/standard-headers/linux/virtio_net.h |   1 +


For RFC, it's ok. But for formal patch, this is not the correct way to
modify Linux headers. There's a script in
scripts/update-linux-headers.sh which was used to sync it from Linux
source. This means, it must be merged in Linux or at least in
maintainer's tree first.


  4 files changed, 575 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 5798f87..b3bb63b 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -15,10 +15,12 @@
  #include "qemu/iov.h"
  #include "hw/virtio/virtio.h"
  #include "net/net.h"
+#include "net/eth.h"
  #include "net/checksum.h"
  #include "net/tap.h"
  #include "qemu/error-report.h"
  #include "qemu/timer.h"
+#include "qemu/sockets.h"
  #include "hw/virtio/virtio-net.h"
  #include "net/vhost_net.h"
  #include "hw/virtio/virtio-bus.h"
@@ -38,6 +40,25 @@
  #define endof(container, field) \
  (offsetof(container, field) + sizeof(((container *

Re: [Qemu-devel] [RFC Patch 0/3] Accept passed in socket 'fd' open from outside for unix socket

2016-06-01 Thread Wei Xu


On 2016年06月01日 00:44, Daniel P. Berrange wrote:

On Wed, Jun 01, 2016 at 12:30:44AM +0800, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Recently I'm working on a fd passing issue, selinux forbids qemu to
create a unix socket for a chardev when managing VMs with libvirt,
because qemu don't have sufficient permissions in this case, and
proposal from libvirt team is opening the 'fd' in libvirt and merely
passing it to qemu.


This sounds like a bug in libvirt, or selinux, or a mistaken configuration
of the guest. It is entirely possible for QEMU to create a unix socket - not
least because that is exactly what QEMU uses for its QMP monitor backend.
Looking at your example command line, I think the issue is simply that you
should be putting the sockets in a different location. ie at
/var/lib/libvirt/qemu/$guest-vhost-user1.sock where QEMU has permission to
create sockets already.
ah.. adjusting permission or file location can solve this problem, i'm 
guessing maybe this is a more security concern, the socket is used as a 
network interface for a vm, similar as the qcow image file, thus should 
prevent it to be arbitrarily accessed.


Michael, do you have any comment on this?



Alternatively you could enhance the SELinux policy to grant svirt_t the
permission to create sockets under /var/run/openvswitch too.


I finished a RFC patch for Unix socket after a glance of the code,
and not sure if this is right or there maybe other side-effects,
please point me out.

I tested it for both server and client mode 'PF_UNIX' socket with a VM
running vhost-user.

Old command line:
-chardev socket,id=char0,path=/var/run/openvswitch/vhost-user1,server

New command line:
-chardev socket,id=char0,path=/var/run/openvswitch/vhost-user1,server,sockfd=$n

because unix socket is bundled with a path, so it should be kept even with the
'fd' is indicated, this looks odd, any comments?


Yes, this syntax doesn't really make sense. The passed in sockfd may not
even be a UNIX socket - it could be a TCP socket. As such, the 'sockfd'
option should be mutually exclusive with the 'path' and 'host' options.
ie you can only supply one out of 'sockfd', 'path', or 'host'.
Currently i just add it for unix socket, and the connect/listen syscall 
must have a path name, an inet socket doesn't need this param at all, 
should it be supported also?





FWIW, I think the ability to pass a pre-opened socket FD with the
-chardev socket backend is a useful enhancement, even though I don't
think you need this in order to fix the problem you have.
OK, thanks for you quick feedback, i just wonder if 'add-fd' and 
qemu_open() can be a more general solution.


Regards,
Daniel

Re: [Qemu-devel] [RFC Patch 2/3] chardev: save the passed in 'fd' parameter during parsing

2016-06-01 Thread Wei Xu


On 2016年06月01日 01:26, Eric Blake wrote:

On 05/31/2016 10:30 AM, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Save the 'fd' paramter as unix socket 'sockfd' member.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  qemu-char.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/qemu-char.c b/qemu-char.c
index ea9c02e..8d20494 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -3664,6 +3664,7 @@ static void qemu_chr_parse_socket(QemuOpts *opts, 
ChardevBackend *backend,
  bool is_telnet  = qemu_opt_get_bool(opts, "telnet", false);
  bool do_nodelay = !qemu_opt_get_bool(opts, "delay", true);
  int64_t reconnect   = qemu_opt_get_number(opts, "reconnect", 0);
+const int32_t fd = (int32_t)qemu_opt_get_number(opts, "sockfd", 0);
  const char *path = qemu_opt_get(opts, "path");
  const char *host = qemu_opt_get(opts, "host");
  const char *port = qemu_opt_get(opts, "port");
@@ -3708,6 +3709,12 @@ static void qemu_chr_parse_socket(QemuOpts *opts, 
ChardevBackend *backend,
  addr->type = SOCKET_ADDRESS_KIND_UNIX;
  q_unix = addr->u.q_unix.data = g_new0(UnixSocketAddress, 1);
  q_unix->path = g_strdup(path);
+
+if (fd) {
+q_unix->sockfd = fd;
+} else {
+q_unix->sockfd = 0;


0 is a valid fd number; this risks accidentally closing stdin later on.
  Please use -1 for unset, if you must store an fd number.  But given my
comments on patch 1, I'm not sure that you need this addition.
Thanks for your comment, i just wonder what's the motivation of 
qemu_open(), seems it's more like an regular file consideration, is it?

can it be easily expended to socket file?

Re: [Qemu-devel] [RFC Patch 0/3] Accept passed in socket 'fd' open from outside for unix socket

2016-06-01 Thread Wei Xu


On 2016年06月01日 01:22, Eric Blake wrote:

On 05/31/2016 10:30 AM, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Recently I'm working on a fd passing issue, selinux forbids qemu to
create a unix socket for a chardev when managing VMs with libvirt,
because qemu don't have sufficient permissions in this case, and
proposal from libvirt team is opening the 'fd' in libvirt and merely
passing it to qemu.


Any reason this wasn't sent to the list?

Sorry, just forgot the list, also add Michal in the loop.

Re: [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL

2016-05-29 Thread Wei Xu



On 2016年05月30日 12:22, Jason Wang wrote:



On 2016年05月29日 00:37, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Changes in V6:
- Sync upstream code
- Split new fields in 'virtio_net_hdr' to a seperate patch
- Remove feature bit code, replace it with a command line parameter
'guest_rsc'
which is turned off by default.

Changes in V5:
- Passed all IPv4/6 test cases
- Add new fields in 'virtio_net_hdr'
- Set 'gso_type' & 'coalesced packets' in new field.
- Bypass all 'tcp option' packet
- Bypass all 'pure ack' packet
- Bypass all 'duplicate ack' packet
- Change 'guest_rsc' feature bit to 'false' by default
- Feedbacks from v4, typo, etc.


Change-log is very important for the ease and speed up reviewers. More
details are more than welcomed. But I see some changes were not
documented here. Please give a more complete one in next iteration.

OK.




Note:
There is still a few pending issues about the feature bit, and need to be
discussed with windows driver maintainer, so linux guests with this patch
won't work at current, haven't figure it out yet, but i'm guessing it's
caused by the 'gso_type' is set to 'VIRTIO_NET_HDR_GSO_TCPV4/6',
will fix it after get the final solution, the below test steps and
performance data is based on v4.


This is probably because you've increased the vnet header length.



Another suggestion from Jason is to adjust part of the code to make it
more readable, since there maybe still few change about the flowchart
in the future, such as timestamp, duplicate ack, so i'd like to delay it
temporarily.

Changes in V4:
- Add new host feature bit
- Replace using fixed header lenght with dynamic header lenght in
VirtIONet
- Change ip/ip6 header union in NetRscUnit to void* pointer
- Add macro prefix, adjust code indent, etc.

Changes in V3:
- Removed big param list, replace it with 'NetRscUnit'
- Different virtio header size
- Modify callback function to direct call.
- Needn't check the failure of g_malloc()
- Other code format adjustment, macro naming, etc

Changes in V2:
- Add detailed commit log

This patch is to support WHQL test for Windows guest, while this
feature also
benifits other guest works as a kernel 'gro' like feature with userspace
implementation.
Feature information:
   http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324

Both IPv4 and IPv6 are supported, though performance with userspace
virtio
is slow than vhost-net, there is about 1.5x to 2x performance
improvement to
userspace virtio, this is done by turning this feature on and disable
'tso/gso/gro' on corresponding tap interface and guest interface,
while get
less improment with all these feature on.

Linux guest performance data(Netperf):
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.2.101 () port 0 AF_INET : nodelay
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

  87380  16384 646.00 1221.20
  87380  16384 646.00 1260.30

  87380  163841286.00 1978.51
  87380  163841286.00 2286.05

  87380  163842566.00 2677.94
  87380  163842566.00 4615.42

  87380  163845126.00 2956.54
  87380  163845126.00 5356.39

  87380  16384   10246.00 2798.17
  87380  16384   10246.00 4943.30

  87380  16384   20486.00 2681.09
  87380  16384   20486.00 4835.81

  87380  16384   40966.00 3390.14
  87380  16384   40966.00 5391.54

  87380  16384   80926.00 3008.27
  87380  16384   80926.00 5381.68

  87380  16384  102406.00 2999.89
  87380  16384  102406.00 5393.11

Test steps:
Although this feature is mainly used for window guest, i used linux
guest to
help test the feature, to make things simple, i used 3 steps to test
the patch
as i moved on.

1. With a tcp socket client/server pair running on 2 linux guest, thus
i can
control
the traffic and debugging the code as i want.
2. Netperf on linux guest test the throughput.
3. WHQL test with 2 Windows guests.

Wei Xu (3):
   virtio-net rsc: support coalescing ipv4 tcp traffic
   virtio-net rsc: support coalescing ipv6 tcp traffic
   virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr'

  hw/net/virtio-net.c | 642
+++-
  include/hw/virtio/virtio-net.h  |   2 +
  include/hw/virtio/virtio.h  |  75 
  include/standard-headers/linux/virtio_net.h |   3 +
  4 files changed, 721 insertions(+), 1 deletion(-)

Re: [Qemu-devel] [ RFC Patch v5 0/2] Support Receive-Segment-Offload(RSC) for WHQL

2016-05-24 Thread Wei Xu




On 2016年05月24日 16:26, Michael S. Tsirkin wrote:

On Tue, May 24, 2016 at 04:03:04PM +0800, Jason Wang wrote:



On 2016年05月24日 04:14, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Changes in V5:
- Passed all IPv4/6 test cases
- Add new fields in 'virtio_net_hdr'
- Set 'gso_type' & 'coalesced packets' in new field.
- Bypass all 'tcp option' packet
- Bypass all 'pure ack' packet
- Bypass all 'duplicate ack' packet
- Change 'guest_rsc' feature bit to 'false' by default
- Feedbacks from v4, typo, etc.


Patch does not apply on master ...



Note:
There is still a few pending issues about the feature bit, and need to be
discussed with windows driver maintainer, so linux guests with this patch
won't work at current, haven't figure it out yet, but i'm guessing it's
caused by the 'gso_type' is set to 'VIRTIO_NET_HDR_GSO_TCPV4/6',
will fix it after get the final solution, the below test steps and
performance data is based on v4.


Can we split the patches into smaller ones to make review or merging easier?
E.g can we send the patches without any feature negotiation and vnet header
extension?

We can focus on the coalescing (maybe ipv4) without any guest involvement in
this series. In this way, the issues were limited and can be converged soon.
After this has been merged, we can add patches that co-operate with guests
on top (since it needs agreement on virtio specs). Does this sounds a good
plan?


True but disabling everything when feature is not negotiated
reduces the risk somewhat.

Sure.




Another suggestion from Jason is to adjust part of the code to make it
more readable, since there maybe still few change about the flowchart
in the future, such as timestamp, duplicate ack, so i'd like to delay it
temporarily.

Changes in V4:
- Add new host feature bit
- Replace using fixed header lenght with dynamic header lenght in VirtIONet
- Change ip/ip6 header union in NetRscUnit to void* pointer
- Add macro prefix, adjust code indent, etc.

Changes in V3:
- Removed big param list, replace it with 'NetRscUnit'
- Different virtio header size
- Modify callback function to direct call.
- Needn't check the failure of g_malloc()
- Other code format adjustment, macro naming, etc

Changes in V2:
- Add detailed commit log

This patch is to support WHQL test for Windows guest, while this feature also
benifits other guest works as a kernel 'gro' like feature with userspace
implementation.
Feature information:
   http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324

Both IPv4 and IPv6 are supported, though performance with userspace virtio
is slow than vhost-net, there is about 1.5x to 2x performance improvement to
userspace virtio, this is done by turning this feature on and disable
'tso/gso/gro' on corresponding tap interface and guest interface, while get
less improment with all these feature on.

Linux guest performance data(Netperf):
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.101 
() port 0 AF_INET : nodelay
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

  87380  16384 646.00 1221.20
  87380  16384 646.00 1260.30

  87380  163841286.00 1978.51
  87380  163841286.00 2286.05

  87380  163842566.00 2677.94
  87380  163842566.00 4615.42

  87380  163845126.00 2956.54
  87380  163845126.00 5356.39

  87380  16384   10246.00 2798.17
  87380  16384   10246.00 4943.30

  87380  16384   20486.00 2681.09
  87380  16384   20486.00 4835.81

  87380  16384   40966.00 3390.14
  87380  16384   40966.00 5391.54

  87380  16384   80926.00 3008.27
  87380  16384   80926.00 5381.68

  87380  16384  102406.00 2999.89
  87380  16384  102406.00 5393.11

Test steps:
Although this feature is mainly used for window guest, i used linux guest to
help test the feature, to make things simple, i used 3 steps to test the patch
as i moved on.

1. With a tcp socket client/server pair running on 2 linux guest, thus i can
control
the traffic and debugging the code as i want.
2. Netperf on linux guest test the throughput.
3. WHQL test with 2 Windows guests.

Wei Xu (2):
   virtio-net rsc: support coalescing ipv4 tcp traffic
   virtio-net rsc: support coalescing ipv6 tcp traffic

  hw/net/virtio-net.c | 623 +++-
  include/hw/virtio/virtio-net.h  |   2 +
  include/hw/virtio/virtio.h  |  75 
  include/standard-headers/linux/virtio_net.h |   2 +
  4 files changed, 699 insertions(+), 3 deletions(-)

Re: [Qemu-devel] [ RFC Patch v5 0/2] Support Receive-Segment-Offload(RSC) for WHQL

2016-05-24 Thread Wei Xu




On 2016年05月24日 16:03, Jason Wang wrote:



On 2016年05月24日 04:14, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Changes in V5:
- Passed all IPv4/6 test cases
- Add new fields in 'virtio_net_hdr'
- Set 'gso_type' & 'coalesced packets' in new field.
- Bypass all 'tcp option' packet
- Bypass all 'pure ack' packet
- Bypass all 'duplicate ack' packet
- Change 'guest_rsc' feature bit to 'false' by default
- Feedbacks from v4, typo, etc.


Patch does not apply on master ...



Note:
There is still a few pending issues about the feature bit, and need to be
discussed with windows driver maintainer, so linux guests with this patch
won't work at current, haven't figure it out yet, but i'm guessing it's
caused by the 'gso_type' is set to 'VIRTIO_NET_HDR_GSO_TCPV4/6',
will fix it after get the final solution, the below test steps and
performance data is based on v4.


Can we split the patches into smaller ones to make review or merging
easier? E.g can we send the patches without any feature negotiation and
vnet header extension?

OK.


We can focus on the coalescing (maybe ipv4) without any guest
involvement in this series. In this way, the issues were limited and can
be converged soon. After this has been merged, we can add patches that
co-operate with guests on top (since it needs agreement on virtio
specs). Does this sounds a good plan?
Exactly, both ipv4/6 passed whql test actually, so maybe leave the 
feature bit and header issue alone.




Another suggestion from Jason is to adjust part of the code to make it
more readable, since there maybe still few change about the flowchart
in the future, such as timestamp, duplicate ack, so i'd like to delay it
temporarily.

Changes in V4:
- Add new host feature bit
- Replace using fixed header lenght with dynamic header lenght in
VirtIONet
- Change ip/ip6 header union in NetRscUnit to void* pointer
- Add macro prefix, adjust code indent, etc.

Changes in V3:
- Removed big param list, replace it with 'NetRscUnit'
- Different virtio header size
- Modify callback function to direct call.
- Needn't check the failure of g_malloc()
- Other code format adjustment, macro naming, etc

Changes in V2:
- Add detailed commit log

This patch is to support WHQL test for Windows guest, while this
feature also
benifits other guest works as a kernel 'gro' like feature with userspace
implementation.
Feature information:
   http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324

Both IPv4 and IPv6 are supported, though performance with userspace
virtio
is slow than vhost-net, there is about 1.5x to 2x performance
improvement to
userspace virtio, this is done by turning this feature on and disable
'tso/gso/gro' on corresponding tap interface and guest interface,
while get
less improment with all these feature on.

Linux guest performance data(Netperf):
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.2.101 () port 0 AF_INET : nodelay
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

  87380  16384 646.00 1221.20
  87380  16384 646.00 1260.30

  87380  163841286.00 1978.51
  87380  163841286.00 2286.05

  87380  163842566.00 2677.94
  87380  163842566.00 4615.42

  87380  163845126.00 2956.54
  87380  163845126.00 5356.39

  87380  16384   10246.00 2798.17
  87380  16384   10246.00 4943.30

  87380  16384   20486.00 2681.09
  87380  16384   20486.00 4835.81

  87380  16384   40966.00 3390.14
  87380  16384   40966.00 5391.54

  87380  16384   80926.00 3008.27
  87380  16384   80926.00 5381.68

  87380  16384  102406.00 2999.89
  87380  16384  102406.00 5393.11

Test steps:
Although this feature is mainly used for window guest, i used linux
guest to
help test the feature, to make things simple, i used 3 steps to test
the patch
as i moved on.

1. With a tcp socket client/server pair running on 2 linux guest, thus
i can
control
the traffic and debugging the code as i want.
2. Netperf on linux guest test the throughput.
3. WHQL test with 2 Windows guests.

Wei Xu (2):
   virtio-net rsc: support coalescing ipv4 tcp traffic
   virtio-net rsc: support coalescing ipv6 tcp traffic

  hw/net/virtio-net.c | 623
+++-
  include/hw/virtio/virtio-net.h  |   2 +
  include/hw/virtio/virtio.h  |  75 
  include/standard-headers/linux/virtio_net.h |   2 +
  4 files changed, 699 insertions(+), 3 deletions(-)

Re: [Qemu-devel] [ RFC Patch v4 1/3] virtio-net rsc: add a new host offload(rsc) feature bit

2016-04-10 Thread Wei Xu




On 2016年04月05日 16:17, Michael S. Tsirkin wrote:

On Tue, Apr 05, 2016 at 10:05:17AM +0800, Jason Wang wrote:


On 04/04/2016 03:25 AM, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

A new feature bit 'VIRTIO_NET_F_GUEST_RSC' is introduced to support WHQL
Receive-Segment-Offload test, this feature will coalesce tcp packets in
IPv4/6 for userspace virtio-net driver.

This feature can be enabled either by 'ACK' the feature when loading
the driver in the guest, or by sending the 'VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET'
command to the host via control queue.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c | 29 +++--
  include/standard-headers/linux/virtio_net.h |  1 +
  2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 5798f87..bd91a4b 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -537,6 +537,7 @@ static uint64_t virtio_net_get_features(VirtIODevice *vdev, 
uint64_t features,
  virtio_clear_feature(, VIRTIO_NET_F_GUEST_TSO4);
  virtio_clear_feature(, VIRTIO_NET_F_GUEST_TSO6);
  virtio_clear_feature(, VIRTIO_NET_F_GUEST_ECN);
+virtio_clear_feature(, VIRTIO_NET_F_GUEST_RSC);

Several questions here:

- I think RSC should work even without vnet_hdr?
That's interesting, but i'm wondering how to test this? could you please 
point me out?

- Need we differentiate ipv4 and ipv6 like TSO here?

Sure, thanks.

- And looks like this patch should be squash to following patches.

OK.



  }
  
  if (!peer_has_vnet_hdr(n) || !peer_has_ufo(n)) {

@@ -582,7 +583,8 @@ static uint64_t 
virtio_net_guest_offloads_by_features(uint32_t features)
  (1ULL << VIRTIO_NET_F_GUEST_TSO4) |
  (1ULL << VIRTIO_NET_F_GUEST_TSO6) |
  (1ULL << VIRTIO_NET_F_GUEST_ECN)  |
-(1ULL << VIRTIO_NET_F_GUEST_UFO);
+(1ULL << VIRTIO_NET_F_GUEST_UFO)  |
+(1ULL << VIRTIO_NET_F_GUEST_RSC);

Looks like this is unnecessary since we won't set peer offload based on
GUEST_RSC.
there is an exclusive check when handling set feature command from 
control queue, so looks it will broke the check if don't include this bit.


  
  return guest_offloads_mask & features;

  }
@@ -1089,7 +1091,8 @@ static int receive_filter(VirtIONet *n, const uint8_t 
*buf, int size)
  return 0;
  }
  
-static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)

+static ssize_t virtio_net_do_receive(NetClientState *nc,
+ const uint8_t *buf, size_t size)
  {
  VirtIONet *n = qemu_get_nic_opaque(nc);
  VirtIONetQueue *q = virtio_net_get_subqueue(nc);
@@ -1685,6 +1688,26 @@ static int virtio_net_load_device(VirtIODevice *vdev, 
QEMUFile *f,
  return 0;
  }
  
+

+static ssize_t virtio_net_rsc_receive(NetClientState *nc,
+  const uint8_t *buf, size_t size)
+{
+return virtio_net_do_receive(nc, buf, size);
+}
+
+static ssize_t virtio_net_receive(NetClientState *nc,
+  const uint8_t *buf, size_t size)
+{
+VirtIONet *n;
+
+n = qemu_get_nic_opaque(nc);
+if (n->curr_guest_offloads & VIRTIO_NET_F_GUEST_RSC) {
+return virtio_net_rsc_receive(nc, buf, size);
+} else {
+return virtio_net_do_receive(nc, buf, size);
+}
+}

The changes here looks odd since it did nothing. Like I've mentioned,
better merge the patch into following ones.

OK.



+
  static NetClientInfo net_virtio_info = {
  .type = NET_CLIENT_OPTIONS_KIND_NIC,
  .size = sizeof(NICState),
@@ -1909,6 +1932,8 @@ static Property virtio_net_properties[] = {
 TX_TIMER_INTERVAL),
  DEFINE_PROP_INT32("x-txburst", VirtIONet, net_conf.txburst, TX_BURST),
  DEFINE_PROP_STRING("tx", VirtIONet, net_conf.tx),
+DEFINE_PROP_BIT("guest_rsc", VirtIONet, host_features,
+VIRTIO_NET_F_GUEST_RSC, true),

Need to compat the bit for old machine type to unbreak migration I believe?

And definitely disable by default.
There maybe some windows specific details about this, i'll discuss with 
Yan and update.



Btw, also need a patch for virtio spec.

Sure.


Thanks


  DEFINE_PROP_END_OF_LIST(),
  };
  
diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h

index a78f33e..5b95762 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -55,6 +55,7 @@
  #define VIRTIO_NET_F_MQ   22  /* Device supports Receive Flow
 * Steering */
  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
+#define VIRTIO_NET_F_GUEST_RSC  24  /* Guest can coalesce tcp packets */
  
  #ifndef VIRTIO_NET_NO_LEGACY

  #define VIRTIO_NET_F_GSO  6   /* Host handles pkts w/ any GSO type */

Re: [Qemu-devel] [ RFC Patch v4 2/3] virtio-net rsc: support coalescing ipv4 tcp traffic

2016-04-08 Thread Wei Xu




On 2016年04月08日 16:31, Jason Wang wrote:


On 04/08/2016 03:47 PM, Wei Xu wrote:


On 2016年04月05日 10:47, Jason Wang wrote:

On 04/04/2016 03:25 AM, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

All the data packets in a tcp connection will be cached to a big buffer
in every receive interval, and will be sent out via a timer, the
'virtio_net_rsc_timeout' controls the interval, the value will
influent the
performance and response of tcp connection extremely, 5(50us) is a
experience value to gain a performance improvement, since the whql test
sends packets every 100us, so '30(300us)' can pass the test case,
this is also the default value, it's gonna to be tunable.

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets

'NetRscChain' is used to save the segments of different protocols in a
VirtIONet device.

The main handler of TCP includes TCP window update, duplicated ACK
check
and the real data coalescing if the new segment passed sanity check
and is identified as an 'wanted' one.

An 'wanted' segment means:
1. Segment is within current window and the sequence is the expected
one.
2. ACK of the segment is in the valid window.
3. If the ACK in the segment is a duplicated one, then it must less
than 2,
 this is to notify upper layer TCP starting retransmission due to
the spec.

Sanity check includes:
1. Incorrect version in IP header
2. IP options & IP fragment
3. Not a TCP packets
4. Sanity size check to prevent buffer overflow attack.

There maybe more cases should be considered such as ip
identification other
flags, while it broke the test because windows set it to the same
even it's
not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag,
'bypass'
and 'finalize', 'bypass' means should be sent out directly, and
'finalize'
means the packets should also be bypassed, and this should be done
after searching for the same connection packets in the pool and sending
all of them out, this is to avoid out of data.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'FIN/RST' will trigger a finalization,
because
this normally happens upon a connection is going to be closed, an
'URG' packet
also finalize current coalescing unit.

Statistics can be used to monitor the basic coalescing status, the
'out of order'
and 'out of window' means how many retransmitting packets, thus
describe the
performance intuitively.

Signed-off-by: Wei Xu <w...@redhat.com>
---
   hw/net/virtio-net.c| 480
-
   include/hw/virtio/virtio-net.h |   1 +
   include/hw/virtio/virtio.h |  72 +++
   3 files changed, 552 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index bd91a4b..81e8e71 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -15,10 +15,12 @@
   #include "qemu/iov.h"
   #include "hw/virtio/virtio.h"
   #include "net/net.h"
+#include "net/eth.h"
   #include "net/checksum.h"
   #include "net/tap.h"
   #include "qemu/error-report.h"
   #include "qemu/timer.h"
+#include "qemu/sockets.h"
   #include "hw/virtio/virtio-net.h"
   #include "net/vhost_net.h"
   #include "hw/virtio/virtio-bus.h"
@@ -38,6 +40,24 @@
   #define endof(container, field) \
   (offsetof(container, field) + sizeof(((container *)0)->field))
   +#define VIRTIO_NET_IP4_ADDR_SIZE   8/* ipv4 saddr + daddr */
+#define VIRTIO_NET_TCP_PORT_SIZE   4/* sport + dport */
+
+/* IPv4 max payload, 16 bits in the header */
+#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
+#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
+
+/* header lenght value in ip header without option */

typo here.

Thanks.

+#define VIRTIO_NET_IP4_HEADER_LENGTH 5
+
+/* Purge coalesced packets timer interval */
+#define VIRTIO_NET_RSC_INTERVAL  30
+
+/* This value affects the performance a lot, and should be tuned
carefully,
+   '30'(300us) is the recommended value to pass the WHQL test,
'5' can
+   gain 2x netperf throughput with tso/gso/gro 'off'. */
+static uint32_t virtio_net_rsc_timeout = VIRTIO_NET_RSC_INTERVAL;

Like we've discussed in previous versions, need we add another property
for this?

Do you know how to make this a tunable parameter to guest? can this
parameter be set via control queue?

It's possible I think.


[...]


+
+static void virtio_net_rsc_purge(void *opq)
+{
+NetRscChain *chain = (NetRscChain *)opq;
+NetRscSeg *seg, *rn;
+
+QTAILQ_FOREACH_SAFE(seg, >buffers, next, rn) {
+if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+chain->stat.purge_failed++;
+continue;

Is it better to break here, consider we fail to do the receive?

Actually this fails only when receive fails according to the test, but
shou

Re: [Qemu-devel] [ RFC Patch v4 3/3] virtio-net rsc: support coalescing ipv6 tcp traffic

2016-04-08 Thread Wei Xu




On 2016年04月08日 15:27, Jason Wang wrote:


On 04/08/2016 03:06 PM, Wei Xu wrote:


On 2016年04月05日 10:50, Jason Wang wrote:

On 04/04/2016 03:25 AM, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Most things like ipv4 except there is a significant difference
between ipv4
and ipv6, the fragment lenght in ipv4 header includes itself, while
it's not

typo

Thanks.

included for ipv6, thus means ipv6 can carry a real '65535' payload.

Signed-off-by: Wei Xu <w...@redhat.com>
---
   hw/net/virtio-net.c | 147
+---
   1 file changed, 141 insertions(+), 6 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 81e8e71..2d09352 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -50,6 +50,10 @@
   /* header lenght value in ip header without option */
   #define VIRTIO_NET_IP4_HEADER_LENGTH 5
   +#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
+#define VIRTIO_NET_IP6_ADDR_SIZE   32  /* ipv6 saddr + daddr */
+#define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD
+
   /* Purge coalesced packets timer interval */
   #define VIRTIO_NET_RSC_INTERVAL  30
   @@ -1725,6 +1729,25 @@ static void
virtio_net_rsc_extract_unit4(NetRscChain *chain,
   unit->payload = htons(*unit->ip_plen) - ip_hdrlen -
unit->tcp_hdrlen;
   }
   +static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
+ const uint8_t *buf,
NetRscUnit* unit)
+{
+uint16_t hdr_len;
+struct ip6_header *ip6;
+
+hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
+ip6 = (struct ip6_header *)(buf + hdr_len + sizeof(struct
eth_header));
+unit->ip = ip6;
+unit->ip_plen = &(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
+unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip)\
++ sizeof(struct ip6_header));
+unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000)

10;

+
+/* There is a difference between payload lenght in ipv4 and v6,
+   ip header is excluded in ipv6 */
+unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
+}
+
   static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
   {
   uint32_t sum;
@@ -1738,7 +1761,9 @@ static size_t
virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
   {
   int ret;
   -virtio_net_rsc_ipv4_checksum(seg->unit.ip);
+if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
+virtio_net_rsc_ipv4_checksum(seg->unit.ip);
+}

Why not introduce proto specific checksum function for chain?

Since there are only 2 protocols to be supported, and very limited
extension for this feature, mst suggest to use direct call in v2 patch
to make things simple, and i took it.

Have you tried with my suggestion? I think it will actually simplify the
current code (at least several lines of codes).

ok, will give it a try.

Re: [Qemu-devel] [ RFC Patch v4 2/3] virtio-net rsc: support coalescing ipv4 tcp traffic

2016-04-08 Thread Wei Xu




On 2016年04月05日 10:47, Jason Wang wrote:


On 04/04/2016 03:25 AM, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

All the data packets in a tcp connection will be cached to a big buffer
in every receive interval, and will be sent out via a timer, the
'virtio_net_rsc_timeout' controls the interval, the value will influent the
performance and response of tcp connection extremely, 5(50us) is a
experience value to gain a performance improvement, since the whql test
sends packets every 100us, so '30(300us)' can pass the test case,
this is also the default value, it's gonna to be tunable.

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets

'NetRscChain' is used to save the segments of different protocols in a
VirtIONet device.

The main handler of TCP includes TCP window update, duplicated ACK check
and the real data coalescing if the new segment passed sanity check
and is identified as an 'wanted' one.

An 'wanted' segment means:
1. Segment is within current window and the sequence is the expected one.
2. ACK of the segment is in the valid window.
3. If the ACK in the segment is a duplicated one, then it must less than 2,
this is to notify upper layer TCP starting retransmission due to the spec.

Sanity check includes:
1. Incorrect version in IP header
2. IP options & IP fragment
3. Not a TCP packets
4. Sanity size check to prevent buffer overflow attack.

There maybe more cases should be considered such as ip identification other
flags, while it broke the test because windows set it to the same even it's
not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
means the packets should also be bypassed, and this should be done
after searching for the same connection packets in the pool and sending
all of them out, this is to avoid out of data.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'FIN/RST' will trigger a finalization, because
this normally happens upon a connection is going to be closed, an 'URG' packet
also finalize current coalescing unit.

Statistics can be used to monitor the basic coalescing status, the 'out of 
order'
and 'out of window' means how many retransmitting packets, thus describe the
performance intuitively.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c| 480 -
  include/hw/virtio/virtio-net.h |   1 +
  include/hw/virtio/virtio.h |  72 +++
  3 files changed, 552 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index bd91a4b..81e8e71 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -15,10 +15,12 @@
  #include "qemu/iov.h"
  #include "hw/virtio/virtio.h"
  #include "net/net.h"
+#include "net/eth.h"
  #include "net/checksum.h"
  #include "net/tap.h"
  #include "qemu/error-report.h"
  #include "qemu/timer.h"
+#include "qemu/sockets.h"
  #include "hw/virtio/virtio-net.h"
  #include "net/vhost_net.h"
  #include "hw/virtio/virtio-bus.h"
@@ -38,6 +40,24 @@
  #define endof(container, field) \
  (offsetof(container, field) + sizeof(((container *)0)->field))
  
+#define VIRTIO_NET_IP4_ADDR_SIZE   8/* ipv4 saddr + daddr */

+#define VIRTIO_NET_TCP_PORT_SIZE   4/* sport + dport */
+
+/* IPv4 max payload, 16 bits in the header */
+#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
+#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
+
+/* header lenght value in ip header without option */

typo here.

Thanks.



+#define VIRTIO_NET_IP4_HEADER_LENGTH 5
+
+/* Purge coalesced packets timer interval */
+#define VIRTIO_NET_RSC_INTERVAL  30
+
+/* This value affects the performance a lot, and should be tuned carefully,
+   '30'(300us) is the recommended value to pass the WHQL test, '5' can
+   gain 2x netperf throughput with tso/gso/gro 'off'. */
+static uint32_t virtio_net_rsc_timeout = VIRTIO_NET_RSC_INTERVAL;

Like we've discussed in previous versions, need we add another property
for this?
Do you know how to make this a tunable parameter to guest? can this 
parameter be set via control queue?



+
  typedef struct VirtIOFeature {
  uint32_t flags;
  size_t end;
@@ -1688,11 +1708,467 @@ static int virtio_net_load_device(VirtIODevice *vdev, 
QEMUFile *f,
  return 0;
  }
  
+static void virtio_net_rsc_extract_unit4(NetRscChain *chain,

+ const uint8_t *buf, NetRscUnit* unit)
+{
+uint16_t hdr_len;
+uint16_t ip_hdrlen;
+struct ip_header *ip;
+
+hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;

The case seems odd. Why not just use VirtIONet * for chain->n?

OK, will take it.

Re: [Qemu-devel] [ RFC Patch v4 3/3] virtio-net rsc: support coalescing ipv6 tcp traffic

2016-04-08 Thread Wei Xu




On 2016年04月05日 10:50, Jason Wang wrote:


On 04/04/2016 03:25 AM, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Most things like ipv4 except there is a significant difference between ipv4
and ipv6, the fragment lenght in ipv4 header includes itself, while it's not

typo

Thanks.



included for ipv6, thus means ipv6 can carry a real '65535' payload.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c | 147 +---
  1 file changed, 141 insertions(+), 6 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 81e8e71..2d09352 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -50,6 +50,10 @@
  /* header lenght value in ip header without option */
  #define VIRTIO_NET_IP4_HEADER_LENGTH 5
  
+#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)

+#define VIRTIO_NET_IP6_ADDR_SIZE   32  /* ipv6 saddr + daddr */
+#define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD
+
  /* Purge coalesced packets timer interval */
  #define VIRTIO_NET_RSC_INTERVAL  30
  
@@ -1725,6 +1729,25 @@ static void virtio_net_rsc_extract_unit4(NetRscChain *chain,

  unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
  }
  
+static void virtio_net_rsc_extract_unit6(NetRscChain *chain,

+ const uint8_t *buf, NetRscUnit* unit)
+{
+uint16_t hdr_len;
+struct ip6_header *ip6;
+
+hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
+ip6 = (struct ip6_header *)(buf + hdr_len + sizeof(struct eth_header));
+unit->ip = ip6;
+unit->ip_plen = &(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
+unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip)\
++ sizeof(struct ip6_header));
+unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
+
+/* There is a difference between payload lenght in ipv4 and v6,
+   ip header is excluded in ipv6 */
+unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
+}
+
  static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
  {
  uint32_t sum;
@@ -1738,7 +1761,9 @@ static size_t virtio_net_rsc_drain_seg(NetRscChain 
*chain, NetRscSeg *seg)
  {
  int ret;
  
-virtio_net_rsc_ipv4_checksum(seg->unit.ip);

+if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
+virtio_net_rsc_ipv4_checksum(seg->unit.ip);
+}

Why not introduce proto specific checksum function for chain?
Since there are only 2 protocols to be supported, and very limited 
extension for this feature, mst suggest to use direct call in v2 patch

to make things simple, and i took it.



  ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
  QTAILQ_REMOVE(>buffers, seg, next);
  g_free(seg->buf);
@@ -1804,7 +1829,18 @@ static void virtio_net_rsc_cache_buf(NetRscChain *chain, 
NetClientState *nc,
  QTAILQ_INSERT_TAIL(>buffers, seg, next);
  chain->stat.cache++;
  
-virtio_net_rsc_extract_unit4(chain, seg->buf, >unit);

+switch (chain->proto) {
+case ETH_P_IP:
+virtio_net_rsc_extract_unit4(chain, seg->buf, >unit);

Another call for proto specific callbacks maybe?

Same as above.



+break;
+
+case ETH_P_IPV6:
+virtio_net_rsc_extract_unit6(chain, seg->buf, >unit);
+break;
+
+default:
+g_assert_not_reached();
+}
  }
  
  static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,

@@ -1948,6 +1984,24 @@ static int32_t virtio_net_rsc_coalesce4(NetRscChain 
*chain, NetRscSeg *seg,
  return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
  }
  
+static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg *seg,

+const uint8_t *buf, size_t size, NetRscUnit *unit)
+{
+struct ip6_header *ip1, *ip2;
+
+ip1 = (struct ip6_header *)(unit->ip);
+ip2 = (struct ip6_header *)(seg->unit.ip);
+if (memcmp(>ip6_src, >ip6_src, sizeof(struct in6_address))
+|| memcmp(>ip6_dst, >ip6_dst, sizeof(struct in6_address))
+|| (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
+|| (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
+chain->stat.no_match++;
+return RSC_NO_MATCH;
+}
+
+return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
+}
+
  /* Pakcets with 'SYN' should bypass, other flag should be sent after drain
   * to prevent out of order */
  static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
@@ -1991,7 +2045,11 @@ static size_t virtio_net_rsc_do_coalesce(NetRscChain 
*chain, NetClientState *nc,
  NetRscSeg *seg, *nseg;
  
  QTAILQ_FOREACH_SAFE(seg, >buffers, next, nseg) {

-ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
+if (chain->proto

Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest

2016-03-19 Thread Wei Xu


On 2016年03月17日 23:44, Michael S. Tsirkin wrote:

On Thu, Mar 17, 2016 at 11:21:28PM +0800, Wei Xu wrote:


On 2016年03月17日 14:47, Jason Wang wrote:

On 03/15/2016 05:17 PM,w...@redhat.com  wrote:

From: Wei Xu<w...@redhat.com>

Fixed issues based on rfc patch v2:
1. Removed big param list, replace it with 'NetRscUnit'
2. Different virtio header size
3. Modify callback function to direct call.
4. Needn't check the failure of g_malloc()
5. Other code format adjustment, macro naming, etc

This patch is to support WHQL test for Windows guest, while this feature also
benifits other guest works as a kernel 'gro' like feature with userspace 
implementation.
Feature information:
   http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324

Both IPv4 and IPv6 are supported, though performance with userspace virtio
is slow than vhost-net, there is about 1x to 3x performance improvement to
userspace virtio, this is done by turning this feature on and disable
'tso/gso/gro' on corresponding tap interface and guest interface, while get
less improment with all these feature on.

Test steps:
Although this feature is mainly used for window guest, i used linux guest to 
help test
the feature, to make things simple, i used 3 steps to test the patch as i moved 
on.
1. With a tcp socket client/server pair running on 2 linux guest, thus i can 
control
the traffic and debugging the code as i want.
2. Netperf on linux guest test the throughput.
3. WHQL test with 2 Windows guests.

Current status:
IPv4 pass all the above tests.
IPv6 just passed test step 1 and 2 as described ahead, the virtio nic cannot
receive any packet in WHQL test, looks like the test traffic is not sent from
on the support machine, test device can access both host and another linux
guest, tried a lot of ways to work it out but failed, maybe debug from windows
guest driver side can help figuring it out.

I think you need figure out where was the packet dropped first. If the
packet was not dropped by windows guest, you may want to try dropmonitor.

Yes, there is something wrong with my previous description, i add some debug
code and did new test, the packets are received by virtio_net_receive() and
are finished putting to the vring with no error and sent to win guest
already, but wireshark on win guest doesn't get it, because the test case
did some hacking on the filter, it installed another lightweight filter, i'm
not sure how these packets go in the guest, maybe they are received but
dropped by driver or stack, etc.

Add some debug output in the driver, rebuild it and see packets
as they are received and passed up the stack.
Yes, but this is to win guest, i tried to build a windows debug binary 
but failed, is there any possible missing path in virtio between pushed 
it to vring and notified the guest successfully? i'm sure at this by 
debugging it with gdb.



I tried 'dropmonitor', it's very interesting but it helps very limitedly for
windows guest, i can only use it with qemu on the host.

Note:
A 'MessageDevice' nic chose as 'Realtek' will panic the system sometimes during 
setup,
this can be figured out by replacing it with an 'e1000' nic.

Todo:
More sanity check and tcp 'ecn' and 'window' scale test.

Wei Xu (2):
   virtio-net rsc: support coalescing ipv4 tcp traffic
   virtio-net rsc: support coalescing ipv6 tcp traffic

  hw/net/virtio-net.c| 602 -
  include/hw/virtio/virtio-net.h |   1 +
  include/hw/virtio/virtio.h |  75 +
  3 files changed, 677 insertions(+), 1 deletion(-)

Re: [Qemu-devel] [ Patch 2/2] virtio-net rsc: support coalescing ipv6 tcp traffic

2016-03-19 Thread Wei Xu


On 2016年03月17日 16:50, Jason Wang wrote:


On 03/15/2016 05:17 PM, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

Most things like ipv4 except there is a significant difference between ipv4
and ipv6, the fragment lenght in ipv4 header includes itself, while it's not
included for ipv6, thus means ipv6 can carry a real '65535' unit.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c| 146 -
  include/hw/virtio/virtio.h |   5 +-
  2 files changed, 135 insertions(+), 16 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index c23b45f..ef61b74 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -52,9 +52,14 @@
  #define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
  #define MAX_TCP_PAYLOAD 65535
  
-/* max payload with virtio header */

+#define IP6_HDR_SZ (sizeof(struct ip6_header))
+#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
+#define IP6_ADDR_SIZE   32  /* ipv6 saddr + daddr */
+#define MAX_IP6_PAYLOAD MAX_TCP_PAYLOAD
+
+/* ip6 max payload, payload in ipv6 don't include the  header */
  #define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
-+ ETH_HDR_SZ + MAX_TCP_PAYLOAD)
++ ETH_IP6_HDR_SZ + MAX_IP6_PAYLOAD)
  
  #define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
  
@@ -1722,14 +1727,27 @@ static void virtio_net_rsc_extract_unit4(NetRscChain *chain,

  {
  uint16_t ip_hdrlen;
  
-unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);

-ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
-unit->ip_plen = >ip->ip_len;
-unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
+unit->u_ip.ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
+ip_hdrlen = ((0xF & unit->u_ip.ip->ip_ver_len) << 2);
+unit->ip_plen = >u_ip.ip->ip_len;
+unit->tcp = (struct tcp_header *)(((uint8_t *)unit->u_ip.ip) + ip_hdrlen);
  unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
  unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
  }
  
+static void virtio_net_rsc_extract_unit6(NetRscChain *chain,

+ const uint8_t *buf, NetRscUnit* unit)
+{
+unit->u_ip.ip6 = (struct ip6_header *)(buf + chain->hdr_size + ETH_HDR_SZ);

The u_ip seems a little bit redundant. How about use a simple void * and
cast it to ipv4/ipv6 in proto specific callbacks?

The introducing of u_ip leads unnecessary ipv4 codes changes for ipv6
coalescing implementation.

Sure.

+unit->ip_plen = &(unit->u_ip.ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
+unit->tcp = (struct tcp_header *)(((uint8_t *)unit->u_ip.ip6)\
++ IP6_HDR_SZ);
+unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
+/* There is a difference between payload lenght in ipv4 and v6,
+   ip header is excluded in ipv6 */
+unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
+}
+
  static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
  {
  uint32_t sum;
@@ -1743,7 +1761,10 @@ static size_t virtio_net_rsc_drain_seg(NetRscChain 
*chain, NetRscSeg *seg)
  {
  int ret;
  
-virtio_net_rsc_ipv4_checksum(seg->unit.ip);

+if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
+virtio_net_rsc_ipv4_checksum(seg->unit.u_ip.ip);
+}
+
  ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
  QTAILQ_REMOVE(>buffers, seg, next);
  g_free(seg->buf);
@@ -1807,7 +1828,11 @@ static void virtio_net_rsc_cache_buf(NetRscChain *chain, 
NetClientState *nc,
  QTAILQ_INSERT_TAIL(>buffers, seg, next);
  chain->stat.cache++;
  
-virtio_net_rsc_extract_unit4(chain, seg->buf, >unit);

+if (chain->proto == ETH_P_IP) {
+virtio_net_rsc_extract_unit4(chain, seg->buf, >unit);
+} else {

A switch and a g_assert_not_reached() is better than this.

sure.



+virtio_net_rsc_extract_unit6(chain, seg->buf, >unit);
+}
  }
  
  static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,

@@ -1930,8 +1955,8 @@ coalesce:
  static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
  const uint8_t *buf, size_t size, NetRscUnit *unit)
  {
-if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
-|| (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
+if ((unit->u_ip.ip->ip_src ^ seg->unit.u_ip.ip->ip_src)
+|| (unit->u_ip.ip->ip_dst ^ seg->unit.u_ip.ip->ip_dst)
  || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
  || (unit->tcp->th_dport ^ seg->un

Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic

2016-03-19 Thread Wei Xu


On 2016年03月17日 16:42, Jason Wang wrote:



On 03/15/2016 05:17 PM, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

All the data packets in a tcp connection will be cached to a big buffer
in every receive interval, and will be sent out via a timer, the
'virtio_net_rsc_timeout' controls the interval, the value will influent the
performance and response of tcp connection extremely, 5(50us) is a
experience value to gain a performance improvement, since the whql test
sends packets every 100us, so '30(300us)' can pass the test case,
this is also the default value, it's gonna to be tunable.

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets

'NetRscChain' is used to save the segments of different protocols in a
VirtIONet device.

The main handler of TCP includes TCP window update, duplicated ACK check
and the real data coalescing if the new segment passed sanity check
and is identified as an 'wanted' one.

An 'wanted' segment means:
1. Segment is within current window and the sequence is the expected one.
2. ACK of the segment is in the valid window.
3. If the ACK in the segment is a duplicated one, then it must less than 2,
this is to notify upper layer TCP starting retransmission due to the spec.

Sanity check includes:
1. Incorrect version in IP header
2. IP options & IP fragment
3. Not a TCP packets
4. Sanity size check to prevent buffer overflow attack.

There maybe more cases should be considered such as ip identification other
flags, while it broke the test because windows set it to the same even it's
not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
means the packets should also be bypassed, and this should be done
after searching for the same connection packets in the pool and sending
all of them out, this is to avoid out of data.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'FIN/RST' will trigger a finalization, because
this normally happens upon a connection is going to be closed, an 'URG' packet
also finalize current coalescing unit while there maybe protocol difference to
different OS.

But URG packet should be sent as quickly as possible regardless of
ordering, no?


Yes, you right, URG will terminate the current 'SCU', i'll amend the commit log.




Statistics can be used to monitor the basic coalescing status, the 'out of 
order'
and 'out of window' means how many retransmitting packets, thus describe the
performance intuitively.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c| 486 -
  include/hw/virtio/virtio-net.h |   1 +
  include/hw/virtio/virtio.h |  72 ++
  3 files changed, 558 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 5798f87..c23b45f 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -15,10 +15,12 @@
  #include "qemu/iov.h"
  #include "hw/virtio/virtio.h"
  #include "net/net.h"
+#include "net/eth.h"
  #include "net/checksum.h"
  #include "net/tap.h"
  #include "qemu/error-report.h"
  #include "qemu/timer.h"
+#include "qemu/sockets.h"
  #include "hw/virtio/virtio-net.h"
  #include "net/vhost_net.h"
  #include "hw/virtio/virtio-bus.h"
@@ -38,6 +40,35 @@
  #define endof(container, field) \
  (offsetof(container, field) + sizeof(((container *)0)->field))
  
+#define ETH_HDR_SZ (sizeof(struct eth_header))

+#define IP4_HDR_SZ (sizeof(struct ip_header))
+#define TCP_HDR_SZ (sizeof(struct tcp_header))
+#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)
+
+#define IP4_ADDR_SIZE   8   /* ipv4 saddr + daddr */
+#define TCP_PORT_SIZE   4   /* sport + dport */
+
+/* IPv4 max payload, 16 bits in the header */
+#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
+#define MAX_TCP_PAYLOAD 65535
+
+/* max payload with virtio header */
+#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
++ ETH_HDR_SZ + MAX_TCP_PAYLOAD)

Should we use guest_hdr_len instead of sizeof() here? Consider the
vnet_hdr will be extended in the future.


Sure.




+
+#define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */

type, should be 'length'


ok.




+
+/* Purge coalesced packets timer interval */
+#define RSC_TIMER_INTERVAL  30
+
+/* Switcher to enable/disable rsc */
+static bool virtio_net_rsc_bypass = 1;
+
+/* This value affects the performance a lot, and should be tuned carefully,
+   '30'(300us) is the recommended value to pass the WHQL test, '5' can
+   gain 2x netperf throughput with tso/gso/gro 'off'. */
+static uint32_t virtio_net_rsc_timeout = RSC_TIMER_INTERVAL;
+
  typedef struc

Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest

2016-03-19 Thread Wei Xu




On 2016年03月18日 13:21, Jason Wang wrote:


On 03/18/2016 12:24 PM, Wei Xu wrote:


On 2016年03月18日 10:22, Jason Wang wrote:

On 03/18/2016 12:57 AM, Wei Xu wrote:

On 2016年03月17日 23:44, Michael S. Tsirkin wrote:

On Thu, Mar 17, 2016 at 11:21:28PM +0800, Wei Xu wrote:

On 2016年03月17日 14:47, Jason Wang wrote:

On 03/15/2016 05:17 PM,w...@redhat.com  wrote:

From: Wei Xu<w...@redhat.com>

Fixed issues based on rfc patch v2:
1. Removed big param list, replace it with 'NetRscUnit'
2. Different virtio header size
3. Modify callback function to direct call.
4. Needn't check the failure of g_malloc()
5. Other code format adjustment, macro naming, etc

This patch is to support WHQL test for Windows guest, while this
feature also
benifits other guest works as a kernel 'gro' like feature with
userspace implementation.
Feature information:

http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324


Both IPv4 and IPv6 are supported, though performance with
userspace virtio
is slow than vhost-net, there is about 1x to 3x performance
improvement to
userspace virtio, this is done by turning this feature on and
disable
'tso/gso/gro' on corresponding tap interface and guest interface,
while get
less improment with all these feature on.

Test steps:
Although this feature is mainly used for window guest, i used
linux guest to help test
the feature, to make things simple, i used 3 steps to test the
patch as i moved on.
1. With a tcp socket client/server pair running on 2 linux guest,
thus i can control
the traffic and debugging the code as i want.
2. Netperf on linux guest test the throughput.
3. WHQL test with 2 Windows guests.

Current status:
IPv4 pass all the above tests.
IPv6 just passed test step 1 and 2 as described ahead, the virtio
nic cannot
receive any packet in WHQL test, looks like the test traffic is
not sent from
on the support machine, test device can access both host and
another linux
guest, tried a lot of ways to work it out but failed, maybe debug
from windows
guest driver side can help figuring it out.

I think you need figure out where was the packet dropped first.
If the
packet was not dropped by windows guest, you may want to try
dropmonitor.

Yes, there is something wrong with my previous description, i add
some debug
code and did new test, the packets are received by
virtio_net_receive() and
are finished putting to the vring with no error and sent to win guest
already, but wireshark on win guest doesn't get it, because the test
case
did some hacking on the filter, it installed another lightweight
filter, i'm
not sure how these packets go in the guest, maybe they are
received but
dropped by driver or stack, etc.

Add some debug output in the driver, rebuild it and see packets
as they are received and passed up the stack.

Yes, but this is to win guest, i tried to build a windows debug binary
but failed, is there any possible missing path in virtio between
pushed it to vring and notified the guest successfully? i'm sure at
this by debugging it with gdb.

Is the packet always dropped or does it help if you turn off some
configuration (e.g checksum offloads) works?

Yes, only the test packets are dropped, there is no checksum for ipv6
header,
i remembered i disabled checksum offloads and changed other features(RSS)
in the guest but it doesn't help, is there any other tunable values
for qemu?

-device virtio-net-pci,? can gives you all the properties.


ok, thanks a lot.

Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest

2016-03-19 Thread Wei Xu




On 2016年03月18日 10:22, Jason Wang wrote:


On 03/18/2016 12:57 AM, Wei Xu wrote:

On 2016年03月17日 23:44, Michael S. Tsirkin wrote:

On Thu, Mar 17, 2016 at 11:21:28PM +0800, Wei Xu wrote:

On 2016年03月17日 14:47, Jason Wang wrote:

On 03/15/2016 05:17 PM,w...@redhat.com  wrote:

From: Wei Xu<w...@redhat.com>

Fixed issues based on rfc patch v2:
1. Removed big param list, replace it with 'NetRscUnit'
2. Different virtio header size
3. Modify callback function to direct call.
4. Needn't check the failure of g_malloc()
5. Other code format adjustment, macro naming, etc

This patch is to support WHQL test for Windows guest, while this
feature also
benifits other guest works as a kernel 'gro' like feature with
userspace implementation.
Feature information:
http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324

Both IPv4 and IPv6 are supported, though performance with
userspace virtio
is slow than vhost-net, there is about 1x to 3x performance
improvement to
userspace virtio, this is done by turning this feature on and disable
'tso/gso/gro' on corresponding tap interface and guest interface,
while get
less improment with all these feature on.

Test steps:
Although this feature is mainly used for window guest, i used
linux guest to help test
the feature, to make things simple, i used 3 steps to test the
patch as i moved on.
1. With a tcp socket client/server pair running on 2 linux guest,
thus i can control
the traffic and debugging the code as i want.
2. Netperf on linux guest test the throughput.
3. WHQL test with 2 Windows guests.

Current status:
IPv4 pass all the above tests.
IPv6 just passed test step 1 and 2 as described ahead, the virtio
nic cannot
receive any packet in WHQL test, looks like the test traffic is
not sent from
on the support machine, test device can access both host and
another linux
guest, tried a lot of ways to work it out but failed, maybe debug
from windows
guest driver side can help figuring it out.

I think you need figure out where was the packet dropped first. If the
packet was not dropped by windows guest, you may want to try
dropmonitor.

Yes, there is something wrong with my previous description, i add
some debug
code and did new test, the packets are received by
virtio_net_receive() and
are finished putting to the vring with no error and sent to win guest
already, but wireshark on win guest doesn't get it, because the test
case
did some hacking on the filter, it installed another lightweight
filter, i'm
not sure how these packets go in the guest, maybe they are received but
dropped by driver or stack, etc.

Add some debug output in the driver, rebuild it and see packets
as they are received and passed up the stack.

Yes, but this is to win guest, i tried to build a windows debug binary
but failed, is there any possible missing path in virtio between
pushed it to vring and notified the guest successfully? i'm sure at
this by debugging it with gdb.

Is the packet always dropped or does it help if you turn off some
configuration (e.g checksum offloads) works?
Yes, only the test packets are dropped, there is no checksum for ipv6 
header,

i remembered i disabled checksum offloads and changed other features(RSS)
in the guest but it doesn't help, is there any other tunable values for 
qemu?



I tried 'dropmonitor', it's very interesting but it helps very
limitedly for
windows guest, i can only use it with qemu on the host.

Note:
A 'MessageDevice' nic chose as 'Realtek' will panic the system
sometimes during setup,
this can be figured out by replacing it with an 'e1000' nic.

Todo:
More sanity check and tcp 'ecn' and 'window' scale test.

Wei Xu (2):
virtio-net rsc: support coalescing ipv4 tcp traffic
virtio-net rsc: support coalescing ipv6 tcp traffic

   hw/net/virtio-net.c| 602
-
   include/hw/virtio/virtio-net.h |   1 +
   include/hw/virtio/virtio.h |  75 +
   3 files changed, 677 insertions(+), 1 deletion(-)

Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic

2016-03-19 Thread Wei Xu




On 2016年03月18日 10:03, Jason Wang wrote:


On 03/18/2016 12:45 AM, Wei Xu wrote:

On 2016年03月17日 16:42, Jason Wang wrote:


On 03/15/2016 05:17 PM, w...@redhat.com wrote:

From: Wei Xu <w...@redhat.com>

All the data packets in a tcp connection will be cached to a big buffer
in every receive interval, and will be sent out via a timer, the
'virtio_net_rsc_timeout' controls the interval, the value will
influent the
performance and response of tcp connection extremely, 5(50us) is a
experience value to gain a performance improvement, since the whql test
sends packets every 100us, so '30(300us)' can pass the test case,
this is also the default value, it's gonna to be tunable.

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets

'NetRscChain' is used to save the segments of different protocols in a
VirtIONet device.

The main handler of TCP includes TCP window update, duplicated ACK
check
and the real data coalescing if the new segment passed sanity check
and is identified as an 'wanted' one.

An 'wanted' segment means:
1. Segment is within current window and the sequence is the expected
one.
2. ACK of the segment is in the valid window.
3. If the ACK in the segment is a duplicated one, then it must less
than 2,
 this is to notify upper layer TCP starting retransmission due to
the spec.

Sanity check includes:
1. Incorrect version in IP header
2. IP options & IP fragment
3. Not a TCP packets
4. Sanity size check to prevent buffer overflow attack.

There maybe more cases should be considered such as ip
identification other
flags, while it broke the test because windows set it to the same
even it's
not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag,
'bypass'
and 'finalize', 'bypass' means should be sent out directly, and
'finalize'
means the packets should also be bypassed, and this should be done
after searching for the same connection packets in the pool and sending
all of them out, this is to avoid out of data.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'FIN/RST' will trigger a finalization,
because
this normally happens upon a connection is going to be closed, an
'URG' packet
also finalize current coalescing unit while there maybe protocol
difference to
different OS.

But URG packet should be sent as quickly as possible regardless of
ordering, no?

Yes, you right, URG will terminate the current 'SCU', i'll amend the
commit log.


Statistics can be used to monitor the basic coalescing status, the
'out of order'
and 'out of window' means how many retransmitting packets, thus
describe the
performance intuitively.

Signed-off-by: Wei Xu <w...@redhat.com>
---
   hw/net/virtio-net.c| 486
-
   include/hw/virtio/virtio-net.h |   1 +
   include/hw/virtio/virtio.h |  72 ++
   3 files changed, 558 insertions(+), 1 deletion(-)

[...]


+} else {
+/* Coalesce window update */
+o_tcp->th_win = n_tcp->th_win;
+chain->stat.win_update++;
+return RSC_COALESCE;
+}
+} else {
+/* pure ack, update ack */
+o_tcp->th_ack = n_tcp->th_ack;
+chain->stat.pure_ack++;
+return RSC_COALESCE;

Looks like there're something I missed. The spec said:

"In other words, any pure ACK that is not a duplicate ACK or a window
update triggers an exception and must not be coalesced. All such pure
ACKs must be indicated as individual segments."

Does it mean we *should not* coalesce windows update and pure ack?
(Since it can wakeup transmission)?

It's also a little bit inexplicit and flexible due to the spec, please
see the flowchart I on the same page.

Comments about the  flowchart:

The first of the following two flowcharts describes the rules for
coalescing segments and updating the TCP headers.
This flowchart refers to mechanisms for distinguishing valid duplicate
ACKs and window updates. The second flowchart describes these mechanisms.

As show in the flowchart, only status 'C' will break current scu and
get finalized, both 'A' and 'B' can be coalesced afaik.


Interesting, looks like you're right.


+}
+}
+
+static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
NetRscSeg *seg,
+const uint8_t *buf, NetRscUnit
*n_unit)
+{
+void *data;
+uint16_t o_ip_len;
+uint32_t nseq, oseq;
+NetRscUnit *o_unit;
+
+o_unit = >unit;
+o_ip_len = htons(*o_unit->ip_plen);
+nseq = htonl(n_unit->tcp->th_seq);
+oseq = htonl(o_unit->tcp->th_seq);
+
+if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
+/* Log this only for debugging observation */
+chain->

Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic

2016-03-19 Thread Wei Xu




On 2016年03月18日 14:56, Jason Wang wrote:


On 03/18/2016 02:38 PM, Wei Xu wrote:


On 2016年03月18日 13:20, Jason Wang wrote:

On 03/18/2016 12:17 PM, Wei Xu wrote:

+static ssize_t virtio_net_receive(NetClientState *nc,
+  const uint8_t *buf, size_t size)
+{
+if (virtio_net_rsc_bypass) {
+return virtio_net_do_receive(nc, buf, size);

You need a feature bit for this and compat it for older machine
types.
And also need some work on virtio spec I think.

yes, not sure which way is good to support this, hmp/qmp/ethtool,
this
is gonna to support win guest,
so need a well-compatible interface, any comments?

I think this should be implemented through feature bits/negotiation
instead of something like ethtool.

Looks this feature should be turn on/off dynamically due to the spec,
so maybe this should be managed from the guest, is there any reference
code for this?

Then you may want to look at implementation of
VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.

Have a short look at it, do you know how to control the feature bit?
both when lauching vm and changing it during runtime?

Virtio spec and maybe windows driver source code can give you the answer.

OK, will check it out.

Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic

2016-03-19 Thread Wei Xu




On 2016年03月18日 13:20, Jason Wang wrote:


On 03/18/2016 12:17 PM, Wei Xu wrote:

+static ssize_t virtio_net_receive(NetClientState *nc,
+  const uint8_t *buf, size_t size)
+{
+if (virtio_net_rsc_bypass) {
+return virtio_net_do_receive(nc, buf, size);

You need a feature bit for this and compat it for older machine types.
And also need some work on virtio spec I think.

yes, not sure which way is good to support this, hmp/qmp/ethtool, this
is gonna to support win guest,
so need a well-compatible interface, any comments?

I think this should be implemented through feature bits/negotiation
instead of something like ethtool.

Looks this feature should be turn on/off dynamically due to the spec,
so maybe this should be managed from the guest, is there any reference
code for this?

Then you may want to look at implementation of
VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.
Have a short look at it, do you know how to control the feature bit?  
both when lauching vm and changing it during runtime?

Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest

2016-03-19 Thread Wei Xu




On 2016年03月17日 14:47, Jason Wang wrote:

On 03/15/2016 05:17 PM,w...@redhat.com  wrote:

From: Wei Xu<w...@redhat.com>

Fixed issues based on rfc patch v2:
1. Removed big param list, replace it with 'NetRscUnit'
2. Different virtio header size
3. Modify callback function to direct call.
4. Needn't check the failure of g_malloc()
5. Other code format adjustment, macro naming, etc

This patch is to support WHQL test for Windows guest, while this feature also
benifits other guest works as a kernel 'gro' like feature with userspace 
implementation.
Feature information:
   http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324

Both IPv4 and IPv6 are supported, though performance with userspace virtio
is slow than vhost-net, there is about 1x to 3x performance improvement to
userspace virtio, this is done by turning this feature on and disable
'tso/gso/gro' on corresponding tap interface and guest interface, while get
less improment with all these feature on.

Test steps:
Although this feature is mainly used for window guest, i used linux guest to 
help test
the feature, to make things simple, i used 3 steps to test the patch as i moved 
on.
1. With a tcp socket client/server pair running on 2 linux guest, thus i can 
control
the traffic and debugging the code as i want.
2. Netperf on linux guest test the throughput.
3. WHQL test with 2 Windows guests.

Current status:
IPv4 pass all the above tests.
IPv6 just passed test step 1 and 2 as described ahead, the virtio nic cannot
receive any packet in WHQL test, looks like the test traffic is not sent from
on the support machine, test device can access both host and another linux
guest, tried a lot of ways to work it out but failed, maybe debug from windows
guest driver side can help figuring it out.

I think you need figure out where was the packet dropped first. If the
packet was not dropped by windows guest, you may want to try dropmonitor.
Yes, there is something wrong with my previous description, i add some 
debug code and did new test, the packets are received by 
virtio_net_receive() and are finished putting to the vring with no error 
and sent to win guest already, but wireshark on win guest doesn't get 
it, because the test case did some hacking on the filter, it installed 
another lightweight filter, i'm not sure how these packets go in the 
guest, maybe they are received but dropped by driver or stack, etc.


I tried 'dropmonitor', it's very interesting but it helps very limitedly 
for windows guest, i can only use it with qemu on the host.

Note:
A 'MessageDevice' nic chose as 'Realtek' will panic the system sometimes during 
setup,
this can be figured out by replacing it with an 'e1000' nic.

Todo:
More sanity check and tcp 'ecn' and 'window' scale test.

Wei Xu (2):
   virtio-net rsc: support coalescing ipv4 tcp traffic
   virtio-net rsc: support coalescing ipv6 tcp traffic

  hw/net/virtio-net.c| 602 -
  include/hw/virtio/virtio-net.h |   1 +
  include/hw/virtio/virtio.h |  75 +
  3 files changed, 677 insertions(+), 1 deletion(-)

Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic

2016-03-15 Thread Wei Xu



- Original Message -
From: "Michael S. Tsirkin" <m...@redhat.com>
To: w...@redhat.com
Cc: vict...@redhat.com, jasow...@redhat.com, yvuge...@redhat.com, 
qemu-devel@nongnu.org, mar...@redhat.com, dfley...@redhat.com
Sent: Tuesday, March 15, 2016 6:00:03 PM
Subject: Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 
tcp traffic

On Tue, Mar 15, 2016 at 05:17:03PM +0800, w...@redhat.com wrote:
> From: Wei Xu <w...@redhat.com>
> 
> All the data packets in a tcp connection will be cached to a big buffer
> in every receive interval, and will be sent out via a timer, the
> 'virtio_net_rsc_timeout' controls the interval, the value will influent the
> performance and response of tcp connection extremely, 5(50us) is a
> experience value to gain a performance improvement, since the whql test
> sends packets every 100us, so '30(300us)' can pass the test case,
> this is also the default value, it's gonna to be tunable.
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets
> 
> 'NetRscChain' is used to save the segments of different protocols in a
> VirtIONet device.
> 
> The main handler of TCP includes TCP window update, duplicated ACK check
> and the real data coalescing if the new segment passed sanity check
> and is identified as an 'wanted' one.
> 
> An 'wanted' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. ACK of the segment is in the valid window.
> 3. If the ACK in the segment is a duplicated one, then it must less than 2,
>this is to notify upper layer TCP starting retransmission due to the spec.
> 
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. IP options & IP fragment
> 3. Not a TCP packets
> 4. Sanity size check to prevent buffer overflow attack.
> 
> There maybe more cases should be considered such as ip identification other
> flags, while it broke the test because windows set it to the same even it's
> not a fragment.
> 
> Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
> and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
> means the packets should also be bypassed, and this should be done
> after searching for the same connection packets in the pool and sending
> all of them out, this is to avoid out of data.
> 
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'FIN/RST' will trigger a finalization, because
> this normally happens upon a connection is going to be closed, an 'URG' packet
> also finalize current coalescing unit while there maybe protocol difference to
> different OS.
> 
> Statistics can be used to monitor the basic coalescing status, the 'out of 
> order'
> and 'out of window' means how many retransmitting packets, thus describe the
> performance intuitively.
> 
> Signed-off-by: Wei Xu <w...@redhat.com>
> ---
>  hw/net/virtio-net.c| 486 
> -
>  include/hw/virtio/virtio-net.h |   1 +
>  include/hw/virtio/virtio.h |  72 ++
>  3 files changed, 558 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 5798f87..c23b45f 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -15,10 +15,12 @@
>  #include "qemu/iov.h"
>  #include "hw/virtio/virtio.h"
>  #include "net/net.h"
> +#include "net/eth.h"
>  #include "net/checksum.h"
>  #include "net/tap.h"
>  #include "qemu/error-report.h"
>  #include "qemu/timer.h"
> +#include "qemu/sockets.h"
>  #include "hw/virtio/virtio-net.h"
>  #include "net/vhost_net.h"
>  #include "hw/virtio/virtio-bus.h"
> @@ -38,6 +40,35 @@
>  #define endof(container, field) \
>  (offsetof(container, field) + sizeof(((container *)0)->field))
>  
> +#define ETH_HDR_SZ (sizeof(struct eth_header))
> +#define IP4_HDR_SZ (sizeof(struct ip_header))
> +#define TCP_HDR_SZ (sizeof(struct tcp_header))
> +#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)

It's better to open-code these imho.

okay.

> +
> +#define IP4_ADDR_SIZE   8   /* ipv4 saddr + daddr */
> +#define TCP_PORT_SIZE   4   /* sport + dport */
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
> +#define MAX_TCP_PAYLOAD 65535
> +
> +/* max payload with virtio header */
> +#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
> ++ ETH_HDR_SZ + MAX_TCP_PAYLOAD)
> +
>

Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest

2016-03-15 Thread Wei Xu



- Original Message -
From: "Michael S. Tsirkin" <m...@redhat.com>
To: w...@redhat.com
Cc: vict...@redhat.com, jasow...@redhat.com, yvuge...@redhat.com, 
qemu-devel@nongnu.org, mar...@redhat.com, dfley...@redhat.com
Sent: Tuesday, March 15, 2016 6:01:12 PM
Subject: Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for 
WHQL test of Window guest

On Tue, Mar 15, 2016 at 05:17:02PM +0800, w...@redhat.com wrote:
> From: Wei Xu <w...@redhat.com>
> 
> Fixed issues based on rfc patch v2:
> 1. Removed big param list, replace it with 'NetRscUnit' 
> 2. Different virtio header size
> 3. Modify callback function to direct call.
> 4. Needn't check the failure of g_malloc()
> 5. Other code format adjustment, macro naming, etc 
> 
> This patch is to support WHQL test for Windows guest, while this feature also
> benifits other guest works as a kernel 'gro' like feature with userspace 
> implementation.
> Feature information:
>   http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
> 
> Both IPv4 and IPv6 are supported, though performance with userspace virtio
> is slow than vhost-net, there is about 1x to 3x performance improvement to
> userspace virtio, this is done by turning this feature on and disable
> 'tso/gso/gro' on corresponding tap interface and guest interface, while get
> less improment with all these feature on.
> 
> Test steps:
> Although this feature is mainly used for window guest, i used linux guest to 
> help test
> the feature, to make things simple, i used 3 steps to test the patch as i 
> moved on.
> 1. With a tcp socket client/server pair running on 2 linux guest, thus i can 
> control
> the traffic and debugging the code as i want.
> 2. Netperf on linux guest test the throughput.
> 3. WHQL test with 2 Windows guests.
> 
> Current status:
> IPv4 pass all the above tests.
> IPv6 just passed test step 1 and 2 as described ahead, the virtio nic cannot
> receive any packet in WHQL test, looks like the test traffic is not sent from
> on the support machine, test device can access both host and another linux
> guest, tried a lot of ways to work it out but failed, maybe debug from windows
> guest driver side can help figuring it out.
> 
> Note:
> A 'MessageDevice' nic chose as 'Realtek' will panic the system sometimes 
> during setup,
> this can be figured out by replacing it with an 'e1000' nic.
> 
> Todo:
> More sanity check and tcp 'ecn' and 'window' scale test.

So at this point this is still an RFC, pls label as such
in the subject.
Also, commit log of each patch should also include info on
how to activate a feature.

OK, thanks mst.

thanks!

> Wei Xu (2):
>   virtio-net rsc: support coalescing ipv4 tcp traffic
>   virtio-net rsc: support coalescing ipv6 tcp traffic
> 
>  hw/net/virtio-net.c| 602 
> -
>  include/hw/virtio/virtio-net.h |   1 +
>  include/hw/virtio/virtio.h |  75 +
>  3 files changed, 677 insertions(+), 1 deletion(-)
> 
> -- 
> 2.5.0

Re: [Qemu-devel] [RFC Patch v2 06/10] virtio-net rsc: IPv4 checksum

2016-02-01 Thread Wei Xu




On 02/01/2016 02:31 PM, Jason Wang wrote:


On 02/01/2016 02:13 AM, w...@redhat.com wrote:

From: Wei Xu <w...@wei-thinkpad.nay.redhat.com>

If a field in the IPv4 header is modified, then the checksum
have to be recalculated before sending it out.

This in fact breaks bisection. I think you need either squash this into
previous patch or introduce virtio_net_rsc_ipv4_checksum() as a helper
before the patch of ipv4 coalescing.

OK.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c | 19 +++
  1 file changed, 19 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 93df0d5..88fc4f8 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1630,6 +1630,18 @@ static int virtio_net_load_device(VirtIODevice *vdev, 
QEMUFile *f,
  return 0;
  }
  
+static void virtio_net_rsc_ipv4_checksum(NetRscSeg *seg)

+{
+uint32_t sum;
+struct ip_header *ip;
+
+ip = (struct ip_header *)(seg->buf + IP_OFFSET);
+
+ip->ip_sum = 0;
+sum = net_checksum_add_cont(sizeof(struct ip_header), (uint8_t *)ip, 0);
+ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
+}
+
  static void virtio_net_rsc_purge(void *opq)
  {
  int ret = 0;
@@ -1643,6 +1655,10 @@ static void virtio_net_rsc_purge(void *opq)
  continue;
  }
  
+if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {

+virtio_net_rsc_ipv4_checksum(seg);
+}
+
  ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
  QTAILQ_REMOVE(>buffers, seg, next);
  g_free(seg->buf);
@@ -1853,6 +1869,9 @@ static size_t virtio_net_rsc_callback(NetRscChain *chain, 
NetClientState *nc,
  QTAILQ_FOREACH_SAFE(seg, >buffers, next, nseg) {
  ret = coalesce(chain, seg, buf, size);
  if (RSC_FINAL == ret) {
+if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
+virtio_net_rsc_ipv4_checksum(seg);
+}
  ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
  QTAILQ_REMOVE(>buffers, seg, next);
  g_free(seg->buf);

Re: [Qemu-devel] [RFC Patch v2 05/10] virtio-net rsc: Create timer to drain the packets from the cache pool

2016-02-01 Thread Wei Xu


On 02/01/2016 02:28 PM, Jason Wang wrote:


On 02/01/2016 02:13 AM, w...@redhat.com wrote:

From: Wei Xu <w...@wei-thinkpad.nay.redhat.com>

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets, this is to reduce the
delay to upper layer protocol stack.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c | 38 ++
  1 file changed, 38 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 4f77fbe..93df0d5 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -48,12 +48,17 @@
  
  #define MAX_VIRTIO_IP_PAYLOAD  (65535 + IP_OFFSET)
  
+/* Purge coalesced packets timer interval */

+#define RSC_TIMER_INTERVAL  50

Any hints for choosing this as default value? Do we need a property for
user to change this?
This is still under estimation, 300ms -500ms is a good value to adapt 
the test, this should be configurable.

+
  /* Global statistics */
  static uint32_t rsc_chain_no_mem;
  
  /* Switcher to enable/disable rsc */

  static bool virtio_net_rsc_bypass;
  
+static uint32_t rsc_timeout = RSC_TIMER_INTERVAL;

+
  /* Coalesce callback for ipv4/6 */
  typedef int32_t (VirtioNetCoalesce) (NetRscChain *chain, NetRscSeg *seg,
   const uint8_t *buf, size_t size);
@@ -1625,6 +1630,35 @@ static int virtio_net_load_device(VirtIODevice *vdev, 
QEMUFile *f,
  return 0;
  }
  
+static void virtio_net_rsc_purge(void *opq)

+{
+int ret = 0;
+NetRscChain *chain = (NetRscChain *)opq;
+NetRscSeg *seg, *rn;
+
+QTAILQ_FOREACH_SAFE(seg, >buffers, next, rn) {
+if (!qemu_can_send_packet(seg->nc)) {
+/* Should quit or continue? not sure if one or some
+* of the queues fail would happen, try continue here */

This looks wrong, qemu_can_send_packet() is used for nc's peer not nc
itself.

OK.



+continue;
+}
+
+ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
+QTAILQ_REMOVE(>buffers, seg, next);
+g_free(seg->buf);
+g_free(seg);
+
+if (ret == 0) {
+/* Try next queue */

Try next seg?

Yes, it's seg.



+continue;
+}

Why need above?
yes, it's optional, my maybe can help if there are extra codes after 
this, will remove this.



+}
+
+if (!QTAILQ_EMPTY(>buffers)) {
+timer_mod(chain->drain_timer,
+  qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + rsc_timeout);

Need stop/start the timer during vm stop/start to save cpu.

Thanks, do you know where should i add the code?



+}
+}
  
  static void virtio_net_rsc_cleanup(VirtIONet *n)

  {
@@ -1810,6 +1844,8 @@ static size_t virtio_net_rsc_callback(NetRscChain *chain, 
NetClientState *nc,
  if (!virtio_net_rsc_cache_buf(chain, nc, buf, size)) {
  return 0;
  } else {
+timer_mod(chain->drain_timer,
+  qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + rsc_timeout);
  return size;
  }
  }
@@ -1877,6 +1913,8 @@ static NetRscChain 
*virtio_net_rsc_lookup_chain(NetClientState *nc,
  }
  
  chain->proto = proto;

+chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
+  virtio_net_rsc_purge, chain);
  chain->do_receive = virtio_net_rsc_receive4;
  
  QTAILQ_INIT(>buffers);

Re: [Qemu-devel] [RFC Patch v2 08/10] virtio-net rsc: Sanity check & More bypass cases check

2016-02-01 Thread Wei Xu


On 02/01/2016 02:58 PM, Jason Wang wrote:


On 02/01/2016 02:13 AM, w...@redhat.com wrote:

From: Wei Xu <w...@wei-thinkpad.nay.redhat.com>

More general exception cases check
1. Incorrect version in IP header
2. IP options & IP fragment
3. Not a TCP packets
4. Sanity size check to prevent buffer overflow attack.

Signed-off-by: Wei Xu <w...@redhat.com>

Let's squash this into previous patches too for a better bisection
ability and complete implementation.

ok.



---
  hw/net/virtio-net.c | 44 
  1 file changed, 44 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index b0987d0..9b44762 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1948,6 +1948,46 @@ static size_t virtio_net_rsc_drain_one(NetRscChain 
*chain, NetClientState *nc,
  
  return virtio_net_do_receive(nc, buf, size);

  }
+
+static int32_t virtio_net_rsc_filter4(NetRscChain *chain, struct ip_header *ip,
+  const uint8_t *buf, size_t size)

This function checks for ip header, so need rename it to something like
"virtio_net_rsc_ipv4_filter()"

OK.



+{
+uint16_t ip_len;
+
+if (size < (TCP4_OFFSET + sizeof(tcp_header))) {
+return RSC_BYPASS;
+}
+
+/* Not an ipv4 one */
+if (0x4 != ((0xF0 & ip->ip_ver_len) >> 4)) {

Let's don't use magic value like 0x4 here.

OK.



+return RSC_BYPASS;
+}
+
+/* Don't handle packets with ip option */
+if (5 != (0xF & ip->ip_ver_len)) {
+return RSC_BYPASS;
+}
+
+/* Don't handle packets with ip fragment */
+if (!(htons(ip->ip_off) & IP_DF)) {
+return RSC_BYPASS;
+}
+
+if (ip->ip_p != IPPROTO_TCP) {
+return RSC_BYPASS;
+}
+
+/* Sanity check */
+ip_len = htons(ip->ip_len);
+if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header))
+|| ip_len > (size - IP_OFFSET)) {
+return RSC_BYPASS;
+}
+
+return RSC_WANT;
+}
+
+
  static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
const uint8_t *buf, size_t size)
  {
@@ -1958,6 +1998,10 @@ static size_t virtio_net_rsc_receive4(void *opq, 
NetClientState* nc,
  chain = (NetRscChain *)opq;
  ip = (struct ip_header *)(buf + IP_OFFSET);
  
+if (RSC_WANT != virtio_net_rsc_filter4(chain, ip, buf, size)) {

+return virtio_net_do_receive(nc, buf, size);
+}
+
  ret = virtio_net_rsc_parse_tcp_ctrl((uint8_t *)ip,
  (0xF & ip->ip_ver_len) << 2);
  if (RSC_BYPASS == ret) {

Re: [Qemu-devel] [RFC Patch v2 03/10] virtio-net rsc: Chain Lookup, Packet Caching and Framework of IPv4

2016-02-01 Thread Wei Xu


On 02/01/2016 01:55 PM, Jason Wang wrote:



On 02/01/2016 02:13 AM, w...@redhat.com wrote:

From: Wei Xu <w...@wei-thinkpad.nay.redhat.com>

Upon a packet is arriving, a corresponding chain will be selected or created,
or be bypassed if it's not an IPv4 packets.

The callback in the chain will be invoked to call the real coalescing.

Since the coalescing is based on the TCP connection, so the packets will be
cached if there is no previous data within the same connection.

The framework of IPv4 is also introduced.

This patch depends on patch 2918cf2 (Detailed IPv4 and General TCP data
coalescing)

Then looks like the order needs to be changed?


OK, as mentioned in other feedbacks, some of the patches should be merged, will 
adjust the patch set again, thanks.



Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c | 173 +++-
  1 file changed, 172 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 4e9458e..cfbac6d 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -14,10 +14,12 @@
  #include "qemu/iov.h"
  #include "hw/virtio/virtio.h"
  #include "net/net.h"
+#include "net/eth.h"
  #include "net/checksum.h"
  #include "net/tap.h"
  #include "qemu/error-report.h"
  #include "qemu/timer.h"
+#include "qemu/sockets.h"
  #include "hw/virtio/virtio-net.h"
  #include "net/vhost_net.h"
  #include "hw/virtio/virtio-bus.h"
@@ -37,6 +39,21 @@
  #define endof(container, field) \
  (offsetof(container, field) + sizeof(((container *)0)->field))
  
+#define VIRTIO_HEADER   12/* Virtio net header size */

This looks wrong if mrg_rxbuf (VIRTIO_NET_F_MRG_RXBUF) is off


OK.




+#define IP_OFFSET (VIRTIO_HEADER + sizeof(struct eth_header))
+
+#define MAX_VIRTIO_IP_PAYLOAD  (65535 + IP_OFFSET)
+
+/* Global statistics */
+static uint32_t rsc_chain_no_mem;

This is meaningless, see below comments.


Yes, should remove this.




+
+/* Switcher to enable/disable rsc */
+static bool virtio_net_rsc_bypass;
+
+/* Coalesce callback for ipv4/6 */
+typedef int32_t (VirtioNetCoalesce) (NetRscChain *chain, NetRscSeg *seg,
+ const uint8_t *buf, size_t size);
+
  typedef struct VirtIOFeature {
  uint32_t flags;
  size_t end;
@@ -1019,7 +1036,8 @@ static int receive_filter(VirtIONet *n, const uint8_t 
*buf, int size)
  return 0;
  }
  
-static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)

+static ssize_t virtio_net_do_receive(NetClientState *nc,
+  const uint8_t *buf, size_t size)
  {
  VirtIONet *n = qemu_get_nic_opaque(nc);
  VirtIONetQueue *q = virtio_net_get_subqueue(nc);
@@ -1623,6 +1641,159 @@ static void virtio_net_rsc_cleanup(VirtIONet *n)
  }
  }
  
+static int virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,

+const uint8_t *buf, size_t size)
+{
+NetRscSeg *seg;
+
+seg = g_malloc(sizeof(NetRscSeg));
+if (!seg) {
+return 0;
+}

g_malloc() can't fail, no need to check if it succeeded.


OK.




+
+seg->buf = g_malloc(MAX_VIRTIO_IP_PAYLOAD);
+if (!seg->buf) {
+goto out;
+}
+
+memmove(seg->buf, buf, size);
+seg->size = size;
+seg->dup_ack_count = 0;
+seg->is_coalesced = 0;
+seg->nc = nc;
+
+QTAILQ_INSERT_TAIL(>buffers, seg, next);
+return size;
+
+out:
+g_free(seg);
+return 0;
+}
+
+
+static int32_t virtio_net_rsc_try_coalesce4(NetRscChain *chain,
+   NetRscSeg *seg, const uint8_t *buf, size_t size)
+{
+/* This real part of this function will be introduced in next patch, just
+*  return a 'final' to feed the compilation. */
+return RSC_FINAL;
+}
+
+static size_t virtio_net_rsc_callback(NetRscChain *chain, NetClientState *nc,
+const uint8_t *buf, size_t size, VirtioNetCoalesce *coalesce)
+{

Looks like this function was called directly, so "callback" suffix is
not accurate.


OK.




+int ret;
+NetRscSeg *seg, *nseg;
+
+if (QTAILQ_EMPTY(>buffers)) {
+if (!virtio_net_rsc_cache_buf(chain, nc, buf, size)) {
+return 0;
+} else {
+return size;
+}
+}
+
+QTAILQ_FOREACH_SAFE(seg, >buffers, next, nseg) {
+ret = coalesce(chain, seg, buf, size);
+if (RSC_FINAL == ret) {

Let's use "ret == RSC_FINAL" for a consistent coding style with other
qemu codes.


OK.




+ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
+QTAILQ_REMOVE(>buffers, seg, next);
+g_free(seg->buf);
+g_free(seg);
+if (ret == 0) {
+/* Send failed */
+retur

Re: [Qemu-devel] [RFC Patch v2 09/10] virtio-net rsc: Add IPv6 support

2016-02-01 Thread Wei Xu




On 02/01/2016 03:14 PM, Jason Wang wrote:


On 02/01/2016 02:13 AM, w...@redhat.com wrote:

From: Wei Xu <w...@wei-thinkpad.nay.redhat.com>

A few more stuffs should be included to support this
1. Corresponding chain lookup
2. Coalescing callback for the protocol chain
3. Filter & Sanity Check.

Signed-off-by: Wei Xu <w...@redhat.com>
---
  hw/net/virtio-net.c | 104 +++-
  1 file changed, 102 insertions(+), 2 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 9b44762..c9f6bfc 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -46,12 +46,19 @@
  #define TCP4_OFFSET (IP_OFFSET + sizeof(struct ip_header)) /* tcp4 header */
  #define TCP4_PORT_OFFSET TCP4_OFFSET/* tcp4 port offset */
  #define IP4_ADDR_SIZE   8   /* ipv4 saddr + daddr */
+
+#define IP6_ADDR_OFFSET (IP_OFFSET + 8) /* ipv6 address start */
+#define TCP6_OFFSET (IP_OFFSET + sizeof(struct ip6_header)) /* tcp6 header */
+#define TCP6_PORT_OFFSET TCP6_OFFSET/* tcp6 port offset */
+#define IP6_ADDR_SIZE   32  /* ipv6 saddr + daddr */
  #define TCP_PORT_SIZE   4   /* sport + dport */
  #define TCP_WINDOW  65535
  
  /* IPv4 max payload, 16 bits in the header */

  #define MAX_IP4_PAYLOAD  (65535 - sizeof(struct ip_header))
  
+/* ip6 max payload, payload in ipv6 don't include the  header */

+#define MAX_IP6_PAYLOAD  65535
  #define MAX_VIRTIO_IP_PAYLOAD  (65535 + IP_OFFSET)
  
  /* Purge coalesced packets timer interval */

@@ -1856,6 +1863,42 @@ static int32_t virtio_net_rsc_try_coalesce4(NetRscChain 
*chain,
  o_data, _ip->ip_len, MAX_IP4_PAYLOAD);
  }
  
+static int32_t virtio_net_rsc_try_coalesce6(NetRscChain *chain,

+NetRscSeg *seg, const uint8_t *buf, size_t size)
+{
+uint16_t o_ip_len, n_ip_len;/* len in ip header field */
+uint16_t n_tcp_len, o_tcp_len;  /* tcp header len */
+uint16_t o_data, n_data;/* payload without virtio/eth/ip/tcp */
+struct ip6_header *n_ip, *o_ip;
+struct tcp_header *n_tcp, *o_tcp;
+
+n_ip = (struct ip6_header *)(buf + IP_OFFSET);
+n_ip_len = htons(n_ip->ip6_ctlun.ip6_un1.ip6_un1_plen);
+n_tcp = (struct tcp_header *)(((uint8_t *)n_ip)\
++ sizeof(struct ip6_header));
+n_tcp_len = (htons(n_tcp->th_offset_flags) & 0xF000) >> 10;
+n_data = n_ip_len - n_tcp_len;
+
+o_ip = (struct ip6_header *)(seg->buf + IP_OFFSET);
+o_ip_len = htons(o_ip->ip6_ctlun.ip6_un1.ip6_un1_plen);
+o_tcp = (struct tcp_header *)(((uint8_t *)o_ip)\
++ sizeof(struct ip6_header));
+o_tcp_len = (htons(o_tcp->th_offset_flags) & 0xF000) >> 10;
+o_data = o_ip_len - o_tcp_len;

Like I've replied in previous mails, need a helper or just store
pointers to both ip and tcp in seg.

OK.



+
+if (memcmp(_ip->ip6_src, _ip->ip6_src, sizeof(struct in6_address))
+|| memcmp(_ip->ip6_dst, _ip->ip6_dst, sizeof(struct in6_address))
+|| (n_tcp->th_sport ^ o_tcp->th_sport)
+|| (n_tcp->th_dport ^ o_tcp->th_dport)) {
+return RSC_NO_MATCH;
+}

And if you still want to handle coalescing in a layer style, better
delay the check of ports to tcp function.

OK.



+
+/* There is a difference between payload lenght in ipv4 and v6,
+   ip header is excluded in ipv6 */
+return virtio_net_rsc_coalesce_tcp(chain, seg, buf,
+   n_tcp, n_tcp_len, n_data, o_tcp, o_tcp_len, o_data,
+   _ip->ip6_ctlun.ip6_un1.ip6_un1_plen, MAX_IP6_PAYLOAD);
+}
  
  /* Pakcets with 'SYN' should bypass, other flag should be sent after drain

   * to prevent out of order */
@@ -2015,6 +2058,59 @@ static size_t virtio_net_rsc_receive4(void *opq, 
NetClientState* nc,
 virtio_net_rsc_try_coalesce4);
  }
  
+static int32_t virtio_net_rsc_filter6(NetRscChain *chain, struct ip6_header *ip,

+  const uint8_t *buf, size_t size)
+{
+uint16_t ip_len;
+
+if (size < (TCP6_OFFSET + sizeof(tcp_header))) {
+return RSC_BYPASS;
+}
+
+if (0x6 != (0xF & ip->ip6_ctlun.ip6_un1.ip6_un1_flow)) {
+return RSC_BYPASS;
+}
+
+/* Both option and protocol is checked in this */
+if (ip->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) {
+return RSC_BYPASS;
+}
+
+/* Sanity check */
+ip_len = htons(ip->ip6_ctlun.ip6_un1.ip6_un1_plen);
+if (ip_len < sizeof(struct tcp_header)
+|| ip_len > (size - TCP6_OFFSET)) {
+return RSC_BYPASS;
+}
+
+return 0;

RSC_WANT?

Yes, the is new code and not tested.



+}
+
+static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc,
+  const uint8_t *b

1 2 >

1 - 100 of 119 matches

Mail list logo