Re: [lng-odp] generic core + HW specific drivers

2017-10-06 Thread Ola Liljedahl
On 5 October 2017 at 12:09, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolai...@nokia.com> wrote:
> No HTML mails, please.
>
>
> From: Bill Fischofer [mailto:bill.fischo...@linaro.org]
> Sent: Wednesday, October 04, 2017 3:55 PM
> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolai...@nokia.com>
> Cc: Andriy Berestovskyy <andriy.berestovs...@caviumnetworks.com>; Ola 
> Liljedahl <ola.liljed...@linaro.org>; lng-odp@lists.linaro.org
> Subject: Re: [lng-odp] generic core + HW specific drivers
>
>
>
> On Wed, Oct 4, 2017 at 7:47 AM, Savolainen, Petri (Nokia - FI/Espoo) 
> <mailto:petri.savolai...@nokia.com> wrote:
>
>
>> -Original Message-
>> From: Andriy Berestovskyy 
>> [mailto:mailto:andriy.berestovs...@caviumnetworks.com]
>> Sent: Tuesday, October 03, 2017 8:22 PM
>> To: Savolainen, Petri (Nokia - FI/Espoo) 
>> <mailto:petri.savolai...@nokia.com>; Ola
>> Liljedahl <mailto:ola.liljed...@linaro.org>; mailto:lng-odp@lists.linaro.org
>> Subject: Re: [lng-odp] generic core + HW specific drivers
>>
>> Hey,
>> Please see below.
>>
>> On 03.10.2017 10:12, Savolainen, Petri (Nokia - FI/Espoo) wrote:
>> > So, we should be able to deliver ODP as a set of HW independent and
>> > HW specific packages (libraries). For example, minimal install would
>> >  include only odp, odp-linux and odp-test-suite, but when on arm64
>> > (and especially when on ThunderX) odp-thunderx would be installed
>>
>> There are architecture dependencies (i.e. i386, amd64, arm64 etc), but
>> there are no specific platform dependencies (i.e. Cavium ThunderX,
>> Cavium Octeon, NXP etc).
>>
>> In other words, there is no such mechanism in packaging to create a
>> package odp, which will automatically install package odp-thunderx only
>> on Cavium ThunderX platforms.
>
> I'd expect that ODP main package (e.g. for arm64) would run a script (written 
> by us) during install which digs out information about the system and sets up 
> ODP paths accordingly. E.g. libs/headers from odp-thunderx package would 
> added to search paths when installing into a ThunderX system. If system is 
> not recognized,  ODP libs/header paths would point into odp-linux.
>
> That's still trying to make this a static configuration that can be done at 
> install time. What about VMs/containers that are instantiated on different 
> hosts as they are deployed? This really needs to be determined at run time, 
> not install time.
>
>
>
> Also with a VM, all arm64 ODP packages would be present, and the problem to 
> solve would be which implementation to use (to link against). If a run time 
> code can probe the host system (e.g. are we on ThunderX), so does a script. 
> An ignorant user might not run additional scripts and thus be left with the 
> default setup (odp-linux). A more aware user would run an additional script 
> before launching/building any ODP apps. This script would notice that we have 
> e.g. ThunderX HW and would change ODP paths to point into odp-thunderx 
> libs/headers. The HW discovery could be as simple as cloud administrator 
> updating VM bootparams with SoC model information.
>
>
>>
>> All other projects you are mentioning (kernel, DPDK, Xorg) use
>> architecture dependency (different packages for different architectures)
>> combined with run time configuration/probing. A kernel driver might be
>> installed, but it will be unused until configured/probed.
>
> Those projects aim to maximize code re-use of the core part and minimize size 
> of the driver part. Optimally, we'd do the opposite - minimize the core part 
> to zero and dynamically link application directly to the right "driver" (== 
> HW specific odp implementation).
>
> If there's no core part, run time probing is not needed - install time 
> probing and lib/header path setup is enough.
>
> You're describing the embedded build case, which is similar to what we have 
> today with --enable-abi-compat=no. That's not changing. We're only talking 
> about what happens for --enable-abi-compat=yes builds.
>
>
>
> No, I'm pointing that the more there's common core SW, the more there are 
> trade-offs and the less direct HW access == less  performance. For optimal 
> performance, the amount of common core SW is zero.
Yes this is sort of the ideal but I doubt this type of installation
will be accepted by e.g. Red Hat for inclusion in server-oriented
Linux distributions. Jon Masters seems to be strongly against this
(although I have only heard this second hand). So that's why I
proposed the common (generic) core + platform specific drivers model
that is used by e.g. Xorg 

[lng-odp] generic core + HW specific drivers

2017-09-29 Thread Ola Liljedahl
olli@vubuntu:~$ dpkg --get-selections | grep xorg
xorg install
xorg-docs-core install
xserver-xorg install
xserver-xorg-core install
xserver-xorg-input-all install
xserver-xorg-input-evdev install
xserver-xorg-input-libinput install
xserver-xorg-input-synaptics install
xserver-xorg-input-wacom install
xserver-xorg-video-all install
xserver-xorg-video-amdgpu install
xserver-xorg-video-ati install
xserver-xorg-video-fbdev install
xserver-xorg-video-intel install
xserver-xorg-video-mach64 install
xserver-xorg-video-neomagic install
xserver-xorg-video-nouveau install<<

Re: [lng-odp] [API-NEXT PATCH v9 3/6] linux-gen: sched scalable: add a bitset

2017-06-21 Thread Ola Liljedahl

On 21/06/2017, 22:00, "Dmitry Eremin-Solenikov"
<dmitry.ereminsoleni...@linaro.org> wrote:

>On 21.06.2017 21:14, Ola Liljedahl wrote:
>> 
>> On 20/06/2017, 15:04, "Savolainen, Petri (Nokia - FI/Espoo)"
>> <petri.savolai...@nokia.com> wrote:
>> 
>>>
>>>
>>>> +++ b/platform/linux-generic/include/odp_bitset.h
>>>> @@ -0,0 +1,210 @@
>>>> +/* Copyright (c) 2017, ARM Limited
>>>> + * All rights reserved.
>>>> + *
>>>> + * SPDX-License-Identifier: BSD-3-Clause
>>>> + */
>>>> +
>>>> +#ifndef _ODP_BITSET_H_
>>>> +#define _ODP_BITSET_H_
>>>> +
>>>> +#include 
>>>> +
>>>> +#include 
>>>> +
>>>>
>>>> 
>>>>+/*
>>>>**
>>>> *
>>>> **
>>>> + * bitset abstract data type
>>>> +
>>>>
>>>> 
>>>>***
>>>>**
>>>> *
>>>> ***/
>>>> +/* This could be a struct of scalars to support larger bit sets */
>>>> +
>>>> +/*
>>>> + * Size of atomic bit set. This limits the max number of threads,
>>>> + * scheduler groups and reorder windows. On ARMv8/64-bit and x86-64,
>>>> the
>>>> + * (lock-free) max is 128
>>>> + */
>>>> +
>>>> +/* Find a suitable data type that supports lock-free atomic
>>>>operations
>>>> */
>>>> +#if defined(__ARM_ARCH) &&  __ARM_ARCH == 8 &&  __ARM_64BIT_STATE ==
>>>>1
>>>> &&
>>>
>>> Why ifdef ARM? Why this code is not in arch directory ?
>> Why is this car red?
>> Because I like it like that.
>
>I think it was agreed that arch-specific code should go to arch/ dirs,
>wasn't it?
If you bend backwards enough, you will always touch ground again with your
hands. It doesn¹t mean it is meaningful to do so. Especially not when you
can just lean forward and accomplish the same without the pain.

>
>
>-- 
>With best wishes
>Dmitry



Re: [lng-odp] [API-NEXT PATCH v9 3/6] linux-gen: sched scalable: add a bitset

2017-06-21 Thread Ola Liljedahl

On 20/06/2017, 15:04, "Savolainen, Petri (Nokia - FI/Espoo)"
 wrote:

>
>
>> +++ b/platform/linux-generic/include/odp_bitset.h
>> @@ -0,0 +1,210 @@
>> +/* Copyright (c) 2017, ARM Limited
>> + * All rights reserved.
>> + *
>> + * SPDX-License-Identifier: BSD-3-Clause
>> + */
>> +
>> +#ifndef _ODP_BITSET_H_
>> +#define _ODP_BITSET_H_
>> +
>> +#include 
>> +
>> +#include 
>> +
>> 
>>+/***
>>*
>> **
>> + * bitset abstract data type
>> +
>> 
>>*
>>*
>> ***/
>> +/* This could be a struct of scalars to support larger bit sets */
>> +
>> +/*
>> + * Size of atomic bit set. This limits the max number of threads,
>> + * scheduler groups and reorder windows. On ARMv8/64-bit and x86-64,
>>the
>> + * (lock-free) max is 128
>> + */
>> +
>> +/* Find a suitable data type that supports lock-free atomic operations
>>*/
>> +#if defined(__ARM_ARCH) &&  __ARM_ARCH == 8 &&  __ARM_64BIT_STATE == 1
>>&&
>
>Why ifdef ARM? Why this code is not in arch directory ?
Why is this car red?
Because I like it like that.


>
>-Petri
>
>
>> \
>> +defined(__SIZEOF_INT128__) && __SIZEOF_INT128__ == 16
>> +#define LOCKFREE16
>> +typedef __int128 bitset_t;
>> +#define ATOM_BITSET_SIZE (CHAR_BIT * __SIZEOF_INT128__)
>> +
>> +#elif __GCC_ATOMIC_LLONG_LOCK_FREE == 2 && \
>> +__SIZEOF_LONG_LONG__ != __SIZEOF_LONG__
>> +typedef unsigned long long bitset_t;
>> +#define ATOM_BITSET_SIZE (CHAR_BIT * __SIZEOF_LONG_LONG__)
>> +
>> +#elif __GCC_ATOMIC_LONG_LOCK_FREE == 2 && __SIZEOF_LONG__ !=
>> __SIZEOF_INT__
>> +typedef unsigned long bitset_t;
>> +#define ATOM_BITSET_SIZE (CHAR_BIT * __SIZEOF_LONG__)
>> +
>> +#elif __GCC_ATOMIC_INT_LOCK_FREE == 2
>> +typedef unsigned int bitset_t;
>> +#define ATOM_BITSET_SIZE (CHAR_BIT * __SIZEOF_INT__)
>> +
>> +#else
>> +/* Target does not support lock-free atomic operations */
>> +typedef unsigned int bitset_t;
>> +#define ATOM_BITSET_SIZE (CHAR_BIT * __SIZEOF_INT__)
>> +#endif
>> +
>> +#if ATOM_BITSET_SIZE <= 32
>
>



Re: [lng-odp] [API-NEXT PATCH v9 5/6] linux-gen: sched scalable: add scalable scheduler

2017-06-21 Thread Ola Liljedahl

On 20/06/2017, 15:58, "Savolainen, Petri (Nokia - FI/Espoo)"
 wrote:

>> --- a/platform/linux-generic/include/odp_config_internal.h
>> +++ b/platform/linux-generic/include/odp_config_internal.h
>> @@ -7,9 +7,7 @@
>>  #ifndef ODP_CONFIG_INTERNAL_H_
>>  #define ODP_CONFIG_INTERNAL_H_
>> 
>> -#ifdef __cplusplus
>> -extern "C" {
>> -#endif
>> +#include 
>
>Why these configs need global visibility? This file should contain
>general configuration options.
>
>> 
>>  /*
>>   * Maximum number of pools
>> @@ -22,6 +20,13 @@ extern "C" {
>>  #define ODP_CONFIG_QUEUES 1024
>> 
>>  /*
>> + * Maximum queue depth. Maximum number of elements that can be stored
>>in
>> a
>> + * queue. This value is used only when the size is not explicitly
>> provided
>> + * during queue creation.
>> + */
>> +#define CONFIG_QUEUE_SIZE 4096
>> +
>> +/*
>>   * Maximum number of ordered locks per queue
>>   */
>>  #define CONFIG_QUEUE_MAX_ORD_LOCKS 4
>> @@ -120,7 +125,7 @@ extern "C" {
>>   *
>>   * This the the number of separate SHM areas that can be reserved
>> concurrently
>>   */
>> -#define ODPDRV_CONFIG_SHM_BLOCKS 48
>> +#define ODPDRV_CONFIG_SHM_BLOCKS ODP_CONFIG_SHM_BLOCKS
>
>
>Is this change necessary? Increases driver memory usage for no reason?
The scalable scheduler should have anything to do with the drivers and
their shmem use. So I think this change is unnecessary and should be
reverted.

>
>
>> +
>> +#endif  /* ODP_SCHEDULE_SCALABLE_H */
>> diff --git 
>>a/platform/linux-generic/include/odp_schedule_scalable_config.h
>> b/platform/linux-generic/include/odp_schedule_scalable_config.h
>> new file mode 100644
>> index ..febf379b
>> --- /dev/null
>> +++ b/platform/linux-generic/include/odp_schedule_scalable_config.h
>> @@ -0,0 +1,55 @@
>> +/* Copyright (c) 2017, ARM Limited
>> + * All rights reserved.
>> + *
>> + * SPDX-License-Identifier: BSD-3-Clause
>> + */
>> +
>> +#ifndef ODP_SCHEDULE_SCALABLE_CONFIG_H_
>> +#define ODP_SCHEDULE_SCALABLE_CONFIG_H_
>> +
>> +/*
>> + * Default scaling factor for the scheduler group
>> + *
>> + * This scaling factor is used when the application creates a scheduler
>> + * group with no worker threads.
>> + */
>> +#define CONFIG_DEFAULT_XFACTOR 4
>> +
>> +/*
>> + * Default weight (in events) for WRR in scalable scheduler
>> + *
>> + * This controls the per-queue weight for WRR between queues of the
>>same
>> + * priority in the scalable scheduler
>> + * A higher value improves throughput while a lower value increases
>> fairness
>> + * and thus likely decreases latency
>> + *
>> + * If WRR is undesired, set the value to ~0 which will use the largest
>> possible
>> + * weight
>> + *
>> + * Note: an API for specifying this on a per-queue basis would be
>>useful
>> but is
>> + * not yet available
>> + */
>> +#define CONFIG_WRR_WEIGHT 64
>> +
>> +/*
>> + * Split queue producer/consumer metadata into separate cache lines.
>> + * This is beneficial on e.g. Cortex-A57 but not so much on A53.
>> + */
>> +#define CONFIG_SPLIT_PRODCONS
>> +
>> +/*
>> + * Use locks to protect queue (ring buffer) and scheduler state updates
>> + * On x86, this decreases overhead noticeably.
>> + */
>> +#ifndef __ARM_ARCH
>> +#define CONFIG_QSCHST_LOCK
>> +/* Keep all ring buffer/qschst data together when using locks */
>> +#undef CONFIG_SPLIT_PRODCONS
>> +#endif
>> +
>> +/*
>> + * Maximum number of ordered locks per queue.
>> + */
>> +#define CONFIG_MAX_ORDERED_LOCKS_PER_QUEUE 2
>
>
>There's already CONFIG_QUEUE_MAX_ORD_LOCKS 4, in the general config file.
>Should not add the same define twice (with different value).
Yes it is unnecessary for the scalable scheduler to use a separate define
for this.

>
>
>
>> +
>> +#endif  /* ODP_SCHEDULE_SCALABLE_CONFIG_H_ */
>> diff --git a/platform/linux-
>> generic/include/odp_schedule_scalable_ordered.h b/platform/linux-
>> generic/include/odp_schedule_scalable_ordered.h
>> new file mode 100644
>> index ..9f3acf7a
>> --- /dev/null
>> +++ b/platform/linux-generic/include/odp_schedule_scalable_ordered.h
>> @@ -0,0 +1,132 @@
>> +/* Copyright (c) 2017, ARM Limited
>> + * All rights reserved.
>> + *
>> + * SPDX-License-Identifier: BSD-3-Clause
>> + */
>> +
>> +#ifndef ODP_SCHEDULE_SCALABLE_ORDERED_H
>> +#define ODP_SCHEDULE_SCALABLE_ORDERED_H
>> +
>> +#include 
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include <_ishmpool_internal.h>
>
>> +
>> +/* Number of reorder contects in the reorder window.
>> + * Should be at least one per CPU.
>> + */
>> +#define RWIN_SIZE 32
>> +ODP_STATIC_ASSERT(CHECK_IS_POWER2(RWIN_SIZE), "RWIN_SIZE is not a power
>> of 2");
>> +
>> +#define NUM_OLOCKS 2
>
>Is this the same as CONFIG_MAX_ORDERED_LOCKS_PER_QUEUE or something
>different with similar name ?
This is the define that is actually usedŠ
There is an obvious need of a cleanup here.

>
>
>> diff --git a/platform/linux-generic/odp_queue_if.c b/platform/linux-
>> generic/odp_queue_if.c
>> index c91f00eb..d7471dfc 100644
>> --- 

Re: [lng-odp] [API-NEXT PATCH v9 4/6] linux-gen: sched scalable: add a concurrent queue

2017-06-21 Thread Ola Liljedahl




On 20/06/2017, 15:12, "Savolainen, Petri (Nokia - FI/Espoo)"
 wrote:

>> +++ b/platform/linux-generic/include/odp_llqueue.h
>> @@ -0,0 +1,309 @@
>> +/* Copyright (c) 2017, ARM Limited.
>> + * All rights reserved.
>> + *
>> + * SPDX-License-Identifier:BSD-3-Clause
>> + */
>> +
>> +#ifndef ODP_LLQUEUE_H_
>> +#define ODP_LLQUEUE_H_
>> +
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include 
>> +#include 
>> +
>> 
>>+/***
>>*
>> **
>> + * Linked list queues
>> +
>> 
>>*
>>*
>> ***/
>> +
>> +struct llqueue;
>> +struct llnode;
>> +
>> +static struct llnode *llq_head(struct llqueue *llq);
>> +static void llqueue_init(struct llqueue *llq);
>> +static void llq_enqueue(struct llqueue *llq, struct llnode *node);
>> +static struct llnode *llq_dequeue(struct llqueue *llq);
>> +static odp_bool_t llq_dequeue_cond(struct llqueue *llq, struct llnode
>> *exp);
>> +static odp_bool_t llq_cond_rotate(struct llqueue *llq, struct llnode
>> *node);
>> +static odp_bool_t llq_on_queue(struct llnode *node);
>> +
>> 
>>+/***
>>*
>> **
>> + * The implementation(s)
>> +
>> 
>>*
>>*
>> ***/
>> +
>> +#define SENTINEL ((void *)~(uintptr_t)0)
>> +
>> +#ifdef CONFIG_LLDSCD
>> +/* Implement queue operations using double-word LL/SC */
>
>> +
>> +#else
>> +/* Implement queue operations protected by a spin lock */
>> +
>
>There's a lot of ifdef'ed code in this file, basically two full parallel
>implementations.
This horse has been flogged before on the mailing list.

> The first is built only for ARM and the second for the rest. Would there
>be a way to build both always ?
For ARMv7a and ARMv8a, you could build both versions. You really want to
use the LL/SC version on these architectures.

For architectures without double-word LL/SC, only the lock-based version
can be built.

>
>-Petri
>



Re: [lng-odp] [API-NEXT PATCH v9 2/6] linux-gen: sched scalable: add arch files

2017-06-21 Thread Ola Liljedahl


On 20/06/2017, 15:00, "Savolainen, Petri (Nokia - FI/Espoo)"
 wrote:

>
>> +#endif  /* PLATFORM_LINUXGENERIC_ARCH_ARM_CPU_IDLING_H */
>> diff --git a/platform/linux-generic/arch/arm/odp_llsc.h
>>b/platform/linux-
>> generic/arch/arm/odp_llsc.h
>> new file mode 100644
>> index ..3ab5c909
>> --- /dev/null
>> +++ b/platform/linux-generic/arch/arm/odp_llsc.h
>> @@ -0,0 +1,249 @@
>> +/* Copyright (c) 2017, ARM Limited
>> + * All rights reserved.
>> + *
>> + * SPDX-License-Identifier: BSD-3-Clause
>> + */
>> +
>> +#ifndef PLATFORM_LINUXGENERIC_ARCH_ARM_LLSC_H
>> +#define PLATFORM_LINUXGENERIC_ARCH_ARM_LLSC_H
>> +
>> +#ifndef PLATFORM_LINUXGENERIC_ARCH_ARM_ODP_CPU_H
>> +#error This file should not be included directly, please include
>> odp_cpu.h
>> +#endif
>> +
>> +#if __ARM_ARCH == 7 || (__ARM_ARCH == 8 && __ARM_64BIT_STATE == 0)
>> +
>
>
>> +
>> +#if __ARM_ARCH == 8 && __ARM_64BIT_STATE == 1
>> +
>
>Build broken for ARMv6? There are so many #ifdefs that it's hard to tell
>which code path is built.
GCC preprocessor symbols for the ARM architecture(s) is a mess (and
different with different compiler versions I think) but that¹s not our
fault.

> Maybe it would make sense to explicitly document/report an error when
>building for a non-supported ARM target. E.g. the original raspberry pi
>is ARMv6, it's possible that someone is building odp-linux for that...
Different versions of the ARM architecture are actually more or less
different architectures from an ODP perspective. Perhaps the architecture
recognition should be more specific and identify ARMv7a and ARMv8a (32-bit
and 64-bit) and treat everything else as "default². I think ODP should use
the default (aka generic) arch implementation when the architecture is not
recognised/supported.


>
>-Petri
>
>



Re: [lng-odp] [API-NEXT PATCH v6 6/6] Add scalable scheduler

2017-06-02 Thread Ola Liljedahl

On 02/06/2017, 12:53, "Peltonen, Janne (Nokia - FI/Espoo)"
 wrote:

>>> for packet output to first tell application that a packet was "accepted
>>>for transmission" and then drop it silently. Packet out (it's a simple
>>>function) should be able to determine if packet can be accepted for
>>>transmission and if it's accepted the packet will eventually go out.
>>Obviously, packet out is not so simple to implement when considering
>>order
>>restoration etc. The original linux-generic implementation was wrong.
>
>Ordering in the  Linux-generic implementation went accidentally broken
>in January when the new ordered queue implementation was added. I suppose
>it worked before that.
OK so remove the word ³original² from my statement above.



Re: [lng-odp] Suspected SPAM - Re: [API-NEXT PATCH v6 6/6] Add scalable scheduler

2017-06-02 Thread Ola Liljedahl

>
>
>> -Original Message-
>> From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of
>>Savolainen, Petri
>> (Nokia - FI/Espoo)
>> Sent: Friday, June 02, 2017 1:18 PM
>> To: Honnappa Nagarahalli <honnappa.nagaraha...@linaro.org>; Ola
>>Liljedahl
>> <ola.liljed...@arm.com>
>> Cc: Elo, Matias (Nokia - FI/Espoo) <matias@nokia.com>; nd
>><n...@arm.com>; Kevin Wang
>> <kevin.w...@arm.com>; Honnappa Nagarahalli
>><honnappa.nagaraha...@arm.com>; lng-
>> o...@lists.linaro.org
>> Subject: Suspected SPAM - Re: [lng-odp] [API-NEXT PATCH v6 6/6] Add
>>scalable scheduler
>> 
>> 
>> 
>> > -Original Message-----
>> > From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of
>> > Honnappa Nagarahalli
>> > Sent: Thursday, June 01, 2017 11:30 PM
>> > To: Ola Liljedahl <ola.liljed...@arm.com>
>> > Cc: lng-odp@lists.linaro.org; Honnappa Nagarahalli
>> > <honnappa.nagaraha...@arm.com>; Elo, Matias (Nokia - FI/Espoo)
>> > <matias@nokia.com>; Kevin Wang <kevin.w...@arm.com>; nd
>><n...@arm.com>
>> > Subject: Re: [lng-odp] [API-NEXT PATCH v6 6/6] Add scalable scheduler
>> >
>> > On 1 June 2017 at 15:20, Ola Liljedahl <ola.liljed...@arm.com> wrote:
>> > >
>> > >
>> > >
>> > >
>> > > On 01/06/2017, 22:15, "Honnappa Nagarahalli"
>> > > <honnappa.nagaraha...@linaro.org> wrote:
>> > >
>> > >>On 1 June 2017 at 15:09, Ola Liljedahl <ola.liljed...@arm.com>
>>wrote:
>> > >>>
>> > >>>
>> > >>> On 01/06/2017, 21:03, "Bill Fischofer" <bill.fischo...@linaro.org>
>> > >>>wrote:
>> > >>>
>> > >>>>On Thu, Jun 1, 2017 at 10:59 AM, Honnappa Nagarahalli
>> > >>>><honnappa.nagaraha...@linaro.org> wrote:
>> > >>>>> On 1 June 2017 at 01:26, Elo, Matias (Nokia - FI/Espoo)
>> > >>>>> <matias@nokia.com> wrote:
>> > >>>>>>
>> > >>>>>>> On 31 May 2017, at 23:53, Bill Fischofer
>> > <bill.fischo...@linaro.org>
>> > >>>>>>>wrote:
>> > >>>>>>>
>> > >>>>>>> On Wed, May 31, 2017 at 8:12 AM, Elo, Matias (Nokia -
>>FI/Espoo)
>> > >>>>>>> <matias@nokia.com> wrote:
>> > >>>>>>>>
>> > >>>>>>>>>>> What¹s the purpose of calling ord_enq_multi() here? To
>>save
>> > >>>>>>>>>>>(stash)
>> > >>>>>>>>>>> packets if the thread is out-of-order?
>> > >>>>>>>>>>> And when the thread is in-order, it is re-enqueueing the
>> > packets
>> > >>>>>>>>>>>which
>> > >>>>>>>>>>> again will invoke pktout_enqueue/pktout_enq_multi but this
>> > time
>> > >>>>>>>>>>> ord_enq_multi() will not save the packets, instead they
>>will
>> > >>>>>>>>>>>actually be
>> > >>>>>>>>>>> transmitted by odp_pktout_send()?
>> > >>>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Since transmitting packets may fail, out-of-order packets
>> > cannot
>> > >>>>>>>>>>be
>> > >>>>>>>>>> stashed here.
>> > >>>>>>>>> You mean that the TX queue of the pktio might be full so
>>not all
>> > >>>>>>>>>packets
>> > >>>>>>>>> will actually be enqueued for transmission.
>> > >>>>>>>>
>> > >>>>>>>> Yep.
>> > >>>>>>>>
>> > >>>>>>>>> This is an interesting case but is it a must to know how
>>many
>> > >>>>>>>>>packets are
>> > >>>>>>>>> actually accepted? Packets can always be dropped without
>>notice,
>> > >>>>>>>>>the
>> > >>>>&g

Re: [lng-odp] [API-NEXT PATCH v6 6/6] Add scalable scheduler

2017-06-02 Thread Ola Liljedahl


On 02/06/2017, 12:17, "Savolainen, Petri (Nokia - FI/Espoo)"
<petri.savolai...@nokia.com> wrote:

>
>
>> -Original Message-
>> From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of
>> Honnappa Nagarahalli
>> Sent: Thursday, June 01, 2017 11:30 PM
>> To: Ola Liljedahl <ola.liljed...@arm.com>
>> Cc: lng-odp@lists.linaro.org; Honnappa Nagarahalli
>> <honnappa.nagaraha...@arm.com>; Elo, Matias (Nokia - FI/Espoo)
>> <matias@nokia.com>; Kevin Wang <kevin.w...@arm.com>; nd <n...@arm.com>
>> Subject: Re: [lng-odp] [API-NEXT PATCH v6 6/6] Add scalable scheduler
>> 
>> On 1 June 2017 at 15:20, Ola Liljedahl <ola.liljed...@arm.com> wrote:
>> >
>> >
>> >
>> >
>> > On 01/06/2017, 22:15, "Honnappa Nagarahalli"
>> > <honnappa.nagaraha...@linaro.org> wrote:
>> >
>> >>On 1 June 2017 at 15:09, Ola Liljedahl <ola.liljed...@arm.com> wrote:
>> >>>
>> >>>
>> >>> On 01/06/2017, 21:03, "Bill Fischofer" <bill.fischo...@linaro.org>
>> >>>wrote:
>> >>>
>> >>>>On Thu, Jun 1, 2017 at 10:59 AM, Honnappa Nagarahalli
>> >>>><honnappa.nagaraha...@linaro.org> wrote:
>> >>>>> On 1 June 2017 at 01:26, Elo, Matias (Nokia - FI/Espoo)
>> >>>>> <matias@nokia.com> wrote:
>> >>>>>>
>> >>>>>>> On 31 May 2017, at 23:53, Bill Fischofer
>> <bill.fischo...@linaro.org>
>> >>>>>>>wrote:
>> >>>>>>>
>> >>>>>>> On Wed, May 31, 2017 at 8:12 AM, Elo, Matias (Nokia - FI/Espoo)
>> >>>>>>> <matias@nokia.com> wrote:
>> >>>>>>>>
>> >>>>>>>>>>> What¹s the purpose of calling ord_enq_multi() here? To save
>> >>>>>>>>>>>(stash)
>> >>>>>>>>>>> packets if the thread is out-of-order?
>> >>>>>>>>>>> And when the thread is in-order, it is re-enqueueing the
>> packets
>> >>>>>>>>>>>which
>> >>>>>>>>>>> again will invoke pktout_enqueue/pktout_enq_multi but this
>> time
>> >>>>>>>>>>> ord_enq_multi() will not save the packets, instead they will
>> >>>>>>>>>>>actually be
>> >>>>>>>>>>> transmitted by odp_pktout_send()?
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Since transmitting packets may fail, out-of-order packets
>> cannot
>> >>>>>>>>>>be
>> >>>>>>>>>> stashed here.
>> >>>>>>>>> You mean that the TX queue of the pktio might be full so not
>>all
>> >>>>>>>>>packets
>> >>>>>>>>> will actually be enqueued for transmission.
>> >>>>>>>>
>> >>>>>>>> Yep.
>> >>>>>>>>
>> >>>>>>>>> This is an interesting case but is it a must to know how many
>> >>>>>>>>>packets are
>> >>>>>>>>> actually accepted? Packets can always be dropped without
>>notice,
>> >>>>>>>>>the
>> >>>>>>>>> question is from which point this is acceptable. If packets
>> >>>>>>>>>enqueued onto
>> >>>>>>>>> a pktout (egress) queue are accepted, this means that they
>>must
>> >>>>>>>>>also be
>> >>>>>>>>> put onto the driver TX queue (as done by odp_pktout_send)?
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Currently, the packet_io/queue APIs don't say anything about
>> >>>>>>>>packets
>> >>>>>>>>being
>> >>>>>>>> possibly dropped after successfully calling odp_queue_enq() to
>>a
>> >>>>>>>>pktout
>> >>>>>>>> event queue. So to be consistent with standard odp_queue_enq()
>> >>>>>>>>o

Re: [lng-odp] [API-NEXT PATCH v6 6/6] Add scalable scheduler

2017-06-01 Thread Ola Liljedahl




On 01/06/2017, 22:15, "Honnappa Nagarahalli"
<honnappa.nagaraha...@linaro.org> wrote:

>On 1 June 2017 at 15:09, Ola Liljedahl <ola.liljed...@arm.com> wrote:
>>
>>
>> On 01/06/2017, 21:03, "Bill Fischofer" <bill.fischo...@linaro.org>
>>wrote:
>>
>>>On Thu, Jun 1, 2017 at 10:59 AM, Honnappa Nagarahalli
>>><honnappa.nagaraha...@linaro.org> wrote:
>>>> On 1 June 2017 at 01:26, Elo, Matias (Nokia - FI/Espoo)
>>>> <matias@nokia.com> wrote:
>>>>>
>>>>>> On 31 May 2017, at 23:53, Bill Fischofer <bill.fischo...@linaro.org>
>>>>>>wrote:
>>>>>>
>>>>>> On Wed, May 31, 2017 at 8:12 AM, Elo, Matias (Nokia - FI/Espoo)
>>>>>> <matias@nokia.com> wrote:
>>>>>>>
>>>>>>>>>> What¹s the purpose of calling ord_enq_multi() here? To save
>>>>>>>>>>(stash)
>>>>>>>>>> packets if the thread is out-of-order?
>>>>>>>>>> And when the thread is in-order, it is re-enqueueing the packets
>>>>>>>>>>which
>>>>>>>>>> again will invoke pktout_enqueue/pktout_enq_multi but this time
>>>>>>>>>> ord_enq_multi() will not save the packets, instead they will
>>>>>>>>>>actually be
>>>>>>>>>> transmitted by odp_pktout_send()?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Since transmitting packets may fail, out-of-order packets cannot
>>>>>>>>>be
>>>>>>>>> stashed here.
>>>>>>>> You mean that the TX queue of the pktio might be full so not all
>>>>>>>>packets
>>>>>>>> will actually be enqueued for transmission.
>>>>>>>
>>>>>>> Yep.
>>>>>>>
>>>>>>>> This is an interesting case but is it a must to know how many
>>>>>>>>packets are
>>>>>>>> actually accepted? Packets can always be dropped without notice,
>>>>>>>>the
>>>>>>>> question is from which point this is acceptable. If packets
>>>>>>>>enqueued onto
>>>>>>>> a pktout (egress) queue are accepted, this means that they must
>>>>>>>>also be
>>>>>>>> put onto the driver TX queue (as done by odp_pktout_send)?
>>>>>>>>
>>>>>>>
>>>>>>> Currently, the packet_io/queue APIs don't say anything about
>>>>>>>packets
>>>>>>>being
>>>>>>> possibly dropped after successfully calling odp_queue_enq() to a
>>>>>>>pktout
>>>>>>> event queue. So to be consistent with standard odp_queue_enq()
>>>>>>>operations I
>>>>>>> think it is better to return the number of events actually accepted
>>>>>>>to the TX queue.
>>>>>>>
>>>>>>> To have more leeway one option would be to modify the API
>>>>>>>documentation to
>>>>>>> state that packets may still be dropped after a successful
>>>>>>>odp_queue_enq() call
>>>>>>> before reaching the NIC. If the application would like to be sure
>>>>>>>that the
>>>>>>> packets are actually sent, it should use odp_pktout_send() instead.
>>>>>>
>>>>>> Ordered queues simply say that packets will be delivered to the next
>>>>>> queue in the pipeline in the order they originated from their source
>>>>>> queue. What happens after that depends on the attributes of the
>>>>>>target
>>>>>> queue. If the target queue is an exit point from the application,
>>>>>>then
>>>>>> this is outside of ODP's scope.
>>>>>
>>>>> My point was that with stashing the application has no way of knowing
>>>>>if an
>>>>> ordered pktout enqueue call actually succeed. In case of parallel and
>>>>>atomic
>>>>> queues it does. So my question is, is this acceptable?
>>>>>
>>>> Also, currently, it is not possible for the application to have a
>>>> consistent 'wai

Re: [lng-odp] [API-NEXT PATCH v6 6/6] Add scalable scheduler

2017-06-01 Thread Ola Liljedahl


On 01/06/2017, 21:03, "Bill Fischofer"  wrote:

>On Thu, Jun 1, 2017 at 10:59 AM, Honnappa Nagarahalli
> wrote:
>> On 1 June 2017 at 01:26, Elo, Matias (Nokia - FI/Espoo)
>>  wrote:
>>>
 On 31 May 2017, at 23:53, Bill Fischofer 
wrote:

 On Wed, May 31, 2017 at 8:12 AM, Elo, Matias (Nokia - FI/Espoo)
  wrote:
>
 What¹s the purpose of calling ord_enq_multi() here? To save
(stash)
 packets if the thread is out-of-order?
 And when the thread is in-order, it is re-enqueueing the packets
which
 again will invoke pktout_enqueue/pktout_enq_multi but this time
 ord_enq_multi() will not save the packets, instead they will
actually be
 transmitted by odp_pktout_send()?

>>>
>>> Since transmitting packets may fail, out-of-order packets cannot be
>>> stashed here.
>> You mean that the TX queue of the pktio might be full so not all
>>packets
>> will actually be enqueued for transmission.
>
> Yep.
>
>> This is an interesting case but is it a must to know how many
>>packets are
>> actually accepted? Packets can always be dropped without notice, the
>> question is from which point this is acceptable. If packets
>>enqueued onto
>> a pktout (egress) queue are accepted, this means that they must
>>also be
>> put onto the driver TX queue (as done by odp_pktout_send)?
>>
>
> Currently, the packet_io/queue APIs don't say anything about packets
>being
> possibly dropped after successfully calling odp_queue_enq() to a
>pktout
> event queue. So to be consistent with standard odp_queue_enq()
>operations I
> think it is better to return the number of events actually accepted
>to the TX queue.
>
> To have more leeway one option would be to modify the API
>documentation to
> state that packets may still be dropped after a successful
>odp_queue_enq() call
> before reaching the NIC. If the application would like to be sure
>that the
> packets are actually sent, it should use odp_pktout_send() instead.

 Ordered queues simply say that packets will be delivered to the next
 queue in the pipeline in the order they originated from their source
 queue. What happens after that depends on the attributes of the target
 queue. If the target queue is an exit point from the application, then
 this is outside of ODP's scope.
>>>
>>> My point was that with stashing the application has no way of knowing
>>>if an
>>> ordered pktout enqueue call actually succeed. In case of parallel and
>>>atomic
>>> queues it does. So my question is, is this acceptable?
>>>
>> Also, currently, it is not possible for the application to have a
>> consistent 'wait/drop on destination queue full' policy for all the
>> queue types.
>
>Today applications have no way of knowing whether packets sent to a
>pktout_queue or tm_queue actually make it to the wire or whether they
>are vaporized as soon as they hit the wire, so there's no change here.
>An RC of 0 simply says that the packet was "accepted" for transmission
>and hence the caller no longer owns that packet handle. You need
>higher-level protocols to track end-to-end transmission and receipt.
>All that ordered queues say is that packets being sent to TX queues
>will have those TX calls made in the same order as the source queue
>they originated from.
>
>The only way to track packet disposition today is to (a) create a
>reference to the packet you want to transmit, (b) verify that
>odp_packet_has_ref(original_pkt) > 0, indicating that an actual
>reference was created, (c) transmit that reference, and (d) note when
>odp_packet_has_ref(original_pkt) returns to 0. That confirms that the
>reference has exited the scope of this ODP instance since a
>"successful" transmission will free that reference.
Doesn¹t this just confirm that the reference has been freed? But you don¹t
know if this was due to the packet actually being transmitted on the wire
or if it was dropped before that (which would also free the reference).

Back to my original question, how far into the ³machine² can we return
(to SW) absolute knowledge of the states of packets?

With normal queues (including scheduled queues), a successful enqueue
guarantees that the packet (event) was actually enqueued. But pktio egress
queues are not normal queues, they are essentially representations of a
network interface¹s TX queues but also maintain the order restoration
function of events enqueued to a queue when processing an ordered queue.

I interpret your comments Bill as even if enqueue to a pktio egress queue
is successful (the packet handle is no longer owned by the application),
the implementation from that moment on can do whatever it wants with the
packets (as long as 

Re: [lng-odp] [API-NEXT PATCH v6 6/6] Add scalable scheduler

2017-05-31 Thread Ola Liljedahl
On 31/05/2017, 12:18, "Elo, Matias (Nokia - FI/Espoo)"
<matias@nokia.com> wrote:


>
>> On 31 May 2017, at 12:04, Ola Liljedahl <ola.liljed...@arm.com> wrote:
>> 
>> 
>> 
>> On 31/05/2017, 10:38, "Peltonen, Janne (Nokia - FI/Espoo)"
>> <janne.pelto...@nokia.com> wrote:
>> 
>>> 
>>> 
>>> Ola Liljedahl wrote:
>>>> On 23/05/2017, 16:49, "Peltonen, Janne (Nokia - FI/Espoo)"
>>>> <janne.pelto...@nokia.com> wrote:
>>>> 
>>>> 
>>>>> 
>>>>>> +static int ord_enq_multi(uint32_t queue_index, void *p_buf_hdr[],
>>>>>> + int num, int *ret)
>>>>>> +{
>>>>>> +(void)queue_index;
>>>>>> +(void)p_buf_hdr;
>>>>>> +(void)num;
>>>>>> +(void)ret;
>>>>>> +return 0;
>>>>>> +}
>>>>> 
>>>>> How is packet order maintained when enqueuing packets read from an
>>>> ordered
>>>>> queue to a pktout queue? Matias' recent fix uses the ord_enq_multi
>>>>> scheduler
>>>>> function for that, but this version does not do any ordering. Or is
>>>>>the
>>>>> ordering guaranteed by some other means?
>>>> The scalable scheduler standard queue enqueue function also handles
>>>> ordered queues. odp_queue_scalable.c can refer to the same
>>>> thread-specific
>>>> data as odp_schedule_scalable.c so we don¹t need this internal
>>>> interface.
>>>> We could perhaps adapt the code to use this interface but I think this
>>>> interface is just an artefact of the implementation of the default
>>>> queues/scheduler.
>>> 
>>> The problem is that odp_pktout_queue_config() sets qentry->s.enqueue
>>> to pktout_enqueue() and that does not have any of the scalable
>>>scheduler
>>> specific magic that odp_queue_scalable.c:queue_enq{_multi}() has. So
>>> ordering does not happen for pktout queues even if it works for other
>>> queues, right?
>> This must be a recent change, it doesn’t look like that in the working
>> branch we are using.
>> I see the code when changing to the master branch.
>> The code in pktout_enqueue() does look like a hack:
>>if (sched_fn->ord_enq_multi(qentry->s.index, (void **)buf_hdr,
>> len, ))
>> A cast to “void **”???
>> 
>> What’s the purpose of calling ord_enq_multi() here? To save (stash)
>> packets if the thread is out-of-order?
>> And when the thread is in-order, it is re-enqueueing the packets which
>> again will invoke pktout_enqueue/pktout_enq_multi but this time
>> ord_enq_multi() will not save the packets, instead they will actually be
>> transmitted by odp_pktout_send()?
>> 
>
>Since transmitting packets may fail, out-of-order packets cannot be
>stashed here.
You mean that the TX queue of the pktio might be full so not all packets
will actually be enqueued for transmission.
This is an interesting case but is it a must to know how many packets are
actually accepted? Packets can always be dropped without notice, the
question is from which point this is acceptable. If packets enqueued onto
a pktout (egress) queue are accepted, this means that they must also be
put onto the driver TX queue (as done by odp_pktout_send)?


>With the current scheduler implementation sched_fn->ord_enq_multi() waits
>until
>in-order and always returns 0 (in case of pktout queue). After this
>odp_pktout_send()
>is called.
>
>-Matias
>



Re: [lng-odp] [API-NEXT PATCH v6 6/6] Add scalable scheduler

2017-05-31 Thread Ola Liljedahl


On 31/05/2017, 10:38, "Peltonen, Janne (Nokia - FI/Espoo)"
<janne.pelto...@nokia.com> wrote:

>
>
>Ola Liljedahl wrote:
>> On 23/05/2017, 16:49, "Peltonen, Janne (Nokia - FI/Espoo)"
>> <janne.pelto...@nokia.com> wrote:
>> 
>> 
>> >
>> >> +static int ord_enq_multi(uint32_t queue_index, void *p_buf_hdr[],
>> >> +  int num, int *ret)
>> >> +{
>> >> + (void)queue_index;
>> >> + (void)p_buf_hdr;
>> >> + (void)num;
>> >> + (void)ret;
>> >> + return 0;
>> >> +}
>> >
>> >How is packet order maintained when enqueuing packets read from an
>>ordered
>> >queue to a pktout queue? Matias' recent fix uses the ord_enq_multi
>> >scheduler
>> >function for that, but this version does not do any ordering. Or is the
>> >ordering guaranteed by some other means?
>> The scalable scheduler standard queue enqueue function also handles
>> ordered queues. odp_queue_scalable.c can refer to the same
>>thread-specific
>> data as odp_schedule_scalable.c so we don¹t need this internal
>>interface.
>> We could perhaps adapt the code to use this interface but I think this
>> interface is just an artefact of the implementation of the default
>> queues/scheduler.
>
>The problem is that odp_pktout_queue_config() sets qentry->s.enqueue
>to pktout_enqueue() and that does not have any of the scalable scheduler
>specific magic that odp_queue_scalable.c:queue_enq{_multi}() has. So
>ordering does not happen for pktout queues even if it works for other
>queues, right?
This must be a recent change, it doesn’t look like that in the working
branch we are using.
I see the code when changing to the master branch.
The code in pktout_enqueue() does look like a hack:
if (sched_fn->ord_enq_multi(qentry->s.index, (void **)buf_hdr,
len, ))
A cast to “void **”???

What’s the purpose of calling ord_enq_multi() here? To save (stash)
packets if the thread is out-of-order?
And when the thread is in-order, it is re-enqueueing the packets which
again will invoke pktout_enqueue/pktout_enq_multi but this time
ord_enq_multi() will not save the packets, instead they will actually be
transmitted by odp_pktout_send()?



>
>   Janne
>
>> 
>> >
>> >> +static void order_lock(void)
>> >> +{
>> >> +}
>> >> +
>> >> +static void order_unlock(void)
>> >> +{
>> >> +}
>> >
>> >Is it ok that these are no-ops? tm_enqueue() seems to use these.
>> No these ought to be implemented. We have fixed that now. Thanks.
>> 
>> 
>> -- Ola
>> 
>> Ola Liljedahl, Networking System Architect, ARM
>> Phone: +46 706 866 373  Skype: ola.liljedahl
>> 
>> 
>> 
>> 
>> >
>> >> +
>> >> +const schedule_fn_t schedule_scalable_fn = {
>> >> + .pktio_start= pktio_start,
>> >> + .thr_add= thr_add,
>> >> + .thr_rem= thr_rem,
>> >> + .num_grps   = num_grps,
>> >> + .init_queue = init_queue,
>> >> + .destroy_queue  = destroy_queue,
>> >> + .sched_queue= sched_queue,
>> >> + .ord_enq_multi  = ord_enq_multi,
>> >> + .init_global= schedule_init_global,
>> >> + .term_global= schedule_term_global,
>> >> + .init_local = schedule_init_local,
>> >> + .term_local = schedule_term_local,
>> >> + .order_lock = order_lock,
>> >> + .order_unlock   = order_unlock,
>> >> +};
>> >
>> >Janne
>> >
>> >
>



Re: [lng-odp] [API-NEXT PATCH v6 6/6] Add scalable scheduler

2017-05-30 Thread Ola Liljedahl
On 23/05/2017, 16:49, "Peltonen, Janne (Nokia - FI/Espoo)"
<janne.pelto...@nokia.com> wrote:


>
>> +static int ord_enq_multi(uint32_t queue_index, void *p_buf_hdr[],
>> + int num, int *ret)
>> +{
>> +(void)queue_index;
>> +(void)p_buf_hdr;
>> +(void)num;
>> +(void)ret;
>> +return 0;
>> +}
>
>How is packet order maintained when enqueuing packets read from an ordered
>queue to a pktout queue? Matias' recent fix uses the ord_enq_multi
>scheduler
>function for that, but this version does not do any ordering. Or is the
>ordering guaranteed by some other means?
The scalable scheduler standard queue enqueue function also handles
ordered queues. odp_queue_scalable.c can refer to the same thread-specific
data as odp_schedule_scalable.c so we don¹t need this internal interface.
We could perhaps adapt the code to use this interface but I think this
interface is just an artefact of the implementation of the default
queues/scheduler.

>
>> +static void order_lock(void)
>> +{
>> +}
>> +
>> +static void order_unlock(void)
>> +{
>> +}
>
>Is it ok that these are no-ops? tm_enqueue() seems to use these.
No these ought to be implemented. We have fixed that now. Thanks.


-- Ola

Ola Liljedahl, Networking System Architect, ARM
Phone: +46 706 866 373  Skype: ola.liljedahl




>
>> +
>> +const schedule_fn_t schedule_scalable_fn = {
>> +.pktio_start= pktio_start,
>> +.thr_add= thr_add,
>> +.thr_rem= thr_rem,
>> +.num_grps   = num_grps,
>> +.init_queue = init_queue,
>> +.destroy_queue  = destroy_queue,
>> +.sched_queue= sched_queue,
>> +.ord_enq_multi  = ord_enq_multi,
>> +.init_global= schedule_init_global,
>> +.term_global= schedule_term_global,
>> +.init_local = schedule_init_local,
>> +.term_local = schedule_term_local,
>> +.order_lock = order_lock,
>> +.order_unlock   = order_unlock,
>> +};
>
>   Janne
>
>



Re: [lng-odp] [PATCH] test: odp_sched_latency: robust draining of queues

2017-04-25 Thread Ola Liljedahl


On 25/04/2017, 14:32, "Savolainen, Petri (Nokia - FI/Espoo)"
<petri.savolai...@nokia-bell-labs.com> wrote:

>
>
>> -Original Message-
>> From: Ola Liljedahl [mailto:ola.liljed...@arm.com]
>> Sent: Tuesday, April 25, 2017 1:56 PM
>> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-
>> labs.com>; Brian Brooks <brian.bro...@arm.com>; lng-odp@lists.linaro.org
>> Cc: nd <n...@arm.com>
>> Subject: Re: [lng-odp] [PATCH] test: odp_sched_latency: robust draining
>>of
>> queues
>> 
>> Another thing.
>> 
>> 
>> On 25/04/2017, 12:26, "Savolainen, Petri (Nokia - FI/Espoo)"
>> <petri.savolai...@nokia-bell-labs.com> wrote:
>> 
>> >
>> >
>> >> -Original Message-
>> >> From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of
>> >>Brian
>> >> Brooks
>> >> Sent: Monday, April 24, 2017 11:59 PM
>> >> To: lng-odp@lists.linaro.org
>> >> Cc: Ola Liljedahl <ola.liljed...@arm.com>
>> >> Subject: [lng-odp] [PATCH] test: odp_sched_latency: robust draining
>>of
>> >> queues
>> >>
>> >> From: Ola Liljedahl <ola.liljed...@arm.com>
>> >>
>> >> In order to robustly drain all queues when the benchmark has
>> >> ended, we enqueue a special event on every queue and invoke
>> >> the scheduler until all such events have been received.
>> >>
>> >
>> >odp_schedule_pause();
>> >
>> >while (1) {
>> >ev = odp_schedule(_queue, ODP_SCHED_NO_WAIT);
>> >
>> >if (ev == ODP_EVENT_INVALID)
>> >break;
>> >
>> >if (odp_queue_enq(src_queue, ev)) {
>> >LOG_ERR("[%i] Queue enqueue failed.\n",
>> thr);
>> >odp_event_free(ev);
>> >return -1;
>> >}
>> >}
>> >
>> >odp_schedule_resume();
>> Is it good to call odp_schedule_resume() here? Isn¹t it legal and
>>possible
>> that the scheduler does or requests some speculative prescheduling in
>>the
>> resume call? Thus defying the schedule_pause and draining of
>>prescheduled
>> (stashed) events happening just before.
>
>
>The loop above ensures that other threads proceed while this thread waits.
>
>Resume should not reserve a schedule context (do pre-scheduling), only
>schedule() does reserve and free a context.
The spec does not say this.

/**
* Resume scheduling
*
* Resume global scheduling for this thread. After this call, all schedule
calls
* will schedule normally (perform global scheduling).
*/
void odp_schedule_resume(void);


“Resume scheduling” could easily be interpreted as allowing pre-scheduling
or enabling some “global scheduler” to schedule events for this thread. I
can easily imagine that when using a HW scheduler with some scheduling
latency, one would always send a scheduling request in advance in order to
hide the latency.

The description for odp_schedule() is written as if pre-scheduling or
stashing cannot occur.

> The clear_sched_queues() loops as long as there are events and stops
>when sees an EVENT_INVALID == no context.
Yes I have added a loop calling odp_schedule() until it returns
EVENT_INVALID.

>
>-Petri
>
>
>> 
>> 
>> >
>> >odp_barrier_wait(>barrier);
>> >
>> >clear_sched_queues();
>> >
>> >
>> >What is the issue that this patch fixes? This sequence should be quite
>> >robust already since no new enqueues happen after the barrier. In a
>> >simple test code like this, the latency from last enq() (through the
>> >barrier) to schedule loop (in clear_sched_queues()) could be overcome
>> >just by not exiting after the first EVENT_INVALID from scheduler, but
>> >after N EVENT_INVALIDs in a row.
>> >
>> >Also in your patch, thread should exit only after scheduler returns
>> >EVENT_INVALID.
>> >
>> >
>> >-Petri
>> >
>



Re: [lng-odp] [PATCH] test: odp_sched_latency: robust draining of queues

2017-04-25 Thread Ola Liljedahl

On 25/04/2017, 12:54, "Savolainen, Petri (Nokia - FI/Espoo)"
 wrote:

>Also in your patch, thread should exit only after scheduler returns
>EVENT_INVALID.
>Since the cool_down event is the last event on all queues (as they are
>enqueued after all threads have passed the barrier), when we have
>received all cool_down events we know that there are no other events on
>the these queues. No need to call odp_schedule() until it returns
>ODP_EVENT_INVALID (which can happen spuriously anyway so doesn't signify
>anything).
>
>
>It signifies release of the schedule context. For a robust exit,
>application should release the current context.
OK. Either odp_schedule() must have returned invalid event or we need an
explicit release call.
But you don¹t think this is something odp_term_local() should handle?

>
>-Petri
>
>



Re: [lng-odp] [PATCH] test: odp_sched_latency: robust draining of queues

2017-04-25 Thread Ola Liljedahl
Another thing.


On 25/04/2017, 12:26, "Savolainen, Petri (Nokia - FI/Espoo)"
<petri.savolai...@nokia-bell-labs.com> wrote:

>
>
>> -Original Message-
>> From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of
>>Brian
>> Brooks
>> Sent: Monday, April 24, 2017 11:59 PM
>> To: lng-odp@lists.linaro.org
>> Cc: Ola Liljedahl <ola.liljed...@arm.com>
>> Subject: [lng-odp] [PATCH] test: odp_sched_latency: robust draining of
>> queues
>> 
>> From: Ola Liljedahl <ola.liljed...@arm.com>
>> 
>> In order to robustly drain all queues when the benchmark has
>> ended, we enqueue a special event on every queue and invoke
>> the scheduler until all such events have been received.
>> 
>
>   odp_schedule_pause();
>
>   while (1) {
>   ev = odp_schedule(_queue, ODP_SCHED_NO_WAIT);
>
>   if (ev == ODP_EVENT_INVALID)
>   break;
>
>   if (odp_queue_enq(src_queue, ev)) {
>   LOG_ERR("[%i] Queue enqueue failed.\n", thr);
>   odp_event_free(ev);
>   return -1;
>   }
>   }
>
>   odp_schedule_resume();
Is it good to call odp_schedule_resume() here? Isn¹t it legal and possible
that the scheduler does or requests some speculative prescheduling in the
resume call? Thus defying the schedule_pause and draining of prescheduled
(stashed) events happening just before.


>
>   odp_barrier_wait(>barrier);
>
>   clear_sched_queues();
>
>
>What is the issue that this patch fixes? This sequence should be quite
>robust already since no new enqueues happen after the barrier. In a
>simple test code like this, the latency from last enq() (through the
>barrier) to schedule loop (in clear_sched_queues()) could be overcome
>just by not exiting after the first EVENT_INVALID from scheduler, but
>after N EVENT_INVALIDs in a row.
>
>Also in your patch, thread should exit only after scheduler returns
>EVENT_INVALID.
>
>
>-Petri
>



Re: [lng-odp] [PATCH] test: odp_sched_latency: robust draining of queues

2017-04-25 Thread Ola Liljedahl

On 25/04/2017, 12:26, "Savolainen, Petri (Nokia - FI/Espoo)" 
<petri.savolai...@nokia-bell-labs.com<mailto:petri.savolai...@nokia-bell-labs.com>>
 wrote:



-Original Message-
From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of Brian
Brooks
Sent: Monday, April 24, 2017 11:59 PM
To: lng-odp@lists.linaro.org<mailto:lng-odp@lists.linaro.org>
Cc: Ola Liljedahl <ola.liljed...@arm.com<mailto:ola.liljed...@arm.com>>
Subject: [lng-odp] [PATCH] test: odp_sched_latency: robust draining of
queues
From: Ola Liljedahl <ola.liljed...@arm.com<mailto:ola.liljed...@arm.com>>
In order to robustly drain all queues when the benchmark has
ended, we enqueue a special event on every queue and invoke
the scheduler until all such events have been received.

odp_schedule_pause();

while (1) {
ev = odp_schedule(_queue, ODP_SCHED_NO_WAIT);

if (ev == ODP_EVENT_INVALID)
break;

if (odp_queue_enq(src_queue, ev)) {
LOG_ERR("[%i] Queue enqueue failed.\n", thr);
odp_event_free(ev);
return -1;
}
}

odp_schedule_resume();

odp_barrier_wait(>barrier);

clear_sched_queues();


What is the issue that this patch fixes?
The issue is that odp_schedule() (even with a timeout) returns 
ODP_EVENT_INVALID but the queues are not actually empty. In a loosely 
synchronised (e.g. using weak ordering) queue and scheduler implementation, 
odp_schedule() can spuriously return EVENT_INVALID. This happens infrequently 
on some A57 targets.

This sequence should be quite robust already since no new enqueues happen after 
the barrier. In a simple test code like this, the latency from last enq() 
(through the barrier) to schedule loop (in clear_sched_queues()) could be 
overcome just by not exiting after the first EVENT_INVALID from scheduler, but 
after N EVENT_INVALIDs in a row.
In the scalable scheduler & queue implementation, it can take some time before 
enqueued events become visible and the corresponding ODP queues pushed to some 
scheduler queue. So odp_schedule() can return ODP_EVENT_INVALID, even when 
called with a timeout. There is no timeout or no amount of INVALID_EVENT 
returns that *guarantees* that the queues have been drained.


Also in your patch, thread should exit only after scheduler returns 
EVENT_INVALID.
Since the cool_down event is the last event on all queues (as they are enqueued 
after all threads have passed the barrier), when we have received all cool_down 
events we know that there are no other events on the these queues. No need to 
call odp_schedule() until it returns ODP_EVENT_INVALID (which can happen 
spuriously anyway so doesn’t signify anything).



-Petri




Re: [lng-odp] [PATCH] test: odp_sched_latency: robust draining of queues

2017-04-24 Thread Ola Liljedahl
(Responding from PoC Outlook)

From:  Bill Fischofer <bill.fischo...@linaro.org>
Date:  Tuesday, 25 April 2017 at 00:00
To:  Brian Brooks <brian.bro...@arm.com>
Cc:  lng-odp-forward <lng-odp@lists.linaro.org>, Ola Liljedahl
<ola.liljed...@arm.com>
Subject:  Re: [lng-odp] [PATCH] test: odp_sched_latency: robust draining
of queues




On Mon, Apr 24, 2017 at 3:58 PM, Brian Brooks
<brian.bro...@arm.com> wrote:

From: Ola Liljedahl <ola.liljed...@arm.com>

In order to robustly drain all queues when the benchmark has
ended, we enqueue a special event on every queue and invoke
the scheduler until all such events have been received.

Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
Reviewed-by: Brian Brooks <brian.bro...@arm.com>
---
 test/common_plat/performance/odp_sched_latency.c | 51
++--
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/test/common_plat/performance/odp_sched_latency.c
b/test/common_plat/performance/odp_sched_latency.c
index 2b28cd7b..7ba99fc6 100644
--- a/test/common_plat/performance/odp_sched_latency.c
+++ b/test/common_plat/performance/odp_sched_latency.c
@@ -57,9 +57,10 @@ ODP_STATIC_ASSERT(LO_PRIO_QUEUES <= MAX_QUEUES, "Too
many LO priority queues");

 /** Test event types */
 typedef enum {
-   WARM_UP, /**< Warm up event */
-   TRAFFIC, /**< Event used only as traffic load */
-   SAMPLE   /**< Event used to measure latency */
+   WARM_UP,  /**< Warm up event */
+   COOL_DOWN,/**< Last event on queue */
+   TRAFFIC,  /**< Event used only as traffic load */
+   SAMPLE/**< Event used to measure latency */
 } event_type_t;

 /** Test event */
@@ -114,16 +115,40 @@ typedef struct {
  *
  * Retry to be sure that all buffers have been scheduled.
  */
-static void clear_sched_queues(void)
+static void clear_sched_queues(test_globals_t *globals)
 {
odp_event_t ev;
+   odp_buffer_t buf;
+   test_event_t *event;
+   int i, j;
+   uint32_t numtogo = 0;



int numtogo would be more standard here, and consistent with i, j
immediately above.
[Ola] Generally I prefer unsigneds for variables that should never be
negative. Is there a good reason for using signed ints instead?

 


-   while (1) {
-   ev = odp_schedule(NULL, ODP_SCHED_NO_WAIT);
-
-   if (ev == ODP_EVENT_INVALID)
-   break;
-
+   /* Enqueue special cool_down events on all queues (one per queue)
*/
+   for (i = 0; i < NUM_PRIOS; i++) {
+   for (j = 0; j < globals->args.prio[i].queues; j++) {
+   buf = odp_buffer_alloc(globals->pool);
+   if (buf == ODP_BUFFER_INVALID) {
+   LOG_ERR("Buffer alloc failed.\n");
+   return;
+   }



This isn't terribly robust. In the unlikely event that this call fails
you're leaving a bunch of events (including the previously successfully
allocated COOL_DOWN markers) on the queues. At minimum you should just
break out of this "marking" phase and
 proceed to drain the remaining queues for the numgoto markers already on
them.
[Ola] What does partial draining achieve?

 More robust would be to preallocate all the markers needed up front or
else do each queue individually to avoid having to buffer the markers.
[Ola] Using only one marker event and cycling through the queues while
draining them is an interesting idea. I might code this and see how it
looks.

[Ola] AFAIK, this is robust in the absence of errors.
You are raising the bar. The code is full of ³return -1² if some
unexpected error occurs. Sometimes some known resource (e.g. event) is
freed. I don¹t think this program is guaranteed to clean up all resources
if an error occurs. Even if this was the goal, actually verifying this
would be cumbersome, achieving full code coverage for all error-related
code paths. The used coding style also makes visual inspection difficult.

Personally I am happy if (all) resources are freed for successful
executions but don¹t think it is necessary to spend that much effort on
cleaning up after unexpected errors (which probably increases the risk of
new errors, trying to do too much when the system is already in an
unexpected and potentially inconsistent state is asking for trouble),
better to just die quickly. You don¹t make systems more robust by adding
complexity.

 

+   event = odp_buffer_addr(buf);
+   event->type = COOL_DOWN;
+   ev = odp_buffer_to_event(buf);
+   if (odp_queue_enq(globals->queue[i][j], ev)) {
+   LOG_ERR("Queue enqueue failed.\n");
+   odp_event_free(ev);
+   return;
+   }
+   numtogo++;
+ 

Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software scheduler

2017-04-16 Thread Ola Liljedahl
On 10 April 2017 at 10:56, Peltonen, Janne (Nokia - FI/Espoo)
<janne.pelto...@nokia.com> wrote:
> Hi,
>
> Ola Liljedahl <mailto:ola.liljed...@linaro.org> wrote:
>> Peltonen, Janne (Nokia - FI/Espoo) <mailto:janne.pelto...@nokia.com> wrote:
>> > In an IPsec GW (as a use case example) one might want to do all
>> > stateless processing (like ingress and egress IP processing and the
>> > crypto operations) in ordered contexts to get parallelism, but the
>> > stateful part (replay-check and sequence number generation) in
>> > an atomic context (or holding an ordered lock) so that not only
>> > the packets are output in ingress order but also their sequence
>> > numbers are in the same order.
>>
>> To what extent does IPsec sequence number ordering have to equal actual
>> transmission order? If an IPsec GW preserves ingress-to-egress packet order,
>> does it matter that this order might not be the same as the IPsec sequence
>> number order? If IPsec SN's are not used for reordering, just for replay
>> protection, I don't see why the two number series have to match.
>
> The sequence numbers of course do not need to fully match the transmission
> order because the anti-replay window mechanism can tolerate out-of-sequence
> packets (caused either by packet reordering in the network or by sequence
> number assignment not quite following the transmission order.
>
> But assigning sequence numbers out of order with respect to the packet
> order (which hopefully stays the same between the ingress and egress of
> an SGW) eats into the out-of-sequence tolerance budget (i.e. the window
> size at the other end) and leaves less of the budget for actual
> reordering in the network.
>
> Whether out-of-sequence sequence number assignment is ok or problematic
> depends on the peer configuration, network, possible QoS induced
> packet reordering and the magnitude of the possible sequence number
> (not packet) reordering with respect to transmission order in the sender.
>
> Often the antireplay window is something like 32 or 64 packets
That seems like very small window(s). I understand the simplicity enabled
by such small windows but is it really enough for modern high speed networks
with multiple L2/L3 switch/router hops (possibly with some link aggregation
in there as well).

> and maybe
> not all of that can be used by the IPsec sender for relaxed ordering of
> the sequence number assignment. One issue is that the size of the replay
> window is not negotiated so the sender cannot tell the receiver that
> a bigger window than normal is needed.
The receiver should be able to adapt the size of antireplay window by monitoring
the amount of (supposedly) stale packets (SN's). Has such a design been tried?
Do you have any "quality" requirements here, how large proportion of packets is
allowed to be dropped due to limited size of antireplay window? I assume there
are higher level SLA's that control packet loss, perhaps it is up to
the service provider
to use that packet loss budget as it sees fit.

>
>> > That said, some might argue that IPsec replay window can take care
>> > not only of packet reordering in the network but also of reordering
>> > inside an IPsec GW and therefore the atomic context (or ordered lock)
>> > is not necessarily needed in all implementations.
>>
>> Do you mean that the replay protection also should be responsible for
>> end-to-end (IPsec GW to GW) order restoration?
>
> No, I do not mean that and I think it is not in general correct for
> an IPsec GW to reorder received packets to the sequence number order.
If order restoration adds latency, it can definitively do harm. And even if it
could be done without adding latency (e.g. in some queue), we don't know if
the SN order is the "real" order and order restoration actually is beneficial.

>
> What I mean (but formulated sloppily) is that the window mechanism of
> replay protection can tolerate out-of-sequence sequence numbers to some
> extent even when the cause is not the network but the sending IPsec GW.
Well you could consider (parts of) the IPsec GW itself to be part of
the network...
Where does the network start? You can consider the transmitting NIC or the
cables the start of the network but what if you have link aggregation with
independent NIC's? The network must have started at some earlier shared
point where there is an unambiguous packet order. Is there always such a point?

>
> So, depending on the implementation and on the circumstances, one might
> want to ensure that sequence number gets assigned in the transmission
> order or one might decide not to worry about it and let the window
> mechanism in the receiver handle it.
Isn't 

Re: [lng-odp] [PATCH 1/3] test: l2fwd: add group option

2017-04-16 Thread Ola Liljedahl
On 10 April 2017 at 08:43, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolai...@nokia-bell-labs.com> wrote:
>
>
>> -Original Message-
>> From: Ola Liljedahl [mailto:ola.liljed...@linaro.org]
>> Sent: Saturday, April 08, 2017 12:13 AM
>> To: Petri Savolainen <petri.savolai...@linaro.org>
>> Cc: lng-odp@lists.linaro.org
>> Subject: Re: [lng-odp] [PATCH 1/3] test: l2fwd: add group option
>>
>> On 6 April 2017 at 13:59, Petri Savolainen <petri.savolai...@linaro.org>
>> wrote:
>> >
>> > User may give number of scheduling groups to test
>> > scheduler performance with other that the default (all
>> > threads) group. Both pktios and threads are allocated
>>
>> Isn't all *workers* a better default scheduler group? In this and in
>> other examples and benchmarks.
>
> The new thing in this patch is the -g option which enables testing of newly 
> created (other than default) groups. The default is not changed. In practice 
> "all" vs. "workers" do not matter much in our current test apps, since ctrl 
> threads do not call schedule().
It makes a difference to the scalable scheduler since it creates a
scheduler queue per thread (as specified when the scheduler group is
created). Creating more scheduler queues than threads that actually
call schedule() is suboptimal.

Therefor it would be better if the (performance) tests use the
"workers" scheduler group.

>
> I'm thinking that those automatic groups should be configurable, so that e.g. 
> "all" and "ctrl" maybe disabled if not used.
How is is useful to disable these groups?

> Those configs would go to new scheduler global config params (with other 
> things). Anyway, not a topic of this patch set.
>
> -Petri


Re: [lng-odp] [PATCH] validation: scheduler: Release context before the end of the scheduler test

2017-04-08 Thread Ola Liljedahl
On 7 April 2017 at 14:06, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolai...@nokia-bell-labs.com> wrote:
>
>
> From: Kevin Wang [mailto:kevin.w...@linaro.org]
> Sent: Friday, April 07, 2017 11:45 AM
> To: Savolainen, Petri (Nokia - FI/Espoo) 
> <petri.savolai...@nokia-bell-labs.com>
> Cc: Kevin Wang <kevin.w...@arm.com>; lng-odp@lists.linaro.org
> Subject: Re: [lng-odp] [PATCH] validation: scheduler: Release context before 
> the end of the scheduler test
>
> 1.Release context is just to be added for scalable scheduler in 
> scheduler_test_groups(). I think it does no harms to other scheduler here.
> 2.This code is to be removed for the scalable scheduler, We use ring buffer 
> to implement the queue. So it is possible the enqueue operation failed if the 
> ring buffer is full.
>
> Kevin
>
>
> 1. Validation tests are written against API spec. The spec says that context 
> release is a hint.
Actually this is what the spec says:
/**
 * Release the current atomic context
 *
 * This call is valid only for source queues with atomic synchronization. It

/**
 * Release the current ordered context
 *
 * This call is valid only for source queues with ordered synchronization. It

So I think the validation test is violating the spec by calling
odp_schedule_release_ordered() also for atomic/parallel queue/synch
types.
odp_schedule_sync_t sync[] = {ODP_SCHED_SYNC_PARALLEL,
  ODP_SCHED_SYNC_ATOMIC,
  ODP_SCHED_SYNC_ORDERED};
for (i = 0; i < 3; i++) {
qp.sched.sync  = sync[i];

The release calls are hints in the sense that the ODP implementation
must not necessarily release the atomic or ordered context, this
release ca be delayed until the scheduler is invoked again.

Originally the scalable scheduler reported an error when calling
odp_schedule_release_atomic() (odp_schedule_release_ordered()) when an
atomic (ordered) queue was *not* being processed and Kevin's patch was
needed for the scheduler validation test to pass. Later we relaxed the
behaviour when an invalid call is made, just silently ignoring invalid
calls.

IMO Kevin's patch changes the scheduler validation test to follow the
spec. We should change either the validation test or the spec.

> Your scheduler must not depend on extra context release calls, but must apply 
> to the spec. A validation test written against the spec must work. May be the 
> application is not working by the spec. But adding a context release hint, 
> does not fix that (== guarantee that context is actually released).
The spec (currently) doesn't say anything about calling the wrong
release call. At best this is undefined behaviour, at worst illegal
behaviour.

>
> 2. This patch is about context releases. This fixes an enqueue issue => 
> should not be in this patch. How big queue capacity the test expects ? If 
> it’s reasonable, maybe you should increase ring size instead. Soon we’ll have 
> queue size param and tests can be updated to check that.  In the meanwhile it 
> feels wrong to remove error checks from validation suite.
>
> -Petri
>
>
>
> 2017-04-07 16:22 GMT+08:00 Savolainen, Petri (Nokia - FI/Espoo) 
> <petri.savolai...@nokia-bell-labs.com<mailto:petri.savolai...@nokia-bell-labs.com>>:
>
>
>> -Original Message-
>> From: lng-odp 
>> [mailto:lng-odp-boun...@lists.linaro.org<mailto:lng-odp-boun...@lists.linaro.org>]
>>  On Behalf Of Kevin
>> Wang
>> Sent: Friday, April 07, 2017 11:07 AM
>> To: lng-odp@lists.linaro.org<mailto:lng-odp@lists.linaro.org>
>> Cc: Kevin Wang <kevin.w...@arm.com<mailto:kevin.w...@arm.com>>
>> Subject: [lng-odp] [PATCH] validation: scheduler: Release context before
>> the end of the scheduler test
>>
>> If the scheduler sync type is atomic or ordered,
>> need to release the context.
>
> Release context is actually a hint. It does not guarantee that context is 
> released. Application needs to call schedule() and receive _EVENT_INVALID to 
> be sure that it does not hold a context anymore.
>
>
>>
>> Signed-off-by: Kevin Wang <kevin.w...@arm.com<mailto:kevin.w...@arm.com>>
>> Reviewed-by: Ola Liljedahl 
>> <ola.liljed...@arm.com<mailto:ola.liljed...@arm.com>>
>> ---
>>  .../common_plat/validation/api/scheduler/scheduler.c | 20 +++
>> -
>>  1 file changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/test/common_plat/validation/api/scheduler/scheduler.c
>> b/test/common_plat/validation/api/scheduler/scheduler.c
>> index 952561c..2631001 100644
>> --- a/test/common_plat/validation/api/scheduler/scheduler.c
>> +++ b/test/common

Re: [lng-odp] [PATCH] validation: scheduler: Release context before the end of the scheduler test

2017-04-08 Thread Ola Liljedahl
On 8 April 2017 at 11:24, Kevin Wang <kevin.w...@linaro.org> wrote:
> The enq failure assert check is already in another patch
> http://patches.opendataplane.org/patch/8499/.
> Just to be redundant here. We can ignore it. Further comments should be
> placed in that patch then.
The problem with this test is that it creates a queue and then expects
to be able to enqueue 1 events on the queue. The current default
queue size in the scalable scheduler is 4096. 4096 minimum size
Ethernet frames is 275ms, I think that's long time for a packet to sit
in a queue.

The correction should be to use the new feature to request a minimum
queue size when creating a queue.


>
> For the atomic/ordered release, if you see the current code in the upstream,
> there is already two places call the release function in scheduler.c. I just
> extend it to another fucntion-scheduler_test_groups() to fix the bugs for
> scalable scheduler.Ola, If you see the ticket #52637 in our Gerrit, you'll
> find the detail for the code review. If this is not an issue anymore, we can
> drop the changes in this patch.
>
> Kevin
>
>
>
> 2017-04-08 1:39 GMT+08:00 Honnappa Nagarahalli
> <honnappa.nagaraha...@linaro.org>:
>>
>> On 7 April 2017 at 07:29, Ola Liljedahl <ola.liljed...@linaro.org> wrote:
>> > On 7 April 2017 at 14:06, Savolainen, Petri (Nokia - FI/Espoo) <
>> > petri.savolai...@nokia-bell-labs.com> wrote:
>> >>
>> >>
>> >> From: Kevin Wang [mailto:kevin.w...@linaro.org]
>> >> Sent: Friday, April 07, 2017 11:45 AM
>> >> To: Savolainen, Petri (Nokia - FI/Espoo) <
>> > petri.savolai...@nokia-bell-labs.com>
>> >> Cc: Kevin Wang <kevin.w...@arm.com>; lng-odp@lists.linaro.org
>> >> Subject: Re: [lng-odp] [PATCH] validation: scheduler: Release context
>> > before the end of the scheduler test
>> >>
>> >> 1.Release context is just to be added for scalable scheduler in
>> > scheduler_test_groups(). I think it does no harms to other scheduler
>> > here.
>> >> 2.This code is to be removed for the scalable scheduler, We use ring
>> > buffer to implement the queue. So it is possible the enqueue operation
>> > failed if the ring buffer is full.
>> >>
>> >> Kevin
>> >>
>> >>
>> >> 1. Validation tests are written against API spec. The spec says that
>> > context release is a hint. Your scheduler must not depend on extra
>> > context
>> > release calls, but must apply to the spec. A validation test written
>> > against the spec must work. May be the application is not working by the
>> > spec. But adding a context release hint, does not fix that (== guarantee
>> > that context is actually released).
>> >
>> > AFAIK, the scalable scheduler conforms to the spec (there could still be
>> > undetected bugs of course).
>> > Kevin, this is the test/common_plat/validation/api/scheduler test? Which
>> > platform and configuration? I haven't seen this failure.
>> > I get the following result on a multicore ARM target:
>> >
>> > $ test/common_plat/validation/api/scheduler/scheduler_main
>> > ...
>> > Run Summary:Type   Total Ran  Passed Failed Inactive
>> >
>> >   suites   1   1 n/a  00
>> >
>> >tests  35  35  35  00
>> >
>> >  asserts 2381749 2381749 2381749  0  n/a
>> >
>> >
>> > Elapsed time =   23.919 seconds
>> >
>> >
>>
>> Looking at the patch carefully, the release is NOT added by this
>> patch. This patch corrects the release to take the sync type into
>> account (as the release APIs are different for different sync types).
>>
>> >>
>> >> 2. This patch is about context releases. This fixes an enqueue issue =>
>> > should not be in this patch. How big queue capacity the test expects ?
>> > If
>> > it’s reasonable, maybe you should increase ring size instead. Soon we’ll
>> > have queue size param and tests can be updated to check that.  In the
>> > meanwhile it feels wrong to remove error checks from validation suite.
>> > The default queue size is 4096 which seems plenty. We need to check how
>> > many events fill_queues() expects to be able to enqueue.
>> >
>>
>> Agree, this change should not be part of this patch. The change is
>> correct in the sense that 'odp_queue_enq' should not be expected to
>> p

Re: [lng-odp] [API-NEXT PATCH v2 13/16] Add a bitset

2017-04-08 Thread Ola Liljedahl
On 5 April 2017 at 14:27, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
> On 04/05/17 15:16, Ola Liljedahl wrote:
>> On 05/04/2017, 12:36, "Dmitry Eremin-Solenikov"
>> <dmitry.ereminsoleni...@linaro.org> wrote:
>>
>>> On 05.04.2017 02:31, Ola Liljedahl wrote:
>>>> On 05/04/2017, 01:25, "Dmitry Eremin-Solenikov"
>>>> <dmitry.ereminsoleni...@linaro.org> wrote:
>>>>> On 04.04.2017 23:52, Ola Liljedahl wrote:
>>>>>> Sending from my ARM email account, I hope Outlook does not mess up the
>>>>>> format.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 04/04/2017, 22:21, "Dmitry Eremin-Solenikov"
>>>>>> <dmitry.ereminsoleni...@linaro.org> wrote:
>>>>>>
>>>>>>> On 04.04.2017 21:48, Brian Brooks wrote:
>>>>>>>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>>>>>>>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>>>>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> +/***
>>>>>>>> **
>>>>>>>> **
>>>>>>>> ***
>>>>>>>> + * bitset abstract data type
>>>>>>>> +
>>>>>>>>
>>>>>>>>
>>>>>>>> *
>>>>>>>> **
>>>>>>>> **
>>>>>>>> /
>>>>>>>> +/* This could be a struct of scalars to support larger bit sets */
>>>>>>>> +
>>>>>>>> +#if ATOM_BITSET_SIZE <= 32
>>>>>>>
>>>>>>> Maybe I missed, where did you set this macro?
>>>>>> In odp_config_internal.h
>>>>>> It is a build time configuration.
>>>>>>
>>>>>>>
>>>>>>> Also, why do you need several versions of bitset? Can you stick to
>>>>>>> one
>>>>>>> size that fits all?
>>>>>> Some 32-bit archs (ARMv7a, x86) will only support 64-bit atomics
>>>>>> (AFAIK).
>>>>>> Only x86-64 and ARMv8a supports 128-bit atomics (and compiler support
>>>>>> for
>>>>>> 128-bit atomics for ARMv8a is a bit lackingŠ).
>>>>>> Other architectures might only support 32-bit atomic operations.
>>>>>
>>>>> What will be the major outcome of settling on the 64-bit atomics?
>>>> The size of the bitset determines the maximum number of threads, the
>>>> maximum number of scheduler groups and the maximum number of reorder
>>>> contexts (per thread).
>>>
>>> Then even 128 can become too small in the forthcoming future. As far as
>>> I understand, most of the interesting things happen around
>>> bitsetting/clearing. Maybe we can redefine bitset as a struct or array
>>> of atomics? Then it would be expandable without significant software
>>> issues, wouldn't it?
>>>
>
> Why odp_cpu_mask_t is not used for that case?
We need to be able to atomically set or clear bits in the bitset as
bitsets are shared by many threads.
There are also some other operations e.g. atomic exchange and the
ARM-specific load-exclusive which is used when busy-polling using WFE.

>
> Maxim.
>
>
>>> I'm trying to get away of situation where we have overcomplicated low
>>> level code, which brings different issues on further platforms (like
>>> supporting this amount of threads on ARM and that amount of threads on
>>> x86/PPC/MIPS/etc).
>> I think the current implementation is simple and efficient. I also think it
>> is sufficiently capable, e.g. supports up to 128 threads/scheduler groups
>> etc.
>> on 64-bit ARM and x86, up to 64 on 32-bit ARM/x86 and 64-bit MIPS. I don't
>> think we should make a more complicated generic implementation until the
>> need has surfaced. It is easy to over-speculate in what will be required in
>> the future and implement stuff that is never used.
>>
>>>
>>>>>> I think the user should have control over this but if you think that
>>>>>> we
>>>>>> should just select the max value that is supported by the architecture
>>>>>> in
>>>>>> question and thus skip one build configuration, I am open to this. We
>>>>>> will
>>>>>> still need separate versions for 32/64/128 bits because there are
>>>>>> slight
>>>>>> differences in the syntax and implementation. Such are the vagaries of
>>>>>> the
>>>>>> C standard (and GCC extensions).
>>>>>>
>>>>>>
>>>>>>> Any real reason for the following defines? Why do you need them?
>>>>>> The functions were added as they were needed, e.g. in
>>>>>> odp_schedule_scalable.c.
>>>>>> I dont think there is anyone which is not used anymore but can
>>>>>> double-check that.
>>>>>
>>>>> Well. I maybe should rephrase my question: why do you think that it's
>>>>> better to have bitset_andn(a, b), rather than just a &~b ?
>>>> The atomic bitset is an abstract data type. The implementation does not
>>>> have to use a scalar word. Alternative implementation paths exist, e.g.
>>>> use a struct with multiple words and perform the requested operation one
>>>> word at a time (this is OK but perhaps not well documented).
>>>
>>> This makes sense, esp. if we add non-plain-integer bitsets.
>> One note on using a struct with multiple words is that this will/might in
>> some cases
>> require multiple atomic operations (one per word) and this will be slower.
>>
>>>
>>>
>>> --
>>> With best wishes
>>> Dmitry
>>
>


Re: [lng-odp] [PATCH 1/3] test: l2fwd: add group option

2017-04-07 Thread Ola Liljedahl
On 6 April 2017 at 13:59, Petri Savolainen  wrote:
>
> User may give number of scheduling groups to test
> scheduler performance with other that the default (all
> threads) group. Both pktios and threads are allocated

Isn't all *workers* a better default scheduler group? In this and in
other examples and benchmarks.

>
> into these groups with round robin. The number of groups
> may not exceed number of pktios or worker threads.
>
> Signed-off-by: Petri Savolainen 
> ---
>  test/common_plat/performance/odp_l2fwd.c | 148 
> ---
>  1 file changed, 116 insertions(+), 32 deletions(-)
>
> diff --git a/test/common_plat/performance/odp_l2fwd.c 
> b/test/common_plat/performance/odp_l2fwd.c
> index 8f5c5e1..33efc02 100644
> --- a/test/common_plat/performance/odp_l2fwd.c
> +++ b/test/common_plat/performance/odp_l2fwd.c
> @@ -104,6 +104,7 @@ typedef struct {
> int src_change; /**< Change source eth addresses */
> int error_check;/**< Check packet errors */
> int sched_mode; /**< Scheduler mode */
> +   int num_groups; /**< Number of scheduling groups */
>  } appl_args_t;
>
>  static int exit_threads;   /**< Break workers loop if set to 1 */
> @@ -130,6 +131,7 @@ typedef union {
>  typedef struct thread_args_t {
> int thr_idx;
> int num_pktio;
> +   int num_groups;
>
> struct {
> odp_pktin_queue_t pktin;
> @@ -142,7 +144,12 @@ typedef struct thread_args_t {
> int tx_queue_idx;
> } pktio[MAX_PKTIOS];
>
> -   stats_t *stats; /**< Pointer to per thread stats */
> +   /* Groups to join */
> +   odp_schedule_group_t group[MAX_PKTIOS];
> +
> +   /* Pointer to per thread stats */
> +   stats_t *stats;
> +
>  } thread_args_t;
>
>  /**
> @@ -297,6 +304,22 @@ static int run_worker_sched_mode(void *arg)
>
> thr = odp_thread_id();
>
> +   if (gbl_args->appl.num_groups) {
> +   odp_thrmask_t mask;
> +
> +   odp_thrmask_zero();
> +   odp_thrmask_set(, thr);
> +
> +   /* Join non-default groups */
> +   for (i = 0; i < thr_args->num_groups; i++) {
> +   if (odp_schedule_group_join(thr_args->group[i],
> +   )) {
> +   LOG_ERR("Join failed\n");
> +   return -1;
> +   }
> +   }
> +   }
> +
> num_pktio = thr_args->num_pktio;
>
> if (num_pktio > MAX_PKTIOS) {
> @@ -590,7 +613,7 @@ static int run_worker_direct_mode(void *arg)
>   * @retval -1 on failure
>   */
>  static int create_pktio(const char *dev, int idx, int num_rx, int num_tx,
> -   odp_pool_t pool)
> +   odp_pool_t pool, odp_schedule_group_t group)
>  {
> odp_pktio_t pktio;
> odp_pktio_param_t pktio_param;
> @@ -650,7 +673,7 @@ static int create_pktio(const char *dev, int idx, int 
> num_rx, int num_tx,
>
> pktin_param.queue_param.sched.prio  = ODP_SCHED_PRIO_DEFAULT;
> pktin_param.queue_param.sched.sync  = sync_mode;
> -   pktin_param.queue_param.sched.group = ODP_SCHED_GROUP_ALL;
> +   pktin_param.queue_param.sched.group = group;
> }
>
> if (num_rx > (int)capa.max_input_queues) {
> @@ -1016,39 +1039,46 @@ static void usage(char *progname)
> printf("\n"
>"OpenDataPlane L2 forwarding application.\n"
>"\n"
> -  "Usage: %s OPTIONS\n"
> +  "Usage: %s [options]\n"
> +  "\n"
>"  E.g. %s -i eth0,eth1,eth2,eth3 -m 0 -t 1\n"
> -  " In the above example,\n"
> -  " eth0 will send pkts to eth1 and vice versa\n"
> -  " eth2 will send pkts to eth3 and vice versa\n"
> +  "  In the above example,\n"
> +  "  eth0 will send pkts to eth1 and vice versa\n"
> +  "  eth2 will send pkts to eth3 and vice versa\n"
>"\n"
>"Mandatory OPTIONS:\n"
> -  "  -i, --interface Eth interfaces (comma-separated, no 
> spaces)\n"
> -  "  Interface count min 1, max %i\n"
> +  "  -i, --interface   Eth interfaces (comma-separated, no 
> spaces)\n"
> +  "  Interface count min 1, max %i\n"
>"\n"
>"Optional OPTIONS:\n"
> -  "  -m, --mode  Packet input mode\n"
> -  "  0: Direct mode: PKTIN_MODE_DIRECT 
> (default)\n"
> -  "  1: Scheduler mode with parallel queues: 
> PKTIN_MODE_SCHED + SCHED_SYNC_PARALLEL\n"
> -  "  2: Scheduler mode with atomic queues:   
> PKTIN_MODE_SCHED + SCHED_SYNC_ATOMIC\n"
> -   

Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software scheduler

2017-04-07 Thread Ola Liljedahl
On 7 April 2017 at 08:40, Peltonen, Janne (Nokia - FI/Espoo) <
janne.pelto...@nokia.com> wrote:

> Hi,
>
> On Thu, Apr 6, 2017 at 1:46 PM, Bill Fischofer <bill.fischo...@linaro.org>
> wrote:
> > On Thu, Apr 6, 2017 at 1:32 PM, Ola Liljedahl <ola.liljed...@linaro.org>
> wrote:
> > > On 6 April 2017 at 13:48, Jerin Jacob <jerin.ja...@caviumnetworks.com>
> wrote:
>
> > >> We see ORDERED->ATOMIC as main use case for basic packet forward.Stage
> > >> 1(ORDERED) to process on N cores and Stage2(ATOMIC) to maintain the
> ingress
> > >> order.
> > > Doesn't ORDERED scheduling maintain the ingress packet order all the
> > > way to the egress interface? A least that's my understanding of ODP
> > > ordered queues.
> > > From an ODP perspective, I fail to see how the ATOMIC stage is needed.
>
> For basic IP forwarding I also do not see why an atomic stage would be
> needed, but for stateful things like IPsec or some application specific
> higher layer processing the situation can be different.
>
> At the risk of stating the obvious: Ordered scheduling maintains ingress
> order when packets are placed in the next queue (toward the next pipeline
> stage or to pktout), but it allows parallel processing of packets of the
> same flow between the points where order is maintained. To guarantee packet
> processing in the ingress order in some section of code, the code needs
> to be executed in an atomic context or protected using an ordered lock.
>
> > As pointed out earlier, ordered locks are another option to avoid a
> > separate processing stage simply to do in-sequence operations within
> > an ordered flow. I'd be curious to understand the use-case in a bit
> > more detail here. Ordered queues preserve the originating queue's
> > order, however to achieve end-to-end ordering involving multiple
> > processing stages requires that flows traverse only ordered or atomic
> > queues. If a parallel queue is used ordering is indeterminate from
> > that point on in the pipeline.
>
> Exactly.
>
> In an IPsec GW (as a use case example) one might want to do all
> stateless processing (like ingress and egress IP processing and the
> crypto operations) in ordered contexts to get parallelism, but the
> stateful part (replay-check and sequence number generation) in
> an atomic context (or holding an ordered lock) so that not only
> the packets are output in ingress order but also their sequence
> numbers are in the same order.
>
To what extent does IPsec sequence number ordering have to equal actual
transmission order?
If an IPsec GW preserves ingress-to-egress packet order, does it matter
that this order might not be the same as the IPsec sequence number order?
If IPsec SN's are not used for reordering, just for replay protection, I
don't see why the two number series have to match.


> That said, some might argue that IPsec replay window can take care
> not only of packet reordering in the network but also of reordering
> inside an IPsec GW and therefore the atomic context (or ordered lock)
> is not necessarily needed in all implementations.
>
Do you mean that the replay protection also should be responsible for
end-to-end (IPsec GW to GW) order restoration? Doesn't that mean that
packets might have to be saved until their SN leaves the replay window (if
there are missing packets/SN's that we are waiting for). Wouldn't this add
a lot of latency when waiting for missing packets? Latency affecting
packets in unrelated flows which don't care about that missing/late packet.
Can't individual IPsec packets have different QoS requirements? You don't
want a latency sensitive packet to have to wait for an earlier missing
packet. It would be great if each QoS class would have its own IPsec SA but
is that always the case?




> Janne
>
>
>


Re: [lng-odp] [PATCH] Relax the assert check with enqueue fails in the scheduler test case

2017-04-07 Thread Ola Liljedahl
>From test/common_plat/validation/api/scheduler/scheduler.c

#define BUFS_PER_QUEUE_EXCL 1

The scalable scheduler has a default queue size of 4096 which seemed
reasonable to us. We can always increase the value but applications and
test prorgams should not expect infinite queue size. Perhaps scheduler.c
should create the queues with the queue size it is going to need? The test
will then verify that you can actually enqueue that many events on the
queue.


This test program also seems to create queues that are not destroyed. Is
that OK?

q = odp_queue_create(name, );


if (q == ODP_QUEUE_INVALID) {

printf("Schedule queue create failed.\n");

return -1;

}


snprintf(name, sizeof(name), "sched_%d_%d_a", i, j);

p.sched.sync = ODP_SCHED_SYNC_ATOMIC;

q = odp_queue_create(name, );

Reuse 'q' without destroying the queue created earlier.



On 7 April 2017 at 09:28, Kevin Wang  wrote:

> In the scalable scheduler, queue is implemented by the ring buffer.
> It is possible enqueue would fail if the ring buffer is full.
> So just remove the assert() check.
>
> Signed-off-by: Kevin Wang 
> ---
>  test/common_plat/validation/api/scheduler/scheduler.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/test/common_plat/validation/api/scheduler/scheduler.c
> b/test/common_plat/validation/api/scheduler/scheduler.c
> index 952561c..e7623b7 100644
> --- a/test/common_plat/validation/api/scheduler/scheduler.c
> +++ b/test/common_plat/validation/api/scheduler/scheduler.c
> @@ -959,7 +959,6 @@ static void fill_queues(thread_args_t *args)
> }
>
> ret = odp_queue_enq(queue, ev);
> -   CU_ASSERT_FATAL(ret == 0);
>
> if (ret)
> odp_buffer_free(buf);
> --
> 2.7.4
>
>


Re: [lng-odp] [API-NEXT PATCH v2 07/16] test: odp_scheduling: Handle dequeueing from a concurrent queue

2017-04-07 Thread Ola Liljedahl
On 6 April 2017 at 20:51, Maxim Uvarov <maxim.uva...@linaro.org> wrote:

> On 04/06/17 13:35, Ola Liljedahl wrote:
> > On 5 April 2017 at 23:39, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
> >> On 04/05/17 17:30, Ola Liljedahl wrote:
> >>> On 5 April 2017 at 14:50, Maxim Uvarov <maxim.uva...@linaro.org>
> wrote:
> >>>> On 04/05/17 06:57, Honnappa Nagarahalli wrote:
> >>>>> This can go into master/api-next as an independent patch. Agree?
> >>>>
> >>>> agree. If we accept implementation where events can be 'delayed'
> >>> Probably all platforms with HW queues.
> >>>
> >>>> than it
> >>>> looks like we missed some api to sync queues.
> >>> When would those API's be used?
> >>>
> >>
> >> might be in case like that. Might be it's not needed in real world
> >> application.
> > This was a test program. I don't see the same situation occurring in a
> > real world application. I could be wrong.
> >
> >>
> >> My point that if situation of postpone event is accepted that we need
> >> document that in api doxygen comment.
> > I think the asynchronous behaviour is the default. ODP is a hardware
> > abstraction. HW is often asynchronous, writes are posted etc. Ensuring
> > synchronous behaviour costs performance.
> >
> > Single-threaded software is "synchronous", writes are immediately
> > visible to the thread. But as soon as you go multi-threaded and don't
> > use locks to access shared resources, software also becomes
> > "asynchronous" (don't know if it is the right word here). Only if you
> > use locks to synchronise accesses to shared memory you return to some
> > form of sequential consistency (all threads see updates in the same
> > order). You don't want to use locks, that quickly creates scalability
> > bottlenecks.
> >
> > Since the scalable scheduler does its best to avoid locks
> > (non-scalable) and sequential consistency (slow), instead utilising
> > lock-less and lock-free algorithms and weak memory ordering (e.g.
> > acquire/release), it exposes the underlying hardware characteristics.
> >
>
> Ola, I think you better understand how week memory ordering works. In
> this case I understand that hardware can 'delay' events in queue and
> make them not visible just after queueing for some reason. And it's not
> possible to solve in implementation. If we speak totally about software
> I would understand if one thread did queue and other dequeue. Or case if
> you queued X and dequeued Y. But in that case if each thread queued 1
> and dequeued 1 in each thread. Which look like if you store in one
> thread some variable then you need several loads to get value which was
> stored. Is that right behaviour of week ordering?
>

The new ring buffer design uses separate read & write (head & tail)
pointers for producers (enqueue) and consumers (dequeue) (DPDK/BSD ring
buffer style but with different names). Enqueue checks if there is space
(load prod_write, load-acquire prod_read), updates prod_write (CAS-relaxed
prod_write), stores events in the ring array, waits for any previous
producers to release their updates (load cons_write) and eventually
releases its own updates (store-release cons_write) so that they can be
seen by consumers.
Dequeue works in an equivalent way (load-acquire on cons_write, CAS on
cons_read, then DMB ISHLD before store to prod_read, DMB ISHLD is enough to
ensure that the read of of the ring buffer has completed before dequeue
"releases" the ring buffer slots).

See slide 9 in my Linaro Connect presentation:
https://docs.google.com/presentation/d/1BqAdni4aP4aHOqO6fNO39-0MN9zOntI-2ZnVTUXBNSQ/edit#slide=id.g1d00c08a90_0_108

One aspect of having separate metadata (head & tail pointers) for producers
and for consumers is that producers and consumer can have a different view
of the queue state, i.e. one thread can see the queue as empty while
another thread sees it as non-empty. I call this Schrödinger's Queue
Conundrum when I talked about it in my internal scheduler prototype design
presentation back in Las Vegas.

The scalable scheduler puts all of these ring buffer metadata variables in
separate cache lines for maximum scalability. They are all targets for
stores and only one cache (CPU) at a time can write to a cache line (per
MOESI and similar cache coherency protocols). Combine this with relaxed
memory ordering where plain loads and stores can be reordered in any way,
prefetching of cache lines and store buffers and the CPU might see updates
(stores) from other CPU's in unexpected order (a CPU always sees its own
stores immediately). If load-ac

Re: [lng-odp] [API-NEXT PATCH v2 07/16] test: odp_scheduling: Handle dequeueing from a concurrent queue

2017-04-07 Thread Ola Liljedahl
On 7 April 2017 at 01:33, Bill Fischofer <bill.fischo...@linaro.org> wrote:

> On Thu, Apr 6, 2017 at 1:51 PM, Maxim Uvarov <maxim.uva...@linaro.org>
> wrote:
> > On 04/06/17 13:35, Ola Liljedahl wrote:
> >> On 5 April 2017 at 23:39, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
> >>> On 04/05/17 17:30, Ola Liljedahl wrote:
> >>>> On 5 April 2017 at 14:50, Maxim Uvarov <maxim.uva...@linaro.org>
> wrote:
> >>>>> On 04/05/17 06:57, Honnappa Nagarahalli wrote:
> >>>>>> This can go into master/api-next as an independent patch. Agree?
> >>>>>
> >>>>> agree. If we accept implementation where events can be 'delayed'
> >>>> Probably all platforms with HW queues.
> >>>>
> >>>>> than it
> >>>>> looks like we missed some api to sync queues.
> >>>> When would those API's be used?
> >>>>
> >>>
> >>> might be in case like that. Might be it's not needed in real world
> >>> application.
> >> This was a test program. I don't see the same situation occurring in a
> >> real world application. I could be wrong.
> >>
> >>>
> >>> My point that if situation of postpone event is accepted that we need
> >>> document that in api doxygen comment.
> >> I think the asynchronous behaviour is the default. ODP is a hardware
> >> abstraction. HW is often asynchronous, writes are posted etc. Ensuring
> >> synchronous behaviour costs performance.
> >>
> >> Single-threaded software is "synchronous", writes are immediately
> >> visible to the thread. But as soon as you go multi-threaded and don't
> >> use locks to access shared resources, software also becomes
> >> "asynchronous" (don't know if it is the right word here). Only if you
> >> use locks to synchronise accesses to shared memory you return to some
> >> form of sequential consistency (all threads see updates in the same
> >> order). You don't want to use locks, that quickly creates scalability
> >> bottlenecks.
> >>
> >> Since the scalable scheduler does its best to avoid locks
> >> (non-scalable) and sequential consistency (slow), instead utilising
> >> lock-less and lock-free algorithms and weak memory ordering (e.g.
> >> acquire/release), it exposes the underlying hardware characteristics.
> >>
> >
> > Ola, I think you better understand how week memory ordering works. In
> > this case I understand that hardware can 'delay' events in queue and
> > make them not visible just after queueing for some reason. And it's not
> > possible to solve in implementation. If we speak totally about software
> > I would understand if one thread did queue and other dequeue. Or case if
> > you queued X and dequeued Y. But in that case if each thread queued 1
> > and dequeued 1 in each thread. Which look like if you store in one
> > thread some variable then you need several loads to get value which was
> > stored. Is that right behaviour of week ordering?
>
> There are some good online articles that explain the issues well. See,
> for example [1]  that explains the types of barriers used to control
> memory orderings and [2] that explains how these relate to strong vs.
> weak memory models.
>
> --
> [1] http://preshing.com/20120710/memory-barriers-are-like-
> source-control-operations/
> [2] http://preshing.com/20120930/weak-vs-strong-memory-models/
>
> Preshing has a lot of good posts.
Herb Sutter has a good presentation here:
https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2


> >
> > Maxim.
> >
> >
> >>>
> >>> Maxim.
> >>>
> >>>>>
> >>>>> But I do not see why we need this patch. On the same cpu test queue 1
> >>>>> event and after that dequeue 1 event:
> >>>>>
> >>>>> for (i = 0; i < QUEUE_ROUNDS; i++) {
> >>>>> ev = odp_buffer_to_event(buf);
> >>>>>
> >>>>> if (odp_queue_enq(queue, ev)) {
> >>>>> LOG_ERR("  [%i] Queue enqueue failed.\n",
> thr);
> >>>>> odp_buffer_free(buf);
> >>>>> return -1;
> >>>>> }
> >>>>>
> >>>>> ev = odp_queue_deq(queue);
> >>>>

Re: [lng-odp] [PATCH] validation: scheduler: Release context before the end of the scheduler test

2017-04-07 Thread Ola Liljedahl
On 7 April 2017 at 14:06, Savolainen, Petri (Nokia - FI/Espoo) <
petri.savolai...@nokia-bell-labs.com> wrote:
>
>
> From: Kevin Wang [mailto:kevin.w...@linaro.org]
> Sent: Friday, April 07, 2017 11:45 AM
> To: Savolainen, Petri (Nokia - FI/Espoo) <
petri.savolai...@nokia-bell-labs.com>
> Cc: Kevin Wang <kevin.w...@arm.com>; lng-odp@lists.linaro.org
> Subject: Re: [lng-odp] [PATCH] validation: scheduler: Release context
before the end of the scheduler test
>
> 1.Release context is just to be added for scalable scheduler in
scheduler_test_groups(). I think it does no harms to other scheduler here.
> 2.This code is to be removed for the scalable scheduler, We use ring
buffer to implement the queue. So it is possible the enqueue operation
failed if the ring buffer is full.
>
> Kevin
>
>
> 1. Validation tests are written against API spec. The spec says that
context release is a hint. Your scheduler must not depend on extra context
release calls, but must apply to the spec. A validation test written
against the spec must work. May be the application is not working by the
spec. But adding a context release hint, does not fix that (== guarantee
that context is actually released).

AFAIK, the scalable scheduler conforms to the spec (there could still be
undetected bugs of course).
Kevin, this is the test/common_plat/validation/api/scheduler test? Which
platform and configuration? I haven't seen this failure.
I get the following result on a multicore ARM target:

$ test/common_plat/validation/api/scheduler/scheduler_main
...
Run Summary:Type   Total Ran  Passed Failed Inactive

  suites   1   1 n/a  00

   tests  35  35  35  00

 asserts 2381749 2381749 2381749  0  n/a


Elapsed time =   23.919 seconds


>
> 2. This patch is about context releases. This fixes an enqueue issue =>
should not be in this patch. How big queue capacity the test expects ? If
it’s reasonable, maybe you should increase ring size instead. Soon we’ll
have queue size param and tests can be updated to check that.  In the
meanwhile it feels wrong to remove error checks from validation suite.
The default queue size is 4096 which seems plenty. We need to check how
many events fill_queues() expects to be able to enqueue.

>
> -Petri
>
>
>
> 2017-04-07 16:22 GMT+08:00 Savolainen, Petri (Nokia - FI/Espoo) <
petri.savolai...@nokia-bell-labs.com>:
>
>
>> -Original Message-
>> From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of Kevin
>> Wang
>> Sent: Friday, April 07, 2017 11:07 AM
>> To: lng-odp@lists.linaro.org<mailto:lng-odp@lists.linaro.org>
>> Cc: Kevin Wang <kevin.w...@arm.com<mailto:kevin.w...@arm.com>>
>> Subject: [lng-odp] [PATCH] validation: scheduler: Release context before
>> the end of the scheduler test
>>
>> If the scheduler sync type is atomic or ordered,
>> need to release the context.
>
> Release context is actually a hint. It does not guarantee that context is
released. Application needs to call schedule() and receive _EVENT_INVALID
to be sure that it does not hold a context anymore.
>
>
>>
>> Signed-off-by: Kevin Wang <kevin.w...@arm.com<mailto:kevin.w...@arm.com>>
>> Reviewed-by: Ola Liljedahl <ola.liljed...@arm.com>
I have no recollection of having reviewed this patch...
And I can't find it in gerrit for our local repo/branch.


>> ---
>>  .../common_plat/validation/api/scheduler/scheduler.c | 20
+++
>> -
>>  1 file changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/test/common_plat/validation/api/scheduler/scheduler.c
>> b/test/common_plat/validation/api/scheduler/scheduler.c
>> index 952561c..2631001 100644
>> --- a/test/common_plat/validation/api/scheduler/scheduler.c
>> +++ b/test/common_plat/validation/api/scheduler/scheduler.c
>> @@ -129,6 +129,14 @@ static int exit_schedule_loop(void)
>>   return ret;
>>  }
>>
>> +static void release_context(odp_schedule_sync_t sync)
>> +{
>> + if (sync == ODP_SCHED_SYNC_ATOMIC)
>> + odp_schedule_release_atomic();
>> + else if (sync == ODP_SCHED_SYNC_ORDERED)
>> + odp_schedule_release_ordered();
>> +}
>> +
>>  void scheduler_test_wait_time(void)
>>  {
>>   int i;
>> @@ -251,8 +259,7 @@ void scheduler_test_queue_destroy(void)
>>   CU_ASSERT_FATAL(u32[0] == MAGIC);
>>
>>   odp_buffer_free(buf);
>> - odp_schedule_release_ordered();
>> -
>> + release_context(qp.sched.sync);
>>   CU_ASSERT_FATAL(odp_queue_destroy(qu

Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software scheduler

2017-04-06 Thread Ola Liljedahl
On 6 April 2017 at 13:48, Jerin Jacob <jerin.ja...@caviumnetworks.com> wrote:
> -Original Message-
>> Date: Thu, 6 Apr 2017 12:54:10 +0200
>> From: Ola Liljedahl <ola.liljed...@linaro.org>
>> To: Brian Brooks <brian.bro...@arm.com>
>> Cc: Jerin Jacob <jerin.ja...@caviumnetworks.com>,
>>  "lng-odp@lists.linaro.org" <lng-odp@lists.linaro.org>
>> Subject: Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software
>>  scheduler
>>
>> On 5 April 2017 at 18:50, Brian Brooks <brian.bro...@arm.com> wrote:
>> > On 04/05 21:27:37, Jerin Jacob wrote:
>> >> -Original Message-
>> >> > Date: Tue, 4 Apr 2017 13:47:52 -0500
>> >> > From: Brian Brooks <brian.bro...@arm.com>
>> >> > To: lng-odp@lists.linaro.org
>> >> > Subject: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software 
>> >> > scheduler
>> >> > X-Mailer: git-send-email 2.12.2
>> >> >
>> >> > This work derives from Ola Liljedahl's prototype [1] which introduced a
>> >> > scalable scheduler design based on primarily lock-free algorithms and
>> >> > data structures designed to decrease contention. A thread searches
>> >> > through a data structure containing only queues that are both non-empty
>> >> > and allowed to be scheduled to that thread. Strict priority scheduling 
>> >> > is
>> >> > respected, and (W)RR scheduling may be used within queues of the same 
>> >> > priority.
>> >> > Lastly, pre-scheduling or stashing is not employed since it is optional
>> >> > functionality that can be implemented in the application.
>> >> >
>> >> > In addition to scalable ring buffers, the algorithm also uses unbounded
>> >> > concurrent queues. LL/SC and CAS variants exist in cases where absense 
>> >> > of
>> >> > ABA problem cannot be proved, and also in cases where the compiler's 
>> >> > atomic
>> >> > built-ins may not be lowered to the desired instruction(s). Finally, a 
>> >> > version
>> >> > of the algorithm that uses locks is also provided.
>> >> >
>> >> > See platform/linux-generic/include/odp_config_internal.h for further 
>> >> > build
>> >> > time configuration.
>> >> >
>> >> > Use --enable-schedule-scalable to conditionally compile this scheduler
>> >> > into the library.
>> >>
>> >> This is an interesting stuff.
>> >>
>> >> Do you have any performance/latency numbers in comparison to exiting 
>> >> scheduler
>> >> for completing say two stage(ORDERED->ATOMIC) or N stage pipeline on any 
>> >> platform?
>> It is still a SW implementation, there is overhead accessed with queue
>> enqueue/dequeue and the scheduling itself.
>> So for an N-stage pipeline, overhead will accumulate.
>> If only a subset of threads are associated with each stage (this could
>> be beneficial for I-cache hit rate), there will be less need for
>> scalability.
>> What is the recommended strategy here for OCTEON/ThunderX?
>
> In the view of portable event driven applications(Works on both
> embedded and server capable chips), the SW schedule is an important piece.
>
>> All threads/cores share all work?
>
> That is the recommend one in HW as it supports nativity. But HW provides
> means to partition the work load based on odp schedule groups
>
>
>>
>> >
>> > To give an idea, the avg latency reported by odp_sched_latency is down to 
>> > half
>> > that of other schedulers (pre-scheduling/stashing disabled) on 4c A53, 16c 
>> > A57,
>> > and 12c broadwell. We are still preparing numbers, and I think it's worth 
>> > mentioning
>> > that they are subject to change as this patch series changes over time.
>> >
>> > I am not aware of an existing benchmark that involves switching between 
>> > different
>> > queue types. Perhaps this is happening in an example app?
>> This could be useful in e.g. IPsec termination. Use an atomic stage
>> for the replay protection check and update. Now ODP has ordered locks
>> for that so the "atomic" (exclusive) section can be achieved from an
>> ordered processing stage. Perhaps Jerin knows some other application
>> that utilises two-stage ORDERED->ATOMIC processing.
>
> We see ORDERED->ATOMIC as main 

Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software scheduler

2017-04-06 Thread Ola Liljedahl
On 5 April 2017 at 18:50, Brian Brooks  wrote:
> On 04/05 21:27:37, Jerin Jacob wrote:
>> -Original Message-
>> > Date: Tue, 4 Apr 2017 13:47:52 -0500
>> > From: Brian Brooks 
>> > To: lng-odp@lists.linaro.org
>> > Subject: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software scheduler
>> > X-Mailer: git-send-email 2.12.2
>> >
>> > This work derives from Ola Liljedahl's prototype [1] which introduced a
>> > scalable scheduler design based on primarily lock-free algorithms and
>> > data structures designed to decrease contention. A thread searches
>> > through a data structure containing only queues that are both non-empty
>> > and allowed to be scheduled to that thread. Strict priority scheduling is
>> > respected, and (W)RR scheduling may be used within queues of the same 
>> > priority.
>> > Lastly, pre-scheduling or stashing is not employed since it is optional
>> > functionality that can be implemented in the application.
>> >
>> > In addition to scalable ring buffers, the algorithm also uses unbounded
>> > concurrent queues. LL/SC and CAS variants exist in cases where absense of
>> > ABA problem cannot be proved, and also in cases where the compiler's atomic
>> > built-ins may not be lowered to the desired instruction(s). Finally, a 
>> > version
>> > of the algorithm that uses locks is also provided.
>> >
>> > See platform/linux-generic/include/odp_config_internal.h for further build
>> > time configuration.
>> >
>> > Use --enable-schedule-scalable to conditionally compile this scheduler
>> > into the library.
>>
>> This is an interesting stuff.
>>
>> Do you have any performance/latency numbers in comparison to exiting 
>> scheduler
>> for completing say two stage(ORDERED->ATOMIC) or N stage pipeline on any 
>> platform?
It is still a SW implementation, there is overhead accessed with queue
enqueue/dequeue and the scheduling itself.
So for an N-stage pipeline, overhead will accumulate.
If only a subset of threads are associated with each stage (this could
be beneficial for I-cache hit rate), there will be less need for
scalability.
What is the recommended strategy here for OCTEON/ThunderX? All
threads/cores share all work?

>
> To give an idea, the avg latency reported by odp_sched_latency is down to half
> that of other schedulers (pre-scheduling/stashing disabled) on 4c A53, 16c 
> A57,
> and 12c broadwell. We are still preparing numbers, and I think it's worth 
> mentioning
> that they are subject to change as this patch series changes over time.
>
> I am not aware of an existing benchmark that involves switching between 
> different
> queue types. Perhaps this is happening in an example app?
This could be useful in e.g. IPsec termination. Use an atomic stage
for the replay protection check and update. Now ODP has ordered locks
for that so the "atomic" (exclusive) section can be achieved from an
ordered processing stage. Perhaps Jerin knows some other application
that utilises two-stage ORDERED->ATOMIC processing.

>
>> When we say scalable scheduler, What application/means used to quantify
>> scalablity??
It starts with the design, use non-blocking data structures and try to
distribute data to threads so that they do not access shared data very
often. Some of this is a little detrimental to single-threaded
performance, you need to use more atomic operations. It seems to work
well on ARM (A53, A57) though, the penalty is higher on x86 (x86 is
very good with spin locks, cmpxchg seems to have more overhead
compared to ldxr/stxr on ARM which can have less memory ordering
constraints). We actually use different synchronisation strategies on
ARM and on x86 (compile time configuration).

You can read more here:
https://docs.google.com/presentation/d/1BqAdni4aP4aHOqO6fNO39-0MN9zOntI-2ZnVTUXBNSQ
I also did an internal presentation on the scheduler prototype back at
Las Vegas, that presentation might also be somewhere on the Linaro web
site.


>>
>> Do you have any numbers in comparison to existing scheduler to show
>> magnitude of the scalablity on any platform?


Re: [lng-odp] [API-NEXT PATCH v2 07/16] test: odp_scheduling: Handle dequeueing from a concurrent queue

2017-04-06 Thread Ola Liljedahl
On 5 April 2017 at 23:39, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
> On 04/05/17 17:30, Ola Liljedahl wrote:
>> On 5 April 2017 at 14:50, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
>>> On 04/05/17 06:57, Honnappa Nagarahalli wrote:
>>>> This can go into master/api-next as an independent patch. Agree?
>>>
>>> agree. If we accept implementation where events can be 'delayed'
>> Probably all platforms with HW queues.
>>
>>> than it
>>> looks like we missed some api to sync queues.
>> When would those API's be used?
>>
>
> might be in case like that. Might be it's not needed in real world
> application.
This was a test program. I don't see the same situation occurring in a
real world application. I could be wrong.

>
> My point that if situation of postpone event is accepted that we need
> document that in api doxygen comment.
I think the asynchronous behaviour is the default. ODP is a hardware
abstraction. HW is often asynchronous, writes are posted etc. Ensuring
synchronous behaviour costs performance.

Single-threaded software is "synchronous", writes are immediately
visible to the thread. But as soon as you go multi-threaded and don't
use locks to access shared resources, software also becomes
"asynchronous" (don't know if it is the right word here). Only if you
use locks to synchronise accesses to shared memory you return to some
form of sequential consistency (all threads see updates in the same
order). You don't want to use locks, that quickly creates scalability
bottlenecks.

Since the scalable scheduler does its best to avoid locks
(non-scalable) and sequential consistency (slow), instead utilising
lock-less and lock-free algorithms and weak memory ordering (e.g.
acquire/release), it exposes the underlying hardware characteristics.

>
> Maxim.
>
>>>
>>> But I do not see why we need this patch. On the same cpu test queue 1
>>> event and after that dequeue 1 event:
>>>
>>> for (i = 0; i < QUEUE_ROUNDS; i++) {
>>> ev = odp_buffer_to_event(buf);
>>>
>>> if (odp_queue_enq(queue, ev)) {
>>> LOG_ERR("  [%i] Queue enqueue failed.\n", thr);
>>> odp_buffer_free(buf);
>>> return -1;
>>> }
>>>
>>> ev = odp_queue_deq(queue);
>>>
>>> buf = odp_buffer_from_event(ev);
>>>
>>> if (!odp_buffer_is_valid(buf)) {
>>> LOG_ERR("  [%i] Queue empty.\n", thr);
>>> return -1;
>>> }
>>> }
>>>
>>> Where this exactly event can be delayed?
>> In the memory system.
>>
>>>
>>> If other threads do the same - then all do enqueue 1 event first and
>>> then dequeue one event. I can understand problem with queueing on one
>>> cpu and dequeuing on other cpu. But on the same cpu is has to always
>>> work. Isn't it?
>> No.
>>
>>>
>>> Maxim.
>>>
>>>>
>>>> On 4 April 2017 at 21:22, Brian Brooks <brian.bro...@arm.com> wrote:
>>>>> On 04/04 17:26:12, Bill Fischofer wrote:
>>>>>> On Tue, Apr 4, 2017 at 3:37 PM, Brian Brooks <brian.bro...@arm.com> 
>>>>>> wrote:
>>>>>>> On 04/04 21:59:15, Maxim Uvarov wrote:
>>>>>>>> On 04/04/17 21:47, Brian Brooks wrote:
>>>>>>>>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>>>>>>>>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>>>>>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>>>>>>>>> Reviewed-by: Kevin Wang <kevin.w...@arm.com>
>>>>>>>>> ---
>>>>>>>>>  test/common_plat/performance/odp_scheduling.c | 12 ++--
>>>>>>>>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/test/common_plat/performance/odp_scheduling.c 
>>>>>>>>> b/test/common_plat/performance/odp_scheduling.c
>>>>>>>>> index c74a0713..38e76257 100644
>>>>>>>>> --- a/test/common_plat/performance/odp_scheduling.c
>>>>>>>>> +++ b/test/common_plat/performance/odp_scheduling.c
>>>>>>>>> @@ -273,7 +273,7 @@ static int test_plai

Re: [lng-odp] [API-NEXT PATCH v2 15/16] Add llqueue, an unbounded concurrent queue

2017-04-05 Thread Ola Liljedahl
On 5 April 2017 at 17:33, Dmitry Eremin-Solenikov
<dmitry.ereminsoleni...@linaro.org> wrote:
> On 05.04.2017 17:40, Ola Liljedahl wrote:
>> On 5 April 2017 at 14:20, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
>>> On 04/05/17 01:46, Ola Liljedahl wrote:
>>>> On 4 April 2017 at 21:25, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
>>>>> it's better to have 2 separate files for that. One for ODP_CONFIG_LLDSCD
>>>> "better"? In what way?
>> Please respond to the question. If you claim something is "better",
>> you must be able to explain *why* it is better.
>>
>> *We* have explained why we think it is better to keep both
>> implementations in the same file, close to each other. I think Brian's
>> explanation was very good.
>
> Because it allows one to overview a complete implementation at once
> instead of switching between two different modes.
That's a good argument as well. It doesn't mean that the
implementations should live in separate files.

We keep both implementations in the same file but avoid interleaving
the different functions (as is done now). This is actually what some
one in our team wanted.

>
> --
> With best wishes
> Dmitry


Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software scheduler

2017-04-05 Thread Ola Liljedahl
failure
>>>> _fdserver.c:463:handle_request():FD table full
>>>> _fdserver.c:297:_odp_fdserver_register_fd():fd registration failure
>>>> _fdserver.c:463:handle_request():FD table full
>>>> _fdserver.c:297:_odp_fdserver_register_fd():fd registration failure
>>>> _fdserver.c:342:_odp_fdserver_deregister_fd():fd de-registration failure
>>>> _fdserver.c:463:handle_request():FD table full
>>>> _fdserver.c:297:_odp_fdserver_register_fd():fd registration failure
>>>> _fdserver.c:463:handle_request():FD table full
>>>> _fdserver.c:297:_odp_fdserver_register_fd():fd registration failure
>>>> _fdserver.c:463:handle_request():FD table full
>>>> _fdserver.c:297:_odp_fdserver_register_fd():fd registration failure
>>>> _fdserver.c:463:handle_request():FD table full
>>>> _fdserver.c:297:_odp_fdserver_register_fd():fd registration failure
>>>> _fdserver.c:342:_odp_fdserver_deregister_fd():fd de-registration failure
>>>> _fdserver.c:342:_odp_fdserver_deregister_fd():fd de-registration failure
>>>>
>>>> These messages repeat throughout the test even though it "passes".
>>>> Clearly something isn't right.
>>>
>>> We have done considerable amount of testing on x86 as well as ARM with
>>> different schedulers.
>>> Can you provide more details?
>>> What is the config command you used?
>>> What platform (x86 vs ARM)?
>>> I assume you are running 'make check'.
>>>
>>>>
>>>> On Tue, Apr 4, 2017 at 1:47 PM, Brian Brooks <brian.bro...@arm.com> wrote:
>>>>> This work derives from Ola Liljedahl's prototype [1] which introduced a
>>>>> scalable scheduler design based on primarily lock-free algorithms and
>>>>> data structures designed to decrease contention. A thread searches
>>>>> through a data structure containing only queues that are both non-empty
>>>>> and allowed to be scheduled to that thread. Strict priority scheduling is
>>>>> respected, and (W)RR scheduling may be used within queues of the same 
>>>>> priority.
>>>>> Lastly, pre-scheduling or stashing is not employed since it is optional
>>>>> functionality that can be implemented in the application.
>>>>>
>>>>> In addition to scalable ring buffers, the algorithm also uses unbounded
>>>>> concurrent queues. LL/SC and CAS variants exist in cases where absense of
>>>>> ABA problem cannot be proved, and also in cases where the compiler's 
>>>>> atomic
>>>>> built-ins may not be lowered to the desired instruction(s). Finally, a 
>>>>> version
>>>>> of the algorithm that uses locks is also provided.
>>>>>
>>>>> See platform/linux-generic/include/odp_config_internal.h for further build
>>>>> time configuration.
>>>>>
>>>>> Use --enable-schedule-scalable to conditionally compile this scheduler
>>>>> into the library.
>>>>>
>>>>> [1] https://lists.linaro.org/pipermail/lng-odp/2016-September/025682.html
>>>>>
>>>>> v2:
>>>>>  - Move ARMv8 issues and other fixes into separate patches
>>>>>  - Abstract away some #ifdefs
>>>>>  - Fix some checkpatch.pl warnings
>>>>>
>>>>> Brian Brooks (14):
>>>>>   Fix native Clang build on ARMv8
>>>>>   api: queue: Add ring_size
>>>>>   Add ODP_CONFIG_QUEUE_SIZE
>>>>>   Fix a locking bug
>>>>>   test: odp_scheduling: Handle dequeueing from a concurrent queue
>>>>>   test: scheduler: Fixup calling release operations
>>>>>   Avoid shm namespace collisions and allow shm block per queue
>>>>>   Add _odp_packet_to_buf_hdr_ptr()
>>>>>   Add scalable scheduler build config
>>>>>   Add LL/SC and signaling primitives
>>>>>   Add a bitset
>>>>>   Add atomic ops for 128-bit scalars
>>>>>   Add llqueue, an unbounded concurrent queue
>>>>>   Add scalable scheduler
>>>>>
>>>>> Ola Liljedahl (2):
>>>>>   linux-generic: ring.c: use required memory orderings
>>>>>   helper: cuckootable: Specify queue ring_size
>>>>>
>>>>>  configure.ac   |   30 +-
>>>>>  helper/cuckootable.c   |1

Re: [lng-odp] [API-NEXT PATCH v2 15/16] Add llqueue, an unbounded concurrent queue

2017-04-05 Thread Ola Liljedahl
On 5 April 2017 at 01:21, Dmitry Eremin-Solenikov
 wrote:
> On 05.04.2017 00:25, Brian Brooks wrote:
>> On 04/04 23:23:33, Dmitry Eremin-Solenikov wrote:
>>> On 04.04.2017 22:25, Maxim Uvarov wrote:
 it's better to have 2 separate files for that. One for ODP_CONFIG_LLDSCD
 defined and one for not.
>>>
>>> Seconding that. At least LLDSCD and non-LLDSCD code should not be
>>> interleaved.
>>
>> Can you explain your judgement?
>
> Consider reading two intermixed books of technical recipes. It is just
> my opinion, but I'd prefer to have two separate code blocks: one for
> LLDSCD, one for non-LLDSCD cases.
In this case, it is two recipes for baking the *same* cake. We think
it is useful to be able to easily compare the recipes.
Each function can be considered a separate code block, it's not like
we are mixing recipes line by line.

>
> --
> With best wishes
> Dmitry


Re: [lng-odp] [API-NEXT PATCH v2 15/16] Add llqueue, an unbounded concurrent queue

2017-04-05 Thread Ola Liljedahl
On 5 April 2017 at 14:20, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
> On 04/05/17 01:46, Ola Liljedahl wrote:
>> On 4 April 2017 at 21:25, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
>>> it's better to have 2 separate files for that. One for ODP_CONFIG_LLDSCD
>> "better"? In what way?
Please respond to the question. If you claim something is "better",
you must be able to explain *why* it is better.

*We* have explained why we think it is better to keep both
implementations in the same file, close to each other. I think Brian's
explanation was very good.

>>
>>> defined and one for not. Also ODP_ prefix should not be used for
>>> internal things (not api).
>> OK this was not clear, some of the defines in odp_config_internal.h
>> use an ODP_ prefix, some not. You mean there is a system to that?
>>
>> Shouldn't those defines that are part of the API be declared/described
>> in the API header files (located in include/odp/api/spec)? How else do
>> you know that they are part of the API? And if they are part of the
>> API, how does the application (the 'A' in API) access the definitions
>> *and* their values?
>>
>> There are API's for querying about things like total number of queues
>> but those API's are separate and do not depend on some define with a
>> specific name.
>>
>
> That is not api setting. It's linux-generic internal settings. ODP apps
> do not use that values.
>
> Maxim.
>
>>>
>>> Maxim.
>>>
>>> On 04/04/17 21:48, Brian Brooks wrote:
>>>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>>>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>>>> ---
>>>>  platform/linux-generic/include/odp_llqueue.h | 285 
>>>> +++
>>>>  1 file changed, 285 insertions(+)
>>>>  create mode 100644 platform/linux-generic/include/odp_llqueue.h
>>>>
>>>> diff --git a/platform/linux-generic/include/odp_llqueue.h 
>>>> b/platform/linux-generic/include/odp_llqueue.h
>>>> new file mode 100644
>>>> index ..aa46ace3
>>>> --- /dev/null
>>>> +++ b/platform/linux-generic/include/odp_llqueue.h
>>>> @@ -0,0 +1,285 @@
>>>> +/* Copyright (c) 2017, ARM Limited.
>>>> + * All rights reserved.
>>>> + *
>>>> + * SPDX-License-Identifier:BSD-3-Clause
>>>> + */
>>>> +
>>>> +#ifndef ODP_LLQUEUE_H_
>>>> +#define ODP_LLQUEUE_H_
>>>> +
>>>> +#include 
>>>> +#include 
>>>> +#include 
>>>> +
>>>> +#include 
>>>> +#include 
>>>> +#include 
>>>> +
>>>> +#include 
>>>> +#include 
>>>> +
>>>> +/**
>>>> + * Linked list queues
>>>> + 
>>>> */
>>>> +
>>>> +/* The scalar equivalent of a double pointer */
>>>> +#if __SIZEOF_PTRDIFF_T__ == 4
>>>> +typedef uint64_t dintptr_t;
>>>> +#endif
>>>> +#if __SIZEOF_PTRDIFF_T__ == 8
>>>> +typedef __int128 dintptr_t;
>>>> +#endif
>>>> +
>>>> +#define SENTINEL ((void *)~(uintptr_t)0)
>>>> +
>>>> +struct llnode {
>>>> + struct llnode *next;
>>>> +};
>>>> +
>>>> +union llht {
>>>> + struct {
>>>> + struct llnode *head, *tail;
>>>> + } st;
>>>> + dintptr_t ui;
>>>> +};
>>>> +
>>>> +struct llqueue {
>>>> + union llht u;
>>>> +#ifndef ODP_CONFIG_LLDSCD
>>>> + odp_spinlock_t lock;
>>>> +#endif
>>>> +};
>>>> +
>>>> +static inline struct llnode *llq_head(struct llqueue *llq)
>>>> +{
>>>> + return __atomic_load_n(>u.st.head, __ATOMIC_RELAXED);
>>>> +}
>>>> +
>>>> +static inline void llqueue_init(struct llqueue *llq)
>>>> +{
>>>> + llq->u.st.head = NULL;
>>>> + llq->u.st.tail = NULL;
>>>> +#ifndef ODP_CONFIG_LLDSCD
>>>> + odp_spinlock_init(>lock);
>>>> +#endif
>>>> +}
>>>> +
>>>> +#ifdef ODP_CONFIG_LLDSCD
>>>> +
>>>> +static inline void llq_en

Re: [lng-odp] [API-NEXT PATCH v2 07/16] test: odp_scheduling: Handle dequeueing from a concurrent queue

2017-04-05 Thread Ola Liljedahl
On 5 April 2017 at 14:50, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
> On 04/05/17 06:57, Honnappa Nagarahalli wrote:
>> This can go into master/api-next as an independent patch. Agree?
>
> agree. If we accept implementation where events can be 'delayed'
Probably all platforms with HW queues.

> than it
> looks like we missed some api to sync queues.
When would those API's be used?

>
> But I do not see why we need this patch. On the same cpu test queue 1
> event and after that dequeue 1 event:
>
> for (i = 0; i < QUEUE_ROUNDS; i++) {
> ev = odp_buffer_to_event(buf);
>
> if (odp_queue_enq(queue, ev)) {
> LOG_ERR("  [%i] Queue enqueue failed.\n", thr);
> odp_buffer_free(buf);
> return -1;
> }
>
> ev = odp_queue_deq(queue);
>
> buf = odp_buffer_from_event(ev);
>
> if (!odp_buffer_is_valid(buf)) {
> LOG_ERR("  [%i] Queue empty.\n", thr);
> return -1;
> }
> }
>
> Where this exactly event can be delayed?
In the memory system.

>
> If other threads do the same - then all do enqueue 1 event first and
> then dequeue one event. I can understand problem with queueing on one
> cpu and dequeuing on other cpu. But on the same cpu is has to always
> work. Isn't it?
No.

>
> Maxim.
>
>>
>> On 4 April 2017 at 21:22, Brian Brooks <brian.bro...@arm.com> wrote:
>>> On 04/04 17:26:12, Bill Fischofer wrote:
>>>> On Tue, Apr 4, 2017 at 3:37 PM, Brian Brooks <brian.bro...@arm.com> wrote:
>>>>> On 04/04 21:59:15, Maxim Uvarov wrote:
>>>>>> On 04/04/17 21:47, Brian Brooks wrote:
>>>>>>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>>>>>>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>>>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>>>>>>> Reviewed-by: Kevin Wang <kevin.w...@arm.com>
>>>>>>> ---
>>>>>>>  test/common_plat/performance/odp_scheduling.c | 12 ++--
>>>>>>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/test/common_plat/performance/odp_scheduling.c 
>>>>>>> b/test/common_plat/performance/odp_scheduling.c
>>>>>>> index c74a0713..38e76257 100644
>>>>>>> --- a/test/common_plat/performance/odp_scheduling.c
>>>>>>> +++ b/test/common_plat/performance/odp_scheduling.c
>>>>>>> @@ -273,7 +273,7 @@ static int test_plain_queue(int thr, test_globals_t 
>>>>>>> *globals)
>>>>>>> test_message_t *t_msg;
>>>>>>> odp_queue_t queue;
>>>>>>> uint64_t c1, c2, cycles;
>>>>>>> -   int i;
>>>>>>> +   int i, j;
>>>>>>>
>>>>>>> /* Alloc test message */
>>>>>>> buf = odp_buffer_alloc(globals->pool);
>>>>>>> @@ -307,7 +307,15 @@ static int test_plain_queue(int thr, 
>>>>>>> test_globals_t *globals)
>>>>>>> return -1;
>>>>>>> }
>>>>>>>
>>>>>>> -   ev = odp_queue_deq(queue);
>>>>>>> +   /* When enqueue and dequeue are decoupled (e.g. not using a
>>>>>>> +* common lock), an enqueued event may not be immediately
>>>>>>> +* visible to dequeue. So we just try again for a while. */
>>>>>>> +   for (j = 0; j < 100; j++) {
>>>>>>
>>>>>> where 100 number comes from?
>>>>>
>>>>> It is the retry count. Perhaps it could be a bit lower, or a bit higher, 
>>>>> but
>>>>> it works well.
>>>>
>>>> Actually, it's incorrect. What happens if all 100 retries fail? You'll
>>>> call odp_buffer_from_event() for ODP_EVENT_INVALID, which is
>>>> undefined.
>>>
>>> Incorrect? :) The point is that an event may not be immediately available
>>> to dequeue after it has been enqueued. This is due to the way that a 
>>> concurrent
>>> ring buffer behaves in a multi-threaded environment. The approach here is
>>> just to retry the dequeue a couple times (100 times actually) before moving
>>> on to the rest of code. Perhaps 100 times is too many times, but some amount
>>> of retry is needed.
>>>
>>> If this is not desirable, then I think it would be more accurate to consider
>>> odp_queue_enq() / odp_queue_deq() as async APIs -or- MT-unsafe (must be 
>>> called
>>> from one thread at a time in order to ensure the behavior that an event is
>>> immediately available for dequeue once it has been enqueued).
>>>
>>>>>
>>>>>> Maxim.
>>>>>>
>>>>>>> +   ev = odp_queue_deq(queue);
>>>>>>> +   if (ev != ODP_EVENT_INVALID)
>>>>>>> +   break;
>>>>>>> +   odp_cpu_pause();
>>>>>>> +   }
>>>>>>>
>>>>>>> buf = odp_buffer_from_event(ev);
>>>>>>>
>>>>>>>
>>>>>>
>


Re: [lng-odp] [API-NEXT PATCH v2 13/16] Add a bitset

2017-04-05 Thread Ola Liljedahl

On 05/04/2017, 15:39, "Dmitry Eremin-Solenikov"
<dmitry.ereminsoleni...@linaro.org> wrote:

>On 05.04.2017 16:33, Ola Liljedahl wrote:
>> 
>> 
>> 
>> 
>> On 05/04/2017, 15:22, "Dmitry Eremin-Solenikov"
>> <dmitry.ereminsoleni...@linaro.org> wrote:
>> 
>>> On 05.04.2017 15:16, Ola Liljedahl wrote:
>>>> On 05/04/2017, 12:36, "Dmitry Eremin-Solenikov"
>>>> <dmitry.ereminsoleni...@linaro.org> wrote:
>>>>
>>>>> On 05.04.2017 02:31, Ola Liljedahl wrote:
>>>>>> On 05/04/2017, 01:25, "Dmitry Eremin-Solenikov"
>>>>>> <dmitry.ereminsoleni...@linaro.org> wrote:
>>>>>>> On 04.04.2017 23:52, Ola Liljedahl wrote:
>>>>>>>> Sending from my ARM email account, I hope Outlook does not mess up
>>>>>>>> the
>>>>>>>> format.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/04/2017, 22:21, "Dmitry Eremin-Solenikov"
>>>>>>>> <dmitry.ereminsoleni...@linaro.org> wrote:
>>>>>>>>
>>>>>>>>> On 04.04.2017 21:48, Brian Brooks wrote:
>>>>>>>>>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>>>>>>>>>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>>>>>>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 
>>>>>>>>>>+/***
>>>>>>>>>>**
>>>>>>>>>> **
>>>>>>>>>> **
>>>>>>>>>> **
>>>>>>>>>> ***
>>>>>>>>>> + * bitset abstract data type
>>>>>>>>>> +
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 
>>>>>>>>>>*
>>>>>>>>>>**
>>>>>>>>>> **
>>>>>>>>>> **
>>>>>>>>>> **
>>>>>>>>>> /
>>>>>>>>>> +/* This could be a struct of scalars to support larger bit sets
>>>>>>>>>> */
>>>>>>>>>> +
>>>>>>>>>> +#if ATOM_BITSET_SIZE <= 32
>>>>>>>>>
>>>>>>>>> Maybe I missed, where did you set this macro?
>>>>>>>> In odp_config_internal.h
>>>>>>>> It is a build time configuration.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also, why do you need several versions of bitset? Can you stick
>>>>>>>>>to
>>>>>>>>> one
>>>>>>>>> size that fits all?
>>>>>>>> Some 32-bit archs (ARMv7a, x86) will only support 64-bit atomics
>>>>>>>> (AFAIK).
>>>>>>>> Only x86-64 and ARMv8a supports 128-bit atomics (and compiler
>>>>>>>> support
>>>>>>>> for
>>>>>>>> 128-bit atomics for ARMv8a is a bit lackingŠ).
>>>>>>>> Other architectures might only support 32-bit atomic operations.
>>>>>>>
>>>>>>> What will be the major outcome of settling on the 64-bit atomics?
>>>>>> The size of the bitset determines the maximum number of threads, the
>>>>>> maximum number of scheduler groups and the maximum number of reorder
>>>>>> contexts (per thread).
>>>>>
>>>>> Then even 128 can become too small in the forthcoming future. As far
>>>>>as
>>>>> I understand, most of the interesting things happen around
>>>>> bitsetting/clearing. Maybe we can redefine bitset as a struct or
>>>>>array
>>>>> of atomics? Then it would be expandable without significant software
>>>>> issues

Re: [lng-odp] [API-NEXT PATCH v2 13/16] Add a bitset

2017-04-05 Thread Ola Liljedahl




On 05/04/2017, 15:22, "Dmitry Eremin-Solenikov"
<dmitry.ereminsoleni...@linaro.org> wrote:

>On 05.04.2017 15:16, Ola Liljedahl wrote:
>> On 05/04/2017, 12:36, "Dmitry Eremin-Solenikov"
>> <dmitry.ereminsoleni...@linaro.org> wrote:
>> 
>>> On 05.04.2017 02:31, Ola Liljedahl wrote:
>>>> On 05/04/2017, 01:25, "Dmitry Eremin-Solenikov"
>>>> <dmitry.ereminsoleni...@linaro.org> wrote:
>>>>> On 04.04.2017 23:52, Ola Liljedahl wrote:
>>>>>> Sending from my ARM email account, I hope Outlook does not mess up
>>>>>>the
>>>>>> format.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 04/04/2017, 22:21, "Dmitry Eremin-Solenikov"
>>>>>> <dmitry.ereminsoleni...@linaro.org> wrote:
>>>>>>
>>>>>>> On 04.04.2017 21:48, Brian Brooks wrote:
>>>>>>>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>>>>>>>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>>>>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 
>>>>>>>>+/*
>>>>>>>>**
>>>>>>>> **
>>>>>>>> **
>>>>>>>> ***
>>>>>>>> + * bitset abstract data type
>>>>>>>> +
>>>>>>>>
>>>>>>>>
>>>>>>>> 
>>>>>>>>***
>>>>>>>>**
>>>>>>>> **
>>>>>>>> **
>>>>>>>> /
>>>>>>>> +/* This could be a struct of scalars to support larger bit sets
>>>>>>>>*/
>>>>>>>> +
>>>>>>>> +#if ATOM_BITSET_SIZE <= 32
>>>>>>>
>>>>>>> Maybe I missed, where did you set this macro?
>>>>>> In odp_config_internal.h
>>>>>> It is a build time configuration.
>>>>>>
>>>>>>>
>>>>>>> Also, why do you need several versions of bitset? Can you stick to
>>>>>>> one
>>>>>>> size that fits all?
>>>>>> Some 32-bit archs (ARMv7a, x86) will only support 64-bit atomics
>>>>>> (AFAIK).
>>>>>> Only x86-64 and ARMv8a supports 128-bit atomics (and compiler
>>>>>>support
>>>>>> for
>>>>>> 128-bit atomics for ARMv8a is a bit lackingŠ).
>>>>>> Other architectures might only support 32-bit atomic operations.
>>>>>
>>>>> What will be the major outcome of settling on the 64-bit atomics?
>>>> The size of the bitset determines the maximum number of threads, the
>>>> maximum number of scheduler groups and the maximum number of reorder
>>>> contexts (per thread).
>>>
>>> Then even 128 can become too small in the forthcoming future. As far as
>>> I understand, most of the interesting things happen around
>>> bitsetting/clearing. Maybe we can redefine bitset as a struct or array
>>> of atomics? Then it would be expandable without significant software
>>> issues, wouldn't it?
>>>
>>> I'm trying to get away of situation where we have overcomplicated low
>>> level code, which brings different issues on further platforms (like
>>> supporting this amount of threads on ARM and that amount of threads on
>>> x86/PPC/MIPS/etc).
>> I think the current implementation is simple and efficient. I also
>>think it
>> is sufficiently capable, e.g. supports up to 128 threads/scheduler
>>groups
>> etc.
>
>With 96 cores on existing boards, 128 seems quite like a close limit.
The limit imposed by bitset_t is the number of threads (CPU's) in one ODP
application. It is not a platform or system limit.

How likely is it that all of those 96 cores will be executing the same ODP
application?
I doubt anyone wants to have a ODP app spanning more than one socket,
consider the inter-socket latency on current multi-socket capable SoC's.

>
>> on 64-bit ARM and x86, up to 64 on 32-bit ARM/x86 and 64-bit MIPS. I
>>don't
>> think we should make a more complicated generic implementation until the
>> need has surfaced. It is easy to over-speculate in what will be
>>required in
>> the future and implement stuff that is never used.
>
>It is already overcomplicated.
What do you think is overcomplicated? I think the code is very simple.
Only one or two functions have more than one line of C code in them.

> It is a nice scientific solution,
"scientific"?

> it
>might be high performance, but it is a bit too complicated for generic
>code.
What is "too complicated" and what simpler solution do you suggest instead?

> I have the feeling that it can find path in odp-cloud, but for
>odp/linux-generic we need (IMO) initially a simple code.
Then we shouldn't add the scalable scheduler to linux-generic, too
complicated.

>
>-- 
>With best wishes
>Dmitry



Re: [lng-odp] [API-NEXT PATCH v2 13/16] Add a bitset

2017-04-05 Thread Ola Liljedahl
On 05/04/2017, 12:36, "Dmitry Eremin-Solenikov"
<dmitry.ereminsoleni...@linaro.org> wrote:

>On 05.04.2017 02:31, Ola Liljedahl wrote:
>> On 05/04/2017, 01:25, "Dmitry Eremin-Solenikov"
>> <dmitry.ereminsoleni...@linaro.org> wrote:
>>> On 04.04.2017 23:52, Ola Liljedahl wrote:
>>>> Sending from my ARM email account, I hope Outlook does not mess up the
>>>> format.
>>>>
>>>>
>>>>
>>>> On 04/04/2017, 22:21, "Dmitry Eremin-Solenikov"
>>>> <dmitry.ereminsoleni...@linaro.org> wrote:
>>>>
>>>>> On 04.04.2017 21:48, Brian Brooks wrote:
>>>>>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>>>>>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>>>>>
>>>>>>
>>>>>>
>>>>>> 
>>>>>>+/***
>>>>>>**
>>>>>> **
>>>>>> ***
>>>>>> + * bitset abstract data type
>>>>>> +
>>>>>>
>>>>>> 
>>>>>>*
>>>>>>**
>>>>>> **
>>>>>> /
>>>>>> +/* This could be a struct of scalars to support larger bit sets */
>>>>>> +
>>>>>> +#if ATOM_BITSET_SIZE <= 32
>>>>>
>>>>> Maybe I missed, where did you set this macro?
>>>> In odp_config_internal.h
>>>> It is a build time configuration.
>>>>
>>>>>
>>>>> Also, why do you need several versions of bitset? Can you stick to
>>>>>one
>>>>> size that fits all?
>>>> Some 32-bit archs (ARMv7a, x86) will only support 64-bit atomics
>>>> (AFAIK).
>>>> Only x86-64 and ARMv8a supports 128-bit atomics (and compiler support
>>>> for
>>>> 128-bit atomics for ARMv8a is a bit lackingŠ).
>>>> Other architectures might only support 32-bit atomic operations.
>>>
>>> What will be the major outcome of settling on the 64-bit atomics?
>> The size of the bitset determines the maximum number of threads, the
>> maximum number of scheduler groups and the maximum number of reorder
>> contexts (per thread).
>
>Then even 128 can become too small in the forthcoming future. As far as
>I understand, most of the interesting things happen around
>bitsetting/clearing. Maybe we can redefine bitset as a struct or array
>of atomics? Then it would be expandable without significant software
>issues, wouldn't it?
>
>I'm trying to get away of situation where we have overcomplicated low
>level code, which brings different issues on further platforms (like
>supporting this amount of threads on ARM and that amount of threads on
>x86/PPC/MIPS/etc).
I think the current implementation is simple and efficient. I also think it
is sufficiently capable, e.g. supports up to 128 threads/scheduler groups
etc.
on 64-bit ARM and x86, up to 64 on 32-bit ARM/x86 and 64-bit MIPS. I don't
think we should make a more complicated generic implementation until the
need has surfaced. It is easy to over-speculate in what will be required in
the future and implement stuff that is never used.

>
>>>> I think the user should have control over this but if you think that
>>>>we
>>>> should just select the max value that is supported by the architecture
>>>> in
>>>> question and thus skip one build configuration, I am open to this. We
>>>> will
>>>> still need separate versions for 32/64/128 bits because there are
>>>>slight
>>>> differences in the syntax and implementation. Such are the vagaries of
>>>> the
>>>> C standard (and GCC extensions).
>>>>
>>>>
>>>>> Any real reason for the following defines? Why do you need them?
>>>> The functions were added as they were needed, e.g. in
>>>> odp_schedule_scalable.c.
>>>> I dont think there is anyone which is not used anymore but can
>>>> double-check that.
>>>
>>> Well. I maybe should rephrase my question: why do you think that it's
>>> better to have bitset_andn(a, b), rather than just a &~b ?
>> The atomic bitset is an abstract data type. The implementation does not
>> have to use a scalar word. Alternative implementation paths exist, e.g.
>> use a struct with multiple words and perform the requested operation one
>> word at a time (this is OK but perhaps not well documented).
>
>This makes sense, esp. if we add non-plain-integer bitsets.
One note on using a struct with multiple words is that this will/might in
some cases
require multiple atomic operations (one per word) and this will be slower.

>
>
>-- 
>With best wishes
>Dmitry



Re: [lng-odp] [API-NEXT PATCH v2 12/16] Add LL/SC and signaling primitives

2017-04-04 Thread Ola Liljedahl




On 05/04/2017, 01:29, "Dmitry Eremin-Solenikov"
<dmitry.ereminsoleni...@linaro.org> wrote:

>On 05.04.2017 01:00, Brian Brooks wrote:
>> On Tue, Apr 4, 2017 at 3:38 PM, Ola Liljedahl <ola.liljed...@arm.com>
>>wrote:
>>> On 04/04/2017, 22:14, "Dmitry Eremin-Solenikov"
>>> <dmitry.ereminsoleni...@linaro.org> wrote:
>>>> On 04.04.2017 21:48, Brian Brooks wrote:
>
>>>>> +#endif
>>>>> +
>>>>> +#if __ARM_ARCH == 8 && __ARM_64BIT_STATE == 1
>>>>
>>>> #elif here please.
>>> Brian this one is for you! :-)
>> 
>> I am not sure where you are requesting the #elif, Dmitry. The first
>> block is for ARMv7 and AArch32, and the second block is for AArch64.
>> Each block is wrapped in a #if XYZ ... #endif.  It's symmetrical.
>
>Yep. However it is more common (at least I'm more used to) having the
>following code. It is more error prone and easier to follow.
I assume you mean the style below is more *robust*? And/or errors are more
easily detected/reported (handled by the #else/#error statements)?

>
>#if XYZ
>#elif FOO
>#elif ABC
>#else
>#error unsupported beast!
>#endif
>
>
>-- 
>With best wishes
>Dmitry



Re: [lng-odp] [API-NEXT PATCH v2 13/16] Add a bitset

2017-04-04 Thread Ola Liljedahl
Trying a different way to avoid the ARM disclaimer. But just to make sure,
this email does NOT contain any confidential information.



On 05/04/2017, 01:25, "Dmitry Eremin-Solenikov"
<dmitry.ereminsoleni...@linaro.org> wrote:

>On 04.04.2017 23:52, Ola Liljedahl wrote:
>> Sending from my ARM email account, I hope Outlook does not mess up the
>> format.
>> 
>> 
>> 
>> On 04/04/2017, 22:21, "Dmitry Eremin-Solenikov"
>> <dmitry.ereminsoleni...@linaro.org> wrote:
>> 
>>> On 04.04.2017 21:48, Brian Brooks wrote:
>>>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>>>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>>>
>>>>
>>>> 
>>>>+/*
>>>>**
>>>> ***
>>>> + * bitset abstract data type
>>>> +
>>>> 
>>>>***
>>>>**
>>>> /
>>>> +/* This could be a struct of scalars to support larger bit sets */
>>>> +
>>>> +#if ATOM_BITSET_SIZE <= 32
>>>
>>> Maybe I missed, where did you set this macro?
>> In odp_config_internal.h
>> It is a build time configuration.
>> 
>>>
>>> Also, why do you need several versions of bitset? Can you stick to one
>>> size that fits all?
>> Some 32-bit archs (ARMv7a, x86) will only support 64-bit atomics
>>(AFAIK).
>> Only x86-64 and ARMv8a supports 128-bit atomics (and compiler support
>>for
>> 128-bit atomics for ARMv8a is a bit lackingŠ).
>> Other architectures might only support 32-bit atomic operations.
>
>What will be the major outcome of settling on the 64-bit atomics?
The size of the bitset determines the maximum number of threads, the
maximum number of scheduler groups and the maximum number of reorder
contexts (per thread).

>
>> I think the user should have control over this but if you think that we
>> should just select the max value that is supported by the architecture
>>in
>> question and thus skip one build configuration, I am open to this. We
>>will
>> still need separate versions for 32/64/128 bits because there are slight
>> differences in the syntax and implementation. Such are the vagaries of
>>the
>> C standard (and GCC extensions).
>> 
>> 
>>> Any real reason for the following defines? Why do you need them?
>> The functions were added as they were needed, e.g. in
>> odp_schedule_scalable.c.
>> I dont think there is anyone which is not used anymore but can
>> double-check that.
>
>Well. I maybe should rephrase my question: why do you think that it's
>better to have bitset_andn(a, b), rather than just a &~b ?
The atomic bitset is an abstract data type. The implementation does not
have to use a scalar word. Alternative implementation paths exist, e.g.
use a struct with multiple words and perform the requested operation one
word at a time (this is OK but perhaps not well documented).

>
>
>-- 
>With best wishes
>Dmitry



Re: [lng-odp] [API-NEXT PATCH v2 07/16] test: odp_scheduling: Handle dequeueing from a concurrent queue

2017-04-04 Thread Ola Liljedahl
On 4 April 2017 at 20:59, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
> On 04/04/17 21:47, Brian Brooks wrote:
>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>> Reviewed-by: Kevin Wang <kevin.w...@arm.com>
>> ---
>>  test/common_plat/performance/odp_scheduling.c | 12 ++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/test/common_plat/performance/odp_scheduling.c 
>> b/test/common_plat/performance/odp_scheduling.c
>> index c74a0713..38e76257 100644
>> --- a/test/common_plat/performance/odp_scheduling.c
>> +++ b/test/common_plat/performance/odp_scheduling.c
>> @@ -273,7 +273,7 @@ static int test_plain_queue(int thr, test_globals_t 
>> *globals)
>>   test_message_t *t_msg;
>>   odp_queue_t queue;
>>   uint64_t c1, c2, cycles;
>> - int i;
>> + int i, j;
>>
>>   /* Alloc test message */
>>   buf = odp_buffer_alloc(globals->pool);
>> @@ -307,7 +307,15 @@ static int test_plain_queue(int thr, test_globals_t 
>> *globals)
>>   return -1;
>>   }
>>
>> - ev = odp_queue_deq(queue);
>> + /* When enqueue and dequeue are decoupled (e.g. not using a
>> +  * common lock), an enqueued event may not be immediately
>> +  * visible to dequeue. So we just try again for a while. */
>> + for (j = 0; j < 100; j++) {
>
> where 100 number comes from?
It's a nice round number.

https://www.google.se/webhp?sourceid=chrome-instant=1=2=UTF-8#q=nice+round+number&*

>
> Maxim.
>
>> + ev = odp_queue_deq(queue);
>> + if (ev != ODP_EVENT_INVALID)
>> + break;
>> + odp_cpu_pause();
>> + }
>>
>>   buf = odp_buffer_from_event(ev);
>>
>>
>


Re: [lng-odp] [API-NEXT PATCH v2 07/16] test: odp_scheduling: Handle dequeueing from a concurrent queue

2017-04-04 Thread Ola Liljedahl
On 5 April 2017 at 00:26, Bill Fischofer <bill.fischo...@linaro.org> wrote:
> On Tue, Apr 4, 2017 at 3:37 PM, Brian Brooks <brian.bro...@arm.com> wrote:
>> On 04/04 21:59:15, Maxim Uvarov wrote:
>>> On 04/04/17 21:47, Brian Brooks wrote:
>>> > Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>>> > Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>>> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>>> > Reviewed-by: Kevin Wang <kevin.w...@arm.com>
>>> > ---
>>> >  test/common_plat/performance/odp_scheduling.c | 12 ++--
>>> >  1 file changed, 10 insertions(+), 2 deletions(-)
>>> >
>>> > diff --git a/test/common_plat/performance/odp_scheduling.c 
>>> > b/test/common_plat/performance/odp_scheduling.c
>>> > index c74a0713..38e76257 100644
>>> > --- a/test/common_plat/performance/odp_scheduling.c
>>> > +++ b/test/common_plat/performance/odp_scheduling.c
>>> > @@ -273,7 +273,7 @@ static int test_plain_queue(int thr, test_globals_t 
>>> > *globals)
>>> > test_message_t *t_msg;
>>> > odp_queue_t queue;
>>> > uint64_t c1, c2, cycles;
>>> > -   int i;
>>> > +   int i, j;
>>> >
>>> > /* Alloc test message */
>>> > buf = odp_buffer_alloc(globals->pool);
>>> > @@ -307,7 +307,15 @@ static int test_plain_queue(int thr, test_globals_t 
>>> > *globals)
>>> > return -1;
>>> > }
>>> >
>>> > -   ev = odp_queue_deq(queue);
>>> > +   /* When enqueue and dequeue are decoupled (e.g. not using a
>>> > +* common lock), an enqueued event may not be immediately
>>> > +* visible to dequeue. So we just try again for a while. */
>>> > +   for (j = 0; j < 100; j++) {
>>>
>>> where 100 number comes from?
>>
>> It is the retry count. Perhaps it could be a bit lower, or a bit higher, but
>> it works well.
>
> Actually, it's incorrect. What happens if all 100 retries fail? You'll
> call odp_buffer_from_event() for ODP_EVENT_INVALID, which is
> undefined.
That's what the code did before. And it is followed by code which
assumes ODP_EVENT_INVALID can be returned.
buf = odp_buffer_from_event(ev);

if (!odp_buffer_is_valid(buf)) {
LOG_ERR("  [%i] Queue empty.\n", thr);
return -1;
}

Some similar calls, e.g. odp_packet_from_event(), explicitly checks
for ODP_EVENT_INVALID and returns ODP_PACKET_INVALID. As
odp_buffer_from_event() is just a static cast, an invalid input just
generates some invalid output (it might still be a undefined
operation).

Possibly we need to fix both the ODP implementation (no check for
invalid event handles because operations on those are anyway undefined
and invalid handles should not be treated as valid inputs) and test
programs which need to check for invalid event handles before passing
handles to other ODP functions.


>
>>
>>> Maxim.
>>>
>>> > +   ev = odp_queue_deq(queue);
>>> > +   if (ev != ODP_EVENT_INVALID)
>>> > +   break;
>>> > +   odp_cpu_pause();
>>> > +   }
>>> >
>>> > buf = odp_buffer_from_event(ev);
>>> >
>>> >
>>>


Re: [lng-odp] [API-NEXT PATCH v2 06/16] Fix a locking bug

2017-04-04 Thread Ola Liljedahl
On 5 April 2017 at 00:23, Bill Fischofer <bill.fischo...@linaro.org> wrote:
> On Tue, Apr 4, 2017 at 5:20 PM, Bill Fischofer
> <bill.fischo...@linaro.org> wrote:
>> This is clearly orthogonal to this patch series. Ideally you should
>> (a) Create a Bug to represent this, (b) Post the fix patch noting the
>> Bug URL in the commit log, and (c) update the Bug entry with the URL
>> of the patch that fixes this bug.
>>
>> On Tue, Apr 4, 2017 at 1:47 PM, Brian Brooks <brian.bro...@arm.com> wrote:
>>> Signed-off-by: Kevin Wang <kevin.w...@arm.com>
>>> Reviewed-by: Ola Liljedahl <ola.liljed...@arm.com>
>>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>>> ---
>>>  platform/linux-generic/pktio/loop.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/platform/linux-generic/pktio/loop.c 
>>> b/platform/linux-generic/pktio/loop.c
>>> index 70962839..49d8a211 100644
>>> --- a/platform/linux-generic/pktio/loop.c
>>> +++ b/platform/linux-generic/pktio/loop.c
>>> @@ -176,6 +176,7 @@ static int loopback_send(pktio_entry_t *pktio_entry, 
>>> int index ODP_UNUSED,
>>> pktio_entry->s.stats.out_octets += bytes;
>>> } else {
>>> ODP_DBG("queue enqueue failed %i\n", ret);
>>> +   odp_ticketlock_unlock(_entry->s.txl);
>>> return -1;
>>
>> A better fix to this is to just delete the return -1 since that will
>> result in the following unlock being executed and ret being returned
>> as the return code from this routine.
I thought we had a coding style that says we shall embrace multiple
exits from a function. At least it looks like that.

However I think that this a very bad coding style because it creates
much more or less redundant code where this typ of bug (forget to undo
some side effect performed earlier in the function) is basically
guaranteed to happen. Multiple exits with redundant code segments are
also fragile when you start changing the code, you easily forget to
update some exit path but this is unlikely to be caught by testing, it
is very difficult to achieve 100% code coverage, including all error
handling. Especially error handling is undertested, many code paths
are very implementation specific and not amenable to black box
testing. One project I worked with in the past (OSEck RTOS) did
extensive white box testing, making additions or changes was very
cumbersome but code quality was very high.

>
> Correction, change return -1 to ret = -1 since the test on the if if
> (ret > 0) and we want to return a negative if ret = 0 from
> odp_enqueue_multi().
>
>>
>>> }
>>>
>>> --
>>> 2.12.2
>>>


Re: [lng-odp] [API-NEXT PATCH v2 15/16] Add llqueue, an unbounded concurrent queue

2017-04-04 Thread Ola Liljedahl
On 4 April 2017 at 21:25, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
> it's better to have 2 separate files for that. One for ODP_CONFIG_LLDSCD
"better"? In what way?

> defined and one for not. Also ODP_ prefix should not be used for
> internal things (not api).
OK this was not clear, some of the defines in odp_config_internal.h
use an ODP_ prefix, some not. You mean there is a system to that?

Shouldn't those defines that are part of the API be declared/described
in the API header files (located in include/odp/api/spec)? How else do
you know that they are part of the API? And if they are part of the
API, how does the application (the 'A' in API) access the definitions
*and* their values?

There are API's for querying about things like total number of queues
but those API's are separate and do not depend on some define with a
specific name.

>
> Maxim.
>
> On 04/04/17 21:48, Brian Brooks wrote:
>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>> ---
>>  platform/linux-generic/include/odp_llqueue.h | 285 
>> +++
>>  1 file changed, 285 insertions(+)
>>  create mode 100644 platform/linux-generic/include/odp_llqueue.h
>>
>> diff --git a/platform/linux-generic/include/odp_llqueue.h 
>> b/platform/linux-generic/include/odp_llqueue.h
>> new file mode 100644
>> index ..aa46ace3
>> --- /dev/null
>> +++ b/platform/linux-generic/include/odp_llqueue.h
>> @@ -0,0 +1,285 @@
>> +/* Copyright (c) 2017, ARM Limited.
>> + * All rights reserved.
>> + *
>> + * SPDX-License-Identifier:BSD-3-Clause
>> + */
>> +
>> +#ifndef ODP_LLQUEUE_H_
>> +#define ODP_LLQUEUE_H_
>> +
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include 
>> +#include 
>> +
>> +/**
>> + * Linked list queues
>> + 
>> */
>> +
>> +/* The scalar equivalent of a double pointer */
>> +#if __SIZEOF_PTRDIFF_T__ == 4
>> +typedef uint64_t dintptr_t;
>> +#endif
>> +#if __SIZEOF_PTRDIFF_T__ == 8
>> +typedef __int128 dintptr_t;
>> +#endif
>> +
>> +#define SENTINEL ((void *)~(uintptr_t)0)
>> +
>> +struct llnode {
>> + struct llnode *next;
>> +};
>> +
>> +union llht {
>> + struct {
>> + struct llnode *head, *tail;
>> + } st;
>> + dintptr_t ui;
>> +};
>> +
>> +struct llqueue {
>> + union llht u;
>> +#ifndef ODP_CONFIG_LLDSCD
>> + odp_spinlock_t lock;
>> +#endif
>> +};
>> +
>> +static inline struct llnode *llq_head(struct llqueue *llq)
>> +{
>> + return __atomic_load_n(>u.st.head, __ATOMIC_RELAXED);
>> +}
>> +
>> +static inline void llqueue_init(struct llqueue *llq)
>> +{
>> + llq->u.st.head = NULL;
>> + llq->u.st.tail = NULL;
>> +#ifndef ODP_CONFIG_LLDSCD
>> + odp_spinlock_init(>lock);
>> +#endif
>> +}
>> +
>> +#ifdef ODP_CONFIG_LLDSCD
>> +
>> +static inline void llq_enqueue(struct llqueue *llq, struct llnode *node)
>> +{
>> + union llht old, neu;
>> +
>> + ODP_ASSERT(node->next == NULL);
>> + node->next = SENTINEL;
>> + do {
>> + old.ui = lld(>u.ui, __ATOMIC_RELAXED);
>> + neu.st.head = old.st.head == NULL ? node : old.st.head;
>> + neu.st.tail = node;
>> + } while (odp_unlikely(scd(>u.ui, neu.ui, __ATOMIC_RELEASE)));
>> + if (old.st.tail != NULL) {
>> + /* List was not empty */
>> + ODP_ASSERT(old.st.tail->next == SENTINEL);
>> + old.st.tail->next = node;
>> + }
>> +}
>> +
>> +#else
>> +
>> +static inline void llq_enqueue(struct llqueue *llq, struct llnode *node)
>> +{
>> + ODP_ASSERT(node->next == NULL);
>> + node->next = SENTINEL;
>> +
>> + odp_spinlock_lock(>lock);
>> + if (llq->u.st.head == NULL) {
>> + llq->u.st.head = node;
>> + llq->u.st.tail = node;
>> + } else {
>> + llq->u.st.tail->next = node;
>> + llq->u.st.tail = node;
>> + }
>> + odp_spinlock_unlock(>lock);
>> +}
>> +
>> +#endif
>> +
>> +#ifdef O

Re: [lng-odp] [API-NEXT PATCH v2 12/16] Add LL/SC and signaling primitives

2017-04-04 Thread Ola Liljedahl
Sorry to say but I see the ARM disclaimer at the bottom even though I
requested the disclaimer not to be added to this specific email (there
is a trick for that). I need to figure out what went wrong.

-- Ola

On 4 April 2017 at 22:38, Ola Liljedahl <ola.liljed...@arm.com> wrote:
>
>
>
>
> On 04/04/2017, 22:14, "Dmitry Eremin-Solenikov"
> <dmitry.ereminsoleni...@linaro.org> wrote:
>
>>On 04.04.2017 21:48, Brian Brooks wrote:
>>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>>> ---
>>>  platform/linux-generic/include/odp_llsc.h | 332
>>>++
>>>  1 file changed, 332 insertions(+)
>>>  create mode 100644 platform/linux-generic/include/odp_llsc.h
>>>
>>> diff --git a/platform/linux-generic/include/odp_llsc.h
>>>b/platform/linux-generic/include/odp_llsc.h
>>> new file mode 100644
>>> index ..ea60c54b
>>> --- /dev/null
>>> +++ b/platform/linux-generic/include/odp_llsc.h
>>> @@ -0,0 +1,332 @@
>>> +/* Copyright (c) 2017, ARM Limited
>>> + * All rights reserved.
>>> + *
>>> + * SPDX-License-Identifier:BSD-3-Clause
>>> + */
>>> +
>>> +#ifndef ODP_LLSC_H_
>>> +#define ODP_LLSC_H_
>>> +
>>> +#include 
>>> +
>>>
>>>+/***
>>>***
>>> + * LL/SC primitives
>>> +
>>>*
>>>/
>>> +
>>> +#if defined __ARM_ARCH
>>> +#if __ARM_ARCH == 7 || (__ARM_ARCH == 8 && __ARM_64BIT_STATE == 0)
>>> +static inline void dmb(void)
>>> +{
>>> +__asm__ volatile("dmb" : : : "memory");
>>> +}
>>> +
>>> +static inline uint32_t ll8(uint8_t *var, int mm)
>>> +{
>>> +uint8_t old;
>>> +
>>> +__asm__ volatile("ldrexb %0, [%1]"
>>> + : "=" (old)
>>> + : "r" (var)
>>> + : );
>>> +/* Barrier after an acquiring load */
>>> +if (mm == __ATOMIC_ACQUIRE)
>>> +dmb();
>>> +return old;
>>> +}
>>
>>
>>Hmm, I remember Ola's story about ipfrag and stdatomic not providing
>>enough support for 128-bit atomics. But why do you need to define
>>load/store for 8- and 32-bit variables? Why can not you use stdatomic
>>interface here?
> The usage here is actually not to perform atomic updates.
>
> Load-exclusive is used in ARMv8 to load the local ³monitor² (which is used
> to check for atomicity in load-exclusive/store-exclusive sections).
> When the monitor is ³lost² (there is a better formal word for that but I
> don¹t remember it now) because some other CPU obtained exclusive ownership
> of the cache line in order to write to it, an event is generated. This is
> the event that wakes up the CPU when it is sleeping in WFE (wait for
> event).
>
> This is all explained in the ARMv8 documentation. In ARMv7a, you had to
> perform DSB+SEV in order to signal/wake up waiting CPU¹s, this is much
> slower (you can still do it on ARMv8 CPUs).
>
>>
>>Not to mention that ll/ll8/etc macto names are not _that_ easy to
>>understand without any additional comments. Please expand the names.
>>
>>> +#endif
>>> +
>>> +#if __ARM_ARCH == 8 && __ARM_64BIT_STATE == 1
>>
>>#elif here please.
> Brian this one is for you! :-)
>
>>
>>> +static inline void sevl(void)
>>> +{
>>> +#if defined __ARM_ARCH
>>> +__asm__ volatile("sevl" : : : );
>>> +#endif
>>> +}
>>> +
>>> +static inline void sev(void)
>>> +{
>>> +#if defined __ARM_ARCH
>>> +__asm__ volatile("sev" : : : "memory");
>>> +#endif
>>> +}
>>> +
>>> +static inline int wfe(void)
>>> +{
>>> +#if defined __ARM_ARCH
>>> +__asm__ volatile("wfe" : : : "memory");
>>> +#endif
>>> +return 1;
>>> +}
>>
>>Ugh. And what about other architectures?
> I don¹t know if any other architectures support something like the ARM
> event mechanism. I have never seen anything like it in e.g. MIPS or POWER.
>
> We could remove the inline functions and have the WFE() macro translate
> directly to the inline asm statement. Better?
>
>

Re: [lng-odp] [API-NEXT PATCH v2 12/16] Add LL/SC and signaling primitives

2017-04-04 Thread Ola Liljedahl
I think I missed one comment.


On 04/04/2017, 22:14, "Dmitry Eremin-Solenikov"
<dmitry.ereminsoleni...@linaro.org> wrote:

>On 04.04.2017 21:48, Brian Brooks wrote:
>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>> ---
>>  platform/linux-generic/include/odp_llsc.h | 332
>>++
>>  1 file changed, 332 insertions(+)
>>  create mode 100644 platform/linux-generic/include/odp_llsc.h
>>
>> diff --git a/platform/linux-generic/include/odp_llsc.h
>>b/platform/linux-generic/include/odp_llsc.h
>> new file mode 100644
>> index ..ea60c54b
>> --- /dev/null
>> +++ b/platform/linux-generic/include/odp_llsc.h
>> @@ -0,0 +1,332 @@
>> +/* Copyright (c) 2017, ARM Limited
>> + * All rights reserved.
>> + *
>> + * SPDX-License-Identifier:BSD-3-Clause
>> + */
>> +
>> +#ifndef ODP_LLSC_H_
>> +#define ODP_LLSC_H_
>> +
>> +#include 
>> +
>>
>>+/***
>>***
>> + * LL/SC primitives
>> +
>>*
>>/
There¹s your comment.

LL <=> load linked
SC <=> store conditional
The original name of this RISCy mechanism for implementing atomic
operations. Also the instruction names used on MIPS.
I did not want to use all caps for function names.

I actually prefer the MIPS abbreviations to the ARM names
(load-exclusive/store-exclusive).



>> +
>> +#if defined __ARM_ARCH
>> +#if __ARM_ARCH == 7 || (__ARM_ARCH == 8 && __ARM_64BIT_STATE == 0)
>> +static inline void dmb(void)
>> +{
>> +__asm__ volatile("dmb" : : : "memory");
>> +}
>> +
>> +static inline uint32_t ll8(uint8_t *var, int mm)
>> +{
>> +uint8_t old;
>> +
>> +__asm__ volatile("ldrexb %0, [%1]"
>> + : "=" (old)
>> + : "r" (var)
>> + : );
>> +/* Barrier after an acquiring load */
>> +if (mm == __ATOMIC_ACQUIRE)
>> +dmb();
>> +return old;
>> +}
>
>
>Hmm, I remember Ola's story about ipfrag and stdatomic not providing
>enough support for 128-bit atomics. But why do you need to define
>load/store for 8- and 32-bit variables? Why can not you use stdatomic
>interface here?
>
>Not to mention that ll/ll8/etc macto names are not _that_ easy to
>understand without any additional comments. Please expand the names.
See above.

>
>> +#endif
>> +
>> +#if __ARM_ARCH == 8 && __ARM_64BIT_STATE == 1
>
>#elif here please.
>
>> +static inline void sevl(void)
>> +{
>> +#if defined __ARM_ARCH
>> +__asm__ volatile("sevl" : : : );
>> +#endif
>> +}
>> +
>> +static inline void sev(void)
>> +{
>> +#if defined __ARM_ARCH
>> +__asm__ volatile("sev" : : : "memory");
>> +#endif
>> +}
>> +
>> +static inline int wfe(void)
>> +{
>> +#if defined __ARM_ARCH
>> +__asm__ volatile("wfe" : : : "memory");
>> +#endif
>> +return 1;
>> +}
>
>Ugh. And what about other architectures?
>
>> +
>> +#ifdef ODP_CONFIG_DMBSTR
>> +
>> +#if defined __ARM_ARCH && __ARM_ARCH == 8
>> +/* Only ARMv8 supports DMB ISHLD */
>> +/* A load only barrier is much cheaper than full barrier */
>> +#define _odp_release_barrier(ro) \
>> +do { \
>> +if (ro) \
>> +__asm__ volatile("dmb ishld" ::: "memory");  \
>> +else \
>> +__asm__ volatile("dmb ish" ::: "memory");\
>> +} while (0)
>> +#else
>> +#define _odp_release_barrier(ro) \
>> +__atomic_thread_fence(__ATOMIC_RELEASE)
>> +#endif
>> +
>> +#define atomic_store_release(loc, val, ro)\
>> +do {\
>> +_odp_release_barrier(ro);\
>> +__atomic_store_n(loc, val, __ATOMIC_RELAXED);   \
>> +} while (0)
>> +
>> +#else
>> +
>> +#define atomic_store_release(loc, val, ro) \
>> +__atomic_store_n(loc, val, __ATOMIC_RELEASE)
>> +
>> +#endif
>> +
>> +#ifdef ODP_CONFIG_USE_WFE
>
>Lack of documentation on those abbreviations is a very, very nice thing.
>
>> +#define SEVL() sevl()
>> +#define WFE() wfe()
>> +#define SEV() do { __asm__ volatile("dsb ish" ::: "memory"); sev(); }
>>while (0)
>
>S, if you need dsb here, why isn't it a part of

Re: [lng-odp] [API-NEXT PATCH v2 13/16] Add a bitset

2017-04-04 Thread Ola Liljedahl
Sending from my ARM email account, I hope Outlook does not mess up the
format.



On 04/04/2017, 22:21, "Dmitry Eremin-Solenikov"
<dmitry.ereminsoleni...@linaro.org> wrote:

>On 04.04.2017 21:48, Brian Brooks wrote:
>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>
>>
>>+/***
>>***
>> + * bitset abstract data type
>> +
>>*
>>/
>> +/* This could be a struct of scalars to support larger bit sets */
>> +
>> +#if ATOM_BITSET_SIZE <= 32
>
>Maybe I missed, where did you set this macro?
In odp_config_internal.h
It is a build time configuration.

>
>Also, why do you need several versions of bitset? Can you stick to one
>size that fits all?
Some 32-bit archs (ARMv7a, x86) will only support 64-bit atomics (AFAIK).
Only x86-64 and ARMv8a supports 128-bit atomics (and compiler support for
128-bit atomics for ARMv8a is a bit lackingŠ).
Other architectures might only support 32-bit atomic operations.

I think the user should have control over this but if you think that we
should just select the max value that is supported by the architecture in
question and thus skip one build configuration, I am open to this. We will
still need separate versions for 32/64/128 bits because there are slight
differences in the syntax and implementation. Such are the vagaries of the
C standard (and GCC extensions).


>
>> +typedef uint32_t bitset_t;
>> +
>> +static inline bitset_t bitset_mask(uint32_t bit)
>> +{
>> +return 1UL << bit;
>> +}
>> +
>> +/* Return first-bit-set with StdC ffs() semantics */
>> +static inline uint32_t bitset_ffs(bitset_t b)
>> +{
>> +return __builtin_ffsl(b);
>> +}
>> +
>> +/* Load-exclusive with memory ordering */
>> +static inline bitset_t bitset_ldex(bitset_t *bs, int mo)
>> +{
>> +return LDXR32(bs, mo);
>> +}
>> +
>> +#elif ATOM_BITSET_SIZE <= 64
>> +
>> +typedef uint64_t bitset_t;
>> +
>> +static inline bitset_t bitset_mask(uint32_t bit)
>> +{
>> +return 1ULL << bit;
>> +}
>> +
>> +/* Return first-bit-set with StdC ffs() semantics */
>> +static inline uint32_t bitset_ffs(bitset_t b)
>> +{
>> +return __builtin_ffsll(b);
>> +}
>> +
>> +/* Load-exclusive with memory ordering */
>> +static inline bitset_t bitset_ldex(bitset_t *bs, int mo)
>> +{
>> +return LDXR64(bs, mo);
>> +}
>> +
>> +#elif ATOM_BITSET_SIZE <= 128
>> +
>> +#if __SIZEOF_INT128__ == 16
>> +typedef unsigned __int128 bitset_t;
>> +
>> +static inline bitset_t bitset_mask(uint32_t bit)
>> +{
>> +if (bit < 64)
>> +return 1ULL << bit;
>> +else
>> +return (unsigned __int128)(1ULL << (bit - 64)) << 64;
>> +}
>> +
>> +/* Return first-bit-set with StdC ffs() semantics */
>> +static inline uint32_t bitset_ffs(bitset_t b)
>> +{
>> +if ((uint64_t)b != 0)
>> +return __builtin_ffsll((uint64_t)b);
>> +else if ((b >> 64) != 0)
>> +return __builtin_ffsll((uint64_t)(b >> 64)) + 64;
>> +else
>> +return 0;
>> +}
>> +
>> +/* Load-exclusive with memory ordering */
>> +static inline bitset_t bitset_ldex(bitset_t *bs, int mo)
>> +{
>> +return LDXR128(bs, mo);
>> +}
>> +
>> +#else
>> +#error __int128 not supported by compiler
>> +#endif
>> +
>> +#else
>> +#error Unsupported size of bit sets (ATOM_BITSET_SIZE)
>> +#endif
>> +
>> +/* Atomic load with memory ordering */
>> +static inline bitset_t atom_bitset_load(bitset_t *bs, int mo)
>> +{
>> +return __atomic_load_n(bs, mo);
>> +}
>> +
>> +/* Atomic bit set with memory ordering */
>> +static inline void atom_bitset_set(bitset_t *bs, uint32_t bit, int mo)
>> +{
>> +(void)__atomic_fetch_or(bs, bitset_mask(bit), mo);
>> +}
>> +
>> +/* Atomic bit clear with memory ordering */
>> +static inline void atom_bitset_clr(bitset_t *bs, uint32_t bit, int mo)
>> +{
>> +(void)__atomic_fetch_and(bs, ~bitset_mask(bit), mo);
>> +}
>> +
>> +/* Atomic exchange with memory ordering */
>> +static inline bitset_t atom_bitset_xchg(bitset_t *bs, bitset_t neu,
>>int mo)
>> +{
>> +return __atomic_exchange_n(bs, neu, mo);
>> +}
>> +
>
>Any real reason for the follo

Re: [lng-odp] [API-NEXT PATCH v2 12/16] Add LL/SC and signaling primitives

2017-04-04 Thread Ola Liljedahl




On 04/04/2017, 22:14, "Dmitry Eremin-Solenikov"
<dmitry.ereminsoleni...@linaro.org> wrote:

>On 04.04.2017 21:48, Brian Brooks wrote:
>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
>> ---
>>  platform/linux-generic/include/odp_llsc.h | 332
>>++
>>  1 file changed, 332 insertions(+)
>>  create mode 100644 platform/linux-generic/include/odp_llsc.h
>>
>> diff --git a/platform/linux-generic/include/odp_llsc.h
>>b/platform/linux-generic/include/odp_llsc.h
>> new file mode 100644
>> index ..ea60c54b
>> --- /dev/null
>> +++ b/platform/linux-generic/include/odp_llsc.h
>> @@ -0,0 +1,332 @@
>> +/* Copyright (c) 2017, ARM Limited
>> + * All rights reserved.
>> + *
>> + * SPDX-License-Identifier:BSD-3-Clause
>> + */
>> +
>> +#ifndef ODP_LLSC_H_
>> +#define ODP_LLSC_H_
>> +
>> +#include 
>> +
>>
>>+/***
>>***
>> + * LL/SC primitives
>> +
>>*
>>/
>> +
>> +#if defined __ARM_ARCH
>> +#if __ARM_ARCH == 7 || (__ARM_ARCH == 8 && __ARM_64BIT_STATE == 0)
>> +static inline void dmb(void)
>> +{
>> +__asm__ volatile("dmb" : : : "memory");
>> +}
>> +
>> +static inline uint32_t ll8(uint8_t *var, int mm)
>> +{
>> +uint8_t old;
>> +
>> +__asm__ volatile("ldrexb %0, [%1]"
>> + : "=" (old)
>> + : "r" (var)
>> + : );
>> +/* Barrier after an acquiring load */
>> +if (mm == __ATOMIC_ACQUIRE)
>> +dmb();
>> +return old;
>> +}
>
>
>Hmm, I remember Ola's story about ipfrag and stdatomic not providing
>enough support for 128-bit atomics. But why do you need to define
>load/store for 8- and 32-bit variables? Why can not you use stdatomic
>interface here?
The usage here is actually not to perform atomic updates.

Load-exclusive is used in ARMv8 to load the local ³monitor² (which is used
to check for atomicity in load-exclusive/store-exclusive sections).
When the monitor is ³lost² (there is a better formal word for that but I
don¹t remember it now) because some other CPU obtained exclusive ownership
of the cache line in order to write to it, an event is generated. This is
the event that wakes up the CPU when it is sleeping in WFE (wait for
event).

This is all explained in the ARMv8 documentation. In ARMv7a, you had to
perform DSB+SEV in order to signal/wake up waiting CPU¹s, this is much
slower (you can still do it on ARMv8 CPUs).

>
>Not to mention that ll/ll8/etc macto names are not _that_ easy to
>understand without any additional comments. Please expand the names.
>
>> +#endif
>> +
>> +#if __ARM_ARCH == 8 && __ARM_64BIT_STATE == 1
>
>#elif here please.
Brian this one is for you! :-)

>
>> +static inline void sevl(void)
>> +{
>> +#if defined __ARM_ARCH
>> +__asm__ volatile("sevl" : : : );
>> +#endif
>> +}
>> +
>> +static inline void sev(void)
>> +{
>> +#if defined __ARM_ARCH
>> +__asm__ volatile("sev" : : : "memory");
>> +#endif
>> +}
>> +
>> +static inline int wfe(void)
>> +{
>> +#if defined __ARM_ARCH
>> +__asm__ volatile("wfe" : : : "memory");
>> +#endif
>> +return 1;
>> +}
>
>Ugh. And what about other architectures?
I don¹t know if any other architectures support something like the ARM
event mechanism. I have never seen anything like it in e.g. MIPS or POWER.

We could remove the inline functions and have the WFE() macro translate
directly to the inline asm statement. Better?

>
>> +
>> +#ifdef ODP_CONFIG_DMBSTR
>> +
>> +#if defined __ARM_ARCH && __ARM_ARCH == 8
>> +/* Only ARMv8 supports DMB ISHLD */
>> +/* A load only barrier is much cheaper than full barrier */
>> +#define _odp_release_barrier(ro) \
>> +do { \
>> +if (ro) \
>> +__asm__ volatile("dmb ishld" ::: "memory");  \
>> +else \
>> +__asm__ volatile("dmb ish" ::: "memory");\
>> +} while (0)
>> +#else
>> +#define _odp_release_barrier(ro) \
>> +__atomic_thread_fence(__ATOMIC_RELEASE)
>> +#endif
>> +
>> +#define atomic_store_release(loc, val, ro)\
>> +do {\
>> +_odp_release_barrier(ro);\
>> +__atomic_store_n(lo

Re: [lng-odp] [API-NEXT 2/4] linux-generic: ring.c: use required memory orderings

2017-03-31 Thread Ola Liljedahl
On 31 March 2017 at 15:21, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
> On 03/28/17 22:23, Brian Brooks wrote:
>> From: Ola Liljedahl <ola.liljed...@arm.com>
>>
>> Signed-off-by: Ola Liljedahl <ola.liljed...@arm.com>
>> Reviewed-by: Brian Brooks <brian.bro...@arm.com>
>> ---
>>  platform/linux-generic/pktio/ring.c | 30 ++
>>  1 file changed, 14 insertions(+), 16 deletions(-)
>>  mode change 100644 => 100755 platform/linux-generic/pktio/ring.c
>>
>> diff --git a/platform/linux-generic/pktio/ring.c 
>> b/platform/linux-generic/pktio/ring.c
>> old mode 100644
>> new mode 100755
>
>
> no need of setting executable permissions to c file. And of course you
Very strange. I can assure you that I have not actively changed
permissions on this file (I made the original changes on our local
copy).

> have to run checkpatch.pl or push that code to github and it will do all
> required checks.
>
> I will fix it, no need to resend.
>
> Maxim.
>
>> index aeda04b2..e3c73d1c
>> --- a/platform/linux-generic/pktio/ring.c
>> +++ b/platform/linux-generic/pktio/ring.c
>> @@ -263,8 +263,8 @@ int ___ring_mp_do_enqueue(_ring_t *r, void * const 
>> *obj_table,
>>   /* Reset n to the initial burst count */
>>   n = max;
>>
>> - prod_head = r->prod.head;
>> - cons_tail = r->cons.tail;
>> + prod_head = __atomic_load_n(>prod.head, __ATOMIC_RELAXED);
>> + cons_tail = __atomic_load_n(>cons.tail, __ATOMIC_ACQUIRE);
>>   /* The subtraction is done between two unsigned 32bits value
>>* (the result is always modulo 32 bits even if we have
>>* prod_head > cons_tail). So 'free_entries' is always between >> 0
>> @@ -306,12 +306,12 @@ int ___ring_mp_do_enqueue(_ring_t *r, void * const 
>> *obj_table,
>>* If there are other enqueues in progress that preceded us,
>>* we need to wait for them to complete
>>*/
>> - while (odp_unlikely(r->prod.tail != prod_head))
>> + while (odp_unlikely(__atomic_load_n(>prod.tail, __ATOMIC_RELAXED) !=
>> + prod_head))
>>   odp_cpu_pause();
>>
>>   /* Release our entries and the memory they refer to */
>> - __atomic_thread_fence(__ATOMIC_RELEASE);
>> - r->prod.tail = prod_next;
>> + __atomic_store_n(>prod.tail, prod_next, __ATOMIC_RELEASE);
>>   return ret;
>>  }
>>
>> @@ -328,7 +328,7 @@ int ___ring_sp_do_enqueue(_ring_t *r, void * const 
>> *obj_table,
>>   int ret;
>>
>>   prod_head = r->prod.head;
>> - cons_tail = r->cons.tail;
>> + cons_tail = __atomic_load_n(>cons.tail, __ATOMIC_ACQUIRE);
>>   /* The subtraction is done between two unsigned 32bits value
>>* (the result is always modulo 32 bits even if we have
>>* prod_head > cons_tail). So 'free_entries' is always between 0
>> @@ -361,8 +361,7 @@ int ___ring_sp_do_enqueue(_ring_t *r, void * const 
>> *obj_table,
>>   }
>>
>>   /* Release our entries and the memory they refer to */
>> - __atomic_thread_fence(__ATOMIC_RELEASE);
>> - r->prod.tail = prod_next;
>> + __atomic_store_n(>prod.tail, prod_next, __ATOMIC_RELEASE);
>>   return ret;
>>  }
>>
>> @@ -385,8 +384,8 @@ int ___ring_mc_do_dequeue(_ring_t *r, void **obj_table,
>>   /* Restore n as it may change every loop */
>>   n = max;
>>
>> - cons_head = r->cons.head;
>> - prod_tail = r->prod.tail;
>> + cons_head = __atomic_load_n(>cons.head, __ATOMIC_RELAXED);
>> + prod_tail = __atomic_load_n(>prod.tail, __ATOMIC_ACQUIRE);
>>   /* The subtraction is done between two unsigned 32bits value
>>* (the result is always modulo 32 bits even if we have
>>* cons_head > prod_tail). So 'entries' is always between 0
>> @@ -419,12 +418,12 @@ int ___ring_mc_do_dequeue(_ring_t *r, void **obj_table,
>>* If there are other dequeues in progress that preceded us,
>>* we need to wait for them to complete
>>*/
>> - while (odp_unlikely(r->cons.tail != cons_head))
>> + while (odp_unlikely(__atomic_load_n(>cons.tail, __ATOMIC_RELAXED) !=
>> + cons_head))
>>   odp_cpu_pause();
>>
>>   /* Release 

Re: [lng-odp] [API-NEXT 4/4] A scalable software scheduler

2017-03-30 Thread Ola Liljedahl
No build, no run.

We found several problems (memory ordering related so visible on ARM but not on 
x86) in the upstream code when running on ARM systems. Aren’t there any ARM 
systems to run on in the LNG lab?

-- Ola

Ola Liljedahl, Networking System Architect, ARM
Phone: +46 706 866 373  Skype: ola.liljedahl

From: Maxim Uvarov <maxim.uva...@linaro.org<mailto:maxim.uva...@linaro.org>>
Date: Thursday, 30 March 2017 at 16:56
To: Bill Fischofer <bill.fischo...@linaro.org<mailto:bill.fischo...@linaro.org>>
Cc: Brian Brooks <brian.bro...@arm.com<mailto:brian.bro...@arm.com>>, 
lng-odp-forward <lng-odp@lists.linaro.org<mailto:lng-odp@lists.linaro.org>>, 
Ola Liljedahl <ola.liljed...@arm.com<mailto:ola.liljed...@arm.com>>, Kevin Wang 
<kevin.w...@arm.com<mailto:kevin.w...@arm.com>>, Honnappa Nagarahalli 
<honnappa.nagaraha...@arm.com<mailto:honnappa.nagaraha...@arm.com>>
Subject: Re: [lng-odp] [API-NEXT 4/4] A scalable software scheduler

I think for now we do not have build for arm v8 & clang. At least it did not 
capture build error.

Maxim.

On 30 March 2017 at 17:45, Bill Fischofer 
<bill.fischo...@linaro.org<mailto:bill.fischo...@linaro.org>> wrote:
On Thu, Mar 30, 2017 at 8:56 AM, Brian Brooks 
<brian.bro...@arm.com<mailto:brian.bro...@arm.com>> wrote:
> On 03/28 18:50:32, Bill Fischofer wrote:
>> 
>> 
>> 
>> <!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: 
>> #80 2px solid; } -->
>> 
>> This 
>> part generates numerous checkpatch warnings and errors. Please
>> run checkpatch and correct for v2.
>> 
>> Also, this part introduces a number of errors that result in failure
>> to compile using clang. Please test with both gcc and clang to ensure
>> that it compiles cleanly for both (gcc looks fine)
>> 
>> Specific clang issues:
>
> Does upstream CI build with Clang on ARMv8? I don't think so because
> the build is broken when I try it locally. I have the patch and will
> send shortly.

CI is supposed to be regression testing on ARM using clang as well as
gcc. Maxim may have insight here.

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [lng-odp] 32b support in ODP-Cloud

2017-03-29 Thread Ola Liljedahl
On 29 March 2017 at 13:25, Bill Fischofer <bill.fischo...@linaro.org> wrote:

>
>
> On Wed, Mar 29, 2017 at 5:47 AM, Ola Liljedahl <ola.liljed...@linaro.org>
> wrote:
>
>> On 29 March 2017 at 10:43, Francois Ozog <francois.o...@linaro.org>
>> wrote:
>>
>>> If there is a cost to get virtual address, then I assume translation is
>>> NOT just casting: correct?
>>>
>> Correct. linux-generic has a number of dereferences in the code that
>> returns e.g. the buffer address from a buffer handle. This is not optimised
>> for performance. The design does provide the ability to check buffer
>> handles for correctness/validity but I cannot see any code that actually
>> does this so an invalid buffer handle might crash the code (some out of
>> bounds memory access).
>>
>> I suspect that the hot spots are due to the fact that in many cases we
> are only using a 32-bit value and wrapping it in a 64-bit handle. This was
> originally done to make the strong typing 32/64 bit agnostic.  But this can
> change if we widen the linux-generic handles to use the full pointer width.
> Ola: you should no longer be seeing those hot spots in the packet code
> since with the more recent changes Petri introduced the odp_packet_t is now
> simply a pointer to an odp_packet_hdr_t, similar to how in odp-dpdk it is a
> pointer to an rte_mbuf.
>
This is/was a benchmark (odp_sched_latency) that uses buffers. But it is
good that we now use pointers for packet handles. Perhaps we can do the
same thing for buffers and other event types?


> Certainly in odp-cloud we should do a similar mapping for other key handle
> types.
>
> Another approach requires a bit more config/tools technology would be to
> support multiple type definitions as a performance tuning option. When
> developing you'd compile using an include structure that has handles
> defined as pointers to structs to get the strong typing and then support a
> compile option for production use that redefines the handles to be
> uint32_t. That would reduce their footprint to 32-bits but would lose
> strong type checking, however that's a trade-off an application writer
> could decide is worth while.
>
> Currently we support a -DDEBUG option that includes additional runtime
> checking. We could do this via a similar -DTYPE_CHECK option (default) and
> support a -DNO_TYPE_CHECK for the "compact" handles.
>
>
>>
>>> FF
>>>
>>> On 29 March 2017 at 10:00, Ola Liljedahl <ola.liljed...@linaro.org>
>>> wrote:
>>>
>>>> So there is a choice between
>>>> A) enabling static type checking in the compiler through strong typing
>>>> => requires (syntactical) pointers i C => handles are 64-bit on 64-bit
>>>> systems
>>>> B) optimise for size and cache efficiency by using 32-bit (scalar)
>>>> handles
>>>>
>>>> Currently this choice is hard-wired into the ODP linux-generic
>>>> implementation.
>>>>
>>>> When profiling some ODP examples, I can see hot spots in the functions
>>>> that convert "pointer"-handles into the actual object pointers
>>>> (virtual addresses). So we are paying a double price here, handles are
>>>> large (increases cache pressure) and we have to translate handles to
>>>> address before we can reference the objects in the ODP calls.
>>>>
>>>> On 29 March 2017 at 06:10, Bill Fischofer <bill.fischo...@linaro.org>
>>>> wrote:
>>>> >
>>>> > On Tue, Mar 28, 2017 at 10:47 PM Honnappa Nagarahalli
>>>> > <honnappa.nagaraha...@linaro.org> wrote:
>>>> >>
>>>> >> On 28 March 2017 at 22:27, Bill Fischofer <bill.fischo...@linaro.org
>>>> >
>>>> >> wrote:
>>>> >> >
>>>> >> >
>>>> >> > On Mon, Mar 27, 2017 at 10:11 PM, Honnappa Nagarahalli
>>>> >> > <honnappa.nagaraha...@linaro.org> wrote:
>>>> >> >>
>>>> >> >> On 27 March 2017 at 08:36, Ola Liljedahl <
>>>> ola.liljed...@linaro.org>
>>>> >> >> wrote:
>>>> >> >> > On 27 March 2017 at 07:58, Honnappa Nagarahalli
>>>> >> >> > <honnappa.nagaraha...@linaro.org> wrote:
>>>> >> >> >> My answers inline. I was confused as hell just a month back :)
>>>> >> >> >>
>>>> >> >> >> On 23 March 2017 at 06:28, Francois 

Re: [lng-odp] 32b support in ODP-Cloud

2017-03-29 Thread Ola Liljedahl
On 29 March 2017 at 10:43, Francois Ozog <francois.o...@linaro.org> wrote:

> If there is a cost to get virtual address, then I assume translation is
> NOT just casting: correct?
>
Correct. linux-generic has a number of dereferences in the code that
returns e.g. the buffer address from a buffer handle. This is not optimised
for performance. The design does provide the ability to check buffer
handles for correctness/validity but I cannot see any code that actually
does this so an invalid buffer handle might crash the code (some out of
bounds memory access).


> FF
>
> On 29 March 2017 at 10:00, Ola Liljedahl <ola.liljed...@linaro.org> wrote:
>
>> So there is a choice between
>> A) enabling static type checking in the compiler through strong typing
>> => requires (syntactical) pointers i C => handles are 64-bit on 64-bit
>> systems
>> B) optimise for size and cache efficiency by using 32-bit (scalar) handles
>>
>> Currently this choice is hard-wired into the ODP linux-generic
>> implementation.
>>
>> When profiling some ODP examples, I can see hot spots in the functions
>> that convert "pointer"-handles into the actual object pointers
>> (virtual addresses). So we are paying a double price here, handles are
>> large (increases cache pressure) and we have to translate handles to
>> address before we can reference the objects in the ODP calls.
>>
>> On 29 March 2017 at 06:10, Bill Fischofer <bill.fischo...@linaro.org>
>> wrote:
>> >
>> > On Tue, Mar 28, 2017 at 10:47 PM Honnappa Nagarahalli
>> > <honnappa.nagaraha...@linaro.org> wrote:
>> >>
>> >> On 28 March 2017 at 22:27, Bill Fischofer <bill.fischo...@linaro.org>
>> >> wrote:
>> >> >
>> >> >
>> >> > On Mon, Mar 27, 2017 at 10:11 PM, Honnappa Nagarahalli
>> >> > <honnappa.nagaraha...@linaro.org> wrote:
>> >> >>
>> >> >> On 27 March 2017 at 08:36, Ola Liljedahl <ola.liljed...@linaro.org>
>> >> >> wrote:
>> >> >> > On 27 March 2017 at 07:58, Honnappa Nagarahalli
>> >> >> > <honnappa.nagaraha...@linaro.org> wrote:
>> >> >> >> My answers inline. I was confused as hell just a month back :)
>> >> >> >>
>> >> >> >> On 23 March 2017 at 06:28, Francois Ozog <
>> francois.o...@linaro.org>
>> >> >> >> wrote:
>> >> >> >>
>> >> >> >>> The more I dig the less I understand ;-)
>> >> >> >>>
>> >> >> >>> Let me ask a few questions:
>> >> >> >>>
>> >> >> >>> - in the future, when selling 32 bit silicon, which architecture
>> >> >> >>> version
>> >> >> >>> will it be ARMv7 or ARMv8 ?
>> >> >> > AFAIK, future 32-bit ARM cores (from ARM) will be ARMv8. But
>> people
>> >> >> > are still building SoC's with e.g. ARM920 which is ARMv4T or
>> >> >> > something.
>> >> >> >
>> >> >> >>>
>> >> >> >>
>> >> >> >> What you are referring to is ISA version, not architecture.
>> AArch32
>> >> >> >> and
>> >> >> >> AArch64 are architectures. ARMv8 also supports AArch32 (i.e.
>> AArch32
>> >> >> >> with
>> >> >> >> ARMv8 ISA)
>> >> >> > ARMv8 has two architectural states, AArch32 and AArch64. An ARMv8
>> >> >> > implementation can implement either-or or both. There are already
>> >> >> > examples out there of all these different combinations.
>> >> >> >
>> >> >> > AAarch32 supports the A32 and T32 ISA's, these are closely
>> related to
>> >> >> > (basically extensions of) the corresponding ARMv7a ARM and
>> Thumb(-2)
>> >> >> > ISA's.
>> >> >> > The A32 (and T32?) ISA's have some of the ARMv8 extensions, e.g.
>> >> >> > load-acquire, store-release, crypto instructions, simplified WFE
>> >> >> > support etc.
>> >> >> > A user space ARMv7a image should run unmodified on ARMv8/AArch32,
>> I
>> >> >> > don't know about other privilege levels but I can imagine an
>> ARMv7a
>> >> >> > kernel running in AArch

Re: [lng-odp] [API-NEXT 3/4] api: queue: Add ring_size

2017-03-29 Thread Ola Liljedahl
On 29 March 2017 at 03:55, Brian Brooks  wrote:
> On 03/28 19:18:37, Bill Fischofer wrote:
>> 
>
> It is infinitely better to do patch review in plain text rather
> than HTML. I thought this was a plain text mailing list?
>
>> 
>> 
>> 
>> 
>> On Tue, 
>> Mar 28, 2017 at 2:23 PM, Brian Brooks brian.bro...@arm.com wrote:
>>  Signed-off-by: Brian Brooks brian.bro...@arm.com
>>  ---
>>  include/odp/api/spec/queue.h | 5 
>>  1 file changed, 5 insertions()
>> 
>>  diff --git a/include/odp/api/spec/queue.h 
>> b/include/odp/api/spec/queue.h
>>  index 7972feac..1cec4773 100644
>>  --- a/include/odp/api/spec/queue.h
>>   b/include/odp/api/spec/queue.h
>>  @@ -124,6 124,11 @@ typedef struct odp_queue_param_t {
>>  * the queue 
>> type. */
>>  odp_queue_type_t 
>> type;
>> 
>>   /** Queue size
>>   *
>>   * Indicates the 
>> max ring size of the ring buffer. */
>>   uint32_t ring_size;
>> 
>> ODP queues have historically been of unspecified size. If we're going
>> to introduce the notion of explicitly limited sized queues this has
>> additional implications.
Implementations have likely applied some internal limitation, unknown
to the application.
linux-generic's use of linked lists can't be seen as *the* way to do it.

>>
>> First, ring_size is an inappropriate choice of name here since a ring
>> is an implementation model, not a specification. The documentation
>> says Queue size, so
>> 
>> uint32_t size;
>> 
>> is sufficient here.
>
> Agree, will change 'ring_size' to a better name.
My suggestion is 'min_capacity' (or 'min_size').
The application specifies the minimum capacity requires/desires.

>
>> We should document that size = 0 requests a queue
>> of default size (which may be unbounded).
>
> Unbounded size is not practical or possible. Can we agree that 0 means
> that the default aka ODP_CONFIG_QUEUE_SIZE is used? Should we allow for
> greater than ODP_CONFIG_QUEUE_SIZE? E.g. 10,000 queue depth is also not
> practical or possible. Perhaps we need a max which also acts as the default.
IMO, a 'min_capacity' of 0 should mean use default size (per the ODP
configuration file). Which is likely what implementations already do
when the application cannot specify the size.

>
>> 
>> Second, if we're going to allow a queue size to be specified then this
>> needs to be added as an output to odp_queue_capability()
>
> OK, will look into that.
>
>> so the
>> application knows the max_size supported (again 0 = unbounded).
>> 
>> A larger question, however, is why is this being introduced at all
>> since this field is only used in the modified
>> odph_cuckoo_table_create() helper routine and this, in turn, is only
>> used within the cuckootable test module? This seems an extraneous and
>> unnecessary change and has no relationship to the rest of this patch
>> series.
>
> AFAIK, the cuckoo unit test enqueues too many events (millions) to a queue.
> That sounds like it makes no sense, but is an example of how anything is
> possible.
The cuckoo hash table uses a queue to store unused "slots". The hash
table could be very large so have a very large number of unused slots
that need to be saved on a queue. I think it is unreasonable of the
cuckoo hash table to expect infinite queue sizes. Better to specify
the required capacity and then fail in odp_queue_create().

>
>> 
>> So Parts 1 and 3 of this series don't seem to have anything to do with
>> the scalable scheduler. As a minor point, the order of these needs to
>> be reversed to preserve bisectability since Part 1 can't reference the
>> new field before Part 3 defines it.
>
> Agree that there are 2 independent sets of patches, but the order is needed
> in order to get a sane `make check' on ARM-based chips. Without these fixes
> going in first, we have no way of knowing whether the scalable scheduler 
> patches
> caused the issue or not. I can break these into 2 separate sets of patches.
>
>> 
>>  
>>  /** Enqueue mode
>>  *
>>  * Default 
>> value for both queue types is ODP_QUEUE_OP_MT. Application
>>  --
>>  2.12.1
>> 
>> 
>> 
>> 


Re: [lng-odp] 32b support in ODP-Cloud

2017-03-29 Thread Ola Liljedahl
So there is a choice between
A) enabling static type checking in the compiler through strong typing
=> requires (syntactical) pointers i C => handles are 64-bit on 64-bit
systems
B) optimise for size and cache efficiency by using 32-bit (scalar) handles

Currently this choice is hard-wired into the ODP linux-generic implementation.

When profiling some ODP examples, I can see hot spots in the functions
that convert "pointer"-handles into the actual object pointers
(virtual addresses). So we are paying a double price here, handles are
large (increases cache pressure) and we have to translate handles to
address before we can reference the objects in the ODP calls.

On 29 March 2017 at 06:10, Bill Fischofer <bill.fischo...@linaro.org> wrote:
>
> On Tue, Mar 28, 2017 at 10:47 PM Honnappa Nagarahalli
> <honnappa.nagaraha...@linaro.org> wrote:
>>
>> On 28 March 2017 at 22:27, Bill Fischofer <bill.fischo...@linaro.org>
>> wrote:
>> >
>> >
>> > On Mon, Mar 27, 2017 at 10:11 PM, Honnappa Nagarahalli
>> > <honnappa.nagaraha...@linaro.org> wrote:
>> >>
>> >> On 27 March 2017 at 08:36, Ola Liljedahl <ola.liljed...@linaro.org>
>> >> wrote:
>> >> > On 27 March 2017 at 07:58, Honnappa Nagarahalli
>> >> > <honnappa.nagaraha...@linaro.org> wrote:
>> >> >> My answers inline. I was confused as hell just a month back :)
>> >> >>
>> >> >> On 23 March 2017 at 06:28, Francois Ozog <francois.o...@linaro.org>
>> >> >> wrote:
>> >> >>
>> >> >>> The more I dig the less I understand ;-)
>> >> >>>
>> >> >>> Let me ask a few questions:
>> >> >>>
>> >> >>> - in the future, when selling 32 bit silicon, which architecture
>> >> >>> version
>> >> >>> will it be ARMv7 or ARMv8 ?
>> >> > AFAIK, future 32-bit ARM cores (from ARM) will be ARMv8. But people
>> >> > are still building SoC's with e.g. ARM920 which is ARMv4T or
>> >> > something.
>> >> >
>> >> >>>
>> >> >>
>> >> >> What you are referring to is ISA version, not architecture. AArch32
>> >> >> and
>> >> >> AArch64 are architectures. ARMv8 also supports AArch32 (i.e. AArch32
>> >> >> with
>> >> >> ARMv8 ISA)
>> >> > ARMv8 has two architectural states, AArch32 and AArch64. An ARMv8
>> >> > implementation can implement either-or or both. There are already
>> >> > examples out there of all these different combinations.
>> >> >
>> >> > AAarch32 supports the A32 and T32 ISA's, these are closely related to
>> >> > (basically extensions of) the corresponding ARMv7a ARM and Thumb(-2)
>> >> > ISA's.
>> >> > The A32 (and T32?) ISA's have some of the ARMv8 extensions, e.g.
>> >> > load-acquire, store-release, crypto instructions, simplified WFE
>> >> > support etc.
>> >> > A user space ARMv7a image should run unmodified on ARMv8/AArch32, I
>> >> > don't know about other privilege levels but I can imagine an ARMv7a
>> >> > kernel running in AArch32 with an AArch64 hypervisor.
>> >> >
>> >> > AArch64 supports the A64 ISA. This ISA actually supports both 32-bit
>> >> > and 64-bit operations (although all addresses are 64-bit AFAIK).
>> >> > 32-bit operations use Wn registers and 64-bit operations use Xn
>> >> > registers. It's the same register set, Wn just denotes the lower 32
>> >> > bits.
>> >> >
>> >> >>
>> >> >> - is the target solution will be running ALL in 32 bits? (boot in 32
>> >> >> bits,
>> >> >>> Linux 32 bits, 32 bits apps)?
>> >> >>> - or is the target solution will be hybrid (64 bits Linux and some
>> >> >>> 32
>> >> >>> bits
>> >> >>> apps).
>> >> > I think this is the more likely path. If you have >= than 4GB of RAM
>> >> > (and also other stuff that needs physical addressing), you want a
>> >> > 64-bit kernel.
>> >> >
>> >> >>>
>> >> >>
>> >> >> The target solution could be Hybrid. Linux could be 64b, the
>> >> >> applications
>> >> >> could be 32b. It is 

Re: [lng-odp] odp_queue_enq semantics

2017-03-28 Thread Ola Liljedahl
On 28 March 2017 at 20:32, Bill Fischofer <bill.fischo...@linaro.org> wrote:
> On Tue, Mar 28, 2017 at 11:18 AM, Honnappa Nagarahalli
> <honnappa.nagaraha...@linaro.org> wrote:
>> On 28 March 2017 at 07:21, Bill Fischofer <bill.fischo...@linaro.org> wrote:
>>> On Tue, Mar 28, 2017 at 7:19 AM, Verma, Shally <shally.ve...@cavium.com> 
>>> wrote:
>>>>
>>>>
>>>> -Original Message-
>>>> From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of Bill 
>>>> Fischofer
>>>> Sent: 28 March 2017 16:44
>>>> To: Ola Liljedahl <ola.liljed...@linaro.org>
>>>> Cc: nd <n...@arm.com>; lng-odp@lists.linaro.org
>>>> Subject: Re: [lng-odp] odp_queue_enq semantics
>>>>
>>>> On Tue, Mar 28, 2017 at 4:10 AM, Ola Liljedahl <ola.liljed...@linaro.org> 
>>>> wrote:
>>>>> On 28 March 2017 at 10:41, Joe Savage <joe.sav...@arm.com> wrote:
>>>>>> Hey,
>>>>>>
>>>>>> I just wanted to clarify something about the expected behaviour of
>>>>>> odp_queue_enq. In the following code snippet, is it acceptable for
>>>>>> the assert to fire? (i.e. for a dequeue after a successful enqueue to
>>>>>> fail, with only a single thread of execution)
>>>>>>
>>>>>> odp_queue_t queue;
>>>>>> odp_event_t ev1, ev2;
>>>>>> /* ... */
>>>>>> if (odp_queue_enq(queue, ev1) == 0) {
>>>>>> ev2 = odp_queue_deq(queue);
>>>>>> assert(ev2 != ODP_EVENT_INVALID);
>>>>>> }
>>>>
>>>>
>>>> rc == 0 from odp_queue_enq() simply means that the enqueue request has 
>>>> been accepted.
>>>>
>>>> odp_queue_deq() removes the first element available on the specified queue.
>>>>
>>>> As Ola points out, depending on the implementation there may be some 
>>>> latency associated with queue operations so it is possible for the assert 
>>>> to fire. Of course, in a multi-threaded environment some other thread may 
>>>> have dequeued the event first as well, so this sort of code is inherently 
>>>> brittle.
>>>>
>>>> Shally-Sounds to me that based on implementation odp_queue_enq() can be 
>>>> async call so applications dequeuing events should always check against 
>>>> whether it is valid intended event? And if not intended, then app should 
>>>> put event back to queue? Is that understanding correct?
>>>
>>
>> odp_queue_enq() itself is not an async call. When it returns to the
>> application (that did the enqueue), the event is enqueued on the
>> queue.
>
> That is correct. If odp_queue_enq() returns success the event is
> enqueued and will appear ahead of any subsequent events added by other
> odp_queue_enq() calls.
Enqueue calls made from the same thread.

I don't think there is any guarantee of global ordering of enqueued
events from different threads, even if they synchronise through shared
memory.

> However that does not imply that it is
> instantly visible to odp_queue_deq().
Per the above:
Assume 'q' is an empty queue.
T0: odp_queue_enq(q, ev0);
T0: __atomic_store_n(, 1, __ATOMIC_RELEASE);//Ensure prior memory
accesses are visible
T1: while (__atomic_load_n(, __ATOMIC_ACQUIRE) != 1) ; //spin-wait
T1: ev1 = odp_queue_deq(q);
T1: assert(ev1 == ev0);

T1 is not guaranteed to be able to dequeue the event enqueued by T0 so
the assertion may fail (ev1 == ODP_EVENT_INVALID). Memory accesses and
ODP events may take different routes between the CPU's of T0 and T1
and I don't think any ordering between them is guaranteed.
Store-release does not order ODP queue operations.

But this one should hold:
Assume flag is initially 0.
T0: flag = 1;
T0: odp_queue_enq(q, ev0);
T1: ev1 = odp_queue_deq(q); //T1 dequeues the event enqueued by T0
T1: assert(flag == 1);

The assertion should hold. odp_queue_enq() should have release
ordering for all previous memory accesses by that thread and
odp_queue_deq() should have acquire ordering for all following memory
accesses by that thread.

>
>>
>>> You should check the return from odp_queue_deq() to see if it's
>>> ODP_EVENT_INVALID, but ODP does not define "peek" or "push" operations
>>> on queues, so there is no way to "put an event back" unless you call
>>> odp_queue_enq(), which adds it to the end of the queue.
>>>
>>>>
>>>>

Re: [lng-odp] odp_queue_enq semantics

2017-03-28 Thread Ola Liljedahl
On 28 March 2017 at 10:41, Joe Savage  wrote:
> Hey,
>
> I just wanted to clarify something about the expected behaviour of
> odp_queue_enq. In the following code snippet, is it acceptable for the assert
> to fire? (i.e. for a dequeue after a successful enqueue to fail, with only a
> single thread of execution)
>
> odp_queue_t queue;
> odp_event_t ev1, ev2;
> /* ... */
> if (odp_queue_enq(queue, ev1) == 0) {
> ev2 = odp_queue_deq(queue);
> assert(ev2 != ODP_EVENT_INVALID);
> }
>
> That is, can the "success" status code from odp_queue_enq be used to indicate
> a delayed enqueue ("This event will be added to the queue at some point
> soon-ish"), or should it only be used to communicate an immediate successful
> addition to the queue? The documentation seems unclear on this point while
> the validation tests suggest the latter, but I thought it worth checking up
> on.
Some code in odp_scheduling test/benchmark also expects an enqueued
event to be immediately dequeuable by the same thread.

I think this is not something you can require, neither from a HW queue
manager or from a SW implementation. HW implementations can always
have associated latencies visible to the thread that
enqueues/dequeues. Also for a SW implementation with loosely coupled
parts (e.g. producer & consumer head & tail pointers) it can be
possible for an enqueued event to not be immediately available when
other threads are doing concurrent operations on the same queue. Only
if a lock protects the whole data structure can you enforce one global
view of this data structure. You don't want to use a lock.

>
> Thanks,
>
> Joe


Re: [lng-odp] 32b support in ODP-Cloud

2017-03-27 Thread Ola Liljedahl
On 27 March 2017 at 07:58, Honnappa Nagarahalli
 wrote:
> My answers inline. I was confused as hell just a month back :)
>
> On 23 March 2017 at 06:28, Francois Ozog  wrote:
>
>> The more I dig the less I understand ;-)
>>
>> Let me ask a few questions:
>>
>> - in the future, when selling 32 bit silicon, which architecture version
>> will it be ARMv7 or ARMv8 ?
AFAIK, future 32-bit ARM cores (from ARM) will be ARMv8. But people
are still building SoC's with e.g. ARM920 which is ARMv4T or
something.

>>
>
> What you are referring to is ISA version, not architecture. AArch32 and
> AArch64 are architectures. ARMv8 also supports AArch32 (i.e. AArch32 with
> ARMv8 ISA)
ARMv8 has two architectural states, AArch32 and AArch64. An ARMv8
implementation can implement either-or or both. There are already
examples out there of all these different combinations.

AAarch32 supports the A32 and T32 ISA's, these are closely related to
(basically extensions of) the corresponding ARMv7a ARM and Thumb(-2)
ISA's.
The A32 (and T32?) ISA's have some of the ARMv8 extensions, e.g.
load-acquire, store-release, crypto instructions, simplified WFE
support etc.
A user space ARMv7a image should run unmodified on ARMv8/AArch32, I
don't know about other privilege levels but I can imagine an ARMv7a
kernel running in AArch32 with an AArch64 hypervisor.

AArch64 supports the A64 ISA. This ISA actually supports both 32-bit
and 64-bit operations (although all addresses are 64-bit AFAIK).
32-bit operations use Wn registers and 64-bit operations use Xn
registers. It's the same register set, Wn just denotes the lower 32
bits.

>
> - is the target solution will be running ALL in 32 bits? (boot in 32 bits,
>> Linux 32 bits, 32 bits apps)?
>> - or is the target solution will be hybrid (64 bits Linux and some 32 bits
>> apps).
I think this is the more likely path. If you have >= than 4GB of RAM
(and also other stuff that needs physical addressing), you want a
64-bit kernel.

>>
>
> The target solution could be Hybrid. Linux could be 64b, the applications
> could be 32b. It is my understanding that everything 32b is also possible
> using AArch32.
>
>
>> When I read "AArch64 was designed to remove known implementation
>> challenges of AArch32 cores" on http://infocenter.arm.com/
>> help/index.jsp?topic=/com.arm.doc.dai0490a/ar01s01.html
>> I wonder if stating we support AArch32 is a good idea...
>>
>> So what is the best way to describe what we want?
>> -  ARMv8LP64 or ILP32 ?
>> - AArch64  LP64 or ILP32 ?
>> - LP64 or ILP32?
>>
>> I think the best way to say is 'we support AArch64 and AArch32'.
Re AArch64, LP64 or ILP32 applications?

AArch32 or ARMv7a?

>
>
>> FF
>>
>>
>> On 23 March 2017 at 04:57, Honnappa Nagarahalli <
>> honnappa.nagaraha...@linaro.org> wrote:
>>
>>> Hi Bill / Matt and others,
>>> What I was trying to say in our discussion is that, the
>>> ODP-Cloud code should not be pointer heavy.
>>>
>>> Please take a look at this video from BUD17:
>>> http://connect.linaro.org/resource/bud17/bud17-101/ (unfortunately
>>> there are no slides, I am trying to get them). This talks about the
>>> performance of the 32b application on AArch64. One of the
>>> applications, has huge performance improvement while running in 32b
>>> mode (ILP32 in this particular case) on AArch64 (when compared to the
>>> same application compiled for 64b mode running on AArch64 i.e. in 64b
>>> compilation it performed very poorly). My understanding is that this
>>> particular application is a pointer chasing application. Other
>>> applications which are not pointer heavy, do not have this behavior.
Isn't the problem with LP64 that if you have a lot of pointers stored
in data structures, these take 2x the space of ILP32 pointers and thus
increases the cache pressure.

I don't think it is the pointer chasing itself that is penalised by
64-bit pointers. Pointer chasing apps are penalised by long
load-to-use latencies (L1 cache hit latency, L2/L3 latencies, DRAM
latency).

>>>
>>> So, we need to make sure ODP-Cloud is not pointer heavy and does not
>>> force the application to be pointer heavy, to get good performance out
>>> of 64b systems.
Even with LP64, ODP could use 32-bit handles for ODP objects. The
address lookup of the handle needs to be efficient (from a cache
perspective) though, already now I can see hotspots in the function
that returns an address from a handle.

>>>
>>> Thank you,
>>> Honnappa
>>>
>>
>>
>>
>> --
>> [image: Linaro] 
>> François-Frédéric Ozog | *Director Linaro Networking Group*
>> T: +33.67221.6485
>> francois.o...@linaro.org | Skype: ffozog
>>
>>


Re: [lng-odp] [PATCH] [RFC] API:update odp_schedle_multi to return events from multiple queues.

2017-02-07 Thread Ola Liljedahl
On 7 February 2017 at 14:22, Nikhil Agarwal <nikhil.agar...@linaro.org> wrote:
>
>
> On 3 February 2017 at 19:20, Ola Liljedahl <ola.liljed...@linaro.org> wrote:
>>
>> Do we have any performance comparison between using this new API
>> compared to using the existing API and the SW behind
>> odp_schedule_multi() sorts the events (if necessary) and only returns
>> events for one queue at a time (keeping the others as prescheduled
>> events)?
>
> Keeping highest priority packets on-hold while other cores are free to
> process them may add significant latency as you are not sure when the
> particular thread will call odp_schedule again.
>>
>>
>> If we don't know that this new API actually improves performance
>> (significantly) compared to using the existing API enhanced with some
>> under-the-hood fixup, I don't think we have a good case for changing
>> the API.
>
> Comparing one packet per call vs multiple(8 packets) gives a boost of around
> ~30% in performance.
But will these N (e.g. 8) packets come from N different queues so that
odp_schedule_multi() can only return 1 packet at a time to the
application?

What is the expected average number of events per source queue when
getting events from the HW?

If every event comes from a different queue, is there any benefit from
batching except to avoid some SW overhead in odp_schedule_multi()? It
seems like there would be little memory access locality if every event
comes from a new queue and thus requires a different type of
processing and accesses different tables.

>
>>
>>
>>
>> On 3 February 2017 at 12:50, Nikhil Agarwal <nikhil.agar...@linaro.org>
>> wrote:
>> > Signed-off-by: Nikhil Agarwal <nikhil.agar...@linaro.org>
>> > ---
>> >  include/odp/api/spec/schedule.h | 36
>> > +++-
>> >  1 file changed, 31 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/include/odp/api/spec/schedule.h
>> > b/include/odp/api/spec/schedule.h
>> > index f8fed17..6e8d759 100644
>> > --- a/include/odp/api/spec/schedule.h
>> > +++ b/include/odp/api/spec/schedule.h
>> > @@ -118,8 +118,8 @@ odp_event_t odp_schedule(odp_queue_t *from, uint64_t
>> > wait);
>> >   * originate from the same source queue and share the same scheduler
>> >   * synchronization context.
>> >   *
>> > - * @param fromOutput parameter for the source queue (where the
>> > event was
>> > - *dequeued from). Ignored if NULL.
>> > + * @param fromOutput parameter for the source queues array (where
>> > the event
>> > + *   were dequeued from). Ignored if NULL.
>> >   * @param waitMinimum time to wait for an event. Waits infinitely,
>> > if set to
>> >   *ODP_SCHED_WAIT. Does not wait, if set to
>> > ODP_SCHED_NO_WAIT.
>> >   *Use odp_schedule_wait_time() to convert time to other
>> > wait
>> > @@ -129,7 +129,7 @@ odp_event_t odp_schedule(odp_queue_t *from, uint64_t
>> > wait);
>> >   *
>> >   * @return Number of events outputted (0 ... num)
>> >   */
>> > -int odp_schedule_multi(odp_queue_t *from, uint64_t wait, odp_event_t
>> > events[],
>> > +int odp_schedule_multi(odp_queue_t from[], uint64_t wait, odp_event_t
>> > events[],
>> >int num);
>> >
>> >  /**
>> > @@ -170,6 +170,17 @@ void odp_schedule_resume(void);
>> >  void odp_schedule_release_atomic(void);
>> >
>> >  /**
>> > + * Release the atomic context associated with the events specified by
>> > evnets[].
>> > + *
>> > + * This call is similar to odp_schedule_release_atomic call which
>> > releases context
>> > + * associated with the events defined by events.
>> > + * @param events  Input event array for which atomic context is to be
>> > released
>> > + * @param num Number of events
>> > + *
>> > + */
>> > +void odp_schedule_release_atomic_contexts(odp_event_t events[], num);
>> > +
>> > +/**
>> >   * Release the current ordered context
>> >   *
>> >   * This call is valid only for source queues with ordered
>> > synchronization. It
>> > @@ -187,6 +198,17 @@ void odp_schedule_release_atomic(void);
>> >  void odp_schedule_release_ordered(void);
>> >
>> >  /**
>> > + * Release the ordered context associated with the events specified by
>> > evnets[].
>> > + *

Re: [lng-odp] [PATCH] [RFC] API:update odp_schedle_multi to return events from multiple queues.

2017-02-03 Thread Ola Liljedahl
Do we have any performance comparison between using this new API
compared to using the existing API and the SW behind
odp_schedule_multi() sorts the events (if necessary) and only returns
events for one queue at a time (keeping the others as prescheduled
events)?

If we don't know that this new API actually improves performance
(significantly) compared to using the existing API enhanced with some
under-the-hood fixup, I don't think we have a good case for changing
the API.


On 3 February 2017 at 12:50, Nikhil Agarwal  wrote:
> Signed-off-by: Nikhil Agarwal 
> ---
>  include/odp/api/spec/schedule.h | 36 +++-
>  1 file changed, 31 insertions(+), 5 deletions(-)
>
> diff --git a/include/odp/api/spec/schedule.h b/include/odp/api/spec/schedule.h
> index f8fed17..6e8d759 100644
> --- a/include/odp/api/spec/schedule.h
> +++ b/include/odp/api/spec/schedule.h
> @@ -118,8 +118,8 @@ odp_event_t odp_schedule(odp_queue_t *from, uint64_t 
> wait);
>   * originate from the same source queue and share the same scheduler
>   * synchronization context.
>   *
> - * @param fromOutput parameter for the source queue (where the event was
> - *dequeued from). Ignored if NULL.
> + * @param fromOutput parameter for the source queues array (where the 
> event
> + *   were dequeued from). Ignored if NULL.
>   * @param waitMinimum time to wait for an event. Waits infinitely, if 
> set to
>   *ODP_SCHED_WAIT. Does not wait, if set to ODP_SCHED_NO_WAIT.
>   *Use odp_schedule_wait_time() to convert time to other wait
> @@ -129,7 +129,7 @@ odp_event_t odp_schedule(odp_queue_t *from, uint64_t 
> wait);
>   *
>   * @return Number of events outputted (0 ... num)
>   */
> -int odp_schedule_multi(odp_queue_t *from, uint64_t wait, odp_event_t 
> events[],
> +int odp_schedule_multi(odp_queue_t from[], uint64_t wait, odp_event_t 
> events[],
>int num);
>
>  /**
> @@ -170,6 +170,17 @@ void odp_schedule_resume(void);
>  void odp_schedule_release_atomic(void);
>
>  /**
> + * Release the atomic context associated with the events specified by 
> evnets[].
> + *
> + * This call is similar to odp_schedule_release_atomic call which releases 
> context
> + * associated with the events defined by events.
> + * @param events  Input event array for which atomic context is to be 
> released
> + * @param num Number of events
> + *
> + */
> +void odp_schedule_release_atomic_contexts(odp_event_t events[], num);
> +
> +/**
>   * Release the current ordered context
>   *
>   * This call is valid only for source queues with ordered synchronization. It
> @@ -187,6 +198,17 @@ void odp_schedule_release_atomic(void);
>  void odp_schedule_release_ordered(void);
>
>  /**
> + * Release the ordered context associated with the events specified by 
> evnets[].
> + *
> + * This call is similar to odp_schedule_release_ordered call which releases 
> context
> + * associated with the events defined by events.
> + * @param events  Input event array for which ordered context is to be 
> released
> + * @param num Number of events
> + *
> + */
> +void odp_schedule_release_ordered_contexts(odp_event_t events[], num);
> +
> +/**
>   * Prefetch events for next schedule call
>   *
>   * Hint the scheduler that application is about to finish processing the 
> current
> @@ -348,11 +370,13 @@ int odp_schedule_group_info(odp_schedule_group_t group,
>   * allowing order to maintained on a more granular basis. If an ordered lock
>   * is used multiple times in the same ordered context results are undefined.
>   *
> + * @param source_queue Queue handle from which event is recieved and lock to 
> be
> + *aquired.
>   * @param lock_index Index of the ordered lock in the current context to be
>   *   acquired. Must be in the range 0..odp_queue_lock_count()
>   *   - 1
>   */
> -void odp_schedule_order_lock(unsigned lock_index);
> +void odp_schedule_order_lock(odp_queue_t source_queue, unsigned lock_index);
>
>  /**
>   * Release ordered context lock
> @@ -360,12 +384,14 @@ void odp_schedule_order_lock(unsigned lock_index);
>   * This call is valid only when holding an ordered synchronization context.
>   * Release a previously locked ordered context lock.
>   *
> + * @param source_queue Queue handle from which event is recieved and lock to 
> be
> + *aquired.
>   * @param lock_index Index of the ordered lock in the current context to be
>   *   released. Results are undefined if the caller does not
>   *   hold this lock. Must be in the range
>   *   0..odp_queue_lock_count() - 1
>   */
> -void odp_schedule_order_unlock(unsigned lock_index);
> +void odp_schedule_order_unlock(odp_queue_t source_queue, unsigned 
> lock_index);
>
>  /**
>   * @}
> --
> 2.9.3
>


Re: [lng-odp] [PATCH v2] example: add IPv4 fragmentation/reassembly example

2017-02-02 Thread Ola Liljedahl

On 02/02/2017, 17:26, "Maxim Uvarov"  wrote:

>>>
>>>My point is:
>>>
>>>original packet -> wrongly fragmented -> original packet
>>>vs
>>>original packet -> good fragmented -> original packet
>>>
>>>result is the same, middle is different.
>>I don't understand. How do you define wrongly/goodly fragmented?
>
>Packet valid or not. All ip headers on their place or shifted, lengths
>and checksums are in right byte order, no trail bytes and etc.

I still don¹t understand what you mean. Is there something wrong or
missing in the patch?
The code should correctly reassemble a set of fragments that arrive
in any order. Duplicated fragments ignored. Incomplete datagrams will
be freed after a timeout.
Do you suspect some case is not properly handled?




Re: [lng-odp] [PATCH v2] example: add IPv4 fragmentation/reassembly example

2017-02-01 Thread Ola Liljedahl
On 1 February 2017 at 16:19, Maxim Uvarov  wrote:
> On 02/01/17 13:26, Joe Savage wrote:
>> Hey Maxim,
>>
>> I'm adding the mailing list to the CCs.
>>
>
> sorry, looks like I pressed replay instead of replay all.
>
> I raised question about coding style question on today’s arch call
> discussion. And agreement was:
>
> 1. variables are on top. (actually we discussed that but looks like
> forget to document.) Some exceptions acceptable if you link to 3-rd
> party code which you can not modify. Like netmap.
My interpretation was the variable declarations at the top of the
function is the default. If you want/need to deviate from the default,
you need to document why. One such reason is this reassembly example
where an inner block with local variables is simpler (as in less
complexity) than calling a separate function. The reason is that the
code has multiple outputs which is complicated to handle in C. Inner
blocks with local variable declarations is not some obscure C language
feature and solves the problem nicely.

>
> 2. Empty braces {} are ok along intend is clear.
>
> gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
> odp_ipfragreass-odp_ipfragreass_reassemble.o: In function
> `atomic_strong_cas_16':
> /opt/Linaro/odp3.git/example/ipfragreass/odp_ipfragreass_helpers.h:123:
> undefined reference to `__atomic_compare_exchange_16'
> /opt/Linaro/odp3.git/example/ipfragreass/odp_ipfragreass_helpers.h:123:
> undefined reference to `__atomic_compare_exchange_16'
> /opt/Linaro/odp3.git/example/ipfragreass/odp_ipfragreass_helpers.h:123:
> undefined reference to `__atomic_compare_exchange_16'
You may need to link with -latomic on certain targets to get 128-bit
atomics support.

>
>
> answers on questions are bellow.
>
>
 +#include 
 +#include 
 +
 +#include "odp_ipfragreass_ip.h"
 +#include "odp_ipfragreass_fragment.h"
 +
 +int fragment_ipv4_packet(odp_packet_t orig_packet, odp_packet_t *out,
 +int *out_len)
 +{
 +   uint32_t orig_len = odp_packet_len(orig_packet);
 +   odph_ipv4hdr_t *orig_header = odp_packet_data(orig_packet);
 +   uint32_t header_len = ipv4hdr_ihl(*orig_header);
 +   uint32_t bytes_remaining = orig_len - header_len;
 +
 +   if (bytes_remaining <= MTU)
 +   return -1;
 +
 +   /*
 +* The main fragmentation loop (continue until all bytes from the
 +* original payload have been assigned to a fragment)
 +*/
 +   odp_packet_t frag = orig_packet;
>>>
>>> 1. why to you need second variable for the same packet handle?
>>
>> For readability! You're right that I could call the "orig_packet" parameter
>> "frag" and be done with it, but it's really much clearer if rather than
>> reusing the same variable for two effectively unrelated purposes, I use the
>> wonderful ability granted to us by the programming gods to associate values
>> with names that describe their purpose. And best of all, for all but the most
>> primitive or idiotic of compilers, it's completely free! (And, actually,
>> probably makes the optimiser's life easier.)
>>
>
> I think it decrease portability.
???
I would really like to hear the arguments for this opinion.

> It's very hard to argue because
> different people look from different angle. And what is nice for one
> looks bad for other.. But anyway if you will declare variable on top
> then this question will go away.
>
>
>>> 2. Variable should be declared at the begining of the scope.
>>
>> Is this a hard requirement? I noticed that it wasn't picked up by checkpatch,
>> and I can't see it outlined in the kernel's coding style guidelines either. I
>> don't think this would cause a total readability disaster here, but I do
>> think it would make the code at least a little harder to read. Mentally, for
>> me, the variable declarations are clearly split between auxiliary data
>> required for the whole function (declared at the top), and data that is
>> instrumental in operating the big ol' while loop (declared just before the
>> loop).
>>
>> Admittedly, the function more or less /is/ the loop here though, and perhaps
>> you could argue that this function is too long as per the kernel style
>> guidelines anyway. I use a similar style in other places in the code, though,
>> where I think it's hugely more useful. For instance, in the function
>> "add_fraglist_to_fraglist". It's a big function, so is helped significantly
>> by this. You could definitely argue that it's too long, but it sort of needs
>> to be. (In some part, at least, because in my testing the optimiser failed to
>> perform tail call optimisation.)
>>
>
> this function also can be split.
Every function can be slit into smaller functions until there is only
one C statement left in each function. This doesn't make it a good
idea though.

> And goto for hole function is ugly,
> which you used to save some space to avoid 

Re: [lng-odp] [PATCH v2] example: add IPv4 fragmentation/reassembly example

2017-01-30 Thread Ola Liljedahl
On 30 January 2017 at 23:50, Bill Fischofer  wrote:
> Checkpatch still has some issues with this:
>
> ill@Ubuntu15:~/linaro/review$ ./scripts/checkpatch.pl
> 0001-example-add-IPv4-fragmentation-reassembly-example.patch
> WARNING: 'DONT' may be misspelled - perhaps 'DON'T'?
> #1056: FILE: example/ipfragreass/odp_ipfragreass_ip.h:20:
> +#define IP_FRAG_DONT 0x4000 /**< "Don't Fragment" (DF) fragment flag */
Seems like a design problem in checkpatch, breaking apart symbolic
names and testing the individual word fragments. Do all versions of
checkpatch behave like this? "Your symbolic names must consist of
properly spelled English words!".

I think we should ignore checkpatch. And IPv4 came before checkpatch.

-- Ola

>
> WARNING: 'dont' may be misspelled - perhaps 'don't'?
> #1122: FILE: example/ipfragreass/odp_ipfragreass_ip.h:86:
> +static inline bool ipv4hdr_dont_fragment(odph_ipv4hdr_t h)
>
> WARNING: 'DONT' may be misspelled - perhaps 'DON'T'?
> #1124: FILE: example/ipfragreass/odp_ipfragreass_ip.h:88:
> + return (h.frag_offset & odp_cpu_to_be_16(IP_FRAG_DONT));
>
> WARNING: 'dont' may be misspelled - perhaps 'don't'?
> #1133: FILE: example/ipfragreass/odp_ipfragreass_ip.h:97:
> +static inline void ipv4hdr_set_dont_fragment(odph_ipv4hdr_t *h, bool df)
>
> WARNING: 'DONT' may be misspelled - perhaps 'DON'T'?
> #1136: FILE: example/ipfragreass/odp_ipfragreass_ip.h:100:
> + h->frag_offset |=  odp_cpu_to_be_16(IP_FRAG_DONT);
>
> WARNING: 'DONT' may be misspelled - perhaps 'DON'T'?
> #1138: FILE: example/ipfragreass/odp_ipfragreass_ip.h:102:
> + h->frag_offset &= ~odp_cpu_to_be_16(IP_FRAG_DONT);
>
> WARNING: 'dont' may be misspelled - perhaps 'don't'?
> #1662: FILE: example/ipfragreass/odp_ipfragreass_reassemble.c:369:
> + * @param dont_assemble Whether reassembly should be attempted by default
>
> WARNING: 'dont' may be misspelled - perhaps 'don't'?
> #1668: FILE: example/ipfragreass/odp_ipfragreass_reassemble.c:375:
> +odp_queue_t out, bool dont_assemble)
>
> WARNING: 'dont' may be misspelled - perhaps 'don't'?
> #1686: FILE: example/ipfragreass/odp_ipfragreass_reassemble.c:393:
> + dont_assemble = false;
>
> WARNING: 'dont' may be misspelled - perhaps 'don't'?
> #1711: FILE: example/ipfragreass/odp_ipfragreass_reassemble.c:418:
> + if (newfl.part_len < newfl.whole_len || dont_assemble) {
>
> WARNING: 'dont' may be misspelled - perhaps 'don't'?
> #1764: FILE: example/ipfragreass/odp_ipfragreass_reassemble.c:471:
> + dont_assemble = 1;
>
> WARNING: 'dont' may be misspelled - perhaps 'don't'?
> #1774: FILE: example/ipfragreass/odp_ipfragreass_reassemble.c:481:
> + dont_assemble = 0;
>
> total: 0 errors, 12 warnings, 0 checks, 2175 lines checked
>
> NOTE: Ignored message types: BIT_MACRO COMPARISON_TO_NULL
> DEPRECATED_VARIABLE NEW_TYPEDEFS SPLIT_STRING SSCANF_TO_KSTRTO
>
> 0001-example-add-IPv4-fragmentation-reassembly-example.patch has style
> problems, please review.
>
> These are unfortunate since they are part of #defines. Perhaps a
> different choice of name here? I'm not sure what tweaks we can do to
> checkpatch to address this.
>
>
> On Mon, Jan 30, 2017 at 4:32 AM, Joe Savage  wrote:
>> Add an example application implementing lock-free IPv4 fragmentation
>> and reassembly functionality using ODP's packet "concat" and "split".
>>
>> Signed-off-by: Joe Savage 
>> ---
>> (This code contribution is provided under the terms of agreement 
>> LES-LTM-21309)
>>
>>  doc/application-api-guide/examples.dox   |   5 +
>>  example/Makefile.am  |   1 +
>>  example/ipfragreass/.gitignore   |   3 +
>>  example/ipfragreass/Makefile.am  |  22 +
>>  example/ipfragreass/odp_ipfragreass.c| 393 
>>  example/ipfragreass/odp_ipfragreass_atomics.h| 124 
>>  example/ipfragreass/odp_ipfragreass_fragment.c   |  99 +++
>>  example/ipfragreass/odp_ipfragreass_fragment.h   |  28 +
>>  example/ipfragreass/odp_ipfragreass_helpers.c| 121 
>>  example/ipfragreass/odp_ipfragreass_helpers.h| 129 
>>  example/ipfragreass/odp_ipfragreass_ip.h | 251 
>>  example/ipfragreass/odp_ipfragreass_reassemble.c | 772 
>> +++
>>  example/ipfragreass/odp_ipfragreass_reassemble.h | 211 +++
>>  example/m4/configure.m4  |   1 +
>>  14 files changed, 2160 insertions(+)
>>  create mode 100644 example/ipfragreass/.gitignore
>>  create mode 100644 example/ipfragreass/Makefile.am
>>  create mode 100644 example/ipfragreass/odp_ipfragreass.c
>>  create mode 100644 example/ipfragreass/odp_ipfragreass_atomics.h
>>  create mode 100644 example/ipfragreass/odp_ipfragreass_fragment.c
>>  create mode 100644 example/ipfragreass/odp_ipfragreass_fragment.h
>>  create mode 100644 example/ipfragreass/odp_ipfragreass_helpers.c
>>  create mode 100644 example/ipfragreass/odp_ipfragreass_helpers.h
>>  create mode 100644 

Re: [lng-odp] schedule_multi returning tasks from multiple queues

2017-01-30 Thread Ola Liljedahl
On 30 January 2017 at 05:18, Honnappa Nagarahalli
<honnappa.nagaraha...@linaro.org> wrote:
> On 26 January 2017 at 15:10, Bill Fischofer <bill.fischo...@linaro.org> wrote:
>> On Wed, Jan 25, 2017 at 11:13 PM, Honnappa Nagarahalli
>> <honnappa.nagaraha...@linaro.org> wrote:
>>> I agree, it needs additional operations. One could combine multiple
>>> atomic operations into a single stage where possible. If the scheduler
>>> is implemented in hardware, impact will be less.
>>>
>>> The other option is for the reordering logic to call per packet
>>> functions which run the atomic stage.
>>
>> The idea behind ordered locks is that while most of the processing for
>> a packet can be done in parallel, there are certain critical sections
>> that need to be performed in order, and these critical sections
>> typically fall somewhere in the middle of packet processing rather
>> than at the start or end where the scheduler might be able to fold
>> them in in some way as part of the odp_schedule() call.
>>
>
> Let us say that the packet processing pipeline is broken into 3
> stages. Stage 1, Critical Section stage and Stage 2. Critical section
> stage is accessible only through atomic queues. Packets, parallel
> processed in Stage 1 are ordered and enter the Critical Section stage,
> one at a time.This does not require ordered locking, scheduler ensures
> Critical Section stage is accessed by one packet at a time.
Yes but this repeated enqueue/dequeue (through the scheduler) likely
has a lot of overhead. The ordered locks were added to avoid that
overhead.

>
>
>> So the concerns still exist if we allow events from multiple ordered
>> queues to be returned from a single odp_schedule_multi() call.
>>
>>>
>>> On 25 January 2017 at 04:16, Ola Liljedahl <ola.liljed...@linaro.org> wrote:
>>>>
>>>>
>>>> On 25 January 2017 at 06:34, Honnappa Nagarahalli
>>>> <honnappa.nagaraha...@linaro.org> wrote:
>>>>>
>>>>> On 24 January 2017 at 19:16, Bill Fischofer <bill.fischo...@linaro.org>
>>>>> wrote:
>>>>> > On Tue, Jan 24, 2017 at 8:30 AM, Nikhil Agarwal <nikhil.agar...@nxp.com>
>>>>> > wrote:
>>>>> >>
>>>>> >>
>>>>> >> -Original Message-
>>>>> >> From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of
>>>>> >> Bill Fischofer
>>>>> >> Sent: Tuesday, January 24, 2017 1:15 AM
>>>>> >> To: Nikhil Agarwal <nikhil.agar...@linaro.org>
>>>>> >> Cc: Kevin Wang <kevin.w...@linaro.com>; lng-odp-forward
>>>>> >> <lng-odp@lists.linaro.org>; Yi He <yi...@linaro.com>
>>>>> >> Subject: Re: [lng-odp] schedule_multi returning tasks from multiple
>>>>> >> queues
>>>>> >>
>>>>> >> Moving this discussion on the ODP mailing list rather than the Internal
>>>>> >> list as that way it will be archived.
>>>>> >>
>>>>> >> The existing ODP controls over scheduling include schedule groups as
>>>>> >> well as queue priorities. The former is a strict requirement (threads 
>>>>> >> can
>>>>> >> only receive events from queues that belong to a matching scheduler 
>>>>> >> group).
>>>>> >> Queues can belong to only a single scheduler group that is set at
>>>>> >> odp_queue_create() time and is fixed for the life of the queue. 
>>>>> >> Threads can
>>>>> >> belong to multiple scheduler groups and may change membership in these
>>>>> >> groups dynamically via the
>>>>> >> odp_schedule_group_join() and odp_schedule_group_leave() APIs.
>>>>> >>
>>>>> >> The latter (queue priority) is advisory. It is expected that in general
>>>>> >> threads will receive events originating on higher-priority queues 
>>>>> >> ahead of
>>>>> >> those on lower-priority queues, but the default scheduler takes other
>>>>> >> factors into consideration to avoid starvation, etc. The "strict 
>>>>> >> priority"
>>>>> >> (SP) scheduler makes priorities strict, so higher priority queues will
>>>>> >> always be scheduled ahead of lower priority queues even at 

Re: [lng-odp] [PATCH] example: add IPv4 fragmentation/reassembly example

2017-01-30 Thread Ola Liljedahl
On 30 January 2017 at 02:28, Bill Fischofer <bill.fischo...@linaro.org>
wrote:

> As the maintainer of the ODP git repo, Maxim has final say on style
> questions for what is accepted into it, so I recommend deferring to
> him on this. This is true of all open source projects, so there's
> nothing new here.
>
I am pointing out that Maxim's knowledge and arguments about the C language
are limited.
"Create/call a separation function" is not always a meaningful alternative
that simplifies complexity.
An inner block with local declarations and a separate function are not
semantically equivalent.
Declaring variables at the top of the function scope could also make some C
language features impossible to use, e.g. use const variables that are
assigned the return value of a function call or some other value that
cannot be computed at the top of the function.
With this knowledge, perhaps the current rules can be improved.


>
> If we want to set up a separate ODP examples repo that can accommodate
> examples written in other languages, or that uses a different set of
> acceptable style rules, then final say on what would go into that repo
> would again be with that repo's maintainer. Actually, such an examples
> repo is probably a good idea as it would be very clear that these
> programs are designed to be compiled and run against many different
> ODP implementations.
>
> It would be great to have dozens of example programs and applications
> that are published this way. Linaro could certainly host such a repo
> to make it clear that it is vendor-neutral, but I suspect we'd want to
> have another maintainer for it as Maxim's plate is already very full.
> Any volunteers?
>
> On Fri, Jan 27, 2017 at 7:51 AM, Ola Liljedahl <ola.liljed...@linaro.org>
> wrote:
> > On 26 January 2017 at 16:56, Maxim Uvarov <maxim.uva...@linaro.org>
> wrote:
> >
> >> On 01/26/17 18:27, Ola Liljedahl wrote:
> >> >
> >> >
> >> > On 26 January 2017 at 15:19, Joe Savage <joe.sav...@arm.com
> >> > <mailto:joe.sav...@arm.com>> wrote:
> >> >
> >> > > >> It will be very helpful if rehe was some README with
> >> description about
> >> > > >> this app, run environments and some output. So people can
> learn
> >> > > >> something before looking to code.
> >> > > >
> >> > > > I can add one, but I don't think there's really that much to
> >> describe. Since
> >> > > > the example doesn't connect to the network, all that users
> >> really need to
> >> > > > know is that it fragments and reassembles IPv4 packets.
> >> > >
> >> > > you add that description here:
> >> > > ./doc/application-api-guide/examples.dox
> >> > >
> >> > > Description can be about application internals. Like you create
> N
> >> > > workers, use queue with nsize to reassembly packets. Use
> following
> >> > > algorithm.
> >> > >
> >> > > And why this app is not connected to network? I think it will be
> >> very
> >> > > useful if you can pass some pcap file and get pcap on output.
> And
> >> test
> >> > > this program work with other program which does reassembly. That
> >> looks
> >> > > like good proof that it works as expected.
> >> >
> >> > Adding the example to the list in examples.dox seems sensible,
> but I
> >> > think
> >> > the code and comments are probably the best description of the
> >> algorithm
> >> > itself.
> >> >
> >> > As for why it isn't network connected, I wanted to keep the
> example
> >> > somewhat
> >> > bare bones to its purpose. Dealing with a real network connection
> is
> >> > likely
> >> > to add clutter that doesn't really speak to the contents of this
> >> > specific
> >> > example. Anyone wanting to implement this kind of functionality
> >> > themselves
> >> > can simply glean this information from a different example
> focusing
> >> > around
> >> > the packet I/O interface.
> >> >
> >> > > >
> >> > > >> app naming might be not best.
> >> > > >
> >> > > > Hmm... do you have any other ideas? I didn't want it to be too
&

Re: [lng-odp] [PATCH] example: add IPv4 fragmentation/reassembly example

2017-01-27 Thread Ola Liljedahl
On 26 January 2017 at 16:56, Maxim Uvarov <maxim.uva...@linaro.org> wrote:

> On 01/26/17 18:27, Ola Liljedahl wrote:
> >
> >
> > On 26 January 2017 at 15:19, Joe Savage <joe.sav...@arm.com
> > <mailto:joe.sav...@arm.com>> wrote:
> >
> > > >> It will be very helpful if rehe was some README with
> description about
> > > >> this app, run environments and some output. So people can learn
> > > >> something before looking to code.
> > > >
> > > > I can add one, but I don't think there's really that much to
> describe. Since
> > > > the example doesn't connect to the network, all that users
> really need to
> > > > know is that it fragments and reassembles IPv4 packets.
> > >
> > > you add that description here:
> > > ./doc/application-api-guide/examples.dox
> > >
> > > Description can be about application internals. Like you create N
> > > workers, use queue with nsize to reassembly packets. Use following
> > > algorithm.
> > >
> > > And why this app is not connected to network? I think it will be
> very
> > > useful if you can pass some pcap file and get pcap on output. And
> test
> > > this program work with other program which does reassembly. That
> looks
> > > like good proof that it works as expected.
> >
> > Adding the example to the list in examples.dox seems sensible, but I
> > think
> > the code and comments are probably the best description of the
> algorithm
> > itself.
> >
> > As for why it isn't network connected, I wanted to keep the example
> > somewhat
> > bare bones to its purpose. Dealing with a real network connection is
> > likely
> > to add clutter that doesn't really speak to the contents of this
> > specific
> > example. Anyone wanting to implement this kind of functionality
> > themselves
> > can simply glean this information from a different example focusing
> > around
> > the packet I/O interface.
> >
> > > >
> > > >> app naming might be not best.
> > > >
> > > > Hmm... do you have any other ideas? I didn't want it to be too
> long, and both
> > > > "fragmentation" and "reassembly" are unfortunately lengthy.
> > > >
> > >
> > > ipfrag or ipv4frag?
> >
> > Ehh, maybe. The fragmentation doesn't really play a huge role here
> > though,
> > the reassembly is really the star of the show. Perhaps ipfragreass
> > or just
> > ipreass?
> >
> >
> > > >> in early examples we defined max workers, but actully it's not
> needed
> > > >> becase you can ask odp how many workers are there with 0.
> > > >> I.e. in your code it will be:
> > > >> *num_workers = odp_cpumask_default_worker(cpumask, 0)
> > > >
> > > > I did see this in the documentation, but since the functionality
> was present
> > > > in other examples I thought it might be worthwhile to allow a
> maximum number
> > > > of workers to be set. Happy to remove this to use all available
> CPUs if it
> > > > is desirable though.
> > >
> > > we used MAX_WORKERS before odp_cpumask_default_worker() was
> implemented,
> > > after that we did not rewrite old example. I think it's better to
> have
> > > less defines in code.
> >
> > Ok, will do.
> >
> >
> > > >>> +   /* ODP initialisation */
> > > >>> +   {
> > > >>
> > > >> usually we do not use additional brackets. I think here they
> also need
> > > >> to be removed to match common style.
> > > >
> > > > In that particular instance, I agree, they are unnecessary.
> Would you also
> > > > suggest the removal of these extra scoping brackets in other
> places within
> > > > this same function though? In my opinion, the extra scope they
> provide in the
> > > > other places make the code easier to read and reason about.
> > > >
> > >
> > > yes, in all other places also. Brackets might be separate static
> inline
> > > function.
> >
> > A static function is somethi

Re: [lng-odp] updated power API proposal

2017-01-27 Thread Ola Liljedahl
On 27 January 2017 at 14:22, Ola Liljedahl <ola.liljed...@linaro.org> wrote:

> Anyone with a linaro.org email address should now be able to access the
> document and make comments.
> I am revoking access for m...@holmesfamily.ws
>
Other people can request access as well, I will add them individually.



>
> -- Ola
>
> On 27 January 2017 at 14:06, Sergei Trofimov <sergei.trofi...@arm.com>
> wrote:
>
>> Hi Mike,
>>
>> The document is on Google Doc here:
>>
>> https://docs.google.com/document/d/1I7dmnbHRb1AYpHEibjfVvW1K
>> 2juHM0YB5MmPwS2T_98/
>>
>> Cheers,
>> Sergei
>>
>> On 27 Jan 2017, 08:01:36 -0500, Mike Holmes wrote:
>> > Hi Sergei
>> >
>> > Do you have a place to store this we can link to ? Then we use this as a
>> > live document perhaps a google doc ?
>> > Its not the end of the world but it would allow people to comment
>> > interactively and give feedback more easily rather than you collate
>> > comments and post new docs.
>> >
>> > Mike
>> >
>> > On 27 January 2017 at 04:11, Sergei Trofimov <sergei.trofi...@arm.com>
>> > wrote:
>> >
>> > > Hello all,
>> > >
>> > > Thank you for reviewing the proposed power API patches in the last
>> > > week's call and for all the feedback. It was clear that the proposed
>> > > design was not fit for purpose, and that I have skipped a step in
>> > > sending out the patches without sufficient prior discussion of the
>> > > assumptions that went into them.
>> > >
>> > > Please find attached a document with the updated proposal wherein I
>> > > have attempted to address the feedback from the call. This proposals
>> is
>> > > significantly different form the previous one:
>> > >
>> > > 1. The governor implementation is assumed to encapsulate the knowledge
>> > >of power management capabilities of the platform, and so the
>> details
>> > >such as available power domains are no longer exposed in the API.
>> > > 2. The application is assumed to encapsulate the knowledge of its QoS
>> > >requirements, and so a way is provided for the application to
>> notify
>> > >the governor if it is not meeting the requirements.
>> > >
>> > > This document is also available on Google Docs:
>> > > https://docs.google.com/document/d/1I7dmnbHRb1AYpHEibjfVvW1K2juHM
>> > > 0YB5MmPwS2T_98
>> > >
>> > > I very much look forward to your comments.
>> > >
>> > > Cheers,
>> > > Sergei
>> > >
>> >
>> >
>> >
>> > --
>> > Mike Holmes
>> > Program Manager - Linaro Networking Group
>> > Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM
>> SoCs
>> > "Work should be fun and collaborative, the rest follows"
>>
>
>


Re: [lng-odp] updated power API proposal

2017-01-27 Thread Ola Liljedahl
Anyone with a linaro.org email address should now be able to access the
document and make comments.
I am revoking access for m...@holmesfamily.ws

-- Ola

On 27 January 2017 at 14:06, Sergei Trofimov 
wrote:

> Hi Mike,
>
> The document is on Google Doc here:
>
> https://docs.google.com/document/d/1I7dmnbHRb1AYpHEibjfVvW1K2juHM
> 0YB5MmPwS2T_98/
>
> Cheers,
> Sergei
>
> On 27 Jan 2017, 08:01:36 -0500, Mike Holmes wrote:
> > Hi Sergei
> >
> > Do you have a place to store this we can link to ? Then we use this as a
> > live document perhaps a google doc ?
> > Its not the end of the world but it would allow people to comment
> > interactively and give feedback more easily rather than you collate
> > comments and post new docs.
> >
> > Mike
> >
> > On 27 January 2017 at 04:11, Sergei Trofimov 
> > wrote:
> >
> > > Hello all,
> > >
> > > Thank you for reviewing the proposed power API patches in the last
> > > week's call and for all the feedback. It was clear that the proposed
> > > design was not fit for purpose, and that I have skipped a step in
> > > sending out the patches without sufficient prior discussion of the
> > > assumptions that went into them.
> > >
> > > Please find attached a document with the updated proposal wherein I
> > > have attempted to address the feedback from the call. This proposals is
> > > significantly different form the previous one:
> > >
> > > 1. The governor implementation is assumed to encapsulate the knowledge
> > >of power management capabilities of the platform, and so the details
> > >such as available power domains are no longer exposed in the API.
> > > 2. The application is assumed to encapsulate the knowledge of its QoS
> > >requirements, and so a way is provided for the application to notify
> > >the governor if it is not meeting the requirements.
> > >
> > > This document is also available on Google Docs:
> > > https://docs.google.com/document/d/1I7dmnbHRb1AYpHEibjfVvW1K2juHM
> > > 0YB5MmPwS2T_98
> > >
> > > I very much look forward to your comments.
> > >
> > > Cheers,
> > > Sergei
> > >
> >
> >
> >
> > --
> > Mike Holmes
> > Program Manager - Linaro Networking Group
> > Linaro.org  *│ *Open source software for ARM
> SoCs
> > "Work should be fun and collaborative, the rest follows"
>


Re: [lng-odp] [PATCH] example: add IPv4 fragmentation/reassembly example

2017-01-26 Thread Ola Liljedahl
On 26 January 2017 at 15:19, Joe Savage  wrote:

> > >> It will be very helpful if rehe was some README with description about
> > >> this app, run environments and some output. So people can learn
> > >> something before looking to code.
> > >
> > > I can add one, but I don't think there's really that much to describe.
> Since
> > > the example doesn't connect to the network, all that users really need
> to
> > > know is that it fragments and reassembles IPv4 packets.
> >
> > you add that description here:
> > ./doc/application-api-guide/examples.dox
> >
> > Description can be about application internals. Like you create N
> > workers, use queue with nsize to reassembly packets. Use following
> > algorithm.
> >
> > And why this app is not connected to network? I think it will be very
> > useful if you can pass some pcap file and get pcap on output. And test
> > this program work with other program which does reassembly. That looks
> > like good proof that it works as expected.
>
> Adding the example to the list in examples.dox seems sensible, but I think
> the code and comments are probably the best description of the algorithm
> itself.
>
> As for why it isn't network connected, I wanted to keep the example
> somewhat
> bare bones to its purpose. Dealing with a real network connection is likely
> to add clutter that doesn't really speak to the contents of this specific
> example. Anyone wanting to implement this kind of functionality themselves
> can simply glean this information from a different example focusing around
> the packet I/O interface.
>
> > >
> > >> app naming might be not best.
> > >
> > > Hmm... do you have any other ideas? I didn't want it to be too long,
> and both
> > > "fragmentation" and "reassembly" are unfortunately lengthy.
> > >
> >
> > ipfrag or ipv4frag?
>
> Ehh, maybe. The fragmentation doesn't really play a huge role here though,
> the reassembly is really the star of the show. Perhaps ipfragreass or just
> ipreass?
>
>
> > >> in early examples we defined max workers, but actully it's not needed
> > >> becase you can ask odp how many workers are there with 0.
> > >> I.e. in your code it will be:
> > >> *num_workers = odp_cpumask_default_worker(cpumask, 0)
> > >
> > > I did see this in the documentation, but since the functionality was
> present
> > > in other examples I thought it might be worthwhile to allow a maximum
> number
> > > of workers to be set. Happy to remove this to use all available CPUs
> if it
> > > is desirable though.
> >
> > we used MAX_WORKERS before odp_cpumask_default_worker() was implemented,
> > after that we did not rewrite old example. I think it's better to have
> > less defines in code.
>
> Ok, will do.
>
>
> > >>> +   /* ODP initialisation */
> > >>> +   {
> > >>
> > >> usually we do not use additional brackets. I think here they also need
> > >> to be removed to match common style.
> > >
> > > In that particular instance, I agree, they are unnecessary. Would you
> also
> > > suggest the removal of these extra scoping brackets in other places
> within
> > > this same function though? In my opinion, the extra scope they provide
> in the
> > > other places make the code easier to read and reason about.
> > >
> >
> > yes, in all other places also. Brackets might be separate static inline
> > function.
>
A static function is something very different from an inner scope in a
function.
The inner scope can directly access and modify variables in outer scopes.
A static function would have to take all of these as input and output
parameters. Messy.
Declaring all function variables at the top of the function regardless of
where they are (actually) used is just confusing and increases the risk for
errors. Why not use a standard feature of the language that is there to
help the programmer?

Give up Maxim or we begin bombing in five minutes.


> I'm of the opinion that separate functions (inline or otherwise) are too
> heavy a tool for this purpose. If the goal is to introduce some additional
> scoping within a fairly uncomplicated function body, and not to allow for
> reuse or more granular abstractions, extra braces are the ideal solution.
> If these really are considered inconsistent for whatever reason though, I
> can
> just dump all the nicely scoped variables into one big list, making the
> code
> less readable should it be desired.
>
> > >>> +   for (int i = 0; i < FRAGLISTS; ++i)
> > >>> +   init_fraglist([i]);
> > >>> +
> > >>
> > >> no declaration for int inside loop.
>
Well this is an example, not part of the ODP implementation. Examples
should be able to use whatever language or standard they want. If someone
contributes an example of using ODP from Erlang, will we respond with "Not
welcome, only a subset of C89 or C99 allowed!" ? Or more realistically, an
example written in some C++ standard (perhaps C++11).


> > >
> > > It might be worth adding this one to the checkpatch script or to the
> compiler
> > > flags as a 

Re: [lng-odp] schedule_multi returning tasks from multiple queues

2017-01-25 Thread Ola Liljedahl
On 25 January 2017 at 13:39, Bill Fischofer <bill.fischo...@linaro.org>
wrote:

> On Wed, Jan 25, 2017 at 4:22 AM, Ola Liljedahl <ola.liljed...@linaro.org>
> wrote:
> >
> >
> > On 24 January 2017 at 15:30, Nikhil Agarwal <nikhil.agar...@nxp.com>
> wrote:
> >>
> >>
> >>
> >> -Original Message-
> >> From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of
> Bill
> >> Fischofer
> >> Sent: Tuesday, January 24, 2017 1:15 AM
> >> To: Nikhil Agarwal <nikhil.agar...@linaro.org>
> >> Cc: Kevin Wang <kevin.w...@linaro.com>; lng-odp-forward
> >> <lng-odp@lists.linaro.org>; Yi He <yi...@linaro.com>
> >> Subject: Re: [lng-odp] schedule_multi returning tasks from multiple
> queues
> >>
> >> Moving this discussion on the ODP mailing list rather than the Internal
> >> list as that way it will be archived.
> >>
> >> The existing ODP controls over scheduling include schedule groups as
> well
> >> as queue priorities. The former is a strict requirement (threads can
> only
> >> receive events from queues that belong to a matching scheduler group).
> >> Queues can belong to only a single scheduler group that is set at
> >> odp_queue_create() time and is fixed for the life of the queue. Threads
> can
> >> belong to multiple scheduler groups and may change membership in these
> >> groups dynamically via the
> >> odp_schedule_group_join() and odp_schedule_group_leave() APIs.
> >>
> >> The latter (queue priority) is advisory. It is expected that in general
> >> threads will receive events originating on higher-priority queues ahead
> of
> >> those on lower-priority queues, but the default scheduler takes other
> >> factors into consideration to avoid starvation, etc. The "strict
> priority"
> >> (SP) scheduler makes priorities strict, so higher priority queues will
> >> always be scheduled ahead of lower priority queues even at the risk of
> >> starvation.
> >>
> >> What other scheduling controls are needed / desired?
> >>
> >> With regard to receiving events from multiple queues in response to
> >> odp_schedule_multi() there are several points that need clarification:
> >>
> >> 1. What is the use case for this capability? How many different
> >> events/queues would one expect to be eligible to be returned in a single
> >> call? Presumably this is a relatively small number (< 10). How does this
> >> compare with having multiple threads running in parallel servicing
> events
> >> from individual queues?
> >>
> >> This will give benefit of burst processing and improve performance.
> >> Generally we does burst processing of 8 packets. This will reduce 16
> >> hardware interactions (in forwarding scenario, 8 RX and 8 TX) to 2 and
> gives
> >> significant boost in performance. Even the DPDK event scheduler gives
> such
> >> flexibility.
> >
> > I would like to see benchmarks (numbers!) that show the performance
> > difference between the approach favored by Petri (i.e. return only events
> > from one queue at a time, let the scheduler sit on any remaining events
> and
> > return (a subset of) them later, similarly the egress side could also
> buffer
> > outgoing events) compared to Nikhil's approach.
> >
> > Does it actually hurt performance to hide this HW characteristics in the
> > scheduler SW compared to handling it in the application SW?
>
> The argument advanced yesterday is that if the HW RX burst is hidden
> in a SW component of the scheduler this will mean that the effects of
> bursting cannot propagate to the TX side since the application would
> presumably receive the first set of packets, process them, and issue
> the TX calls for them before making another odp_schedule() call
> allowing it to see the next set of RX packets. I guess I still don't
> appreciate under what conditions an application will burst packets
> originating from different queues as part of a single TX operation.
>
That's why I wrote that ODP could buffer outgoing packets (in SW) until the
next forced HW interaction (e.g. the scheduler SW needs to ask the
scheduler HW for new packets). Then ODP would transmit the buffered packets
(possibly destined for different (pktout) queues).

This way the scheduler (HW+SW) controls the order in which events are
processed and any potential race condition due to use of ordered locks
could be avoided.



> >
> >
> >>
> >> 2. While semantically

Re: [lng-odp] schedule_multi returning tasks from multiple queues

2017-01-25 Thread Ola Liljedahl
On 25 January 2017 at 06:34, Honnappa Nagarahalli <
honnappa.nagaraha...@linaro.org> wrote:

> On 24 January 2017 at 19:16, Bill Fischofer 
> wrote:
> > On Tue, Jan 24, 2017 at 8:30 AM, Nikhil Agarwal 
> wrote:
> >>
> >>
> >> -Original Message-
> >> From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of
> Bill Fischofer
> >> Sent: Tuesday, January 24, 2017 1:15 AM
> >> To: Nikhil Agarwal 
> >> Cc: Kevin Wang ; lng-odp-forward <
> lng-odp@lists.linaro.org>; Yi He 
> >> Subject: Re: [lng-odp] schedule_multi returning tasks from multiple
> queues
> >>
> >> Moving this discussion on the ODP mailing list rather than the Internal
> list as that way it will be archived.
> >>
> >> The existing ODP controls over scheduling include schedule groups as
> well as queue priorities. The former is a strict requirement (threads can
> only receive events from queues that belong to a matching scheduler group).
> Queues can belong to only a single scheduler group that is set at
> odp_queue_create() time and is fixed for the life of the queue. Threads can
> belong to multiple scheduler groups and may change membership in these
> groups dynamically via the
> >> odp_schedule_group_join() and odp_schedule_group_leave() APIs.
> >>
> >> The latter (queue priority) is advisory. It is expected that in general
> threads will receive events originating on higher-priority queues ahead of
> those on lower-priority queues, but the default scheduler takes other
> factors into consideration to avoid starvation, etc. The "strict priority"
> (SP) scheduler makes priorities strict, so higher priority queues will
> always be scheduled ahead of lower priority queues even at the risk of
> starvation.
> >>
> >> What other scheduling controls are needed / desired?
> >>
> >> With regard to receiving events from multiple queues in response to
> >> odp_schedule_multi() there are several points that need clarification:
> >>
> >> 1. What is the use case for this capability? How many different
> events/queues would one expect to be eligible to be returned in a single
> call? Presumably this is a relatively small number (< 10). How does this
> compare with having multiple threads running in parallel servicing events
> from individual queues?
> >>
> >> This will give benefit of burst processing and improve performance.
> Generally we does burst processing of 8 packets. This will reduce 16
> hardware interactions (in forwarding scenario, 8 RX and 8 TX) to 2 and
> gives significant boost in performance. Even the DPDK event scheduler gives
> such flexibility.
> >>
> >> 2. While semantically this would work for parallel queues, since the
> scheduler provides no synchronization context for events originating from
> parallel queues, is it acceptable / useful to have this restriction in the
> API?  If not, then it's not obvious how multiple atomic or ordered contexts
> are expected to be maintained in any coherent fashion. This would seem to
> add significant complexity to any scheduler design, so we'd need a
> convincing use case to justify this.
> >>
> >> IMO,  it should not be restricted to parallel queues only. It should be
> available to all queue types. For release context APIs we should pass
> event(or list of events) as arguments to cater to  context release
> requirements.
> >> Even in current APIs if application want release ordered context for a
> particular event (so that all later events can be processed), there is no
> API for that. DPDK takes care of the problem in similar way.
> >
> > As we discussed during today's call, if we wish to include atomic and
> > ordered contexts, there are two issues that need to be addressed:
> >
> > 1. Today when running under a single context it is unambiguous how new
> > packets allocated by odp_packet_alloc() within an atomic or ordered
> > context are handled. If we now support the notion of the caller
> > simultaneously holding multiple such contexts, then to which context
> > are new packets assigned since a packet can only be associated with a
> > single such context.
> >
> > 2. When multiple callers hold multiple ordered contexts it's not
> > obvious how they can safely use odp_schedule_order_lock() calls, even
> > when this call is extended to name the ordered context explicitly. To
> > see this consider the following:
> >
> > Thread A calls odp_schedule_multi() and receives three packets
> > belonging to ordered contexts C1 and C2. Suppose the Packets this
> > consists of two packets from C1 in order 3 and 4. and one packet from
> > C2 in order 6. Thread B also calls odp_schedule_multi() and receives
> > three other packets: Two from C2 in order 7 and 8, and one from C1 in
> > order 2.  Both threads decide to process the context they hold the
> > most packets from first and they require ordered critical sections. So
> > Thread A issues odp_schedule_order_lock() for C1 

Re: [lng-odp] 32-bit support in examples

2017-01-24 Thread Ola Liljedahl
On 24 January 2017 at 10:53, Ola Liljedahl <ola.liljed...@linaro.org> wrote:

>
>
> On 20 January 2017 at 13:15, Savolainen, Petri (Nokia - FI/Espoo) <
> petri.savolai...@nokia-bell-labs.com> wrote:
>
>>
>>
>> > -Original Message-
>> > From: Joe Savage [mailto:joe.sav...@arm.com]
>> > Sent: Friday, January 20, 2017 1:51 PM
>> > To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-
>> > labs.com>; Maxim Uvarov <maxim.uva...@linaro.org>; lng-
>> > o...@lists.linaro.org; Bill Fischofer <bill.fischo...@linaro.org>
>> > Cc: nd <n...@arm.com>
>> > Subject: Re: [lng-odp] 32-bit support in examples
>> >
>> > > Agree with Maxim. I which way the application is not 32 bit compliant?
>> >
>> > It uses 128-bit atomics, and so is really designed for execution on
>> 64-bit
>> > machines. It is possible to provide lockless 32-bit support in this
>> case,
>> > though, and I have an implementation that does so. Since the pointer
>> size
>> > is
>> > halved and there is a pointer in the 128-bit struct, I just have to
>> squash
>> > a
>> > few of the other fields down (managing them carefully) so that 64-bit
>> > atomics
>> > can be used instead.
>>
>> Unfortunately, ODP atomics API does not support 128 bit atomics - at
>> least currently. So, your example could not use those anyway. Not all 64
>> CPUs have 128 bit atomic instructions.
>>
>> >
>> > On reflection, I think that providing 32-bit support is probably
>> > worthwhile
>> > here, so I will do so. It does add a little complexity to the code, but
>> > it's
>> > not actually that much, and there are clear benefits from having the
>> > example
>> > be better supported on different platforms.
>> >
>> > I do think that having a place for 64-bit only examples in the future
>> > (e.g.
>> > an "example_64" directory as Bill outlined) might be useful though. It
>> > isn't
>> > always so easy to add 32-bit support.
>>
>> Good. ODP provides 32 and 64 bit atomics (also on 32 bit CPUs), so you
>> can still utilize those. In addition, synchronization / critical sections
>> should touch only a small portion of the application code base and
>> preferably in a modular way (inside enqueue() / dequeue(), push() / pop(),
>> etc functions).
>>
> This code can't use the ODP atomics (at least not as defined today). We
> are not doing atomic operations on e.g. integer counters. The (preferably
> lock-free) atomic operations are done on a structure with multiple fields.
>
Well I need to correct myself. Using unions, we can use atomics that
operate on e.g. __int128 operands. This is how we utilise ARM load/store
exclusives (ldxp/stxp). But the actual operand is not an 128-bit scalar.

Should we decide to support 128-bit atomics in ODP, we need CAS and SWAP
but nothing much else (e.g. we are *not* using 128-bit counters). Load is
preferably done non-atomically as an atomic 128-bit load is rather
expensive (needs to be implemented using CAS).



>
>
>>
>> -Petri
>>
>>
>>
>


Re: [lng-odp] 32-bit support in examples

2017-01-24 Thread Ola Liljedahl
On 23 January 2017 at 09:50, Savolainen, Petri (Nokia - FI/Espoo) <
petri.savolai...@nokia-bell-labs.com> wrote:

>
>
> > -Original Message-
> > From: Brian Brooks [mailto:brian.bro...@linaro.org]
> > Sent: Friday, January 20, 2017 7:47 PM
> > To: Francois Ozog 
> > Cc: Bill Fischofer ; Joe Savage
> > ; Maxim Uvarov ;
> Savolainen,
> > Petri (Nokia - FI/Espoo) ; lng-
> > o...@lists.linaro.org; nd 
> > Subject: Re: [lng-odp] 32-bit support in examples
> >
> > CAS is a universal primitive in the sense that you can construct those
> > RMW ops by speculatively computing the updated value and the CAS to
> > atomically update the value (in a retry loop).  LL/SC also universal,
> > but different behavior.  Both are not the same as an atomic op
> > performed deeper in the memory system.
> >
> > To Petri's point about ODP not supporting 128b atomics, which compiler
> > does not support the __atomic_xxx built-ins or the __int128 128b
> > variable?  This has impact on portability and should be explicitly
> > known; is it the microblaze compiler?
>
>
>
> Any atomics can be emulated in SW (using compiler built-ins or locks
> directly). The point here is the missing HW support:
>
Atomic operations can be emulated/implemented but not lock-free behaviour.
GCC does provide a lock-based implementation of e.g. 128-bit atomics in
libatomic so functionally all targets should support 128-bit atomics.


>  * E.g. MIPS, Power, ARMv7 do not have 128 bit CAS
>  * 128 bit fetch-and-add is not supported in any of the architectures
>
MIPS64r6 has Load Linked DoubleWord Paired/Store Conditional DoubleWord
Paired (LLDP/SCDP) so identical to ARMv8/AArch64. This is all you need.


>
> We need to ensure on any operations added that those can be implemented
> efficiently on most of the targets.
>
I think we should let 32-bit platforms wither (e.g. suffer with non-ideal
performance), how relevant are they? Why should we be limited (in an ODP
example) in what we can do by targets that are less and less relevant?


>
> -Petri
>
>
> >
> > On Fri, Jan 20, 2017 at 7:36 AM, Francois Ozog  >
> > wrote:
> > > well, yes. But that is the only atomic operation supported. No add,
> sub,
> > > inc, xadd, bit operations
> > >
> > > Le ven. 20 janv. 2017 à 14:31, Joe Savage  a
> écrit :
> > >
> > >> > I wonder what processor supports 128 bits atomics. As far as I know
> > Intel
> > >>
> > >> > does not support it. Lock prefix is not allowed on SSE instructions.
> > >>
> > >>
> > >>
> > >> Actually, Intel does support them through a locked cmpxchg16b. And
> > ARMv8
> > >>
> > >> through load exclusive pair and store exclusive pair.
> > >>
> > >>
>


Re: [lng-odp] 32-bit support in examples

2017-01-24 Thread Ola Liljedahl
On 20 January 2017 at 14:36, Francois Ozog  wrote:

> well, yes. But that is the only atomic operation supported. No add, sub,
> inc, xadd, bit operations
>
Using lock cmpxchg16b (i.e. atomic CAS), GCC implements all of the __atomic
operations on 128-bit operands.
Intel cheats a little bit because 128-bit __atomic_load_n() is also
implemented using cmpxhg16b so does a write to the location.


>
> Le ven. 20 janv. 2017 à 14:31, Joe Savage  a écrit :
>
> > > I wonder what processor supports 128 bits atomics. As far as I know
> Intel
> >
> > > does not support it. Lock prefix is not allowed on SSE instructions.
> >
> >
> >
> > Actually, Intel does support them through a locked cmpxchg16b. And ARMv8
> >
> > through load exclusive pair and store exclusive pair.
> >
> >
>


Re: [lng-odp] 32-bit support in examples

2017-01-24 Thread Ola Liljedahl
On 20 January 2017 at 13:15, Savolainen, Petri (Nokia - FI/Espoo) <
petri.savolai...@nokia-bell-labs.com> wrote:

>
>
> > -Original Message-
> > From: Joe Savage [mailto:joe.sav...@arm.com]
> > Sent: Friday, January 20, 2017 1:51 PM
> > To: Savolainen, Petri (Nokia - FI/Espoo)  > labs.com>; Maxim Uvarov ; lng-
> > o...@lists.linaro.org; Bill Fischofer 
> > Cc: nd 
> > Subject: Re: [lng-odp] 32-bit support in examples
> >
> > > Agree with Maxim. I which way the application is not 32 bit compliant?
> >
> > It uses 128-bit atomics, and so is really designed for execution on
> 64-bit
> > machines. It is possible to provide lockless 32-bit support in this case,
> > though, and I have an implementation that does so. Since the pointer size
> > is
> > halved and there is a pointer in the 128-bit struct, I just have to
> squash
> > a
> > few of the other fields down (managing them carefully) so that 64-bit
> > atomics
> > can be used instead.
>
> Unfortunately, ODP atomics API does not support 128 bit atomics - at least
> currently. So, your example could not use those anyway. Not all 64 CPUs
> have 128 bit atomic instructions.
>
> >
> > On reflection, I think that providing 32-bit support is probably
> > worthwhile
> > here, so I will do so. It does add a little complexity to the code, but
> > it's
> > not actually that much, and there are clear benefits from having the
> > example
> > be better supported on different platforms.
> >
> > I do think that having a place for 64-bit only examples in the future
> > (e.g.
> > an "example_64" directory as Bill outlined) might be useful though. It
> > isn't
> > always so easy to add 32-bit support.
>
> Good. ODP provides 32 and 64 bit atomics (also on 32 bit CPUs), so you can
> still utilize those. In addition, synchronization / critical sections
> should touch only a small portion of the application code base and
> preferably in a modular way (inside enqueue() / dequeue(), push() / pop(),
> etc functions).
>
This code can't use the ODP atomics (at least not as defined today). We are
not doing atomic operations on e.g. integer counters. The (preferably
lock-free) atomic operations are done on a structure with multiple fields.


>
> -Petri
>
>
>


Re: [lng-odp] [API-NEXT PATCHv4 0/5] Add Packet Splice/Reference APIs

2016-10-18 Thread Ola Liljedahl
On 17 October 2016 at 16:34, Bill Fischofer <bill.fischo...@linaro.org>
wrote:

>
>
> On Mon, Oct 17, 2016 at 8:54 AM, Ola Liljedahl <ola.liljed...@linaro.org>
> wrote:
>
>> On 10 October 2016 at 17:50, Bill Fischofer <bill.fischo...@linaro.org>
>> wrote:
>>
>>> This patch adds support for packet references and splices following
>>> discussions at LAS16 on this subject.
>>>
>>> I've changed things around from Petri's original proposal by splitting
>>> this
>>> into two separate APIs: odp_packet_splice() and odp_packet_ref(), where
>>> the
>>> latter is just a splice of a zero-length header on to a base packet. The
>>> various odp packet manipulation APIs have also been enhanced to behave
>>> sensibly when presented with a spliced packet as input. Reference counts
>>> are
>>> used to enable odp_packet_free() to not free a packet until all splices
>>> based
>>> on it are also freed.
>>>
>> Seems OK.
>>
>>
>>>
>>> Also added are two new APIs for working with spliced packets as these
>>> seem
>>> necessary for completeness:
>>>
>>> - odp_packet_is_a_splice() tells whether an input packet is a splice,
>>> and if
>>> so how many spliced packets it contains
>>>
>>> - odp_packet_is_spliced() tells whether any splices have been created on
>>> this
>>> packet, and if so how many.
>>>
>> Is there any conceptual difference between the base packet and any
>> splices?
>>
>> Can you add a splice to a splice?
>>
>
> Architecturally, yes. Compound splices of arbitrary depth can be
> accommodated , however not every implementation may be able to support
> this. I'd expect us to need to add some odp_packet_capability() info to
> allow implementation limits to be communicated to applications, but we need
> to specify what sort of things we want to be able to limit this way.
>
>
>>
>>
>>>
>>> Note that there is no odp_packet_unsplice() API. To remove a splice from
>>> a
>>> base packet currently requres that the splice be freed via an
>>> odp_packet_free() call. We should discuss and decide if such an API is
>>> warranted for symmetry.
>>>
>> As long as odp_packet_free() respects any additional references (to
>> segments)
>> caused by splicing, I don't think we need any free splice call. How would
>> it be
>> different from odp_packet_free()?
>>
>
> After splicing the header packet becomes the handle for the resulting
> splice, while the handle for the base packet is still valid. An unsplice
> operation would separate these so that what was the splice handle becomes
> just a handle to the header, which can then still be used. Not sure if
> there is a use case for this. It would basically be an "undo" on the splice.
>
Aha.


>
>
>>
>> What happens when you truncate the tail? E.g. of the base packet?
>> Does this affect the splices? (I prefer not).
>>
>
> Yes. After a successful splice, all offsets beyond the splice point are
> shared with all other splices so any change to that part of the splice is
> visible to all other splices on the same base packet. For this reason it's
> suggested that the shared portion be treated as read only. However, if the
> application changes the base packet or an offset in a splice that is
> visible to other splices, then it's the application responsibility to
> ensure that proper synchronization is performed to avoid unpredictable
> behavior.
>
Is it really unpredictable?
If you change packet data through one handle, all other handles referring
to the same segment should also see this change.

And implementation that implements splicing through copying will not see
such changes. So perhaps implementation dependent behaviour.



>
>
>>
>> To use this splice API for fragmentation, you want to be able to add
>> splices with new
>> L2/IP headers at suitable places (at each MTU point) and then truncate
>> the base packet
>> so that only the just added splice refers the the data beyond the new end
>> of the base
>> packet.
>>
>
> That's not a use case that's currently supported as a splice is created
> with an offset only rather than an offset and length, so splices go from
> the specified offset through the end of the base packet.
>
That's OK. That's what I want. What I don't want it that if I truncate the
length of the packet through one handle, that should affect the other
handles. Is that behaviour really intrinsic to the splice API definition?


> If we 

Re: [lng-odp] [API-NEXT PATCHv4 0/5] Add Packet Splice/Reference APIs

2016-10-17 Thread Ola Liljedahl
On 10 October 2016 at 17:50, Bill Fischofer 
wrote:

> This patch adds support for packet references and splices following
> discussions at LAS16 on this subject.
>
> I've changed things around from Petri's original proposal by splitting this
> into two separate APIs: odp_packet_splice() and odp_packet_ref(), where the
> latter is just a splice of a zero-length header on to a base packet. The
> various odp packet manipulation APIs have also been enhanced to behave
> sensibly when presented with a spliced packet as input. Reference counts
> are
> used to enable odp_packet_free() to not free a packet until all splices
> based
> on it are also freed.
>
Seems OK.


>
> Also added are two new APIs for working with spliced packets as these seem
> necessary for completeness:
>
> - odp_packet_is_a_splice() tells whether an input packet is a splice, and
> if
> so how many spliced packets it contains
>
> - odp_packet_is_spliced() tells whether any splices have been created on
> this
> packet, and if so how many.
>
Is there any conceptual difference between the base packet and any splices?

Can you add a splice to a splice?


>
> Note that there is no odp_packet_unsplice() API. To remove a splice from a
> base packet currently requres that the splice be freed via an
> odp_packet_free() call. We should discuss and decide if such an API is
> warranted for symmetry.
>
As long as odp_packet_free() respects any additional references (to
segments)
caused by splicing, I don't think we need any free splice call. How would
it be
different from odp_packet_free()?

What happens when you truncate the tail? E.g. of the base packet?
Does this affect the splices? (I prefer not).

To use this splice API for fragmentation, you want to be able to add
splices with new
L2/IP headers at suitable places (at each MTU point) and then truncate the
base packet
so that only the just added splice refers the the data beyond the new end
of the base
packet.

It would be useful to use the API for reassembly. Splice fragment N onto
fragment N+1
(after the IP header of frag N+1) and free fragment N+1. Repeat for all
fragments from
last to first. The splice for the first fragment will then refer to the
complete reassembled
datagram.

It seems like these use cases should be supported by this API if the
specification is
written with this in mind.

I will check the later patches as well. We have a bunch of use cases define
here:
https://docs.google.com/document/d/1r4olxr39fHvgFACQp_1RL_mh9736no3TFaVM7_XkUtM/edit



>
> Changes for v4:
> - Add negative tests to validation test suite
> - Fix implementation bugs relating to negative tests
>
> Changes for v3:
> - Bug fixes (detected by the validation tests)
> - Addition of validation tests for these new APIs
> - Diagrams and User Guide documentation for these new APIs
>
> Changes for v2:
> - Bug fixes
> - Enhance ODP packet segment APIs to behave properly with spliced packets
>
> Bill Fischofer (5):
>   api: packet: add support for packet splices and references
>   linux-generic: packet: implement splice/reference apis
>   validation: packet: add packet splice/reference tests
>   doc: images: add images for packet splice/reference documentation
>   doc: userguide: add user documentation for packet splice/reference
> APIs
>
>  doc/images/doublesplice.svg|  67 ++
>  doc/images/pktref.svg  |  49 +
>  doc/images/splice.svg  |  64 ++
>  doc/users-guide/users-guide-packet.adoc| 118 ++
>  include/odp/api/spec/packet.h  | 103 +
>  .../linux-generic/include/odp_packet_internal.h|  54 -
>  platform/linux-generic/odp_packet.c| 241
> ++---
>  test/common_plat/validation/api/packet/packet.c| 176 +++
>  test/common_plat/validation/api/packet/packet.h|   1 +
>  9 files changed, 836 insertions(+), 37 deletions(-)
>  create mode 100644 doc/images/doublesplice.svg
>  create mode 100644 doc/images/pktref.svg
>  create mode 100644 doc/images/splice.svg
>
> --
> 2.7.4
>
>


Re: [lng-odp] thread/shmem discussion summary V3

2016-06-02 Thread Ola Liljedahl
On 2 June 2016 at 11:08, Christophe Milard 
wrote:

> since V2: Update following Barry and Bill's comments
> since V1: Update following arch call 31 may 2016
>
> This is a tentative to sum up the discussions around the thread/process
> that have been happening these last weeks.
> Sorry for the formalism of this mail, but it seems we need accuracy here...
>
> This summary is organized as follows:
>
> It is a set of statements, each of them expecting a separate answer
> from you. When no specific ODP version is specified, the statement
> regards the"ultimate" goal (i.e what we want eventually to achieve).
> Each statement is prefixed with:
>   - a statement number for further reference (e.g. S1)
>   - a status word (one of 'agreed' or 'open', or 'closed').
> Agreed statements expect a yes/no answers: 'yes' meaning that you
> acknowledge that this is your understanding of the agreement and will
> not nack an implementation based on this statement. You can comment
> after a yes, but your comment will not block any implementation based
> on the agreed statement. A 'no' implies that the statement does not
> reflect your understanding of the agreement, or you refuse the
> proposal.
> Any 'no' received on an 'agreed' statement will push it back as 'open'.
> Open statements are fully open for further discussion.
>
> S1  -agreed: an ODP thread is an OS/platform concurrent execution
> environment object (as opposed to an ODP objects). No more specific
> definition is given by the ODP API itself.
>
> Barry: YES
> ---
>
> S2  -agreed: Each ODP implementation must tell what is allowed to be
> used as ODP thread for that specific implementation: a linux-based
> implementation, for instance, will have to state whether odp threads
> can be linux pthread, linux processes, or both, or any other type of
> concurrent execution environment. ODP implementations can put any
> restriction they wish on what an ODP thread is allowed to be. This
> should be documented in the ODP implementation documentation.
>
> Barry: YES
> ---
>
> S3  -agreed: in the linux generic ODP implementation a odpthread will be
> either:
> * a linux process descendant (or same as) the odp instantiation
> process.
> * a pthread 'member' of a linux process descendant (or same
> as) the odp instantiation process.
>
> Barry: YES
> ---
>
> S4  -agreed: For monarch, the linux generic ODP implementation only
> supports odp thread as pthread member of the instantiation process.
>
> Barry: YES
> ---
>
> S5  -agreed: whether multiple instances of ODP can be run on the same
> machine is left as a implementation decision. The ODP implementation
> document
> should state what is supported and any restriction is allowed.
>
> Barry: YES
> ---
>
> S6  -agreed: The l-g odp implementation will support multiple odp
> instances whose instantiation processes are different and not
> ancestor/descendant of each others. Different instances of ODP will,
> of course, be restricted in sharing common OS ressources (The total
> amount of memory available for each ODP instances may decrease as the
> number of instances increases, the access to network interfaces will
> probably be granted to the first instance grabbing the interface and
> denied to others... some other rule may apply when sharing other
> common OD ressources.)
> ---
>
> S7  -agreed: the l-g odp implementation will not support multiple ODP
> instances initiated from the same linux process (calling multiple time
> odp_init_global).
> As an illustration, This means that a single process P is not allowed
> to execute the following calls (in any order)
> instance1 = odp_global_init()
> instance2 = odp_globa_init()
> pthread_create (and, in that thread, run odp_local_init(instance1) )
> pthread_create (and, in that thread, run odp_local_init(instance2) )
> ---
>
> S8  -agreed: the l-g odp implementation will not support multiple ODP
> instances initiated from related linux processes (descendant/ancestor
> of each other), hence enabling ODP 'sub-instance'? As an illustration,
> this means that the following is not supported:
> instance1 = odp_global_init()
> pthread_create (and, in that thread, run odp_local_init(instance1) )
> if (fork()==0) {
> instance2 = odp_globa_init()
> pthread_create (and, in that thread, run odp_local_init(instance2) )
> }
>
> 
> S9  -agreed: the odp instance passed as parameter to odp_local_init()
> must always be one of the odp_instance returned by odp_global_init()
>
> Barry: YES
> ---
>
> S10 -agreed: For l-g, if the answer to S7 and S8 are 'no', then due to S3,
> the odp_instance an odp_thread can attach to is completely defined by
> the ancestor of the thread, making the odp_instance parameter of
> odp_init_local redundant. The odp l-g 

Re: [lng-odp] lng-odp mailman settings

2016-05-31 Thread Ola Liljedahl
On 16 May 2016 at 11:52, Anders Roxell  wrote:

> On 2016-05-16 08:27, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> >
> >
> > From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of
> Bill Fischofer
> > Sent: Saturday, May 14, 2016 1:55 AM
> > To: Brian Brooks 
> > Cc: lng-odp 
> > Subject: Re: [lng-odp] lng-odp mailman settings
> >
> >
> >
> > On Friday, May 13, 2016, Brian Brooks > wrote:
> > On 05/13 10:07:44, Bill Fischofer wrote:
> > > I don't think we need to impose those sort of restrictions. I
> personally
> > > use sylpheed for patches and GMail for everything else and they have no
> > > problem sorting things out.
> > >
> > > The ODP mailing list covers both discussions as well as patches, and
> HTML
> > > is useful for the former. The onus should be on those who have problems
> > > with this to upgrade their tools rather than sending everyone else
> back to
> > > the 1980s.
> > >
> > > On Fri, May 13, 2016 at 9:57 AM, Mike Holmes 
> > > >
> wrote:
> > >
> > > > All,
> > > >
> > > > A topic that comes up occasionally that affects some mail tools like
> > > > outlook is that html email on this list makes it harder for some
> folks to
> > > > process patches.
> > > >
> > > > I use mutt specifically to avoid using a fancy email client for all
> my
> > > > patch download/send work, it is powered by coal/steam and would work
> fine
> > > > on a 1970 VAX but there you go.
> > > >
> > > > Anyway, does anyone object to having mailman reject HTML mail ? One
> up
> > > > side is that I will not waste time blocking offers of cheap watches
> and fax
> > > > services from the list :)
> > > >
> > > > On the other hand I will miss bullet lists and the color red when I
> am
> > > > grumpy.
> > > >
> > > > Mike
> >
> > +1 for plain text
> >
> > I think this topic is more about simplicity and best practices rather
> than
> > anything else.
> >
> > Here is an example of why HTML doesn't work well with archives:
> > https://lists.linaro.org/pipermail/lng-odp/2016-May/023134.html
> > That link displays perfectly fine for me.  What problem do you see?
> >
> > Did you read it? It’s a mixture of paragraphs from two writers, without
> any indication who is writing what. The list archive drops out HTML
> decoration and at least in some cases will not convert (various blue or
> grey lines, or indentations) into ‘>’.
> >
> > As Brian highlights, it’s about simplicity. With one rule (no HTML) we
> can uniform output of all mail clients and their fonts/indentations/other
> settings, so that the list is clean and clear to read (both directly and
> from the archives) and reply (no need to struggle with mismatching styles
> inherited from the senders HTML settings, etc).
> >
> > +1 for plain text
>
> I agree with Brian and Petri.
>
There shouldn't be any other option than plain text!
>
Is that an option then that I can choose not to choose?


> That works for discussions as well. "You can mail all those funny mails
> to your friends." =)
>
But can you send all those funny emails to your pets? I mean, if you don't
have friends?


>
> Just a reflection, since we talking about plain text vs. html... Is
> there someone that can follow this thread easily? =)
>
> We have another problem with this thread as well, people top-post,
> which is insane! =)
>
How does forbidding HTML prevent top-posting? Perhaps we should filter out
top-posts as well? And why stop at that?


>
> Cheers,
> Anders
> ___
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [PATCH] linux-generic: test: fix ring resource leaks

2016-05-26 Thread Ola Liljedahl
On 26 May 2016 at 12:05, Yi He <yi...@linaro.org> wrote:

> Hi, Maxim, Ola and Bill
>
> Is it good to have a further investigation in RC4 to understand this? I
> can do this since it seems introduced since my patch :), Maxim did you
> mention this also happens to the old ringtest program (hang issue)?
>
No I don't think this is necessary. We see a problem only with older GCC
versions and using an unusual compiler flag (together with -O3). The code
(ring implementation and ring test code) works with newer GCC versions and
other compilers.

There could be a problem (race condition?) in the code (more likely ring
test, the actual ring implementation is copied from DPDK, I added the
memory ordering necessary for e.g. ARM) which is only exposed with this
compiler and version but then I would expect this problem to resurface at
some later time.

What we should do is to run the ringtest on serious multi (many) core
machines. That should be more prone to expose any race conditions.


>
> Best Regards, Yi
>
> On 26 May 2016 at 16:15, Ola Liljedahl <ola.liljed...@linaro.org> wrote:
>
>>
>>
>> On 26 May 2016 at 10:02, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
>>
>>>
>>>
>>> On 26 May 2016 at 10:33, Ola Liljedahl <ola.liljed...@linaro.org> wrote:
>>>
>>>>
>>>>
>>>> On 25 May 2016 at 17:18, Maxim Uvarov <maxim.uva...@linaro.org> wrote:
>>>>
>>>>> issue is that -mcx16 and -O3 generates buggy code on gcc4.8 and
>>>>> gcc4.9, but works fine with gcc5.3 or on old gcc without optimization. 
>>>>> I.e.
>>>>> with -O0.
>>>>> So 2 options:
>>>>> 1) change configure.ac to not add -mcx16 flags;
>>>>> 2) fix ring code to support -mcx16 on old compilers;
>>>>>
>>>> I don't understand this one.
>>>> If (ring) code works without -mcx16 (which enables cmpxchg16b
>>>> instruction on x86), it should work also with the -mcx16 flag. I can't
>>>> understand how the code on its own could be wrong. Maybe the compiler is
>>>> doing something wrong this option is specified (e.g. code generation gets
>>>> screwed up).
>>>>
>>>>
>>> Ola, I just wrote facts which I see on my 14.10 Ubuntu with gcc4.8 and
>>> gcc4.9. And Bill has gcc5.3 and he is unable to reproduce that bug. Also it
>>> works well with clang. I can spend some time to installing gcc5.4,
>>> disassembling 2 versions and compare what is the difference but I think we
>>> should not waist time on that if everything works well with newer compiler
>>> version.
>>>
>> So you don't actually know that the ring code can be "fixed" (correction
>> or work-around) in order to avoid this problem?
>>
>> I do agree that if we only see a problem with an older version of a
>> specific compiler, then it is most likely caused by a bug in that compiler
>> and the best thing for us is to either not support that compiler version at
>> all or (in this case) avoid using a compiler option which triggers the
>> problem. This is not what I questioned. You proposed (as one alternative)
>> fixing the ring code, then you should have had some clue as to what kind of
>> fix or workaround would be necessary, As any problem or weakness in the
>> ring code wasn't obvious to me, I asked you about it.
>>
>> Maybe this was just a rhetorical trick. Also suggest an impossible or
>> outrageous alternative so that we quicker select the simpler but perhaps
>> more bitter pill that as the only choice wouldn't have been welcomed.
>>
>>
>>>
>>>
>>> Maxim.
>>>
>>>
>>>>
>>>>>
>>>>> I think we should go first path.
>>>>>
>>>>>
>>>>> Maxim.
>>>>>
>>>>> On 05/25/16 04:24, Yi He wrote:
>>>>>
>>>>>> Hi, sorry about the memory leak issue and thanks Maxim for your patch.
>>>>>>
>>>>>> Best Regards, Yi
>>>>>>
>>>>>> On 25 May 2016 at 09:05, Bill Fischofer <bill.fischo...@linaro.org
>>>>>> <mailto:bill.fischo...@linaro.org>> wrote:
>>>>>>
>>>>>> On Tue, May 24, 2016 at 3:46 PM, Maxim Uvarov
>>>>>> <maxim.uva...@linaro.org <mailto:maxim.uva...@linaro.org>>
>>>>>> wrote:
>>>>>>
>>>>>> > Make test a little bit simple. Add memory free and
>>>

Re: [lng-odp] [PATCH] linux-generic: test: fix ring resource leaks

2016-05-26 Thread Ola Liljedahl
On 25 May 2016 at 17:18, Maxim Uvarov  wrote:

> issue is that -mcx16 and -O3 generates buggy code on gcc4.8 and gcc4.9,
> but works fine with gcc5.3 or on old gcc without optimization. I.e. with
> -O0.
> So 2 options:
> 1) change configure.ac to not add -mcx16 flags;
> 2) fix ring code to support -mcx16 on old compilers;
>
I don't understand this one.
If (ring) code works without -mcx16 (which enables cmpxchg16b instruction
on x86), it should work also with the -mcx16 flag. I can't understand how
the code on its own could be wrong. Maybe the compiler is doing something
wrong this option is specified (e.g. code generation gets screwed up).



>
> I think we should go first path.
>
>
> Maxim.
>
> On 05/25/16 04:24, Yi He wrote:
>
>> Hi, sorry about the memory leak issue and thanks Maxim for your patch.
>>
>> Best Regards, Yi
>>
>> On 25 May 2016 at 09:05, Bill Fischofer > > wrote:
>>
>> On Tue, May 24, 2016 at 3:46 PM, Maxim Uvarov
>> >
>> wrote:
>>
>> > Make test a little bit simple. Add memory free and
>> > take care about overflow using cast to int:
>> > (int)odp_atomic_load_u32(consume_count)
>> > Where number of consumer threads can dequeue from ring
>> > and decrease atomic u32.
>> >
>> > Signed-off-by: Maxim Uvarov > >
>> >
>>
>> Reviewed-and-tested-by: Bill Fischofer > >
>>
>>
>> > ---
>> >  platform/linux-generic/test/ring/ring_stress.c | 74
>> > --
>> >  1 file changed, 34 insertions(+), 40 deletions(-)
>> >
>> > diff --git a/platform/linux-generic/test/ring/ring_stress.c
>> > b/platform/linux-generic/test/ring/ring_stress.c
>> > index c68419f..a7e89a8 100644
>> > --- a/platform/linux-generic/test/ring/ring_stress.c
>> > +++ b/platform/linux-generic/test/ring/ring_stress.c
>> > @@ -156,12 +156,11 @@ void
>> ring_test_stress_N_M_producer_consumer(void)
>> > consume_count = retrieve_consume_count();
>> > CU_ASSERT(consume_count != NULL);
>> >
>> > -   /* in N:M test case, producer threads are always
>> > -* greater or equal to consumer threads, thus produce
>> > -* enought "goods" to be consumed by consumer threads.
>> > +   /* all producer threads try to fill ring to RING_SIZE,
>> > +* while consumers threads dequeue from ring with PIECE_BULK
>> > +* blocks. Multiply on 100 to add more tries.
>> >  */
>> > -   odp_atomic_init_u32(consume_count,
>> > -   (worker_param.numthrds) / 2);
>> > +   odp_atomic_init_u32(consume_count, RING_SIZE /
>> PIECE_BULK * 100);
>> >
>> > /* kick the workers */
>> > odp_cunit_thread_create(stress_worker, _param);
>> > @@ -202,8 +201,15 @@ static odp_atomic_u32_t
>> *retrieve_consume_count(void)
>> >  /* worker function for multiple producer instances */
>> >  static int do_producer(_ring_t *r)
>> >  {
>> > -   int i, result = 0;
>> > +   int i;
>> > void **enq = NULL;
>> > +   odp_atomic_u32_t *consume_count;
>> > +
>> > +   consume_count = retrieve_consume_count();
>> > +   if (consume_count == NULL) {
>> > +   LOG_ERR("cannot retrieve expected consume
>> count.\n");
>> > +   return -1;
>> > +   }
>> >
>> > /* allocate dummy object pointers for enqueue */
>> > enq = malloc(PIECE_BULK * 2 * sizeof(void *));
>> > @@ -216,26 +222,28 @@ static int do_producer(_ring_t *r)
>> > for (i = 0; i < PIECE_BULK; i++)
>> > enq[i] = (void *)(unsigned long)i;
>> >
>> > -   do {
>> > -   result = _ring_mp_enqueue_bulk(r, enq, PIECE_BULK);
>> > -   if (0 == result) {
>> > -   free(enq);
>> > -   return 0;
>> > -   }
>> > -   usleep(10); /* wait for consumer threads */
>> > -   } while (!_ring_full(r));
>> > +   while ((int)odp_atomic_load_u32(consume_count) > 0) {
>> > +   /* produce as much data as we can to the ring */
>> > +   if (!_ring_mp_enqueue_bulk(r, enq, PIECE_BULK))
>> > +   usleep(10);
>> > +   }
>> >
>> > +   free(enq);
>> > return 0;
>> >  }
>> >
>> >  /* worker function for multiple consumer instances */
>> >  static int do_consumer(_ring_t *r)
>> >  {
>> > -   int i, result = 0;
>> > +   int i;
>> > void **deq = NULL;
>> > -   odp_atomic_u32_t 

Re: [lng-odp] thread model and memory/address sharing

2016-05-25 Thread Ola Liljedahl
On 25 May 2016 at 13:11, Christophe Milard <christophe.mil...@linaro.org>
wrote:

>
>
> On 25 May 2016 at 12:18, Ola Liljedahl <ola.liljed...@linaro.org> wrote:
>
>> I have attempted to summarise and comment on the discussion. This is
>> partly
>> intended for an external audience so sorry if I seem to repeat things
>> already "solved". Unfortunately I couldn't attend yesterdays public call
>> but I read the notes.
>>
>> Does ODP define the threading model? The ODP documentation does refer to
>> “ODP threads” and the ODP API specifies thread identifiers (small
>> integers,
>> starting from 1 I think) and defines “thread masks” which define which
>> threads participate in certain processing (e.g. scheduler groups).
>>
>> The - mostly accepted answer - is no. ODP does not define the thread
>> model.
>> This is instead inherited from the execution environment (which the ODP
>> implementation in question is designed for). ODP will not have any thread
>> API’s (e.g. for setting CPU affinity for a “ODP” thread), the application
>> will have to use any native API but there might also be ODP helpers for
>> this (there are some new CPPU pinning ODP helper functions for e.g.
>> pthreads and Linux processes). Such helpers are not proper ODP calls
>> because they might not be meaningful (or available) in some execution
>> environment (e.g. CPU pinning would probably work very differently in a
>> bare metal environment where thread equals CPU, helpers wouldn't be
>> necessary and could only serve to confuse the ODP implementation).
>>
>> There are two major flavours of the thread model: multiple threads in a
>> single memory space (“process” in Linux) - "single-process” - and multiple
>> threads in different memory spaces - “multi-process”. How can ODP
>> resources
>> be shared by ODP threads in different thread models and execution
>> environments?
>>
>> ODP handles are global and can be used by all ODP threads in the same
>> application (regardless of thread model). ODP shared memory regions are
>> also shared by all ODP threads. As ODP shared memory is really an OS
>> concept, the underlying OS may allow even wider sharing (e.g. sharing of
>> shared memory regions between different (ODP) applications).
>>
>> (Virtual) addresses derived from ODP handles (e.g. address for shared
>> memory region, address for buffer or packet data) may not be valid for all
>> threads. The ODP implementation specifies whether addresses derived from
>> ODP objects can be shared by threads.
>>
>
> I do not agree fully with that: I would rather say that the OS rules
> applies for theses addresses: I don't think the linux-generic odp
> implementation should have to specify more than saying that pointers are
> sharable withing the linux rules,(i.e. between linux threads and "before
> fork". It boils down to the same, probably.
>
The rules and limitations of memory and pointer sharability derive from the
execution environment (e.g. the OS). And ODP implementation is designed for
some specific execution environment (e.g. Linux user space) and inherits
those rules and limitations.


>  Having said that, we will surely have to specify on what objects the OS
> rules apply: if process A creates a pool P and then forks B and C, it is
> likely that addresses retrieved from packets allocated from P will be
> shareable as P was created before the fork. ODP will have to specify these
> kind of behaviour as nothing says when the memory is actually allocated.
> Do we want this rules to be common over all ODP implementations (e.g. to
> be able to say that on any ODP, the OS rules apply to the pool rather than
> individual packets)?
>
I think we should make strong recommendations for the behaviour we see as
vital to ODP's success. An ODP implementation may still not follow those
recommendations but they better have good reasons for doing so (i.e. not
mainstream HW architecture or SW execution environment). We will not be
able to depend on such ODP implementations for ODP's mainstream success.

I didn't follow you here with regards to pools vs. packets. Both pools and
packets have ODP handles which should be globally sharable so I see no
difference between them.



>
>> There is a recommendation (from ODP public call on 20160524) that "for
>> best
>> portability applications should not assume that such addresses have
>> validity beyond thread scope". I don’t agree with this recommendation as I
>> think it pushes application writers towards complicated designs were
>> pointers to data structures (which may be allocated from e.g. ODP shared
>> me

Re: [lng-odp] number of CPU's actually used by linux-generic

2016-05-24 Thread Ola Liljedahl
On 24 May 2016 at 03:34, Yi He <yi...@linaro.org> wrote:

> Hi,
>
> This is a policy coded in: *platform/linux-generic/odp_cpumask.c*
>
>
> *init_default_worker_cpumask()*
> *init_default_control_cpumask():*
>
> *...default mask initialization if not specified by odp_init_global()*
> */**
> * * If three or more CPUs, reserve CPU 0 for kernel, * reserve CPU 1 for
> control, and * reserve remaining CPUs for workers */*
>
Thanks. I missed this change. Not paying 100% attention to the ODP list.


>
> *Best Regards, Yi*
>
> On 23 May 2016 at 23:40, Ola Liljedahl <ola.liljed...@linaro.org> wrote:
>
>> Bill mentioned something in an email recently and I see the same.
>> On a four core x86-64 machine, I see only two ODP worker threads. Assuming
>> one CPU is allocated for the control thread, what is the fourth CPU doing
>> (not much I guess).
>>
>> -- Ola
>> ___
>> lng-odp mailing list
>> lng-odp@lists.linaro.org
>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>
>
>
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [PATCH] helper: fixing doxygen comments for odpthread creation parameters

2016-05-23 Thread Ola Liljedahl
On 23 May 2016 at 17:34, Christophe Milard <christophe.mil...@linaro.org>
wrote:

>
>
> On 23 May 2016 at 17:31, Ola Liljedahl <ola.liljed...@linaro.org> wrote:
>
>>
>>
>> On 23 May 2016 at 17:28, Christophe Milard <christophe.mil...@linaro.org>
>> wrote:
>>
>>> This is for linux helpers: in odp linux, just pthreads and processes are
>>> supported as odpthreads.(well actually processes are not supported yet, but
>>> we head to it).
>>>
>> You can use C and C++ threads in Linux as well. You just need a
>> conforming compiler.
>>
>
> Not through the helper at this stage. When calling
> odph_odpthread_create(), the odpthread is created as either pthread
> (default) or forked process (--odph_proc option).
> So at this stage C and C++ threads are not supported by the helpers.
>
OK that's fine if this is all encapsulated in to ODP/Linux helpers. That
should be possible to change without interfering too much with the actual
ODP implementation in use.


>
>
>>
>>
>>> If we support something else in some future, we'll update the comment
>>> then.
>>>
>>> Christophe
>>>
>>> On 23 May 2016 at 17:23, Ola Liljedahl <ola.liljed...@linaro.org> wrote:
>>>
>>>>
>>>>
>>>> On 23 May 2016 at 17:04, Christophe Milard <
>>>> christophe.mil...@linaro.org> wrote:
>>>>
>>>>> Signed-off-by: Christophe Milard <christophe.mil...@linaro.org>
>>>>> ---
>>>>>  helper/include/odp/helper/linux.h | 6 +++---
>>>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/helper/include/odp/helper/linux.h
>>>>> b/helper/include/odp/helper/linux.h
>>>>> index 01c348d..2e89833 100644
>>>>> --- a/helper/include/odp/helper/linux.h
>>>>> +++ b/helper/include/odp/helper/linux.h
>>>>> @@ -73,13 +73,13 @@ typedef struct {
>>>>>
>>>>>  /** The odpthread starting arguments, used both in process or thread
>>>>> mode */
>>>>>  typedef struct {
>>>>> -   odph_odpthread_linuxtype_t linuxtype;
>>>>> -   odph_odpthread_params_t thr_params; /*copy of thread start
>>>>> parameter*/
>>>>> +   odph_odpthread_linuxtype_t linuxtype; /**< process or pthread
>>>>> */
>>>>>
>>>> ODP threads might not be pthreads. There are many implementations of
>>>> threads.
>>>> Aren't we trying to tell the application if we are using a
>>>> single-process (memory space) or multi-process model?
>>>> Let's report this and only this.
>>>>
>>>> +   odph_odpthread_params_t thr_params; /**< odpthread start
>>>>> parameters */
>>>>>  } odph_odpthread_start_args_t;
>>>>>
>>>>>  /** Linux odpthread state information, used both in process or thread
>>>>> mode */
>>>>>  typedef struct {
>>>>> -   odph_odpthread_start_args_t start_args;
>>>>> +   odph_odpthread_start_args_t start_args; /**< start
>>>>> arguments */
>>>>> int cpu;/**< CPU ID */
>>>>> int last;   /**< true if last
>>>>> table entry */
>>>>> union {
>>>>> --
>>>>> 2.5.0
>>>>>
>>>>> ___
>>>>> lng-odp mailing list
>>>>> lng-odp@lists.linaro.org
>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>>>>
>>>>
>>>>
>>>
>>
>
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


[lng-odp] number of CPU's actually used by linux-generic

2016-05-23 Thread Ola Liljedahl
Bill mentioned something in an email recently and I see the same.
On a four core x86-64 machine, I see only two ODP worker threads. Assuming
one CPU is allocated for the control thread, what is the fourth CPU doing
(not much I guess).

-- Ola
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [PATCH] helper: fixing doxygen comments for odpthread creation parameters

2016-05-23 Thread Ola Liljedahl
On 23 May 2016 at 17:28, Christophe Milard <christophe.mil...@linaro.org>
wrote:

> This is for linux helpers: in odp linux, just pthreads and processes are
> supported as odpthreads.(well actually processes are not supported yet, but
> we head to it).
>
You can use C and C++ threads in Linux as well. You just need a conforming
compiler.


> If we support something else in some future, we'll update the comment then.
>
> Christophe
>
> On 23 May 2016 at 17:23, Ola Liljedahl <ola.liljed...@linaro.org> wrote:
>
>>
>>
>> On 23 May 2016 at 17:04, Christophe Milard <christophe.mil...@linaro.org>
>> wrote:
>>
>>> Signed-off-by: Christophe Milard <christophe.mil...@linaro.org>
>>> ---
>>>  helper/include/odp/helper/linux.h | 6 +++---
>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/helper/include/odp/helper/linux.h
>>> b/helper/include/odp/helper/linux.h
>>> index 01c348d..2e89833 100644
>>> --- a/helper/include/odp/helper/linux.h
>>> +++ b/helper/include/odp/helper/linux.h
>>> @@ -73,13 +73,13 @@ typedef struct {
>>>
>>>  /** The odpthread starting arguments, used both in process or thread
>>> mode */
>>>  typedef struct {
>>> -   odph_odpthread_linuxtype_t linuxtype;
>>> -   odph_odpthread_params_t thr_params; /*copy of thread start
>>> parameter*/
>>> +   odph_odpthread_linuxtype_t linuxtype; /**< process or pthread */
>>>
>> ODP threads might not be pthreads. There are many implementations of
>> threads.
>> Aren't we trying to tell the application if we are using a single-process
>> (memory space) or multi-process model?
>> Let's report this and only this.
>>
>> +   odph_odpthread_params_t thr_params; /**< odpthread start
>>> parameters */
>>>  } odph_odpthread_start_args_t;
>>>
>>>  /** Linux odpthread state information, used both in process or thread
>>> mode */
>>>  typedef struct {
>>> -   odph_odpthread_start_args_t start_args;
>>> +   odph_odpthread_start_args_t start_args; /**< start arguments
>>> */
>>> int cpu;/**< CPU ID */
>>> int last;   /**< true if last table
>>> entry */
>>> union {
>>> --
>>> 2.5.0
>>>
>>> ___
>>> lng-odp mailing list
>>> lng-odp@lists.linaro.org
>>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>>
>>
>>
>
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [PATCH] helper: fixing doxygen comments for odpthread creation parameters

2016-05-23 Thread Ola Liljedahl
On 23 May 2016 at 17:04, Christophe Milard 
wrote:

> Signed-off-by: Christophe Milard 
> ---
>  helper/include/odp/helper/linux.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/helper/include/odp/helper/linux.h
> b/helper/include/odp/helper/linux.h
> index 01c348d..2e89833 100644
> --- a/helper/include/odp/helper/linux.h
> +++ b/helper/include/odp/helper/linux.h
> @@ -73,13 +73,13 @@ typedef struct {
>
>  /** The odpthread starting arguments, used both in process or thread mode
> */
>  typedef struct {
> -   odph_odpthread_linuxtype_t linuxtype;
> -   odph_odpthread_params_t thr_params; /*copy of thread start
> parameter*/
> +   odph_odpthread_linuxtype_t linuxtype; /**< process or pthread */
>
ODP threads might not be pthreads. There are many implementations of
threads.
Aren't we trying to tell the application if we are using a single-process
(memory space) or multi-process model?
Let's report this and only this.

+   odph_odpthread_params_t thr_params; /**< odpthread start parameters
> */
>  } odph_odpthread_start_args_t;
>
>  /** Linux odpthread state information, used both in process or thread
> mode */
>  typedef struct {
> -   odph_odpthread_start_args_t start_args;
> +   odph_odpthread_start_args_t start_args; /**< start arguments */
> int cpu;/**< CPU ID */
> int last;   /**< true if last table
> entry */
> union {
> --
> 2.5.0
>
> ___
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [PATCH] helper: linux: thread and process cpu affinity APIs

2016-05-23 Thread Ola Liljedahl
On 23 May 2016 at 10:16, Yi He  wrote:

> Hi, Christophe
>
> Here I met a difficulty, if I unified the API into
> odph_odpthread_set_affinity(),
> inside the function how can I determine whether the current context is a
> pthread or process? So I cannot decide to call sched_setaffinity() or
> pthread_setaffinity_np().
>
Perhaps the ODP implementation uses some other forms of threads. E.g. C11
and C++11 has built in support for threads.
Or you are running bare metal (which some vendors support) and the threads
are actually HW threads (logical CPU's). Possibly such an environment will
emulate some thread API (pthreads being the most common I believe).

It seems that although ODP is not an OS abstraction, it can't avoid
intersecting with some functionality that normally is provided by the OS or
language runtime. But where do you draw the line?


>
> thanks and best regards, Yi
>
>
>
> On 23 May 2016 at 14:53, Yi He  wrote:
>
> > Hi, Christophe
> >
> > Yes, I'll apply your series and send a new one later.
> >
> > Best Regards, Yi
> >
> > On 23 May 2016 at 14:33, Christophe Milard  >
> > wrote:
> >
> >>
> >>
> >> On 20 May 2016 at 10:48, Yi He  wrote:
> >>
> >>> Set affinity to 1st available control cpu for all odp
> >>> validation programs in odp_cunit_common library.
> >>>
> >>> Signed-off-by: Yi He 
> >>> ---
> >>>  helper/include/odp/helper/linux.h | 47 +++
> >>>  helper/linux.c| 32 +
> >>>  helper/test/thread.c  | 76
> >>> +--
> >>>  test/validation/common/odp_cunit_common.c | 15 --
> >>>  4 files changed, 164 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/helper/include/odp/helper/linux.h
> >>> b/helper/include/odp/helper/linux.h
> >>> index e2dca35..fa815e1 100644
> >>> --- a/helper/include/odp/helper/linux.h
> >>> +++ b/helper/include/odp/helper/linux.h
> >>> @@ -84,6 +84,29 @@ int odph_linux_pthread_create(odph_linux_pthread_t
> >>> *pthread_tbl,
> >>>   */
> >>>  void odph_linux_pthread_join(odph_linux_pthread_t *thread_tbl, int
> num);
> >>>
> >>> +/**
> >>> + * Set CPU affinity of the current thread
> >>> + *
> >>> + * CPU affinity determines the set of CPUs on which the thread is
> >>> + * eligible to run.
> >>> + *
> >>> + * @param cpusetA bitmask lists the affinity CPU cores
> >>> + *
> >>> + * @return 0 on success, -1 on failure
> >>> + */
> >>> +int odph_linux_pthread_setaffinity(const odp_cpumask_t *cpuset);
> >>>
> >>
> >> odph_odpthread_set_affinity(), is I guess better, at least as long as we
> >> try to keep thread and processes together. Dropping the linux prefix
> makes
> >> also the code more portable (on some other OS, change provide new
> helpers
> >> and the app does hopefully not need to change)
> >>
> >> I guess you can apply the "running things in  process mode" patch series
> >> to see what I am after...
> >>
> >> This comment applies to all your function names, of course.
> >>
> >> Christophe
> >>
> >> +
> >>> +/**
> >>> + * Get CPU affinity of the current thread
> >>> + *
> >>> + * CPU affinity determines the set of CPUs on which the thread is
> >>> + * eligible to run.
> >>> + *
> >>> + * @param cpuset[out]   A bitmask lists the affinity CPU cores
> >>> + *
> >>> + * @return 0 on success, -1 on failure
> >>> + */
> >>> +int odph_linux_pthread_getaffinity(odp_cpumask_t *cpuset);
> >>>
> >>>  /**
> >>>   * Fork a process
> >>> @@ -134,6 +157,30 @@ int odph_linux_process_fork_n(odph_linux_process_t
> >>> *proc_tbl,
> >>>  int odph_linux_process_wait_n(odph_linux_process_t *proc_tbl, int
> num);
> >>>
> >>>  /**
> >>> + * Set CPU affinity of the current process
> >>> + *
> >>> + * CPU affinity determines the set of CPUs on which the process is
> >>> + * eligible to run.
> >>> + *
> >>> + * @param cpusetA bitmask lists the affinity CPU cores
> >>> + *
> >>> + * @return 0 on success, -1 on failure
> >>> + */
> >>> +int odph_linux_process_setaffinity(const odp_cpumask_t *cpuset);
> >>> +
> >>> +/**
> >>> + * Get CPU affinity of the current process
> >>> + *
> >>> + * CPU affinity determines the set of CPUs on which the process is
> >>> + * eligible to run.
> >>> + *
> >>> + * @param cpuset[out]   A bitmask lists the affinity CPU cores
> >>> + *
> >>> + * @return 0 on success, -1 on failure
> >>> + */
> >>> +int odph_linux_process_getaffinity(odp_cpumask_t *cpuset);
> >>> +
> >>> +/**
> >>>   * @}
> >>>   */
> >>>
> >>> diff --git a/helper/linux.c b/helper/linux.c
> >>> index 24e243b..6ce7e7d 100644
> >>> --- a/helper/linux.c
> >>> +++ b/helper/linux.c
> >>> @@ -114,6 +114,22 @@ void odph_linux_pthread_join(odph_linux_pthread_t
> >>> *thread_tbl, int num)
> >>> }
> >>>  }
> >>>
> >>> +int odph_linux_pthread_setaffinity(const odp_cpumask_t *cpuset)
> >>> +{
> >>> +   const cpu_set_t *_cpuset = >set;
> >>> +
> >>> +   return (0 == 

Re: [lng-odp] [PATCHv2] linux-generic: timer fix odp_timer_pool_create return code

2016-05-23 Thread Ola Liljedahl
On 18 May 2016 at 16:24, Maxim Uvarov  wrote:

> Accodring to API return code for fail case is ODP_TIMER_POOL_INVALID
> and errno set event if it's defined to NULL. Also add check on pool
> alloc that input parameter is not invalid.
> https://bugs.linaro.org/show_bug.cgi?id=2139

If  odp_timer_pool is a pointer in linux-generic, then it is OK for the
implementation to return NULL for an invalid timer pool handle. The
implementation should be allowed to use constants (e.g. NULL) for types
which are under its control (e.g. odp_timer_pool) and not have to use
abstract values (ODP_TIMER_POOL_INVALID). This is the implementation, it is
not directly portable and reusable with other definitions of the timer
types.



>
> Signed-off-by: Maxim Uvarov 
> ---
>  v2: added missed ) on if statement. (looks like forgot to git ---ammend
> it )
>
>  platform/linux-generic/odp_timer.c | 24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/platform/linux-generic/odp_timer.c
> b/platform/linux-generic/odp_timer.c
> index a6d3332..8696074 100644
> --- a/platform/linux-generic/odp_timer.c
> +++ b/platform/linux-generic/odp_timer.c
> @@ -227,16 +227,15 @@ static inline odp_timer_t tp_idx_to_handle(struct
> odp_timer_pool_s *tp,
>  static void itimer_init(odp_timer_pool *tp);
>  static void itimer_fini(odp_timer_pool *tp);
>
> -static odp_timer_pool *odp_timer_pool_new(
> -   const char *_name,
> -   const odp_timer_pool_param_t *param)
> +static odp_timer_pool_t odp_timer_pool_new(const char *_name,
> +  const odp_timer_pool_param_t
> *param)
>  {
> uint32_t tp_idx = odp_atomic_fetch_add_u32(_timer_pools, 1);
> if (odp_unlikely(tp_idx >= MAX_TIMER_POOLS)) {
> /* Restore the previous value */
> odp_atomic_sub_u32(_timer_pools, 1);
> __odp_errno = ENFILE; /* Table overflow */
> -   return NULL;
> +   return ODP_TIMER_POOL_INVALID;
> }
> size_t sz0 = ODP_ALIGN_ROUNDUP(sizeof(odp_timer_pool),
> ODP_CACHE_LINE_SIZE);
> @@ -804,10 +803,9 @@ odp_timer_pool_create(const char *name,
> /* Verify that we have a valid (non-zero) timer resolution */
> if (param->res_ns == 0) {
> __odp_errno = EINVAL;
> -   return NULL;
> +   return ODP_TIMER_POOL_INVALID;
> }
> -   odp_timer_pool_t tp = odp_timer_pool_new(name, param);
> -   return tp;
> +   return odp_timer_pool_new(name, param);
>  }
>
>  void odp_timer_pool_start(void)
> @@ -855,15 +853,17 @@ odp_timer_t odp_timer_alloc(odp_timer_pool_t tpid,
> odp_queue_t queue,
> void *user_ptr)
>  {
> +   odp_timer_t hdl;
> +
> +   if (odp_unlikely(tpid == ODP_TIMER_POOL_INVALID))
> +   ODP_ABORT("Invalid timer pool.\n");
>
Possibly it is helpful to check for invalid timer pool handle and die in a
clean way.
But we will not check a billion other invalid timer pool handles and those
will likely generate undefined behaviour or (if we are lucky) a memory
access violation (sigsegv). So there is not any guarantee that the use of
invalid timer pool handle well be detected in any clean way.


> if (odp_unlikely(queue == ODP_QUEUE_INVALID))
> ODP_ABORT("%s: Invalid queue handle\n", tpid->name);
>
The queue handle we want to verify now when it the context of the
application. Later when a timer expires, we don't want to get errors in the
timer background thread. That would likely be fatal and also difficult to
debug.


> /* We don't care about the validity of user_ptr because we will not
>  * attempt to dereference it */
> -   odp_timer_t hdl = timer_alloc(tpid, queue, user_ptr);
> -   if (odp_likely(hdl != ODP_TIMER_INVALID)) {
> -   /* Success */
> -   return hdl;
> -   }
> +   hdl = timer_alloc(tpid, queue, user_ptr);
> +   if (odp_likely(hdl != ODP_TIMER_INVALID))
> +   return hdl; /* Success */
>
As Bill wrote, this code can be simplified even more. Just call
timer_alloc() and return its value directly.


> /* errno set by timer_alloc() */
> return ODP_TIMER_INVALID;
>  }
> --
> 2.7.1.250.gff4ea60
>
> ___
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


[lng-odp] [PATCH] linux-generic: timer: generalize arch-specific code path selection

2016-05-20 Thread Ola Liljedahl
Make architecture-specific code path selection generic, controlled
directly by compiler feature predefines.
Replace macro PREFETCH with intrinsic __builtin_prefetch.
Fixes https://bugs.linaro.org/show_bug.cgi?id=2235

Signed-off-by: Ola Liljedahl <ola.liljed...@linaro.org>
---
(This document/code contribution attached is provided under the terms of
agreement LES-LTM-21309)

 platform/linux-generic/odp_timer.c | 28 +---
 1 file changed, 9 insertions(+), 19 deletions(-)

diff --git a/platform/linux-generic/odp_timer.c 
b/platform/linux-generic/odp_timer.c
index a6d3332..4e56fb0 100644
--- a/platform/linux-generic/odp_timer.c
+++ b/platform/linux-generic/odp_timer.c
@@ -58,12 +58,6 @@
  * for checking the freshness of received timeouts */
 #define TMO_INACTIVE ((uint64_t)0x8000)
 
-#ifdef __ARM_ARCH
-#define PREFETCH(ptr) __builtin_prefetch((ptr), 0, 0)
-#else
-#define PREFETCH(ptr) (void)(ptr)
-#endif
-
 /**
  * Mutual exclusion in the absence of CAS16
  */
@@ -210,7 +204,7 @@ static inline uint32_t handle_to_idx(odp_timer_t hdl,
struct odp_timer_pool_s *tp)
 {
uint32_t idx = _odp_typeval(hdl) & ((1U << INDEX_BITS) - 1U);
-   PREFETCH(>tick_buf[idx]);
+   __builtin_prefetch(>tick_buf[idx], 0, 0);
if (odp_likely(idx < odp_atomic_load_u32(>high_wm)))
return idx;
ODP_ABORT("Invalid timer handle %#x\n", hdl);
@@ -395,7 +389,7 @@ static bool timer_reset(uint32_t idx,
tick_buf_t *tb = >tick_buf[idx];
 
if (tmo_buf == NULL || *tmo_buf == ODP_BUFFER_INVALID) {
-#ifdef ODP_ATOMIC_U128
+#ifdef ODP_ATOMIC_U128 /* Target supports 128-bit atomic operations */
tick_buf_t new, old;
do {
/* Relaxed and non-atomic read of current values */
@@ -422,9 +416,10 @@ static bool timer_reset(uint32_t idx,
(_uint128_t *),
_ODP_MEMMODEL_RLS,
_ODP_MEMMODEL_RLX));
-#else
-#ifdef __ARM_ARCH
-   /* Since barriers are not good for C-A15, we take an
+#elif __GCC_ATOMIC_LLONG_LOCK_FREE >= 2 && \
+   defined __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8
+   /* Target supports lock-free 64-bit CAS (and probably exchange) */
+   /* Since locks/barriers are not good for C-A15, we take an
 * alternative approach using relaxed memory model */
uint64_t old;
/* Swap in new expiration tick, get back old tick which
@@ -450,7 +445,7 @@ static bool timer_reset(uint32_t idx,
_ODP_MEMMODEL_RLX);
success = false;
}
-#else
+#else /* Target supports neither 128-bit nor 64-bit CAS => use lock */
/* Take a related lock */
while (_odp_atomic_flag_tas(IDX2LOCK(idx)))
/* While lock is taken, spin using relaxed loads */
@@ -470,7 +465,6 @@ static bool timer_reset(uint32_t idx,
/* Release the lock */
_odp_atomic_flag_clear(IDX2LOCK(idx));
 #endif
-#endif
} else {
/* We have a new timeout buffer which replaces any old one */
/* Fill in some (constant) header fields for timeout events */
@@ -655,13 +649,11 @@ static unsigned odp_timer_pool_expire(odp_timer_pool_t 
tpid, uint64_t tick)
 
ODP_ASSERT(high_wm <= tpid->param.num_timers);
for (i = 0; i < high_wm;) {
-#ifdef __ARM_ARCH
/* As a rare occurrence, we can outsmart the HW prefetcher
 * and the compiler (GCC -fprefetch-loop-arrays) with some
 * tuned manual prefetching (32x16=512B ahead), seems to
 * give 30% better performance on ARM C-A15 */
-   PREFETCH([i + 32]);
-#endif
+   __builtin_prefetch([i + 32], 0, 0);
/* Non-atomic read for speed */
uint64_t exp_tck = array[i++].exp_tck.v;
if (odp_unlikely(exp_tck <= tick)) {
@@ -691,13 +683,11 @@ static void timer_notify(odp_timer_pool *tp)
}
}
 
-#ifdef __ARM_ARCH
odp_timer *array = >timers[0];
uint32_t i;
/* Prefetch initial cache lines (match 32 above) */
for (i = 0; i < 32; i += ODP_CACHE_LINE_SIZE / sizeof(array[0]))
-   PREFETCH([i]);
-#endif
+   __builtin_prefetch([i], 0, 0);
prev_tick = odp_atomic_fetch_inc_u64(>cur_tick);
 
/* Scan timer array, looking for timers to expire */
-- 
2.5.0

___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [PATCH] linux-generic: timer: disable 128 bit timer optimization for clang

2016-05-18 Thread Ola Liljedahl
Maxim,

I configured and built using clang and -m32 and it worked for me. clang
chooses the lock-based path.
This means __SIZEOF_INT128__ and __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16 were
not defined so the ODP atomic support for 128-bit variables was not
enabled. The timer code fell back to the lock-based path.

I used clang 3.6.2-1 running on Ubuntu 15.10.

I think your clang 3.4 is buggy, it claims to support -mcx16 also for -m32
targets and then defines those preprocessor symbols above which are tested
by odp_atomic_internal.h. But when generating code, the compiler backend
realises that there is no suitable instruction (e.g. cmpxchg16) for
i386/686 targets so generates calls to external functions (e.g.
__atomic_exchange) instead. But those functions either do not exist or are
somehow not included in the linking (I would expect them to be located in
clang's equivalent to libgcc.a). There are some bug reports on missing
support for 128-bit atomics in clang and instead you get these calls to
non-existing functions.

Add an additional check to odp_atomic_internal.h that clang version must be
>= 3.6 (don't know about 3.5) for the ODP support for 128 bit atomics to be
enabled.

-- Ola


On 18 May 2016 at 20:53, Maxim Uvarov  wrote:

> On 05/18/16 21:45, Mike Holmes wrote:
>
>> Maxim and I had a chat, I think  this patch means to say, "for now clang
>> will use non optimised code, but deeper analysis is needed to optimise with
>> clang"
>>
>> yes, looks like comment under "---" was confusing. This patch is not a
> hack it's only routes clang generated code to lock path instead of lock
> free path with 128 bit instructions.
>
> Maxim.
>
> On 18 May 2016 at 14:27, Maxim Uvarov  maxim.uva...@linaro.org>> wrote:
>>
>> On 05/18/16 19:48, Mike Holmes wrote:
>>
>>
>>
>> On 18 May 2016 at 11:56, Maxim Uvarov > 
>> > >> wrote:
>>
>> On 05/18/16 18:52, Mike Holmes wrote:
>>
>>
>>
>> On 18 May 2016 at 11:15, Maxim Uvarov
>> 
>> > >
>> > 
>> > >
>> Fix compilation error for clang with disabling 128 bit
>> optimization.
>> In function `_odp_atomic_u128_xchg_mm':
>> undefined reference to `__atomic_exchange'
>>
>> Signed-off-by: Maxim Uvarov
>> 
>> > >
>> > 
>> > > ---
>>  I need some quick way to make clang build happy
>>
>>
>> Why not revert whatever introduced the issue ?
>>
>> . Clean patch can go later.
>>
>> When is "later" defined to be ?
>>
>>
>> Why dont we just wait for the correct fix ?
>>
>> to make -m32 work now.
>>
>>
>> why now, why do we need a fix so urgently that we dont fix it
>> properly.
>>
>>
>> There is no big urgent and patch can wait usual 24 hours. I might
>> be clear describing
>> problem which patch fixes to me understandable for people who did
>> not look into timer test.
>>
>> I think that 2 Olas patches fixes issue with 128 bit optimization
>> with gcc, but introduce some
>> other things which we did not capture on review process:
>>
>> 1) clangs (at least my Ubuntu clang version 3.4-1ubuntu3
>> (tags/RELEASE_34/final) (based on LLVM 3.4)
>> does not link with gcc build in __atomic_exchange. That means
>> clang build should use generic not optimized
>> for 128 bit version.
>>
>> 2) Build for odp-linux has to be reproducible. And at the same
>> time run on any similar arch.
>> That means that all such optimizations should be under ./configure
>> options (for timer it's #define ODP_ATOMIC_U128).
>> Only in that case we can be sure that x86 generic build (which
>> will be in ubuntu, debian, redhat and etc) will run on
>> all machines, even which do not support intrisics.
>>
>> 3) configure compiller detection things has to be in:
>> platform/linux-generic/m4/configure.m4
>>
>>
>> Current patch fixes (1).  But because (2) and (3) is only dance
>> around configure.ac 

Re: [lng-odp] [PATCHv3 2/2] linux-generic: timer: fix failed static assert

2016-05-18 Thread Ola Liljedahl
On 18 May 2016 at 16:13, Maxim Uvarov <maxim.uva...@linaro.org> wrote:

> odp-check passed but looks like clang is not tested there, now I have
> errors:
>
Bill claimed to have tested 32- and 64-bit x86 using gcc *and* clang?

If your compiler defines __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16, then it
should implement __atomic_exchange and __atomic_compare_exchange on 128 bit
variables directly (e.g. not calling external functions, x86-64 has an
instruction for this).

See
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20130930/090167.html


>
>   CCLD odp_crypto
> ../../lib/.libs/libodp-linux.a(odp_timer.o): In function
> `_odp_atomic_u128_xchg_mm':
> /opt/Linaro/odp3.git/platform/linux-generic/./include/odp_atomic_internal.h:619:
> undefined reference to `__atomic_exchange'
> ../../lib/.libs/libodp-linux.a(odp_timer.o): In function
> `_odp_atomic_u128_cmp_xchg_mm':
> /opt/Linaro/odp3.git/platform/linux-generic/./include/odp_atomic_internal.h:643:
> undefined reference to `__atomic_compare_exchange'
> clang: error: linker command failed with exit code 1 (use -v to see
> invocation)


> Does somebody else see this?
>
> Maxim.
>
>
> On 05/17/16 22:21, Ola Liljedahl wrote:
>
>> Fixes https://bugs.linaro.org/show_bug.cgi?id=2211
>> Ensure tick_buf_t structure is 128 bits large on all platforms,
>> regardless of support for 64-bit atomic operations.
>> Only assert that tick_buf_t is 128 bits large when performing
>> atomic operations on it (requires ODP and platform support for 128
>> bit atomics).
>>
>> Signed-off-by: Ola Liljedahl <ola.liljed...@linaro.org>
>> ---
>>   platform/linux-generic/odp_timer.c | 23 +++
>>   1 file changed, 23 insertions(+)
>>
>> diff --git a/platform/linux-generic/odp_timer.c
>> b/platform/linux-generic/odp_timer.c
>> index 41e7195..a6d3332 100644
>> --- a/platform/linux-generic/odp_timer.c
>> +++ b/platform/linux-generic/odp_timer.c
>> @@ -94,7 +94,15 @@ static odp_timeout_hdr_t *timeout_hdr(odp_timeout_t
>> tmo)
>>
>>  
>> */
>> typedef struct tick_buf_s {
>> +#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
>> +   /* No atomics support for 64-bit variables, will use separate
>> lock */
>> +   /* Use the same layout as odp_atomic_u64_t but without lock
>> variable */
>> +   struct {
>> +   uint64_t v;
>> +   } exp_tck;/* Expiration tick or TMO_xxx */
>> +#else
>> odp_atomic_u64_t exp_tck;/* Expiration tick or TMO_xxx */
>> +#endif
>> odp_buffer_t tmo_buf;/* ODP_BUFFER_INVALID if timer not active */
>>   #ifdef TB_NEEDS_PAD
>> uint32_t pad;/* Need to be able to access padding for successful
>> CAS */
>> @@ -105,7 +113,10 @@ ODP_ALIGNED(16) /* 16-byte atomic operations need
>> properly aligned addresses */
>>   #endif
>>   ;
>>   +#if __GCC_ATOMIC_LLONG_LOCK_FREE >= 2
>> +/* Only assert this when we perform atomic operations on tick_buf_t */
>>   ODP_STATIC_ASSERT(sizeof(tick_buf_t) == 16, "sizeof(tick_buf_t) == 16");
>> +#endif
>> typedef struct odp_timer_s {
>> void *user_ptr;
>> @@ -123,7 +134,11 @@ static void timer_init(odp_timer *tim,
>> /* All pad fields need a defined and constant value */
>> TB_SET_PAD(*tb);
>> /* Release the timer by setting timer state to inactive */
>> +#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
>> +   tb->exp_tck.v = TMO_INACTIVE;
>> +#else
>> _odp_atomic_u64_store_mm(>exp_tck, TMO_INACTIVE,
>> _ODP_MEMMODEL_RLS);
>> +#endif
>>   }
>> /* Teardown when timer is freed */
>> @@ -253,7 +268,11 @@ static odp_timer_pool *odp_timer_pool_new(
>> tp->timers[i].queue = ODP_QUEUE_INVALID;
>> set_next_free(>timers[i], i + 1);
>> tp->timers[i].user_ptr = NULL;
>> +#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
>> +   tp->tick_buf[i].exp_tck.v = TMO_UNUSED;
>> +#else
>> odp_atomic_init_u64(>tick_buf[i].exp_tck, TMO_UNUSED);
>> +#endif
>> tp->tick_buf[i].tmo_buf = ODP_BUFFER_INVALID;
>> }
>> tp->tp_idx = tp_idx;
>> @@ -935,7 +954,11 @@ int odp_timeout_fresh(odp_timeout_t tmo)
>> odp_timer_pool *tp = handle_to_tp(hdl);
>> uint32_t idx = handle_to_idx(hdl, tp);
>> tick_buf_t *tb = >tick_buf[idx];
>> +#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
>> +   uint64_t exp_tck = 

[lng-odp] [PATCHv3 2/2] linux-generic: timer: fix failed static assert

2016-05-17 Thread Ola Liljedahl
Fixes https://bugs.linaro.org/show_bug.cgi?id=2211
Ensure tick_buf_t structure is 128 bits large on all platforms,
regardless of support for 64-bit atomic operations.
Only assert that tick_buf_t is 128 bits large when performing
atomic operations on it (requires ODP and platform support for 128
bit atomics).

Signed-off-by: Ola Liljedahl <ola.liljed...@linaro.org>
---
 platform/linux-generic/odp_timer.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/platform/linux-generic/odp_timer.c 
b/platform/linux-generic/odp_timer.c
index 41e7195..a6d3332 100644
--- a/platform/linux-generic/odp_timer.c
+++ b/platform/linux-generic/odp_timer.c
@@ -94,7 +94,15 @@ static odp_timeout_hdr_t *timeout_hdr(odp_timeout_t tmo)
  */
 
 typedef struct tick_buf_s {
+#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
+   /* No atomics support for 64-bit variables, will use separate lock */
+   /* Use the same layout as odp_atomic_u64_t but without lock variable */
+   struct {
+   uint64_t v;
+   } exp_tck;/* Expiration tick or TMO_xxx */
+#else
odp_atomic_u64_t exp_tck;/* Expiration tick or TMO_xxx */
+#endif
odp_buffer_t tmo_buf;/* ODP_BUFFER_INVALID if timer not active */
 #ifdef TB_NEEDS_PAD
uint32_t pad;/* Need to be able to access padding for successful CAS */
@@ -105,7 +113,10 @@ ODP_ALIGNED(16) /* 16-byte atomic operations need properly 
aligned addresses */
 #endif
 ;
 
+#if __GCC_ATOMIC_LLONG_LOCK_FREE >= 2
+/* Only assert this when we perform atomic operations on tick_buf_t */
 ODP_STATIC_ASSERT(sizeof(tick_buf_t) == 16, "sizeof(tick_buf_t) == 16");
+#endif
 
 typedef struct odp_timer_s {
void *user_ptr;
@@ -123,7 +134,11 @@ static void timer_init(odp_timer *tim,
/* All pad fields need a defined and constant value */
TB_SET_PAD(*tb);
/* Release the timer by setting timer state to inactive */
+#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
+   tb->exp_tck.v = TMO_INACTIVE;
+#else
_odp_atomic_u64_store_mm(>exp_tck, TMO_INACTIVE, _ODP_MEMMODEL_RLS);
+#endif
 }
 
 /* Teardown when timer is freed */
@@ -253,7 +268,11 @@ static odp_timer_pool *odp_timer_pool_new(
tp->timers[i].queue = ODP_QUEUE_INVALID;
set_next_free(>timers[i], i + 1);
tp->timers[i].user_ptr = NULL;
+#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
+   tp->tick_buf[i].exp_tck.v = TMO_UNUSED;
+#else
odp_atomic_init_u64(>tick_buf[i].exp_tck, TMO_UNUSED);
+#endif
tp->tick_buf[i].tmo_buf = ODP_BUFFER_INVALID;
}
tp->tp_idx = tp_idx;
@@ -935,7 +954,11 @@ int odp_timeout_fresh(odp_timeout_t tmo)
odp_timer_pool *tp = handle_to_tp(hdl);
uint32_t idx = handle_to_idx(hdl, tp);
tick_buf_t *tb = >tick_buf[idx];
+#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
+   uint64_t exp_tck = tb->exp_tck.v;
+#else
uint64_t exp_tck = odp_atomic_load_u64(>exp_tck);
+#endif
/* Return true if the timer still has the same expiration tick
 * (ignoring the inactive/expired bit) as the timeout */
return hdr->expiration == (exp_tck & ~TMO_INACTIVE);
-- 
2.5.0

___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


[lng-odp] [PATCHv3 1/2] linux-generic: odp_atomic_internal.h: add 128-bit atomics

2016-05-17 Thread Ola Liljedahl
Add detection of availability of the -mcx16 compiler flag to
the configure script. This flag is necessary on x86-64 to enable
cpmxchg16.
Implement 128-bit atomics if natively supported by the platform.
128-bit atomics are used by linux-generic timer implementation
on certain targets (e.g. x86-64) for lock-free implementation.

Signed-off-by: Ola Liljedahl <ola.liljed...@linaro.org>
---
 configure.ac   | 13 +
 .../linux-generic/include/odp_atomic_internal.h| 62 ++
 platform/linux-generic/odp_timer.c |  4 +-
 3 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/configure.ac b/configure.ac
index c59d2d1..7cd6670 100644
--- a/configure.ac
+++ b/configure.ac
@@ -207,6 +207,19 @@ ODP_CFLAGS="$ODP_CFLAGS -std=c99"
 # Extra flags for example to suppress certain warning types
 ODP_CFLAGS="$ODP_CFLAGS $ODP_CFLAGS_EXTRA"
 
+#
+# Check if compiler supports cmpxchng16
+##
+my_save_cflags="$CFLAGS"
+CFLAGS=-mcx16
+AC_MSG_CHECKING([whether CC supports -mcx16])
+AC_COMPILE_IFELSE([AC_LANG_PROGRAM([])],
+   [AC_MSG_RESULT([yes])]
+   [ODP_CFLAGS="$ODP_CFLAGS -mcx16"],
+   [AC_MSG_RESULT([no])]
+)
+CFLAGS="$my_save_cflags"
+
 ##
 # Default include setup
 ##
diff --git a/platform/linux-generic/include/odp_atomic_internal.h 
b/platform/linux-generic/include/odp_atomic_internal.h
index 093280f..3c5606c 100644
--- a/platform/linux-generic/include/odp_atomic_internal.h
+++ b/platform/linux-generic/include/odp_atomic_internal.h
@@ -587,6 +587,68 @@ static inline void 
_odp_atomic_flag_clear(_odp_atomic_flag_t *flag)
__atomic_clear(flag, __ATOMIC_RELEASE);
 }
 
+/* Check if target and compiler supports 128-bit scalars and corresponding
+ * exchange and CAS operations */
+/* GCC on x86-64 needs -mcx16 compiler option */
+#if defined __SIZEOF_INT128__ && defined __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16
+
+/** Preprocessor symbol that indicates support for 128-bit atomics */
+#define ODP_ATOMIC_U128
+
+/** An unsigned 128-bit (16-byte) scalar type */
+typedef __int128 _uint128_t;
+
+/** Atomic 128-bit type */
+typedef struct {
+   _uint128_t v; /**< Actual storage for the atomic variable */
+} _odp_atomic_u128_t ODP_ALIGNED(16);
+
+/**
+ * 16-byte atomic exchange operation
+ *
+ * @param ptr   Pointer to a 16-byte atomic variable
+ * @param val   Pointer to new value to write
+ * @param old   Pointer to location for old value
+ * @param   mmodel Memory model associated with the exchange operation
+ */
+static inline void _odp_atomic_u128_xchg_mm(_odp_atomic_u128_t *ptr,
+   _uint128_t *val,
+   _uint128_t *old,
+   _odp_memmodel_t mm)
+{
+   __atomic_exchange(>v, val, old, mm);
+}
+
+/**
+ * Atomic compare and exchange (swap) of 16-byte atomic variable
+ * "Strong" semantics, will not fail spuriously.
+ *
+ * @param ptr   Pointer to a 16-byte atomic variable
+ * @param exp   Pointer to expected value (updated on failure)
+ * @param val   Pointer to new value to write
+ * @param succ  Memory model associated with a successful compare-and-swap
+ * operation
+ * @param fail  Memory model associated with a failed compare-and-swap
+ * operation
+ *
+ * @retval 1 exchange successul
+ * @retval 0 exchange failed and '*exp' updated with current value
+ */
+static inline int _odp_atomic_u128_cmp_xchg_mm(_odp_atomic_u128_t *ptr,
+  _uint128_t *exp,
+  _uint128_t *val,
+  _odp_memmodel_t succ,
+  _odp_memmodel_t fail)
+{
+   return __atomic_compare_exchange(>v, exp, val,
+   false/*strong*/, succ, fail);
+}
+#endif
+
+/**
+ * @}
+ */
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/platform/linux-generic/odp_timer.c 
b/platform/linux-generic/odp_timer.c
index 6b84309..41e7195 100644
--- a/platform/linux-generic/odp_timer.c
+++ b/platform/linux-generic/odp_timer.c
@@ -11,9 +11,7 @@
  *
  */
 
-/* Check if compiler supports 16-byte atomics. GCC needs -mcx16 flag on x86 */
-/* Using spin lock actually seems faster on Core2 */
-#ifdef ODP_ATOMIC_U128
+#if __SIZEOF_POINTER__ != 8
 /* TB_NEEDS_PAD defined if sizeof(odp_buffer_t) != 8 */
 #define TB_NEEDS_PAD
 #define TB_SET_PAD(x) ((x).pad = 0)
-- 
2.5.0

___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


[lng-odp] [PATCHv3 0/2] Enhanced and fixed atomics support

2016-05-17 Thread Ola Liljedahl
(This document/code contribution attached is provided under the terms of
agreement LES-LTM-21309)

Changes in v3:
Remove "contribution" sentence from commit messages.
Added reference to bug that was fixed.

Changes in v2:
Removed leading underscore from ODP_STATIC_ASSERT().

Enable and use (in linux-generic timer) 128-bit atomic support.
Fix a static assert (in linux-generic timer) when 64-bit atomics
are not supported.

Ola Liljedahl (2):
  linux-generic: odp_atomic_internal.h: add 128-bit atomics
  linux-generic: timer: fix failed static assert

 configure.ac   | 13 +
 .../linux-generic/include/odp_atomic_internal.h| 62 ++
 platform/linux-generic/odp_timer.c | 27 --
 3 files changed, 99 insertions(+), 3 deletions(-)

-- 
2.5.0

___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [PATCHv2 0/2] Enhanced and fixed atomics support

2016-05-17 Thread Ola Liljedahl
Anders wanted me to send a v3 with the license contrib messages moved to
below the --- line or they will show up in the commit log.


On 17 May 2016 at 21:05, Bill Fischofer <bill.fischo...@linaro.org> wrote:

> For this series:
>
> Reviewed-and-tested-by: Bill Fischofer <bill.fischo...@linaro.org>
>
> On Tue, May 17, 2016 at 1:58 PM, Ola Liljedahl <ola.liljed...@linaro.org>
> wrote:
>
>> (This document/code contribution attached is provided under the terms of
>> agreement LES-LTM-21309)
>>
>> Enable and use (in linux-generic timer) 128-bit atomic support.
>> Fix a static assert (in linux-generic timer) when 64-bit atomics
>> are not supported.
>>
>> Ola Liljedahl (2):
>>   linux-generic: odp_atomic_internal.h: add 128-bit atomics
>>   linux-generic: timer: fix failed static assert
>>
>>  configure.ac   | 13 +
>>  .../linux-generic/include/odp_atomic_internal.h| 62
>> ++
>>  platform/linux-generic/odp_timer.c | 27 --
>>  3 files changed, 99 insertions(+), 3 deletions(-)
>>
>> --
>> 2.5.0
>>
>> ___
>> lng-odp mailing list
>> lng-odp@lists.linaro.org
>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>
>
>
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


[lng-odp] [PATCHv2 2/2] linux-generic: timer: fix failed static assert

2016-05-17 Thread Ola Liljedahl
(This document/code contribution attached is provided under the terms of
agreement LES-LTM-21309)

Ensure tick_buf_t structure is max 128 bits large on all platforms,
regardless of support for 64-bit atomic operations.
Only assert that tick_buf_t is 128 bits large when performing
atomic operations on it (requires ODP and platform support for 128
bit atomics).

Signed-off-by: Ola Liljedahl <ola.liljed...@linaro.org>
---
 platform/linux-generic/odp_timer.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/platform/linux-generic/odp_timer.c 
b/platform/linux-generic/odp_timer.c
index 41e7195..a6d3332 100644
--- a/platform/linux-generic/odp_timer.c
+++ b/platform/linux-generic/odp_timer.c
@@ -94,7 +94,15 @@ static odp_timeout_hdr_t *timeout_hdr(odp_timeout_t tmo)
  */
 
 typedef struct tick_buf_s {
+#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
+   /* No atomics support for 64-bit variables, will use separate lock */
+   /* Use the same layout as odp_atomic_u64_t but without lock variable */
+   struct {
+   uint64_t v;
+   } exp_tck;/* Expiration tick or TMO_xxx */
+#else
odp_atomic_u64_t exp_tck;/* Expiration tick or TMO_xxx */
+#endif
odp_buffer_t tmo_buf;/* ODP_BUFFER_INVALID if timer not active */
 #ifdef TB_NEEDS_PAD
uint32_t pad;/* Need to be able to access padding for successful CAS */
@@ -105,7 +113,10 @@ ODP_ALIGNED(16) /* 16-byte atomic operations need properly 
aligned addresses */
 #endif
 ;
 
+#if __GCC_ATOMIC_LLONG_LOCK_FREE >= 2
+/* Only assert this when we perform atomic operations on tick_buf_t */
 ODP_STATIC_ASSERT(sizeof(tick_buf_t) == 16, "sizeof(tick_buf_t) == 16");
+#endif
 
 typedef struct odp_timer_s {
void *user_ptr;
@@ -123,7 +134,11 @@ static void timer_init(odp_timer *tim,
/* All pad fields need a defined and constant value */
TB_SET_PAD(*tb);
/* Release the timer by setting timer state to inactive */
+#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
+   tb->exp_tck.v = TMO_INACTIVE;
+#else
_odp_atomic_u64_store_mm(>exp_tck, TMO_INACTIVE, _ODP_MEMMODEL_RLS);
+#endif
 }
 
 /* Teardown when timer is freed */
@@ -253,7 +268,11 @@ static odp_timer_pool *odp_timer_pool_new(
tp->timers[i].queue = ODP_QUEUE_INVALID;
set_next_free(>timers[i], i + 1);
tp->timers[i].user_ptr = NULL;
+#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
+   tp->tick_buf[i].exp_tck.v = TMO_UNUSED;
+#else
odp_atomic_init_u64(>tick_buf[i].exp_tck, TMO_UNUSED);
+#endif
tp->tick_buf[i].tmo_buf = ODP_BUFFER_INVALID;
}
tp->tp_idx = tp_idx;
@@ -935,7 +954,11 @@ int odp_timeout_fresh(odp_timeout_t tmo)
odp_timer_pool *tp = handle_to_tp(hdl);
uint32_t idx = handle_to_idx(hdl, tp);
tick_buf_t *tb = >tick_buf[idx];
+#if __GCC_ATOMIC_LLONG_LOCK_FREE < 2
+   uint64_t exp_tck = tb->exp_tck.v;
+#else
uint64_t exp_tck = odp_atomic_load_u64(>exp_tck);
+#endif
/* Return true if the timer still has the same expiration tick
 * (ignoring the inactive/expired bit) as the timeout */
return hdr->expiration == (exp_tck & ~TMO_INACTIVE);
-- 
2.5.0

___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


  1   2   3   4   5   6   7   8   9   >