On Fri, 27 Apr 2018 16:41:42 +0000
Jianfeng Tan <[email protected]> wrote:
> Below commit introduced pthread barrier for synchronization.
> But two IPC threads block on the barrier, and never wake up.
>
> (gdb) bt
> #0 futex_wait (private=0, expected=0, futex_word=0x7fffffffcff4)
> at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> #1 futex_wait_simple (private=0, expected=0, futex_word=0x7fffffffcff4)
> at ../sysdeps/nptl/futex-internal.h:135
> #2 __pthread_barrier_wait (barrier=0x7fffffffcff0) at
> pthread_barrier_wait.c:184
> #3 rte_thread_init (arg=0x7fffffffcfe0)
> at ../dpdk/lib/librte_eal/common/eal_common_thread.c:160
> #4 start_thread (arg=0x7ffff6ecf700) at pthread_create.c:333
> #5 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>
> Through analysis, we find the barrier defined on the stack could be the
> root cause. This patch will change to use heap memory as the barrier.
>
> Fixes: d651ee4919cd ("eal: set affinity for control threads")
>
> Cc: Olivier Matz <[email protected]>
> Cc: Anatoly Burakov <[email protected]>
>
> Signed-off-by: Jianfeng Tan <[email protected]>
> ---
> lib/librte_eal/common/eal_common_thread.c | 20 +++++++++++++-------
> 1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/lib/librte_eal/common/eal_common_thread.c
> b/lib/librte_eal/common/eal_common_thread.c
> index 4e75cb8..da2b84f 100644
> --- a/lib/librte_eal/common/eal_common_thread.c
> +++ b/lib/librte_eal/common/eal_common_thread.c
> @@ -166,17 +166,21 @@ rte_ctrl_thread_create(pthread_t *thread, const char
> *name,
> const pthread_attr_t *attr,
> void *(*start_routine)(void *), void *arg)
> {
> - struct rte_thread_ctrl_params params = {
> - .start_routine = start_routine,
> - .arg = arg,
> - };
> + struct rte_thread_ctrl_params *params;
> unsigned int lcore_id;
> rte_cpuset_t cpuset;
> int cpu_found, ret;
>
> - pthread_barrier_init(¶ms.configured, NULL, 2);
> + params = malloc(sizeof(*params));
> + if (!params)
> + return -1;
> +
> + params->start_routine = start_routine;
> + params->arg = arg;
>
> - ret = pthread_create(thread, attr, rte_thread_init, (void *)¶ms);
> + pthread_barrier_init(¶ms->configured, NULL, 2);
> +
> + ret = pthread_create(thread, attr, rte_thread_init, (void *)params);
> if (ret != 0)
> return ret;
>
> @@ -203,12 +207,14 @@ rte_ctrl_thread_create(pthread_t *thread, const char
> *name,
> if (ret < 0)
> goto fail;
>
> - pthread_barrier_wait(¶ms.configured);
> + pthread_barrier_wait(¶ms->configured);
> + free(params);
>
> return 0;
>
> fail:
> pthread_cancel(*thread);
> pthread_join(*thread, NULL);
> + free(params);
> return ret;
> }
This looks like a library bug. If there is a race on the configured barrier,
then
putting on heap is just moving problem. It still has bug where other thread is
referring to freed memory.