[DISCUSS][C++][Proposal] Threading engine for Arrow

Malakhov, Anton Tue, 30 Apr 2019 19:29:37 -0700

Hi dear Arrow developers, Antoine,

I'd like to kick off the discussion of the threading engine that Arrow can use 
underneath for implementing multicore parallelism for execution nodes, kernels, 
and/or all the functions, which can be optimized this way.
I've documented some ideas on Arrow's Confluence Wiki: 
https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+Engine
The bottom line is that while Arrow is moving into the right direction 
introducing shared thread pool, there are some questions and concerns about 
current implementation and the way how it is supposed to co-exist with other 
threaded libraries ("threading composability") while providing efficient 
nestable NUMA&cache-aware data and data-flow parallelism.
I suggest to introduce threading layers like in other libraries like MKL and 
Numba, starting with TBB-based layer. Or maybe even use TBB directly. In short, 
there are the following arguments for it:

1. Designed for composability from day zero. Avoids mandatory parallelism.
Provides work stealing and FIFO scheduling. Compatible with parallel depth
first scheduling (a better composability research).

2. TBB Flow Graph. It fits nicely into data flow and execution nodes model
of SQL databases. Besides basic nodes needed for implementing an execution
engine, it also provides a foundation for heterogeneous and distributed
computing (async_node, opencl_node, distributed_node)

3. Arrow's ThreadPool, TaskGroup, and ParallelFor have direct equivalent
in TBB: task_arena, task_group, and parallel_for while providing mature and
performant implementation, which solves many if not all of the XXX todo notes
in the comments like exceptions, singletons and time of initialization,
lock-free.

4. Concurrent hash tables, queues, vector and other concurrent containers.
Hash tables are required for implementing parallel versions of joins, groupby,
uniq, dictionary operations. There is a contribution to integrate libcuckoo
under TBB interface.

5. TBB scalable malloc and memory pools, which can use any user-provided
memory chunk for scalable allocation. Arrow uses jemalloc, which is slower in
some cases than tbbmalloc or tcmalloc.

6. OpenMP is good for NUMA with static schedule, however, there is no good
answer for dynamic tasks, graphs. TBB provides tools for implementing NUMA
support: task_arena, task_scheduler_observer, task affinity & priorities,
committed to improve NUMA for its other customers in 2019.

7. TBB is licensed under Apache 2.0, has conda-forge feedstock, supports
CMake, it's adopted for CPU scheduling by other industry players, has multiple
ports for other OSes and CPU arches.

Full disclosure: I was TBB developer before its 1.0 version, responsible for
multiple core components like hash tables, adaptive partitioning, interfaces of
memory pools and task_arena, all of these are very relevant to Arrow. I've
background in scalability and NUMA-aware performance optimization like what we
did for OpenCL runtime for CPU (TBB-based). I also was behind optimizations for
Intel Distribution for Python and its threading composability
story<https://software.intel.com/en-us/blogs/2016/04/04/unleash-parallel-performance-of-python-programs>.
Thus, I'm sincerely hope to reuse all these stuff in order to deliver the best
performance for Arrow.

Best regards,
Anton Malakhov<http://www.linkedin.com/in/antonmalakhov>
IAGS Scripting Analyzers & Tools

O: +1-512-3620-512
1300 S. MoPac Expy
Office: AN4-C1-D4
Austin, TX 78746
Intel Corporation | www.intel.com

[DISCUSS][C++][Proposal] Threading engine for Arrow

Reply via email to