I would caution to please not commit to the MKL/BLAS model in which the library creates threads internally. It's a disaster for managing oversubscription and affinity issues among groups of threads and/or multiple processes (e.g., MPI). For example, a composable OpenMP technique is for the caller who wants threaded execution to create the thread team and explicitly give it to the library,
#pragma omp parallel #pragma omp single { library_calls(); } The library is then free to use constructs like omp taskgroup/taskloop as granularity warrants; it will never utilize threads that the application didn't explicitly give it. I know Intel has put work into making TBB and OpenMP thread pools interoperate, though this historically hasn't worked cleanly on other platforms and thus may be a liability for Arrow to consider. "Malakhov, Anton" <anton.malak...@intel.com> writes: > Thanks Wes! > > Sounds like a good way to go! We'll create a demo, as you suggested, > implementing a parallel execution model for a simple analytics pipeline that > reads and processes the files. My only concern is about adding more pipeline > breaker nodes and compute intensive operations into this demo because min/max > are effectively no-ops fused into I/O scan node. What do you think about > adding group-by into this picture, effectively implementing NY taxi and/or > mortgage benchmarks? Ideally, I'd like to go even further and add sci-kit > learn-like stuff for processing that data in order to demonstrate the > co-existence side of the story. What do you think? > So, the idea of the prototype will be to validate the parallel execution > model as the first step. After that, it'll help to shape API for both - > execution nodes and the threading backend. Does it sound right to you? > > P.S. I can well understand your hesitation about using TBB directly and as > non-optional dependency, thus I'm suggesting threading layers approach here. > Please let me clarify myself, using TBB and nested parallelism is non-goal by > itself. The goal is to build components of efficient execution model, which > coexist well with each other and with all the other, external to Arrow, > components of an applications. However, without a rich, composable, and > mature parallel toolkit, it is hard to achieve and to focus on this goal. > Thus, I wanted to check with the community if it is an acceptable way at all > and what's the roadmap. > > Thanks, > // Anton > > > -----Original Message----- > From: Wes McKinney [mailto:wesmck...@gmail.com] > Sent: Thursday, May 2, 2019 13:52 > To: dev@arrow.apache.org > Subject: Re: [DISCUSS][C++][Proposal] Threading engine for Arrow > > hi Anton, > > Thank you for bringing your expertise to the project -- this is a very useful > discussion to have. > > Partly why our threading capabilities in the project are not further > developed is that there is not much that needs to be parallelized. It would > be like designing a supercharger when you don't have a car yet. > That being said, it is worthwhile to plan ahead so we aren't trying to > retrofit significant pieces of software to be able to take advantage of a > more advanced task scheduler. > > From my perspective, we have a few key practical areas of consideration: > > * Computational tasks that may offer nested parallelism (e.g. an Aggregation > or Projection task may be able to execution in multiple > threads) > * IO operations performed from within tasks that appear to be computational > in nature (example: in the course of reading a Parquet file, both computation > -- decoding, decompression -- and IO -- local or remote filesystem operations > -- must be performed). The status quo right now is that IO performed inside a > task in the thread pool is not releasing any resources to other tasks. > > I believe that we should design and develop a sane programming model / API > for implementing our software in the presence of these challenges. > If the backend / implementation of this API uses TBB and that makes things > more efficient than other approaches, then that sounds great to me. I would > be hesitant to use TBB APIs directly in Arrow application code unless it can > be clearly demonstrated by that is a superior option to alternatives. > > It seems useful to validate the implementation approach by starting with some > practical problems. Suppose, for the sake of argument, you want to read 10 > Parquet files (constituting a single logical dataset) as fast as possible and > perform some simple analytics on them -- let's take something very simple > like computing the maximum and minimum values of each column in the dataset. > This problem features both problems listed above: > > * Reading a single Parquet file can be parallelized (by columns -- since > columns can be decoded in parallel) on the global thread pool, so reading > multiple files in parallel would cause nested parallelism > * Within the context of reading a single Parquet file column, IO calls are > performed. CPU threads sit idle while this IO is taking place, particularly > if the file system is high latency (e.g. HDFS) > > What do you think about -- as a way of moving this project forward -- > developing a prototype threading backend and developer API (for people like > me to use to develop libraries like the Parquet library) that addresses these > issues? I think it could be difficult to build consensus around a threading > backend developed in the abstract. > > Thanks > Wes > > On Tue, Apr 30, 2019 at 9:28 PM Malakhov, Anton <anton.malak...@intel.com> > wrote: >> >> Hi dear Arrow developers, Antoine, >> >> I'd like to kick off the discussion of the threading engine that Arrow can >> use underneath for implementing multicore parallelism for execution nodes, >> kernels, and/or all the functions, which can be optimized this way. >> I've documented some ideas on Arrow's Confluence Wiki: >> https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+E >> ngine The bottom line is that while Arrow is moving into the right >> direction introducing shared thread pool, there are some questions and >> concerns about current implementation and the way how it is supposed to >> co-exist with other threaded libraries ("threading composability") while >> providing efficient nestable NUMA&cache-aware data and data-flow parallelism. >> I suggest to introduce threading layers like in other libraries like MKL and >> Numba, starting with TBB-based layer. Or maybe even use TBB directly. In >> short, there are the following arguments for it: >> >> 1. Designed for composability from day zero. Avoids mandatory >> parallelism. Provides work stealing and FIFO scheduling. Compatible with >> parallel depth first scheduling (a better composability research). >> >> 2. TBB Flow Graph. It fits nicely into data flow and execution nodes >> model of SQL databases. Besides basic nodes needed for implementing an >> execution engine, it also provides a foundation for heterogeneous and >> distributed computing (async_node, opencl_node, distributed_node) >> >> 3. Arrow's ThreadPool, TaskGroup, and ParallelFor have direct >> equivalent in TBB: task_arena, task_group, and parallel_for while providing >> mature and performant implementation, which solves many if not all of the >> XXX todo notes in the comments like exceptions, singletons and time of >> initialization, lock-free. >> >> 4. Concurrent hash tables, queues, vector and other concurrent >> containers. Hash tables are required for implementing parallel versions of >> joins, groupby, uniq, dictionary operations. There is a contribution to >> integrate libcuckoo under TBB interface. >> >> 5. TBB scalable malloc and memory pools, which can use any >> user-provided memory chunk for scalable allocation. Arrow uses jemalloc, >> which is slower in some cases than tbbmalloc or tcmalloc. >> >> 6. OpenMP is good for NUMA with static schedule, however, there is no >> good answer for dynamic tasks, graphs. TBB provides tools for implementing >> NUMA support: task_arena, task_scheduler_observer, task affinity & >> priorities, committed to improve NUMA for its other customers in 2019. >> >> 7. TBB is licensed under Apache 2.0, has conda-forge feedstock, >> supports CMake, it's adopted for CPU scheduling by other industry players, >> has multiple ports for other OSes and CPU arches. >> >> Full disclosure: I was TBB developer before its 1.0 version, responsible for >> multiple core components like hash tables, adaptive partitioning, interfaces >> of memory pools and task_arena, all of these are very relevant to Arrow. >> I've background in scalability and NUMA-aware performance optimization like >> what we did for OpenCL runtime for CPU (TBB-based). I also was behind >> optimizations for Intel Distribution for Python and its threading >> composability >> story<https://software.intel.com/en-us/blogs/2016/04/04/unleash-parallel-performance-of-python-programs>. >> Thus, I'm sincerely hope to reuse all these stuff in order to deliver the >> best performance for Arrow. >> >> >> Best regards, >> Anton Malakhov<http://www.linkedin.com/in/antonmalakhov> >> IAGS Scripting Analyzers & Tools >> >> O: +1-512-3620-512 >> 1300 S. MoPac Expy >> Office: AN4-C1-D4 >> Austin, TX 78746 >> Intel Corporation | www.intel.com >>