Hi Li,
I’ll answer the questions in order:

1. Your guess is correct! The Hash Join may be used standalone (mostly in 
testing or benchmarking for now) or as part of the ExecNode. The ExecNode will 
pass the task to the Executor to be scheduled, or will run it immediately if 
it’s in sync mode (i.e. no executor). Our Hash Join benchmark uses OpenMP to 
schedule things, and passes a lambda that does OpenMP things to the HashJoin.

2. We might not have an executor if we want to execute synchronously. This is 
set during construction of the ExecContext, which is given to the ExecPlan 
during creation. If the ExecContext has a nullptr Executor, then we are in 
async mode, otherwise we use the Executor to schedule. One confusing thing is 
that we also have a SerialExecutor - I’m actually not quite sure what the 
difference between using that and setting the Executor to nullptr is (might 
have something to do with testing?). @Weston probably knows

3. You can think of the TaskGroup as a “parallel for loop”. TaskImpl is the 
function that implements the work that needs to be split up, 
TaskGroupContinuationImpl is what gets run after the for loop. TaskImpl will 
receive the index of the task. If you’re familiar with OpenMP, it’s equivalent 
to this:

#pragma omp parallel for
for(int i = 0; i < 100; i++)
    TaskImpl(omp_get_thread_num(), i);
TaskGroupContinuationImpl();

Examples of the two are here:
https://github.com/apache/arrow/blob/5a5d92928ccd438edf7ced8eae449fad05a7e71f/cpp/src/arrow/compute/exec/hash_join.cc#L416
 
<https://github.com/apache/arrow/blob/5a5d92928ccd438edf7ced8eae449fad05a7e71f/cpp/src/arrow/compute/exec/hash_join.cc#L416>
https://github.com/apache/arrow/blob/5a5d92928ccd438edf7ced8eae449fad05a7e71f/cpp/src/arrow/compute/exec/hash_join.cc#L458
 
<https://github.com/apache/arrow/blob/5a5d92928ccd438edf7ced8eae449fad05a7e71f/cpp/src/arrow/compute/exec/hash_join.cc#L458>

Sasha

> On Apr 25, 2022, at 8:35 AM, Li Jin <ice.xell...@gmail.com> wrote:
> 
> Hello!
> 
> I am reading the use of TaskScheduler inside C++ compute code (reading hash
> join) and have some questions about it, in particular:
> 
> (1) What the purpose of SchedulerTaskCallback defined here:
> https://github.com/apache/arrow/blob/5a5d92928ccd438edf7ced8eae449fad05a7e71f/cpp/src/arrow/compute/exec/hash_join_node.cc#L428
> (My guess is that the caller of TaskScheduler::StartTaskGroup needs to
> provide an implementation of a task executor, and the implementation of
> SchedulerTaskCallback inside hash_join_node.cc is just a vanillar
> implementation)
> 
> (2) When would this task context not have an executor?
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/exec/hash_join_node.cc#L581
> 
> (3) What's the difference between TaskImpl and TaskGroupContinuationImpl in
> TaskScheduler::RegisterTaskGroup? And how would one normally define
> TaskGroupContinuationImpl?
> 
> Sorry I am still learning the Arrow compute internals and appreciate help
> on understanding these.
> 
> Li

Reply via email to