[GitHub] [arrow] westonpace commented on pull request #13426: [DRAFT] ARROW-16894: [C++] Add Benchmarks for Asof Join Node

GitBox Wed, 29 Jun 2022 19:39:48 -0700


westonpace commented on PR #13426:
URL: https://github.com/apache/arrow/pull/13426#issuecomment-1170685688


   Some initial questions and I'll take a more detailed look soon.
   
   > Here is a very primitive version of our Asof Join Benchmarks 
(asof_join_benchmark.cc). Our main goal is to benchmark on four qualities: the 
effect of table density (the frequency of rows, e.g a row every 2s as opposed 
to every 1h over some time range), table width (# of columns), tids (# of 
keys), and multi-table joins. We also have a baseline comparison benchmark with 
hash joins (which is currently in this file).
   
   * Does the table density get applied uniformly over all input columns?  In 
other words, do you worry about cases where one input is very dense and the 
others are not so dense?
   * When you say multi-table joins how many tables are you testing?  Or is 
that a parameter?
   * # of columns and # of keys is good.  Eventually you will need to worry 
about data types I would think (probably more for payload columns than for key 
columns)
   
   > I think this needs some work before it goes into arrow. We currently run 
this benchmark by generating .feather files with Python via bamboo-streaming's 
datagen.py to represent each table, and then reading them in through cpp (see 
make_arrow_ipc_reader_node). We perhaps want to write a utility that allows us 
to do this in cpp, while varying many of the metrics I've mentioned above, or 
finding a way to generate those files as part of the benchmark.
   
   Another potential approach is to create a custom source node that generates 
data.  We do something like this for our tpc-h benchmark.  This allows us to 
isolate the scanning from the actual compute operations.  However, this kind of 
requires you to write the data generator in C++ which is probably not ideal.
   
   > There are also quite a large number of BENCHMARK_CAPTURE statements, as an 
immediate workaround to some limitations in Google Benchmarks. I haven't found 
a great non-verbose way to pass in the parameters needed (strings and vectors) 
while also having readable titles and details about the benchmark being written 
to the output file. Let me know if you have any advice about this / know some 
one who does.
   
   I'll think about this when I take a closer look.  Google benchmark is not 
necessarily the perfect tool for every situation.  But maybe there is something 
we can do.
   
   > While being processed, is a single source node dedicated a single thread?
   
   No.
   
   > How many threads can call InputReceived of the following node at once?
   
   That is mostly determined by the capacity of the executor which defaults to 
std::hardware_concurrency (e.g. one per core or two per core if you have 
hyperthreading) for the CPU thread pool.  At the moment it can be even greater 
than this number but this is an issue we are hoping to fix soon (limiting these 
calls to only CPU thread pool calls).
   
   > I was also wondering if you could clarify how the arrow threading engine 
would work for a node that has multiple inputs (an asof join / hash join 
ingesting from multiple source nodes, for example).
   
   Even if the asof join node had a single source node it could still be called 
many times.  You can think of a source node as a parallel while loop across all 
the batches:
   
   ```
   while (!source.empty()) {
     exec_context->executor->Spawn([this] {
       ExecBatch next_batch = ReadNext();
       output->InputReceived(next_batch);
     }
   }
   ```
   
   If there are multiple sources then they are still all submitting tasks to 
the same common thread pool so you won't see any more threads.  Also, in many 
unit tests and small use cases the source doesn't have enough data for more 
than one task so you often don't see the full scale of parallelism until you 
are testing larger datasets.
   
   There are some changes planned for the scheduler but most of what I said 
already will remain more or less true.  The future scheduler could potentially 
prioritize one source above others (for example, it often makes sense with a 
hash-join node to prioritize the build side input) so that is something to 
consider (for the as-of join node you probably want to read from all sources 
evenly I think).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #13426: [DRAFT] ARROW-16894: [C++] Add Benchmarks for Asof Join Node

Reply via email to