westonpace commented on a change in pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#discussion_r804328939
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -151,6 +151,7 @@ class HashJoinBasicImpl : public HashJoinImpl {
}
void InitLocalStateIfNeeded(size_t thread_index) {
+ DCHECK_LT(thread_index, local_states_.size());
ThreadLocalState& local_state = local_states_[thread_index];
Review comment:
> Even with Sys.setenv(OMP_THREAD_LIMIT = "1") this still occurs.
That isn't too surprising. `use_threads` triggers an entirely different
path in some places. So it is not entirely equivalent to `OMP_THREAD_LIMIT =
"1"`.
> I also tried writing a C++ unit test that did a join after a dataset scan,
but I couldn't reproduce the problem. That leads me to think there may be some
issue with how the R bindings are configuring things, but it could also be I
just didn't reproduce it quite well enough.
How consistent is the R error?
> Despite use_threads = FALSE, it seems like there are quite a few threads
spawned by the engine. While I'm learning, I'm just not familiar enough to know
which parts seem weird.
`use_threads` generally does not control the I/O thread pool (which defaults
to 8 threads and is not controlled by `OMP_THREAD_LIMIT`). If someone was
really passionate about shoving everything onto the calling thread then there
is a way to do this but it would be quite a bit of work.
In addition, jemalloc (if compiled in), will spawn some background cleanup
threads.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]