tmaxwell-anthropic commented on issue #39332:
URL: https://github.com/apache/arrow/issues/39332#issuecomment-1865267161
A variation is to define `right_keys` as follows:
```python
right_keys = pa.array([0, 0], pa.int64())
```
Instead of a SIGSEGV, this results in an infinite loop with the following
stack trace:
```#0 0x00007ffff43a7a90 in
arrow::compute::ResizableArrayData::ResizeVaryingLengthBuffer() ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#1 0x00007ffff43ab901 in
arrow::compute::ExecBatchBuilder::AppendSelected(std::shared_ptr<arrow::ArrayData>
const&, arrow::compute::ResizableArrayData*, int, unsigned short const*,
arrow::MemoryPool*) () from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#2 0x00007ffff43acfa7 in
arrow::compute::ExecBatchBuilder::AppendSelected(arrow::MemoryPool*,
arrow::compute::ExecBatch const&, int, unsigned short const*, int, int const*)
() from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#3 0x00007ffff67d526b in
arrow::acero::JoinResultMaterialize::AppendProbeOnly(arrow::compute::ExecBatch
const&, int, unsigned short const*, int*) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#4 0x00007ffff67e0953 in
arrow::acero::JoinProbeProcessor::OnNextBatch(long, arrow::compute::ExecBatch
const&, arrow::util::TempVectorStack*,
std::vector<arrow::compute::KeyColumnArray,
std::allocator<arrow::compute::KeyColumnArray> >*) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#5 0x00007ffff6802721 in arrow::acero::SwissJoin::ProbeSingleBatch(unsigned
long, arrow::compute::ExecBatch) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#6 0x00007ffff6825c07 in std::_Function_handler<arrow::Status (unsigned
long, long), arrow::acero::HashJoinNode::Init()::{lambda(unsigned long,
long)#8}>::_M_invoke(std::_Any_data const&, unsigned long&&, long&&) () from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#7 0x00007ffff67be225 in
arrow::acero::TaskSchedulerImpl::ExecuteTask(unsigned long, int, long, bool*) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#8 0x00007ffff67d5814 in std::_Function_handler<arrow::Status (unsigned
long), arrow::acero::TaskSchedulerImpl::ScheduleMore(unsigned long,
int)::{lambda(unsigned long)#1}>::_M_invoke(std::_Any_data const&, unsigned
long&&) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#9 0x00007ffff67baf47 in std::_Function_handler<arrow::Status (),
arrow::acero::QueryContext::ScheduleTask(std::function<arrow::Status (unsigned
long)>, std::basic_string_view<char, std::char_traits<char>
>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#10 0x00007ffff67f9260 in arrow::internal::FnOnce<void
()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture
(arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)>
>::invoke() () from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#11 0x00007ffff44d9505 in arrow::internal::FnOnce<void ()>::operator()() &&
()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#12 0x00007ffff44d5c38 in
std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}>
> >::_M_run() () from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#13 0x00007ffff543b4a0 in execute_native_thread_routine () from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#14 0x00007ffff7850ac3 in start_thread (arg=<optimized out>) at
./nptl/pthread_create.c:442
#15 0x00007ffff78e2660 in clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81```
We're looping through the following three instructions:
```
0x7ffff43a7a90
<_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+80> add
%ebx,%ebx 0x7ffff43a7a92
<_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+82> cmp
%ebx,%eax 0x7ffff43a7a94
<_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+84> jg
0x7ffff43a7a90
<_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+80>
```
Here `%eax=1082130432` and `%ebx=0`.
I believe this corresponds to
https://github.com/apache/arrow/blob/apache-arrow-12.0.1/cpp/src/arrow/compute/light_array.cc#L329-L331.
Suppose `min_new_size` is slightly larger than 2**30. If `new_size` is
initially some power of two, it will double until it hits 2**30; and then it
will try to double again, but overflow to -2147483648; and then try to double
again, and become 0.
Oddly, the duplicate value in `right_keys` is essential to reproduce the
bug. If I set `right_keys = pa.array([0, 1], pa.int64())` then it completes
successfully.
Replacing `pa.string()` with `pa.large_string()` also makes the bug go away.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]