tmaxwell-anthropic commented on issue #39332:
URL: https://github.com/apache/arrow/issues/39332#issuecomment-1865267161

   A variation is to define `right_keys` as follows:
   ```python
   right_keys = pa.array([0, 0], pa.int64())
   ```
   Instead of a SIGSEGV, this results in an infinite loop with the following 
stack trace:
   ```#0  0x00007ffff43a7a90 in 
arrow::compute::ResizableArrayData::ResizeVaryingLengthBuffer() ()
      from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
   #1  0x00007ffff43ab901 in 
arrow::compute::ExecBatchBuilder::AppendSelected(std::shared_ptr<arrow::ArrayData>
 const&, arrow::compute::ResizableArrayData*, int, unsigned short const*, 
arrow::MemoryPool*) () from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
   #2  0x00007ffff43acfa7 in 
arrow::compute::ExecBatchBuilder::AppendSelected(arrow::MemoryPool*, 
arrow::compute::ExecBatch const&, int, unsigned short const*, int, int const*) 
() from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
   #3  0x00007ffff67d526b in 
arrow::acero::JoinResultMaterialize::AppendProbeOnly(arrow::compute::ExecBatch 
const&, int, unsigned short const*, int*) ()
      from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
   #4  0x00007ffff67e0953 in 
arrow::acero::JoinProbeProcessor::OnNextBatch(long, arrow::compute::ExecBatch 
const&, arrow::util::TempVectorStack*, 
std::vector<arrow::compute::KeyColumnArray, 
std::allocator<arrow::compute::KeyColumnArray> >*) ()
      from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
   #5  0x00007ffff6802721 in arrow::acero::SwissJoin::ProbeSingleBatch(unsigned 
long, arrow::compute::ExecBatch) ()
      from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
   #6  0x00007ffff6825c07 in std::_Function_handler<arrow::Status (unsigned 
long, long), arrow::acero::HashJoinNode::Init()::{lambda(unsigned long, 
long)#8}>::_M_invoke(std::_Any_data const&, unsigned long&&, long&&) () from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
   #7  0x00007ffff67be225 in 
arrow::acero::TaskSchedulerImpl::ExecuteTask(unsigned long, int, long, bool*) ()
      from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
   #8  0x00007ffff67d5814 in std::_Function_handler<arrow::Status (unsigned 
long), arrow::acero::TaskSchedulerImpl::ScheduleMore(unsigned long, 
int)::{lambda(unsigned long)#1}>::_M_invoke(std::_Any_data const&, unsigned 
long&&) ()
      from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
   #9  0x00007ffff67baf47 in std::_Function_handler<arrow::Status (), 
arrow::acero::QueryContext::ScheduleTask(std::function<arrow::Status (unsigned 
long)>, std::basic_string_view<char, std::char_traits<char> 
>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
      from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
   #10 0x00007ffff67f9260 in arrow::internal::FnOnce<void 
()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture 
(arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)> 
>::invoke() () from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
   #11 0x00007ffff44d9505 in arrow::internal::FnOnce<void ()>::operator()() && 
()
      from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
   #12 0x00007ffff44d5c38 in 
std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}>
 > >::_M_run() () from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
   #13 0x00007ffff543b4a0 in execute_native_thread_routine () from 
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
   #14 0x00007ffff7850ac3 in start_thread (arg=<optimized out>) at 
./nptl/pthread_create.c:442
   #15 0x00007ffff78e2660 in clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81```
   We're looping through the following three instructions:
   ```
   0x7ffff43a7a90 
<_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+80>  add    
%ebx,%ebx                                               0x7ffff43a7a92 
<_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+82>  cmp    
%ebx,%eax                                               0x7ffff43a7a94 
<_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+84>  jg     
0x7ffff43a7a90 
<_ZN5arrow7compute18ResizableArrayData25ResizeVaryingLengthBufferEv+80>
   ```
   Here `%eax=1082130432` and `%ebx=0`.
   
   I believe this corresponds to 
https://github.com/apache/arrow/blob/apache-arrow-12.0.1/cpp/src/arrow/compute/light_array.cc#L329-L331.
 Suppose `min_new_size` is slightly larger than 2**30. If `new_size` is 
initially some power of two, it will double until it hits 2**30; and then it 
will try to double again, but overflow to -2147483648; and then try to double 
again, and become 0.
   
   Oddly, the duplicate value in `right_keys` is essential to reproduce the 
bug. If I set `right_keys = pa.array([0, 1], pa.int64())` then it completes 
successfully.
   
   Replacing `pa.string()` with `pa.large_string()` also makes the bug go away.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to