[ 
https://issues.apache.org/jira/browse/ARROW-10799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243477#comment-17243477
 ] 

David Li commented on ARROW-10799:
----------------------------------

I didn't get very far at all, here's the diff (once I realized I needed to be 
generic over the index type without bloating the generated code excessively I 
set it aside)
{code:cpp}
diff --git a/cpp/src/arrow/compute/kernels/vector_selection.cc 
b/cpp/src/arrow/compute/kernels/vector_selection.cc
index 1967ce727..c462984eb 100644
--- a/cpp/src/arrow/compute/kernels/vector_selection.cc
+++ b/cpp/src/arrow/compute/kernels/vector_selection.cc
@@ -1902,18 +1902,35 @@ Result<std::shared_ptr<ChunkedArray>> TakeCA(const 
ChunkedArray& values,
   // Case 1: `values` has a single chunk, so just use it
   if (num_chunks == 1) {
     current_chunk = values.chunk(0);
+    // Call Array Take on our single chunk
+    ARROW_ASSIGN_OR_RAISE(new_chunks[0], TakeAA(*current_chunk, indices, 
options, ctx));
   } else {
     // TODO Case 2: See if all `indices` fall in the same chunk and call Array 
Take on it
     // See
     // 
https://github.com/apache/arrow/blob/6f2c9041137001f7a9212f244b51bc004efc29af/r/src/compute.cpp#L123-L151
     // TODO Case 3: If indices are sorted, can slice them and call Array Take
 
-    // Case 4: Else, concatenate chunks and call Array Take
-    ARROW_ASSIGN_OR_RAISE(current_chunk,
-                          Concatenate(values.chunks(), ctx->memory_pool()));
+    // Case 4: for each run of indices that falls within a single
+    // chunk, call Array Take for that chunk
+    int64_t start = 0;
+    int64_t end = 0;
+    // An array of the (max index + 1) of each chunk
+    std::vector<int64_t> boundaries(num_chunks);
+    int64_t boundary = 0;
+    for (const auto& chunk : values.chunks()) {
+      boundary += chunk->length();
+      boundaries.push_back(boundary);
+    }
+
+    while (end < indices.length()) {
+      int64_t chunk_index = 0;
+      for (const auto& boundary : boundaries) {
+        if (end >= )
+      }
+
+      start = end;
+    }
   }
-  // Call Array Take on our single chunk
-  ARROW_ASSIGN_OR_RAISE(new_chunks[0], TakeAA(*current_chunk, indices, 
options, ctx));
   return std::make_shared<ChunkedArray>(std::move(new_chunks));
 }
{code}

> [C++] Take on string chunked arrays slow and fails
> --------------------------------------------------
>
>                 Key: ARROW-10799
>                 URL: https://issues.apache.org/jira/browse/ARROW-10799
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Maarten Breddels
>            Priority: Major
>
>  
> {code:java}
> import pyarrow as pa
> a = pa.array(['a'] * 2**26)
> c = pa.chunked_array([a] * 2*18)
> c.take([0, 1])
> {code}
> Gives
> {noformat}
> ----------------------------------------
> ArrowInvalidTraceback (most recent call last)
> <ipython-input-4-57099ee02815> in <module>
> ----> 1 c.take([0, 1])
> ~/github/apache/arrow/python/pyarrow/table.pxi in 
> pyarrow.lib.ChunkedArray.take()
> ~/github/apache/arrow/python/pyarrow/compute.py in take(data, indices, 
> boundscheck, memory_pool)
>     421     """
>     422     options = TakeOptions(boundscheck=boundscheck)
> --> 423     return call_function('take', [data, indices], options, 
> memory_pool)
>     424 
>     425 
> ~/github/apache/arrow/python/pyarrow/_compute.pyx in 
> pyarrow._compute.call_function()
> ~/github/apache/arrow/python/pyarrow/_compute.pyx in 
> pyarrow._compute.Function.call()
> ~/github/apache/arrow/python/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~/github/apache/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: offset overflow while concatenating arrays
> {noformat}
>  
> PS: did not check master but  3.0.0.dev238+gb0bc9f8d
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to