[ https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17465869#comment-17465869 ]
Eduardo Ponce edited comment on ARROW-11989 at 12/27/21, 9:29 PM: ------------------------------------------------------------------ [Chunks are linearly searched in {{GetScalar()}} and {{Slice()}} methods|https://github.com/apache/arrow/blob/master/cpp/src/arrow/chunked_array.cc#L157-L173], so if a faster approach is implemented then it should be applied to both cases. I implemented a binary search in C++ for {{GetScalar()}} and noticed that [pyarrow performs its own linear search in ChunkedArray|https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L162-L188], and [does not has a binding for {{GetScalar()}}|https://github.com/apache/arrow/blob/master/python/pyarrow/includes/libarrow.pxd#L727-L744]. Is there a particular reason why {{GetScalar()}} is not exposed in pyarrow? was (Author: edponce): [Chunks are linearly searched in {{GetScalar()}} and {{Slice()}} methods|https://github.com/apache/arrow/blob/master/cpp/src/arrow/chunked_array.cc#L157-L173], so if a faster approach is implemented then it should be applied to both cases. I implemented a binary search in C++ for {{GetScalar()}} and noticed that [pyarrow performs its own linear search in ChunkedArray|https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L162-L188], and [does not has a binding for {{GetScalar()}}|https://github.com/apache/arrow/blob/master/python/pyarrow/includes/libarrow.pxd#L727-L744]. > [C++][Python] Improve ChunkedArray's complexity for the access of elements > -------------------------------------------------------------------------- > > Key: ARROW-11989 > URL: https://issues.apache.org/jira/browse/ARROW-11989 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python > Affects Versions: 3.0.0 > Reporter: quentin lhoest > Priority: Major > > Chunked arrays are stored as a C++ vector of Arrays. > There is currently no indexing structure on top of the vector to allow for > anything better than O(chunk) to access an arbitrary element. > For example, with a Table consisting of 1 column “text” defined by: > - 1024 chunks > - each chunk is 1024 rows > - each row is a text of 1024 characters > Then the time it takes to access one example are: > {code:java} > Time to access example at i=0% : 6.7μs > Time to access example at i=10% : 7.2μs > Time to access example at i=20% : 9.1μs > Time to access example at i=30% : 11.4μs > Time to access example at i=40% : 13.8μs > Time to access example at i=50% : 16.2μs > Time to access example at i=60% : 18.7μs > Time to access example at i=70% : 21.1μs > Time to access example at i=80% : 26.8μs > Time to access example at i=90% : 25.2μs > {code} > The time measured are the average times to do `table[“text”][j]` depending on > the index we want to fetch (from the first example at 0% to the example at > 90% of the length of the table). > You can take a look at the code that produces this benchmark > [here|https://pastebin.com/pSkYHQn9]. > Some discussions in [this thread on the mailing > list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E] > suggested different approaches to improve the complexity: > - use a contiguous array of chunk lengths, since having a contiguous array of > lengths makes the iteration over the chunks lengths faster; > - use a binary search, as in the Julia implementation > [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14]; > - use interpolation search. > Apparently there is also a lookup structure in the compute layer > [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94]. > cc [~emkornfield], [~wesm] > Thanks again for the amazing work ! -- This message was sent by Atlassian Jira (v8.20.1#820001)