[ 
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17465561#comment-17465561
 ] 

Eduardo Ponce edited comment on ARROW-11989 at 12/27/21, 7:12 AM:
------------------------------------------------------------------

Is the structure of the example ChunkedArray common (ie., having many chunks 
with the same length)? If that is the case, this type of access pattern would 
benefit from a "FixedSizeChunkedArray" (doesn't exists) where all chunks are of 
the same length and thus chunk access would be O(1). This can be implemented 
without defining a new class but by simply having a flag the ChunkedArray uses 
to track if chunks are of the same length.

Now, wrt to using a binary search instead of a linear search for finding the 
chunk of interest, I expect the binary search to improve access time for 
high-value indices but worsen access time for low-value indices due to the 
overhead of performing binary search. The overall access time will depend on 
the application and access patterns and although a binary search would make the 
overall chunk finding more consistent it will also have its drawback for 
certain cases.

_Food for thought:_ An alternative solution is to allow the client code to 
specify the direction of the linear search. This will help control performance 
based on the expected access patterns. The search direction could be specified 
as an object attribute or function parameter.
* *forward* - begins at first chunk and is useful for low-value indices
* *backward* - begins search at last chunk and is useful for high-value indices


was (Author: edponce):
Is the structure of the example ChunkedArray common (ie., having many chunks of 
with same number of rows)? If that is the case, this type of access pattern 
would benefit from a "FixedSizeChunkedArray" (doesn't exists) where all chunks 
are of the same length and thus chunk access would be O(1). This can be 
implemented without defining a new class but by simply having a flag the 
ChunkedArray uses to track if chunks are of the same size.

Now, wrt to using a binary search instead of a linear search for finding the 
chunk of interest, I expect the binary search to improve access time for 
high-value indices but worsen access time for low-value indices due to the 
overhead of performing binary search. The overall access time will depend on 
the application and access patterns and although a binary search would make the 
overall chunk finding more consistent it will also have its drawback for 
certain cases.

_Food for thought:_ An alternative solution is to allow the client code to 
specify the direction of the linear search. This will help control performance 
based on the expected access patterns. The search direction could be specified 
as an object attribute or function parameter.
* *forward* - begins at first chunk and is useful for low-value indices
* *backward* - begins search at last chunk and is useful for high-value indices

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --------------------------------------------------------------------------
>
>                 Key: ARROW-11989
>                 URL: https://issues.apache.org/jira/browse/ARROW-11989
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>    Affects Versions: 3.0.0
>            Reporter: quentin lhoest
>            Priority: Major
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for 
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%        : 6.7μs
> Time to access example at i=10%       : 7.2μs
> Time to access example at i=20%       : 9.1μs
> Time to access example at i=30%       : 11.4μs
> Time to access example at i=40%       : 13.8μs
> Time to access example at i=50%       : 16.2μs
> Time to access example at i=60%       : 18.7μs
> Time to access example at i=70%       : 21.1μs
> Time to access example at i=80%       : 26.8μs
> Time to access example at i=90%       : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark 
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing 
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
>  suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, since having a contiguous array of 
> lengths makes the iteration over the chunks lengths faster;
> - use a binary search, as in the Julia implementation 
> [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
> - use interpolation search.
> Apparently there is also a lookup structure in the compute layer 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].
> cc [~emkornfield], [~wesm]
> Thanks again for the amazing work !



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to