[jira] [Commented] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements

2022-04-01 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515990#comment-17515990
 ] 

Eduardo Ponce commented on ARROW-11989:
---

While implementing in {{GetScalar()}} a lazy initialization of the internal 
offsets array used for the binary search, found out that such approach requires 
a mutex to check/initialize the offsets array because the immutability property 
of {{ChunkedArray}} entails thread-safe access.

So the current solution always initializes the offsets array during 
construction. This means that the overhead for creating the offsets is always 
observed even if {{GetScalar()}} does not gets called or called very few times. 
This is the current performance:
{code}
Time to access example at i=0%  : 20.7μs
Time to access example at i=10% : 6.9μs
Time to access example at i=20% : 6.7μs
Time to access example at i=30% : 6.9μs
Time to access example at i=40% : 6.8μs
Time to access example at i=50% : 6.9μs
Time to access example at i=60% : 6.9μs
Time to access example at i=70% : 6.8μs
Time to access example at i=80% : 6.8μs
Time to access example at i=90% : 6.8μs
{code}

The initial overhead can be reduced because at the moment the chunks are 
scanned twice during construction:
1. Construct offsets array for binary search
2. Calculate total ChunkedArray length and null count. Actually, the length can 
be computed in O(1) with the offsets.

We could combine this by adding a {{length()}} and {{null_count()}} to 
{{ChunkResolver}} class. The drawback here is that length and null count are 
not necessary for the ChunkResolver operations and can increase data structure 
size if variable members are used in ChunkResolver and ChunkedArray. 
Experiments do not show a significant difference between the two approaches 
choosing the former as it is simpler.

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --
>
> Key: ARROW-11989
> URL: https://issues.apache.org/jira/browse/ARROW-11989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: quentin lhoest
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for 
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%: 6.7μs
> Time to access example at i=10%   : 7.2μs
> Time to access example at i=20%   : 9.1μs
> Time to access example at i=30%   : 11.4μs
> Time to access example at i=40%   : 13.8μs
> Time to access example at i=50%   : 16.2μs
> Time to access example at i=60%   : 18.7μs
> Time to access example at i=70%   : 21.1μs
> Time to access example at i=80%   : 26.8μs
> Time to access example at i=90%   : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark 
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing 
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
>  suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, since having a contiguous array of 
> lengths makes the iteration over the chunks lengths faster;
> - use a binary search, as in the Julia implementation 
> [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
> - use interpolation search.
> Apparently there is also a lookup structure in the compute layer 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].
> cc [~emkornfield], [~wesm]
> Thanks again for the amazing work !



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements

2021-12-27 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465884#comment-17465884
 ] 

Eduardo Ponce commented on ARROW-11989:
---

Here are my local results for the PyArrow benchmark provided by [~lhoestq] 
using {{GetScalar()}} directly with a linear and binary search.

Note: The behavior exposed in this benchmark is particular in that it makes 
many consecutive accesses to the same data structure and in incremental order.

1. No changes to Arrow
{code:shell}
Time to access example at i=0%  : 15.5μs
Time to access example at i=10% : 11.3μs
Time to access example at i=20% : 14.8μs
Time to access example at i=30% : 18.5μs
Time to access example at i=40% : 22.2μs
Time to access example at i=50% : 26.0μs
Time to access example at i=60% : 29.7μs
Time to access example at i=70% : 33.5μs
Time to access example at i=80% : 37.2μs
Time to access example at i=90% : 41.2μs
{code}

2. Exposing C++ {{GetScalar}} with linear search in PyArrow and using it 
directly instead of linear search in Python.
{code:shell}
Time to access example at i=0%  : 10.4μs
Time to access example at i=10% : 8.1μs
Time to access example at i=20% : 8.6μs
Time to access example at i=30% : 9.5μs
Time to access example at i=40% : 10.7μs
Time to access example at i=50% : 11.3μs
Time to access example at i=60% : 12.8μs
Time to access example at i=70% : 13.2μs
Time to access example at i=80% : 14.8μs
Time to access example at i=90% : 14.9μs
{code}

3. Exposing C++ {{GetScalar}} with binary search in PyArrow and using it 
directly instead of linear search in Python.
{code:shell}
Time to access example at i=0%  : 691.2μs
Time to access example at i=10% : 7.5μs
Time to access example at i=20% : 6.9μs
Time to access example at i=30% : 6.8μs
Time to access example at i=40% : 6.6μs
Time to access example at i=50% : 6.6μs
Time to access example at i=60% : 6.9μs
Time to access example at i=70% : 6.8μs
Time to access example at i=80% : 6.6μs
Time to access example at i=90% : 6.6μs
{code}

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --
>
> Key: ARROW-11989
> URL: https://issues.apache.org/jira/browse/ARROW-11989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: quentin lhoest
>Priority: Major
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for 
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%: 6.7μs
> Time to access example at i=10%   : 7.2μs
> Time to access example at i=20%   : 9.1μs
> Time to access example at i=30%   : 11.4μs
> Time to access example at i=40%   : 13.8μs
> Time to access example at i=50%   : 16.2μs
> Time to access example at i=60%   : 18.7μs
> Time to access example at i=70%   : 21.1μs
> Time to access example at i=80%   : 26.8μs
> Time to access example at i=90%   : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark 
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing 
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
>  suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, since having a contiguous array of 
> lengths makes the iteration over the chunks lengths faster;
> - use a binary search, as in the Julia implementation 
> [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
> - use interpolation search.
> Apparently there is also a lookup structure in the compute layer 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].
> cc [~emkornfield], [~wesm]
> Thanks again for the amazing work !



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements

2021-12-27 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465869#comment-17465869
 ] 

Eduardo Ponce commented on ARROW-11989:
---

[Chunks are linearly searched in {{GetScalar()}} and {{Slice()}} 
methods|https://github.com/apache/arrow/blob/master/cpp/src/arrow/chunked_array.cc#L157-L173],
 so if a faster approach is implemented then it should be applied to both cases.

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --
>
> Key: ARROW-11989
> URL: https://issues.apache.org/jira/browse/ARROW-11989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: quentin lhoest
>Priority: Major
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for 
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%: 6.7μs
> Time to access example at i=10%   : 7.2μs
> Time to access example at i=20%   : 9.1μs
> Time to access example at i=30%   : 11.4μs
> Time to access example at i=40%   : 13.8μs
> Time to access example at i=50%   : 16.2μs
> Time to access example at i=60%   : 18.7μs
> Time to access example at i=70%   : 21.1μs
> Time to access example at i=80%   : 26.8μs
> Time to access example at i=90%   : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark 
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing 
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
>  suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, since having a contiguous array of 
> lengths makes the iteration over the chunks lengths faster;
> - use a binary search, as in the Julia implementation 
> [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
> - use interpolation search.
> Apparently there is also a lookup structure in the compute layer 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].
> cc [~emkornfield], [~wesm]
> Thanks again for the amazing work !



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements

2021-12-26 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465561#comment-17465561
 ] 

Eduardo Ponce commented on ARROW-11989:
---

Is the structure of the example ChunkedArray common (ie., having many chunks of 
with same number of rows)? If that is the case, this type of access pattern 
would benefit from a "FixedSizeChunkedArray" (doesn't exists) where all chunks 
are of the same length and thus chunk access would be O(1). This can be 
implemented without defining a new class but by simply having a flag the 
ChunkedArray uses to track if chunks are of the same size.

Now, wrt to using a binary search instead of a linear search for finding the 
chunk of interest, I expect the binary search to improve access time for 
high-value indices but worsen access time for low-value indices due to the 
overhead of performing binary search. The overall access time will depend on 
the application and access patterns and although a binary search would make the 
overall chunk finding more consistent it will also have its drawback for 
certain cases.

_Food for thought:_ An alternative solution is to allow the client code to 
specify the direction of the linear search. This will help control performance 
based on the expected access patterns. The search direction could be specified 
as an object attribute or function parameter.
* *forward* - begins at first chunk and is useful for low-value indices
* *backward* - begins search at last chunk and is useful for high-value indices

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --
>
> Key: ARROW-11989
> URL: https://issues.apache.org/jira/browse/ARROW-11989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: quentin lhoest
>Priority: Major
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for 
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%: 6.7μs
> Time to access example at i=10%   : 7.2μs
> Time to access example at i=20%   : 9.1μs
> Time to access example at i=30%   : 11.4μs
> Time to access example at i=40%   : 13.8μs
> Time to access example at i=50%   : 16.2μs
> Time to access example at i=60%   : 18.7μs
> Time to access example at i=70%   : 21.1μs
> Time to access example at i=80%   : 26.8μs
> Time to access example at i=90%   : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark 
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing 
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
>  suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, since having a contiguous array of 
> lengths makes the iteration over the chunks lengths faster;
> - use a binary search, as in the Julia implementation 
> [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
> - use interpolation search.
> Apparently there is also a lookup structure in the compute layer 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].
> cc [~emkornfield], [~wesm]
> Thanks again for the amazing work !



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements

2021-03-18 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304638#comment-17304638
 ] 

Micah Kornfield commented on ARROW-11989:
-

{quote}AFAIU, interpolation search is a variation on binary search to try and 
improve the common case?

Since chunk offset comparisons are cheap (we're comparing integers) and 
interpolation is more expensive (it involves a division), I'm not sure it's a 
good idea here. Someone may want to benchmark for sure.
{quote}
Agreed, benchmarks would be useful.  It certainly depends on number of batches 
that need to be searched.  Found 
[http://pages.cs.wisc.edu/~chronis/files/efficiently_searching_sorted_arrays.pdf]
 this paper on some improvements to interpolation search as well.

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --
>
> Key: ARROW-11989
> URL: https://issues.apache.org/jira/browse/ARROW-11989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: quentin lhoest
>Priority: Major
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for 
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%: 6.7μs
> Time to access example at i=10%   : 7.2μs
> Time to access example at i=20%   : 9.1μs
> Time to access example at i=30%   : 11.4μs
> Time to access example at i=40%   : 13.8μs
> Time to access example at i=50%   : 16.2μs
> Time to access example at i=60%   : 18.7μs
> Time to access example at i=70%   : 21.1μs
> Time to access example at i=80%   : 26.8μs
> Time to access example at i=90%   : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark 
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing 
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
>  suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, since having a contiguous array of 
> lengths makes the iteration over the chunks lengths faster;
> - use a binary search, as in the Julia implementation 
> [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
> - use interpolation search.
> Apparently there is also a lookup structure in the compute layer 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].
> cc [~emkornfield], [~wesm]
> Thanks again for the amazing work !



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements

2021-03-17 Thread Daniel Nugent (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303320#comment-17303320
 ] 

Daniel Nugent commented on ARROW-11989:
---

Saw this on the list and just wanted to point out that individual index 
accessing or arbitrary vector at a time accessing might be less common than 
accessing with a sorted vector of indices at a time. Sorted contiguous vector 
at a time indexing may be most common of all (for example, an attempt to 
iterate across a table in batches of records not aligned to chunk size).

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --
>
> Key: ARROW-11989
> URL: https://issues.apache.org/jira/browse/ARROW-11989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: quentin lhoest
>Priority: Major
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for 
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%: 6.7μs
> Time to access example at i=10%   : 7.2μs
> Time to access example at i=20%   : 9.1μs
> Time to access example at i=30%   : 11.4μs
> Time to access example at i=40%   : 13.8μs
> Time to access example at i=50%   : 16.2μs
> Time to access example at i=60%   : 18.7μs
> Time to access example at i=70%   : 21.1μs
> Time to access example at i=80%   : 26.8μs
> Time to access example at i=90%   : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark 
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing 
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
>  suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, since having a contiguous array of 
> lengths makes the iteration over the chunks lengths faster;
> - use a binary search, as in the Julia implementation 
> [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
> - use interpolation search.
> Apparently there is also a lookup structure in the compute layer 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].
> cc [~emkornfield], [~wesm]
> Thanks again for the amazing work !



--
This message was sent by Atlassian Jira
(v8.3.4#803005)