Thanks for your response ! 

> This is accurate the O(1) access applies to a single Array.  Chunked arrays 
> are stored as a C++ vector of Arrays.  There is currently no indexing 
> structure on top of the vector to allow for anything better than O(chunk) to 
> an arbitrary element.


That makes sense, thanks.

> I am a little surprised that the difference is as pronounced in your 
> benchmark for only 1024 chunks though, I'm not familiar enough with the 
> memory mapped code path to be able to answer why that would be the case.


For information, I’m getting approximately the same results without memory 
mapping though:
```
Time to access example at i=0%  : 9.2μs
Time to access example at i=10% : 7.6μs
Time to access example at i=20% : 8.7μs
Time to access example at i=30% : 10.8μs
Time to access example at i=40% : 13.0μs
Time to access example at i=50% : 15.4μs
Time to access example at i=60% : 17.6μs
Time to access example at i=70% : 20.5μs
Time to access example at i=80% : 22.9μs
Time to access example at i=90% : 24.1μs
```

From the previous code I just replaced `pa.memory_map(…)` by `open(…, “rb”)`

Also, regarding the time difference, it must come from an iteration over the 
chunks until the requested example is reached.

In terms of timing, 20μs is a reasonable time for a list of 1000 elements in 
python, since for loops are pretty slow.
I would expect it to be much faster if it was done on the C++ side though.

Do you know if the iteration is done on the python side, or on the C++ side ?

> This would probably be a good addition to the library.   In the short term as 
> a work-around you could build the index externally, and use that. (chunk 
> methods are exposed on ChunkedArray).


May I create a post on JIRA about adding an indexing structure for ChunkedArray 
?


Best,

Quentin

> On Mar 15, 2021, at 5:40 PM, Micah Kornfield <[email protected]> wrote:
> 
> I am a little surprised that the difference is as pronounced in your 
> benchmark for only 1024 chunks though, I'm not familiar enough with the 
> memory mapped code path to be able to answer why that would be the case.
> 
> On Mon, Mar 15, 2021 at 9:37 AM Micah Kornfield <[email protected]> wrote:
> Hi Quentin, 
> > More specifically, it looks like the time it takes to access an example of 
> > a ChunkedArray is a function of the index of the chunk in which the example 
> > is at.
>  
> This is accurate the O(1) access applies to a single Array.  Chunked arrays 
> are stored as a C++ vector of Arrays.  There is currently no indexing 
> structure on top of the vector to allow for anything better than O(chunk) to 
> an arbitrary element.
> 
> This would probably be a good addition to the library.   In the short term as 
> a work-around you could build the index externally, and use that. (chunk 
> methods are exposed on ChunkedArray).
> 
> -Micah
> 
> 
> 
> 
> On Mon, Mar 15, 2021 at 9:12 AM Quentin Lhoest <[email protected]> wrote:
> Hi !
> 
> I’m working on the huggingface/datasets library and we are using Arrow for 
> natural language processing datasets.
> Our datasets are usually text datasets, often bigger than memory.
> 
> We use memory mapping of arrow files to load the datasets.
> 
> > Unfortunately it looks like accessing examples is not O(1)
> 
> More specifically, it looks like the time it takes to access an example of a 
> ChunkedArray is a function of the index of the chunk in which the example is 
> at.
> If the example is in the last chunk, then it takes more time to access it 
> than accessing an example in the first chunk.
> 
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> 
> Then the time it takes to access one example are:
> 
> ```
> Time to access example at i=0%        : 6.7μs
> Time to access example at i=10%       : 7.2μs
> Time to access example at i=20%       : 9.1μs
> Time to access example at i=30%       : 11.4μs
> Time to access example at i=40%       : 13.8μs
> Time to access example at i=50%       : 16.2μs
> Time to access example at i=60%       : 18.7μs
> Time to access example at i=70%       : 21.1μs
> Time to access example at i=80%       : 26.8μs
> Time to access example at i=90%       : 25.2μs
> 
> ```
> 
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> 
> You can take a look at the code that produces this benchmark here.
> 
> > Why do I observe this behavior ? Is there a way to get O(1) access without 
> > bringing the whole table in memory ?
> 
> 
> Thanks in advance for you help !
> 
> Quentin

Reply via email to