Hi Quentin,

> > More specifically, it looks like the time it takes to access an example
> of a ChunkedArray is a function of the index of the chunk in which the
> example is at.


This is accurate the O(1) access applies to a single Array.  Chunked arrays
are stored as a C++ vector of Arrays.  There is currently no indexing
structure on top of the vector to allow for anything better than O(chunk)
to an arbitrary element.

This would probably be a good addition to the library.   In the short term
as a work-around you could build the index externally, and use that. (chunk
methods are exposed on ChunkedArray).

-Micah




On Mon, Mar 15, 2021 at 9:12 AM Quentin Lhoest <[email protected]>
wrote:

> Hi !
>
> I’m working on the huggingface/datasets library and we are using Arrow for
> natural language processing datasets.
> Our datasets are usually text datasets, often bigger than memory.
>
> We use memory mapping of arrow files to load the datasets.
>
> > Unfortunately it looks like accessing examples is not O(1)
>
> More specifically, it looks like the time it takes to access an example of
> a ChunkedArray is a function of the index of the chunk in which the example
> is at.
> If the example is in the last chunk, then it takes more time to access it
> than accessing an example in the first chunk.
>
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
>
> Then the time it takes to access one example are:
>
> ```
>
> Time to access example at i=0%        : 6.7μs
> Time to access example at i=10%       : 7.2μs
> Time to access example at i=20%       : 9.1μs
> Time to access example at i=30%       : 11.4μs
> Time to access example at i=40%       : 13.8μs
> Time to access example at i=50%       : 16.2μs
> Time to access example at i=60%       : 18.7μs
> Time to access example at i=70%       : 21.1μs
> Time to access example at i=80%       : 26.8μs
> Time to access example at i=90%       : 25.2μs
>
> ```
>
> The time measured are the average times to do `table[“text”][j]` depending
> on the index we want to fetch (from the first example at 0% to the example
> at 90% of the length of the table).
>
> You can take a look at the code that produces this benchmark here
> <https://pastebin.com/pSkYHQn9>.
>
> > Why do I observe this behavior ? Is there a way to get O(1) access
> without bringing the whole table in memory ?
>
>
> Thanks in advance for you help !
>
> Quentin
>

Reply via email to