Hi Quentin, > > More specifically, it looks like the time it takes to access an example > of a ChunkedArray is a function of the index of the chunk in which the > example is at.
This is accurate the O(1) access applies to a single Array. Chunked arrays are stored as a C++ vector of Arrays. There is currently no indexing structure on top of the vector to allow for anything better than O(chunk) to an arbitrary element. This would probably be a good addition to the library. In the short term as a work-around you could build the index externally, and use that. (chunk methods are exposed on ChunkedArray). -Micah On Mon, Mar 15, 2021 at 9:12 AM Quentin Lhoest <[email protected]> wrote: > Hi ! > > I’m working on the huggingface/datasets library and we are using Arrow for > natural language processing datasets. > Our datasets are usually text datasets, often bigger than memory. > > We use memory mapping of arrow files to load the datasets. > > > Unfortunately it looks like accessing examples is not O(1) > > More specifically, it looks like the time it takes to access an example of > a ChunkedArray is a function of the index of the chunk in which the example > is at. > If the example is in the last chunk, then it takes more time to access it > than accessing an example in the first chunk. > > For example, with a Table consisting of 1 column “text” defined by: > - 1024 chunks > - each chunk is 1024 rows > - each row is a text of 1024 characters > > Then the time it takes to access one example are: > > ``` > > Time to access example at i=0% : 6.7μs > Time to access example at i=10% : 7.2μs > Time to access example at i=20% : 9.1μs > Time to access example at i=30% : 11.4μs > Time to access example at i=40% : 13.8μs > Time to access example at i=50% : 16.2μs > Time to access example at i=60% : 18.7μs > Time to access example at i=70% : 21.1μs > Time to access example at i=80% : 26.8μs > Time to access example at i=90% : 25.2μs > > ``` > > The time measured are the average times to do `table[“text”][j]` depending > on the index we want to fetch (from the first example at 0% to the example > at 90% of the length of the table). > > You can take a look at the code that produces this benchmark here > <https://pastebin.com/pSkYHQn9>. > > > Why do I observe this behavior ? Is there a way to get O(1) access > without bringing the whole table in memory ? > > > Thanks in advance for you help ! > > Quentin >
