Thanks for your response ! > This is accurate the O(1) access applies to a single Array. Chunked arrays > are stored as a C++ vector of Arrays. There is currently no indexing > structure on top of the vector to allow for anything better than O(chunk) to > an arbitrary element.
That makes sense, thanks. > I am a little surprised that the difference is as pronounced in your > benchmark for only 1024 chunks though, I'm not familiar enough with the > memory mapped code path to be able to answer why that would be the case. For information, I’m getting approximately the same results without memory mapping though: ``` Time to access example at i=0% : 9.2μs Time to access example at i=10% : 7.6μs Time to access example at i=20% : 8.7μs Time to access example at i=30% : 10.8μs Time to access example at i=40% : 13.0μs Time to access example at i=50% : 15.4μs Time to access example at i=60% : 17.6μs Time to access example at i=70% : 20.5μs Time to access example at i=80% : 22.9μs Time to access example at i=90% : 24.1μs ``` From the previous code I just replaced `pa.memory_map(…)` by `open(…, “rb”)` Also, regarding the time difference, it must come from an iteration over the chunks until the requested example is reached. In terms of timing, 20μs is a reasonable time for a list of 1000 elements in python, since for loops are pretty slow. I would expect it to be much faster if it was done on the C++ side though. Do you know if the iteration is done on the python side, or on the C++ side ? > This would probably be a good addition to the library. In the short term as > a work-around you could build the index externally, and use that. (chunk > methods are exposed on ChunkedArray). May I create a post on JIRA about adding an indexing structure for ChunkedArray ? Best, Quentin > On Mar 15, 2021, at 5:40 PM, Micah Kornfield <[email protected]> wrote: > > I am a little surprised that the difference is as pronounced in your > benchmark for only 1024 chunks though, I'm not familiar enough with the > memory mapped code path to be able to answer why that would be the case. > > On Mon, Mar 15, 2021 at 9:37 AM Micah Kornfield <[email protected]> wrote: > Hi Quentin, > > More specifically, it looks like the time it takes to access an example of > > a ChunkedArray is a function of the index of the chunk in which the example > > is at. > > This is accurate the O(1) access applies to a single Array. Chunked arrays > are stored as a C++ vector of Arrays. There is currently no indexing > structure on top of the vector to allow for anything better than O(chunk) to > an arbitrary element. > > This would probably be a good addition to the library. In the short term as > a work-around you could build the index externally, and use that. (chunk > methods are exposed on ChunkedArray). > > -Micah > > > > > On Mon, Mar 15, 2021 at 9:12 AM Quentin Lhoest <[email protected]> wrote: > Hi ! > > I’m working on the huggingface/datasets library and we are using Arrow for > natural language processing datasets. > Our datasets are usually text datasets, often bigger than memory. > > We use memory mapping of arrow files to load the datasets. > > > Unfortunately it looks like accessing examples is not O(1) > > More specifically, it looks like the time it takes to access an example of a > ChunkedArray is a function of the index of the chunk in which the example is > at. > If the example is in the last chunk, then it takes more time to access it > than accessing an example in the first chunk. > > For example, with a Table consisting of 1 column “text” defined by: > - 1024 chunks > - each chunk is 1024 rows > - each row is a text of 1024 characters > > Then the time it takes to access one example are: > > ``` > Time to access example at i=0% : 6.7μs > Time to access example at i=10% : 7.2μs > Time to access example at i=20% : 9.1μs > Time to access example at i=30% : 11.4μs > Time to access example at i=40% : 13.8μs > Time to access example at i=50% : 16.2μs > Time to access example at i=60% : 18.7μs > Time to access example at i=70% : 21.1μs > Time to access example at i=80% : 26.8μs > Time to access example at i=90% : 25.2μs > > ``` > > The time measured are the average times to do `table[“text”][j]` depending on > the index we want to fetch (from the first example at 0% to the example at > 90% of the length of the table). > > You can take a look at the code that produces this benchmark here. > > > Why do I observe this behavior ? Is there a way to get O(1) access without > > bringing the whole table in memory ? > > > Thanks in advance for you help ! > > Quentin
