Hi There, I have been experimenting with Tabular Datasets <https://arrow.apache.org/docs/python/dataset.html> for data that can be larger than memory and had a few questions related to what's going on under the hood and how to work with it (I understand it is still experimental).
*Question 1: Reading Data from Azure Blob Storage* Now I know the filesystems don't fully support this yet, but there is an fsspec compatible library (adlfs) which is shown in the file system example <https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems-with-arrow> which I have used. Example below with the nyc taxi dataset, where I am pulling the whole dataset through and writing to disk to the feather format. import adlfs import pyarrow.dataset as ds fs = adlfs.AzureBlobFileSystem(account_name='azureopendatastorage') dataset = ds.dataset('nyctlc/green/', filesystem=fs, format='parquet') scanner = dataset.scanner() ds.write_dataset(scanner, f'taxinyc/green/feather/', format='feather') This could be something on the Azure side but I find I am being bottlenecked on the download speed and have noticed if I spin up multiple Python sessions (or in my case interactive windows) I can increase my throughput. Hence I can download each year of the taxinyc dataset in separate interactive windows and increase my bandwidth consumed. The tabular dataset <https://arrow.apache.org/docs/python/dataset.html> documentation notes 'optionally parallel reading.' Do you know how I can control this? Or perhaps control the number of concurrent connections. Or has this got nothing to do with the arrow and sits purley on the Azure side? I have increased the io thread count from the default 8 to 16 and saw no difference, but could still spin up more interactive windows to maximise bandwidth. *Question 2: Reading Filtered Data from Azure Blob Storage* Unfortunately I don't quite have a repeatable example here. However using the same data above, only this time I have each year as a feather file instead of a parquet file. I have uploaded this to my own Azure blob storage account. I am trying to read a subset of this data from the blob storage by selecting columns and filtering the data. The final result should be a dataframe that takes up around 240 mb of memory (I have tested this by working with the data locally). However when I run this by connecting to the Azure blob storage it takes over an hour to run and it's clear it's downloading a lot more data than I would have thought. Given the file formats are feather that supports random access I would have thought I would only have to download the 240 mb? Is there more going on in the background? Perhaps I am using this incorrectly? import adlfs import pyarrow.dataset as ds connection_string = '' fs = adlfs.AzureBlobFileSystem(connection_string=connection_string,) ds_f = ds.dataset("taxinyc/green/feather/", format='feather') df = ( ds_f .scanner( columns={ # Selections and Projections 'passengerCount': ds.field(('passengerCount'))*1000, 'tripDistance': ds.field(('tripDistance')) }, filter=(ds.field('vendorID') == 1) ) .to_table() .to_pandas() ) df.info() *Question 3: How is memory mapping being applied?* Does the Dataset API make use of memory mapping? Do I have the correct understanding that memory mapping is only intended for dealing with large data stored on a local file system. Where as data stored on a cloud file system in the feather format effectively cannot be memory mapped? *Question 4: Projections* I noticed in the scanner function when projecting a column I am unable to use any compute functions (I get a Type Error: only other expressions allowed as arguments) yet I am able to multiply this using standard python arithmetic. 'passengerCount': ds.field(('passengerCount'))*1000, 'passengerCount': pc.multiply(ds.field(('passengerCount')),1000), Is this correct or am I to process this using an iterator via record batch <https://arrow.apache.org/docs/python/dataset.html#iterative-out-of-core-or-streaming-reads> to do this out of core? Is it actually even doing it out of core with " *1000 ". Thanks for your help in advance. I have been following the Arrow project for the last two years but have only recently decided to dive into it in depth to explore it for various use cases. I am particularly interested in the out-of-core data processing and the interaction with cloud storages to retrieve only a selection of data from feather files. Hopefully at some point when I have enough knowledge I can contribute to this amazing project. Kind regards Nikhil Makan
