(oh, sorry I misread `pa.scalar` as `pc.scalar`, so please try `pyarrow.scalar` per the documentation)
Aldrin Montana Computer Science PhD Student UC Santa Cruz On Thu, Sep 15, 2022 at 5:26 PM Aldrin <[email protected]> wrote: > For Question 2: > At a glance, I don't see anything in adlfs or azure that is able to do > partial reads of a blob. If you're using block blobs, then likely you would > want to store blocks of your file as separate blocks of a blob, and then > you can do partial data transfers that way. I could be misunderstanding the > SDKs or how Azure stores data, but my guess is that a whole blob is > retrieved and then the local file is able to support partial, block-based > reads as you expect from local filesystems. You may be able to double check > how much data is being retrieved by looking at where adlfs is mounting your > blob storage. > > For Question 3: > you can memory map remote files, it's just that every page fault will be > even more expensive than for local files. I am not sure how to tell the > dataset API to do memory mapping, and I'm not sure how well that would work > over adlfs. > > For Question 4: > Can you try using `pc.scalar(1000)` as shown in the first code excerpt in > [1]: > > >> x, y = pa.scalar(7.8), pa.scalar(9.3) > >> pc.multiply(x, y) > <pyarrow.DoubleScalar: 72.54> > > [1]: > https://arrow.apache.org/docs/python/compute.html#standard-compute-functions > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Thu, Sep 8, 2022 at 8:26 PM Nikhil Makan <[email protected]> > wrote: > >> Hi There, >> >> I have been experimenting with Tabular Datasets >> <https://arrow.apache.org/docs/python/dataset.html> for data that can be >> larger than memory and had a few questions related to what's going on >> under the hood and how to work with it (I understand it is still >> experimental). >> >> *Question 1: Reading Data from Azure Blob Storage* >> Now I know the filesystems don't fully support this yet, but there is an >> fsspec compatible library (adlfs) which is shown in the file system >> example >> <https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems-with-arrow> >> which >> I have used. Example below with the nyc taxi dataset, where I am pulling >> the whole dataset through and writing to disk to the feather format. >> >> import adlfs >> import pyarrow.dataset as ds >> >> fs = adlfs.AzureBlobFileSystem(account_name='azureopendatastorage') >> >> dataset = ds.dataset('nyctlc/green/', filesystem=fs, format='parquet') >> >> scanner = dataset.scanner() >> ds.write_dataset(scanner, f'taxinyc/green/feather/', format='feather') >> >> This could be something on the Azure side but I find I am being >> bottlenecked on the download speed and have noticed if I spin up multiple >> Python sessions (or in my case interactive windows) I can increase my >> throughput. Hence I can download each year of the taxinyc dataset in >> separate interactive windows and increase my bandwidth consumed. The tabular >> dataset <https://arrow.apache.org/docs/python/dataset.html> documentation >> notes 'optionally parallel reading.' Do you know how I can control this? Or >> perhaps control the number of concurrent connections. Or has this got >> nothing to do with the arrow and sits purley on the Azure side? I have >> increased the io thread count from the default 8 to 16 and saw no >> difference, but could still spin up more interactive windows to maximise >> bandwidth. >> >> *Question 2: Reading Filtered Data from Azure Blob Storage* >> Unfortunately I don't quite have a repeatable example here. However using >> the same data above, only this time I have each year as a feather file >> instead of a parquet file. I have uploaded this to my own Azure blob >> storage account. >> I am trying to read a subset of this data from the blob storage by >> selecting columns and filtering the data. The final result should be a >> dataframe that takes up around 240 mb of memory (I have tested this by >> working with the data locally). However when I run this by connecting to >> the Azure blob storage it takes over an hour to run and it's clear it's >> downloading a lot more data than I would have thought. Given the file >> formats are feather that supports random access I would have thought I >> would only have to download the 240 mb? >> >> Is there more going on in the background? Perhaps I am using this >> incorrectly? >> >> import adlfs >> import pyarrow.dataset as ds >> >> connection_string = '' >> fs = adlfs.AzureBlobFileSystem(connection_string=connection_string,) >> >> ds_f = ds.dataset("taxinyc/green/feather/", format='feather') >> >> df = ( >> ds_f >> .scanner( >> columns={ # Selections and Projections >> 'passengerCount': ds.field(('passengerCount'))*1000, >> 'tripDistance': ds.field(('tripDistance')) >> }, >> filter=(ds.field('vendorID') == 1) >> ) >> .to_table() >> .to_pandas() >> ) >> >> df.info() >> >> *Question 3: How is memory mapping being applied?* >> Does the Dataset API make use of memory mapping? Do I have the correct >> understanding that memory mapping is only intended for dealing with large >> data stored on a local file system. Where as data stored on a cloud file >> system in the feather format effectively cannot be memory mapped? >> >> *Question 4: Projections* >> I noticed in the scanner function when projecting a column I am unable to >> use any compute functions (I get a Type Error: only other expressions >> allowed as arguments) yet I am able to multiply this using standard python >> arithmetic. >> >> 'passengerCount': ds.field(('passengerCount'))*1000, >> >> 'passengerCount': pc.multiply(ds.field(('passengerCount')),1000), >> >> Is this correct or am I to process this using an iterator via record >> batch >> <https://arrow.apache.org/docs/python/dataset.html#iterative-out-of-core-or-streaming-reads> >> to >> do this out of core? Is it actually even doing it out of core with " *1000 >> ". >> >> Thanks for your help in advance. I have been following the Arrow project >> for the last two years but have only recently decided to dive into it in >> depth to explore it for various use cases. I am particularly interested in >> the out-of-core data processing and the interaction with cloud storages to >> retrieve only a selection of data from feather files. Hopefully at some >> point when I have enough knowledge I can contribute to this amazing project. >> >> Kind regards >> Nikhil Makan >> >
