Re: [Python] - Dataset API - What's happening under the hood?

Nikhil Makan Tue, 20 Sep 2022 19:46:08 -0700

Hi Weston, thanks for the response!

> I would say that this is always a problem.  In the datasets API the
goal is to maximize the resource usage within a single process.  Now,
it may be a known or expected problem :)


The dataset API still makes use of multiple cores though correct?
How does this then relate to the filesystems interface and native support
for HDFS, GCFS and S3. Do these exhibit the same issue? Further to this are
per my earlier discussions on this thread we are unable to do partial reads
of a blob in Azure, I wanted to know if that is possible with any of the
other three that have native support. i.e. can we filter the data
downloaded from these instead of downloading everything and then filtering?

> I think the benefits of memory mapping are rather subtle and often
misleading.  Datasets can make use of memory mapping for local
filesystems.  Doing so will, at best, have a slight performance
benefit (avoiding a memcpy) but would most likely decrease performance
(by introducing I/O where it is not expected) and it will have no
effect whatsoever on the amount of RAM used.

I don't think I quite follow this. Happy to be pointed to some
documentation to read more on this by the way. I thought the basic idea
behind memory mapping is that the data structure has the same
representation on disk as it does in memory therefore allowing it to not
consume additional memory when reading it, which is typical with normal I/O
operations with reading files. So would the dataset API process multiple
files potentially quicker without memory mapping. Also correct me if I am
wrong, but memory mapping is related to the ipc format only, formats such
as parquet cannot take advantage of this?

Kind regards
Nikhil Makan


On Tue, Sep 20, 2022 at 5:12 AM Weston Pace <[email protected]> wrote:

> Sorry for the slow reply.
>
> > This could be something on the Azure side but I find I am being
> bottlenecked on the download speed and have noticed if I spin up multiple
> Python sessions (or in my case interactive windows) I can increase my
> throughput. Hence I can download each year of the taxinyc dataset in
> separate interactive windows and increase my bandwidth consumed.
>
> I would say that this is always a problem.  In the datasets API the
> goal is to maximize the resource usage within a single process.  Now,
> it may be a known or expected problem :)
>
> > Does the Dataset API make use of memory mapping? Do I have the correct
> understanding that memory mapping is only intended for dealing with large
> data stored on a local file system. Where as data stored on a cloud file
> system in the feather format effectively cannot be memory mapped?
>
> I think the benefits of memory mapping are rather subtle and often
> misleading.  Datasets can make use of memory mapping for local
> filesystems.  Doing so will, at best, have a slight performance
> benefit (avoiding a memcpy) but would most likely decrease performance
> (by introducing I/O where it is not expected) and it will have no
> effect whatsoever on the amount of RAM used.
>
> > This works as well as noted previosuly, so I assume the python operators
> are mapped across similar to what happens when you use the operators
> against a numpy or pandas series it just executes a np.multiply or pd.
> multiply in the background.
>
> Yes.  However the functions that get mapped can sometimes be
> surprising.  Specifically, logical operations map to the _kleene
> variation and arithmetic maps to the _checked variation.  You can find
> the implementation at [1].  For multiplication this boils down to:
>
> ```
> @staticmethod
> cdef Expression _expr_or_scalar(object expr):
>     if isinstance(expr, Expression):
>         return (<Expression> expr)
>     return (<Expression> Expression._scalar(expr))
>
> ...
>
> def __mul__(Expression self, other):
>     other = Expression._expr_or_scalar(other)
>     return Expression._call("multiply_checked", [self, other])
> ```
>
>
> On Mon, Sep 19, 2022 at 12:52 AM Jacek Pliszka <[email protected]>
> wrote:
> >
> > Re 2.   In Python Azure SDK there is logic for partial blob read:
> >
> >
> https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#azure-storage-blob-blobclient-query-blob
> >
> > However I was unable to use it as it does not support parquet files
> > with decimal columns and these are the ones I have.
> >
> > BR
> >
> > J
> >
> > pt., 16 wrz 2022 o 02:26 Aldrin <[email protected]> napisał(a):
> > >
> > > For Question 2:
> > > At a glance, I don't see anything in adlfs or azure that is able to do
> partial reads of a blob. If you're using block blobs, then likely you would
> want to store blocks of your file as separate blocks of a blob, and then
> you can do partial data transfers that way. I could be misunderstanding the
> SDKs or how Azure stores data, but my guess is that a whole blob is
> retrieved and then the local file is able to support partial, block-based
> reads as you expect from local filesystems. You may be able to double check
> how much data is being retrieved by looking at where adlfs is mounting your
> blob storage.
> > >
> > > For Question 3:
> > > you can memory map remote files, it's just that every page fault will
> be even more expensive than for local files. I am not sure how to tell the
> dataset API to do memory mapping, and I'm not sure how well that would work
> over adlfs.
> > >
> > > For Question 4:
> > > Can you try using `pc.scalar(1000)` as shown in the first code excerpt
> in [1]:
> > >
> > > >> x, y = pa.scalar(7.8), pa.scalar(9.3)
> > > >> pc.multiply(x, y)
> > > <pyarrow.DoubleScalar: 72.54>
> > >
> > > [1]:
> https://arrow.apache.org/docs/python/compute.html#standard-compute-functions
> > >
> > > Aldrin Montana
> > > Computer Science PhD Student
> > > UC Santa Cruz
> > >
> > >
> > > On Thu, Sep 8, 2022 at 8:26 PM Nikhil Makan <[email protected]>
> wrote:
> > >>
> > >> Hi There,
> > >>
> > >> I have been experimenting with Tabular Datasets for data that can be
> larger than memory and had a few questions related to what's going on under
> the hood and how to work with it (I understand it is still experimental).
> > >>
> > >> Question 1: Reading Data from Azure Blob Storage
> > >> Now I know the filesystems don't fully support this yet, but there is
> an fsspec compatible library (adlfs) which is shown in the file system
> example which I have used. Example below with the nyc taxi dataset, where I
> am pulling the whole dataset through and writing to disk to the feather
> format.
> > >>
> > >> import adlfs
> > >> import pyarrow.dataset as ds
> > >>
> > >> fs = adlfs.AzureBlobFileSystem(account_name='azureopendatastorage')
> > >>
> > >> dataset = ds.dataset('nyctlc/green/', filesystem=fs, format='parquet')
> > >>
> > >> scanner = dataset.scanner()
> > >> ds.write_dataset(scanner, f'taxinyc/green/feather/', format='feather')
> > >>
> > >> This could be something on the Azure side but I find I am being
> bottlenecked on the download speed and have noticed if I spin up multiple
> Python sessions (or in my case interactive windows) I can increase my
> throughput. Hence I can download each year of the taxinyc dataset in
> separate interactive windows and increase my bandwidth consumed. The
> tabular dataset documentation notes 'optionally parallel reading.' Do you
> know how I can control this? Or perhaps control the number of concurrent
> connections. Or has this got nothing to do with the arrow and sits purley
> on the Azure side? I have increased the io thread count from the default 8
> to 16 and saw no difference, but could still spin up more interactive
> windows to maximise bandwidth.
> > >>
> > >> Question 2: Reading Filtered Data from Azure Blob Storage
> > >> Unfortunately I don't quite have a repeatable example here. However
> using the same data above, only this time I have each year as a feather
> file instead of a parquet file. I have uploaded this to my own Azure blob
> storage account.
> > >> I am trying to read a subset of this data from the blob storage by
> selecting columns and filtering the data. The final result should be a
> dataframe that takes up around 240 mb of memory (I have tested this by
> working with the data locally). However when I run this by connecting to
> the Azure blob storage it takes over an hour to run and it's clear it's
> downloading a lot more data than I would have thought. Given the file
> formats are feather that supports random access I would have thought I
> would only have to download the 240 mb?
> > >>
> > >> Is there more going on in the background? Perhaps I am using this
> incorrectly?
> > >>
> > >> import adlfs
> > >> import pyarrow.dataset as ds
> > >>
> > >> connection_string = ''
> > >> fs = adlfs.AzureBlobFileSystem(connection_string=connection_string,)
> > >>
> > >> ds_f = ds.dataset("taxinyc/green/feather/", format='feather')
> > >>
> > >> df = (
> > >>     ds_f
> > >>     .scanner(
> > >>         columns={ # Selections and Projections
> > >>             'passengerCount': ds.field(('passengerCount'))*1000,
> > >>             'tripDistance': ds.field(('tripDistance'))
> > >>         },
> > >>         filter=(ds.field('vendorID') == 1)
> > >>     )
> > >>     .to_table()
> > >>     .to_pandas()
> > >> )
> > >>
> > >> df.info()
> > >>
> > >> Question 3: How is memory mapping being applied?
> > >> Does the Dataset API make use of memory mapping? Do I have the
> correct understanding that memory mapping is only intended for dealing with
> large data stored on a local file system. Where as data stored on a cloud
> file system in the feather format effectively cannot be memory mapped?
> > >>
> > >> Question 4: Projections
> > >> I noticed in the scanner function when projecting a column I am
> unable to use any compute functions (I get a Type Error: only other
> expressions allowed as arguments) yet I am able to multiply this using
> standard python arithmetic.
> > >>
> > >> 'passengerCount': ds.field(('passengerCount'))*1000,
> > >>
> > >> 'passengerCount': pc.multiply(ds.field(('passengerCount')),1000),
> > >>
> > >> Is this correct or am I to process this using an iterator via record
> batch to do this out of core? Is it actually even doing it out of core with
> " *1000 ".
> > >>
> > >> Thanks for your help in advance. I have been following the Arrow
> project for the last two years but have only recently decided to dive into
> it in depth to explore it for various use cases. I am particularly
> interested in the out-of-core data processing and the interaction with
> cloud storages to retrieve only a selection of data from feather files.
> Hopefully at some point when I have enough knowledge I can contribute to
> this amazing project.
> > >>
> > >> Kind regards
> > >> Nikhil Makan
>

Re: [Python] - Dataset API - What's happening under the hood?

Reply via email to