Re: Handle large dataset

Murugan Muthusamy Wed, 04 Aug 2021 23:38:45 -0700

Thanks Weston. The data source is S3. All the parquet files are stored in
S3. So far, I am only able to load smaller datasets and still figuring out
a way to load the large dataset (4.2B rows that consists of 1000 parquet
files) but only trying a subset of it. One of the reason, reaching out to
get some help and also the sizing requirements.


On Wed, Aug 4, 2021 at 7:38 PM Weston Pace <[email protected]> wrote:

> > So far, I tried using Arrow dataset with filters and generator approach
> within Arrow flight. I noticed that even with use_threads = True, the arrow
> API does not use all the core available in the system.
>
> It is possible you are I/O bound.  What device are you reading from?
> Do you know if the data is cached in the kernel page cache already or
> not?  If you are reading parquet from an HDD for example I wouldn't
> expect you to be able to utilize more than a few cores.
>
> > Sure, I will try your suggestion. Please, can you share or point me to
> mmap reference sample?
>
> There is a little bit of documentation (and a sample) that was
> recently added here (but won't be on the site until 6.0.0 releases):
>
> https://github.com/apache/arrow/blob/master/docs/source/python/ipc.rst#efficiently-writing-and-reading-arrow-data
>
> > The use case is to build an in-memory datastore
>
> Can you give a rough idea of what kind of performance you are getting
> and what kind of performance you would expect or need?  For an
> in-memory datastore I wouldn't expect parquet I/O to be on the
> critical path.
>
> On Wed, Aug 4, 2021 at 10:28 AM Murugan Muthusamy <[email protected]>
> wrote:
> >
> > Thanks Chris. The use case is to build an in-memory datastore. After the
> data gets loaded, the clients will query and get the results in sub-seconds
> via an api. Mostly, select queries and high level aggregations. Yes, not
> the entire dataset, the last 90days of data but ideally it would be nice to
> have the entire dataset.
> >
> > Sure, I will try your suggestion. Please, can you share or point me to
> mmap reference sample?
> >
> > On Wed, Aug 4, 2021 at 10:12 AM Chris Nuernberger <[email protected]>
> wrote:
> >>
> >> Murugan,
> >>
> >> Could you talk a bit more about what you intend to do with the dataset
> once loaded?
> >>
> >> A large dataset is often best represented by a sequence of smaller
> datasets which sounds like how it is currently stored if I hear you
> correctly.  If you are doing some large aggregation or something then you
> can feed the datasets one by one into your aggregation without needing to
> load all of them simultaneously.
> >>
> >> Are you trying to do some random access pathway across the entire
> dataset?
> >>
> >> One option is to convert each existing parquet file into an arrow table
> and then mmap the resulting tables all at once if you need to simulate
> having the entire system 'in memory'.
> >>
> >> On Wed, Aug 4, 2021 at 9:55 AM Murugan Muthusamy <[email protected]>
> wrote:
> >>>
> >>> Hi Team,
> >>>
> >>> I am trying to create a PyArrow table from Parquet data files (1K
> files ~= 4.2B rows with 9 columns but am facing the challenges. I am
> seeking some help and guidance to resolve it.
> >>>
> >>> So far, I tried using Arrow dataset with filters and generator
> approach within Arrow flight. I noticed that even with use_threads = True,
> the arrow API does not use all the core available in the system.
> >>>
> >>> I think one way to load all the data in parallel, is to split the
> parquet files and run them in multiple servers but it is going to be manual.
> >>>
> >>> I really appreciate any help you can provide to handle the large
> datasets.
> >>>
> >>> Thank you,
> >>> Muru
>

Re: Handle large dataset

Reply via email to