Murugan,

Could you talk a bit more about what you intend to do with the dataset once
loaded?

A large dataset is often best represented by a sequence of smaller datasets
which sounds like how it is currently stored if I hear you correctly.  If
you are doing some large aggregation or something then you can feed the
datasets one by one into your aggregation without needing to load all of
them simultaneously.

Are you trying to do some random access pathway across the entire dataset?

One option is to convert each existing parquet file into an arrow table and
then mmap the resulting tables all at once if you need to simulate having
the entire system 'in memory'.

On Wed, Aug 4, 2021 at 9:55 AM Murugan Muthusamy <[email protected]> wrote:

> Hi Team,
>
> I am trying to create a PyArrow table from Parquet data files (1K files ~=
> 4.2B rows with 9 columns but am facing the challenges. I am seeking some
> help and guidance to resolve it.
>
> So far, I tried using Arrow dataset with filters and generator approach
> within Arrow flight. I noticed that even with use_threads = True, the arrow
> API does not use all the core available in the system.
>
> I think one way to load all the data in parallel, is to split the parquet
> files and run them in multiple servers but it is going to be manual.
>
> I really appreciate any help you can provide to handle the large datasets.
>
> Thank you,
> Muru
>

Reply via email to