Hi Team,

I am trying to create a PyArrow table from Parquet data files (1K files ~=
4.2B rows with 9 columns but am facing the challenges. I am seeking some
help and guidance to resolve it.

So far, I tried using Arrow dataset with filters and generator approach
within Arrow flight. I noticed that even with use_threads = True, the arrow
API does not use all the core available in the system.

I think one way to load all the data in parallel, is to split the parquet
files and run them in multiple servers but it is going to be manual.

I really appreciate any help you can provide to handle the large datasets.

Thank you,
Muru

Reply via email to