Thanks Weston. The data source is S3. All the parquet files are stored in S3. So far, I am only able to load smaller datasets and still figuring out a way to load the large dataset (4.2B rows that consists of 1000 parquet files) but only trying a subset of it. One of the reason, reaching out to get some help and also the sizing requirements.
On Wed, Aug 4, 2021 at 7:38 PM Weston Pace <[email protected]> wrote: > > So far, I tried using Arrow dataset with filters and generator approach > within Arrow flight. I noticed that even with use_threads = True, the arrow > API does not use all the core available in the system. > > It is possible you are I/O bound. What device are you reading from? > Do you know if the data is cached in the kernel page cache already or > not? If you are reading parquet from an HDD for example I wouldn't > expect you to be able to utilize more than a few cores. > > > Sure, I will try your suggestion. Please, can you share or point me to > mmap reference sample? > > There is a little bit of documentation (and a sample) that was > recently added here (but won't be on the site until 6.0.0 releases): > > https://github.com/apache/arrow/blob/master/docs/source/python/ipc.rst#efficiently-writing-and-reading-arrow-data > > > The use case is to build an in-memory datastore > > Can you give a rough idea of what kind of performance you are getting > and what kind of performance you would expect or need? For an > in-memory datastore I wouldn't expect parquet I/O to be on the > critical path. > > On Wed, Aug 4, 2021 at 10:28 AM Murugan Muthusamy <[email protected]> > wrote: > > > > Thanks Chris. The use case is to build an in-memory datastore. After the > data gets loaded, the clients will query and get the results in sub-seconds > via an api. Mostly, select queries and high level aggregations. Yes, not > the entire dataset, the last 90days of data but ideally it would be nice to > have the entire dataset. > > > > Sure, I will try your suggestion. Please, can you share or point me to > mmap reference sample? > > > > On Wed, Aug 4, 2021 at 10:12 AM Chris Nuernberger <[email protected]> > wrote: > >> > >> Murugan, > >> > >> Could you talk a bit more about what you intend to do with the dataset > once loaded? > >> > >> A large dataset is often best represented by a sequence of smaller > datasets which sounds like how it is currently stored if I hear you > correctly. If you are doing some large aggregation or something then you > can feed the datasets one by one into your aggregation without needing to > load all of them simultaneously. > >> > >> Are you trying to do some random access pathway across the entire > dataset? > >> > >> One option is to convert each existing parquet file into an arrow table > and then mmap the resulting tables all at once if you need to simulate > having the entire system 'in memory'. > >> > >> On Wed, Aug 4, 2021 at 9:55 AM Murugan Muthusamy <[email protected]> > wrote: > >>> > >>> Hi Team, > >>> > >>> I am trying to create a PyArrow table from Parquet data files (1K > files ~= 4.2B rows with 9 columns but am facing the challenges. I am > seeking some help and guidance to resolve it. > >>> > >>> So far, I tried using Arrow dataset with filters and generator > approach within Arrow flight. I noticed that even with use_threads = True, > the arrow API does not use all the core available in the system. > >>> > >>> I think one way to load all the data in parallel, is to split the > parquet files and run them in multiple servers but it is going to be manual. > >>> > >>> I really appreciate any help you can provide to handle the large > datasets. > >>> > >>> Thank you, > >>> Muru >
