Maybe -- will give it a try. Thanks for the suggestion. On Thu, Feb 3, 2022 at 7:56 AM Partha Dutta <partha.du...@gmail.com> wrote:
> There is a parameter to iter_batches where you can pass in the row_group > number, or a list of row groups. Would this help to read the Parquet file > in parallel? > > On Thu, Feb 3, 2022 at 8:31 AM Cindy McMullen <cmcmul...@twitter.com> > wrote: > >> Hi - >> >> I'd like to ingest batches within a Parquet file in parallel. The >> client (DGLDataset) needs to be thread-safe. What's the best API for me to >> use to do so? >> >> Here's the metadata for one example file: >> >> <pyarrow._parquet.FileMetaData object at 0x7fbb05c64050> >> created_by: parquet-mr version 1.12.0 (build >> db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20) >> num_columns: 4 >> num_rows: 1000000 >> num_row_groups: 9997 >> format_version: 1.0 >> serialized_size: 17824741 >> >> I want the consumption of batches to be distributed among multiple >> workers. I'm currently trying something like this: >> >> # Once per client >> >> pqf = pq.ParquetFile(f, memory_map=True) >> >> >> # Ideally, each worker can do this, but ParquetFile.iter_batches is not >> thread-safe. This makes intuitive sense. pq_batches = >> pqf.iter_batches(self.rows_per_batch, use_pandas_metadata=True) >> >> >> >> My workaround is to buffer these ParquetFile batches into DataFrame [], >> but this is memory-intensive, so will not scale to multiple of these input >> files. >> >> What's a better PyArrow pattern to use so I can distribute batches to my >> workers in a thread-safe manner? >> >> Thanks -- >> >> >> >> >> >> > > -- > Partha Dutta > partha.du...@gmail.com >