Hi Cindy, > I'd like to ingest batches within a Parquet file in parallel.
What is the motivation here? Is it speeding up Parquet reading or processing after the fact? Side note, the size of your row groups seems quite small (it might be right for your specific use-case). Cheers, Micah On Thu, Feb 3, 2022 at 8:01 AM Cindy McMullen <cmcmul...@twitter.com> wrote: > Maybe -- will give it a try. Thanks for the suggestion. > > On Thu, Feb 3, 2022 at 7:56 AM Partha Dutta <partha.du...@gmail.com> > wrote: > >> There is a parameter to iter_batches where you can pass in the row_group >> number, or a list of row groups. Would this help to read the Parquet file >> in parallel? >> >> On Thu, Feb 3, 2022 at 8:31 AM Cindy McMullen <cmcmul...@twitter.com> >> wrote: >> >>> Hi - >>> >>> I'd like to ingest batches within a Parquet file in parallel. The >>> client (DGLDataset) needs to be thread-safe. What's the best API for me to >>> use to do so? >>> >>> Here's the metadata for one example file: >>> >>> <pyarrow._parquet.FileMetaData object at 0x7fbb05c64050> >>> created_by: parquet-mr version 1.12.0 (build >>> db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20) >>> num_columns: 4 >>> num_rows: 1000000 >>> num_row_groups: 9997 >>> format_version: 1.0 >>> serialized_size: 17824741 >>> >>> I want the consumption of batches to be distributed among multiple >>> workers. I'm currently trying something like this: >>> >>> # Once per client >>> >>> pqf = pq.ParquetFile(f, memory_map=True) >>> >>> >>> # Ideally, each worker can do this, but ParquetFile.iter_batches is not >>> thread-safe. This makes intuitive sense. pq_batches = >>> pqf.iter_batches(self.rows_per_batch, use_pandas_metadata=True) >>> >>> >>> >>> My workaround is to buffer these ParquetFile batches into DataFrame [], >>> but this is memory-intensive, so will not scale to multiple of these input >>> files. >>> >>> What's a better PyArrow pattern to use so I can distribute batches to my >>> workers in a thread-safe manner? >>> >>> Thanks -- >>> >>> >>> >>> >>> >>> >> >> -- >> Partha Dutta >> partha.du...@gmail.com >> >