Hi - I'd like to ingest batches within a Parquet file in parallel. The client (DGLDataset) needs to be thread-safe. What's the best API for me to use to do so?
Here's the metadata for one example file: <pyarrow._parquet.FileMetaData object at 0x7fbb05c64050> created_by: parquet-mr version 1.12.0 (build db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20) num_columns: 4 num_rows: 1000000 num_row_groups: 9997 format_version: 1.0 serialized_size: 17824741 I want the consumption of batches to be distributed among multiple workers. I'm currently trying something like this: # Once per client pqf = pq.ParquetFile(f, memory_map=True) # Ideally, each worker can do this, but ParquetFile.iter_batches is not thread-safe. This makes intuitive sense. pq_batches = pqf.iter_batches(self.rows_per_batch, use_pandas_metadata=True) My workaround is to buffer these ParquetFile batches into DataFrame [], but this is memory-intensive, so will not scale to multiple of these input files. What's a better PyArrow pattern to use so I can distribute batches to my workers in a thread-safe manner? Thanks --