Multi-threaded reads via ParquetFile API

Cindy McMullen Thu, 03 Feb 2022 05:31:19 -0800

Hi -

I'd like to ingest  batches within a Parquet file in parallel.  The client
(DGLDataset) needs to be thread-safe.  What's the best API for me to use
to do so?


Here's the metadata for one example file:

  <pyarrow._parquet.FileMetaData object at 0x7fbb05c64050>
  created_by: parquet-mr version 1.12.0 (build
db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20)
  num_columns: 4
  num_rows: 1000000
  num_row_groups: 9997
  format_version: 1.0
  serialized_size: 17824741

I want the consumption of batches to be distributed among multiple
workers.  I'm currently trying something like this:

# Once per client

pqf = pq.ParquetFile(f, memory_map=True)


# Ideally, each worker can do this, but ParquetFile.iter_batches is
not thread-safe.  This makes intuitive sense. pq_batches =
pqf.iter_batches(self.rows_per_batch, use_pandas_metadata=True)



My workaround is to buffer these ParquetFile batches into DataFrame [], but
this is memory-intensive, so will not scale to multiple of these input
files.

What's a better PyArrow pattern to use so I can distribute batches to my
workers in a thread-safe manner?

Thanks --

Multi-threaded reads via ParquetFile API

Reply via email to