Maybe -- will give it a try.  Thanks for the suggestion.

On Thu, Feb 3, 2022 at 7:56 AM Partha Dutta <partha.du...@gmail.com> wrote:

> There is a parameter to iter_batches where you can pass in the row_group
> number, or a list of row groups. Would this help to read the Parquet file
> in parallel?
>
> On Thu, Feb 3, 2022 at 8:31 AM Cindy McMullen <cmcmul...@twitter.com>
> wrote:
>
>> Hi -
>>
>> I'd like to ingest  batches within a Parquet file in parallel.  The
>> client (DGLDataset) needs to be thread-safe.  What's the best API for me to
>> use  to do so?
>>
>> Here's the metadata for one example file:
>>
>>   <pyarrow._parquet.FileMetaData object at 0x7fbb05c64050>
>>   created_by: parquet-mr version 1.12.0 (build 
>> db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20)
>>   num_columns: 4
>>   num_rows: 1000000
>>   num_row_groups: 9997
>>   format_version: 1.0
>>   serialized_size: 17824741
>>
>> I want the consumption of batches to be distributed among multiple
>> workers.  I'm currently trying something like this:
>>
>> # Once per client
>>
>> pqf = pq.ParquetFile(f, memory_map=True)
>>
>>
>> # Ideally, each worker can do this, but ParquetFile.iter_batches is not 
>> thread-safe.  This makes intuitive sense. pq_batches = 
>> pqf.iter_batches(self.rows_per_batch, use_pandas_metadata=True)
>>
>>
>>
>> My workaround is to buffer these ParquetFile batches into DataFrame [],
>> but this is memory-intensive, so will not scale to multiple of these input
>> files.
>>
>> What's a better PyArrow pattern to use so I can distribute batches to my
>> workers in a thread-safe manner?
>>
>> Thanks --
>>
>>
>>
>>
>>
>>
>
> --
> Partha Dutta
> partha.du...@gmail.com
>

Reply via email to