Re: Multi-threaded reads via ParquetFile API

Cindy McMullen Thu, 03 Feb 2022 10:01:40 -0800

The use case is for a GraphDataLoader
<https://docs.dgl.ai/en/0.6.x/api/python/dgl.dataloading.html#dgl.dataloading.pytorch.GraphDataLoader>
to
run w/ multiple threads.  GraphDataLoade invokes its DGLDataset
<https://docs.dgl.ai/en/0.6.x/api/python/dgl.data.html#dgl.data.DGLDataset>,
which loads these Parquet files to convert into DGL-compatible
graph objects.



On Thu, Feb 3, 2022 at 10:00 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Cindy,
>
>> I'd like to ingest  batches within a Parquet file in parallel.
>
> What is the motivation here?  Is it speeding up Parquet reading or
> processing after the fact?
>
>
> Side note, the size of your row groups seems quite small (it might be
> right for your specific use-case).
>
> Cheers,
> Micah
>
> On Thu, Feb 3, 2022 at 8:01 AM Cindy McMullen <cmcmul...@twitter.com>
> wrote:
>
>> Maybe -- will give it a try.  Thanks for the suggestion.
>>
>> On Thu, Feb 3, 2022 at 7:56 AM Partha Dutta <partha.du...@gmail.com>
>> wrote:
>>
>>> There is a parameter to iter_batches where you can pass in the row_group
>>> number, or a list of row groups. Would this help to read the Parquet file
>>> in parallel?
>>>
>>> On Thu, Feb 3, 2022 at 8:31 AM Cindy McMullen <cmcmul...@twitter.com>
>>> wrote:
>>>
>>>> Hi -
>>>>
>>>> I'd like to ingest  batches within a Parquet file in parallel.  The
>>>> client (DGLDataset) needs to be thread-safe.  What's the best API for me to
>>>> use  to do so?
>>>>
>>>> Here's the metadata for one example file:
>>>>
>>>>   <pyarrow._parquet.FileMetaData object at 0x7fbb05c64050>
>>>>   created_by: parquet-mr version 1.12.0 (build 
>>>> db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20)
>>>>   num_columns: 4
>>>>   num_rows: 1000000
>>>>   num_row_groups: 9997
>>>>   format_version: 1.0
>>>>   serialized_size: 17824741
>>>>
>>>> I want the consumption of batches to be distributed among multiple
>>>> workers.  I'm currently trying something like this:
>>>>
>>>> # Once per client
>>>>
>>>> pqf = pq.ParquetFile(f, memory_map=True)
>>>>
>>>>
>>>> # Ideally, each worker can do this, but ParquetFile.iter_batches is not 
>>>> thread-safe.  This makes intuitive sense. pq_batches = 
>>>> pqf.iter_batches(self.rows_per_batch, use_pandas_metadata=True)
>>>>
>>>>
>>>>
>>>> My workaround is to buffer these ParquetFile batches into DataFrame [],
>>>> but this is memory-intensive, so will not scale to multiple of these input
>>>> files.
>>>>
>>>> What's a better PyArrow pattern to use so I can distribute batches to
>>>> my workers in a thread-safe manner?
>>>>
>>>> Thanks --
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Partha Dutta
>>> partha.du...@gmail.com
>>>
>>

Re: Multi-threaded reads via ParquetFile API

Reply via email to