If you are creating different ParquetFileReader then there shouldn't be any
concurrency issues.  Each one should maintain its own independent state.

On Mon, Apr 4, 2022 at 1:17 PM McDonald, Ben <[email protected]> wrote:

> Thanks for the response. The code is quite complex, so coming up with a
> simple reproducer is not easy, but I think I am satisfied with your answer
> with respect to multi-process runs.
>
>
>
> This is making me wonder if there are any limitations on parallelism
> within a single process though. On top of the distribution I mentioned in
> the previous message, each process is also concurrently reading many
> different files, each with a unique instance of a `ParquetFileReader`.
>
>
>
> Is reading multiple files concurrently safe with `ParquetFileReader`s? Are
> there any additional considerations I need to be aware of with concurrent
> reads beyond not using a single `ParquetFileReader` instance for two
> different uses?
>
>
>
> Best,
>
> Ben McDonald
>
>
>
> *From: *Micah Kornfield <[email protected]>
> *Date: *Sunday, April 3, 2022 at 12:22 PM
> *To: *[email protected] <[email protected]>
> *Subject: *Re: [C++] Reading a single Parquet file from multiple processes
>
> So far, this code has worked as expected and I have been able to read in
> multiple files simultaneously across processes, but recently I hit a case
> where reading a file in a single process resulted in a error that could be
> handled gracefully (with an `Unexpected end of stream` error), but reading
> in that same file across multiple processes crashed the code, and I would
> like to be able to handle the errors rather than having it crash. Thanks.
>
>
>
> As long as there is no shared state (i.e. the multiprocesses aren't
> sharing a reader handle via Forks) then reading in multple should be safe.
>  If there is a small reproducible example to show the error that only
> occurs when reading from multiple processes (and that doesn't
> reproduce when reading from a single process) it would be helpful to share
> this to help figure out what is going on.
>
>
>
> On Thu, Mar 31, 2022 at 3:33 PM McDonald, Ben <[email protected]>
> wrote:
>
> Hello,
>
>
>
> I am currently writing some distributed code where I am reading Parquet
> columns from the same file across multiple processes. I see that
> https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReaderEseems
> to suggest that parallelism within a process would need to read at the row
> group granularity and that multiple file readers working independently on
> the same file in a single process would not be safe.
>
>
>
> Given that I haven’t been able to find anything suggesting the contrary, I
> was thinking that reading the same file from different processes would be
> allowed, but a recent crash I encountered made me question if that were
> true.
>
>
>
> Is it allowed to read a single Parquet file simultaneously from separate
> processes? I am currently using the low level `ReadBatch` API and, for
> example, if I were reading 1 file across 2 processes, I would have the
> first process read the first half of the elements and the second process
> read the second half of the elements, and both of these are happening
> simultaneously, but as I have mentioned, it is in different processes, so I
> wouldn’t expect there to be any conflict.
>
>
>
> So far, this code has worked as expected and I have been able to read in
> multiple files simultaneously across processes, but recently I hit a case
> where reading a file in a single process resulted in a error that could be
> handled gracefully (with an `Unexpected end of stream` error), but reading
> in that same file across multiple processes crashed the code, and I would
> like to be able to handle the errors rather than having it crash. Thanks.
>
>
>
> Best,
>
> Ben McDonald
>
>

Reply via email to