Thanks for the response. The code is quite complex, so coming up with a simple 
reproducer is not easy, but I think I am satisfied with your answer with 
respect to multi-process runs.

This is making me wonder if there are any limitations on parallelism within a 
single process though. On top of the distribution I mentioned in the previous 
message, each process is also concurrently reading many different files, each 
with a unique instance of a `ParquetFileReader`.

Is reading multiple files concurrently safe with `ParquetFileReader`s? Are 
there any additional considerations I need to be aware of with concurrent reads 
beyond not using a single `ParquetFileReader` instance for two different uses?

Best,
Ben McDonald

From: Micah Kornfield <[email protected]>
Date: Sunday, April 3, 2022 at 12:22 PM
To: [email protected] <[email protected]>
Subject: Re: [C++] Reading a single Parquet file from multiple processes
So far, this code has worked as expected and I have been able to read in 
multiple files simultaneously across processes, but recently I hit a case where 
reading a file in a single process resulted in a error that could be handled 
gracefully (with an `Unexpected end of stream` error), but reading in that same 
file across multiple processes crashed the code, and I would like to be able to 
handle the errors rather than having it crash. Thanks.

As long as there is no shared state (i.e. the multiprocesses aren't sharing a 
reader handle via Forks) then reading in multple should be safe.   If there is 
a small reproducible example to show the error that only occurs when reading 
from multiple processes (and that doesn't reproduce when reading from a single 
process) it would be helpful to share this to help figure out what is going on.

On Thu, Mar 31, 2022 at 3:33 PM McDonald, Ben 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I am currently writing some distributed code where I am reading Parquet columns 
from the same file across multiple processes. I see that 
https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReaderE<https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReaderE>seems
 to suggest that parallelism within a process would need to read at the row 
group granularity and that multiple file readers working independently on the 
same file in a single process would not be safe.

Given that I haven’t been able to find anything suggesting the contrary, I was 
thinking that reading the same file from different processes would be allowed, 
but a recent crash I encountered made me question if that were true.

Is it allowed to read a single Parquet file simultaneously from separate 
processes? I am currently using the low level `ReadBatch` API and, for example, 
if I were reading 1 file across 2 processes, I would have the first process 
read the first half of the elements and the second process read the second half 
of the elements, and both of these are happening simultaneously, but as I have 
mentioned, it is in different processes, so I wouldn’t expect there to be any 
conflict.

So far, this code has worked as expected and I have been able to read in 
multiple files simultaneously across processes, but recently I hit a case where 
reading a file in a single process resulted in a error that could be handled 
gracefully (with an `Unexpected end of stream` error), but reading in that same 
file across multiple processes crashed the code, and I would like to be able to 
handle the errors rather than having it crash. Thanks.

Best,
Ben McDonald

Reply via email to