Hello,

I am currently writing some distributed code where I am reading Parquet columns 
from the same file across multiple processes. I see that 
https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReaderEseems
 to suggest that parallelism within a process would need to read at the row 
group granularity and that multiple file readers working independently on the 
same file in a single process would not be safe.

Given that I haven’t been able to find anything suggesting the contrary, I was 
thinking that reading the same file from different processes would be allowed, 
but a recent crash I encountered made me question if that were true.

Is it allowed to read a single Parquet file simultaneously from separate 
processes? I am currently using the low level `ReadBatch` API and, for example, 
if I were reading 1 file across 2 processes, I would have the first process 
read the first half of the elements and the second process read the second half 
of the elements, and both of these are happening simultaneously, but as I have 
mentioned, it is in different processes, so I wouldn’t expect there to be any 
conflict.

So far, this code has worked as expected and I have been able to read in 
multiple files simultaneously across processes, but recently I hit a case where 
reading a file in a single process resulted in a error that could be handled 
gracefully (with an `Unexpected end of stream` error), but reading in that same 
file across multiple processes crashed the code, and I would like to be able to 
handle the errors rather than having it crash. Thanks.

Best,
Ben McDonald

Reply via email to