Thanks for the response. The code is quite complex, so coming up with a simple reproducer is not easy, but I think I am satisfied with your answer with respect to multi-process runs.
This is making me wonder if there are any limitations on parallelism within a single process though. On top of the distribution I mentioned in the previous message, each process is also concurrently reading many different files, each with a unique instance of a `ParquetFileReader`. Is reading multiple files concurrently safe with `ParquetFileReader`s? Are there any additional considerations I need to be aware of with concurrent reads beyond not using a single `ParquetFileReader` instance for two different uses? Best, Ben McDonald From: Micah Kornfield <[email protected]> Date: Sunday, April 3, 2022 at 12:22 PM To: [email protected] <[email protected]> Subject: Re: [C++] Reading a single Parquet file from multiple processes So far, this code has worked as expected and I have been able to read in multiple files simultaneously across processes, but recently I hit a case where reading a file in a single process resulted in a error that could be handled gracefully (with an `Unexpected end of stream` error), but reading in that same file across multiple processes crashed the code, and I would like to be able to handle the errors rather than having it crash. Thanks. As long as there is no shared state (i.e. the multiprocesses aren't sharing a reader handle via Forks) then reading in multple should be safe. If there is a small reproducible example to show the error that only occurs when reading from multiple processes (and that doesn't reproduce when reading from a single process) it would be helpful to share this to help figure out what is going on. On Thu, Mar 31, 2022 at 3:33 PM McDonald, Ben <[email protected]<mailto:[email protected]>> wrote: Hello, I am currently writing some distributed code where I am reading Parquet columns from the same file across multiple processes. I see that https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReaderE<https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReaderE>seems to suggest that parallelism within a process would need to read at the row group granularity and that multiple file readers working independently on the same file in a single process would not be safe. Given that I haven’t been able to find anything suggesting the contrary, I was thinking that reading the same file from different processes would be allowed, but a recent crash I encountered made me question if that were true. Is it allowed to read a single Parquet file simultaneously from separate processes? I am currently using the low level `ReadBatch` API and, for example, if I were reading 1 file across 2 processes, I would have the first process read the first half of the elements and the second process read the second half of the elements, and both of these are happening simultaneously, but as I have mentioned, it is in different processes, so I wouldn’t expect there to be any conflict. So far, this code has worked as expected and I have been able to read in multiple files simultaneously across processes, but recently I hit a case where reading a file in a single process resulted in a error that could be handled gracefully (with an `Unexpected end of stream` error), but reading in that same file across multiple processes crashed the code, and I would like to be able to handle the errors rather than having it crash. Thanks. Best, Ben McDonald
