> > So far, this code has worked as expected and I have been able to read in > multiple files simultaneously across processes, but recently I hit a case > where reading a file in a single process resulted in a error that could be > handled gracefully (with an `Unexpected end of stream` error), but reading > in that same file across multiple processes crashed the code, and I would > like to be able to handle the errors rather than having it crash. Thanks.
As long as there is no shared state (i.e. the multiprocesses aren't sharing a reader handle via Forks) then reading in multple should be safe. If there is a small reproducible example to show the error that only occurs when reading from multiple processes (and that doesn't reproduce when reading from a single process) it would be helpful to share this to help figure out what is going on. On Thu, Mar 31, 2022 at 3:33 PM McDonald, Ben <[email protected]> wrote: > Hello, > > > > I am currently writing some distributed code where I am reading Parquet > columns from the same file across multiple processes. I see that > https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReaderEseems > to suggest that parallelism within a process would need to read at the row > group granularity and that multiple file readers working independently on > the same file in a single process would not be safe. > > > > Given that I haven’t been able to find anything suggesting the contrary, I > was thinking that reading the same file from different processes would be > allowed, but a recent crash I encountered made me question if that were > true. > > > > Is it allowed to read a single Parquet file simultaneously from separate > processes? I am currently using the low level `ReadBatch` API and, for > example, if I were reading 1 file across 2 processes, I would have the > first process read the first half of the elements and the second process > read the second half of the elements, and both of these are happening > simultaneously, but as I have mentioned, it is in different processes, so I > wouldn’t expect there to be any conflict. > > > > So far, this code has worked as expected and I have been able to read in > multiple files simultaneously across processes, but recently I hit a case > where reading a file in a single process resulted in a error that could be > handled gracefully (with an `Unexpected end of stream` error), but reading > in that same file across multiple processes crashed the code, and I would > like to be able to handle the errors rather than having it crash. Thanks. > > > > Best, > > Ben McDonald >
