OmBiradar commented on PR #50158: URL: https://github.com/apache/arrow/pull/50158#issuecomment-4827132870
Hey @wgtmac i looked into the failing test, which is specifically - `TestParquetFileFormatScan.ScanRecordBatchReaderProjectedNested/0Threaded16b1024r` This test requires reading of a nested parquet file having struct columns. I used gdb to obtain a backtrace after I was confident the program has hit a "deadlock" of some sort. Analysing the deadlock, I found that the 1. threads which are suppose to read the structs, hand off the reading of its children fields to other threads and go into a "wait" state. 2. The child field reading threads cannot execute because the thread pool is fully saturated with threads which are on "wait" This creates the deadlock due to the cross dependency between threads and threads spend time waiting on each other. Thus, I believe this is a sync-async problem, where generally a blocking thread should not spawn other threads and wait on them. Here a async type thread management would be nice. As there is also a note in the `cpp/src/parquet/arrow/reader.cc` file in 80fe83a4fd2cc6d119eaf547cee24a2cdf1d28d8 by lidavidm and pitrou where it says > Making the Parquet reader truly asynchronous requires heavy refactoring, so the generator piggybacks on ReadRangeCache. I believe that it enables the multi-threaded reading of row groups, but it does not consider threads producing new threads to read various fields in a struct. I really don't have much idea on how to approach this, could you please provide any help @wgtmac -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
