Ziheng Wang created ARROW-17984: ----------------------------------- Summary: pq.read_table doesn't seem to be thread safe Key: ARROW-17984 URL: https://issues.apache.org/jira/browse/ARROW-17984 Project: Apache Arrow Issue Type: Bug Components: Parquet Affects Versions: 9.0.0 Reporter: Ziheng Wang
Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in master, I am using multithreading to improve read bandwidth from S3. Even after that PR gets merged, I probably will still try to use multithreading to some extent. However pq.read_table from S3 doesn't seem to be thread safe. Seems like it uses the new dataset reader under the hood. I cannot provide a reproduction, not a stable one anyway. But this is roughly the script I have been using ``` def get_next_batch(self, mapper_id, pos=None): def download(file): return pq.read_table("s3://" + self.bucket + "/" + file, columns=self.columns, filters=self.filters) executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers) futures= \{executor.submit(download, file): file for file in my_files} for future inconcurrent.futures.as_completed(futures): yield future.result() ``` The errors all have to do with malloc segfaults which makes me suspect the connection object is being reused across different pq.read_table invocations in different threads ``` (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid chunk size (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at time=1665464922 on cpu 9 *** (InputReaderNode pid=25001, ip=172.31.60.29) PC: @ 0x7f9a480a803b (unknown) raise (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480a80c0 4160 (unknown) (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480fa32c (unknown) (unknown) ``` Note, this multithreaded code is running inside a Ray actor process, but that shouldn't be a problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)