[ https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625884#comment-17625884 ]
Weston Pace commented on ARROW-17984: ------------------------------------- Can you share the full output of {{thread apply all bt}}? The core dump itself isn't too portable without the binaries that generated it. E.g... {noformat} warning: Can't open file /usr/bin/python3.8 during file-backed mapping note processing warning: Can't open file /usr/lib/x86_64-linux-gnu/libresolv-2.31.so during file-backed mapping note processing warning: Can't open file /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow_python_flight.so.900 during file-backed mapping note processing {noformat} So I just get: {noformat} (gdb) thread apply all bt Thread 42 (LWP 10252): #0 0x00007fd4adf28f37 in ?? () #1 0x0000000000000024 in ?? () #2 0x00007fd4a5ff31e0 in ?? () #3 0x0000000000000000 in ?? () Thread 41 (LWP 10072): #0 0x00007fd4ae05676d in ?? () #1 0x00007fd1f3864396 in ?? () #2 0x0000000000000000 in ?? () ... {noformat} > pq.read_table doesn't seem to be thread safe > -------------------------------------------- > > Key: ARROW-17984 > URL: https://issues.apache.org/jira/browse/ARROW-17984 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet > Affects Versions: 9.0.0 > Reporter: Ziheng Wang > Priority: Major > Attachments: _usr_bin_python3.8.1000.crash > > > Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in > master, I am using multithreading to improve read bandwidth from S3. Even > after that PR gets merged, I probably will still try to use multithreading to > some extent. > However pq.read_table from S3 doesn't seem to be thread safe. Seems like it > uses the new dataset reader under the hood. I cannot provide a reproduction, > not a stable one anyway. But this is roughly the script I have been using > ~~~ > def get_next_batch(self, mapper_id, pos=None): > def download(file): > return pq.read_table("s3://" + self.bucket + "/" + file, > columns=self.columns, filters=self.filters) > > executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers) > futures= \{executor.submit(download, file): file for file in my_files} > for future inconcurrent.futures.as_completed(futures): > yield future.result() > ~~~ > The errors all have to do with malloc segfaults which makes me suspect the > connection object is being reused across different pq.read_table invocations > in different threads > ``` > (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid > chunk size > (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at > time=1665464922 on cpu 9 *** > (InputReaderNode pid=25001, ip=172.31.60.29) PC: @ 0x7f9a480a803b > (unknown) raise > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480a80c0 > 4160 (unknown) > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480fa32c > (unknown) (unknown) > ``` > Note, this multithreaded code is running inside a Ray actor process, but that > shouldn't be a problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)