[ 
https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625884#comment-17625884
 ] 

Weston Pace commented on ARROW-17984:
-------------------------------------

Can you share the full output of {{thread apply all bt}}?  The core dump itself 
isn't too portable without the binaries that generated it.  E.g...

{noformat}
warning: Can't open file /usr/bin/python3.8 during file-backed mapping note 
processing

warning: Can't open file /usr/lib/x86_64-linux-gnu/libresolv-2.31.so during 
file-backed mapping note processing

warning: Can't open file 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow_python_flight.so.900
 during file-backed mapping note processing
{noformat}

So I just get:

{noformat}
(gdb) thread apply all bt

Thread 42 (LWP 10252):
#0  0x00007fd4adf28f37 in ?? ()
#1  0x0000000000000024 in ?? ()
#2  0x00007fd4a5ff31e0 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 41 (LWP 10072):
#0  0x00007fd4ae05676d in ?? ()
#1  0x00007fd1f3864396 in ?? ()
#2  0x0000000000000000 in ?? ()
...
{noformat}

> pq.read_table doesn't seem to be thread safe
> --------------------------------------------
>
>                 Key: ARROW-17984
>                 URL: https://issues.apache.org/jira/browse/ARROW-17984
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet
>    Affects Versions: 9.0.0
>            Reporter: Ziheng Wang
>            Priority: Major
>         Attachments: _usr_bin_python3.8.1000.crash
>
>
> Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in 
> master, I am using multithreading to improve read bandwidth from S3. Even 
> after that PR gets merged, I probably will still try to use multithreading to 
> some extent.
> However pq.read_table from S3 doesn't seem to be thread safe. Seems like it 
> uses the new dataset reader under the hood. I cannot provide a reproduction, 
> not a stable one anyway. But this is roughly the script I have been using 
> ~~~
> def get_next_batch(self, mapper_id, pos=None):
> def download(file):
>     return pq.read_table("s3://" + self.bucket + "/" + file, 
> columns=self.columns, filters=self.filters)
>  
> executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers)
> futures= \{executor.submit(download, file): file for file in my_files}
> for future inconcurrent.futures.as_completed(futures):
>     yield  future.result()
> ~~~
> The errors all have to do with malloc segfaults which makes me suspect the 
> connection object is being reused across different pq.read_table invocations 
> in different threads
> ```
> (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid 
> chunk size
> (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at 
> time=1665464922 on cpu 9 ***
> (InputReaderNode pid=25001, ip=172.31.60.29) PC: @     0x7f9a480a803b  
> (unknown)  raise
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480a80c0       
> 4160  (unknown)
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480fa32c  
> (unknown)  (unknown)
> ```
> Note, this multithreaded code is running inside a Ray actor process, but that 
> shouldn't be a problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to