[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe

2022-11-04 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629119#comment-17629119
 ] 

Antoine Pitrou commented on ARROW-17984:


[~marsupialtail] Can you post the backtrace for _all_ threads? (using "thread 
apply all bt" as suggested by [~westonpace]).

Since that will probably be quite long I suggest posting it on a site such as 
https://gist.github.com/ and posting the link here.

> pq.read_table doesn't seem to be thread safe
> 
>
> Key: ARROW-17984
> URL: https://issues.apache.org/jira/browse/ARROW-17984
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 9.0.0
>Reporter: Ziheng Wang
>Priority: Major
> Attachments: _usr_bin_python3.8.1000.crash
>
>
> Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in 
> master, I am using multithreading to improve read bandwidth from S3. Even 
> after that PR gets merged, I probably will still try to use multithreading to 
> some extent.
> However pq.read_table from S3 doesn't seem to be thread safe. Seems like it 
> uses the new dataset reader under the hood. I cannot provide a reproduction, 
> not a stable one anyway. But this is roughly the script I have been using 
> ~~~
> def get_next_batch(self, mapper_id, pos=None):
> def download(file):
>     return pq.read_table("s3://" + self.bucket + "/" + file, 
> columns=self.columns, filters=self.filters)
>  
> executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers)
> futures= \{executor.submit(download, file): file for file in my_files}
> for future inconcurrent.futures.as_completed(futures):
>     yield  future.result()
> ~~~
> The errors all have to do with malloc segfaults which makes me suspect the 
> connection object is being reused across different pq.read_table invocations 
> in different threads
> ```
> (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid 
> chunk size
> (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at 
> time=1665464922 on cpu 9 ***
> (InputReaderNode pid=25001, ip=172.31.60.29) PC: @     0x7f9a480a803b  
> (unknown)  raise
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480a80c0       
> 4160  (unknown)
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480fa32c  
> (unknown)  (unknown)
> ```
> Note, this multithreaded code is running inside a Ray actor process, but that 
> shouldn't be a problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe

2022-11-03 Thread Ziheng Wang (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628471#comment-17628471
 ] 

Ziheng Wang commented on ARROW-17984:
-

```

Thread 42 (Thread 0x7fd1d77fe700 (LWP 10252)):
#0  futex_abstimed_wait (private=0, abstime=0x0, clockid=0, expected=2, 
futex_word=) at ../sysdeps/nptl/futex-internal.h:284
#1  __pthread_rwlock_wrlock_full (abstime=0x0, clockid=0, rwlock=0x2fd02f0) at 
pthread_rwlock_common.c:830
#2  __GI___pthread_rwlock_wrlock (rwlock=0x2fd02f0) at 
pthread_rwlock_wrlock.c:27
#3  0x7fd4a586ee59 in CRYPTO_THREAD_write_lock () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#4  0x7fd4a586436f in ossl_namemap_add_names () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#5  0x7fd4a58512c0 in construct_evp_method () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#6  0x7fd4a5863e15 in ossl_method_construct_this () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#7  0x7fd4a5863c8e in algorithm_do_this () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#8  0x7fd4a586e1db in ossl_provider_doall_activated () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#9  0x7fd4a5863d4f in ossl_algorithm_do_all () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#10 0x7fd4a5864015 in ossl_method_construct () from 
/home/ubuntu/.local/lib---Type  for--Type  for --Type--Ty--Type 
 for mo--Type Type  for more, q to quit, c to continue without 
paging--
python3.8/site-packages/pyarrow/libarrow.so.900
#11 0x7fd4a5851876 in evp_generic_fetch () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#12 0x7fd4a584fd1a in EVP_CIPHER_fetch () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#13 0x7fd4a5c6ad22 in ssl_evp_cipher_fetch () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#14 0x7fd4a5c605b3 in ssl_load_ciphers () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#15 0x7fd4a5c6c1b1 in SSL_CTX_new_ex () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#16 0x7fd4a5c4e727 in ossl_connect_step1 () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#17 0x7fd4a5c51c7d in ossl_connect_common () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#18 0x7fd4a5c1a348 in Curl_ssl_connect_nonblocking () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#19 0x7fd4a5c32766 in Curl_http_connect () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#20 0x7fd4a5bfc2eb in multi_runsingle () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#21 0x7fd4a5bfd76b in curl_multi_perform () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#22 0x7fd4a5bec833 in easy_transfer () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#23 0x7fd4a5becde9 in curl_easy_perform () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#24 0x7fd4a5737984 in 
Aws::Http::CurlHttpClient::MakeRequest(std::shared_ptr 
const&, Aws::Utils::RateLimits::RateLimiterInterface*, 
Aws::Utils::RateLimits::RateLimiterInterface*) const () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#25 0x7fd4a56fa581 in 
Aws::Client::AWSClient::AttemptOneRequest(std::shared_ptr
 const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char 
const*) const () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#26 0x7fd4a56fb3d0 in 
Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, 
Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char 
const*, char const*) const () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#27 0x7fd4a57072f1 in Aws::Client::AWSXMLClient::MakeRequest(Aws::Http::URI 
const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char 
const*, char const*, char const*) const () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#28 0x7fd4a5219dd9 in arrow::fs::ResolveS3BucketRegion(std::string const&) 
() from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#29 0x7fd4a521b6a3 in arrow::fs::S3Options::FromUri(arrow::internal::Uri 
const&, std::string*) () from 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900
#30 0x7fd4a4528c4f in arrow::fs::(anonymous 
namespace)::FileSystemFromUriReal(arrow::internal::Uri const&, std::string 
const&, arrow::io::IOContext const&, 

[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe

2022-10-28 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625884#comment-17625884
 ] 

Weston Pace commented on ARROW-17984:
-

Can you share the full output of {{thread apply all bt}}?  The core dump itself 
isn't too portable without the binaries that generated it.  E.g...

{noformat}
warning: Can't open file /usr/bin/python3.8 during file-backed mapping note 
processing

warning: Can't open file /usr/lib/x86_64-linux-gnu/libresolv-2.31.so during 
file-backed mapping note processing

warning: Can't open file 
/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow_python_flight.so.900
 during file-backed mapping note processing
{noformat}

So I just get:

{noformat}
(gdb) thread apply all bt

Thread 42 (LWP 10252):
#0  0x7fd4adf28f37 in ?? ()
#1  0x0024 in ?? ()
#2  0x7fd4a5ff31e0 in ?? ()
#3  0x in ?? ()

Thread 41 (LWP 10072):
#0  0x7fd4ae05676d in ?? ()
#1  0x7fd1f3864396 in ?? ()
#2  0x in ?? ()
...
{noformat}

> pq.read_table doesn't seem to be thread safe
> 
>
> Key: ARROW-17984
> URL: https://issues.apache.org/jira/browse/ARROW-17984
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 9.0.0
>Reporter: Ziheng Wang
>Priority: Major
> Attachments: _usr_bin_python3.8.1000.crash
>
>
> Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in 
> master, I am using multithreading to improve read bandwidth from S3. Even 
> after that PR gets merged, I probably will still try to use multithreading to 
> some extent.
> However pq.read_table from S3 doesn't seem to be thread safe. Seems like it 
> uses the new dataset reader under the hood. I cannot provide a reproduction, 
> not a stable one anyway. But this is roughly the script I have been using 
> ~~~
> def get_next_batch(self, mapper_id, pos=None):
> def download(file):
>     return pq.read_table("s3://" + self.bucket + "/" + file, 
> columns=self.columns, filters=self.filters)
>  
> executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers)
> futures= \{executor.submit(download, file): file for file in my_files}
> for future inconcurrent.futures.as_completed(futures):
>     yield  future.result()
> ~~~
> The errors all have to do with malloc segfaults which makes me suspect the 
> connection object is being reused across different pq.read_table invocations 
> in different threads
> ```
> (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid 
> chunk size
> (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at 
> time=1665464922 on cpu 9 ***
> (InputReaderNode pid=25001, ip=172.31.60.29) PC: @     0x7f9a480a803b  
> (unknown)  raise
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480a80c0       
> 4160  (unknown)
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480fa32c  
> (unknown)  (unknown)
> ```
> Note, this multithreaded code is running inside a Ray actor process, but that 
> shouldn't be a problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe

2022-10-28 Thread Ziheng Wang (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625816#comment-17625816
 ] 

Ziheng Wang commented on ARROW-17984:
-

I have attached a crash file. You can unpack using apport-unpack, and then gdb 
the CoreDump with gdb `cat ExecutablePath` CoreDump. Once inside the gdb do 
{color:#e01e5a}thread apply all bt{color}

> pq.read_table doesn't seem to be thread safe
> 
>
> Key: ARROW-17984
> URL: https://issues.apache.org/jira/browse/ARROW-17984
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 9.0.0
>Reporter: Ziheng Wang
>Priority: Major
> Attachments: _usr_bin_python3.8.1000.crash
>
>
> Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in 
> master, I am using multithreading to improve read bandwidth from S3. Even 
> after that PR gets merged, I probably will still try to use multithreading to 
> some extent.
> However pq.read_table from S3 doesn't seem to be thread safe. Seems like it 
> uses the new dataset reader under the hood. I cannot provide a reproduction, 
> not a stable one anyway. But this is roughly the script I have been using 
> ~~~
> def get_next_batch(self, mapper_id, pos=None):
> def download(file):
>     return pq.read_table("s3://" + self.bucket + "/" + file, 
> columns=self.columns, filters=self.filters)
>  
> executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers)
> futures= \{executor.submit(download, file): file for file in my_files}
> for future inconcurrent.futures.as_completed(futures):
>     yield  future.result()
> ~~~
> The errors all have to do with malloc segfaults which makes me suspect the 
> connection object is being reused across different pq.read_table invocations 
> in different threads
> ```
> (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid 
> chunk size
> (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at 
> time=1665464922 on cpu 9 ***
> (InputReaderNode pid=25001, ip=172.31.60.29) PC: @     0x7f9a480a803b  
> (unknown)  raise
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480a80c0       
> 4160  (unknown)
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480fa32c  
> (unknown)  (unknown)
> ```
> Note, this multithreaded code is running inside a Ray actor process, but that 
> shouldn't be a problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe

2022-10-11 Thread Ziheng Wang (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17616016#comment-17616016
 ] 

Ziheng Wang commented on ARROW-17984:
-

Unfortunately I cannot figure out how to get a core dump. However I can say 
with confidence that this is the line that triggers this:

"/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", 
line 2353 in __init__

  filesystem, path_or_paths = FileSystem.from_uri(path_or_paths)

For context this will be trying to make different S3FileSystem objects in 
different threads. 

> pq.read_table doesn't seem to be thread safe
> 
>
> Key: ARROW-17984
> URL: https://issues.apache.org/jira/browse/ARROW-17984
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 9.0.0
>Reporter: Ziheng Wang
>Priority: Major
>
> Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in 
> master, I am using multithreading to improve read bandwidth from S3. Even 
> after that PR gets merged, I probably will still try to use multithreading to 
> some extent.
> However pq.read_table from S3 doesn't seem to be thread safe. Seems like it 
> uses the new dataset reader under the hood. I cannot provide a reproduction, 
> not a stable one anyway. But this is roughly the script I have been using 
> ~~~
> def get_next_batch(self, mapper_id, pos=None):
> def download(file):
>     return pq.read_table("s3://" + self.bucket + "/" + file, 
> columns=self.columns, filters=self.filters)
>  
> executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers)
> futures= \{executor.submit(download, file): file for file in my_files}
> for future inconcurrent.futures.as_completed(futures):
>     yield  future.result()
> ~~~
> The errors all have to do with malloc segfaults which makes me suspect the 
> connection object is being reused across different pq.read_table invocations 
> in different threads
> ```
> (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid 
> chunk size
> (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at 
> time=1665464922 on cpu 9 ***
> (InputReaderNode pid=25001, ip=172.31.60.29) PC: @     0x7f9a480a803b  
> (unknown)  raise
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480a80c0       
> 4160  (unknown)
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480fa32c  
> (unknown)  (unknown)
> ```
> Note, this multithreaded code is running inside a Ray actor process, but that 
> shouldn't be a problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe

2022-10-11 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615523#comment-17615523
 ] 

Antoine Pitrou commented on ARROW-17984:


Can you enable core dumps and try to get a gdb backtrace of all threads?

> pq.read_table doesn't seem to be thread safe
> 
>
> Key: ARROW-17984
> URL: https://issues.apache.org/jira/browse/ARROW-17984
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 9.0.0
>Reporter: Ziheng Wang
>Priority: Major
>
> Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in 
> master, I am using multithreading to improve read bandwidth from S3. Even 
> after that PR gets merged, I probably will still try to use multithreading to 
> some extent.
> However pq.read_table from S3 doesn't seem to be thread safe. Seems like it 
> uses the new dataset reader under the hood. I cannot provide a reproduction, 
> not a stable one anyway. But this is roughly the script I have been using 
> ~~~
> def get_next_batch(self, mapper_id, pos=None):
> def download(file):
>     return pq.read_table("s3://" + self.bucket + "/" + file, 
> columns=self.columns, filters=self.filters)
>  
> executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers)
> futures= \{executor.submit(download, file): file for file in my_files}
> for future inconcurrent.futures.as_completed(futures):
>     yield  future.result()
> ~~~
> The errors all have to do with malloc segfaults which makes me suspect the 
> connection object is being reused across different pq.read_table invocations 
> in different threads
> ```
> (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid 
> chunk size
> (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at 
> time=1665464922 on cpu 9 ***
> (InputReaderNode pid=25001, ip=172.31.60.29) PC: @     0x7f9a480a803b  
> (unknown)  raise
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480a80c0       
> 4160  (unknown)
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480fa32c  
> (unknown)  (unknown)
> ```
> Note, this multithreaded code is running inside a Ray actor process, but that 
> shouldn't be a problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)