[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe
[ https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629119#comment-17629119 ] Antoine Pitrou commented on ARROW-17984: [~marsupialtail] Can you post the backtrace for _all_ threads? (using "thread apply all bt" as suggested by [~westonpace]). Since that will probably be quite long I suggest posting it on a site such as https://gist.github.com/ and posting the link here. > pq.read_table doesn't seem to be thread safe > > > Key: ARROW-17984 > URL: https://issues.apache.org/jira/browse/ARROW-17984 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet >Affects Versions: 9.0.0 >Reporter: Ziheng Wang >Priority: Major > Attachments: _usr_bin_python3.8.1000.crash > > > Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in > master, I am using multithreading to improve read bandwidth from S3. Even > after that PR gets merged, I probably will still try to use multithreading to > some extent. > However pq.read_table from S3 doesn't seem to be thread safe. Seems like it > uses the new dataset reader under the hood. I cannot provide a reproduction, > not a stable one anyway. But this is roughly the script I have been using > ~~~ > def get_next_batch(self, mapper_id, pos=None): > def download(file): > return pq.read_table("s3://" + self.bucket + "/" + file, > columns=self.columns, filters=self.filters) > > executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers) > futures= \{executor.submit(download, file): file for file in my_files} > for future inconcurrent.futures.as_completed(futures): > yield future.result() > ~~~ > The errors all have to do with malloc segfaults which makes me suspect the > connection object is being reused across different pq.read_table invocations > in different threads > ``` > (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid > chunk size > (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at > time=1665464922 on cpu 9 *** > (InputReaderNode pid=25001, ip=172.31.60.29) PC: @ 0x7f9a480a803b > (unknown) raise > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480a80c0 > 4160 (unknown) > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480fa32c > (unknown) (unknown) > ``` > Note, this multithreaded code is running inside a Ray actor process, but that > shouldn't be a problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe
[ https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628471#comment-17628471 ] Ziheng Wang commented on ARROW-17984: - ``` Thread 42 (Thread 0x7fd1d77fe700 (LWP 10252)): #0 futex_abstimed_wait (private=0, abstime=0x0, clockid=0, expected=2, futex_word=) at ../sysdeps/nptl/futex-internal.h:284 #1 __pthread_rwlock_wrlock_full (abstime=0x0, clockid=0, rwlock=0x2fd02f0) at pthread_rwlock_common.c:830 #2 __GI___pthread_rwlock_wrlock (rwlock=0x2fd02f0) at pthread_rwlock_wrlock.c:27 #3 0x7fd4a586ee59 in CRYPTO_THREAD_write_lock () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #4 0x7fd4a586436f in ossl_namemap_add_names () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #5 0x7fd4a58512c0 in construct_evp_method () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #6 0x7fd4a5863e15 in ossl_method_construct_this () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #7 0x7fd4a5863c8e in algorithm_do_this () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #8 0x7fd4a586e1db in ossl_provider_doall_activated () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #9 0x7fd4a5863d4f in ossl_algorithm_do_all () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #10 0x7fd4a5864015 in ossl_method_construct () from /home/ubuntu/.local/lib---Type for--Type for --Type--Ty--Type for mo--Type Type for more, q to quit, c to continue without paging-- python3.8/site-packages/pyarrow/libarrow.so.900 #11 0x7fd4a5851876 in evp_generic_fetch () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #12 0x7fd4a584fd1a in EVP_CIPHER_fetch () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #13 0x7fd4a5c6ad22 in ssl_evp_cipher_fetch () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #14 0x7fd4a5c605b3 in ssl_load_ciphers () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #15 0x7fd4a5c6c1b1 in SSL_CTX_new_ex () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #16 0x7fd4a5c4e727 in ossl_connect_step1 () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #17 0x7fd4a5c51c7d in ossl_connect_common () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #18 0x7fd4a5c1a348 in Curl_ssl_connect_nonblocking () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #19 0x7fd4a5c32766 in Curl_http_connect () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #20 0x7fd4a5bfc2eb in multi_runsingle () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #21 0x7fd4a5bfd76b in curl_multi_perform () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #22 0x7fd4a5bec833 in easy_transfer () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #23 0x7fd4a5becde9 in curl_easy_perform () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #24 0x7fd4a5737984 in Aws::Http::CurlHttpClient::MakeRequest(std::shared_ptr const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #25 0x7fd4a56fa581 in Aws::Client::AWSClient::AttemptOneRequest(std::shared_ptr const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #26 0x7fd4a56fb3d0 in Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #27 0x7fd4a57072f1 in Aws::Client::AWSXMLClient::MakeRequest(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #28 0x7fd4a5219dd9 in arrow::fs::ResolveS3BucketRegion(std::string const&) () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #29 0x7fd4a521b6a3 in arrow::fs::S3Options::FromUri(arrow::internal::Uri const&, std::string*) () from /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow.so.900 #30 0x7fd4a4528c4f in arrow::fs::(anonymous namespace)::FileSystemFromUriReal(arrow::internal::Uri const&, std::string const&, arrow::io::IOContext const&,
[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe
[ https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625884#comment-17625884 ] Weston Pace commented on ARROW-17984: - Can you share the full output of {{thread apply all bt}}? The core dump itself isn't too portable without the binaries that generated it. E.g... {noformat} warning: Can't open file /usr/bin/python3.8 during file-backed mapping note processing warning: Can't open file /usr/lib/x86_64-linux-gnu/libresolv-2.31.so during file-backed mapping note processing warning: Can't open file /home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/libarrow_python_flight.so.900 during file-backed mapping note processing {noformat} So I just get: {noformat} (gdb) thread apply all bt Thread 42 (LWP 10252): #0 0x7fd4adf28f37 in ?? () #1 0x0024 in ?? () #2 0x7fd4a5ff31e0 in ?? () #3 0x in ?? () Thread 41 (LWP 10072): #0 0x7fd4ae05676d in ?? () #1 0x7fd1f3864396 in ?? () #2 0x in ?? () ... {noformat} > pq.read_table doesn't seem to be thread safe > > > Key: ARROW-17984 > URL: https://issues.apache.org/jira/browse/ARROW-17984 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet >Affects Versions: 9.0.0 >Reporter: Ziheng Wang >Priority: Major > Attachments: _usr_bin_python3.8.1000.crash > > > Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in > master, I am using multithreading to improve read bandwidth from S3. Even > after that PR gets merged, I probably will still try to use multithreading to > some extent. > However pq.read_table from S3 doesn't seem to be thread safe. Seems like it > uses the new dataset reader under the hood. I cannot provide a reproduction, > not a stable one anyway. But this is roughly the script I have been using > ~~~ > def get_next_batch(self, mapper_id, pos=None): > def download(file): > return pq.read_table("s3://" + self.bucket + "/" + file, > columns=self.columns, filters=self.filters) > > executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers) > futures= \{executor.submit(download, file): file for file in my_files} > for future inconcurrent.futures.as_completed(futures): > yield future.result() > ~~~ > The errors all have to do with malloc segfaults which makes me suspect the > connection object is being reused across different pq.read_table invocations > in different threads > ``` > (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid > chunk size > (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at > time=1665464922 on cpu 9 *** > (InputReaderNode pid=25001, ip=172.31.60.29) PC: @ 0x7f9a480a803b > (unknown) raise > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480a80c0 > 4160 (unknown) > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480fa32c > (unknown) (unknown) > ``` > Note, this multithreaded code is running inside a Ray actor process, but that > shouldn't be a problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe
[ https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625816#comment-17625816 ] Ziheng Wang commented on ARROW-17984: - I have attached a crash file. You can unpack using apport-unpack, and then gdb the CoreDump with gdb `cat ExecutablePath` CoreDump. Once inside the gdb do {color:#e01e5a}thread apply all bt{color} > pq.read_table doesn't seem to be thread safe > > > Key: ARROW-17984 > URL: https://issues.apache.org/jira/browse/ARROW-17984 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet >Affects Versions: 9.0.0 >Reporter: Ziheng Wang >Priority: Major > Attachments: _usr_bin_python3.8.1000.crash > > > Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in > master, I am using multithreading to improve read bandwidth from S3. Even > after that PR gets merged, I probably will still try to use multithreading to > some extent. > However pq.read_table from S3 doesn't seem to be thread safe. Seems like it > uses the new dataset reader under the hood. I cannot provide a reproduction, > not a stable one anyway. But this is roughly the script I have been using > ~~~ > def get_next_batch(self, mapper_id, pos=None): > def download(file): > return pq.read_table("s3://" + self.bucket + "/" + file, > columns=self.columns, filters=self.filters) > > executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers) > futures= \{executor.submit(download, file): file for file in my_files} > for future inconcurrent.futures.as_completed(futures): > yield future.result() > ~~~ > The errors all have to do with malloc segfaults which makes me suspect the > connection object is being reused across different pq.read_table invocations > in different threads > ``` > (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid > chunk size > (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at > time=1665464922 on cpu 9 *** > (InputReaderNode pid=25001, ip=172.31.60.29) PC: @ 0x7f9a480a803b > (unknown) raise > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480a80c0 > 4160 (unknown) > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480fa32c > (unknown) (unknown) > ``` > Note, this multithreaded code is running inside a Ray actor process, but that > shouldn't be a problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe
[ https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17616016#comment-17616016 ] Ziheng Wang commented on ARROW-17984: - Unfortunately I cannot figure out how to get a core dump. However I can say with confidence that this is the line that triggers this: "/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", line 2353 in __init__ filesystem, path_or_paths = FileSystem.from_uri(path_or_paths) For context this will be trying to make different S3FileSystem objects in different threads. > pq.read_table doesn't seem to be thread safe > > > Key: ARROW-17984 > URL: https://issues.apache.org/jira/browse/ARROW-17984 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet >Affects Versions: 9.0.0 >Reporter: Ziheng Wang >Priority: Major > > Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in > master, I am using multithreading to improve read bandwidth from S3. Even > after that PR gets merged, I probably will still try to use multithreading to > some extent. > However pq.read_table from S3 doesn't seem to be thread safe. Seems like it > uses the new dataset reader under the hood. I cannot provide a reproduction, > not a stable one anyway. But this is roughly the script I have been using > ~~~ > def get_next_batch(self, mapper_id, pos=None): > def download(file): > return pq.read_table("s3://" + self.bucket + "/" + file, > columns=self.columns, filters=self.filters) > > executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers) > futures= \{executor.submit(download, file): file for file in my_files} > for future inconcurrent.futures.as_completed(futures): > yield future.result() > ~~~ > The errors all have to do with malloc segfaults which makes me suspect the > connection object is being reused across different pq.read_table invocations > in different threads > ``` > (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid > chunk size > (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at > time=1665464922 on cpu 9 *** > (InputReaderNode pid=25001, ip=172.31.60.29) PC: @ 0x7f9a480a803b > (unknown) raise > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480a80c0 > 4160 (unknown) > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480fa32c > (unknown) (unknown) > ``` > Note, this multithreaded code is running inside a Ray actor process, but that > shouldn't be a problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe
[ https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615523#comment-17615523 ] Antoine Pitrou commented on ARROW-17984: Can you enable core dumps and try to get a gdb backtrace of all threads? > pq.read_table doesn't seem to be thread safe > > > Key: ARROW-17984 > URL: https://issues.apache.org/jira/browse/ARROW-17984 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet >Affects Versions: 9.0.0 >Reporter: Ziheng Wang >Priority: Major > > Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in > master, I am using multithreading to improve read bandwidth from S3. Even > after that PR gets merged, I probably will still try to use multithreading to > some extent. > However pq.read_table from S3 doesn't seem to be thread safe. Seems like it > uses the new dataset reader under the hood. I cannot provide a reproduction, > not a stable one anyway. But this is roughly the script I have been using > ~~~ > def get_next_batch(self, mapper_id, pos=None): > def download(file): > return pq.read_table("s3://" + self.bucket + "/" + file, > columns=self.columns, filters=self.filters) > > executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers) > futures= \{executor.submit(download, file): file for file in my_files} > for future inconcurrent.futures.as_completed(futures): > yield future.result() > ~~~ > The errors all have to do with malloc segfaults which makes me suspect the > connection object is being reused across different pq.read_table invocations > in different threads > ``` > (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid > chunk size > (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at > time=1665464922 on cpu 9 *** > (InputReaderNode pid=25001, ip=172.31.60.29) PC: @ 0x7f9a480a803b > (unknown) raise > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480a80c0 > 4160 (unknown) > (InputReaderNode pid=25001, ip=172.31.60.29) @ 0x7f9a480fa32c > (unknown) (unknown) > ``` > Note, this multithreaded code is running inside a Ray actor process, but that > shouldn't be a problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)