[ https://issues.apache.org/jira/browse/ARROW-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599612#comment-17599612 ]
Antoine Pitrou commented on ARROW-17033: ---------------------------------------- cc [~benpharkins] > [C++] Add GCS connection pool size option > ----------------------------------------- > > Key: ARROW-17033 > URL: https://issues.apache.org/jira/browse/ARROW-17033 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Affects Versions: 8.0.0 > Reporter: Leonhard Gruenschloss > Priority: Minor > Labels: GCP, good-first-issue, performance > > Multi-threaded read performance in Arrow's GCS file system implementation > currently is relatively low. Given the high latency of cloud blob systems > like GCS, a common strategy is to use many concurrent readers (if the system > has enough memory to support that), e.g. using 100 threads. > The GCS client library offers a [{{ConnectionPoolSize}} > option|https://googleapis.dev/cpp/google-cloud-storage/latest/structgoogle_1_1cloud_1_1storage_1_1v1_1_1ConnectionPoolSizeOption.html]. > If this option is set to a value that's too low, concurrency is throttled. > At the moment, this is not exposed in > [{{GcsOptions}}|https://github.com/apache/arrow/blob/73cdd6a59b52781cc43e097ccd63ac36f705ee2e/cpp/src/arrow/filesystem/gcsfs.h#L59], > consequently limiting multi-threaded throughput. > Instead of exposing this option, an alternative implementation strategy could > be to use the same value as set by {{arrow::io::SetIOThreadPoolCapacity}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)