[I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]


Tom-Newton opened a new issue, #40035:
URL: https://github.com/apache/arrow/issues/40035

### Describe the enhancement requested

Optimisation to https://github.com/apache/arrow/issues/37511
Child of https://github.com/apache/arrow/issues/18014

When reading from Azure blob storage the bandwidth we get per connection is
very dependant on the latency to the filesystem. To achieve good bandwidth with
high latency far greater concurrency is needed. For example this is relevant
when reading from blob storage in a different region to your compute.

As an example lets consider reading a parquet file. There are 2 levels of
parallelism that I'm aware of when using Arrow and the native `AzureFileSystem`:
1. Arrow will make concurrent calls to `ReadAt` for each column and row
group combination. At most we can have one concurrent connection per column and
row group combination, so for small parquet files this may be less than we
would like.
2. Within `ReadAt` the `AzureFileSystem` calls `BlobClient::DownloadTo`
which implements some extra concurrency internally
https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/src/blob_client.cpp#L516.
Purpose of this issue is to make the [config options for this
parallelism](https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/inc/azure/storage/blobs/blob_options.hpp#L691-L709)
configurable by the user.

### Component(s)

C++

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to