Tom-Newton opened a new issue, #40035:
URL: https://github.com/apache/arrow/issues/40035

   ### Describe the enhancement requested
   
   Optimisation to https://github.com/apache/arrow/issues/37511
   Child of https://github.com/apache/arrow/issues/18014
   
   When reading from Azure blob storage the bandwidth we get per connection is 
very dependant on the latency to the filesystem. To achieve good bandwidth with 
high latency far greater concurrency is needed. For example this is relevant 
when reading from blob storage in a different region to your compute. 
   
   As an example lets consider reading a parquet file. There are 2 levels of 
parallelism that I'm aware of when using Arrow and the native `AzureFileSystem`:
   1. Arrow will make concurrent calls to `ReadAt` for each column and row 
group combination. At most we can have one concurrent connection per column and 
row group combination, so for small parquet files this may be less than we 
would like. 
   2. Within `ReadAt` the `AzureFileSystem` calls `BlobClient::DownloadTo` 
which implements some extra concurrency internally 
https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/src/blob_client.cpp#L516.
 Purpose of this issue is to make the [config options for this 
parallelism](https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/inc/azure/storage/blobs/blob_options.hpp#L691-L709)
 configurable by the user.  
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to