Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

via GitHub Tue, 13 Feb 2024 07:30:51 -0800


felipecrv commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-1941785546


   I think exposing all these settings in `AzureOptions` can be premature. They 
are per-request settings, so allowing a config in `AzureOptions` would force 
the internal implementation to stick to one set of values of every `DownloadTo` 
request.
   
   My suggestion: we keep statistics about the `ReadAt` calls in a file handle 
and adjust the options as the calls come in. After this exercise we might 
expose settings in `AzureOptions` that expresses which policy should be used 
(assuming we can't come up with a good adaptive policy). What the policies 
would be called depends on which workloads we can isolate in the benchmarks: 
latency vs throughput, sequential vs random.
   
   An alternative to the named policies can be: we expose only 
`read_file_max_concurrency` [1], and we can keep learning a simple model that 
sets the best `initial_chunk_size` and `chunk_size` parameters adaptively. The 
goal becomes minimizing the latency of each individual `ReadAt` call and the 
user can set a low max concurrency factor for latency-optimized workloads and a 
high concurrency factor for throughput-optimized. A low concurrency factor 
would also be useful when the user of the `FileSystem` interface is managing 
multiple threads themselves.
   
   [1] set it to some multiple of CPU cores by default
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

Reply via email to