felipecrv commented on issue #40035: URL: https://github.com/apache/arrow/issues/40035#issuecomment-1941785546
I think exposing all these settings in `AzureOptions` can be premature. They are per-request settings, so allowing a config in `AzureOptions` would force the internal implementation to stick to one set of values of every `DownloadTo` request. My suggestion: we keep statistics about the `ReadAt` calls in a file handle and adjust the options as the calls come in. After this exercise we might expose settings in `AzureOptions` that expresses which policy should be used (assuming we can't come up with a good adaptive policy). What the policies would be called depends on which workloads we can isolate in the benchmarks: latency vs throughput, sequential vs random. An alternative to the named policies can be: we expose only `read_file_max_concurrency` [1], and we can keep learning a simple model that sets the best `initial_chunk_size` and `chunk_size` parameters adaptively. The goal becomes minimizing the latency of each individual `ReadAt` call and the user can set a low max concurrency factor for latency-optimized workloads and a high concurrency factor for throughput-optimized. A low concurrency factor would also be useful when the user of the `FileSystem` interface is managing multiple threads themselves. [1] set it to some multiple of CPU cores by default -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
