Mayur Srivastava created ARROW-8562:
---------------------------------------

             Summary: [C++] IO: Parameterize I/O coalescing using S3 storage 
metrics
                 Key: ARROW-8562
                 URL: https://issues.apache.org/jira/browse/ARROW-8562
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Mayur Srivastava


Related to https://issues.apache.org/jira/browse/ARROW-7995

The adaptive I/O coalescing algorithm uses two parameters:
1. max_io_gap: Max I/O gap/hole size in bytes
2. ideal_request_size = Ideal I/O Request size in bytes

These parameters can be derived from S3 metrics as described below:

In an S3 compatible storage, there are two main metrics:
1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of a 
new S3 request
2. Transfer Bandwidth (BW) for data in bytes/sec

1. Computing max_io_gap:

max_io_gap = TTFB * BW

This is also called Bandwidth-Delay-Product (BDP).

Two byte ranges that have a gap can still be mapped to the same read if the gap 
is less than the bandwidt-delay product [TTFB * TransferBandwidth], i.e. if the 
Time-To-First-Byte (or call setup latency of a new S3 request) is expected to 
be greater than just reading and discarding the extra bytes on an existing HTTP 
request.

2. Computing ideal_request_size:

We want to have high bandwidth utilization per S3 connections, i.e. transfer 
large amounts of data to amortize the seek overhead.
But, we also want to leverage parallelism by slicing very large IO chunks. We 
define two more config parameters with suggested default values to control the 
slice size and seek to balance the two effects with the goal of maximizing net 
data load performance.

BW_util (ideal bandwidth utilization):
This means what fraction of per connection bandwidth should be utilized to 
maximinze net data load.
A good default value is 90% or 0.9.

MAX_IDEAL_REQUEST_SIZE:
This means what is the maximum single request size (in bytes) to maximinze net 
data load.
A good default value is 64 MiB.

The amount of data that needs to be transferred in a single S3 get_object 
request to achieve effective bandwidth eff_BW = BW_util * BW is as follows:
eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW)

Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the 
following result:
ideal_request_size = max_io_gap * BW_util / (1 - BW_util)

Applying the MAX_IDEAL_REQUEST_SIZE, we get the following:
ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - 
BW_util))

The proposal is to create a named constructor in the io::CacheOptions (PR: 
[https://github.com/apache/arrow/pull/6744] created by [~lidavidm]) to compute 
max_io_gap and ideal_request_size from TTFB and BW which will then be passed to 
reader to configure the I/O coalescing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to