Andrew Kyle Purtell created HBASE-30062:
-------------------------------------------

             Summary: Device layer simulator for MiniDFSCluster-based tests
                 Key: HBASE-30062
                 URL: https://issues.apache.org/jira/browse/HBASE-30062
             Project: HBase
          Issue Type: New Feature
          Components: HFile, integration tests, test, wal
            Reporter: Andrew Kyle Purtell
            Assignee: Andrew Kyle Purtell


On EBS-backed deployments in AWS, or equivalents in other cloud infrastructure 
providers, HBase compaction and replication throughput can be constrained by 
per-volume IOPS limits rather than bandwidth. A faithful device-level simulator 
within the test harness allows developers to reproduce, analyze, and validate 
fixes for such performance issues without requiring actual cloud infrastructure.

This proposed change adds a test-only EBS device layer that operates at the 
DataNode storage level within {{MiniDFSCluster}} by replacing the 
{{FsDatasetSpi}} implementation via Hadoop's pluggable factory mechanism. This 
allows HBase integration tests to simulate realistic cloud block storage 
characteristics, such as per-volume bandwidth budgets, IOPS limits, sequential 
IO coalescing, and per-IO device latency, enabling identification and 
reproduction of IO bottlenecks.

The simulator wraps the real {{FsDatasetImpl}} with a reflection proxy that 
intercepts the three SPI methods where DataNode local IO actually engages the 
underlying block device, without compile-time coupling to internal Hadoop 
classes.
On the read path, {{getBlockInputStream}} wraps the returned {{InputStream}} 
with {{{}ThrottledBlockInputStream{}}}, charging every byte against the 
volume's BW and IOPS budgets with sequential IO coalescing. On the write path, 
{{submitBackgroundSyncFileRangeRequest}} charges {{nbytes}} against BW and IOPS 
budgets, modeling the async {{sync_file_range(SYNC_FILE_RANGE_WRITE)}} that the 
DataNode issues to flush dirty pages from the operating system's page cache to 
the block device; and {{finalizeBlock}} charges the remaining unflushed delta 
(minus bytes already charged via sync_file_range) against the budgets, modeling 
the {{fsync()}} at block finalization.

Each proxy gets its own set of {{EBSVolumeDevice}} instances with independent 
budgets. Block-to-volume resolution uses {{{}delegate.getVolume(block){}}}, 
providing real HDFS placement decisions. A single configuration applies to all 
volumes, but each volume maintains its own token buckets, matching production 
where all attached block devices to a host share the same SKU but have 
independent throughput budgets, and where the host itself has a cap on maximum 
aggregate throughput.

EBS merges sequential IOs up to 1 MiB before counting them as a single IOPS 
token. The simulator tracks read streams and write streams independently.

After each IOPS token consumption, the simulator sleeps for a configurable 
duration (default 1 ms), modeling physical device service time.

Some naming and concepts heavily favor Amazon's EBS but these naming issues can 
be addressed during review.

Test integration looks like:
{noformat}
Configuration conf = HBaseConfiguration.create();

// Sets dfs.datanode.fsdataset.factory so that each DataNode started by 
MiniDFSCluster
// wraps its real FsDatasetImpl with a throttling proxy that intercepts 
block-level IO.
EBSDevice.configure(conf, /*budgetMbps=*/500, /*budgetIops=*/500,
    /*deviceLatencyUs=*/1000, /*maxIoSizeKb=*/1024, /*instanceMbps=*/1250);

HBaseTestingUtility util = new HBaseTestingUtility(conf);
util.startMiniZKCluster();
MiniDFSCluster dfsCluster = new MiniDFSCluster.Builder(conf)
    .numDataNodes(1)
    .storagesPerDatanode(6)
    .build();
dfsCluster.waitClusterUp();
util.setDFSCluster(dfsCluster);
util.startMiniCluster(1);

// ... run workload ...

long bytesRead    = EBSDevice.getTotalBytesRead();
long deviceIops   = EBSDevice.getDeviceReadOps();
String perVolume  = EBSDevice.getPerVolumeStats();

EBSDevice.shutdown();
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to