Hi all,
I'd like to refactor the entire OSSFileIO implementation to improve its
performance and fix several bugs. ## Background First, let me briefly explain
how the following test results were obtained. I implemented a FileIO benchmark
that runs both S3FileIO and OSSFileIO against the same Aliyun OSS bucket from
the same VM for comparison (Aliyun OSS is S3 protocol compatible). I also
ensured that disk, memory, CPU, and network bandwidth were not bottlenecks, and
used identical runtime parameters, so any performance differences in the
results should come from the FileIO implementation itself. ## Issues ### 1.
Random Read: Critical Performance Issue The random read code has a serious
problem that results in extremely poor random read performance. **Test
Results** ``` Benchmark (bufferSizeKB) (fileIOClass) (fileSizeKB) Mode Cnt
Score Error Units FileIOBenchmark.randomRead 1024
org.apache.iceberg.aws.s3.S3FileIO 131072 avgt 4 1817.108 ± 37.337 ms/op
FileIOBenchmark.randomRead 1024 org.apache.iceberg.aliyun.oss.OSSFileIO 131072
avgt 5 27164.064 ± 24437.452 ms/op ``` With a buffer size of 1MB and total file
size of 128MB, OSSFileIO is more than 10x slower than S3FileIO. **Analysis**
When a random read ends, `OSSInputStream` calls the underlying `close()`
method, which continues to consume the remaining TCP data, causing unnecessary
waiting. In contrast, `S3InputStream` calls `abort()`, which directly tears
down the TCP connection. **Problems and Impact** 1. Calling `close()` results
in wasted time and network bandwidth. This has significant impact — a 20x
performance degradation may make it completely unusable in certain scenarios.
2. `OSSInputStream` does not implement `RangeReadable`, so every random read
disrupts the sequential read stream. This has moderate impact. ### 2.
Sequential Write: Poor Performance **Test Results** ``` Benchmark
(bufferSizeKB) (fileIOClass) (fileSizeKB) Mode Cnt Score Error Units
FileIOBenchmark.sequentialWrite 1024 org.apache.iceberg.aliyun.oss.OSSFileIO
1048576 avgt 5 4162.820 ± 162.809 ms/op FileIOBenchmark.sequentialWrite 1024
org.apache.iceberg.aws.s3.S3FileIO 1048576 avgt 4 1615.085 ± 73.897 ms/op ```
With a buffer size of 1MB and total file size of 1GB, OSSFileIO is about 2x
slower. In terms of per-stream bandwidth, S3FileIO achieves roughly 640MB/s
while OSSFileIO achieves only about 249MB/s. **Analysis** The current OSSFileIO
implementation writes data to a local file first, then uploads the entire file
via the `PutObject` API. S3FileIO, for large files, uploads in parts (default
32MB per part) asynchronously and with multiple concurrent uploads, so the
upload time overlaps with upper-layer business logic. **Problem List** 1.
Sequential write performance is roughly 2x worse. Moderate impact — usable but
suboptimal. 2. File size has an upper limit. The maximum file size for
`PutObject` is 5GB, while multipart upload supports up to about 48TB. This may
cause unavailability in some scenarios. 3. Page cache thrashing. Since OSSFile
accumulates data into a single local file, dirty pages in the page cache may
trigger disk flushing. In contrast, S3FileIO's 32MB part files are deleted
after upload, avoiding excessive page cache accumulation. In memory-constrained
or disk-performance-constrained environments, this may become an upload
throughput bottleneck. ### 3. OSS SDK Version Update The OSS SDK now has a
brand new V2 version (see
https://github.com/aliyun/alibabacloud-oss-java-sdk-v2
<https://github.com/aliyun/alibabacloud-oss-java-sdk-v2 >), which offers
improvements in both community activity and performance. ## Plan I propose to
complete this work in two phases: 1. Refactor the entire OSSFileIO to fix the
issues described above. 2. Continue with deeper performance optimizations based
on Aliyun OSS-specific features and pefetch. Looking forward to your feedback
and suggestions!
Thanks,
Liquan Liu