On Fri, 23 Dec 2022 22:28:34 GMT, Markus KARG <d...@openjdk.org> wrote:

> I/O had always been much slower than CPU and memory access, and thanks to 
> physical constraints, always will be.
> While CPUs can get shrinked more and more, and can hold more and more memory 
> cache on or nearby a CPU core, the distance between CPU core and I/O device 
> cannot get reduced much: It will stay "far" away.
> Due to this simple logic (and other factors), the spread between performance 
> of CPU and memory access on one hand, and performance of I/O on the other 
> hand, increases with every new CPU generation.
> As a consequence, internal adjustment factors of the JDK need to get revised 
> from time to time to ensure optimum performance and each hardware generation.
> 
> One such factor is the size of the temporary transfer buffer used internally 
> by `InputStream::transferTo`.
> Since its introduction with JDK 9 many years (hence hardware generations) 
> have passed, so it's time to check the appropriateness of that buffer's size.
> 
> Using JMH on a typical, modern cloud platform, it was proven that the current 
> 8K buffer is (much) too small on modern hardware:
> The small buffer clearly stands in the way of faster transfers.
> The ops/s of a simple `FileInputStream.transferTo(ByteArrayOutputStream)` 
> operation on JDK 21 could be doubled (!) by only doubling the buffer size 
> from 8K to 16K, which seems to be a considerable and cheap deal.
> Doubling the buffer even more shows only marginal improvements of approx. 1% 
> to 3% per duplication of size, which does not justify additional memory 
> consumption.
> 
> 
> TransferToPerformance.transferTo 8192 1048576 thrpt 25 1349.929 ± 47.057 ops/s
> TransferToPerformance.transferTo 16384 1048576 thrpt 25 2633.560 ± 93.337 
> ops/s
> TransferToPerformance.transferTo 32768 1048576 thrpt 25 2721.025 ± 89.555 
> ops/s
> TransferToPerformance.transferTo 65536 1048576 thrpt 25 2855.949 ± 96.623 
> ops/s
> TransferToPerformance.transferTo 131072 1048576 thrpt 25 2903.062 ± 40.798 
> ops/s
> 
> 
> Even on small or limited platforms, an investment of 8K additonal temporary 
> buffer is very cheap and very useful, as it doubles the performance of 
> `InputStream::transferTo`, in particular for legacy (non-NIO) applications 
> still using `FileInputStream` and `ByteArrayOutputStream`.
> I dare to say, even if not proven, that is a very considerable (possibly the 
> major) number of existing applications, as NIO was only adopted gradually by 
> programmers.
> 
> Due to the given reasons, it should be approporiate to change 
> `DEFAULT_BUFFER_SIZE` from 8192 to 16384.

Hello Markus,

Happy 2023!

I checked out your JMH code and I'm a little uncomfortable with two things:
- at least on Linux, the benchmark does not actually involve any input from the 
device (HD, SSD) at all.
- at least on Linux, the benchmark involves output to the device (HD, SSD) in a 
way that might interfere with the banchmark.

Let me explain. In the per-invocation @Setup method you create new file with 
random bytes and close it. At least on Linux this actually just writes the 
content to the write-back buffer cache. The disk is not touched yet. Writing 
happens asynchronously in the background while the benchmark is running. The 
benchmark reads from that file but this does not involve reading from the 
device at all, since the whole content is already present in buffer cache. So 
what this benchmark does is it measures overhead of JNI invocations from Java 
to native code and the overhead of system calls from native code to kernel 
which are served from buffer cache. All this happens while asynchronous writes 
to disk are taking place which might interfere with benchmark as they compete 
with benchmark for system resources.
While running your unchanged benchmark, I checked the output of `iostat 2` 
which proves above claims:


Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    
kB_wrtn    kB_dscd
nvme0n1          22.50         0.00       438.75         0.00          0        
877          0


So I modified your benchmark a bit:
- I copied the InputStream.transferTo method to the benchmark itself so that 
buffer size can be passed as argument. I think this should not present any 
noticable difference since we are not measuring the quality of JIT-ed code but 
overheads of JNI and system calls which are of entirely different magnitude.
- I moved the creation of a temporary file from per-invocation to per-trial 
@Setup method and added a "sync" command at the end which flushes the data to 
the actual device and waits for flushing to complete.
- I added a "drop_caches" command just before opening the file for reading in 
the per-invocation @Setup method. This drops cached data of the file so that 
reads in the benchmark method are made from the actual device. Since the use of 
buffer cache is an important use case I added a boolean `dropCaches` parameter 
to measure both cases.

Both commands work on Linux and I don't have equivalent commands for Windows. 
`sync` works on Mac OS too, but `drop_caches` is Linux specific. Mac OS has a 
`purge` command for droping caches which must be executed as root.

Here are the results with dropCaches=true on a Linux PC with SSD:
https://jmh.morethan.io/?gist=abcba62638a1fa652fd597fa141a1a10

Here are the results with dropCaches=false:
https://jmh.morethan.io/?gist=ac8af4d8232b6f3a758491381d2fae31

And here's the modified benchmark:


public class TransferToPerformance {

    @State(Scope.Benchmark)
    public static class Config {

        @Param({"8388608"}) // 8 MiB
        public int streamSize;

        @Param({"1024", "2048", "4096", "8192", "16384", "32768", "65536", 
"131072"})
        public int bufferSize;

        @Param({"false", "true"}) // just before opening file for reading
        public boolean dropCaches;

        public InputStream source;
        public OutputStream target;
        private Path path;

        @Setup(Level.Trial)
        public void setUpTrial() throws IOException, InterruptedException {
            path = Files.createTempFile("a-", ".bin");
            var bytes = createRandomBytes(streamSize);
            Files.deleteIfExists(this.path);
            Files.write(this.path, bytes, CREATE, TRUNCATE_EXISTING, WRITE);
            sync();
        }

        @TearDown(Level.Trial)
        public void tearDownTrial() throws IOException {
            Files.deleteIfExists(this.path);
        }

        @Setup(Level.Invocation)
        public void setUpInvocation() throws IOException, InterruptedException {
            if (dropCaches) {
                dropCaches();
            }
            source = new FileInputStream(this.path.toFile());
            target = new ByteArrayOutputStream();
        }

        @TearDown(Level.Invocation)
        public void tearDownInvocation() throws IOException {
            source.close();
            target.close();
            source = null;
            target = null;
        }
    }

    private static byte[] createRandomBytes(int size) {
        var bytes = new byte[size];
        ThreadLocalRandom.current().nextBytes(bytes);
        return bytes;
    }

    @Benchmark
    @Fork(warmups = 0, value = 1)
    @BenchmarkMode(Mode.AverageTime)
    @Warmup(iterations = 3, time = 2)
    @Measurement(iterations = 5, time = 4)
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    public final void transferTo(Config config, Blackhole blackhole) throws 
IOException {
        blackhole.consume(transferTo(config.bufferSize, config.source, 
config.target));
    }

    private static long transferTo(int bufferSize, InputStream in, OutputStream 
out) throws IOException {
        long transferred = 0;
        byte[] buffer = new byte[bufferSize];
        int read;
        while ((read = in.read(buffer, 0, bufferSize)) >= 0) {
            out.write(buffer, 0, read);
            transferred += read;
        }
        return transferred;
    }

    private static void sync() throws IOException, InterruptedException {
        var proc = new ProcessBuilder("sync").start();
        proc.waitFor();
    }

    private static void dropCaches() throws IOException, InterruptedException {
        var proc = new ProcessBuilder(
            "sudo",
            "bash", "-c",
            "echo 1 > /proc/sys/vm/drop_caches"
        ).start();
        proc.waitFor();
    }
}


When reading involves actual device, at least on my computer, the increase of 
buffer size from 8k to 16k does not give any benefit. When reading involves 
cached data only, the benefit is marginal. I wonder what results would you get 
with this modified benchmark...

-------------

PR: https://git.openjdk.org/jdk/pull/11783

Reply via email to