On Fri, 23 Dec 2022 22:28:34 GMT, Markus KARG <d...@openjdk.org> wrote:
> I/O had always been much slower than CPU and memory access, and thanks to > physical constraints, always will be. > While CPUs can get shrinked more and more, and can hold more and more memory > cache on or nearby a CPU core, the distance between CPU core and I/O device > cannot get reduced much: It will stay "far" away. > Due to this simple logic (and other factors), the spread between performance > of CPU and memory access on one hand, and performance of I/O on the other > hand, increases with every new CPU generation. > As a consequence, internal adjustment factors of the JDK need to get revised > from time to time to ensure optimum performance and each hardware generation. > > One such factor is the size of the temporary transfer buffer used internally > by `InputStream::transferTo`. > Since its introduction with JDK 9 many years (hence hardware generations) > have passed, so it's time to check the appropriateness of that buffer's size. > > Using JMH on a typical, modern cloud platform, it was proven that the current > 8K buffer is (much) too small on modern hardware: > The small buffer clearly stands in the way of faster transfers. > The ops/s of a simple `FileInputStream.transferTo(ByteArrayOutputStream)` > operation on JDK 21 could be doubled (!) by only doubling the buffer size > from 8K to 16K, which seems to be a considerable and cheap deal. > Doubling the buffer even more shows only marginal improvements of approx. 1% > to 3% per duplication of size, which does not justify additional memory > consumption. > > > TransferToPerformance.transferTo 8192 1048576 thrpt 25 1349.929 ± 47.057 ops/s > TransferToPerformance.transferTo 16384 1048576 thrpt 25 2633.560 ± 93.337 > ops/s > TransferToPerformance.transferTo 32768 1048576 thrpt 25 2721.025 ± 89.555 > ops/s > TransferToPerformance.transferTo 65536 1048576 thrpt 25 2855.949 ± 96.623 > ops/s > TransferToPerformance.transferTo 131072 1048576 thrpt 25 2903.062 ± 40.798 > ops/s > > > Even on small or limited platforms, an investment of 8K additonal temporary > buffer is very cheap and very useful, as it doubles the performance of > `InputStream::transferTo`, in particular for legacy (non-NIO) applications > still using `FileInputStream` and `ByteArrayOutputStream`. > I dare to say, even if not proven, that is a very considerable (possibly the > major) number of existing applications, as NIO was only adopted gradually by > programmers. > > Due to the given reasons, it should be approporiate to change > `DEFAULT_BUFFER_SIZE` from 8192 to 16384. Hello Markus, Happy 2023! I checked out your JMH code and I'm a little uncomfortable with two things: - at least on Linux, the benchmark does not actually involve any input from the device (HD, SSD) at all. - at least on Linux, the benchmark involves output to the device (HD, SSD) in a way that might interfere with the banchmark. Let me explain. In the per-invocation @Setup method you create new file with random bytes and close it. At least on Linux this actually just writes the content to the write-back buffer cache. The disk is not touched yet. Writing happens asynchronously in the background while the benchmark is running. The benchmark reads from that file but this does not involve reading from the device at all, since the whole content is already present in buffer cache. So what this benchmark does is it measures overhead of JNI invocations from Java to native code and the overhead of system calls from native code to kernel which are served from buffer cache. All this happens while asynchronous writes to disk are taking place which might interfere with benchmark as they compete with benchmark for system resources. While running your unchanged benchmark, I checked the output of `iostat 2` which proves above claims: Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd nvme0n1 22.50 0.00 438.75 0.00 0 877 0 So I modified your benchmark a bit: - I copied the InputStream.transferTo method to the benchmark itself so that buffer size can be passed as argument. I think this should not present any noticable difference since we are not measuring the quality of JIT-ed code but overheads of JNI and system calls which are of entirely different magnitude. - I moved the creation of a temporary file from per-invocation to per-trial @Setup method and added a "sync" command at the end which flushes the data to the actual device and waits for flushing to complete. - I added a "drop_caches" command just before opening the file for reading in the per-invocation @Setup method. This drops cached data of the file so that reads in the benchmark method are made from the actual device. Since the use of buffer cache is an important use case I added a boolean `dropCaches` parameter to measure both cases. Both commands work on Linux and I don't have equivalent commands for Windows. `sync` works on Mac OS too, but `drop_caches` is Linux specific. Mac OS has a `purge` command for droping caches which must be executed as root. Here are the results with dropCaches=true on a Linux PC with SSD: https://jmh.morethan.io/?gist=abcba62638a1fa652fd597fa141a1a10 Here are the results with dropCaches=false: https://jmh.morethan.io/?gist=ac8af4d8232b6f3a758491381d2fae31 And here's the modified benchmark: public class TransferToPerformance { @State(Scope.Benchmark) public static class Config { @Param({"8388608"}) // 8 MiB public int streamSize; @Param({"1024", "2048", "4096", "8192", "16384", "32768", "65536", "131072"}) public int bufferSize; @Param({"false", "true"}) // just before opening file for reading public boolean dropCaches; public InputStream source; public OutputStream target; private Path path; @Setup(Level.Trial) public void setUpTrial() throws IOException, InterruptedException { path = Files.createTempFile("a-", ".bin"); var bytes = createRandomBytes(streamSize); Files.deleteIfExists(this.path); Files.write(this.path, bytes, CREATE, TRUNCATE_EXISTING, WRITE); sync(); } @TearDown(Level.Trial) public void tearDownTrial() throws IOException { Files.deleteIfExists(this.path); } @Setup(Level.Invocation) public void setUpInvocation() throws IOException, InterruptedException { if (dropCaches) { dropCaches(); } source = new FileInputStream(this.path.toFile()); target = new ByteArrayOutputStream(); } @TearDown(Level.Invocation) public void tearDownInvocation() throws IOException { source.close(); target.close(); source = null; target = null; } } private static byte[] createRandomBytes(int size) { var bytes = new byte[size]; ThreadLocalRandom.current().nextBytes(bytes); return bytes; } @Benchmark @Fork(warmups = 0, value = 1) @BenchmarkMode(Mode.AverageTime) @Warmup(iterations = 3, time = 2) @Measurement(iterations = 5, time = 4) @OutputTimeUnit(TimeUnit.MICROSECONDS) public final void transferTo(Config config, Blackhole blackhole) throws IOException { blackhole.consume(transferTo(config.bufferSize, config.source, config.target)); } private static long transferTo(int bufferSize, InputStream in, OutputStream out) throws IOException { long transferred = 0; byte[] buffer = new byte[bufferSize]; int read; while ((read = in.read(buffer, 0, bufferSize)) >= 0) { out.write(buffer, 0, read); transferred += read; } return transferred; } private static void sync() throws IOException, InterruptedException { var proc = new ProcessBuilder("sync").start(); proc.waitFor(); } private static void dropCaches() throws IOException, InterruptedException { var proc = new ProcessBuilder( "sudo", "bash", "-c", "echo 1 > /proc/sys/vm/drop_caches" ).start(); proc.waitFor(); } } When reading involves actual device, at least on my computer, the increase of buffer size from 8k to 16k does not give any benefit. When reading involves cached data only, the benefit is marginal. I wonder what results would you get with this modified benchmark... ------------- PR: https://git.openjdk.org/jdk/pull/11783