This is an automated email from the ASF dual-hosted git repository.
kou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 10eaafd2b4 GH-49544: [Ruby] Add benchmark for readers (#49545)
10eaafd2b4 is described below
commit 10eaafd2b4af3ca3e68181636ba4087c236d0c1d
Author: Sutou Kouhei <[email protected]>
AuthorDate: Sat Mar 21 17:59:16 2026 +0900
GH-49544: [Ruby] Add benchmark for readers (#49545)
### Rationale for this change
Performance is important in Apache Arrow. So benchmark is useful for
developing Apache Arrow implementation.
### What changes are included in this PR?
* Add benchmarks for file and streaming readers.
* Add support for `mmap` in streaming reader.
Here are benchmark results on my environment.
Pure Ruby implementation is about 5-6x slower than release build C++
implementation but a bit faster than debug build C++ implementation.
Release build C++/GLib:
File format:
```console
$ ruby -v -S benchmark-driver
ruby/red-arrow-format/benchmark/file-reader.yaml
ruby 4.1.0dev (2026-02-19T09:04:23Z master 6bb0b6b16c) +PRISM [x86_64-linux]
Warming up --------------------------------------
Arrow::Table.load 11.207k i/s - 12.188k times in
1.087487s (89.23μs/i)
Arrow::RecordBatchFileReader 19.724k i/s - 21.296k times in
1.079727s (50.70μs/i)
ArrowFormat::FileReader 3.555k i/s - 3.883k times in
1.092223s (281.28μs/i)
Calculating -------------------------------------
Arrow::Table.load 11.483k i/s - 33.622k times in
2.928024s (87.09μs/i)
Arrow::RecordBatchFileReader 19.673k i/s - 59.170k times in
3.007729s (50.83μs/i)
ArrowFormat::FileReader 3.574k i/s - 10.665k times in
2.984214s (279.81μs/i)
Comparison:
Arrow::RecordBatchFileReader: 19672.6 i/s
Arrow::Table.load: 11482.8 i/s - 1.71x slower
ArrowFormat::FileReader: 3573.8 i/s - 5.50x slower
```
Streaming format:
```console
$ ruby -v -S benchmark-driver
ruby/red-arrow-format/benchmark/streaming-reader.yaml
ruby 4.1.0dev (2026-02-19T09:04:23Z master 6bb0b6b16c) +PRISM [x86_64-linux]
Warming up --------------------------------------
Arrow::Table.load 11.360k i/s - 12.485k times in
1.099067s (88.03μs/i)
Arrow::RecordBatchStreamReader 20.180k i/s - 21.857k times in
1.083126s (49.56μs/i)
ArrowFormat::StreamingReader 3.398k i/s - 3.400k times in
1.000479s (294.26μs/i)
Calculating -------------------------------------
Arrow::Table.load 11.397k i/s - 34.078k times in
2.990170s (87.74μs/i)
Arrow::RecordBatchStreamReader 20.039k i/s - 60.538k times in
3.020964s (49.90μs/i)
ArrowFormat::StreamingReader 3.340k i/s - 10.195k times in
3.052059s (299.37μs/i)
Comparison:
Arrow::RecordBatchStreamReader: 20039.3 i/s
Arrow::Table.load: 11396.7 i/s - 1.76x slower
ArrowFormat::StreamingReader: 3340.4 i/s - 6.00x slower
```
Debug build C++/GLib:
File format:
```console
$ ruby -v -S benchmark-driver
ruby/red-arrow-format/benchmark/file-reader.yaml
ruby 4.1.0dev (2026-02-19T09:04:23Z master 6bb0b6b16c) +PRISM [x86_64-linux]
Warming up --------------------------------------
Arrow::Table.load 2.175k i/s - 2.200k times in
1.011375s (459.72μs/i)
Arrow::RecordBatchFileReader 3.129k i/s - 3.421k times in
1.093397s (319.61μs/i)
ArrowFormat::FileReader 3.384k i/s - 3.430k times in
1.013625s (295.52μs/i)
Calculating -------------------------------------
Arrow::Table.load 2.145k i/s - 6.525k times in
3.041760s (466.17μs/i)
Arrow::RecordBatchFileReader 3.020k i/s - 9.386k times in
3.108456s (331.18μs/i)
ArrowFormat::FileReader 3.368k i/s - 10.151k times in
3.013576s (296.87μs/i)
Comparison:
ArrowFormat::FileReader: 3368.4 i/s
Arrow::RecordBatchFileReader: 3019.5 i/s - 1.12x slower
Arrow::Table.load: 2145.1 i/s - 1.57x slower
```
Streaming format:
```console
$ ruby -v -S benchmark-driver
ruby/red-arrow-format/benchmark/streaming-reader.yaml
ruby 4.1.0dev (2026-02-19T09:04:23Z master 6bb0b6b16c) +PRISM [x86_64-linux]
Warming up --------------------------------------
Arrow::Table.load 2.115k i/s - 2.140k times in
1.011815s (472.81μs/i)
Arrow::RecordBatchStreamReader 3.052k i/s - 3.355k times in
1.099273s (327.65μs/i)
ArrowFormat::StreamingReader 3.283k i/s - 3.290k times in
1.002016s (304.56μs/i)
Calculating -------------------------------------
Arrow::Table.load 2.198k i/s - 6.345k times in
2.886603s (454.94μs/i)
Arrow::RecordBatchStreamReader 3.105k i/s - 9.156k times in
2.948523s (322.03μs/i)
ArrowFormat::StreamingReader 3.225k i/s - 9.850k times in
3.054339s (310.09μs/i)
Comparison:
ArrowFormat::StreamingReader: 3224.9 i/s
Arrow::RecordBatchStreamReader: 3105.3 i/s - 1.04x slower
Arrow::Table.load: 2198.1 i/s - 1.47x slower
```
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* GitHub Issue: #49544
Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
---
.github/workflows/ruby.yml | 2 +-
ruby/Rakefile | 22 +++++-
ruby/red-arrow-format/Gemfile | 8 ++-
ruby/red-arrow-format/Rakefile | 22 ++++++
ruby/red-arrow-format/benchmark/file-reader.yaml | 53 ++++++++++++++
.../benchmark/streaming-reader.yaml | 53 ++++++++++++++
.../lib/arrow-format/streaming-reader.rb | 27 +++++--
ruby/red-arrow-format/test/test-reader.rb | 82 ++++++++++++++++++++--
8 files changed, 254 insertions(+), 15 deletions(-)
diff --git a/.github/workflows/ruby.yml b/.github/workflows/ruby.yml
index 1e91f62487..04d974f641 100644
--- a/.github/workflows/ruby.yml
+++ b/.github/workflows/ruby.yml
@@ -124,7 +124,7 @@ jobs:
run: archery docker push ubuntu-ruby
macos:
- name: ARM64 macOS 14 GLib & Ruby
+ name: ARM64 macOS GLib & Ruby
runs-on: macos-latest
if: ${{ !contains(github.event.pull_request.title, 'WIP') }}
timeout-minutes: 60
diff --git a/ruby/Rakefile b/ruby/Rakefile
index 7f26773403..6ff20915bc 100644
--- a/ruby/Rakefile
+++ b/ruby/Rakefile
@@ -35,9 +35,11 @@ end
packages.each do |package|
namespace package do
+ package_dir = File.join(base_dir, package)
+
desc "Run test for #{package}"
task :test do
- cd(File.join(base_dir, package)) do
+ cd(package_dir) do
if ENV["USE_BUNDLER"]
sh("bundle", "exec", "rake", "test")
else
@@ -46,9 +48,22 @@ packages.each do |package|
end
end
+ desc "Run benchmark for #{package}"
+ task :benchmark do
+ cd(package_dir) do
+ if File.directory?("benchmark")
+ if ENV["USE_BUNDLER"]
+ sh("bundle", "exec", "rake", "benchmark")
+ else
+ ruby("-S", "rake", "benchmark")
+ end
+ end
+ end
+ end
+
desc "Install #{package}"
task :install do
- cd(File.join(base_dir, package)) do
+ cd(package_dir) do
if ENV["USE_BUNDLER"]
sh("bundle", "exec", "rake", "install")
else
@@ -70,6 +85,9 @@ end
desc "Run test for all packages"
task test: sorted_packages.collect {|package| "#{package}:test"}
+desc "Run benchmark for all packages"
+task benchmark: sorted_packages.collect {|package| "#{package}:benchmark"}
+
desc "Install all packages"
task install: sorted_packages.collect {|package| "#{package}:install"}
diff --git a/ruby/red-arrow-format/Gemfile b/ruby/red-arrow-format/Gemfile
index 2307252d9e..296a7b4435 100644
--- a/ruby/red-arrow-format/Gemfile
+++ b/ruby/red-arrow-format/Gemfile
@@ -21,6 +21,10 @@ source "https://rubygems.org/"
gemspec
-gem "rake"
gem "red-arrow", path: "../red-arrow"
-gem "test-unit"
+
+group :development do
+ gem "benchmark-driver"
+ gem "rake"
+ gem "test-unit"
+end
diff --git a/ruby/red-arrow-format/Rakefile b/ruby/red-arrow-format/Rakefile
index f50f18f3b8..3671f35d6e 100644
--- a/ruby/red-arrow-format/Rakefile
+++ b/ruby/red-arrow-format/Rakefile
@@ -39,6 +39,28 @@ task :test do
end
end
+benchmark_tasks = []
+namespace :benchmark do
+ Dir.glob("benchmark/*.yaml").sort.each do |yaml|
+ name = File.basename(yaml, ".*")
+ command_line = [
+ RbConfig.ruby, "-v", "-S", "benchmark-driver", File.expand_path(yaml),
+ ]
+
+ desc "Run #{name} benchmark"
+ task name do
+ puts("```console")
+ puts("$ #{command_line.join(" ")}")
+ sh(*command_line, verbose: false)
+ puts("```")
+ end
+ benchmark_tasks << "benchmark:#{name}"
+ end
+end
+
+desc "Run all benchmarks"
+task :benchmark => benchmark_tasks
+
namespace :flat_buffers do
desc "Generate FlatBuffers code"
task :generate do
diff --git a/ruby/red-arrow-format/benchmark/file-reader.yaml
b/ruby/red-arrow-format/benchmark/file-reader.yaml
new file mode 100644
index 0000000000..25e8c73ef1
--- /dev/null
+++ b/ruby/red-arrow-format/benchmark/file-reader.yaml
@@ -0,0 +1,53 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+prelude: |
+ Warning[:experimental] = false
+
+ require "arrow"
+ require "arrow-format"
+
+ seed = 29
+ random = Random.new(seed)
+
+ n_columns = 100
+ n_rows = 10000
+ max_uint32 = 2 ** 32 - 1
+ arrays = n_columns.times.collect do |i|
+ if i.even?
+ Arrow::UInt32Array.new(n_rows.times.collect {random.rand(max_uint32)})
+ else
+ Arrow::BinaryArray.new(n_rows.times.collect
{random.bytes(random.rand(10))})
+ end
+ end
+ table = Arrow::Table.new(arrays.collect.with_index {|array, i| [i, array]})
+ buffer = Arrow::ResizableBuffer.new(4096)
+ table.save(buffer, format: :arrow_file)
+
+ GC.start
+ GC.disable
+benchmark:
+ "Arrow::Table.load": |
+ Arrow::Table.load(buffer, format: :arrow_file)
+ "Arrow::RecordBatchFileReader": |
+ Arrow::BufferInputStream.open(buffer) do |input|
+ Arrow::RecordBatchFileReader.new(input).each do
+ end
+ end
+ "ArrowFormat::FileReader": |
+ ArrowFormat::FileReader.new(buffer.data.to_s).each do
+ end
diff --git a/ruby/red-arrow-format/benchmark/streaming-reader.yaml
b/ruby/red-arrow-format/benchmark/streaming-reader.yaml
new file mode 100644
index 0000000000..f1b383395b
--- /dev/null
+++ b/ruby/red-arrow-format/benchmark/streaming-reader.yaml
@@ -0,0 +1,53 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+prelude: |
+ Warning[:experimental] = false
+
+ require "arrow"
+ require "arrow-format"
+
+ seed = 29
+ random = Random.new(seed)
+
+ n_columns = 100
+ n_rows = 10000
+ max_uint32 = 2 ** 32 - 1
+ arrays = n_columns.times.collect do |i|
+ if i.even?
+ Arrow::UInt32Array.new(n_rows.times.collect {random.rand(max_uint32)})
+ else
+ Arrow::BinaryArray.new(n_rows.times.collect
{random.bytes(random.rand(10))})
+ end
+ end
+ table = Arrow::Table.new(arrays.collect.with_index {|array, i| [i, array]})
+ buffer = Arrow::ResizableBuffer.new(4096)
+ table.save(buffer, format: :arrow_streaming)
+
+ GC.start
+ GC.disable
+benchmark:
+ "Arrow::Table.load": |
+ Arrow::Table.load(buffer, format: :arrow_streaming)
+ "Arrow::RecordBatchStreamReader": |
+ Arrow::BufferInputStream.open(buffer) do |input|
+ Arrow::RecordBatchStreamReader.new(input).each do
+ end
+ end
+ "ArrowFormat::StreamingReader": |
+ ArrowFormat::StreamingReader.new(buffer.data.to_s).each do
+ end
diff --git a/ruby/red-arrow-format/lib/arrow-format/streaming-reader.rb
b/ruby/red-arrow-format/lib/arrow-format/streaming-reader.rb
index f81cfe8913..1a9f71ac9c 100644
--- a/ruby/red-arrow-format/lib/arrow-format/streaming-reader.rb
+++ b/ruby/red-arrow-format/lib/arrow-format/streaming-reader.rb
@@ -22,7 +22,17 @@ module ArrowFormat
include Enumerable
def initialize(input)
- @input = input
+ case input
+ when File
+ @input = IO::Buffer.map(input, nil, 0, IO::Buffer::READONLY)
+ @offset = 0
+ when String
+ @input = IO::Buffer.for(input)
+ @offset = 0
+ else
+ @input = input
+ end
+
@on_read = nil
@pull_reader = StreamingPullReader.new do |record_batch|
@on_read.call(record_batch) if @on_read
@@ -53,11 +63,18 @@ module ArrowFormat
next_size = @pull_reader.next_required_size
return false if next_size.zero?
- next_chunk = @input.read(next_size, @buffer)
- return false if next_chunk.nil?
+ if @input.is_a?(IO::Buffer)
+ next_chunk = @input.slice(@offset, next_size)
+ @offset += next_size
+ @pull_reader.consume(next_chunk)
+ true
+ else
+ next_chunk = @input.read(next_size, @buffer)
+ return false if next_chunk.nil?
- @pull_reader.consume(IO::Buffer.for(next_chunk))
- true
+ @pull_reader.consume(IO::Buffer.for(next_chunk))
+ true
+ end
end
def ensure_schema
diff --git a/ruby/red-arrow-format/test/test-reader.rb
b/ruby/red-arrow-format/test/test-reader.rb
index d59a93ce18..c1c6b26288 100644
--- a/ruby/red-arrow-format/test/test-reader.rb
+++ b/ruby/red-arrow-format/test/test-reader.rb
@@ -26,9 +26,7 @@ module ReaderTests
else
table = data
end
- path = File.join(tmp_dir, "data.#{file_extension}")
- table.save(path)
- File.open(path, "rb") do |input|
+ open_input(table, tmp_dir) do |input|
reader = reader_class.new(input)
case data
when Arrow::Array
@@ -677,8 +675,42 @@ module ReaderTests
end
end
-class TestFileReader < Test::Unit::TestCase
+module FileInput
+ def open_input(table, tmp_dir, &block)
+ path = File.join(tmp_dir, "data.#{file_extension}")
+ table.save(path)
+ File.open(path, "rb", &block)
+ end
+end
+
+module PipeInput
+ def open_input(table, tmp_dir, &block)
+ buffer = Arrow::ResizableBuffer.new(4096)
+ table.save(buffer, format: format)
+ IO.pipe do |input, output|
+ write_thread = Thread.new do
+ output.write(buffer.data.to_s)
+ end
+ begin
+ yield(input)
+ ensure
+ write_thread.join
+ end
+ end
+ end
+end
+
+module StringInput
+ def open_input(table, tmp_dir)
+ buffer = Arrow::ResizableBuffer.new(4096)
+ table.save(buffer, format: format)
+ yield(buffer.data.to_s)
+ end
+end
+
+class TestFileReaderFileInput < Test::Unit::TestCase
include ReaderTests
+ include FileInput
def file_extension
"arrow"
@@ -689,8 +721,22 @@ class TestFileReader < Test::Unit::TestCase
end
end
-class TestStreamingReader < Test::Unit::TestCase
+class TestFileReaderStringInput < Test::Unit::TestCase
include ReaderTests
+ include StringInput
+
+ def format
+ :arrow_file
+ end
+
+ def reader_class
+ ArrowFormat::FileReader
+ end
+end
+
+class TestStreamingReaderFileInupt < Test::Unit::TestCase
+ include ReaderTests
+ include FileInput
def file_extension
"arrows"
@@ -700,3 +746,29 @@ class TestStreamingReader < Test::Unit::TestCase
ArrowFormat::StreamingReader
end
end
+
+class TestStreamingReaderPipeInupt < Test::Unit::TestCase
+ include ReaderTests
+ include PipeInput
+
+ def format
+ :arrow_streaming
+ end
+
+ def reader_class
+ ArrowFormat::StreamingReader
+ end
+end
+
+class TestStreamingReaderStringInupt < Test::Unit::TestCase
+ include ReaderTests
+ include StringInput
+
+ def format
+ :arrow_streaming
+ end
+
+ def reader_class
+ ArrowFormat::StreamingReader
+ end
+end