This is an automated email from the ASF dual-hosted git repository.

kou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 10eaafd2b4 GH-49544: [Ruby] Add benchmark for readers (#49545)
10eaafd2b4 is described below

commit 10eaafd2b4af3ca3e68181636ba4087c236d0c1d
Author: Sutou Kouhei <[email protected]>
AuthorDate: Sat Mar 21 17:59:16 2026 +0900

    GH-49544: [Ruby] Add benchmark for readers (#49545)
    
    ### Rationale for this change
    
    Performance is important in Apache Arrow. So benchmark is useful for 
developing Apache Arrow implementation.
    
    ### What changes are included in this PR?
    
    * Add benchmarks for file and streaming readers.
    * Add support for `mmap` in streaming reader.
    
    Here are benchmark results on my environment.
    
    Pure Ruby implementation is about 5-6x slower than release build C++ 
implementation but a bit faster than debug build C++ implementation.
    
    Release build C++/GLib:
    
    File format:
    
    ```console
    $ ruby -v -S benchmark-driver 
ruby/red-arrow-format/benchmark/file-reader.yaml
    ruby 4.1.0dev (2026-02-19T09:04:23Z master 6bb0b6b16c) +PRISM [x86_64-linux]
    Warming up --------------------------------------
               Arrow::Table.load    11.207k i/s -     12.188k times in 
1.087487s (89.23μs/i)
    Arrow::RecordBatchFileReader    19.724k i/s -     21.296k times in 
1.079727s (50.70μs/i)
         ArrowFormat::FileReader     3.555k i/s -      3.883k times in 
1.092223s (281.28μs/i)
    Calculating -------------------------------------
               Arrow::Table.load    11.483k i/s -     33.622k times in 
2.928024s (87.09μs/i)
    Arrow::RecordBatchFileReader    19.673k i/s -     59.170k times in 
3.007729s (50.83μs/i)
         ArrowFormat::FileReader     3.574k i/s -     10.665k times in 
2.984214s (279.81μs/i)
    
    Comparison:
    Arrow::RecordBatchFileReader:     19672.6 i/s
               Arrow::Table.load:     11482.8 i/s - 1.71x  slower
         ArrowFormat::FileReader:      3573.8 i/s - 5.50x  slower
    
    ```
    
    Streaming format:
    
    ```console
    $ ruby -v -S benchmark-driver 
ruby/red-arrow-format/benchmark/streaming-reader.yaml
    ruby 4.1.0dev (2026-02-19T09:04:23Z master 6bb0b6b16c) +PRISM [x86_64-linux]
    Warming up --------------------------------------
                 Arrow::Table.load    11.360k i/s -     12.485k times in 
1.099067s (88.03μs/i)
    Arrow::RecordBatchStreamReader    20.180k i/s -     21.857k times in 
1.083126s (49.56μs/i)
      ArrowFormat::StreamingReader     3.398k i/s -      3.400k times in 
1.000479s (294.26μs/i)
    Calculating -------------------------------------
                 Arrow::Table.load    11.397k i/s -     34.078k times in 
2.990170s (87.74μs/i)
    Arrow::RecordBatchStreamReader    20.039k i/s -     60.538k times in 
3.020964s (49.90μs/i)
      ArrowFormat::StreamingReader     3.340k i/s -     10.195k times in 
3.052059s (299.37μs/i)
    
    Comparison:
    Arrow::RecordBatchStreamReader:     20039.3 i/s
                 Arrow::Table.load:     11396.7 i/s - 1.76x  slower
      ArrowFormat::StreamingReader:      3340.4 i/s - 6.00x  slower
    
    ```
    
    Debug build C++/GLib:
    
    File format:
    
    ```console
    $ ruby -v -S benchmark-driver 
ruby/red-arrow-format/benchmark/file-reader.yaml
    ruby 4.1.0dev (2026-02-19T09:04:23Z master 6bb0b6b16c) +PRISM [x86_64-linux]
    Warming up --------------------------------------
               Arrow::Table.load     2.175k i/s -      2.200k times in 
1.011375s (459.72μs/i)
    Arrow::RecordBatchFileReader     3.129k i/s -      3.421k times in 
1.093397s (319.61μs/i)
         ArrowFormat::FileReader     3.384k i/s -      3.430k times in 
1.013625s (295.52μs/i)
    Calculating -------------------------------------
               Arrow::Table.load     2.145k i/s -      6.525k times in 
3.041760s (466.17μs/i)
    Arrow::RecordBatchFileReader     3.020k i/s -      9.386k times in 
3.108456s (331.18μs/i)
         ArrowFormat::FileReader     3.368k i/s -     10.151k times in 
3.013576s (296.87μs/i)
    
    Comparison:
         ArrowFormat::FileReader:      3368.4 i/s
    Arrow::RecordBatchFileReader:      3019.5 i/s - 1.12x  slower
               Arrow::Table.load:      2145.1 i/s - 1.57x  slower
    
    ```
    
    Streaming format:
    
    ```console
    $ ruby -v -S benchmark-driver 
ruby/red-arrow-format/benchmark/streaming-reader.yaml
    ruby 4.1.0dev (2026-02-19T09:04:23Z master 6bb0b6b16c) +PRISM [x86_64-linux]
    Warming up --------------------------------------
                 Arrow::Table.load     2.115k i/s -      2.140k times in 
1.011815s (472.81μs/i)
    Arrow::RecordBatchStreamReader     3.052k i/s -      3.355k times in 
1.099273s (327.65μs/i)
      ArrowFormat::StreamingReader     3.283k i/s -      3.290k times in 
1.002016s (304.56μs/i)
    Calculating -------------------------------------
                 Arrow::Table.load     2.198k i/s -      6.345k times in 
2.886603s (454.94μs/i)
    Arrow::RecordBatchStreamReader     3.105k i/s -      9.156k times in 
2.948523s (322.03μs/i)
      ArrowFormat::StreamingReader     3.225k i/s -      9.850k times in 
3.054339s (310.09μs/i)
    
    Comparison:
      ArrowFormat::StreamingReader:      3224.9 i/s
    Arrow::RecordBatchStreamReader:      3105.3 i/s - 1.04x  slower
                 Arrow::Table.load:      2198.1 i/s - 1.47x  slower
    
    ```
    
    ### Are these changes tested?
    
    Yes.
    
    ### Are there any user-facing changes?
    
    No.
    * GitHub Issue: #49544
    
    Authored-by: Sutou Kouhei <[email protected]>
    Signed-off-by: Sutou Kouhei <[email protected]>
---
 .github/workflows/ruby.yml                         |  2 +-
 ruby/Rakefile                                      | 22 +++++-
 ruby/red-arrow-format/Gemfile                      |  8 ++-
 ruby/red-arrow-format/Rakefile                     | 22 ++++++
 ruby/red-arrow-format/benchmark/file-reader.yaml   | 53 ++++++++++++++
 .../benchmark/streaming-reader.yaml                | 53 ++++++++++++++
 .../lib/arrow-format/streaming-reader.rb           | 27 +++++--
 ruby/red-arrow-format/test/test-reader.rb          | 82 ++++++++++++++++++++--
 8 files changed, 254 insertions(+), 15 deletions(-)

diff --git a/.github/workflows/ruby.yml b/.github/workflows/ruby.yml
index 1e91f62487..04d974f641 100644
--- a/.github/workflows/ruby.yml
+++ b/.github/workflows/ruby.yml
@@ -124,7 +124,7 @@ jobs:
         run: archery docker push ubuntu-ruby
 
   macos:
-    name: ARM64 macOS 14 GLib & Ruby
+    name: ARM64 macOS GLib & Ruby
     runs-on: macos-latest
     if: ${{ !contains(github.event.pull_request.title, 'WIP') }}
     timeout-minutes: 60
diff --git a/ruby/Rakefile b/ruby/Rakefile
index 7f26773403..6ff20915bc 100644
--- a/ruby/Rakefile
+++ b/ruby/Rakefile
@@ -35,9 +35,11 @@ end
 
 packages.each do |package|
   namespace package do
+    package_dir = File.join(base_dir, package)
+
     desc "Run test for #{package}"
     task :test do
-      cd(File.join(base_dir, package)) do
+      cd(package_dir) do
         if ENV["USE_BUNDLER"]
           sh("bundle", "exec", "rake", "test")
         else
@@ -46,9 +48,22 @@ packages.each do |package|
       end
     end
 
+    desc "Run benchmark for #{package}"
+    task :benchmark do
+      cd(package_dir) do
+        if File.directory?("benchmark")
+          if ENV["USE_BUNDLER"]
+            sh("bundle", "exec", "rake", "benchmark")
+          else
+            ruby("-S", "rake", "benchmark")
+          end
+        end
+      end
+    end
+
     desc "Install #{package}"
     task :install do
-      cd(File.join(base_dir, package)) do
+      cd(package_dir) do
         if ENV["USE_BUNDLER"]
           sh("bundle", "exec", "rake", "install")
         else
@@ -70,6 +85,9 @@ end
 desc "Run test for all packages"
 task test: sorted_packages.collect {|package| "#{package}:test"}
 
+desc "Run benchmark for all packages"
+task benchmark: sorted_packages.collect {|package| "#{package}:benchmark"}
+
 desc "Install all packages"
 task install: sorted_packages.collect {|package| "#{package}:install"}
 
diff --git a/ruby/red-arrow-format/Gemfile b/ruby/red-arrow-format/Gemfile
index 2307252d9e..296a7b4435 100644
--- a/ruby/red-arrow-format/Gemfile
+++ b/ruby/red-arrow-format/Gemfile
@@ -21,6 +21,10 @@ source "https://rubygems.org/";
 
 gemspec
 
-gem "rake"
 gem "red-arrow", path: "../red-arrow"
-gem "test-unit"
+
+group :development do
+  gem "benchmark-driver"
+  gem "rake"
+  gem "test-unit"
+end
diff --git a/ruby/red-arrow-format/Rakefile b/ruby/red-arrow-format/Rakefile
index f50f18f3b8..3671f35d6e 100644
--- a/ruby/red-arrow-format/Rakefile
+++ b/ruby/red-arrow-format/Rakefile
@@ -39,6 +39,28 @@ task :test do
   end
 end
 
+benchmark_tasks = []
+namespace :benchmark do
+  Dir.glob("benchmark/*.yaml").sort.each do |yaml|
+    name = File.basename(yaml, ".*")
+    command_line = [
+      RbConfig.ruby, "-v", "-S", "benchmark-driver", File.expand_path(yaml),
+    ]
+
+    desc "Run #{name} benchmark"
+    task name do
+      puts("```console")
+      puts("$ #{command_line.join(" ")}")
+      sh(*command_line, verbose: false)
+      puts("```")
+    end
+    benchmark_tasks << "benchmark:#{name}"
+  end
+end
+
+desc "Run all benchmarks"
+task :benchmark => benchmark_tasks
+
 namespace :flat_buffers do
   desc "Generate FlatBuffers code"
   task :generate do
diff --git a/ruby/red-arrow-format/benchmark/file-reader.yaml 
b/ruby/red-arrow-format/benchmark/file-reader.yaml
new file mode 100644
index 0000000000..25e8c73ef1
--- /dev/null
+++ b/ruby/red-arrow-format/benchmark/file-reader.yaml
@@ -0,0 +1,53 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+prelude: |
+  Warning[:experimental] = false
+
+  require "arrow"
+  require "arrow-format"
+
+  seed = 29
+  random = Random.new(seed)
+
+  n_columns = 100
+  n_rows = 10000
+  max_uint32 = 2 ** 32 - 1
+  arrays = n_columns.times.collect do |i|
+    if i.even?
+      Arrow::UInt32Array.new(n_rows.times.collect {random.rand(max_uint32)})
+    else
+      Arrow::BinaryArray.new(n_rows.times.collect 
{random.bytes(random.rand(10))})
+    end
+  end
+  table = Arrow::Table.new(arrays.collect.with_index {|array, i| [i, array]})
+  buffer = Arrow::ResizableBuffer.new(4096)
+  table.save(buffer, format: :arrow_file)
+
+  GC.start
+  GC.disable
+benchmark:
+  "Arrow::Table.load": |
+    Arrow::Table.load(buffer, format: :arrow_file)
+  "Arrow::RecordBatchFileReader": |
+    Arrow::BufferInputStream.open(buffer) do |input|
+      Arrow::RecordBatchFileReader.new(input).each do
+      end
+    end
+  "ArrowFormat::FileReader": |
+    ArrowFormat::FileReader.new(buffer.data.to_s).each do
+    end
diff --git a/ruby/red-arrow-format/benchmark/streaming-reader.yaml 
b/ruby/red-arrow-format/benchmark/streaming-reader.yaml
new file mode 100644
index 0000000000..f1b383395b
--- /dev/null
+++ b/ruby/red-arrow-format/benchmark/streaming-reader.yaml
@@ -0,0 +1,53 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+prelude: |
+  Warning[:experimental] = false
+
+  require "arrow"
+  require "arrow-format"
+
+  seed = 29
+  random = Random.new(seed)
+
+  n_columns = 100
+  n_rows = 10000
+  max_uint32 = 2 ** 32 - 1
+  arrays = n_columns.times.collect do |i|
+    if i.even?
+      Arrow::UInt32Array.new(n_rows.times.collect {random.rand(max_uint32)})
+    else
+      Arrow::BinaryArray.new(n_rows.times.collect 
{random.bytes(random.rand(10))})
+    end
+  end
+  table = Arrow::Table.new(arrays.collect.with_index {|array, i| [i, array]})
+  buffer = Arrow::ResizableBuffer.new(4096)
+  table.save(buffer, format: :arrow_streaming)
+
+  GC.start
+  GC.disable
+benchmark:
+  "Arrow::Table.load": |
+    Arrow::Table.load(buffer, format: :arrow_streaming)
+  "Arrow::RecordBatchStreamReader": |
+    Arrow::BufferInputStream.open(buffer) do |input|
+      Arrow::RecordBatchStreamReader.new(input).each do
+      end
+    end
+  "ArrowFormat::StreamingReader": |
+    ArrowFormat::StreamingReader.new(buffer.data.to_s).each do
+    end
diff --git a/ruby/red-arrow-format/lib/arrow-format/streaming-reader.rb 
b/ruby/red-arrow-format/lib/arrow-format/streaming-reader.rb
index f81cfe8913..1a9f71ac9c 100644
--- a/ruby/red-arrow-format/lib/arrow-format/streaming-reader.rb
+++ b/ruby/red-arrow-format/lib/arrow-format/streaming-reader.rb
@@ -22,7 +22,17 @@ module ArrowFormat
     include Enumerable
 
     def initialize(input)
-      @input = input
+      case input
+      when File
+        @input = IO::Buffer.map(input, nil, 0, IO::Buffer::READONLY)
+        @offset = 0
+      when String
+        @input = IO::Buffer.for(input)
+        @offset = 0
+      else
+        @input = input
+      end
+
       @on_read = nil
       @pull_reader = StreamingPullReader.new do |record_batch|
         @on_read.call(record_batch) if @on_read
@@ -53,11 +63,18 @@ module ArrowFormat
       next_size = @pull_reader.next_required_size
       return false if next_size.zero?
 
-      next_chunk = @input.read(next_size, @buffer)
-      return false if next_chunk.nil?
+      if @input.is_a?(IO::Buffer)
+        next_chunk = @input.slice(@offset, next_size)
+        @offset += next_size
+        @pull_reader.consume(next_chunk)
+        true
+      else
+        next_chunk = @input.read(next_size, @buffer)
+        return false if next_chunk.nil?
 
-      @pull_reader.consume(IO::Buffer.for(next_chunk))
-      true
+        @pull_reader.consume(IO::Buffer.for(next_chunk))
+        true
+      end
     end
 
     def ensure_schema
diff --git a/ruby/red-arrow-format/test/test-reader.rb 
b/ruby/red-arrow-format/test/test-reader.rb
index d59a93ce18..c1c6b26288 100644
--- a/ruby/red-arrow-format/test/test-reader.rb
+++ b/ruby/red-arrow-format/test/test-reader.rb
@@ -26,9 +26,7 @@ module ReaderTests
       else
         table = data
       end
-      path = File.join(tmp_dir, "data.#{file_extension}")
-      table.save(path)
-      File.open(path, "rb") do |input|
+      open_input(table, tmp_dir) do |input|
         reader = reader_class.new(input)
         case data
         when Arrow::Array
@@ -677,8 +675,42 @@ module ReaderTests
   end
 end
 
-class TestFileReader < Test::Unit::TestCase
+module FileInput
+  def open_input(table, tmp_dir, &block)
+    path = File.join(tmp_dir, "data.#{file_extension}")
+    table.save(path)
+    File.open(path, "rb", &block)
+  end
+end
+
+module PipeInput
+  def open_input(table, tmp_dir, &block)
+    buffer = Arrow::ResizableBuffer.new(4096)
+    table.save(buffer, format: format)
+    IO.pipe do |input, output|
+      write_thread = Thread.new do
+        output.write(buffer.data.to_s)
+      end
+      begin
+        yield(input)
+      ensure
+        write_thread.join
+      end
+    end
+  end
+end
+
+module StringInput
+  def open_input(table, tmp_dir)
+    buffer = Arrow::ResizableBuffer.new(4096)
+    table.save(buffer, format: format)
+    yield(buffer.data.to_s)
+  end
+end
+
+class TestFileReaderFileInput < Test::Unit::TestCase
   include ReaderTests
+  include FileInput
 
   def file_extension
     "arrow"
@@ -689,8 +721,22 @@ class TestFileReader < Test::Unit::TestCase
   end
 end
 
-class TestStreamingReader < Test::Unit::TestCase
+class TestFileReaderStringInput < Test::Unit::TestCase
   include ReaderTests
+  include StringInput
+
+  def format
+    :arrow_file
+  end
+
+  def reader_class
+    ArrowFormat::FileReader
+  end
+end
+
+class TestStreamingReaderFileInupt < Test::Unit::TestCase
+  include ReaderTests
+  include FileInput
 
   def file_extension
     "arrows"
@@ -700,3 +746,29 @@ class TestStreamingReader < Test::Unit::TestCase
     ArrowFormat::StreamingReader
   end
 end
+
+class TestStreamingReaderPipeInupt < Test::Unit::TestCase
+  include ReaderTests
+  include PipeInput
+
+  def format
+    :arrow_streaming
+  end
+
+  def reader_class
+    ArrowFormat::StreamingReader
+  end
+end
+
+class TestStreamingReaderStringInupt < Test::Unit::TestCase
+  include ReaderTests
+  include StringInput
+
+  def format
+    :arrow_streaming
+  end
+
+  def reader_class
+    ArrowFormat::StreamingReader
+  end
+end

Reply via email to