This is an automated email from the ASF dual-hosted git repository.
kou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new f51a70f654 GH-49656: [Ruby] Add benchmark for writers (#49657)
f51a70f654 is described below
commit f51a70f654b74714fa012d91dfc977c3d7ebd514
Author: Sutou Kouhei <[email protected]>
AuthorDate: Sun Apr 5 16:55:23 2026 +0900
GH-49656: [Ruby] Add benchmark for writers (#49657)
### Rationale for this change
Performance is important in Apache Arrow. So benchmark is useful for
developing Apache Arrow implementation.
### What changes are included in this PR?
* Add benchmarks for file and streaming writers.
* Remove redundant type arguments from array constructors.
Here are benchmark results on my environment.
Pure Ruby implementation is about 2-2.5x slower than release build C++
implementation but about 2-2.5x faster than debug build C++ implementation.
Release build C++/GLib:
File format:
```console
$ ruby -v -S benchmark-driver
ruby/red-arrow-format/benchmark/file-writer.yaml
ruby 4.1.0dev (2026-03-26T07:27:31Z master c5ab2114df) +PRISM [x86_64-linux]
Warming up --------------------------------------
Arrow::Table#save 348.499 i/s - 374.000 times in
1.073175s (2.87ms/i)
Arrow::RecordBatchFileWriter 353.426 i/s - 385.000 times in
1.089337s (2.83ms/i)
ArrowFormat::FileWriter 133.293 i/s - 140.000 times in
1.050314s (7.50ms/i)
Calculating -------------------------------------
Arrow::Table#save 336.984 i/s - 1.045k times in
3.101035s (2.97ms/i)
Arrow::RecordBatchFileWriter 338.695 i/s - 1.060k times in
3.129655s (2.95ms/i)
ArrowFormat::FileWriter 134.640 i/s - 399.000 times in
2.963462s (7.43ms/i)
Comparison:
Arrow::RecordBatchFileWriter: 338.7 i/s
Arrow::Table#save: 337.0 i/s - 1.01x slower
ArrowFormat::FileWriter: 134.6 i/s - 2.52x slower
```
Streaming format:
```console
$ ruby -v -S benchmark-driver
ruby/red-arrow-format/benchmark/streaming-writer.yaml
ruby 4.1.0dev (2026-03-26T07:27:31Z master c5ab2114df) +PRISM [x86_64-linux]
Warming up --------------------------------------
Arrow::Table#save 356.995 i/s - 385.000 times in
1.078447s (2.80ms/i)
Arrow::RecordBatchStreamWriter 347.891 i/s - 374.000 times in
1.075050s (2.87ms/i)
ArrowFormat::StreamingWriter 156.709 i/s - 160.000 times in
1.021004s (6.38ms/i)
Calculating -------------------------------------
Arrow::Table#save 350.743 i/s - 1.070k times in
3.050665s (2.85ms/i)
Arrow::RecordBatchStreamWriter 345.821 i/s - 1.043k times in
3.016011s (2.89ms/i)
ArrowFormat::StreamingWriter 160.022 i/s - 470.000 times in
2.937090s (6.25ms/i)
Comparison:
Arrow::Table#save: 350.7 i/s
Arrow::RecordBatchStreamWriter: 345.8 i/s - 1.01x slower
ArrowFormat::StreamingWriter: 160.0 i/s - 2.19x slower
```
Debug build C++/GLib:
File format:
```console
$ ruby -v -S benchmark-driver
ruby/red-arrow-format/benchmark/file-writer.yaml
ruby 4.1.0dev (2026-03-26T07:27:31Z master c5ab2114df) +PRISM [x86_64-linux]
Warming up --------------------------------------
Arrow::Table#save 63.290 i/s - 66.000 times in
1.042815s (15.80ms/i)
Arrow::RecordBatchFileWriter 62.655 i/s - 66.000 times in
1.053389s (15.96ms/i)
ArrowFormat::FileWriter 138.082 i/s - 140.000 times in
1.013891s (7.24ms/i)
Calculating -------------------------------------
Arrow::Table#save 63.165 i/s - 189.000 times in
2.992143s (15.83ms/i)
Arrow::RecordBatchFileWriter 61.773 i/s - 187.000 times in
3.027220s (16.19ms/i)
ArrowFormat::FileWriter 134.709 i/s - 414.000 times in
3.073285s (7.42ms/i)
Comparison:
ArrowFormat::FileWriter: 134.7 i/s
Arrow::Table#save: 63.2 i/s - 2.13x slower
Arrow::RecordBatchFileWriter: 61.8 i/s - 2.18x slower
```
Streaming format:
```console
$ ruby -v -S benchmark-driver
ruby/red-arrow-format/benchmark/streaming-writer.yaml
ruby 4.1.0dev (2026-03-26T07:27:31Z master c5ab2114df) +PRISM [x86_64-linux]
Warming up --------------------------------------
Arrow::Table#save 63.252 i/s - 66.000 times in
1.043439s (15.81ms/i)
Arrow::RecordBatchStreamWriter 61.272 i/s - 66.000 times in
1.077162s (16.32ms/i)
ArrowFormat::StreamingWriter 152.598 i/s - 160.000 times in
1.048506s (6.55ms/i)
Calculating -------------------------------------
Arrow::Table#save 61.016 i/s - 189.000 times in
3.097525s (16.39ms/i)
Arrow::RecordBatchStreamWriter 63.024 i/s - 183.000 times in
2.903642s (15.87ms/i)
ArrowFormat::StreamingWriter 160.416 i/s - 457.000 times in
2.848846s (6.23ms/i)
Comparison:
ArrowFormat::StreamingWriter: 160.4 i/s
Arrow::RecordBatchStreamWriter: 63.0 i/s - 2.55x slower
Arrow::Table#save: 61.0 i/s - 2.63x slower
```
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49656
Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
---
ruby/red-arrow-format/benchmark/file-writer.yaml | 89 ++++++++++++++++++
.../benchmark/streaming-writer.yaml | 89 ++++++++++++++++++
ruby/red-arrow-format/lib/arrow-format/array.rb | 101 ++++++++++++++++++++-
ruby/red-arrow-format/lib/arrow-format/type.rb | 39 ++++----
4 files changed, 293 insertions(+), 25 deletions(-)
diff --git a/ruby/red-arrow-format/benchmark/file-writer.yaml
b/ruby/red-arrow-format/benchmark/file-writer.yaml
new file mode 100644
index 0000000000..37b89f5bff
--- /dev/null
+++ b/ruby/red-arrow-format/benchmark/file-writer.yaml
@@ -0,0 +1,89 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+prelude: |
+ Warning[:experimental] = false
+
+ require "arrow"
+ require "arrow-format"
+
+ seed = 29
+ random = Random.new(seed)
+
+ n_columns = 100
+ n_rows = 10000
+ max_uint32 = 2 ** 32 - 1
+ arrays = n_columns.times.collect do |i|
+ if i.even?
+ Arrow::UInt32Array.new(n_rows.times.collect {random.rand(max_uint32)})
+ else
+ Arrow::BinaryArray.new(n_rows.times.collect
{random.bytes(random.rand(10))})
+ end
+ end
+ columns = arrays.collect.with_index {|array, i| [i, array]}
+ red_arrow_table = Arrow::Table.new(columns)
+
+ fields = arrays.collect.with_index do |array, i|
+ case array
+ when Arrow::UInt32Array
+ type = ArrowFormat::UInt32Type.singleton
+ when Arrow::BinaryArray
+ type = ArrowFormat::BinaryType.singleton
+ end
+ ArrowFormat::Field.new(i.to_s, type)
+ end
+ schema = ArrowFormat::Schema.new(fields)
+ def convert_buffer(buffer)
+ return nil if buffer.nil?
+ IO::Buffer.for(buffer.data.to_s.dup)
+ end
+ columns = fields.zip(arrays).collect do |field, array|
+ case array
+ when Arrow::UInt32Array
+ field.type.build_array(n_rows,
+ convert_buffer(array.null_bitmap),
+ convert_buffer(array.data_buffer))
+ when Arrow::BinaryArray
+ field.type.build_array(n_rows,
+ convert_buffer(array.null_bitmap),
+ convert_buffer(array.offsets_buffer),
+ convert_buffer(array.data_buffer))
+ end
+ end
+ red_arrow_format_record_batch =
+ ArrowFormat::RecordBatch.new(schema, n_rows, columns)
+
+ GC.start
+ GC.disable
+benchmark:
+ "Arrow::Table#save": |
+ buffer = Arrow::ResizableBuffer.new(4096)
+ red_arrow_table.save(buffer, format: :arrow_file)
+ "Arrow::RecordBatchFileWriter": |
+ buffer = Arrow::ResizableBuffer.new(4096)
+ Arrow::BufferOutputStream.open(buffer) do |output|
+ schema = red_arrow_table.schema
+ Arrow::RecordBatchFileWriter.open(output, schema) do |writer|
+ writer.write_table(red_arrow_table)
+ end
+ end
+ "ArrowFormat::FileWriter": |
+ output = +"".b
+ writer = ArrowFormat::FileWriter.new(output)
+ writer.start(red_arrow_format_record_batch.schema)
+ writer.write_record_batch(red_arrow_format_record_batch)
+ writer.finish
diff --git a/ruby/red-arrow-format/benchmark/streaming-writer.yaml
b/ruby/red-arrow-format/benchmark/streaming-writer.yaml
new file mode 100644
index 0000000000..824e71dff6
--- /dev/null
+++ b/ruby/red-arrow-format/benchmark/streaming-writer.yaml
@@ -0,0 +1,89 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+prelude: |
+ Warning[:experimental] = false
+
+ require "arrow"
+ require "arrow-format"
+
+ seed = 29
+ random = Random.new(seed)
+
+ n_columns = 100
+ n_rows = 10000
+ max_uint32 = 2 ** 32 - 1
+ arrays = n_columns.times.collect do |i|
+ if i.even?
+ Arrow::UInt32Array.new(n_rows.times.collect {random.rand(max_uint32)})
+ else
+ Arrow::BinaryArray.new(n_rows.times.collect
{random.bytes(random.rand(10))})
+ end
+ end
+ columns = arrays.collect.with_index {|array, i| [i, array]}
+ red_arrow_table = Arrow::Table.new(columns)
+
+ fields = arrays.collect.with_index do |array, i|
+ case array
+ when Arrow::UInt32Array
+ type = ArrowFormat::UInt32Type.singleton
+ when Arrow::BinaryArray
+ type = ArrowFormat::BinaryType.singleton
+ end
+ ArrowFormat::Field.new(i.to_s, type)
+ end
+ schema = ArrowFormat::Schema.new(fields)
+ def convert_buffer(buffer)
+ return nil if buffer.nil?
+ IO::Buffer.for(buffer.data.to_s.dup)
+ end
+ columns = fields.zip(arrays).collect do |field, array|
+ case array
+ when Arrow::UInt32Array
+ field.type.build_array(n_rows,
+ convert_buffer(array.null_bitmap),
+ convert_buffer(array.data_buffer))
+ when Arrow::BinaryArray
+ field.type.build_array(n_rows,
+ convert_buffer(array.null_bitmap),
+ convert_buffer(array.offsets_buffer),
+ convert_buffer(array.data_buffer))
+ end
+ end
+ red_arrow_format_record_batch =
+ ArrowFormat::RecordBatch.new(schema, n_rows, columns)
+
+ GC.start
+ GC.disable
+benchmark:
+ "Arrow::Table#save": |
+ buffer = Arrow::ResizableBuffer.new(4096)
+ red_arrow_table.save(buffer, format: :arrow_streaming)
+ "Arrow::RecordBatchStreamWriter": |
+ buffer = Arrow::ResizableBuffer.new(4096)
+ Arrow::BufferOutputStream.open(buffer) do |output|
+ schema = red_arrow_table.schema
+ Arrow::RecordBatchStreamWriter.open(output, schema) do |writer|
+ writer.write_table(red_arrow_table)
+ end
+ end
+ "ArrowFormat::StreamingWriter": |
+ output = +"".b
+ writer = ArrowFormat::StreamingWriter.new(output)
+ writer.start(red_arrow_format_record_batch.schema)
+ writer.write_record_batch(red_arrow_format_record_batch)
+ writer.finish
diff --git a/ruby/red-arrow-format/lib/arrow-format/array.rb
b/ruby/red-arrow-format/lib/arrow-format/array.rb
index cb71a4d255..9a248d279f 100644
--- a/ruby/red-arrow-format/lib/arrow-format/array.rb
+++ b/ruby/red-arrow-format/lib/arrow-format/array.rb
@@ -140,8 +140,8 @@ module ArrowFormat
end
class NullArray < Array
- def initialize(type, size)
- super(type, size, nil)
+ def initialize(size)
+ super(NullType.singleton, size, nil)
end
def each_buffer
@@ -186,6 +186,10 @@ module ArrowFormat
end
class BooleanArray < PrimitiveArray
+ def initialize(size, validity_buffer, values_buffer)
+ super(BooleanType.singleton, size, validity_buffer, values_buffer)
+ end
+
def to_a
return [] if empty?
@@ -209,51 +213,120 @@ module ArrowFormat
end
class IntArray < PrimitiveArray
+ def initialize(size, validity_buffer, values_buffer)
+ super(self.class.type, size, validity_buffer, values_buffer)
+ end
end
class Int8Array < IntArray
+ class << self
+ def type
+ Int8Type.singleton
+ end
+ end
end
class UInt8Array < IntArray
+ class << self
+ def type
+ UInt8Type.singleton
+ end
+ end
end
class Int16Array < IntArray
+ class << self
+ def type
+ Int16Type.singleton
+ end
+ end
end
class UInt16Array < IntArray
+ class << self
+ def type
+ UInt16Type.singleton
+ end
+ end
end
class Int32Array < IntArray
+ class << self
+ def type
+ Int32Type.singleton
+ end
+ end
end
class UInt32Array < IntArray
+ class << self
+ def type
+ UInt32Type.singleton
+ end
+ end
end
class Int64Array < IntArray
+ class << self
+ def type
+ Int64Type.singleton
+ end
+ end
end
class UInt64Array < IntArray
+ class << self
+ def type
+ UInt64Type.singleton
+ end
+ end
end
class FloatingPointArray < PrimitiveArray
+ def initialize(size, validity_buffer, values_buffer)
+ super(self.class.type, size, validity_buffer, values_buffer)
+ end
end
class Float32Array < FloatingPointArray
+ class << self
+ def type
+ Float32Type.singleton
+ end
+ end
end
class Float64Array < FloatingPointArray
+ class << self
+ def type
+ Float64Type.singleton
+ end
+ end
end
class TemporalArray < PrimitiveArray
end
class DateArray < TemporalArray
+ def initialize(size, validity_buffer, values_buffer)
+ super(self.class.type, size, validity_buffer, values_buffer)
+ end
end
class Date32Array < DateArray
+ class << self
+ def type
+ Date32Type.singleton
+ end
+ end
end
class Date64Array < DateArray
+ class << self
+ def type
+ Date64Type.singleton
+ end
+ end
end
class TimeArray < TemporalArray
@@ -318,8 +391,8 @@ module ArrowFormat
end
class VariableSizeBinaryArray < Array
- def initialize(type, size, validity_buffer, offsets_buffer, values_buffer)
- super(type, size, validity_buffer)
+ def initialize(size, validity_buffer, offsets_buffer, values_buffer)
+ super(self.class.type, size, validity_buffer)
@offsets_buffer = offsets_buffer
@values_buffer = values_buffer
end
@@ -364,18 +437,38 @@ module ArrowFormat
end
class BinaryArray < VariableSizeBinaryArray
+ class << self
+ def type
+ BinaryType.singleton
+ end
+ end
end
class LargeBinaryArray < VariableSizeBinaryArray
+ class << self
+ def type
+ LargeBinaryType.singleton
+ end
+ end
end
class VariableSizeUTF8Array < VariableSizeBinaryArray
end
class UTF8Array < VariableSizeUTF8Array
+ class << self
+ def type
+ UTF8Type.singleton
+ end
+ end
end
class LargeUTF8Array < VariableSizeUTF8Array
+ class << self
+ def type
+ LargeUTF8Type.singleton
+ end
+ end
end
class FixedSizeBinaryArray < Array
diff --git a/ruby/red-arrow-format/lib/arrow-format/type.rb
b/ruby/red-arrow-format/lib/arrow-format/type.rb
index 17674af30c..38523cf00b 100644
--- a/ruby/red-arrow-format/lib/arrow-format/type.rb
+++ b/ruby/red-arrow-format/lib/arrow-format/type.rb
@@ -33,7 +33,7 @@ module ArrowFormat
end
def build_array(size)
- NullArray.new(self, size)
+ NullArray.new(size)
end
def to_flatbuffers
@@ -56,7 +56,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- BooleanArray.new(self, size, validity_buffer, values_buffer)
+ BooleanArray.new(size, validity_buffer, values_buffer)
end
def to_flatbuffers
@@ -107,7 +107,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- Int8Array.new(self, size, validity_buffer, values_buffer)
+ Int8Array.new(size, validity_buffer, values_buffer)
end
end
@@ -131,7 +131,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- UInt8Array.new(self, size, validity_buffer, values_buffer)
+ UInt8Array.new(size, validity_buffer, values_buffer)
end
end
@@ -155,7 +155,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- Int16Array.new(self, size, validity_buffer, values_buffer)
+ Int16Array.new(size, validity_buffer, values_buffer)
end
end
@@ -179,7 +179,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- UInt16Array.new(self, size, validity_buffer, values_buffer)
+ UInt16Array.new(size, validity_buffer, values_buffer)
end
end
@@ -203,7 +203,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- Int32Array.new(self, size, validity_buffer, values_buffer)
+ Int32Array.new(size, validity_buffer, values_buffer)
end
end
@@ -227,7 +227,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- UInt32Array.new(self, size, validity_buffer, values_buffer)
+ UInt32Array.new(size, validity_buffer, values_buffer)
end
end
@@ -251,7 +251,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- Int64Array.new(self, size, validity_buffer, values_buffer)
+ Int64Array.new(size, validity_buffer, values_buffer)
end
end
@@ -275,7 +275,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- UInt64Array.new(self, size, validity_buffer, values_buffer)
+ UInt64Array.new(size, validity_buffer, values_buffer)
end
end
@@ -313,7 +313,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- Float32Array.new(self, size, validity_buffer, values_buffer)
+ Float32Array.new(size, validity_buffer, values_buffer)
end
end
@@ -337,7 +337,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- Float64Array.new(self, size, validity_buffer, values_buffer)
+ Float64Array.new(size, validity_buffer, values_buffer)
end
end
@@ -378,7 +378,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- Date32Array.new(self, size, validity_buffer, values_buffer)
+ Date32Array.new(size, validity_buffer, values_buffer)
end
end
@@ -402,7 +402,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, values_buffer)
- Date64Array.new(self, size, validity_buffer, values_buffer)
+ Date64Array.new(size, validity_buffer, values_buffer)
end
end
@@ -628,8 +628,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, offsets_buffer, values_buffer)
- BinaryArray.new(self,
- size,
+ BinaryArray.new(size,
validity_buffer,
offsets_buffer,
values_buffer)
@@ -660,8 +659,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, offsets_buffer, values_buffer)
- LargeBinaryArray.new(self,
- size,
+ LargeBinaryArray.new(size,
validity_buffer,
offsets_buffer,
values_buffer)
@@ -692,7 +690,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, offsets_buffer, values_buffer)
- UTF8Array.new(self, size, validity_buffer, offsets_buffer, values_buffer)
+ UTF8Array.new(size, validity_buffer, offsets_buffer, values_buffer)
end
def to_flatbuffers
@@ -720,8 +718,7 @@ module ArrowFormat
end
def build_array(size, validity_buffer, offsets_buffer, values_buffer)
- LargeUTF8Array.new(self,
- size,
+ LargeUTF8Array.new(size,
validity_buffer,
offsets_buffer,
values_buffer)