fpacanowski opened a new issue, #45117:
URL: https://github.com/apache/arrow/issues/45117
### Describe the usage question you have. Please include as many useful
details as possible.
I'm experiencing performance issues when reading Parquet files in Ruby. I've
created a very simple synthetic benchmark: a table with a single `float` column
and 1,000,000 rows. I want to read this data into an array of hashes. Here's my
code:
```ruby
require 'arrow'
require 'parquet'
require 'benchmark'
def read_parquet
table = Arrow::TableLoader.load('data.parquet', { format: :parquet })
table.each_record(reuse_record: true).map(&:to_h)
end
Benchmark.bmbm do |x|
x.report("read_parquet") { read_parquet }
end
```
And the report:
```
$ bundle exec ruby --yjit read.rb
Rehearsal ------------------------------------------------
read_parquet 20.894371 0.129733 21.024104 ( 21.066472)
-------------------------------------- total: 21.024104sec
user system total real
read_parquet 21.872816 0.079736 21.952552 ( 21.991875)
```
I also tested equivalent code in Python:
```python
import pyarrow.parquet as pq
import timeit
def read_parquet():
table = pq.read_table('data.parquet')
return table.to_pylist()
time_taken = timeit.timeit(read_parquet, number=10) # Run 10 times
print(f"Average time: {time_taken / 10:.6f} seconds")
```
which yields:
```
$ poetry run python read.py
Average time: 0.610864 seconds
```
This means that Ruby version is **30-40x** slower than pyarrow. Is there
anything I can do to improve the performance here?
For completeness, here's a script that generates the test data
(`data.parquet` file):
```ruby
require 'arrow'
require 'parquet'
require 'benchmark'
schema = Arrow::Schema.new([Arrow::Field.new("foo", :float)])
data = 1_000_000.times.map { {foo: rand} }
table = Arrow::RecordBatchBuilder.build(schema, data).to_table
table.save('data.parquet', format: :parquet, compression: :uncompressed)
```
### Component(s)
Ruby
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]