[I] [Ruby] Building an array of hashes from Table is very slow [arrow]

via GitHub Sat, 28 Dec 2024 09:13:57 -0800


fpacanowski opened a new issue, #45117:
URL: https://github.com/apache/arrow/issues/45117


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I'm experiencing performance issues when reading Parquet files in Ruby. I've 
created a very simple synthetic benchmark: a table with a single `float` column 
and 1,000,000 rows. I want to read this data into an array of hashes. Here's my 
code:
   ```ruby
   require 'arrow'
   require 'parquet'
   require 'benchmark'
   
   def read_parquet
     table = Arrow::TableLoader.load('data.parquet', { format: :parquet })
     table.each_record(reuse_record: true).map(&:to_h)
   end
   
   Benchmark.bmbm do |x|
     x.report("read_parquet")  { read_parquet }
   end
   ```
   And the report:
   ```
   $ bundle exec ruby --yjit read.rb
   Rehearsal ------------------------------------------------
   read_parquet  20.894371   0.129733  21.024104 ( 21.066472)
   -------------------------------------- total: 21.024104sec
   
                      user     system      total        real
   read_parquet  21.872816   0.079736  21.952552 ( 21.991875)
   ```
   
   I also tested equivalent code in Python:
   ```python
   import pyarrow.parquet as pq
   import timeit
   
   def read_parquet():
       table = pq.read_table('data.parquet')
       return table.to_pylist()
   
   time_taken = timeit.timeit(read_parquet, number=10)  # Run 10 times
   print(f"Average time: {time_taken / 10:.6f} seconds")
   ```
   which yields:
   ```
   $ poetry run python read.py
   Average time: 0.610864 seconds
   ```
   
   This means that Ruby version is **30-40x** slower than pyarrow. Is there 
anything I can do to improve the performance here?
   
   For completeness, here's a script that generates the test data 
(`data.parquet` file):
   ```ruby
   require 'arrow'
   require 'parquet'
   require 'benchmark'
   
   schema = Arrow::Schema.new([Arrow::Field.new("foo", :float)])
   
   data = 1_000_000.times.map { {foo: rand} }
   table = Arrow::RecordBatchBuilder.build(schema, data).to_table
   table.save('data.parquet', format: :parquet, compression: :uncompressed)
   ```
   
   ### Component(s)
   
   Ruby


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Ruby] Building an array of hashes from Table is very slow [arrow]

Reply via email to