[GitHub] [arrow-julia] bilelomrani1 opened a new issue, #334: Streaming: Pyarrow is 15 times faster than Arrow.jl

GitBox Sun, 04 Sep 2022 02:22:27 -0700


bilelomrani1 opened a new issue, #334:
URL: https://github.com/apache/arrow-julia/issues/334


   I have an `.arrow` file generated with `pyarrow` whose schema is the 
following:
   ```
   input: struct<open: fixed_size_list<item: float>[512], high: 
fixed_size_list<item: float>[512], low: fixed_size_list<item: float>[512], 
close: fixed_size_list<item: float>[512]> not null
     child 0, open: fixed_size_list<item: float>[512]
         child 0, item: float
     child 1, high: fixed_size_list<item: float>[512]
         child 0, item: float
     child 2, low: fixed_size_list<item: float>[512]
         child 0, item: float
     child 3, close: fixed_size_list<item: float>[512]
         child 0, item: float
   ```
   
   With `pyarrow`, I load and iterate over records with the following:
   ```python
   with pa.memory_map('arraydata.arrow', 'r') as source:
       loaded_arrays = pa.ipc.open_file(source).read_all()
   
   a = 0
   for batch in loaded_arrays.to_batches():
       for input_candles in batch["input"]:
           a += 1
   ```
   Iterating over my example file (~10,000 lines) takes 210 ms.
   
   In julia, I load and iterate over the same file with the following:
   
   ```julia
   stream = Arrow.Stream("./arraydata.arrow")
   
   function bench_iteration(stream)
       a = 0
       for batch in stream
           for sample in batch.input
               a += 1
           end
       end
   end
   
   @btime bench_iteration($stream)
   ```
   
   ```
   3.169 s (25272097 allocations: 1.70 GiB)
   ```
   
   Iterating over records takes 15 more time with `Arrow.jl`. Am I doing 
something wrong?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-julia] bilelomrani1 opened a new issue, #334: Streaming: Pyarrow is 15 times faster than Arrow.jl

Reply via email to