Re: [I] Writing and Reading Random Access Files [arrow-julia]

via GitHub Sun, 20 Jul 2025 00:18:36 -0700


Yuan-Ru-Lin commented on issue #434:
URL: https://github.com/apache/arrow-julia/issues/434#issuecomment-3093807424


   > Is there a way to get the batch-offset table with Arrow.jl, if the data is 
written in "file" mode?
   
   Yes.
   
   Consider `test.arrow` generated by the following script.
   
   ```julia
   using Arrow
   using TypedTables
   using Tables
   
   t = Table(
       a=collect(1:10_000),
       b=rand(Float32, 10_000),
       c=rand(ComplexF32, 10_000),
   )
   
   # This would produce 10 RecordBatches
   Arrow.write("test.arrow", Tables.partitioner(Iterators.partition(t, 1_000)))
   ```
   
   Then one can get the indices of all the `RecordBatch`es by `read`ing the 
relevant bytes and parsing them using 
`Arrow.FlatBuffers.getrootas(Arrow.Meta.Footer, _footerbytes, 0)`
   
   ```julia
   using Arrow
   
   f = open("test.arrow")
   
   # Check whether the magic number is there
   seekend(f)
   seek(f, position(f) - 6)
   @assert String(read(f, 6)) == "ARROW1"
   
   # Fetch footer size
   seekend(f)
   seek(f, position(f) - 6 - 4)
   footersize = only(reinterpret(Int32, read(f, 4)))
   @assert footersize == 560
   
   # Fetch footer
   seekend(f)
   seek(f, position(f) - 6 - 4 - 560)
   _footerbytes = read(f, 560)
   _footer = Arrow.FlatBuffers.getrootas(Arrow.Meta.Footer, _footerbytes, 0)
   
   """
   julia> _footer.recordBatches
   10-element Arrow.FlatBuffers.Array{Arrow.Flatbuf.Block, NTuple{24, UInt8}, 
Arrow.Flatbuf.Footer}:
    Arrow.Flatbuf.Block(offset = 320, metaDataLength = 320, bodyLength = 20000)
    Arrow.Flatbuf.Block(offset = 20640, metaDataLength = 320, bodyLength = 
20000)
    Arrow.Flatbuf.Block(offset = 40960, metaDataLength = 320, bodyLength = 
20000)
    Arrow.Flatbuf.Block(offset = 61280, metaDataLength = 320, bodyLength = 
20000)
    Arrow.Flatbuf.Block(offset = 81600, metaDataLength = 320, bodyLength = 
20000)
    Arrow.Flatbuf.Block(offset = 101920, metaDataLength = 320, bodyLength = 
20000)
    Arrow.Flatbuf.Block(offset = 122240, metaDataLength = 320, bodyLength = 
20000)
    Arrow.Flatbuf.Block(offset = 142560, metaDataLength = 320, bodyLength = 
20000)
    Arrow.Flatbuf.Block(offset = 162880, metaDataLength = 320, bodyLength = 
20000)
    Arrow.Flatbuf.Block(offset = 183200, metaDataLength = 320, bodyLength = 
20000)
   """
   
   # Sanity check: fetch the first column in the first block using the above 
information
   seek(f, 320 + 320)
   block1data = read(f, 20000)
   reinterpret(Int64, block1data[1:8000])
   
   """
   julia> reinterpret(Int64, block1data[1:8000])
   1000-element reinterpret(Int64, ::Vector{UInt8}):
       1
       2
       3
       4
       (omitted)
   """
   ```
   
   I accessed the first batch but in principle one can access to whichever 
block without reading others.
   
   In order to come up with an API, I still need to know how to parse bytes 
that make up a `RecordBatch`.
   
   By the way, this might provide a way to closing 
https://github.com/apache/arrow-julia/issues/353


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Writing and Reading Random Access Files [arrow-julia]

Reply via email to