Yuan-Ru-Lin commented on issue #434: URL: https://github.com/apache/arrow-julia/issues/434#issuecomment-3093807424
> Is there a way to get the batch-offset table with Arrow.jl, if the data is written in "file" mode? Yes. Consider `test.arrow` generated by the following script. ```julia using Arrow using TypedTables using Tables t = Table( a=collect(1:10_000), b=rand(Float32, 10_000), c=rand(ComplexF32, 10_000), ) # This would produce 10 RecordBatches Arrow.write("test.arrow", Tables.partitioner(Iterators.partition(t, 1_000))) ``` Then one can get the indices of all the `RecordBatch`es by `read`ing the relevant bytes and parsing them using `Arrow.FlatBuffers.getrootas(Arrow.Meta.Footer, _footerbytes, 0)` ```julia using Arrow f = open("test.arrow") # Check whether the magic number is there seekend(f) seek(f, position(f) - 6) @assert String(read(f, 6)) == "ARROW1" # Fetch footer size seekend(f) seek(f, position(f) - 6 - 4) footersize = only(reinterpret(Int32, read(f, 4))) @assert footersize == 560 # Fetch footer seekend(f) seek(f, position(f) - 6 - 4 - 560) _footerbytes = read(f, 560) _footer = Arrow.FlatBuffers.getrootas(Arrow.Meta.Footer, _footerbytes, 0) """ julia> _footer.recordBatches 10-element Arrow.FlatBuffers.Array{Arrow.Flatbuf.Block, NTuple{24, UInt8}, Arrow.Flatbuf.Footer}: Arrow.Flatbuf.Block(offset = 320, metaDataLength = 320, bodyLength = 20000) Arrow.Flatbuf.Block(offset = 20640, metaDataLength = 320, bodyLength = 20000) Arrow.Flatbuf.Block(offset = 40960, metaDataLength = 320, bodyLength = 20000) Arrow.Flatbuf.Block(offset = 61280, metaDataLength = 320, bodyLength = 20000) Arrow.Flatbuf.Block(offset = 81600, metaDataLength = 320, bodyLength = 20000) Arrow.Flatbuf.Block(offset = 101920, metaDataLength = 320, bodyLength = 20000) Arrow.Flatbuf.Block(offset = 122240, metaDataLength = 320, bodyLength = 20000) Arrow.Flatbuf.Block(offset = 142560, metaDataLength = 320, bodyLength = 20000) Arrow.Flatbuf.Block(offset = 162880, metaDataLength = 320, bodyLength = 20000) Arrow.Flatbuf.Block(offset = 183200, metaDataLength = 320, bodyLength = 20000) """ # Sanity check: fetch the first column in the first block using the above information seek(f, 320 + 320) block1data = read(f, 20000) reinterpret(Int64, block1data[1:8000]) """ julia> reinterpret(Int64, block1data[1:8000]) 1000-element reinterpret(Int64, ::Vector{UInt8}): 1 2 3 4 (omitted) """ ``` I accessed the first batch but in principle one can access to whichever block without reading others. In order to come up with an API, I still need to know how to parse bytes that make up a `RecordBatch`. By the way, this might provide a way to closing https://github.com/apache/arrow-julia/issues/353 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org