Why aren't all the compressed batches the chunk size I specified in write_feather (7000000)? How can I know which batch my slice resides in if this is not a constant? Using pyarrow 9.0.0
This file contains 1.5 billion rows. I need a way to know where to look for, say, [780567127,922022522) 0.7492516040802002 done 0 len 7000000 1.7520167827606201 done 1 len 7000000 3.302407741546631 done 2 len 4995912 5.16457986831665 done 3 len 7000000 6.0424370765686035 done 4 len 4706276 7.58642315864563 done 5 len 7000000 7.719322681427002 done 6 len 289636 8.705692291259766 done 7 len 5698775 On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen <j...@jgm.org> wrote: > The following seems like good news... like I should be able to decompress > just one column of a RecordBatch in the middle of a compressed feather v2 > file. Is there a Python API for this kind of access? C++? > > /// Provided for forward compatibility in case we need to support different > /// strategies for compressing the IPC message body (like whole-body > /// compression rather than buffer-level) in the future > enum BodyCompressionMethod:byte { > /// Each constituent buffer is first compressed with the indicated > /// compressor, and then written with the uncompressed length in the > first 8 > /// bytes as a 64-bit little-endian signed integer followed by the > compressed > /// buffer bytes (and then padding as required by the protocol). The > /// uncompressed length may be set to -1 to indicate that the data that > /// follows is not compressed, which can be useful for cases where > /// compression does not yield appreciable savings. > BUFFER > } > > On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen <j...@jgm.org> wrote: > >> ``Internal structure supports random access and slicing from the middle. >> This also means that you can read a large file chunk by chunk without >> having to pull the whole thing into memory.'' >> https://ursalabs.org/blog/2020-feather-v2/ >> >> For a compressed v2 file, can I decompress just one column of a batch in >> the middle, or is the entire batch with all of its columns compressed as a >> unit? >> >> Unfortunately reader.get_batch(i) seems like it is doing a lot of work. >> Like maybe decompressing all the columns? >> >> Thanks, >> John >> >