Re: compressed feather v2 "slicing from the middle"

Jorge Cardoso Leitão Wed, 21 Sep 2022 21:53:26 -0700

Hi,

AFAIK compressed IPC arrow files do not support random access (like
uncompressed counterparts) - you need to decompress the whole batch (or at
least the columns you need). A "RecordBatch" is the compression unit of the
file. Think of it like a parquet file whose every row group has a single
data page.


With that said, the file contains the number of rows that each record batch
has, and thus you can compute which record batch you can start from. This
is available in the RecordBatch message "length" [1]. Unfortunately you do
need to read the record batch message to know it, and these are located in
the middle of the file:
[header][record_batch1_message][record_batch1_data][record_batch2_message][record_batch2_data]...[footer].
The positions of the messages are declared in the file's footer's
"record_batches".

[1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L87

Best,
Jorge


On Thu, Sep 22, 2022 at 3:01 AM John Muehlhausen <j...@jgm.org> wrote:

> Why aren't all the compressed batches the chunk size I specified in
> write_feather (7000000)?  How can I know which batch my slice resides in if
> this is not a constant?  Using pyarrow 9.0.0
>
> This file contains 1.5 billion rows.  I need a way to know where to look
> for, say, [780567127,922022522)
>
> 0.7492516040802002 done 0 len 7000000
> 1.7520167827606201 done 1 len 7000000
> 3.302407741546631 done 2 len 4995912
> 5.16457986831665 done 3 len 7000000
> 6.0424370765686035 done 4 len 4706276
> 7.58642315864563 done 5 len 7000000
> 7.719322681427002 done 6 len 289636
> 8.705692291259766 done 7 len 5698775
>
> On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen <j...@jgm.org> wrote:
>
> > The following seems like good news... like I should be able to decompress
> > just one column of a RecordBatch in the middle of a compressed feather v2
> > file.  Is there a Python API for this kind of access?  C++?
> >
> > /// Provided for forward compatibility in case we need to support
> different
> > /// strategies for compressing the IPC message body (like whole-body
> > /// compression rather than buffer-level) in the future
> > enum BodyCompressionMethod:byte {
> >   /// Each constituent buffer is first compressed with the indicated
> >   /// compressor, and then written with the uncompressed length in the
> > first 8
> >   /// bytes as a 64-bit little-endian signed integer followed by the
> > compressed
> >   /// buffer bytes (and then padding as required by the protocol). The
> >   /// uncompressed length may be set to -1 to indicate that the data that
> >   /// follows is not compressed, which can be useful for cases where
> >   /// compression does not yield appreciable savings.
> >   BUFFER
> > }
> >
> > On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen <j...@jgm.org> wrote:
> >
> >> ``Internal structure supports random access and slicing from the middle.
> >> This also means that you can read a large file chunk by chunk without
> >> having to pull the whole thing into memory.''
> >> https://ursalabs.org/blog/2020-feather-v2/
> >>
> >> For a compressed v2 file, can I decompress just one column of a batch in
> >> the middle, or is the entire batch with all of its columns compressed
> as a
> >> unit?
> >>
> >> Unfortunately reader.get_batch(i) seems like it is doing a lot of work.
> >> Like maybe decompressing all the columns?
> >>
> >> Thanks,
> >> John
> >>
> >
>

Re: compressed feather v2 "slicing from the middle"

Reply via email to