Why aren't all the compressed batches the chunk size I specified in
write_feather (7000000)?  How can I know which batch my slice resides in if
this is not a constant?  Using pyarrow 9.0.0

This file contains 1.5 billion rows.  I need a way to know where to look
for, say, [780567127,922022522)

0.7492516040802002 done 0 len 7000000
1.7520167827606201 done 1 len 7000000
3.302407741546631 done 2 len 4995912
5.16457986831665 done 3 len 7000000
6.0424370765686035 done 4 len 4706276
7.58642315864563 done 5 len 7000000
7.719322681427002 done 6 len 289636
8.705692291259766 done 7 len 5698775

On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen <j...@jgm.org> wrote:

> The following seems like good news... like I should be able to decompress
> just one column of a RecordBatch in the middle of a compressed feather v2
> file.  Is there a Python API for this kind of access?  C++?
>
> /// Provided for forward compatibility in case we need to support different
> /// strategies for compressing the IPC message body (like whole-body
> /// compression rather than buffer-level) in the future
> enum BodyCompressionMethod:byte {
>   /// Each constituent buffer is first compressed with the indicated
>   /// compressor, and then written with the uncompressed length in the
> first 8
>   /// bytes as a 64-bit little-endian signed integer followed by the
> compressed
>   /// buffer bytes (and then padding as required by the protocol). The
>   /// uncompressed length may be set to -1 to indicate that the data that
>   /// follows is not compressed, which can be useful for cases where
>   /// compression does not yield appreciable savings.
>   BUFFER
> }
>
> On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen <j...@jgm.org> wrote:
>
>> ``Internal structure supports random access and slicing from the middle.
>> This also means that you can read a large file chunk by chunk without
>> having to pull the whole thing into memory.''
>> https://ursalabs.org/blog/2020-feather-v2/
>>
>> For a compressed v2 file, can I decompress just one column of a batch in
>> the middle, or is the entire batch with all of its columns compressed as a
>> unit?
>>
>> Unfortunately reader.get_batch(i) seems like it is doing a lot of work.
>> Like maybe decompressing all the columns?
>>
>> Thanks,
>> John
>>
>

Reply via email to