Re: compressed feather v2 "slicing from the middle"

2022-09-22 Thread John Muehlhausen
If I'm understanding the below correctly, it seems that the file format
supports finding an arbitrary compressed buffer without decompressing
anything else.  Correct?

-John

/// --
/// A Buffer represents a single contiguous memory segment
struct Buffer {
  /// The relative offset into the shared memory page where the bytes for
this
  /// buffer starts
  offset: long;

  /// The absolute length (in bytes) of the memory buffer. The memory is
found
  /// from offset (inclusive) to offset + length (non-inclusive). When
building
  /// messages using the encapsulated IPC message, padding bytes may be
written
  /// after a buffer, but such padding bytes do not need to be accounted
for in
  /// the size here.
  length: long;
}

On Thu, Sep 22, 2022 at 9:10 AM John Muehlhausen  wrote:

> Regarding tab=feather.read_table(fname, memory_map=True)
>
> Uncompressed: low-cost setup and len(tab), data is read when sections of
> the map are "paged in" by the OS
> Compressed (desired):
>   * low-cost setup
>   * read the length of the "table" without decompressing anything  (
> len(tab) )
>   * low-cost slicing
>   * data is decompressed only on dereference, such that there is a "batch
> column cache" corresponding to the lower level page cache concept; batch
> columns are only decompressed when they are accessed. Repeated access does
> not re-decompress (subject to LRU column cache capacity).
>
> The above design would allow drop-in replacement of compressed files for
> uncompressed files for workloads that depend on the characteristics of
> uncompressed mmap, it seems to me.  The only difference would be two levels
> of "paging in" (page cache -> a new column cache on the heap) instead of
> just the OS page cache level.
>
> Is there anything about the file format that would make this impossible?
>  *For example, do we know where each compressed buffer starts without
> needing to decompress buffers around it?*
>
> Thanks,
> John
>
> On Wed, Sep 21, 2022 at 11:53 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
>> Hi,
>>
>> AFAIK compressed IPC arrow files do not support random access (like
>> uncompressed counterparts) - you need to decompress the whole batch (or at
>> least the columns you need). A "RecordBatch" is the compression unit of
>> the
>> file. Think of it like a parquet file whose every row group has a single
>> data page.
>>
>> With that said, the file contains the number of rows that each record
>> batch
>> has, and thus you can compute which record batch you can start from. This
>> is available in the RecordBatch message "length" [1]. Unfortunately you do
>> need to read the record batch message to know it, and these are located in
>> the middle of the file:
>>
>> [header][record_batch1_message][record_batch1_data][record_batch2_message][record_batch2_data]...[footer].
>> The positions of the messages are declared in the file's footer's
>> "record_batches".
>>
>> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L87
>>
>> Best,
>> Jorge
>>
>>
>> On Thu, Sep 22, 2022 at 3:01 AM John Muehlhausen  wrote:
>>
>> > Why aren't all the compressed batches the chunk size I specified in
>> > write_feather (700)?  How can I know which batch my slice resides
>> in if
>> > this is not a constant?  Using pyarrow 9.0.0
>> >
>> > This file contains 1.5 billion rows.  I need a way to know where to look
>> > for, say, [780567127,922022522)
>> >
>> > 0.7492516040802002 done 0 len 700
>> > 1.7520167827606201 done 1 len 700
>> > 3.302407741546631 done 2 len 4995912
>> > 5.16457986831665 done 3 len 700
>> > 6.0424370765686035 done 4 len 4706276
>> > 7.58642315864563 done 5 len 700
>> > 7.719322681427002 done 6 len 289636
>> > 8.705692291259766 done 7 len 5698775
>> >
>> > On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen  wrote:
>> >
>> > > The following seems like good news... like I should be able to
>> decompress
>> > > just one column of a RecordBatch in the middle of a compressed
>> feather v2
>> > > file.  Is there a Python API for this kind of access?  C++?
>> > >
>> > > /// Provided for forward compatibility in case we need to support
>> > different
>> > > /// strategies for compressing the IPC message body (like whole-body
>> > > /// compression rather than buffer-level) in the future
>> > > enum BodyCompressionMethod:byte {
>> > >   /// Each constituent buffer is first compressed with the indicated
>> > >   /// compressor, and then written with the uncompressed length in the
>> > > first 8
>> > >   /// bytes as a 64-bit little-endian signed integer followed by the
>> > > compressed
>> > >   /// buffer bytes (and then padding as required by the protocol). The
>> > >   /// uncompressed length may be set to -1 to indicate that the data
>> that
>> > >   /// follows is not compressed, which can be useful for cases where
>> > >   /// compression does not yield appreciable savings.
>> > >   BUFFER
>> > > }

Re: compressed feather v2 "slicing from the middle"

2022-09-22 Thread John Muehlhausen
Regarding tab=feather.read_table(fname, memory_map=True)

Uncompressed: low-cost setup and len(tab), data is read when sections of
the map are "paged in" by the OS
Compressed (desired):
  * low-cost setup
  * read the length of the "table" without decompressing anything  (
len(tab) )
  * low-cost slicing
  * data is decompressed only on dereference, such that there is a "batch
column cache" corresponding to the lower level page cache concept; batch
columns are only decompressed when they are accessed. Repeated access does
not re-decompress (subject to LRU column cache capacity).

The above design would allow drop-in replacement of compressed files for
uncompressed files for workloads that depend on the characteristics of
uncompressed mmap, it seems to me.  The only difference would be two levels
of "paging in" (page cache -> a new column cache on the heap) instead of
just the OS page cache level.

Is there anything about the file format that would make this impossible?
 *For example, do we know where each compressed buffer starts without
needing to decompress buffers around it?*

Thanks,
John

On Wed, Sep 21, 2022 at 11:53 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> AFAIK compressed IPC arrow files do not support random access (like
> uncompressed counterparts) - you need to decompress the whole batch (or at
> least the columns you need). A "RecordBatch" is the compression unit of the
> file. Think of it like a parquet file whose every row group has a single
> data page.
>
> With that said, the file contains the number of rows that each record batch
> has, and thus you can compute which record batch you can start from. This
> is available in the RecordBatch message "length" [1]. Unfortunately you do
> need to read the record batch message to know it, and these are located in
> the middle of the file:
>
> [header][record_batch1_message][record_batch1_data][record_batch2_message][record_batch2_data]...[footer].
> The positions of the messages are declared in the file's footer's
> "record_batches".
>
> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L87
>
> Best,
> Jorge
>
>
> On Thu, Sep 22, 2022 at 3:01 AM John Muehlhausen  wrote:
>
> > Why aren't all the compressed batches the chunk size I specified in
> > write_feather (700)?  How can I know which batch my slice resides in
> if
> > this is not a constant?  Using pyarrow 9.0.0
> >
> > This file contains 1.5 billion rows.  I need a way to know where to look
> > for, say, [780567127,922022522)
> >
> > 0.7492516040802002 done 0 len 700
> > 1.7520167827606201 done 1 len 700
> > 3.302407741546631 done 2 len 4995912
> > 5.16457986831665 done 3 len 700
> > 6.0424370765686035 done 4 len 4706276
> > 7.58642315864563 done 5 len 700
> > 7.719322681427002 done 6 len 289636
> > 8.705692291259766 done 7 len 5698775
> >
> > On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen  wrote:
> >
> > > The following seems like good news... like I should be able to
> decompress
> > > just one column of a RecordBatch in the middle of a compressed feather
> v2
> > > file.  Is there a Python API for this kind of access?  C++?
> > >
> > > /// Provided for forward compatibility in case we need to support
> > different
> > > /// strategies for compressing the IPC message body (like whole-body
> > > /// compression rather than buffer-level) in the future
> > > enum BodyCompressionMethod:byte {
> > >   /// Each constituent buffer is first compressed with the indicated
> > >   /// compressor, and then written with the uncompressed length in the
> > > first 8
> > >   /// bytes as a 64-bit little-endian signed integer followed by the
> > > compressed
> > >   /// buffer bytes (and then padding as required by the protocol). The
> > >   /// uncompressed length may be set to -1 to indicate that the data
> that
> > >   /// follows is not compressed, which can be useful for cases where
> > >   /// compression does not yield appreciable savings.
> > >   BUFFER
> > > }
> > >
> > > On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen  wrote:
> > >
> > >> ``Internal structure supports random access and slicing from the
> middle.
> > >> This also means that you can read a large file chunk by chunk without
> > >> having to pull the whole thing into memory.''
> > >> https://ursalabs.org/blog/2020-feather-v2/
> > >>
> > >> For a compressed v2 file, can I decompress just one column of a batch
> in
> > >> the middle, or is the entire batch with all of its columns compressed
> > as a
> > >> unit?
> > >>
> > >> Unfortunately reader.get_batch(i) seems like it is doing a lot of
> work.
> > >> Like maybe decompressing all the columns?
> > >>
> > >> Thanks,
> > >> John
> > >>
> > >
> >
>


Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread Jorge Cardoso Leitão
Hi,

AFAIK compressed IPC arrow files do not support random access (like
uncompressed counterparts) - you need to decompress the whole batch (or at
least the columns you need). A "RecordBatch" is the compression unit of the
file. Think of it like a parquet file whose every row group has a single
data page.

With that said, the file contains the number of rows that each record batch
has, and thus you can compute which record batch you can start from. This
is available in the RecordBatch message "length" [1]. Unfortunately you do
need to read the record batch message to know it, and these are located in
the middle of the file:
[header][record_batch1_message][record_batch1_data][record_batch2_message][record_batch2_data]...[footer].
The positions of the messages are declared in the file's footer's
"record_batches".

[1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L87

Best,
Jorge


On Thu, Sep 22, 2022 at 3:01 AM John Muehlhausen  wrote:

> Why aren't all the compressed batches the chunk size I specified in
> write_feather (700)?  How can I know which batch my slice resides in if
> this is not a constant?  Using pyarrow 9.0.0
>
> This file contains 1.5 billion rows.  I need a way to know where to look
> for, say, [780567127,922022522)
>
> 0.7492516040802002 done 0 len 700
> 1.7520167827606201 done 1 len 700
> 3.302407741546631 done 2 len 4995912
> 5.16457986831665 done 3 len 700
> 6.0424370765686035 done 4 len 4706276
> 7.58642315864563 done 5 len 700
> 7.719322681427002 done 6 len 289636
> 8.705692291259766 done 7 len 5698775
>
> On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen  wrote:
>
> > The following seems like good news... like I should be able to decompress
> > just one column of a RecordBatch in the middle of a compressed feather v2
> > file.  Is there a Python API for this kind of access?  C++?
> >
> > /// Provided for forward compatibility in case we need to support
> different
> > /// strategies for compressing the IPC message body (like whole-body
> > /// compression rather than buffer-level) in the future
> > enum BodyCompressionMethod:byte {
> >   /// Each constituent buffer is first compressed with the indicated
> >   /// compressor, and then written with the uncompressed length in the
> > first 8
> >   /// bytes as a 64-bit little-endian signed integer followed by the
> > compressed
> >   /// buffer bytes (and then padding as required by the protocol). The
> >   /// uncompressed length may be set to -1 to indicate that the data that
> >   /// follows is not compressed, which can be useful for cases where
> >   /// compression does not yield appreciable savings.
> >   BUFFER
> > }
> >
> > On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen  wrote:
> >
> >> ``Internal structure supports random access and slicing from the middle.
> >> This also means that you can read a large file chunk by chunk without
> >> having to pull the whole thing into memory.''
> >> https://ursalabs.org/blog/2020-feather-v2/
> >>
> >> For a compressed v2 file, can I decompress just one column of a batch in
> >> the middle, or is the entire batch with all of its columns compressed
> as a
> >> unit?
> >>
> >> Unfortunately reader.get_batch(i) seems like it is doing a lot of work.
> >> Like maybe decompressing all the columns?
> >>
> >> Thanks,
> >> John
> >>
> >
>


Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread John Muehlhausen
Why aren't all the compressed batches the chunk size I specified in
write_feather (700)?  How can I know which batch my slice resides in if
this is not a constant?  Using pyarrow 9.0.0

This file contains 1.5 billion rows.  I need a way to know where to look
for, say, [780567127,922022522)

0.7492516040802002 done 0 len 700
1.7520167827606201 done 1 len 700
3.302407741546631 done 2 len 4995912
5.16457986831665 done 3 len 700
6.0424370765686035 done 4 len 4706276
7.58642315864563 done 5 len 700
7.719322681427002 done 6 len 289636
8.705692291259766 done 7 len 5698775

On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen  wrote:

> The following seems like good news... like I should be able to decompress
> just one column of a RecordBatch in the middle of a compressed feather v2
> file.  Is there a Python API for this kind of access?  C++?
>
> /// Provided for forward compatibility in case we need to support different
> /// strategies for compressing the IPC message body (like whole-body
> /// compression rather than buffer-level) in the future
> enum BodyCompressionMethod:byte {
>   /// Each constituent buffer is first compressed with the indicated
>   /// compressor, and then written with the uncompressed length in the
> first 8
>   /// bytes as a 64-bit little-endian signed integer followed by the
> compressed
>   /// buffer bytes (and then padding as required by the protocol). The
>   /// uncompressed length may be set to -1 to indicate that the data that
>   /// follows is not compressed, which can be useful for cases where
>   /// compression does not yield appreciable savings.
>   BUFFER
> }
>
> On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen  wrote:
>
>> ``Internal structure supports random access and slicing from the middle.
>> This also means that you can read a large file chunk by chunk without
>> having to pull the whole thing into memory.''
>> https://ursalabs.org/blog/2020-feather-v2/
>>
>> For a compressed v2 file, can I decompress just one column of a batch in
>> the middle, or is the entire batch with all of its columns compressed as a
>> unit?
>>
>> Unfortunately reader.get_batch(i) seems like it is doing a lot of work.
>> Like maybe decompressing all the columns?
>>
>> Thanks,
>> John
>>
>


Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread John Muehlhausen
The following seems like good news... like I should be able to decompress
just one column of a RecordBatch in the middle of a compressed feather v2
file.  Is there a Python API for this kind of access?  C++?

/// Provided for forward compatibility in case we need to support different
/// strategies for compressing the IPC message body (like whole-body
/// compression rather than buffer-level) in the future
enum BodyCompressionMethod:byte {
  /// Each constituent buffer is first compressed with the indicated
  /// compressor, and then written with the uncompressed length in the
first 8
  /// bytes as a 64-bit little-endian signed integer followed by the
compressed
  /// buffer bytes (and then padding as required by the protocol). The
  /// uncompressed length may be set to -1 to indicate that the data that
  /// follows is not compressed, which can be useful for cases where
  /// compression does not yield appreciable savings.
  BUFFER
}

On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen  wrote:

> ``Internal structure supports random access and slicing from the middle.
> This also means that you can read a large file chunk by chunk without
> having to pull the whole thing into memory.''
> https://ursalabs.org/blog/2020-feather-v2/
>
> For a compressed v2 file, can I decompress just one column of a batch in
> the middle, or is the entire batch with all of its columns compressed as a
> unit?
>
> Unfortunately reader.get_batch(i) seems like it is doing a lot of work.
> Like maybe decompressing all the columns?
>
> Thanks,
> John
>