Re: compressed feather v2 "slicing from the middle"

2022-09-22 Thread John Muehlhausen
If I'm understanding the below correctly, it seems that the file format supports finding an arbitrary compressed buffer without decompressing anything else. Correct? -John /// -- /// A Buffer represents a single contiguous

Re: compressed feather v2 "slicing from the middle"

2022-09-22 Thread John Muehlhausen
Regarding tab=feather.read_table(fname, memory_map=True) Uncompressed: low-cost setup and len(tab), data is read when sections of the map are "paged in" by the OS Compressed (desired): * low-cost setup * read the length of the "table" without decompressing anything ( len(tab) ) *

Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread Jorge Cardoso Leitão
Hi, AFAIK compressed IPC arrow files do not support random access (like uncompressed counterparts) - you need to decompress the whole batch (or at least the columns you need). A "RecordBatch" is the compression unit of the file. Think of it like a parquet file whose every row group has a single

Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread John Muehlhausen
Why aren't all the compressed batches the chunk size I specified in write_feather (700)? How can I know which batch my slice resides in if this is not a constant? Using pyarrow 9.0.0 This file contains 1.5 billion rows. I need a way to know where to look for, say, [780567127,922022522)

Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread John Muehlhausen
The following seems like good news... like I should be able to decompress just one column of a RecordBatch in the middle of a compressed feather v2 file. Is there a Python API for this kind of access? C++? /// Provided for forward compatibility in case we need to support different ///

compressed feather v2 "slicing from the middle"

2022-09-21 Thread John Muehlhausen
``Internal structure supports random access and slicing from the middle. This also means that you can read a large file chunk by chunk without having to pull the whole thing into memory.'' https://ursalabs.org/blog/2020-feather-v2/ For a compressed v2 file, can I decompress just one column of a