On 2020/11/03 20:49:46, Micah Kornfield <emkornfi...@gmail.com> wrote:
> Hi Jason,
> At least as a first pass I would try to avoid the padding and storing the
> length separately in Parquet. Using one column for timestamp and one
> column of bytes for the data is what I would try first. If there is any
> structure to the packets splitting them into the structure could also help.
>
> -Micah
For the test cases I have, >99% of the packets are the same length, so there's
little-to-no benefit of removing the padding; the length field and zero padding
barely adds anything once you factor compression into the mix.
I've tried use_dictionaries=False and that does help some.
But I'll post an updated example to back these statements up and see how much
better I can get.
I'm just surprised that hdf5 does a better job in this case; maybe I don't
understand the constraints the file format imposes on data compression.