Re: Best way to store ragged packet data in Parquet files

Jason Sachs Tue, 03 Nov 2020 13:27:24 -0800

On 2020/11/03 20:49:46, Micah Kornfield <[email protected]> wrote: 
> Hi Jason,
> At least as a first pass I would try to avoid the padding and storing the
> length separately in Parquet.  Using one column for timestamp and one
> column of bytes for the data is what I would try first.  If there is any
> structure to the packets splitting them into the structure could also help.
> 
> -Micah

For the test cases I have, >99% of the packets are the same length, so there's 
little-to-no benefit of removing the padding; the length field and zero padding 
barely adds anything once you factor compression into the mix.

I've tried use_dictionaries=False and that does help some.

But I'll post an updated example to back these statements up and see how much 
better I can get.

I'm just surprised that hdf5 does a better job in this case; maybe I don't 
understand the constraints the file format imposes on data compression.

Re: Best way to store ragged packet data in Parquet files

Reply via email to