> > For the test cases I have, >99% of the packets are the same length, so > there's little-to-no benefit of removing the padding; the length field and > zero padding barely adds anything once you factor compression into the mix.
Are you writing the data out as fixed size bytes arrays or as variable length binary data? On Tue, Nov 3, 2020 at 1:26 PM Jason Sachs <[email protected]> wrote: > > > On 2020/11/03 20:49:46, Micah Kornfield <[email protected]> wrote: > > Hi Jason, > > At least as a first pass I would try to avoid the padding and storing the > > length separately in Parquet. Using one column for timestamp and one > > column of bytes for the data is what I would try first. If there is any > > structure to the packets splitting them into the structure could also > help. > > > > -Micah > > For the test cases I have, >99% of the packets are the same length, so > there's little-to-no benefit of removing the padding; the length field and > zero padding barely adds anything once you factor compression into the mix. > > I've tried use_dictionaries=False and that does help some. > > But I'll post an updated example to back these statements up and see how > much better I can get. > > I'm just surprised that hdf5 does a better job in this case; maybe I don't > understand the constraints the file format imposes on data compression. >
