Re: [HACKERS] jsonb format is pessimal for toast compression

Stephen Frost Fri, 08 Aug 2014 18:25:27 -0700

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost <sfr...@snowman.net> writes:
> > What about considering how large the object is when we are analyzing if
> > it compresses well overall?
> 
> Hmm, yeah, that's a possibility: we could redefine the limit at which
> we bail out in terms of a fraction of the object size instead of a fixed
> limit.  However, that risks expending a large amount of work before we
> bail, if we have a very large incompressible object --- which is not
> exactly an unlikely case.  Consider for example JPEG images stored as
> bytea, which I believe I've heard of people doing.  Another issue is
> that it's not real clear that that fixes the problem for any fractional
> size we'd want to use.  In Larry's example of a jsonb value that fails
> to compress, the header size is 940 bytes out of about 12K, so we'd be
> needing to trial-compress about 10% of the object before we reach
> compressible data --- and I doubt his example is worst-case.


Agreed- I tried to allude to that in my prior mail, there's clearly a
concern that we'd make things worse in certain situations.  Then again,
at least for that case, we could recommend changing the storage type to
EXTERNAL.

> >> 1. The real problem here is that jsonb is emitting quite a bit of
> >> fundamentally-nonrepetitive data, even when the user-visible input is very
> >> repetitive.  That's a compression-unfriendly transformation by anyone's
> >> measure.
> 
> > I disagree that another algorithm wouldn't be able to manage better on
> > this data than pglz.  pglz, from my experience, is notoriously bad a
> > certain data sets which other algorithms are not as poorly impacted by.
> 
> Well, I used to be considered a compression expert, and I'm going to
> disagree with you here.  It's surely possible that other algorithms would
> be able to get some traction where pglz fails to get any, but that doesn't
> mean that presenting them with hard-to-compress data in the first place is
> a good design decision.  There is no scenario in which data like this is
> going to be friendly to a general-purpose compression algorithm.  It'd
> be necessary to have explicit knowledge that the data consists of an
> increasing series of four-byte integers to be able to do much with it.
> And then such an assumption would break down once you got past the
> header ...

I've wondered previously as to if we, perhaps, missed the boat pretty
badly by not providing an explicitly versioned per-type compression
capability, such that we wouldn't be stuck with one compression
algorith for all types, and would be able to version compression types
in a way that would allow us to change them over time, provided the
newer code always understands how to decode X-4 (or whatever) versions
back.

I do agree that it'd be great to represent every type in a highly
compressable way for the sake of the compression algorithm, but
I've not seen any good suggestions for how to make that happen and I've
got a hard time seeing how we could completely change the jsonb storage
format, retain the capabilities it has today, make it highly
compressible, and get 9.4 out this calendar year.

I expect we could trivially add padding into the jsonb header to make it
compress better, for the sake of this particular check, but then we're
going to always be compression jsonb, even when the user data isn't
actually terribly good for compression, spending a good bit of CPU time
while we're at it.

> > Perhaps another options would be a new storage type which basically says
> > "just compress it, no matter what"?  We'd be able to make that the
> > default for jsonb columns too, no?
> 
> Meh.  We could do that, but it would still require adding arguments to
> toast_compress_datum() that aren't there now.  In any case, this is a
> band-aid solution; and as Josh notes, once we ship 9.4 we are going to
> be stuck with jsonb's on-disk representation pretty much forever.

I agree that we need to avoid changing jsonb's on-disk representation.
Have I missed where a good suggestion has been made about how to do that
which preserves the binary-search capabilities and doesn't make the code
much more difficult?  Trying to move the header to the end just for the
sake of this doesn't strike me as a good solution as it'll make things
quite a bit more complicated.  Is there a way we could interleave the
likely-compressible user data in with the header instead?  I've not
looked, but it seems like that's the only reasonable approach to address
this issue in this manner.  If that's simply done, then great, but it
strikes me as unlikely to be..

I'll just throw out a bit of a counter-point to all this also though- we
don't try to focus on making our on-disk representation of data,
generally, very compressible even though there are filesystems, such as
ZFS, which might benefit from certain rearrangements of our on-disk
formats (no, I don't have any specific recommendations in this vein, but
I certainly don't see anyone else asking after it or asking for us to be
concerned about it).  Compression is great and I'd hate to see us have a
format that will just work with it even though it might be beneficial in
many cases, but I feel the fault here is with the compression algorithm
and the decisions made as part of that operation and not really with
this particular data structure.

        Thanks,

                Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] jsonb format is pessimal for toast compression

Reply via email to