Re: [HACKERS] jsonb format is pessimal for toast compression

Andrew Dunstan Fri, 08 Aug 2014 07:22:42 -0700


On 08/07/2014 11:17 PM, Tom Lane wrote:

I looked into the issue reported in bug #11109.  The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
This interacts poorly with the code in pglz_compress() that gives up if
it's found nothing compressible in the first first_success_by bytes of a
value-to-be-compressed.  (first_success_by is 1024 in the default set of
compression parameters.)


[snip]

There is plenty of compressible data once we get into the repetitive
strings in the payload part --- but that starts at offset 944, and up to
that point there is nothing that pg_lzcompress can get a handle on.  There
are, by definition, no sequences of 4 or more repeated bytes in that area.
I think in principle pg_lzcompress could decide to compress the 3-byte
sequences consisting of the high-order 24 bits of each offset; but it
doesn't choose to do so, probably because of the way its lookup hash table
works:

  * pglz_hist_idx -
  *
  *             Computes the history table slot for the lookup by the next 4
  *             characters in the input.
  *
  * NB: because we use the next 4 characters, we are not guaranteed to
  * find 3-character matches; they very possibly will be in the wrong
  * hash list.  This seems an acceptable tradeoff for spreading out the
  * hash keys more.

For jsonb header data, the "next 4 characters" are *always* different, so
only a chance hash collision can result in a match.  There is therefore a
pretty good chance that no compression will occur before it gives up
because of first_success_by.

I'm not sure if there is any easy fix for this.  We could possibly change
the default first_success_by value, but I think that'd just be postponing
the problem to larger jsonb objects/arrays, and it would hurt performance
for genuinely incompressible data.  A somewhat painful, but not yet
out-of-the-question, alternative is to change the jsonb on-disk
representation.  Perhaps the JEntry array could be defined as containing
element lengths instead of element ending offsets.  Not sure though if
that would break binary searching for JSON object keys.



Ouch.

Back when this structure was first presented at pgCon 2013, I wonderedif we shouldn't extract the strings into a dictionary, because of keyrepetition, and convinced myself that this shouldn't be necessarybecause in significant cases TOAST would take care of it.

Maybe we should have pglz_compress() look at the *last* 1024 bytes if itcan't find anything worth compressing in the first, for values largerthan a certain size.

It's worth noting that this is a fairly pathological case. AIUI theexample you constructed has an array with 100k string elements. I don'tthink that's typical. So I suspect that unless I've misunderstood thestatement of the problem we're going to find that almost all the jsonbwe will be storing is still compressible.


cheers

andrew


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

Reply via email to