Andrew Dunstan <and...@dunslane.net> writes:
> On 08/07/2014 11:17 PM, Tom Lane wrote:
>> I looked into the issue reported in bug #11109.  The problem appears to be
>> that jsonb's on-disk format is designed in such a way that the leading
>> portion of any JSON array or object will be fairly incompressible, because
>> it consists mostly of a strictly-increasing series of integer offsets.

> Ouch.

> Back when this structure was first presented at pgCon 2013, I wondered 
> if we shouldn't extract the strings into a dictionary, because of key 
> repetition, and convinced myself that this shouldn't be necessary 
> because in significant cases TOAST would take care of it.

That's not really the issue here, I think.  The problem is that a
relatively minor aspect of the representation, namely the choice to store
a series of offsets rather than a series of lengths, produces
nonrepetitive data even when the original input is repetitive.

> Maybe we should have pglz_compress() look at the *last* 1024 bytes if it 
> can't find anything worth compressing in the first, for values larger 
> than a certain size.

Not possible with anything like the current implementation, since it's
just an on-the-fly status check not a trial compression.

> It's worth noting that this is a fairly pathological case. AIUI the 
> example you constructed has an array with 100k string elements. I don't 
> think that's typical. So I suspect that unless I've misunderstood the 
> statement of the problem we're going to find that almost all the jsonb 
> we will be storing is still compressible.

Actually, the 100K-string example I constructed *did* compress.  Larry's
example that's not compressing is only about 12kB.  AFAICS, the threshold
for trouble is in the vicinity of 256 array or object entries (resulting
in a 1kB JEntry array).  That doesn't seem especially high.  There is a
probabilistic component as to whether the early-exit case will actually
fire, since any chance hash collision will probably result in some 3-byte
offset prefix getting compressed.  But the fact that a beta tester tripped
over this doesn't leave me with a warm feeling about the odds that it
won't happen much in the field.

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to