On Sat, Aug 9, 2014 at 6:15 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>
> Stephen Frost <sfr...@snowman.net> writes:
> > What about considering how large the object is when we are analyzing if
> > it compresses well overall?
>
> Hmm, yeah, that's a possibility: we could redefine the limit at which
> we bail out in terms of a fraction of the object size instead of a fixed
> limit.  However, that risks expending a large amount of work before we
> bail, if we have a very large incompressible object --- which is not
> exactly an unlikely case.  Consider for example JPEG images stored as
> bytea, which I believe I've heard of people doing.  Another issue is
> that it's not real clear that that fixes the problem for any fractional
> size we'd want to use.  In Larry's example of a jsonb value that fails
> to compress, the header size is 940 bytes out of about 12K, so we'd be
> needing to trial-compress about 10% of the object before we reach
> compressible data --- and I doubt his example is worst-case.
>
> >> 1. The real problem here is that jsonb is emitting quite a bit of
> >> fundamentally-nonrepetitive data, even when the user-visible input is
very
> >> repetitive.  That's a compression-unfriendly transformation by anyone's
> >> measure.
>
> > I disagree that another algorithm wouldn't be able to manage better on
> > this data than pglz.  pglz, from my experience, is notoriously bad a
> > certain data sets which other algorithms are not as poorly impacted by.
>
> Well, I used to be considered a compression expert, and I'm going to
> disagree with you here.  It's surely possible that other algorithms would
> be able to get some traction where pglz fails to get any,

During my previous work in this area, I had seen that some algorithms
use skipping logic which can be useful for incompressible data followed
by compressible data or in general as well.  One of the technique could
be If we don't find any match for first 4 bytes, then skip 4 bytes
and if we don't find match again for next 8 bytes, then skip 8
bytes and keep on doing the same until we find first match in which
case it would go back to beginning of data.  Now here we could follow
this logic until we actually compare total of first_success_by bytes.
There can be caveats in this particular scheme of skipping but I
just wanted to mention in general about the skipping idea to reduce
the number of situations where we will bail out even though there is
lot of compressible data.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply via email to