* Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost <sfr...@snowman.net> writes: > > What about considering how large the object is when we are analyzing if > > it compresses well overall? > > Hmm, yeah, that's a possibility: we could redefine the limit at which > we bail out in terms of a fraction of the object size instead of a fixed > limit. However, that risks expending a large amount of work before we > bail, if we have a very large incompressible object --- which is not > exactly an unlikely case. Consider for example JPEG images stored as > bytea, which I believe I've heard of people doing. Another issue is > that it's not real clear that that fixes the problem for any fractional > size we'd want to use. In Larry's example of a jsonb value that fails > to compress, the header size is 940 bytes out of about 12K, so we'd be > needing to trial-compress about 10% of the object before we reach > compressible data --- and I doubt his example is worst-case.
Agreed- I tried to allude to that in my prior mail, there's clearly a concern that we'd make things worse in certain situations. Then again, at least for that case, we could recommend changing the storage type to EXTERNAL. > >> 1. The real problem here is that jsonb is emitting quite a bit of > >> fundamentally-nonrepetitive data, even when the user-visible input is very > >> repetitive. That's a compression-unfriendly transformation by anyone's > >> measure. > > > I disagree that another algorithm wouldn't be able to manage better on > > this data than pglz. pglz, from my experience, is notoriously bad a > > certain data sets which other algorithms are not as poorly impacted by. > > Well, I used to be considered a compression expert, and I'm going to > disagree with you here. It's surely possible that other algorithms would > be able to get some traction where pglz fails to get any, but that doesn't > mean that presenting them with hard-to-compress data in the first place is > a good design decision. There is no scenario in which data like this is > going to be friendly to a general-purpose compression algorithm. It'd > be necessary to have explicit knowledge that the data consists of an > increasing series of four-byte integers to be able to do much with it. > And then such an assumption would break down once you got past the > header ... I've wondered previously as to if we, perhaps, missed the boat pretty badly by not providing an explicitly versioned per-type compression capability, such that we wouldn't be stuck with one compression algorith for all types, and would be able to version compression types in a way that would allow us to change them over time, provided the newer code always understands how to decode X-4 (or whatever) versions back. I do agree that it'd be great to represent every type in a highly compressable way for the sake of the compression algorithm, but I've not seen any good suggestions for how to make that happen and I've got a hard time seeing how we could completely change the jsonb storage format, retain the capabilities it has today, make it highly compressible, and get 9.4 out this calendar year. I expect we could trivially add padding into the jsonb header to make it compress better, for the sake of this particular check, but then we're going to always be compression jsonb, even when the user data isn't actually terribly good for compression, spending a good bit of CPU time while we're at it. > > Perhaps another options would be a new storage type which basically says > > "just compress it, no matter what"? We'd be able to make that the > > default for jsonb columns too, no? > > Meh. We could do that, but it would still require adding arguments to > toast_compress_datum() that aren't there now. In any case, this is a > band-aid solution; and as Josh notes, once we ship 9.4 we are going to > be stuck with jsonb's on-disk representation pretty much forever. I agree that we need to avoid changing jsonb's on-disk representation. Have I missed where a good suggestion has been made about how to do that which preserves the binary-search capabilities and doesn't make the code much more difficult? Trying to move the header to the end just for the sake of this doesn't strike me as a good solution as it'll make things quite a bit more complicated. Is there a way we could interleave the likely-compressible user data in with the header instead? I've not looked, but it seems like that's the only reasonable approach to address this issue in this manner. If that's simply done, then great, but it strikes me as unlikely to be.. I'll just throw out a bit of a counter-point to all this also though- we don't try to focus on making our on-disk representation of data, generally, very compressible even though there are filesystems, such as ZFS, which might benefit from certain rearrangements of our on-disk formats (no, I don't have any specific recommendations in this vein, but I certainly don't see anyone else asking after it or asking for us to be concerned about it). Compression is great and I'd hate to see us have a format that will just work with it even though it might be beneficial in many cases, but I feel the fault here is with the compression algorithm and the decisions made as part of that operation and not really with this particular data structure. Thanks, Stephen
signature.asc
Description: Digital signature