subject:"\[HACKERS\] Custom compression methods"

Re: [HACKERS] Custom compression methods

2018-01-29 Thread Ildus Kurbangaliev

On Mon, 29 Jan 2018 17:29:29 +0300
Ildar Musin  wrote:

> 
> Patch applies cleanly, builds without any warnings, documentation
> builds ok, all tests pass.
> 
> A remark for the committers. The patch is quite big, so I really wish 
> more reviewers looked into it for more comprehensive review. Also a 
> native english speaker should check the documentation and comments. 
> Another thing is that tests don't cover cmdrop method because the 
> built-in pglz compression doesn't use it (I know there is an jsonbd 
> extension [1] based on this patch and which should benefit from
> cmdrop method, but it doesn't test it either yet).
> 
> I think I did what I could and so passing this patch to committers
> for the review. Changed status to "Ready for committer".
> 
> 
> [1] https://github.com/postgrespro/jsonbd
> 

Thank you!

About cmdrop, I checked that's is called manually, but going to check
it thoroughly in my extension.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] Custom compression methods

2018-01-29 Thread Ildar Musin


Hello Ildus,

On 29.01.2018 14:44, Ildus Kurbangaliev wrote:


Thanks! Attached new version of the patch.



Patch applies cleanly, builds without any warnings, documentation builds 
ok, all tests pass.


A remark for the committers. The patch is quite big, so I really wish 
more reviewers looked into it for more comprehensive review. Also a 
native english speaker should check the documentation and comments. 
Another thing is that tests don't cover cmdrop method because the 
built-in pglz compression doesn't use it (I know there is an jsonbd 
extension [1] based on this patch and which should benefit from cmdrop 
method, but it doesn't test it either yet).


I think I did what I could and so passing this patch to committers for 
the review. Changed status to "Ready for committer".



[1] https://github.com/postgrespro/jsonbd

--
Ildar Musin
i.mu...@postgrespro.ru

Re: [HACKERS] Custom compression methods

2018-01-26 Thread Ildar Musin


Hello Ildus,

I continue reviewing your patch. Here are some thoughts.

1. When I set column storage to EXTERNAL then I cannot set compression.
Seems reasonable:
create table test(id serial, msg text);
alter table test alter column msg set storage external;
alter table test alter column msg set compression pg_lz4;
ERROR:  storage for "msg" should be MAIN or EXTENDED

But if I reorder commands then it's ok:
create table test(id serial, msg text);
alter table test alter column msg set compression pg_lz4;
alter table test alter column msg set storage external;
\d+ test
Table "public.test"
 Column |  Type   |  ...  | Storage  | Compression
+-+  ...  +--+-
 id | integer |  ...  | plain|
 msg| text|  ...  | external | pg_lz4

So we could either allow user to set compression settings even when
storage is EXTERNAL but with warning or prohibit user to set compression
and external storage at the same time. The same thing is with setting 
storage PLAIN.



2. I think TOAST_COMPRESS_SET_RAWSIZE macro could be rewritten like
following to prevent overwriting of higher bits of 'info':

((toast_compress_header *) (ptr))->info = \
((toast_compress_header *) (ptr))->info & ~RAWSIZEMASK | (len);

It maybe does not matter at the moment since it is only used once, but
it could save some efforts for other developers in future.
In TOAST_COMPRESS_SET_CUSTOM() instead of changing individual bits you
may do something like this:

#define TOAST_COMPRESS_SET_CUSTOM(ptr) \
do { \
((toast_compress_header *) (ptr))->info = \
((toast_compress_header *) (ptr))->info & RAWSIZEMASK | 
((uint32) 0x02
<< 30) \
} while (0)

Also it would be nice if bit flags were explained and maybe replaced by
a macro.


3. In AlteredTableInfo, BulkInsertStateData and some functions (eg
toast_insert_or_update) there is a hash table used to keep preserved
compression methods list per attribute. I think a simple array of List*
would be sufficient in this case.


4. In optionListToArray() you can use list_qsort() to sort options list 
instead of converting it manually into array and then back to a list.



5. Redundunt #includes:

In heap.c:
#include "access/reloptions.h"
In tsvector.c:
#include "catalog/pg_type.h"
#include "common/pg_lzcompress.h"
In relcache.c:
#include "utils/datum.h"

6. Just a minor thing: no reason to change formatting in copy.c
-   heap_insert(resultRelInfo->ri_RelationDesc, tuple, mycid,
-   hi_options, bistate);
+   heap_insert(resultRelInfo->ri_RelationDesc, tuple,
+   mycid, hi_options, bistate);

7. Also in utility.c the extra new line was added which isn't relevant 
for this patch.


8. In parse_utilcmd.h the 'extern' keyword was removed from 
transformRuleStmt declaration which doesn't make sense in this patch.


9. Comments. Again, they should be read by a native speaker. So just a 
few suggestions:

toast_prepare_varlena() - comment needed
invalidate_amoptions_cache() - comment format doesn't match other 
functions in the file


In htup_details.h:
/* tuple contain custom compressed
 * varlenas */
should be "contains"

--
Ildar Musin
i.mu...@postgrespro.ru

Re: [HACKERS] Custom compression methods

2018-01-25 Thread Ildar Musin


Hello Ildus,

On 23.01.2018 16:04, Ildus Kurbangaliev wrote:

On Mon, 22 Jan 2018 23:26:31 +0300
Ildar Musin  wrote:

Thanks for review! Attached new version of the patch. Fixed few bugs,
added more documentation and rebased to current master.


You need to rebase to the latest master, there are some conflicts.
I've applied it to the three days old master to try it.


Done.



As I can see the documentation is not yet complete. For example, there
is no section for ALTER COLUMN ... SET COMPRESSION in ddl.sgml; and
section "Compression Access Method Functions" in compression-am.sgml
hasn't been finished.


Not sure about ddl.sgml, it contains more common things, but since
postgres contains only pglz by default there is not much to show.



I've implemented an extension [1] to understand the way developer
would go to work with new infrastructure. And for me it seems clear.
(Except that it took me some effort to wrap my mind around varlena
macros but it is probably a different topic).

I noticed that you haven't cover 'cmdrop' in the regression tests and
I saw the previous discussion about it. Have you considered using
event triggers to handle the drop of column compression instead of
'cmdrop' function? This way you would kill two birds with one stone:
it still provides sufficient infrastructure to catch those events
(and it something postgres already has for different kinds of ddl
commands) and it would be easier to test.


I have added support for event triggers for ALTER SET COMPRESSION in
current version. Event trigger on ALTER can be used to replace cmdrop
function but it will be far from trivial. There is not easy way to
understand that's attribute compression is really dropping in the
command.



I've encountered unexpected behavior in command 'CREATE TABLE ... (LIKE 
...)'. It seems that it copies compression settings of the table 
attributes no matter which INCLUDING options are specified. E.g.


create table xxx(id serial, msg text compression pg_lz4);
alter table xxx alter column msg set storage external;
\d+ xxx
Table "public.xxx"
 Column |  Type   |  ...  | Storage  | Compression |
+-+  ...  +--+-+
 id | integer |  ...  | plain| |
 msg| text|  ...  | external | pg_lz4  |

Now copy the table structure with "INCLUDING ALL":

create table yyy (like xxx including all);
\d+ yyy
Table "public.yyy"
 Column |  Type   |  ...  | Storage  | Compression |
+-+  ...  +--+-+
 id | integer |  ...  | plain| |
 msg| text|  ...  | external | pg_lz4  |

And now copy without "INCLUDING ALL":

create table zzz (like xxx);
\d+ zzz
Table "public.zzz"
 Column |  Type   |  ...  | Storage  | Compression |
+-+  ...  +--+-+
 id | integer |  ...  | plain| |
 msg| text|  ...  | extended | pg_lz4  |

As you see, compression option is copied anyway. I suggest adding new 
INCLUDING COMPRESSION option to enable user to explicitly specify 
whether they want or not to copy compression settings.



I found a few phrases in documentation that can be improved. But the 
documentation should be checked by a native speaker.


In compression-am.sgml:
"an compression access method" -> "a compression access method"
"compression method method" -> "compression method"
"compability" -> "compatibility"
Probably "local-backend cached state" would be better to replace with 
"per backend cached state"?
"Useful to store the parsed view of the compression options" -> "It 
could be useful for example to cache compression options"

"and stores result of" -> "and stores the result of"
"Called when CompressionAmOptions is creating." -> "Called when 
CompressionAmOptions is being initialized"


"Note that in any system cache invalidation related with 
pg_attr_compression relation the options will be cleaned" -> "Note that 
any pg_attr_compression relation invalidation will 
cause all the cached acstate options cleared."

"Function used to ..." -> "Function is used to ..."

I think it would be nice to mention custom compression methods in 
storage.sgml. At this moment it only mentions built-in pglz compression.


--
Ildar Musin
i.mu...@postgrespro.ru

Re: [HACKERS] Custom compression methods

2018-01-22 Thread Ildar Musin

Hello Ildus,


15/01/2018 00:49, Ildus Kurbangaliev пишет:
> Attached a new version of the patch. Main changes:
>
> * compression as an access method
> * pglz as default compression access method.
> * PRESERVE syntax for tables rewrite control.
> * pg_upgrade fixes
> * support partitioned tables.
> * more tests.
>
You need to rebase to the latest master, there are some conflicts. I've
applied it to the three days old master to try it.

As I can see the documentation is not yet complete. For example, there
is no section for ALTER COLUMN ... SET COMPRESSION in ddl.sgml; and
section "Compression Access Method Functions" in compression-am.sgml
hasn't been finished.

I've implemented an extension [1] to understand the way developer would
go to work with new infrastructure. And for me it seems clear. (Except
that it took me some effort to wrap my mind around varlena macros but it
is probably a different topic).

I noticed that you haven't cover 'cmdrop' in the regression tests and I
saw the previous discussion about it. Have you considered using event
triggers to handle the drop of column compression instead of 'cmdrop'
function? This way you would kill two birds with one stone: it still
provides sufficient infrastructure to catch those events (and it
something postgres already has for different kinds of ddl commands) and
it would be easier to test.

Thanks!

[1] https://github.com/zilder/pg_lz4

-- 
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] Custom compression methods

2017-12-18 Thread Robert Haas

On Mon, Dec 18, 2017 at 10:43 AM, Tomas Vondra
 wrote:
> I personally am not quite convinced about that, for the reason I tried
> to explain in my previous messages. I see it as a poor alternative to
> compression built into the data type. I do like the idea of compression
> with external dictionary, however.

I think that compression built into the datatype and what is proposed
here are both useful and everybody's free to work on either one as the
prefer, so I don't see that as a reason not to accept this patch.  And
I think this patch can be a stepping stone toward compression with an
external dictionary, so that seems like an affirmative reason to
accept this patch.

> But don't forget that it's not me in this thread - it's my evil twin,
> moonlighting as Mr. Devil's lawyer ;-)

Well, I don't mind you objecting to the patch under any persona, but
so far I'm not finding your reasons convincing...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-18 Thread Tomas Vondra



On 12/17/2017 04:32 AM, Robert Haas wrote:
> On Thu, Dec 14, 2017 at 12:23 PM, Tomas Vondra
>  wrote:
>> Can you give an example of such algorithm? Because I haven't seen such
>> example, and I find arguments based on hypothetical compression methods
>> somewhat suspicious.
>>
>> FWIW I'm not against considering such compression methods, but OTOH it
>> may not be such a great primary use case to drive the overall design.
> 
> Well it isn't, really.  I am honestly not sure what we're arguing
> about at this point.  I think you've agreed that (1) opening avenues
> for extensibility is useful, (2) substitution a general-purpose
> compression algorithm could be useful, and (3) having datatype
> compression that is enabled through TOAST rather than built into the
> datatype might sometimes be desirable.  That's more than adequate
> justification for this proposal, whether half-general compression
> methods exist or not.  I am prepared to concede that there may be no
> useful examples of such a thing.
> 

I don't think we're arguing - we're discussing if a proposed patch is
the right design solving relevant use cases.

I personally am not quite convinced about that, for the reason I tried
to explain in my previous messages. I see it as a poor alternative to
compression built into the data type. I do like the idea of compression
with external dictionary, however.

But don't forget that it's not me in this thread - it's my evil twin,
moonlighting as Mr. Devil's lawyer ;-)

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-12-18 Thread Ildus Kurbangaliev

On Thu, 14 Dec 2017 10:29:10 -0500
Robert Haas  wrote:

> On Wed, Dec 13, 2017 at 7:18 AM, Ildus Kurbangaliev
>  wrote:
> > Since we agreed on ALTER syntax, i want to clear things about
> > CREATE. Should it be CREATE ACCESS METHOD .. TYPE СOMPRESSION or
> > CREATE COMPRESSION METHOD? I like the access method approach, and it
> > simplifies the code, but I'm just not sure a compression is an
> > access method or not.  
> 
> +1 for ACCESS METHOD.

An access method then.

> 
> > Current implementation
> > --
> >
> > To avoid extra patches I also want to clear things about current
> > implementation. Right now there are two tables, "pg_compression" and
> > "pg_compression_opt". When compression method is linked to a column
> > it creates a record in pg_compression_opt. This record's Oid is
> > stored in the varlena. These Oids kept in first column so I can
> > move them in pg_upgrade but in all other aspects they behave like
> > usual Oids. Also it's easy to restore them.  
> 
> pg_compression_opt -> pg_attr_compression, maybe.
> 
> > Compression options linked to a specific column. When tuple is
> > moved between relations it will be decompressed.  
> 
> Can we do this only if the compression method isn't OK for the new
> column?  For example, if the old column is COMPRESS foo PRESERVE bar
> and the new column is COMPRESS bar PRESERVE foo, we don't need to
> force decompression in any case.

Thanks, sounds right, i will add it to the patch.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] Custom compression methods

2017-12-16 Thread Robert Haas

On Thu, Dec 14, 2017 at 12:23 PM, Tomas Vondra
 wrote:
> Can you give an example of such algorithm? Because I haven't seen such
> example, and I find arguments based on hypothetical compression methods
> somewhat suspicious.
>
> FWIW I'm not against considering such compression methods, but OTOH it
> may not be such a great primary use case to drive the overall design.

Well it isn't, really.  I am honestly not sure what we're arguing
about at this point.  I think you've agreed that (1) opening avenues
for extensibility is useful, (2) substitution a general-purpose
compression algorithm could be useful, and (3) having datatype
compression that is enabled through TOAST rather than built into the
datatype might sometimes be desirable.  That's more than adequate
justification for this proposal, whether half-general compression
methods exist or not.  I am prepared to concede that there may be no
useful examples of such a thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-14 Thread Tomas Vondra

On 12/14/2017 04:21 PM, Robert Haas wrote:
> On Wed, Dec 13, 2017 at 5:10 AM, Tomas Vondra
>  wrote:
>>> 2. If several data types can benefit from a similar approach, it has
>>> to be separately implemented for each one.
>>
>> I don't think the current solution improves that, though. If you
>> want to exploit internal features of individual data types, it
>> pretty much requires code customized to every such data type.
>>
>> For example you can't take the tsvector compression and just slap
>> it on tsquery, because it relies on knowledge of internal tsvector
>> structure. So you need separate implementations anyway.
> 
> I don't think that's necessarily true. Certainly, it's true that
> *if* tsvector compression depends on knowledge of internal tsvector 
> structure, *then* that you can't use the implementation for anything 
> else (this, by the way, means that there needs to be some way for a 
> compression method to reject being applied to a column of a data
> type it doesn't like).

I believe such dependency (on implementation details) is pretty much the
main benefit of datatype-aware compression methods. If you don't rely on
such assumption, then I'd say it's a general-purpose compression method.

> However, it seems possible to imagine compression algorithms that can
> work for a variety of data types, too. There might be a compression
> algorithm that is theoretically a general-purpose algorithm but has
> features which are particularly well-suited to, say, JSON or XML
> data, because it looks for word boundaries to decide on what strings
> to insert into the compression dictionary.
> 

Can you give an example of such algorithm? Because I haven't seen such
example, and I find arguments based on hypothetical compression methods
somewhat suspicious.

FWIW I'm not against considering such compression methods, but OTOH it
may not be such a great primary use case to drive the overall design.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-12-14 Thread Robert Haas

On Wed, Dec 13, 2017 at 7:18 AM, Ildus Kurbangaliev
 wrote:
> Since we agreed on ALTER syntax, i want to clear things about CREATE.
> Should it be CREATE ACCESS METHOD .. TYPE СOMPRESSION or CREATE
> COMPRESSION METHOD? I like the access method approach, and it
> simplifies the code, but I'm just not sure a compression is an access
> method or not.

+1 for ACCESS METHOD.

> Current implementation
> --
>
> To avoid extra patches I also want to clear things about current
> implementation. Right now there are two tables, "pg_compression" and
> "pg_compression_opt". When compression method is linked to a column it
> creates a record in pg_compression_opt. This record's Oid is stored in
> the varlena. These Oids kept in first column so I can move them in
> pg_upgrade but in all other aspects they behave like usual Oids. Also
> it's easy to restore them.

pg_compression_opt -> pg_attr_compression, maybe.

> Compression options linked to a specific column. When tuple is
> moved between relations it will be decompressed.

Can we do this only if the compression method isn't OK for the new
column?  For example, if the old column is COMPRESS foo PRESERVE bar
and the new column is COMPRESS bar PRESERVE foo, we don't need to
force decompression in any case.

> Also in current implementation SET COMPRESSION contains WITH syntax
> which is used to provide extra options to compression method.

Hmm, that's an alternative to use reloptions.  Maybe that's fine.

> What could be changed
> -
>
> As Alvaro mentioned COMPRESSION METHOD is practically an access method,
> so it could be created as CREATE ACCESS METHOD .. TYPE COMPRESSION.
> This approach simplifies the patch and "pg_compression" table could be
> removed. So compression method is created with something like:
>
> CREATE ACCESS METHOD .. TYPE COMPRESSION HANDLER
> awesome_compression_handler;
>
> Syntax of SET COMPRESSION changes to SET COMPRESSION .. PRESERVE which
> is useful to control rewrites and for pg_upgrade to make dependencies
> between moved compression options and compression methods from pg_am
> table.
>
> Default compression is always pglz and if users want to change they run:
>
> ALTER COLUMN  SET COMPRESSION awesome PRESERVE pglz;
>
> Without PRESERVE it will rewrite the whole relation using new
> compression. Also the rewrite removes all unlisted compression options
> so their compresssion methods could be safely dropped.

That all sounds good.

> "pg_compression_opt" table could be renamed to "pg_compression", and
> compression options will be stored there.

See notes above.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-14 Thread Robert Haas

On Wed, Dec 13, 2017 at 5:10 AM, Tomas Vondra
 wrote:
>> 2. If several data types can benefit from a similar approach, it has
>> to be separately implemented for each one.
>
> I don't think the current solution improves that, though. If you want to
> exploit internal features of individual data types, it pretty much
> requires code customized to every such data type.
>
> For example you can't take the tsvector compression and just slap it on
> tsquery, because it relies on knowledge of internal tsvector structure.
> So you need separate implementations anyway.

I don't think that's necessarily true.  Certainly, it's true that *if*
tsvector compression depends on knowledge of internal tsvector
structure, *then* that you can't use the implementation for anything
else (this, by the way, means that there needs to be some way for a
compression method to reject being applied to a column of a data type
it doesn't like).  However, it seems possible to imagine compression
algorithms that can work for a variety of data types, too.  There
might be a compression algorithm that is theoretically a
general-purpose algorithm but has features which are particularly
well-suited to, say, JSON or XML data, because it looks for word
boundaries to decide on what strings to insert into the compression
dictionary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-13 Thread Tomas Vondra

On 12/13/2017 05:55 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
> 
>> On 12/13/2017 01:54 AM, Robert Haas wrote:
> 
>>> 3. Compression is only applied to large-ish values.  If you are just
>>> making the data type representation more compact, you probably want to
>>> apply the new representation to all values.  If you are compressing in
>>> the sense that the original data gets smaller but harder to interpret,
>>> then you probably only want to apply the technique where the value is
>>> already pretty wide, and maybe respect the user's configured storage
>>> attributes.  TOAST knows about some of that kind of stuff.
>>
>> Good point. One such parameter that I really miss is compression level.
>> I can imagine tuning it through CREATE COMPRESSION METHOD, but it does
>> not seem quite possible with compression happening in a datatype.
> 
> Hmm, actually isn't that the sort of thing that you would tweak using a
> column-level option instead of a compression method?
>   ALTER TABLE ALTER COLUMN SET (compression_level=123)
> The only thing we need for this is to make tuptoaster.c aware of the
> need to check for a parameter.
> 

Wouldn't that require some universal compression level, shared by all
supported compression algorithms? I don't think there is such thing.

Defining it should not be extremely difficult, although I'm sure there
will be some cumbersome cases. For example what if an algorithm "a"
supports compression levels 0-10, and algorithm "b" only supports 0-3?

You may define 11 "universal" compression levels, and map the four
levels for "b" to that (how). But then everyone has to understand how
that "universal" mapping is defined.

Another issue is that there are algorithms without a compression level
(e.g. pglz does not have one, AFAICS), or with somewhat definition (lz4
does not have levels, and instead has "acceleration" which may be an
arbitrary positive integer, so not really compatible with "universal"
compression level).

So to me the

ALTER TABLE ALTER COLUMN SET (compression_level=123)

seems more like an unnecessary hurdle ...

>>> I don't think TOAST needs to be entirely transparent for the
>>> datatypes.  We've already dipped our toe in the water by allowing some
>>> operations on "short" varlenas, and there's really nothing to prevent
>>> a given datatype from going further.  The OID problem you mentioned
>>> would presumably be solved by hard-coding the OIDs for any built-in,
>>> privileged compression methods.
>>
>> Stupid question, but what do you mean by "short" varlenas?
> 
> Those are varlenas with 1-byte header rather than the standard 4-byte
> header.
> 

OK, that's what I thought. But that is still pretty transparent to the
data types, no?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-12-13 Thread Ildus Kurbangaliev

On Tue, 12 Dec 2017 15:52:01 -0500
Robert Haas  wrote:

> 
> Yes.  I wonder if \d or \d+ can show it somehow.
> 

Yes, in current version of the patch, \d+ shows current compression.
It can be extended to show a list of current compression methods.

Since we agreed on ALTER syntax, i want to clear things about CREATE.
Should it be CREATE ACCESS METHOD .. TYPE СOMPRESSION or CREATE
COMPRESSION METHOD? I like the access method approach, and it
simplifies the code, but I'm just not sure a compression is an access
method or not.

Current implementation
--

To avoid extra patches I also want to clear things about current
implementation. Right now there are two tables, "pg_compression" and
"pg_compression_opt". When compression method is linked to a column it
creates a record in pg_compression_opt. This record's Oid is stored in
the varlena. These Oids kept in first column so I can move them in
pg_upgrade but in all other aspects they behave like usual Oids. Also
it's easy to restore them.

Compression options linked to a specific column. When tuple is
moved between relations it will be decompressed.

Also in current implementation SET COMPRESSION contains WITH syntax
which is used to provide extra options to compression method.

What could be changed
-

As Alvaro mentioned COMPRESSION METHOD is practically an access method,
so it could be created as CREATE ACCESS METHOD .. TYPE COMPRESSION.
This approach simplifies the patch and "pg_compression" table could be
removed. So compression method is created with something like:

CREATE ACCESS METHOD .. TYPE COMPRESSION HANDLER
awesome_compression_handler;

Syntax of SET COMPRESSION changes to SET COMPRESSION .. PRESERVE which
is useful to control rewrites and for pg_upgrade to make dependencies
between moved compression options and compression methods from pg_am
table.

Default compression is always pglz and if users want to change they run:

ALTER COLUMN  SET COMPRESSION awesome PRESERVE pglz;

Without PRESERVE it will rewrite the whole relation using new
compression. Also the rewrite removes all unlisted compression options
so their compresssion methods could be safely dropped.

"pg_compression_opt" table could be renamed to "pg_compression", and
compression options will be stored there.

I'd like to keep extra compression options, for example pglz can be
configured with them. Syntax would be slightly changed:

SET COMPRESSION pglz WITH (min_comp_rate=25) PRESERVE awesome;

Setting the same compression method with different options will create
new compression options record for future tuples but will not
rewrite table.

-- 

Regards,
Ildus Kurbangaliev

Re: [HACKERS] Custom compression methods

2017-12-12 Thread Robert Haas

On Tue, Dec 12, 2017 at 5:07 PM, Tomas Vondra
 wrote:
>> I definitely think there's a place for compression built right into
>> the data type.  I'm still happy about commit
>> 145343534c153d1e6c3cff1fa1855787684d9a38 -- although really, more
>> needs to be done there.  But that type of improvement and what is
>> proposed here are basically orthogonal.  Having either one is good;
>> having both is better.
>>
> Why orthogonal?

I mean, they are different things.  Data types are already free to
invent more compact representations, and that does not preclude
applying pglz to the result.

> For example, why couldn't (or shouldn't) the tsvector compression be
> done by tsvector code itself? Why should we be doing that at the varlena
> level (so that the tsvector code does not even know about it)?

We could do that, but then:

1. The compression algorithm would be hard-coded into the system
rather than changeable.  Pluggability has some value.

2. If several data types can benefit from a similar approach, it has
to be separately implemented for each one.

3. Compression is only applied to large-ish values.  If you are just
making the data type representation more compact, you probably want to
apply the new representation to all values.  If you are compressing in
the sense that the original data gets smaller but harder to interpret,
then you probably only want to apply the technique where the value is
already pretty wide, and maybe respect the user's configured storage
attributes.  TOAST knows about some of that kind of stuff.

> It seems to me the main reason is that tsvector actually does not allow
> us to do that, as there's no good way to distinguish the different
> internal format (e.g. by storing a flag or format version in some sort
> of header, etc.).

That is also a potential problem, although I suspect it is possible to
work around it somehow for most data types.  It might be annoying,
though.

>> I think there may also be a place for declaring that a particular data
>> type has a "privileged" type of TOAST compression; if you use that
>> kind of compression for that data type, the data type will do smart
>> things, and if not, it will have to decompress in more cases.  But I
>> think this infrastructure makes that kind of thing easier, not harder.
>
> I don't quite understand how that would be done. Isn't TOAST meant to be
> entirely transparent for the datatypes? I can imagine custom TOAST
> compression (which is pretty much what the patch does, after all), but I
> don't see how the datatype could do anything smart about it, because it
> has no idea which particular compression was used. And considering the
> OIDs of the compression methods do change, I'm not sure that's fixable.

I don't think TOAST needs to be entirely transparent for the
datatypes.  We've already dipped our toe in the water by allowing some
operations on "short" varlenas, and there's really nothing to prevent
a given datatype from going further.  The OID problem you mentioned
would presumably be solved by hard-coding the OIDs for any built-in,
privileged compression methods.

> Well, it wasn't my goal to suddenly widen the scope of the patch and
> require it adds all these pieces. My intent was more to point to pieces
> that need to be filled in the future.

Sure, that's fine.  I'm not worked up about this, just explaining why
it seems reasonably well-designed to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-12 Thread Chapman Flack

On 12/12/2017 04:33 PM, Robert Haas wrote:

> you want to upgrade your existing tables to use lz4 rather than pglz,
> you can change the compression option for those columns to COMPRESS
> lz4 PRESERVE pglz if you want to do it incrementally or just COMPRESS

This is a thread I've only been following peripherally, so forgive
a question that's probably covered somewhere upthread: how will this
be done? Surely not with compression-type bits in each tuple? By
remembering a txid where the compression was changed, and the former
algorithm for older txids?

-Chap

Re: [HACKERS] Custom compression methods

2017-12-12 Thread Tomas Vondra



On 12/12/2017 10:33 PM, Robert Haas wrote:
> On Mon, Dec 11, 2017 at 2:53 PM, Tomas Vondra
>  wrote:
>> But let me play the devil's advocate for a while and question the
>> usefulness of this approach to compression. Some of the questions were
>> mentioned in the thread before, but I don't think they got the attention
>> they deserve.
> 
> Sure, thanks for chiming in.  I think it is good to make sure we are
> discussing this stuff.
> 
>> But perhaps we should simply make it an initdb option (in which case the
>> whole cluster would simply use e.g. lz4 instead of pglz)?
>>
>> That seems like a much simpler approach - it would only require some
>> ./configure options to add --with-lz4 (and other compression libraries),
>> an initdb option to pick compression algorithm, and probably noting the
>> choice in cluster controldata.
>>
>> No dependencies tracking, no ALTER TABLE issues, etc.
>>
>> Of course, it would not allow using different compression algorithms for
>> different columns (although it might perhaps allow different compression
>> level, to some extent).
>>
>> Conclusion: If we want to offer a simple cluster-wide pglz alternative,
>> perhaps this patch is not the right way to do that.
> 
> I actually disagree with your conclusion here.   I mean, if you do it
> that way, then it has the same problem as checksums: changing
> compression algorithms requires a full dump-and-reload of the
> database, which makes it more or less a non-starter for large
> databases.  On the other hand, with the infrastructure provided by
> this patch, we can have a default_compression_method GUC that will be
> set to 'pglz' initially.  If the user changes it to 'lz4', or we ship
> a new release where the new default is 'lz4', then new tables created
> will use that new setting, but the existing stuff keeps working.  If
> you want to upgrade your existing tables to use lz4 rather than pglz,
> you can change the compression option for those columns to COMPRESS
> lz4 PRESERVE pglz if you want to do it incrementally or just COMPRESS
> lz4 to force a rewrite of an individual table.  That's really
> powerful, and I think users will like it a lot.
> 
> In short, your approach, while perhaps a little simpler to code, seems
> like it is fraught with operational problems which this design avoids.
> 

I agree the checksum-like limitations are annoying and make it
impossible to change the compression algorithm after the cluster is
initialized (although I recall a discussion about addressing that).

So yeah, if such flexibility is considered valuable/important, then the
patch is a better solution.

>> Custom datatype-aware compression (e.g. the tsvector)
>> --
>>
>> Exploiting knowledge of the internal data type structure is a promising
>> way to improve compression ratio and/or performance.
>>
>> The obvious question of course is why shouldn't this be done by the data
>> type code directly, which would also allow additional benefits like
>> operating directly on the compressed values.
>>
>> Another thing is that if the datatype representation changes in some
>> way, the compression method has to change too. So it's tightly coupled
>> to the datatype anyway.
>>
>> This does not really require any new infrastructure, all the pieces are
>> already there.
>>
>> In some cases that may not be quite possible - the datatype may not be
>> flexible enough to support alternative (compressed) representation, e.g.
>> because there are no bits available for "compressed" flag, etc.
>>
>> Conclusion: IMHO if we want to exploit the knowledge of the data type
>> internal structure, perhaps doing that in the datatype code directly
>> would be a better choice.
> 
> I definitely think there's a place for compression built right into
> the data type.  I'm still happy about commit
> 145343534c153d1e6c3cff1fa1855787684d9a38 -- although really, more
> needs to be done there.  But that type of improvement and what is
> proposed here are basically orthogonal.  Having either one is good;
> having both is better.
> 

Why orthogonal?

For example, why couldn't (or shouldn't) the tsvector compression be
done by tsvector code itself? Why should we be doing that at the varlena
level (so that the tsvector code does not even know about it)?

For example we could make the datatype EXTERNAL and do the compression
on our own, using a custom algorithm. Of course, that would require
datatype-specific implementation, but tsvector_compress does that too.

It seems to me the main reason is that tsvector actually does not allow
us to do that, as there's no good way to distinguish the different
internal format (e.g. by storing a flag or format version in some sort
of header, etc.).

> I think there may also be a place for declaring that a particular data
> type has a "privileged" type of TOAST compression; if you use that
> kind of compression for that data type, the data type

Re: [HACKERS] Custom compression methods

2017-12-12 Thread Robert Haas

On Mon, Dec 11, 2017 at 2:53 PM, Tomas Vondra
 wrote:
> But let me play the devil's advocate for a while and question the
> usefulness of this approach to compression. Some of the questions were
> mentioned in the thread before, but I don't think they got the attention
> they deserve.

Sure, thanks for chiming in.  I think it is good to make sure we are
discussing this stuff.

> But perhaps we should simply make it an initdb option (in which case the
> whole cluster would simply use e.g. lz4 instead of pglz)?
>
> That seems like a much simpler approach - it would only require some
> ./configure options to add --with-lz4 (and other compression libraries),
> an initdb option to pick compression algorithm, and probably noting the
> choice in cluster controldata.
>
> No dependencies tracking, no ALTER TABLE issues, etc.
>
> Of course, it would not allow using different compression algorithms for
> different columns (although it might perhaps allow different compression
> level, to some extent).
>
> Conclusion: If we want to offer a simple cluster-wide pglz alternative,
> perhaps this patch is not the right way to do that.

I actually disagree with your conclusion here.   I mean, if you do it
that way, then it has the same problem as checksums: changing
compression algorithms requires a full dump-and-reload of the
database, which makes it more or less a non-starter for large
databases.  On the other hand, with the infrastructure provided by
this patch, we can have a default_compression_method GUC that will be
set to 'pglz' initially.  If the user changes it to 'lz4', or we ship
a new release where the new default is 'lz4', then new tables created
will use that new setting, but the existing stuff keeps working.  If
you want to upgrade your existing tables to use lz4 rather than pglz,
you can change the compression option for those columns to COMPRESS
lz4 PRESERVE pglz if you want to do it incrementally or just COMPRESS
lz4 to force a rewrite of an individual table.  That's really
powerful, and I think users will like it a lot.

In short, your approach, while perhaps a little simpler to code, seems
like it is fraught with operational problems which this design avoids.

> Custom datatype-aware compression (e.g. the tsvector)
> --
>
> Exploiting knowledge of the internal data type structure is a promising
> way to improve compression ratio and/or performance.
>
> The obvious question of course is why shouldn't this be done by the data
> type code directly, which would also allow additional benefits like
> operating directly on the compressed values.
>
> Another thing is that if the datatype representation changes in some
> way, the compression method has to change too. So it's tightly coupled
> to the datatype anyway.
>
> This does not really require any new infrastructure, all the pieces are
> already there.
>
> In some cases that may not be quite possible - the datatype may not be
> flexible enough to support alternative (compressed) representation, e.g.
> because there are no bits available for "compressed" flag, etc.
>
> Conclusion: IMHO if we want to exploit the knowledge of the data type
> internal structure, perhaps doing that in the datatype code directly
> would be a better choice.

I definitely think there's a place for compression built right into
the data type.  I'm still happy about commit
145343534c153d1e6c3cff1fa1855787684d9a38 -- although really, more
needs to be done there.  But that type of improvement and what is
proposed here are basically orthogonal.  Having either one is good;
having both is better.

I think there may also be a place for declaring that a particular data
type has a "privileged" type of TOAST compression; if you use that
kind of compression for that data type, the data type will do smart
things, and if not, it will have to decompress in more cases.  But I
think this infrastructure makes that kind of thing easier, not harder.

> Custom datatype-aware compression with additional column-specific
> metadata (e.g. the jsonb with external dictionary).
> --
>
> Exploiting redundancy in multiple values in the same column (instead of
> compressing them independently) is another attractive way to help the
> compression. It is inherently datatype-aware, but currently can't be
> implemented directly in datatype code as there's no concept of
> column-specific storage (e.g. to store dictionary shared by all values
> in a particular column).
>
> I believe any patch addressing this use case would have to introduce
> such column-specific storage, and any solution doing that would probably
> need to introduce the same catalogs, etc.
>
> The obvious disadvantage of course is that we need to decompress the
> varlena value before doing pretty much anything with it, because the
> datatype is not aware of the compression.
>
> So I wonder

Re: [HACKERS] Custom compression methods

2017-12-12 Thread Robert Haas

On Mon, Dec 11, 2017 at 1:06 PM, Alexander Korotkov
 wrote:
> OK, but NOTICE that presumably unexpected table rewrite takes place could be
> still useful.

I'm not going to complain too much about that, but I think that's
mostly a failure of expectation rather than a real problem.  If the
documentation says what the user should expect, and they expect
something else, tough luck for them.

> Also we probably should add some view that will expose compression methods
> whose are currently preserved for columns.  So that user can correctly
> construct SET COMPRESSION query that doesn't rewrites table without digging
> into internals (like directly querying pg_depend).

Yes.  I wonder if \d or \d+ can show it somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-12 Thread Oleg Bartunov

On Tue, Dec 12, 2017 at 6:07 PM, Alexander Korotkov
 wrote:
> Hi!
>
> Let me add my two cents too.
>
> On Tue, Dec 12, 2017 at 2:41 PM, Ildus Kurbangaliev
>  wrote:
>>
>> On Mon, 11 Dec 2017 20:53:29 +0100 Tomas Vondra
>>  wrote:
>> > Replacing the algorithm used to compress all varlena values (in a way
>> > that makes it transparent for the data type code).
>> > --
>> >
>> > While pglz served us well over time, it was repeatedly mentioned that
>> > in some cases it becomes the bottleneck. So supporting other state of
>> > the art compression algorithms seems like a good idea, and this patch
>> > is one way to do that.
>> >
>> > But perhaps we should simply make it an initdb option (in which case
>> > the whole cluster would simply use e.g. lz4 instead of pglz)?
>> >
>> > That seems like a much simpler approach - it would only require some
>> > ./configure options to add --with-lz4 (and other compression
>> > libraries), an initdb option to pick compression algorithm, and
>> > probably noting the choice in cluster controldata.
>>
>> Replacing pglz for all varlena values wasn't the goal of the patch, but
>> it's possible to do with it and I think that's good. And as Robert
>> mentioned pglz could appear as builtin undroppable compresssion method
>> so the others could be added too. And in the future it can open the
>> ways to specify compression for specific database or cluster.
>
>
> Yes, usage of custom compression methods to replace generic non
> type-specific compression method is really not the primary goal of this
> patch.  However, I would consider that as useful side effect.  However, even
> in this case I see some advantages of custom compression methods over initdb
> option.
>
> 1) In order to support alternative compression methods in initdb, we have to
> provide builtin support for them.  Then we immediately run into
> dependencies/incompatible-licenses problem.  Also, we tie appearance of new
> compression methods to our release cycle.  In real life, flexibility means a
> lot.  Giving users freedom to experiment with various compression libraries
> without having to recompile PostgreSQL core is great.
> 2) It's not necessary that users would be satisfied with applying single
> compression method to the whole database cluster.  Various columns may have
> different data distributions with different workloads.  Optimal compression
> type for one column is not necessary optimal for another column.
> 3) Possibility to change compression method on the fly without re-initdb is
> very good too.

I consider custom compression as the way to custom TOAST. For example,
to optimal access
parts of very long document we need to compress slices of document.
Currently, long jsonb
document get compressed and then sliced and that killed all benefits
of binary jsonb.  Also, we are
thinking about "lazy" access to the parts of jsonb from pl's, which is
currently awfully unefficient.

>
>> > Custom datatype-aware compression (e.g. the tsvector)
>> > --
>> >
>> > Exploiting knowledge of the internal data type structure is a
>> > promising way to improve compression ratio and/or performance.
>> >
>> > The obvious question of course is why shouldn't this be done by the
>> > data type code directly, which would also allow additional benefits
>> > like operating directly on the compressed values.
>> >
>> > Another thing is that if the datatype representation changes in some
>> > way, the compression method has to change too. So it's tightly coupled
>> > to the datatype anyway.
>> >
>> > This does not really require any new infrastructure, all the pieces
>> > are already there.
>> >
>> > In some cases that may not be quite possible - the datatype may not be
>> > flexible enough to support alternative (compressed) representation,
>> > e.g. because there are no bits available for "compressed" flag, etc.
>> >
>> > Conclusion: IMHO if we want to exploit the knowledge of the data type
>> > internal structure, perhaps doing that in the datatype code directly
>> > would be a better choice.
>>
>> It could be, but let's imagine there will be internal compression for
>> tsvector. It means that tsvector has two formats now and minus one bit
>> somewhere in the header. After a while we found a better compression
>> but we can't add it because there is already one and it's not good to
>> have three different formats for one type. Or, the compression methods
>> were implemented and we decided to use dictionaries for tsvector (if
>> the user going to store limited number of words). But it will mean that
>> tsvector will go two compression stages (for its internal and for
>> dictionaries).
>
>
> I would like to add that even for single datatype various compression
> methods may have different tradeoffs.  For instance,

Re: [HACKERS] Custom compression methods

2017-12-12 Thread Alexander Korotkov

Hi!

Let me add my two cents too.

On Tue, Dec 12, 2017 at 2:41 PM, Ildus Kurbangaliev <
i.kurbangal...@postgrespro.ru> wrote:

> On Mon, 11 Dec 2017 20:53:29 +0100 Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
> > Replacing the algorithm used to compress all varlena values (in a way
> > that makes it transparent for the data type code).
> > --
> >
> > While pglz served us well over time, it was repeatedly mentioned that
> > in some cases it becomes the bottleneck. So supporting other state of
> > the art compression algorithms seems like a good idea, and this patch
> > is one way to do that.
> >
> > But perhaps we should simply make it an initdb option (in which case
> > the whole cluster would simply use e.g. lz4 instead of pglz)?
> >
> > That seems like a much simpler approach - it would only require some
> > ./configure options to add --with-lz4 (and other compression
> > libraries), an initdb option to pick compression algorithm, and
> > probably noting the choice in cluster controldata.
>
> Replacing pglz for all varlena values wasn't the goal of the patch, but
> it's possible to do with it and I think that's good. And as Robert
> mentioned pglz could appear as builtin undroppable compresssion method
> so the others could be added too. And in the future it can open the
> ways to specify compression for specific database or cluster.
>

Yes, usage of custom compression methods to replace generic non
type-specific compression method is really not the primary goal of this
patch.  However, I would consider that as useful side effect.  However,
even in this case I see some advantages of custom compression methods over
initdb option.

1) In order to support alternative compression methods in initdb, we have
to provide builtin support for them.  Then we immediately run into
dependencies/incompatible-licenses problem.  Also, we tie appearance of new
compression methods to our release cycle.  In real life, flexibility means
a lot.  Giving users freedom to experiment with various compression
libraries without having to recompile PostgreSQL core is great.
2) It's not necessary that users would be satisfied with applying single
compression method to the whole database cluster.  Various columns may have
different data distributions with different workloads.  Optimal compression
type for one column is not necessary optimal for another column.
3) Possibility to change compression method on the fly without re-initdb is
very good too.

> Custom datatype-aware compression (e.g. the tsvector)
> > --
> >
> > Exploiting knowledge of the internal data type structure is a
> > promising way to improve compression ratio and/or performance.
> >
> > The obvious question of course is why shouldn't this be done by the
> > data type code directly, which would also allow additional benefits
> > like operating directly on the compressed values.
> >
> > Another thing is that if the datatype representation changes in some
> > way, the compression method has to change too. So it's tightly coupled
> > to the datatype anyway.
> >
> > This does not really require any new infrastructure, all the pieces
> > are already there.
> >
> > In some cases that may not be quite possible - the datatype may not be
> > flexible enough to support alternative (compressed) representation,
> > e.g. because there are no bits available for "compressed" flag, etc.
> >
> > Conclusion: IMHO if we want to exploit the knowledge of the data type
> > internal structure, perhaps doing that in the datatype code directly
> > would be a better choice.
>
> It could be, but let's imagine there will be internal compression for
> tsvector. It means that tsvector has two formats now and minus one bit
> somewhere in the header. After a while we found a better compression
> but we can't add it because there is already one and it's not good to
> have three different formats for one type. Or, the compression methods
> were implemented and we decided to use dictionaries for tsvector (if
> the user going to store limited number of words). But it will mean that
> tsvector will go two compression stages (for its internal and for
> dictionaries).

I would like to add that even for single datatype various compression
methods may have different tradeoffs.  For instance, one compression method
can have better compression ratio, but another one have faster
decompression.  And it's OK for user to choose different compression
methods for different columns.

Depending extensions on datatype internal representation doesn't seem evil
for me.  We already have bunch of extension depending on much more deeper
guts of PostgreSQL.  On major release of PostgreSQL, extensions must adopt
the changes, that is the rule.  And note, the datatype internal
representation alters relatively rare in comparison with other internals,
because it's related to on-disk

Re: [HACKERS] Custom compression methods

2017-12-12 Thread Ildus Kurbangaliev

On Mon, 11 Dec 2017 20:53:29 +0100
Tomas Vondra  wrote:

> But let me play the devil's advocate for a while and question the
> usefulness of this approach to compression. Some of the questions were
> mentioned in the thread before, but I don't think they got the
> attention they deserve.

Hi. I will try to explain why this approach could be better than others.

> 
> 
> Replacing the algorithm used to compress all varlena values (in a way
> that makes it transparent for the data type code).
> --
> 
> While pglz served us well over time, it was repeatedly mentioned that
> in some cases it becomes the bottleneck. So supporting other state of
> the art compression algorithms seems like a good idea, and this patch
> is one way to do that.
> 
> But perhaps we should simply make it an initdb option (in which case
> the whole cluster would simply use e.g. lz4 instead of pglz)?
> 
> That seems like a much simpler approach - it would only require some
> ./configure options to add --with-lz4 (and other compression
> libraries), an initdb option to pick compression algorithm, and
> probably noting the choice in cluster controldata.

Replacing pglz for all varlena values wasn't the goal of the patch, but
it's possible to do with it and I think that's good. And as Robert
mentioned pglz could appear as builtin undroppable compresssion method
so the others could be added too. And in the future it can open the
ways to specify compression for specific database or cluster.

> 
> Custom datatype-aware compression (e.g. the tsvector)
> --
> 
> Exploiting knowledge of the internal data type structure is a
> promising way to improve compression ratio and/or performance.
> 
> The obvious question of course is why shouldn't this be done by the
> data type code directly, which would also allow additional benefits
> like operating directly on the compressed values.
> 
> Another thing is that if the datatype representation changes in some
> way, the compression method has to change too. So it's tightly coupled
> to the datatype anyway.
> 
> This does not really require any new infrastructure, all the pieces
> are already there.
> 
> In some cases that may not be quite possible - the datatype may not be
> flexible enough to support alternative (compressed) representation,
> e.g. because there are no bits available for "compressed" flag, etc.
> 
> Conclusion: IMHO if we want to exploit the knowledge of the data type
> internal structure, perhaps doing that in the datatype code directly
> would be a better choice.

It could be, but let's imagine there will be internal compression for
tsvector. It means that tsvector has two formats now and minus one bit
somewhere in the header. After a while we found a better compression
but we can't add it because there is already one and it's not good to
have three different formats for one type. Or, the compression methods
were implemented and we decided to use dictionaries for tsvector (if
the user going to store limited number of words). But it will mean that
tsvector will go two compression stages (for its internal and for
dictionaries).

> 
> 
> Custom datatype-aware compression with additional column-specific
> metadata (e.g. the jsonb with external dictionary).
> --
> 
> Exploiting redundancy in multiple values in the same column (instead
> of compressing them independently) is another attractive way to help
> the compression. It is inherently datatype-aware, but currently can't
> be implemented directly in datatype code as there's no concept of
> column-specific storage (e.g. to store dictionary shared by all values
> in a particular column).
> 
> I believe any patch addressing this use case would have to introduce
> such column-specific storage, and any solution doing that would
> probably need to introduce the same catalogs, etc.
> 
> The obvious disadvantage of course is that we need to decompress the
> varlena value before doing pretty much anything with it, because the
> datatype is not aware of the compression.
> 
> So I wonder if the patch should instead provide infrastructure for
> doing that in the datatype code directly.
> 
> The other question is if the patch should introduce some
> infrastructure for handling the column context (e.g. column
> dictionary). Right now, whoever implements the compression has to
> implement this bit too.

Column specific storage sounds optional to me. For example compressing
timestamp[] using some delta compression will not require it.

-- 

Regards,
Ildus Kurbangaliev

Re: [HACKERS] Custom compression methods

2017-12-11 Thread Tomas Vondra

Hi,

I see there's an ongoing discussion about the syntax and ALTER TABLE
behavior when changing a compression method for a column. So the patch
seems to be on the way to be ready in the January CF, I guess.

But let me play the devil's advocate for a while and question the
usefulness of this approach to compression. Some of the questions were
mentioned in the thread before, but I don't think they got the attention
they deserve.

FWIW I don't know the answers, but I think it's important to ask them.
Also, apologies if this post looks to be against the patch - that's part
of the "devil's advocate" thing.


The main question I'm asking myself is what use cases the patch
addresses, and whether there is a better way to do that. I see about
three main use-cases:

1) Replacing the algorithm used to compress all varlena types (in a way
that makes it transparent for the data type code).

2) Custom datatype-aware compression (e.g. the tsvector).

3) Custom datatype-aware compression with additional column-specific
metadata (e.g. the jsonb with external dictionary).

Now, let's discuss those use cases one by one, and see if there are
simpler (or better in some way) solutions ...


Replacing the algorithm used to compress all varlena values (in a way
that makes it transparent for the data type code).
--

While pglz served us well over time, it was repeatedly mentioned that in
some cases it becomes the bottleneck. So supporting other state of the
art compression algorithms seems like a good idea, and this patch is one
way to do that.

But perhaps we should simply make it an initdb option (in which case the
whole cluster would simply use e.g. lz4 instead of pglz)?

That seems like a much simpler approach - it would only require some
./configure options to add --with-lz4 (and other compression libraries),
an initdb option to pick compression algorithm, and probably noting the
choice in cluster controldata.

No dependencies tracking, no ALTER TABLE issues, etc.

Of course, it would not allow using different compression algorithms for
different columns (although it might perhaps allow different compression
level, to some extent).

Conclusion: If we want to offer a simple cluster-wide pglz alternative,
perhaps this patch is not the right way to do that.


Custom datatype-aware compression (e.g. the tsvector)
--

Exploiting knowledge of the internal data type structure is a promising
way to improve compression ratio and/or performance.

The obvious question of course is why shouldn't this be done by the data
type code directly, which would also allow additional benefits like
operating directly on the compressed values.

Another thing is that if the datatype representation changes in some
way, the compression method has to change too. So it's tightly coupled
to the datatype anyway.

This does not really require any new infrastructure, all the pieces are
already there.

In some cases that may not be quite possible - the datatype may not be
flexible enough to support alternative (compressed) representation, e.g.
because there are no bits available for "compressed" flag, etc.

Conclusion: IMHO if we want to exploit the knowledge of the data type
internal structure, perhaps doing that in the datatype code directly
would be a better choice.


Custom datatype-aware compression with additional column-specific
metadata (e.g. the jsonb with external dictionary).
--

Exploiting redundancy in multiple values in the same column (instead of
compressing them independently) is another attractive way to help the
compression. It is inherently datatype-aware, but currently can't be
implemented directly in datatype code as there's no concept of
column-specific storage (e.g. to store dictionary shared by all values
in a particular column).

I believe any patch addressing this use case would have to introduce
such column-specific storage, and any solution doing that would probably
need to introduce the same catalogs, etc.

The obvious disadvantage of course is that we need to decompress the
varlena value before doing pretty much anything with it, because the
datatype is not aware of the compression.

So I wonder if the patch should instead provide infrastructure for doing
that in the datatype code directly.

The other question is if the patch should introduce some infrastructure
for handling the column context (e.g. column dictionary). Right now,
whoever implements the compression has to implement this bit too.



regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-12-11 Thread Alexander Korotkov

On Mon, Dec 11, 2017 at 8:46 PM, Robert Haas  wrote:

> On Mon, Dec 11, 2017 at 12:41 PM, Alexander Korotkov
>  wrote:
> > Thus, in your example if user would like to further change awesome
> > compression for evenbetter compression, she should write.
> >
> > SET COMPRESSION evenbetter PRESERVE pglz, awesome; -- full list of
> previous
> > compression methods
>
> Right.
>
> > I wonder what should we do if user specifies only part of previous
> > compression methods?  For instance, pglz is specified but awesome is
> > missing.
> >
> > SET COMPRESSION evenbetter PRESERVE pglz; -- awesome is missing
> >
> > I think we should trigger an error in this case.  Because query is
> specified
> > in form that is assuming to work without table rewrite, but we're unable
> to
> > do this without table rewrite.
>
> I think that should just rewrite the table in that case.  PRESERVE
> should specify the things that are allowed to be preserved -- its mere
> presence should not be read to preclude a rewrite.  And it's
> completely reasonable for someone to want to do this, if they are
> thinking about de-installing awesome.
>

OK, but NOTICE that presumably unexpected table rewrite takes place could
be still useful.

Also we probably should add some view that will expose compression methods
whose are currently preserved for columns.  So that user can correctly
construct SET COMPRESSION query that doesn't rewrites table without digging
into internals (like directly querying pg_depend).

> I also think that we need some way to change compression method for
> multiple
> > columns in a single table rewrite.  Because it would be way more
> efficient
> > than rewriting table for each of columns.  So as an alternative of
> >
> > ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome; -- first table
> > rewrite
> > ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome; -- second table
> > rewrite
> >
> > we could also provide
> >
> > ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome PRESERVE pglz;
> -- no
> > rewrite
> > ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome PRESERVE pglz;
> -- no
> > rewrite
> > VACUUM FULL tbl RESET COMPRESSION PRESERVE c1, c2; -- rewrite with
> > recompression of c1 and c2 and removing depedencies
> >
> > ?
>
> Hmm.  ALTER TABLE allows multi comma-separated subcommands, so I don't
> think we need to drag VACUUM into this.  The user can just say:
>
> ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome, ALTER COLUMN
> c2 SET COMPRESSION awesome;
>
> If this is properly integrated into tablecmds.c, that should cause a
> single rewrite affecting both columns.


OK.  Sorry, I didn't notice we can use multiple subcommands for ALTER TABLE
in this case...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Custom compression methods

2017-12-11 Thread Robert Haas

On Mon, Dec 11, 2017 at 12:41 PM, Alexander Korotkov
 wrote:
> Thus, in your example if user would like to further change awesome
> compression for evenbetter compression, she should write.
>
> SET COMPRESSION evenbetter PRESERVE pglz, awesome; -- full list of previous
> compression methods

Right.

> I wonder what should we do if user specifies only part of previous
> compression methods?  For instance, pglz is specified but awesome is
> missing.
>
> SET COMPRESSION evenbetter PRESERVE pglz; -- awesome is missing
>
> I think we should trigger an error in this case.  Because query is specified
> in form that is assuming to work without table rewrite, but we're unable to
> do this without table rewrite.

I think that should just rewrite the table in that case.  PRESERVE
should specify the things that are allowed to be preserved -- its mere
presence should not be read to preclude a rewrite.  And it's
completely reasonable for someone to want to do this, if they are
thinking about de-installing awesome.

> I also think that we need some way to change compression method for multiple
> columns in a single table rewrite.  Because it would be way more efficient
> than rewriting table for each of columns.  So as an alternative of
>
> ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome; -- first table
> rewrite
> ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome; -- second table
> rewrite
>
> we could also provide
>
> ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome PRESERVE pglz; -- no
> rewrite
> ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome PRESERVE pglz; -- no
> rewrite
> VACUUM FULL tbl RESET COMPRESSION PRESERVE c1, c2; -- rewrite with
> recompression of c1 and c2 and removing depedencies
>
> ?

Hmm.  ALTER TABLE allows multi comma-separated subcommands, so I don't
think we need to drag VACUUM into this.  The user can just say:

ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome, ALTER COLUMN
c2 SET COMPRESSION awesome;

If this is properly integrated into tablecmds.c, that should cause a
single rewrite affecting both columns.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-11 Thread Alexander Korotkov

On Mon, Dec 11, 2017 at 8:25 PM, Robert Haas  wrote:

> On Mon, Dec 11, 2017 at 7:55 AM, Ildus Kurbangaliev
>  wrote:
> > On Fri, 8 Dec 2017 15:12:42 -0500
> > Robert Haas  wrote:
> >> Maybe a better idea is ALTER COLUMN SET COMPRESSION x1, x2, x3 ...
> >> meaning that x1 is the default for new tuples but x2, x3, etc. are
> >> still allowed if present.  If you issue a command that only adds
> >> things to the list, no table rewrite happens, but if you remove
> >> anything, then it does.
> >
> > I like this idea, but maybe it should be something like ALTER COLUMN
> > SET COMPRESSION x1 [ PRESERVE x2, x3 ]? 'PRESERVE' is already used in
> > syntax and this syntax will show better which one is current and which
> > ones should be kept.
>
> Sure, that works.  And I think pglz should exist in the catalog as a
> predefined, undroppable compression algorithm.  So the default for
> each column initially is:
>
> SET COMPRESSION pglz
>
> And if you want to rewrite the table with your awesome custom thing, you
> can do
>
> SET COMPRESSION awesome
>
> But if you want to just use the awesome custom thing for new rows, you can
> do
>
> SET COMPRESSION awesome PRESERVE pglz
>
> Then we can get all the dependencies right, pg_upgrade works, users
> have total control of rewrite behavior, and everything is great.  :-)
>

Looks good.

Thus, in your example if user would like to further change awesome
compression for evenbetter compression, she should write.

SET COMPRESSION evenbetter PRESERVE pglz, awesome; -- full list of previous
compression methods

I wonder what should we do if user specifies only part of previous
compression methods?  For instance, pglz is specified but awesome is
missing.

SET COMPRESSION evenbetter PRESERVE pglz; -- awesome is missing

I think we should trigger an error in this case.  Because query is
specified in form that is assuming to work without table rewrite, but we're
unable to do this without table rewrite.

I also think that we need some way to change compression method for
multiple columns in a single table rewrite.  Because it would be way more
efficient than rewriting table for each of columns.  So as an alternative of

ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome; -- first table
rewrite
ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome; -- second table
rewrite

we could also provide

ALTER TABLE tbl ALTER COLUMN c1 SET COMPRESSION awesome PRESERVE pglz; --
no rewrite
ALTER TABLE tbl ALTER COLUMN c2 SET COMPRESSION awesome PRESERVE pglz; --
no rewrite
VACUUM FULL tbl RESET COMPRESSION PRESERVE c1, c2; -- rewrite with
recompression of c1 and c2 and removing depedencies

?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Custom compression methods

2017-12-11 Thread Robert Haas

On Mon, Dec 11, 2017 at 7:55 AM, Ildus Kurbangaliev
 wrote:
> On Fri, 8 Dec 2017 15:12:42 -0500
> Robert Haas  wrote:
>> Maybe a better idea is ALTER COLUMN SET COMPRESSION x1, x2, x3 ...
>> meaning that x1 is the default for new tuples but x2, x3, etc. are
>> still allowed if present.  If you issue a command that only adds
>> things to the list, no table rewrite happens, but if you remove
>> anything, then it does.
>
> I like this idea, but maybe it should be something like ALTER COLUMN
> SET COMPRESSION x1 [ PRESERVE x2, x3 ]? 'PRESERVE' is already used in
> syntax and this syntax will show better which one is current and which
> ones should be kept.

Sure, that works.  And I think pglz should exist in the catalog as a
predefined, undroppable compression algorithm.  So the default for
each column initially is:

SET COMPRESSION pglz

And if you want to rewrite the table with your awesome custom thing, you can do

SET COMPRESSION awesome

But if you want to just use the awesome custom thing for new rows, you can do

SET COMPRESSION awesome PRESERVE pglz

Then we can get all the dependencies right, pg_upgrade works, users
have total control of rewrite behavior, and everything is great.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-11 Thread Ildus Kurbangaliev

On Fri, 8 Dec 2017 15:12:42 -0500
Robert Haas  wrote:

> 
> Maybe a better idea is ALTER COLUMN SET COMPRESSION x1, x2, x3 ...
> meaning that x1 is the default for new tuples but x2, x3, etc. are
> still allowed if present.  If you issue a command that only adds
> things to the list, no table rewrite happens, but if you remove
> anything, then it does.
> 

I like this idea, but maybe it should be something like ALTER COLUMN
SET COMPRESSION x1 [ PRESERVE x2, x3 ]? 'PRESERVE' is already used in
syntax and this syntax will show better which one is current and which
ones should be kept.

-- 

Regards,
Ildus Kurbangaliev

Re: [HACKERS] Custom compression methods

2017-12-08 Thread Robert Haas

On Wed, Dec 6, 2017 at 10:07 AM, Ildus Kurbangaliev
 wrote:
> On Fri, 1 Dec 2017 21:47:43 +0100
> Tomas Vondra  wrote:
>> +1 to do the rewrite, just like for other similar ALTER TABLE commands
>
> Ok. What about the following syntax:
>
> ALTER COLUMN DROP COMPRESSION - removes compression from the column
> with the rewrite and removes related compression options, so the user
> can drop compression method.
>
> ALTER COLUMN SET COMPRESSION NONE for the cases when
> the users want to just disable compression for future tuples. After
> that they can keep compressed tuples, or in the case when they have a
> large table they can decompress tuples partially using e.g. UPDATE,
> and then use ALTER COLUMN DROP COMPRESSION which will be much faster
> then.
>
> ALTER COLUMN SET COMPRESSION  WITH  will change
> compression for new tuples but will not touch old ones. If the users
> want the recompression they can use DROP/SET COMPRESSION combination.
>
> I don't think that SET COMPRESSION with the rewrite of the whole table
> will be useful enough on any somewhat big tables and same time big
> tables is where the user needs compression the most.
>
> I understand that ALTER with the rewrite sounds logical and much easier
> to implement (and it doesn't require Oids in tuples), but it could be
> unusable.

The problem with this is that old compression methods can still be
floating around in the table even after you have done SET COMPRESSION
to something else.  The table still needs to have a dependency on the
old compression method, because otherwise you might think it's safe to
drop the old one when it really is not.  Furthermore, if you do a
pg_upgrade, you've got to preserve that dependency, which means it
would have to show up in a pg_dump --binary-upgrade someplace.  It's
not obvious how any of that would work with this syntax.

Maybe a better idea is ALTER COLUMN SET COMPRESSION x1, x2, x3 ...
meaning that x1 is the default for new tuples but x2, x3, etc. are
still allowed if present.  If you issue a command that only adds
things to the list, no table rewrite happens, but if you remove
anything, then it does.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-06 Thread Ildus Kurbangaliev

On Fri, 1 Dec 2017 21:47:43 +0100
Tomas Vondra  wrote:

> 
> +1 to do the rewrite, just like for other similar ALTER TABLE commands

Ok. What about the following syntax:

ALTER COLUMN DROP COMPRESSION - removes compression from the column
with the rewrite and removes related compression options, so the user
can drop compression method.

ALTER COLUMN SET COMPRESSION NONE for the cases when
the users want to just disable compression for future tuples. After
that they can keep compressed tuples, or in the case when they have a
large table they can decompress tuples partially using e.g. UPDATE,
and then use ALTER COLUMN DROP COMPRESSION which will be much faster
then.

ALTER COLUMN SET COMPRESSION  WITH  will change
compression for new tuples but will not touch old ones. If the users
want the recompression they can use DROP/SET COMPRESSION combination.

I don't think that SET COMPRESSION with the rewrite of the whole table
will be useful enough on any somewhat big tables and same time big
tables is where the user needs compression the most.

I understand that ALTER with the rewrite sounds logical and much easier
to implement (and it doesn't require Oids in tuples), but it could be
unusable.

-- 

Regards,
Ildus Kurbangaliev

Re: [HACKERS] Custom compression methods

2017-12-02 Thread Alvaro Herrera

Tomas Vondra wrote:

> On 12/01/2017 08:48 PM, Alvaro Herrera wrote:

> > Maybe our dependency code needs to be extended in order to support this.
> > I think the current logic would drop the column if you were to do "DROP
> > COMPRESSION .. CASCADE", but I'm not sure we'd see that as a feature.
> > I'd rather have DROP COMPRESSION always fail instead until no columns
> > use it.  Let's hear other's opinions on this bit though.
> 
> Why should this behave differently compared to data types? Seems quite
> against POLA, if you ask me ...

OK, DROP TYPE sounds good enough precedent, so +1 on that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-12-02 Thread Tomas Vondra

On 12/02/2017 09:38 PM, Andres Freund wrote:
> Hi,
> 
> On 2017-12-02 16:04:52 +0100, Tomas Vondra wrote:
>> Firstly, it's going to be quite hard (or perhaps impossible) to find an
>> algorithm that is "universally better" than pglz. Some algorithms do
>> work better for text documents, some for binary blobs, etc. I don't
>> think there's a win-win option.
> 
> lz4 is pretty much there.
> 

That's a matter of opinion, I guess. It's a solid compression algorithm,
that's for sure ...

>> Secondly, all the previous attempts ran into some legal issues, i.e.
>> licensing and/or patents. Maybe the situation changed since then (no
>> idea, haven't looked into that), but in the past the "pluggable"
>> approach was proposed as a way to address this.
> 
> Those were pretty bogus.

IANAL so I don't dare to judge on bogusness of such claims. I assume if
we made it optional (e.g. configure/initdb option, it'd be much less of
an issue). Of course, that has disadvantages too (because when you
compile/init with one algorithm, and then find something else would work
better for your data, you have to start from scratch).

>
> I think we're not doing our users a favor if they've to download
> some external projects, then fiddle with things, just to not choose
> a compression algorithm that's been known bad for at least 5+ years.
> If we've a decent algorithm in-core *and* then allow extensibility, 
> that's one thing, but keeping the bad and tell forks "please take
> our users with this code we give you" is ...
> 

I don't understand what exactly is your issue with external projects,
TBH. I think extensibility is one of the great strengths of Postgres.
It's not all rainbows and unicorns, of course, and it has costs too.

FWIW I don't think pglz is a "known bad" algorithm. Perhaps there are
cases where other algorithms (e.g. lz4) are running circles around it,
particularly when it comes to decompression speed, but I wouldn't say
it's "known bad".

Not sure which forks you're talking about ...

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-12-02 Thread Tomas Vondra


On 12/02/2017 09:24 PM, konstantin knizhnik wrote:
> 
> On Dec 2, 2017, at 6:04 PM, Tomas Vondra wrote:
> 
>> On 12/01/2017 10:52 PM, Andres Freund wrote:
>> ...
>>
>> Other algorithms (e.g. zstd) got significantly better compression (25%)
>> compared to pglz, but in exchange for longer compression. I'm sure we
>> could lower compression level to make it faster, but that will of course
>> hurt the compression ratio.
>>
>> I don't think switching to a different compression algorithm is a way
>> forward - it was proposed and explored repeatedly in the past, and every
>> time it failed for a number of reasons, most of which are still valid.
>>
>>
>> Firstly, it's going to be quite hard (or perhaps impossible) to
>> find an algorithm that is "universally better" than pglz. Some
>> algorithms do work better for text documents, some for binary
>> blobs, etc. I don't think there's a win-win option.
>>
>> Sure, there are workloads where pglz performs poorly (I've seen
>> such cases too), but IMHO that's more an argument for the custom
>> compression method approach. pglz gives you good default
>> compression in most cases, and you can change it for columns where
>> it matters, and where a different space/time trade-off makes
>> sense.
>>
>>
>> Secondly, all the previous attempts ran into some legal issues, i.e.
>> licensing and/or patents. Maybe the situation changed since then (no
>> idea, haven't looked into that), but in the past the "pluggable"
>> approach was proposed as a way to address this.
>>
>>
> 
> May be it will be interesting for you to see the following results
> of applying page-level compression (CFS in PgPro-EE) to pgbench
> data:
> 

I don't follow. If I understand what CFS does correctly (and I'm mostly
guessing here, because I haven't seen the code published anywhere, and I
assume it's proprietary), it essentially compresses whole 8kB blocks.

I don't know it reorganizes the data into columnar format first, in some
way (to make it more "columnar" which is more compressible), which would
make somewhat similar to page-level compression in Oracle.

But it's clearly a very different approach from what the patch aims to
improve (compressing individual varlena values).

> 
> All algorithms (except zlib) were used with best-speed option: using 
> better compression level usually has not so large impact on
> compression ratio (<30%), but can significantly increase time
> (several times). Certainly pgbench isnot the best candidate for
> testing compression algorithms: it generates a lot of artificial and
> redundant data. But we measured it also on real customers data and
> still zstd seems to be the best compression methods: provides good
> compression with smallest CPU overhead.
> 

I think this really depends on the dataset, and drawing conclusions
based on a single test is somewhat crazy. Especially when it's synthetic
pgbench data with lots of inherent redundancy - sequential IDs, ...

My takeaway from the results is rather that page-level compression may
be very beneficial in some cases, although I wonder how much of that can
be gained by simply using compressed filesystem (thus making it
transparent to PostgreSQL).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-12-02 Thread Andres Freund

Hi,

On 2017-12-02 16:04:52 +0100, Tomas Vondra wrote:
> Firstly, it's going to be quite hard (or perhaps impossible) to find an
> algorithm that is "universally better" than pglz. Some algorithms do
> work better for text documents, some for binary blobs, etc. I don't
> think there's a win-win option.

lz4 is pretty much there.

> Secondly, all the previous attempts ran into some legal issues, i.e.
> licensing and/or patents. Maybe the situation changed since then (no
> idea, haven't looked into that), but in the past the "pluggable"
> approach was proposed as a way to address this.

Those were pretty bogus.  I think we're not doing our users a favor if
they've to download some external projects, then fiddle with things,
just to not choose a compression algorithm that's been known bad for at
least 5+ years.  If we've a decent algorithm in-core *and* then allow
extensibility, that's one thing, but keeping the bad and tell forks
"please take our users with this code we give you" is ...

Greetings,

Andres Freund

Re: [HACKERS] Custom compression methods

2017-12-02 Thread Tomas Vondra

On 12/01/2017 10:52 PM, Andres Freund wrote:
> On 2017-12-01 16:14:58 -0500, Robert Haas wrote:
>> Honestly, if we can give everybody a 4% space reduction by
>> switching to lz4, I think that's totally worth doing -- but let's
>> not make people choose it, let's make it the default going forward,
>> and keep pglz support around so we don't break pg_upgrade
>> compatibility (and so people can continue to choose it if for some
>> reason it works better in their use case). That kind of improvement
>> is nothing special in a specific workload, but TOAST is a pretty
>> general-purpose mechanism. I have become, through a few bitter
>> experiences, a strong believer in the value of trying to reduce our
>> on-disk footprint, and knocking 4% off the size of every TOAST
>> table in the world does not sound worthless to me -- even though
>> context-aware compression can doubtless do a lot better.
> 
> +1. It's also a lot faster, and I've seen way way to many workloads
> with 50%+ time spent in pglz.
> 

TBH the 4% figure is something I mostly made up (I'm fake news!). On the
mailing list archive (which I believe is pretty compressible) I observed
something like 2.5% size reduction with lz4 compared to pglz, at least
with the compression levels I've used ...

Other algorithms (e.g. zstd) got significantly better compression (25%)
compared to pglz, but in exchange for longer compression. I'm sure we
could lower compression level to make it faster, but that will of course
hurt the compression ratio.

I don't think switching to a different compression algorithm is a way
forward - it was proposed and explored repeatedly in the past, and every
time it failed for a number of reasons, most of which are still valid.

Firstly, it's going to be quite hard (or perhaps impossible) to find an
algorithm that is "universally better" than pglz. Some algorithms do
work better for text documents, some for binary blobs, etc. I don't
think there's a win-win option.

Sure, there are workloads where pglz performs poorly (I've seen such
cases too), but IMHO that's more an argument for the custom compression
method approach. pglz gives you good default compression in most cases,
and you can change it for columns where it matters, and where a
different space/time trade-off makes sense.

Secondly, all the previous attempts ran into some legal issues, i.e.
licensing and/or patents. Maybe the situation changed since then (no
idea, haven't looked into that), but in the past the "pluggable"
approach was proposed as a way to address this.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-12-01 Thread Ildus Kurbangaliev

On Fri, 1 Dec 2017 16:38:42 -0300
Alvaro Herrera  wrote:

> 
> To me it makes sense to say "let's create this method which is for
> data compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION)
> followed by either "let's use this new compression method for the
> type tsvector" (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's
> use this new compression method for the column tc" (ALTER TABLE ALTER
> COLUMN tc SET COMPRESSION hyperz).
> 

Hi, I think if CREATE ACCESS METHOD can be used for compression, then it
could be nicer than CREATE COMPRESSION METHOD. I just don't
know that compression could go as access method or not. Anyway
it's easy to change syntax and I don't mind to do it, if it will be
neccessary for the patch to be commited.

-- 

Regards,
Ildus Kurbangaliev

Re: [HACKERS] Custom compression methods

2017-12-01 Thread Andres Freund

On 2017-12-01 16:14:58 -0500, Robert Haas wrote:
> Honestly, if we can give everybody a 4% space reduction by switching
> to lz4, I think that's totally worth doing -- but let's not make
> people choose it, let's make it the default going forward, and keep
> pglz support around so we don't break pg_upgrade compatibility (and so
> people can continue to choose it if for some reason it works better in
> their use case).  That kind of improvement is nothing special in a
> specific workload, but TOAST is a pretty general-purpose mechanism.  I
> have become, through a few bitter experiences, a strong believer in
> the value of trying to reduce our on-disk footprint, and knocking 4%
> off the size of every TOAST table in the world does not sound
> worthless to me -- even though context-aware compression can doubtless
> do a lot better.

+1. It's also a lot faster, and I've seen way way to many workloads with
50%+ time spent in pglz.

Greetings,

Andres Freund

Re: [HACKERS] Custom compression methods

2017-12-01 Thread Tomas Vondra

On 12/01/2017 08:38 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
> 
>> On 11/30/2017 09:51 PM, Alvaro Herrera wrote:
> 
>>> Just passing by, but wouldn't this fit in the ACCESS METHOD group of
>>> commands?  So this could be simplified down to
>>> CREATE ACCESS METHOD ts1 TYPE COMPRESSION
>>> we have that for indexes and there are patches flying for heap storage,
>>> sequences, etc.
>>
>> I think that would conflate two very different concepts. In my mind,
>> access methods define how rows are stored.
> 
> In mine, they define how things are accessed (i.e. more general than
> what you're thinking).  We *currently* use them to store rows [in
> indexes], but there is no reason why we couldn't expand that.
> 

Not sure I follow. My argument was not as much about whether the rows
are stored as rows or in some other (columnar) format, but that access
methods deal with "tuples" (i.e. row in the "logical" way). I assume
that even if we end up implementing other access method types, they will
still be tuple-based.

OTOH compression methods (at least as introduced by this patch) operate
on individual values, and have very little to do with access to the
value (in a sense it's a transparent thing).

>
> So we group access methods in "types"; the current type we have is for
> indexes, and methods in that type define how are indexes accessed.  This
> new type would indicate how would values be compressed.  I disagree that
> there is no parallel there.
> 
> I'm trying to avoid pointless proliferation of narrowly defined DDL
> commands.
> 

Of course, the opposite case is using the same DDL for very different
concepts (although I understand you don't see it that way).

But in fairness, I don't really care if we call this COMPRESSION METHOD
or ACCESS METHOD or DARTH VADER ...

>> Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to
>> restrict the compression algorithm to a particular data type (so, if it
>> relies on tsvector, you can't apply it to text columns).
> 
> Yes, of course.  I'm saying that the "datatype" property of a
> compression access method would be declared somewhere else, not in the
> TYPE clause of the CREATE ACCESS METHOD command.  Perhaps it makes sense
> to declare that a certain compression access method is good only for a
> certain data type, and then you can put that in the options clause,
> "CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)".
> But many compression access methods would be general in nature and so
> could be used for many datatypes (say, snappy).
> 
> To me it makes sense to say "let's create this method which is for data
> compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by
> either "let's use this new compression method for the type tsvector"
> (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new
> compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET
> COMPRESSION hyperz).
> 

The WITH syntax does not seem particularly pretty to me, TBH. I'd be
much happier with "TYPE tsvector" and leaving WITH for the options
specific to each compression method.

FWIW I think syntax is the least critical part of this patch. It's ~1%
of the patch, and the gram.y additions are rather trivial.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-12-01 Thread Robert Haas

On Fri, Dec 1, 2017 at 4:06 PM, Tomas Vondra
 wrote:
>>> I agree with these thoughts in general, but I'm not quite sure
>>> what is your conclusion regarding the patch.
>>
>> I have not reached one. Sometimes I like to discuss problems before
>> deciding what I think. :-)
>
> That's lame! Let's make decisions without discussion ;-)

Oh, right.  What was I thinking?

>> It does seem to me that the patch may be aiming at a relatively narrow
>> target in a fairly large problem space, but I don't know whether to
>> label that as short-sightedness or prudent incrementalism.
>
> I don't know either. I don't think people will start switching their
> text columns to lz4 just because they can, or because they get 4% space
> reduction compared to pglz.

Honestly, if we can give everybody a 4% space reduction by switching
to lz4, I think that's totally worth doing -- but let's not make
people choose it, let's make it the default going forward, and keep
pglz support around so we don't break pg_upgrade compatibility (and so
people can continue to choose it if for some reason it works better in
their use case).  That kind of improvement is nothing special in a
specific workload, but TOAST is a pretty general-purpose mechanism.  I
have become, through a few bitter experiences, a strong believer in
the value of trying to reduce our on-disk footprint, and knocking 4%
off the size of every TOAST table in the world does not sound
worthless to me -- even though context-aware compression can doubtless
do a lot better.

> But the ability to build per-column dictionaries seems quite powerful, I
> guess. And I don't think that can be easily built directly into JSONB,
> because we don't have a way to provide information about the column
> (i.e. how would you fetch the correct dictionary?).

That's definitely a problem, but I think we should mull it over a bit
more before giving up.  I have a few thoughts, but the part of my life
that doesn't happen on the PostgreSQL mailing list precludes
expounding on them right this minute.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-01 Thread Robert Haas

On Fri, Dec 1, 2017 at 2:38 PM, Alvaro Herrera  wrote:
> In mine, they define how things are accessed (i.e. more general than
> what you're thinking).  We *currently* use them to store rows [in
> indexes], but there is no reason why we couldn't expand that.
>
> So we group access methods in "types"; the current type we have is for
> indexes, and methods in that type define how are indexes accessed.  This
> new type would indicate how would values be compressed.  I disagree that
> there is no parallel there.

+1.

> I'm trying to avoid pointless proliferation of narrowly defined DDL
> commands.

I also think that's an important goal.

> Yes, of course.  I'm saying that the "datatype" property of a
> compression access method would be declared somewhere else, not in the
> TYPE clause of the CREATE ACCESS METHOD command.  Perhaps it makes sense
> to declare that a certain compression access method is good only for a
> certain data type, and then you can put that in the options clause,
> "CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)".
> But many compression access methods would be general in nature and so
> could be used for many datatypes (say, snappy).
>
> To me it makes sense to say "let's create this method which is for data
> compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by
> either "let's use this new compression method for the type tsvector"
> (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new
> compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET
> COMPRESSION hyperz).

+1 to this, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-01 Thread Tomas Vondra

On 12/01/2017 08:20 PM, Robert Haas wrote:
> On Fri, Dec 1, 2017 at 10:18 AM, Tomas Vondra
>  wrote:
>> It has very little impact on this patch, as it has nothing to do with
>> columnar storage. That is, each value is compressed independently.
> 
> I understand that this patch is not about columnar storage, but I 
> think the idea that we may want to operate on the compressed data 
> directly is not only applicable to that case.
> 

Yeah. To clarify, my point was that column stores benefit from
compressing many values at once, and then operating on this compressed
vector. That is not what this patch is doing (or can do), of course.

But I certainly do agree that if the compression can be integrated into
the data type, allowing processing on compressed representation, then
that will beat whatever this patch is doing, of course ...

>>
>> I agree with these thoughts in general, but I'm not quite sure
>> what is your conclusion regarding the patch.
> 
> I have not reached one. Sometimes I like to discuss problems before 
> deciding what I think. :-)
> 

That's lame! Let's make decisions without discussion ;-)

>
> It does seem to me that the patch may be aiming at a relatively narrow
> target in a fairly large problem space, but I don't know whether to
> label that as short-sightedness or prudent incrementalism.
> 

I don't know either. I don't think people will start switching their
text columns to lz4 just because they can, or because they get 4% space
reduction compared to pglz.

But the ability to build per-column dictionaries seems quite powerful, I
guess. And I don't think that can be easily built directly into JSONB,
because we don't have a way to provide information about the column
(i.e. how would you fetch the correct dictionary?).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-12-01 Thread Tomas Vondra



On 12/01/2017 08:48 PM, Alvaro Herrera wrote:
> Ildus Kurbangaliev wrote:
> 
>> If the table is big, decompression could take an eternity. That's why i
>> decided to only to disable it and the data could be decompressed using
>> compression options.
>>
>> My idea was to keep compression options forever, since there will not
>> be much of them in one database. Still that requires that extension is
>> not removed.
>>
>> I will try to find a way how to recompress data first in case it moves
>> to another table.
> 
> I think what you should do is add a dependency between a column that
> compresses using a method, and that method.  So the method cannot be
> dropped and leave compressed data behind.  Since the method is part of
> the extension, the extension cannot be dropped either.  If you ALTER
> the column so that it uses another compression method, then the table is
> rewritten and the dependency is removed; once you do that for all the
> columns that use the compression method, the compression method can be
> dropped.
> 

+1 to do the rewrite, just like for other similar ALTER TABLE commands

>
> Maybe our dependency code needs to be extended in order to support this.
> I think the current logic would drop the column if you were to do "DROP
> COMPRESSION .. CASCADE", but I'm not sure we'd see that as a feature.
> I'd rather have DROP COMPRESSION always fail instead until no columns
> use it.  Let's hear other's opinions on this bit though.
> 

Why should this behave differently compared to data types? Seems quite
against POLA, if you ask me ...

If you want to remove the compression, you can do the SET NOT COMPRESSED
(or whatever syntax we end up using), and then DROP COMPRESSION METHOD.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-12-01 Thread Robert Haas

On Fri, Dec 1, 2017 at 10:18 AM, Tomas Vondra
 wrote:
> It has very little impact on this patch, as it has nothing to do with
> columnar storage. That is, each value is compressed independently.

I understand that this patch is not about columnar storage, but I
think the idea that we may want to operate on the compressed data
directly is not only applicable to that case.

> I agree with these thoughts in general, but I'm not quite sure what is
> your conclusion regarding the patch.

I have not reached one.  Sometimes I like to discuss problems before
deciding what I think.  :-)

It does seem to me that the patch may be aiming at a relatively narrow
target in a fairly large problem space, but I don't know whether to
label that as short-sightedness or prudent incrementalism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-01 Thread Tomas Vondra

On 12/01/2017 03:23 PM, Robert Haas wrote:
> On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra
>  wrote:
>> OK. I think it's a nice use case (and nice gains on the compression
>> ratio), demonstrating the datatype-aware compression. The question is
>> why shouldn't this be built into the datatypes directly?
> 
> Tomas, thanks for running benchmarks of this.  I was surprised to see
> how little improvement there was from other modern compression
> methods, although lz4 did appear to be a modest win on both size and
> speed.  But I share your intuition that a lot of the interesting work
> is in datatype-specific compression algorithms.  I have noticed in a
> number of papers that I've read that teaching other parts of the
> system to operate directly on the compressed data, especially for
> column stores, is a critical performance optimization; of course, that
> only makes sense if the compression is datatype-specific.  I don't
> know exactly what that means for the design of this patch, though.
> 

It has very little impact on this patch, as it has nothing to do with
columnar storage. That is, each value is compressed independently.

Column stores exploit the fact that they get a vector of values,
compressed in some data-aware way. E.g. some form of RLE or dictionary
compression, which allows them to evaluate expressions on the compressed
vector. But that's irrelevant here, we only get row-by-row execution.

Note: The idea to build dictionary for the whole jsonb column (which
this patch should allow) does not make it "columnar compression" in the
"column store" way. The executor will still get the decompressed value.

> As a general point, no matter which way you go, you have to somehow
> deal with on-disk compatibility.  If you want to build in compression
> to the datatype itself, you need to find at least one bit someplace to
> mark the fact that you applied built-in compression.  If you want to
> build it in as a separate facility, you need to denote the compression
> used someplace else.  I haven't looked at how this patch does it, but
> the proposal in the past has been to add a value to vartag_external.

AFAICS the patch does that by setting a bit in the varlena header, and
then adding OID of the compression method after the varlena header. So
you get (verlena header + OID + data).

This has good and bad consequences.

Good: It's transparent for the datatype, so it does not have to worry
about the custom compression at all (and it may change arbitrarily).

Bad: It's transparent for the datatype, so it can't operate directly on
the compressed representation.

I don't think this is an argument against the patch, though. If the
datatype can support intelligent compression (and execution without
decompression), it has to be done in the datatype anyway.

> One nice thing about the latter method is that it can be used for any
> data type generically, regardless of how much bit-space is available
> in the data type representation itself.  It's realistically hard to
> think of a data-type that has no bit space available anywhere but is
> still subject to data-type specific compression; bytea definitionally
> has no bit space but is also can't benefit from special-purpose
> compression, whereas even something like text could be handled by
> starting the varlena with a NUL byte to indicate compressed data
> following.  However, you'd have to come up with a different trick for
> each data type.  Piggybacking on the TOAST machinery avoids that.  It
> also implies that we only try to compress values that are "big", which
> is probably be desirable if we're talking about a kind of compression
> that makes comprehending the value slower. Not all types of
> compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38,
> and for those that don't it probably makes more sense to just build it
> into the data type.
> 
> All of that is a somewhat separate question from whether we should
> have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using
> the ACCESS METHOD stuff instead).  Even if we agree that piggybacking
> on TOAST is a good way to implement pluggable compression methods, it
> doesn't follow that the compression method is something that should be
> attached to the datatype from the outside; it could be built into it
> in a deep way.  For example, "packed" varlenas (1-byte header) are a
> form of compression, and the default functions for detoasting always
> produced unpacked values, but the operators for the text data type
> know how to operate on the packed representation.  That's sort of a
> trivial example, but it might well be that there are other cases where
> we can do something similar.  Maybe jsonb, for example, can compress
> data in such a way that some of the jsonb functions can operate
> directly on the compressed representation -- perhaps the number of
> keys is easily visible, for example, or maybe more.  In this view of
> the world, each data type

Re: [HACKERS] Custom compression methods

2017-12-01 Thread Robert Haas

On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra
 wrote:
> OK. I think it's a nice use case (and nice gains on the compression
> ratio), demonstrating the datatype-aware compression. The question is
> why shouldn't this be built into the datatypes directly?

Tomas, thanks for running benchmarks of this.  I was surprised to see
how little improvement there was from other modern compression
methods, although lz4 did appear to be a modest win on both size and
speed.  But I share your intuition that a lot of the interesting work
is in datatype-specific compression algorithms.  I have noticed in a
number of papers that I've read that teaching other parts of the
system to operate directly on the compressed data, especially for
column stores, is a critical performance optimization; of course, that
only makes sense if the compression is datatype-specific.  I don't
know exactly what that means for the design of this patch, though.

As a general point, no matter which way you go, you have to somehow
deal with on-disk compatibility.  If you want to build in compression
to the datatype itself, you need to find at least one bit someplace to
mark the fact that you applied built-in compression.  If you want to
build it in as a separate facility, you need to denote the compression
used someplace else.  I haven't looked at how this patch does it, but
the proposal in the past has been to add a value to vartag_external.
One nice thing about the latter method is that it can be used for any
data type generically, regardless of how much bit-space is available
in the data type representation itself.  It's realistically hard to
think of a data-type that has no bit space available anywhere but is
still subject to data-type specific compression; bytea definitionally
has no bit space but is also can't benefit from special-purpose
compression, whereas even something like text could be handled by
starting the varlena with a NUL byte to indicate compressed data
following.  However, you'd have to come up with a different trick for
each data type.  Piggybacking on the TOAST machinery avoids that.  It
also implies that we only try to compress values that are "big", which
is probably be desirable if we're talking about a kind of compression
that makes comprehending the value slower. Not all types of
compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38,
and for those that don't it probably makes more sense to just build it
into the data type.

All of that is a somewhat separate question from whether we should
have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using
the ACCESS METHOD stuff instead).  Even if we agree that piggybacking
on TOAST is a good way to implement pluggable compression methods, it
doesn't follow that the compression method is something that should be
attached to the datatype from the outside; it could be built into it
in a deep way.  For example, "packed" varlenas (1-byte header) are a
form of compression, and the default functions for detoasting always
produced unpacked values, but the operators for the text data type
know how to operate on the packed representation.  That's sort of a
trivial example, but it might well be that there are other cases where
we can do something similar.  Maybe jsonb, for example, can compress
data in such a way that some of the jsonb functions can operate
directly on the compressed representation -- perhaps the number of
keys is easily visible, for example, or maybe more.  In this view of
the world, each data type should get to define its own compression
method (or methods) but they are hard-wired into the datatype and you
can't add more later, or if you do, you lose the advantages of the
hard-wired stuff.

BTW, another related concept that comes up a lot in discussions of
this area is that we could do a lot better compression of columns if
we had some place to store a per-column dictionary.  I don't really
know how to make that work.  We could have a catalog someplace that
stores an opaque blob for each column configured to use a compression
method, and let the compression method store whatever it likes in
there.  That's probably fine if you are compressing the whole table at
once and the blob is static thereafter.  But if you want to update
that blob as you see new column values there seem to be almost
insurmountable problems.

To be clear, I'm not trying to load this patch down with a requirement
to solve every problem in the universe.  On the other hand, I think it
would be easy to beat a patch like this into shape in a fairly
mechanical way and then commit-and-forget.  That might be leaving a
lot of money on the table; I'm glad you are thinking about the bigger
picture and hope that my thoughts here somehow contribute.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2017-12-01 Thread Tomas Vondra

On 11/30/2017 09:51 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
> 
>> On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:
> 
>>> CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
>>> tsvector_compression_handler;
>>
>> Understood. Good to know you've considered it, and I agree it doesn't
>> need to be there from the start (which makes the patch simpler).
> 
> Just passing by, but wouldn't this fit in the ACCESS METHOD group of
> commands?  So this could be simplified down to
> CREATE ACCESS METHOD ts1 TYPE COMPRESSION
> we have that for indexes and there are patches flying for heap storage,
> sequences, etc.  I think that's simpler than trying to invent all new
> commands here.  Then (in a future patch) you can use ALTER TYPE to
> define compression for that type, or even add a column-level option to
> reference a specific compression method.
> 

I think that would conflate two very different concepts. In my mind,
access methods define how rows are stored. Compression methods are an
orthogonal concept, e.g. you can compress a value (using a custom
compression algorithm) and store it in an index (using whatever access
method it's using). So not only access methods operate on rows (while
compression operates on varlena values), but you can combine those two
things together. I don't see how you could do that if both are defined
as "access methods" ...

Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to
restrict the compression algorithm to a particular data type (so, if it
relies on tsvector, you can't apply it to text columns). Which is very
different from "TYPE COMPRESSION" in CREATE ACCESS METHOD.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-11-30 Thread Alvaro Herrera

Tomas Vondra wrote:

> On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:

> > CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
> > tsvector_compression_handler;
> 
> Understood. Good to know you've considered it, and I agree it doesn't
> need to be there from the start (which makes the patch simpler).

Just passing by, but wouldn't this fit in the ACCESS METHOD group of
commands?  So this could be simplified down to
CREATE ACCESS METHOD ts1 TYPE COMPRESSION
we have that for indexes and there are patches flying for heap storage,
sequences, etc.  I think that's simpler than trying to invent all new
commands here.  Then (in a future patch) you can use ALTER TYPE to
define compression for that type, or even add a column-level option to
reference a specific compression method.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-11-30 Thread Tomas Vondra



On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:
> On Thu, 30 Nov 2017 00:30:37 +0100
> Tomas Vondra  wrote:
>
> ...
>
>> I can imagine other interesting use cases - for example values in
>> JSONB columns often use the same "schema" (keys, nesting, ...), so
>> can I imagine building a "dictionary" of JSON keys for the whole
>> column ...
>>
>> Ildus, is this a use case you've been aiming for, or were you aiming
>> to use the new API in a different way?
> 
> Thank you for such good overview. I agree that pglz is pretty good as
> general compression method, and there's no point to change it, at
> least now.
> 
> I see few useful use cases for compression methods, it's special
> compression methods for int[], timestamp[] for time series and yes,
> dictionaries for jsonb, for which I have even already created an
> extension (https://github.com/postgrespro/jsonbd). It's working and
> giving promising results.
> 

I understand the reluctance to put everything into core, particularly
for complex patches that evolve quickly. Also, not having to put
everything into core is kinda why we have extensions.

But perhaps some of the simpler cases would be good candidates for core,
making it possible to test the feature?

>>
>> I wonder if the patch can be improved to handle this use case better.
>> For example, it requires knowledge the actual data type, instead of
>> treating it as opaque varlena / byte array. I see tsvector compression
>> does that by checking typeid in the handler.
>>
>> But that fails for example with this example
>>
>> db=# create domain x as tsvector;
>> CREATE DOMAIN
>> db=# create table t (a x compressed ts1);
>> ERROR:  unexpected type 28198672 for tsvector compression handler
>>
>> which means it's a few brick shy to properly support domains. But I
>> wonder if this should be instead specified in CREATE COMPRESSION
>> METHOD instead. I mean, something like
>>
>> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>> TYPE tsvector;
>>
>> When type is no specified, it applies to all varlena values. Otherwise
>> only to that type. Also, why not to allow setting the compression as
>> the default method for a data type, e.g.
>>
>> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>> TYPE tsvector DEFAULT;
>>
>> would automatically add 'COMPRESSED ts1' to all tsvector columns in
>> new CREATE TABLE commands.
> 
> Initial version of the patch contains ALTER syntax that change 
> compression method for whole types, but I have decided to remove
> that functionality for now because the patch is already quite complex
> and it could be added later as separate patch.
> 
> Syntax was:
> ALTER TYPE  SET COMPRESSION ;
> 
> Specifying the supported type for the compression method is a good idea.
> Maybe the following syntax would be better?
> 
> CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
> tsvector_compression_handler;
> 

Understood. Good to know you've considered it, and I agree it doesn't
need to be there from the start (which makes the patch simpler).

>>
>> BTW do you expect the tsvector compression to be generally useful, or
>> is it meant to be used only by the tests? If generally useful,
>> perhaps it should be created in pg_compression by default. If only
>> for tests, maybe it should be implemented in an extension in contrib
>> (thus also serving as example how to implement new methods).
>>
>> I haven't thought about the JSONB use case very much, but I suppose
>> that could be done using the configure/drop methods. I mean,
>> allocating the dictionary somewhere (e.g. in a table created by an
>> extension?). The configure method gets the Form_pg_attribute record,
>> so that should be enough I guess.
>>
>> But the patch is not testing those two methods at all, which seems
>> like something that needs to be addresses before commit. I don't
>> expect a full-fledged JSONB compression extension, but something
>> simple that actually exercises those methods in a meaningful way.
> 
> I will move to tsvector_compression_handler to separate extension in
> the next version. I added it more like as example, but also it could be
> used to achieve a better compression for tsvectors. Tests on maillists
> database ('archie' tables):
> 
> usual compression:
> 
> maillists=# select body_tsvector, subject_tsvector into t1 from
> messages; SELECT 1114213
> maillists=# select pg_size_pretty(pg_total_relation_size('t1'));
>  pg_size_pretty 
> 
>  1637 MB
> (1 row)
> 
> tsvector_compression_handler:
> maillists=# select pg_size_pretty(pg_total_relation_size('t2'));
>  pg_size_pretty 
> 
>  1521 MB
> (1 row)
> 
> lz4:
> maillists=# select pg_size_pretty(pg_total_relation_size('t3'));
>  pg_size_pretty 
> 
>  1487 MB
> (1 row)
> 
> I don't stick to tsvector_compression_handler, I think if there
> will some example that can use all the features then
>

Re: [HACKERS] Custom compression methods

2017-11-29 Thread Michael Paquier

On Thu, Nov 30, 2017 at 8:30 AM, Tomas Vondra
 wrote:
> On 11/28/2017 02:29 PM, Ildus Kurbangaliev wrote:
>> On Mon, 27 Nov 2017 18:20:12 +0100
>> Tomas Vondra  wrote:
>>
>>> I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first
>>> tried without it, and it seemed working fine). If that's the case,
>>> I bet there is a palloc that should have been palloc0, or something
>>> like that.
>>
>> Thanks, that was it. I've been able to reproduce this bug. The
>> attached patch should fix this bug and I've also added recompression
>> when tuples moved to the relation with the compressed attribute.
>>
>
> I've done many tests with fulltext search on the mail archive, using
> different compression algorithm, and this time it worked fine. So I can
> confirm v7 fixes the issue.

Moved to next CF.
-- 
Michael

Re: [HACKERS] Custom compression methods

2017-11-27 Thread Tomas Vondra

Hi,

On 11/27/2017 04:52 PM, Ildus Kurbangaliev wrote:
> ...
>
> Hi. This looks like a serious bug, but I couldn't reproduce it yet.
> Did you upgrade some old database or this bug happened after 
> insertion of all data to new database? I tried using your 'archie' 
> tool to download mailing lists and insert them to database, but
> couldn't catch any errors.
> 

I can trigger it pretty reliably with these steps:

git checkout f65d21b258085bdc8ef2cc282ab1ff12da9c595c
patch -p1 < ~/custom_compression_methods_v6.patch
./configure --enable-debug --enable-cassert \
 CFLAGS="-fno-omit-frame-pointer -O0 -DRANDOMIZE_ALLOCATED_MEMORY" \
 --prefix=/home/postgres/pg-compress
make -s clean && make -s -j4 install
cd contrib/
make -s clean && make -s -j4 install

export PATH=/home/postgres/pg-compress/bin:$PATH
pg_ctl -D /mnt/raid/pg-compress init
pg_ctl -D /mnt/raid/pg-compress -l compress.log start
createdb archie
cd ~/archie/sql/
psql archie < create.sql

~/archie/bin/load.py --workers 4 --db archie */* > load.log 2>&1


I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first tried
without it, and it seemed working fine). If that's the case, I bet there
is a palloc that should have been palloc0, or something like that.

If you still can't reproduce that, I may give you access to this machine
so that you can debug it there.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-11-24 Thread Tomas Vondra

Hi,

I ran into another issue - after inserting some data into a table with a
tsvector column (without any compression defined), I can no longer read
the data.

This is what I get in the console:

db=# select max(md5(body_tsvector::text)) from messages;
ERROR:  cache lookup failed for compression options 6432

and the stack trace looks like this:

Breakpoint 1, get_cached_compression_options (cmoptoid=6432) at
tuptoaster.c:2563
2563elog(ERROR, "cache lookup failed for compression 
options %u",
cmoptoid);
(gdb) bt
#0  get_cached_compression_options (cmoptoid=6432) at tuptoaster.c:2563
#1  0x004bf3da in toast_decompress_datum (attr=0x2b44148) at
tuptoaster.c:2390
#2  0x004c0c1e in heap_tuple_untoast_attr (attr=0x2b44148) at
tuptoaster.c:225
#3  0x0083f976 in pg_detoast_datum (datum=) at
fmgr.c:1829
#4  0x008072de in tsvectorout (fcinfo=0x2b41e00) at tsvector.c:315
#5  0x005fae00 in ExecInterpExpr (state=0x2b414b8,
econtext=0x2b25ab0, isnull=) at execExprInterp.c:1131
#6  0x0060bdf4 in ExecEvalExprSwitchContext
(isNull=0x7fe9bd37 "", econtext=0x2b25ab0, state=0x2b414b8) at
../../../src/include/executor/executor.h:299

It seems the VARATT_IS_CUSTOM_COMPRESSED incorrectly identifies the
value as custom-compressed for some reason.

Not sure why, but the tsvector column is populated by a trigger that
simply does

NEW.body_tsvector
:= to_tsvector('english', strip_replies(NEW.body_plain));

If needed, the complete tool is here:

https://bitbucket.org/tvondra/archie


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-11-24 Thread Ildus Kurbangaliev

On Thu, 23 Nov 2017 21:54:32 +0100
Tomas Vondra  wrote:
 
> 
> Hmm, this seems to have fixed it, but only in one direction. Consider
> this:
> 
> create table t_pglz (v text);
> create table t_lz4 (v text compressed lz4);
> 
> insert into t_pglz select repeat(md5(i::text),300)
> from generate_series(1,10) s(i);
> 
> insert into t_lz4 select repeat(md5(i::text),300)
> from generate_series(1,10) s(i);
> 
> \d+
> 
>  Schema |  Name  | Type  | Owner | Size  | Description
> ++---+---+---+-
>  public | t_lz4  | table | user  | 12 MB |
>  public | t_pglz | table | user  | 18 MB |
> (2 rows)
> 
> truncate t_pglz;
> insert into t_pglz select * from t_lz4;
> 
> \d+
> 
>  Schema |  Name  | Type  | Owner | Size  | Description
> ++---+---+---+-
>  public | t_lz4  | table | user  | 12 MB |
>  public | t_pglz | table | user  | 18 MB |
> (2 rows)
> 
> which is fine. But in the other direction, this happens
> 
> truncate t_lz4;
> insert into t_lz4 select * from t_pglz;
> 
>  \d+
>List of relations
>  Schema |  Name  | Type  | Owner | Size  | Description
> ++---+---+---+-
>  public | t_lz4  | table | user  | 18 MB |
>  public | t_pglz | table | user  | 18 MB |
> (2 rows)
> 
> which means the data is still pglz-compressed. That's rather strange,
> I guess, and it should compress the data using the compression method
> set for the target table instead.

That's actually an interesting issue. It happens because if tuple fits
to page then postgres just moves it as is. I've just added
recompression if it has custom compressed datums to keep dependencies
right. But look:

  create table t1(a text);
  create table t2(a text);
  alter table t2 alter column a set storage external;
  insert into t1 select repeat(md5(i::text),300) from
generate_series(1,10) s(i);
  \d+

  List of relations 
   Schema | Name | Type  | Owner |Size| Description 
  +--+---+---++-
   public | t1   | table | ildus | 18 MB  | 
   public | t2   | table | ildus | 8192 bytes | 
  (2 rows)

  insert into t2 select * from t1;

  \d+

List of relations
   Schema | Name | Type  | Owner | Size  | Description 
  +--+---+---+---+-
   public | t1   | table | ildus | 18 MB | 
   public | t2   | table | ildus | 18 MB | 
  (2 rows)

That means compressed datums now in the column with storage specified as
external. I'm not sure that's a bug or a feature. Lets insert them
usual way:

  delete from t2;
  insert into t2 select repeat(md5(i::text),300) from
generate_series(1,10) s(i);
  \d+

 List of relations
   Schema | Name | Type  | Owner |  Size   | Description 
  +--+---+---+-+-
   public | t1   | table | ildus | 18 MB   | 
   public | t2   | table | ildus | 1011 MB | 

Maybe there should be more common solution like comparison of attribute
properties?

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] Custom compression methods

2017-11-23 Thread Tomas Vondra

Hi,

On 11/23/2017 10:38 AM, Ildus Kurbangaliev wrote:
> On Tue, 21 Nov 2017 18:47:49 +0100
> Tomas Vondra  wrote:
> 
>>>   
>>
>> Hmmm, it still doesn't work for me. See this:
>>
>> test=# create extension pg_lz4 ;
>> CREATE EXTENSION
>> test=# create table t_lz4 (v text compressed lz4);
>> CREATE TABLE
>> test=# create table t_pglz (v text);
>> CREATE TABLE
>> test=# insert into t_lz4 select repeat(md5(1::text),300);
>> INSERT 0 1
>> test=# insert into t_pglz select * from t_lz4;
>> INSERT 0 1
>> test=# drop extension pg_lz4 cascade;
>> NOTICE:  drop cascades to 2 other objects
>> DETAIL:  drop cascades to compression options for lz4
>> drop cascades to table t_lz4 column v
>> DROP EXTENSION
>> test=# \c test
>> You are now connected to database "test" as user "user".
>> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>> test=# select * from t_pglz ;
>> ERROR:  cache lookup failed for compression options 16419
>>
>> That suggests no recompression happened.
> 
> Should be fixed in the attached patch. I've changed your extension a
> little bit according changes in the new patch (also in attachments).
> 

Hmm, this seems to have fixed it, but only in one direction. Consider this:

create table t_pglz (v text);
create table t_lz4 (v text compressed lz4);

insert into t_pglz select repeat(md5(i::text),300)
from generate_series(1,10) s(i);

insert into t_lz4 select repeat(md5(i::text),300)
from generate_series(1,10) s(i);

\d+

 Schema |  Name  | Type  | Owner | Size  | Description
++---+---+---+-
 public | t_lz4  | table | user  | 12 MB |
 public | t_pglz | table | user  | 18 MB |
(2 rows)

truncate t_pglz;
insert into t_pglz select * from t_lz4;

\d+

 Schema |  Name  | Type  | Owner | Size  | Description
++---+---+---+-
 public | t_lz4  | table | user  | 12 MB |
 public | t_pglz | table | user  | 18 MB |
(2 rows)

which is fine. But in the other direction, this happens

truncate t_lz4;
insert into t_lz4 select * from t_pglz;

 \d+
   List of relations
 Schema |  Name  | Type  | Owner | Size  | Description
++---+---+---+-
 public | t_lz4  | table | user  | 18 MB |
 public | t_pglz | table | user  | 18 MB |
(2 rows)

which means the data is still pglz-compressed. That's rather strange, I
guess, and it should compress the data using the compression method set
for the target table instead.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-11-21 Thread Tomas Vondra


On 11/21/2017 09:28 PM, Ildus K wrote:
>> Hmmm, it still doesn't work for me. See this:
>>
>> test=# create extension pg_lz4 ;
>> CREATE EXTENSION
>> test=# create table t_lz4 (v text compressed lz4);
>> CREATE TABLE
>> test=# create table t_pglz (v text);
>> CREATE TABLE
>> test=# insert into t_lz4 select repeat(md5(1::text),300);
>> INSERT 0 1
>> test=# insert into t_pglz select * from t_lz4;
>> INSERT 0 1
>> test=# drop extension pg_lz4 cascade;
>> NOTICE:  drop cascades to 2 other objects
>> DETAIL:  drop cascades to compression options for lz4
>> drop cascades to table t_lz4 column v
>> DROP EXTENSION
>> test=# \c test
>> You are now connected to database "test" as user "user".
>> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>> test=# select * from t_pglz ;
>> ERROR:  cache lookup failed for compression options 16419
>>
>> That suggests no recompression happened.
> 
> I will check that. Is your extension published somewhere?
> 

No, it was just an experiment, so I've only attached it to the initial
review. Attached is an updated version, with a fix or two.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pg_lz4.tgz
Description: application/compressed-tar

Re: [HACKERS] Custom compression methods

2017-11-21 Thread Ildus K

On Tue, 21 Nov 2017 18:47:49 +0100
Tomas Vondra  wrote:

> 
> I propose to use either
> 
>CompressionMethodOptions (and CompressionMethodRoutine)
> 
> or
> 
>CompressionOptions (and CompressionRoutine)

Sounds good, thanks.

> 
> OK. But then I don't understand why tsvector.c does things like
> 
> VARSIZE(data) - VARHDRSZ_CUSTOM_COMPRESSED - arrsize
> VARRAWSIZE_4B_C(data) - arrsize
> 
> instead of
> 
> VARSIZE_ANY_EXHDR(data) - arrsize
> VARSIZE_ANY(data) - arrsize
> 
> Seems somewhat confusing.
> 

VARRAWSIZE_4B_C returns original size of data, before compression (from
va_rawsize in current postgres, and from va_info in my patch), not size
of the already compressed data, so you can't use VARSIZE_ANY here.

VARSIZE_ANY_EXHDR in current postgres returns VARSIZE-VARHDRSZ, despite
the varlena is compressed or not, so I just kept this behavior for
custom compressed varlenas too. If you look into tuptoaster.c you will
also see lines like 'VARSIZE(attr) - TOAST_COMPRESS_HDRSZ'. So I think
if VARSIZE_ANY_EXHDR will subtract different header sizes then it
should subtract them for usual compressed varlenas too.

> >   
> 
> Hmmm, it still doesn't work for me. See this:
> 
> test=# create extension pg_lz4 ;
> CREATE EXTENSION
> test=# create table t_lz4 (v text compressed lz4);
> CREATE TABLE
> test=# create table t_pglz (v text);
> CREATE TABLE
> test=# insert into t_lz4 select repeat(md5(1::text),300);
> INSERT 0 1
> test=# insert into t_pglz select * from t_lz4;
> INSERT 0 1
> test=# drop extension pg_lz4 cascade;
> NOTICE:  drop cascades to 2 other objects
> DETAIL:  drop cascades to compression options for lz4
> drop cascades to table t_lz4 column v
> DROP EXTENSION
> test=# \c test
> You are now connected to database "test" as user "user".
> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
> test=# select * from t_pglz ;
> ERROR:  cache lookup failed for compression options 16419
> 
> That suggests no recompression happened.

I will check that. Is your extension published somewhere?

Re: [HACKERS] Custom compression methods

2017-11-20 Thread Ildus Kurbangaliev

On Mon, 20 Nov 2017 16:29:11 +0100
Tomas Vondra  wrote:

> On 11/20/2017 04:21 PM, Евгений Шишкин wrote:
> > 
> >   
> >> On Nov 20, 2017, at 18:18, Tomas Vondra
> >>  >> > wrote:
> >>
> >>
> >> I don't think we need to do anything smart here - it should behave
> >> just like dropping a data type, for example. That is, error out if
> >> there are columns using the compression method (without CASCADE),
> >> and drop all the columns (with CASCADE).  
> > 
> > What about instead of dropping column we leave data uncompressed?
> >   
> 
> That requires you to go through the data and rewrite the whole table.
> And I'm not aware of a DROP command doing that, instead they just drop
> the dependent objects (e.g. DROP TYPE, ...). So per PLOS the DROP
> COMPRESSION METHOD command should do that too.
> 
> But I'm wondering if ALTER COLUMN ... SET NOT COMPRESSED should do
> that (currently it only disables compression for new data).

If the table is big, decompression could take an eternity. That's why i
decided to only to disable it and the data could be decompressed using
compression options.

My idea was to keep compression options forever, since there will not
be much of them in one database. Still that requires that extension is
not removed.

I will try to find a way how to recompress data first in case it moves
to another table.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] Custom compression methods

2017-11-19 Thread Tomas Vondra



On 11/15/2017 02:13 PM, Robert Haas wrote:
> On Wed, Nov 15, 2017 at 4:09 AM, Ildus Kurbangaliev
>  wrote:
>> So in the next version of the patch I can just unlink the options from
>> compression methods and dropping compression method will not affect
>> already compressed tuples. They still could be decompressed.
> 
> I guess I don't understand how that can work.  I mean, if somebody
> removes a compression method - i.e. uninstalls the library - and you
> don't have a way to make sure there are no tuples that can only be
> uncompressed by that library - then you've broken the database.
> Ideally, there should be a way to add a new compression method via an
> extension ... and then get rid of it and all dependencies thereupon.
> 

I share your confusion. Once you do DROP COMPRESSION METHOD, there must
be no remaining data compressed with it. But that's what the patch is
doing already - it enforces this using dependencies, as usual.

Ildus, can you explain what you meant? How could the data still be
decompressed after DROP COMPRESSION METHOD, and possibly after removing
the .so library?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2017-11-19 Thread Tomas Vondra

Hi,

On 11/14/2017 02:23 PM, Ildus Kurbangaliev wrote:
>
> ...
>
> Attached version 4 of the patch. Fixed pg_upgrade and few other bugs.
> 

I did a review of this today, and I think there are some things that
need improvement / fixing.

Firstly, some basic comments from just eye-balling the diff, then some
bugs I discovered after writing an extension adding lz4.

1) formatRelOptions/freeRelOptions are no longer needed (I see Ildar
already pointer that out)

2) There's unnecessary whitespace (extra newlines) on a couple of
places, which is needlessly increasing the size of the patch. Small
difference, but annoying.

3) tuptoaster.c

Why do you change 'info' from int32 to uint32? Seems unnecessary.

Adding new 'att' variable in toast_insert_or_update is confusing, as
there already is 'att' in the very next loop. Technically it's correct,
but I'd bet it'll lead to some WTF?! moments later. I propose to just
use TupleDescAttr(tupleDesc,i) on the two places where it matters,
around line 808.

There are no comments for init_compression_options_htab and
get_compression_options_info, so that needs to be fixed. Moreover, the
names are confusing because what we really get is not just 'options' but
the compression routines too.

4) gen_db_file_maps probably shouldn't do the fprints, right?

5) not sure why you modify src/tools/pgindent/exclude_file_patterns

6) I'm rather confused by AttributeCompression vs. ColumnCompression. I
mean, attribute==column, right? Of course, one is for data from parser,
the other one is for internal info. But can we make the naming clearer?

7) The docs in general are somewhat unsatisfactory, TBH. For example the
ColumnCompression has no comments, unlike everything else in parsenodes.
Similarly for the SGML docs - I suggest to expand them to resemble FDW
docs (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
also follows the handler/routines pattern.

8) One of the unclear things if why we even need 'drop' routing. It
seems that if it's defined DropAttributeCompression does something. But
what should it do? I suppose dropping the options should be done using
dependencies (just like we drop columns in this case).

BTW why does DropAttributeCompression mess with att->attisdropped in
this way? That seems a bit odd.

9) configure routines that only check if (options != NIL) and then error
out (like tsvector_configure) seem a bit unnecessary. Just allow it to
be NULL in CompressionMethodRoutine, and throw an error if options is
not NIL for such compression method.

10) toast_compress_datum still does this:

if (!ac && (valsize < PGLZ_strategy_default->min_input_size ||
valsize > PGLZ_strategy_default->max_input_size))

which seems rather pglz-specific (the naming is a hint). Why shouldn't
this be specific to compression, exposed either as min/max constants, or
wrapped in another routine - size_is_valid() or something like that?

11) The comments in toast_compress_datum probably need updating, as it
still references to pglz specifically. I guess the new compression
methods do matter too.

12) get_compression_options_info organizes the compression info into a
hash table by OID. The hash table implementation assumes the hash key is
at the beginning of the entry, but AttributeCompression is defined like
this:

typedef struct
{
CompressionMethodRoutine *routine;
List  *options;
Oid cmoptoid;
} AttributeCompression;

Which means get_compression_options_info is busted, will never lookup
anything, and the hash table will grow by adding more and more entries
into the same bucket. Of course, this has extremely negative impact on
performance (pretty much arbitrarily bad, depending on how many entries
you've already added to the hash table).

Moving the OID to the beginning of the struct fixes the issue.

13) When writing the experimental extension, I was extremely confused
about the regular varlena headers, custom compression headers, etc. In
the end I stole the code from tsvector.c and whacked it a bit until it
worked, but I wouldn't dare to claim I understand how it works.

This needs to be documented somewhere. For example postgres.h has a
bunch of paragraphs about varlena headers, so perhaps it should be
there? I see the patch tweaks some of the constants, but does not update
the comment at all.

Perhaps it would be useful to provide some additional macros making
access to custom-compressed varlena values easier. Or perhaps the
VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already support that? This
part is not very clear to me.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pg_lz4.tgz
Description: application/compressed-tar

Re: [HACKERS] Custom compression methods

2017-11-15 Thread Robert Haas

On Wed, Nov 15, 2017 at 4:09 AM, Ildus Kurbangaliev
 wrote:
> So in the next version of the patch I can just unlink the options from
> compression methods and dropping compression method will not affect
> already compressed tuples. They still could be decompressed.

I guess I don't understand how that can work.  I mean, if somebody
removes a compression method - i.e. uninstalls the library - and you
don't have a way to make sure there are no tuples that can only be
uncompressed by that library - then you've broken the database.
Ideally, there should be a way to add a new compression method via an
extension ... and then get rid of it and all dependencies thereupon.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

< 1 2 3 4

301 - 359 of 359 matches

Mail list logo