Re: [HACKERS] Default Partition for Range
Hello Beena, Thanks for the updated patch. It passes the basic tests which I performed. Few comments: 1. In check_new_partition_bound(), > /* Default case is not handled here */ >if (spec->is_default) >break; The default partition check here can be performed in common for both range and list partition. 2. RANGE_DATUM_FINITE = 0, /* actual datum stored elsewhere */ + RANGE_DATUM_DEFAULT,/* default */ More elaborative comment? 3. Inside make_one_range_bound(), >+ >+ /* datums are NULL for default */ >+ if (datums == NULL) >+ for (i = 0; i < key->partnatts; i++) >+ bound->content[i] = RANGE_DATUM_DEFAULT; >+ IMO, returning bound at this stage will make code clearer as further processing does not happen for default partition. 4. @@ -549,6 +568,7 @@ RelationBuildPartitionDesc(Relation rel) sizeof(RangeDatumContent *)); boundinfo->indexes = (int *) palloc((ndatums + 1) * sizeof(int)); + boundinfo->default_index = -1; This should also be done commonly for both default list and range partition. 5. if (!spec->is_default && partdesc->nparts > 0) { ListCell *cell; Assert(boundinfo && boundinfo->strategy == PARTITION_STRATEGY_LIST && (boundinfo->ndatums > 0 || partition_bound_accepts_nulls(boundinfo) || partition_bound_has_default(boundinfo))); The assertion condition partition_bound_has_default(boundinfo) is rendered useless because of if condition change and can be removed perhaps. 6. I think its better to have following regression tests coverage --Testing with inserting one NULL and other non NULL values for multicolumn range partitioned table. postgres=# CREATE TABLE range_parted ( postgres(# a int, postgres(# b int postgres(# ) PARTITION BY RANGE (a, b); CREATE TABLE postgres=# postgres=# CREATE TABLE part1 ( postgres(# a int NOT NULL CHECK (a = 1), postgres(# b int NOT NULL CHECK (b >= 1 AND b <= 10) postgres(# ); CREATE TABLE postgres=# ALTER TABLE range_parted ATTACH PARTITION part1 FOR VALUES FROM (1, 1) TO (1, 10); ALTER TABLE postgres=# CREATE TABLE partr_def1 PARTITION OF range_parted DEFAULT; CREATE TABLE postgres=# insert into range_parted values (1,NULL); INSERT 0 1 postgres=# insert into range_parted values (NULL,9); INSERT 0 1 --Testing the boundary conditions like in the above example following should pass. postgres=# insert into partr_def1 values (1,10); INSERT 0 1 Thank you, Rahila Syed On Mon, Jul 3, 2017 at 8:00 PM, Dilip Kumar <dilipbal...@gmail.com> wrote: > On Fri, Jun 30, 2017 at 11:56 AM, Beena Emerson <memissemer...@gmail.com> > wrote: > > Hello Dilip, > > > > On Wed, Jun 21, 2017 at 6:27 PM, Dilip Kumar <dilipbal...@gmail.com> > wrote: > >> On Tue, Jun 20, 2017 at 6:57 PM, Dilip Kumar <dilipbal...@gmail.com> > wrote: > >>> This is basically crashing in RelationBuildPartitionDesc, so I think > > > > > Thank you for your review and analysis. > > > > I have updated the patch. > > - bound->content is set to RANGE_DATUM_DEFAULT for each of the keys > > and not just the first one. > > - Improve the way of handling DEFAULT bounds in > RelationBuildPartitionDesc, > > > > There is a test for multiple column range in alter_table.sql > > Thanks for update patch, After this fix segmentation fault is solved. > > I have some more comments. > > 1. > > lower = make_one_range_bound(key, -1, spec->lowerdatums, true); > upper = make_one_range_bound(key, -1, spec->upperdatums, false); > > + /* Default case is not handled here */ > + if (spec->is_default) > + break; > + > > I think we can move default check just above the make range bound it > will avoid unnecessary processing. > > > 2. > RelationBuildPartitionDesc > { > > >/* > * If either of them has infinite element, we can't equate > * them. Even when both are infinite, they'd have > * opposite signs, because only one of cur and prev is a > * lower bound). > */ > if (cur->content[j] != RANGE_DATUM_FINITE || > prev->content[j] != RANGE_DATUM_FINITE) > { > is_distinct = true; > break; > } > After default range partitioning patch the above comment doesn't sound > very accurate and > can lead to the confusion, now here we are not only considering > infinite bounds but > there is also a possibility that prev bound can be DEFAULT, so >
Re: [HACKERS] Default Partition for Range
Hi Beena, I started testing and reviewing the patch. Can you update the patch as v5 patch does not apply cleanly on master? Thank you, Rahila Syed On Wed, Jun 21, 2017 at 8:43 PM, Dilip Kumar <dilipbal...@gmail.com> wrote: > On Wed, Jun 21, 2017 at 8:08 PM, Robert Haas <robertmh...@gmail.com> > wrote: > > I think somebody should do some testing of the existing code with > > valgrind. And then apply the list-partitioning patch and this patch, > > and do some more testing with valgrind. It seems to be really easy to > > miss these uninitialized access problems during code review. > > I think it will help, but it may not help in all the scenarios > because most of the places we are allocating memory with palloc0 ( > initialized with 0) but it never initialized with RANGE_DATUM_DEFAULT > except the first element (in the case of DEFAULT partition). And, > later they may be considered as RANGE_DATUM_FINITE (because its value > is 0). > > One solution can be that if bound is DEFAULT then initialize with > RANGE_DATUM_DEFAULT for the complete content array for that partition > bound instead of just first. Otherwise, we need to be careful of > early exiting wherever we are looping the content array of the DEFAULT > bound. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] Dropping partitioned table drops a previously detached partition
I reviewed and tested 0001-Dependency-between-partitioned-table-and-partition_v1.patch It applies cleanly on master and make check passes. Following are few comments: >/* > * Drop the dependency created by StoreCatalogInheritance1 (CREATE TABLE > * INHERITS/ALTER TABLE INHERIT -- refclassid will be RelationRelationId) or > * heap_create_with_catalog (CREATE TABLE OF/ALTER TABLE OF -- refclassid will > * be TypeRelationId). There's no convenient way to do this, so go trawling > * through pg_depend. > */ >static void >drop_parent_dependency(Oid relid, Oid refclassid, Oid refobjid, DependencyType deptype) This is not directly related to this patch but still would like to mention. drop_parent_dependency() is being used to drop the dependency created by CREATE TABLE PARTITION OF command also, so probably the comment needs to be modified to reflect that. >+/* >+ * Partition tables are expected to be dropped when the parent partitioned >+ * table gets dropped. Hence for partitioning we use AUTO dependency. >+ * Otherwise, for regular inheritance use NORMAL dependency. A minor cosmetic correction: + Hence for *declarative* partitioning we use AUTO dependency + * Otherwise, for regular inheritance *we* use NORMAL dependency. I have added tests to the 0001-Dependency-between-partitioned-table-and-partition_v1.patch. Please find attached the v2 patch. On Tue, Jun 13, 2017 at 10:25 AM, Amit Langote < langote_amit...@lab.ntt.co.jp> wrote: > Hi Ashutosh, > > On 2017/06/12 20:56, Ashutosh Bapat wrote: > > Hi, > > If we detach a partition and drop the corresponding partitioned table, > > it drops the once-partition now-standalone table as well. > > Thanks for the report. Looks like 8b4d582d279d78 missed the bit about > drop_parent_dependency() that you describe in your analysis below. > > > The reason for this is as follows > > An AUTO dependency is recorded between a partitioned table and > partition. In > > case of inheritance we record a NORMAL dependency between parent > > and child. While > > detaching a partition, we call RemoveInheritance(), which should > have taken > > care of removing this dependency. But it removes the dependencies > which are > > marked as NORMAL. When we changed the dependency for partitioned > case from > > NORMAL to AUTO by updating StoreCatalogInheritance1(), this function > was not > > updated. This caused the dependency to linger behind even after > > detaching the > > partition, thus causing the now standalone table (which was once a > > partition) > > to be dropped when the parent partitioned table is dropped. This > patch fixes > > RemoveInheritance() to choose appropriate dependency. > > > > Attached patch 0001 to fix this. > > Looks good to me. Perhaps, the example in your email could be added as a > test case. > > > I see a similar issue-in-baking in case we change the type of > > dependency recorded between a table and the composite type associated > > with using CREATE TABLE ... OF command. 0002 patch addresses the > > potential issue by using a macro while creating and dropping the > > dependency in corresponding functions. There might be more such > > places, but I haven't searched for those. > > Might be a good idea too. > > Adding this to the open items list. > > Thanks, > Amit > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index b61fda9..9aef67b 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -283,6 +283,14 @@ struct DropRelationCallbackState #define ATT_COMPOSITE_TYPE 0x0010 #define ATT_FOREIGN_TABLE 0x0020 +/* + * Partition tables are expected to be dropped when the parent partitioned + * table gets dropped. Hence for partitioning we use AUTO dependency. + * Otherwise, for regular inheritance use NORMAL dependency. + */ +#define child_dependency_type(child_is_partition) \ + ((child_is_partition) ? DEPENDENCY_AUTO : DEPENDENCY_NORMAL) + static void truncate_check_rel(Relation rel); static List *MergeAttributes(List *schema, List *supers, char relpersistence, bool is_partition, List **supOids, List **supconstr, @@ -439,7 +447,8 @@ static void ATExecEnableDisableRule(Relation rel, char *rulename, static void ATPrepAddInherit(Relation child_rel); static ObjectAddress ATExecAddInherit(Relation child_rel, RangeVar *parent, LOCKMODE lockmode); static ObjectAddress ATExecDropInherit(Relation rel, RangeVar *parent, LOCKMODE lockmode); -static void drop_parent_dependency(Oid relid, Oid refclassid, Oid refobjid); +static void drop_parent_dependency(Oid relid, Oid refclassid, Oid refobjid, + DependencyType deptype); static ObjectAddress ATExecAddOf(Relation rel, const TypeName *ofTypename, LOCKMODE lockmode); static void
Re: [HACKERS] Adding support for Default partition in partitioning
Hello, >(1) With the new patch, we allow new partitions when there is overlapping data with default partition. The entries in default are ignored when running queries satisfying the new partition. This was introduced in latest version. We are not allowing adding a partition when entries with same key value exist in default partition. So this scenario should not come in picture. Please find attached an updated patch which corrects this. >(2) I get the following warning: >partition.c: In function ‘check_new_partition_bound’: >partition.c:882:15: warning: ‘boundinfo’ may be used uninitialized in this function [-Wmaybe-uninitialized] > && boundinfo->has_default) ^ >preproc.y:3250.2-8: warning: type clash on default action: != <> I failed to notice this warning. I will look into it. >This is incredibly ugly. I don't know exactly what should be done >about it, but I think PARTITION_DEFAULT is a bad idea and has got to >go. Maybe add a separate isDefault flag to PartitionBoundSpec Will look at other ways to do it. >Doesn't it strike you as a bit strange that get_qual_for_default() >doesn't return a qual? Functions should generally have names that >describe what they do. Will fix this. >There's an existing function that you can use to concatenate two lists >instead of open-coding it. Will check this. >you should really add the documentation and >regression tests which you mentioned as a TODO. And run the code >through pgindent I will also update the next version with documentation and regression tests and run pgindent Thank you, Rahila Syed On Fri, May 12, 2017 at 4:33 PM, Beena Emerson <memissemer...@gmail.com> wrote: > > > On Thu, May 11, 2017 at 7:37 PM, Rahila Syed <rahilasye...@gmail.com> > wrote: > >> Hello, >> >> Please find attached an updated patch with review comments and bugs >> reported till date implemented. >> > > Hello Rahila, > > Tested on "efa2c18 Doc fix: scale(numeric) returns integer, not numeric." > > (1) With the new patch, we allow new partitions when there is overlapping > data with default partition. The entries in default are ignored when > running queries satisfying the new partition. > > DROP TABLE list1; > CREATE TABLE list1 ( > a int, > b int > ) PARTITION BY LIST (a); > CREATE TABLE list1_1 (LIKE list1); > ALTER TABLE list1 ATTACH PARTITION list1_1 FOR VALUES IN (1); > CREATE TABLE list1_def PARTITION OF list1 DEFAULT; > INSERT INTO list1 SELECT generate_series(1,2),1; > -- Partition overlapping with DEF > CREATE TABLE list1_2 PARTITION OF list1 FOR VALUES IN (2); > INSERT INTO list1 SELECT generate_series(2,3),2; > > postgres=# SELECT * FROM list1 ORDER BY a,b; > a | b > ---+--- > 1 | 1 > 2 | 1 > 2 | 2 > 3 | 2 > (4 rows) > > postgres=# SELECT * FROM list1 WHERE a=2; > a | b > ---+--- > 2 | 2 > (1 row) > > This ignores the a=2 entries in the DEFAULT. > > postgres=# SELECT * FROM list1_def; > a | b > ---+--- > 2 | 1 > 3 | 2 > (2 rows) > > > (2) I get the following warning: > > partition.c: In function ‘check_new_partition_bound’: > partition.c:882:15: warning: ‘boundinfo’ may be used uninitialized in this > function [-Wmaybe-uninitialized] >&& boundinfo->has_default) >^ > preproc.y:3250.2-8: warning: type clash on default action: != <> > > >> >1. >> >In following block, we can just do with def_index, and we do not need >> found_def >> >flag. We can check if def_index is -1 or not to decide if default >> partition is >> >present. >> found_def is used to set boundinfo->has_default which is used at couple >> of other places to check if default partition exists. The implementation >> is similar >> to has_null. >> >> >3. >> >In following function isDefaultPartitionBound, first statement "return >> false" >> >is not needed. >> It is needed to return false if the node is not DefElem. >> >> Todo: >> Add regression tests >> Documentation >> >> Thank you, >> Rahila Syed >> >> >> default_partition_v11.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Adding support for Default partition in partitioning
Hello, Please find attached an updated patch with review comments and bugs reported till date implemented. >1. >In following block, we can just do with def_index, and we do not need found_def >flag. We can check if def_index is -1 or not to decide if default partition is >present. found_def is used to set boundinfo->has_default which is used at couple of other places to check if default partition exists. The implementation is similar to has_null. >3. >In following function isDefaultPartitionBound, first statement "return false" >is not needed. It is needed to return false if the node is not DefElem. Todo: Add regression tests Documentation Thank you, Rahila Syed On Fri, May 5, 2017 at 1:30 AM, Jeevan Ladhe <jeevan.la...@enterprisedb.com> wrote: > Hi Rahila, > > I have started reviewing your latest patch, and here are my initial > comments: > > 1. > In following block, we can just do with def_index, and we do not need > found_def > flag. We can check if def_index is -1 or not to decide if default > partition is > present. > > @@ -166,6 +172,8 @@ RelationBuildPartitionDesc(Relation rel) > /* List partitioning specific */ > PartitionListValue **all_values = NULL; > bool found_null = false; > + bool found_def = false; > + int def_index = -1; > int null_index = -1; > > 2. > In check_new_partition_bound, in case of PARTITION_STRATEGY_LIST, remove > following duplicate declaration of boundinfo, because it is confusing and > after > your changes it is not needed as its not getting overridden in the if block > locally. > if (partdesc->nparts > 0) > { > PartitionBoundInfo boundinfo = partdesc->boundinfo; > ListCell *cell; > > > 3. > In following function isDefaultPartitionBound, first statement "return > false" > is not needed. > > + * Returns true if the partition bound is default > + */ > +bool > +isDefaultPartitionBound(Node *value) > +{ > + if (IsA(value, DefElem)) > + { > + DefElem *defvalue = (DefElem *) value; > + if(!strcmp(defvalue->defname, "DEFAULT")) > + return true; > + return false; > + } > + return false; > +} > > 4. > As mentioned in my previous set of comments, following if block inside a > loop > in get_qual_for_default needs a break: > > + foreach(cell1, bspec->listdatums) > + { > + Node *value = lfirst(cell1); > + if (isDefaultPartitionBound(value)) > + { > + def_elem = true; > + *defid = inhrelid; > + } > + } > > 5. > In the grammar the rule default_part_list is not needed: > > +default_partition: > + DEFAULT { $$ = (Node *)makeDefElem("DEFAULT", NULL, @1); } > + > +default_part_list: > + default_partition { $$ = list_make1($1); } > + ; > + > > Instead you can simply declare default_partition as a list and write it as: > > default_partition: > DEFAULT > { > Node *def = (Node *)makeDefElem("DEFAULT", NULL, @1); > $$ = list_make1(def); > } > > 6. > You need to change the output of the describe command, which is currently > as below: postgres=# \d+ test; Table "public.test" Column | Type | > Collation | Nullable | Default | Storage | Stats target | Description > +-+---+--+-+-+--+- > a | integer | | | | plain | | b | date | | | | plain | | Partition key: > LIST (a) Partitions: pd FOR VALUES IN (DEFAULT), test_p1 FOR VALUES IN (4, > 5) What about changing the Paritions output as below: Partitions: *pd > DEFAULT,* test_p1 FOR VALUES IN (4, 5) > > 7. > You need to handle tab completion for DEFAULT. > e.g. > If I partially type following command: > CREATE TABLE pd PARTITION OF test DEFA > and then press tab, I get following completion: > CREATE TABLE pd PARTITION OF test FOR VALUES > > I did some primary testing and did not find any problem so far. > > I will review and test further and let you know my comments. > > Regards, > Jeevan Ladhe > > On Thu, May 4, 2017 at 6:09 PM, Rajkumar Raghuwanshi < > rajkumar.raghuwan...@enterprisedb.com> wrote: > >> On Thu, May 4, 2017 at 5:14 PM, Rahila Syed <rahilasye...@gmail.com> >> wrote: >> >>> The syntax implemented in this patch is as follows, >>> >>> CREATE TABLE p11 PARTITION OF p1 DEFAULT; >>> >>> Applied v9 patches, table description still showing old pattern of >> default partition. Is it expected? >> >> create table lpd (a int, b int, c varchar) partition by list(a); >> create table lpd_d partition of lpd DEFAULT; >> >> \d+ lpd >> Table "publi
Re: [HACKERS] Adding support for Default partition in partitioning
>It seems that adding a new partition at the same level as the default >partition will require scanning it or its (leaf) partitions if >partitioned. Consider that p1, pd are partitions of a list-partitioned >table p accepting 1 and everything else, respectively, and pd is further >partitioned. When adding p2 of p for 2, we need to scan the partitions of >pd to check if there are any (2, ...) rows. This is a better explanation. May be following sentence was confusing, "That is prohibit creation of new partition after a default partition which is further partitioned" Here, what I meant was default partition is partitioned further. >As for fixing the reported issue whereby the partitioned default >partition's non-existent file is being accessed, it would help to take a >look at the code in ATExecAttachPartition() starting at the following: OK. I get it now. If attach partition already supports scanning all the partitions before attach, similar support should be provided in the case of adding a partition after default partition as well. Thank you, Rahila Syed On Wed, May 10, 2017 at 6:42 AM, Amit Langote <langote_amit...@lab.ntt.co.jp > wrote: > On 2017/05/10 2:09, Robert Haas wrote: > > On Tue, May 9, 2017 at 9:26 AM, Rahila Syed <rahilasye...@gmail.com> > wrote: > >>> Hi Rahila, > >> > >>> I am not able add a new partition if default partition is further > >>> partitioned > >>> with default partition. > >> > >>> Consider example below: > >> > >>> postgres=# CREATE TABLE test ( a int, b int, c int) PARTITION BY LIST > (a); > >>> CREATE TABLE > >>> postgres=# CREATE TABLE test_p1 PARTITION OF test FOR VALUES IN(4, 5, > 6, 7, > >>> 8); > >>> CREATE TABLE > >>> postgres=# CREATE TABLE test_pd PARTITION OF test DEFAULT PARTITION BY > >>> LIST(b); > >>> CREATE TABLE > >>> postgres=# CREATE TABLE test_pd_pd PARTITION OF test_pd DEFAULT; > >>> CREATE TABLE > >>> postgres=# INSERT INTO test VALUES (20, 24, 12); > >>> INSERT 0 1 > >>> postgres=# CREATE TABLE test_p2 PARTITION OF test FOR VALUES IN(15); > >> ERROR: could not open file "base/12335/16420": No such file or > directory > >> > >> Regarding fix for this I think we need to prohibit this case. That is > >> prohibit creation > >> of new partition after a default partition which is further partitioned. > >> Currently before adding a new partition after default partition all the > rows > >> of default > >> partition are scanned and if a row which matches the new partitions > >> constraint exists > >> the new partition is not added. > >> > >> If we allow this for default partition which is partitioned further, we > will > >> have to scan > >> all the partitions of default partition for matching rows which can slow > >> down execution. > > > > I think this case should be allowed > > +1 > > > and I don't think it should > > require scanning all the partitions of the default partition. This is > > no different than any other case where multiple levels of partitioning > > are used. First, you route the tuple at the root level; then, you > > route it at the next level; and so on. It shouldn't matter whether > > the routing at the top level is to that level's default partition or > > not. > > It seems that adding a new partition at the same level as the default > partition will require scanning it or its (leaf) partitions if > partitioned. Consider that p1, pd are partitions of a list-partitioned > table p accepting 1 and everything else, respectively, and pd is further > partitioned. When adding p2 of p for 2, we need to scan the partitions of > pd to check if there are any (2, ...) rows. > > As for fixing the reported issue whereby the partitioned default > partition's non-existent file is being accessed, it would help to take a > look at the code in ATExecAttachPartition() starting at the following: > > /* > * Set up to have the table be scanned to validate the partition > * constraint (see partConstraint above). If it's a partitioned > table, we > * instead schedule its leaf partitions to be scanned. > */ > if (!skip_validate) > { > > Thanks, > Amit > >
Re: [HACKERS] Adding support for Default partition in partitioning
>Hi Rahila, >I am not able add a new partition if default partition is further partitioned >with default partition. >Consider example below: >postgres=# CREATE TABLE test ( a int, b int, c int) PARTITION BY LIST (a); >CREATE TABLE >postgres=# CREATE TABLE test_p1 PARTITION OF test FOR VALUES IN(4, 5, 6, 7, 8); >CREATE TABLE >postgres=# CREATE TABLE test_pd PARTITION OF test DEFAULT PARTITION BY LIST(b); >CREATE TABLE >postgres=# CREATE TABLE test_pd_pd PARTITION OF test_pd DEFAULT; >CREATE TABLE >postgres=# INSERT INTO test VALUES (20, 24, 12); >INSERT 0 1 *>postgres=# CREATE TABLE test_p2 PARTITION OF test FOR VALUES IN(15);* *ERROR: could not open file "base/12335/16420": No such file or directory* Regarding fix for this I think we need to prohibit this case. That is prohibit creation of new partition after a default partition which is further partitioned. Currently before adding a new partition after default partition all the rows of default partition are scanned and if a row which matches the new partitions constraint exists the new partition is not added. If we allow this for default partition which is partitioned further, we will have to scan all the partitions of default partition for matching rows which can slow down execution. So to not hamper the performance, an error should be thrown in this case and user should be expected to change his schema to avoid partitioning default partitions. Kindly give your opinions. On Fri, May 5, 2017 at 12:46 PM, Jeevan Ladhewrote: > Hi Rahila, > > I am not able add a new partition if default partition is further > partitioned > with default partition. > > Consider example below: > > postgres=# CREATE TABLE test ( a int, b int, c int) PARTITION BY LIST (a); > CREATE TABLE > postgres=# CREATE TABLE test_p1 PARTITION OF test FOR VALUES IN(4, 5, 6, > 7, 8); > CREATE TABLE > postgres=# CREATE TABLE test_pd PARTITION OF test DEFAULT PARTITION BY > LIST(b); > CREATE TABLE > postgres=# CREATE TABLE test_pd_pd PARTITION OF test_pd DEFAULT; > CREATE TABLE > postgres=# INSERT INTO test VALUES (20, 24, 12); > INSERT 0 1 > *postgres=# CREATE TABLE test_p2 PARTITION OF test FOR VALUES IN(15);* > *ERROR: could not open file "base/12335/16420": No such file or directory* > > > Thanks, > Jeevan Ladhe > > On Fri, May 5, 2017 at 11:55 AM, Rajkumar Raghuwanshi < > rajkumar.raghuwan...@enterprisedb.com> wrote: > >> Hi Rahila, >> >> pg_restore is failing for default partition, dump file still storing old >> syntax of default partition. >> >> create table lpd (a int, b int, c varchar) partition by list(a); >> create table lpd_d partition of lpd DEFAULT; >> >> create database bkp owner 'edb'; >> grant all on DATABASE bkp to edb; >> >> --take plain dump of existing database >> \! ./pg_dump -f lpd_test.sql -Fp -d postgres >> >> --restore plain backup to new database bkp >> \! ./psql -f lpd_test.sql -d bkp >> >> psql:lpd_test.sql:63: ERROR: syntax error at or near "DEFAULT" >> LINE 2: FOR VALUES IN (DEFAULT); >>^ >> >> >> vi lpd_test.sql >> >> -- >> -- Name: lpd; Type: TABLE; Schema: public; Owner: edb >> -- >> >> CREATE TABLE lpd ( >> a integer, >> b integer, >> c character varying >> ) >> PARTITION BY LIST (a); >> >> >> ALTER TABLE lpd OWNER TO edb; >> >> -- >> -- Name: lpd_d; Type: TABLE; Schema: public; Owner: edb >> -- >> >> CREATE TABLE lpd_d PARTITION OF lpd >> FOR VALUES IN (DEFAULT); >> >> >> ALTER TABLE lpd_d OWNER TO edb; >> >> >> Thanks, >> Rajkumar >> > >
Re: [HACKERS] Adding support for Default partition in partitioning
+1 for AS DEFAULT syntax if it helps in improving readability specially in following case CREATE TABLE p1 PARTITION OF test AS DEFAULT PARTITION BY LIST(a); Thank you, Rahila Syed On Tue, May 9, 2017 at 1:13 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, May 4, 2017 at 4:40 PM, Sven R. Kunze <srku...@mail.de> wrote: > > It yields > > > > CREATE TABLE p1 PARTITION OF test DEFAULT PARTITION BY LIST(b); > > > > This reads to me like "DEFAULT PARTITION". > > > > I can imagine a lot of confusion when those queries are encountered in > the > > wild. I know this thread is about creating a default partition but I > were to > > propose a minor change in the following direction, I think confusion > would > > be greatly avoided: > > > > CREATE TABLE p1 PARTITION OF test AS DEFAULT PARTITIONED BY LIST(b); > > > > I know it's a bit longer but I think those 4 characters might serve > > readability in the long term. It was especially confusing to see > PARTITION > > in two positions serving two different functions. > > Well, we certainly can't make that change just for default partitions. > I mean, that would be non-orthogonal, right? You can't say that the > way to subpartition is to write "PARTITION BY strategy" when the table > unpartitioned or is a non-default partition but "PARTITIONED BY > strategy" when it is a default partition. That would certainly not be > a good way of confusing users less, and would probably result in a > variety of special cases in places like ruleutils.c or pg_dump, plus > some weasel-wording in the documentation. We COULD do a general > change from "CREATE TABLE table_name PARTITION BY strategy" to "CREATE > TABLE table_name PARTITIONED BY strategy". I don't have any > particular arguments against that except that the current syntax is > more like Oracle, which might count for something, and maybe the fact > that we're a month after feature freeze. Still, if we want to change > that, now would be the time; but I favor leaving it alone. > > I don't have a big objection to adding AS. If that's the majority > vote, fine; if not, that's OK, too. I can see it might be a bit more > clear in the case you mention, but it might also just be a noise word > that we don't really need. There don't seem to be many uses of AS > that would pose a risk of actual grammar conflicts here. I can > imagine someone wanting to use CREATE TABLE ... PARTITION BY ... AS > SELECT ... to create and populate a partition in one command, but that > wouldn't be a conflict because it'd have to go AFTER the partition > specification. In the DEFAULT case, you'd end up with something like > > CREATE TABLE p1 PARTITION OF test AS DEFAULT AS > > ...which is neither great nor horrible syntax-wise and maybe not such > a good thing to support anyway since it would have to lock the parent > to add the partition and then keep the lock on the parent while > populating the new child (ouch). > > So I guess I'm still in favor of the CREATE TABLE p1 PARTITION OF test > DEFAULT syntax, but if it ends up being AS DEFAULT instead, I can live > with that. > > Other opinions? > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: [HACKERS] Adding support for Default partition in partitioning
>I am not able add a new partition if default partition is further partitioned >with default partition. Thanks for reporting. I will fix this. >pg_restore is failing for default partition, dump file still storing old syntax of default partition. Thanks for reporting . I will fix this once the syntax is finalized. On Fri, May 5, 2017 at 12:46 PM, Jeevan Ladhewrote: > Hi Rahila, > > I am not able add a new partition if default partition is further > partitioned > with default partition. > > Consider example below: > > postgres=# CREATE TABLE test ( a int, b int, c int) PARTITION BY LIST (a); > CREATE TABLE > postgres=# CREATE TABLE test_p1 PARTITION OF test FOR VALUES IN(4, 5, 6, > 7, 8); > CREATE TABLE > postgres=# CREATE TABLE test_pd PARTITION OF test DEFAULT PARTITION BY > LIST(b); > CREATE TABLE > postgres=# CREATE TABLE test_pd_pd PARTITION OF test_pd DEFAULT; > CREATE TABLE > postgres=# INSERT INTO test VALUES (20, 24, 12); > INSERT 0 1 > *postgres=# CREATE TABLE test_p2 PARTITION OF test FOR VALUES IN(15);* > *ERROR: could not open file "base/12335/16420": No such file or directory* > > > Thanks, > Jeevan Ladhe > > On Fri, May 5, 2017 at 11:55 AM, Rajkumar Raghuwanshi < > rajkumar.raghuwan...@enterprisedb.com> wrote: > >> Hi Rahila, >> >> pg_restore is failing for default partition, dump file still storing old >> syntax of default partition. >> >> create table lpd (a int, b int, c varchar) partition by list(a); >> create table lpd_d partition of lpd DEFAULT; >> >> create database bkp owner 'edb'; >> grant all on DATABASE bkp to edb; >> >> --take plain dump of existing database >> \! ./pg_dump -f lpd_test.sql -Fp -d postgres >> >> --restore plain backup to new database bkp >> \! ./psql -f lpd_test.sql -d bkp >> >> psql:lpd_test.sql:63: ERROR: syntax error at or near "DEFAULT" >> LINE 2: FOR VALUES IN (DEFAULT); >>^ >> >> >> vi lpd_test.sql >> >> -- >> -- Name: lpd; Type: TABLE; Schema: public; Owner: edb >> -- >> >> CREATE TABLE lpd ( >> a integer, >> b integer, >> c character varying >> ) >> PARTITION BY LIST (a); >> >> >> ALTER TABLE lpd OWNER TO edb; >> >> -- >> -- Name: lpd_d; Type: TABLE; Schema: public; Owner: edb >> -- >> >> CREATE TABLE lpd_d PARTITION OF lpd >> FOR VALUES IN (DEFAULT); >> >> >> ALTER TABLE lpd_d OWNER TO edb; >> >> >> Thanks, >> Rajkumar >> > >
Re: [HACKERS] Adding support for Default partition in partitioning
Hello Amul, Thanks for reporting. Please find attached an updated patch which fixes the above. Also, the attached patch includes changes in syntax proposed upthread. The syntax implemented in this patch is as follows, CREATE TABLE p11 PARTITION OF p1 DEFAULT; Thank you, Rahila Syed On Thu, May 4, 2017 at 4:02 PM, amul sul <sula...@gmail.com> wrote: > On Tue, May 2, 2017 at 9:33 PM, Rahila Syed <rahilasye...@gmail.com> > wrote: > > Please find attached updated patch with review comments by Robert and > Jeevan > > implemented. > > > Patch v8 got clean apply on latest head but server got crash at data > insert in the following test: > > -- Create test table > CREATE TABLE test ( a int, b date) PARTITION BY LIST (a); > CREATE TABLE p1 PARTITION OF test FOR VALUES IN (DEFAULT) PARTITION BY > LIST(b); > CREATE TABLE p11 PARTITION OF p1 FOR VALUES IN (DEFAULT); > > -- crash > INSERT INTO test VALUES (210,'1/1/2002'); > > Regards, > Amul > default_partition_v9.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Adding support for Default partition in partitioning
Please find attached updated patch with review comments by Robert and Jeevan implemented. The newly proposed syntax CREATE TABLE .. PARTITION OF .. DEFAULT has got most votes on this thread. If there is no more objection, I will go ahead and include that in the patch. Thank you, Rahila Syed On Mon, Apr 24, 2017 at 2:40 PM, Rahila Syed <rahilasye...@gmail.com> wrote: > Hello, > > Thank you for reviewing. > > >But that's not a good idea for several reasons. For one thing, you > >can also write FOR VALUES IN (DEFAULT, 5) or which isn't sensible. > >For another thing, this kind of syntax won't generalize to range > >partitioning, which we've talked about making this feature support. > >Maybe something like: > > >CREATE TABLE .. PARTITION OF .. DEFAULT; > > I agree that the syntax should be changed to also support range > partitioning. > > Following can also be considered as it specifies more clearly that the > partition holds default values. > > CREATE TABLE ...PARTITION OF...FOR VALUES DEFAULT; > > >Maybe we should introduce a dedicated node type to > >represent a default-specification in the parser grammar. If not, then > >let's at least encapsulate the test a little better, e.g. by adding > >isDefaultPartitionBound() which tests not only IsA(..., DefElem) but > >also whether the name is DEFAULT as expected. BTW, we typically use > >lower-case internally, so if we stick with this representation it > >should really be "default" not "DEFAULT". > > isDefaultPartitionBound() function is created in the attached patch which > checks for both node type and name. > > >Why abbreviate "default" to def here? Seems pointless. > Corrected in the attached. > > >Consider && > Fixed. > > >+ * default partiton for rows satisfying the new partition > >Spelling. > Fixed. > > >Missing apostrophe > Fixed. > > >Definitely not safe against concurrency, since AccessShareLock won't > >exclude somebody else's update. In fact, it won't even cover somebody > >else's already-in-flight transaction > Changed it to AccessExclusiveLock > > >Normally in such cases we try to give more detail using > >ExecBuildSlotValueDescription. > This function is used in execMain.c and the error is being > reported in partition.c. > Do you mean the error reporting should be moved into execMain.c > to use ExecBuildSlotValueDescription? > > >This variable starts out true and is never set to any value other than > >true. Just get rid of it and, in the one place where it is currently > >used, write "true". That's shorter and clearer. > Fixed. > > >There's not really a reason to cast the result of stringToNode() to > >Node * and then turn around and cast it to PartitionBoundSpec *. Just > >cast it directly to whatever it needs to be. And use the new castNode > >macro > Fixed. castNode macro takes as input Node * whereas stringToNode() takes > string. > IIUC, castNode cant be used here. > > >The if (def_elem) test continues > >early, but if the point is that the loop using cell3 shouldn't execute > >in that case, why not just put if (!def_elem) { foreach(cell3, ...) { > >... } } instead of reiterating the ReleaseSysCache in two places? > Fixed in the attached. > > I will respond to further comments in following email. > > > On Thu, Apr 13, 2017 at 12:48 AM, Robert Haas <robertmh...@gmail.com> > wrote: > >> On Thu, Apr 6, 2017 at 7:30 AM, Rahila Syed <rahilasye...@gmail.com> >> wrote: >> > Thanks a lot for testing and reporting this. Please find attached an >> updated >> > patch with the fix. The patch also contains a fix >> > regarding operator used at the time of creating expression as default >> > partition constraint. This was notified offlist by Amit Langote. >> >> I think that the syntax for this patch should probably be revised. >> Right now the proposal is for: >> >> CREATE TABLE .. PARTITION OF ... FOR VALUES IN (DEFAULT); >> >> But that's not a good idea for several reasons. For one thing, you >> can also write FOR VALUES IN (DEFAULT, 5) or which isn't sensible. >> For another thing, this kind of syntax won't generalize to range >> partitioning, which we've talked about making this feature support. >> Maybe something like: >> >> CREATE TABLE .. PARTITION OF .. DEFAULT; >> >> This patch makes the assumption throughout that any DefElem represents >> the word DEFAULT, which is true in the patch as written but doesn't >> seem very future-proof. I think the "def" in "
Re: [HACKERS] Adding support for Default partition in partitioning
Hi, On Apr 27, 2017 18:37, "Robert Haas" <robertmh...@gmail.com> wrote: > > > Are you also working on extending this to work with range > partitioning? Because I think that would be good to do. > > > Currently I am working on review comments and bug fixes for the default list partitioning patch. After that I can start with default partition for range partitioning. Thank you, Rahila Syed
Re: [HACKERS] Adding support for Default partition in partitioning
>I suspect it could be done as of now, but I'm a little worried that it >might create grammar conflicts in the future as we extend the syntax >further. If we use CREATE TABLE ... PARTITION OF .. DEFAULT, then the >word DEFAULT appears in the same position where we'd normally have FOR >VALUES, and so the parser will definitely be able to figure out what's >going on. When it gets to that position, it will see FOR or it will >see DEFAULT, and all is clear. OTOH, if we use CREATE TABLE ... >DEFAULT PARTITION OF ..., then we have action at a distance: whether >or not the word DEFAULT is present before PARTITION affects which >tokens are legal after the parent table name. bison isn't always very >smart about that kind of thing. No particular dangers come to mind at >the moment, but it makes me nervous anyway. +1 for CREATE TABLE..PARTITION OF...DEFAULT syntax. I think substituting DEFAULT for FOR VALUES is appropriate as both cases are mutually exclusive. One more thing that needs consideration is should default partitions be partitioned further? Other databases allow default partitions to be partitioned further. I think, its normal for users to expect the data in default partitions to also be divided into sub partitions. So it should be supported. My colleague Rajkumar Raghuwanshi brought to my notice the current patch does not handle this correctly. I will include this in the updated patch if there is no objection. On the other hand if sub partitions of a default partition is to be prohibited, an error should be thrown if PARTITION BY is specified after DEFAULT. Thank you, Rahila Syed On Tue, Apr 25, 2017 at 1:46 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Mon, Apr 24, 2017 at 8:14 AM, Ashutosh Bapat > <ashutosh.ba...@enterprisedb.com> wrote: > > On Mon, Apr 24, 2017 at 4:24 PM, Robert Haas <robertmh...@gmail.com> > wrote: > >> On Mon, Apr 24, 2017 at 5:10 AM, Rahila Syed <rahilasye...@gmail.com> > wrote: > >>> Following can also be considered as it specifies more clearly that the > >>> partition holds default values. > >>> > >>> CREATE TABLE ...PARTITION OF...FOR VALUES DEFAULT; > >> > >> The partition doesn't contain default values; it is itself a default. > > > > Is CREATE TABLE ... DEFAULT PARTITION OF ... feasible? That sounds more > natural. > > I suspect it could be done as of now, but I'm a little worried that it > might create grammar conflicts in the future as we extend the syntax > further. If we use CREATE TABLE ... PARTITION OF .. DEFAULT, then the > word DEFAULT appears in the same position where we'd normally have FOR > VALUES, and so the parser will definitely be able to figure out what's > going on. When it gets to that position, it will see FOR or it will > see DEFAULT, and all is clear. OTOH, if we use CREATE TABLE ... > DEFAULT PARTITION OF ..., then we have action at a distance: whether > or not the word DEFAULT is present before PARTITION affects which > tokens are legal after the parent table name. bison isn't always very > smart about that kind of thing. No particular dangers come to mind at > the moment, but it makes me nervous anyway. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: [HACKERS] Adding support for Default partition in partitioning
Hello Jeevan, Thank you for comments. I will include your comments in the updated patch. >7.The output of describe needs to be improved. The syntax for DEFAULT partitioning is still under discussion. This comment wont be applicable if the syntax is changed. >6. >I am wondering, isn't it possible to retrieve the has_default and default_index >here to find out if default partition exists and if exist then find it's oid >using rd_partdesc, that would avoid above(7) loop to check if partition bound is >default The checks are used to find the default partition bound and exclude it from the list of partition bounds to form the partition constraint. This cant be accomplished by using has_default flag. isDefaultPartitionBound() is written to accomplish that. >8. >Following call to find_inheritance_children() in generate_qual_for_defaultpart() >is an overhead, instead we can simply use an array of oids in rd_partdesc. I think using find_inheritance_children() will take into consideration concurrent drop of a partition which the value in rd_partdesc will not. Thank you, Rahila Syed
Re: [HACKERS] Adding support for Default partition in partitioning
Hello, Thank you for reviewing. >But that's not a good idea for several reasons. For one thing, you >can also write FOR VALUES IN (DEFAULT, 5) or which isn't sensible. >For another thing, this kind of syntax won't generalize to range >partitioning, which we've talked about making this feature support. >Maybe something like: >CREATE TABLE .. PARTITION OF .. DEFAULT; I agree that the syntax should be changed to also support range partitioning. Following can also be considered as it specifies more clearly that the partition holds default values. CREATE TABLE ...PARTITION OF...FOR VALUES DEFAULT; >Maybe we should introduce a dedicated node type to >represent a default-specification in the parser grammar. If not, then >let's at least encapsulate the test a little better, e.g. by adding >isDefaultPartitionBound() which tests not only IsA(..., DefElem) but >also whether the name is DEFAULT as expected. BTW, we typically use >lower-case internally, so if we stick with this representation it >should really be "default" not "DEFAULT". isDefaultPartitionBound() function is created in the attached patch which checks for both node type and name. >Why abbreviate "default" to def here? Seems pointless. Corrected in the attached. >Consider && Fixed. >+ * default partiton for rows satisfying the new partition >Spelling. Fixed. >Missing apostrophe Fixed. >Definitely not safe against concurrency, since AccessShareLock won't >exclude somebody else's update. In fact, it won't even cover somebody >else's already-in-flight transaction Changed it to AccessExclusiveLock >Normally in such cases we try to give more detail using >ExecBuildSlotValueDescription. This function is used in execMain.c and the error is being reported in partition.c. Do you mean the error reporting should be moved into execMain.c to use ExecBuildSlotValueDescription? >This variable starts out true and is never set to any value other than >true. Just get rid of it and, in the one place where it is currently >used, write "true". That's shorter and clearer. Fixed. >There's not really a reason to cast the result of stringToNode() to >Node * and then turn around and cast it to PartitionBoundSpec *. Just >cast it directly to whatever it needs to be. And use the new castNode >macro Fixed. castNode macro takes as input Node * whereas stringToNode() takes string. IIUC, castNode cant be used here. >The if (def_elem) test continues >early, but if the point is that the loop using cell3 shouldn't execute >in that case, why not just put if (!def_elem) { foreach(cell3, ...) { >... } } instead of reiterating the ReleaseSysCache in two places? Fixed in the attached. I will respond to further comments in following email. On Thu, Apr 13, 2017 at 12:48 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Apr 6, 2017 at 7:30 AM, Rahila Syed <rahilasye...@gmail.com> > wrote: > > Thanks a lot for testing and reporting this. Please find attached an > updated > > patch with the fix. The patch also contains a fix > > regarding operator used at the time of creating expression as default > > partition constraint. This was notified offlist by Amit Langote. > > I think that the syntax for this patch should probably be revised. > Right now the proposal is for: > > CREATE TABLE .. PARTITION OF ... FOR VALUES IN (DEFAULT); > > But that's not a good idea for several reasons. For one thing, you > can also write FOR VALUES IN (DEFAULT, 5) or which isn't sensible. > For another thing, this kind of syntax won't generalize to range > partitioning, which we've talked about making this feature support. > Maybe something like: > > CREATE TABLE .. PARTITION OF .. DEFAULT; > > This patch makes the assumption throughout that any DefElem represents > the word DEFAULT, which is true in the patch as written but doesn't > seem very future-proof. I think the "def" in "DefElem" stands for > "definition" or "define" or something like that, so this is actually > pretty confusing. Maybe we should introduce a dedicated node type to > represent a default-specification in the parser grammar. If not, then > let's at least encapsulate the test a little better, e.g. by adding > isDefaultPartitionBound() which tests not only IsA(..., DefElem) but > also whether the name is DEFAULT as expected. BTW, we typically use > lower-case internally, so if we stick with this representation it > should really be "default" not "DEFAULT". > > Useless hunk: > > +boolhas_def;/* Is there a default partition? > Currently false > + * for a range partitioned table */ > +intde
Re: [HACKERS] Adding support for Default partition in partitioning
Hello, Thanks a lot for testing and reporting this. Please find attached an updated patch with the fix. The patch also contains a fix regarding operator used at the time of creating expression as default partition constraint. This was notified offlist by Amit Langote. Thank you, Rahila Syed On Thu, Apr 6, 2017 at 12:21 AM, Keith Fiske <ke...@omniti.com> wrote: > > On Wed, Apr 5, 2017 at 11:19 AM, Robert Haas <robertmh...@gmail.com> > wrote: > >> On Wed, Apr 5, 2017 at 5:57 AM, Rahila Syed <rahilasye...@gmail.com> >> wrote: >> >>Could you briefly elaborate why you think the lack global index support >> >>would be a problem in this regard? >> > I think following can happen if we allow rows satisfying the new >> partition >> > to lie around in the >> > default partition until background process moves it. >> > Consider a scenario where partition key is a primary key and the data >> in the >> > default partition is >> > not yet moved into the newly added partition. If now, new data is added >> into >> > the new partition >> > which also exists(same key) in default partition there will be data >> > duplication. If now >> > we scan the partitioned table for that key(from both the default and new >> > partition as we >> > have not moved the rows) it will fetch the both rows. >> > Unless we have global indexes for partitioned tables, there is chance of >> > data duplication between >> > child table added after default partition and the default partition. >> >> Yes, I think it would be completely crazy to try to migrate the data >> in the background: >> >> - The migration might never complete because of a UNIQUE or CHECK >> constraint on the partition to which rows are being migrated. >> >> - Even if the migration eventually succeeded, such a design abandons >> all hope of making INSERT .. ON CONFLICT DO NOTHING work sensibly >> while the migration is in progress, unless the new partition has no >> UNIQUE constraints. >> >> - Partition-wise join and partition-wise aggregate would need to have >> special case handling for the case of an unfinished migration, as >> would any user code that accesses partitions directly. >> >> - More generally, I think users expect that when a DDL command >> finishes execution, it's done all of the work that there is to do (or >> at the very least, that any remaining work has no user-visible >> consequences, which would not be the case here). >> >> IMV, the question of whether we have efficient ways to move data >> around between partitions is somewhat separate from the question of >> whether partitions can be defined in a certain way in the first place. >> The problems that Keith refers to upthread already exist for >> subpartitioning; you've got to detach the old partition, create a new >> one, and then reinsert the data. And for partitioning an >> unpartitioned table: create a replacement table, insert all the data, >> substitute it for the original table. The fact that we have these >> limitation is not good, but we're not going to rip out partitioning >> entirely because we don't have clever ways of migrating the data in >> those cases, and the proposed behavior here is not any worse. >> >> Also, waiting for those problems to get fixed might be waiting for >> Godot. I'm not really all that sanguine about our chances of coming >> up with a really nice way of handling these cases. In a designed >> based on table inheritance, you can leave it murky where certain data >> is supposed to end up and migrate it on-line and you might get away >> with that, but a major point of having declarative partitioning at all >> is to remove that sort of murkiness. It's probably not that hard to >> come up with a command that locks the parent and moves data around via >> full table scans, but I'm not sure how far that really gets us; you >> could do the same thing easily enough with a sequence of commands >> generated via a script. And being able to do this in a general way >> without a full table lock looks pretty hard - it doesn't seem >> fundamentally different from trying to perform a table-rewriting >> operation like CLUSTER without a full table lock, which we also don't >> support. The executor is not built to cope with any aspect of the >> table definition shifting under it, and that includes the set of child >> tables with are partitions of the table mentioned in the query. Maybe >> the executor can be taught to survive such definitional changes at >> lea
Re: [HACKERS] Adding support for Default partition in partitioning
Hello Amit, >Could you briefly elaborate why you think the lack global index support >would be a problem in this regard? I think following can happen if we allow rows satisfying the new partition to lie around in the default partition until background process moves it. Consider a scenario where partition key is a primary key and the data in the default partition is not yet moved into the newly added partition. If now, new data is added into the new partition which also exists(same key) in default partition there will be data duplication. If now we scan the partitioned table for that key(from both the default and new partition as we have not moved the rows) it will fetch the both rows. Unless we have global indexes for partitioned tables, there is chance of data duplication between child table added after default partition and the default partition. >Scanning it and moving rows to the newly added partition while holding an >AccessExclusiveLock on the parent will block any and all of the concurrent >activity on it until the row-movement is finished. Can you explain why this will require AccessExclusiveLock on parent and not just the default partition and newly added partition? Thank you, Rahila Syed On Wed, Apr 5, 2017 at 1:22 PM, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote: > On 2017/04/05 14:41, Rushabh Lathia wrote: > > I agree about the future plan about the row movement, how that is I am > > not quite sure at this stage. > > > > I was thinking that CREATE new partition is the DDL command, so even > > if row-movement works with holding the lock on the new partition table, > > that should be fine. I am not quire sure, why row movement should be > > happen in the back-ground process. > > I think to improve the availability of access to the partitioned table. > > Consider that the default partition may have gotten pretty large. > Scanning it and moving rows to the newly added partition while holding an > AccessExclusiveLock on the parent will block any and all of the concurrent > activity on it until the row-movement is finished. One may be prepared to > pay this cost, for which there should definitely be an option to perform > the row-movement in the same transaction (also possibly the default > behavior). > > Thanks, > Amit > > >
Re: [HACKERS] Adding support for Default partition in partitioning
Hi, >However, running your original example, I'm getting this error Thank you for testing. Please find attached an updated patch which fixes the above. Thank you, Rahila Syed default_partition_v5.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Adding support for Default partition in partitioning
Hello, Please find attached an updated patch. Following has been accomplished in this update: 1. A new partition can be added after default partition if there are no conflicting rows in default partition. 2. Solved the crash reported earlier. Thank you, Rahila Syed On Tue, Apr 4, 2017 at 6:22 PM, David Steele <da...@pgmasters.net> wrote: > On 3/31/17 10:45 AM, David Steele wrote: > > On 3/29/17 8:13 AM, Rahila Syed wrote: > > > >> Thanks for reporting. I have identified the problem and have a fix. > >> Currently working on allowing > >> adding a partition after default partition if the default partition does > >> not have any conflicting rows. > >> Will update the patch with both of these. > > > > The CF has been extended but until April 7 but time is still growing > > short. Please respond with a new patch by 2017-04-04 00:00 AoE (UTC-12) > > or this submission will be marked "Returned with Feedback". > > This submission has been marked "Returned with Feedback". Please feel > free to resubmit to a future commitfest. > > Regards, > -- > -David > da...@pgmasters.net > default_partition_v4.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Adding support for Default partition in partitioning
Thanks for reporting. I have identified the problem and have a fix. Currently working on allowing adding a partition after default partition if the default partition does not have any conflicting rows. Will update the patch with both of these. Thank you, Rahila Syed On Mon, Mar 27, 2017 at 12:10 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > I applied the patch and was trying to perform some testing, but its > ending up with server crash with the test shared by you in your starting > mail: > > postgres=# CREATE TABLE list_partitioned ( > postgres(# a int > postgres(# ) PARTITION BY LIST (a); > CREATE TABLE > postgres=# > postgres=# CREATE TABLE part_default PARTITION OF list_partitioned FOR > VALUES IN (DEFAULT); > CREATE TABLE > > postgres=# CREATE TABLE part_1 PARTITION OF list_partitioned FOR VALUES IN > (4,5); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > > Apart from this, few more explanation in the patch is needed to explain the > changes for the DEFAULT partition. Like I am not quite sure what exactly > the > latest version of patch supports, like does that support the tuple row > movement, > or adding new partition will be allowed having partition table having > DEFAULT > partition, which is quite difficult to understand through the code changes. > > Another part which is missing in the patch is the test coverage, adding > proper test coverage, which explain what is supported and what's not. > > Thanks, > > On Fri, Mar 24, 2017 at 3:25 PM, Rahila Syed <rahilasye...@gmail.com> > wrote: > >> Hello Rushabh, >> >> Thank you for reviewing. >> Have addressed all your comments in the attached patch. The attached >> patch currently throws an >> error if a new partition is added after default partition. >> >> >Rather then adding check for default here, I think this should be >> handle inside >> >get_qual_for_list(). >> Have moved the check inside get_qual_for_partbound() as needed to do some >> operations >> before calling get_qual_for_list() for default partitions. >> >> Thank you, >> Rahila Syed >> >> On Tue, Mar 21, 2017 at 11:36 AM, Rushabh Lathia < >> rushabh.lat...@gmail.com> wrote: >> >>> I picked this for review and noticed that patch is not getting >>> cleanly complied on my environment. >>> >>> partition.c: In function ‘RelationBuildPartitionDesc’: >>> partition.c:269:6: error: ISO C90 forbids mixed declarations and code >>> [-Werror=declaration-after-statement] >>> Const*val = lfirst(c); >>> ^ >>> partition.c: In function ‘generate_partition_qual’: >>> partition.c:1590:2: error: ISO C90 forbids mixed declarations and code >>> [-Werror=declaration-after-statement] >>> PartitionBoundSpec *spec = (PartitionBoundSpec *)bound; >>> ^ >>> partition.c: In function ‘get_partition_for_tuple’: >>> partition.c:1820:5: error: array subscript has type ‘char’ >>> [-Werror=char-subscripts] >>> result = parent->indexes[partdesc->boundinfo->def_index]; >>> ^ >>> partition.c:1824:16: error: assignment makes pointer from integer >>> without a cast [-Werror] >>> *failed_at = RelationGetRelid(parent->reldesc); >>> ^ >>> cc1: all warnings being treated as errors >>> >>> Apart from this, I was reading patch here are few more comments: >>> >>> 1) Variable initializing happening at two place. >>> >>> @@ -166,6 +170,8 @@ RelationBuildPartitionDesc(Relation rel) >>> /* List partitioning specific */ >>> PartitionListValue **all_values = NULL; >>> boolfound_null = false; >>> +boolfound_def = false; >>> +intdef_index = -1; >>> intnull_index = -1; >>> >>> /* Range partitioning specific */ >>> @@ -239,6 +245,8 @@ RelationBuildPartitionDesc(Relation rel) >>> i = 0; >>> found_null = false; >>> null_index = -1; >>> +found_def = false; >>> +def_index = -1; >>> foreach(cell, boundspecs) >>> { >>> ListCell *c; >>> @@ -249,6 +257,15 @@ RelationBuildPartitionDesc(Relation rel) >>> >>> >>> 2) >>> >>> @@
Re: [HACKERS] Adding support for Default partition in partitioning
Hello Rushabh, Thank you for reviewing. Have addressed all your comments in the attached patch. The attached patch currently throws an error if a new partition is added after default partition. >Rather then adding check for default here, I think this should be handle inside >get_qual_for_list(). Have moved the check inside get_qual_for_partbound() as needed to do some operations before calling get_qual_for_list() for default partitions. Thank you, Rahila Syed On Tue, Mar 21, 2017 at 11:36 AM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > I picked this for review and noticed that patch is not getting > cleanly complied on my environment. > > partition.c: In function ‘RelationBuildPartitionDesc’: > partition.c:269:6: error: ISO C90 forbids mixed declarations and code > [-Werror=declaration-after-statement] > Const*val = lfirst(c); > ^ > partition.c: In function ‘generate_partition_qual’: > partition.c:1590:2: error: ISO C90 forbids mixed declarations and code > [-Werror=declaration-after-statement] > PartitionBoundSpec *spec = (PartitionBoundSpec *)bound; > ^ > partition.c: In function ‘get_partition_for_tuple’: > partition.c:1820:5: error: array subscript has type ‘char’ > [-Werror=char-subscripts] > result = parent->indexes[partdesc->boundinfo->def_index]; > ^ > partition.c:1824:16: error: assignment makes pointer from integer without > a cast [-Werror] > *failed_at = RelationGetRelid(parent->reldesc); > ^ > cc1: all warnings being treated as errors > > Apart from this, I was reading patch here are few more comments: > > 1) Variable initializing happening at two place. > > @@ -166,6 +170,8 @@ RelationBuildPartitionDesc(Relation rel) > /* List partitioning specific */ > PartitionListValue **all_values = NULL; > boolfound_null = false; > +boolfound_def = false; > +intdef_index = -1; > intnull_index = -1; > > /* Range partitioning specific */ > @@ -239,6 +245,8 @@ RelationBuildPartitionDesc(Relation rel) > i = 0; > found_null = false; > null_index = -1; > +found_def = false; > +def_index = -1; > foreach(cell, boundspecs) > { > ListCell *c; > @@ -249,6 +257,15 @@ RelationBuildPartitionDesc(Relation rel) > > > 2) > > @@ -1558,6 +1586,19 @@ generate_partition_qual(Relation rel) > bound = stringToNode(TextDatumGetCString(boundDatum)); > ReleaseSysCache(tuple); > > +/* Return if it is a default list partition */ > +PartitionBoundSpec *spec = (PartitionBoundSpec *)bound; > +ListCell *cell; > +foreach(cell, spec->listdatums) > > More comment on above hunk is needed? > > Rather then adding check for default here, I think this should be handle > inside > get_qual_for_list(). > > 3) Code is not aligned with existing > > diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y > index 6316688..ebb7db7 100644 > --- a/src/backend/parser/gram.y > +++ b/src/backend/parser/gram.y > @@ -2594,6 +2594,7 @@ partbound_datum: > Sconst{ $$ = makeStringConst($1, @1); } > | NumericOnly{ $$ = makeAConst($1, @1); } > | NULL_P{ $$ = makeNullAConst(@1); } > +| DEFAULT { $$ = (Node *)makeDefElem("DEFAULT", NULL, @1); } > ; > > > 4) Unnecessary hunk: > > @@ -2601,7 +2602,6 @@ partbound_datum_list: > | partbound_datum_list ',' partbound_datum > { $$ = lappend($1, $3); } > ; > - > > Note: this is just an initially review comments, I am yet to do the > detailed review > and the testing for the patch. > > Thanks. > > On Mon, Mar 20, 2017 at 9:27 AM, Rahila Syed <rahilasye...@gmail.com> > wrote: > >> Hello, >> >> Please find attached a rebased patch with support for pg_dump. I am >> working on the patch >> to handle adding a new partition after a default partition by throwing an >> error if >> conflicting rows exist in default partition and adding the partition >> successfully otherwise. >> Will post an updated patch by tomorrow. >> >> Thank you, >> Rahila Syed >> >> On Mon, Mar 13, 2017 at 5:42 AM, Robert Haas <robertmh...@gmail.com> >> wrote: >> >>> On Fri, Mar 10, 2017 at 2:17 PM, Peter Eisentraut >>> <peter.eisentr...@2ndquadrant.com> wrote: >>> > On 3/2/17 21:40, Robert Haas wrote: >>> >> On the point mentioned above, I >>>
Re: [HACKERS] Adding support for Default partition in partitioning
Hello, Please find attached a rebased patch with support for pg_dump. I am working on the patch to handle adding a new partition after a default partition by throwing an error if conflicting rows exist in default partition and adding the partition successfully otherwise. Will post an updated patch by tomorrow. Thank you, Rahila Syed On Mon, Mar 13, 2017 at 5:42 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Fri, Mar 10, 2017 at 2:17 PM, Peter Eisentraut > <peter.eisentr...@2ndquadrant.com> wrote: > > On 3/2/17 21:40, Robert Haas wrote: > >> On the point mentioned above, I > >> don't think adding a partition should move tuples, necessarily; seems > >> like it would be good enough - maybe better - for it to fail if there > >> are any that would need to be moved. > > > > ISTM that the uses cases of various combinations of adding a default > > partition, adding another partition after it, removing the default > > partition, clearing out the default partition in order to add more > > nondefault partitions, and so on, need to be more clearly spelled out > > for each partitioning type. We also need to consider that pg_dump and > > pg_upgrade need to be able to reproduce all those states. Seems to be a > > bit of work still ... > > This patch is only targeting list partitioning. It is not entirely > clear that the concept makes sense for range partitioning; you can > already define a partition from the end of the last partitioning up to > infinity, or from minus-infinity up to the starting point of the first > partition. The only thing a default range partition would do is let > you do is have a single partition cover all of the ranges that would > otherwise be unassigned, which might not even be something we want. > > I don't know how complete the patch is, but the specification seems > clear enough. If you have partitions for 1, 3, and 5, you get > partition constraints of (a = 1), (a = 3), and (a = 5). If you add a > default partition, you get a constraint of (a != 1 and a != 3 and a != > 5). If you then add a partition for 7, the default partition's > constraint becomes (a != 1 and a != 3 and a != 5 and a != 7). The > partition must be revalidated at that point for conflicting rows, > which we can either try to move to the new partition, or just error > out if there are any, depending on what we decide we want to do. I > don't think any of that requires any special handling for either > pg_dump or pg_upgrade; it all just falls out of getting the > partitioning constraints correct and consistently enforcing them, just > as for any other partition. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > default_partition_v2.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wait events for disk I/O
Thank you for the updated patch. I have applied and tested it on latest sources and the patch looks good to me. >I am not quite sure about this, as this is for stat statements. Also part from the >place you found there are many other fwrite() call into pg_stat_statements, and >I intentionally haven't added event here as it is very small write about stat, and >doesn't look like we should add for those call. I agree that this writes less amount of data. Although tracking this can be useful too in scenarios where pg_stat_statements is lagging due to I/O bottleneck. I will leave this decision to the committer. Thank you, Rahila Syed On Wed, Mar 15, 2017 at 1:03 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > Thanks Rahila for reviewing this patch. > > On Tue, Mar 14, 2017 at 8:13 PM, Rahila Syed <rahilasye...@gmail.com> > wrote: > >> Hello, >> >> I applied and tested this patch on latest sources and it works fine. >> >> Following are some comments, >> >> >+ /* Wait event for SNRU */ >> >+ WAIT_EVENT_READ_SLRU_PAGE, >> Typo in the comment. >> >> > Fixed. > > >> >FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, >> WAIT_EVENT_FLUSH_DATA_BLOCK); >> This call is inside mdwriteback() which can flush more than one block so >> should WAIT_EVENT _FLUSH_DATA_BLOCK >> be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS? >> >> > Changed with WAIT_EVENT_FLUSH_DATA_BLOCKS. > > >> Should calls to write() in following functions be tracked too? >> qtext_store() - This is related to pg_stat_statements >> >> > > I am not quite sure about this, as this is for stat statements. Also part > from the > place you found there are many other fwrite() call into > pg_stat_statements, and > I intentionally haven't added event here as it is very small write about > stat, and > doesn't look like we should add for those call. > > > >> dsm_impl_mmap() - This is in relation to creating dsm segments. >> >> > Added new event here. Actually particular write call is zero-filling the > DSM file. > > >> write_auto_conf_file()- This is called when updated configuration >> parameters are >> written to a temp file. >> >> > write_auto_conf_file() is getting called during the ALTER SYSTEM call. > Here write > happen only when someone explicitly run the ALTER SYSTEM call. This is > administrator call and so doesn't seem like necessary to add separate wait > event > for this. > > PFA latest patch with other fixes. > > >> >> On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lat...@gmail.com> >> wrote: >> >>> >>> >>> On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmh...@gmail.com> >>> wrote: >>> >>>> On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapil...@gmail.com> >>>> wrote: >>>> > On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmh...@gmail.com> >>>> wrote: >>>> >> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapil...@gmail.com> >>>> wrote: >>>> >>> Sure, if you think both Writes and Reads at OS level can have some >>>> >>> chance of blocking in obscure cases, then we should add a wait event >>>> >>> for them. >>>> >> >>>> >> I think writes have a chance of blocking in cases even in cases that >>>> >> are not very obscure at all. >>>> > >>>> > Point taken for writes, but I think in general we should have some >>>> > criteria based on which we can decide whether to have a wait event for >>>> > a particular call. It should not happen that we have tons of wait >>>> > events and out of which, only a few are helpful in most of the cases >>>> > in real-world scenarios. >>>> >>>> Well, the problem is that if you pick and choose which wait events to >>>> add based on what you think will be common, you're actually kind of >>>> hosing yourself. Because now when something uncommon happens, suddenly >>>> you don't get any wait event data and you can't tell what's happening. >>>> I think the number of new wait events added by Rushabh's patch is >>>> wholly reasonable. Yeah, some of those are going to be a lot more >>>> common than others, but so what? We add wait events so that we can >>>> find out what's going on. I don't want to sometimes know when a >>>
Re: [HACKERS] wait events for disk I/O
Hello, I applied and tested this patch on latest sources and it works fine. Following are some comments, >+ /* Wait event for SNRU */ >+ WAIT_EVENT_READ_SLRU_PAGE, Typo in the comment. >FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_FLUSH_DATA_BLOCK); This call is inside mdwriteback() which can flush more than one block so should WAIT_EVENT _FLUSH_DATA_BLOCK be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS? Should calls to write() in following functions be tracked too? qtext_store() - This is related to pg_stat_statements dsm_impl_mmap() - This is in relation to creating dsm segments. write_auto_conf_file()- This is called when updated configuration parameters are written to a temp file. Thank you, Rahila Syed On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmh...@gmail.com> wrote: > >> On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapil...@gmail.com> >> wrote: >> > On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmh...@gmail.com> >> wrote: >> >> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapil...@gmail.com> >> wrote: >> >>> Sure, if you think both Writes and Reads at OS level can have some >> >>> chance of blocking in obscure cases, then we should add a wait event >> >>> for them. >> >> >> >> I think writes have a chance of blocking in cases even in cases that >> >> are not very obscure at all. >> > >> > Point taken for writes, but I think in general we should have some >> > criteria based on which we can decide whether to have a wait event for >> > a particular call. It should not happen that we have tons of wait >> > events and out of which, only a few are helpful in most of the cases >> > in real-world scenarios. >> >> Well, the problem is that if you pick and choose which wait events to >> add based on what you think will be common, you're actually kind of >> hosing yourself. Because now when something uncommon happens, suddenly >> you don't get any wait event data and you can't tell what's happening. >> I think the number of new wait events added by Rushabh's patch is >> wholly reasonable. Yeah, some of those are going to be a lot more >> common than others, but so what? We add wait events so that we can >> find out what's going on. I don't want to sometimes know when a >> backend is blocked on an I/O. I want to ALWAYS know. >> >> > Yes, I agree with Robert. Knowing what we want and what we don't > want is difficult to judge. Something which we might think its not useful > information, and later of end up into situation where we re-think about > adding those missing stuff is not good. Having more information about > the system, specially for monitoring purpose is always good. > > I am attaching another version of the patch, as I found stupid mistake > in the earlier version of patch, where I missed to initialize initial > value to > WaitEventIO enum. Also earlier version was not getting cleanly apply on > the current version of sources. > > > > -- > Rushabh Lathia > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > >
Re: [HACKERS] Adding support for Default partition in partitioning
>I agree that adding a new partition should not move any data out of the default. It's easy enough to set up a monitor to watch for data existing in the >default. Perhaps also adding a column to pg_partitioned_table that contains the oid of the default partition so it's easier to identify from a system >catalog perspective and make that monitoring easier. Wont it incur overhead of scanning the default partition for matching rows each time a select happens on any matching partition? This extra scan would be required until rows satisfying the newly added partition are left around in default partition. >I don't even see a need for it to fail either and not quite sure how that would even work? If they can't add a necessary child due to data being in the >default, how can they ever get it out? Just leave it to the user to keep an eye on the default and fix it as necessary. This patch intends to provide a way to insert data that does not satisfy any of the existing partitions. For this patch, we can disallow adding a new partition when a default partition with conflicting rows exist. There can be many solutions to the problem of adding a new partition. One is to move the relevant tuples from default to the new partition or like you suggest keep monitoring the default partition until user moves the rows out of the default. Thank you, Rahila Syed On Tue, Mar 7, 2017 at 10:00 PM, Keith Fiske <ke...@omniti.com> wrote: > On Thu, Mar 2, 2017 at 9:40 PM, Robert Haas <robertmh...@gmail.com> wrote: > >> On Wed, Mar 1, 2017 at 6:29 AM, Rahila Syed <rahilasye...@gmail.com> >> wrote: >> > 3. Handling adding a new partition to a partitioned table >> >with default partition. >> >This will require moving tuples from existing default partition to >> > newly created partition if they satisfy its partition bound. >> >> Considering that this patch was submitted at the last minute and isn't >> even complete, I can't see this getting into v10. But that doesn't >> mean we can't talk about it. I'm curious to hear other opinions on >> whether we should have this feature. On the point mentioned above, I >> don't think adding a partition should move tuples, necessarily; seems >> like it would be good enough - maybe better - for it to fail if there >> are any that would need to be moved. >> >> -- >> Robert Haas >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers >> > > I'm all for this feature and had suggested it back in the original thread > to add partitioning to 10. I agree that adding a new partition should not > move any data out of the default. It's easy enough to set up a monitor to > watch for data existing in the default. Perhaps also adding a column to > pg_partitioned_table that contains the oid of the default partition so it's > easier to identify from a system catalog perspective and make that > monitoring easier. I don't even see a need for it to fail either and not > quite sure how that would even work? If they can't add a necessary child > due to data being in the default, how can they ever get it out? Just leave > it to the user to keep an eye on the default and fix it as necessary. This > is what I do in pg_partman. > > -- > Keith Fiske > Database Administrator > OmniTI Computer Consulting, Inc. > http://www.keithf4.com >
[HACKERS] Adding support for Default partition in partitioning
Hello, Currently inserting the data into a partitioned table that does not fit into any of its partitions is not allowed. The attached patch provides a capability to add a default partition to a list partitioned table as follows. postgres=# CREATE TABLE list_partitioned ( a int ) PARTITION BY LIST (a); CREATE TABLE postgres=# CREATE TABLE part_default PARTITION OF list_partitioned FOR VALUES IN (DEFAULT); CREATE TABLE postgres=# CREATE TABLE part_1 PARTITION OF list_partitioned FOR VALUES IN (4,5); CREATE TABLE postgres=# insert into list_partitioned values (9); INSERT 0 1 postgres=# select * from part_default; a --- 9 (1 row) The attached patch is in a preliminary stage and has following ToDos: 1. Adding pg_dump support. 2. Documentation 3. Handling adding a new partition to a partitioned table with default partition. This will require moving tuples from existing default partition to newly created partition if they satisfy its partition bound. 4. Handling of update of partition key in a default partition. As per current design it should throw an error if the update requires the tuple to be moved to any other partition. But this can changed by the following proposal. https://www.postgresql.org/message-id/CAJ3gD9do9o2ccQ7j7+tSgiE1REY65XRiMb= yjo3u3qhyp8e...@mail.gmail.com I am adding it to the current commitfest with the status Waiting on Author as I will submit an updated patch with above ToDos. Kindly give your suggestions. Thank you, Rahila Syed default_list_partition_v1.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Index-only scan
I reviewed the patch. Overall it looks fine to me. One comment, >- if (index->amcanparallel && >- !index_only_scan && >+ if ((index->amcanparallel || >+ index_only_scan) && Why do we need to check for index_only_scan in the above condition. IIUC, index->amcanparallel is necessary for parallel index scan to be chosen. We cannot chose parallel index only scan if index_only_scan is happening without worrying about index->amcanparallel value. So OR-ing index->amcanparallel with index_only_scan is probably not correct. Thank you, Rahila Syed On Thu, Feb 16, 2017 at 1:06 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Jan 19, 2017 at 7:07 AM, Rafia Sabih > <rafia.sa...@enterprisedb.com> wrote: > > Please find the attached file rebased patch of parallel index-only > > scan on the latest Parallel index-scan patch [1]. > > This again needs minor rebasing but basically looks fine. It's a > pretty straightforward extension of the parallel index scan work. > > Please make sure that this is pgindent-clean - i.e. that when you > pgindent the files that it touches, pgindent doesn't change anything > of the same parts of the file that you've changed in the patch. Also, > I believe Amit may have made some adjustments to the logic in > nodeIndexScan.c; if so, it would be good to make sure that the > nodeIndexOnlyScan.c changes match what was done there. In particular, > he's got this: > > if (reset_parallel_scan && node->iss_ScanDesc->parallel_ > scan) > index_parallelrescan(node->iss_ScanDesc); > > And you've got this: > > + if (reset_parallel_scan) > + index_parallelrescan(node->ioss_ScanDesc); > > There might be some other inconsistencies as well that I didn't notice > on a quick look. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: [HACKERS] Parallel Index Scans
Hello Robert, >I am a bit mystified about how this manages to work with array keys. >_bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE >unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but >_bt_parallel_advance_scan() won't do anything unless btps_pageStatus >is already BTPARALLEL_DONE. It seems like one of those two things has >to be wrong. btps_pageStatus is to be set to BTPARALLEL_DONE only by the first worker which is performing scan for latest array key and which has encountered end of scan. This is ensured by following check in _bt_parallel_done(), if (so->arrayKeyCount >= btscan->btps_arrayKeyCount && btscan->btps_pageStatus != BTPARALLEL_DONE) Thus, BTPARALLEL_DONE marks end of scan only for latest array keys. This ensures that when any worker reaches _bt_advance_array_keys() it advances latest scan which is marked by btscan->btps_arrayKeyCount only when latest scan has ended by checking if(btps_pageStatus == BTPARALLEL_DONE) in _bt_advance_array_keys(). Otherwise, the worker just advances its local so->arrayKeyCount. I hope this provides some clarification. Thank you, Rahila Syed On Wed, Feb 1, 2017 at 3:52 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Jan 24, 2017 at 4:51 PM, Robert Haas <robertmh...@gmail.com> > wrote: > > On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > >> In spite of being careful, I missed reorganizing the functions in > >> genam.h which I have done in attached patch. > > > > Cool. Committed parallel-generic-index-scan.2.patch. Thanks. > > Reviewing parallel_index_scan_v6.patch: > > I think it might be better to swap the return value and the out > parameter for _bt_parallel_seize; that is, return a bool, and have > callers ignore the value of the out parameter (e.g. *pageno). > > I think _bt_parallel_advance_scan should be renamed something that > includes the word "keys", like _bt_parallel_advance_array_keys. > > The hunk in indexam.c looks like a leftover that can be omitted. > > +/* > + * Below flags are used to indicate the state of parallel scan. > > They aren't really flags any more; they're members of an enum. I > think you could just leave this sentence out entirely and start right > in with descriptions of the individual values. But maybe all of those > descriptions should end in a period (instead of one ending in a period > but not the other three) since they read like complete sentences. > > + * btinitparallelscan - Initializing BTParallelScanDesc for parallel > btree scan > > Initializing -> initialize > > + * btparallelrescan() -- reset parallel scan > > Previous two prototypes have one dash, this one has two. Make it > consistent, please. > > + * Ideally, we don't need to acquire spinlock here, but being > + * consistent with heap_rescan seems to be a good idea. > > How about: In theory, we don't need to acquire the spinlock here, > because there shouldn't be any other workers running at this point, > but we do so for consistency. > > + * _bt_parallel_seize() -- returns the next block to be scanned for > forward > + * scans and latest block scanned for backward scans. > > I think the description should be more like "Begin the process of > advancing the scan to a new page. Other scans must wait until we call > bt_parallel_release() or bt_parallel_done()." And likewise > _bt_parallel_release() should say something like "Complete the process > of advancing the scan to a new page. We now have the new value for > btps_scanPage; some other backend can now begin advancing the scan." > And _bt_parallel_done should say something like "Mark the parallel > scan as complete." > > I am a bit mystified about how this manages to work with array keys. > _bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE > unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but > _bt_parallel_advance_scan() won't do anything unless btps_pageStatus > is already BTPARALLEL_DONE. It seems like one of those two things has > to be wrong. > > _bt_readpage's header comment should be updated to note that, in the > case of a parallel scan, _bt_parallel_seize should have been called > prior to entering this function, and _bt_parallel_release will be > called prior to return (or this could alternatively be phrased in > terms of btps_pageStatus on entry/exit). > > _bt_readnextpage isn't very clear about the meaning of its blkno > argument. It looks like it always has to be valid when scanning > forward, but only in the parallel case when scanning backward? That > is a little odd. Another, possibly related
Re: [HACKERS] Parallel Index Scans
Hello, >Agreed, that it makes sense to consider only the number of pages to >scan for computation of parallel workers. I think for index scan we >should consider both index and heap pages that need to be scanned >(costing of index scan consider both index and heap pages). I thin >where considering heap pages matter more is when the finally selected >rows are scattered across heap pages or we need to apply a filter on >rows after fetching from the heap. OTOH, we can consider just pages >in the index as that is where mainly the parallelism works IMO, considering just index pages will give a better estimate of work to be done in parallel. As the amount of work/number of pages divided amongst workers is irrespective of the number of heap pages scanned. Thank you, Rahila Syed On Mon, Jan 30, 2017 at 2:52 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Sat, Jan 28, 2017 at 1:34 AM, Robert Haas <robertmh...@gmail.com> > wrote: > > On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > >> parallel_index_opt_exec_support_v6 - Removed the usage of > >> GatherSupportsBackwardScan. Expanded the comments in > >> ExecReScanIndexScan. > > > > I looked through this and in general it looks reasonable to me. > > However, I did notice one thing that I think is wrong. In the > > parallel bitmap heap scan patch, the second argument to > > compute_parallel_worker() is the number of pages that the parallel > > scan is expected to fetch from the heap. In this patch, it's the > > total number of pages in the index. The former seems to me to be > > better, because the point of having a threshold relation size for > > parallelism is that we don't want to use a lot of workers to scan a > > small number of pages -- the distribution of work won't be even, and > > the potential savings are limited. If we've got a big index but are > > using a very selective qual to pull out only one or a small number of > > rows on a single page or a small handful of pages, we shouldn't > > generate a parallel path for that. > > > > Agreed, that it makes sense to consider only the number of pages to > scan for computation of parallel workers. I think for index scan we > should consider both index and heap pages that need to be scanned > (costing of index scan consider both index and heap pages). I thin > where considering heap pages matter more is when the finally selected > rows are scattered across heap pages or we need to apply a filter on > rows after fetching from the heap. OTOH, we can consider just pages > in the index as that is where mainly the parallelism works. In the > attached patch (parallel_index_opt_exec_support_v7.patch), I have > considered both index and heap pages, let me know if you think some > other way is better. I have also prepared a separate independent > patch (compute_index_pages_v1) on HEAD to compute index pages which > can be used by parallel index scan. There is no change in > parallel_index_scan (parallel btree scan) patch, so I am not attaching > its new version. > > > Now, against that theory, the GUC that controls the behavior of > > compute_parallel_worker() is called min_parallel_relation_size, which > > might make you think that the decision is supposed to be based on the > > whole size of some relation. But I think that just means we need to > > rename the GUC to something like min_parallel_scan_size. Possibly we > > also ought consider reducing the default value somewhat, because it > > seems like both sequential and index scans can benefit even when > > scanning less than 8MB. > > > > Agreed, but let's consider it separately. > > > The order in which patches needs to be applied: > compute_index_pages_v1.patch, parallel_index_scan_v6.patch[1], > parallel_index_opt_exec_support_v7.patch > > > [1] - https://www.postgresql.org/message-id/CAA4eK1J% > 3DLSBpDx7i_izGJxGVUryqPe-2SKT02De-PrQvywiMxw%40mail.gmail.com > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com >
Re: [HACKERS] Improvements in psql hooks for variables
Hello, Please consider following comments on the patch. In function ParseVariableNum, > if (!val || !val[0]) > return false; Check for 'val == NULL' as in above condition is done even in callers of ParseVariableNum(). There should be only one such check. >+ psql_error("Invalid value \"%s\" for \"%s\"\nAn integer is expected\n", Cant we have this error in ParseVariableNum() similar to ParseVariableBool() ? >+ /* >+* Switching AUTOCOMMIT from OFF to ON is not allowed when inside a >+* transaction, because it cannot be effective until the current >+* transaction is ended. >+*/ >+ if (autocommit && !pset.autocommit && >+ pset.db && PQtransactionStatus(pset.db) == PQTRANS_INTRANS) >+ { >+ psql_error("cannot set AUTOCOMMIT to %s when inside a transaction\n", newval); >+ } The above change in autocommit behaviour needs to be documented. Thank you, Rahila Syed On Wed, Jan 25, 2017 at 5:48 PM, Rahila Syed <rahilasye...@gmail.com> wrote: > Hello, > > The patch works fine on applying on latest master branch and testing it > for various variables as listed in PsqlSettings struct. > I will post further comments on patch soon. > > Thank you, > Rahila Syed > > On Wed, Jan 25, 2017 at 1:33 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> "Daniel Verite" <dan...@manitou-mail.org> writes: >> > Here's an update with these changes: >> >> I scanned this patch very quickly and think it addresses my previous >> stylistic objections. Rahila is the reviewer of record though, so >> I'll wait for his comments before going further. >> >> regards, tom lane >> > >
Re: [HACKERS] Improvements in psql hooks for variables
Hello, The patch works fine on applying on latest master branch and testing it for various variables as listed in PsqlSettings struct. I will post further comments on patch soon. Thank you, Rahila Syed On Wed, Jan 25, 2017 at 1:33 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > "Daniel Verite" <dan...@manitou-mail.org> writes: > > Here's an update with these changes: > > I scanned this patch very quickly and think it addresses my previous > stylistic objections. Rahila is the reviewer of record though, so > I'll wait for his comments before going further. > > regards, tom lane >
Re: [HACKERS] Parallel Index Scans
>+ /* Check if the scan for current scan keys is finished */ >+ if (so->arrayKeyCount < btscan->btps_arrayKeyCount) >+ *status = false; >I didn't clearly understand, in which scenario the arrayKeyCount is less >than btps_arrayKeyCount? Consider following array scan keys select * from test2 where j=ANY(ARRAY[1000,2000,3000]); By the time the current worker has finished reading heap tuples corresponding to array key 1000(arrayKeyCount = 0), some other worker might have advanced the scan to the array key 2000(btps_arrayKeyCount =1). In this case when the current worker fetches next page to scan, it must advance its scan keys before scanning the next page of parallel scan. I hope this helps. >+BlockNumber >+_bt_parallel_seize(IndexScanDesc scan, bool *status) >The return value of the above function is validated only in _bt_first >function, but in other cases it is not. In other cases it is validated in _bt_readnextpage() which is called after _bt_parallel_seize(). >From the function description, >it is possible to return P_NONE for the workers also with status as >true. I feel it is better to handle the P_NONE case internally only >so that callers just check for the status. Am i missing anything? In case of the next block being P_NONE and status true, the code calls _bt_parallel_done() to notify other workers followed by BTScanPosInvalidate(). Similar check for block = P_NONE also happens in existing code. See following in _bt_readnextpage(), if (blkno == P_NONE || !so->currPos.moreRight) { _bt_parallel_done(scan); BTScanPosInvalidate(so->currPos); return false; } So to keep it consistent with the existing code, the check is kept outside _bt_parallel_seize(). Thank you, Rahila Syed On Wed, Jan 18, 2017 at 6:25 AM, Haribabu Kommi <kommi.harib...@gmail.com> wrote: > > > On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > >> >> Changed as per suggestion. >> >> >> I have also rebased the optimizer/executor support patch >> (parallel_index_opt_exec_support_v4.patch) and added a test case in >> it. > > > Thanks for the patch. Here are comments found during review. > > parallel_index_scan_v4.patch: > > > + amtarget = (char *) ((void *) target) + offset; > > The above calcuation can be moved after NULL check? > > + * index_beginscan_parallel - join parallel index scan > > The name and the description doesn't sync properly, any better description? > > + BTPARALLEL_DONE, > + BTPARALLEL_IDLE > +} PS_State; > > The order of above two enum values can be changed according to their use. > > + /* Check if the scan for current scan keys is finished */ > + if (so->arrayKeyCount < btscan->btps_arrayKeyCount) > + *status = false; > > I didn't clearly understand, in which scenario the arrayKeyCount is less > than btps_arrayKeyCount? > > > +BlockNumber > +_bt_parallel_seize(IndexScanDesc scan, bool *status) > > The return value of the above function is validated only in _bt_first > function, but in other cases it is not. From the function description, > it is possible to return P_NONE for the workers also with status as > true. I feel it is better to handle the P_NONE case internally only > so that callers just check for the status. Am i missing anything? > > > +extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status); > +extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber > scan_page); > > Any better names for the above functions, as these function will > provide/set > the next page that needs to be read. > > > parallel_index_opt_exec_support_v4.patch: > > +#include "access/parallel.h" > > Why is it required to be include nbtree.c? i didn't find > any code changes in the patch. > > > + /* reset (parallel) index scan */ > + if (node->iss_ScanDesc) > + { > > Why this if check required? There is an assert check in later function > calls. > > > Regards, > Hari Babu > Fujitsu Australia >
Re: [HACKERS] Parallel Index-only scan
Hello, On applying the patch on latest master branch and running regression tests following failure occurs. I applied it on latest parallel index scan patches as given in the link above. *** /home/rahila/postgres/postgres/src/test/regress/expected/select_parallel.out 2017-01-03 14:06:29.122022780 +0530 --- /home/rahila/postgres/postgres/src/test/regress/results/select_parallel.out 2017-01-12 14:35:56.652712622 +0530 *** *** 92,103 explain (costs off) select sum(parallel_restricted(unique1)) from tenk1 group by(parallel_restricted(unique1)); ! QUERY PLAN ! HashAggregate Group Key: parallel_restricted(unique1) !-> Index Only Scan using tenk1_unique1 on tenk1 ! (3 rows) set force_parallel_mode=1; explain (costs off) --- 92,105 explain (costs off) select sum(parallel_restricted(unique1)) from tenk1 group by(parallel_restricted(unique1)); ! QUERY PLAN ! --- HashAggregate Group Key: parallel_restricted(unique1) !-> Gather ! Workers Planned: 4 ! -> Parallel Index Only Scan using tenk1_unique1 on tenk1 ! (5 rows) IIUC, parallel operation being performed here is fine as parallel restricted function occurs above Gather node and just the expected output needs to be changed. Thank you, Rahila Syed regression.diffs Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assignment of valid collation for SET operations on queries with UNKNOWN types.
Thank you all for inputs. Kindly help me clarify the scope of the patch. >However, I thought the idea was to silently coerce affected columns from >unknown to text. This doesn't look like the behavior we want: This patch prevents creation of relation with unknown columns and in addition fixes the particular case of CREATE VIEW with literal columns by coercing unknown to text only in this particular case. Are you suggesting extending the patch to include coercing from unknown to text for all possible cases where a column of unknown type is being created? Thank you, Rahila Syed On Fri, Dec 30, 2016 at 7:21 PM, Michael Paquier <michael.paqu...@gmail.com> wrote: > On Fri, Dec 30, 2016 at 1:30 PM, Ashutosh Bapat > <ashutosh.ba...@enterprisedb.com> wrote: > > On Thu, Dec 29, 2016 at 8:18 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> writes: > >>> The way this patch has been written, it doesn't allow creating tables > >>> with unknown type columns, which was allowed earlier. > >> > >> Yes, that's an intentional change; creating such tables (or views) has > >> never been anything but a foot-gun. > >> > >> However, I thought the idea was to silently coerce affected columns from > >> unknown to text. This doesn't look like the behavior we want: > > > > Do you mean to say that when creating a table with a column of unknown > > type, that column type should be silently converted (there's nothing > > to coerce when the table is being created) to text? instead of > > throwing an error? > > FWIW that's what I understood: the patch should switch unknown columns > to text. A bunch of side effects when converting types are avoided > this way. > -- > Michael >
Re: [HACKERS] Parallel Index Scans
>> 5. Comment for _bt_parallel_seize() says: >> "False indicates that we have reached the end of scan for >> current scankeys and for that we return block number as P_NONE." >> >> What is the reason to check (blkno == P_NONE) after checking (status == false) >> in _bt_first() (see code below)? If comment is correct >> we'll never reach _bt_parallel_done() >> >> + blkno = _bt_parallel_seize(scan, ); >> + if (status == false) >> + { >> + BTScanPosInvalidate(so->currPos); >> + return false; >> + } >> + else if (blkno == P_NONE) >> + { >> + _bt_parallel_done(scan); >> + BTScanPosInvalidate(so->currPos); >> + return false; >> + } >> >The first time master backend or worker hits last page (calls this >API), it will return P_NONE and after that any worker tries to fetch >next page, it will return status as false. I think we can expand a >comment to explain it clearly. Let me know, if you need more >clarification, I can explain it in detail. Probably this was confusing because we have not mentioned that P_NONE can be returned even when status = TRUE and not just when status is false. I think, the comment above the function can be modified as follows, + /* + * True indicates that the block number returned is either valid including P_NONE + * and scan is continued or block number is invalid and scan has just + * begun. Thank you, Rahila Syed On Thu, Dec 22, 2016 at 9:49 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova > <lubennikov...@gmail.com> wrote: > > The following review has been posted through the commitfest application: > > make installcheck-world: tested, passed > > Implements feature: tested, passed > > Spec compliant: tested, passed > > Documentation:tested, passed > > > > Hi, thank you for the patch. > > Results are very promising. Do you see any drawbacks of this feature or > something that requires more testing? > > > > I think you can focus on the handling of array scan keys for testing. > In general, one of my colleagues has shown interest in testing this > patch and I think he has tested as well but never posted his findings. > I will request him to share his findings and what kind of tests he has > done, if any. > > > I'm willing to oo a review. > > Thanks, that will be helpful. > > > > I saw the discussion about parameters in the thread above. And I agree > that we'd better concentrate > > on the patch itself and add them later if necessary. > > > > 1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of > xs_temp_snap flag? > > > > + if (scan->xs_temp_snap) > > + UnregisterSnapshot(scan->xs_snapshot); > > > > I agree with what Rober has told in his reply. We do same way for > heap, refer heap_endscan(). > > > I must say that I'm quite new with all this parallel stuff. If you give > me a link, > > where to read about snapshots for parallel workers, my review will be > more helpful. > > > > You can read transam/README.parallel. Refer "State Sharing" portion > of README to learn more about it. > > > Anyway, it would be great to have more comments about it in the code. > > > > We are sharing snapshot to ensure that reads in both master backend > and worker backend can use the same snapshot. There is no harm in > adding comments, but I think it is better to be consistent with > similar heapam code. After reading README.parallel, if you still feel > that we should add more comments in the code, then we can definitely > do that. > > > 2. Don't you mind to rename 'amestimateparallelscan' to let's say > 'amparallelscan_spacerequired' > > or something like this? > > Sure, I am open to other names, but IMHO, lets keep "estimate" in the > name to keep it consistent with other parallel stuff. Refer > execParallel.c to see how widely this word is used. > > > As far as I understand there is nothing to estimate, we know this size > > for sure. I guess that you've chosen this name because of > 'heap_parallelscan_estimate'. > > But now it looks similar to 'amestimate' which refers to indexscan cost > for optimizer. > > That leads to the next question. > > > > Do you mean 'amcostestimate'? If you want we can rename it > amparallelscanestimate to be consistent wi
Re: [HACKERS] Assignment of valid collation for SET operations on queries with UNKNOWN types.
Hello, Thank you for comments. >There is a similar code pattern for materialized views, see >create_ctas_nodata() where the attribute list is built create_ctas_nodata() is for creation of materialized views WITH NO DATA. For other materialized views and CREATE TABLE AS, column definitions are built in intorel_startup() function which has different code from that of CREATE VIEW which the patch deals with. Limiting the scope of the patch to include changing the type of literal constants to text only for plain views. Also, error out when column with UNKNOWN type is being created for other relations like tables and materialized views. >And actually, shouldn't this be just a plain error? Changed it to error in the attached patch. >Your patch has no regression tests, surely you want some to stress >this code path Added regression tests in the attached patch. Also adding this patch to CF 2017-01 Thank you, Rahila Syed unknown_view_column_to_text_v1.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assignment of valid collation for SET operations on queries with UNKNOWN types.
Hello, >And ideally fix things so >that the type of the view column will be resolved as text, so that you >don't hit this condition in the first place; but there is no good that >comes out of allowing a view to be created like this Thank you for suggestion. Attached is a patch which resolves the columns with literal constants as TEXT for view creation. Following views can be created with literal columns resolved as TEXT. postgres=# create view v as select 'abc' a; CREATE VIEW postgres=# create view v1 as select 'def' a; CREATE VIEW postgres=# \d+ v1; View "public.v1" Column | Type | Collation | Nullable | Default | Storage | Description +--+---+--+-+--+- a | text | | | | extended | View definition: SELECT 'def'::text AS a; This allows following queries to run successfully which wasn't the case earlier. postgres=# select a from v UNION select a from v1; a - abc def (2 rows) AND postgres=# select * from v order by 1; a - abc (1 row) Kindly give your opinion. Thank you, Rahila Syed. On Thu, Nov 17, 2016 at 8:59 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Rahila Syed <rahilasye...@gmail.com> writes: > > CASE 2: > > postgres=# create view v as select 'abc' a; > > 2016-11-16 15:28:48 IST WARNING: column "a" has type "unknown" > > 2016-11-16 15:28:48 IST DETAIL: Proceeding with relation creation > anyway. > > WARNING: column "a" has type "unknown" > > DETAIL: Proceeding with relation creation anyway. > > CREATE VIEW > > We really ought to make that a hard error. And ideally fix things so > that the type of the view column will be resolved as text, so that you > don't hit this condition in the first place; but there is no good that > comes out of allowing a view to be created like this. > > > Attached WIP patch does that. Kindly let me know your opinion. > > This is a seriously ugly kluge that's attacking the symptom not the > problem. Or really, a symptom not the problem. There are lots of > other symptoms, for instance > > regression=# select * from v order by 1; > ERROR: failed to find conversion function from unknown to text > > regards, tom lane > unknown_view_column_to_text_convert.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvements in psql hooks for variables
I applied and tested the patch on latest master branch. Kindly consider following comments, ParseVariableBool(const char *value, const char *name) +ParseVariableBool(const char *value, const char *name, bool *valid) { size_t len; + if (valid) + *valid = true; psql_error("unrecognized value \"%s\" for \"%s\": boolean expected\n", + value, name); + if (valid) + *valid = false; Why do we need this? IMO, valid should be always set to true if the value is parsed to be correct. There should not be an option to the caller to not follow the behaviour of setting valid to either true/false. As it is in the current patch, all callers of ParseVariableBool follow the behaviour of setting valid with either true/false. In following examples, incorrect error message is begin displayed. “ON_ERROR_ROLLBACK” is an enum and also accepts value 'interactive' . The error message says boolean expected. postgres=# \set ON_ERROR_ROLLBACK eretere unrecognized value "eretere" for "ON_ERROR_ROLLBACK": boolean expected \set: error while setting variable Similarly for ECHO_HIDDEN which is also an enum and accepts value 'no_exec' postgres=# \set ECHO_HIDDEN NULL unrecognized value "NULL" for "ECHO_HIDDEN": boolean expected \set: error while setting variable + boolnewval = ParseVariableBool(opt, "\\timing", ); + if (success) + pset.timing = newval; + } else pset.timing = !pset.timing; if (!pset.quiet) @@ -2660,7 +2669,14 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet) if (value && pg_strcasecmp(value, "auto") == 0) popt->topt.expanded = 2; else if (value) - popt->topt.expanded = ParseVariableBool(value, param); + { + boolvalid; + boolnewval = ParseVariableBool(value, param, ); + if (valid) Should same variable names (success / valid) be used for consistency? Thank you, Rahila Syed On Wed, Nov 23, 2016 at 5:47 PM, Daniel Verite <dan...@manitou-mail.org> wrote: >I wrote: > > > So I went through the psql commands which don't fail on parse errors > > for booleans > > [...] > > Here's a v5 patch implementing the suggestions mentioned upthread: > all meta-commands calling ParseVariableBool() now fail > when the boolean argument can't be parsed successfully. > > Also includes a minor change to SetVariableAssignHook() that now > returns the result of the hook it calls after installing it. > It doesn't make any difference in psql behavior since callers > of SetVariableAssignHook() ignore its return value, but it's > more consistent with SetVariable(). > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite >
[HACKERS] Assignment of valid collation for SET operations on queries with UNKNOWN types.
Following UNION of two queries with constant literals runs successfully. CASE 1: postgres=# SELECT 'abc' UNION SELECT 'bcd' ; ?column? -- abc bcd (2 rows) whereas when these literals are part of a view, the UNION fails. CASE 2: postgres=# create view v as select 'abc' a; 2016-11-16 15:28:48 IST WARNING: column "a" has type "unknown" 2016-11-16 15:28:48 IST DETAIL: Proceeding with relation creation anyway. WARNING: column "a" has type "unknown" DETAIL: Proceeding with relation creation anyway. CREATE VIEW postgres=# create view v1 as select 'bcd' a; 2016-11-16 15:28:56 IST WARNING: column "a" has type "unknown" 2016-11-16 15:28:56 IST DETAIL: Proceeding with relation creation anyway. WARNING: column "a" has type "unknown" DETAIL: Proceeding with relation creation anyway. CREATE VIEW postgres=# select a from v UNION select a from v1; 2016-11-16 15:25:28 IST ERROR: could not determine which collation to use for string comparison 2016-11-16 15:25:28 IST HINT: Use the COLLATE clause to set the collation explicitly. 2016-11-16 15:25:28 IST STATEMENT: select a from v UNION select a from v1; ERROR: could not determine which collation to use for string comparison HINT: Use the COLLATE clause to set the collation explicitly. When UNION of queries with constant literals as in CASE 1 is allowed shouldn't a UNION of queries with literals in a view as in CASE 2 be allowed? In transformSetOperationTree, while determining the result type of the merged output columns, if the left and right column types are UNKNOWNs the result type is resolved to TEXT. The difference of behaviour in above two cases arises because the result collation assigned is not valid in CASE 2. When the left and the right inputs are literal constants i.e UNKNOWN as in Case 1 the collation of result column is correctly assigned to a valid value. Whereas when the left and the right inputs are columns of UNKNOWN type as in Case 2, the result collation is InvalidOid. So if we ensure assignment of a valid collation when the left and the right columns/inputs are UNKNOWN, the above can be resolved. Attached WIP patch does that. Kindly let me know your opinion. Thank you, Rahila Syed invalid_collation_error.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvements in psql hooks for variables
>ECHO_HIDDEN differs from the generic boolean case because it also >accepts "noexec" and pset.echo_hidden is an enum, not a boolean. When >considering refactoring echo_hidden_hook() to call >generic_boolean_hook() instead of ParseVariableBool() after >having established that the value is not "noexec", I don't see >any advantage in clarity or code size, so I'm not in favor of that change I agree if generic_boolean_hook() is used in its current form instead of ParseVariableBool() inside echo_hidden_hook or on_error_rollback_hook the code will not change much. I was suggesting change on the lines of extending generic_boolean_hook to include enum(part in enum_hooks which calls ParseVariableBool) and integer parsing as well. >Besides, I don't think it would go well with hooks. If we wanted one >big function that knows all about parsing all built-in variables, we >could just as well dispense with hooks, since their current purpose in >psql is to achieve this parsing, but in a decentralized way. >Or if we keep them, our various built-in variables would be >essentially tied to the same one-big-hook-that-does-all, but isn't >that an antipattern for hooks? I was suggesting something on the lines of having common parsing logic for each implementation of hook. This is similar to what ParseVariableBool does in the existing code. ParseVariableBool is being called from different hooks to parse the boolean value of the variable. Thus representing common code in various implementations of hook. >"hook" is changed by the patch from [pointer to function returning >void], to [pointer to function returning bool]. The cast to void is >not essential, it just indicates that we deliberately want to >ignore the return value here. I expect some compilers might >complain under a high level of warnings without this cast, although >TBH if you ask me, I wouldn't know which compiler with which flags >exactly. Thank you for explanation. Thank you, Rahila Syed On Fri, Nov 11, 2016 at 2:57 AM, Daniel Verite <dan...@manitou-mail.org> wrote: > Rahila Syed wrote: > > > I have applied this patch on latest HEAD and have done basic testing > which > > works fine. > > Thanks for reviewing this patch! > > > >if (current->assign_hook) > > >- (*current->assign_hook) (current->value); > > >- return true; > > >+ { > > >+ confirmed = (*current->assign_hook) (value); > > >+ } > > >+ if (confirmed) > > > > Spurious brackets > > OK. > > > >static bool > > >+generic_boolean_hook(const char *newval, const char* varname, bool > *flag) > > >+{ > > > > Contrary to what name suggests this function does not seem to have other > > implementations as in a hook. > > Also this takes care of rejecting a syntactically wrong value only for > > boolean variable hooks like autocommit_hook, > > on_error_stop_hook. However, there are other variable hooks which call > > ParseVariableBool. > > For instance, echo_hidden_hook which is handled separately in the patch. > > Thus there is some duplication of code between generic_boolean_hook and > > echo_hidden_hook. > > Similarly for on_error_rollback_hook. > > The purpose of generic_boolean_hook() is to handle the case of a > boolean variable that only accepts ON or OFF, and has its pset.varname > declared as bool. I thought of this case as "generic" because that's > the base case and several variables need no more than that. > > ECHO_HIDDEN differs from the generic boolean case because it also > accepts "noexec" and pset.echo_hidden is an enum, not a boolean. When > considering refactoring echo_hidden_hook() to call > generic_boolean_hook() instead of ParseVariableBool() after > having established that the value is not "noexec", I don't see > any advantage in clarity or code size, so I'm not in favor of that change. > > The same applies to on_error_rollback_hook(), which has to deal > with a specific enum as well. > > > >-static void > > >+static bool > > > fetch_count_hook(const char *newval) > > > { > > >pset.fetch_count = ParseVariableNum(newval, -1, -1, false); > > >+ return true; > > > } > > > > Shouldn't invalid numeric string assignment for numeric variables be > > handled too? > > Agreed. Assignments like "\set FETCH_COUNT bogus" don't provoke any > user feedback currently, which is not ideal. I'll add this in a > v3 of the patch tomorrow. > > > Instead of generic_boolean_hook cant we have something like follows which
Re: [HACKERS] Improvements in psql hooks for variables
Hello, I have applied this patch on latest HEAD and have done basic testing which works fine. Some comments, >if (current->assign_hook) >- (*current->assign_hook) (current->value); >- return true; >+ { >+ confirmed = (*current->assign_hook) (value); >+ } >+ if (confirmed) Spurious brackets >static bool >+generic_boolean_hook(const char *newval, const char* varname, bool *flag) >+{ Contrary to what name suggests this function does not seem to have other implementations as in a hook. Also this takes care of rejecting a syntactically wrong value only for boolean variable hooks like autocommit_hook, on_error_stop_hook. However, there are other variable hooks which call ParseVariableBool. For instance, echo_hidden_hook which is handled separately in the patch. Thus there is some duplication of code between generic_boolean_hook and echo_hidden_hook. Similarly for on_error_rollback_hook. >-static void >+static bool > fetch_count_hook(const char *newval) > { >pset.fetch_count = ParseVariableNum(newval, -1, -1, false); >+ return true; > } Shouldn't invalid numeric string assignment for numeric variables be handled too? Instead of generic_boolean_hook cant we have something like follows which like generic_boolean_hook can be called from specific variable assignment hooks, static bool ParseVariable(newval, VariableName, ) { if (VariableName == ‘AUTOCOMMIT’ || ECHO_HIDDEN || other variable with hooks which call ParseVariableBool ) else if (VariableName == ‘FETCH_COUNT’) ParseVariableNum(); } This will help merge the logic which is to check for valid syntax before assigning and returning false on error, in one place. >@@ -260,7 +276,7 @@ SetVariableAssignHook(VariableSpace space, const char *name, VariableAssignHook >current->assign_hook = hook; >current->next = NULL; >previous->next = current; >- (*hook) (NULL); >+ (void)(*hook) (NULL); /* ignore return value */ Sorry for my lack of understanding, can you explain why is above change needed? Thank you, Rahila Syed On Tue, Sep 20, 2016 at 11:16 PM, Daniel Verite <dan...@manitou-mail.org> wrote: > Ashutosh Bapat wrote: > > > You may want to add this to the November commitfest > > https://commitfest.postgresql.org/11/. > > Done. It's at https://commitfest.postgresql.org/11/799/ > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] Parallel Index Scans
>Another point which needs some thoughts is whether it is good idea to >use index relation size to calculate parallel workers for index scan. >I think ideally for index scans it should be based on number of pages >to be fetched/scanned from index. IIUC, its not possible to know the exact number of pages scanned from an index in advance. What we are essentially making parallel is the scan of the leaf pages. So it will make sense to have the number of workers based on number of leaf pages. Having said that, I think it will not make much difference as compared to existing method because currently total index pages are used to calculate the number of workers. As far as I understand,in large indexes, the difference between number of leaf pages and total pages is not significant. In other words, internal pages form a small fraction of total pages. Also the calculation is based on log of number of pages so it will make even lesser difference. Thank you, Rahila Syed On Tue, Oct 18, 2016 at 8:38 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Thu, Oct 13, 2016 at 8:48 AM, Amit Kapila <amit.kapil...@gmail.com> > wrote: > > As of now, the driving table for parallel query is accessed by > > parallel sequential scan which limits its usage to a certain degree. > > Parallelising index scans would further increase the usage of parallel > > query in many more cases. This patch enables the parallelism for the > > btree scans. Supporting parallel index scan for other index types > > like hash, gist, spgist can be done as separate patches. > > > > I would like to have an input on the method of selecting parallel > workers for scanning index. Currently the patch selects number of > workers based on size of index relation and the upper limit of > parallel workers is max_parallel_workers_per_gather. This is quite > similar to what we do for parallel sequential scan except for the fact > that in parallel seq. scan, we use the parallel_workers option if > provided by user during Create Table. User can provide > parallel_workers option as below: > > Create Table With (parallel_workers = 4); > > Is it desirable to have similar option for parallel index scans, if > yes then what should be the interface for same? One possible way > could be to allow user to provide it during Create Index as below: > > Create Index With (parallel_workers = 4); > > If above syntax looks sensible, then we might need to think what > should be used for parallel index build. It seems to me that parallel > tuple sort patch [1] proposed by Peter G. is using above syntax for > getting the parallel workers input from user for parallel index > builds. > > Another point which needs some thoughts is whether it is good idea to > use index relation size to calculate parallel workers for index scan. > I think ideally for index scans it should be based on number of pages > to be fetched/scanned from index. > > > [1] - https://www.postgresql.org/message-id/CAM3SWZTmkOFEiCDpUNaO4n9- > 1xcmWP-1NXmT7h0Pu3gM2YuHvg%40mail.gmail.com > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] Optimization for lazy_scan_heap
Hello, Some initial comments on optimize_lazy_scan_heap_v2.patch. >- while (next_unskippable_block < nblocks) >+ while (next_unskippable_block < nblocks && >+ !FORCE_CHECK_PAGE(next_unskippable_block)) Dont we need similar check of !FORCE_CHECK_PAGE(next_unskippable_block) in the below code which appears on line no 556 of vacuumlazy.c ? next_unskippable_block = 0; if ((options & VACOPT_DISABLE_PAGE_SKIPPING) == 0) { while (next_unskippable_block < nblocks) > { >if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0) >break; >+ n_all_frozen++; >} >else >{ >if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0) >break; >+ >+ /* Count the number of all-frozen pages */ >+ if (vmstatus & VISIBILITYMAP_ALL_FROZEN) >+ n_all_frozen++; >} I think its better to have just one place where we increment n_all_frozen before if-else block. >} >vacuum_delay_point(); >next_unskippable_block++; >+ n_skipped++; >} >} Also, instead of incrementing n_skipped everytime, it can be set to 'next_unskippable_block' or 'next_unskippable_block -blkno' once at the end of the loop. Thank you, Rahila Syed On Thu, Aug 25, 2016 at 11:52 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote: > On Thu, Aug 25, 2016 at 1:41 AM, Anastasia Lubennikova > <lubennikov...@gmail.com> wrote: > > The following review has been posted through the commitfest application: > > make installcheck-world: tested, passed > > Implements feature: not tested > > Spec compliant: not tested > > Documentation:not tested > > > > Hi, > > I haven't tested the performance yet, but the patch itself looks pretty > good > > and reasonable improvement. > > I have a question about removing the comment. It seems to be really > tricky > > moment. How do we know that all-frozen block hasn't changed since the > > moment we checked it? > > I think that we don't need to check whether all-frozen block hasn't > changed since we checked it. > Even if the all-frozen block has changed after we saw it as an > all-frozen page, we can update the relfrozenxid. > Because any new XIDs just added to that page are newer than the > GlobalXmin we computed. > > Am I missing something? > > In this patch, since we count the number of all-frozen page even > during not an aggresive scan, I thought that we don't need to check > whether these blocks were all-frozen pages. > > > I'm going to test the performance this week. > > I wonder if you could send a test script or describe the steps to test > it? > > This optimization reduces the number of scanning visibility map when > table is almost visible or frozen.. > So it would be effective for very large table (more than several > hundreds GB) which is almost all-visible or all-frozen pages. > > For example, > 1. Create very large table. > 2. Execute VACUUM FREEZE to freeze all pages of table. > 3. Measure the execution time of VACUUM FREEZE. > I hope that the second VACUUM FREEZE become fast a little. > > I modified the comment, and attached v2 patch. > > Regards, > > -- > Masahiko Sawada > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > >
Re: [HACKERS] Improvements in psql hooks for variables
Hello, I am beginning to review this patch. Initial comment. I got following compilation error when I applied the patch on latest sources and run make. command.c: In function ‘exec_command’: *command.c:257:5*: error: too few arguments to function ‘ParseVariableBool’ ParseVariableBool(opt1 + sizeof(prefix) - 1, prefix) ? ^ In file included from settings.h:12:0, from command.c:50: variables.h:38:7: note: declared here bool ParseVariableBool(const char *value, const char *name, bool *valid); ^ *command.c:1551:4*: error: too few arguments to function ‘ParseVariableBool’ pset.timing = ParseVariableBool(opt, "\\timing"); ^ In file included from settings.h:12:0, from command.c:50: variables.h:38:7: note: declared here bool ParseVariableBool(const char *value, const char *name, bool *valid); ^ command.c: In function ‘do_pset’: *command.c:2663:4*: error: too few arguments to function ‘ParseVariableBool’ popt->topt.expanded = ParseVariableBool(value, param); ^ In file included from settings.h:12:0, from command.c:50: variables.h:38:7: note: declared here bool ParseVariableBool(const char *value, const char *name, bool *valid); ^ *command.c:2672:4*: error: too few arguments to function ‘ParseVariableBool’ popt->topt.numericLocale = ParseVariableBool(value, param); ^ In file included from settings.h:12:0, from command.c:50: variables.h:38:7: note: declared here bool ParseVariableBool(const char *value, const char *name, bool *valid); ^ *command.c:2727:4*: error: too few arguments to function ‘ParseVariableBool’ popt->topt.tuples_only = ParseVariableBool(value, param); ^ In file included from settings.h:12:0, from command.c:50: variables.h:38:7: note: declared here bool ParseVariableBool(const char *value, const char *name, bool *valid); ^ *command.c:2759:4*: error: too few arguments to function ‘ParseVariableBool’ if (ParseVariableBool(value, param)) ^ In file included from settings.h:12:0, from command.c:50: variables.h:38:7: note: declared here bool ParseVariableBool(const char *value, const char *name, bool *valid); ^ *command.c:2781:4:* error: too few arguments to function ‘ParseVariableBool’ popt->topt.default_footer = ParseVariableBool(value, param); ^ In file included from settings.h:12:0, from command.c:50: variables.h:38:7: note: declared here bool ParseVariableBool(const char *value, const char *name, bool *valid); ^ Thank you, Rahila Syed On Mon, Sep 19, 2016 at 11:21 AM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > You may want to add this to the November commitfest > https://commitfest.postgresql.org/11/. > > On Fri, Sep 16, 2016 at 6:05 PM, Daniel Verite <dan...@manitou-mail.org> > wrote: > > Hi, > > > > Following the discussion on forbidding an AUTOCOMMIT off->on > > switch mid-transaction [1], attached is a patch that let the hooks > > return a boolean indicating whether a change is allowed. > > > > Using the hooks, bogus assignments to built-in variables can > > be dealt with more strictly. > > > > For example, pre-patch behavior: > > > > =# \set ECHO errors > > =# \set ECHO on > > unrecognized value "on" for "ECHO"; assuming "none" > > =# \echo :ECHO > > on > > > > which has two problems: > > - we have to assume a value, even though we can't know what the user > meant. > > - after assignment, the user-visible value of the variable diverges from > its > > internal counterpart (pset.echo in this case). > > > > > > Post-patch: > > =# \set ECHO errors > > =# \set ECHO on > > unrecognized value "on" for "ECHO" > > \set: error while setting variable > > =# \echo :ECHO > > errors > > > > Both the internal pset.* state and the user-visible value are kept > unchanged > > is the input value is incorrect. > > > > Concerning AUTOCOMMIT, autocommit_hook() could return false to forbid > > a switch when the conditions are not met. > > > > > > Another user-visible effect of the patch is that, using a bogus value > > for a built-in variable on the command-line becomes a fatal error > > that prevents psql to continue. > > This is not directly intended by the patch but is a consequence > > of SetVariable() failing. > > > > Example: > > $ ./psql -vECHO=bogus > > unrecognized value "bogus" for "ECHO" > > psql: could not set variable "ECHO" > > $ echo $? > > 1 > > > > The built-in vars c
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
>Have you considered expanding >the API for hook functions? Changing the hooks API to allow rejecting a setting and return false is certainly useful to other psql variables wanting to report an error and reject a value. I did not consider expanding hook APIs because there was no requirement in sight for other variables to reject a setting. As far as autocommit is concerned something in line with the current design can be implemented. In the current design, any unrecognisable/bad value is reinterpreted and the execution inside hook is always successful. In keeping with current design of hooks instead of rejecting autocommit 'ON' setting inside a transaction,the value can be set to 'ON' with a psql_error displaying that the value will be effective when the current transaction has ended. >Actually, it would make a lot more sense UI-wise if attempting to assign a >non-boolean value to a boolean variable resulted in an error and no change >to the variable, instead of what happens now. Hooks API can be expanded to implement this. The proposed feature is mainly to reduce the ambiguity for the user when \set AUTOCOMMIT on is run within a transaction. According to current behaviour, the variable is set immediately but it is effective only when the current transaction has ended. It is good to notify this to the user. This ambiguity in the behaviour was highlighted because in AUTOCOMMIT off mode Postgres implicitly starts a transaction and behaviour of \set AUTOCOMMIT ON in such scenario can be confusing. Thank you, Rahila Syed On Wed, Sep 14, 2016 at 8:06 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Rahila Syed <rahilasye...@gmail.com> writes: > >> Looking at the other variables hooks, they already emit errors and can > >> deny the effect of a change corresponding to a new value, without > >> informing the caller. Why would autocommit be different? > > > These instances where psql_error occurs inside hooks the command is > > successful and the value supplied by user is reinterpreted to some other > > value as user had supplied an unrecognisable value. > > With psql_error_on_autocommit patch what was intended was to make > > the command unsuccessful and keep the previous setting of autocommit. > > Hence having it inside autocommit_hook did not seem appropriate to me. > > Nonetheless, asking all callers of SetVariable to deal with such cases > is entirely unmaintainable/unacceptable. Have you considered expanding > the API for hook functions? I'm not really sure why we didn't provide a > way for the hooks to reject a setting to begin with. > > Actually, it would make a lot more sense UI-wise if attempting to assign a > non-boolean value to a boolean variable resulted in an error and no change > to the variable, instead of what happens now. > > Anyway, I'm not very thrilled with the idea that AUTOCOMMIT is so special > that it should have a different behavior than any other built-in psql > variable. If we make them all throw errors and refuse to change to bad > values, that would be consistent and defensible IMO. But having > AUTOCOMMIT alone act that way is not a feature, it's a wart. > > regards, tom lane >
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
Hello, >Looking at the other variables hooks, they already emit errors and can >deny the effect of a change corresponding to a new value, without >informing the caller. Why would autocommit be different? The only type of psql_error inside hooks is as follows, psql_error("unrecognized value \"%s\" for \"%s\"; assuming \"%s\"\n", newval, "ECHO", "none"); These instances where psql_error occurs inside hooks the command is successful and the value supplied by user is reinterpreted to some other value as user had supplied an unrecognisable value. With psql_error_on_autocommit patch what was intended was to make the command unsuccessful and keep the previous setting of autocommit. Hence having it inside autocommit_hook did not seem appropriate to me. But as pointed out by you, the other way of setting autocommit i,e *SELECT 'on' as "AUTOCOMMIT" \gset* will not be handled by the patch. So I will change the patch to have the check in the autocommit_hook instead. This will mean if *\set AUTOCOMMIT ON* or *SELECT 'on' as "AUTOCOMMIT" \gset * is run inside a transaction, it will be effective after current transaction has ended. Appropriate message will be displayed notifying this to the user and user need not rerun the set AUTOCOMMIT command. Thank you, Rahila Syed On Thu, Sep 8, 2016 at 9:55 PM, Daniel Verite <dan...@manitou-mail.org> wrote: > Rahila Syed wrote: > > > However, including the check here will require returning the status > > of command from this hook. i.e if we throw error inside this > > hook we will need to return false indicating the value has not changed. > > Looking at the other variables hooks, they already emit errors and can > deny the effect of a change corresponding to a new value, without > informing the caller. Why would autocommit be different? > > For example echo_hook does this: > > /* ...if the value is in (queries,errors,all,none) then >assign pset.echo accordingly ... */ > else > { > psql_error("unrecognized value \"%s\" for \"%s\"; assuming \"%s\"\n", >newval, "ECHO", "none"); > pset.echo = PSQL_ECHO_NONE; > } > > > If the user issues \set ECHO FOOBAR, then it produces the above error > message and makes the same internal change as if \set ECHO none > had been issued. > > But, the value of the variable as seen by the user is still FOOBAR: > > \set > [...other variables...] > ECHO = 'FOOBAR' > > The proposed patch doesn't follow that behavior, as it denies > a new value for AUTOCOMMIT. You might argue that it's better, > but on the other side, there are two reasons against it: > > - it's not consistent with the other uses of \set that affect psql, > such as ECHO, ECHO_HIDDEN, ON_ERROR_ROLLBACK, > COMP_KEYWORD_CASE... and even AUTOCOMMIT as > \set AUTOCOMMIT FOOBAR is not denied, just reinterpreted. > > - it doesn't address the case of another method than \set being used > to set the variable, as in : SELECT 'on' as "AUTOCOMMIT" \gset > whereas the hook would work in that case. > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite >
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
Thank you for comments. >But there's a function in startup.c which might be the ideal location >or the check, as it looks like the one and only place where the >autocommit flag actually changes: >static void >autocommit_hook(const char *newval) Thank you for pointing out this. This indeed is the only place where autocommit setting changes in psql. However, including the check here will require returning the status of command from this hook. i.e if we throw error inside this hook we will need to return false indicating the value has not changed. This will change the existing definition of the hook which returns void. This will also lead to changes in other implementations of this hook apart from autocommit_hook. >There are other values than ON: true/yes and theirs abbreviations, >the value 1, and anything that doesn't resolve to OFF is taken as ON. >ParseVariableBool() in commands.c already does the job of converting >these to bool, the new code should probably just call that function >instead of parsing the value itself. I have included this in the attached patch. Also I have included prior review comments by Rushabh Lathia and Amit Langote in the attached patch Regarding suggestion to move the code inside SetVariable(), I think it is not a good idea because SetVariable() and variable.c file in which it is present is about handling of psql variables, checks for correct variable names, values etc. This check is about correctness of the command so it is better placed in exec_command(). Thank you, Rahila Syed On Sat, Sep 3, 2016 at 4:39 PM, Daniel Verite <dan...@manitou-mail.org> wrote: > Rushabh Lathia wrote: > > > It might happen that SetVariable() can be called from other places in > > future, > > so its good to have all the set variable related checks inside > > SetVariable(). > > There's already another way to skip the \set check: > select 'on' as "AUTOCOMMIT" \gset > > But there's a function in startup.c which might be the ideal location > for the check, as it looks like the one and only place where the > autocommit flag actually changes: > > static void > autocommit_hook(const char *newval) > { > pset.autocommit = ParseVariableBool(newval, "AUTOCOMMIT"); > } > > > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite > psql_error_on_autocommit_v2.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
>Document doesn't say this changes are only for implicit BEGIN.. Correcting myself here. Not just implicit BEGIN, it will throw an error on \set AUTOCOMMIT inside any any transaction provided earlier value of AUTOCOMMIT was OFF. Probably the test in which you tried it was already ON. Thank you, Rahila Syed On Fri, Sep 2, 2016 at 3:17 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Fri, Sep 2, 2016 at 1:12 PM, Rahila Syed <rahilasye...@gmail.com> > wrote: > >> Hello, >> >> Thank you for comments. >> >> >Above test not throwing psql error, as you used strcmp(newval,"ON"). You >> >should use pg_strcasecmp. >> Corrected in the attached. >> >> >Above error coming because in below code block change, you don't have >> check for >> >command (you should check opt0 for AUTOCOMMIT command) >> Corrected in the attached. >> >> >postgres=# BEGIN; >> >BEGIN >> >postgres=# create table foo ( a int ); >> >CREATE TABLE >> >postgres=# \set AUTOCOMMIT ON >> >Don't you think, in above case also we should throw a psql error? >> IMO, in this case BEGIN is explicitly specified by user, so I think it is >> understood that a commit is required for changes to be effective. >> Hence I did not consider this case. >> >> > Document doesn't say this changes are only for implicit BEGIN.. > > + > + > + Autocommit cannot be set on inside a transaction, the ongoing > + transaction has to be ended by entering COMMIT or > + ROLLBACK before setting autocommit on. > + > + > > In my opinion, its good to be consistent, whether its implicit or > explicitly specified BEGIN. > > >postgres=# \set AUTOCOMMIT off >> >postgres=# create table foo ( a int ); >> >CREATE TABLE >> >postgres=# \set XYZ ON >> >\set: Cannot set XYZ to ON inside a transaction, either COMMIT or >> ROLLBACK and retry >> >> >May be you would like to move the new code block inside SetVariable(). >> So that >> >don't need to worry about the validity of the variable names. >> >> I think validating variable names wont be required if we throw error only >> if command is \set AUTOCOMMIT. >> Validation can happen later as in the existing code. >> >> > It might happen that SetVariable() can be called from other places in > future, > so its good to have all the set variable related checks inside > SetVariable(). > > Also you can modify the condition in such way, so that we don't often end > up > calling PQtransactionStatus() even though its not require. > > Example: > > if (!strcmp(opt0,"AUTOCOMMIT") && > !strcasecmp(newval,"ON") && > !pset.autocommit && > PQtransactionStatus(pset.db) == PQTRANS_INTRANS) > > > >Basically if I understand correctly, if we are within transaction and >> someone >> >tries the set the AUTOCOMMIT ON, it should throw a psql error. Correct me >> >if I am missing anything? >> >> Yes the psql_error is thrown when AUTOCOMMIT is turned on inside a >> transaction. But only when there is an implicit BEGIN as in following case, >> >> postgres=# \set AUTOCOMMIT OFF >> postgres=# create table test(i int); >> CREATE TABLE >> postgres=# \set AUTOCOMMIT ON >> \set: Cannot set AUTOCOMMIT to ON inside a transaction, either COMMIT or >> ROLLBACK and retry >> postgres=# >> >> Thank you, >> Rahila Syed >> >> > > > Regards, > Rushabh Lathia > www.EnterpriseDB.com >
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
Hello, Thank you for comments. >Above test not throwing psql error, as you used strcmp(newval,"ON"). You >should use pg_strcasecmp. Corrected in the attached. >Above error coming because in below code block change, you don't have check for >command (you should check opt0 for AUTOCOMMIT command) Corrected in the attached. >postgres=# BEGIN; >BEGIN >postgres=# create table foo ( a int ); >CREATE TABLE >postgres=# \set AUTOCOMMIT ON >Don't you think, in above case also we should throw a psql error? IMO, in this case BEGIN is explicitly specified by user, so I think it is understood that a commit is required for changes to be effective. Hence I did not consider this case. >postgres=# \set AUTOCOMMIT off >postgres=# create table foo ( a int ); >CREATE TABLE >postgres=# \set XYZ ON >\set: Cannot set XYZ to ON inside a transaction, either COMMIT or ROLLBACK and retry >May be you would like to move the new code block inside SetVariable(). So that >don't need to worry about the validity of the variable names. I think validating variable names wont be required if we throw error only if command is \set AUTOCOMMIT. Validation can happen later as in the existing code. >Basically if I understand correctly, if we are within transaction and someone >tries the set the AUTOCOMMIT ON, it should throw a psql error. Correct me >if I am missing anything? Yes the psql_error is thrown when AUTOCOMMIT is turned on inside a transaction. But only when there is an implicit BEGIN as in following case, postgres=# \set AUTOCOMMIT OFF postgres=# create table test(i int); CREATE TABLE postgres=# \set AUTOCOMMIT ON \set: Cannot set AUTOCOMMIT to ON inside a transaction, either COMMIT or ROLLBACK and retry postgres=# Thank you, Rahila Syed psql_error_on_autocommit_v1.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
Ok. Please find attached a patch which introduces psql error when autocommit is turned on inside a transaction. It also adds relevant documentation in psql-ref.sgml. Following is the output. bash-4.2$ psql -d postgres psql (10devel) Type "help" for help. postgres=# \set AUTOCOMMIT OFF postgres=# create table test(i int); CREATE TABLE postgres=# \set AUTOCOMMIT ON \set: Cannot set AUTOCOMMIT to ON inside a transaction, either COMMIT or ROLLBACK and retry postgres=# Thank you, Rahila Syed On Wed, Aug 17, 2016 at 6:31 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Aug 16, 2016 at 5:25 PM, Rahila Syed <rahilasye...@gmail.com> > wrote: > >>I think I like the option of having psql issue an error. On the > >>server side, the transaction would still be open, but the user would > >>receive a psql error message and the autocommit setting would not be > >>changed. So the user could type COMMIT or ROLLBACK manually and then > >>retry changing the value of the setting. > > > > Throwing psql error comes out to be most accepted outcome on this > thread. I > > agree it is safer than guessing user intention. > > > > Although according to the default behaviour of psql, error will abort the > > current transaction and roll back all the previous commands. > > A server error would do that, but a psql errror won't. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > psql_error_on_autocommit.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
>I think I like the option of having psql issue an error. On the >server side, the transaction would still be open, but the user would >receive a psql error message and the autocommit setting would not be >changed. So the user could type COMMIT or ROLLBACK manually and then >retry changing the value of the setting. Throwing psql error comes out to be most accepted outcome on this thread. I agree it is safer than guessing user intention. Although according to the default behaviour of psql, error will abort the current transaction and roll back all the previous commands. This can be user unfriendly making user rerun all the commands just because of autocommit switch. So probably behaviour of 'ON_ERROR_ROLLBACK on' needs to be implemented along with the error display. This will rollback just the autocommit switch command. Also, psql error instead of a simple commit will lead to script terminations. Hence issuing a COMMIT seems more viable here. However, script termination can be avoided by default behaviour of ON_ERROR_STOP which will execute subsequent commands successfully.(However subsequent commands won't be executed in autocommit mode which I think should be OK as it will be notified via ERROR). So summarizing my view of the discussion on this thread, issuing a psql error seems to be the best option. I will post a patch regarding this if there is no objection. Thank you, Rahila Syed
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
Thank you for inputs everyone. The opinions on this thread can be classified into following 1. Commit 2. Rollback 3. Error 4. Warning As per opinion upthread, issuing implicit commit immediately after switching autocommit to ON, can be unsafe if it was not desired. While I agree that its difficult to judge users intention here, but if we were to base it on some assumption, the closest would be implicit COMMIT in my opinion.There is higher likelihood of a user being happy with issuing a commit when setting autocommit ON than a transaction being rolled back. Also there are quite some interfaces which provide this. As mentioned upthread, issuing a warning on switching back to autocommit will not be effective inside a script. It won't allow subsequent commands to be committed as set autocommit to ON is not committed. Scripts will have to be rerun with changes which will impact user friendliness. While I agree that issuing an ERROR and rolling back the transaction ranks higher in safe behaviour, it is not as common (according to instances stated upthread) as immediately committing any open transaction when switching back to autocommit. Thank you, Rahila Syed On Sun, Aug 7, 2016 at 4:42 AM, Matt Kelly <mkell...@gmail.com> wrote: > Its worth noting that the JDBC's behavior when you switch back to > autocommit is to immediately commit the open transaction. > > Personally, I think committing immediately or erroring are unsurprising > behaviors. The current behavior is surprising and obviously wrong. > Rolling back without an error would be very surprising (no other database > API I know of does that) and would take forever to debug issues around the > behavior. And committing after the next statement is definitely the most > surprising behavior suggested. > > IMHO, I think committing immediately and erroring are both valid. I think > I'd prefer the error in principle, and per the law of bad code I'm sure, > although no one has ever intended to use this behavior, there is probably > some code out there that is relying on this behavior for "correctness". I > think a hard failure and making the dev add an explicit commit is least > likely to cause people serious issues. As for the other options, consider > me opposed. > > - Matt K. >
[HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
Hello, Need community's opinion on following behaviour of \set AUTOCOMMIT If \set AUTOCOMMIT ON is issued after \set AUTOCOMMIT OFF the commands which follow after AUTOCOMMIT is set ON are not committed until an explicit COMMIT is issued. Its can be surprising to the user to not see results of the commands fired after AUTOCOMMIT is set to ON. bash-4.2$ psql -d postgres -U rahila psql (9.6beta3) Type "help" for help. postgres=# \set AUTOCOMMIT OFF postgres=# create table test1(i int); CREATE TABLE postgres=# \set AUTOCOMMIT ON postgres=# create table test2(j int); CREATE TABLE postgres=# \c postgres rahila You are now connected to database "postgres" as user "rahila". postgres=# \dt; No relations found. The ongoing transaction is left running when there is this change in mode from AUTOCOMMIT OFF to AUTOCOMMIT ON. This happens because \set AUTOCOMMIT ON is fired within a transaction block started when first command after \set AUTOCOMMIT OFF is executed. Hence it requires an explicit COMMIT to be effective. Should changing the value from OFF to ON automatically either commit or rollback transaction in progress? FWIW, running set autocommit through ecpg commits the ongoing transaction when autocommit is set to ON from OFF. Should such behaviour be implemented for \set AUTOCOMMIT ON as well? Thank you, Rahila Syed
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
>Oops. I forgot to credit you in the commit message. Sorry about that. :-( No problem :). Thanks for the commit. Thank you, Rahila Syed
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, Server crash was reported on running vacuum progress checker view on 32-bit machine. Please find attached a fix for the same. Crash was because 32 bit machine considers int8 as being passed by reference while creating the tuple descriptor. At the time of filling the tuple store, the code (heap_fill_tuple) checks this tuple descriptor before inserting the value into the tuple store. It finds the attribute type pass by reference and hence it treats the value as a pointer when it is not and thus it fails at the time of memcpy. This happens because appropriate conversion function is not employed while storing the value of that particular attribute into the values array before copying it into tuple store. - values[i+3] = UInt32GetDatum(beentry->st_progress_param[i]); + values[i+3] = Int64GetDatum(beentry->st_progress_param[i]); Attached patch fixes this. Thank you, Rahila Syed On Wed, Mar 16, 2016 at 11:30 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Wed, Mar 16, 2016 at 6:44 AM, Rahila Syed <rahilasye...@gmail.com> > wrote: > >>Sorta. Committed after renaming what you called heap blocks vacuumed > >>back to heap blocks scanned, adding heap blocks vacuumed, removing the > >>overall progress meter which I don't believe will be anything close to > >>accurate, fixing some stylistic stuff, arranging to update multiple > >>counters automatically where it could otherwise produce confusion, > >>moving the new view near similar ones in the file, reformatting it to > >>follow the style of the rest of the file, exposing the counter > >>#defines via a header file in case extensions want to use them, and > >>overhauling and substantially expanding the documentation > > > > We have following lines, > > > > /* report that everything is scanned and vacuumed */ > > pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, > > blkno); > > pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, > > blkno); > > > > > > which appear before final vacuum cycle happens for any remaining dead > tuples > > which may span few pages if I am not mistaken. > > > > IMO, reporting final count of heap_blks_scanned is correct here, but > > reporting final heap_blks_vacuumed can happen after the final VACUUM > cycle > > for more accuracy. > > You are quite right. Good catch. Fixed that, and applied Vinayak's > patch too, and fixed another mistake I saw while I was at it. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 0f6f891..64c4cc4 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -612,7 +612,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS) { values[2] = ObjectIdGetDatum(beentry->st_progress_command_target); for(i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++) -values[i+3] = UInt32GetDatum(beentry->st_progress_param[i]); +values[i+3] = Int64GetDatum(beentry->st_progress_param[i]); } else { -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
>Sorta. Committed after renaming what you called heap blocks vacuumed >back to heap blocks scanned, adding heap blocks vacuumed, removing the >overall progress meter which I don't believe will be anything close to >accurate, fixing some stylistic stuff, arranging to update multiple >counters automatically where it could otherwise produce confusion, >moving the new view near similar ones in the file, reformatting it to >follow the style of the rest of the file, exposing the counter >#defines via a header file in case extensions want to use them, and >overhauling and substantially expanding the documentation We have following lines, /* report that everything is scanned and vacuumed */ pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno); pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno); which appear before final vacuum cycle happens for any remaining dead tuples which may span few pages if I am not mistaken. IMO, reporting final count of heap_blks_scanned is correct here, but reporting final heap_blks_vacuumed can happen after the final VACUUM cycle for more accuracy. Thank you, Rahila Syed
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, While I am still looking at this WIP patch, I had one suggestion. Instead of making changes in the index AM API can we have a call to update the shared state using pgstat_progress* API directly from specific index level code? Like pgstat_count_index_scan(rel) call from _bt_first does. Though this function basically updates local structures and sends the count to stat collector via messages we can have a function which will instead modify the shared state using the progress API committed recently. Thank you, Rahila Syed
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
>Okay, I agree that reporting just the current blkno is simple and good >enough. How about numbers of what we're going to report as the "Vacuuming >Index and Heap" phase? I guess we can still keep the scanned_index_pages >and index_scan_count So we have: >+CREATE VIEW pg_stat_vacuum_progress AS >+ SELECT >+ S.pid, >+ S.relid, >+ S.phase, >+ S.total_heap_blks, >+ S.current_heap_blkno, >+ S.total_index_pages, >+ S.scanned_index_pages, >+ S.index_scan_count >+ S.percent_complete, >+ FROM pg_stat_get_vacuum_progress() AS S; >I guess it won't remain pg_stat_get_"vacuum"_progress( >), though. Apart from these, as suggested in [1] , finer grained reporting from index vacuuming phase can provide better insight. Currently we report number of blocks processed once at the end of vacuuming of each index. IIUC, what was suggested in [1] was instrumenting lazy_tid_reaped with a counter to count number of index tuples processed so far as lazy_tid_reaped is called for every index tuple to see if it matches any of the dead tuple tids. So additional parameters for each index can be, scanned_index_tuples total_index_tuples (from pg_class.reltuples entry) Thank you, Rahila Syed [1]. http://www.postgresql.org/message-id/56500356.4070...@bluetreble.com
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
>+if(!scan_all) >+scanned_heap_pages = scanned_heap_pages + >next_not_all_visible_block; >I don't want to be too much of a stickler for details here, but it >seems to me that this is an outright lie. Initially the scanned_heap_pages were meant to report just the scanned pages and skipped pages were not added to the count. Instead the skipped pages were deduced from number of total heap pages to be scanned to make the number of scanned pages eventually add up to total heap pages. As per comments received later total heap pages were kept constant and skipped pages count was added to scanned pages to make the count add up to total heap pages at the end of scan. That said, as suggested, scanned_heap_pages should be renamed to current_heap_page to report current blkno in lazy_scan_heap loop which will add up to total heap pages(nblocks) at the end of scan. And scanned_heap_pages can be reported as a separate number which wont contain skipped pages. >+/* >+ * Reporting vacuum progress to statistics collector >+ */ >This patch doesn't report anything to the statistics collector, nor should it. Yes. This was incorrectly added initially by referring to similar pgstat_report interface functions. Thank you, Rahila Syed On Thu, Jan 28, 2016 at 2:27 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Jan 26, 2016 at 11:37 PM, Vinayak Pokale <vinpok...@gmail.com> > wrote: > > Hi, > > > > Please find attached updated patch with an updated interface. > > Well, this isn't right. You've got this sort of thing: > > +scanned_index_pages += RelationGetNumberOfBlocks(Irel[i]); > +/* Report progress to the statistics collector */ > +pgstat_report_progress_update_message(0, progress_message); > +pgstat_report_progress_update_counter(1, scanned_heap_pages); > +pgstat_report_progress_update_counter(3, scanned_index_pages); > +pgstat_report_progress_update_counter(4, > vacrelstats->num_index_scans + 1); > > The point of having pgstat_report_progress_update_counter() is so that > you can efficiently update a single counter without having to update > everything, when only one counter has changed. But here you are > calling this function a whole bunch of times in a row, which > completely misses the point - if you are updating all the counters, > it's more efficient to use an interface that does them all at once > instead of one at a time. > > But there's a second problem here, too, which is that I think you've > got this code in the wrong place. The second point of the > pgstat_report_progress_update_counter interface is that this should be > cheap enough that we can do it every time the counter changes. That's > not what you are doing here. You're updating the counters at various > points in the code and just pushing new values for all of them > regardless of which ones have changed. I think you should find a way > that you can update the value immediately at the exact moment it > changes. If that seems like too much of a performance hit we can talk > about it, but I think the value of this feature will be greatly > weakened if users can't count on it to be fully and continuously up to > the moment. When something gets stuck, you want to know where it's > stuck, not approximately kinda where it's stuck. > > +if(!scan_all) > +scanned_heap_pages = scanned_heap_pages + > next_not_all_visible_block; > > I don't want to be too much of a stickler for details here, but it > seems to me that this is an outright lie. The number of scanned pages > does not go up when we decide to skip some pages, because scanning and > skipping aren't the same thing. If we're only going to report one > number here, it needs to be called something like "current heap page", > and then you can just report blkno at the top of each iteration of > lazy_scan_heap's main loop. If you want to report the numbers of > scanned and skipped pages separately that'd be OK too, but you can't > call it the number of scanned pages and then actually report a value > that is not that. > > +/* > + * Reporting vacuum progress to statistics collector > + */ > > This patch doesn't report anything to the statistics collector, nor should > it. > > Instead of making the SQL-visible function > pg_stat_get_vacuum_progress(), I think it should be something more > generic like pg_stat_get_command_progress(). Maybe VACUUM will be the > only command that reports into that feature for right now, but I'd > hope for us to change that pretty soon after we get the first patch > committed. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
>I suspect you need to create a new CF entry for this patch in CF 2016-01. Unless I am missing something, there seems to be no entry for this patch into CF 2016-01 page: https://commitfest.postgresql.org/8/. Regrettably, we have exceeded the deadline to add the patch into this commitfest. Is there still some way to add it to the commitfest 2016-01? As this feature has received lot of feedback in previous commitfest , adding it to this commitfest will surely help in progressing it in order to make it ready for PostgreSQL 9.6. Thank you, Rahila Syed On Mon, Dec 28, 2015 at 6:01 AM, Amit Langote <langote_amit...@lab.ntt.co.jp > wrote: > > Hi Vinayak, > > On 2015/12/25 21:46, Vinayak Pokale wrote: > > Hi, > > Please find attached patch addressing following comments. > > > >> The relation OID should be reported and not its name. In case of a > >> relation rename that would not be cool for tracking, and most users > >> are surely going to join with other system tables using it. > > The relation OID is reported instead of relation name. > > The following interface function is called at the beginning to report the > > relation OID once. > > void pgstat_report_command_target(Oid relid) > > > >> Regarding pg_stat_get_vacuum_progress(): I think a backend can simply be > >> skipped if (!has_privs_of_role(GetUserId(), beentry->st_userid)) (cannot > >> put that in plain English, :)) > > Updated in the attached patch. > > > > In the previous patch, ACTIVITY_IS_VACUUM is set unnecessarily for > > VACOPT_FULL and they are not covered by lazy_scan_heap(). > > I have updated it in attached patch and also renamed ACTIVITY_IS_VACUUM > to > > COMMAND_LAZY_VACUUM. > > > > Added documentation for view. > > Some more comments need to be addressed. > > I suspect you need to create a new CF entry for this patch in CF 2016-01. > > Thanks, > Amit > > >
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, Thank you for your comments. Please find attached patch addressing following comments , >- duplicate_oids error in HEAD. Check. >- a compiler warning: >pgstat.c:2898: warning: no previous prototype for ‘pgstat_reset_activityflag’ Check. >One more change you could do is 's/activityflag/activity_flag/g', Check. >Type of the flag of vacuum activity. The flag variable is an integer to incorporate more commands in future. >Type of st_progress_param and so. st_progress_param is also given a generic name to incorporate different parameters reported from various commands. >st_progress_param_float is currently totally useless. Float parameter has currently been removed from the patch. >Definition of progress_message. >The definition of progress_message in lazy_scan_heap is "char >[PROGRESS_MESSAGE_LENGTH][N_PROGRESS_PARAM]" which looks to be >inversed. Corrected. >The current code subtracts the number of blocks when >skipping_all_visible_blocks is set in two places. But I think >it is enough to decrement when skipping. In both the places, the pages are being skipped hence the total count was decremented. >He suggested to keep total_heap_pages fixed while adding number >of skipped pages to that of scanned pages. This has been done in the attached. >snprintf(progress_message[0], PROGRESS_MESSAGE_LENGTH, "%s.%s", > get_namespace_name(RelationGetNamespace(rel)), > relname); Check. The previous implementation used to add total number of pages across all indexes of a relation to total_index_pages in every scan of indexes to account for total pages scanned. Thus, it was equal to number of scans * total_index_pages. In the attached patch, total_index_pages reflects total number of pages across all indexes of a relation. And the column to report passes through indexes (phase 2) has been added to account for number of passes for index and heap vacuuming. Number of scanned index pages is reset at the end of each pass. This makes the reporting clearer. The percent complete does not account for index pages. It just denotes percentage of heap scanned. >Spotted a potential oversight regarding report of scanned_pages. It >seems pages that are skipped because of not getting a pin, being new, >being empty could be left out of the progress equation. Corrected. >It's better to >report that some other way, like use one of the strings to report a >"phase" of processing that we're currently performing. Has been included in the attached. Some more comments need to be addressed which include name change of activity flag, reporting only changed parameters to shared memory, ACTIVITY_IS_VACUUM flag being set unnecessarily for ANALYZE and FULL commands ,documentation for new view. Also, finer grain reporting from indexes and heap truncate phase is yet to be incorporated into the patch Thank you, Rahila Syed diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index ccc030f..d53833e 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -631,6 +631,19 @@ CREATE VIEW pg_stat_activity AS WHERE S.datid = D.oid AND S.usesysid = U.oid; +CREATE VIEW pg_stat_vacuum_progress AS +SELECT + S.pid, + S.table_name, + S.phase, + S.total_heap_pages, + S.scanned_heap_pages, +S.percent_complete, +S.total_index_pages, +S.scanned_index_pages, +S.index_scan_count +FROM pg_stat_get_vacuum_progress() AS S; + CREATE VIEW pg_stat_replication AS SELECT S.pid, diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 7c4ef58..e27a8f3 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -284,6 +284,7 @@ vacuum(int options, RangeVar *relation, Oid relid, VacuumParams *params, VacuumPageMiss = 0; VacuumPageDirty = 0; + pgstat_report_activity_flag(ACTIVITY_IS_VACUUM); /* * Loop to process each selected relation. */ @@ -325,6 +326,7 @@ vacuum(int options, RangeVar *relation, Oid relid, VacuumParams *params, { in_vacuum = false; VacuumCostActive = false; + pgstat_reset_activity_flag(); PG_RE_THROW(); } PG_END_TRY(); @@ -355,6 +357,7 @@ vacuum(int options, RangeVar *relation, Oid relid, VacuumParams *params, vac_update_datfrozenxid(); } + pgstat_reset_activity_flag(); /* * Clean up working storage --- note we must do this after * StartTransactionCommand, else we might be trying to delete the active diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 2429889..1c74b51 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -439,9 +439,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, Relation *Irel, int nindexes, bool scan_
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello Michael, I am planning to continue contributing to this feature in any way be it by reviewing the patch or making one. Though I haven't been able to reply to the comments or post an updated patch lately. I plan to do that soon. Thank you, Rahila On Thu, Nov 19, 2015 at 1:09 PM, Michael Paquierwrote: > On Thu, Nov 19, 2015 at 4:30 PM, Amit Langote > wrote: > > On 2015/11/19 16:18, Amit Langote wrote: > >> I'm marking this as "Waiting on author" in the commitfest app. Also, > let's > >> hear from more people. > > > > Well, it seems Michael beat me to it. > > Yeah, some other folks provided feedback that's why I did it. > @Rahila: are you planning more work on this patch? It seems that at > this point we're waiting for you. If not, and because I have a couple > of times waited for long VACUUM jobs to finish on some relations > without having much information about their progress, I would be fine > to spend time hacking this thing. That's up to you. > Regards, > -- > Michael > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
>This doesn't seem to compile Oh. It compiled successfully when applied on HEAD on my machine. Anyways, the OID is changed to 3309 in the attached patch. 3308 / 3309 both are part of OIDs in unused OID list. Thank you, Rahila Syed Vacuum_progress_checker_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, On Jul 16, 2015 1:48 AM, "Rahila Syed" <rahilasye...@gmail.com> wrote: > > Hello, > > Please find attached updated patch >with an interface to calculate command progress in pgstat.c. This interface currently implements VACUUM progress tracking . I have added this patch to CommitFest 2015-09. It is marked as Waiting on author . I will post an updated patch as per review comments soon. Thank you, Rahila Syed
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
In case of vacuum, I think we need to track the number of scanned heap pages at least, and the information about index scan is the additional information Actually the progress of heap pages scan depend on index scans. So complete VACUUM progress needs to have a count of index pages scanned too. So, progress can be calculated by measuring index_pages_scanned + heap_pages_scanned against total_index_pages + total_heap_pages. This can make essential information. This can be followed by additional individual phase information. Following fields common across different commands can be used to display progress Command work done total workpercent complete message VACUUM x y z total progress uv w phase 1 The command code can be divided into distinct phases and each phase progress can be represented separately. With a summary of entire command progress as the first entry. The summary can be the summation of individual phase entries. If the phase repeats during command execution the previous entry for the phase will be replaced.(for ex. index scan in vacuum) Essential information has one numeric data, which is stored essentially information regarding of its processing. We may need more than one numeric data as mentioned above to represent scanned blocks versus total blocks. Additional information has two data: text and numeric. These data is free-style data which is stored by each backend as it like. If I understand your point correctly, I think you are missing following, The amount of additional information for each command can be different. We may need an array of text and numeric data to represent more additional information. Thank you, Rahila Syed
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, Say, 6 bigint counters, 6 float8 counters, and 3 strings up to 80 characters each. So we have a fixed-size chunk of shared memory per backend, and each backend that wants to expose progress information can fill in those fields however it likes, and we expose the results. This would be sorta like the way pg_statistic works: the same columns can be used for different purposes depending on what estimator will be used to access them. After thinking more on this suggestion, I came up with following generic structure which can be used to store progress of any command per backend in shared memory. Struct PgBackendProgress { int32 *counter[COMMAND_NUM_SLOTS]; float8 *counter_float[COMMAND_NUM_SLOTS]; char *progress_message[COMMAND_NUM_SLOTS]; } COMMAND_NUM_SLOTS will define maximum number of slots(phases) for any command. Progress of command will be measured using progress of each phase in command. For some command the number of phases can be singular and rest of the slots will be NULL. Each phase will report n integer counters, n float counters and a progress message. For some phases , any of the above fields can be NULL. For VACUUM , there can 3 phases as discussed in the earlier mails. Phase 1. Report 2 integer counters: heap pages scanned and total heap pages, 1 float counter: percentage_complete and progress message. Phase 2. Report 2 integer counters: index pages scanned and total index pages(across all indexes) and progress message. Phase 3. 1 integer counter: heap pages vacuumed. This structure can be accessed by statistics collector to display progress via new view. Thank you, Rahila Syed
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
The total number of heap pages is known, and the total number of index pages is also known, so it's possible to derive a percentage out of this part. The total number of index pages scanned during entire vacuum will depend on number of index scans that happens. In order to extrapolate percent complete for phase two(index scan) we need number of index scans * total index pages. The number of index scans can vary from 1 to n (n depending on maintenance_work_mem) Summarizing suggestions in previous mails, following information can be reported Phase 1.heap pages scanned / total heap pages Phase 2.index pages scanned / total index pages (across all indexes) Phase 3.count of heap pages vacuumed Additional info like number of index scans so far, number of dead tuples being vacuumed in one batch can also be provided. A combined estimate for vacuum percent complete can be provided by summing up heap pages scanned, index pages scanned against total heap pages, total index pages * number of index scans. Thank you, Rahila Syed
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
I think it's related to the problem of figuring out how many dead tuples you expect to find in the overall heap, which you need to do to have any hope of this being a comprehensive estimate. An estimate of number of index scans while vacuuming can be done using estimate of total dead tuples in the relation and maintenance work mem. n_dead_tuples in pg_stat_all_tables can be used as an estimate of dead tuples. Following can be a way to estimate, if nindexes == 0 index_scans =0 else if pages_all_visible index_scans =0 else index_scans = Max((n_dead_tuples * space occupied by single dead tuple)/m_w_m,1) This estimates index_scans = 1 if n_dead_tuples = 0 assuming lazy scan heap is likely to find some dead_tuples. If n_dead_tuples is non zero the above estimate gives a lower bound on number of index scans possible. Thank you, Rahila Syed
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, Please find attached updated patch with an interface to calculate command progress in pgstat.c. This interface currently implements VACUUM progress tracking . A column named percent_complete has been added in pg_stat_activity to report progress. VACUUM calls the progress calculation interface periodically at an interval specified by pgstat_track_progress GUC in ms. Progress calculation can be disabled by setting pgstat_track_progress as -1. Remaining_time for VACUUM is not included in current patch to avoid cluttering pg_stat_activity with too many columns. But the estimate as seen from previous implementation seems reasonable enough to be included in progress information , may be as an exclusive view for vacuum progress information. GUC parameter 'pgstat_track_progress' is currently PGC_SUSET in line with 'track_activities' GUC parameter. Although IMO, pgstat_track_progress can be made PGC_USERSET in order to provide more flexibility to any user to enable/disable progress calculation provided progress is tracked only if track_activities GUC parameter is enabled. In this patch, index scans are not taken into account for progress calculation as of now . Thank you, Rahila Syed. Vacuum_progress_checker_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, Thank you for suggestions. Yes, I suggest just a single column on pg_stat_activity called pct_complete Reporting remaining time also can be crucial to make decisions regarding continuing or aborting VACUUM. The same has been suggested in the thread below, http://www.postgresql.org/message-id/13072.1284826...@sss.pgh.pa.us trace_completion_interval = 5s (default) Every interval, we report the current % complete for any operation that supports it. We just show NULL if the current operation has not reported anything or never will. We do this for VACUUM first, then we can begin adding other operations as we work out how (for that operation). Thank you for explaining. This design seems good to me except, adding more than one columns(percent_complete, remaining_time) if required to pg_stat_activity can be less user intuitive than having a separate view for VACUUM. -Rahila Syed On Tue, Jun 30, 2015 at 2:02 PM, Simon Riggs si...@2ndquadrant.com wrote: On 30 June 2015 at 08:52, Pavel Stehule pavel.steh...@gmail.com wrote: I though about the possibilities of progress visualization - and one possibility is one or two special column in pg_stat_activity table - this info can be interesting for VACUUM started by autovacuum too. Yes, I suggest just a single column on pg_stat_activity called pct_complete trace_completion_interval = 5s (default) Every interval, we report the current % complete for any operation that supports it. We just show NULL if the current operation has not reported anything or never will. We do this for VACUUM first, then we can begin adding other operations as we work out how (for that operation). -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, I though about the possibilities of progress visualization - and one possibility is one or two special column in pg_stat_activity table - this info can be interesting for VACUUM started by autovacuum too. Thank you for suggestion. The design with hooks and a separate view was mainly to keep most of the code outside core as the feature proposed is specific to VACUUM command. Also, having a separate view can give more flexibility in terms of displaying various progress parameters. FWIW ,there was resistance to include columns in pg_stat_activity earlier in the following thread, http://www.postgresql.org/message-id/AANLkTi=TcuMA38oGUKX9p5WVPpY+M3L0XUp7=plt+...@mail.gmail.com Thank you, Rahila Syed On Tue, Jun 30, 2015 at 1:22 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Hi 2015-06-30 9:37 GMT+02:00 Rahila Syed rahilasye...@gmail.com: Hello Hackers, Following is a proposal for feature to calculate VACUUM progress. interesting idea - I like to see it integrated to core. Use Case : Measuring progress of long running VACUUMs to help DBAs make informed decision whether to continue running VACUUM or abort it. Design: A shared preload library to store progress information from different backends running VACUUM, calculate remaining time for each and display progress in the in the form a view. probably similar idea can be used for REINDEX, CREATE INDEX, COPY TO statements I though about the possibilities of progress visualization - and one possibility is one or two special column in pg_stat_activity table - this info can be interesting for VACUUM started by autovacuum too. Regards Pavel VACUUM needs to be instrumented with a hook to collect progress information (pages vacuumed/scanned) periodically. The patch attached implements a new hook to store vacuumed_pages and scanned_pages count at the end of each page scanned by VACUUM. This information is stored in a shared memory structure. In addition to measuring progress this function using hook also calculates remaining time for VACUUM. The frequency of collecting progress information can be reduced by appending delays in between hook function calls. Also, a GUC parameter log_vacuum_min_duration can be used. This will cause VACUUM progress to be calculated only if VACUUM runs more than specified milliseconds. A value of zero calculates VACUUM progress for each page processed. -1 disables logging. Progress calculation : percent_complete = scanned_pages * 100 / total_pages_to_be_scanned; remaining_time = elapsed_time * (total_pages_to_be_scanned - scanned_pages) / scanned_pages; Shared memory struct: typedef struct PgStat_VacuumStats { Oid databaseoid; Oid tableoid; Int32 vacuumed_pages; Int32 total_pages; Int32 scanned_pages; doubleelapsed_time; doubleremaining_time; } PgStat_VacuumStats[max_connections]; Reporting : A view named 'pg_maintenance_progress' can be created using the values in the struct above. pg_stat_maintenance can be called from any other backend and will display progress of each running VACUUM. Other uses of hook in VACUUM: Cost of VACUUM in terms of pages hit , missed and dirtied same as autovacuum can be collected using this hook. Autovacuum does it at the end of VACUUM for each table. It can be done while VACUUM on a table is in progress. This can be helpful to track manual VACUUMs also not just autovacuum. Read/Write(I/O) rates can be computed on the lines of autovacuum. Read rate patterns can be used to help tuning future vacuum on the table(like shared buffers tuning) Other resource usages can also be collected using progress checker hook. Attached patch is POC patch of progress calculation for a single backend. Also attached is a brief snapshot of the output log. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, I have some minor comments The comments have been implemented in the attached patch. I think that extra parenthesis should be used for the first expression with BKPIMAGE_HAS_HOLE. Parenthesis have been added to improve code readability. Thank you, Rahila Syed On Mon, Mar 9, 2015 at 5:38 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila rahila.s...@nttdata.com wrote: Please find attached a patch. As discussed, flag to denote compression and presence of hole in block image has been added in XLogRecordImageHeader rather than block header. Thanks for updating the patch! Attached is the refactored version of the patch. Cool. Thanks! I have some minor comments: +The default value is literaloff/ Dot at the end of this sentence. +Turning this parameter on can reduce the WAL volume without Turning valueon/ this parameter +but at the cost of some extra CPU time by the compression during +WAL logging and the decompression during WAL replay. Isn't a verb missing here, for something like that: but at the cost of some extra CPU spent on the compression during WAL logging and on the decompression during WAL replay. + * This can reduce the WAL volume, but at some extra cost of CPU time + * by the compression during WAL logging. Er, similarly some extra cost of CPU spent on the compression + if (blk-bimg_info BKPIMAGE_HAS_HOLE + (blk-hole_offset == 0 || +blk-hole_length == 0 || I think that extra parenthesis should be used for the first expression with BKPIMAGE_HAS_HOLE. + if (blk-bimg_info BKPIMAGE_IS_COMPRESSED + blk-bimg_len == BLCKSZ) + { Same here. + /* +* cross-check that hole_offset == 0 and hole_length == 0 +* if the HAS_HOLE flag is set. +*/ I think that you mean here that this happens when the flag is *not* set. + /* +* If BKPIMAGE_HAS_HOLE and BKPIMAGE_IS_COMPRESSED, +* an XLogRecordBlockCompressHeader follows +*/ Maybe a struct should be added for an XLogRecordBlockCompressHeader struct. And a dot at the end of the sentence should be added? Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Support-compression-full-page-writes-in-WAL_v25.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, When I test the WAL or replication related features, I usually run make installcheck and pgbench against the master at the same time after setting up the replication environment. I will conduct these tests before sending updated version. Seems this increases the header size of WAL record even if no backup block image is included. Right? Yes, this increases the header size of WAL record by 1 byte for every block reference even if it has no backup block image. Isn't it better to add the flag info about backup block image into XLogRecordBlockImageHeader rather than XLogRecordBlockHeader Yes , this will make the code extensible,readable and will save couple of bytes per record. But the current approach is to provide a chunk ID identifying different xlog record fragments like main data , block references etc. Currently , block ID is used to identify record fragments which can be either XLR_BLOCK_ID_DATA_SHORT , XLR_BLOCK_ID_DATA_LONG or actual block ID. This can be replaced by chunk ID to separate it from block ID. Block ID can be used to number the block fragments whereas chunk ID can be used to distinguish between main data fragments and block references. Chunk ID of block references can contain information about presence of data, image , hole and compression. Chunk ID for main data fragments remains as it is . This approach provides for readability and extensibility. Thank you, Rahila Syed On Mon, Mar 2, 2015 at 3:43 PM, Fujii Masao masao.fu...@gmail.com wrote: On Fri, Feb 27, 2015 at 12:44 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 27, 2015 at 8:01 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed rahilasye...@gmail.com wrote: Even this patch doesn't work fine. The standby emit the following error messages. Yes this bug remains unsolved. I am still working on resolving this. Following chunk IDs have been added in the attached patch as suggested upthread. +#define XLR_CHUNK_BLOCK_REFERENCE 0x10 +#define XLR_CHUNK_BLOCK_HAS_IMAGE 0x04 +#define XLR_CHUNK_BLOCK_HAS_DATA 0x08 XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references. XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA. Before sending a new version, be sure that this get fixed by for example building up a master with a standby replaying WAL, and running make installcheck-world or similar. If the standby does not complain at all, you have good chances to not have bugs. You could also build with WAL_DEBUG to check record consistency. +1 When I test the WAL or replication related features, I usually run make installcheck and pgbench against the master at the same time after setting up the replication environment. typedef struct XLogRecordBlockHeader { +uint8chunk_id;/* xlog fragment id */ uint8id;/* block reference ID */ Seems this increases the header size of WAL record even if no backup block image is included. Right? Isn't it better to add the flag info about backup block image into XLogRecordBlockImageHeader rather than XLogRecordBlockHeader? Originally we borrowed one or two bits from its existing fields to minimize the header size, but we can just add new flag field if we prefer the extensibility and readability of the code. Regards, -- Fujii Masao
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, Attached is a patch which has following changes, As suggested above block ID in xlog structs has been replaced by chunk ID. Chunk ID is used to distinguish between different types of xlog record fragments. Like, XLR_CHUNK_ID_DATA_SHORT XLR_CHUNK_ID_DATA_LONG XLR_CHUNK_BKP_COMPRESSED XLR_CHUNK_BKP_WITH_HOLE In block references, block ID follows the chunk ID. Here block ID retains its functionality. This approach increases data by 1 byte for each block reference in an xlog record. This approach separates ID referring different fragments of xlog record from the actual block ID which is used to refer block references in xlog record. Following are WAL numbers for each scenario, WAL FPW compression on 121.652 MB FPW compression off 148.998 MB HEAD 148.764 MB Compression remains nearly same as before. There is some difference in WAL between HEAD and HEAD+patch+compression OFF. This difference corresponds to 1 byte increase with each block reference of xlog record. Thank you, Rahila Syed On Wed, Feb 18, 2015 at 7:53 PM, Syed, Rahila rahila.s...@nttdata.com wrote: Hello, I think we should change the xlog format so that the block_id (which currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the block id but something like XLR_CHUNK_ID. Which is used as is for XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED, XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the block id following the chunk id. Yes, that'll increase the amount of data for a backup block by 1 byte, but I think that's worth it. I'm pretty sure we will be happy about the added extensibility pretty soon. To clarify my understanding of the above change, Instead of a block id to reference different fragments of an xlog record , a single byte field chunk_id should be used. chunk_id will be same as XLR_BLOCK_ID_DATA_SHORT/LONG for main data fragments. But for block references, it will take store following values in order to store information about the backup blocks. #define XLR_CHUNK_BKP_COMPRESSED 0x01 #define XLR_CHUNK_BKP_WITH_HOLE 0x02 ... The new xlog format should look like follows, Fixed-size header (XLogRecord struct) Chunk_id(add a field before id field in XLogRecordBlockHeader struct) XLogRecordBlockHeader Chunk_id XLogRecordBlockHeader ... ... Chunk_id ( rename id field of the XLogRecordDataHeader struct) XLogRecordDataHeader[Short|Long] block data block data ... main data I will post a patch based on this. Thank you, Rahila Syed -Original Message- From: Andres Freund [mailto:and...@2ndquadrant.com] Sent: Monday, February 16, 2015 5:26 PM To: Syed, Rahila Cc: Michael Paquier; Fujii Masao; PostgreSQL mailing lists Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes On 2015-02-16 11:30:20 +, Syed, Rahila wrote: - * As a trivial form of data compression, the XLOG code is aware that - * PG data pages usually contain an unused hole in the middle, which - * contains only zero bytes. If hole_length 0 then we have removed - * such a hole from the stored data (and it's not counted in the - * XLOG record's CRC, either). Hence, the amount of block data actually - * present is BLCKSZ - hole_length bytes. + * Block images are able to do several types of compression: + * - When wal_compression is off, as a trivial form of compression, + the + * XLOG code is aware that PG data pages usually contain an unused hole + * in the middle, which contains only zero bytes. If length BLCKSZ + * then we have removed such a hole from the stored data (and it is + * not counted in the XLOG record's CRC, either). Hence, the amount + * of block data actually present is length bytes. The hole offset + * on page is defined using hole_offset. + * - When wal_compression is on, block images are compressed using a + * compression algorithm without their hole to improve compression + * process of the page. length corresponds in this case to the + length + * of the compressed block. hole_offset is the hole offset of the + page, + * and the length of the uncompressed block is defined by + raw_length, + * whose data is included in the record only when compression is + enabled + * and with_hole is set to true, see below. + * + * is_compressed is used to identify if a given block image is + compressed + * or not. Maximum page size allowed on the system being 32k, the + hole + * offset cannot be more than 15-bit long so the last free bit is + used to + * store the compression state of block image. If the maximum page + size + * allowed is increased to a value higher than that, we should + consider + * increasing this structure size as well
Re: [HACKERS] Compression of full-page-writes
So this test can be used to evaluate how shorter records influence performance since the master waits for flush confirmation from the standby, right? Yes. This test can help measure performance improvement due to reduced I/O on standby as master waits for WAL records flush on standby. Isn't that GB and not MB? Yes. That is a typo. It should be GB. How many FPWs have been generated and how many dirty buffers have been flushed for the 3 checkpoints of each test? Any data about the CPU activity? Above data is not available for this run . I will rerun the tests to gather above data. Thank you, Rahila Syed -- View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5833389.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
Hello, Below are performance numbers in case of synchronous replication with and without fpw compression using latest version of patch(version 14). The patch helps improve performance considerably. Both master and standby are on the same machine in order to get numbers independent of network overhead. The compression patch helps to increase tps by 10% . It also helps reduce I/O to disk , latency and total runtime for a fixed number of transactions as shown below. The compression of WAL is quite high around 40%. pgbench scale :1000 pgbench command : pgbench -c 16 -j 16 -r -t 25 -M prepared To ensure that data is not highly compressible, empty filler columns were altered using alter table pgbench_accounts alter column filler type text using gen_random_uuid()::text checkpoint_segments = 1024 checkpoint_timeout = 5min fsync = on Compressionon off WAL generated 23037180520(~23.04MB) 38196743704(~38.20MB) TPS 264.18 239.34 Latency average60.541 ms 66.822 ms Latency stddev 126.567 ms 130.434 ms Total writes to disk 145045.310 MB192357.250 MB Runtime 15141.0 s 16712.0 s Server specifications: Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos RAM: 32GB Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm Thank you, Rahila Syed -- View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5833315.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Isn't it better to allocate the memory for compression_scratch in InitXLogInsert() like hdr_scratch? I think making compression_scratch a statically allocated global variable is the result of following discussion earlier, http://www.postgresql.org/message-id/ca+tgmoaznbuwnls4bpwyqgqteeznoavy7kwdbm0a2-tbarn...@mail.gmail.com Thank you, Rahila Syed On Thu, Dec 18, 2014 at 1:57 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Dec 18, 2014 at 2:21 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Dec 17, 2014 at 11:33 PM, Rahila Syed rahilasye...@gmail.com wrote: I had a look at code. I have few minor points, Thanks! + bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE; + + if (is_compressed) { - rdt_datas_last-data = page; - rdt_datas_last-len = BLCKSZ; + /* compressed block information */ + bimg.length = compress_len; + bimg.extra_data = hole_offset; + bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK; For consistency with the existing code , how about renaming the macro XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of BKPBLOCK_HAS_IMAGE. OK, why not... + blk-hole_offset = extra_data ~XLR_BLCK_COMPRESSED_MASK; Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be more indicative of the fact that lower 15 bits of extra_data field comprises of hole_offset value. This suggestion is also just to achieve consistency with the existing BKPBLOCK_FORK_MASK for fork_flags field. Yeah that seems clearer, let's define it as ~XLR_BLCK_COMPRESSED_MASK though. And comment typo +* First try to compress block, filling in the page hole with zeros +* to improve the compression of the whole. If the block is considered +* as incompressible, complete the block header information as if +* nothing happened. As hole is no longer being compressed, this needs to be changed. Fixed. As well as an additional comment block down. A couple of things noticed on the fly: - Fixed pg_xlogdump being not completely correct to report the FPW information - A couple of typos and malformed sentences fixed - Added an assertion to check that the hole offset value does not the bit used for compression status - Reworked docs, mentioning as well that wal_compression is off by default. - Removed stuff in pg_controldata and XLOG_PARAMETER_CHANGE (mentioned by Fujii-san) Thanks! +else +memcpy(compression_scratch, page, page_len); I don't think the block image needs to be copied to scratch buffer here. We can try to compress the page directly. +#include utils/pg_lzcompress.h #include utils/memutils.h pg_lzcompress.h should be after meutils.h. +/* Scratch buffer used to store block image to-be-compressed */ +static char compression_scratch[PGLZ_MAX_BLCKSZ]; Isn't it better to allocate the memory for compression_scratch in InitXLogInsert() like hdr_scratch? +uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header)); Why don't we allocate the buffer for uncompressed page only once and keep reusing it like XLogReaderState-readBuf? The size of uncompressed page is at most BLCKSZ, so we can allocate the memory for it even before knowing the real size of each block image. -printf( (FPW); hole: offset: %u, length: %u\n, - record-blocks[block_id].hole_offset, - record-blocks[block_id].hole_length); +if (record-blocks[block_id].is_compressed) +printf( (FPW); hole offset: %u, compressed length %u\n, + record-blocks[block_id].hole_offset, + record-blocks[block_id].bkp_len); +else +printf( (FPW); hole offset: %u, length: %u\n, + record-blocks[block_id].hole_offset, + record-blocks[block_id].bkp_len); We need to consider what info about FPW we want pg_xlogdump to report. I'd like to calculate how much bytes FPW was compressed, from the report of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW and that of compressed one in the report. In pg_config.h, the comment of BLCKSZ needs to be updated? Because the maximum size of BLCKSZ can be affected by not only itemid but also XLogRecordBlockImageHeader. boolhas_image; +boolis_compressed; Doesn't ResetDecoder need to reset is_compressed? +#wal_compression = off# enable compression of full-page writes Currently wal_compression compresses only FPW, so isn't it better to place it after full_page_writes in postgresql.conf.sample? +uint16extra_data;/* used to store offset
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, Patches as well as results are attached. I had a look at code. I have few minor points, + bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE; + + if (is_compressed) { - rdt_datas_last-data = page; - rdt_datas_last-len = BLCKSZ; + /* compressed block information */ + bimg.length = compress_len; + bimg.extra_data = hole_offset; + bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK; For consistency with the existing code , how about renaming the macro XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of BKPBLOCK_HAS_IMAGE. + blk-hole_offset = extra_data ~XLR_BLCK_COMPRESSED_MASK; Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be more indicative of the fact that lower 15 bits of extra_data field comprises of hole_offset value. This suggestion is also just to achieve consistency with the existing BKPBLOCK_FORK_MASK for fork_flags field. And comment typo +* First try to compress block, filling in the page hole with zeros +* to improve the compression of the whole. If the block is considered +* as incompressible, complete the block header information as if +* nothing happened. As hole is no longer being compressed, this needs to be changed. Thank you, Rahila Syed On Tue, Dec 16, 2014 at 10:04 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Dec 17, 2014 at 12:00 AM, Michael Paquier michael.paqu...@gmail.com wrote: Actually, the original length of the compressed block in saved in PGLZ_Header, so if we are fine to not check the size of the block decompressed when decoding WAL we can do without the hole filled with zeros, and use only 1 bit to see if the block is compressed or not. And.. After some more hacking, I have been able to come up with a patch that is able to compress blocks without the page hole, and that keeps the WAL record length the same as HEAD when compression switch is off. The numbers are pretty good, CPU is saved in the same proportions as previous patches when compression is enabled, and there is zero delta with HEAD when compression switch is off. Here are the actual numbers: test_name | pg_size_pretty | user_diff | system_diff ---++---+- FPW on + 2 bytes, ffactor 50 | 582 MB | 42.391894 |0.807444 FPW on + 2 bytes, ffactor 20 | 229 MB | 14.330304 |0.729626 FPW on + 2 bytes, ffactor 10 | 117 MB | 7.335442 |0.570996 FPW off + 2 bytes, ffactor 50 | 746 MB | 25.330391 |1.248503 FPW off + 2 bytes, ffactor 20 | 293 MB | 10.537475 |0.755448 FPW off + 2 bytes, ffactor 10 | 148 MB | 5.762775 |0.763761 FPW on + 0 bytes, ffactor 50 | 582 MB | 42.174297 |0.790596 FPW on + 0 bytes, ffactor 20 | 229 MB | 14.424233 |0.770459 FPW on + 0 bytes, ffactor 10 | 117 MB | 7.057195 |0.584806 FPW off + 0 bytes, ffactor 50 | 746 MB | 25.261998 |1.054516 FPW off + 0 bytes, ffactor 20 | 293 MB | 10.589888 |0.860207 FPW off + 0 bytes, ffactor 10 | 148 MB | 5.827191 |0.874285 HEAD, ffactor 50 | 746 MB | 25.181729 |1.133433 HEAD, ffactor 20 | 293 MB | 9.962242 |0.765970 HEAD, ffactor 10 | 148 MB | 5.693426 |0.775371 Record, ffactor 50| 582 MB | 54.904374 |0.678204 Record, ffactor 20| 229 MB | 19.798268 |0.807220 Record, ffactor 10| 116 MB | 9.401877 |0.668454 (18 rows) The new tests of this patch are FPW off + 0 bytes. Patches as well as results are attached. Regards, -- Michael
[HACKERS] Possibly a comment typo in xlogrecord.h
Hello, The comment before declaration of XLogRecordBlockHeader says * 'data_length' is the length of the payload data associated with this, * and includes the possible full-page image, and rmgr-specific data. It IIUC, data_length does not include associated full page image length. Attached patch changes the comment as follows: - * and includes the possible full-page image, and rmgr-specific data. It - * does not include the XLogRecordBlockHeader struct itself. + * and includes the rmgr-specific data. It does not include the possible + * full page image and XLogRecordBlockHeader struct itself. Thank you, Rahila Syed correct_comment_typo_XLogRecordBlockHeader.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, Well, the larger question is why wouldn't we just have the user compress the entire WAL file before archiving --- why have each backend do it? Is it the write volume we are saving? IIUC, the idea here is to not only save the on disk size of WAL but to reduce the overhead of flushing WAL records to disk in servers with heavy write operations. So yes improving the performance by saving write volume is a part of the requirement. Thank you, Rahila Syed On Fri, Dec 12, 2014 at 7:48 PM, Bruce Momjian br...@momjian.us wrote: On Fri, Dec 12, 2014 at 08:27:59AM -0500, Robert Haas wrote: On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian br...@momjian.us wrote: compression = 'on' : 1838 secs = 'off' : 1701 secs Different is around 140 secs. OK, so the compression took 2x the cpu and was 8% slower. The only benefit is WAL files are 35% smaller? Compression didn't take 2x the CPU. It increased user CPU from 354.20 s to 562.67 s over the course of the run, so it took about 60% more CPU. But I wouldn't be too discouraged by that. At least AIUI, there are quite a number of users for whom WAL volume is a serious challenge, and they might be willing to pay that price to have less of it. Also, we have talked a number of times before about incorporating Snappy or LZ4, which I'm guessing would save a fair amount of CPU -- but the decision was made to leave that out of the first version, and just use pg_lz, to keep the initial patch simple. I think that was a good decision. Well, the larger question is why wouldn't we just have the user compress the entire WAL file before archiving --- why have each backend do it? Is it the write volume we are saving? I though this WAL compression gave better performance in some cases. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
What I would suggest is instrument the backend with getrusage() at startup and shutdown and have it print the difference in user time and system time. Then, run tests for a fixed number of transactions and see how the total CPU usage for the run differs. Folllowing are the numbers obtained on tests with absolute CPU usage, fixed number of transactions and longer duration with latest fpw compression patch pgbench command : pgbench -r -t 25 -M prepared To ensure that data is not highly compressible, empty filler columns were altered using alter table pgbench_accounts alter column filler type text using gen_random_uuid()::text checkpoint_segments = 1024 checkpoint_timeout = 5min fsync = on The tests ran for around 30 mins.Manual checkpoint was run before each test. Compression WAL generated%compressionLatency-avg CPU usage (seconds) TPS Latency stddev on 1531.4 MB ~35 % 7.351 ms user diff: 562.67s system diff: 41.40s 135.96 13.759 ms off 2373.1 MB 6.781 ms user diff: 354.20s system diff: 39.67s147.40 14.152 ms The compression obtained is quite high close to 35 %. CPU usage at user level when compression is on is quite noticeably high as compared to that when compression is off. But gain in terms of reduction of WAL is also high. Server specifications: Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos RAM: 32GB Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm Thank you, Rahila Syed On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed rahilasyed...@gmail.com wrote: If that's really true, we could consider having no configuration any time, and just compressing always. But I'm skeptical that it's actually true. I was referring to this for CPU utilization: http://www.postgresql.org/message-id/1410414381339-5818552.p...@n5.nabble.com http:// The above tests were performed on machine with configuration as follows Server specifications: Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos RAM: 32GB Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm I think that measurement methodology is not very good for assessing the CPU overhead, because you are only measuring the percentage CPU utilization, not the absolute amount of CPU utilization. It's not clear whether the duration of the tests was the same for all the configurations you tried - in which case the number of transactions might have been different - or whether the number of operations was exactly the same - in which case the runtime might have been different. Either way, it could obscure an actual difference in absolute CPU usage per transaction. It's unlikely that both the runtime and the number of transactions were identical for all of your tests, because that would imply that the patch makes no difference to performance; if that were true, you wouldn't have bothered writing it What I would suggest is instrument the backend with getrusage() at startup and shutdown and have it print the difference in user time and system time. Then, run tests for a fixed number of transactions and see how the total CPU usage for the run differs. Last cycle, Amit Kapila did a bunch of work trying to compress the WAL footprint for updates, and we found that compression was pretty darn expensive there in terms of CPU time. So I am suspicious of the finding that it is free here. It's not impossible that there's some effect which causes us to recoup more CPU time than we spend compressing in this case that did not apply in that case, but the projects are awfully similar, so I tend to doubt it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
I am sorry but I can't understand the above results due to wrapping. Are you saying compression was twice as slow? CPU usage at user level (in seconds) for compression set 'on' is 562 secs while that for compression set 'off' is 354 secs. As per the readings, it takes little less than double CPU time to compress. However , the total time taken to run 25 transactions for each of the scenario is as follows, compression = 'on' : 1838 secs = 'off' : 1701 secs Different is around 140 secs. Thank you, Rahila Syed On Wed, Dec 10, 2014 at 7:55 PM, Bruce Momjian br...@momjian.us wrote: On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote: The tests ran for around 30 mins.Manual checkpoint was run before each test. Compression WAL generated%compressionLatency-avg CPU usage (seconds) TPS Latency stddev on 1531.4 MB ~35 % 7.351 ms user diff: 562.67s system diff: 41.40s 135.96 13.759 ms off 2373.1 MB 6.781 ms user diff: 354.20s system diff: 39.67s147.40 14.152 ms The compression obtained is quite high close to 35 %. CPU usage at user level when compression is on is quite noticeably high as compared to that when compression is off. But gain in terms of reduction of WAL is also high. I am sorry but I can't understand the above results due to wrapping. Are you saying compression was twice as slow? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
The important point to consider for this patch is the use of the additional 2-bytes as uint16 in the block information structure to save the length of a compressed block, which may be compressed without its hole to achieve a double level of compression (image compressed without its hole). We may use a simple flag on one or two bits using for example a bit from hole_length, but in this case we would need to always compress images with their hole included, something more expensive as the compression would take more time. As you have mentioned here the idea to use bits from existing fields rather than adding additional 2 bytes in header, FWIW elaborating slightly on the way it was done in the initial patches, We can use the following struct unsignedhole_offset:15, compress_flag:2, hole_length:15; Here compress_flag can be 0 or 1 depending on status of compression. We can reduce the compress_flag to just 1 bit flag. IIUC, the purpose of adding compress_len field in the latest patch is to store length of compressed blocks which is used at the time of decoding the blocks. With this approach, length of compressed block can be stored in hole_length as, hole_length = BLCKSZ - compress_len. Thus, hole_length can serve the purpose of storing length of a compressed block without the need of additional 2-bytes. In DecodeXLogRecord, hole_length can be used for tracking the length of data received in cases of both compressed as well as uncompressed blocks. As you already mentioned, this will need compressing images with hole but we can MemSet hole to 0 in order to make compression of hole less expensive and effective. Thank you, Rahila Syed On Sat, Dec 6, 2014 at 7:37 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Sat, Dec 6, 2014 at 12:17 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-12-06 00:10:11 +0900, Michael Paquier wrote: On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed rahilasyed...@gmail.com wrote: I attempted quick review and could not come up with much except this + /* +* Calculate the amount of FPI data in the record. Each backup block +* takes up BLCKSZ bytes, minus the hole length. +* +* XXX: We peek into xlogreader's private decoded backup blocks for the +* hole_length. It doesn't seem worth it to add an accessor macro for +* this. +*/ + fpi_len = 0; + for (block_id = 0; block_id = record-max_block_id; block_id++) + { + if (XLogRecHasCompressedBlockImage(record, block_id)) + fpi_len += BLCKSZ - record-blocks[block_id].compress_len; IIUC, fpi_len in case of compressed block image should be fpi_len = record-blocks[block_id].compress_len; Yep, true. Patches need a rebase btw as Heikki fixed a commit related to the stats of pg_xlogdump. In any case, any opinions to switch this patch as Ready for committer? Needing a rebase is a obvious conflict to that... But I guess some wider looks afterwards won't hurt. Here are rebased versions, which are patches 1 and 2. And I am switching as well the patch to Ready for Committer. The important point to consider for this patch is the use of the additional 2-bytes as uint16 in the block information structure to save the length of a compressed block, which may be compressed without its hole to achieve a double level of compression (image compressed without its hole). We may use a simple flag on one or two bits using for example a bit from hole_length, but in this case we would need to always compress images with their hole included, something more expensive as the compression would take more time. Robert wrote: What I would suggest is instrument the backend with getrusage() at startup and shutdown and have it print the difference in user time and system time. Then, run tests for a fixed number of transactions and see how the total CPU usage for the run differs. That's a nice idea, which is done with patch 3 as a simple hack calling twice getrusage at the beginning of PostgresMain and before proc_exit, calculating the difference time and logging it for each process (used as well log_line_prefix with %p). Then I just did a small test with a load of a pgbench-scale-100 database on fresh instances: 1) Compression = on: Stop LSN: 0/487E49B8 getrusage: proc 11163: LOG: user diff: 63.071127, system diff: 10.898386 pg_xlogdump: FPI size: 122296653 [90.52%] 2) Compression = off Stop LSN: 0/4E54EB88 Result: proc 11648: LOG: user diff: 43.855212, system diff: 7.857965 pg_xlogdump: FPI size: 204359192 [94.10%] And the CPU consumption is showing quite some difference... I'd expect as well pglz_compress to show up high in a perf profile for this case (don't have the time to do that now, but a perf record -a -g would
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
I attempted quick review and could not come up with much except this + /* +* Calculate the amount of FPI data in the record. Each backup block +* takes up BLCKSZ bytes, minus the hole length. +* +* XXX: We peek into xlogreader's private decoded backup blocks for the +* hole_length. It doesn't seem worth it to add an accessor macro for +* this. +*/ + fpi_len = 0; + for (block_id = 0; block_id = record-max_block_id; block_id++) + { + if (XLogRecHasCompressedBlockImage(record, block_id)) + fpi_len += BLCKSZ - record-blocks[block_id].compress_len; IIUC, fpi_len in case of compressed block image should be fpi_len = record-blocks[block_id].compress_len; Thank you, Rahila Syed -- View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5829403.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
If that's really true, we could consider having no configuration any time, and just compressing always. But I'm skeptical that it's actually true. I was referring to this for CPU utilization: http://www.postgresql.org/message-id/1410414381339-5818552.p...@n5.nabble.com http:// The above tests were performed on machine with configuration as follows Server specifications: Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos RAM: 32GB Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm Thank you, Rahila Syed -- View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5829339.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
if (!fullPageWrites) { WALInsertLockAcquireExclusive(); Insert-fullPageWrites = fullPageWrites; WALInsertLockRelease(); } As fullPageWrites is not a boolean isnt it better to change the if condition as fullPageWrites == FULL_PAGE_WRITES_OFF? As it is done in the if condition above, this seems to be a miss. doPageWrites = (Insert-fullPageWrites || Insert-forcePageWrites); IIUC, doPageWrites is true when fullPageWrites is either 'on' or 'compress' Considering Insert - fullPageWrites is an int now, I think its better to explicitly write the above as , doPageWrites = (Insert - fullPageWrites != FULL_PAGE_WRITES_OFF || Insert-forcePageWrites) The patch attached has the above changes. Also, it initializes doPageCompression in InitXLOGAccess as per earlier discussion. I have attached the changes separately as changes.patch. Thank you, Rahila Syed changes.patch Description: Binary data 0002-Support-compression-for-full-page-writes-in-WAL.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
I think this was changed based on following, if I am not wrong. http://www.postgresql.org/message-id/54297A45.8080904@... Yes this change is the result of the above complaint. Attaching the compression status to XLogRecord is more in-line with the fact that all the blocks are compressed, and not each one individually, so we basically now duplicate an identical flag value in all the backup block headers, which is a waste IMO. Thoughts? If I understand your point correctly, as all blocks are compressed, adding compression attribute to XLogRecord surely makes more sense if the record contains backup blocks . But in case of XLOG records without backup blocks the compression attribute in record header might not make much sense. Attaching the status of compression to XLogRecord will mean that the status is duplicated across all records. It will mean that it is an attribute of all the records when it is only an attribute of records with backup blocks or the attribute of backup blocks. The current approach is adopted with this thought. Regards, Rahila Syed -- View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5826487.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello , Please find updated patch with the review comments given above implemented. The compressed data now includes all backup blocks and their headers except the header of first backup block in WAL record. The first backup block header in WAL record is used to store the compression information. This is done in order to avoid adding compression information in WAL record header. Memory allocation on SIGHUP in autovacuum is remaining. Working on it. Thank you, Rahila Syed On Tue, Oct 28, 2014 at 8:51 PM, Rahila Syed rahilasyed...@gmail.com wrote: Do we release the buffers for compressed data when fpw is changed from compress to on? The current code does not do this. Don't we need to do that? Yes this needs to be done in order to avoid memory leak when compression is turned off at runtime while the backend session is running. You don't need to make the processes except the startup process allocate the memory for uncompressedPages when fpw=on. Only the startup process uses it for the WAL decompression I see. fpw != on check can be put at the time of memory allocation of uncompressedPages in the backend code . And at the time of recovery uncompressedPages can be allocated separately if not already allocated. BTW, what happens if the memory allocation for uncompressedPages for the recovery fails? The current code does not handle this. This will be rectified. Which would prevent the recovery at all, so PANIC should happen in that case? IIUC, instead of reporting PANIC , palloc can be used to allocate memory for uncompressedPages at the time of recovery which will throw ERROR and abort startup process in case of failure. Thank you, Rahila Syed -- View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5824613.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers compress_fpw_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Do we release the buffers for compressed data when fpw is changed from compress to on? The current code does not do this. Don't we need to do that? Yes this needs to be done in order to avoid memory leak when compression is turned off at runtime while the backend session is running. You don't need to make the processes except the startup process allocate the memory for uncompressedPages when fpw=on. Only the startup process uses it for the WAL decompression I see. fpw != on check can be put at the time of memory allocation of uncompressedPages in the backend code . And at the time of recovery uncompressedPages can be allocated separately if not already allocated. BTW, what happens if the memory allocation for uncompressedPages for the recovery fails? The current code does not handle this. This will be rectified. Which would prevent the recovery at all, so PANIC should happen in that case? IIUC, instead of reporting PANIC , palloc can be used to allocate memory for uncompressedPages at the time of recovery which will throw ERROR and abort startup process in case of failure. Thank you, Rahila Syed -- View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5824613.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, Please find the updated patch attached. 1) I don't think it's a good idea to put the full page write compression into struct XLogRecord. 1. The compressed blocks is of varlena type. Hence, VARATT_IS_COMPRESSED can be used to detect if the datum is compressed. But, it can give false positive when blocks are not compressed because uncompressed blocks in WAL record are not of type varlena. If I understand correctly, VARATT_IS_COMPRESSED looks for particular bit pattern in the datum which when found it returns true irrespective of type of datum. 2. BkpBlock header of the first block in a WAL record can be copied as it is followed by compressed data including block corresponding to first header and remaining headers and blocks. This header can then be used to store flag indicating if the blocks are compressed or not. This seems to be a feasible option but will increase few bytes equivalent to sizeof(BkpBlock) in record when compared to the method of compressing all blocks and headers. Also , the full page write compression currently stored in WAL record occupies 1 byte of padding hence does not increase the overall size. But at the same timecompression attribute is related to backup up blocks hence it makes more sense to have it in BkpBlock header. Although, the attached patch does not include this yet as it will be better to get consensus first. Thoughts? 2) You've essentially removed a lot of checks about the validity of bkp blocks in xlogreader. I don't think that's acceptable Check to see if size of compressed blocks agrees with the total size stored on WAL record header is added in the patch attached. This serves as a check to validate length of record. 3) You have both FullPageWritesStr() and full_page_writes_str(). This has not changed for now reason being full_page_writes_str() is true/false version of FullPageWritesStr macro. It is implemented for backward compatibility with pg_xlogdump. 4)I don't like FullPageWritesIsNeeded(). For one it, at least to me, 7) Unless I miss something CompressBackupBlock should be plural, right? ATM it compresses all the blocks? 8) I don't tests like if (fpw = FULL_PAGE_WRITES_COMPRESS). That relies on the, less than intuitive, ordering of FULL_PAGE_WRITES_COMPRESS (=1) before FULL_PAGE_WRITES_ON (=2). 9) I think you've broken the case where we first think 1 block needs to be backed up, and another doesn't. If we then detect, after the START_CRIT_SECTION(), that we need to goto begin; orig_len will still have it's old content. I have corrected these in the patch attached. 5) CompressBackupBlockPagesAlloc is declared static but not defined as such. Have made it global now in order to be able to access it from PostgresMain. 6) You call CompressBackupBlockPagesAlloc() from two places. Neither is IIRC within a critical section. So you imo should remove the outOfMem handling and revert to palloc() instead of using malloc directly. This has not been changed in the current patch reason being outOfMem handling is done in order to proceed without compression of FPW in case sufficient memory is not available for compression. One thing worthy of note is that I don't think you currently can legally check fullPageWrites == FULL_PAGE_WRITES_ON when calling it only during startup as fullPageWrites can be changed at runtime In the attached patch, this check is also added in PostgresMain on SIGHUP after processing postgresql.conf file. Thank you, Rahila Syed On Mon, Sep 29, 2014 at 6:06 PM, Andres Freund and...@anarazel.de wrote: Hi, On 2014-09-22 10:39:32 +, Syed, Rahila wrote: Please find attached the patch to compress FPW. I've given this a quick look and noticed some things: 1) I don't think it's a good idea to put the full page write compression into struct XLogRecord. 2) You've essentially removed a lot of checks about the validity of bkp blocks in xlogreader. I don't think that's acceptable. 3) You have both FullPageWritesStr() and full_page_writes_str(). 4) I don't like FullPageWritesIsNeeded(). For one it, at least to me, sounds grammatically wrong. More importantly when reading it I'm thinking of it being about the LSN check. How about instead directly checking whatever != FULL_PAGE_WRITES_OFF? 5) CompressBackupBlockPagesAlloc is declared static but not defined as such. 6) You call CompressBackupBlockPagesAlloc() from two places. Neither is IIRC within a critical section. So you imo should remove the outOfMem handling and revert to palloc() instead of using malloc directly. One thing worthy of note is that I don't think you currently can legally check fullPageWrites == FULL_PAGE_WRITES_ON when calling it only during startup as fullPageWrites can be changed at runtime. 7) Unless I miss something CompressBackupBlock should be plural, right? ATM it compresses all the blocks? 8) I don't tests like if (fpw = FULL_PAGE_WRITES_COMPRESS
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, Thank you for review. 1) I don't think it's a good idea to put the full page write compression into struct XLogRecord. Full page write compression information can be stored in varlena struct of compressed blocks as done for toast data in pluggable compression support patch. If I understand correctly, it can be done similar to the manner in which compressed Datum is modified to contain information about compression algorithm in pluggable compression support patch. 2) You've essentially removed a lot of checks about the validity of bkp blocks in xlogreader. I don't think that's acceptable To ensure this, the raw size stored in first four byte of compressed datum can be used to perform error checking for backup blocks Currently, the error checking for size of backup blocks happens individually for each block. If backup blocks are compressed together , it can happen once for the entire set of backup blocks in a WAL record. The total raw size of compressed blocks can be checked against the total size stored in WAL record header. 3) You have both FullPageWritesStr() and full_page_writes_str(). full_page_writes_str() is true/false version of FullPageWritesStr macro. It is implemented for backward compatibility with pg_xlogdump 4)I don't like FullPageWritesIsNeeded(). For one it, at least to me, sounds grammatically wrong. More importantly when reading it I'm thinking of it being about the LSN check. How about instead directly checking whatever != FULL_PAGE_WRITES_OFF? I will modify this. 5) CompressBackupBlockPagesAlloc is declared static but not defined as such. 7) Unless I miss something CompressBackupBlock should be plural, right? ATM it compresses all the blocks? I will correct these. 6)You call CompressBackupBlockPagesAlloc() from two places. Neither is IIRC within a critical section. So you imo should remove the outOfMem handling and revert to palloc() instead of using malloc directly. Yes neither is in critical section. outOfMem handling is done in order to proceed without compression of FPW in case sufficient memory is not available for compression. Thank you, Rahila Syed -- View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5822391.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello All, Well, from Rahila's point of view the patch is actually attached, but she's posting from the Nabble interface, which mangles it and turns into a link instead. Yes. but the end result is the same: to properly submit a patch, you need to send an email to the mailing list, not join a group/forum from some intermediary newsgroup site that mirrors the list. Thank you. I will take care of it henceforth. Please find attached the patch to compress FPW. Patch submitted by Fujii-san earlier in the thread is used to merge compression GUC with full_page_writes. I am reposting the measurement numbers. Server Specification: Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos RAM: 32GB Disk : HDWD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm Checkpoint segments: 1024 Checkpoint timeout: 5 mins pgbench -c 64 -j 64 -r -T 900 -M prepared Scale factor: 1000 WAL generated (MB) Throughput (tps) Latency(ms) On 9235.43 979.03 65.36 Compress(pglz) 6518.681072.34 59.66 Off 501.04 1135.1756.34 The results show around 30 percent decrease in WAL volume due to compression of FPW. Thank you , Rahila Syed Tom Lane wrote: Rahila Syed rahilasyed rahilasyed...@gmail.com.90@ rahilasyed...@gmail.comgmail.com rahilasyed...@gmail.com writes: Please find attached patch to compress FPW using pglz compression. Patch not actually attached AFAICS (no, a link is not good enough). Well, from Rahila's point of view the patch is actually attached, but she's posting from the Nabble interface, which mangles it and turns into a link instead. Not her fault, really -- but the end result is the same: to properly submit a patch, you need to send an email to the pgsql pgsql-hackers@postgresql.org- pgsql-hackers@postgresql.orghackers pgsql-hackers@postgresql.org@ pgsql-hackers@postgresql.orgpostgresql.org pgsql-hackers@postgresql.orgmailing list, not join a group/forum from some intermediary newsgroup site that mirrors the list. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, Maybe. Let's get the basic patch done first; then we can argue about that Please find attached patch to compress FPW using pglz compression. All backup blocks in WAL record are compressed at once before inserting it into WAL buffers . Full_page_writes GUC has been modified to accept three values 1. On 2. Compress 3. Off FPW are compressed when full_page_writes is set to compress. FPW generated forcibly during online backup even when full_page_writes is off are also compressed. When full_page_writes is set on FPW are not compressed. Benckmark: Server Specification: Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos RAM: 32GB Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm Checkpoint segments: 1024 Checkpoint timeout: 5 mins pgbench -c 64 -j 64 -r -T 900 -M prepared Scale factor: 1000 WAL generated (MB) Throughput (tps) Latency(ms) On 9235.43 979.03 65.36 Compress(pglz) 6518.68 1072.34 59.66 Off501.04 1135.17 56.34 The results show around 30 percent decrease in WAL volume due to compression of FPW. compress_fpw_v1.patch http://postgresql.1045698.n5.nabble.com/file/n5819645/compress_fpw_v1.patch -- View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5819645.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Please find attached patch to compress FPW using pglz compression. Please refer the updated patch attached. The earlier patch added few duplicate lines of code in guc.c file. compress_fpw_v1.patch http://postgresql.1045698.n5.nabble.com/file/n5819659/compress_fpw_v1.patch Thank you, Rahila Syed -- View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5819659.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
I will repeat the above tests with high load on CPU and using the benchmark given by Fujii-san and post the results. Average % of CPU usage at user level for each of the compression algorithm are as follows. CompressionMultipleSingle Off81.133881.1267 LZ4 81.099881.1695 Snappy:80.9741 80.9703 Pglz :81.235381.2753 http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png The numbers show CPU utilization of Snappy is the least. The CPU utilization in increasing order is pglz No compression LZ4 Snappy The variance of average CPU utilization numbers is very low. However , snappy seems to be best when it comes to lesser utilization of CPU. As per the measurement results posted till date LZ4 outperforms snappy and pglz in terms of compression ratio and performance. However , CPU utilization numbers show snappy utilizes least amount of CPU . Difference is not much though. As there has been no consensus yet about which compression algorithm to adopt, is it better to make this decision independent of the FPW compression patch as suggested earlier in this thread?. FPW compression can be done using built in compression pglz as it shows considerable performance over uncompressed WAL and good compression ratio Also, the patch to compress multiple blocks at once gives better compression as compared to single block. ISTM that performance overhead introduced by multiple blocks compression is slightly higher than single block compression which can be tested again after modifying the patch to use pglz . Hence, this patch can be built using multiple blocks compression. Thoughts? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, It'd be interesting to check avg cpu usage as well I have collected average CPU utilization numbers by collecting sar output at interval of 10 seconds for following benchmark: Server specifications: Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos RAM: 32GB Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm Benchmark: Scale : 16 Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js -sleepTime 550,250,250,200,200 Warmup time : 1 sec Measurement time : 900 sec Number of tx types : 5 Number of agents : 16 Connection pool size : 16 Statement cache size : 40 Auto commit : false Checkpoint segments:1024 Checkpoint timeout:5 mins Average % of CPU utilization at user level for multiple blocks compression: Compression Off = 3.34133 Snappy = 3.41044 LZ4 = 3.59556 Pglz = 3.66422 The numbers show the average CPU utilization is in the following order pglz LZ4 Snappy No compression Attached is the graph which gives plot of % CPU utilization versus time elapsed for each of the compression algorithms. Also, the overall CPU utilization during tests is very low i.e below 10% . CPU remained idle for large(~90) percentage of time. I will repeat the above tests with high load on CPU and using the benchmark given by Fujii-san and post the results. Thank you, On Wed, Aug 27, 2014 at 9:16 PM, Arthur Silva arthur...@gmail.com wrote: Em 26/08/2014 09:16, Fujii Masao masao.fu...@gmail.com escreveu: On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed rahilasye...@gmail.com wrote: Hello, Thank you for comments. Could you tell me where the patch for single block in one run is? Please find attached patch for single block compression in one run. Thanks! I ran the benchmark using pgbench and compared the results. I'd like to share the results. [RESULT] Amount of WAL generated during the benchmark. Unit is MB. MultipleSingle off202.0201.5 on6051.06053.0 pglz3543.03567.0 lz43344.03485.0 snappy3354.03449.5 Latency average during the benchmark. Unit is ms. MultipleSingle off19.119.0 on55.357.3 pglz45.045.9 lz444.244.7 snappy43.443.3 These results show that FPW compression is really helpful for decreasing the WAL volume and improving the performance. The compression ratio by lz4 or snappy is better than that by pglz. But it's difficult to conclude which lz4 or snappy is best, according to these results. ISTM that compression-of-multiple-pages-at-a-time approach can compress WAL more than compression-of-single-... does. [HOW TO BENCHMARK] Create pgbench database with scall factor 1000. Change the data type of the column filler on each pgbench table from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's gen_random_uuid() in order to avoid empty column, e.g., alter table pgbench_accounts alter column filler type text using gen_random_uuid()::text After creating the test database, run the pgbench as follows. The number of transactions executed during benchmark is almost same between each benchmark because -R option is used. pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared checkpoint_timeout is 5min, so it's expected that checkpoint was executed at least two times during the benchmark. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers It'd be interesting to check avg cpu usage as well. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, Thank you for comments. Could you tell me where the patch for single block in one run is? Please find attached patch for single block compression in one run. Thank you, On Tue, Aug 19, 2014 at 1:17 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Aug 19, 2014 at 2:08 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-18 13:06:15 -0400, Robert Haas wrote: On Mon, Aug 18, 2014 at 7:19 AM, Rahila Syed rahilasye...@gmail.com wrote: According to the measurement result, the amount of WAL generated in Multiple Blocks in one run than that in Single Block in one run. So ISTM that compression of multiple blocks at one run can improve the compression ratio. Am I missing something? Sorry for using unclear terminology. WAL generated here means WAL that gets generated in each run without compression. So, the value WAL generated in the above measurement is uncompressed WAL generated to be specific. uncompressed WAL = compressed WAL + Bytes saved. Here, the measurements are done for a constant amount of time rather than fixed number of transactions. Hence amount of WAL generated does not correspond to compression ratios of each algo. Hence have calculated bytes saved in order to get accurate idea of the amount of compression in each scenario and for various algorithms. Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above scenarios are as follows: Compression algo Multiple Blocks in one runSingle Block in one run LZ4 1.21 1.27 Snappy1.19 1.25 pglz 1.14 1.16 This shows compression ratios of both the scenarios Multiple blocks and single block are nearly same for this benchmark. I don't agree with that conclusion. The difference between 1.21 and 1.27, or between 1.19 and 1.25, is quite significant. Even the difference beyond 1.14 and 1.16 is not trivial. We should try to get the larger benefit, if it is possible to do so without an unreasonable effort. Agreed. One more question: Do I see it right that multiple blocks compressed together compress *worse* than compressing individual blocks? If so, I have a rather hard time believing that the patch is sane. Or the way of benchmark might have some problems. Rahila, I'd like to measure the compression ratio in both multiple blocks and single block cases. Could you tell me where the patch for single block in one run is? Regards, -- Fujii Masao CompressSingleBlock.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
So, it seems like you're basically using malloc to work around the fact that a palloc failure is an error, and we can't throw an error in a critical section. I don't think that's good; we want all of our allocations, as far as possible, to be tracked via palloc. It might be a good idea to add a new variant of palloc or MemoryContextAlloc that returns NULL on failure instead of throwing an error; I've wanted that once or twice. But in this particular case, I'm not quite seeing why it should be necessary I am using malloc to return NULL in case of failure and proceed without compression of FPW ,if it returns NULL. Proceeding without compression seems to be more accurate than throwing an error and exiting because of failure to allocate memory for compression. the number of backup blocks per record is limited to some pretty small number, so it ought to be possible to preallocate enough memory to compress them all, perhaps just by declaring a global variable like char wal_compression_space[8192]; or whatever. In the updated patch a static global variable is added to which memory is allocated from heap using malloc outside critical section. The size of the memory block is 4 * BkpBlock header + 4 * BLCKSZ. Thank you, On Mon, Aug 18, 2014 at 10:40 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jul 3, 2014 at 3:58 PM, Rahila Syed rahilasye...@gmail.com wrote: Updated version of patches are attached. Changes are as follows 1. Improved readability of the code as per the review comments. 2. Addition of block_compression field in BkpBlock structure to store information about compression of block. This provides for switching compression on/off and changing compression algorithm as required. 3.Handling of OOM in critical section by checking for return value of malloc and proceeding without compression of FPW if return value is NULL. So, it seems like you're basically using malloc to work around the fact that a palloc failure is an error, and we can't throw an error in a critical section. I don't think that's good; we want all of our allocations, as far as possible, to be tracked via palloc. It might be a good idea to add a new variant of palloc or MemoryContextAlloc that returns NULL on failure instead of throwing an error; I've wanted that once or twice. But in this particular case, I'm not quite seeing why it should be necessary - the number of backup blocks per record is limited to some pretty small number, so it ought to be possible to preallocate enough memory to compress them all, perhaps just by declaring a global variable like char wal_compression_space[8192]; or whatever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Fwd: [HACKERS] [REVIEW] Re: Compression of full-page-writes
According to the measurement result, the amount of WAL generated in Multiple Blocks in one run than that in Single Block in one run. So ISTM that compression of multiple blocks at one run can improve the compression ratio. Am I missing something? Sorry for using unclear terminology. WAL generated here means WAL that gets generated in each run without compression. So, the value WAL generated in the above measurement is uncompressed WAL generated to be specific. uncompressed WAL = compressed WAL + Bytes saved. Here, the measurements are done for a constant amount of time rather than fixed number of transactions. Hence amount of WAL generated does not correspond to compression ratios of each algo. Hence have calculated bytes saved in order to get accurate idea of the amount of compression in each scenario and for various algorithms. Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above scenarios are as follows: Compression algo Multiple Blocks in one runSingle Block in one run LZ4 1.21 1.27 Snappy1.19 1.25 pglz 1.14 1.16 This shows compression ratios of both the scenarios Multiple blocks and single block are nearly same for this benchmark. Thank you, Rahila Syed
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Thank you for review. So, you're compressing backup blocks one by one. I wonder if that's the right idea and if we shouldn't instead compress all of them in one run to increase the compression ratio. The idea behind compressing blocks one by one was to keep the code as much similar to the original as possible. For instance the easiest change I could think of is , if we compress all backup blocks of a WAL record together the below format of WAL record might change Fixed-size header (XLogRecord struct) rmgr-specific data BkpBlock backup block data BkpBlock backup block data to Fixed-size header (XLogRecord struct) rmgr-specific data BkpBlock BkpBlock backup blocks data ... But at the same time, it can be worth giving a try to see if there is significant improvement in compression . So why aren't we compressing the hole here instead of compressing the parts that the current logic deems to be filled with important information? Entire full page image in the WAL record is compressed. The unimportant part of the full page image which is hole is not WAL logged in original code. This patch compresses entire full page image inclusive of hole. This can be optimized by omitting hole in the compressed FPW(incase hole is filled with non-zeros) like the original uncompressed FPW . But this can lead to change in BkpBlock structure. This should be named 'compress_full_page_writes' or so, even if a temporary guc. There's the 'full_page_writes' guc and I see little reaason to deviate from its name. Yes. This will be renamed to full_page_compression according to suggestions earlier in the discussion. Thank you, Rahila Syed On Fri, Jul 11, 2014 at 12:00 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-07-04 19:27:10 +0530, Rahila Syed wrote: + /* Allocates memory for compressed backup blocks according to the compression + * algorithm used.Once per session at the time of insertion of first XLOG + * record. + * This memory stays till the end of session. OOM is handled by making the + * code proceed without FPW compression*/ + static char *compressed_pages[XLR_MAX_BKP_BLOCKS]; + static bool compressed_pages_allocated = false; + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF + compressed_pages_allocated!= true) + { + size_t buffer_size = VARHDRSZ; + int j; + if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_SNAPPY) + buffer_size += snappy_max_compressed_length(BLCKSZ); + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_LZ4) + buffer_size += LZ4_compressBound(BLCKSZ); + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_PGLZ) + buffer_size += PGLZ_MAX_OUTPUT(BLCKSZ); + for (j = 0; j XLR_MAX_BKP_BLOCKS; j++) + { compressed_pages[j] = (char *) malloc(buffer_size); + if(compressed_pages[j] == NULL) + { + compress_backup_block=BACKUP_BLOCK_COMPRESSION_OFF; + break; + } + } + compressed_pages_allocated = true; + } Why not do this in InitXLOGAccess() or similar? /* * Make additional rdata chain entries for the backup blocks, so that we * don't need to special-case them in the write loop. This modifies the @@ -1015,11 +1048,32 @@ begin:; rdt-next = (dtbuf_rdt2[i]); rdt = rdt-next; + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF) + { + /* Compress the backup block before including it in rdata chain */ + rdt-data = CompressBackupBlock(page, BLCKSZ - bkpb-hole_length, + compressed_pages[i], (rdt-len)); + if (rdt-data != NULL) + { + /* + * write_len is the length of compressed block and its varlena + * header + */ + write_len += rdt-len; + bkpb-hole_length = BLCKSZ - rdt-len; + /*Adding information about compression in the backup block header*/ + bkpb-block_compression=compress_backup_block; + rdt-next = NULL; + continue; + } + } + So, you're compressing backup blocks one by one. I wonder if that's the right idea and if we shouldn't instead compress all of them in one run to increase the compression ratio. +/* * Get a pointer to the right location in the WAL buffer containing the * given XLogRecPtr. * @@ -4061,6 +4174,50 @@ RestoreBackupBlockContents(XLogRecPtr