Re: [HACKERS] Default Partition for Range

2017-07-04 Thread Rahila Syed
Hello Beena,

Thanks for the updated patch. It passes the basic tests which I performed.

Few comments:
1. In   check_new_partition_bound(),

> /* Default case is not handled here */
>if (spec->is_default)
>break;
The default partition check here can be performed in common for both range
and list partition.

2. RANGE_DATUM_FINITE = 0, /* actual datum stored elsewhere */
+   RANGE_DATUM_DEFAULT,/* default */

More elaborative comment?

3.  Inside make_one_range_bound(),

>+
>+   /* datums are NULL for default */
>+   if (datums == NULL)
>+   for (i = 0; i < key->partnatts; i++)
>+   bound->content[i] = RANGE_DATUM_DEFAULT;
>+
IMO, returning bound at this stage will make code clearer as further
processing
does not happen for default partition.

4.
@@ -549,6 +568,7 @@ RelationBuildPartitionDesc(Relation rel)
sizeof(RangeDatumContent
*));
boundinfo->indexes = (int *) palloc((ndatums + 1) *
sizeof(int));
+   boundinfo->default_index = -1;
This should also be done commonly for both default list and range partition.

5.
  if (!spec->is_default && partdesc->nparts > 0)
{
ListCell   *cell;

Assert(boundinfo &&
   boundinfo->strategy == PARTITION_STRATEGY_LIST &&
   (boundinfo->ndatums > 0 ||
partition_bound_accepts_nulls(boundinfo) ||
partition_bound_has_default(boundinfo)));
The assertion condition partition_bound_has_default(boundinfo) is rendered
useless
because of if condition change and can be removed perhaps.

6. I think its better to have following  regression tests coverage

--Testing with inserting one NULL and other non NULL values for multicolumn
range partitioned table.

postgres=# CREATE TABLE range_parted (
postgres(# a int,
postgres(# b int
postgres(# ) PARTITION BY RANGE (a, b);
CREATE TABLE
postgres=#
postgres=# CREATE TABLE part1 (
postgres(# a int NOT NULL CHECK (a = 1),
postgres(# b int NOT NULL CHECK (b >= 1 AND b <= 10)
postgres(# );
CREATE TABLE

postgres=# ALTER TABLE range_parted ATTACH PARTITION part1 FOR VALUES FROM
(1, 1) TO (1, 10);
ALTER TABLE
postgres=# CREATE TABLE partr_def1 PARTITION OF range_parted DEFAULT;
CREATE TABLE

postgres=# insert into range_parted values (1,NULL);
INSERT 0 1
postgres=# insert into range_parted values (NULL,9);
INSERT 0 1

--Testing the boundary conditions like in the above example
 following should pass.

postgres=# insert into partr_def1 values (1,10);
INSERT 0 1

Thank you,
Rahila Syed










On Mon, Jul 3, 2017 at 8:00 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:

> On Fri, Jun 30, 2017 at 11:56 AM, Beena Emerson <memissemer...@gmail.com>
> wrote:
> > Hello Dilip,
> >
> > On Wed, Jun 21, 2017 at 6:27 PM, Dilip Kumar <dilipbal...@gmail.com>
> wrote:
> >> On Tue, Jun 20, 2017 at 6:57 PM, Dilip Kumar <dilipbal...@gmail.com>
> wrote:
> >>> This is basically crashing in RelationBuildPartitionDesc, so I think
>
> >
> > Thank you for your review and analysis.
> >
> > I have updated the patch.
> > - bound->content is set to RANGE_DATUM_DEFAULT for each of the keys
> > and not just the first one.
> > - Improve the way of handling DEFAULT bounds in
> RelationBuildPartitionDesc,
> >
> > There is a test for multiple column range in alter_table.sql
>
> Thanks for update patch,  After this fix segmentation fault is solved.
>
> I have some more comments.
>
> 1.
>
> lower = make_one_range_bound(key, -1, spec->lowerdatums, true);
>   upper = make_one_range_bound(key, -1, spec->upperdatums, false);
>
> + /* Default case is not handled here */
> + if (spec->is_default)
> + break;
> +
>
> I think we can move default check just above the make range bound it
> will avoid unnecessary processing.
>
>
> 2.
> RelationBuildPartitionDesc
> {
> 
>
>/*
> * If either of them has infinite element, we can't equate
> * them.  Even when both are infinite, they'd have
> * opposite signs, because only one of cur and prev is a
> * lower bound).
> */
> if (cur->content[j] != RANGE_DATUM_FINITE ||
>   prev->content[j] != RANGE_DATUM_FINITE)
> {
> is_distinct = true;
> break;
> }
> After default range partitioning patch the above comment doesn't sound
> very accurate and
> can lead to the confusion, now here we are not only considering
> infinite bounds but
> there is also a possibility that prev bound can be DEFAULT, so
> 

Re: [HACKERS] Default Partition for Range

2017-06-28 Thread Rahila Syed
Hi Beena,

I started testing and reviewing the patch. Can you update the patch as v5
patch does not apply cleanly on master?

Thank you,
Rahila Syed

On Wed, Jun 21, 2017 at 8:43 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:

> On Wed, Jun 21, 2017 at 8:08 PM, Robert Haas <robertmh...@gmail.com>
> wrote:
> > I think somebody should do some testing of the existing code with
> > valgrind.  And then apply the list-partitioning patch and this patch,
> > and do some more testing with valgrind.  It seems to be really easy to
> > miss these uninitialized access problems during code review.
>
> I think it will help,  but it may not help in all the scenarios
> because most of the places we are allocating memory with palloc0 (
> initialized with 0) but it never initialized with RANGE_DATUM_DEFAULT
> except the first element (in the case of DEFAULT partition).  And,
> later they may be considered as RANGE_DATUM_FINITE (because its value
> is 0).
>
> One solution can be that if bound is DEFAULT then initialize with
> RANGE_DATUM_DEFAULT for the complete content array for that partition
> bound instead of just first.  Otherwise, we need to be careful of
> early exiting wherever we are looping the content array of the DEFAULT
> bound.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] Dropping partitioned table drops a previously detached partition

2017-06-13 Thread Rahila Syed
I reviewed and tested
0001-Dependency-between-partitioned-table-and-partition_v1.patch
It applies cleanly on master and make check passes.

Following are few comments:

>/*
> * Drop the dependency created by StoreCatalogInheritance1 (CREATE TABLE
> * INHERITS/ALTER TABLE INHERIT -- refclassid will be RelationRelationId)
or
> * heap_create_with_catalog (CREATE TABLE OF/ALTER TABLE OF -- refclassid
will
> * be TypeRelationId).  There's no convenient way to do this, so go
trawling
> * through pg_depend.
> */
>static void
>drop_parent_dependency(Oid relid, Oid refclassid, Oid refobjid,
   DependencyType deptype)

This is not directly related to this patch but still would like to mention.
drop_parent_dependency() is being used to drop the dependency created
by CREATE TABLE PARTITION OF command also, so probably the comment
needs to be modified to reflect that.

>+/*
>+ * Partition tables are expected to be dropped when the parent partitioned
>+ * table gets dropped. Hence for partitioning we use AUTO dependency.
>+ * Otherwise, for regular inheritance use NORMAL dependency.
A minor cosmetic correction:
+ Hence for *declarative* partitioning we use AUTO dependency
+ * Otherwise, for regular inheritance *we* use NORMAL dependency.

I have added tests to the
0001-Dependency-between-partitioned-table-and-partition_v1.patch. Please
find attached the v2 patch.



On Tue, Jun 13, 2017 at 10:25 AM, Amit Langote <
langote_amit...@lab.ntt.co.jp> wrote:

> Hi Ashutosh,
>
> On 2017/06/12 20:56, Ashutosh Bapat wrote:
> > Hi,
> > If we detach a partition and drop the corresponding partitioned table,
> > it drops the once-partition now-standalone table as well.
>
> Thanks for the report.  Looks like 8b4d582d279d78 missed the bit about
> drop_parent_dependency() that you describe in your analysis below.
>
> > The reason for this is as follows
> > An AUTO dependency is recorded between a partitioned table and
> partition. In
> > case of inheritance we record a NORMAL dependency between parent
> > and child. While
> > detaching a partition, we call RemoveInheritance(), which should
> have taken
> > care of removing this dependency. But it removes the dependencies
> which are
> > marked as NORMAL. When we changed the dependency for partitioned
> case from
> > NORMAL to AUTO by updating StoreCatalogInheritance1(), this function
> was not
> > updated. This caused the dependency to linger behind even after
> > detaching the
> > partition, thus causing the now standalone table (which was once a
> > partition)
> > to be dropped when the parent partitioned table is dropped. This
> patch fixes
> > RemoveInheritance() to choose appropriate dependency.
> >
> > Attached patch 0001 to fix this.
>
> Looks good to me.  Perhaps, the example in your email could be added as a
> test case.
>
> > I see a similar issue-in-baking in case we change the type of
> > dependency recorded between a table and the composite type associated
> > with using CREATE TABLE ... OF command. 0002 patch addresses the
> > potential issue by using a macro while creating and dropping the
> > dependency in corresponding functions. There might be more such
> > places, but I haven't searched for those.
>
> Might be a good idea too.
>
> Adding this to the open items list.
>
> Thanks,
> Amit
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b61fda9..9aef67b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -283,6 +283,14 @@ struct DropRelationCallbackState
 #define		ATT_COMPOSITE_TYPE		0x0010
 #define		ATT_FOREIGN_TABLE		0x0020
 
+/*
+ * Partition tables are expected to be dropped when the parent partitioned
+ * table gets dropped. Hence for partitioning we use AUTO dependency.
+ * Otherwise, for regular inheritance use NORMAL dependency.
+ */
+#define child_dependency_type(child_is_partition)	\
+	((child_is_partition) ? DEPENDENCY_AUTO : DEPENDENCY_NORMAL)
+
 static void truncate_check_rel(Relation rel);
 static List *MergeAttributes(List *schema, List *supers, char relpersistence,
 bool is_partition, List **supOids, List **supconstr,
@@ -439,7 +447,8 @@ static void ATExecEnableDisableRule(Relation rel, char *rulename,
 static void ATPrepAddInherit(Relation child_rel);
 static ObjectAddress ATExecAddInherit(Relation child_rel, RangeVar *parent, LOCKMODE lockmode);
 static ObjectAddress ATExecDropInherit(Relation rel, RangeVar *parent, LOCKMODE lockmode);
-static void drop_parent_dependency(Oid relid, Oid refclassid, Oid refobjid);
+static void drop_parent_dependency(Oid relid, Oid refclassid, Oid refobjid,
+	   DependencyType deptype);
 static ObjectAddress ATExecAddOf(Relation rel, const TypeName *ofTypename, LOCKMODE lockmode);
 static void 

Re: [HACKERS] Adding support for Default partition in partitioning

2017-05-12 Thread Rahila Syed
Hello,

>(1) With the new patch, we allow new partitions when there is overlapping
data with default partition. The entries in default are ignored when
running queries satisfying the new partition.
This was introduced in latest version. We are not allowing adding a
partition when entries with same key value exist in default partition. So
this scenario should not
come in picture. Please find attached an updated patch which corrects this.

>(2) I get the following warning:

>partition.c: In function ‘check_new_partition_bound’:
>partition.c:882:15: warning: ‘boundinfo’ may be used uninitialized in this
function [-Wmaybe-uninitialized]
>   && boundinfo->has_default)
   ^
>preproc.y:3250.2-8: warning: type clash on default action:  != <>
I failed to notice this warning. I will look into it.

>This is incredibly ugly.  I don't know exactly what should be done
>about it, but I think PARTITION_DEFAULT is a bad idea and has got to
>go.  Maybe add a separate isDefault flag to PartitionBoundSpec
Will look at other ways to do it.

>Doesn't it strike you as a bit strange that get_qual_for_default()
>doesn't return a qual?  Functions should generally have names that
>describe what they do.
Will fix this.

>There's an existing function that you can use to concatenate two lists
>instead of open-coding it.
Will check this.

>you should really add the documentation and
>regression tests which you mentioned as a TODO.  And run the code
>through pgindent
I will also update the next version with documentation and regression tests
and run pgindent

Thank you,
Rahila Syed

On Fri, May 12, 2017 at 4:33 PM, Beena Emerson <memissemer...@gmail.com>
wrote:

>
>
> On Thu, May 11, 2017 at 7:37 PM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
>
>> Hello,
>>
>> Please find attached an updated patch with review comments and bugs
>> reported till date implemented.
>>
>
> Hello Rahila,
>
> Tested on "efa2c18 Doc fix: scale(numeric) returns integer, not numeric."
>
> (1) With the new patch, we allow new partitions when there is overlapping
> data with default partition. The entries in default are ignored when
> running queries satisfying the new partition.
>
> DROP TABLE list1;
> CREATE TABLE list1 (
> a int,
> b int
> ) PARTITION BY LIST (a);
> CREATE TABLE list1_1 (LIKE list1);
> ALTER TABLE  list1 ATTACH PARTITION list1_1 FOR VALUES IN (1);
> CREATE TABLE list1_def PARTITION OF list1 DEFAULT;
> INSERT INTO list1 SELECT generate_series(1,2),1;
> -- Partition overlapping with DEF
> CREATE TABLE list1_2 PARTITION OF list1 FOR VALUES IN (2);
> INSERT INTO list1 SELECT generate_series(2,3),2;
>
> postgres=# SELECT * FROM list1 ORDER BY a,b;
>  a | b
> ---+---
>  1 | 1
>  2 | 1
>  2 | 2
>  3 | 2
> (4 rows)
>
> postgres=# SELECT * FROM list1 WHERE a=2;
>  a | b
> ---+---
>  2 | 2
> (1 row)
>
> This ignores the a=2 entries in the DEFAULT.
>
> postgres=# SELECT * FROM list1_def;
>  a | b
> ---+---
>  2 | 1
>  3 | 2
> (2 rows)
>
>
> (2) I get the following warning:
>
> partition.c: In function ‘check_new_partition_bound’:
> partition.c:882:15: warning: ‘boundinfo’ may be used uninitialized in this
> function [-Wmaybe-uninitialized]
>&& boundinfo->has_default)
>^
> preproc.y:3250.2-8: warning: type clash on default action:  != <>
>
>
>> >1.
>> >In following block, we can just do with def_index, and we do not need
>> found_def
>> >flag. We can check if def_index is -1 or not to decide if default
>> partition is
>> >present.
>> found_def is used to set boundinfo->has_default which is used at couple
>> of other places to check if default partition exists. The implementation
>> is similar
>> to has_null.
>>
>> >3.
>> >In following function isDefaultPartitionBound, first statement "return
>> false"
>> >is not needed.
>> It is needed to return false if the node is not DefElem.
>>
>> Todo:
>> Add regression tests
>> Documentation
>>
>> Thank you,
>> Rahila Syed
>>
>>
>>


default_partition_v11.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Adding support for Default partition in partitioning

2017-05-11 Thread Rahila Syed
Hello,

Please find attached an updated patch with review comments and bugs
reported till date implemented.

>1.
>In following block, we can just do with def_index, and we do not need
found_def
>flag. We can check if def_index is -1 or not to decide if default partition
is
>present.
found_def is used to set boundinfo->has_default which is used at couple
of other places to check if default partition exists. The implementation is
similar
to has_null.

>3.
>In following function isDefaultPartitionBound, first statement "return
false"
>is not needed.
It is needed to return false if the node is not DefElem.

Todo:
Add regression tests
Documentation

Thank you,
Rahila Syed



On Fri, May 5, 2017 at 1:30 AM, Jeevan Ladhe <jeevan.la...@enterprisedb.com>
wrote:

> Hi Rahila,
>
> I have started reviewing your latest patch, and here are my initial
> comments:
>
> 1.
> In following block, we can just do with def_index, and we do not need
> found_def
> flag. We can check if def_index is -1 or not to decide if default
> partition is
> present.
>
> @@ -166,6 +172,8 @@ RelationBuildPartitionDesc(Relation rel)
>   /* List partitioning specific */
>   PartitionListValue **all_values = NULL;
>   bool found_null = false;
> + bool found_def = false;
> + int def_index = -1;
>   int null_index = -1;
>
> 2.
> In check_new_partition_bound, in case of PARTITION_STRATEGY_LIST, remove
> following duplicate declaration of boundinfo, because it is confusing and
> after
> your changes it is not needed as its not getting overridden in the if block
> locally.
> if (partdesc->nparts > 0)
> {
> PartitionBoundInfo boundinfo = partdesc->boundinfo;
> ListCell   *cell;
>
>
> 3.
> In following function isDefaultPartitionBound, first statement "return
> false"
> is not needed.
>
> + * Returns true if the partition bound is default
> + */
> +bool
> +isDefaultPartitionBound(Node *value)
> +{
> + if (IsA(value, DefElem))
> + {
> + DefElem *defvalue = (DefElem *) value;
> + if(!strcmp(defvalue->defname, "DEFAULT"))
> + return true;
> + return false;
> + }
> + return false;
> +}
>
> 4.
> As mentioned in my previous set of comments, following if block inside a
> loop
> in get_qual_for_default needs a break:
>
> + foreach(cell1, bspec->listdatums)
> + {
> + Node *value = lfirst(cell1);
> + if (isDefaultPartitionBound(value))
> + {
> + def_elem = true;
> + *defid  = inhrelid;
> + }
> + }
>
> 5.
> In the grammar the rule default_part_list is not needed:
>
> +default_partition:
> + DEFAULT  { $$ = (Node *)makeDefElem("DEFAULT", NULL, @1); }
> +
> +default_part_list:
> + default_partition { $$ = list_make1($1); }
> + ;
> +
>
> Instead you can simply declare default_partition as a list and write it as:
>
> default_partition:
> DEFAULT
> {
> Node *def = (Node *)makeDefElem("DEFAULT", NULL, @1);
> $$ = list_make1(def);
> }
>
> 6.
> You need to change the output of the describe command, which is currently
> as below: postgres=# \d+ test; Table "public.test" Column | Type |
> Collation | Nullable | Default | Storage | Stats target | Description
> +-+---+--+-+-+--+-
> a | integer | | | | plain | | b | date | | | | plain | | Partition key:
> LIST (a) Partitions: pd FOR VALUES IN (DEFAULT), test_p1 FOR VALUES IN (4,
> 5) What about changing the Paritions output as below: Partitions: *pd
> DEFAULT,* test_p1 FOR VALUES IN (4, 5)
>
> 7.
> You need to handle tab completion for DEFAULT.
> e.g.
> If I partially type following command:
> CREATE TABLE pd PARTITION OF test DEFA
> and then press tab, I get following completion:
> CREATE TABLE pd PARTITION OF test FOR VALUES
>
> I did some primary testing and did not find any problem so far.
>
> I will review and test further and let you know my comments.
>
> Regards,
> Jeevan Ladhe
>
> On Thu, May 4, 2017 at 6:09 PM, Rajkumar Raghuwanshi <
> rajkumar.raghuwan...@enterprisedb.com> wrote:
>
>> On Thu, May 4, 2017 at 5:14 PM, Rahila Syed <rahilasye...@gmail.com>
>> wrote:
>>
>>> The syntax implemented in this patch is as follows,
>>>
>>> CREATE TABLE p11 PARTITION OF p1 DEFAULT;
>>>
>>> Applied v9 patches, table description still showing old pattern of
>> default partition. Is it expected?
>>
>> create table lpd (a int, b int, c varchar) partition by list(a);
>> create table lpd_d partition of lpd DEFAULT;
>>
>> \d+ lpd
>>  Table "publi

Re: [HACKERS] Adding support for Default partition in partitioning

2017-05-10 Thread Rahila Syed
>It seems that adding a new partition at the same level as the default
>partition will require scanning it or its (leaf) partitions if
>partitioned.  Consider that p1, pd are partitions of a list-partitioned
>table p accepting 1 and everything else, respectively, and pd is further
>partitioned.  When adding p2 of p for 2, we need to scan the partitions of
>pd to check if there are any (2, ...) rows.

 This is a better explanation. May be following sentence was confusing,
"That is prohibit creation of new partition after a default partition which
is further partitioned"
Here, what I meant was default partition is partitioned further.

>As for fixing the reported issue whereby the partitioned default
>partition's non-existent file is being accessed, it would help to take a
>look at the code in ATExecAttachPartition() starting at the following:
OK. I get it now. If attach partition already supports scanning all the
partitions before attach,
similar support should be provided in the case of adding a partition after
default partition as well.

Thank you,
Rahila Syed

On Wed, May 10, 2017 at 6:42 AM, Amit Langote <langote_amit...@lab.ntt.co.jp
> wrote:

> On 2017/05/10 2:09, Robert Haas wrote:
> > On Tue, May 9, 2017 at 9:26 AM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
> >>> Hi Rahila,
> >>
> >>> I am not able add a new partition if default partition is further
> >>> partitioned
> >>> with default partition.
> >>
> >>> Consider example below:
> >>
> >>> postgres=# CREATE TABLE test ( a int, b int, c int) PARTITION BY LIST
> (a);
> >>> CREATE TABLE
> >>> postgres=# CREATE TABLE test_p1 PARTITION OF test FOR VALUES IN(4, 5,
> 6, 7,
> >>> 8);
> >>> CREATE TABLE
> >>> postgres=# CREATE TABLE test_pd PARTITION OF test DEFAULT PARTITION BY
> >>> LIST(b);
> >>> CREATE TABLE
> >>> postgres=# CREATE TABLE test_pd_pd PARTITION OF test_pd DEFAULT;
> >>> CREATE TABLE
> >>> postgres=# INSERT INTO test VALUES (20, 24, 12);
> >>> INSERT 0 1
> >>> postgres=# CREATE TABLE test_p2 PARTITION OF test FOR VALUES IN(15);
> >> ERROR:  could not open file "base/12335/16420": No such file or
> directory
> >>
> >> Regarding fix for this I think we need to prohibit this case. That is
> >> prohibit creation
> >> of new partition after a default partition which is further partitioned.
> >> Currently before adding a new partition after default partition all the
> rows
> >> of default
> >> partition are scanned and if a row which matches the new partitions
> >> constraint exists
> >> the new partition is not added.
> >>
> >> If we allow this for default partition which is partitioned further, we
> will
> >> have to scan
> >> all the partitions of default partition for matching rows which can slow
> >> down execution.
> >
> > I think this case should be allowed
>
> +1
>
> > and I don't think it should
> > require scanning all the partitions of the default partition.  This is
> > no different than any other case where multiple levels of partitioning
> > are used.  First, you route the tuple at the root level; then, you
> > route it at the next level; and so on.  It shouldn't matter whether
> > the routing at the top level is to that level's default partition or
> > not.
>
> It seems that adding a new partition at the same level as the default
> partition will require scanning it or its (leaf) partitions if
> partitioned.  Consider that p1, pd are partitions of a list-partitioned
> table p accepting 1 and everything else, respectively, and pd is further
> partitioned.  When adding p2 of p for 2, we need to scan the partitions of
> pd to check if there are any (2, ...) rows.
>
> As for fixing the reported issue whereby the partitioned default
> partition's non-existent file is being accessed, it would help to take a
> look at the code in ATExecAttachPartition() starting at the following:
>
> /*
>  * Set up to have the table be scanned to validate the partition
>  * constraint (see partConstraint above).  If it's a partitioned
> table, we
>  * instead schedule its leaf partitions to be scanned.
>  */
> if (!skip_validate)
> {
>
> Thanks,
> Amit
>
>


Re: [HACKERS] Adding support for Default partition in partitioning

2017-05-09 Thread Rahila Syed
>Hi Rahila,

>I am not able add a new partition if default partition is further
partitioned
>with default partition.

>Consider example below:

>postgres=# CREATE TABLE test ( a int, b int, c int) PARTITION BY LIST (a);
>CREATE TABLE
>postgres=# CREATE TABLE test_p1 PARTITION OF test FOR VALUES IN(4, 5, 6,
7, 8);
>CREATE TABLE
>postgres=# CREATE TABLE test_pd PARTITION OF test DEFAULT PARTITION BY
LIST(b);
>CREATE TABLE
>postgres=# CREATE TABLE test_pd_pd PARTITION OF test_pd DEFAULT;
>CREATE TABLE
>postgres=# INSERT INTO test VALUES (20, 24, 12);
>INSERT 0 1
*>postgres=# CREATE TABLE test_p2 PARTITION OF test FOR VALUES IN(15);*

*ERROR:  could not open file "base/12335/16420": No such file or directory*
Regarding fix for this I think we need to prohibit this case. That is
prohibit creation
of new partition after a default partition which is further partitioned.
Currently before adding a new partition after default partition all the
rows of default
partition are scanned and if a row which matches the new partitions
constraint exists
the new partition is not added.

If we allow this for default partition which is partitioned further, we
will have to scan
all the partitions of default partition for matching rows which can slow
down execution.

So to not hamper the performance, an error should be thrown in this case
and user should
be expected to change his schema to avoid partitioning default partitions.

Kindly give your opinions.



On Fri, May 5, 2017 at 12:46 PM, Jeevan Ladhe  wrote:

> Hi Rahila,
>
> I am not able add a new partition if default partition is further
> partitioned
> with default partition.
>
> Consider example below:
>
> postgres=# CREATE TABLE test ( a int, b int, c int) PARTITION BY LIST (a);
> CREATE TABLE
> postgres=# CREATE TABLE test_p1 PARTITION OF test FOR VALUES IN(4, 5, 6,
> 7, 8);
> CREATE TABLE
> postgres=# CREATE TABLE test_pd PARTITION OF test DEFAULT PARTITION BY
> LIST(b);
> CREATE TABLE
> postgres=# CREATE TABLE test_pd_pd PARTITION OF test_pd DEFAULT;
> CREATE TABLE
> postgres=# INSERT INTO test VALUES (20, 24, 12);
> INSERT 0 1
> *postgres=# CREATE TABLE test_p2 PARTITION OF test FOR VALUES IN(15);*
> *ERROR:  could not open file "base/12335/16420": No such file or directory*
>
>
> Thanks,
> Jeevan Ladhe
>
> On Fri, May 5, 2017 at 11:55 AM, Rajkumar Raghuwanshi <
> rajkumar.raghuwan...@enterprisedb.com> wrote:
>
>> Hi Rahila,
>>
>> pg_restore is failing for default partition, dump file still storing old
>> syntax of default partition.
>>
>> create table lpd (a int, b int, c varchar) partition by list(a);
>> create table lpd_d partition of lpd DEFAULT;
>>
>> create database bkp owner 'edb';
>> grant all on DATABASE bkp to edb;
>>
>> --take plain dump of existing database
>> \! ./pg_dump -f lpd_test.sql -Fp -d postgres
>>
>> --restore plain backup to new database bkp
>> \! ./psql -f lpd_test.sql -d bkp
>>
>> psql:lpd_test.sql:63: ERROR:  syntax error at or near "DEFAULT"
>> LINE 2: FOR VALUES IN (DEFAULT);
>>^
>>
>>
>> vi lpd_test.sql
>>
>> --
>> -- Name: lpd; Type: TABLE; Schema: public; Owner: edb
>> --
>>
>> CREATE TABLE lpd (
>> a integer,
>> b integer,
>> c character varying
>> )
>> PARTITION BY LIST (a);
>>
>>
>> ALTER TABLE lpd OWNER TO edb;
>>
>> --
>> -- Name: lpd_d; Type: TABLE; Schema: public; Owner: edb
>> --
>>
>> CREATE TABLE lpd_d PARTITION OF lpd
>> FOR VALUES IN (DEFAULT);
>>
>>
>> ALTER TABLE lpd_d OWNER TO edb;
>>
>>
>> Thanks,
>> Rajkumar
>>
>
>


Re: [HACKERS] Adding support for Default partition in partitioning

2017-05-09 Thread Rahila Syed
+1 for AS DEFAULT syntax if it helps in improving readability specially in
following case

CREATE TABLE p1 PARTITION OF test AS DEFAULT PARTITION BY LIST(a);

Thank you,
Rahila Syed

On Tue, May 9, 2017 at 1:13 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, May 4, 2017 at 4:40 PM, Sven R. Kunze <srku...@mail.de> wrote:
> > It yields
> >
> > CREATE TABLE p1 PARTITION OF test DEFAULT PARTITION BY LIST(b);
> >
> > This reads to me like "DEFAULT PARTITION".
> >
> > I can imagine a lot of confusion when those queries are encountered in
> the
> > wild. I know this thread is about creating a default partition but I
> were to
> > propose a minor change in the following direction, I think confusion
> would
> > be greatly avoided:
> >
> > CREATE TABLE p1 PARTITION OF test AS DEFAULT PARTITIONED BY LIST(b);
> >
> > I know it's a bit longer but I think those 4 characters might serve
> > readability in the long term. It was especially confusing to see
> PARTITION
> > in two positions serving two different functions.
>
> Well, we certainly can't make that change just for default partitions.
> I mean, that would be non-orthogonal, right?  You can't say that the
> way to subpartition is to write "PARTITION BY strategy" when the table
> unpartitioned or is a non-default partition but "PARTITIONED BY
> strategy" when it is a default partition.  That would certainly not be
> a good way of confusing users less, and would probably result in a
> variety of special cases in places like ruleutils.c or pg_dump, plus
> some weasel-wording in the documentation.  We COULD do a general
> change from "CREATE TABLE table_name PARTITION BY strategy" to "CREATE
> TABLE table_name PARTITIONED BY strategy".  I don't have any
> particular arguments against that except that the current syntax is
> more like Oracle, which might count for something, and maybe the fact
> that we're a month after feature freeze.  Still, if we want to change
> that, now would be the time; but I favor leaving it alone.
>
> I don't have a big objection to adding AS.  If that's the majority
> vote, fine; if not, that's OK, too.  I can see it might be a bit more
> clear in the case you mention, but it might also just be a noise word
> that we don't really need.  There don't seem to be many uses of AS
> that would pose a risk of actual grammar conflicts here.  I can
> imagine someone wanting to use CREATE TABLE ... PARTITION BY ... AS
> SELECT ... to create and populate a partition in one command, but that
> wouldn't be a conflict because it'd have to go AFTER the partition
> specification.  In the DEFAULT case, you'd end up with something like
>
> CREATE TABLE p1 PARTITION OF test AS DEFAULT AS 
>
> ...which is neither great nor horrible syntax-wise and maybe not such
> a good thing to support anyway since it would have to lock the parent
> to add the partition and then keep the lock on the parent while
> populating the new child (ouch).
>
> So I guess I'm still in favor of the CREATE TABLE p1 PARTITION OF test
> DEFAULT syntax, but if it ends up being AS DEFAULT instead, I can live
> with that.
>
> Other opinions?
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] Adding support for Default partition in partitioning

2017-05-08 Thread Rahila Syed
>I am not able add a new partition if default partition is further
partitioned
>with default partition.

Thanks for reporting. I will fix this.

>pg_restore is failing for default partition, dump file still storing old
syntax of default partition.
Thanks for reporting . I will fix this once the syntax is finalized.


On Fri, May 5, 2017 at 12:46 PM, Jeevan Ladhe  wrote:

> Hi Rahila,
>
> I am not able add a new partition if default partition is further
> partitioned
> with default partition.
>
> Consider example below:
>
> postgres=# CREATE TABLE test ( a int, b int, c int) PARTITION BY LIST (a);
> CREATE TABLE
> postgres=# CREATE TABLE test_p1 PARTITION OF test FOR VALUES IN(4, 5, 6,
> 7, 8);
> CREATE TABLE
> postgres=# CREATE TABLE test_pd PARTITION OF test DEFAULT PARTITION BY
> LIST(b);
> CREATE TABLE
> postgres=# CREATE TABLE test_pd_pd PARTITION OF test_pd DEFAULT;
> CREATE TABLE
> postgres=# INSERT INTO test VALUES (20, 24, 12);
> INSERT 0 1
> *postgres=# CREATE TABLE test_p2 PARTITION OF test FOR VALUES IN(15);*
> *ERROR:  could not open file "base/12335/16420": No such file or directory*
>
>
> Thanks,
> Jeevan Ladhe
>
> On Fri, May 5, 2017 at 11:55 AM, Rajkumar Raghuwanshi <
> rajkumar.raghuwan...@enterprisedb.com> wrote:
>
>> Hi Rahila,
>>
>> pg_restore is failing for default partition, dump file still storing old
>> syntax of default partition.
>>
>> create table lpd (a int, b int, c varchar) partition by list(a);
>> create table lpd_d partition of lpd DEFAULT;
>>
>> create database bkp owner 'edb';
>> grant all on DATABASE bkp to edb;
>>
>> --take plain dump of existing database
>> \! ./pg_dump -f lpd_test.sql -Fp -d postgres
>>
>> --restore plain backup to new database bkp
>> \! ./psql -f lpd_test.sql -d bkp
>>
>> psql:lpd_test.sql:63: ERROR:  syntax error at or near "DEFAULT"
>> LINE 2: FOR VALUES IN (DEFAULT);
>>^
>>
>>
>> vi lpd_test.sql
>>
>> --
>> -- Name: lpd; Type: TABLE; Schema: public; Owner: edb
>> --
>>
>> CREATE TABLE lpd (
>> a integer,
>> b integer,
>> c character varying
>> )
>> PARTITION BY LIST (a);
>>
>>
>> ALTER TABLE lpd OWNER TO edb;
>>
>> --
>> -- Name: lpd_d; Type: TABLE; Schema: public; Owner: edb
>> --
>>
>> CREATE TABLE lpd_d PARTITION OF lpd
>> FOR VALUES IN (DEFAULT);
>>
>>
>> ALTER TABLE lpd_d OWNER TO edb;
>>
>>
>> Thanks,
>> Rajkumar
>>
>
>


Re: [HACKERS] Adding support for Default partition in partitioning

2017-05-04 Thread Rahila Syed
Hello Amul,

Thanks for reporting. Please find attached an updated patch which fixes the
above.
Also, the attached patch includes changes in syntax proposed upthread.

The syntax implemented in this patch is as follows,

CREATE TABLE p11 PARTITION OF p1 DEFAULT;

Thank you,
Rahila Syed

On Thu, May 4, 2017 at 4:02 PM, amul sul <sula...@gmail.com> wrote:

> On Tue, May 2, 2017 at 9:33 PM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
> > Please find attached updated patch with review comments by Robert and
> Jeevan
> > implemented.
> >
> Patch v8 got clean apply on latest head but server got crash at data
> insert in the following test:
>
> -- Create test table
> CREATE TABLE test ( a int, b date) PARTITION BY LIST (a);
> CREATE TABLE p1 PARTITION OF test FOR VALUES IN  (DEFAULT) PARTITION BY
> LIST(b);
> CREATE TABLE p11 PARTITION OF p1 FOR VALUES IN (DEFAULT);
>
> -- crash
> INSERT INTO test VALUES (210,'1/1/2002');
>
> Regards,
> Amul
>


default_partition_v9.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Adding support for Default partition in partitioning

2017-05-02 Thread Rahila Syed
Please find attached updated patch with review comments by Robert and
Jeevan implemented.

The newly proposed syntax
CREATE TABLE .. PARTITION OF .. DEFAULT has got most votes on this thread.

If there is no more objection, I will go ahead and include that in the
patch.

Thank you,
Rahila Syed

On Mon, Apr 24, 2017 at 2:40 PM, Rahila Syed <rahilasye...@gmail.com> wrote:

> Hello,
>
> Thank you for reviewing.
>
> >But that's not a good idea for several reasons.  For one thing, you
> >can also write FOR VALUES IN (DEFAULT, 5) or which isn't sensible.
> >For another thing, this kind of syntax won't generalize to range
> >partitioning, which we've talked about making this feature support.
> >Maybe something like:
>
> >CREATE TABLE .. PARTITION OF .. DEFAULT;
>
> I agree that the syntax should be changed to also support range
> partitioning.
>
> Following can also be considered as it specifies more clearly that the
> partition holds default values.
>
> CREATE TABLE ...PARTITION OF...FOR VALUES DEFAULT;
>
> >Maybe we should introduce a dedicated node type to
> >represent a default-specification in the parser grammar.  If not, then
> >let's at least encapsulate the test a little better, e.g. by adding
> >isDefaultPartitionBound() which tests not only IsA(..., DefElem) but
> >also whether the name is DEFAULT as expected.  BTW, we typically use
> >lower-case internally, so if we stick with this representation it
> >should really be "default" not "DEFAULT".
>
> isDefaultPartitionBound() function is created in the attached patch which
> checks for both node type and name.
>
> >Why abbreviate "default" to def here?  Seems pointless.
> Corrected in the attached.
>
> >Consider &&
> Fixed.
>
> >+ * default partiton for rows satisfying the new partition
> >Spelling.
> Fixed.
>
> >Missing apostrophe
> Fixed.
>
> >Definitely not safe against concurrency, since AccessShareLock won't
> >exclude somebody else's update.  In fact, it won't even cover somebody
> >else's already-in-flight transaction
> Changed it to AccessExclusiveLock
>
> >Normally in such cases we try to give more detail using
> >ExecBuildSlotValueDescription.
> This function is used in execMain.c and the error is being
> reported in partition.c.
> Do you mean the error reporting should be moved into execMain.c
> to use ExecBuildSlotValueDescription?
>
> >This variable starts out true and is never set to any value other than
> >true.  Just get rid of it and, in the one place where it is currently
> >used, write "true".  That's shorter and clearer.
> Fixed.
>
> >There's not really a reason to cast the result of stringToNode() to
> >Node * and then turn around and cast it to PartitionBoundSpec *.  Just
> >cast it directly to whatever it needs to be.  And use the new castNode
> >macro
> Fixed. castNode macro takes as input Node * whereas stringToNode() takes
> string.
> IIUC, castNode cant be used here.
>
> >The if (def_elem) test continues
> >early, but if the point is that the loop using cell3 shouldn't execute
> >in that case, why not just put if (!def_elem) { foreach(cell3, ...) {
> >... } } instead of reiterating the ReleaseSysCache in two places?
> Fixed in the attached.
>
> I will respond to further comments in following email.
>
>
> On Thu, Apr 13, 2017 at 12:48 AM, Robert Haas <robertmh...@gmail.com>
> wrote:
>
>> On Thu, Apr 6, 2017 at 7:30 AM, Rahila Syed <rahilasye...@gmail.com>
>> wrote:
>> > Thanks a lot for testing and reporting this. Please find attached an
>> updated
>> > patch with the fix. The patch also contains a fix
>> > regarding operator used at the time of creating expression as default
>> > partition constraint. This was notified offlist by Amit Langote.
>>
>> I think that the syntax for this patch should probably be revised.
>> Right now the proposal is for:
>>
>> CREATE TABLE .. PARTITION OF ... FOR VALUES IN (DEFAULT);
>>
>> But that's not a good idea for several reasons.  For one thing, you
>> can also write FOR VALUES IN (DEFAULT, 5) or which isn't sensible.
>> For another thing, this kind of syntax won't generalize to range
>> partitioning, which we've talked about making this feature support.
>> Maybe something like:
>>
>> CREATE TABLE .. PARTITION OF .. DEFAULT;
>>
>> This patch makes the assumption throughout that any DefElem represents
>> the word DEFAULT, which is true in the patch as written but doesn't
>> seem very future-proof.  I think the "def" in "

Re: [HACKERS] Adding support for Default partition in partitioning

2017-04-27 Thread Rahila Syed
Hi,

On Apr 27, 2017 18:37, "Robert Haas" <robertmh...@gmail.com> wrote:

>
>
> Are you also working on extending this to work with range
> partitioning?  Because I think that would be good to do.
>
>
> Currently I am working on review comments and bug fixes for the
default list partitioning patch. After that I can start with default
partition for range partitioning.

Thank you,
Rahila Syed


Re: [HACKERS] Adding support for Default partition in partitioning

2017-04-27 Thread Rahila Syed
>I suspect it could be done as of now, but I'm a little worried that it
>might create grammar conflicts in the future as we extend the syntax
>further.  If we use CREATE TABLE ... PARTITION OF .. DEFAULT, then the
>word DEFAULT appears in the same position where we'd normally have FOR
>VALUES, and so the parser will definitely be able to figure out what's
>going on.  When it gets to that position, it will see FOR or it will
>see DEFAULT, and all is clear.  OTOH, if we use CREATE TABLE ...
>DEFAULT PARTITION OF ..., then we have action at a distance: whether
>or not the word DEFAULT is present before PARTITION affects which
>tokens are legal after the parent table name.  bison isn't always very
>smart about that kind of thing.  No particular dangers come to mind at
>the moment, but it makes me nervous anyway.

+1 for CREATE TABLE..PARTITION OF...DEFAULT  syntax.
I think substituting DEFAULT for FOR VALUES is appropriate as
both cases are mutually exclusive.

One more thing that needs consideration is should default partitions be
partitioned further? Other databases allow default partitions to be
partitioned further. I think, its normal for users to expect the data in
default partitions to also be divided into sub partitions.  So
it should be supported.
My colleague Rajkumar Raghuwanshi brought to my notice the current patch
does not handle this correctly.
I will include this in the updated patch if there is no objection.

On the other hand if sub partitions of a default partition is to be
prohibited,
an error should be thrown if PARTITION BY is specified after DEFAULT.


Thank you,
Rahila Syed



On Tue, Apr 25, 2017 at 1:46 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Mon, Apr 24, 2017 at 8:14 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
> > On Mon, Apr 24, 2017 at 4:24 PM, Robert Haas <robertmh...@gmail.com>
> wrote:
> >> On Mon, Apr 24, 2017 at 5:10 AM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
> >>> Following can also be considered as it specifies more clearly that the
> >>> partition holds default values.
> >>>
> >>> CREATE TABLE ...PARTITION OF...FOR VALUES DEFAULT;
> >>
> >> The partition doesn't contain default values; it is itself a default.
> >
> > Is CREATE TABLE ... DEFAULT PARTITION OF ... feasible? That sounds more
> natural.
>
> I suspect it could be done as of now, but I'm a little worried that it
> might create grammar conflicts in the future as we extend the syntax
> further.  If we use CREATE TABLE ... PARTITION OF .. DEFAULT, then the
> word DEFAULT appears in the same position where we'd normally have FOR
> VALUES, and so the parser will definitely be able to figure out what's
> going on.  When it gets to that position, it will see FOR or it will
> see DEFAULT, and all is clear.  OTOH, if we use CREATE TABLE ...
> DEFAULT PARTITION OF ..., then we have action at a distance: whether
> or not the word DEFAULT is present before PARTITION affects which
> tokens are legal after the parent table name.  bison isn't always very
> smart about that kind of thing.  No particular dangers come to mind at
> the moment, but it makes me nervous anyway.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] Adding support for Default partition in partitioning

2017-04-26 Thread Rahila Syed
Hello Jeevan,

Thank you for comments.

I will include your comments in the updated patch.

>7.The output of describe needs to be improved.

The syntax for DEFAULT partitioning is still under discussion. This comment
wont be
applicable if the syntax is changed.

>6.
>I am wondering, isn't it possible to retrieve the has_default and
default_index
>here to find out if default partition exists and if exist then find it's
oid
>using rd_partdesc, that would avoid above(7) loop to check if partition
bound is
>default
The checks are used to find the default partition bound and
exclude it from the list of partition bounds to form the partition
constraint.
This cant be accomplished by using has_default flag.
isDefaultPartitionBound() is written to accomplish that.


>8.
>Following call to find_inheritance_children() in
generate_qual_for_defaultpart()
>is an overhead, instead we can simply use an array of oids in rd_partdesc.
I think using find_inheritance_children() will take into consideration
concurrent
drop of a partition which the value in rd_partdesc will not.

Thank you,
Rahila Syed


Re: [HACKERS] Adding support for Default partition in partitioning

2017-04-24 Thread Rahila Syed
Hello,

Thank you for reviewing.

>But that's not a good idea for several reasons.  For one thing, you
>can also write FOR VALUES IN (DEFAULT, 5) or which isn't sensible.
>For another thing, this kind of syntax won't generalize to range
>partitioning, which we've talked about making this feature support.
>Maybe something like:

>CREATE TABLE .. PARTITION OF .. DEFAULT;

I agree that the syntax should be changed to also support range
partitioning.

Following can also be considered as it specifies more clearly that the
partition holds default values.

CREATE TABLE ...PARTITION OF...FOR VALUES DEFAULT;

>Maybe we should introduce a dedicated node type to
>represent a default-specification in the parser grammar.  If not, then
>let's at least encapsulate the test a little better, e.g. by adding
>isDefaultPartitionBound() which tests not only IsA(..., DefElem) but
>also whether the name is DEFAULT as expected.  BTW, we typically use
>lower-case internally, so if we stick with this representation it
>should really be "default" not "DEFAULT".

isDefaultPartitionBound() function is created in the attached patch which
checks for both node type and name.

>Why abbreviate "default" to def here?  Seems pointless.
Corrected in the attached.

>Consider &&
Fixed.

>+ * default partiton for rows satisfying the new partition
>Spelling.
Fixed.

>Missing apostrophe
Fixed.

>Definitely not safe against concurrency, since AccessShareLock won't
>exclude somebody else's update.  In fact, it won't even cover somebody
>else's already-in-flight transaction
Changed it to AccessExclusiveLock

>Normally in such cases we try to give more detail using
>ExecBuildSlotValueDescription.
This function is used in execMain.c and the error is being
reported in partition.c.
Do you mean the error reporting should be moved into execMain.c
to use ExecBuildSlotValueDescription?

>This variable starts out true and is never set to any value other than
>true.  Just get rid of it and, in the one place where it is currently
>used, write "true".  That's shorter and clearer.
Fixed.

>There's not really a reason to cast the result of stringToNode() to
>Node * and then turn around and cast it to PartitionBoundSpec *.  Just
>cast it directly to whatever it needs to be.  And use the new castNode
>macro
Fixed. castNode macro takes as input Node * whereas stringToNode() takes
string.
IIUC, castNode cant be used here.

>The if (def_elem) test continues
>early, but if the point is that the loop using cell3 shouldn't execute
>in that case, why not just put if (!def_elem) { foreach(cell3, ...) {
>... } } instead of reiterating the ReleaseSysCache in two places?
Fixed in the attached.

I will respond to further comments in following email.


On Thu, Apr 13, 2017 at 12:48 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Apr 6, 2017 at 7:30 AM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
> > Thanks a lot for testing and reporting this. Please find attached an
> updated
> > patch with the fix. The patch also contains a fix
> > regarding operator used at the time of creating expression as default
> > partition constraint. This was notified offlist by Amit Langote.
>
> I think that the syntax for this patch should probably be revised.
> Right now the proposal is for:
>
> CREATE TABLE .. PARTITION OF ... FOR VALUES IN (DEFAULT);
>
> But that's not a good idea for several reasons.  For one thing, you
> can also write FOR VALUES IN (DEFAULT, 5) or which isn't sensible.
> For another thing, this kind of syntax won't generalize to range
> partitioning, which we've talked about making this feature support.
> Maybe something like:
>
> CREATE TABLE .. PARTITION OF .. DEFAULT;
>
> This patch makes the assumption throughout that any DefElem represents
> the word DEFAULT, which is true in the patch as written but doesn't
> seem very future-proof.  I think the "def" in "DefElem" stands for
> "definition" or "define" or something like that, so this is actually
> pretty confusing.  Maybe we should introduce a dedicated node type to
> represent a default-specification in the parser grammar.  If not, then
> let's at least encapsulate the test a little better, e.g. by adding
> isDefaultPartitionBound() which tests not only IsA(..., DefElem) but
> also whether the name is DEFAULT as expected.  BTW, we typically use
> lower-case internally, so if we stick with this representation it
> should really be "default" not "DEFAULT".
>
> Useless hunk:
>
> +boolhas_def;/* Is there a default partition?
> Currently false
> + * for a range partitioned table */
> +intde

Re: [HACKERS] Adding support for Default partition in partitioning

2017-04-06 Thread Rahila Syed
Hello,

Thanks a lot for testing and reporting this. Please find attached an
updated patch with the fix. The patch also contains a fix
regarding operator used at the time of creating expression as default
partition constraint. This was notified offlist by Amit Langote.

Thank you,
Rahila Syed


On Thu, Apr 6, 2017 at 12:21 AM, Keith Fiske <ke...@omniti.com> wrote:

>
> On Wed, Apr 5, 2017 at 11:19 AM, Robert Haas <robertmh...@gmail.com>
> wrote:
>
>> On Wed, Apr 5, 2017 at 5:57 AM, Rahila Syed <rahilasye...@gmail.com>
>> wrote:
>> >>Could you briefly elaborate why you think the lack global index support
>> >>would be a problem in this regard?
>> > I think following can happen if we allow rows satisfying the new
>> partition
>> > to lie around in the
>> > default partition until background process moves it.
>> > Consider a scenario where partition key is a primary key and the data
>> in the
>> > default partition is
>> > not yet moved into the newly added partition. If now, new data is added
>> into
>> > the new partition
>> > which also exists(same key) in default partition there will be data
>> > duplication. If now
>> > we scan the partitioned table for that key(from both the default and new
>> > partition as we
>> > have not moved the rows) it will fetch the both rows.
>> > Unless we have global indexes for partitioned tables, there is chance of
>> > data duplication between
>> > child table added after default partition and the default partition.
>>
>> Yes, I think it would be completely crazy to try to migrate the data
>> in the background:
>>
>> - The migration might never complete because of a UNIQUE or CHECK
>> constraint on the partition to which rows are being migrated.
>>
>> - Even if the migration eventually succeeded, such a design abandons
>> all hope of making INSERT .. ON CONFLICT DO NOTHING work sensibly
>> while the migration is in progress, unless the new partition has no
>> UNIQUE constraints.
>>
>> - Partition-wise join and partition-wise aggregate would need to have
>> special case handling for the case of an unfinished migration, as
>> would any user code that accesses partitions directly.
>>
>> - More generally, I think users expect that when a DDL command
>> finishes execution, it's done all of the work that there is to do (or
>> at the very least, that any remaining work has no user-visible
>> consequences, which would not be the case here).
>>
>> IMV, the question of whether we have efficient ways to move data
>> around between partitions is somewhat separate from the question of
>> whether partitions can be defined in a certain way in the first place.
>> The problems that Keith refers to upthread already exist for
>> subpartitioning; you've got to detach the old partition, create a new
>> one, and then reinsert the data.  And for partitioning an
>> unpartitioned table: create a replacement table, insert all the data,
>> substitute it for the original table.  The fact that we have these
>> limitation is not good, but we're not going to rip out partitioning
>> entirely because we don't have clever ways of migrating the data in
>> those cases, and the proposed behavior here is not any worse.
>>
>> Also, waiting for those problems to get fixed might be waiting for
>> Godot.  I'm not really all that sanguine about our chances of coming
>> up with a really nice way of handling these cases.  In a designed
>> based on table inheritance, you can leave it murky where certain data
>> is supposed to end up and migrate it on-line and you might get away
>> with that, but a major point of having declarative partitioning at all
>> is to remove that sort of murkiness.  It's probably not that hard to
>> come up with a command that locks the parent and moves data around via
>> full table scans, but I'm not sure how far that really gets us; you
>> could do the same thing easily enough with a sequence of commands
>> generated via a script.  And being able to do this in a general way
>> without a full table lock looks pretty hard - it doesn't seem
>> fundamentally different from trying to perform a table-rewriting
>> operation like CLUSTER without a full table lock, which we also don't
>> support.  The executor is not built to cope with any aspect of the
>> table definition shifting under it, and that includes the set of child
>> tables with are partitions of the table mentioned in the query.  Maybe
>> the executor can be taught to survive such definitional changes at
>> lea

Re: [HACKERS] Adding support for Default partition in partitioning

2017-04-05 Thread Rahila Syed
Hello Amit,

>Could you briefly elaborate why you think the lack global index support
>would be a problem in this regard?
I think following can happen if we allow rows satisfying the new partition
to lie around in the
default partition until background process moves it.
Consider a scenario where partition key is a primary key and the data in
the default partition is
not yet moved into the newly added partition. If now, new data is added
into the new partition
which also exists(same key) in default partition there will be data
duplication. If now
we scan the partitioned table for that key(from both the default and new
partition as we
have not moved the rows) it will fetch the both rows.
Unless we have global indexes for partitioned tables, there is chance of
data duplication between
child table added after default partition and the default partition.

>Scanning it and moving rows to the newly added partition while holding an
>AccessExclusiveLock on the parent will block any and all of the concurrent
>activity on it until the row-movement is finished.
Can you explain why this will require AccessExclusiveLock on parent and
not just the default partition and newly added partition?

Thank you,
Rahila Syed


On Wed, Apr 5, 2017 at 1:22 PM, Amit Langote <langote_amit...@lab.ntt.co.jp>
wrote:

> On 2017/04/05 14:41, Rushabh Lathia wrote:
> > I agree about the future plan about the row movement, how that is I am
> > not quite sure at this stage.
> >
> > I was thinking that CREATE new partition is the DDL command, so even
> > if row-movement works with holding the lock on the new partition table,
> > that should be fine. I am not quire sure, why row movement should be
> > happen in the back-ground process.
>
> I think to improve the availability of access to the partitioned table.
>
> Consider that the default partition may have gotten pretty large.
> Scanning it and moving rows to the newly added partition while holding an
> AccessExclusiveLock on the parent will block any and all of the concurrent
> activity on it until the row-movement is finished.  One may be prepared to
> pay this cost, for which there should definitely be an option to perform
> the row-movement in the same transaction (also possibly the default
> behavior).
>
> Thanks,
> Amit
>
>
>


Re: [HACKERS] Adding support for Default partition in partitioning

2017-04-05 Thread Rahila Syed
Hi,

>However, running your original example, I'm getting this error
Thank you for testing. Please find attached an updated patch which fixes
the above.


Thank you,
Rahila Syed


default_partition_v5.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Adding support for Default partition in partitioning

2017-04-04 Thread Rahila Syed
Hello,

Please find attached an updated patch.
Following has been accomplished in this update:

1. A new partition can be added after default partition if there are no
conflicting rows in default partition.
2. Solved the crash reported earlier.

Thank you,
Rahila Syed



On Tue, Apr 4, 2017 at 6:22 PM, David Steele <da...@pgmasters.net> wrote:

> On 3/31/17 10:45 AM, David Steele wrote:
> > On 3/29/17 8:13 AM, Rahila Syed wrote:
> >
> >> Thanks for reporting. I have identified the problem and have a fix.
> >> Currently working on allowing
> >> adding a partition after default partition if the default partition does
> >> not have any conflicting rows.
> >> Will update the patch with both of these.
> >
> > The CF has been extended but until April 7 but time is still growing
> > short.  Please respond with a new patch by 2017-04-04 00:00 AoE (UTC-12)
> > or this submission will be marked "Returned with Feedback".
>
> This submission has been marked "Returned with Feedback".  Please feel
> free to resubmit to a future commitfest.
>
> Regards,
> --
> -David
> da...@pgmasters.net
>


default_partition_v4.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Adding support for Default partition in partitioning

2017-03-29 Thread Rahila Syed
Thanks for reporting. I have identified the problem and have a fix.
Currently working on allowing
adding a partition after default partition if the default partition does
not have any conflicting rows.
Will update the patch with both of these.

Thank you,
Rahila Syed

On Mon, Mar 27, 2017 at 12:10 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

> I applied the patch and was trying to perform some testing, but its
> ending up with server crash with the test shared by you in your starting
> mail:
>
> postgres=# CREATE TABLE list_partitioned (
> postgres(# a int
> postgres(# ) PARTITION BY LIST (a);
> CREATE TABLE
> postgres=#
> postgres=# CREATE TABLE part_default PARTITION OF list_partitioned FOR
> VALUES IN (DEFAULT);
> CREATE TABLE
>
> postgres=# CREATE TABLE part_1 PARTITION OF list_partitioned FOR VALUES IN
> (4,5);
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> Apart from this, few more explanation in the patch is needed to explain the
> changes for the DEFAULT partition. Like I am not quite sure what exactly
> the
> latest version of patch supports, like does that support the tuple row
> movement,
> or adding new partition will be allowed having partition table having
> DEFAULT
> partition, which is quite difficult to understand through the code changes.
>
> Another part which is missing in the patch is the test coverage, adding
> proper test coverage, which explain what is supported and what's not.
>
> Thanks,
>
> On Fri, Mar 24, 2017 at 3:25 PM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
>
>> Hello Rushabh,
>>
>> Thank you for reviewing.
>> Have addressed all your comments in the attached patch. The attached
>> patch currently throws an
>> error if a new partition is added after default partition.
>>
>> >Rather then adding check for default here, I think this should be
>> handle inside
>> >get_qual_for_list().
>> Have moved the check inside get_qual_for_partbound() as needed to do some
>> operations
>> before calling get_qual_for_list() for default partitions.
>>
>> Thank you,
>> Rahila Syed
>>
>> On Tue, Mar 21, 2017 at 11:36 AM, Rushabh Lathia <
>> rushabh.lat...@gmail.com> wrote:
>>
>>> I picked this for review and noticed that patch is not getting
>>> cleanly complied on my environment.
>>>
>>> partition.c: In function ‘RelationBuildPartitionDesc’:
>>> partition.c:269:6: error: ISO C90 forbids mixed declarations and code
>>> [-Werror=declaration-after-statement]
>>>   Const*val = lfirst(c);
>>>   ^
>>> partition.c: In function ‘generate_partition_qual’:
>>> partition.c:1590:2: error: ISO C90 forbids mixed declarations and code
>>> [-Werror=declaration-after-statement]
>>>   PartitionBoundSpec *spec = (PartitionBoundSpec *)bound;
>>>   ^
>>> partition.c: In function ‘get_partition_for_tuple’:
>>> partition.c:1820:5: error: array subscript has type ‘char’
>>> [-Werror=char-subscripts]
>>>  result = parent->indexes[partdesc->boundinfo->def_index];
>>>  ^
>>> partition.c:1824:16: error: assignment makes pointer from integer
>>> without a cast [-Werror]
>>>  *failed_at = RelationGetRelid(parent->reldesc);
>>> ^
>>> cc1: all warnings being treated as errors
>>>
>>> Apart from this, I was reading patch here are few more comments:
>>>
>>> 1) Variable initializing happening at two place.
>>>
>>> @@ -166,6 +170,8 @@ RelationBuildPartitionDesc(Relation rel)
>>>  /* List partitioning specific */
>>>  PartitionListValue **all_values = NULL;
>>>  boolfound_null = false;
>>> +boolfound_def = false;
>>> +intdef_index = -1;
>>>  intnull_index = -1;
>>>
>>>  /* Range partitioning specific */
>>> @@ -239,6 +245,8 @@ RelationBuildPartitionDesc(Relation rel)
>>>  i = 0;
>>>  found_null = false;
>>>  null_index = -1;
>>> +found_def = false;
>>> +def_index = -1;
>>>  foreach(cell, boundspecs)
>>>  {
>>>  ListCell   *c;
>>> @@ -249,6 +257,15 @@ RelationBuildPartitionDesc(Relation rel)
>>>
>>>
>>> 2)
>>>
>>> @@

Re: [HACKERS] Adding support for Default partition in partitioning

2017-03-24 Thread Rahila Syed
Hello Rushabh,

Thank you for reviewing.
Have addressed all your comments in the attached patch. The attached patch
currently throws an
error if a new partition is added after default partition.

>Rather then adding check for default here, I think this should be handle
inside
>get_qual_for_list().
Have moved the check inside get_qual_for_partbound() as needed to do some
operations
before calling get_qual_for_list() for default partitions.

Thank you,
Rahila Syed

On Tue, Mar 21, 2017 at 11:36 AM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

> I picked this for review and noticed that patch is not getting
> cleanly complied on my environment.
>
> partition.c: In function ‘RelationBuildPartitionDesc’:
> partition.c:269:6: error: ISO C90 forbids mixed declarations and code
> [-Werror=declaration-after-statement]
>   Const*val = lfirst(c);
>   ^
> partition.c: In function ‘generate_partition_qual’:
> partition.c:1590:2: error: ISO C90 forbids mixed declarations and code
> [-Werror=declaration-after-statement]
>   PartitionBoundSpec *spec = (PartitionBoundSpec *)bound;
>   ^
> partition.c: In function ‘get_partition_for_tuple’:
> partition.c:1820:5: error: array subscript has type ‘char’
> [-Werror=char-subscripts]
>  result = parent->indexes[partdesc->boundinfo->def_index];
>  ^
> partition.c:1824:16: error: assignment makes pointer from integer without
> a cast [-Werror]
>  *failed_at = RelationGetRelid(parent->reldesc);
> ^
> cc1: all warnings being treated as errors
>
> Apart from this, I was reading patch here are few more comments:
>
> 1) Variable initializing happening at two place.
>
> @@ -166,6 +170,8 @@ RelationBuildPartitionDesc(Relation rel)
>  /* List partitioning specific */
>  PartitionListValue **all_values = NULL;
>  boolfound_null = false;
> +boolfound_def = false;
> +intdef_index = -1;
>  intnull_index = -1;
>
>  /* Range partitioning specific */
> @@ -239,6 +245,8 @@ RelationBuildPartitionDesc(Relation rel)
>  i = 0;
>  found_null = false;
>  null_index = -1;
> +found_def = false;
> +def_index = -1;
>  foreach(cell, boundspecs)
>  {
>  ListCell   *c;
> @@ -249,6 +257,15 @@ RelationBuildPartitionDesc(Relation rel)
>
>
> 2)
>
> @@ -1558,6 +1586,19 @@ generate_partition_qual(Relation rel)
>  bound = stringToNode(TextDatumGetCString(boundDatum));
>  ReleaseSysCache(tuple);
>
> +/* Return if it is a default list partition */
> +PartitionBoundSpec *spec = (PartitionBoundSpec *)bound;
> +ListCell *cell;
> +foreach(cell, spec->listdatums)
>
> More comment on above hunk is needed?
>
> Rather then adding check for default here, I think this should be handle
> inside
> get_qual_for_list().
>
> 3) Code is not aligned with existing
>
> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> index 6316688..ebb7db7 100644
> --- a/src/backend/parser/gram.y
> +++ b/src/backend/parser/gram.y
> @@ -2594,6 +2594,7 @@ partbound_datum:
>  Sconst{ $$ = makeStringConst($1, @1); }
>  | NumericOnly{ $$ = makeAConst($1, @1); }
>  | NULL_P{ $$ = makeNullAConst(@1); }
> +| DEFAULT  { $$ = (Node *)makeDefElem("DEFAULT", NULL, @1); }
>  ;
>
>
> 4) Unnecessary hunk:
>
> @@ -2601,7 +2602,6 @@ partbound_datum_list:
>  | partbound_datum_list ',' partbound_datum
>  { $$ = lappend($1, $3); }
>  ;
> -
>
> Note: this is just an initially review comments, I am yet to do the
> detailed review
> and the testing for the patch.
>
> Thanks.
>
> On Mon, Mar 20, 2017 at 9:27 AM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
>
>> Hello,
>>
>> Please find attached a rebased patch with support for pg_dump. I am
>> working on the patch
>> to handle adding a new partition after a default partition by throwing an
>> error if
>> conflicting rows exist in default partition and adding the partition
>> successfully otherwise.
>> Will post an updated patch by tomorrow.
>>
>> Thank you,
>> Rahila Syed
>>
>> On Mon, Mar 13, 2017 at 5:42 AM, Robert Haas <robertmh...@gmail.com>
>> wrote:
>>
>>> On Fri, Mar 10, 2017 at 2:17 PM, Peter Eisentraut
>>> <peter.eisentr...@2ndquadrant.com> wrote:
>>> > On 3/2/17 21:40, Robert Haas wrote:
>>> >> On the point mentioned above, I
>>> 

Re: [HACKERS] Adding support for Default partition in partitioning

2017-03-19 Thread Rahila Syed
Hello,

Please find attached a rebased patch with support for pg_dump. I am working
on the patch
to handle adding a new partition after a default partition by throwing an
error if
conflicting rows exist in default partition and adding the partition
successfully otherwise.
Will post an updated patch by tomorrow.

Thank you,
Rahila Syed

On Mon, Mar 13, 2017 at 5:42 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Fri, Mar 10, 2017 at 2:17 PM, Peter Eisentraut
> <peter.eisentr...@2ndquadrant.com> wrote:
> > On 3/2/17 21:40, Robert Haas wrote:
> >> On the point mentioned above, I
> >> don't think adding a partition should move tuples, necessarily; seems
> >> like it would be good enough - maybe better - for it to fail if there
> >> are any that would need to be moved.
> >
> > ISTM that the uses cases of various combinations of adding a default
> > partition, adding another partition after it, removing the default
> > partition, clearing out the default partition in order to add more
> > nondefault partitions, and so on, need to be more clearly spelled out
> > for each partitioning type.  We also need to consider that pg_dump and
> > pg_upgrade need to be able to reproduce all those states.  Seems to be a
> > bit of work still ...
>
> This patch is only targeting list partitioning.   It is not entirely
> clear that the concept makes sense for range partitioning; you can
> already define a partition from the end of the last partitioning up to
> infinity, or from minus-infinity up to the starting point of the first
> partition.  The only thing a default range partition would do is let
> you do is have a single partition cover all of the ranges that would
> otherwise be unassigned, which might not even be something we want.
>
> I don't know how complete the patch is, but the specification seems
> clear enough.  If you have partitions for 1, 3, and 5, you get
> partition constraints of (a = 1), (a = 3), and (a = 5).  If you add a
> default partition, you get a constraint of (a != 1 and a != 3 and a !=
> 5).  If you then add a partition for 7, the default partition's
> constraint becomes (a != 1 and a != 3 and a != 5 and a != 7).  The
> partition must be revalidated at that point for conflicting rows,
> which we can either try to move to the new partition, or just error
> out if there are any, depending on what we decide we want to do.  I
> don't think any of that requires any special handling for either
> pg_dump or pg_upgrade; it all just falls out of getting the
> partitioning constraints correct and consistently enforcing them, just
> as for any other partition.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


default_partition_v2.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wait events for disk I/O

2017-03-16 Thread Rahila Syed
Thank you for the updated patch.

I have applied and tested it on latest sources and the patch looks good to
me.

>I am not quite sure about this, as this is for stat statements. Also part
from the
>place you found there are many other fwrite() call into
pg_stat_statements, and
>I intentionally haven't added event here as it is very small write about
stat, and
>doesn't look like we should add for those call.
I agree that this writes less amount of data. Although tracking this can
be useful too in scenarios where pg_stat_statements is lagging due to I/O
bottleneck.
I will leave this decision to the committer.

Thank you,
Rahila Syed

On Wed, Mar 15, 2017 at 1:03 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

> Thanks Rahila for reviewing this patch.
>
> On Tue, Mar 14, 2017 at 8:13 PM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I applied and tested this patch on latest sources and it works fine.
>>
>> Following are some comments,
>>
>> >+   /* Wait event for SNRU */
>> >+   WAIT_EVENT_READ_SLRU_PAGE,
>> Typo in the comment.
>>
>>
> Fixed.
>
>
>> >FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush,
>> WAIT_EVENT_FLUSH_DATA_BLOCK);
>> This call is inside mdwriteback() which can flush more than one block so
>> should  WAIT_EVENT _FLUSH_DATA_BLOCK
>> be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?
>>
>>
> Changed with WAIT_EVENT_FLUSH_DATA_BLOCKS.
>
>
>> Should calls to write() in following functions be tracked too?
>>  qtext_store()  - This is related to pg_stat_statements
>>
>>
>
> I am not quite sure about this, as this is for stat statements. Also part
> from the
> place you found there are many other fwrite() call into
> pg_stat_statements, and
> I intentionally haven't added event here as it is very small write about
> stat, and
> doesn't look like we should add for those call.
>
>
>
>> dsm_impl_mmap() - This is in relation to creating dsm segments.
>>
>>
> Added new event here. Actually particular write call is zero-filling the
> DSM file.
>
>
>> write_auto_conf_file()-  This is called when updated configuration
>> parameters are
>>  written to a temp file.
>>
>>
> write_auto_conf_file() is getting called during the ALTER SYSTEM call.
> Here write
> happen only when someone explicitly run the ALTER SYSTEM call. This is
> administrator call and so doesn't seem like necessary to add separate wait
> event
> for this.
>
> PFA latest patch with other fixes.
>
>
>>
>> On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmh...@gmail.com>
>>> wrote:
>>>
>>>> On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapil...@gmail.com>
>>>> wrote:
>>>> > On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmh...@gmail.com>
>>>> wrote:
>>>> >> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapil...@gmail.com>
>>>> wrote:
>>>> >>> Sure, if you think both Writes and Reads at OS level can have some
>>>> >>> chance of blocking in obscure cases, then we should add a wait event
>>>> >>> for them.
>>>> >>
>>>> >> I think writes have a chance of blocking in cases even in cases that
>>>> >> are not very obscure at all.
>>>> >
>>>> > Point taken for writes, but I think in general we should have some
>>>> > criteria based on which we can decide whether to have a wait event for
>>>> > a particular call. It should not happen that we have tons of wait
>>>> > events and out of which, only a few are helpful in most of the cases
>>>> > in real-world scenarios.
>>>>
>>>> Well, the problem is that if you pick and choose which wait events to
>>>> add based on what you think will be common, you're actually kind of
>>>> hosing yourself. Because now when something uncommon happens, suddenly
>>>> you don't get any wait event data and you can't tell what's happening.
>>>> I think the number of new wait events added by Rushabh's patch is
>>>> wholly reasonable.  Yeah, some of those are going to be a lot more
>>>> common than others, but so what?  We add wait events so that we can
>>>> find out what's going on.  I don't want to sometimes know when a
>>>

Re: [HACKERS] wait events for disk I/O

2017-03-14 Thread Rahila Syed
Hello,

I applied and tested this patch on latest sources and it works fine.

Following are some comments,

>+   /* Wait event for SNRU */
>+   WAIT_EVENT_READ_SLRU_PAGE,
Typo in the comment.

>FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush,
WAIT_EVENT_FLUSH_DATA_BLOCK);
This call is inside mdwriteback() which can flush more than one block so
should  WAIT_EVENT _FLUSH_DATA_BLOCK
be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?

Should calls to write() in following functions be tracked too?
 qtext_store()  - This is related to pg_stat_statements

dsm_impl_mmap() - This is in relation to creating dsm segments.

write_auto_conf_file()-  This is called when updated configuration
parameters are
 written to a temp file.

Thank you,
Rahila Syed

On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

>
>
> On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmh...@gmail.com> wrote:
>
>> On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>> > On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmh...@gmail.com>
>> wrote:
>> >> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>> >>> Sure, if you think both Writes and Reads at OS level can have some
>> >>> chance of blocking in obscure cases, then we should add a wait event
>> >>> for them.
>> >>
>> >> I think writes have a chance of blocking in cases even in cases that
>> >> are not very obscure at all.
>> >
>> > Point taken for writes, but I think in general we should have some
>> > criteria based on which we can decide whether to have a wait event for
>> > a particular call. It should not happen that we have tons of wait
>> > events and out of which, only a few are helpful in most of the cases
>> > in real-world scenarios.
>>
>> Well, the problem is that if you pick and choose which wait events to
>> add based on what you think will be common, you're actually kind of
>> hosing yourself. Because now when something uncommon happens, suddenly
>> you don't get any wait event data and you can't tell what's happening.
>> I think the number of new wait events added by Rushabh's patch is
>> wholly reasonable.  Yeah, some of those are going to be a lot more
>> common than others, but so what?  We add wait events so that we can
>> find out what's going on.  I don't want to sometimes know when a
>> backend is blocked on an I/O.  I want to ALWAYS know.
>>
>>
> Yes, I agree with Robert. Knowing what we want and what we don't
> want is difficult to judge. Something which we might think its not useful
> information, and later of end up into situation where we re-think about
> adding those missing stuff is not good. Having more information about
> the system, specially for monitoring purpose is always good.
>
> I am attaching  another version of the patch, as I found stupid mistake
> in the earlier version of patch, where I missed to initialize initial
> value to
> WaitEventIO enum. Also earlier version was not getting cleanly apply on
> the current version of sources.
>
>
>
> --
> Rushabh Lathia
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


Re: [HACKERS] Adding support for Default partition in partitioning

2017-03-10 Thread Rahila Syed
 >I agree that adding a new partition should not move any data out of the
default. It's easy enough to set up a monitor to watch for data existing in
the >default. Perhaps also adding a column to pg_partitioned_table that
contains the oid of the default partition so it's easier to identify from a
system >catalog perspective and make that monitoring easier.

Wont it incur overhead of scanning the default partition for matching rows
each time a select happens on any matching partition?
This extra scan would be required until rows satisfying the newly added
partition are left around in default partition.

>I don't even see a need for it to fail either and not quite sure how that
would even work? If they can't add a necessary child due to data being in
the >default, how can they ever get it out? Just leave it to the user to
keep an eye on the default and fix it as necessary.
This patch intends to provide a way to insert data that does not satisfy
any of the existing partitions. For this patch, we can disallow adding a
new partition when a default partition with conflicting rows exist. There
can be many solutions to the problem of adding a new partition. One is to
move the relevant tuples from default to the new partition or like you
suggest keep monitoring the default partition until user moves the rows out
of the default.

Thank you,
Rahila Syed

On Tue, Mar 7, 2017 at 10:00 PM, Keith Fiske <ke...@omniti.com> wrote:

> On Thu, Mar 2, 2017 at 9:40 PM, Robert Haas <robertmh...@gmail.com> wrote:
>
>> On Wed, Mar 1, 2017 at 6:29 AM, Rahila Syed <rahilasye...@gmail.com>
>> wrote:
>> > 3. Handling adding a new partition to a partitioned table
>> >with default partition.
>> >This will require moving tuples from existing default partition to
>> >   newly created partition if they satisfy its partition bound.
>>
>> Considering that this patch was submitted at the last minute and isn't
>> even complete, I can't see this getting into v10.  But that doesn't
>> mean we can't talk about it.  I'm curious to hear other opinions on
>> whether we should have this feature.  On the point mentioned above, I
>> don't think adding a partition should move tuples, necessarily; seems
>> like it would be good enough - maybe better - for it to fail if there
>> are any that would need to be moved.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>
> I'm all for this feature and had suggested it back in the original thread
> to add partitioning to 10. I agree that adding a new partition should not
> move any data out of the default. It's easy enough to set up a monitor to
> watch for data existing in the default. Perhaps also adding a column to
> pg_partitioned_table that contains the oid of the default partition so it's
> easier to identify from a system catalog perspective and make that
> monitoring easier. I don't even see a need for it to fail either and not
> quite sure how that would even work? If they can't add a necessary child
> due to data being in the default, how can they ever get it out? Just leave
> it to the user to keep an eye on the default and fix it as necessary. This
> is what I do in pg_partman.
>
> --
> Keith Fiske
> Database Administrator
> OmniTI Computer Consulting, Inc.
> http://www.keithf4.com
>


[HACKERS] Adding support for Default partition in partitioning

2017-02-28 Thread Rahila Syed
Hello,

Currently inserting the data into a partitioned table that does not fit into
any of its partitions is not allowed.

The attached patch provides a capability to add a default partition to a
list
partitioned table as follows.

postgres=# CREATE TABLE list_partitioned (
a int
) PARTITION BY LIST (a);
CREATE TABLE

postgres=# CREATE TABLE part_default PARTITION OF list_partitioned FOR
VALUES IN (DEFAULT);
CREATE TABLE

postgres=# CREATE TABLE part_1 PARTITION OF list_partitioned FOR VALUES IN
(4,5);
CREATE TABLE

postgres=# insert into list_partitioned values (9);
INSERT 0 1

postgres=# select * from part_default;
 a
---
 9
(1 row)

The attached patch is in a  preliminary stage and has following ToDos:
1. Adding pg_dump support.
2. Documentation
3. Handling adding a new partition to a partitioned table
   with default partition.
   This will require moving tuples from existing default partition to
  newly created partition if they satisfy its partition bound.
4. Handling of update of partition key in a default partition. As per
current design it should throw an error if the update requires the tuple to
be moved to any other partition. But this can changed by the following
proposal.

https://www.postgresql.org/message-id/CAJ3gD9do9o2ccQ7j7+tSgiE1REY65XRiMb=
yjo3u3qhyp8e...@mail.gmail.com

I am adding it to the current commitfest with the status Waiting on Author
as I will submit an updated patch with above ToDos.
Kindly give your suggestions.

Thank you,
Rahila Syed


default_list_partition_v1.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index-only scan

2017-02-15 Thread Rahila Syed
I  reviewed the patch. Overall it looks fine to me.

One comment,

>-   if (index->amcanparallel &&
>-   !index_only_scan &&
>+   if ((index->amcanparallel ||
>+   index_only_scan) &&

Why do we need to check for index_only_scan in the above condition. IIUC,
index->amcanparallel is necessary for
parallel index scan to be chosen. We cannot chose parallel index only scan
if index_only_scan is happening
without worrying about index->amcanparallel value. So OR-ing
index->amcanparallel with index_only_scan is probably not
correct.

Thank you,
Rahila Syed




On Thu, Feb 16, 2017 at 1:06 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Jan 19, 2017 at 7:07 AM, Rafia Sabih
> <rafia.sa...@enterprisedb.com> wrote:
> > Please find the attached file rebased patch of parallel index-only
> > scan on the latest Parallel index-scan patch [1].
>
> This again needs minor rebasing but basically looks fine.  It's a
> pretty straightforward extension of the parallel index scan work.
>
> Please make sure that this is pgindent-clean - i.e. that when you
> pgindent the files that it touches, pgindent doesn't change anything
> of the same parts of the file that you've changed in the patch.  Also,
> I believe Amit may have made some adjustments to the logic in
> nodeIndexScan.c; if so, it would be good to make sure that the
> nodeIndexOnlyScan.c changes match what was done there.  In particular,
> he's got this:
>
> if (reset_parallel_scan && node->iss_ScanDesc->parallel_
> scan)
> index_parallelrescan(node->iss_ScanDesc);
>
> And you've got this:
>
> +   if (reset_parallel_scan)
> +   index_parallelrescan(node->ioss_ScanDesc);
>
> There might be some other inconsistencies as well that I didn't notice
> on a quick look.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] Parallel Index Scans

2017-01-31 Thread Rahila Syed
Hello Robert,

>I am a bit mystified about how this manages to work with array keys.
>_bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
>unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
>_bt_parallel_advance_scan() won't do anything unless btps_pageStatus
>is already BTPARALLEL_DONE.  It seems like one of those two things has
>to be wrong.

btps_pageStatus is to be set to BTPARALLEL_DONE only by the first worker
which is
performing scan for latest array key and which has encountered end of scan.
This is ensured by
following check in _bt_parallel_done(),

   if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
   btscan->btps_pageStatus != BTPARALLEL_DONE)

Thus, BTPARALLEL_DONE marks end of scan only for latest array keys. This
ensures that when any worker reaches
_bt_advance_array_keys() it advances latest scan which is marked by
btscan->btps_arrayKeyCount only when latest scan
has ended by checking if(btps_pageStatus == BTPARALLEL_DONE) in
_bt_advance_array_keys(). Otherwise, the worker just
advances its local so->arrayKeyCount.

I hope this provides some clarification.

Thank you,
Rahila Syed





On Wed, Feb 1, 2017 at 3:52 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Tue, Jan 24, 2017 at 4:51 PM, Robert Haas <robertmh...@gmail.com>
> wrote:
> > On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
> >> In spite of being careful, I missed reorganizing the functions in
> >> genam.h which I have done in attached patch.
> >
> > Cool.  Committed parallel-generic-index-scan.2.patch.  Thanks.
>
> Reviewing parallel_index_scan_v6.patch:
>
> I think it might be better to swap the return value and the out
> parameter for _bt_parallel_seize; that is, return a bool, and have
> callers ignore the value of the out parameter (e.g. *pageno).
>
> I think _bt_parallel_advance_scan should be renamed something that
> includes the word "keys", like _bt_parallel_advance_array_keys.
>
> The hunk in indexam.c looks like a leftover that can be omitted.
>
> +/*
> + * Below flags are used to indicate the state of parallel scan.
>
> They aren't really flags any more; they're members of an enum.  I
> think you could just leave this sentence out entirely and start right
> in with descriptions of the individual values.  But maybe all of those
> descriptions should end in a period (instead of one ending in a period
> but not the other three) since they read like complete sentences.
>
> + * btinitparallelscan - Initializing BTParallelScanDesc for parallel
> btree scan
>
> Initializing -> initialize
>
> + *  btparallelrescan() -- reset parallel scan
>
> Previous two prototypes have one dash, this one has two.  Make it
> consistent, please.
>
> + * Ideally, we don't need to acquire spinlock here, but being
> + * consistent with heap_rescan seems to be a good idea.
>
> How about: In theory, we don't need to acquire the spinlock here,
> because there shouldn't be any other workers running at this point,
> but we do so for consistency.
>
> + * _bt_parallel_seize() -- returns the next block to be scanned for
> forward
> + *  scans and latest block scanned for backward scans.
>
> I think the description should be more like "Begin the process of
> advancing the scan to a new page.  Other scans must wait until we call
> bt_parallel_release() or bt_parallel_done()."  And likewise
> _bt_parallel_release() should say something like "Complete the process
> of advancing the scan to a new page.  We now have the new value for
> btps_scanPage; some other backend can now begin advancing the scan."
> And _bt_parallel_done should say something like "Mark the parallel
> scan as complete."
>
> I am a bit mystified about how this manages to work with array keys.
> _bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
> unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
> _bt_parallel_advance_scan() won't do anything unless btps_pageStatus
> is already BTPARALLEL_DONE.  It seems like one of those two things has
> to be wrong.
>
> _bt_readpage's header comment should be updated to note that, in the
> case of a parallel scan, _bt_parallel_seize should have been called
> prior to entering this function, and _bt_parallel_release will be
> called prior to return (or this could alternatively be phrased in
> terms of btps_pageStatus on entry/exit).
>
> _bt_readnextpage isn't very clear about the meaning of its blkno
> argument.  It looks like it always has to be valid when scanning
> forward, but only in the parallel case when scanning backward?  That
> is a little odd.  Another, possibly related 

Re: [HACKERS] Parallel Index Scans

2017-01-31 Thread Rahila Syed
Hello,

>Agreed, that it makes sense to consider only the number of pages to
>scan for computation of parallel workers.  I think for index scan we
>should consider both index and heap pages that need to be scanned
>(costing of index scan consider both index and heap pages).   I thin
>where considering heap pages matter more is when the finally selected
>rows are scattered across heap pages or we need to apply a filter on
>rows after fetching from the heap.  OTOH, we can consider just pages
>in the index as that is where mainly the parallelism works
IMO, considering just index pages will give a better estimate of work to be
done
in parallel. As the amount of work/number of pages divided amongst workers
is irrespective of
the number of heap pages scanned.

Thank you,
Rahila Syed




On Mon, Jan 30, 2017 at 2:52 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Sat, Jan 28, 2017 at 1:34 AM, Robert Haas <robertmh...@gmail.com>
> wrote:
> > On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
> >> parallel_index_opt_exec_support_v6 - Removed the usage of
> >> GatherSupportsBackwardScan.  Expanded the comments in
> >> ExecReScanIndexScan.
> >
> > I looked through this and in general it looks reasonable to me.
> > However, I did notice one thing that I think is wrong.  In the
> > parallel bitmap heap scan patch, the second argument to
> > compute_parallel_worker() is the number of pages that the parallel
> > scan is expected to fetch from the heap.  In this patch, it's the
> > total number of pages in the index.  The former seems to me to be
> > better, because the point of having a threshold relation size for
> > parallelism is that we don't want to use a lot of workers to scan a
> > small number of pages -- the distribution of work won't be even, and
> > the potential savings are limited.  If we've got a big index but are
> > using a very selective qual to pull out only one or a small number of
> > rows on a single page or a small handful of pages, we shouldn't
> > generate a parallel path for that.
> >
>
> Agreed, that it makes sense to consider only the number of pages to
> scan for computation of parallel workers.  I think for index scan we
> should consider both index and heap pages that need to be scanned
> (costing of index scan consider both index and heap pages).   I thin
> where considering heap pages matter more is when the finally selected
> rows are scattered across heap pages or we need to apply a filter on
> rows after fetching from the heap.  OTOH, we can consider just pages
> in the index as that is where mainly the parallelism works.  In the
> attached patch (parallel_index_opt_exec_support_v7.patch), I have
> considered both index and heap pages, let me know if you think some
> other way is better.  I have also prepared a separate independent
> patch (compute_index_pages_v1) on HEAD to compute index pages which
> can be used by parallel index scan. There is no change in
> parallel_index_scan (parallel btree scan) patch, so I am not attaching
> its new version.
>
> > Now, against that theory, the GUC that controls the behavior of
> > compute_parallel_worker() is called min_parallel_relation_size, which
> > might make you think that the decision is supposed to be based on the
> > whole size of some relation.  But I think that just means we need to
> > rename the GUC to something like min_parallel_scan_size.  Possibly we
> > also ought consider reducing the default value somewhat, because it
> > seems like both sequential and index scans can benefit even when
> > scanning less than 8MB.
> >
>
> Agreed, but let's consider it separately.
>
>
> The order in which patches needs to be applied:
> compute_index_pages_v1.patch, parallel_index_scan_v6.patch[1],
> parallel_index_opt_exec_support_v7.patch
>
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1J%
> 3DLSBpDx7i_izGJxGVUryqPe-2SKT02De-PrQvywiMxw%40mail.gmail.com
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>


Re: [HACKERS] Improvements in psql hooks for variables

2017-01-29 Thread Rahila Syed
Hello,

Please consider following comments on the patch.

In function ParseVariableNum,

> if (!val || !val[0])
>   return false;

Check for 'val == NULL' as in above condition is done even in callers of
ParseVariableNum().
There should be only one such check.

>+   psql_error("Invalid value \"%s\" for \"%s\"\nAn integer is
expected\n",
Cant we have this error in ParseVariableNum() similar to
ParseVariableBool() ?

>+   /*
>+* Switching AUTOCOMMIT from OFF to ON is not allowed when inside a
>+* transaction, because it cannot be effective until the current
>+* transaction is ended.
>+*/
>+   if (autocommit && !pset.autocommit &&
>+   pset.db && PQtransactionStatus(pset.db) == PQTRANS_INTRANS)
>+   {
>+   psql_error("cannot set AUTOCOMMIT to %s when inside a
transaction\n", newval);
>+   }
The above change in autocommit behaviour needs to be documented.


Thank you,
Rahila Syed




On Wed, Jan 25, 2017 at 5:48 PM, Rahila Syed <rahilasye...@gmail.com> wrote:

> Hello,
>
> The patch works fine on applying on latest master branch and testing it
> for various variables as listed in PsqlSettings struct.
> I will post further comments on patch soon.
>
> Thank you,
> Rahila Syed
>
> On Wed, Jan 25, 2017 at 1:33 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>
>> "Daniel Verite" <dan...@manitou-mail.org> writes:
>> > Here's an update with these changes:
>>
>> I scanned this patch very quickly and think it addresses my previous
>> stylistic objections.  Rahila is the reviewer of record though, so
>> I'll wait for his comments before going further.
>>
>> regards, tom lane
>>
>
>


Re: [HACKERS] Improvements in psql hooks for variables

2017-01-25 Thread Rahila Syed
Hello,

The patch works fine on applying on latest master branch and testing it for
various variables as listed in PsqlSettings struct.
I will post further comments on patch soon.

Thank you,
Rahila Syed

On Wed, Jan 25, 2017 at 1:33 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> "Daniel Verite" <dan...@manitou-mail.org> writes:
> > Here's an update with these changes:
>
> I scanned this patch very quickly and think it addresses my previous
> stylistic objections.  Rahila is the reviewer of record though, so
> I'll wait for his comments before going further.
>
> regards, tom lane
>


Re: [HACKERS] Parallel Index Scans

2017-01-17 Thread Rahila Syed
>+ /* Check if the scan for current scan keys is finished */
>+ if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
>+ *status = false;

>I didn't clearly understand, in which scenario the arrayKeyCount is less
>than btps_arrayKeyCount?
Consider following array scan keys

select * from test2 where j=ANY(ARRAY[1000,2000,3000]);

By the time the current worker has finished reading heap tuples
corresponding
to array key 1000(arrayKeyCount = 0), some other worker might have advanced
the scan to the
array key 2000(btps_arrayKeyCount =1). In this case when the current worker
fetches next page to scan,
it must advance its scan keys before scanning the next page of parallel
scan.
I hope this helps.

>+BlockNumber
>+_bt_parallel_seize(IndexScanDesc scan, bool *status)

>The return value of the above function is validated only in _bt_first
>function, but in other cases it is not.
In other cases it is validated in _bt_readnextpage() which is called after
_bt_parallel_seize().

>From the function description,
>it is possible to return P_NONE for the workers also with status as
>true. I feel it is better to handle the P_NONE case internally only
>so that callers just check for the status. Am i missing anything?

In case of the next block being P_NONE and status true, the code
calls _bt_parallel_done() to notify other workers followed by
BTScanPosInvalidate(). Similar check for block = P_NONE also
happens in existing code. See following in _bt_readnextpage(),


if (blkno == P_NONE || !so->currPos.moreRight)
{
   _bt_parallel_done(scan);
BTScanPosInvalidate(so->currPos);
return false;
}
So to keep it consistent with the existing code, the check
is kept outside _bt_parallel_seize().

Thank you,
Rahila Syed


On Wed, Jan 18, 2017 at 6:25 AM, Haribabu Kommi <kommi.harib...@gmail.com>
wrote:

>
>
> On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>
>>
>> Changed as per suggestion.
>>
>>
>> I have also rebased the optimizer/executor support patch
>> (parallel_index_opt_exec_support_v4.patch) and added a test case in
>> it.
>
>
> Thanks for the patch. Here are comments found during review.
>
> parallel_index_scan_v4.patch:
>
>
> + amtarget = (char *) ((void *) target) + offset;
>
> The above calcuation can be moved after NULL check?
>
> + * index_beginscan_parallel - join parallel index scan
>
> The name and the description doesn't sync properly, any better description?
>
> + BTPARALLEL_DONE,
> + BTPARALLEL_IDLE
> +} PS_State;
>
> The order of above two enum values can be changed according to their use.
>
> + /* Check if the scan for current scan keys is finished */
> + if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
> + *status = false;
>
> I didn't clearly understand, in which scenario the arrayKeyCount is less
> than btps_arrayKeyCount?
>
>
> +BlockNumber
> +_bt_parallel_seize(IndexScanDesc scan, bool *status)
>
> The return value of the above function is validated only in _bt_first
> function, but in other cases it is not. From the function description,
> it is possible to return P_NONE for the workers also with status as
> true. I feel it is better to handle the P_NONE case internally only
> so that callers just check for the status. Am i missing anything?
>
>
> +extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
> +extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber
> scan_page);
>
> Any better names for the above functions, as these function will
> provide/set
> the next page that needs to be read.
>
>
> parallel_index_opt_exec_support_v4.patch:
>
> +#include "access/parallel.h"
>
> Why is it required to be include nbtree.c? i didn't find
> any code changes in the patch.
>
>
> + /* reset (parallel) index scan */
> + if (node->iss_ScanDesc)
> + {
>
> Why this if check required? There is an assert check in later function
> calls.
>
>
> Regards,
> Hari Babu
> Fujitsu Australia
>


Re: [HACKERS] Parallel Index-only scan

2017-01-12 Thread Rahila Syed
Hello,

On applying the patch on latest master branch and running regression tests
following failure occurs.
I applied it on latest parallel index scan patches as given in the link
above.

***
/home/rahila/postgres/postgres/src/test/regress/expected/select_parallel.out
2017-01-03 14:06:29.122022780 +0530
---
/home/rahila/postgres/postgres/src/test/regress/results/select_parallel.out
2017-01-12 14:35:56.652712622 +0530
***
*** 92,103 
  explain (costs off)
select  sum(parallel_restricted(unique1)) from tenk1
group by(parallel_restricted(unique1));
!  QUERY PLAN
! 
   HashAggregate
 Group Key: parallel_restricted(unique1)
!->  Index Only Scan using tenk1_unique1 on tenk1
! (3 rows)

  set force_parallel_mode=1;
  explain (costs off)
--- 92,105 
  explain (costs off)
select  sum(parallel_restricted(unique1)) from tenk1
group by(parallel_restricted(unique1));
! QUERY PLAN
! ---
   HashAggregate
 Group Key: parallel_restricted(unique1)
!->  Gather
!  Workers Planned: 4
!  ->  Parallel Index Only Scan using tenk1_unique1 on tenk1
! (5 rows)

IIUC, parallel operation being performed here is fine as parallel
restricted function occurs above Gather node
and just the expected output needs to be changed.

Thank you,
Rahila Syed


regression.diffs
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assignment of valid collation for SET operations on queries with UNKNOWN types.

2017-01-03 Thread Rahila Syed
Thank you all for inputs.
Kindly help me clarify the scope of the patch.

>However, I thought the idea was to silently coerce affected columns from
>unknown to text.  This doesn't look like the behavior we want:

This patch prevents creation of relation with unknown columns and
in addition fixes the particular case of CREATE VIEW with literal columns
by coercing unknown to text only in this particular case.

Are you suggesting extending the patch to include coercing from unknown to
text for all possible cases where a column of unknown type is being created?

Thank you,
Rahila Syed


On Fri, Dec 30, 2016 at 7:21 PM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Fri, Dec 30, 2016 at 1:30 PM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
> > On Thu, Dec 29, 2016 at 8:18 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >> Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> writes:
> >>> The way this patch has been written, it doesn't allow creating tables
> >>> with unknown type columns, which was allowed earlier.
> >>
> >> Yes, that's an intentional change; creating such tables (or views) has
> >> never been anything but a foot-gun.
> >>
> >> However, I thought the idea was to silently coerce affected columns from
> >> unknown to text.  This doesn't look like the behavior we want:
> >
> > Do you mean to say that when creating a table with a column of unknown
> > type, that column type should be silently converted (there's nothing
> > to coerce when the table is being created) to text? instead of
> > throwing an error?
>
> FWIW that's what I understood: the patch should switch unknown columns
> to text. A bunch of side effects when converting types are avoided
> this way.
> --
> Michael
>


Re: [HACKERS] Parallel Index Scans

2016-12-23 Thread Rahila Syed
>> 5. Comment for _bt_parallel_seize() says:
>> "False indicates that we have reached the end of scan for
>>  current scankeys and for that we return block number as P_NONE."
>>
>>  What is the reason to check (blkno == P_NONE) after checking (status ==
false)
>>  in _bt_first() (see code below)? If comment is correct
>>  we'll never reach _bt_parallel_done()
>>
>> +   blkno = _bt_parallel_seize(scan, );
>> +   if (status == false)
>> +   {
>> +   BTScanPosInvalidate(so->currPos);
>> +   return false;
>> +   }
>> +   else if (blkno == P_NONE)
>> +   {
>> +   _bt_parallel_done(scan);
>> +   BTScanPosInvalidate(so->currPos);
>> +   return false;
>> +   }
>>
>The first time master backend or worker hits last page (calls this
>API), it will return P_NONE and after that any worker tries to fetch
>next page, it will return status as false.  I think we can expand a
>comment to explain it clearly.  Let me know, if you need more
>clarification, I can explain it in detail.

Probably this was confusing because we have not mentioned
that P_NONE can be returned even when status = TRUE and
not just when status is false.

I think, the comment above the function can be modified as follows,

+ /*
+ * True indicates that the block number returned is either valid including
P_NONE
+ * and scan is continued or block number is invalid and scan has just
+ * begun.

Thank you,
Rahila Syed


On Thu, Dec 22, 2016 at 9:49 AM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
> <lubennikov...@gmail.com> wrote:
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  tested, passed
> > Implements feature:   tested, passed
> > Spec compliant:   tested, passed
> > Documentation:tested, passed
> >
> > Hi, thank you for the patch.
> > Results are very promising. Do you see any drawbacks of this feature or
> something that requires more testing?
> >
>
> I think you can focus on the handling of array scan keys for testing.
> In general, one of my colleagues has shown interest in testing this
> patch and I think he has tested as well but never posted his findings.
> I will request him to share his findings and what kind of tests he has
> done, if any.
>
> > I'm willing to oo a review.
>
> Thanks, that will be helpful.
>
>
> > I saw the discussion about parameters in the thread above. And I agree
> that we'd better concentrate
> > on the patch itself and add them later if necessary.
> >
> > 1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of
> xs_temp_snap flag?
> >
> > +   if (scan->xs_temp_snap)
> > +   UnregisterSnapshot(scan->xs_snapshot);
> >
>
> I agree with what Rober has told in his reply.  We do same way for
> heap, refer heap_endscan().
>
> > I must say that I'm quite new with all this parallel stuff. If you give
> me a link,
> > where to read about snapshots for parallel workers, my review will be
> more helpful.
> >
>
> You can read transam/README.parallel.  Refer "State Sharing" portion
> of README to learn more about it.
>
> > Anyway, it would be great to have more comments about it in the code.
> >
>
> We are sharing snapshot to ensure that reads in both master backend
> and worker backend can use the same snapshot.  There is no harm in
> adding comments, but I think it is better to be consistent with
> similar heapam code.  After reading README.parallel, if you still feel
> that we should add more comments in the code, then we can definitely
> do that.
>
> > 2. Don't you mind to rename 'amestimateparallelscan' to let's say
> 'amparallelscan_spacerequired'
> > or something like this?
>
> Sure, I am open to other names, but IMHO, lets keep "estimate" in the
> name to keep it consistent with other parallel stuff. Refer
> execParallel.c to see how widely this word is used.
>
> > As far as I understand there is nothing to estimate, we know this size
> > for sure. I guess that you've chosen this name because of
> 'heap_parallelscan_estimate'.
> > But now it looks similar to 'amestimate' which refers to indexscan cost
> for optimizer.
> > That leads to the next question.
> >
>
> Do you mean 'amcostestimate'?  If you want we can rename it
> amparallelscanestimate to be consistent wi

Re: [HACKERS] Assignment of valid collation for SET operations on queries with UNKNOWN types.

2016-12-14 Thread Rahila Syed
Hello,

Thank you for comments.

>There is a similar code pattern for materialized views, see
>create_ctas_nodata() where the attribute list is built
create_ctas_nodata() is for creation of materialized views WITH NO DATA.
For other materialized views and CREATE TABLE AS, column definitions are
built in
intorel_startup() function which has different code from that of CREATE
VIEW which
the patch deals with.

Limiting the scope of the patch to include changing the type of literal
constants
to text only for plain views. Also, error out when column with UNKNOWN type
is
being created for other relations like tables and materialized views.

>And actually, shouldn't this be just a plain error?
Changed it to error in the attached patch.

>Your patch has no regression tests, surely you want some to stress
>this code path
Added regression tests in the attached patch.

Also adding this patch to CF 2017-01

Thank you,
Rahila Syed


unknown_view_column_to_text_v1.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assignment of valid collation for SET operations on queries with UNKNOWN types.

2016-12-06 Thread Rahila Syed
Hello,

>And ideally fix things so
>that the type of the view column will be resolved as text, so that you
>don't hit this condition in the first place; but there is no good that
>comes out of allowing a view to be created like this

Thank you for suggestion. Attached is a patch which resolves the columns
with literal constants as TEXT for view creation.

Following views can be created with literal columns resolved as TEXT.

postgres=# create view v as select 'abc' a;
CREATE VIEW
postgres=# create view v1 as select 'def' a;
CREATE VIEW
postgres=# \d+ v1;
View "public.v1"
 Column | Type | Collation | Nullable | Default | Storage  | Description
+--+---+--+-+--+-
 a  | text |   |  | | extended |
View definition:
 SELECT 'def'::text AS a;

This allows following queries to run successfully which wasn't the case
earlier.

postgres=# select a from v UNION select a from v1;
  a
-
 abc
 def
(2 rows)

AND

postgres=# select * from v order by 1;
  a
-
 abc
(1 row)

Kindly give your opinion.

Thank you,
Rahila Syed.




On Thu, Nov 17, 2016 at 8:59 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Rahila Syed <rahilasye...@gmail.com> writes:
> > CASE 2:
> > postgres=# create view v as select 'abc' a;
> > 2016-11-16 15:28:48 IST WARNING:  column "a" has type "unknown"
> > 2016-11-16 15:28:48 IST DETAIL:  Proceeding with relation creation
> anyway.
> > WARNING:  column "a" has type "unknown"
> > DETAIL:  Proceeding with relation creation anyway.
> > CREATE VIEW
>
> We really ought to make that a hard error.  And ideally fix things so
> that the type of the view column will be resolved as text, so that you
> don't hit this condition in the first place; but there is no good that
> comes out of allowing a view to be created like this.
>
> > Attached WIP patch does that. Kindly let me know your opinion.
>
> This is a seriously ugly kluge that's attacking the symptom not the
> problem.  Or really, a symptom not the problem.  There are lots of
> other symptoms, for instance
>
> regression=# select * from v order by 1;
> ERROR:  failed to find conversion function from unknown to text
>
> regards, tom lane
>


unknown_view_column_to_text_convert.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvements in psql hooks for variables

2016-12-02 Thread Rahila Syed
I applied and tested the patch on latest master branch.

Kindly consider following comments,

ParseVariableBool(const char *value, const char *name)
+ParseVariableBool(const char *value, const char *name, bool *valid)
 {
size_t  len;

+   if (valid)
+   *valid = true;


  psql_error("unrecognized value \"%s\" for \"%s\": boolean
expected\n",
+  value, name);
+   if (valid)
+   *valid = false;


Why do we need this? IMO, valid should be always set to true if the value
is parsed to be correct.
There should not be an option to the caller to not follow the behaviour of
setting valid to either true/false.
As it is in the current patch, all callers of ParseVariableBool follow the
behaviour of setting valid with either true/false.

In following examples, incorrect error message is begin displayed.
“ON_ERROR_ROLLBACK” is an enum and also
accepts value 'interactive' .  The error message says boolean expected.

postgres=# \set ON_ERROR_ROLLBACK eretere
unrecognized value "eretere" for "ON_ERROR_ROLLBACK": boolean expected
\set: error while setting variable

Similarly for ECHO_HIDDEN which is also an enum and accepts value 'no_exec'

  postgres=# \set ECHO_HIDDEN NULL
unrecognized value "NULL" for "ECHO_HIDDEN": boolean expected
\set: error while setting variable


+   boolnewval = ParseVariableBool(opt, "\\timing", );
+   if (success)
+   pset.timing = newval;
+   }
else
pset.timing = !pset.timing;
if (!pset.quiet)
@@ -2660,7 +2669,14 @@ do_pset(const char *param, const char *value,
printQueryOpt *popt, bool quiet)
if (value && pg_strcasecmp(value, "auto") == 0)
popt->topt.expanded = 2;
else if (value)
-   popt->topt.expanded = ParseVariableBool(value, param);
+   {
+   boolvalid;
+   boolnewval = ParseVariableBool(value, param, );
+       if (valid)

Should same variable names (success / valid) be used for consistency?

Thank you,
Rahila Syed

On Wed, Nov 23, 2016 at 5:47 PM, Daniel Verite <dan...@manitou-mail.org>
wrote:

>I wrote:
>
> > So I went through the psql commands which don't fail on parse errors
> > for booleans
> > [...]
>
> Here's a v5 patch implementing the suggestions mentioned upthread:
> all meta-commands calling ParseVariableBool() now fail
> when the boolean argument can't be parsed successfully.
>
> Also includes a minor change to SetVariableAssignHook() that now
> returns the result of the hook it calls after installing it.
> It doesn't make any difference in psql behavior since callers
> of SetVariableAssignHook() ignore its return value, but it's
> more consistent with SetVariable().
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


[HACKERS] Assignment of valid collation for SET operations on queries with UNKNOWN types.

2016-11-16 Thread Rahila Syed
Following UNION of two queries with constant literals runs successfully.

CASE 1:
postgres=# SELECT 'abc' UNION SELECT 'bcd' ;
 ?column?
--
 abc
 bcd
(2 rows)

whereas when these literals are part of a view, the UNION fails.

CASE 2:
postgres=# create view v as select 'abc' a;
2016-11-16 15:28:48 IST WARNING:  column "a" has type "unknown"
2016-11-16 15:28:48 IST DETAIL:  Proceeding with relation creation anyway.
WARNING:  column "a" has type "unknown"
DETAIL:  Proceeding with relation creation anyway.
CREATE VIEW

postgres=# create view v1 as select 'bcd' a;
2016-11-16 15:28:56 IST WARNING:  column "a" has type "unknown"
2016-11-16 15:28:56 IST DETAIL:  Proceeding with relation creation anyway.
WARNING:  column "a" has type "unknown"
DETAIL:  Proceeding with relation creation anyway.
CREATE VIEW

postgres=# select a from v UNION select a from v1;
2016-11-16 15:25:28 IST ERROR:  could not determine which collation to use
for string comparison
2016-11-16 15:25:28 IST HINT:  Use the COLLATE clause to set the collation
explicitly.
2016-11-16 15:25:28 IST STATEMENT:  select a from v UNION select a from v1;
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.

When UNION of queries with constant literals as in CASE 1 is allowed
shouldn't a UNION of queries with literals in a view as in CASE 2 be
allowed?

In transformSetOperationTree, while determining the result type of the
merged
output columns, if the left and right column types are UNKNOWNs the result
type
is resolved to TEXT.

The difference of behaviour in above two cases arises because the result
collation
assigned is not valid in CASE 2.

When the left and the right inputs are literal constants i.e UNKNOWN as in
Case 1
the collation of result column is correctly assigned to a valid value.

Whereas when the left and the right inputs are columns of UNKNOWN type as
in Case 2,
the result collation is InvalidOid.

So if we ensure assignment of a valid collation when the left and the right
columns/inputs
are UNKNOWN, the above can be resolved.

Attached WIP patch does that. Kindly let me know your opinion.

Thank you,
Rahila Syed


invalid_collation_error.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvements in psql hooks for variables

2016-11-13 Thread Rahila Syed
>ECHO_HIDDEN differs from the generic boolean case because it also
>accepts "noexec" and pset.echo_hidden is an enum, not a boolean. When
>considering refactoring echo_hidden_hook() to call
>generic_boolean_hook() instead of ParseVariableBool() after
>having established that the value is not "noexec", I don't see
>any advantage in clarity or code size, so I'm not in favor of that change

I agree if generic_boolean_hook() is used in its current form instead of
ParseVariableBool() inside
echo_hidden_hook or on_error_rollback_hook the code will not change much.

I was suggesting change on the lines of extending generic_boolean_hook to
include
enum(part in enum_hooks which calls ParseVariableBool) and integer parsing
as well.

>Besides, I don't think it would go well with hooks. If we wanted one
>big function that knows all about parsing all built-in variables, we
>could just as well dispense with hooks, since their current purpose in
>psql is to achieve this parsing, but in a decentralized way.
>Or if we keep them, our various built-in variables would be
>essentially tied to the same one-big-hook-that-does-all, but isn't
>that an antipattern for hooks?

I was suggesting something on the lines of having common parsing logic for
each implementation of hook. This is similar to what ParseVariableBool does
in
the existing code. ParseVariableBool is being called from different hooks
to
parse the boolean value of the variable. Thus representing common code in
various
implementations of hook.

>"hook" is changed by the patch from [pointer to function returning
>void], to [pointer to function returning bool]. The cast to void is
>not essential, it just indicates that we deliberately want to
>ignore the return value here. I expect some compilers might
>complain under a high level of warnings without this cast, although
>TBH if you ask me, I wouldn't know which compiler with which flags
>exactly.
Thank you for explanation.

Thank you,
Rahila Syed




On Fri, Nov 11, 2016 at 2:57 AM, Daniel Verite <dan...@manitou-mail.org>
wrote:

> Rahila Syed wrote:
>
> > I have applied this patch on latest HEAD and have done basic testing
> which
> > works fine.
>
> Thanks for reviewing this patch!
>
> > >if (current->assign_hook)
> > >-   (*current->assign_hook) (current->value);
> > >-   return true;
> > >+   {
> > >+   confirmed = (*current->assign_hook) (value);
> > >+   }
> > >+   if (confirmed)
> >
> > Spurious brackets
>
> OK.
>
> > >static bool
> > >+generic_boolean_hook(const char *newval, const char* varname, bool
> *flag)
> > >+{
> >
> > Contrary to what name suggests this function does not seem to have other
> > implementations as in a hook.
> > Also this takes care of rejecting a syntactically wrong value only for
> > boolean variable hooks like autocommit_hook,
> > on_error_stop_hook. However, there are other variable hooks which call
> > ParseVariableBool.
> > For instance, echo_hidden_hook which is handled separately in the patch.
> > Thus there is some duplication of code between generic_boolean_hook and
> > echo_hidden_hook.
> > Similarly for on_error_rollback_hook.
>
> The purpose of generic_boolean_hook() is to handle the case of a
> boolean variable that only accepts ON or OFF, and has its pset.varname
> declared as bool. I thought of this case as "generic" because that's
> the base case and several variables need no more than that.
>
> ECHO_HIDDEN differs from the generic boolean case because it also
> accepts "noexec" and pset.echo_hidden is an enum, not a boolean. When
> considering refactoring echo_hidden_hook() to call
> generic_boolean_hook() instead of ParseVariableBool() after
> having established that the value is not "noexec", I don't see
> any advantage in clarity or code size, so I'm not in favor of that change.
>
> The same applies to on_error_rollback_hook(), which has to deal
> with a specific enum as well.
>
> > >-static void
> > >+static bool
> > > fetch_count_hook(const char *newval)
> > > {
> > >pset.fetch_count = ParseVariableNum(newval, -1, -1, false);
> > >+   return true;
> > > }
> >
> > Shouldn't invalid numeric string assignment for numeric variables be
> > handled too?
>
> Agreed. Assignments like "\set FETCH_COUNT bogus" don't provoke any
> user feedback currently, which is not ideal. I'll add this in a
> v3 of the patch tomorrow.
>
> > Instead of generic_boolean_hook cant we have something like follows which

Re: [HACKERS] Improvements in psql hooks for variables

2016-10-31 Thread Rahila Syed
Hello,

I have applied this patch on latest HEAD and have done basic testing which
works fine.

Some comments,

>if (current->assign_hook)
>-   (*current->assign_hook) (current->value);
>-   return true;
>+   {
>+   confirmed = (*current->assign_hook) (value);
>+   }
>+   if (confirmed)

Spurious brackets

>static bool
>+generic_boolean_hook(const char *newval, const char* varname, bool *flag)
>+{

Contrary to what name suggests this function does not seem to have other
implementations as in a hook.
Also this takes care of rejecting a syntactically wrong value only for
boolean variable hooks like autocommit_hook,
on_error_stop_hook. However, there are other variable hooks which call
ParseVariableBool.
For instance, echo_hidden_hook which is handled separately in the patch.
Thus there is some duplication of code between generic_boolean_hook and
echo_hidden_hook.
Similarly for on_error_rollback_hook.

>-static void
>+static bool
> fetch_count_hook(const char *newval)
> {
>pset.fetch_count = ParseVariableNum(newval, -1, -1, false);
>+   return true;
> }

Shouldn't invalid numeric string assignment for numeric variables be
handled too?

Instead of generic_boolean_hook cant we have something like follows which
like generic_boolean_hook can be called from specific variable assignment
hooks,

static bool
ParseVariable(newval, VariableName, )
{
if (VariableName == ‘AUTOCOMMIT’ || ECHO_HIDDEN || other variable with
hooks which call ParseVariableBool )

else if (VariableName == ‘FETCH_COUNT’)
ParseVariableNum();
}
This will help merge the logic which is to check for valid syntax before
assigning and returning false on error, in one place.

>@@ -260,7 +276,7 @@ SetVariableAssignHook(VariableSpace space, const char
*name, VariableAssignHook
>current->assign_hook = hook;
>current->next = NULL;
>previous->next = current;
>-   (*hook) (NULL);
>+   (void)(*hook) (NULL);   /* ignore return value */

Sorry for my lack of understanding, can you explain why is above change
needed?

Thank you,
Rahila Syed








On Tue, Sep 20, 2016 at 11:16 PM, Daniel Verite <dan...@manitou-mail.org>
wrote:

> Ashutosh Bapat wrote:
>
> > You may want to add this to the November commitfest
> > https://commitfest.postgresql.org/11/.
>
> Done. It's at https://commitfest.postgresql.org/11/799/
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] Parallel Index Scans

2016-10-18 Thread Rahila Syed
>Another point which needs some thoughts is whether it is good idea to
>use index relation size to calculate parallel workers for index scan.
>I think ideally for index scans it should be based on number of pages
>to be fetched/scanned from index.
IIUC, its not possible to know the exact number of pages scanned from an
index
in advance.
What we are essentially making parallel is the scan of the leaf pages.
So it will make sense to have the number of workers based on number of leaf
pages.
Having said that, I think it will not make much difference as compared to
existing method because
currently total index pages are used to calculate the number of workers. As
far as I understand,in large indexes, the difference between
number of leaf pages and total pages is not significant. In other words,
internal pages form a small fraction of total pages.
Also the calculation is based on log of number of pages so it will make
even lesser difference.

Thank you,
Rahila Syed






On Tue, Oct 18, 2016 at 8:38 AM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Thu, Oct 13, 2016 at 8:48 AM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
> > As of now, the driving table for parallel query is accessed by
> > parallel sequential scan which limits its usage to a certain degree.
> > Parallelising index scans would further increase the usage of parallel
> > query in many more cases.  This patch enables the parallelism for the
> > btree scans.  Supporting parallel index scan for other index types
> > like hash, gist, spgist can be done as separate patches.
> >
>
> I would like to have an input on the method of selecting parallel
> workers for scanning index.  Currently the patch selects number of
> workers based on size of index relation and the upper limit of
> parallel workers is max_parallel_workers_per_gather.  This is quite
> similar to what we do for parallel sequential scan except for the fact
> that in parallel seq. scan, we use the parallel_workers option if
> provided by user during Create Table.  User can provide
> parallel_workers option as below:
>
> Create Table  With (parallel_workers = 4);
>
> Is it desirable to have similar option for parallel index scans, if
> yes then what should be the interface for same?  One possible way
> could be to allow user to provide it during Create Index as below:
>
> Create Index  With (parallel_workers = 4);
>
> If above syntax looks sensible, then we might need to think what
> should be used for parallel index build.  It seems to me that parallel
> tuple sort patch [1] proposed by Peter G. is using above syntax for
> getting the parallel workers input from user for parallel index
> builds.
>
> Another point which needs some thoughts is whether it is good idea to
> use index relation size to calculate parallel workers for index scan.
> I think ideally for index scans it should be based on number of pages
> to be fetched/scanned from index.
>
>
> [1] - https://www.postgresql.org/message-id/CAM3SWZTmkOFEiCDpUNaO4n9-
> 1xcmWP-1NXmT7h0Pu3gM2YuHvg%40mail.gmail.com
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] Optimization for lazy_scan_heap

2016-09-25 Thread Rahila Syed
Hello,

Some initial comments on optimize_lazy_scan_heap_v2.patch.

>-   while (next_unskippable_block < nblocks)
>+   while (next_unskippable_block < nblocks &&
>+  !FORCE_CHECK_PAGE(next_unskippable_block))

Dont  we need similar check of !FORCE_CHECK_PAGE(next_unskippable_block) in
the below code
which appears on line no 556 of vacuumlazy.c ?

next_unskippable_block = 0;
if ((options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
{
while (next_unskippable_block < nblocks)


>   {
>if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
>break;
>+   n_all_frozen++;
>}
>else
>{
>if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
>break;
>+
>+   /* Count the number of all-frozen pages */
>+   if (vmstatus & VISIBILITYMAP_ALL_FROZEN)
>+   n_all_frozen++;
>}

I think its better to have just one place where we increment n_all_frozen
before if-else block.

>}
>vacuum_delay_point();
>next_unskippable_block++;
>+   n_skipped++;
>}
>}

Also, instead of incrementing n_skipped everytime, it can be set to
'next_unskippable_block' or 'next_unskippable_block -blkno'
once at the end of the loop.

Thank you,
Rahila Syed



On Thu, Aug 25, 2016 at 11:52 AM, Masahiko Sawada <sawada.m...@gmail.com>
wrote:

> On Thu, Aug 25, 2016 at 1:41 AM, Anastasia Lubennikova
> <lubennikov...@gmail.com> wrote:
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  tested, passed
> > Implements feature:   not tested
> > Spec compliant:   not tested
> > Documentation:not tested
> >
> > Hi,
> > I haven't tested the performance yet, but the patch itself looks pretty
> good
> > and reasonable improvement.
> > I have a question about removing the comment. It seems to be really
> tricky
> > moment. How do we know that all-frozen block hasn't changed since the
> > moment we checked it?
>
> I think that we don't need to check whether all-frozen block hasn't
> changed since we checked it.
> Even if the all-frozen block has changed after we saw it as an
> all-frozen page, we can update the relfrozenxid.
> Because any new XIDs just added to that page are newer than the
> GlobalXmin we computed.
>
> Am I missing something?
>
> In this patch, since we count the number of all-frozen page even
> during not an aggresive scan, I thought that we don't need to check
> whether these blocks were all-frozen pages.
>
> > I'm going to test the performance this week.
> > I wonder if you could send a test script or describe the steps to test
> it?
>
> This optimization reduces the number of scanning visibility map when
> table is almost visible or frozen..
> So it would be effective for very large table (more than several
> hundreds GB) which is almost all-visible or all-frozen pages.
>
> For example,
> 1. Create very large table.
> 2. Execute VACUUM FREEZE to freeze all pages of table.
> 3. Measure the execution time of VACUUM FREEZE.
> I hope that the second VACUUM FREEZE become fast a little.
>
> I modified the comment, and attached v2 patch.
>
> Regards,
>
> --
> Masahiko Sawada
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


Re: [HACKERS] Improvements in psql hooks for variables

2016-09-19 Thread Rahila Syed
Hello,

I am beginning to review this patch. Initial comment. I got following
compilation error when I applied the patch on latest sources and run make.

command.c: In function ‘exec_command’:
*command.c:257:5*: error: too few arguments to function ‘ParseVariableBool’
 ParseVariableBool(opt1 + sizeof(prefix) - 1, prefix) ?
 ^
In file included from settings.h:12:0,
 from command.c:50:
variables.h:38:7: note: declared here
 bool  ParseVariableBool(const char *value, const char *name, bool *valid);
   ^
*command.c:1551:4*: error: too few arguments to function ‘ParseVariableBool’
pset.timing = ParseVariableBool(opt, "\\timing");
^
In file included from settings.h:12:0,
 from command.c:50:
variables.h:38:7: note: declared here
 bool  ParseVariableBool(const char *value, const char *name, bool *valid);
   ^
command.c: In function ‘do_pset’:
*command.c:2663:4*: error: too few arguments to function ‘ParseVariableBool’
popt->topt.expanded = ParseVariableBool(value, param);
^
In file included from settings.h:12:0,
 from command.c:50:
variables.h:38:7: note: declared here
 bool  ParseVariableBool(const char *value, const char *name, bool *valid);
   ^
*command.c:2672:4*: error: too few arguments to function ‘ParseVariableBool’
popt->topt.numericLocale = ParseVariableBool(value, param);
^
In file included from settings.h:12:0,
 from command.c:50:
variables.h:38:7: note: declared here
 bool  ParseVariableBool(const char *value, const char *name, bool *valid);
   ^
*command.c:2727:4*: error: too few arguments to function ‘ParseVariableBool’
popt->topt.tuples_only = ParseVariableBool(value, param);
^
In file included from settings.h:12:0,
 from command.c:50:
variables.h:38:7: note: declared here
 bool  ParseVariableBool(const char *value, const char *name, bool *valid);
   ^
*command.c:2759:4*: error: too few arguments to function ‘ParseVariableBool’
if (ParseVariableBool(value, param))
^
In file included from settings.h:12:0,
 from command.c:50:
variables.h:38:7: note: declared here
 bool  ParseVariableBool(const char *value, const char *name, bool *valid);
   ^
*command.c:2781:4:* error: too few arguments to function ‘ParseVariableBool’
popt->topt.default_footer = ParseVariableBool(value, param);
^
In file included from settings.h:12:0,
 from command.c:50:
variables.h:38:7: note: declared here
 bool  ParseVariableBool(const char *value, const char *name, bool *valid);
   ^

Thank you,
Rahila Syed


On Mon, Sep 19, 2016 at 11:21 AM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> You may want to add this to the November commitfest
> https://commitfest.postgresql.org/11/.
>
> On Fri, Sep 16, 2016 at 6:05 PM, Daniel Verite <dan...@manitou-mail.org>
> wrote:
> >  Hi,
> >
> > Following the discussion on forbidding an AUTOCOMMIT off->on
> > switch mid-transaction [1], attached is a patch that let the hooks
> > return a boolean indicating whether a change is allowed.
> >
> > Using the hooks, bogus assignments to built-in variables can
> > be dealt with more strictly.
> >
> > For example, pre-patch behavior:
> >
> >   =# \set ECHO errors
> >   =# \set ECHO on
> >   unrecognized value "on" for "ECHO"; assuming "none"
> >   =# \echo :ECHO
> >   on
> >
> > which has two problems:
> > - we have to assume a value, even though we can't know what the user
> meant.
> > - after assignment, the user-visible value of the variable diverges from
> its
> > internal counterpart (pset.echo in this case).
> >
> >
> > Post-patch:
> >   =# \set ECHO errors
> >   =# \set ECHO on
> >   unrecognized value "on" for "ECHO"
> >   \set: error while setting variable
> >   =# \echo :ECHO
> >   errors
> >
> > Both the internal pset.* state and the user-visible value are kept
> unchanged
> > is the input value is incorrect.
> >
> > Concerning AUTOCOMMIT, autocommit_hook() could return false to forbid
> > a switch when the conditions are not met.
> >
> >
> > Another user-visible effect of the patch is that, using a bogus value
> > for a built-in variable on the command-line becomes a fatal error
> > that prevents psql to continue.
> > This is not directly intended by the patch but is a consequence
> > of SetVariable() failing.
> >
> > Example:
> >   $ ./psql -vECHO=bogus
> >   unrecognized value "bogus" for "ECHO"
> >   psql: could not set variable "ECHO"
> >   $ echo $?
> >   1
> >
> > The built-in vars c

Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-15 Thread Rahila Syed
>Have you considered expanding
>the API for hook functions?

Changing the hooks API to allow rejecting a setting and return false is
certainly useful
to other psql variables wanting to report an error and reject a value.

I did not consider expanding hook APIs because there was no requirement in
sight for other
variables to reject a setting. As far as autocommit is concerned something
in line with the current design can be implemented.

In the current design, any unrecognisable/bad value is reinterpreted and
the execution inside hook is always
successful.
In keeping with current design of hooks instead of rejecting autocommit
'ON' setting inside
a transaction,the value can be set to 'ON' with a psql_error displaying
that the value
will be effective when the current transaction has ended.

>Actually, it would make a lot more sense UI-wise if attempting to assign a
>non-boolean value to a boolean variable resulted in an error and no change
>to the variable, instead of what happens now.
Hooks API can be expanded to implement this.

The proposed feature is mainly to reduce the ambiguity for the user when
\set AUTOCOMMIT on is run within a transaction. According to current
behaviour,
the variable is set immediately but it is effective only when the current
transaction
has ended. It is good to notify this to the user.
This ambiguity in the behaviour was highlighted because in AUTOCOMMIT off
mode Postgres
implicitly starts a transaction and behaviour of \set AUTOCOMMIT ON in such
scenario can
be confusing.

Thank you,
Rahila Syed



On Wed, Sep 14, 2016 at 8:06 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Rahila Syed <rahilasye...@gmail.com> writes:
> >> Looking at the other variables hooks, they already emit errors and can
> >> deny the effect of a change corresponding to a new value, without
> >> informing the caller. Why would autocommit be different?
>
> > These instances where psql_error occurs inside hooks the command is
> > successful and the value supplied by user is reinterpreted to some other
> > value as user had supplied an unrecognisable value.
> > With psql_error_on_autocommit patch what was intended was to make
> > the command unsuccessful and keep the previous setting of autocommit.
> > Hence having it inside autocommit_hook did not seem appropriate to me.
>
> Nonetheless, asking all callers of SetVariable to deal with such cases
> is entirely unmaintainable/unacceptable.  Have you considered expanding
> the API for hook functions?  I'm not really sure why we didn't provide a
> way for the hooks to reject a setting to begin with.
>
> Actually, it would make a lot more sense UI-wise if attempting to assign a
> non-boolean value to a boolean variable resulted in an error and no change
> to the variable, instead of what happens now.
>
> Anyway, I'm not very thrilled with the idea that AUTOCOMMIT is so special
> that it should have a different behavior than any other built-in psql
> variable.  If we make them all throw errors and refuse to change to bad
> values, that would be consistent and defensible IMO.  But having
> AUTOCOMMIT alone act that way is not a feature, it's a wart.
>
> regards, tom lane
>


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-14 Thread Rahila Syed
Hello,

>Looking at the other variables hooks, they already emit errors and can
>deny the effect of a change corresponding to a new value, without
>informing the caller. Why would autocommit be different?
The only type of psql_error inside hooks is as follows,

 psql_error("unrecognized value \"%s\" for \"%s\"; assuming \"%s\"\n",
   newval, "ECHO", "none");

These instances where psql_error occurs inside hooks the command is
successful and the value supplied by user is reinterpreted to some other
value as user had supplied an unrecognisable value.

With psql_error_on_autocommit patch what was intended was to make
the command unsuccessful and keep the previous setting of autocommit.
Hence having it inside autocommit_hook did not seem appropriate to me.

But as pointed out by you, the other  way of setting autocommit i,e

*SELECT 'on' as "AUTOCOMMIT" \gset*  will not be handled by the patch.
So I will change the patch to have the check in the autocommit_hook instead.

This will mean if *\set AUTOCOMMIT  ON* or  *SELECT 'on' as "AUTOCOMMIT"
\gset *
is run inside a transaction, it will be effective after current transaction
has
ended. Appropriate message will be displayed notifying this to the user and
user need not
rerun the set AUTOCOMMIT command.

Thank you,
Rahila Syed



On Thu, Sep 8, 2016 at 9:55 PM, Daniel Verite <dan...@manitou-mail.org>
wrote:

> Rahila Syed wrote:
>
> > However, including the check here will require returning the status
> > of command from this hook. i.e if we throw error inside this
> > hook we will need to return false indicating the value has not changed.
>
> Looking at the other variables hooks, they already emit errors and can
> deny the effect of a change corresponding to a new value, without
> informing the caller. Why would autocommit be different?
>
> For example echo_hook does this:
>
> /* ...if the value is in (queries,errors,all,none) then
>assign pset.echo accordingly ... */
> else
> {
>   psql_error("unrecognized value \"%s\" for \"%s\"; assuming \"%s\"\n",
>newval, "ECHO", "none");
>   pset.echo = PSQL_ECHO_NONE;
> }
>
>
> If the user issues \set ECHO FOOBAR, then it produces the above error
> message and makes the same internal change as if \set ECHO none
> had been issued.
>
> But, the value of the variable as seen by the user is still FOOBAR:
>
> \set
> [...other variables...]
> ECHO = 'FOOBAR'
>
> The proposed patch doesn't follow that behavior, as it denies
> a new value for AUTOCOMMIT. You might argue that it's better,
> but on the other side, there are two reasons against it:
>
> - it's not consistent with the other uses of \set that affect psql,
> such as ECHO, ECHO_HIDDEN,  ON_ERROR_ROLLBACK,
>  COMP_KEYWORD_CASE... and even AUTOCOMMIT as
>  \set AUTOCOMMIT FOOBAR is not denied, just reinterpreted.
>
> - it doesn't address the case of another method than \set being used
>  to set the variable, as in : SELECT 'on' as "AUTOCOMMIT" \gset
>  whereas the hook would work in that case.
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-08 Thread Rahila Syed
Thank you for comments.

>But there's a function in startup.c which might be the ideal location
>or the check, as it looks like the one and only place where the
>autocommit flag actually changes:

>static void
>autocommit_hook(const char *newval)

Thank you for pointing out this. This indeed is the only place where
autocommit setting changes in psql. However,
including the check here will require returning the status of command from
this hook. i.e if we throw error inside this
hook we will need to return false indicating the value has not changed.
This will change the existing definition of the hook
which returns void. This will also lead to changes in other implementations
of this hook apart from autocommit_hook.

>There are other values than ON: true/yes and theirs abbreviations,
>the value 1, and anything that doesn't resolve to OFF is taken as ON.
>ParseVariableBool() in commands.c already does the job of converting
>these to bool, the new code should probably just call that function
>instead of parsing the value itself.

I have included this in the attached patch.

Also I have included prior review comments by Rushabh Lathia and Amit
Langote in the attached patch

Regarding suggestion to move the code inside SetVariable(), I think it is
not a good idea because
SetVariable() and variable.c file in which it is present is about handling
of psql variables, checks for
correct variable names, values etc. This check is about correctness of the
command so it is better placed
in exec_command().

Thank you,
Rahila Syed




On Sat, Sep 3, 2016 at 4:39 PM, Daniel Verite <dan...@manitou-mail.org>
wrote:

> Rushabh Lathia wrote:
>
> > It might happen that SetVariable() can be called from other places in
> > future,
> > so its good to have all the set variable related checks inside
> > SetVariable().
>
> There's already another way to skip the \set check:
>   select 'on' as "AUTOCOMMIT" \gset
>
> But there's a function in startup.c which might be the ideal location
> for the check, as it looks like the one and only place where the
> autocommit flag actually changes:
>
> static void
> autocommit_hook(const char *newval)
> {
>  pset.autocommit = ParseVariableBool(newval, "AUTOCOMMIT");
> }
>
>
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


psql_error_on_autocommit_v2.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-02 Thread Rahila Syed
>Document doesn't say this changes are only for implicit BEGIN..
Correcting myself here. Not just implicit BEGIN, it will throw an error on
\set AUTOCOMMIT inside any any transaction provided earlier value of
AUTOCOMMIT was OFF. Probably the test in which you tried it was already ON.

Thank you,
Rahila Syed


On Fri, Sep 2, 2016 at 3:17 PM, Rushabh Lathia <rushabh.lat...@gmail.com>
wrote:

>
>
> On Fri, Sep 2, 2016 at 1:12 PM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
>
>> Hello,
>>
>> Thank you for comments.
>>
>> >Above test not throwing psql error, as you used strcmp(newval,"ON"). You
>> >should use pg_strcasecmp.
>> Corrected in the attached.
>>
>> >Above error coming because in below code block change, you don't have
>> check for
>> >command (you should check opt0 for AUTOCOMMIT command)
>> Corrected in the attached.
>>
>> >postgres=# BEGIN;
>> >BEGIN
>> >postgres=# create table foo ( a int );
>> >CREATE TABLE
>> >postgres=# \set AUTOCOMMIT ON
>> >Don't you think, in above case also we should throw a psql error?
>> IMO, in this case BEGIN is explicitly specified by user, so I think it is
>> understood that a commit is required for changes to be effective.
>> Hence I did not consider this case.
>>
>>
> Document doesn't say this changes are only for implicit BEGIN..
>
> +
> +
> + Autocommit cannot be set on inside a transaction, the ongoing
> + transaction has to be ended by entering COMMIT or
> + ROLLBACK before setting autocommit on.
> +
> +
>
> In my opinion, its good to be consistent, whether its implicit or
> explicitly specified BEGIN.
>
> >postgres=# \set AUTOCOMMIT off
>> >postgres=# create table foo ( a int );
>> >CREATE TABLE
>> >postgres=# \set XYZ ON
>> >\set: Cannot set XYZ to ON inside a transaction, either COMMIT or
>> ROLLBACK and retry
>>
>> >May be you would like to move the new code block inside SetVariable().
>> So that
>> >don't need to worry about the validity of the variable names.
>>
>> I think validating variable names wont be required if we throw error only
>> if  command is \set AUTOCOMMIT.
>> Validation can happen later as in the existing code.
>>
>>
> It might happen that SetVariable() can be called from other places in
> future,
> so its good to have all the set variable related checks inside
> SetVariable().
>
> Also you can modify the condition in such way, so that we don't often end
> up
> calling PQtransactionStatus() even though its not require.
>
> Example:
>
> if (!strcmp(opt0,"AUTOCOMMIT") &&
> !strcasecmp(newval,"ON") &&
> !pset.autocommit &&
> PQtransactionStatus(pset.db) == PQTRANS_INTRANS)
>
>
> >Basically if I understand correctly, if we are within transaction and
>> someone
>> >tries the set the AUTOCOMMIT ON, it should throw a psql error. Correct me
>> >if I am missing anything?
>>
>> Yes the psql_error is thrown when AUTOCOMMIT is turned on inside a
>> transaction. But only when there is an implicit BEGIN as in following case,
>>
>> postgres=# \set AUTOCOMMIT OFF
>> postgres=# create table test(i int);
>> CREATE TABLE
>> postgres=# \set AUTOCOMMIT ON
>> \set: Cannot set AUTOCOMMIT to ON inside a transaction, either COMMIT or
>> ROLLBACK and retry
>> postgres=#
>>
>> Thank you,
>> Rahila Syed
>>
>>
>
>
> Regards,
> Rushabh Lathia
> www.EnterpriseDB.com
>


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-02 Thread Rahila Syed
Hello,

Thank you for comments.

>Above test not throwing psql error, as you used strcmp(newval,"ON"). You
>should use pg_strcasecmp.
Corrected in the attached.

>Above error coming because in below code block change, you don't have
check for
>command (you should check opt0 for AUTOCOMMIT command)
Corrected in the attached.

>postgres=# BEGIN;
>BEGIN
>postgres=# create table foo ( a int );
>CREATE TABLE
>postgres=# \set AUTOCOMMIT ON
>Don't you think, in above case also we should throw a psql error?
IMO, in this case BEGIN is explicitly specified by user, so I think it is
understood that a commit is required for changes to be effective.
Hence I did not consider this case.

>postgres=# \set AUTOCOMMIT off
>postgres=# create table foo ( a int );
>CREATE TABLE
>postgres=# \set XYZ ON
>\set: Cannot set XYZ to ON inside a transaction, either COMMIT or ROLLBACK
and retry

>May be you would like to move the new code block inside SetVariable(). So
that
>don't need to worry about the validity of the variable names.

I think validating variable names wont be required if we throw error only
if  command is \set AUTOCOMMIT.
Validation can happen later as in the existing code.

>Basically if I understand correctly, if we are within transaction and
someone
>tries the set the AUTOCOMMIT ON, it should throw a psql error. Correct me
>if I am missing anything?

Yes the psql_error is thrown when AUTOCOMMIT is turned on inside a
transaction. But only when there is an implicit BEGIN as in following case,

postgres=# \set AUTOCOMMIT OFF
postgres=# create table test(i int);
CREATE TABLE
postgres=# \set AUTOCOMMIT ON
\set: Cannot set AUTOCOMMIT to ON inside a transaction, either COMMIT or
ROLLBACK and retry
postgres=#

Thank you,
Rahila Syed


psql_error_on_autocommit_v1.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-01 Thread Rahila Syed
Ok. Please find attached a patch which introduces psql error when
autocommit is turned on inside a transaction. It also adds relevant
documentation in psql-ref.sgml. Following is the output.

bash-4.2$ psql -d postgres
psql (10devel)
Type "help" for help.

postgres=# \set AUTOCOMMIT OFF
postgres=# create table test(i int);
CREATE TABLE
postgres=# \set AUTOCOMMIT ON
\set: Cannot set AUTOCOMMIT to ON inside a transaction, either COMMIT or
ROLLBACK and retry
postgres=#


Thank you,
Rahila Syed

On Wed, Aug 17, 2016 at 6:31 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Tue, Aug 16, 2016 at 5:25 PM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
> >>I think I like the option of having psql issue an error.  On the
> >>server side, the transaction would still be open, but the user would
> >>receive a psql error message and the autocommit setting would not be
> >>changed.  So the user could type COMMIT or ROLLBACK manually and then
> >>retry changing the value of the setting.
> >
> > Throwing psql error comes out to be most accepted outcome on this
> thread. I
> > agree it is safer than guessing user intention.
> >
> > Although according to the default behaviour of psql, error will abort the
> > current transaction and roll back all the previous commands.
>
> A server error would do that, but a psql errror won't.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


psql_error_on_autocommit.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-16 Thread Rahila Syed
>I think I like the option of having psql issue an error.  On the
>server side, the transaction would still be open, but the user would
>receive a psql error message and the autocommit setting would not be
>changed.  So the user could type COMMIT or ROLLBACK manually and then
>retry changing the value of the setting.

Throwing psql error comes out to be most accepted outcome on this thread. I
agree it is safer than guessing user intention.

Although according to the default behaviour of psql, error will abort the
current transaction and roll back all the previous commands. This can be
user unfriendly making user rerun all the commands just because of
autocommit switch. So probably behaviour of 'ON_ERROR_ROLLBACK on' needs to
be implemented along with the error display. This will rollback just the
autocommit switch command.

Also, psql error instead of a simple commit will lead to script
terminations. Hence issuing a COMMIT seems more viable here. However,
script termination can be avoided by default behaviour of ON_ERROR_STOP
which will execute subsequent commands successfully.(However subsequent
commands won't be executed in autocommit mode which I think should be OK as
it will be notified via ERROR).

So summarizing my view of the discussion on this thread, issuing a psql
error seems to be the best option. I will post a patch regarding this if
there is no objection.


Thank you,

Rahila Syed


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-08 Thread Rahila Syed
Thank you for inputs everyone.

The opinions on this thread can be classified into following
1. Commit
2. Rollback
3. Error
4. Warning

As per opinion upthread, issuing implicit commit immediately after
switching autocommit to ON, can be unsafe if it was not desired.  While I
agree that its difficult to judge users intention here, but if we were to
base it on some assumption, the closest would be implicit COMMIT in my
opinion.There is higher likelihood of a user being happy with issuing a
commit when setting autocommit ON than a transaction being rolled back.
Also there are quite some interfaces which provide this.

As mentioned upthread, issuing a warning on switching back to autocommit
will not be effective inside a script. It won't allow subsequent commands
to be committed as set autocommit to ON is not committed. Scripts will have
to be rerun with changes which will impact user friendliness.

While I agree that issuing an ERROR and rolling back the transaction ranks
higher in safe behaviour, it is not as common (according to instances
stated upthread) as immediately committing any open transaction when
switching back to autocommit.

Thank you,
Rahila Syed


On Sun, Aug 7, 2016 at 4:42 AM, Matt Kelly <mkell...@gmail.com> wrote:

> Its worth noting that the JDBC's behavior when you switch back to
> autocommit is to immediately commit the open transaction.
>
> Personally, I think committing immediately or erroring are unsurprising
> behaviors.  The current behavior is surprising and obviously wrong.
> Rolling back without an error would be very surprising (no other database
> API I know of does that) and would take forever to debug issues around the
> behavior.  And committing after the next statement is definitely the most
> surprising behavior suggested.
>
> IMHO, I think committing immediately and erroring are both valid.  I think
> I'd prefer the error in principle, and per the law of bad code I'm sure,
> although no one has ever intended to use this behavior, there is probably
> some code out there that is relying on this behavior for "correctness".  I
> think a hard failure and making the dev add an explicit commit is least
> likely to cause people serious issues.  As for the other options, consider
> me opposed.
>
> - Matt K.
>


[HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-03 Thread Rahila Syed
Hello,

Need community's opinion on following behaviour of  \set AUTOCOMMIT

If  \set AUTOCOMMIT ON is issued after \set AUTOCOMMIT OFF the commands
which follow after AUTOCOMMIT is set ON are not committed until an explicit
COMMIT is issued.
Its can be surprising to the user to not see results of the commands fired
after AUTOCOMMIT is set to ON.

bash-4.2$ psql -d postgres -U rahila
psql (9.6beta3)
Type "help" for help.

postgres=# \set AUTOCOMMIT OFF
postgres=# create table test1(i int);
CREATE TABLE
postgres=# \set AUTOCOMMIT ON
postgres=# create table test2(j int);
CREATE TABLE
postgres=# \c postgres rahila
You are now connected to database "postgres" as user "rahila".
postgres=# \dt;
No relations found.

The ongoing transaction is left running when there is this change in mode
from AUTOCOMMIT OFF to AUTOCOMMIT ON.
This happens because \set AUTOCOMMIT ON is fired within a transaction block
started when first command after \set AUTOCOMMIT OFF is executed. Hence it
requires an explicit COMMIT to be effective.

Should changing the value from OFF to ON automatically either commit or
rollback transaction in progress?
FWIW, running  set autocommit through ecpg commits the ongoing transaction
when autocommit is set to ON from OFF. Should such behaviour be implemented
for \set AUTOCOMMIT ON as well?

Thank you,
Rahila Syed


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2016-03-25 Thread Rahila Syed
>Oops.  I forgot to credit you in the commit message.  Sorry about that.
:-(
No problem :). Thanks for the commit.

Thank you,
Rahila Syed


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2016-03-24 Thread Rahila Syed
Hello,

Server crash was reported on running  vacuum progress checker view on
32-bit machine.
Please find attached a fix for the same.

Crash was because 32 bit machine considers int8 as being passed by
reference while creating the tuple descriptor. At the time of filling the
tuple store, the code (heap_fill_tuple) checks this tuple descriptor before
inserting the value into the tuple store. It finds the attribute type pass
by reference and hence it treats the value as a pointer when it is not and
thus it fails at the time of memcpy.

This happens because appropriate conversion function is not employed while
storing the value of that particular attribute into the values array before
copying it into tuple store.

-   values[i+3] =
UInt32GetDatum(beentry->st_progress_param[i]);
+   values[i+3] =
Int64GetDatum(beentry->st_progress_param[i]);


Attached patch fixes this.

Thank you,
Rahila Syed

On Wed, Mar 16, 2016 at 11:30 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Wed, Mar 16, 2016 at 6:44 AM, Rahila Syed <rahilasye...@gmail.com>
> wrote:
> >>Sorta.  Committed after renaming what you called heap blocks vacuumed
> >>back to heap blocks scanned, adding heap blocks vacuumed, removing the
> >>overall progress meter which I don't believe will be anything close to
> >>accurate, fixing some stylistic stuff, arranging to update multiple
> >>counters automatically where it could otherwise produce confusion,
> >>moving the new view near similar ones in the file, reformatting it to
> >>follow the style of the rest of the file, exposing the counter
> >>#defines via a header file in case extensions want to use them, and
> >>overhauling and substantially expanding the documentation
> >
> > We have following lines,
> >
> > /* report that everything is scanned and vacuumed */
> > pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
> > blkno);
> > pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED,
> > blkno);
> >
> >
> > which appear before final vacuum cycle happens for any remaining dead
> tuples
> > which may span few pages if I am not mistaken.
> >
> > IMO, reporting final count of heap_blks_scanned is correct here, but
> > reporting final heap_blks_vacuumed can happen after the final VACUUM
> cycle
> > for more accuracy.
>
> You are quite right.  Good catch.  Fixed that, and applied Vinayak's
> patch too, and fixed another mistake I saw while I was at it.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0f6f891..64c4cc4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -612,7 +612,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		{
 			values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
 			for(i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
-values[i+3] = UInt32GetDatum(beentry->st_progress_param[i]);
+values[i+3] = Int64GetDatum(beentry->st_progress_param[i]);
 		}
 		else
 		{

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2016-03-16 Thread Rahila Syed
>Sorta.  Committed after renaming what you called heap blocks vacuumed
>back to heap blocks scanned, adding heap blocks vacuumed, removing the
>overall progress meter which I don't believe will be anything close to
>accurate, fixing some stylistic stuff, arranging to update multiple
>counters automatically where it could otherwise produce confusion,
>moving the new view near similar ones in the file, reformatting it to
>follow the style of the rest of the file, exposing the counter
>#defines via a header file in case extensions want to use them, and
>overhauling and substantially expanding the documentation

We have following lines,

/* report that everything is scanned and vacuumed */
pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
blkno);
pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED,
blkno);


which appear before final vacuum cycle happens for any remaining dead
tuples which may span few pages if I am not mistaken.

IMO, reporting final count of heap_blks_scanned is correct here, but
reporting final heap_blks_vacuumed can happen after the final VACUUM cycle
for more accuracy.

Thank you,
Rahila Syed


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2016-03-14 Thread Rahila Syed
Hello,

While I am still looking at this WIP patch, I had one suggestion.

Instead of making changes in the index AM API can we have a call to update
the shared state using pgstat_progress* API

directly from specific index level code?

Like  pgstat_count_index_scan(rel) call from _bt_first does. Though this
function basically updates local structures and sends the count to stat
collector via messages we can have a function which will instead modify the
shared state using the progress API committed recently.

Thank you,

Rahila Syed


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2016-01-29 Thread Rahila Syed
>Okay, I agree that reporting just the current blkno is simple and good
>enough. How about numbers of what we're going to report as the "Vacuuming
>Index and Heap" phase? I guess we can still keep the scanned_index_pages
>and index_scan_count So we have:
>+CREATE VIEW pg_stat_vacuum_progress AS
>+   SELECT
>+  S.pid,
>+  S.relid,
>+  S.phase,
>+  S.total_heap_blks,
>+  S.current_heap_blkno,
>+  S.total_index_pages,
>+  S.scanned_index_pages,
>+  S.index_scan_count
>+  S.percent_complete,
>+   FROM pg_stat_get_vacuum_progress() AS S;
>I guess it won't remain pg_stat_get_"vacuum"_progress(
>), though.

Apart from these, as suggested in [1] , finer grained reporting from index
vacuuming phase can provide better insight. Currently we report number of
blocks processed once at the end of vacuuming of each index.
IIUC, what was suggested in [1] was instrumenting lazy_tid_reaped with a
counter to count number of index tuples processed so far as lazy_tid_reaped
is called for every index tuple to see if it matches any of the dead tuple
tids.

So additional parameters for each index can be,
scanned_index_tuples
total_index_tuples (from pg_class.reltuples entry)

Thank you,
Rahila Syed

[1]. http://www.postgresql.org/message-id/56500356.4070...@bluetreble.com


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2016-01-28 Thread Rahila Syed
>+if(!scan_all)
>+scanned_heap_pages = scanned_heap_pages +
>next_not_all_visible_block;

>I don't want to be too much of a stickler for details here, but it
>seems to me that this is an outright lie.

Initially the scanned_heap_pages were meant to report just the scanned
pages and skipped pages were not added to the count.  Instead the skipped
pages were deduced from number of total heap pages to be scanned to make
the number of scanned pages eventually add up to total heap pages.   As per
comments received later total heap pages were kept constant and skipped
pages count was added to scanned pages to make the count add up to total
heap pages at the end of scan. That said, as suggested, scanned_heap_pages
should be renamed to current_heap_page to report current blkno in
lazy_scan_heap loop which will add up to total heap pages(nblocks) at the
end of scan. And scanned_heap_pages can be reported as a separate number
which wont contain skipped pages.


>+/*
>+ * Reporting vacuum progress to statistics collector
>+ */

>This patch doesn't report anything to the statistics collector, nor should
it.
Yes. This was incorrectly added initially by referring to similar
pgstat_report interface functions.

Thank you,
Rahila Syed

On Thu, Jan 28, 2016 at 2:27 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Tue, Jan 26, 2016 at 11:37 PM, Vinayak Pokale <vinpok...@gmail.com>
> wrote:
> > Hi,
> >
> > Please find attached updated patch with an updated interface.
>
> Well, this isn't right.  You've got this sort of thing:
>
> +scanned_index_pages += RelationGetNumberOfBlocks(Irel[i]);
> +/* Report progress to the statistics collector */
> +pgstat_report_progress_update_message(0, progress_message);
> +pgstat_report_progress_update_counter(1, scanned_heap_pages);
> +pgstat_report_progress_update_counter(3, scanned_index_pages);
> +pgstat_report_progress_update_counter(4,
> vacrelstats->num_index_scans + 1);
>
> The point of having pgstat_report_progress_update_counter() is so that
> you can efficiently update a single counter without having to update
> everything, when only one counter has changed.  But here you are
> calling this function a whole bunch of times in a row, which
> completely misses the point - if you are updating all the counters,
> it's more efficient to use an interface that does them all at once
> instead of one at a time.
>
> But there's a second problem here, too, which is that I think you've
> got this code in the wrong place.  The second point of the
> pgstat_report_progress_update_counter interface is that this should be
> cheap enough that we can do it every time the counter changes.  That's
> not what you are doing here.  You're updating the counters at various
> points in the code and just pushing new values for all of them
> regardless of which ones have changed.  I think you should find a way
> that you can update the value immediately at the exact moment it
> changes.  If that seems like too much of a performance hit we can talk
> about it, but I think the value of this feature will be greatly
> weakened if users can't count on it to be fully and continuously up to
> the moment.  When something gets stuck, you want to know where it's
> stuck, not approximately kinda where it's stuck.
>
> +if(!scan_all)
> +scanned_heap_pages = scanned_heap_pages +
> next_not_all_visible_block;
>
> I don't want to be too much of a stickler for details here, but it
> seems to me that this is an outright lie.  The number of scanned pages
> does not go up when we decide to skip some pages, because scanning and
> skipping aren't the same thing.  If we're only going to report one
> number here, it needs to be called something like "current heap page",
> and then you can just report blkno at the top of each iteration of
> lazy_scan_heap's main loop.  If you want to report the numbers of
> scanned and skipped pages separately that'd be OK too, but you can't
> call it the number of scanned pages and then actually report a value
> that is not that.
>
> +/*
> + * Reporting vacuum progress to statistics collector
> + */
>
> This patch doesn't report anything to the statistics collector, nor should
> it.
>
> Instead of making the SQL-visible function
> pg_stat_get_vacuum_progress(), I think it should be something more
> generic like pg_stat_get_command_progress().  Maybe VACUUM will be the
> only command that reports into that feature for right now, but I'd
> hope for us to change that pretty soon after we get the first patch
> committed.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2016-01-08 Thread Rahila Syed
>I suspect you need to create a new CF entry for this patch in CF 2016-01.

Unless I am missing something, there seems to be no entry for this patch
into CF 2016-01 page: https://commitfest.postgresql.org/8/.
Regrettably, we have exceeded the deadline to add the patch into this
commitfest. Is there still some way to add it to the commitfest 2016-01? As
this feature has received lot of feedback in previous commitfest , adding
it to this commitfest will surely help in progressing it in order to make
it ready for PostgreSQL 9.6.

Thank you,
Rahila Syed


On Mon, Dec 28, 2015 at 6:01 AM, Amit Langote <langote_amit...@lab.ntt.co.jp
> wrote:

>
> Hi Vinayak,
>
> On 2015/12/25 21:46, Vinayak Pokale wrote:
> > Hi,
> > Please find attached patch addressing following comments.
> >
> >> The relation OID should be reported and not its name. In case of a
> >> relation rename that would not be cool for tracking, and most users
> >> are surely going to join with other system tables using it.
> > The relation OID is reported instead of relation name.
> > The following interface function is called at the beginning to report the
> > relation OID once.
> > void pgstat_report_command_target(Oid relid)
> >
> >> Regarding pg_stat_get_vacuum_progress(): I think a backend can simply be
> >> skipped if (!has_privs_of_role(GetUserId(), beentry->st_userid)) (cannot
> >> put that in plain English, :))
> > Updated in the attached patch.
> >
> > In the previous patch, ACTIVITY_IS_VACUUM is set unnecessarily for
> > VACOPT_FULL and they are not covered by lazy_scan_heap().
> > I have updated it in attached patch and also renamed ACTIVITY_IS_VACUUM
> to
> > COMMAND_LAZY_VACUUM.
> >
> > Added documentation for view.
> > Some more comments need to be addressed.
>
> I suspect you need to create a new CF entry for this patch in CF 2016-01.
>
> Thanks,
> Amit
>
>
>


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-11-30 Thread Rahila Syed
Hello,

Thank you for your comments.
Please find attached patch addressing following comments ,

>- duplicate_oids error in HEAD.
Check.

>- a compiler warning:
>pgstat.c:2898: warning: no previous prototype for
‘pgstat_reset_activityflag’
Check.

>One more change you could do is 's/activityflag/activity_flag/g',
Check.

>Type of the flag of vacuum activity.
The flag variable is an integer to incorporate more commands in future.

>Type of st_progress_param and so.
st_progress_param is also given a generic name to incorporate different
parameters reported from various commands.

>st_progress_param_float is currently totally useless.
Float parameter has currently been removed from the patch.

>Definition of progress_message.
>The definition of progress_message in lazy_scan_heap is "char
>[PROGRESS_MESSAGE_LENGTH][N_PROGRESS_PARAM]" which looks to be
>inversed.
Corrected.

>The current code subtracts the number of blocks when
>skipping_all_visible_blocks is set in two places. But I think
>it is enough to decrement when skipping.
In both the places, the pages are being skipped hence the total count was
decremented.

>He suggested to keep total_heap_pages fixed while adding number
>of skipped pages to that of scanned pages.
This has been done in the attached.

>snprintf(progress_message[0], PROGRESS_MESSAGE_LENGTH, "%s.%s",
>  get_namespace_name(RelationGetNamespace(rel)),
>  relname);
Check.

The previous implementation used to add total number of pages across all
indexes of a relation to total_index_pages in every scan of
indexes to account for total pages scanned. Thus, it  was equal to number
of scans * total_index_pages.

In the attached patch, total_index_pages reflects total number of pages
across all indexes of a relation.
And the column to report passes through indexes (phase 2) has been added to
account for number of passes for index and heap vacuuming.
Number of scanned index pages is reset at the end of each pass.
This makes the reporting clearer.
The percent complete does not account for index pages. It just denotes
percentage of heap scanned.

>Spotted a potential oversight regarding report of scanned_pages. It
>seems pages that are skipped because of not getting a pin, being new,
>being empty could be left out of the progress equation.
Corrected.


>It's better to
>report that some other way, like use one of the strings to report a
>"phase" of processing that we're currently performing.
Has been included in the attached.

Some more comments need to be addressed which include name change of
activity flag, reporting only changed parameters to shared memory,
ACTIVITY_IS_VACUUM flag being set unnecessarily for ANALYZE and FULL
commands ,documentation for new view.
Also, finer grain reporting from indexes and heap truncate phase is yet to
be incorporated into the patch

Thank you,
Rahila Syed
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..d53833e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -631,6 +631,19 @@ CREATE VIEW pg_stat_activity AS
 WHERE S.datid = D.oid AND
 S.usesysid = U.oid;
 
+CREATE VIEW pg_stat_vacuum_progress AS
+SELECT
+			S.pid,
+			S.table_name,
+			S.phase,
+			S.total_heap_pages,
+			S.scanned_heap_pages,
+S.percent_complete,
+S.total_index_pages,
+S.scanned_index_pages,
+S.index_scan_count
+FROM pg_stat_get_vacuum_progress() AS S;
+
 CREATE VIEW pg_stat_replication AS
 SELECT
 S.pid,
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..e27a8f3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -284,6 +284,7 @@ vacuum(int options, RangeVar *relation, Oid relid, VacuumParams *params,
 		VacuumPageMiss = 0;
 		VacuumPageDirty = 0;
 
+		pgstat_report_activity_flag(ACTIVITY_IS_VACUUM);
 		/*
 		 * Loop to process each selected relation.
 		 */
@@ -325,6 +326,7 @@ vacuum(int options, RangeVar *relation, Oid relid, VacuumParams *params,
 	{
 		in_vacuum = false;
 		VacuumCostActive = false;
+		pgstat_reset_activity_flag();
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
@@ -355,6 +357,7 @@ vacuum(int options, RangeVar *relation, Oid relid, VacuumParams *params,
 		vac_update_datfrozenxid();
 	}
 
+	pgstat_reset_activity_flag();
 	/*
 	 * Clean up working storage --- note we must do this after
 	 * StartTransactionCommand, else we might be trying to delete the active
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..1c74b51 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -439,9 +439,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			   Relation *Irel, int nindexes, bool scan_

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-11-19 Thread Rahila Syed
Hello Michael,

I am planning to continue contributing to this feature in any way be it by
reviewing the patch or making one. Though I haven't been able to reply to
the comments or post an updated patch lately. I plan to do that soon.

Thank you,
Rahila

On Thu, Nov 19, 2015 at 1:09 PM, Michael Paquier 
wrote:

> On Thu, Nov 19, 2015 at 4:30 PM, Amit Langote
>  wrote:
> > On 2015/11/19 16:18, Amit Langote wrote:
> >> I'm marking this as "Waiting on author" in the commitfest app. Also,
> let's
> >> hear from more people.
> >
> > Well, it seems Michael beat me to it.
>
> Yeah, some other folks provided feedback that's why I did it.
> @Rahila: are you planning more work on this patch? It seems that at
> this point we're waiting for you. If not, and because I have a couple
> of times waited for long VACUUM jobs to finish on some relations
> without having much information about their progress, I would be fine
> to spend time hacking this thing. That's up to you.
> Regards,
> --
> Michael
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-09-11 Thread Rahila Syed
>This doesn't seem to compile
Oh. It compiled successfully when applied on HEAD on my machine. Anyways,
the OID is changed to 3309 in the attached patch. 3308 / 3309 both are part
of OIDs in unused OID list.

Thank you,
Rahila Syed


Vacuum_progress_checker_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-08-31 Thread Rahila Syed
Hello,

On Jul 16, 2015 1:48 AM, "Rahila Syed" <rahilasye...@gmail.com> wrote:
>
> Hello,
>
> Please find attached updated patch >with an interface to calculate
command progress in pgstat.c. This interface currently
implements VACUUM progress tracking .

I have added this patch to CommitFest 2015-09. It is marked as
Waiting on author . I will post an updated patch as per review comments
soon.

Thank you,
Rahila Syed


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-08-17 Thread Rahila Syed
In case of vacuum, I think we need to track the number of scanned heap
pages at least, and the information about index scan is the additional
information

Actually the progress of heap pages scan depend on index scans. So complete
VACUUM progress
needs to have a count of index pages scanned too. So, progress can be
calculated by measuring index_pages_scanned + heap_pages_scanned
against total_index_pages + total_heap_pages. This can make essential
information.

This can be followed by additional individual phase information.

Following fields  common across different commands can be used to display
progress

Command work done total workpercent complete  message

VACUUM  x y  z
total progress

 uv   w
   phase 1

The command code can be divided into distinct phases and each phase
progress can be represented separately. With a summary of entire command
progress as the first entry. The summary can be the summation of individual
phase entries.

If the phase repeats during command execution the previous entry for the
phase will be replaced.(for ex. index scan in vacuum)

Essential information has one numeric data, which is stored
essentially information regarding of its processing.
We may need more than one numeric data as mentioned above to represent
scanned blocks versus total blocks.

Additional information has two data: text and numeric. These data is
free-style data which is stored by each backend as it like.
If I understand your point correctly, I think you are missing following,
The amount of additional information for each command can be different. We
may need an array of  text and numeric data to represent more additional
information.

Thank you,
Rahila Syed


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-08-09 Thread Rahila Syed
Hello,

Say, 6 bigint counters, 6 float8
counters, and 3 strings up to 80 characters each.  So we have a
fixed-size chunk of shared memory per backend, and each backend that
wants to expose progress information can fill in those fields however
it likes, and we expose the results.
This would be sorta like the way pg_statistic works: the same columns
can be used for different purposes depending on what estimator will be
used to access them.

After thinking more on this suggestion, I came up with following generic
structure which can be used to store progress of any command per backend in
shared memory.

Struct PgBackendProgress
{
int32 *counter[COMMAND_NUM_SLOTS];
float8 *counter_float[COMMAND_NUM_SLOTS];

char *progress_message[COMMAND_NUM_SLOTS];
}

COMMAND_NUM_SLOTS will define maximum number of slots(phases)  for any
command.
Progress of command will be measured using progress of each phase in
command.
For some command the number of phases can be singular and rest of the slots
will be NULL.

Each phase will report n integer counters, n float counters and a progress
message.
For some phases , any of the above fields can be NULL.

For VACUUM , there can 3 phases as discussed in the earlier mails.

Phase 1. Report 2 integer counters: heap pages scanned and total heap
pages, 1 float counter: percentage_complete and progress message.
Phase 2. Report 2 integer counters: index pages scanned and total index
pages(across all indexes) and progress message.
Phase 3. 1 integer counter: heap pages vacuumed.

This structure can be accessed by statistics collector to display progress
via new view.

Thank you,
Rahila Syed


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-08-01 Thread Rahila Syed
The total number of heap pages is known, and the total number of index
pages is also known, so it's possible to derive a percentage out of
this part.

The total number of index pages scanned during entire vacuum will depend on
number
of index scans that happens.
In order to extrapolate percent complete for phase two(index scan) we need
number of index scans * total index pages.
The number of index scans can vary from 1 to n (n depending on
maintenance_work_mem)

Summarizing suggestions in previous mails, following information can be
reported
Phase 1.heap pages scanned / total heap pages
Phase 2.index pages scanned / total index pages (across all indexes)
Phase 3.count of heap pages vacuumed

Additional info like number of index scans so far, number of dead tuples
being vacuumed in one batch can also be provided.

A combined estimate for vacuum percent complete can be provided by summing
up heap pages scanned, index pages scanned against total heap pages, total
index pages * number of index scans.

Thank you,
Rahila Syed


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-07-30 Thread Rahila Syed
I think it's related to the problem of figuring out how many dead tuples
you expect to find in the overall heap, which you need to do to have any
hope of this being a comprehensive estimate.

An  estimate of number of index scans while vacuuming can be done using
estimate of total dead tuples in the relation and maintenance work mem.
n_dead_tuples in pg_stat_all_tables can be used as an estimate of dead
tuples.

Following can be a way to estimate,

if nindexes == 0
  index_scans =0
else if pages_all_visible
  index_scans =0
else
  index_scans = Max((n_dead_tuples * space occupied by single dead
tuple)/m_w_m,1)

This estimates index_scans = 1 if n_dead_tuples = 0 assuming lazy scan heap
is likely to find some dead_tuples.
If n_dead_tuples is non zero the above estimate gives a lower bound on
number of index scans possible.


Thank you,
Rahila Syed


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-07-15 Thread Rahila Syed
Hello,

Please find attached updated patch with an interface to calculate command
progress in pgstat.c. This interface currently implements VACUUM progress
 tracking .
A column named percent_complete has been added in pg_stat_activity to
report progress.

VACUUM calls the progress calculation interface periodically at an interval
specified by pgstat_track_progress GUC in ms.
Progress calculation can be disabled by setting pgstat_track_progress  as
-1.

Remaining_time for VACUUM is not included in current patch to avoid
cluttering pg_stat_activity with too many columns.
But the estimate as seen from previous implementation seems reasonable
enough to be included in progress information , may be as an exclusive view
for vacuum progress information.

GUC parameter 'pgstat_track_progress' is currently PGC_SUSET in line with
'track_activities' GUC parameter.  Although IMO, pgstat_track_progress can
be made PGC_USERSET in order to provide more flexibility to any user to
enable/disable progress calculation provided progress is tracked only if
track_activities GUC parameter  is enabled.

In this patch, index scans are not taken into account for progress
calculation as of now .

Thank you,
Rahila Syed.


Vacuum_progress_checker_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-07-01 Thread Rahila Syed
Hello,

Thank you for suggestions.

Yes, I suggest just a single column on pg_stat_activity called pct_complete

Reporting remaining time also can be crucial to make decisions regarding
continuing or aborting VACUUM.
The same has been suggested  in  the thread below,

http://www.postgresql.org/message-id/13072.1284826...@sss.pgh.pa.us

trace_completion_interval = 5s (default)

Every interval, we report the current % complete for any operation that
supports it. We just show NULL if the current operation has not reported
anything or never will.

We do this for VACUUM first, then we can begin adding other operations as
we work out how (for that operation).

Thank you for explaining. This design seems good to me except, adding more
than one columns(percent_complete, remaining_time) if required to
pg_stat_activity can be less user intuitive than having a separate view for
VACUUM.

-Rahila Syed



















On Tue, Jun 30, 2015 at 2:02 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 30 June 2015 at 08:52, Pavel Stehule pavel.steh...@gmail.com wrote:


 I though about the possibilities of progress visualization - and one
 possibility is one or two special column in pg_stat_activity table - this
 info can be interesting for VACUUM started by autovacuum too.


 Yes, I suggest just a single column on pg_stat_activity called pct_complete

 trace_completion_interval = 5s (default)

 Every interval, we report the current % complete for any operation that
 supports it. We just show NULL if the current operation has not reported
 anything or never will.

 We do this for VACUUM first, then we can begin adding other operations as
 we work out how (for that operation).

 --
 Simon Riggshttp://www.2ndQuadrant.com/
 http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services



Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-07-01 Thread Rahila Syed
Hello,

I though about the possibilities of progress visualization - and one
possibility is one or two special column in pg_stat_activity table - this
info can be interesting for VACUUM started by autovacuum too.

Thank you for suggestion. The design with hooks and a separate view was
mainly to keep most of the code outside core as the feature proposed is
specific to VACUUM command. Also, having a separate view can give more
flexibility in terms of displaying various progress parameters.

FWIW ,there was resistance to include columns in pg_stat_activity earlier
in the following thread,
http://www.postgresql.org/message-id/AANLkTi=TcuMA38oGUKX9p5WVPpY+M3L0XUp7=plt+...@mail.gmail.com

Thank you,
Rahila Syed

On Tue, Jun 30, 2015 at 1:22 PM, Pavel Stehule pavel.steh...@gmail.com
wrote:

 Hi

 2015-06-30 9:37 GMT+02:00 Rahila Syed rahilasye...@gmail.com:

 Hello Hackers,

 Following is a proposal for feature to calculate VACUUM progress.


 interesting idea - I like to see it integrated to core.



 Use Case : Measuring progress of long running VACUUMs to help DBAs make
 informed decision
 whether to continue running VACUUM or abort it.

 Design:

 A shared preload library to store progress information from different
 backends running VACUUM, calculate remaining time for each and display
 progress in the
 in the form a view.


 probably similar idea can be used for REINDEX, CREATE INDEX, COPY TO
 statements

 I though about the possibilities of progress visualization - and one
 possibility is one or two special column in pg_stat_activity table - this
 info can be interesting for VACUUM started by autovacuum too.

 Regards

 Pavel



 VACUUM  needs to be instrumented with a hook to collect progress
 information (pages vacuumed/scanned) periodically.

 The patch attached implements a new hook to store vacuumed_pages and
 scanned_pages count at the end of each page scanned by VACUUM.

 This information is stored in a shared memory structure.

 In addition to measuring progress this function using hook also
 calculates remaining time for VACUUM.



 The frequency of collecting progress information can be reduced by
 appending delays in between hook function calls.

 Also, a GUC parameter

 log_vacuum_min_duration can be used.

 This will cause VACUUM progress to be calculated only if VACUUM runs more
 than specified milliseconds.

 A value of zero calculates VACUUM progress for each page processed. -1
 disables logging.


 Progress calculation :


 percent_complete = scanned_pages * 100 / total_pages_to_be_scanned;

 remaining_time = elapsed_time * (total_pages_to_be_scanned -
 scanned_pages) / scanned_pages;


 Shared memory struct:

 typedef struct PgStat_VacuumStats

 {

   Oid databaseoid;

   Oid tableoid;

   Int32   vacuumed_pages;

   Int32   total_pages;

   Int32   scanned_pages;

   doubleelapsed_time;

   doubleremaining_time;

  } PgStat_VacuumStats[max_connections];



 Reporting :

  A view named 'pg_maintenance_progress' can be created using the values
 in the struct above.

 pg_stat_maintenance can be called from any other backend and will display
 progress of

 each running VACUUM.


 Other uses of hook in VACUUM:


 Cost of VACUUM in terms of pages hit , missed and dirtied same as
 autovacuum can be collected using this hook.

 Autovacuum does it at the end of VACUUM for each table. It can be done
 while VACUUM on a table is in progress.
 This can be helpful to track manual VACUUMs also not just autovacuum.

 Read/Write(I/O) rates can be computed on the lines of autovacuum.
 Read rate patterns can be used to help tuning future vacuum on the
 table(like shared buffers tuning)
 Other resource usages can also be collected using progress checker hook.


 Attached patch is POC patch of progress calculation for a single backend.

 Also attached is a brief snapshot of the output log.




 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers





Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-03-10 Thread Rahila Syed
Hello,

I have some minor comments

The comments have been implemented in the attached patch.

I think that extra parenthesis should be used for the first expression
with BKPIMAGE_HAS_HOLE.
Parenthesis have been added to improve code readability.

Thank you,
Rahila Syed


On Mon, Mar 9, 2015 at 5:38 PM, Michael Paquier michael.paqu...@gmail.com
wrote:

 On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
  On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier
  michael.paqu...@gmail.com wrote:
  On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila rahila.s...@nttdata.com
 wrote:
  Please find attached  a patch. As discussed, flag to denote
 compression and presence of hole in block image has been added in
 XLogRecordImageHeader rather than block header.
 
  Thanks for updating the patch! Attached is the refactored version of the
 patch.

 Cool. Thanks!

 I have some minor comments:

 +The default value is literaloff/
 Dot at the end of this sentence.

 +Turning this parameter on can reduce the WAL volume without
 Turning valueon/ this parameter

 +but at the cost of some extra CPU time by the compression during
 +WAL logging and the decompression during WAL replay.
 Isn't a verb missing here, for something like that:
 but at the cost of some extra CPU spent on the compression during WAL
 logging and on the decompression during WAL replay.

 + * This can reduce the WAL volume, but at some extra cost of CPU time
 + * by the compression during WAL logging.
 Er, similarly some extra cost of CPU spent on the compression

 +   if (blk-bimg_info  BKPIMAGE_HAS_HOLE 
 +   (blk-hole_offset == 0 ||
 +blk-hole_length == 0 ||
 I think that extra parenthesis should be used for the first expression
 with BKPIMAGE_HAS_HOLE.

 +   if (blk-bimg_info 
 BKPIMAGE_IS_COMPRESSED 
 +   blk-bimg_len == BLCKSZ)
 +   {
 Same here.

 +   /*
 +* cross-check that hole_offset == 0
 and hole_length == 0
 +* if the HAS_HOLE flag is set.
 +*/
 I think that you mean here that this happens when the flag is *not* set.

 +   /*
 +* If BKPIMAGE_HAS_HOLE and BKPIMAGE_IS_COMPRESSED,
 +* an XLogRecordBlockCompressHeader follows
 +*/
 Maybe a struct should be added for an XLogRecordBlockCompressHeader
 struct. And a dot at the end of the sentence should be added?

 Regards,
 --
 Michael


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Support-compression-full-page-writes-in-WAL_v25.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-03-02 Thread Rahila Syed
Hello,

When I test the WAL or replication related features, I usually run
make installcheck and pgbench against the master at the same time
after setting up the replication environment.
I will conduct these tests before sending updated version.

Seems this increases the header size of WAL record even if no backup block
image is included. Right?
Yes, this increases the header size of WAL record by 1 byte for every block
reference even if it has no backup block image.

Isn't it better to add the flag info about backup block image into
XLogRecordBlockImageHeader rather than XLogRecordBlockHeader
Yes , this will make the code extensible,readable and  will save couple of
bytes per record.
 But the current approach is to provide a chunk ID identifying different
xlog record fragments like main data , block references etc.
Currently , block ID is used to identify record fragments which can be
either XLR_BLOCK_ID_DATA_SHORT , XLR_BLOCK_ID_DATA_LONG or actual block ID.
This can be replaced by chunk ID to separate it from block ID. Block ID can
be used to number the block fragments whereas chunk ID can be used to
distinguish between main data fragments and block references. Chunk ID of
block references can contain information about presence of data, image ,
hole and compression.
Chunk ID for main data fragments remains as it is . This approach provides
for readability and extensibility.

Thank you,
Rahila Syed

On Mon, Mar 2, 2015 at 3:43 PM, Fujii Masao masao.fu...@gmail.com wrote:

 On Fri, Feb 27, 2015 at 12:44 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  On Fri, Feb 27, 2015 at 8:01 AM, Michael Paquier
  michael.paqu...@gmail.com wrote:
  On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed rahilasye...@gmail.com
 wrote:
 Even this patch doesn't work fine. The standby emit the following
 error messages.
 
  Yes this bug remains unsolved. I am still working on resolving this.
 
  Following chunk IDs have been added in the attached patch as suggested
  upthread.
  +#define XLR_CHUNK_BLOCK_REFERENCE  0x10
  +#define XLR_CHUNK_BLOCK_HAS_IMAGE  0x04
  +#define XLR_CHUNK_BLOCK_HAS_DATA   0x08
 
  XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.
  XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE
  and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.
 
  Before sending a new version, be sure that this get fixed by for
  example building up a master with a standby replaying WAL, and running
  make installcheck-world or similar. If the standby does not complain
  at all, you have good chances to not have bugs. You could also build
  with WAL_DEBUG to check record consistency.

 +1

 When I test the WAL or replication related features, I usually run
 make installcheck and pgbench against the master at the same time
 after setting up the replication environment.

  typedef struct XLogRecordBlockHeader
  {
 +uint8chunk_id;/* xlog fragment id */
  uint8id;/* block reference ID */

 Seems this increases the header size of WAL record even if no backup block
 image is included. Right? Isn't it better to add the flag info about backup
 block image into XLogRecordBlockImageHeader rather than
 XLogRecordBlockHeader?
 Originally we borrowed one or two bits from its existing fields to minimize
 the header size, but we can just add new flag field if we prefer
 the extensibility and readability of the code.

 Regards,

 --
 Fujii Masao



Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-02-23 Thread Rahila Syed
Hello,

Attached is a patch which has following changes,

As suggested above block ID in xlog structs has been replaced by chunk ID.
Chunk ID is used to distinguish between different types of xlog record
fragments.
Like,
XLR_CHUNK_ID_DATA_SHORT
XLR_CHUNK_ID_DATA_LONG
XLR_CHUNK_BKP_COMPRESSED
XLR_CHUNK_BKP_WITH_HOLE

In block references, block ID follows the chunk ID. Here block ID retains
its functionality.
This approach increases data by 1 byte for each block reference in an xlog
record. This approach separates ID referring different fragments of xlog
record from the actual block ID which is used to refer  block references in
xlog record.

Following are WAL numbers for each scenario,

 WAL
FPW compression on   121.652 MB

FPW compression off   148.998 MB

HEAD  148.764 MB

Compression remains nearly same as before. There is some difference in WAL
between HEAD and HEAD+patch+compression OFF. This difference corresponds to
1 byte increase with each block reference of xlog record.

Thank you,
Rahila Syed





On Wed, Feb 18, 2015 at 7:53 PM, Syed, Rahila rahila.s...@nttdata.com
wrote:

 Hello,

 I think we should change the xlog format so that the block_id (which
 currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the
 block id but something like XLR_CHUNK_ID. Which is used as is for
 XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to
 XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED,
 XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the
 block id following the chunk id.

 Yes, that'll increase the amount of data for a backup block by 1 byte,
 but I think that's worth it. I'm pretty sure we will be happy about the
 added extensibility pretty soon.

 To clarify my understanding of the above change,

 Instead of a block id to reference different fragments of an xlog record ,
 a single byte field chunk_id  should be used.  chunk_id  will be same as
 XLR_BLOCK_ID_DATA_SHORT/LONG for main data fragments.
 But for block references, it will take store following values in order to
 store information about the backup blocks.
 #define XLR_CHUNK_BKP_COMPRESSED  0x01
 #define XLR_CHUNK_BKP_WITH_HOLE 0x02
 ...

 The new xlog format should look like follows,

 Fixed-size header (XLogRecord struct)
 Chunk_id(add a field before id field in XLogRecordBlockHeader struct)
 XLogRecordBlockHeader
 Chunk_id
  XLogRecordBlockHeader
 ...
 ...
 Chunk_id ( rename id field of the XLogRecordDataHeader struct)
 XLogRecordDataHeader[Short|Long]
  block data
  block data
  ...
  main data

 I will post a patch based on this.

 Thank you,
 Rahila Syed

 -Original Message-
 From: Andres Freund [mailto:and...@2ndquadrant.com]
 Sent: Monday, February 16, 2015 5:26 PM
 To: Syed, Rahila
 Cc: Michael Paquier; Fujii Masao; PostgreSQL mailing lists
 Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

 On 2015-02-16 11:30:20 +, Syed, Rahila wrote:
  - * As a trivial form of data compression, the XLOG code is aware that
  - * PG data pages usually contain an unused hole in the middle,
  which
  - * contains only zero bytes.  If hole_length  0 then we have removed
  - * such a hole from the stored data (and it's not counted in the
  - * XLOG record's CRC, either).  Hence, the amount of block data
  actually
  - * present is BLCKSZ - hole_length bytes.
  + * Block images are able to do several types of compression:
  + * - When wal_compression is off, as a trivial form of compression,
  + the
  + * XLOG code is aware that PG data pages usually contain an unused
 hole
  + * in the middle, which contains only zero bytes.  If length  BLCKSZ
  + * then we have removed such a hole from the stored data (and it is
  + * not counted in the XLOG record's CRC, either).  Hence, the amount
  + * of block data actually present is length bytes.  The hole offset
  + * on page is defined using hole_offset.
  + * - When wal_compression is on, block images are compressed using a
  + * compression algorithm without their hole to improve compression
  + * process of the page. length corresponds in this case to the
  + length
  + * of the compressed block. hole_offset is the hole offset of the
  + page,
  + * and the length of the uncompressed block is defined by
  + raw_length,
  + * whose data is included in the record only when compression is
  + enabled
  + * and with_hole is set to true, see below.
  + *
  + * is_compressed is used to identify if a given block image is
  + compressed
  + * or not. Maximum page size allowed on the system being 32k, the
  + hole
  + * offset cannot be more than 15-bit long so the last free bit is
  + used to
  + * store the compression state of block image. If the maximum page
  + size
  + * allowed is increased to a value higher than that, we should
  + consider
  + * increasing this structure size as well

Re: [HACKERS] Compression of full-page-writes

2015-01-09 Thread Rahila Syed
So this test can be used to evaluate how shorter records influence 
performance since the master waits for flush confirmation from the 
standby, right? 

Yes. This test can help measure performance improvement due to reduced I/O
on standby as master waits for WAL records flush on standby.

Isn't that GB and not MB? 
Yes. That is a typo. It should be GB.

How many FPWs have been generated and how many dirty buffers have been 
flushed for the 3 checkpoints of each test? 

Any data about the CPU activity? 
Above data is not available for this run . I will rerun the tests to gather
above data.

Thank you,
Rahila Syed





--
View this message in context: 
http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5833389.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compression of full-page-writes

2015-01-08 Thread Rahila Syed
Hello, 

Below are performance numbers in case of synchronous replication with and
without fpw compression using latest version of patch(version 14). The patch
helps improve performance considerably.

Both master and standby are on the same machine in order to get numbers
independent of network overhead. 
The compression patch helps to increase tps by 10% . It also helps reduce 
I/O to disk , latency and total runtime for a fixed number of transactions
as shown below.
The compression of WAL is quite high around 40%.

pgbench scale :1000
pgbench command : pgbench  -c 16 -j 16 -r -t 25 -M prepared 

To ensure that data is not highly compressible, empty filler columns were
altered using 
alter table pgbench_accounts alter column filler type text using 
gen_random_uuid()::text  

checkpoint_segments = 1024  
checkpoint_timeout =  5min 
fsync = on

Compressionon 
off

WAL generated 23037180520(~23.04MB)  
38196743704(~38.20MB)

TPS 264.18
239.34

Latency average60.541  ms   66.822
ms

Latency stddev  126.567 ms  130.434
ms

Total writes to disk 145045.310 MB192357.250
MB

Runtime   15141.0 s   
16712.0 s 


Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Thank you,
Rahila Syed



--
View this message in context: 
http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5833315.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-18 Thread Rahila Syed
Isn't it better to allocate the memory for compression_scratch in
InitXLogInsert()
like hdr_scratch?

I think making compression_scratch a statically allocated global variable
 is the result of  following discussion earlier,

http://www.postgresql.org/message-id/ca+tgmoaznbuwnls4bpwyqgqteeznoavy7kwdbm0a2-tbarn...@mail.gmail.com


Thank you,
Rahila Syed



On Thu, Dec 18, 2014 at 1:57 PM, Fujii Masao masao.fu...@gmail.com wrote:

 On Thu, Dec 18, 2014 at 2:21 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 
 
  On Wed, Dec 17, 2014 at 11:33 PM, Rahila Syed rahilasye...@gmail.com
  wrote:
 
  I had a look at code. I have few minor points,
 
  Thanks!
 
  +   bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
  +
  +   if (is_compressed)
  {
  -   rdt_datas_last-data = page;
  -   rdt_datas_last-len = BLCKSZ;
  +   /* compressed block information */
  +   bimg.length = compress_len;
  +   bimg.extra_data = hole_offset;
  +   bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK;
 
  For consistency with the existing code , how about renaming the macro
  XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines
 of
  BKPBLOCK_HAS_IMAGE.
 
  OK, why not...
 
 
  +   blk-hole_offset = extra_data 
 ~XLR_BLCK_COMPRESSED_MASK;
  Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be
  more indicative of the fact that lower 15 bits of extra_data field
 comprises
  of hole_offset value. This suggestion is also just to achieve
 consistency
  with the existing BKPBLOCK_FORK_MASK for fork_flags field.
 
  Yeah that seems clearer, let's define it as ~XLR_BLCK_COMPRESSED_MASK
  though.
 
  And comment typo
  +* First try to compress block, filling in the page hole
 with
  zeros
  +* to improve the compression of the whole. If the block is
  considered
  +* as incompressible, complete the block header information
 as
  if
  +* nothing happened.
 
  As hole is no longer being compressed, this needs to be changed.
 
  Fixed. As well as an additional comment block down.
 
  A couple of things noticed on the fly:
  - Fixed pg_xlogdump being not completely correct to report the FPW
  information
  - A couple of typos and malformed sentences fixed
  - Added an assertion to check that the hole offset value does not the bit
  used for compression status
  - Reworked docs, mentioning as well that wal_compression is off by
 default.
  - Removed stuff in pg_controldata and XLOG_PARAMETER_CHANGE (mentioned by
  Fujii-san)

 Thanks!

 +else
 +memcpy(compression_scratch, page, page_len);

 I don't think the block image needs to be copied to scratch buffer here.
 We can try to compress the page directly.

 +#include utils/pg_lzcompress.h
  #include utils/memutils.h

 pg_lzcompress.h should be after meutils.h.

 +/* Scratch buffer used to store block image to-be-compressed */
 +static char compression_scratch[PGLZ_MAX_BLCKSZ];

 Isn't it better to allocate the memory for compression_scratch in
 InitXLogInsert()
 like hdr_scratch?

 +uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header));

 Why don't we allocate the buffer for uncompressed page only once and
 keep reusing it like XLogReaderState-readBuf? The size of uncompressed
 page is at most BLCKSZ, so we can allocate the memory for it even before
 knowing the real size of each block image.

 -printf( (FPW); hole: offset: %u, length: %u\n,
 -   record-blocks[block_id].hole_offset,
 -   record-blocks[block_id].hole_length);
 +if (record-blocks[block_id].is_compressed)
 +printf( (FPW); hole offset: %u, compressed length
 %u\n,
 +   record-blocks[block_id].hole_offset,
 +   record-blocks[block_id].bkp_len);
 +else
 +printf( (FPW); hole offset: %u, length: %u\n,
 +   record-blocks[block_id].hole_offset,
 +   record-blocks[block_id].bkp_len);

 We need to consider what info about FPW we want pg_xlogdump to report.
 I'd like to calculate how much bytes FPW was compressed, from the report
 of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW
 and that of compressed one in the report.

 In pg_config.h, the comment of BLCKSZ needs to be updated? Because
 the maximum size of BLCKSZ can be affected by not only itemid but also
 XLogRecordBlockImageHeader.

  boolhas_image;
 +boolis_compressed;

 Doesn't ResetDecoder need to reset is_compressed?

 +#wal_compression = off# enable compression of full-page writes

 Currently wal_compression compresses only FPW, so isn't it better to place
 it after full_page_writes in postgresql.conf.sample?

 +uint16extra_data;/* used to store offset

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-17 Thread Rahila Syed
Hello,

Patches as well as results are attached.

I had a look at code. I have few minor points,

+   bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
+
+   if (is_compressed)
{
-   rdt_datas_last-data = page;
-   rdt_datas_last-len = BLCKSZ;
+   /* compressed block information */
+   bimg.length = compress_len;
+   bimg.extra_data = hole_offset;
+   bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK;

For consistency with the existing code , how about renaming the macro
XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of
BKPBLOCK_HAS_IMAGE.

+   blk-hole_offset = extra_data  ~XLR_BLCK_COMPRESSED_MASK;
Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be
more indicative of the fact that lower 15 bits of extra_data field
comprises of hole_offset value. This suggestion is also just to achieve
consistency with the existing BKPBLOCK_FORK_MASK for fork_flags field.

And comment typo
+* First try to compress block, filling in the page hole with
zeros
+* to improve the compression of the whole. If the block is
considered
+* as incompressible, complete the block header information as
if
+* nothing happened.

As hole is no longer being compressed, this needs to be changed.

Thank you,
Rahila Syed







On Tue, Dec 16, 2014 at 10:04 PM, Michael Paquier michael.paqu...@gmail.com
 wrote:



 On Wed, Dec 17, 2014 at 12:00 AM, Michael Paquier 
 michael.paqu...@gmail.com wrote:

 Actually, the original length of the compressed block in saved in
 PGLZ_Header, so if we are fine to not check the size of the block
 decompressed when decoding WAL we can do without the hole filled with
 zeros, and use only 1 bit to see if the block is compressed or not.

 And.. After some more hacking, I have been able to come up with a patch
 that is able to compress blocks without the page hole, and that keeps the
 WAL record length the same as HEAD when compression switch is off. The
 numbers are pretty good, CPU is saved in the same proportions as previous
 patches when compression is enabled, and there is zero delta with HEAD when
 compression switch is off.

 Here are the actual numbers:
test_name   | pg_size_pretty | user_diff | system_diff
 ---++---+-
  FPW on + 2 bytes, ffactor 50  | 582 MB | 42.391894 |0.807444
  FPW on + 2 bytes, ffactor 20  | 229 MB | 14.330304 |0.729626
  FPW on + 2 bytes, ffactor 10  | 117 MB |  7.335442 |0.570996
  FPW off + 2 bytes, ffactor 50 | 746 MB | 25.330391 |1.248503
  FPW off + 2 bytes, ffactor 20 | 293 MB | 10.537475 |0.755448
  FPW off + 2 bytes, ffactor 10 | 148 MB |  5.762775 |0.763761
  FPW on + 0 bytes, ffactor 50  | 582 MB | 42.174297 |0.790596
  FPW on + 0 bytes, ffactor 20  | 229 MB | 14.424233 |0.770459
  FPW on + 0 bytes, ffactor 10  | 117 MB |  7.057195 |0.584806
  FPW off + 0 bytes, ffactor 50 | 746 MB | 25.261998 |1.054516
  FPW off + 0 bytes, ffactor 20 | 293 MB | 10.589888 |0.860207
  FPW off + 0 bytes, ffactor 10 | 148 MB |  5.827191 |0.874285
  HEAD, ffactor 50  | 746 MB | 25.181729 |1.133433
  HEAD, ffactor 20  | 293 MB |  9.962242 |0.765970
  HEAD, ffactor 10  | 148 MB |  5.693426 |0.775371
  Record, ffactor 50| 582 MB | 54.904374 |0.678204
  Record, ffactor 20| 229 MB | 19.798268 |0.807220
  Record, ffactor 10| 116 MB |  9.401877 |0.668454
 (18 rows)

 The new tests of this patch are FPW off + 0 bytes. Patches as well as
 results are attached.
 Regards,
 --
 Michael



[HACKERS] Possibly a comment typo in xlogrecord.h

2014-12-16 Thread Rahila Syed
Hello,

The comment before declaration of XLogRecordBlockHeader says

* 'data_length' is the length of the payload data associated with this,
 * and includes the possible full-page image, and rmgr-specific data. It

IIUC, data_length does not include associated full page image length.
Attached patch changes the comment as follows:

- * and includes the possible full-page image, and rmgr-specific data. It
- * does not include the XLogRecordBlockHeader struct itself.
+ * and includes the rmgr-specific data. It does not include the possible
+ * full page image and XLogRecordBlockHeader struct itself.

Thank you,
Rahila Syed


correct_comment_typo_XLogRecordBlockHeader.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-12 Thread Rahila Syed
Hello,

Well, the larger question is why wouldn't we just have the user compress
the entire WAL file before archiving --- why have each backend do it?
Is it the write volume we are saving?

IIUC,  the idea here is to not only save the on disk size of WAL but to
reduce the overhead of flushing WAL records to disk in servers with heavy
write operations. So yes improving the performance by saving write volume
is a part of the requirement.

Thank you,
Rahila Syed


On Fri, Dec 12, 2014 at 7:48 PM, Bruce Momjian br...@momjian.us wrote:

 On Fri, Dec 12, 2014 at 08:27:59AM -0500, Robert Haas wrote:
  On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian br...@momjian.us
 wrote:
   compression = 'on'  : 1838 secs
   = 'off' : 1701 secs
  
   Different is around 140 secs.
  
   OK, so the compression took 2x the cpu and was 8% slower.  The only
   benefit is WAL files are 35% smaller?
 
  Compression didn't take 2x the CPU.  It increased user CPU from 354.20
  s to 562.67 s over the course of the run, so it took about 60% more
  CPU.
 
  But I wouldn't be too discouraged by that.  At least AIUI, there are
  quite a number of users for whom WAL volume is a serious challenge,
  and they might be willing to pay that price to have less of it.  Also,
  we have talked a number of times before about incorporating Snappy or
  LZ4, which I'm guessing would save a fair amount of CPU -- but the
  decision was made to leave that out of the first version, and just use
  pg_lz, to keep the initial patch simple.  I think that was a good
  decision.

 Well, the larger question is why wouldn't we just have the user compress
 the entire WAL file before archiving --- why have each backend do it?
 Is it the write volume we are saving?  I though this WAL compression
 gave better performance in some cases.

 --
   Bruce Momjian  br...@momjian.ushttp://momjian.us
   EnterpriseDB http://enterprisedb.com

   + Everyone has their own god. +



Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-10 Thread Rahila Syed
What I would suggest is instrument the backend with getrusage() at
startup and shutdown and have it print the difference in user time and
system time.  Then, run tests for a fixed number of transactions and
see how the total CPU usage for the run differs.

Folllowing are the numbers obtained on tests with absolute CPU usage, fixed
number of transactions and longer duration with latest fpw compression
patch

pgbench command : pgbench  -r -t 25 -M prepared

To ensure that data is not highly compressible, empty filler columns were
altered using

alter table pgbench_accounts alter column filler type text using
gen_random_uuid()::text

checkpoint_segments = 1024
checkpoint_timeout =  5min
fsync = on

The tests ran for around 30 mins.Manual checkpoint was run before each test.

Compression   WAL generated%compressionLatency-avg   CPU usage
(seconds)  TPS  Latency
stddev


on  1531.4 MB  ~35 %  7.351 ms
  user diff: 562.67s system diff: 41.40s  135.96
13.759 ms


off  2373.1 MB 6.781 ms
  user diff: 354.20s  system diff: 39.67s147.40
  14.152 ms

The compression obtained is quite high close to 35 %.
CPU usage at user level when compression is on is quite noticeably high as
compared to that when compression is off. But gain in terms of reduction of
WAL is also high.

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm



Thank you,

Rahila Syed





On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed rahilasyed...@gmail.com
 wrote:
 If that's really true, we could consider having no configuration any
 time, and just compressing always.  But I'm skeptical that it's
 actually true.
 
  I was referring to this for CPU utilization:
 
 http://www.postgresql.org/message-id/1410414381339-5818552.p...@n5.nabble.com
  http://
 
  The above tests were performed on machine with configuration as follows
  Server specifications:
  Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
  RAM: 32GB
  Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
  1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

 I think that measurement methodology is not very good for assessing
 the CPU overhead, because you are only measuring the percentage CPU
 utilization, not the absolute amount of CPU utilization.  It's not
 clear whether the duration of the tests was the same for all the
 configurations you tried - in which case the number of transactions
 might have been different - or whether the number of operations was
 exactly the same - in which case the runtime might have been
 different.  Either way, it could obscure an actual difference in
 absolute CPU usage per transaction.  It's unlikely that both the
 runtime and the number of transactions were identical for all of your
 tests, because that would imply that the patch makes no difference to
 performance; if that were true, you wouldn't have bothered writing
 it

 What I would suggest is instrument the backend with getrusage() at
 startup and shutdown and have it print the difference in user time and
 system time.  Then, run tests for a fixed number of transactions and
 see how the total CPU usage for the run differs.

 Last cycle, Amit Kapila did a bunch of work trying to compress the WAL
 footprint for updates, and we found that compression was pretty darn
 expensive there in terms of CPU time.  So I am suspicious of the
 finding that it is free here.  It's not impossible that there's some
 effect which causes us to recoup more CPU time than we spend
 compressing in this case that did not apply in that case, but the
 projects are awfully similar, so I tend to doubt it.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-10 Thread Rahila Syed
I am sorry but I can't understand the above results due to wrapping.
Are you saying compression was twice as slow?

CPU usage at user level (in seconds)  for compression set 'on' is 562 secs
while that for compression  set 'off' is 354 secs. As per the readings, it
takes little less than double CPU time to compress.
However , the total time  taken to run 25 transactions for each of the
scenario is as follows,

compression = 'on'  : 1838 secs
= 'off'  : 1701 secs

Different is around 140 secs.

Thank you,
Rahila Syed


On Wed, Dec 10, 2014 at 7:55 PM, Bruce Momjian br...@momjian.us wrote:

 On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote:
  The tests ran for around 30 mins.Manual checkpoint was run before each
 test.
 
  Compression   WAL generated%compressionLatency-avg   CPU usage
  (seconds)  TPS
  Latency
  stddev
 
 
  on  1531.4 MB  ~35 %  7.351 ms

user diff: 562.67s system diff: 41.40s  135.96

13.759 ms
 
 
  off  2373.1 MB 6.781
 ms
user diff: 354.20s  system diff: 39.67s147.40

14.152 ms
 
  The compression obtained is quite high close to 35 %.
  CPU usage at user level when compression is on is quite noticeably high
 as
  compared to that when compression is off. But gain in terms of reduction
 of WAL
  is also high.

 I am sorry but I can't understand the above results due to wrapping.
 Are you saying compression was twice as slow?

 --
   Bruce Momjian  br...@momjian.ushttp://momjian.us
   EnterpriseDB http://enterprisedb.com

   + Everyone has their own god. +



Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-07 Thread Rahila Syed
The important point to consider for this patch is the use of the
additional 2-bytes as uint16 in the block information structure to save the
length of a compressed
block, which may be compressed without its hole to achieve a double level
of compression (image compressed without its hole). We may use a simple
flag on
one or two bits using for example a bit from hole_length, but in this case
we would need to always compress images with their hole included, something
more
 expensive as the compression would take more time.
As you have mentioned here the idea to use bits from existing fields rather
than adding additional 2 bytes in header,
FWIW elaborating slightly on the way it was done in the initial patches,
We can use the following struct

 unsignedhole_offset:15,
 compress_flag:2,
hole_length:15;

Here  compress_flag can be 0 or 1 depending on status of compression. We
can reduce the compress_flag to just 1 bit flag.

IIUC, the purpose of adding compress_len field in the latest patch is
to store length of compressed blocks which is used at the time of decoding
the blocks.

With this approach, length of compressed block can be stored in hole_length
as,

 hole_length = BLCKSZ - compress_len.

Thus, hole_length can serve the purpose of storing length of a compressed
block without the need of additional 2-bytes.  In DecodeXLogRecord,
hole_length can be used for tracking the length of data received in cases
of both compressed as well as uncompressed blocks.

As you already mentioned, this will need compressing images with hole but
 we can MemSet hole to 0 in order to make compression of hole less
expensive and effective.



Thank you,

Rahila Syed

On Sat, Dec 6, 2014 at 7:37 PM, Michael Paquier michael.paqu...@gmail.com
wrote:


 On Sat, Dec 6, 2014 at 12:17 AM, Andres Freund and...@2ndquadrant.com
 wrote:

 On 2014-12-06 00:10:11 +0900, Michael Paquier wrote:
  On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier 
 michael.paqu...@gmail.com
  wrote:
 
  
  
  
   On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed rahilasyed...@gmail.com
 
   wrote:
  
   I attempted quick review and could not come up with much except this
  
   +   /*
   +* Calculate the amount of FPI data in the record. Each backup
 block
   +* takes up BLCKSZ bytes, minus the hole length.
   +*
   +* XXX: We peek into xlogreader's private decoded backup blocks
 for
   the
   +* hole_length. It doesn't seem worth it to add an accessor
 macro for
   +* this.
   +*/
   +   fpi_len = 0;
   +   for (block_id = 0; block_id = record-max_block_id; block_id++)
   +   {
   +   if (XLogRecHasCompressedBlockImage(record, block_id))
   +   fpi_len += BLCKSZ -
 record-blocks[block_id].compress_len;
  
  
   IIUC, fpi_len in case of compressed block image should be
  
   fpi_len = record-blocks[block_id].compress_len;
  
   Yep, true. Patches need a rebase btw as Heikki fixed a commit related
 to
   the stats of pg_xlogdump.
  
 
  In any case, any opinions to switch this patch as Ready for committer?

 Needing a rebase is a obvious conflict to that... But I guess some wider
 looks afterwards won't hurt.


 Here are rebased versions, which are patches 1 and 2. And I am switching
 as well the patch to Ready for Committer. The important point to consider
 for this patch is the use of the additional 2-bytes as uint16 in the block
 information structure to save the length of a compressed block, which may
 be compressed without its hole to achieve a double level of compression
 (image compressed without its hole). We may use a simple flag on one or two
 bits using for example a bit from hole_length, but in this case we would
 need to always compress images with their hole included, something more
 expensive as the compression would take more time.

 Robert wrote:
  What I would suggest is instrument the backend with getrusage() at
  startup and shutdown and have it print the difference in user time and
  system time.  Then, run tests for a fixed number of transactions and
  see how the total CPU usage for the run differs.
 That's a nice idea, which is done with patch 3 as a simple hack calling
 twice getrusage at the beginning of PostgresMain and before proc_exit,
 calculating the difference time and logging it for each process (used as
 well log_line_prefix with %p).

 Then I just did a small test with a load of a pgbench-scale-100 database
 on fresh instances:
 1) Compression = on:
 Stop LSN: 0/487E49B8
 getrusage: proc 11163: LOG:  user diff: 63.071127, system diff: 10.898386
 pg_xlogdump: FPI size: 122296653 [90.52%]
 2) Compression = off
 Stop LSN: 0/4E54EB88
 Result: proc 11648: LOG:  user diff: 43.855212, system diff: 7.857965
 pg_xlogdump: FPI size: 204359192 [94.10%]
 And the CPU consumption is showing quite some difference... I'd expect as
 well pglz_compress to show up high in a perf profile for this case (don't
 have the time to do that now, but a perf record -a -g would

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-05 Thread Rahila Syed
I attempted quick review and could not come up with much except this

+   /*
+* Calculate the amount of FPI data in the record. Each backup block
+* takes up BLCKSZ bytes, minus the hole length.
+*
+* XXX: We peek into xlogreader's private decoded backup blocks for the
+* hole_length. It doesn't seem worth it to add an accessor macro for
+* this.
+*/
+   fpi_len = 0;
+   for (block_id = 0; block_id = record-max_block_id; block_id++)
+   {
+   if (XLogRecHasCompressedBlockImage(record, block_id))
+   fpi_len += BLCKSZ - record-blocks[block_id].compress_len;


IIUC, fpi_len in case of compressed block image should be

fpi_len = record-blocks[block_id].compress_len;


Thank you,
Rahila Syed 




--
View this message in context: 
http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5829403.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-04 Thread Rahila Syed
If that's really true, we could consider having no configuration any 
time, and just compressing always.  But I'm skeptical that it's 
actually true.

I was referring to this for CPU utilization: 
http://www.postgresql.org/message-id/1410414381339-5818552.p...@n5.nabble.com
http://   

The above tests were performed on machine with configuration as follows
Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Thank you,
Rahila Syed



--
View this message in context: 
http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5829339.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-11-27 Thread Rahila Syed
if (!fullPageWrites)
{
   WALInsertLockAcquireExclusive();
Insert-fullPageWrites = fullPageWrites;
WALInsertLockRelease();
}


As fullPageWrites is not a boolean isnt it better to change the if
condition as fullPageWrites == FULL_PAGE_WRITES_OFF? As it is done in the
if condition above, this seems to be a miss.

doPageWrites = (Insert-fullPageWrites || Insert-forcePageWrites);

IIUC, doPageWrites is true when fullPageWrites is either 'on' or
'compress'
Considering Insert - fullPageWrites is an int now, I think its better to
explicitly write the above as ,

doPageWrites = (Insert - fullPageWrites != FULL_PAGE_WRITES_OFF ||
Insert-forcePageWrites)

The patch attached has the above changes. Also, it initializes
doPageCompression in InitXLOGAccess as per earlier discussion.

I have attached the changes separately as changes.patch.

Thank you,

Rahila Syed


changes.patch
Description: Binary data


0002-Support-compression-for-full-page-writes-in-WAL.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-11-11 Thread Rahila Syed
I think this was changed based on following, if I am not wrong. 

http://www.postgresql.org/message-id/54297A45.8080904@...
Yes this change is the result of the above complaint.

Attaching the compression status to XLogRecord is more 
in-line with the fact that all the blocks are compressed, and not each 
one individually, so we basically now duplicate an identical flag 
value in all the backup block headers, which is a waste IMO. 
Thoughts? 

If I understand your point correctly, as all blocks are compressed, adding
compression attribute to XLogRecord surely makes more sense if the record
contains backup blocks . But in case of XLOG records without backup blocks
the compression attribute in record header might not make much sense.

Attaching the status of compression to XLogRecord will mean that the status
is duplicated across all records. It will mean that it is an attribute of
all the records when it is only an attribute of records with backup blocks
or the attribute of backup blocks. 
The current approach is adopted with this thought.


Regards,
Rahila Syed



--
View this message in context: 
http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5826487.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-11-03 Thread Rahila Syed
Hello ,

Please find updated patch with the review comments given above implemented.
The compressed data now includes all backup blocks and their headers except
the header of first backup block in WAL record. The first backup block
header in WAL record is used to store the compression information. This is
done in order to avoid adding compression information in WAL record header.

Memory allocation on SIGHUP in autovacuum is remaining. Working  on it.


Thank you,
Rahila Syed





On Tue, Oct 28, 2014 at 8:51 PM, Rahila Syed rahilasyed...@gmail.com
wrote:


 Do we release the buffers for compressed data when fpw is changed from
 compress to on?
  The current code does not do this.
 Don't we need to do that?
 Yes this needs to be done in order to avoid memory leak when compression is
 turned off at runtime while the backend session is running.

 You don't need to make the processes except the startup process allocate
 the memory for uncompressedPages when fpw=on. Only the startup process
 uses it for the WAL decompression
 I see. fpw != on check can be put at the time of memory allocation of
 uncompressedPages in the backend code . And at the time of recovery
 uncompressedPages can be allocated separately if not already allocated.

 BTW, what happens if the memory allocation for uncompressedPages for
 the recovery fails?
 The current code does not handle this. This will be rectified.

 Which would prevent the recovery at all, so PANIC should
 happen in that case?
 IIUC, instead of reporting  PANIC , palloc can be used to allocate memory
 for uncompressedPages at the time of recovery which will throw ERROR and
 abort startup process in case of failure.


 Thank you,
 Rahila Syed



 --
 View this message in context:
 http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5824613.html
 Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



compress_fpw_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-10-28 Thread Rahila Syed

Do we release the buffers for compressed data when fpw is changed from
compress to on? 
 The current code does not do this.
Don't we need to do that? 
Yes this needs to be done in order to avoid memory leak when compression is
turned off at runtime while the backend session is running.

You don't need to make the processes except the startup process allocate 
the memory for uncompressedPages when fpw=on. Only the startup process 
uses it for the WAL decompression
I see. fpw != on check can be put at the time of memory allocation of
uncompressedPages in the backend code . And at the time of recovery
uncompressedPages can be allocated separately if not already allocated.

BTW, what happens if the memory allocation for uncompressedPages for 
the recovery fails? 
The current code does not handle this. This will be rectified.

Which would prevent the recovery at all, so PANIC should 
happen in that case? 
IIUC, instead of reporting  PANIC , palloc can be used to allocate memory
for uncompressedPages at the time of recovery which will throw ERROR and
abort startup process in case of failure.


Thank you,
Rahila Syed



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5824613.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-10-16 Thread Rahila Syed
Hello,

Please find the updated patch attached.

1) I don't think it's a good idea to put the full page write compression
   into struct XLogRecord.

1. The compressed blocks is of varlena type. Hence, VARATT_IS_COMPRESSED
can be used to detect if the datum is compressed. But, it can give false
positive  when blocks are not compressed because uncompressed blocks in WAL
record are not of type varlena.  If I understand correctly,
VARATT_IS_COMPRESSED looks for particular bit pattern in the datum which
when found it returns true irrespective of type of datum.

2. BkpBlock header of the first block in a WAL record can be copied as it
is followed by compressed data including block corresponding to first
header and remaining headers and blocks. This header can then be used to
store flag indicating if the blocks are compressed or not. This seems to be
a feasible option but will increase few bytes equivalent to
sizeof(BkpBlock) in record when compared to the method of compressing all
blocks and headers.
Also , the full page write compression currently stored in WAL record
occupies 1 byte of padding hence does not increase the overall size. But at
the same timecompression attribute is related to backup up blocks hence it
makes more sense to have it in BkpBlock header. Although, the attached
patch does not include this yet as it will be better to get consensus
first.
Thoughts?

2) You've essentially removed a lot of checks about the validity of bkp
   blocks in xlogreader. I don't think that's acceptable

Check to see if size of compressed blocks agrees with the total size stored
on WAL record header is added in the patch attached. This serves as a check
to validate length of record.

3) You have both FullPageWritesStr() and full_page_writes_str().

This has not changed for now reason being full_page_writes_str() is
true/false version of FullPageWritesStr macro. It
is implemented for backward compatibility with pg_xlogdump.

4)I don't like FullPageWritesIsNeeded(). For one it, at least to me,
7) Unless I miss something CompressBackupBlock should be plural, right?
   ATM it compresses all the blocks?
8) I don't tests like  if (fpw = FULL_PAGE_WRITES_COMPRESS). That
   relies on the, less than intuitive, ordering of
   FULL_PAGE_WRITES_COMPRESS (=1) before FULL_PAGE_WRITES_ON (=2).
9) I think you've broken the case where we first think 1 block needs to
   be backed up, and another doesn't. If we then detect, after the
   START_CRIT_SECTION(), that we need to goto begin; orig_len will
   still have it's old content.
I have corrected these in the patch attached.

5) CompressBackupBlockPagesAlloc is declared static but not defined as
   such.
Have made it global now in order to be able to access it from PostgresMain.

6) You call CompressBackupBlockPagesAlloc() from two places. Neither is
   IIRC within a critical section. So you imo should remove the outOfMem
   handling and revert to palloc() instead of using malloc directly.
This has not been changed in the current patch reason being outOfMem
handling is done in order to
proceed without compression of FPW in case sufficient memory is not
available for compression.

 One
   thing worthy of note is that I don't think you currently can
   legally check fullPageWrites == FULL_PAGE_WRITES_ON when calling it
   only during startup as fullPageWrites can be changed at runtime

 In the attached patch, this check is also added in PostgresMain on SIGHUP
after processing postgresql.conf file.


Thank you,

Rahila Syed


On Mon, Sep 29, 2014 at 6:06 PM, Andres Freund and...@anarazel.de wrote:

 Hi,

 On 2014-09-22 10:39:32 +, Syed, Rahila wrote:
  Please find attached the patch to compress FPW.

 I've given this a quick look and noticed some things:
 1) I don't think it's a good idea to put the full page write compression
into struct XLogRecord.

 2) You've essentially removed a lot of checks about the validity of bkp
blocks in xlogreader. I don't think that's acceptable.

 3) You have both FullPageWritesStr() and full_page_writes_str().

 4) I don't like FullPageWritesIsNeeded(). For one it, at least to me,
sounds grammatically wrong. More importantly when reading it I'm
thinking of it being about the LSN check. How about instead directly
checking whatever != FULL_PAGE_WRITES_OFF?

 5) CompressBackupBlockPagesAlloc is declared static but not defined as
such.

 6) You call CompressBackupBlockPagesAlloc() from two places. Neither is
IIRC within a critical section. So you imo should remove the outOfMem
handling and revert to palloc() instead of using malloc directly. One
thing worthy of note is that I don't think you currently can
legally check fullPageWrites == FULL_PAGE_WRITES_ON when calling it
only during startup as fullPageWrites can be changed at runtime.

 7) Unless I miss something CompressBackupBlock should be plural, right?
ATM it compresses all the blocks?

 8) I don't tests like  if (fpw = FULL_PAGE_WRITES_COMPRESS

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-10-09 Thread Rahila Syed
Hello,

Thank you for review.

1) I don't think it's a good idea to put the full page write compression 
   into struct XLogRecord.

Full page write compression information can be stored in varlena struct of
compressed blocks as done for toast data in pluggable compression support
patch. If I understand correctly, it can be done similar to the manner in
which compressed Datum is modified to contain information about compression
algorithm in pluggable compression support patch.

2) You've essentially removed a lot of checks about the validity of bkp 
   blocks in xlogreader. I don't think that's acceptable

To ensure this, the raw size stored in first four byte of compressed datum
can be used to perform error checking for backup blocks
Currently, the error checking for size of backup blocks happens individually
for each block.
If backup blocks are compressed together , it can happen once for the entire
set of backup blocks in a WAL record. The total raw size of compressed
blocks can be checked against the total size stored in WAL record header. 

3) You have both FullPageWritesStr() and full_page_writes_str().

full_page_writes_str() is true/false version of FullPageWritesStr macro. It
is implemented for backward compatibility with pg_xlogdump


4)I don't like FullPageWritesIsNeeded(). For one it, at least to me, 
   sounds grammatically wrong. More importantly when reading it I'm 
   thinking of it being about the LSN check. How about instead directly 
   checking whatever != FULL_PAGE_WRITES_OFF? 

I will modify this.

5) CompressBackupBlockPagesAlloc is declared static but not defined as 
   such. 
7) Unless I miss something CompressBackupBlock should be plural, right? 
   ATM it compresses all the blocks? 
I will correct these.

6)You call CompressBackupBlockPagesAlloc() from two places. Neither is 
  IIRC within a critical section. So you imo should remove the outOfMem 
   handling and revert to palloc() instead of using malloc directly. 

Yes neither is in critical section. outOfMem handling is done in order to
proceed without compression of FPW in case sufficient memory is not
available for compression.


Thank you,
Rahila Syed



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5822391.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-22 Thread Rahila Syed
Hello All,


Well, from Rahila's point of view the patch is actually attached, but
she's posting from the Nabble interface, which mangles it and turns into
a link instead.

Yes.



but the end result is the
same: to properly submit a patch, you need to send an email to the
 mailing list, not join a group/forum from
some intermediary newsgroup site that mirrors the list.


Thank you. I will take care of it henceforth.

Please find attached the patch to compress FPW.  Patch submitted by
Fujii-san earlier in the thread is used to merge compression GUC with
full_page_writes.



I am reposting the measurement numbers.

Server Specification:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDWD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Checkpoint segments: 1024
Checkpoint timeout: 5 mins

pgbench -c 64 -j 64 -r -T 900 -M prepared
Scale factor: 1000

WAL generated (MB)   Throughput
(tps)  Latency(ms)
On 9235.43   979.03
  65.36
Compress(pglz)   6518.681072.34
59.66
Off 501.04
1135.1756.34

The results show  around 30 percent decrease in WAL volume due to
compression of FPW.

Thank you ,

Rahila Syed

Tom Lane wrote:
 Rahila Syed rahilasyed rahilasyed...@gmail.com.90@
rahilasyed...@gmail.comgmail.com rahilasyed...@gmail.com writes:
  Please find attached patch to compress FPW using pglz compression.

 Patch not actually attached AFAICS (no, a link is not good enough).

Well, from Rahila's point of view the patch is actually attached, but
she's posting from the Nabble interface, which mangles it and turns into
a link instead.  Not her fault, really -- but the end result is the
same: to properly submit a patch, you need to send an email to the
pgsql pgsql-hackers@postgresql.org- pgsql-hackers@postgresql.orghackers
pgsql-hackers@postgresql.org@ pgsql-hackers@postgresql.orgpostgresql.org
pgsql-hackers@postgresql.orgmailing list, not join a group/forum from
some intermediary newsgroup site that mirrors the list.

--
Álvaro Herrera   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-19 Thread Rahila Syed
Hello,

Maybe.  Let's get the basic patch done first; then we can argue about that

Please find attached patch to compress FPW using pglz compression.
All backup blocks in WAL record are compressed at once before inserting it
into WAL buffers . Full_page_writes  GUC has been modified to accept three
values 
1.  On
2.  Compress
3.  Off
FPW are compressed when full_page_writes is set to compress. FPW generated
forcibly during online backup even when full_page_writes is off are also
compressed.  When full_page_writes is set on FPW are not compressed. 
Benckmark:
Server Specification:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
Checkpoint segments: 1024
Checkpoint timeout: 5 mins
pgbench -c 64 -j 64 -r -T 900 -M prepared
Scale factor: 1000

 WAL generated (MB) 
Throughput
(tps)   Latency(ms)
On   9235.43
  
979.03   65.36
Compress(pglz)   6518.68  
1072.34 59.66
Off501.04 
1135.17   
56.34 

The results show  around 30 percent decrease in WAL volume due to
compression of FPW. 
compress_fpw_v1.patch
http://postgresql.1045698.n5.nabble.com/file/n5819645/compress_fpw_v1.patch  



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5819645.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-19 Thread Rahila Syed

Please find attached patch to compress FPW using pglz compression.
Please refer the updated patch attached. The earlier patch added few
duplicate lines of code in guc.c file.

compress_fpw_v1.patch
http://postgresql.1045698.n5.nabble.com/file/n5819659/compress_fpw_v1.patch  


Thank you,
Rahila Syed



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5819659.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-10 Thread Rahila Syed
I will repeat the above tests with high load on CPU and using the benchmark
given by Fujii-san and post the results. 

Average % of CPU usage at user level for each of the compression algorithm
are as follows.

CompressionMultipleSingle

Off81.133881.1267
LZ4  81.099881.1695
Snappy:80.9741 80.9703
Pglz :81.235381.2753
 
http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png
 
http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png
  
 
The numbers show CPU utilization of Snappy is the least. The CPU utilization
in increasing order is
pglz  No compression  LZ4  Snappy

The variance of average CPU utilization numbers is very low. However ,
snappy seems to be best when it comes to lesser utilization of CPU.

As per the measurement results posted till date 

LZ4 outperforms snappy and pglz in terms of compression ratio and
performance. However , CPU utilization numbers show snappy utilizes least
amount of CPU . Difference is not much though.

As there has been no consensus yet about which compression algorithm to
adopt, is it better to make this decision independent of the FPW compression
patch as suggested earlier in this thread?. FPW compression can be done
using built in compression pglz as it shows considerable performance over
uncompressed WAL and good compression ratio 
Also, the patch to compress multiple blocks at once gives better compression
as compared to single block. ISTM that performance overhead introduced by
multiple blocks compression is slightly higher than single block compression
which can be tested again after modifying the patch to use pglz .  Hence,
this patch can be built using multiple blocks compression.

Thoughts?



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-02 Thread Rahila Syed
Hello,

It'd be interesting to check avg cpu usage as well

I have collected average CPU utilization numbers by collecting sar output
at interval of 10 seconds  for following benchmark:

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Benchmark:

Scale : 16
Command  :java JR  /home/postgres/jdbcrunner-1.2/scripts/tpcc.js
 -sleepTime 550,250,250,200,200

Warmup time  : 1 sec
Measurement time : 900 sec
Number of tx types   : 5
Number of agents : 16
Connection pool size : 16
Statement cache size : 40
Auto commit  : false


Checkpoint segments:1024
Checkpoint timeout:5 mins


Average % of CPU utilization at user level for multiple blocks compression:

Compression Off  =  3.34133

Snappy = 3.41044

LZ4  = 3.59556

 Pglz = 3.66422


The numbers show the average CPU utilization is in the following order pglz
 LZ4  Snappy  No compression
Attached is the graph which gives plot of % CPU utilization versus time
elapsed for each of the compression algorithms.
Also, the overall CPU utilization during tests is very low i.e below 10% .
CPU remained idle for large(~90) percentage of time. I will repeat the
above tests with high load on CPU and using the benchmark given by
Fujii-san and post the results.


Thank you,



On Wed, Aug 27, 2014 at 9:16 PM, Arthur Silva arthur...@gmail.com wrote:


 Em 26/08/2014 09:16, Fujii Masao masao.fu...@gmail.com escreveu:

 
  On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed rahilasye...@gmail.com
 wrote:
   Hello,
   Thank you for comments.
  
  Could you tell me where the patch for single block in one run is?
   Please find attached patch for single block compression in one run.
 
  Thanks! I ran the benchmark using pgbench and compared the results.
  I'd like to share the results.
 
  [RESULT]
  Amount of WAL generated during the benchmark. Unit is MB.
 
  MultipleSingle
  off202.0201.5
  on6051.06053.0
  pglz3543.03567.0
  lz43344.03485.0
  snappy3354.03449.5
 
  Latency average during the benchmark. Unit is ms.
 
  MultipleSingle
  off19.119.0
  on55.357.3
  pglz45.045.9
  lz444.244.7
  snappy43.443.3
 
  These results show that FPW compression is really helpful for decreasing
  the WAL volume and improving the performance.
 
  The compression ratio by lz4 or snappy is better than that by pglz. But
  it's difficult to conclude which lz4 or snappy is best, according to
 these
  results.
 
  ISTM that compression-of-multiple-pages-at-a-time approach can compress
  WAL more than compression-of-single-... does.
 
  [HOW TO BENCHMARK]
  Create pgbench database with scall factor 1000.
 
  Change the data type of the column filler on each pgbench table
  from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
  gen_random_uuid() in order to avoid empty column, e.g.,
 
   alter table pgbench_accounts alter column filler type text using
  gen_random_uuid()::text
 
  After creating the test database, run the pgbench as follows. The
  number of transactions executed during benchmark is almost same
  between each benchmark because -R option is used.
 
pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
 
  checkpoint_timeout is 5min, so it's expected that checkpoint was
  executed at least two times during the benchmark.
 
  Regards,
 
  --
  Fujii Masao
 
 
  --
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers

 It'd be interesting to check avg cpu usage as well.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-08-19 Thread Rahila Syed
Hello,
Thank you for comments.

Could you tell me where the patch for single block in one run is?
Please find attached patch for single block compression in one run.

Thank you,



On Tue, Aug 19, 2014 at 1:17 PM, Fujii Masao masao.fu...@gmail.com wrote:

 On Tue, Aug 19, 2014 at 2:08 AM, Andres Freund and...@2ndquadrant.com
 wrote:
  On 2014-08-18 13:06:15 -0400, Robert Haas wrote:
  On Mon, Aug 18, 2014 at 7:19 AM, Rahila Syed rahilasye...@gmail.com
 wrote:
  According to the measurement result, the amount of WAL generated in
  Multiple Blocks in one run than that in Single Block in one run.
  So ISTM that compression of multiple blocks at one run can improve
  the compression ratio. Am I missing something?
  
   Sorry for using unclear terminology. WAL generated here means WAL
 that gets
   generated in each run without compression.
   So, the value WAL generated in the  above measurement is uncompressed
 WAL
   generated to be specific.
   uncompressed WAL = compressed WAL  + Bytes saved.
  
   Here, the measurements are done for a constant amount of time rather
 than
   fixed number of transactions. Hence amount of WAL generated does not
   correspond to compression ratios of each algo. Hence have calculated
 bytes
   saved in order to get accurate idea of the amount of compression in
 each
   scenario and for various algorithms.
  
   Compression ratio i.e Uncompressed WAL/compressed WAL in each of the
 above
   scenarios are as follows:
  
   Compression algo   Multiple Blocks in one runSingle Block in
 one run
  
   LZ4  1.21
1.27
  
   Snappy1.19
  1.25
  
   pglz 1.14
1.16
  
   This shows compression ratios of both the scenarios Multiple blocks
 and
   single block  are nearly same for this benchmark.
 
  I don't agree with that conclusion.  The difference between 1.21 and
  1.27, or between 1.19 and 1.25, is quite significant.  Even the
  difference beyond 1.14 and 1.16 is not trivial.  We should try to get
  the larger benefit, if it is possible to do so without an unreasonable
  effort.
 
  Agreed.
 
  One more question: Do I see it right that multiple blocks compressed
  together compress *worse* than compressing individual blocks? If so, I
  have a rather hard time believing that the patch is sane.

 Or the way of benchmark might have some problems.

 Rahila,
 I'd like to measure the compression ratio in both multiple blocks and
 single block cases.
 Could you tell me where the patch for single block in one run is?

 Regards,

 --
 Fujii Masao



CompressSingleBlock.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-08-19 Thread Rahila Syed
So, it seems like you're basically using malloc to work around the
fact that a palloc failure is an error, and we can't throw an error in
a critical section.  I don't think that's good; we want all of our
allocations, as far as possible, to be tracked via palloc.  It might
be a good idea to add a new variant of palloc or MemoryContextAlloc
that returns NULL on failure instead of throwing an error; I've wanted
that once or twice.  But in this particular case, I'm not quite seeing
why it should be necessary

I am using malloc to return NULL in case of failure and proceed without
compression of FPW ,if it returns NULL.
Proceeding without compression seems to be more accurate than throwing an
error and exiting because of failure to allocate memory for compression.

the number of backup blocks per record is
limited to some pretty small number, so it ought to be possible to
preallocate enough memory to compress them all, perhaps just by
declaring a global variable like char wal_compression_space[8192]; or
whatever.

In the updated patch  a static global variable is added to which memory is
allocated from heap using malloc outside critical section. The size of the
memory block is 4 * BkpBlock header + 4 * BLCKSZ.


Thank you,



On Mon, Aug 18, 2014 at 10:40 PM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jul 3, 2014 at 3:58 PM, Rahila Syed rahilasye...@gmail.com
 wrote:
  Updated version of patches are attached.
  Changes are as follows
  1. Improved readability of the code as per the review comments.
  2. Addition of block_compression field in BkpBlock structure to store
  information about compression of block. This provides for switching
  compression on/off and changing compression algorithm as required.
  3.Handling of OOM in critical section by checking for return value of
 malloc
  and proceeding without compression of FPW if return value is NULL.

 So, it seems like you're basically using malloc to work around the
 fact that a palloc failure is an error, and we can't throw an error in
 a critical section.  I don't think that's good; we want all of our
 allocations, as far as possible, to be tracked via palloc.  It might
 be a good idea to add a new variant of palloc or MemoryContextAlloc
 that returns NULL on failure instead of throwing an error; I've wanted
 that once or twice.  But in this particular case, I'm not quite seeing
 why it should be necessary - the number of backup blocks per record is
 limited to some pretty small number, so it ought to be possible to
 preallocate enough memory to compress them all, perhaps just by
 declaring a global variable like char wal_compression_space[8192]; or
 whatever.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



Fwd: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-08-18 Thread Rahila Syed
According to the measurement result, the amount of WAL generated in
Multiple Blocks in one run than that in Single Block in one run.
So ISTM that compression of multiple blocks at one run can improve
the compression ratio. Am I missing something?

Sorry for using unclear terminology. WAL generated here means WAL that gets
generated in each run without compression.
So, the value WAL generated in the  above measurement is uncompressed WAL
generated to be specific.
uncompressed WAL = compressed WAL  + Bytes saved.

Here, the measurements are done for a constant amount of time rather than
fixed number of transactions. Hence amount of WAL generated does not
correspond to compression ratios of each algo. Hence have calculated bytes
saved in order to get accurate idea of the amount of compression in each
scenario and for various algorithms.

Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above
scenarios are as follows:

Compression algo   Multiple Blocks in one runSingle Block in one run


LZ4  1.21   1.27


Snappy1.19   1.25

pglz 1.14   1.16

This shows compression ratios of both the scenarios Multiple blocks and
single block  are nearly same for this benchmark.

Thank you,

Rahila Syed


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-07-11 Thread Rahila Syed
Thank you for review.

So, you're compressing backup blocks one by one. I wonder if that's the
right idea and if we shouldn't instead compress all of them in one run to
increase the compression ratio.
The idea behind compressing blocks one by one was to keep the code as much
similar to the original as possible.
For instance the easiest change I could think of is , if we compress all
backup blocks of a WAL record together the below format of WAL record might
change

Fixed-size header (XLogRecord struct)
 rmgr-specific data
 BkpBlock
 backup block data
 BkpBlock
 backup block data


to

 Fixed-size header (XLogRecord struct)
 rmgr-specific data
 BkpBlock
 BkpBlock
 backup blocks data
...

But at the same time, it can be worth giving a try to see if there is
significant improvement in compression .

So why aren't we compressing the hole here instead of compressing the
parts that the current logic deems to be filled with important information?
Entire full page image in the WAL record is compressed. The unimportant
part of the full page image  which is hole is not WAL logged in original
code. This patch compresses entire full page image inclusive of hole. This
can be optimized by omitting hole in the compressed FPW(incase hole is
filled with non-zeros) like the original uncompressed FPW . But this can
lead to change in BkpBlock structure.

This should be named 'compress_full_page_writes' or so, even if a
temporary guc. There's the 'full_page_writes' guc and I see little
reaason to deviate from its name.

Yes. This will be renamed to full_page_compression according to suggestions
earlier in the discussion.


Thank you,

Rahila Syed


On Fri, Jul 11, 2014 at 12:00 PM, Andres Freund and...@2ndquadrant.com
wrote:

 On 2014-07-04 19:27:10 +0530, Rahila Syed wrote:
  + /* Allocates memory for compressed backup blocks according to the
 compression
  + * algorithm used.Once per session at the time of insertion of
 first XLOG
  + * record.
  + * This memory stays till the end of session. OOM is handled by
 making the
  + * code proceed without FPW compression*/
  + static char *compressed_pages[XLR_MAX_BKP_BLOCKS];
  + static bool compressed_pages_allocated = false;
  + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF 
  + compressed_pages_allocated!= true)
  + {
  + size_t buffer_size = VARHDRSZ;
  + int j;
  + if (compress_backup_block ==
 BACKUP_BLOCK_COMPRESSION_SNAPPY)
  + buffer_size +=
 snappy_max_compressed_length(BLCKSZ);
  + else if (compress_backup_block ==
 BACKUP_BLOCK_COMPRESSION_LZ4)
  + buffer_size += LZ4_compressBound(BLCKSZ);
  + else if (compress_backup_block ==
 BACKUP_BLOCK_COMPRESSION_PGLZ)
  + buffer_size += PGLZ_MAX_OUTPUT(BLCKSZ);
  + for (j = 0; j  XLR_MAX_BKP_BLOCKS; j++)
  + {   compressed_pages[j] = (char *) malloc(buffer_size);
  + if(compressed_pages[j] == NULL)
  + {
  +
 compress_backup_block=BACKUP_BLOCK_COMPRESSION_OFF;
  + break;
  + }
  + }
  + compressed_pages_allocated = true;
  + }

 Why not do this in InitXLOGAccess() or similar?

/*
 * Make additional rdata chain entries for the backup blocks, so
 that we
 * don't need to special-case them in the write loop.  This
 modifies the
  @@ -1015,11 +1048,32 @@ begin:;
rdt-next = (dtbuf_rdt2[i]);
rdt = rdt-next;
 
  + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF)
  + {
  + /* Compress the backup block before including it in rdata
 chain */
  + rdt-data = CompressBackupBlock(page, BLCKSZ -
 bkpb-hole_length,
  +
   compressed_pages[i], (rdt-len));
  + if (rdt-data != NULL)
  + {
  + /*
  +  * write_len is the length of compressed
 block and its varlena
  +  * header
  +  */
  + write_len += rdt-len;
  + bkpb-hole_length = BLCKSZ - rdt-len;
  + /*Adding information about compression in
 the backup block header*/
  +
 bkpb-block_compression=compress_backup_block;
  + rdt-next = NULL;
  + continue;
  + }
  + }
  +

 So, you're compressing backup blocks one by one. I wonder if that's the
 right idea and if we shouldn't instead compress all of them in one run to
 increase the compression ratio.


  +/*
* Get a pointer to the right location in the WAL buffer containing the
* given XLogRecPtr.
*
  @@ -4061,6 +4174,50 @@ RestoreBackupBlockContents(XLogRecPtr

  1   2   >