Re: pg_dump is broken for partition tablespaces

2019-03-05 Thread Michael Paquier
On Wed, Mar 06, 2019 at 07:45:06PM +1300, David Rowley wrote:
> Partitioned indexes have this similar inherit tablespace from parent
> feature, so ca4103025dfe26 was intended to align the behaviour of the
> two. Partitioned indexes happen not to suffer from the same issue as
> the indexes are attached after their creation similar to what I
> propose above.
> 
> Can anyone see any fundamental reason that we should not create a
> partitioned table by doing CREATE TABLE followed by ATTACH PARTITION?
>  If not, I'll write a patch that fixes it that way.

The part for partitioned indexes is already battle-proven, so if the
part for partitioned tables can be consolidated the same way that
would be really nice.

> As far as I can see, the biggest fundamental difference with doing
> things this way will be that the column order of partitions will be
> preserved, where before it would inherit the order of the partitioned
> table.  I'm a little unsure if doing this column reordering was an
> intended side-effect or not.

I don't see any direct issues with that to be honest thinking about
it..
--
Michael


signature.asc
Description: PGP signature


ECPG regression with DECLARE STATEMENT support

2019-03-05 Thread Rushabh Lathia
Hi,

Commit bd7c95f0c1a38becffceb3ea7234d57167f6d4bf add DECLARE
STATEMENT support to ECPG.  This introduced the new rule
for EXEC SQL CLOSE cur and with that it gets transformed into
ECPGclose().

Now prior to the above commit, someone can declare the
cursor in the SQL statement and "CLOSE cur_name" can be
also, execute as a normal statement.

Example:

EXEC SQL PREPARE cur_query FROM "DECLARE cur1 CURSOR WITH HOLD FOR SELECT
count(*) FROM pg_class";
EXEC SQL PREPARE fetch_stmt FROM "FETCH next FROM cur1";
EXEC SQL EXECUTE cur_query;
EXEC SQL EXECUTE fetch_stmt INTO :c;
EXEC SQL CLOSE cur1;

With commit bd7c95f0c1, "EXEC SQL CLOSE cur1" will fail
and throw an error "sqlcode -245 The cursor is invalid".

I think the problem here is ECPGclose(), tries to find the
cursor into "connection->cursor_stmts" and if it doesn't
find it there, just throws an error.   Maybe require fix
into ECPGclose() - rather than throwing an error continue
executing statement "CLOSE cur_name" with ecpg_do().

Attaching the ECPG program for reference.

Thanks,

-- 
Rushabh Lathia
www.EnterpriseDB.com


test.pgc
Description: Binary data


Re: BUG #15668: Server crash in transformPartitionRangeBounds

2019-03-05 Thread Amit Langote
Hi,

On 2019/03/06 15:48, Michael Paquier wrote:
> On Tue, Mar 05, 2019 at 11:04:17PM +0900, Amit Langote wrote:
>> Maybe we should error out as follows in
>> transformPartitionRangeBounds(), although that means we'll get
>> different error message than when using list partitioning syntax:
> 
> Hm.  I don't think that this is a good idea as you could lose some
> information for the expression transformation handling, and the error
> handling becomes inconsistent depending on the partition bound
> strategy.  It seems to me that if we cannot extract any special value
> from the ColumnRef expression generated, then we ought to let
> transformPartitionBoundValue() and particularly transformExprRecurse()
> do the analysis work and complain if needed:
> =# CREATE TABLE rp_part PARTITION OF range_parted FOR VALUES FROM
> (unknown.unknown) TO (1);
> ERROR:  42P01: missing FROM-clause entry for table "unknown"
> LINE 1: ...p_part PARTITION OF range_parted FOR VALUES FROM
> (unknown.un...
> =# CREATE TABLE rp_part PARTITION OF range_parted FOR VALUES FROM
> (a.a.a.a.a.a.a.a.a.a.a.a) TO (1);
> ERROR:  42601: improper qualified name (too many dotted names):
> a.a.a.a.a.a.a.a.a.a.a.a
> LINE 1: ...p_part PARTITION OF range_parted FOR VALUES FROM
> (a.a.a.a.a
> 
> What about something like the attached instead?  Minus the test
> cases which should go to create_table.sql of course.

Thanks for looking at this.  Your patch seems better, because it allows us
to keep the error message consistent with the message one would get with
list-partitioned syntax.

Thanks,
Amit




Re: BUG #15668: Server crash in transformPartitionRangeBounds

2019-03-05 Thread Michael Paquier
On Tue, Mar 05, 2019 at 11:04:17PM +0900, Amit Langote wrote:
> Maybe we should error out as follows in
> transformPartitionRangeBounds(), although that means we'll get
> different error message than when using list partitioning syntax:

Hm.  I don't think that this is a good idea as you could lose some
information for the expression transformation handling, and the error
handling becomes inconsistent depending on the partition bound
strategy.  It seems to me that if we cannot extract any special value
from the ColumnRef expression generated, then we ought to let
transformPartitionBoundValue() and particularly transformExprRecurse()
do the analysis work and complain if needed:
=# CREATE TABLE rp_part PARTITION OF range_parted FOR VALUES FROM
(unknown.unknown) TO (1);
ERROR:  42P01: missing FROM-clause entry for table "unknown"
LINE 1: ...p_part PARTITION OF range_parted FOR VALUES FROM
(unknown.un...
=# CREATE TABLE rp_part PARTITION OF range_parted FOR VALUES FROM
(a.a.a.a.a.a.a.a.a.a.a.a) TO (1);
ERROR:  42601: improper qualified name (too many dotted names):
a.a.a.a.a.a.a.a.a.a.a.a
LINE 1: ...p_part PARTITION OF range_parted FOR VALUES FROM
(a.a.a.a.a

What about something like the attached instead?  Minus the test
cases which should go to create_table.sql of course.
--
Michael
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index a37d1f18be..ee129ba18f 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -3750,8 +3750,16 @@ transformPartitionRangeBounds(ParseState *pstate, List *blist,
 IsA(linitial(cref->fields), String))
 cname = strVal(linitial(cref->fields));
 
-			Assert(cname != NULL);
-			if (strcmp("minvalue", cname) == 0)
+			if (cname == NULL)
+			{
+/*
+ * No field names have been found, meaning that there
+ * is not much to do with special value handling.  Instead
+ * let the expression transformation handle any errors and
+ * limitations.
+ */
+			}
+			else if (strcmp("minvalue", cname) == 0)
 			{
 prd = makeNode(PartitionRangeDatum);
 prd->kind = PARTITION_RANGE_DATUM_MINVALUE;


signature.asc
Description: PGP signature


pg_dump is broken for partition tablespaces

2019-03-05 Thread David Rowley
Over on [1] Andres pointed out that the pg_dump support for the new to
PG12 tablespace inheritance feature is broken.  This is the feature
added in ca4103025dfe26 to allow a partitioned table to have a
tablespace that acts as the default tablespace for newly attached
partitions. The idea being that you can periodically change the
default tablespace for new partitions to a tablespace that sits on a
disk partition with more free space without affecting the default
tablespace for normal non-partitioned tables. Anyway...

pg_dump is broken with this. Consider:

create tablespace newts location '/tmp/newts';
create table listp (a int) partition by list(a) tablespace newts;
create table listp1 partition of listp for values in(1) tablespace pg_default;
create table listp2 partition of listp for values in(2);

select relname,relkind,reltablespace from pg_class where relname like
'listp%' and relkind in('r','p') order by relname;
produces:

 relname | relkind | reltablespace
-+-+---
 listp   | p   | 16384
 listp1  | r   | 0
 listp2  | r   | 16384
(3 rows)

after dump/restore:

 relname | relkind | reltablespace
-+-+---
 listp   | p   | 16384
 listp1  | r   | 16384
 listp2  | r   | 16384
(3 rows)

Here the tablespace for listp1 was inherited from listp, but we really
should have restored this to use pg_default like was specified.

The reason this occurs is that in pg_dump we do:

SET default_tablespace = '';

CREATE TABLE public.listp1 PARTITION OF public.listp
FOR VALUES IN (1);

so, since we're creating the table initially as a partition the logic
that applies the default partition from the parent kicks in.

If we instead did:

CREATE TABLE public.listp1 (a integer
);

ALTER TABLE public.list1 ATTACH PARTITION public.listp FOR VALUES IN (1);

then we'd have no issue, as tablespace will be set to whatever
default_tablespace is set to.

Partitioned indexes have this similar inherit tablespace from parent
feature, so ca4103025dfe26 was intended to align the behaviour of the
two. Partitioned indexes happen not to suffer from the same issue as
the indexes are attached after their creation similar to what I
propose above.

Can anyone see any fundamental reason that we should not create a
partitioned table by doing CREATE TABLE followed by ATTACH PARTITION?
 If not, I'll write a patch that fixes it that way.

As far as I can see, the biggest fundamental difference with doing
things this way will be that the column order of partitions will be
preserved, where before it would inherit the order of the partitioned
table.  I'm a little unsure if doing this column reordering was an
intended side-effect or not.

[1] 
https://www.postgresql.org/message-id/20190305060804.jv5mz4slrnelh...@alap3.anarazel.de

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] CLUSTER command progress monitor

2019-03-05 Thread Tatsuro Yamada

On 2019/03/05 17:56, Tatsuro Yamada wrote:

Hi Robert!

On 2019/03/05 11:35, Robert Haas wrote:

On Mon, Mar 4, 2019 at 5:38 AM Tatsuro Yamada
 wrote:

=== Current design ===

CLUSTER command uses Index Scan or Seq Scan when scanning the heap.
Depending on which one is chosen, the command will proceed in the
following sequence of phases:

    * Scan method: Seq Scan
  0. initializing (*2)
  1. seq scanning heap    (*1)
  3. sorting tuples   (*2)
  4. writing new heap (*1)
  5. swapping relation files  (*2)
  6. rebuilding index (*2)
  7. performing final cleanup (*2)

    * Scan method: Index Scan
  0. initializing (*2)
  2. index scanning heap  (*1)
  5. swapping relation files  (*2)
  6. rebuilding index (*2)
  7. performing final cleanup (*2)

VACUUM FULL command will proceed in the following sequence of phases:

  1. seq scanning heap    (*1)
  5. swapping relation files  (*2)
  6. rebuilding index (*2)
  7. performing final cleanup (*2)

(*1): increasing the value in heap_tuples_scanned column
(*2): only shows the phase in the phase column


All of that sounds good.


The view provides the information of CLUSTER command progress details as follows
# \d pg_stat_progress_cluster
    View "pg_catalog.pg_stat_progress_cluster"
    Column   |  Type   | Collation | Nullable | Default
---+-+---+--+-
   pid   | integer |   |  |
   datid | oid |   |  |
   datname   | name    |   |  |
   relid | oid |   |  |
   command   | text    |   |  |
   phase | text    |   |  |
   cluster_index_relid   | bigint  |   |  |
   heap_tuples_scanned   | bigint  |   |  |
   heap_tuples_vacuumed  | bigint  |   |  |


Still not sure if we need heap_tuples_vacuumed.  We could try to
report heap_blks_scanned and heap_blks_total like we do for VACUUM, if
we're using a Seq Scan.


I have no strong opinion to add heap_tuples_vacuumed, so I'll remove that in
next patch.

Regarding heap_blks_scanned and heap_blks_total, I suppose that it is able to
get those from initscan(). I'll investigate it more.

cluster.c
   copy_heap_data()
     heap_beginscan()
   heap_beginscan_internal()
     initscan()




=== Discussion points ===

   - Progress counter for "3. sorting tuples" phase
  - Should we add pgstat_progress_update_param() in tuplesort.c like a
    "trace_sort"?
    Thanks to Peter Geoghegan for the useful advice!


How would we avoid an abstraction violation?


Hmm... What do you mean an abstraction violation?
If it is difficult to solve, I'd not like to add the progress counter for the 
sorting tuples.



   - Progress counter for "6. rebuilding index" phase
  - Should we add "index_vacuum_count" in the view like a vacuum progress 
monitor?
    If yes, I'll add pgstat_progress_update_param() to reindex_relation() 
of index.c.
    However, I'm not sure whether it is okay or not.


Doesn't seem unreasonable to me.


I see, I'll add it later.



Attached file is revised and WIP patch including:

  - Remove heap_tuples_vacuumed
  - Add heap_blks_scanned and heap_blks_total
  - Add index_vacuum_count

I tried to "add heap_blks_scanned and heap_blks_total" columns and I realized 
that
"heap_tuples_scanned" column is suitable as a counter when a scan method is
both index-scan and seq-scan because CLUSTER is on a tuple basis.



Regards,
Tatsuro Yamada


diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d16c3d0ea5..a88e8d2492 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -51,6 +51,7 @@
 #include "catalog/storage.h"
 #include "commands/tablecmds.h"
 #include "commands/event_trigger.h"
+#include "commands/progress.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
@@ -58,6 +59,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
 #include "parser/parser.h"
+#include "pgstat.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -3850,6 +3852,7 @@ reindex_relation(Oid relid, int flags, int options)
 	List	   *indexIds;
 	bool		is_pg_class;
 	bool		result;
+	int			i;
 
 	/*
 	 * Open and lock the relation.  ShareLock is sufficient since we only need
@@ -3937,6 +3940,7 @@ reindex_relation(Oid relid, int flags, int options)
 
 		/* Reindex all the indexes. */
 		doneIndexes = NIL;
+		i = 1;
 		foreach(indexId, indexIds)
 		{
 			Oid			indexOid = lfirst_oid(indexId);
@@ -3954,6 +3958,11 @@ 

Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread Amit Langote
Fujita-san,

On 2019/03/06 15:10, Etsuro Fujita wrote:
> --- a/doc/src/sgml/ddl.sgml
> +++ b/doc/src/sgml/ddl.sgml
> @@ -3376,6 +3376,13 @@ ALTER TABLE measurement ATTACH PARTITION
> measurement_y2008m02
>    
>   
> 
> + 
> +  
> +   UPDATE row movement is not supported in the cases
> +   where the old row is contained in a foreign table partition.
> +  
> + 
> 
> ISTM that it's also a limitation that rows can be moved from a local
> partition to a foreign partition *if the FDW support tuple routing*, so I
> would vote for mentioning that as well here.

Thanks for checking.

I have updated the patch to include a line about this in the same
paragraph, because maybe we don't need to make a new  for it.

Thanks,
Amit
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 8314fce78f..c6b5bd0e54 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3878,6 +3878,15 @@ ALTER TABLE measurement ATTACH PARTITION 
measurement_y2008m02
 
  
   
+   UPDATE row movement is not supported in the cases
+   where the old row is contained in a foreign table partition.  Also,
+   moving the new row into a foreign table partition is supported only
+   if the table's FDW supports tuple routing.
+  
+ 
+
+ 
+  
BEFORE ROW triggers, if necessary, must be defined
on individual partitions, not the partitioned table.
   


Re: [HACKERS] Block level parallel vacuum

2019-03-05 Thread Masahiko Sawada
On Mon, Mar 4, 2019 at 10:27 AM Masahiko Sawada  wrote:
>
> On Sat, Mar 2, 2019 at 3:54 AM Robert Haas  wrote:
> >
> > On Fri, Mar 1, 2019 at 12:19 AM Masahiko Sawada  
> > wrote:
> > > > I wonder if we really want this behavior.  Should a setting that
> > > > controls the degree of parallelism when scanning the table also affect
> > > > VACUUM?  I tend to think that we probably don't ever want VACUUM of a
> > > > table to be parallel by default, but rather something that the user
> > > > must explicitly request.  Happy to hear other opinions.  If we do want
> > > > this behavior, I think this should be written differently, something
> > > > like this: The PARALLEL N option to VACUUM takes precedence over this
> > > > option.
> > >
> > > For example, I can imagine a use case where a batch job does parallel
> > > vacuum to some tables in a maintenance window. The batch operation
> > > will need to compute and specify the degree of parallelism every time
> > > according to for instance the number of indexes, which would be
> > > troublesome. But if we can set the degree of parallelism for each
> > > tables it can just to do 'VACUUM (PARALLEL)'.
> >
> > True, but the setting in question would also affect the behavior of
> > sequential scans and index scans.  TBH, I'm not sure that the
> > parallel_workers reloption is really a great design as it is: is
> > hard-coding the number of workers really what people want?  Do they
> > really want the same degree of parallelism for sequential scans and
> > index scans?  Why should they want the same degree of parallelism also
> > for VACUUM?  Maybe they do, and maybe somebody explain why they do,
> > but as of now, it's not obvious to me why that should be true.
>
> I think that there are users who want to specify the degree of
> parallelism. I think that hard-coding the number of workers would be
> good design for something like VACUUM which is a simple operation for
> single object; since there are no joins, aggregations it'd be
> relatively easy to compute it. That's why the patch introduces
> PARALLEL N option as well. I think that a reloption for parallel
> vacuum would be just a way to save the degree of parallelism. And I
> agree that users don't want to use same degree of parallelism for
> VACUUM, so maybe it'd better to add new reloption like
> parallel_vacuum_workers. On the other hand, it can be a separate
> patch, I can remove the reloption part from this patch and will
> propose it when there are requests.
>

Okay, attached the latest version of patch set. I've incorporated all
comments I got and separated the patch for making vacuum options a
Node (0001 patch). And the patch doesn't use parallel_workers. It
might be proposed in the another form again in the future if
requested.


Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


v16-0001-Make-vacuum-options-a-Node.patch
Description: Binary data


v16-0003-Add-paralell-P-option-to-vacuumdb-command.patch
Description: Binary data


v16-0002-Add-parallel-option-to-VACUUM-command.patch
Description: Binary data


Re: [HACKERS] CLUSTER command progress monitor

2019-03-05 Thread Tatsuro Yamada

On 2019/03/06 1:13, Robert Haas wrote:

On Tue, Mar 5, 2019 at 3:56 AM Tatsuro Yamada
 wrote:

=== Discussion points ===

- Progress counter for "3. sorting tuples" phase
   - Should we add pgstat_progress_update_param() in tuplesort.c like a
 "trace_sort"?
 Thanks to Peter Geoghegan for the useful advice!


How would we avoid an abstraction violation?


Hmm... What do you mean an abstraction violation?
If it is difficult to solve, I'd not like to add the progress counter for the 
sorting tuples.


What I mean is... I think it would be useful to have this counter, but
I'm not sure how the tuplesort code would know to update the counter
in this case and not in other cases.  The tuplesort code is used for
lots of things; we can't update a counter for CLUSTER if the tuplesort
is being used for CREATE INDEX or a Sort node in a query or whatever.
So my question is how we would indicate to the tuplesort that it needs
to do the counter update, and whether that would end up making for
ugly code.


Thanks for your explanation!
I understood that now. I guess it means an API to get a progress for sort 
processing,
so let us just put it aside now. I'd like to leave that as is for an 
appropriate person.

Regards,
Tatsuro Yamada






Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread Etsuro Fujita

(2019/03/06 13:53), Amit Langote wrote:

On 2019/03/06 13:30, David Rowley wrote:



I think you missed my point.  If there's no special support for "tuple
moving", as you say, then what help is it to tell the user "if the FDW
supports tuple routing"?   The answer is, it's not any help. How would
the user check such a fact?


Hmm, maybe getting the following error, like one would get in PG 10 when
using postgres_fdw-managed partitions:

ERROR:  cannot route inserted tuples to a foreign table

Getting the above error is perhaps not the best way for a user to learn of
this fact, but maybe we (and hopefully other FDW authors) mention this in
the documentation?


+1


As far as I can tell, this is just the requirements as defined in
CheckValidResultRel() for CMD_INSERT. Fragments of which I pasted
above.


Only supporting INSERT doesn't suffice though.  An FDW which intends to
support tuple routing and hence 1-way tuple moving needs to updated like
postgres_fdw was in PG 11.


That's right; the "if the FDW supports it" in the documentation refers 
to the FDW's support for the callback functions BeginForeignInsert() and 
EndForeignInsert() described in 57.2.4. FDW Routines For Updating 
Foreign Tables [1] in addition to ExecForeignInsert(), as stated there:


"Tuples inserted into a partitioned table by INSERT or COPY FROM are 
routed to partitions. If an FDW supports routable foreign-table 
partitions, it should also provide the following callback functions."


Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/docs/current/fdw-callbacks.html#FDW-CALLBACKS-UPDATE





Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread Etsuro Fujita

(2019/03/06 13:18), Amit Langote wrote:

The main problem here is indeed that the limitation is not listed under
the partitioning limitations in ddl.sgml, where it's easier to notice than
in the UPDATE's page.


Agreed.


I've updated my patch to remove the release-11.sgml
changes.


Thanks for the updated patch!

--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3376,6 +3376,13 @@ ALTER TABLE measurement ATTACH PARTITION 
measurement_y2008m02

   
  

+ 
+  
+   UPDATE row movement is not supported in the cases
+   where the old row is contained in a foreign table partition.
+  
+ 

ISTM that it's also a limitation that rows can be moved from a local 
partition to a foreign partition *if the FDW support tuple routing*, so 
I would vote for mentioning that as well here.


Best regards,
Etsuro Fujita




Re: Tab completion for SKIP_LOCKED option

2019-03-05 Thread Masahiko Sawada
On Wed, Mar 6, 2019 at 2:46 PM Michael Paquier  wrote:
>
> On Wed, Mar 06, 2019 at 11:45:01AM +0900, Masahiko Sawada wrote:
> > I realized that the tab completions for SKIP_LOCKED option of both
> > VACUUM and ANALYZE are missing. Attached patch adds them.
>
> Thanks Sawada-san, committed.

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Allowing extensions to supply operator-/function-specific info

2019-03-05 Thread Tom Lane
Paul Ramsey  writes:
>> On Mar 5, 2019, at 3:56 PM, Tom Lane  wrote:
>> Then you're at least missing adequate tests for the 3-arg functions...
>> 3 args with the index column second will not work as this stands.

> Some of the operators are indifferent to order (&&,  overlaps) and others are 
> not (@, within) (~, contains).

Right.

> The 3-arg functions fortunately all have && strategies.

Hm ... that probably explains why it's okay to apply the "expand"
behavior to the non-indexed argument regardless of which one that is.
I imagine the official definition of those functions isn't really
symmetrical about which argument the expansion applies to, though?

> The types on either side of the operators are always the same (geometry && 
> geometry), ST_Intersects(geometry, geometry).
> I could simply be getting a free pass from the simplicity of my setup?

Yeah, seems so.  The real reason I'm pestering you about this is that,
since you're the first outside user of the support-function
infrastructure, other people are likely to be looking at your code
to see how to do things.  So I'd like your code to not contain
unnecessary dependencies on accidents like that ...

regards, tom lane



Re: Tab completion for SKIP_LOCKED option

2019-03-05 Thread Michael Paquier
On Wed, Mar 06, 2019 at 11:45:01AM +0900, Masahiko Sawada wrote:
> I realized that the tab completions for SKIP_LOCKED option of both
> VACUUM and ANALYZE are missing. Attached patch adds them.

Thanks Sawada-san, committed.
--
Michael


signature.asc
Description: PGP signature


Re: speeding up planning with partitions

2019-03-05 Thread Amit Langote
Imai-san,

Thanks for the review.

On 2019/03/06 11:09, Imai, Yoshikazu wrote:
> Here is the code review for previous v26 patches.
> 
> [0002]
> In expand_inherited_rtentry():
> 
> expand_inherited_rtentry()
> {
> ...
> +   RelOptInfo *rel = NULL;
> 
> can be declared at more later:
> 
> if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> ...
> else
> {   
> List   *appinfos = NIL;
> RangeTblEntry *childrte;
> Index   childRTindex;
> +RelOptInfo *rel = NULL;
> 
> 

This is fixed in v27.

> [0003]
> In inheritance_planner:
> 
> + rtable_with_target = list_copy(root->parse->rtable);
> 
> can be:
> 
> + rtable_with_target = list_copy(parse->rtable);

Sure.

> [0004 or 0005]
> There are redundant process in add_appendrel_other_rels (or 
> expand_xxx_rtentry()?).
> I modified add_appendrel_other_rels like below and it actually worked.
> 
> 
> add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index rti) 
> {
> ListCell   *l;
> RelOptInfo *rel;
> 
> /*   
>  * Add inheritance children to the query if not already done. For child
>  * tables that are themselves partitioned, their children will be added
>  * recursively.
>  */
> if (rte->rtekind == RTE_RELATION && !root->contains_inherit_children)
> {
> expand_inherited_rtentry(root, rte, rti);
> return;
> }
> 
> rel = find_base_rel(root, rti);
> 
> foreach(l, root->append_rel_list)
> {
> AppendRelInfo  *appinfo = lfirst(l);
> Index   childRTindex = appinfo->child_relid;
> RangeTblEntry  *childrte;
> RelOptInfo *childrel;
> 
> if (appinfo->parent_relid != rti) 
> continue;
> 
> Assert(childRTindex < root->simple_rel_array_size);
> childrte = root->simple_rte_array[childRTindex];
> Assert(childrte != NULL);
> build_simple_rel(root, childRTindex, rel);
> 
> /* Child may itself be an inherited relation. */
> if (childrte->inh)
> add_appendrel_other_rels(root, childrte, childRTindex);
> }
> }

If you don't intialize part_rels here, then it will break any code in the
planner that expects a partitioned rel with non-zero number of partitions
to have part_rels set to non-NULL.  At the moment, that includes the code
that implements partitioning-specific optimizations such partitionwise
aggregate and join, run-time pruning etc.  No existing regression tests
cover the cases where source inherited roles participates in those
optimizations, so you didn't find anything that broke.  For an example,
consider the following update query:

update p set a = p1.a + 1 from p p1 where p1.a = (select 1);

Planner will create a run-time pruning aware Append node for p (aliased
p1), for which it will need to use the part_rels array.  Note that p in
this case is a source relation which the above code initializes.

Maybe we should add such regression tests.

> I didn't investigate that problem, but there is another memory increase
issue, which is because of 0003 patch I think. I'll try to solve the
latter issue.

Interested in details as it seems to be a separate problem.

Thanks,
Amit




Re: Patch to document base64 encoding

2019-03-05 Thread Karl O. Pinc
On Wed, 6 Mar 2019 11:27:38 +0900
Michael Paquier  wrote:

> On Tue, Mar 05, 2019 at 07:55:22PM -0600, Karl O. Pinc wrote:
> > Attached: doc_base64_v4.patch  
> 
> Details about the "escape" mode are already available within the
> description of function "encode".  Wouldn't we want to consolidate a
> description for all the modes at the same place, including some words
> for hex?  Your patch only includes the description of base64, which is
> a good addition, still not consistent with the rest.  A paragraph
> after all the functions listed is fine I think as the description is
> long so it would bloat the table if included directly.

Makes sense.  (As did hyperlinking to the RFC.)

(No matter how simple I think a patch is going to be it
always turns into a project.  :)

Attached: doc_base64_v5.patch

Made index entries for hex and escape encodings.

Added word "encoding" to index entries.

Made  entries with terms for
base64, hex, and escape encodings.

Added documentation for hex and escape encodings,
including output formats and what are acceptable
inputs.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6765b0d584..bd337fd530 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1752,6 +1752,9 @@
 
  decode
 
+
+  base64 encoding
+
 decode(string text,
 format text)

@@ -1769,16 +1772,25 @@
 
  encode
 
+
+ base64 encoding
+
+
+ hex encoding
+
+
+ escape encoding
+
 encode(data bytea,
 format text)

text

 Encode binary data into a textual representation.  Supported
-formats are: base64, hex, escape.
-escape converts zero bytes and high-bit-set bytes to
-octal sequences (\nnn) and
-doubles backslashes.
+formats are:
+base64,
+hex,
+escape.

encode('123\000\001', 'base64')
MTIzAAE=
@@ -2365,6 +2377,84 @@
 format treats a NULL as a zero-element array.

 
+   
+ base64 encoding
+   
+   
+hex encoding
+   
+   
+escape encoding
+   
+
+   
+ The string and the binary encode
+ and decode functions support the following
+ encodings:
+
+ 
+   
+ base64
+ 
+   
+ The base64 encoding is that
+ of https://tools.ietf.org/html/rfc2045#section-6.8;>RFC
+ 2045 section 6.8.  As per the RFC, encoded lines are
+ broken at 76 characters.  However instead of the MIME CRLF
+ end-of-line marker, only a newline is used for end-of-line.
+   
+   
+ The carriage-return, newline, space, and tab characters are
+ ignored by decode.  Otherwise, an error is
+ raised when decode is supplied invalid
+ base64 data  including when trailing padding is incorrect.
+   
+ 
+   
+
+   
+ hex
+ 
+   
+ hex represents each 4 bits of data as a single
+ hexadecimal digit, 0
+ through f.  Encoding outputs
+ the a-f hex digits in lower
+ case.  Because the smallest unit of data is 8 bits there are
+ always an even number of characters returned
+ by encode.
+   
+   
+ The decode function
+ accepts a-f characters in
+ either upper or lower case.  An error is raised
+ when decode is supplied invalid hex data
+  including when given an odd number of characters.
+   
+ 
+   
+
+   
+ escape
+ 
+   
+ escape converts zero bytes and high-bit-set
+ bytes to octal sequences
+ (\nnn) and doubles
+ backslashes.  Encoding always produces 4 characters for each
+ high-bit-set input byte.
+   
+   
+ The decode function accepts any number of
+ octal digits after a \ character.  An error is
+ raised when decode is supplied a
+ single \ not followed by an octal digit.
+   
+ 
+   
+ 
+   
+

See also the aggregate function string_agg in
.
@@ -3577,16 +3667,25 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
 
  encode
 
+
+ base64 encoding
+
+
+ hex encoding
+
+
+ escape encoding
+
encode(data bytea,
format text)
   
   text
   
Encode binary data into a textual representation.  Supported
-   formats are: base64, hex, escape.
-   escape converts zero bytes and high-bit-set 

Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread Amit Langote
On 2019/03/06 13:30, David Rowley wrote:
> On Wed, 6 Mar 2019 at 17:20, Amit Langote  
> wrote:
>>
>> On 2019/03/06 12:47, David Rowley wrote:
>>> It seems a bit light on detail to me. If I was a user I'd want to know
>>> what exactly the FDW needed to support this. Does it need a special
>>> partition move function?  Looking at ExecFindPartition(), this check
>>> seems to be done in CheckValidResultRel() and is basically:
>>>
>>> case RELKIND_FOREIGN_TABLE:
>>> /* Okay only if the FDW supports it */
>>> fdwroutine = resultRelInfo->ri_FdwRoutine;
>>> switch (operation)
>>> {
>>> case CMD_INSERT:
>>> if (fdwroutine->ExecForeignInsert == NULL)
>>> ereport(ERROR,
>>> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>>> errmsg("cannot insert into foreign table \"%s\"",
>>> RelationGetRelationName(resultRel;
>>>
>>> Alternatively, we could just remove the mention about "if the FDW
>>> supports it", since it's probably unlikely for an FDW not to support
>>> INSERT.
>>
>> AFAIK, there's no special support in FDWs for "tuple moving" as such.  The
>> "if the FDW supports it" refers to the FDW's ability to handle tuple
>> routing.  Note that moving/re-routing involves calling
>> ExecPrepareTupleRouting followed by ExecInsert on the new tupls after the
>> old tuple is deleted.  If an FDW doesn't support tuple routing, then a
>> tuple cannot be moved into it.  That's what that text is talking about.
>>
>> Maybe, we should reword it as "if the FDW supports tuple routing", so that
>> a reader doesn't go looking around for "tuple moving support" in FDWs.
> 
> I think you missed my point.  If there's no special support for "tuple
> moving", as you say, then what help is it to tell the user "if the FDW
> supports tuple routing"?   The answer is, it's not any help. How would
> the user check such a fact?

Hmm, maybe getting the following error, like one would get in PG 10 when
using postgres_fdw-managed partitions:

ERROR:  cannot route inserted tuples to a foreign table

Getting the above error is perhaps not the best way for a user to learn of
this fact, but maybe we (and hopefully other FDW authors) mention this in
the documentation?

> As far as I can tell, this is just the requirements as defined in
> CheckValidResultRel() for CMD_INSERT. Fragments of which I pasted
> above.

Only supporting INSERT doesn't suffice though.  An FDW which intends to
support tuple routing and hence 1-way tuple moving needs to updated like
postgres_fdw was in PG 11.

Thanks,
Amit




Re: Inheriting table AMs for partitioned tables

2019-03-05 Thread David Rowley
On Tue, 5 Mar 2019 at 19:08, Andres Freund  wrote:
>
> On 2019-03-05 16:01:50 +1300, David Rowley wrote:
> > I'd suggest it's made to work the same way as ca4103025dfe26 made
> > tablespaces work.
>
> Hm, is that actually correct?  Because as far as I can tell that doesn't
> have the necessary pg_dump code to make this behaviour persistent:
>
> CREATE TABLESPACE frak LOCATION '/tmp/frak';
> CREATE TABLE test_tablespace (a text, b int) PARTITION BY list (a) TABLESPACE 
> frak ;
> CREATE TABLE test_tablespace_1 PARTITION OF test_tablespace FOR VALUES in 
> ('a');
> CREATE TABLE test_tablespace_2 PARTITION OF test_tablespace FOR VALUES in 
> ('b') TABLESPACE pg_default;
> CREATE TABLE test_tablespace_3 PARTITION OF test_tablespace FOR VALUES in 
> ('c') TABLESPACE frak;
>
> SELECT relname, relkind, reltablespace FROM pg_class WHERE relname LIKE 
> 'test_tablespace%' ORDER BY 1;
> ┌───┬─┬───┐
> │  relname  │ relkind │ reltablespace │
> ├───┼─┼───┤
> │ test_tablespace   │ p   │ 16384 │
> │ test_tablespace_1 │ r   │ 16384 │
> │ test_tablespace_2 │ r   │ 0 │
> │ test_tablespace_3 │ r   │ 16384 │
> └───┴─┴───┘

[pg_dump/pg_restore]

> ┌───┬─┬───┐
> │  relname  │ relkind │ reltablespace │
> ├───┼─┼───┤
> │ test_tablespace   │ p   │ 16384 │
> │ test_tablespace_1 │ r   │ 16384 │
> │ test_tablespace_2 │ r   │ 16384 │
> │ test_tablespace_3 │ r   │ 16384 │
> └───┴─┴───┘

frak... that's a bit busted. I can't instantly think of a fix, but I
see the same problem does not seem to exist for partition indexes, so
that's a relief since that's already in PG11.

I'll take this up on another thread once I have something good to report.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 17:20, Amit Langote  wrote:
>
> On 2019/03/06 12:47, David Rowley wrote:
> > It seems a bit light on detail to me. If I was a user I'd want to know
> > what exactly the FDW needed to support this. Does it need a special
> > partition move function?  Looking at ExecFindPartition(), this check
> > seems to be done in CheckValidResultRel() and is basically:
> >
> > case RELKIND_FOREIGN_TABLE:
> > /* Okay only if the FDW supports it */
> > fdwroutine = resultRelInfo->ri_FdwRoutine;
> > switch (operation)
> > {
> > case CMD_INSERT:
> > if (fdwroutine->ExecForeignInsert == NULL)
> > ereport(ERROR,
> > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > errmsg("cannot insert into foreign table \"%s\"",
> > RelationGetRelationName(resultRel;
> >
> > Alternatively, we could just remove the mention about "if the FDW
> > supports it", since it's probably unlikely for an FDW not to support
> > INSERT.
>
> AFAIK, there's no special support in FDWs for "tuple moving" as such.  The
> "if the FDW supports it" refers to the FDW's ability to handle tuple
> routing.  Note that moving/re-routing involves calling
> ExecPrepareTupleRouting followed by ExecInsert on the new tupls after the
> old tuple is deleted.  If an FDW doesn't support tuple routing, then a
> tuple cannot be moved into it.  That's what that text is talking about.
>
> Maybe, we should reword it as "if the FDW supports tuple routing", so that
> a reader doesn't go looking around for "tuple moving support" in FDWs.

I think you missed my point.  If there's no special support for "tuple
moving", as you say, then what help is it to tell the user "if the FDW
supports tuple routing"?   The answer is, it's not any help. How would
the user check such a fact?

As far as I can tell, this is just the requirements as defined in
CheckValidResultRel() for CMD_INSERT. Fragments of which I pasted
above.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: New vacuum option to do only freezing

2019-03-05 Thread Masahiko Sawada
On Tue, Mar 5, 2019 at 8:01 PM Kyotaro HORIGUCHI
 wrote:
>
> Hello, I have some other comments.
>

Thank you for the comment!

On Tue, Mar 5, 2019 at 8:01 PM Kyotaro HORIGUCHI
 wrote:
>
>
> +   nleft;/* item pointers we left */
>
> The name seems to be something other, and the comment doesn't
> makes sense at least.. for me.. Looking below,
>
> +"%.0f tuples are left as dead.\n",
> +nleft),
> + nleft);
>
> How about "nleft_dead; /* iterm pointers left as dead */"?

Fixed.

>
>
>
> In this block:
>
> -if (nindexes == 0 &&
> +if ((nindexes == 0 || skip_index_vacuum) &&
>  vacrelstats->num_dead_tuples > 0)
>  {
>
> Is it right that vacuumed_pages is incremented and FSM is updated
> while the page is not actually vacuumed?

Good catch. I think the FSM stuff is right because we actually did HOT
pruning but the increment of vacuumed_page seems wrong to me. I think
we should not increment it and not report "removed XX row version in
YY pages" message.

>
>
>  tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false,
> - >latestRemovedXid);
> + >latestRemovedXid,
> + _pruned);
>
> tups_pruned looks as "HOT pruned tuples". It is named "unused" in
> the function's parameters. (But I think it is useless. Please see
> the details below.)
>
>
> I tested it with a simple data set.
>
> (autovacuum = off)
> drop table if exists t;
> create table t with (fillfactor=50) as select a, a % 3 as b from 
> generate_series(1, 9) a;
> create index on t(a);
> update t set a = -a where b = 0;
> update t set b = b + 1 where b = 1;
>
> We now have 9 tuples, 15 versions and 3 out of 6 "old" tuples are
> to be "left dead" by DISABLE_INDEX_CLEANUP vacuum. It means,
> three tuples ends with "left dead", three tuples are removed and
> 12 tuples will survive the vacuum below.
>
> vacuum (verbose, freeze ,disable_index_cleanup, disable_page_skipping) t;
>
> > INFO:  "t": removed 0 row versions in 1 pages
> > INFO:  "t": found 0 removable, 9 nonremovable row versions in 1 out of 1 
> > pages
> > DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 925
> > There were 0 unused item pointers.
> > Skipped 0 pages due to buffer pins, 0 frozen pages.
> > 0 pages are entirely empty.
> > 3 tuples are left as dead.
>
> Three tuple versions have vanished. Actually they were removed
> but not shown in the message.
>
> heap_prune_chain() doesn't count a live root entry of a chain as
> "unused (line pointer)" since it is marked as "redierected". As
> the result the vanished tuples are counted in tups_vacuumed, not
> in tups_pruned. Maybe the name tups_vacuumed is confusing.  After
> removing tups_pruned code it works correctly.
>
> > INFO:  "t": removed 6 row versions in 1 pages
> > INFO:  "t": found 6 removable, 9 nonremovable row versions in 1 out of 1 
> > pages
> > DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 932
> > There were 0 unused item pointers.
> > Skipped 0 pages due to buffer pins, 0 frozen pages.
> > 0 pages are entirely empty.
> > 3 tuples are left as dead.
>
> I see two choices of the second line above.
>
> 1> "t": found 6 removable, 9 nonremovable row versions in 1 out of 1 pages
>
>   "removable" includes "left dead" tuples.
>
> 2> "t": found 3 removable, 12 nonremovable row versions in 1 out of 1 pages
>
>   "removable" excludes "left dead" tuples.
>
> If you prefer the latter, removable and nonremoveable need to be
> corrected using nleft.

I think that the first vacuum should report the former message because
it's true that the table has 6 removable tuples. We remove 6 tuples
but leave 3 item pointers. So in the second vacuum, it should be
"found 0 removable, 9 nonremovable row versions ..." and "3 tuples are
left as dead". But to report more precisely it'd be better to report
"0 tuples and 3 item identifiers are left as dead" here.

Attached updated patch incorporated all of comments. Also I've added
new reloption vacuum_index_cleanup as per discussion on the "reloption
to prevent VACUUM from truncating empty pages at the end of relation"
thread. Autovacuums also can skip index cleanup when the reloption is
set to false. Since the setting this to false might lead some problems
I've made autovacuums report the number of dead tuples and dead
itemids we left.

Regards,

Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From 70358ea1ae7d7c8f78d9101176f4cc1652c3c978 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Fri, 1 Feb 2019 16:02:50 +0100
Subject: [PATCH v8 2/2] Add --diable-index-cleanup option to vacuumdb.

---
 doc/src/sgml/ref/vacuumdb.sgml| 15 +++
 src/bin/scripts/t/100_vacuumdb.pl |  9 -
 src/bin/scripts/vacuumdb.c| 29 

Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread Amit Langote
On 2019/03/06 12:47, David Rowley wrote:
> On Wed, 6 Mar 2019 at 16:29, Etsuro Fujita  
> wrote:
>> That means that rows can be moved from a local partition to a foreign
>> partition if the FDW supports it.
> 
> It seems a bit light on detail to me. If I was a user I'd want to know
> what exactly the FDW needed to support this. Does it need a special
> partition move function?  Looking at ExecFindPartition(), this check
> seems to be done in CheckValidResultRel() and is basically:
> 
> case RELKIND_FOREIGN_TABLE:
> /* Okay only if the FDW supports it */
> fdwroutine = resultRelInfo->ri_FdwRoutine;
> switch (operation)
> {
> case CMD_INSERT:
> if (fdwroutine->ExecForeignInsert == NULL)
> ereport(ERROR,
> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> errmsg("cannot insert into foreign table \"%s\"",
> RelationGetRelationName(resultRel;
> 
> Alternatively, we could just remove the mention about "if the FDW
> supports it", since it's probably unlikely for an FDW not to support
> INSERT.

AFAIK, there's no special support in FDWs for "tuple moving" as such.  The
"if the FDW supports it" refers to the FDW's ability to handle tuple
routing.  Note that moving/re-routing involves calling
ExecPrepareTupleRouting followed by ExecInsert on the new tupls after the
old tuple is deleted.  If an FDW doesn't support tuple routing, then a
tuple cannot be moved into it.  That's what that text is talking about.

Maybe, we should reword it as "if the FDW supports tuple routing", so that
a reader doesn't go looking around for "tuple moving support" in FDWs.

Thanks,
Amit




Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread Amit Langote
Fujita-san,

On 2019/03/06 13:04, Etsuro Fujita wrote:
> (2019/03/06 11:34), Amit Langote wrote:
>> Ah, indeed.  In the documentation fix patch I'd posted, I also made
>> changes to release-11.sgml to link to the limitations section.  (I'm
>> attaching it here for your reference.)
> 
> I'm not sure it's a good idea to make changes to the release notes like
> that, because 1) that would make the release notes verbose, and 2) it
> might end up doing the same thing to items that have some limitations in
> the existing/future release notes (eg, FOR EACH ROW triggers on
> partitioned tables added to V11 has the limitation listed on the
> limitation section, so the same link would be needed.), for consistency.

OK, sure.  It just seemed to me that the original complainer found it
quite a bit surprising that such a limitation is not mentioned in the
release notes, but maybe that's fine.  It seems we don't normally list
feature limitations in the release notes, which as you rightly say, would
make them verbose.

The main problem here is indeed that the limitation is not listed under
the partitioning limitations in ddl.sgml, where it's easier to notice than
in the UPDATE's page.  I've updated my patch to remove the release-11.sgml
changes.

Thanks,
Amit
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 7bed4f56f0..3b20e73a61 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3376,6 +3376,13 @@ ALTER TABLE measurement ATTACH PARTITION 
measurement_y2008m02
   
  
 
+ 
+  
+   UPDATE row movement is not supported in the cases
+   where the old row is contained in a foreign table partition.
+  
+ 
+
  
   
BEFORE ROW triggers, if necessary, must be defined


Re: Converting NOT IN to anti-joins during planning

2019-03-05 Thread David Rowley
Hi Jim,

Thanks for replying here.

On Wed, 6 Mar 2019 at 16:37, Jim Finnerty  wrote:
>
> Actually, we're working hard to integrate the two approaches.  I haven't had
> time since I returned to review your patch, but I understand that you were
> checking for strict predicates as part of the nullness checking criteria,
> and we definitely must have that.  Zheng tells me that he has combined your
> patch with ours, but before we put out a new patch, we're trying to figure
> out how to preserve the existing NOT IN execution plan in the case where the
> materialized subplan fits in memory.  This (good) plan is effectively an
> in-memory hash anti-join.
>
> This is tricky to do because the NOT IN Subplan to anti-join transformation
> currently happens early in the planning process, whereas the decision to
> materialize is made much later, when the best path is being converted into a
> Plan.

I guess you're still going with the OR ... IS NULL in your patch then?
otherwise, you'd likely find that the transformation (when NULLs are
not possible) is always a win since it'll allow hash anti-joins. (see
#2 in the original email on this thread)  FWIW I mentioned in [1] and
Tom confirmed in [2] that we both think hacking the join condition to
add an OR .. IS NULL is a bad idea. I guess you're not deterred by
that?

I'd say your next best move is, over on the other thread, to put up
your argument against what Tom and I mentioned, then detail out what
exactly you're planning. Likely this will save time. I personally
don't think that ignoring this part is going to allow you to progress
your patch too much further in PostgreSQL. Consensus about how $thing
works is something that's needed before the $thing can ever be
committed. Sometimes lack of objection can count, but an unaddressed
objection is not consensus. Not trying to make this hard, just trying
to explain the process.

[1] 
https://www.postgresql.org/message-id/CAKJS1f8q4S%2B5Z7WSRDWJd__SwqMr12JdWKXTDo35ptzneRvZnw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/5420.1551487529%40sss.pgh.pa.us

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread Etsuro Fujita

(2019/03/06 11:34), Amit Langote wrote:

Ah, indeed.  In the documentation fix patch I'd posted, I also made
changes to release-11.sgml to link to the limitations section.  (I'm
attaching it here for your reference.)


I'm not sure it's a good idea to make changes to the release notes like 
that, because 1) that would make the release notes verbose, and 2) it 
might end up doing the same thing to items that have some limitations in 
the existing/future release notes (eg, FOR EACH ROW triggers on 
partitioned tables added to V11 has the limitation listed on the 
limitation section, so the same link would be needed.), for consistency.


Best regards,
Etsuro Fujita




Re: [PATCH v20] GSSAPI encryption support

2019-03-05 Thread Robbie Harwood
Stephen Frost  writes:

> * Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
>
>> Or maybe we avoid that, and you rename be-secure-gssapi.c to just
>> be-gssapi.c and also combine that with the contents of
>> be-gssapi-common.c.
>
> I don't know why we would need to, or want to, combine
> be-secure-gssapi.c and be-gssapi-common.c, they do have different
> roles in that be-gssapi-common.c is used even if you aren't doing
> encryption, while bs-secure-gssapi.c is specifically for handling the
> encryption side of things.
>
> Then again, be-gssapi-common.c does currently only have the one
> function in it, and it's not like there's an option to compile for
> *just* GSS authentication (and not encryption), or at least, I don't
> think that would be a useful option to have..  Robbie?

Yeah, I think I'm opposed to making that an option.

Worth pointing out here: up until v6, I had this structured differently,
with all the GSSAPI code in fe-gssapi.c and be-gssapi.c.  The current
separation was suggested by Michael Paquier for ease of reading and to
keep the code churn down.

>> (And similarly in libpq.)
>
> I do agree that we should try to keep the frontend and backend code
> structures similar in this area, that seems to make sense.

Well, I don't know if it's an argument in either direction, but: on the
frontend we have about twice as much shared code in fe-gssapi-common.c
(pg_GSS_have_ccache() and pg_GSS_load_servicename()).

>> I don't see any tests in the patch.  We have a Kerberos test suite at
>> src/test/kerberos/ and an SSL test suite at src/test/ssl/.  You can
>> get some ideas there.
>
> Yeah, I was going to comment on that as well.  We definitely should
> implement tests around this.

Attached.  Please note that I don't really speak perl.  There's a pile
of duplicated code in 002_enc.pl that probably should be shared between
the two.  (It would also I think be possible for 001_auth.pl to set up
the KDC and for 002_enc.pl to then use it.)

Thanks,
--Robbie


signature.asc
Description: PGP signature
>From 42ab1ccae8e517934866ee923d80554ef1996709 Mon Sep 17 00:00:00 2001
From: Robbie Harwood 
Date: Tue, 5 Mar 2019 22:54:11 -0500
Subject: [PATCH] Add tests for GSSAPI/krb5 encryption

---
 src/test/kerberos/t/002_enc.pl | 197 +
 1 file changed, 197 insertions(+)
 create mode 100644 src/test/kerberos/t/002_enc.pl

diff --git a/src/test/kerberos/t/002_enc.pl b/src/test/kerberos/t/002_enc.pl
new file mode 100644
index 00..927abe15e4
--- /dev/null
+++ b/src/test/kerberos/t/002_enc.pl
@@ -0,0 +1,197 @@
+use strict;
+use warnings;
+use TestLib;
+use PostgresNode;
+use Test::More;
+use File::Path 'remove_tree';
+
+if ($ENV{with_gssapi} eq 'yes')
+{
+	plan tests => 5;
+}
+else
+{
+	plan skip_all => 'GSSAPI/Kerberos not supported by this build';
+}
+
+my ($krb5_bin_dir, $krb5_sbin_dir);
+
+if ($^O eq 'darwin')
+{
+	$krb5_bin_dir  = '/usr/local/opt/krb5/bin';
+	$krb5_sbin_dir = '/usr/local/opt/krb5/sbin';
+}
+elsif ($^O eq 'freebsd')
+{
+	$krb5_bin_dir  = '/usr/local/bin';
+	$krb5_sbin_dir = '/usr/local/sbin';
+}
+elsif ($^O eq 'linux')
+{
+	$krb5_sbin_dir = '/usr/sbin';
+}
+
+my $krb5_config  = 'krb5-config';
+my $kinit= 'kinit';
+my $kdb5_util= 'kdb5_util';
+my $kadmin_local = 'kadmin.local';
+my $krb5kdc  = 'krb5kdc';
+
+if ($krb5_bin_dir && -d $krb5_bin_dir)
+{
+	$krb5_config = $krb5_bin_dir . '/' . $krb5_config;
+	$kinit   = $krb5_bin_dir . '/' . $kinit;
+}
+if ($krb5_sbin_dir && -d $krb5_sbin_dir)
+{
+	$kdb5_util= $krb5_sbin_dir . '/' . $kdb5_util;
+	$kadmin_local = $krb5_sbin_dir . '/' . $kadmin_local;
+	$krb5kdc  = $krb5_sbin_dir . '/' . $krb5kdc;
+}
+
+my $host = 'auth-test-localhost.postgresql.example.com';
+my $hostaddr = '127.0.0.1';
+my $realm = 'EXAMPLE.COM';
+
+my $krb5_conf   = "${TestLib::tmp_check}/krb5.conf";
+my $kdc_conf= "${TestLib::tmp_check}/kdc.conf";
+my $krb5_log= "${TestLib::tmp_check}/krb5libs.log";
+my $kdc_log = "${TestLib::tmp_check}/krb5kdc.log";
+my $kdc_port= int(rand() * 16384) + 49152;
+my $kdc_datadir = "${TestLib::tmp_check}/krb5kdc";
+my $kdc_pidfile = "${TestLib::tmp_check}/krb5kdc.pid";
+my $keytab  = "${TestLib::tmp_check}/krb5.keytab";
+
+note "setting up Kerberos";
+
+my ($stdout, $krb5_version);
+run_log [ $krb5_config, '--version' ], '>', \$stdout
+  or BAIL_OUT("could not execute krb5-config");
+BAIL_OUT("Heimdal is not supported") if $stdout =~ m/heimdal/;
+$stdout =~ m/Kerberos 5 release ([0-9]+\.[0-9]+)/
+  or BAIL_OUT("could not get Kerberos version");
+$krb5_version = $1;
+
+append_to_file(
+	$krb5_conf,
+	qq![logging]
+default = FILE:$krb5_log
+kdc = FILE:$kdc_log
+
+[libdefaults]
+default_realm = $realm
+
+[realms]
+$realm = {
+kdc = $hostaddr:$kdc_port
+}!);
+
+append_to_file(
+	$kdc_conf,
+	qq![kdcdefaults]
+!);
+
+# For new-enough versions of krb5, use the _listen settings rather
+# than the _ports settings so that we can bind to localhost only.
+if 

Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 16:29, Etsuro Fujita  wrote:
>
> (2019/03/06 11:06), David Rowley wrote:
> > I don't quite understand what a "foreign table to some other
> > partition" is meant to mean. Partitions don't have foreign tables,
> > they can only be one themselves.
>
> I think "foreign table" is describing "partition" in front of that; "a
> partition that is a foreign table".

I think I was reading this wrong:

-   Currently, rows cannot be moved from a partition that is a
-   foreign table to some other partition, but they can be moved into a foreign
-   table if the foreign data wrapper supports it.

I parsed it as "cannot be moved from a partition, that is a foreign
table to some other partition"

and subsequently struggled with what "a foreign table to some other
partition" is.

but now looking at it, I think it's meant to mean:

"cannot be moved from a foreign table partition to another partition"

> > I've tried to put all this right again in the attached. However, I was
> > a bit unsure of what "but they can be moved into a foreign table if
> > the foreign data wrapper supports it." is referring to. Copying Robert
> > and Etsuro as this was all added in 3d956d9562aa. Hopefully, they can
> > confirm what is meant by this.
>
> That means that rows can be moved from a local partition to a foreign
> partition if the FDW supports it.

It seems a bit light on detail to me. If I was a user I'd want to know
what exactly the FDW needed to support this. Does it need a special
partition move function?  Looking at ExecFindPartition(), this check
seems to be done in CheckValidResultRel() and is basically:

case RELKIND_FOREIGN_TABLE:
/* Okay only if the FDW supports it */
fdwroutine = resultRelInfo->ri_FdwRoutine;
switch (operation)
{
case CMD_INSERT:
if (fdwroutine->ExecForeignInsert == NULL)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot insert into foreign table \"%s\"",
RelationGetRelationName(resultRel;

Alternatively, we could just remove the mention about "if the FDW
supports it", since it's probably unlikely for an FDW not to support
INSERT.

> IMO, I think the existing mention in [3] is good, so I would vote for
> putting the same mention in table 5.10.2.3 in [2] as well.

I think the sentence is unclear, at least I struggled to parse it the
first time.  Happy for Amit to choose some better words and include in
his patch. I think it should be done in the same commit.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread Etsuro Fujita

(2019/03/06 11:06), David Rowley wrote:

On Tue, 5 Mar 2019 at 03:01, Derek Hans  wrote:

Based on a reply to reporting this as a bug, moving rows out of foreign 
partitions is not yet implemented so this is behaving as expected. There's a 
mention of this limitation in the Notes section of the Update docs.


(Moving this discussion to -Hackers)

In [1], Derek reports that once a row is inserted into a foreign
partition that an UPDATE does not correctly route it back out into the
correct partition.

I didn't really follow the foreign partition code when it went in, but
do recall being involved in the documentation about the limitations of
partitioned tables in table 5.10.2.3 in [2].  Unfortunately, table
5.10.2.3 does not seem to mention this limitation at all.  As Derek
mentions, there is a brief mention in [3] in the form of:

"Currently, rows cannot be moved from a partition that is a foreign
table to some other partition, but they can be moved into a foreign
table if the foreign data wrapper supports it."

I don't quite understand what a "foreign table to some other
partition" is meant to mean. Partitions don't have foreign tables,
they can only be one themselves.


I think "foreign table" is describing "partition" in front of that; "a 
partition that is a foreign table".



I've tried to put all this right again in the attached. However, I was
a bit unsure of what "but they can be moved into a foreign table if
the foreign data wrapper supports it." is referring to. Copying Robert
and Etsuro as this was all added in 3d956d9562aa. Hopefully, they can
confirm what is meant by this.


That means that rows can be moved from a local partition to a foreign 
partition if the FDW supports it.


IMO, I think the existing mention in [3] is good, so I would vote for 
putting the same mention in table 5.10.2.3 in [2] as well.


Best regards,
Etsuro Fujita




Re: Rare SSL failures on eelpout

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 4:07 PM Shawn Debnath  wrote:
> On Wed, Mar 06, 2019 at 11:13:31AM +1300, Thomas Munro wrote:
> > So... can anyone tell us what happens on Windows?

> C:\Users\Shawn Debnath\Desktop>c:\Python27\python.exe tmunro-ssl-test.py 
> --client
> Sending A...
> 2
> Sending B...
> [Errno 10054] An existing connection was forcibly closed by the remote host
> Sending C...
> [Errno 10054] An existing connection was forcibly closed by the remote host
> Traceback (most recent call last):
>   File "tmunro-ssl-test.py", line 57, in 
> client()
>   File "tmunro-ssl-test.py", line 51, in client
> print s.recv(1024)
> socket.error: [Errno 10054] An existing connection was forcibly closed by the 
> remote host

Thanks!  Wow, so not only can we not read the final message sent by
the server, if we try we get an error.  That's... not what we want.  I
wonder if there might be a way to put the socket into don't-do-that
mode...

-- 
Thomas Munro
https://enterprisedb.com



Re: few more wait events to add to docs

2019-03-05 Thread Michael Paquier
On Wed, Mar 06, 2019 at 02:21:15PM +1300, Thomas Munro wrote:
> Here's some missing documentation.

The addition looks fine to me.

> Hmm, yeah.  I wish these were alphabetised, I wish there was an
> automated warning about this, I wish these tranches were declared a
> better way that by adding code in RegisterLWLockTranches().

Why not using this occasion to reorganize the LWLock and Lock sections
so as their entries are fully alphabetized then?  I have made an
effort in this direction in 5ef037c for all the sections except these
two.  And honestly getting all these organized would really help
documentation readers.
--
Michael


signature.asc
Description: PGP signature


Tab completion for SKIP_LOCKED option

2019-03-05 Thread Masahiko Sawada
Hi,

I realized that the tab completions for SKIP_LOCKED option of both
VACUUM and ANALYZE are missing. Attached patch adds them.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


skip_locked.patch
Description: Binary data


Re: Online verification of checksums

2019-03-05 Thread Stephen Frost
Greetings,

On Tue, Mar 5, 2019 at 18:36 Michael Paquier  wrote:

> On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote:
> > Based on quickly skimming that thread the main issue seems to be
> > deciding which files in the data directory are expected to have
> > checksums. Which is a valid issue, of course, but I was expecting
> > something about partial read/writes etc.
>
> I remember complaining about partial write handling as well for the
> base backup checks...  There should be an email about it on the list,
> cannot find it now ;p
>
> > My understanding is that:
> >
> > (a) The checksum verification should not generate false positives (same
> > as for basebackup).
> >
> > (b) The partial reads do emit warnings, which might be considered false
> > positives I guess. Which is why I'm arguing for changing it to do the
> > same thing basebackup does, i.e. ignore this.
>
> Well, at least that's consistent...  Argh, I really think that we
> ought to make the failures reported harder because that's easier to
> detect within a tool and some deployments set log_min_messages >
> WARNING so checksum failures would just be lost.  For base backups we
> don't care much about that as files are just blindly copied so they
> could have torn pages, which is fine as that's fixed at replay.  Now
> we are talking about a set of tools which could have reliable
> detection mechanisms for those problems.


I’m traveling but will try to comment more in the coming days but in
general I agree with Tomas on these items. Also, pg_basebackup has to
handle torn pages when it comes to checksums just like the verify tool
does, and having them be consistent (along with external tools) would
really be for the best, imv.  I still feel like a retry of a short read
(try reading more to get the whole page..) would be alright and reading
until we hit eof and then moving on. I’m not sure it’s possible but I do
worry a bit that we might get a short read from a network file system or
something that isn’t actually at eof and then we would skip a significant
remaining portion of the file...   another thought might be to stat the
file after we have opened it to see it’s length...

Just a few thoughts since I’m on my phone.  Will try to write up something
more in a day or two.

Thanks!

Stephen


Re: Online verification of checksums

2019-03-05 Thread Michael Paquier
On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote:
> Based on quickly skimming that thread the main issue seems to be
> deciding which files in the data directory are expected to have
> checksums. Which is a valid issue, of course, but I was expecting
> something about partial read/writes etc.

I remember complaining about partial write handling as well for the
base backup checks...  There should be an email about it on the list,
cannot find it now ;p

> My understanding is that:
> 
> (a) The checksum verification should not generate false positives (same
> as for basebackup).
> 
> (b) The partial reads do emit warnings, which might be considered false
> positives I guess. Which is why I'm arguing for changing it to do the
> same thing basebackup does, i.e. ignore this.

Well, at least that's consistent...  Argh, I really think that we
ought to make the failures reported harder because that's easier to
detect within a tool and some deployments set log_min_messages >
WARNING so checksum failures would just be lost.  For base backups we
don't care much about that as files are just blindly copied so they
could have torn pages, which is fine as that's fixed at replay.  Now
we are talking about a set of tools which could have reliable
detection mechanisms for those problems.
--
Michael


signature.asc
Description: PGP signature


Re: Pluggable Storage - Andres's take

2019-03-05 Thread Andres Freund
Hi,

Thanks for looking!

On 2019-03-05 18:27:45 -0800, Ashwin Agrawal wrote:
> While playing with the tableam, usage of which starts with commit
> v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we
> check for NULL function pointer before actually calling the same and ERROR
> out instead as NOT_SUPPORTED or something on those lines.

Scans seem like absolutely required part of the functionality, so I
don't think there's much point in that. It'd just bloat code and
runtime.


> Understand its kind of think which should get caught during development.
> But still currently it segfaults if missing to define some AM function,

The segfault iself doesn't bother me at all, it's just a NULL pointer
dereference. If we were to put Asserts somewhere it'd crash very
similarly. I think you have a point in that:

> might be easier for iterative development to error instead in common place.

Would make it a tiny bit easier to implement a new AM.  We could
probably add a few asserts to GetTableAmRoutine(), to check that
required functions are implemted.  Don't think that'd make a meaningful
difference for something like the scan functions, but it'd probably make
it easier to forward port AMs to the next release - I'm pretty sure
we're going to add required callbacks in the next few releases.

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

2019-03-05 Thread Andres Freund
Hi,

Thanks for looking!

On 2019-03-05 18:27:45 -0800, Ashwin Agrawal wrote:
> While playing with the tableam, usage of which starts with commit
> v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we
> check for NULL function pointer before actually calling the same and ERROR
> out instead as NOT_SUPPORTED or something on those lines.

Scans seem like absolutely required part of the functionality, so I
don't think there's much point in that. It'd just bloat code and
runtime.


> Understand its kind of think which should get caught during development.
> But still currently it segfaults if missing to define some AM function,

The segfault iself doesn't bother me at all, it's just a NULL pointer
dereference. If we were to put Asserts somewhere it'd crash very
similarly. I think you have a point in that:

> might be easier for iterative development to error instead in common place.

Would make it a tiny bit easier to implement a new AM.  We could
probably add a few asserts to GetTableAmRoutine(), to check that
required functions are implemted.  Don't think that'd make a meaningful
difference for something like the scan functions, but it'd probably make
it easier to forward port AMs to the next release - I'm pretty sure
we're going to add required callbacks in the next few releases.

Greetings,

Andres Freund



Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread Amit Langote
On 2019/03/06 11:29, David Rowley wrote:
> On Wed, 6 Mar 2019 at 15:26, Amit Langote  
> wrote:
>>
>>> I've tried to put all this right again in the attached. However, I was
>>> a bit unsure of what "but they can be moved into a foreign table if
>>> the foreign data wrapper supports it." is referring to. Copying Robert
>>> and Etsuro as this was all added in 3d956d9562aa. Hopefully, they can
>>> confirm what is meant by this.
>>
>> Did you miss my reply on that thread?
>>
>> https://www.postgresql.org/message-id/CA%2BHiwqF3gma5HfCJb4_cOk0_%2BLEpVc57EHdBfz_EKt%2BNu0hNYg%40mail.gmail.com
> 
> Yes. I wasn't aware that there were two threads for this.

Ah, indeed.  In the documentation fix patch I'd posted, I also made
changes to release-11.sgml to link to the limitations section.  (I'm
attaching it here for your reference.)

Thanks,
Amit
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 7bed4f56f0..3b20e73a61 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3376,6 +3376,13 @@ ALTER TABLE measurement ATTACH PARTITION 
measurement_y2008m02
   
  
 
+ 
+  
+   UPDATE row movement is not supported in the cases
+   where the old row is contained in a foreign table partition.
+  
+ 
+
  
   
BEFORE ROW triggers, if necessary, must be defined
diff --git a/doc/src/sgml/release-11.sgml b/doc/src/sgml/release-11.sgml
index 14e2726f0c..d6fc5a3e31 100644
--- a/doc/src/sgml/release-11.sgml
+++ b/doc/src/sgml/release-11.sgml
@@ -2578,6 +2578,13 @@ Branch: REL9_3_STABLE [84261eb10] 2018-10-19 17:02:26 
-0400
 column now cause affected rows to be moved to the appropriate
 partitions (Amit Khandekar)

+
+   
+However, not all cases where such row movment would be necessary
+are handled currently; see
+declarative
+partitioning limitations for more information.
+   
   
 
   


Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 15:26, Amit Langote  wrote:
>
> > I've tried to put all this right again in the attached. However, I was
> > a bit unsure of what "but they can be moved into a foreign table if
> > the foreign data wrapper supports it." is referring to. Copying Robert
> > and Etsuro as this was all added in 3d956d9562aa. Hopefully, they can
> > confirm what is meant by this.
>
> Did you miss my reply on that thread?
>
> https://www.postgresql.org/message-id/CA%2BHiwqF3gma5HfCJb4_cOk0_%2BLEpVc57EHdBfz_EKt%2BNu0hNYg%40mail.gmail.com

Yes. I wasn't aware that there were two threads for this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Pluggable Storage - Andres's take

2019-03-05 Thread Ashwin Agrawal
Hi,

While playing with the tableam, usage of which starts with commit
v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we
check for NULL function pointer before actually calling the same and ERROR
out instead as NOT_SUPPORTED or something on those lines.

Understand its kind of think which should get caught during development.
But still currently it segfaults if missing to define some AM function,
might be easier for iterative development to error instead in common place.

Or should there be upfront check for NULL somewhere if all the AM functions
are mandatory to have functions defined for them and should not be NULL.


Re: Patch to document base64 encoding

2019-03-05 Thread Michael Paquier
On Tue, Mar 05, 2019 at 07:55:22PM -0600, Karl O. Pinc wrote:
> Attached: doc_base64_v4.patch

Details about the "escape" mode are already available within the
description of function "encode".  Wouldn't we want to consolidate a
description for all the modes at the same place, including some words
for hex?  Your patch only includes the description of base64, which is
a good addition, still not consistent with the rest.  A paragraph
after all the functions listed is fine I think as the description is
long so it would bloat the table if included directly.
--
Michael


signature.asc
Description: PGP signature


Re: Prevent extension creation in temporary schemas

2019-03-05 Thread Michael Paquier
On Tue, Mar 05, 2019 at 12:47:54PM +, Chris Travers wrote:
> I tried installing a test extension into a temp schema.  I found
> this was remarkably difficult to do because pg_temp did not work (I
> had to create a temporary table and then locate the actual table it
> was created in).  While that might also be a bug it is not in the
> scope of this patch so mostly noting in terms of future work.

pgcrypto works in this case.

> After creating the extension I did as follows:
> \dx in the current session shows the extension
> \dx in a stock psql shows the extension in a separate session
> \dx with a patched psql in a separate session does not show the
> extension.
> 
> In terms of the scope of this patch, I think this correctly and
> fully solves the problem at hand. 

I was just looking at this patch this morning with fresh eyes, and I
think that I have found one argument to *not* apply it.  Imagine the
following in one session:
=# create extension pgcrypto with schema pg_temp_3;
CREATE EXTENSION
=# \dx
  List of installed extensions
   Name   | Version |   Schema   | Description
--+-++--
 pgcrypto | 1.3 | pg_temp_3  | cryptographic functions
 plpgsql  | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)

That's all good, we see that the session which created this extension
has it listed.  Now let's use in parallel a second session:
=# create extension pgcrypto with schema pg_temp_4;
ERROR:  42710: extension "pgcrypto" already exists
LOCATION:  CreateExtension, extension.c:1664
=# \dx
  List of installed extensions
   Name   | Version |   Schema   | Description
--+-++--
 plpgsql  | 1.0 | pg_catalog | PL/pgSQL procedural language
(1 row)

This is actually also good, because the extension of the temporary
schema of the first session does not show up.  Now I think that this
can bring some confusion to the user actually, because the extension
becomes not listed via \dx, but trying to create it with a different
schema fails.

Thoughts?
--
Michael


signature.asc
Description: PGP signature


RE: speeding up planning with partitions

2019-03-05 Thread Imai, Yoshikazu
Amit-san,

On Tue, Mar 5, 2019 at 10:24 AM, Amit Langote wrote:
> On 2019/03/05 9:50, Amit Langote wrote:
> > I'll post the updated patches after diagnosing what I'm suspecting a
> > memory over-allocation bug in one of the patches.  If you configure
> > build with --enable-cassert, you'll see that throughput decreases as
> > number of partitions run into many thousands, but it doesn't when
> > asserts are turned off.
> 
> Attached an updated version.  This incorporates fixes for both Jesper's
> and Imai-san's review.

Thanks for updating patches!
Here is the code review for previous v26 patches.

[0002]
In expand_inherited_rtentry():

expand_inherited_rtentry()
{
...
+   RelOptInfo *rel = NULL;

can be declared at more later:

if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
...
else
{   
List   *appinfos = NIL;
RangeTblEntry *childrte;
Index   childRTindex;
+RelOptInfo *rel = NULL;


[0003]
In inheritance_planner:

+   rtable_with_target = list_copy(root->parse->rtable);

can be:

+   rtable_with_target = list_copy(parse->rtable);

[0004 or 0005]
There are redundant process in add_appendrel_other_rels (or 
expand_xxx_rtentry()?).
I modified add_appendrel_other_rels like below and it actually worked.


add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index rti) 
{
ListCell   *l;
RelOptInfo *rel;

/*   
 * Add inheritance children to the query if not already done. For child
 * tables that are themselves partitioned, their children will be added
 * recursively.
 */
if (rte->rtekind == RTE_RELATION && !root->contains_inherit_children)
{
expand_inherited_rtentry(root, rte, rti);
return;
}

rel = find_base_rel(root, rti);

foreach(l, root->append_rel_list)
{
AppendRelInfo  *appinfo = lfirst(l);
Index   childRTindex = appinfo->child_relid;
RangeTblEntry  *childrte;
RelOptInfo *childrel;

if (appinfo->parent_relid != rti) 
continue;

Assert(childRTindex < root->simple_rel_array_size);
childrte = root->simple_rte_array[childRTindex];
Assert(childrte != NULL);
build_simple_rel(root, childRTindex, rel);

/* Child may itself be an inherited relation. */
if (childrte->inh)
add_appendrel_other_rels(root, childrte, childRTindex);
}
}

> and Imai-san's review.  I haven't been able to pin down the bug (or
> whatever) that makes throughput go down as the partition count increases,
> when tested with a --enable-cassert build.

I didn't investigate that problem, but there is another memory increase issue, 
which is because of 0003 patch I think. I'll try to solve the latter issue.

--
Yoshikazu Imai



Re: Update does not move row across foreign partitions in v11

2019-03-05 Thread David Rowley
On Tue, 5 Mar 2019 at 03:01, Derek Hans  wrote:
> Based on a reply to reporting this as a bug, moving rows out of foreign 
> partitions is not yet implemented so this is behaving as expected. There's a 
> mention of this limitation in the Notes section of the Update docs.

(Moving this discussion to -Hackers)

In [1], Derek reports that once a row is inserted into a foreign
partition that an UPDATE does not correctly route it back out into the
correct partition.

I didn't really follow the foreign partition code when it went in, but
do recall being involved in the documentation about the limitations of
partitioned tables in table 5.10.2.3 in [2].  Unfortunately, table
5.10.2.3 does not seem to mention this limitation at all.  As Derek
mentions, there is a brief mention in [3] in the form of:

"Currently, rows cannot be moved from a partition that is a foreign
table to some other partition, but they can be moved into a foreign
table if the foreign data wrapper supports it."

I don't quite understand what a "foreign table to some other
partition" is meant to mean. Partitions don't have foreign tables,
they can only be one themselves.

I've tried to put all this right again in the attached. However, I was
a bit unsure of what "but they can be moved into a foreign table if
the foreign data wrapper supports it." is referring to. Copying Robert
and Etsuro as this was all added in 3d956d9562aa. Hopefully, they can
confirm what is meant by this.

[1] 
https://www.postgresql.org/message-id/cagrp7a3xc1qy_b2wjcgad8uqts_ndcjn06o5mts_ne1nyhb...@mail.gmail.com
[2] 
https://www.postgresql.org/docs/devel/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-LIMITATIONS
[3] https://www.postgresql.org/docs/devel/sql-update.html

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


doc_confirm_foreign_partition_limitations.patch
Description: Binary data


Re: Fix memleaks and error handling in jsonb_plpython

2019-03-05 Thread Michael Paquier
On Tue, Mar 05, 2019 at 02:10:01PM +0300, Nikita Glukhov wrote:
> I known about this volatility issues, but maybe I incorrectly understand what
> should be marked as volatile for pointer variables: the pointer itself and/or
> the memory referenced by it.  I thought that only pointer needs to be marked,
> and also there is message [1] clearly describing what needs to be marked.

Yeah, sorry for bringing some confusion.

> Previously in PLyMapping_ToJsonbValue() the whole contents of PyObject was
> marked as volatile, not the pointer itself which is not modified in PG_TRY:
> 
> - /* We need it volatile, since we use it after longjmp */
> - volatile PyObject *items_v = NULL;
> 
> So, I removed volatile qualifier here.

Okay, this one looks correct to me.  Well the whole variable has been
removed.

> Variable 'result' is also not modified in PG_TRY, it is also non-volatile.

Fine here as well.

> I marked only 'key' variable in PLyObject_FromJsonbContainer() as volatile,
> because it is really modified in the loop inside PG_TRY(), and
> PLyObject_FromJsonbValue() call after its assignment can throw PG
> exception:
> +  PyObject   *volatile key = NULL;

One thing that you are missing here is that key can become NULL when
reaching the catch block, so Py_XDECREF() should be called on it only
when the value is not NULL.  And actually, looking closer, you don't
need to have that volatile variable at all, no?  Why not just
declaring it as a PyObject in the while loop?

Also here, key and val can be NULL, so we had better only call
Py_XDECREF() when they are not.  On top of that, potential errors on
PyDict_SetItem() not be simply ignored, so the loop should only break
when the key or the value is NULL, but not when PyDict_SetItem() has a
problem.

> Also I have idea to introduce a global list of Python objects that need to be
> dereferenced in PG_CATCH inside PLy_exec_function() in the case of exception.
> Then typical code will be look like that:

Perhaps we could do that, but let's not juggle with the code more than
necessary for a bug fix.

> Yes, PyList_SetItem() and PyString_FromStringAndSize() should be checked,
> but CPython's PyList_SetItem() really should not fail because list storage
> is preallocated:

Hm.  We could add an elog() here for safety I think.  That's not a big
deal either.

Another thing is that you cannot just return within a try block with
what is added in PLyObject_FromJsonbContainer, or the error stack is
not reset properly.  So they should be replaced by breaks.
--
Michael


signature.asc
Description: PGP signature


Re: speeding up planning with partitions

2019-03-05 Thread Amit Langote
On 2019/03/06 0:57, Jesper Pedersen wrote:
> On 3/5/19 5:24 AM, Amit Langote wrote:
>> Attached an updated version.  This incorporates fixes for both Jesper's
>> and Imai-san's review.  I haven't been able to pin down the bug (or
>> whatever) that makes throughput go down as the partition count increases,
>> when tested with a --enable-cassert build.
>>
> 
> Thanks !
> 
> I'm seeing the throughput going down as well, but are you sure it isn't
> just the extra calls of MemoryContextCheck you are seeing ? A flamegraph
> diff highlights that area -- sent offline.
> 
> A non cassert build shows the same profile for 64 and 1024 partitions.

Thanks for testing.

I now see what's happening here.  I was aware that it's MemoryContextCheck
overhead but didn't quite understand why the time it takes should increase
with the number of partitions increasing, especially given that the patch
makes it so that only one partition is accessed if that's what the query
needs to do.  What I had forgotten however is that MemoryContextCheck
checks *all* contexts in every transaction, including the "partition
descriptor" context which stores a partitioned table's PartitionDesc.
PartitionDesc gets bigger as the number of partitions increase, so does
the time to check the context it's allocated in.

So, the decrease in throughput in cassert build as the number of
partitions increases is not due to any fault of this patch series as I had
suspected.  Phew!

Thanks,
Amit




Re: Patch to document base64 encoding

2019-03-05 Thread Karl O. Pinc
Hi Fabien,

On Tue, 5 Mar 2019 23:02:26 +0100 (CET)
Fabien COELHO  wrote:

> > Attached: doc_base64_v3.patch  
> 
> I'm ok with referencing the historical MIME RFC.

For the record, RFC 2045 is updated but not
yet obsolete.  The updates don't invalidate
section 6.8.

> "RFC2045 section 6.8" -> "RFC 2045 Section 6.8"
> 
> you can link to the RFC directly with:
> 
> https://tools.ietf.org/html/rfc2045#section-6.8;>RFC 2045 
> Section 6.8

Done.

Attached: doc_base64_v4.patch

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6765b0d584..06dfd84d22 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1752,6 +1752,9 @@
 
  decode
 
+
+ base64
+
 decode(string text,
 format text)

@@ -1769,13 +1772,16 @@
 
  encode
 
+
+ base64
+
 encode(data bytea,
 format text)

text

 Encode binary data into a textual representation.  Supported
-formats are: base64, hex, escape.
+formats are: base64, hex, escape.
 escape converts zero bytes and high-bit-set bytes to
 octal sequences (\nnn) and
 doubles backslashes.
@@ -2365,6 +2371,23 @@
 format treats a NULL as a zero-element array.

 
+   
+ base64
+   
+
+   
+ The base64 encoding of the encode
+ and decode functions is that
+ of https://tools.ietf.org/html/rfc2045#section-6.8;>RFC 2045
+ section 6.8.  As per the RFC, encoded lines are broken at 76
+ characters.  However instead of the MIME CRLF end-of-line marker, only a
+ newline is used for end-of-line.  The carriage-return, newline, space,
+ and tab characters are ignored by decode.
+ Otherwise, an error is raised when decode is
+ supplied invalid base64 data  including when trailing padding is
+ incorrect.
+   
+

See also the aggregate function string_agg in
.
@@ -3577,13 +3600,16 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
 
  encode
 
+
+ base64
+
encode(data bytea,
format text)
   
   text
   
Encode binary data into a textual representation.  Supported
-   formats are: base64, hex, escape.
+   formats are: base64, hex, escape.
escape converts zero bytes and high-bit-set bytes to
octal sequences (\nnn) and
doubles backslashes.


Re: Delay locking partitions during query execution

2019-03-05 Thread Robert Haas
On Tue, Mar 5, 2019 at 8:04 PM David Rowley
 wrote:
> Actually, I'm not sure it could work at all.  It does not seem very
> safe to lookup a partition's parent without actually holding a lock on
> the partition and we can't lock the partition and then lock each
> parent in turn as that's the exact opposite locking order that we do
> when querying a partitioned table, so could result in deadlocking.
>
> Many moons ago in the 0002 patch in [1] I proposed to change the way
> we store partition hierarchies which involved ditching bool
> pg_class.relispartition in favour of Oid pg_class.relpartitionparent.
> That would make parent lookups pretty fast, but... given the above it
> seems like we couldn't do this at all.

One thing that is both kinda nice and kinda strange about our
heavyweight lock manager is that it has no idea what it is locking.
If you say "please lock OID 12345" for me, it does, but it doesn't
know anything about the relationship between that OID and any other
thing you might want to lock.  Compare that to what Gray and Reuter
describe in "Transaction Processing: Concepts and Techniques", Section
7.8, Granular Locking.  There, there is an idea that the set of
possible locks forms a tree, and that locking a node of the tree is
tantamount to locking all of its descendents.  Such a concept would be
useful here: you could take e.g. AccessShareLock on the root of the
partitioning hierarchy and that would in effect give you that lock
mode on every partition, but without needing to make a separate entry
for each partition lock.

Sadly, implementing such a thing for PostgreSQL seems extremely
challenging.  We'd somehow have to build up in shared memory an idea
of what the locking hierarchy was, and then update it as DDL happens,
and drop entries that aren't interesting any more so we don't run out
of shared memory.  Performance and concurrency would be really hard
problems, and assumptions about the current locking model are baked
into code in MANY parts of the system, so finding everything that
needed to be (or could be) changed would probably be extremely
challenging.  But if you could make it all work we might well end up
in a better state than we are today.  However, I'm leaving this
problem for a time when I have six months or a year to do nothing
else, because I'm pretty sure that's what it would take.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Allowing extensions to supply operator-/function-specific info

2019-03-05 Thread Paul Ramsey


> On Mar 5, 2019, at 3:56 PM, Tom Lane  wrote:
> 
> Paul Ramsey  writes:
>> On Mar 5, 2019, at 3:26 PM, Tom Lane  wrote:
>>> Hm, I think your addition of this bit is wrong:
>>> 
>>> +/*
>>> +* Arguments were swapped to put the index value on the
>>> +* left, so we need the commutated operator for
>>> +* the OpExpr
>>> +*/
>>> +if (swapped)
>>> +{
>>> +oproid = get_commutator(oproid);
>>> +if (!OidIsValid(oproid))
>>> PG_RETURN_POINTER((Node *)NULL);
>>> +}
>>> 
>>> We already did the operator lookup with the argument types in the desired
>>> order, so this is introducing an extra swap.  The only reason it appears
>>> to work, I suspect, is that all your index operators are self-commutators.
> 
>> I was getting regression failures until I re-swapped the operator… 
>>  SELEcT * FROM foobar WHERE ST_Within(ConstA, VarB)
> 
> Ah ... so the real problem here is that *not* all of your functions
> treat their first two inputs alike, and the hypothetical future
> improvement I commented about is needed right now.  I should've
> looked more closely at the strategies in your table; then I would've
> realized the patch as I proposed it didn't work.
> 
> But this code isn't right either.  I'm surprised you're not getting
> crashes --- perhaps there aren't cases where the first and second args
> are of incompatible types?  Also, it's certainly wrong to be doing this
> sort of swap in only one of the two code paths.
> 
> There's more than one way you could handle this, but the way that
> I was vaguely imagining was to have two strategy entries in each
> IndexableFunction entry, one to apply if the first function argument
> is the indexable one, and the other to apply if the second function
> argument is the indexable one.  If you leave the operator lookup as
> I had it (using the already-swapped data types) then you'd have to
> make sure that the latter set of strategy entries are written as
> if the arguments get swapped before selecting the strategy, which
> would be confusing perhaps :-( --- for instance, st_within would
> use RTContainedByStrategyNumber in the first case but
> RTContainsStrategyNumber in the second.  But otherwise you need the
> separate get_commutator step, which seems like one more catalog lookup
> than you really need.
> 
>> I feel OK about it, if for no other reason than it passes all the tests :)
> 
> Then you're at least missing adequate tests for the 3-arg functions...
> 3 args with the index column second will not work as this stands.

Some of the operators are indifferent to order (&&,  overlaps) and others are 
not (@, within) (~, contains).

The 3-arg functions fortunately all have && strategies.

The types on either side of the operators are always the same (geometry && 
geometry), ST_Intersects(geometry, geometry).

I could simply be getting a free pass from the simplicity of my setup?

P.


Re: few more wait events to add to docs

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 2:18 PM Alvaro Herrera  wrote:
> On 2019-Mar-05, Jeremy Schneider wrote:
> > It's possible I'm misreading this, but I'm thinking that commits
> > cc5f8136 and ab9e0e71 added a few tranches which we need to add to the docs.
> >
> > session_dsa, session_record_table, session_typmod_table, and
> > shared_tuplestore
> >
> > https://www.postgresql.org/docs/current/monitoring-stats.html#WAIT-EVENT-TABLE
>
> https://postgr.es/m/CAB7nPqSU1=-3fqvz+ncgm5vmro4lszjgqsmoygwizs2hvpy...@mail.gmail.com

Here's some missing documentation.

Hmm, yeah.  I wish these were alphabetised, I wish there was an
automated warning about this, I wish these tranches were declared a
better way that by adding code in RegisterLWLockTranches().

-- 
Thomas Munro
https://enterprisedb.com


missing-tranches.patch
Description: Binary data


Re: few more wait events to add to docs

2019-03-05 Thread Alvaro Herrera
On 2019-Mar-05, Jeremy Schneider wrote:

> It's possible I'm misreading this, but I'm thinking that commits
> cc5f8136 and ab9e0e71 added a few tranches which we need to add to the docs.
> 
> session_dsa, session_record_table, session_typmod_table, and
> shared_tuplestore
> 
> https://www.postgresql.org/docs/current/monitoring-stats.html#WAIT-EVENT-TABLE

https://postgr.es/m/CAB7nPqSU1=-3fqvz+ncgm5vmro4lszjgqsmoygwizs2hvpy...@mail.gmail.com
:-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: patch to allow disable of WAL recycling

2019-03-05 Thread Alvaro Herrera
Jerry,

On 2019-Mar-05, Jerry Jelinek wrote:

> Thanks again for your review. I went through your proposed patch diffs and
> applied most of them to my original changes. I did a few things slightly
> differently since I wanted to keep to to 80 columns for the source code,
> but I can revisit that if it is not an issue.

Yeah, in the places where I exceeded the limit, it is largely considered
not an issue.  Brace curling *is* an issue, though :-)  I would prefer
to go with my version (which is largely just stylistic changes over
yours), applying your subsequent changes on top of that.

I can't remember now if I pgindented it or just changed manually ... I
should do that.

> also cleaned up the confusing wording around "allocating blocks". I
> ran a clean build and make check passes. The new patch is attached.

Cool.  Can you confirm that it still fixes the performance issue for
you?  (It should, since functionally it should be the same thing as
yours.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] CLUSTER command progress monitor

2019-03-05 Thread Alvaro Herrera
On 2019-Mar-04, Robert Haas wrote:

> On Mon, Mar 4, 2019 at 5:38 AM Tatsuro Yamada
>  wrote:

> > === Discussion points ===
> >
> >   - Progress counter for "3. sorting tuples" phase
> >  - Should we add pgstat_progress_update_param() in tuplesort.c like a
> >"trace_sort"?
> >Thanks to Peter Geoghegan for the useful advice!
> 
> How would we avoid an abstraction violation?

The theory embodied in my patch at 
https://postgr.es/m/20190304204607.GA15946@alvherre.pgsql
is that we don't; tuplesort.c functions (index.c's IndexBuildHeapScan in
my case) would get a boolean parameter to indicate whether to update
some params or not -- the param number(s) to update are supposed to be
generic in the sense that it's not part of any individual command's
implementation (PROGRESS_SCAN_BLOCKS_DONE for what you call "blks
scanned", PROGRESS_SCAN_BLOCKS_TOTAL for "blks total"), but rather
defined by the "progress update provider" (index.c or tuplesort.c).

One, err, small issue with that idea is that we need the param numbers
not to conflict for any "progress update providers" that are to be used
simultaneously by any command.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Delay locking partitions during query execution

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 04:46, Tomas Vondra  wrote:
>
> On 3/5/19 6:55 AM, David Rowley wrote:
> > The only way I can think to fix this is to just never lock partitions
> > at all, and if a lock is to be obtained on a partition, it must be
> > instead obtained on the top-level partitioned table.  That's a rather
> > large change that could have large consequences, and I'm not even sure
> > it's possible since we'd need to find the top-level parent before
> > obtaining the lock, by then the hierarchy might have changed and we'd
> > need to recheck, which seems like quite a lot of effort just to obtain
> > a lock... Apart from that, it's not this patch, so looks like I'll
> > need to withdraw this one :-(
> >
>
> So you're saying we could
>
> 1) lookup the parent and lock it
> 2) repeat the lookup to verify it did not change
>
> I think that could still be a win, assuming that most hierarchies will
> be rather shallow (I'd say 2-3 levels will cover like 95% of cases, and
> 4 levels would be 100% in practice). And the second lookup should be
> fairly cheap thanks to syscache and the fact that the hierarchies do not
> change very often.

Actually, I'm not sure it could work at all.  It does not seem very
safe to lookup a partition's parent without actually holding a lock on
the partition and we can't lock the partition and then lock each
parent in turn as that's the exact opposite locking order that we do
when querying a partitioned table, so could result in deadlocking.

Many moons ago in the 0002 patch in [1] I proposed to change the way
we store partition hierarchies which involved ditching bool
pg_class.relispartition in favour of Oid pg_class.relpartitionparent.
That would make parent lookups pretty fast, but... given the above it
seems like we couldn't do this at all.

[1] 
https://www.postgresql.org/message-id/CAKJS1f9QjUwQrio20Pi%3DyCHmnouf4z3SfN8sqXaAcwREG6k0zQ%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



fix for BUG #3720: wrong results at using ltree

2019-03-05 Thread Filip Rembiałkowski
Hi all,

Here is my attempt to fix a 12-years old ltree bug (which is a todo item).

I see it's not backward-compatible, but in my understanding that's
what is documented. Previous behavior was inconsistent with
documentation (where single asterisk should match zero or more
labels).

http://archives.postgresql.org/pgsql-bugs/2007-11/msg00044.php
diff --git a/contrib/ltree/expected/ltree.out b/contrib/ltree/expected/ltree.out
index 8226930905..b1878a99b3 100644
--- a/contrib/ltree/expected/ltree.out
+++ b/contrib/ltree/expected/ltree.out
@@ -676,7 +676,7 @@ SELECT 'a.b.c.d.e'::ltree ~ '*.a.*.d.*';
 SELECT 'a.b.c.d.e'::ltree ~ '*.!d.*';
  ?column? 
 --
- f
+ t
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ '*.!d';
@@ -706,7 +706,7 @@ SELECT 'a.b.c.d.e'::ltree ~ '*.!e';
 SELECT 'a.b.c.d.e'::ltree ~ '*.!e.*';
  ?column? 
 --
- f
+ t
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ 'a.*.!e';
@@ -724,7 +724,7 @@ SELECT 'a.b.c.d.e'::ltree ~ 'a.*.!d';
 SELECT 'a.b.c.d.e'::ltree ~ 'a.*.!d.*';
  ?column? 
 --
- f
+ t
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ 'a.*.!f.*';
@@ -742,7 +742,7 @@ SELECT 'a.b.c.d.e'::ltree ~ '*.a.*.!f.*';
 SELECT 'a.b.c.d.e'::ltree ~ '*.a.*.!d.*';
  ?column? 
 --
- f
+ t
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ '*.a.!d.*';
@@ -766,13 +766,13 @@ SELECT 'a.b.c.d.e'::ltree ~ 'a.!d.*';
 SELECT 'a.b.c.d.e'::ltree ~ '*.a.*.!d.*';
  ?column? 
 --
- f
+ t
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ '*.!b.*';
  ?column? 
 --
- f
+ t
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ '*.!b.c.*';
@@ -784,7 +784,7 @@ SELECT 'a.b.c.d.e'::ltree ~ '*.!b.c.*';
 SELECT 'a.b.c.d.e'::ltree ~ '*.!b.*.c.*';
  ?column? 
 --
- f
+ t
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ '!b.*.c.*';
@@ -832,31 +832,31 @@ SELECT 'a.b.c.d.e'::ltree ~ '*{1}.!b.*.!c.*.e';
 SELECT 'a.b.c.d.e'::ltree ~ '*{1}.!b.*{1}.!c.*.e';
  ?column? 
 --
- t
+ f
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ 'a.!b.*{1}.!c.*.e';
  ?column? 
 --
- t
+ f
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ '!b.*{1}.!c.*.e';
  ?column? 
 --
- t
+ f
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ '*.!b.*{1}.!c.*.e';
  ?column? 
 --
- t
+ f
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ '*.!b.*.!c.*.e';
  ?column? 
 --
- f
+ t
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ '!b.!c.*';
@@ -886,19 +886,19 @@ SELECT 'a.b.c.d.e'::ltree ~ '*{1}.!b.*.!c.*';
 SELECT 'a.b.c.d.e'::ltree ~ '*{1}.!b.*{1}.!c.*';
  ?column? 
 --
- t
+ f
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ 'a.!b.*{1}.!c.*';
  ?column? 
 --
- t
+ f
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ '!b.*{1}.!c.*';
  ?column? 
 --
- t
+ f
 (1 row)
 
 SELECT 'a.b.c.d.e'::ltree ~ '*.!b.*{1}.!c.*';
@@ -909,6 +909,18 @@ SELECT 'a.b.c.d.e'::ltree ~ '*.!b.*{1}.!c.*';
 
 SELECT 'a.b.c.d.e'::ltree ~ '*.!b.*.!c.*';
  ?column? 
+--
+ t
+(1 row)
+
+SELECT '5.0.1.0'::ltree ~ '5.!0.!0.0';
+ ?column? 
+--
+ f
+(1 row)
+
+SELECT 'a.b'::ltree ~ '!a.!a';
+ ?column? 
 --
  f
 (1 row)
diff --git a/contrib/ltree/lquery_op.c b/contrib/ltree/lquery_op.c
index b6d2deb1af..392c193fd8 100644
--- a/contrib/ltree/lquery_op.c
+++ b/contrib/ltree/lquery_op.c
@@ -19,16 +19,6 @@ PG_FUNCTION_INFO_V1(lt_q_rregex);
 
 #define NEXTVAL(x) ( (lquery*)( (char*)(x) + INTALIGN( VARSIZE(x) ) ) )
 
-typedef struct
-{
-	lquery_level *q;
-	int			nq;
-	ltree_level *t;
-	int			nt;
-	int			posq;
-	int			post;
-} FieldNot;
-
 static char *
 getlexeme(char *start, char *end, int *len)
 {
@@ -50,7 +40,7 @@ getlexeme(char *start, char *end, int *len)
 }
 
 bool
-			compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *, const char *, size_t), bool anyend)
+compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *, const char *, size_t), bool anyend)
 {
 	char	   *endt = t->name + t->len;
 	char	   *endq = qn + len;
@@ -108,6 +98,9 @@ checkLevel(lquery_level *curq, ltree_level *curt)
 	int			(*cmpptr) (const char *, const char *, size_t);
 	lquery_variant *curvar = LQL_FIRST(curq);
 	int			i;
+	boolsuccess;
+
+	success = (curq->flag & LQL_NOT) ? false : true;
 
 	for (i = 0; i < curq->numvar; i++)
 	{
@@ -115,8 +108,9 @@ checkLevel(lquery_level *curq, ltree_level *curt)
 
 		if (curvar->flag & LVAR_SUBLEXEME)
 		{
-			if (compare_subnode(curt, curvar->name, curvar->len, cmpptr, (curvar->flag & LVAR_ANYEND)))
-return true;
+			if (compare_subnode(curt, curvar->name, curvar->len, cmpptr,
+		(curvar->flag & LVAR_ANYEND)))
+return success;
 		}
 		else if (
  (
@@ -126,22 +120,12 @@ checkLevel(lquery_level *curq, ltree_level *curt)
  (*cmpptr) (curvar->name, curt->name, curvar->len) == 0)
 		{
 
-			return true;
+			return success;
 		}
 		curvar = LVAR_NEXT(curvar);
 	}
-	return false;
-}
-
-/*
-void
-printFieldNot(FieldNot *fn ) {
-	while(fn->q) {
-		elog(NOTICE,"posQ:%d lenQ:%d posT:%d lenT:%d", fn->posq,fn->nq,fn->post,fn->nt);
-		fn++;
-	}
+	return !success;
 }
-*/
 
 static struct
 {
@@ -154,7 +138,7 @@ static struct
 };

Re: dropdb --force

2019-03-05 Thread Filip Rembiałkowski
Here is Pavel's patch rebased to master branch, added the dropdb
--force option, a test case & documentation.

I'm willing to work on it if needed. What are possible bad things that
could happen here? Is the documentation clear enough?

Thanks.


On Tue, Dec 18, 2018 at 4:34 PM Marti Raudsepp  wrote:
>
> Hi
>
> > út 18. 12. 2018 v 16:11 odesílatel Filip Rembiałkowski 
> >  napsal:
> >> Please share opinions if this makes sense at all, and has any chance
> >> going upstream.
>
> Clearly since Pavel has another implementation of the same concept,
> there is some interest in this feature. :)
>
> On Tue, Dec 18, 2018 at 5:20 PM Pavel Stehule  wrote:
> > Still one my customer use a patch that implement FORCE on SQL level. It is 
> > necessary under higher load when is not easy to synchronize clients.
>
> I think Filip's approach of setting pg_database.datallowconn='false'
> is pretty clever to avoid the synchronization problem. But it's also a
> good idea to expose this functionality via DROP DATABASE in SQL, like
> Pavel's patch, not just the 'dropdb' binary.
>
> If this is to be accepted into PostgreSQL core, I think the two
> approaches should be combined on the server side.
>
> Regards,
> Marti Raudsepp
diff --git a/doc/src/sgml/ref/drop_database.sgml b/doc/src/sgml/ref/drop_database.sgml
index 3ac06c984a..84df485e11 100644
--- a/doc/src/sgml/ref/drop_database.sgml
+++ b/doc/src/sgml/ref/drop_database.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-DROP DATABASE [ IF EXISTS ] name
+DROP DATABASE [ IF EXISTS ] name [ FORCE ]
 
  
 
@@ -32,9 +32,11 @@ DROP DATABASE [ IF EXISTS ] name
DROP DATABASE drops a database. It removes the
catalog entries for the database and deletes the directory
containing the data.  It can only be executed by the database owner.
-   Also, it cannot be executed while you or anyone else are connected
-   to the target database.  (Connect to postgres or any
-   other database to issue this command.)
+   Also, it cannot be executed while you are connected to the target database.
+   (Connect to postgres or any other database to issue this
+   command.)
+   If anyone else is connected to the target database, this command will fail
+   - unless you use the FORCE option described below.
   
 
   
@@ -64,6 +66,24 @@ DROP DATABASE [ IF EXISTS ] name
  
 

+
+   
+FORCE
+
+ 
+  Attempt to terminate all existing connections to the target database.
+ 
+ 
+  This will fail, if current user has no permissions to terminate other
+  connections. Required permissions are the same as with 
+  pg_terminate_backend, described
+  in .
+
+  This will also fail, if the connections do not terminate in 60 seconds.
+ 
+
+   
+
   
  
 
diff --git a/doc/src/sgml/ref/dropdb.sgml b/doc/src/sgml/ref/dropdb.sgml
index 38f38f01ce..365ba317c1 100644
--- a/doc/src/sgml/ref/dropdb.sgml
+++ b/doc/src/sgml/ref/dropdb.sgml
@@ -86,6 +86,20 @@ PostgreSQL documentation
   
  
 
+ 
+  -f
+  --force
+  
+   
+Force termination of connected backends before removing the database.
+   
+   
+This will add the FORCE option to the DROP
+DATABASE command sent to the server.
+   
+  
+ 
+
  
-V
--version
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index d207cd899f..2137bee3a7 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -487,7 +487,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 * potential waiting; we may as well throw an error first if we're gonna
 	 * throw one.
 	 */
-	if (CountOtherDBBackends(src_dboid, , ))
+	if (CountOtherDBBackends(src_dboid, , , false))
 		ereport(ERROR,
 (errcode(ERRCODE_OBJECT_IN_USE),
  errmsg("source database \"%s\" is being accessed by other users",
@@ -777,7 +777,7 @@ createdb_failure_callback(int code, Datum arg)
  * DROP DATABASE
  */
 void
-dropdb(const char *dbname, bool missing_ok)
+dropdb(const char *dbname, bool missing_ok, bool force)
 {
 	Oid			db_id;
 	bool		db_istemplate;
@@ -785,6 +785,7 @@ dropdb(const char *dbname, bool missing_ok)
 	HeapTuple	tup;
 	int			notherbackends;
 	int			npreparedxacts;
+	int			loops = 0;
 	int			nslots,
 nslots_active;
 	int			nsubscriptions;
@@ -868,12 +869,33 @@ dropdb(const char *dbname, bool missing_ok)
 	 *
 	 * As in CREATE DATABASE, check this after other error conditions.
 	 */
-	if (CountOtherDBBackends(db_id, , ))
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_IN_USE),
- errmsg("database \"%s\" is being accessed by other users",
-		dbname),
- errdetail_busy_db(notherbackends, npreparedxacts)));
+	for (;;)
+	{
+		/*
+		 * CountOtherDBBackends check usage of database by other backends and try
+		 * to wait 5 sec. We try to raise warning after 1 minute and and raise
+		 * a error after 5 minutes.
+		 */
+		if (!CountOtherDBBackends(db_id, , , force))
+			

Re: Ordered Partitioned Table Scans

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 07:17, Robert Haas  wrote:
>
> On Wed, Dec 19, 2018 at 5:08 PM David Rowley
>  wrote:
> > With my idea for using live_parts, we'll process the partitions
> > looking for interleaved values on each query, after pruning takes
> > place. In this case, we'll see the partitions are naturally ordered. I
> > don't really foresee any issues with that additional processing since
> > it will only be a big effort when there are a large number of
> > partitions, and in those cases the planner already has lots of work to
> > do. Such processing is just a drop in the ocean when compared to path
> > generation for all those partitions.
>
> I agree that partitions_are_ordered() is cheap enough in this patch
> that it probably doesn't matter whether we cache the result.  On the
> other hand, that's mostly because you haven't handled the hard cases -
> e.g. interleaved list partitions.  If you did, then it would be
> expensive, and it probably *would* be worth caching the result.  Now
> maybe those hard cases aren't worth handling anyway.

I admit that I didn't understand the idea of the flag at the time,
having failed to see the point of it since if partitions are plan-time
pruned then I had thought the flag would be useless. However, as
Julien explained, it would be a flag of "Yes" means "Yes", okay to do
ordered scans, and "No" means "Recheck if there are pruned partitions
using only the non-pruned ones".   That seems fine and very sane to me
now that I understand it. FWIW, my moment of realisation came in [1].

However, my thoughts are that adding new flags and the live_parts
field in RelOptInfo raise the bar a bit for this patch.   There's
already quite a number of partition-related fields in RelOptInfo.
Understanding what each of those does is not trivial, so I figured
that this patch would be much easier to consider if I skipped that
part for the first cut version.   I feared a lot of instability of
what fields exist from Amit's planner improvement patches and I didn't
want to deal with dependencies from WIP. I had to deal with that last
year on run-time pruning and it turned out not to be fun.

> You also seem to be saying that since we run-time partitioning pruning
> might change the answer, caching the initial answer is pointless.  But
> I think Julien has made a good argument for why that's wrong: if the
> initial answer is that the partitions are ordered, which will often be
> true, then we can skip all later checks.
>
> So I am OK with the fact that this patch doesn't choose to cache it,
> but I don't really buy any of your arguments for why it would be a bad
> idea.

OK, good.  I agree.  For the record; I want to steer clear of the flag
in this first cut version, especially so now given what time it is.

[1] 
https://www.postgresql.org/message-id/cakjs1f_r51oapsn1oc4i36d7vznnihnk+1widfg0qrvb_eo...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2019-03-05 Thread Robert Haas
On Tue, Mar 5, 2019 at 3:46 AM David Steele  wrote:
> I have marked this entry as targeting PG13 since it is too late to
> consider for PG12.

Sounds right.  As Peter said himself, this patch is WIP, so it's too
soon to consider integrating it.  This is also fairly evident from the
content of the patch, which is full of comments marked XXX and PEMOSER
that obviously need to be addressed somehow.  For all of that, I'd say
this patch is much closer to being on the right track than the old
design, even though it's clear that a lot of work remains.

Some desultory review comments:

+#define setSweepline(datum) \
+ node->sweepline = datumCopy(datum, node->datumFormat->attbyval,
node->datumFormat->attlen)
+
+#define freeSweepline() \
+ if (! node->datumFormat->attbyval) pfree(DatumGetPointer(node->sweepline))

I am quite dubious about this. Almost everywhere in the executor we
rely on slots to keep track of tuples and free memory for us. It seems
unlikely that this should be the one place where we have code that
looks completely different.  Aside from that, this seems to propose
there is only one relevant column, which seems like an assumption that
we probably don't want to bake too deeply into the code.

+ ereport(ERROR,
+ (errcode(ERRCODE_NOT_NULL_VIOLATION),
+ errmsg("Attribute \"%s\" at position %d is null. Temporal " \
+ "adjustment not possible.",
+ NameStr(TupleDescAttr(slot->tts_tupleDescriptor, attnum - 1)->attname),
+ attnum)));

This error message is marked as translatable (ereport used rather than
elog) but the error-message text is unsuitable for a user-facing
error.  If this is a user-facing error, we need a better error, or
maybe we need to rethink the semantics altogether (e.g. just skip such
rows instead of erroring out, or something).  If it's an internal
error that should not be possible no matter what query the user
enters, and is only here as a sanity test, just simplify and use elog
(and maybe add some comments explaining why that's so).

+ * heapGetAttrNotNull

I may be a bit behind the times here, but it seems to me that this is
functionally equivalent to slotGetAttrNotNull and thus we shouldn't
need both.

+ boolempty = false;

Not much point in declaring a variable whose value is never changed
and whose value takes up exactly the same number of characters as the
variable name.

+ * temporalAdjustmentStoreTuple
+ *  While we store result tuples, we must add the newly calculated temporal
+ *  boundaries as two scalar fields or create a single range-typed field
+ *  with the two given boundaries.

This doesn't seem to describe what the function is actually doing.

+ * This should ideally be done with RangeBound types on the right-hand-side
+ * created during range_split execution. Otherwise, we loose information about
+ * inclusive/exclusive bounds and infinity. We would need to implement btree
+ * operators for RangeBounds.

This seems like an idea for future improvement, but it's unclear to me
how the proposed idea is different from the state created by the
patch.

Also, materializing the slot to a heap tuple so that we can modify it
seems inefficient.  I wonder if we can't use a virtual slot here.

+ if (qual->opno == OID_RANGE_EQ_OP) {
+ Oid rngtypid;
+
+ // XXX PEMOSER Change opfamily and opfunc
+ qual->opfuncid = F_RANGE_CONTAINS; //<<--- opfuncid can be 0 during planning
+ qual->opno = OID_RANGE_CONTAINS_ELEM_OP; //OID_RANGE_CONTAINS_OP;
+ clause->isnormalize = true;
+
+ // Attention: cannot merge using non-equality operator 3890 <---
OID_RANGE_CONTAINS_OP
+ opfamily = 4103; //range_inclusion_ops from pg_opfamily.h
+
+ rngtypid = exprType((Node*)clause->lexpr->expr);
+ clause->range_typcache = lookup_type_cache(rngtypid, TYPECACHE_RANGE_INFO);
+ testmytypcache = clause->range_typcache;
+ } else {
+ clause->isnormalize = false;
+ }

This is pretty obviously a mess of hardcoded constants which are,
furthermore, not explained.  I can't tell whether this is intended as
a dirty hack to make some cases work while other things remain broken,
or whether you believe that OID_RANGE_EQ_OP.  If it's the latter, this
needs a much better explanation.  You probably realize that
"Attention: cannot merge using non-equality operator 3890" is not a
compelling justification, and that hard-coding all of these things
here doesn't look good.

In general, this patch needs both user-facing documentation and more
and better code comments.  I would suggest writing the user-facing
documentation soon.  It is pretty clear that you've got the basics of
this working, but it's impossible to understand what the semantics are
supposed to be except by staring at the code until you figure it out,
or running experiments.  People who are interested in this
functionality are more likely to provide useful feedback if there's
something they can read and go "yeah, that looks right" or "wait, that
sounds wrong."  Also, there may be places where, in the process of
documenting, you realize that things 

Re: NOT IN subquery optimization

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 12:25, David Rowley  wrote:
> That sounds fine. I'll take mine elsewhere since I didn't start this thread.

Moved to 
https://www.postgresql.org/message-id/CAKJS1f82pqjqe3WT9_xREmXyG20aOkHc-XqkKZG_yMA7JVJ3Tw%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Allowing extensions to supply operator-/function-specific info

2019-03-05 Thread Tom Lane
Paul Ramsey  writes:
> On Mar 5, 2019, at 3:26 PM, Tom Lane  wrote:
>> Hm, I think your addition of this bit is wrong:
>> 
>> +/*
>> +* Arguments were swapped to put the index value on the
>> +* left, so we need the commutated operator for
>> +* the OpExpr
>> +*/
>> +if (swapped)
>> +{
>> +oproid = get_commutator(oproid);
>> +if (!OidIsValid(oproid))
>> PG_RETURN_POINTER((Node *)NULL);
>> +}
>> 
>> We already did the operator lookup with the argument types in the desired
>> order, so this is introducing an extra swap.  The only reason it appears
>> to work, I suspect, is that all your index operators are self-commutators.

> I was getting regression failures until I re-swapped the operator… 
>   SELEcT * FROM foobar WHERE ST_Within(ConstA, VarB)

Ah ... so the real problem here is that *not* all of your functions
treat their first two inputs alike, and the hypothetical future
improvement I commented about is needed right now.  I should've
looked more closely at the strategies in your table; then I would've
realized the patch as I proposed it didn't work.

But this code isn't right either.  I'm surprised you're not getting
crashes --- perhaps there aren't cases where the first and second args
are of incompatible types?  Also, it's certainly wrong to be doing this
sort of swap in only one of the two code paths.

There's more than one way you could handle this, but the way that
I was vaguely imagining was to have two strategy entries in each
IndexableFunction entry, one to apply if the first function argument
is the indexable one, and the other to apply if the second function
argument is the indexable one.  If you leave the operator lookup as
I had it (using the already-swapped data types) then you'd have to
make sure that the latter set of strategy entries are written as
if the arguments get swapped before selecting the strategy, which
would be confusing perhaps :-( --- for instance, st_within would
use RTContainedByStrategyNumber in the first case but
RTContainsStrategyNumber in the second.  But otherwise you need the
separate get_commutator step, which seems like one more catalog lookup
than you really need.

> I feel OK about it, if for no other reason than it passes all the tests :)

Then you're at least missing adequate tests for the 3-arg functions...
3 args with the index column second will not work as this stands.

regards, tom lane



Converting NOT IN to anti-joins during planning

2019-03-05 Thread David Rowley
Way back in [1] I proposed that we allow NOT IN subqueries to be
converted into an anti-join where the subquery cannot return any NULL
values.   As Tom pointed out to me, I had neglected to consider that
the outer side producing NULLs can cause the anti-join plan to produce
incorrect results.  The difference is that a NOT IN where the subquery
returns no results filters nothing, otherwise it filters the nulls,
plus the records that exist in the subquery.

More recently over on [2], Jim and Zheng have re-proposed making
improvements in this area. Their ideas are slightly different from
mine as they propose to add an OR .. IS NULL clause to the join
condition to handle the outer side being NULL with empty subquery
problem.  Before Jim and Zheng's patch arrived I managed to fix the
known problems with my 4-year-old patch thinking it would have been
welcome, but it seems that's not the case, perhaps due to the
differing ideas we have on how this should work. At that time I didn't
think the other patch actually existed yet... oops

Anyway, I don't really want to drop my patch as I believe what it does
is correct and there's debate on the other thread about how good an
idea adding these OR clauses to the join quals is... (forces nested
loop plan (see [3])),  but it appears Jim and Zheng are fairly set on
that idea.  Hence...

I'm moving my patch here, so it can be debated without interfering
with the other work that's going on in this area.  There has also been
some review of my patch in [4], and of course, originally in [1].

The background is really.

1. Seems fine to do this transformation when there are no nulls.
2. We don't want to cost anything to decide on to do the
transformation or not, i.e do it regardless, in all possible cases
where it's valid to do so. We already do that for NOT EXISTS, no
apparent reason to think this case is any different.
3. Need to consider what planner overhead there is from doing this and
failing to do the conversion due lack of evidence for no NULLs.

I've not done #3, at least not with the latest patch.

There's already a CF entry [5] for this patch, although its targeting PG13.

The latest patch is attached.

[1] 
https://www.postgresql.org/message-id/CAApHDvqRB-iFBy68%3DdCgqS46aRep7AuN2pou4KTwL8kX9YOcTQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/1550706289606-0.p...@n3.nabble.com
[3] 
https://www.postgresql.org/message-id/CAKJS1f_ZwXtzPz6wDpBXgAVYuxforsqpc6hBw05Y6aPGcOONfA%40mail.gmail.com
[4] https://www.postgresql.org/message-id/18203.1551543939%40sss.pgh.pa.us
[5] https://commitfest.postgresql.org/22/2020/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


not_in_anti_join_v1.3.patch
Description: Binary data


Re: Allowing extensions to supply operator-/function-specific info

2019-03-05 Thread Paul Ramsey


> On Mar 5, 2019, at 3:26 PM, Tom Lane  wrote:
> 
> Paul Ramsey  writes:
>> Thanks for the patch, I’ve applied and smoothed and taken your advice on 
>> schema-qualified lookups as well.
> 
> Hm, I think your addition of this bit is wrong:
> 
> +/*
> +* Arguments were swapped to put the index value on the
> +* left, so we need the commutated operator for
> +* the OpExpr
> +*/
> +if (swapped)
> +{
> +oproid = get_commutator(oproid);
> +if (!OidIsValid(oproid))
> PG_RETURN_POINTER((Node *)NULL);
> +}
> 
> We already did the operator lookup with the argument types in the desired
> order, so this is introducing an extra swap.  The only reason it appears
> to work, I suspect, is that all your index operators are self-commutators.

I was getting regression failures until I re-swapped the operator… 

  SELEcT * FROM foobar WHERE ST_Within(ConstA, VarB)

Place the indexed operator in the Left, now:

  Left == VarB
  Right == ConstA
  Strategy == Within
  get_opfamily_member(opfamilyoid, Left, Right, Within)

Unless we change the strategy number when we assign the left/right we’re 
looking up an operator for “B within A”, so we’re backwards.

I feel OK about it, if for no other reason than it passes all the tests :)

P


Re: Allowing extensions to supply operator-/function-specific info

2019-03-05 Thread Tom Lane
Paul Ramsey  writes:
> Thanks for the patch, I’ve applied and smoothed and taken your advice on 
> schema-qualified lookups as well.

Hm, I think your addition of this bit is wrong:

+/*
+* Arguments were swapped to put the index value on the
+* left, so we need the commutated operator for
+* the OpExpr
+*/
+if (swapped)
+{
+oproid = get_commutator(oproid);
+if (!OidIsValid(oproid))
 PG_RETURN_POINTER((Node *)NULL);
+}

We already did the operator lookup with the argument types in the desired
order, so this is introducing an extra swap.  The only reason it appears
to work, I suspect, is that all your index operators are self-commutators.

regards, tom lane



Re: NOT IN subquery optimization

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 03:37, Tom Lane  wrote:
>
> David Steele  writes:
> > I'm not sure if I have an issue with competing patches on the same
> > thread.  I've seen that before and it can lead to a good outcome.  It
> > case, as you say, also lead to confusion.
>
> It's a bit of a shame that the cfbot will only be testing one of them
> at a time if we leave it like this.  I kind of lean towards the
> two-thread, two-CF-entry approach because of that.  The amount of
> confusion is a constant.

That sounds fine. I'll take mine elsewhere since I didn't start this thread.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: using index or check in ALTER TABLE SET NOT NULL

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 08:41, Sergei Kornilov  wrote:
> In this case we need extra argument for ConstraintImpliedByRelConstraint for 
> some caller-provided existConstraint, right? Along with Relation itself? Then 
> I need make copy of existConstraint, append relation constraints and call 
> predicate_implied_by. If I understood correctly pseudocode would be:
>
> PartConstraintImpliedByRelConstraint(rel, testConstraint)
>   generate notNullConstraint from NOT NULL table attributes
>   call ConstraintImpliedByRelConstraint(rel, testConstraint, 
> notNullConstraint)
>
> ConstraintImpliedByRelConstraint(rel, testConstraint, existsConstraint)
>   copy existsConstraint to localExistsConstraint variable
>   append relation constraints to localExistsConstraint list
>   call predicate_implied_by
>
> I thing redundant IS NOT NULL check is not issue here and we not need extra 
> arguments for ConstraintImpliedByRelConstraint. Was not changed in attached 
> version, but I will change if you think this would be better design.

I was really just throwing the idea out there for review. I don't
think that I'm insisting on it.

FWIW I think you could get away without the copy of the constraint
providing it was documented that the list is modified during the
ConstraintImpliedByRelConstraint call and callers must make a copy
themselves if they need an unmodified version. Certainly,
PartConstraintImpliedByRelConstraint won't need to make a copy, so
there's probably not much reason to assume that possible future
callers will.

Providing I'm imagining it correctly, I do think the patch is slightly
cleaner with that change.

It's:
a) slightly more efficient due to not needlessly checking a bunch of
IS NOT NULLs (imagine a 1000 column table with almost all NOT NULL and
a single CHECK constraint); and
b) patch has a smaller footprint (does not modify existing callers of
PartConstraintImpliedByRelConstraint()); and
c) does not modify an existing API function.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Should we increase the default vacuum_cost_limit?

2019-03-05 Thread David Rowley
Thanks for chipping in on this.

On Wed, 6 Mar 2019 at 01:53, Tomas Vondra  wrote:
> But on the other hand it feels a bit weird that we increase this one
> value and leave all the other (also very conservative) defaults alone.

Which others did you have in mind? Like work_mem, shared_buffers?  If
so, I mentioned in the initial post that I don't see vacuum_cost_limit
as in the same category as those.  It's not like PostgreSQL won't
start on a tiny server if vacuum_cost_limit is too high, but you will
have issues with too big a shared_buffers, for example.   I think if
we insist that this patch is a review of all the "how big is your
server" GUCs then that's raising the bar significantly and
unnecessarily for what I'm proposing here.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: patch to allow disable of WAL recycling

2019-03-05 Thread Jerry Jelinek
Alvaro,

Thanks again for your review. I went through your proposed patch diffs and
applied most of them to my original changes. I did a few things slightly
differently since I wanted to keep to to 80 columns for the source code,
but I can revisit that if it is not an issue. I also cleaned up the
confusing wording around "allocating blocks". I ran a clean build and make
check passes. The new patch is attached.

Thanks,
Jerry


On Wed, Feb 27, 2019 at 4:12 PM Alvaro Herrera 
wrote:

> On 2019-Feb-05, Jerry Jelinek wrote:
>
> > First, since last fall, we have found another performance problem related
> > to initializing WAL files. I've described this issue in more detail
> below,
> > but in order to handle this new problem, I decided to generalize the
> patch
> > so the tunable refers to running on a Copy-On-Write filesystem instead of
> > just being specific to WAL recycling. Specifically, I renamed the GUC
> > tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it
> > more obvious what is being tuned and will also be more flexible if there
> > are other problems in the future which are related to running on a COW
> > filesystem. I'm happy to choose a different name for the tunable if
> people
> > don't like 'wal_cow_fs'.
>
> I think the idea of it being a generic tunable for assorted behavior
> changes, rather than specific to WAL recycling, is a good one.  I'm
> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
>
> I'm rewording your doc addition a little bit.  Here's my proposal:
>
>
> This parameter should only be set to on when
> the WAL
> resides on a Copy-On-Write
> (COW)
> filesystem.
> Enabling this option adjusts behavior to take advantage of the
> filesystem characteristics (for example, recycling WAL files and
> zero-filling new WAL files are disabled).
>
> This part sounds good enough to me -- further suggestions welcome.
>
> I'm less sure about this phrase:
>
> This setting is only appropriate for filesystems which
> allocate new disk blocks on every write.
>
> Is "... which allocate new disk blocks on every write" a technique
> distinct from CoW itself?  I'm confused as to what it means, or how can
> the user tell whether they are on such a filesystem.
>
> Obviously you're thinking that ZFS is such a filesystem and everybody
> who has pg_wal on ZFS should enable this option.  What about, say, Btrfs
> -- should they turn this option on?  Browsing the wikipedia, I find that
> Windows has this ReFS thing that apparently is also CoW, but NTFS isn't.
> I don't think either Btrfs or ReFS are realistic options to put pg_wal
> on, so let's just list the common filesystems for which users are
> supposed to enable this option ... which I think nowadays is just ZFS.
> All in all, I would replace this phrase with something like: "This
> setting should be enabled when pg_wal resides on a ZFS filesystem or
> similar." That should be weasely enough that it's clear that we expect
> users to do the homework when on unusual systems, while actively pointing
> out the most common use case.
>
> > Finally, the patch now includes bypassing the zero-fill for new WAL files
> > when wal_cow_fs is true.
>
> That makes sense.  I think all these benchmarks Tomas Vondra run are not
> valid anymore ...
>
> The attached v2 has assorted cosmetic cleanups.  If you can validate it,
> I would appreciate it.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


0001-cow-filesystem.patch
Description: Binary data


Re: [HACKERS] Incomplete startup packet errors

2019-03-05 Thread Andrew Dunstan


On 3/4/19 7:42 AM, Christoph Berg wrote:
> Re: Andrew Dunstan 2019-03-04 
> <7cc6d2c1-bd87-9890-259d-36739c247...@2ndquadrant.com>
>> Looks good to me.
> +1.
>


OK, I think we have agreement on Tom's patch. Do we want to backpatch
it? It's a change in behaviour, but I find it hard to believe anyone
relies on the existence of these annoying messages, so my vote would be
to backpatch it.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: A separate table level option to control compression

2019-03-05 Thread Andrew Dunstan


On 2/6/19 2:32 AM, Pavan Deolasee wrote:
> Hello,
>
> Currently either the table level option `toast_tuple_target` or the
> compile time default `TOAST_TUPLE_TARGET` is used to decide whether a
> new tuple should be compressed or not. While this works reasonably
> well for most situations, at times the user may not want to pay the
> overhead of toasting, yet take benefits of inline compression.
>
> I would like to propose a new table level option,
> compress_tuple_target, which can be set independently of
> toast_tuple_target, and is checked while deciding whether to compress
> the new tuple or not.
>
> For example,
>
> CREATE TABLE compresstest250 (a int, b text) WITH
> (compress_tuple_target = 250);
> CREATE TABLE compresstest2040 (a int, b text) WITH
> (compress_tuple_target = 2040);
>
> -- shouldn't get compressed nor toasted
> INSERT INTO compresstest250 VALUES (1, repeat('1234567890',20));
>
> -- should get compressed, but not toasted
> INSERT INTO compresstest250 VALUES (2, repeat('1234567890',30));
>
> -- shouldn't get compressed nor toasted
> INSERT INTO compresstest2040 VALUES (1, repeat('1234567890',20));
> INSERT INTO compresstest2040 VALUES (2, repeat('1234567890',30));
>
> Without this patch, the second INSERT will not compress the tuple
> since its length is less than the toast threshold. With the patch and
> after setting table level option, one can compress such tuples.
>
> The attached patch implements this idea. 
>


This is a nice idea, and I'm a bit surprised it hasn't got more
attention. The patch itself seems very simple and straightforward,
although it could probably do with having several sets of eyeballs on it.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Should we increase the default vacuum_cost_limit?

2019-03-05 Thread Andres Freund
On 2019-03-05 17:14:55 -0500, Andrew Dunstan wrote:
> This patch is tiny, seems perfectly reasonable, and has plenty of
> support. I'm going to commit it shortly unless there are last minute
> objections.

+1



Re: Should we increase the default vacuum_cost_limit?

2019-03-05 Thread Andrew Dunstan


On 2/25/19 8:38 AM, David Rowley wrote:
> On Tue, 26 Feb 2019 at 02:06, Joe Conway  wrote:
>> On 2/25/19 1:17 AM, Peter Geoghegan wrote:
>>> On Sun, Feb 24, 2019 at 9:42 PM David Rowley
>>>  wrote:
 The current default vacuum_cost_limit of 200 seems to be 15 years old
 and was added in f425b605f4e.

 Any supporters for raising the default?
>>> I also think that the current default limit is far too conservative.
>> I agree entirely. In my experience you are usually much better off if
>> vacuum finishes quickly. Personally I think our default scale factors
>> are horrible too, especially when there are tables with large numbers of
>> rows.
> Agreed that the scale factors are not perfect, but I don't think
> changing them is as quite a no-brainer as the vacuum_cost_limit, so
> the attached patch just does the vacuum_cost_limit.
>
> I decided to do the times by 10 option that I had mentioned Ensue
> debate about that...
>
> I'll add this to the March commitfest and set the target version as PG12.
>

This patch is tiny, seems perfectly reasonable, and has plenty of
support. I'm going to commit it shortly unless there are last minute
objections.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Rare SSL failures on eelpout

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 9:21 AM Tom Lane  wrote:
> The bug #15598 report is more troublesome, as we don't have a strong
> reason to believe it's not common on Windows.  However, I wonder whether
> we can really do anything at all about that one.  If I understand what
> Andrew was hypothesizing in that thread, it was that Windows might be
> dropping undelivered data on the floor once the server closes its end
> of the connection.  That would be totally broken behavior, but I never
> expect anything else from Microsoft :-(.  If that is an accurate theory
> then rewriting libpq won't fix it.

Here is a stupid Python 2.7 program to try to test that.  Run one copy
of it like this:

$ python ./test.py --server

The server will wait for a client, send a message immediately, and
then close the socket after a second.  The client will connect and
send something once before and twice after the server closes the
socket, and finally see if it can read the message from the server.

Here's the output I get from the client on some different systems (the
server was running on the same system):

$ uname -a
Linux debian 4.18.0-3-amd64 #1 SMP Debian 4.18.20-2 (2018-11-23)
x86_64 GNU/Linux
$ python ./test.py --client
Sending A...
2
Sending B...
[Errno 104] Connection reset by peer
Sending C...
[Errno 32] Broken pipe
This is the server saying goodbye

$ uname -a
Darwin macaque.local 18.2.0 Darwin Kernel Version 18.2.0: Thu Dec 20
20:46:53 PST 2018; root:xnu-4903.241.1~1/RELEASE_X86_64 x86_64
$ python2.7 ./test.py --client
Sending A...
2
Sending B...
[Errno 32] Broken pipe
Sending C...
[Errno 32] Broken pipe
This is the server saying goodbye

$ uname -a
FreeBSD dogmatix 13.0-CURRENT FreeBSD 13.0-CURRENT c0873ea614a(master)
GENERIC  amd64
$ python2.7 ./test.py --client
Sending A...
2
Sending B...
2
Sending C...
2
This is the server saying goodbye

So... can anyone tell us what happens on Windows?

(A secondary question might be what happens if the server and client
are on different machines since I guess it could be different?)

-- 
Thomas Munro
https://enterprisedb.com
import socket
import sys
import time

address = ("localhost", )

def server():
  ss = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  ss.bind(address)
  ss.listen(5)
  (s, other_address) = ss.accept()
  # Send a message, but then close the socket shortly after.
  # Will the client manage to read the goodbye message, if it tries to write
  # first?
  s.send("This is the server saying goodbye\n")
  time.sleep(1)
  s.close()

def client():
  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  s.connect(address)
  # Try to send a message immediately.  At this point we expect the server
  # end to be waiting for a second before closing, so it should be OK.
  try:
print "Sending A..."
sent = s.send("A\n")
print sent
  except Exception, e:
# (Should not be reached)
print e
  # Wait until after the server has closed its socket, and try to send
  # another message.
  time.sleep(2)
  try:
print "Sending B..."
sent = s.send("B\n")
print sent
  except Exception, e:
# We expect an error either here or at the next write (?)
print e
  # Send one more message, just to see if perhaps an error will be reported
  # here rather than earlier...
  try:
print "Sending C..."
sent = s.send("C\n")
print sent
  except Exception, e:
# What about now?
print e
  # Can we read the goodbye message?
  print s.recv(1024)

if __name__ == "__main__":
  if sys.argv[1] == "--server":
server()
  elif sys.argv[1] == "--client":
client()


Re: Patch to document base64 encoding

2019-03-05 Thread Fabien COELHO



Hello Karl,


Attached: doc_base64_v3.patch


I'm ok with referencing the historical MIME RFC.

"RFC2045 section 6.8" -> "RFC 2045 Section 6.8"

you can link to the RFC directly with:

https://tools.ietf.org/html/rfc2045#section-6.8;>RFC 2045 
Section 6.8


--
Fabien.



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2019-03-05 Thread Robert Haas
On Thu, Feb 28, 2019 at 3:27 PM Robert Haas  wrote:
> I'm not currently aware of any remaining correctness issues with this
> code, although certainly there may be some.  There has been a certain
> dearth of volunteers to review any of this.  I do plan to poke at it a
> bit to see whether it has any significant performance impact, but not
> today.

Today, did some performance testing.  I created a table with 100
partitions and randomly selected rows from it using pgbench, with and
without -M prepared.  The results show a small regression, but I
believe it's below the noise floor.  Five minute test runs.

with prepared queries

master:
tps = 10919.914458 (including connections establishing)
tps = 10876.271217 (including connections establishing)
tps = 10761.586160 (including connections establishing)

concurrent-attach:
tps = 10883.535582 (including connections establishing)
tps = 10868.471805 (including connections establishing)
tps = 10761.586160 (including connections establishing)

with simple queries

master:
tps = 1486.120530 (including connections establishing)
tps = 1486.797251 (including connections establishing)
tps = 1494.129256 (including connections establishing)

concurrent-attach:
tps = 1481.774212 (including connections establishing)
tps = 1472.159016 (including connections establishing)
tps = 1476.444097 (including connections establishing)

Looking at the total of the three results, that's about an 0.8%
regression with simple queries and an 0.2% regression with prepared
queries.  Looking at the median, it's about 0.7% and 0.07%.  Would
anybody like to argue that's a reason not to commit these patches?

Would anyone like to argue that there is any other reason not to
commit these patches?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: bgwriter_lru_maxpages limits in PG 10 sample conf

2019-03-05 Thread Bruce Momjian
On Tue, Mar  5, 2019 at 12:24:14AM -0300, Alvaro Herrera wrote:
> On 2019-Mar-04, Bruce Momjian wrote:
> 
> > On Thu, Feb 28, 2019 at 10:28:44AM +0300, Sergei Kornilov wrote:
> > > Hello
> > > 
> > > postgresql.conf.sample was changed recently in REL_10_STABLE (commit 
> > > ab1d9f066aee4f9b81abde6136771debe0191ae8)
> > > So config will be changed in next minor release anyway. We have another 
> > > reason to not fix bgwriter_lru_maxpages comment?
> > 
> > Frankly, I was surprised postgresql.conf.sample was changed in a back
> > branch since it will cause people who diff $PGDATA/postgresql.conf with
> > share/postgresql.conf.sample to see differences they didn't make.
> 
> I think the set of people that execute diffs of their production conf
> file against the sample file to be pretty small -- maybe even empty.  If
> you're really interested in knowing the changes you've done, you're more
> likely to use a version control system on the file anyway.

Well, if this is true, then we should all agree to backpatch to
postgresql.conf.sample more often.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Fwd: [Issue] Can't recompile cube extension as PGXS, utils/float.h is not installed

2019-03-05 Thread Siarhei Siniak
-- Forwarded message -
From: Siarhei Siniak 
Date: Tue, 5 Mar 2019 at 23:31
Subject: Re: [Issue] Can't recompile cube extension as PGXS, utils/float.h
is not installed
To: Tom Lane 


>AFAICT, that file only exists in HEAD, not in any released branch, and
>it is installed during "make install" from HEAD.  Please be sure you
>are using installed files that match whatever branch you're trying
>to build from.
Yeah june and july 2018, a month in between. Just thought a release was not
so long ago.
```
git log  REL_11_BETA2..6bf0bc842bd75 --format=oneline -- | wc -l
# 175
```
Ok, then probably no more questions.


Re: Rare SSL failures on eelpout

2019-03-05 Thread Tom Lane
Thomas Munro  writes:
> Bleugh.  I think this OpenSSL package might just be buggy on ARM.  On
> x86, apparently the same version of OpenSSL and all other details of
> the test the same, I can see that SSL_connect() returns <= 0
> (failure), and then we ask for that cert revoked message directly and
> never even reach the startup packet sending code.  On ARM,
> SSL_connect() returns 1 (success) and then we proceed as discussed and
> eventually get the error later (or not).  So I think I should figure
> out a minimal repro and report this to them.

Yeah, I've still been unable to reproduce even with the sleep idea,
so eelpout is definitely looking like a special snowflake from here.
In any case, there seems little doubt that getting past SSL_connect()
when the cert check has failed is an OpenSSL bug; I don't feel a need
to create a workaround for it.

The bug #15598 report is more troublesome, as we don't have a strong
reason to believe it's not common on Windows.  However, I wonder whether
we can really do anything at all about that one.  If I understand what
Andrew was hypothesizing in that thread, it was that Windows might be
dropping undelivered data on the floor once the server closes its end
of the connection.  That would be totally broken behavior, but I never
expect anything else from Microsoft :-(.  If that is an accurate theory
then rewriting libpq won't fix it.

I still think the redesign I suggested upthread would make things cleaner,
but I don't have time or interest to make it happen in the near future
if it's not fixing an observable bug.

regards, tom lane



Re: using index or check in ALTER TABLE SET NOT NULL

2019-03-05 Thread Sergei Kornilov
Hello, David!

> I've made a pass over v10. I think it's in pretty good shape, but I
> did end up changing a few small things.

Thank you! I merged your changes to new patch version.

> The only thing that I'm a bit unsure of is the tests. I've read the
> thread and I see the discussion above about it. I'd personally have
> thought INFO was fine since ATTACH PARTITION does that, but I see
> objections.

It is unclear for me if we have consensus about INFO ereport in ATTACH 
PARTITION.

> It appears that all the tests just assume that the CHECK
> constraint was used to validate the SET NOT NULL. I'm not all that
> sure if there's any good reason not to set client_min_messages =
> 'debug1' just before your test then RESET client_min_messages;
> afterwards. No other tests seem to do it, but my only thoughts on the
> reason not to would be that it might fail if someone added another
> debug somewhere, but so what? they should update the expected results
> too.

Not sure we have consensus about debug messages in tests, so I did such test 
changes in additional patch.

> separate that part out and
> put back the function named PartConstraintImpliedByRelConstraint and
> have it do the IS NOT NULL building before calling
> ConstraintImpliedByRelConstraint(). No duplicate code that way and you
> also don't need to touch partbound.c

In this case we need extra argument for ConstraintImpliedByRelConstraint for 
some caller-provided existConstraint, right? Along with Relation itself? Then I 
need make copy of existConstraint, append relation constraints and call 
predicate_implied_by. If I understood correctly pseudocode would be:

PartConstraintImpliedByRelConstraint(rel, testConstraint)
  generate notNullConstraint from NOT NULL table attributes
  call ConstraintImpliedByRelConstraint(rel, testConstraint, notNullConstraint)

ConstraintImpliedByRelConstraint(rel, testConstraint, existsConstraint)
  copy existsConstraint to localExistsConstraint variable
  append relation constraints to localExistsConstraint list
  call predicate_implied_by
  
I thing redundant IS NOT NULL check is not issue here and we not need extra 
arguments for ConstraintImpliedByRelConstraint. Was not changed in attached 
version, but I will change if you think this would be better design.

regards, Sergeidiff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 890b23afd6..926b3361ea 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -214,8 +214,17 @@ WITH ( MODULUS numeric_literal, REM
 
  
   These forms change whether a column is marked to allow null
-  values or to reject null values.  You can only use SET
-  NOT NULL when the column contains no null values.
+  values or to reject null values.
+ 
+
+ 
+  SET NOT NULL may only be applied to a column
+  providing none of the records in the table contain a
+  NULL value for the column.  Ordinarily this is
+  checked during the ALTER TABLE by scanning the
+  entire table, however if a valid CHECK constraint is
+  found which proves no NULL can exist, then the
+  table scan is skipped.
  
 
  
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index a93b13c2fe..b03e1bc875 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -158,7 +158,7 @@ typedef struct AlteredTableInfo
 	/* Information saved by Phases 1/2 for Phase 3: */
 	List	   *constraints;	/* List of NewConstraint */
 	List	   *newvals;		/* List of NewColumnValue */
-	bool		new_notnull;	/* T if we added new NOT NULL constraints */
+	bool		verify_new_notnull; /* T if we should recheck NOT NULL */
 	int			rewrite;		/* Reason for forced rewrite, if any */
 	Oid			newTableSpace;	/* new tablespace; 0 means no change */
 	bool		chgPersistence; /* T if SET LOGGED/UNLOGGED is used */
@@ -370,6 +370,7 @@ static ObjectAddress ATExecDropNotNull(Relation rel, const char *colName, LOCKMO
 static void ATPrepSetNotNull(Relation rel, bool recurse, bool recursing);
 static ObjectAddress ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
  const char *colName, LOCKMODE lockmode);
+static bool NotNullImpliedByRelConstraints(Relation rel, Form_pg_attribute attr);
 static ObjectAddress ATExecColumnDefault(Relation rel, const char *colName,
 	Node *newDefault, LOCKMODE lockmode);
 static ObjectAddress ATExecAddIdentity(Relation rel, const char *colName,
@@ -4506,10 +4507,11 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode)
 		else
 		{
 			/*
-			 * Test the current data within the table against new constraints
-			 * generated by ALTER TABLE commands, but don't rebuild data.
+			 * If required, test the current data within the table against new
+			 * constraints generated by ALTER TABLE commands, but don't rebuild
+			 * data.
 			 */
-			if (tab->constraints != NIL || tab->new_notnull ||
+			if (tab->constraints != 

Re: any plan to support shared servers like Oracle in PG?

2019-03-05 Thread legrand legrand
There already are solutions regarding this feature in Postgres
 using "connection pooler" wording

see 

pgpool: http://www.pgpool.net/mediawiki/index.php/Main_Page

pgbouncer: https://pgbouncer.github.io/

there are also discussions to include this as a core feature

https://www.postgresql.org/message-id/flat/4b971a8f-ff61-40eb-8f30-7b57eb0fdf9d%40postgrespro.ru



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: Rare SSL failures on eelpout

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 7:05 AM Thomas Munro  wrote:
> On Wed, Mar 6, 2019 at 6:07 AM Tom Lane  wrote:
> > Annoying.  I'd be happier about writing code to fix this if I could
> > reproduce it :-(
>
> Hmm.  Note that eelpout only started doing it with OpenSSL 1.1.1.

Bleugh.  I think this OpenSSL package might just be buggy on ARM.  On
x86, apparently the same version of OpenSSL and all other details of
the test the same, I can see that SSL_connect() returns <= 0
(failure), and then we ask for that cert revoked message directly and
never even reach the startup packet sending code.  On ARM,
SSL_connect() returns 1 (success) and then we proceed as discussed and
eventually get the error later (or not).  So I think I should figure
out a minimal repro and report this to them.

-- 
Thomas Munro
https://enterprisedb.com



Re: New vacuum option to do only freezing

2019-03-05 Thread Bossart, Nathan
On 3/5/19, 1:22 AM, "Masahiko Sawada"  wrote:
> On Tue, Mar 5, 2019 at 8:27 AM Bossart, Nathan  wrote:
>> +  VACUUM removes dead tuples and prunes HOT-updated
>> +  tuples chain for live tuples on table. If the table has any dead tuple
>> +  it removes them from both table and indexes for re-use. With this
>> +  option VACUUM doesn't completely remove dead tuples
>> +  and disables removing dead tuples from indexes.  This is suitable for
>> +  avoiding transaction ID wraparound (see
>> +  ) but not sufficient for 
>> avoiding
>> +  index bloat. This option is ignored if the table doesn't have index.
>> +  This cannot be used in conjunction with FULL
>> +  option.
>>
>> There are a couple of small changes I would make.  How does something
>> like this sound?
>>
>> VACUUM removes dead tuples and prunes HOT-updated tuple chains for
>> live tuples on the table.  If the table has any dead tuples, it
>> removes them from both the table and its indexes and marks the
>> corresponding line pointers as available for re-use.  With this
>> option, VACUUM still removes dead tuples from the table, but it
>> does not process any indexes, and the line pointers are marked as
>> dead instead of available for re-use.  This is suitable for
>> avoiding transaction ID wraparound (see Section 24.1.5) but not
>> sufficient for avoiding index bloat.  This option is ignored if
>> the table does not have any indexes.  This cannot be used in
>> conjunction with the FULL option.
>
> Hmm, that's good idea but I'm not sure that user knows the word 'line
> pointer' because it is used only at pageinspect document. I wonder if
> the word 'item identifier' would rather be appropriate here because
> this is used at at describing page layout(in storage.sgml). Thought?

That seems reasonable to me.  It seems like ItemIdData is referred to
as "item pointer," "line pointer," or "item identifier" depending on
where you're looking, but ItemPointerData is also referred to as "item
pointer."  I think using "item identifier" is appropriate here for
clarity and consistency with storage.sgml.

>> tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, 
>> false,
>> -
>> >latestRemovedXid);
>> +
>> >latestRemovedXid,
>> +
>> _pruned);
>>
>> Why do we need a separate tups_pruned argument in heap_page_prune()?
>> Could we add the result of heap_page_prune() to tups_pruned instead,
>> then report the total number of removed tuples as tups_vacuumed +
>> tups_pruned elsewhere?
>
> Hmm, I thought that we should report only the number of tuples
> completely removed but we already count the tulples marked as
> redirected as tups_vacuumed. Let me summarize the fate of dead tuples.
> I think we can roughly classify dead tuples as follows.
>
> 1. root tuple of HOT chain that became dead
> 2. root tuple of HOT chain that became redirected
> 3. other tupels of HOT chain that became unused
> 4. tuples that became dead after HOT pruning
>
> The tuples of #1 through #3 either have only ItemIDs or have been
> completely removed but tuples of #4 has its tuple storage because they
> are not processed when HOT-pruning.
>
> Currently tups_vacuumed counts all of them, nleft (=
> vacrelstats->num_dead_tuples) counts #1 + #4. I think that the number
> of removed tuples being reported would be #1 + #2 + #3. Or should we
> use  #2 + #3 instead?

I think I'm actually more in favor of what was in v6.  IIRC that
version of the patch didn't modify how we tracked the "removed" tuples
at all, but it just added the "X item identifiers left marked dead"
metric.  Since even the tuples we are leaving marked dead lose
storage, that seems accurate enough to me.

>> Another interesting thing I noticed is that this "removed X row
>> versions" message is only emitted if vacuumed_pages is greater than 0.
>> However, if we only did HOT pruning, tups_vacuumed will be greater
>> than 0 while vacuumed_pages will still be 0, so some information will
>> be left out.  I think this is already the case, though, so this could
>> probably be handled in a separate thread.
>>
>
> Hmm, since this log message is corresponding to the one that
> lazy_vacuum_heap makes and total number of removed tuples are always
> reported, it seems consistent to me. Do you have another point?

Here's an example:

postgres=# CREATE TABLE test (a INT, b INT);
CREATE TABLE
postgres=# CREATE INDEX ON test (a);
CREATE INDEX
postgres=# INSERT INTO test VALUES (1, 2);
INSERT 0 1

After only HOT updates, the "removed X row versions in Y pages"
message is not emitted:

postgres=# UPDATE test SET b = 3;
UPDATE 1
postgres=# 

Windows 32 bit vs circle test

2019-03-05 Thread Andrew Dunstan


We don't currently have any buildfarm animals running 32 bit mingw
builds for releases > 10. As part of my testing of msys 2 I thought I
would try its 32 bit compiler and got this regression diff on HEAD


cheers


andrew


diff -w -U3
C:/tools/msys64/home/Administrator/bf/root/HEAD/pgsql/src/test/regress/expected/circle.out
C:/tools/msys64/home/Administrator/bf/root/HEAD/pgsql.build/src/test/regress/results/circle.out
---
C:/tools/msys64/home/Administrator/bf/root/HEAD/pgsql/src/test/regress/expected/circle.out
 
2019-03-03 17:14:38.648207000 +
+++
C:/tools/msys64/home/Administrator/bf/root/HEAD/pgsql.build/src/test/regress/results/circle.out

2019-03-04 19:33:37.576494900 +
@@ -111,8 +111,8 @@
   WHERE (c1.f1 < c2.f1) AND ((c1.f1 <-> c2.f1) > 0)
   ORDER BY distance, area(c1.f1), area(c2.f1);
  five |  one   |  two   | distance
---+++--
-  | <(3,5),0>  | <(1,2),3>  | 0.60555127546399
+--+++---
+  | <(3,5),0>  | <(1,2),3>  | 0.605551275463989
   | <(3,5),0>  | <(5,1),3>  | 1.47213595499958
   | <(100,200),10> | <(100,1),115>  |   74
   | <(100,200),10> | <(1,2),100>    | 111.370729772479



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Inheriting table AMs for partitioned tables

2019-03-05 Thread Robert Haas
On Tue, Mar 5, 2019 at 12:59 PM Andres Freund  wrote:
> Based on this mail I'm currently planning to simply forbid specifying
> USING for partitioned tables. Then we can argue about this later.

+1.  I actually think that might be the right thing in the long-term,
but it undeniably avoids committing to any particular decision in the
short term, which seems good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Ordered Partitioned Table Scans

2019-03-05 Thread Robert Haas
On Wed, Dec 19, 2018 at 5:08 PM David Rowley
 wrote:
> With my idea for using live_parts, we'll process the partitions
> looking for interleaved values on each query, after pruning takes
> place. In this case, we'll see the partitions are naturally ordered. I
> don't really foresee any issues with that additional processing since
> it will only be a big effort when there are a large number of
> partitions, and in those cases the planner already has lots of work to
> do. Such processing is just a drop in the ocean when compared to path
> generation for all those partitions.

I agree that partitions_are_ordered() is cheap enough in this patch
that it probably doesn't matter whether we cache the result.  On the
other hand, that's mostly because you haven't handled the hard cases -
e.g. interleaved list partitions.  If you did, then it would be
expensive, and it probably *would* be worth caching the result.  Now
maybe those hard cases aren't worth handling anyway.

You also seem to be saying that since we run-time partitioning pruning
might change the answer, caching the initial answer is pointless.  But
I think Julien has made a good argument for why that's wrong: if the
initial answer is that the partitions are ordered, which will often be
true, then we can skip all later checks.

So I am OK with the fact that this patch doesn't choose to cache it,
but I don't really buy any of your arguments for why it would be a bad
idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Rare SSL failures on eelpout

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 6:07 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > You can see that poll() already knew the other end had closed the
> > socket.  Since this is clearly timing... let's see, yeah, I can make
> > it fail every time by adding sleep(1) before the comment "Send the
> > startup packet.".  I assume that'll work on any Linux machine?
>
> Great idea, but no cigar --- doesn't do anything for me except make
> the ssl test really slow.  (I tried it on RHEL6 and Fedora 28 and, just
> for luck, current macOS.)  What this seems to prove is that the thing
> that's different about eelpout is the particular kernel it's running,
> and that that kernel has some weird timing behavior in this situation.
>
> I've also been experimenting with reducing libpq's SO_SNDBUF setting
> on the socket, with more or less the same idea of making the sending
> of the startup packet slower.  No joy there either.
>
> Annoying.  I'd be happier about writing code to fix this if I could
> reproduce it :-(

Hmm.  Note that eelpout only started doing it with OpenSSL 1.1.1.  But
I just tried the sleep(1) trick on an x86 box running the same version
of Debian, OpenSSL etc and it didn't work.  So eelpout (a super cheap
virtualised 4-core ARMv8 system rented from scaleway.com running
Debian Buster with a kernel identifying itself as 4.9.23-std-1 and
libc6 2.28-7) is indeed starting to look pretty weird.  Let me know if
you want to log in and experiment on that machine.

-- 
Thomas Munro
https://enterprisedb.com



Re: Inheriting table AMs for partitioned tables

2019-03-05 Thread Andres Freund
On 2019-03-04 22:08:04 -0800, Andres Freund wrote:
> Hi,
> 
> On 2019-03-05 16:01:50 +1300, David Rowley wrote:
> > On Tue, 5 Mar 2019 at 12:47, Andres Freund  wrote:
> > > CREATE TABLE tableam_parted_heap2 (a text, b int) PARTITION BY list (a) 
> > > USING heap2;
> > >
> > > SET default_table_access_method = 'heap';
> > > CREATE TABLE tableam_parted_a_heap2 PARTITION OF tableam_parted_heap2 FOR 
> > > VALUES IN ('a');
> > 
> > 
> > > But for tableam_parted_a_heap2 tableam_parted_b_heap2 the answer isn't
> > > quite as clear.  I think it'd both be sensible for new partitions to
> > > inherit the AM from the root, but it'd also be sensible to use the
> > > current default.
> > 
> > I'd suggest it's made to work the same way as ca4103025dfe26 made
> > tablespaces work.
> 
> Hm, is that actually correct?  Because as far as I can tell that doesn't
> have the necessary pg_dump code to make this behaviour persistent:
> 
> CREATE TABLESPACE frak LOCATION '/tmp/frak';
> CREATE TABLE test_tablespace (a text, b int) PARTITION BY list (a) TABLESPACE 
> frak ;
> CREATE TABLE test_tablespace_1 PARTITION OF test_tablespace FOR VALUES in 
> ('a');
> CREATE TABLE test_tablespace_2 PARTITION OF test_tablespace FOR VALUES in 
> ('b') TABLESPACE pg_default;
> CREATE TABLE test_tablespace_3 PARTITION OF test_tablespace FOR VALUES in 
> ('c') TABLESPACE frak;
> 
> SELECT relname, relkind, reltablespace FROM pg_class WHERE relname LIKE 
> 'test_tablespace%' ORDER BY 1;
> ┌───┬─┬───┐
> │  relname  │ relkind │ reltablespace │
> ├───┼─┼───┤
> │ test_tablespace   │ p   │ 16384 │
> │ test_tablespace_1 │ r   │ 16384 │
> │ test_tablespace_2 │ r   │ 0 │
> │ test_tablespace_3 │ r   │ 16384 │
> └───┴─┴───┘
> (4 rows)
> 
> but a dump outputs (abbreviated)
> 
> SET default_tablespace = frak;
> CREATE TABLE public.test_tablespace (
> a text,
> b integer
> )
> PARTITION BY LIST (a);
> CREATE TABLE public.test_tablespace_1 PARTITION OF public.test_tablespace
> FOR VALUES IN ('a');
> SET default_tablespace = '';
> CREATE TABLE public.test_tablespace_2 PARTITION OF public.test_tablespace
> FOR VALUES IN ('b');
> SET default_tablespace = frak;
> CREATE TABLE public.test_tablespace_3 PARTITION OF public.test_tablespace
> FOR VALUES IN ('c');
> 
> which restores to:
> 
> postgres[32125][1]=# SELECT relname, relkind, reltablespace FROM pg_class 
> WHERE relname LIKE 'test_tablespace%' ORDER BY 1;
> ┌───┬─┬───┐
> │  relname  │ relkind │ reltablespace │
> ├───┼─┼───┤
> │ test_tablespace   │ p   │ 16384 │
> │ test_tablespace_1 │ r   │ 16384 │
> │ test_tablespace_2 │ r   │ 16384 │
> │ test_tablespace_3 │ r   │ 16384 │
> └───┴─┴───┘
> (4 rows)
> 
> because public.test_tablespace_2 assumes it's ought to inherit the
> tablespace from the partitioned table.
> 
> 
> I also find it far from clear that:
> 
>  
>   The tablespace_name is the 
> name
>   of the tablespace in which the new table is to be created.
>   If not specified,
>is consulted, or
>if the table is temporary.  For
>   partitioned tables, since no storage is required for the table itself,
>   the tablespace specified here only serves to mark the default tablespace
>   for any newly created partitions when no other tablespace is explicitly
>   specified.
>  
> 
> is handled correctly. The above says that the *specified* tablespaces -
> which seems to exclude the default tablespace - is what's going to
> determine what partitions use as their default tablespace. But in fact
> that's not true, the partitioned table's pg_class.retablespace is set to
> what default_tablespaces was at the time of the creation.
> 
> 
> > i.e. if they specify the storage type when creating
> > the partition, then always use that, unless they mention otherwise. If
> > nothing was mentioned when they created the partition, then use
> > default_table_access_method.
> 
> Hm. That'd be doable, but given the above ambiguities I'm not convinced
> that's the best approach.  As far as I can see that'd require:
> 
> 1) At relation creation, for partitioned tables only, do not take
>default_table_access_method into account.
> 
> 2) At partition creation, if the AM is not specified and if the
>partitioned table's relam is 0, use the default_table_access_method.
> 
> 3) At pg_dump, for partitioned tables only, explicitly emit a USING
>... rather than use the method of manipulating default_table_access_method.
> 
> As far as I can tell, the necessary steps are also what'd need to be
> done to actually implement the described behaviour for TABLESPACE (with
> s/default_table_access_method/default_tablespace/ and 

Re: insensitive collations

2019-03-05 Thread Daniel Verite
Peter Eisentraut wrote:

> Older ICU versions (<54) don't support all the locale customization
> options, so many of my new tests in collate.icu.utf8.sql will fail on
> older systems.  What should we do about that?  Have another extra test file?

Maybe stick to the old-style syntax for the regression tests?
The declarations that won't work as expected with older ICU versions
would be:

CREATE COLLATION case_insensitive (provider = icu, locale =
'und-u-ks-level2', deterministic = false);

'und-u-ks-level2' is equivalent to 'und@colStrength=secondary'

CREATE COLLATION ignore_accents (provider = icu, locale =
'und-u-ks-level1-kc-true', deterministic = false);

'und-u-ks-level1-kc-true' => 'und@colStrength=primary;colCaseLevel=yes'


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: proposal: new polymorphic types - commontype and commontypearray

2019-03-05 Thread Pavel Stehule
út 5. 3. 2019 v 15:35 odesílatel Tom Lane  napsal:

> David Steele  writes:
> > This thread has been very quiet for a month.  I agree with Andres [1]
> > that we should push this to PG13.
>
> I think the main thing it's blocked on is disagreement on what the
> type name should be, which is kind of a silly thing to get blocked on,
> but nonetheless it's important ...
>

I sent some others possible names, but probably this mail was forgotten

What about "ctype" like shortcut for common type? carraytype, cnonarraytype?

Regards

Pavel

>
> regards, tom lane
>


Re: Re: Optimze usage of immutable functions as relation

2019-03-05 Thread David Steele

On 2/28/19 4:27 PM, Alexander Kuzmenkov wrote:

On 2/18/19 03:20, Tom Lane wrote:

The dummy-relation stuff I referred to has now been merged, so there's
really no good reason not to revise the patch along that line.



I'll try to post the revised implementation soon.


I'll close this on March 8th if there is no new patch.

Regards,
--
-David
da...@pgmasters.net



Re: Re: \describe*

2019-03-05 Thread Corey Huinker
>
>
> I agree with Andres and Robert.  This patch should be pushed to PG13.
>
> I'll do that on March 8 unless there is a compelling argument not to.
>
>
No objection. I'll continue to work on it, though.


Re: Refactoring the checkpointer's fsync request queue

2019-03-05 Thread Tom Lane
Thomas Munro  writes:
> +#include "fmgr.h"
> +#include "storage/block.h"
> +#include "storage/relfilenode.h"
> +#include "storage/smgr.h"
> +#include "storage/sync.h"

> Why do we need to include fmgr.h in md.h?

More generally, any massive increase in an include file's inclusions
is probably a sign that you need to refactor.  Cross-header inclusions
are best avoided altogether if you can --- obviously that's not always
possible, but we should minimize them.  We've had some very unfortunate
problems in the past from indiscriminate #includes in headers.

regards, tom lane



Re: jsonpath

2019-03-05 Thread Robert Haas
On Mon, Mar 4, 2019 at 6:27 PM Tomas Vondra
 wrote:
> 11) Wording of some of the error messages in the execute methods seems a
> bit odd. For example executeNumericItemMethod may complain that it
>
> ... is applied to not a numeric value
>
> but perhaps a more natural wording would be
>
> ... is applied to a non-numeric value
>
> And similarly for the other execute methods. But I'm not a native
> speaker, so perhaps the original wording is just fine.

As a native speaker I can confirm that the first wording is definitely
not OK.  The second one is tolerable, but I wonder if there is
something better, like "can only be applied to a numeric value" or
maybe there's a way to rephrase it so that we complain about the
non-numeric value itself rather than the application, e.g. ERROR:
json_frobnitz can only frob numeric values or ERROR: argument to
json_frobnitz must be numeric.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Refactoring the checkpointer's fsync request queue

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 5:07 AM Shawn Debnath  wrote:
> Confirmed. Patch shows 8900 ms vs 192 ms on master for the insert test.
> Interesting! It's reproducible so should be able to figure out what's
> going on. The only thing we do in ForwardSyncRequest() is split up the 8
> bits into 2x4 bits and copy the FileTagData structure to the
> checkpointer queue. Will report back what I find.

More review, all superficial stuff:

+typedef struct
+{
+   RelFileNode rnode;
+   ForkNumber  forknum;
+   SegmentNumber   segno;
+} FileTagData;
+
+typedef FileTagData *FileTag;

Even though I know I said we should take FileTag by pointer, and even
though there is an older tradition in the tree of having a struct
named "FooData" and a corresponding pointer typedef named "Foo", as
far as I know most people are not following the convention for new
code and I for one don't like it.  One problem is that there isn't a
way to make a pointer-to-const type given a pointer-to-non-const type,
so you finish up throwing away const from your programs.  I like const
as documentation and a tiny bit of extra compiler checking.  What do
you think about "FileTag" for the struct and eg "const FileTag *tag"
when receiving one as a function argument?

-/* internals: move me elsewhere -- ay 7/94 */

Aha, about time too!

+#include "fmgr.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"

Why do we need to include fmgr.h in md.h?

+/* md storage manager funcationality */

Typo.

+/* md sync callback forward declarations */

These aren't "forward" declarations, they're plain old declarations.

+extern char* mdfilepath(FileTag ftag);

Doesn't really matter too much because all of this will get
pgindent-ed at some point, but FYI we write "char *md", not "char*
md".

 #include "storage/smgr.h"
+#include "storage/md.h"
 #include "utils/hsearch.h"

Bad sorting.

+   FileTagData tag;
+   tag.rnode = reln->smgr_rnode.node;
+   tag.forknum = forknum;
+   tag.segno = seg->mdfd_segno;

I wonder if it would be better practice to zero-initialise that
sucker, so that if more members are added we don't leave them
uninitialised.  I like the syntax "FileTagData tag = {{0}}".
(Unfortunately extra nesting required here because first member is a
struct, and C99 doesn't allow us to use empty {} like C++, even though
some versions of GCC accept it.  Rats.)

-- 
Thomas Munro
https://enterprisedb.com



Re: Rare SSL failures on eelpout

2019-03-05 Thread Thomas Munro
On Wed, Mar 6, 2019 at 3:33 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > Disappointingly, that turned out to be just because 10 and earlier
> > didn't care what the error message said.
>
> That is, you can reproduce the failure on old branches?  That lets
> out a half-theory I'd had, which was that Andres' changes to make
> the backend always run its socket in nonblock mode had had something
> to do with it.  (Those changes do represent a plausible reason why
> SSL_shutdown might be returning WANT_READ/WANT_WRITE; but I'm not
> in a hurry to add such code without evidence that it actually
> happens and something useful would change if we retry.)

Yes, on REL_10_STABLE:

$ for i in `seq 1 1000 ` ; do
psql "host=localhost port=56024 dbname=certdb user=postgres
sslcert=ssl/client-revoked.crt sslkey=ssl/client-revoked.key"
  done
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
...
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
could not send startup packet: Connection reset by peer
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked

Zooming in with strace:

sendto(3, 
"\27\3\3\2\356\r\214\352@\21\320\202\236}\376\367\262\227\177\255\212\204`q\254\108\326\201+c)"...,
1115, MSG_NOSIGNAL, NULL, 0) = 1115
ppoll([{fd=3, events=POLLOUT|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLOUT|POLLERR|POLLHUP}])
sendto(3, 
"\27\3\3\0cW_\210\337Q\227\360\216k\221\346\372pw\27\325P\203\357\245km\304Rx\355\200"...,
104, MSG_NOSIGNAL, NULL, 0) = -1 ECONNRESET (Connection reset by peer)

You can see that poll() already knew the other end had closed the
socket.  Since this is clearly timing... let's see, yeah, I can make
it fail every time by adding sleep(1) before the comment "Send the
startup packet.".  I assume that'll work on any Linux machine?

To set this test up, I ran a server with the following config:

ssl=on
ssl_ca_file='root+client_ca.crt'
ssl_cert_file='server-cn-only.crt'
ssl_key_file='server-cn-only.key'
ssl_crl_file='root+client.crl'

I copied those files out of src/test/ssl/ssl/.  Then I ran the psql
command shown earlier.  I think I had to chmod 600 the keys.

-- 
Thomas Munro
https://enterprisedb.com



Re: [HACKERS] CLUSTER command progress monitor

2019-03-05 Thread Robert Haas
On Tue, Mar 5, 2019 at 3:56 AM Tatsuro Yamada
 wrote:
> >> === Discussion points ===
> >>
> >>- Progress counter for "3. sorting tuples" phase
> >>   - Should we add pgstat_progress_update_param() in tuplesort.c like a
> >> "trace_sort"?
> >> Thanks to Peter Geoghegan for the useful advice!
> >
> > How would we avoid an abstraction violation?
>
> Hmm... What do you mean an abstraction violation?
> If it is difficult to solve, I'd not like to add the progress counter for the 
> sorting tuples.

What I mean is... I think it would be useful to have this counter, but
I'm not sure how the tuplesort code would know to update the counter
in this case and not in other cases.  The tuplesort code is used for
lots of things; we can't update a counter for CLUSTER if the tuplesort
is being used for CREATE INDEX or a Sort node in a query or whatever.
So my question is how we would indicate to the tuplesort that it needs
to do the counter update, and whether that would end up making for
ugly code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Question about commit 11cf92f6e2e13c0a6e3f98be3e629e6bd90b74d5

2019-03-05 Thread Robert Haas
On Tue, Mar 5, 2019 at 3:00 AM Etsuro Fujita
 wrote:
> apply_projection_to_path() not only jams the given tlist into the
> existing path but updates its tlist eval costs appropriately except for
> the cases of Gather and GatherMerge:

I had forgotten that detail, but I don't think it changes the basic
picture.  Once you've added a bunch of Paths to a RelOptInfo, it's too
late to change their *relative* cost, because add_path() puts the list
in a certain order, and adjusting the path costs won't change that
ordering.  You've got to have the costs already correct at the time
add_path() is first called for any given Path.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: speeding up planning with partitions

2019-03-05 Thread Jesper Pedersen

On 3/5/19 5:24 AM, Amit Langote wrote:

Attached an updated version.  This incorporates fixes for both Jesper's
and Imai-san's review.  I haven't been able to pin down the bug (or
whatever) that makes throughput go down as the partition count increases,
when tested with a --enable-cassert build.



Thanks !

I'm seeing the throughput going down as well, but are you sure it isn't 
just the extra calls of MemoryContextCheck you are seeing ? A flamegraph 
diff highlights that area -- sent offline.


A non cassert build shows the same profile for 64 and 1024 partitions.

Best regards,
 Jesper



Re: Delay locking partitions during query execution

2019-03-05 Thread Tomas Vondra



On 3/5/19 6:55 AM, David Rowley wrote:
> On Sat, 2 Feb 2019 at 02:52, Robert Haas  wrote:
>> I think the key question here is whether or not you can cope with
>> someone having done arbitrary AEL-requiring modifications to the
>> delaylocked partitions.  If you can, the fact that the plan might
>> sometimes be out-of-date is an inevitable consequence of doing this at
>> all, but I think we can argue that it's an acceptable consequence as
>> long as the resulting behavior is not too bletcherous.  If you can't,
>> then this patch is dead.
> 
> I spent some time looking at this patch again and thinking about the
> locking. I ended up looking through each alter table subcommand by way
> of AlterTableGetLockLevel() and going through scenarios with each of
> them in my head to try to understand:
> 
> a) If the subcommand can even be applied to a leaf partition; and
> b) Can the subcommand cause a cached plan to become invalid?
> 
> I did all the easy ones first then started on the harder ones. I came
> to AT_DropConstraint and imagined the following scenario:
> 
> -- ** session 1
> create table listp (a int) partition by list(a);
> create table listp1 partition of listp for values in(1);
> create table listp2 partition of listp for values in(2);
> 
> create index listp1_a_idx on listp1 (a);
> create index listp2_a_idx on listp2 (a);
> 
> set enable_seqscan = off;
> set plan_cache_mode = force_generic_plan;
> prepare q1 (int) as select * from listp where a = $1;
> execute q1 (1);
> begin;
> execute q1 (1);
> 
> 
> -- ** session 2
> drop index listp2_a_idx;
> 
> -- ** session 1
> execute q1 (2);
> ERROR:  could not open relation with OID 16401
> 
> The only way I can think to fix this is to just never lock partitions
> at all, and if a lock is to be obtained on a partition, it must be
> instead obtained on the top-level partitioned table.  That's a rather
> large change that could have large consequences, and I'm not even sure
> it's possible since we'd need to find the top-level parent before
> obtaining the lock, by then the hierarchy might have changed and we'd
> need to recheck, which seems like quite a lot of effort just to obtain
> a lock... Apart from that, it's not this patch, so looks like I'll
> need to withdraw this one :-(
> 

So you're saying we could

1) lookup the parent and lock it
2) repeat the lookup to verify it did not change

I think that could still be a win, assuming that most hierarchies will
be rather shallow (I'd say 2-3 levels will cover like 95% of cases, and
4 levels would be 100% in practice). And the second lookup should be
fairly cheap thanks to syscache and the fact that the hierarchies do not
change very often.

I can't judge how invasive this patch would be, but I agree it's more
complex than the originally proposed patch.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



any plan to support shared servers like Oracle in PG?

2019-03-05 Thread Andy Fan
currently there is one process per connection and it will not not very good
for some short time connection.In oracle database, it support shared
server which can serve more than 1 users  at the same time.

See
https://docs.oracle.com/cd/B28359_01/server.111/b28310/manproc001.htm#ADMIN11166


do we have any plan about this?


Re: Should we increase the default vacuum_cost_limit?

2019-03-05 Thread Robert Haas
On Tue, Mar 5, 2019 at 7:53 AM Tomas Vondra
 wrote:
> But on the other hand it feels a bit weird that we increase this one
> value and leave all the other (also very conservative) defaults alone.

Are you talking about vacuum-related defaults or defaults in general?
In 2014, we increased the defaults for work_mem and
maintenance_work_mem by 4x and the default for effective_cache_size by
32x; in 2012, we increased the default for shared_buffers from by 4x.
It's possible some of those parameters should be further increased at
some point, but deciding not to increase any of them until we can
increase all of them is tantamount to giving up on changing anything
at all.  I think it's OK to be conservative by default, but we should
increase parameters where we know that the default is likely to be too
conservative for 99% of users.  My only question about this change is
whether to go for a lesser multiple (e.g. 4x) rather than the proposed
10x.  But I think even if 10x turns out to be too much for a few more
people than we'd like, we're still going to be better off increasing
it and having some people have to turn it back down again than leaving
it the way it is and having users regularly suffer vacuum-starvation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-05 Thread Robert Haas
On Mon, Mar 4, 2019 at 1:01 AM Masahiko Sawada  wrote:
> I think that there is no need to use the same key for both the spill
> files and WAL because only one process encrypt/decrypt spill files. We
> can use something like temporary key for that use case, which is used
> by only one process and lives during process lifetime (or transaction
> lifetime). The same is true for for other temporary files such as
> tuplesort and tuplestore, although maybe we need tricks for shared
> tuplestore.

Agreed.  For a shared tuplestore you need a key that is shared between
the processes involved, but it doesn't need to be the same as any
other key.  For anything that is accessed by only a single process,
that process can just generate any old key and, as long as it's
secure, it's fine.

For the WAL, you could potentially create a new WAL record type that
is basically an encrypted wrapper around another WAL record.  So if
table X is encrypted with key K1, then all of the WAL records for
table X are wrapped inside of an encrypted-record WAL record that is
encrypted with key K1.  That's useful for people who want fine-grained
encryption only of certain data.

But for people who want to just encrypt everything, you need to
encrypt the entire WAL stream, all SLRU data, etc. and that pretty
much all has to be one key (or sub-keys derived from that one key
somehow).

> > Or what do you do
> > about SLRUs or other global structures?  If you just exclude that
> > stuff from the scope of encryption, then you aren't helping the people
> > who want to Just Encrypt Everything.
>
> Why do people want to just encrypt everything? For satisfying some
> security compliance?

Yeah, I think so.  Perhaps an encrypted filesystem is a better way to
go, but some people want something that is built into the database
server.  The motivation seems to be mostly that they have a compliance
requirement -- either the database itself encrypts everything, or they
cannot use the software.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Index Skip Scan

2019-03-05 Thread Dmitry Dolgov
> On Thu, Feb 28, 2019 at 10:45 PM Jeff Janes  wrote:
>
> This version of the patch can return the wrong answer.

Yes, indeed. In fact it answers my previous question related to the backward
cursor scan, when while going back we didn't skip enough. Within the current
approach it can be fixed by proper skipping for backward scan, something like
in the attached patch.

Although there are still some rough edges, e.g. going forth, back and forth
again leads to a sutiation, when `_bt_first` is not applied anymore and the
first element is wrongly skipped. I'll try to fix it with the next version of
patch.

> If we accept this patch, I hope it would be expanded in the future to give
> similar performance as the above query does even when the query is written in
> its more natural way of:

Yeah, I hope the current approach with a new index am routine can be extended
for that.

> Without the patch, instead of getting a wrong answer, I get an error:

Right, as far as I can see without a skip scan and SCROLL, a unique + index
scan is used, where amcanbackward is false by default. So looks like it's not
really patch related.


v10-0001-Index-skip-scan.patch
Description: Binary data


Re: Fsync-before-close thought experiment

2019-03-05 Thread Robert Haas
On Sun, Mar 3, 2019 at 6:31 PM Thomas Munro  wrote:
> A more obvious approach that probably moves us closer to the way
> kernel developers expect us to write programs is to call fsync()
> before close() (due to vfd pressure) if you've written.

Interesting

> The obvious
> problem with that is that you could finish up doing loads more
> fsyncing than we're doing today if you're regularly dirtying more than
> max_files_per_process in the same backend.

Check.

> I wonder if we could make
> it more acceptable like so:
>
> * when performing writes, record the checkpointer's cycle counter in the File
>
> * when closing due to vfd pressure, only call fsync() if the cycle
> hasn't advanced (that is, you get to skip the fsync() if you haven't
> written in this sync cycle, because the checkpointer must have taken
> care of your writes from the previous cycle)

Hmm, OK.

> * if you find you're doing this too much (by default, dirtying more
> than 1k relation segments per checkpoint cycle), maybe log a warning
> that the user might want to consider increasing max_files_per_process
> (and related OS limits)
>
> A possible improvement, stolen from the fd-passing patch, is to
> introduce a "sync cycle", separate from checkpoint cycle, so that the
> checkpointer can process its table of pending operations more than
> once per checkpoint if that would help minimise fsyncing in foreground
> processes.  I haven't thought much about how exactly that would work.

Yeah, that seems worth considering.  I suppose that a backend could
keep track of how many times it's recorded the current sync cycle in a
File that is still open -- this seems like it should be pretty simple
and cheap, provided we can find the right places to put the counter
adjustments.  If that number gets too big, like say greater than 80%
of the number of fds, it sends a ping to the checkpointer.  I'm not
sure if that would then immediately trigger a full sync cycle or if
there is something more granular we could do.

> There is a less obvious problem with this scheme though:
>
> 1.  Suppose the operating system has a single error flag for a file no
> matter how many fds have it open, and it is cleared by the first
> syscall to return EIO.  There is a broken schedule like this (B =
> regular backend, C = checkpointer):
>
> B: fsync() -> EIO # clears error flag
> C: fsync() -> success, log checkpoint
> B: PANIC!
>
> 2.  On an operating system that has an error counter + seen flag
> (Linux 4.14+), in general you receive the error in all fds, which is a
> very nice property, but we'd still have a broken schedule involving
> the seen flag:
>
> B: fsync() -> EIO # clears seen flag
> C: open() # opened after error
> C: fsync() -> success, log checkpoint
> B: PANIC!
>
> Here's one kind of interlocking that might work:  Hash pathnames and
> map to an array of lwlocks + error flags.  Any process trying to sync
> a file must hold the lock and check for a pre-existing error flag.
> Now a checkpoint cannot succeed if any backend has recently decided to
> panic.  You could skip that if data_sync_retry = on.

That might be fine, but I think it might be possible to create
something more light-weight.  Suppose we just decide that foreground
processes always win and the checkpointer has to wait before logging
the checkpoint.  To that end, a foreground process advertises in
shared memory whether or not it is currently performing an fsync.  The
checkpointer must observe each process until it sees that process in
the not-fsyncing state at least once.

If a process starts fsync-ing after being observed not-fsyncing, it
just means that the backend started doing an fsync() after the
checkpointer had completed the fsyncs for that checkpoint.  And in
that case the checkpointer would have observed the EIO for any writes
prior to the checkpoint, so it's OK to write that checkpoint; it's
only the next one that has an issue, and the fact that we're now
advertising that we are fsync()-ing again will prevent that one from
completing before we emit any necessary PANIC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



  1   2   >