Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-09-01 Thread amul sul
Hi Fujita San,


Please find my comments inline below:

On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita 
wrote:

> On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita 
> wrote:
> >

[... skipped ..]

>
> About the attached:
>
> * The attached patch modified try_partitionwise_join() so that we call
> partition_bounds_equal() then partition_bounds_merge() (if necessary)
> to create the partition bounds for the join rel.  We don't support for
> merging the partition bounds for the hash-partitioning case, so this
> makes code to handle the hash-partitioning case in
> partition_bounds_merge() completely unnecessary, so I removed that
> entirely.
>

Yes, that make sense.

On thinking further, a thought hits to me why we can't join two hash
partitioned
table which has the same modulus and partition key specification, but some
of
the partitions are missing from either partitioned table.

For e.g. here is a smaller case where foo has two partitions and bar has
only one.

CREATE TABLE foo(a int) PARTITION BY HASH(a);
CREATE TABLE foo_p0 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder
0);
CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder
1);

CREATE TABLE bar(a int) PARTITION BY HASH(a);  <-- missing partitions
for REMAINDER 1
CREATE TABLE bar_p0 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder
0);

Explain:
postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
   QUERY PLAN

-
 Merge Join  (cost=590.35..1578.47 rows=65025 width=8)
   Merge Cond: (p2.a = p1.a)
   ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
 Sort Key: p2.a
 ->  Seq Scan on bar_p0 p2  (cost=0.00..35.50 rows=2550 width=4)
   ->  Sort  (cost=410.57..423.32 rows=5100 width=4)
 Sort Key: p1.a
 ->  Append  (cost=0.00..96.50 rows=5100 width=4)
   ->  Seq Scan on foo_p0 p1  (cost=0.00..35.50 rows=2550
width=4)
   ->  Seq Scan on foo_p1 p1_1  (cost=0.00..35.50 rows=2550
width=4)
(10 rows)

The partitions-wise join will be performed only if we fill the partition
hole that
exists for the joining table i.e. adding partitions to bar table.

postgres=# CREATE TABLE bar_p1 PARTITION OF bar FOR VALUES WITH(modulus 2,
remainder 1);
CREATE TABLE
postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
   QUERY PLAN

-
 Append  (cost=359.57..2045.11 rows=65024 width=8)
   ->  Merge Join  (cost=359.57..860.00 rows=32512 width=8)
 Merge Cond: (p1.a = p2.a)
 ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
   Sort Key: p1.a
   ->  Seq Scan on foo_p0 p1  (cost=0.00..35.50 rows=2550
width=4)
 ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
   Sort Key: p2.a
   ->  Seq Scan on bar_p0 p2  (cost=0.00..35.50 rows=2550
width=4)
   ->  Merge Join  (cost=359.57..860.00 rows=32512 width=8)
 Merge Cond: (p1_1.a = p2_1.a)
 ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
   Sort Key: p1_1.a
   ->  Seq Scan on foo_p1 p1_1  (cost=0.00..35.50 rows=2550
width=4)
 ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
   Sort Key: p2_1.a
   ->  Seq Scan on bar_p1 p2_1  (cost=0.00..35.50 rows=2550
width=4)
(17 rows)

It would have been nice if we could support this case, as we do allow
partitions
hole for the other partition scheme, but there wouldn't be much objection if
we don't want to add this support for now since there will be a lesser
chance
that hash partitioned table has the hole, IMO.


> * I removed this assertion in partition_bounds_merge(), because I
> think this is covered by two assertions above this.
>
> +   Assert((*outer_parts == NIL || *inner_parts != NIL) &&
> +  (*inner_parts == NIL || *outer_parts != NIL));
>
> * (I forgot to mention this in a previous email, but) I removed this
> bit of generate_matching_part_pairs(), because we already have the
> same check in try_partitionwise_join(), so this bit would be redundant
> IIUC.
>

Looks good.


>
> * I added more comments.
>

Thanks.


> If there are no objections, I'll merge the attached with the base patch in
> [1].
>

The proposed enhancement in the patch is too good and the patch is pretty
much
reasonable to merge into the main patch.

Here are the few cosmetic fixes for this path I think is needed. Feel free
to
ignore if some of them do not make sense or too obvious.

Note: left side number represents code line number of the patch.

118 +   }
119 +
120 +   /*
121 +* Try to create the partition bounds along with join pairs.
122 +*/
123 +   if (boundinfo == NULL)
124 +   {

Can we add this block as else part of previous if condition checking equal
partitions bound?

133 +   

Re: Bug in GiST paring heap comparator

2019-09-01 Thread Andrey Borodin



> 2 сент. 2019 г., в 9:54, Alexander Korotkov  
> написал(а):
> 
> It appears to be related to implementation of comparison function in
> pairing heap used as priority queue for KNN.  It used plain float
> comparison, which doesn't handle Inf and Nan values well.  Attached
> patch replaced it with float8_cmp_internal().

Thanks! This patch fixes tests of my new GiST build :)

While patch looks good to me, I want to add that that there's a lot of <= and > 
comparisons in gistproc.c in function:

static float8
computeDistance(bool isLeaf, BOX *box, Point *point)

Should we fix this too? Or add comment why current code is safe.

Thanks!

Best regards, Andrey Borodin.



Re: moonjelly vs tsearch

2019-09-01 Thread Fabien COELHO



News from Fabien's bleeding edge compiler zoo: apparently GCC 10 
20190831 thinks the tsearch test should produce different output than 
20190824 did.  It looks like the parsing of question marks change...


Indeed, that looks pretty strange.


I can't wait for next week's installment.


I'll relaunch gcc trunk compilation before the end of the week.

--
Fabien.




Bug in GiST paring heap comparator

2019-09-01 Thread Alexander Korotkov
Hi!

Andrey Borodin noticed me that results of some KNN-GIST tests are
obviously wrong and don't match results of sort node.

SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
f1
---
 (10,10)
 (NaN,NaN)
 (0,0)
 (1e-300,-1e-300)
 (-3,4)
 (-10,0)
 (-5,-12)
 (5.1,34.5)

 (1e+300,Infinity)
(10 rows)

It appears to be related to implementation of comparison function in
pairing heap used as priority queue for KNN.  It used plain float
comparison, which doesn't handle Inf and Nan values well.  Attached
patch replaced it with float8_cmp_internal().

Also, note that with patch KNN results still don't fully match results
of sort node.

SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
f1
---
 (0,0)
 (1e-300,-1e-300)
 (-3,4)
 (-10,0)
 (10,10)
 (-5,-12)
 (5.1,34.5)
 (1e+300,Infinity)

 (NaN,NaN)
(10 rows)

NULL and '(NaN,NaN)' are swapped.  It happens so, because we assume
distance to NULL to be Inf, while float8_cmp_internal() assumes NaN to
be greater than NULL.  If even we would assume distance to NULL to be
Nan, it doesn't guarantee that NULLs would be last.  It looks like we
can handle this only by introduction array of "distance is null" flags
to GISTSearchItem.  But does it worth it?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gist-pairing-heap-cmp-fix-1.patch
Description: Binary data


Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity

2019-09-01 Thread Amit Langote
On Mon, Sep 2, 2019 at 6:31 AM Tom Lane  wrote:
> I wrote:
> > As far as 4) goes, I think the code in ExtractReplicaIdentity is pretty
> > duff anyway, because it doesn't bother to check for the defined failure
> > return for RelationIdGetRelation.  But if we're concerned about the
> > cost of recalculating this stuff per-row, couldn't we cache it a little
> > better?  It should be safe to assume the set of index columns isn't
> > changing intra-query.
> > ... in fact, isn't all the infrastructure for that present already?
> > Why is this code looking directly at the index at all, rather than
> > using the relcache's rd_idattr bitmap?
>
> Here's a proposed patch along those lines.  It fixes Hadi's original
> crash case and passes check-world.

Agree that this patch would be a better solution for Hadi's report,
although I also agree that the situation with index locking for DELETE
isn't perfect.

> I'm a bit suspicious of the exclusion for idattrs being empty, but
> if I remove that, some of the contrib/test_decoding test results
> change.  Anybody want to comment on that?  If that's actually an
> expected situation, why is there an elog(DEBUG) in that path?

ISTM that the exclusion case may occur with the table's replica
identity being REPLICA_IDENTITY_DEFAULT and there being no primary
index defined, in which case nothing needs to get logged.

The elog(DEBUG) may just be a remnant from the days when this was
being developed.  I couldn't find any notes on it though in the
archives [1] though.

Thanks,
Amit

[1] 
https://www.postgresql.org/message-id/flat/20131204155510.GO24801%40awork2.anarazel.de




pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout

2019-09-01 Thread r.takahash...@fujitsu.com
Hi


pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout in 
following environment.

[Environment]
Postgres 13dev (master branch)
Red Hat Enterprise Postgres 7.4

[Error]
$ pg_basebackup -F t --progress --verbose -h  -D 
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/5A60 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_15647"
pg_basebackup: error: could not read COPY data: server closed the connection 
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

[Analysis]
- pg_basebackup -F t creates a tar file and does fsync() for each tablespace.
  (Otherwise, -F p does fsync() only once at the end.)
- While doing fsync() for a tar file for one tablespace, wal sender sends the 
content of the next tablespace.
  When fsync() spends long time, the tcp socket of pg_basebackup returns "zero 
window" packets to wal sender.
  This means the tcp socket buffer of pg_basebackup is exhausted since 
pg_basebackup cannot receive during fsync().
- The socket of wal sender retries to send the packet, but resets connection 
after tcp_user_timeout.
  After wal sender resets connection, pg_basebackup cannot receive data and 
fails with above error.

[Solution]
I think fsync() for each tablespace is not necessary.
Like pg_basebackup -F p, I think fsync() is necessary only once at the end.


Could you give me any comment?


Regards,
Ryohei Takahashi





Re: safe to overload objectSubId for a type?

2019-09-01 Thread Tom Lane
Chapman Flack  writes:
> I don't mean "overload objectSubId" in any ObjectAddress that PG code
> would ever see. I am only thinking of a data structure of my own that
> is ObjectAddress-like and has all three components available all the
> time, and for an object that's a type, I would find it handy to stash
> a typmod there, and not have to carry those around separately.

If this is totally independent of ObjectAddress, why are you even
asking?  I assume that what you mean is you'd like these values to
bleed into ObjectAddresses or vice versa.

If we ever do make ObjectAddress.objectSubId mean something for types,
I'd expect --- based on the precedent of relation columns --- that it'd
specify a column number for a column of a composite type.  There are
fairly obvious use-cases for that, such as allowing a DROP of a column
type to not propagate to the whole composite type.  So I'd be pretty down
on allowing it to mean typmod instead.

regards, tom lane




Re: POC: Cleaning up orphaned files using undo logs

2019-09-01 Thread Thomas Munro
On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh  wrote:
> I'm getting the following assert failure while performing the recovery
> with the same.
> "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> File: "undolog.c", Line: 997)"
>
> I found that we don't emit an WAL record when we update the
> slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> crash recovery, some new transaction may use that undo log which is
> wrong, IMHO. Am I missing something?

Thanks, right, that status logging is wrong, will fix in next version.

-- 
Thomas Munro
https://enterprisedb.com




moonjelly vs tsearch

2019-09-01 Thread Thomas Munro
Hi,

News from Fabien's bleeding edge compiler zoo: apparently GCC 10
20190831 thinks the tsearch test should produce different output than
20190824 did.  It looks like the parsing of question marks change...
I can't wait for next week's installment.

--
Thomas Munro
https://enterprisedb.com




Re: XPRS

2019-09-01 Thread Thomas Munro
On Sat, Aug 24, 2019 at 3:19 AM Tomas Vondra
 wrote:
> > Although “The Wei Hong Optimizer” was designed in the context of
> > Postgres, it became the standard approach for many of the parallel
> > query optimizers in industry."
>
> I assume this quote is from 30 years ago. I wonder if the claim is still
> true, on current hardware (including e.g. distributed databases).

The quote is from 2018, and appears in the article I linked (it's a
chapter from the book Making Databases Work: The Wisdom of Michael
Stonebraker), but I'm not sure which systems it's referring to.

Speculation:  Many other systems have what we call parallel-oblivious
operators only, and then insert various exchange operators to make a
parallel plan.  That is, they don't necessarily have a direct
equivalent of our "Parallel Sequential Scan", "Parallel Index Scan",
"Parallel Foreign Scan": they just use their regular scans, possibly
with the addition of some kind of "Parallel Scatter" node (that's my
made up name, it's the opposite of Gather, called various things like
"page supplier" or "block iterator") or "Parallel Repartition"
inserted in the right places.  Perhaps they create a serial plan
first, and then try to parallelise it by inserting various parallel
operators and then recomputing the costs?  Rather than considering the
separate paths in the first phase of the optimiser, as we do.  The
cases where Hong's best-parallel-plan hypothesis isn't true for us now
might go away if we had Parallel Repartition, so that each 'stream'
would be the complete set of tuples for some known partition.

To be clear, I'm not suggesting we do that necessarily, just pointing
out some interesting claims about ancient POSTGRES wisdom, in a highly
speculative water cooler thread.   Actually, this isn't the first time
it's occurred to me that elements of our design were falling out of
the order that we chose to implement things in.  Another example is
the "Shared Hash" that I had in an early version of my work on
Parallel Hash Join, where just one process would run a parallel-safe
but non-parallel-oblivious plan to build a shared hash table while
other workers twiddled their thumbs; I dropped it because our cost
model has no penalty for running N copies of the same plan rather than
just one so there was no way to pick that plan, and that's because we
don't have a cost model like Hong's that considers resource usage too.
Another more speculative observation: maybe no-partition/shared
Parallel Hash Join is only obvious if you already have the general
concept of parallel-aware executor nodes.  AFAIK Robert and Amit
invented those to be able to coordinate parallel scans between
processes, where thread-based systems might be able to share a single
scan state somehow under a Scatter-like operator.  If you had threads,
you might not need that concept that allows arbitrary executor nodes
to communicate with each other between workers, and then it might be
more obvious and natural to use repartitioning for parallelising hash
joins.

> FWIW I think we'll have to do something about resource acquisition, sooner
> or later. It was always quite annoying that we don't really consider
> memory consumption of the query as a whole during planning, and parallel
> query made it a bit more painful.

Agreed.

Here's an approach I have been wondering about to cap total executor
memory usage, which is a much more down-to-Earth idea than any of the
above space cadet planner problems.  Let's start at the other end of
the problem, by introducing admission control and memory quotas.  That
is, keep using work_mem with its current per-node-copy meaning at
planning time, for now, and then:

1.  Compute the peak amount of memory each plan thinks it will need.
Initially that could be done by by summing estimates from all nodes
and considering workers.  A later refinement could deal with nodes
that give back memory early, if we get around to doing that.  The
estimate could be shown by EXPLAIN.  (Some details to work out: worst
case vs expected etc.)

2.  Introduce a new GUC global_work_mem, which limits the total plan
that are allowed to run concurrently, according to their memory
estimates.  Introduce a shared memory counter of currently allocated
quota.

3.  Introduce a new GUC session_work_mem, which is the amount of quota
that every session tries to acquire when it connects or perhaps first
runs a query, and that it won't give back until the end of the
session.  Or perhaps they acquire less than that if they need less,
but that's the amount they never give back once they've got that much.
The idea is to allow queries with estimates under that limit, for
example high frequency OLTP queries, to avoid any extra communication
overhead from this scheme.

4.  To run queries that have estimates higher than the session's
current allocated quota, the session must acquire more quota for the
duration of the query.  If it can't be acquired right now without
exceeding global_work_mem, it has to 

safe to overload objectSubId for a type?

2019-09-01 Thread Chapman Flack
Hi,

I don't mean "overload objectSubId" in any ObjectAddress that PG code
would ever see. I am only thinking of a data structure of my own that
is ObjectAddress-like and has all three components available all the
time, and for an object that's a type, I would find it handy to stash
a typmod there, and not have to carry those around separately.

As far as I can tell, most objects (maybe all, except attributes
and attribute defaults?) have the objectSubId unused. But I would not
want to paint myself into a corner if anyone anticipates making
objectSubId mean something for type objects somewhere down the road.

Thanks,
-Chap




Re: [Patch] Add a reset_computed_values function in pg_stat_statements

2019-09-01 Thread Euler Taveira
Em dom, 1 de set de 2019 às 06:04, Pierre Ducroquet
 escreveu:
>
> Hello
>
> pg_stat_statements is a great tool to track performance issue in live
> databases, especially when adding interfaces like PoWA on top of it.
> But so far, tools like PoWA can not track the min_time, max_time, mean_time
> and sum_var_time of queries : these statistics are cumulated over time,
> fetching points in time would be of little to no use, especially when looking
> at the impact of a DDL change.
> This patch thus introduces a simple pg_stat_statements_reset_computed_values
> function that reset the computed statistics, leaving all other information
> alive, thus allowing the aforementioned scenario.
>
Pierre, I don't understand why you want to reset part of the statement
statistics. If you want the minimal query time every week, reset all
statistics of that statement (v12 or later). Postgres provides
histogram metrics (min, max, mean, stddev). AFAICS you want a metric
type called timer (combination of histogram and meter). For example,
calls, sum, min, max, mean, standard deviation will be calculated each
hour. If I were to add a new metric to pg_stat_statements, it would be
last_time. Compare histogram metrics with the last statement time is
useful to check if a certain optimization was effective.

Talking about your patch, math is wrong. If you don't reset
total_time, all computed values will be incorrect. You are changing
the actual meaning of those metrics and I think it is unacceptable
because it will break applications. Instead, you should provide (i)
new counters or (ii) add GUC to control this behavior. It is a
trade-off between incompatibility and too many metrics. Though I
wouldn't like to break compatibility (as I said you can always reset
all statement statistics). Why don't you provide a
pg_stat_statements_reset_computed_values(userid Oid, dbid Oid, queryid
bigint)? You forgot to provide documentation for the new function. You
should revoke pg_stat_statements_reset_computed_values from PUBLIC.

Regards,


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento




Re: refactoring - share str2*int64 functions

2019-09-01 Thread Michael Paquier
On Sun, Sep 01, 2019 at 08:07:06PM +0200, Fabien COELHO wrote:
> They allow to quickly do performance tests, for me it is useful to keep it
> around, but you are the committer, you do as you feel.

If there are more voices for having that in core, we could consider
it.  For now I have added that into my own plugin repository with all
the functions discussed on this thread:
https://github.com/michaelpq/pg_plugins/

> The signed overflows are trickier even, I have not paid attention to the
> fallback code. I agree that it is better left untouched for know.

Thanks.

> Hmmm. I did manual tests really. Attached a psql script replicating them.
> 
>   # with builtin overflow detection
>   sh> psql < oc.sql
>   NOTICE:  int 16 mul: 00:00:02.747269 # slow
>   NOTICE:  int 16 add: 00:00:01.83281
>   NOTICE:  int 16 sub: 00:00:01.8501
>   NOTICE:  uint 16 mul: 00:00:03.68362 # slower
>   NOTICE:  uint 16 add: 00:00:01.835294
>   NOTICE:  uint 16 sub: 00:00:02.136895 # slow
>   NOTICE:  int 32 mul: 00:00:01.828065
>   NOTICE:  int 32 add: 00:00:01.840269
>   NOTICE:  int 32 sub: 00:00:01.843557
>   NOTICE:  uint 32 mul: 00:00:02.447052 # slow
>   NOTICE:  uint 32 add: 00:00:01.849899
>   NOTICE:  uint 32 sub: 00:00:01.840773
>   NOTICE:  int 64 mul: 00:00:01.839051
>   NOTICE:  int 64 add: 00:00:01.839065
>   NOTICE:  int 64 sub: 00:00:01.838599
>   NOTICE:  uint 64 mul: 00:00:02.161346 # slow
>   NOTICE:  uint 64 add: 00:00:01.839404
>   NOTICE:  uint 64 sub: 00:00:01.838549

Actually that's much faster than a single core on my debian SID with
gcc 9.2.1.

Here are more results from me:
   Built-in undef Built-in
int16 mul  00:00:05.425207  00:00:05.634417
int16 add  00:00:05.389738  00:00:06.38885 
int16 sub  00:00:05.446529  00:00:06.39569 
uint16 mul 00:00:05.499066  00:00:05.541617
uint16 add 00:00:05.281622  00:00:06.252511
uint16 sub 00:00:05.366424  00:00:05.457148
int32 mul  00:00:05.121209  00:00:06.154989
int32 add  00:00:05.228722  00:00:06.344721
int32 sub  00:00:05.237594  00:00:06.323543
uint32 mul 00:00:05.126339  00:00:05.921738
uint32 add 00:00:05.212085  00:00:06.183031
uint32 sub 00:00:05.201884  00:00:05.363667
int64 mul  00:00:05.136129  00:00:06.148101
int64 add  00:00:05.200201  00:00:06.329091
int64 sub  00:00:05.218028  00:00:06.313114
uint64 mul 00:00:05.444733  00:00:08.089742
uint64 add 00:00:05.603978  00:00:06.377753
uint64 sub 00:00:05.544838  00:00:05.490873

This part has been committed, now let's move to the next parts of your
patch.
--
Michael


signature.asc
Description: PGP signature


Re: row filtering for logical replication

2019-09-01 Thread Euler Taveira
Em dom, 1 de set de 2019 às 06:09, Erik Rijkers  escreveu:
>
> The first 4 of these apply without error, but I can't get 0005 to apply.
> This is what I use:
>
Erik, I generate a new patch set with patience diff algorithm. It
seems it applies cleanly.


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
From c07af2f00b7a72ba9660e389bb1392fc9e5d2688 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Wed, 24 Jan 2018 17:01:31 -0200
Subject: [PATCH 4/8] Rename a WHERE node

A WHERE clause will be used for row filtering in logical replication. We
already have a similar node: 'WHERE (condition here)'. Let's rename the
node to a generic name and use it for row filtering too.
---
 src/backend/parser/gram.y | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c97bb367f8..1de8f56794 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -476,7 +476,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	def_arg columnElem where_clause where_or_current_clause
 a_expr b_expr c_expr AexprConst indirection_el opt_slice_bound
 columnref in_expr having_clause func_table xmltable array_expr
-ExclusionWhereClause operator_def_arg
+OptWhereClause operator_def_arg
 %type 	rowsfrom_item rowsfrom_list opt_col_def_list
 %type  opt_ordinality
 %type 	ExclusionConstraintList ExclusionConstraintElem
@@ -3710,7 +3710,7 @@ ConstraintElem:
 	$$ = (Node *)n;
 }
 			| EXCLUDE access_method_clause '(' ExclusionConstraintList ')'
-opt_c_include opt_definition OptConsTableSpace  ExclusionWhereClause
+opt_c_include opt_definition OptConsTableSpace  OptWhereClause
 ConstraintAttributeSpec
 {
 	Constraint *n = makeNode(Constraint);
@@ -3812,7 +3812,7 @@ ExclusionConstraintElem: index_elem WITH any_operator
 			}
 		;
 
-ExclusionWhereClause:
+OptWhereClause:
 			WHERE '(' a_expr ')'	{ $$ = $3; }
 			| /*EMPTY*/{ $$ = NULL; }
 		;
-- 
2.11.0

From 367631ac4ba1e41170d59d39693e2eaf7c406621 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Tue, 27 Feb 2018 02:21:03 +
Subject: [PATCH 3/8] Refactor function create_estate_for_relation

Relation localrel is the only LogicalRepRelMapEntry structure member
that is useful for create_estate_for_relation.
---
 src/backend/replication/logical/worker.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 11e6331f49..d9952c8b7e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -173,7 +173,7 @@ ensure_transaction(void)
  * This is based on similar code in copy.c
  */
 static EState *
-create_estate_for_relation(LogicalRepRelMapEntry *rel)
+create_estate_for_relation(Relation rel)
 {
 	EState	   *estate;
 	ResultRelInfo *resultRelInfo;
@@ -183,13 +183,13 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel)
 
 	rte = makeNode(RangeTblEntry);
 	rte->rtekind = RTE_RELATION;
-	rte->relid = RelationGetRelid(rel->localrel);
-	rte->relkind = rel->localrel->rd_rel->relkind;
+	rte->relid = RelationGetRelid(rel);
+	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = AccessShareLock;
 	ExecInitRangeTable(estate, list_make1(rte));
 
 	resultRelInfo = makeNode(ResultRelInfo);
-	InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
+	InitResultRelInfo(resultRelInfo, rel, 1, NULL, 0);
 
 	estate->es_result_relations = resultRelInfo;
 	estate->es_num_result_relations = 1;
@@ -589,7 +589,7 @@ apply_handle_insert(StringInfo s)
 	}
 
 	/* Initialize the executor state. */
-	estate = create_estate_for_relation(rel);
+	estate = create_estate_for_relation(rel->localrel);
 	remoteslot = ExecInitExtraTupleSlot(estate,
 		RelationGetDescr(rel->localrel),
 		);
@@ -696,7 +696,7 @@ apply_handle_update(StringInfo s)
 	check_relation_updatable(rel);
 
 	/* Initialize the executor state. */
-	estate = create_estate_for_relation(rel);
+	estate = create_estate_for_relation(rel->localrel);
 	remoteslot = ExecInitExtraTupleSlot(estate,
 		RelationGetDescr(rel->localrel),
 		);
@@ -815,7 +815,7 @@ apply_handle_delete(StringInfo s)
 	check_relation_updatable(rel);
 
 	/* Initialize the executor state. */
-	estate = create_estate_for_relation(rel);
+	estate = create_estate_for_relation(rel->localrel);
 	remoteslot = ExecInitExtraTupleSlot(estate,
 		RelationGetDescr(rel->localrel),
 		);
-- 
2.11.0

From e83de1cab559f4ca8f9a75356e220356814cd243 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Fri, 9 Mar 2018 18:39:22 +
Subject: [PATCH 1/8] Remove unused atttypmod column from initial table
 synchronization

 Since commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920, atttypmod was
 added but not 

RE: SIGQUIT on archiver child processes maybe not such a hot idea?

2019-09-01 Thread Tsunakawa, Takayuki
From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> After investigation, the mechanism that's causing that is that the
> src/test/recovery/t/010_logical_decoding_timelines.pl test shuts
> down its replica server with a mode-immediate stop, which causes
> that postmaster to shut down all its children with SIGQUIT, and
> in particular that signal propagates to a "cp" command that the
> archiver process is executing.  The "cp" is unsurprisingly running
> with default SIGQUIT handling, which per the signal man page
> includes dumping core.

We've experienced this (core dump in the data directory by an archive command) 
years ago.  Related to this, the example of using cp in the PostgreSQL manual 
is misleading, because cp doesn't reliably persist the WAL archive file.


> This makes me wonder whether we shouldn't be using some other signal
> to shut down archiver subprocesses.  It's not real cool if we're
> spewing cores all over the place.  Admittedly, production servers
> are likely running with "ulimit -c 0" on most modern platforms,
> so this might not be a huge problem in the field; but accumulation
> of core files could be a problem anywhere that's configured to allow
> server core dumps.

We enable the core dump in production to help the investigation just in case.


> Ideally, perhaps, we'd be using SIGINT not SIGQUIT to shut down
> non-Postgres child processes.  But redesigning the system's signal
> handling to make that possible seems like a bit of a mess.
> 
> Thoughts?

We're using a shell script and a command that's called in the shell script.  
That is:

archive_command = 'call some_shell_script.sh ...'

[some_shell_script.sh]
ulimit -c 0
trap SIGQUIT to just exit on the receipt of the signal
call some_command to copy file

some_command also catches SIGQUIT just exit.  It copies and syncs the file.

I proposed something in this line as below, but I couldn't respond to Peter's 
review comments due to other tasks.  Does anyone think it's worth resuming this?

https://www.postgresql.org/message-id/7E37040CF3804EA5B018D7A022822984@maumau


Regards
Takayuki Tsunakawa







Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity

2019-09-01 Thread Tom Lane
I wrote:
> As far as 4) goes, I think the code in ExtractReplicaIdentity is pretty
> duff anyway, because it doesn't bother to check for the defined failure
> return for RelationIdGetRelation.  But if we're concerned about the
> cost of recalculating this stuff per-row, couldn't we cache it a little
> better?  It should be safe to assume the set of index columns isn't
> changing intra-query.
> ... in fact, isn't all the infrastructure for that present already?
> Why is this code looking directly at the index at all, rather than
> using the relcache's rd_idattr bitmap?

Here's a proposed patch along those lines.  It fixes Hadi's original
crash case and passes check-world.

I'm a bit suspicious of the exclusion for idattrs being empty, but
if I remove that, some of the contrib/test_decoding test results
change.  Anybody want to comment on that?  If that's actually an
expected situation, why is there an elog(DEBUG) in that path?

regards, tom lane

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cb811d3..8a3c7dc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7594,18 +7594,20 @@ log_heap_new_cid(Relation relation, HeapTuple tup)
  *
  * Returns NULL if there's no need to log an identity or if there's no suitable
  * key in the Relation relation.
+ *
+ * *copy is set to true if the returned tuple is a modified copy rather than
+ * the same tuple that was passed in.
  */
 static HeapTuple
-ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool *copy)
+ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
+	   bool *copy)
 {
 	TupleDesc	desc = RelationGetDescr(relation);
-	Oid			replidindex;
-	Relation	idx_rel;
 	char		replident = relation->rd_rel->relreplident;
-	HeapTuple	key_tuple = NULL;
+	Bitmapset  *idattrs;
+	HeapTuple	key_tuple;
 	bool		nulls[MaxHeapAttributeNumber];
 	Datum		values[MaxHeapAttributeNumber];
-	int			natt;
 
 	*copy = false;
 
@@ -7624,7 +7626,7 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool *
 		if (HeapTupleHasExternal(tp))
 		{
 			*copy = true;
-			tp = toast_flatten_tuple(tp, RelationGetDescr(relation));
+			tp = toast_flatten_tuple(tp, desc);
 		}
 		return tp;
 	}
@@ -7633,41 +7635,37 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool *
 	if (!key_changed)
 		return NULL;
 
-	/* find the replica identity index */
-	replidindex = RelationGetReplicaIndex(relation);
-	if (!OidIsValid(replidindex))
+	/* find out the replica identity columns */
+	idattrs = RelationGetIndexAttrBitmap(relation,
+		 INDEX_ATTR_BITMAP_IDENTITY_KEY);
+
+	if (bms_is_empty(idattrs))
 	{
 		elog(DEBUG4, "could not find configured replica identity for table \"%s\"",
 			 RelationGetRelationName(relation));
 		return NULL;
 	}
 
-	idx_rel = RelationIdGetRelation(replidindex);
-
-	Assert(CheckRelationLockedByMe(idx_rel, AccessShareLock, true));
-
-	/* deform tuple, so we have fast access to columns */
-	heap_deform_tuple(tp, desc, values, nulls);
-
-	/* set all columns to NULL, regardless of whether they actually are */
-	memset(nulls, 1, sizeof(nulls));
-
 	/*
-	 * Now set all columns contained in the index to NOT NULL, they cannot
-	 * currently be NULL.
+	 * Construct a new tuple containing only the replica identity columns,
+	 * with nulls elsewhere.  While we're at it, assert that the replica
+	 * identity columns aren't null.
 	 */
-	for (natt = 0; natt < IndexRelationGetNumberOfKeyAttributes(idx_rel); natt++)
-	{
-		int			attno = idx_rel->rd_index->indkey.values[natt];
+	heap_deform_tuple(tp, desc, values, nulls);
 
-		if (attno < 0)
-			elog(ERROR, "system column in index");
-		nulls[attno - 1] = false;
+	for (int i = 0; i < desc->natts; i++)
+	{
+		if (bms_is_member(i + 1 - FirstLowInvalidHeapAttributeNumber,
+		  idattrs))
+			Assert(!nulls[i]);
+		else
+			nulls[i] = true;
 	}
 
 	key_tuple = heap_form_tuple(desc, values, nulls);
 	*copy = true;
-	RelationClose(idx_rel);
+
+	bms_free(idattrs);
 
 	/*
 	 * If the tuple, which by here only contains indexed columns, still has
@@ -7680,7 +7678,7 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool *
 	{
 		HeapTuple	oldtup = key_tuple;
 
-		key_tuple = toast_flatten_tuple(oldtup, RelationGetDescr(relation));
+		key_tuple = toast_flatten_tuple(oldtup, desc);
 		heap_freetuple(oldtup);
 	}
 


Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity

2019-09-01 Thread Tom Lane
Andres Freund  writes:
> On 2019-08-17 01:43:45 -0400, Tom Lane wrote:
>> Yeah ... see the discussion leading up to 9c703c169,
>> https://www.postgresql.org/message-id/flat/19465.1541636036%40sss.pgh.pa.us
>> We didn't pull the trigger on removing the CMD_DELETE exception here,
>> but I think the handwriting has been on the wall for some time.

> ISTM there's a few different options here:

> 1a) We build all index infos, unconditionally. As argued in the thread
> you reference, future tableams may eventually require that anyway,
> by doing more proactive index maintenance somehow. Currently there's
> however no support for such AMs via tableam (mostly because I wasn't
> sure how exactly that'd look, and none of the already in-development
> AMs needed it).
> 2a) We separate acquisition of index locks from ExecOpenIndices(), and
> acquire index locks even for CMD_DELETE. Do so either during
> executor startup, or as part of AcquireExecutorLocks() (the latter
> on the basis that parsing/planning would have required the locks
> already).
> There's also corresponding *b) options, where we only do additional work
> for CMD_DELETE if wal_level = logical, and the table has a replica
> identity requiring use of the index during deleteions. But I think
> that's clearly enough a bad idea that we can just dismiss it out of
> hand.
> 3)  Remove the CheckRelationLockedByMe() assert from
> ExtractReplicaIdentity(), at least for 12. I don't think this is an
> all that convicing option, but it'd reduce churn relatively late in
> beta.
> 4)  Add a index_open(RowExclusiveLock) to ExtractReplicaIdentity(). That
> seems very unconvincing to me, because we'd do so for every row.

As far as 4) goes, I think the code in ExtractReplicaIdentity is pretty
duff anyway, because it doesn't bother to check for the defined failure
return for RelationIdGetRelation.  But if we're concerned about the
cost of recalculating this stuff per-row, couldn't we cache it a little
better?  It should be safe to assume the set of index columns isn't
changing intra-query.

... in fact, isn't all the infrastructure for that present already?
Why is this code looking directly at the index at all, rather than
using the relcache's rd_idattr bitmap?

I suspect we'll have to do 1a) eventually anyway, but this particular
problem seems like it has a better solution.  Will try to produce a
patch in a bit.

regards, tom lane




Re: Proposal: roll pg_stat_statements into core

2019-09-01 Thread Pavel Stehule
ne 1. 9. 2019 v 20:48 odesílatel David Fetter  napsal:

> On Sun, Sep 01, 2019 at 08:12:15PM +0200, Pavel Stehule wrote:
> > Hi
> >
> > ne 1. 9. 2019 v 20:00 odesílatel David Fetter  napsal:
> >
> > > Folks,
> > >
> > > I'd like to $Subject, on by default, with a switch to turn it off for
> > > those really at the outer edges of performance. Some reasons include:
> > >
> > > - It's broadly useful.
> > > - Right now, the barrier for turning it on is quite high. In addition
> > >   to being non-core, which is already a pretty high barrier at a lot
> > >   of organizations, it requires a shared_preload_libraries setting,
> > >   which is pretty close to untenable in a lot of use cases.
> > > - The overhead for most use cases is low compared to the benefit.
> > >
> >
> > I have not a strong opinion about it. pg_stat_statements is really useful
> > extenstion, on second hand
> >
> > 1. the API is not stabilized yet - there are some patches related to this
> > extension if I remember correctly
>
> You do.
>
> > 2. there are not any numbers what is a overhead
>
> What numbers would you suggest collecting?  We could get some by
> running them with the extension loaded and not, although this goes to
> your next proposal.
>

I would to see some benchmarks (pg_bench - readonly, higher number of
connects)


> > Maybe better solution can be some new API for shared memory, that doesn't
> > need to use shared_preload_library.
>
> What would such an API look like?
>

possibility to allocate shared large blocks of shared memory without
necessity to do it at startup time


> > It can be useful for lot of other monitoring extensions, profilers,
> > debuggers,
>
> It would indeed.
>
> Do you see this new API as a separate project, and if so, of
> approximately what size?  Are we talking about something about the
> size of DSM? Of JIT?
>

Probably about DSM, and about API how to other processes can connect to
some blocks of DSM. I rember similar request related to shared fulltext
dictionaries

It can be part of some project "enhancing pg_stat_statement - be possible
load this extension without restart"


>
> Best,
> David.
> --
> David Fetter  http://fetter.org/
> Phone: +1 415 235 3778
>
> Remember to vote!
> Consider donating to Postgres: http://www.postgresql.org/about/donate
>


Re: Proposal: roll pg_stat_statements into core

2019-09-01 Thread Tom Lane
David Fetter  writes:
> - It's broadly useful.

Maybe.  Whether it can be argued that it's so broadly useful as
to justify being on-by-default is not clear.

> - Right now, the barrier for turning it on is quite high. In addition
>   to being non-core, which is already a pretty high barrier at a lot
>   of organizations, it requires a shared_preload_libraries setting,
>   which is pretty close to untenable in a lot of use cases.

That argument seems pretty weak.  It's part of contrib and therefore
maintained by the same people as the "core" code.  Also, I don't buy
for a minute that people who would need it don't also need a bunch of
other changes in default GUC settings (shared_buffers etc).

> - The overhead for most use cases is low compared to the benefit.

Please do not expect that we're going to accept such assertions
unsupported by evidence.  (As a very quick-n-dirty test, I tried
"pgbench -S" and got somewhere around 4% TPS degradation with
pg_stat_statements loaded.  That doesn't seem negligible.)

I think also that we would need to consider the migration penalty
for people who already have the contrib version installed.  To
judge by previous cases (I'm thinking tsearch2) that could be
pretty painful.  Admittedly, tsearch2 might not be a good comparison,
but points as simple as "the views/functions won't be in the same
schema as before" are enough to cause trouble.

regards, tom lane




Re: Proposal: roll pg_stat_statements into core

2019-09-01 Thread David Fetter
On Sun, Sep 01, 2019 at 08:12:15PM +0200, Pavel Stehule wrote:
> Hi
> 
> ne 1. 9. 2019 v 20:00 odesílatel David Fetter  napsal:
> 
> > Folks,
> >
> > I'd like to $Subject, on by default, with a switch to turn it off for
> > those really at the outer edges of performance. Some reasons include:
> >
> > - It's broadly useful.
> > - Right now, the barrier for turning it on is quite high. In addition
> >   to being non-core, which is already a pretty high barrier at a lot
> >   of organizations, it requires a shared_preload_libraries setting,
> >   which is pretty close to untenable in a lot of use cases.
> > - The overhead for most use cases is low compared to the benefit.
> >
> 
> I have not a strong opinion about it. pg_stat_statements is really useful
> extenstion, on second hand
> 
> 1. the API is not stabilized yet - there are some patches related to this
> extension if I remember correctly

You do.

> 2. there are not any numbers what is a overhead

What numbers would you suggest collecting?  We could get some by
running them with the extension loaded and not, although this goes to
your next proposal.

> Maybe better solution can be some new API for shared memory, that doesn't
> need to use shared_preload_library.

What would such an API look like?

> It can be useful for lot of other monitoring extensions, profilers,
> debuggers,

It would indeed.

Do you see this new API as a separate project, and if so, of
approximately what size?  Are we talking about something about the
size of DSM? Of JIT?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Proposal: roll pg_stat_statements into core

2019-09-01 Thread Pavel Stehule
Hi

ne 1. 9. 2019 v 20:00 odesílatel David Fetter  napsal:

> Folks,
>
> I'd like to $Subject, on by default, with a switch to turn it off for
> those really at the outer edges of performance. Some reasons include:
>
> - It's broadly useful.
> - Right now, the barrier for turning it on is quite high. In addition
>   to being non-core, which is already a pretty high barrier at a lot
>   of organizations, it requires a shared_preload_libraries setting,
>   which is pretty close to untenable in a lot of use cases.
> - The overhead for most use cases is low compared to the benefit.
>

I have not a strong opinion about it. pg_stat_statements is really useful
extenstion, on second hand

1. the API is not stabilized yet - there are some patches related to this
extension if I remember correctly
2. there are not any numbers what is a overhead

Maybe better solution can be some new API for shared memory, that doesn't
need to use shared_preload_library.

It can be useful for lot of other monitoring extensions, profilers,
debuggers,

Regards

Pavel


> Before I go do the patch, I'd like to see whether there's anything
> like a consensus as to the what and how of doing this.
>
> What say?
>
> Best,
> David.
> --
> David Fetter  http://fetter.org/
> Phone: +1 415 235 3778
>
> Remember to vote!
> Consider donating to Postgres: http://www.postgresql.org/about/donate
>
>
>


Re: refactoring - share str2*int64 functions

2019-09-01 Thread Fabien COELHO


Michaël,


I would put back unlikely() on overflow tests, as there are indeed unlikely
to occur and it may help some compilers, and cannot be harmful. It also
helps the code reader to know that these path are not expected to be taken
often.


Hm.  I don't agree about that part, and the original signed portions
don't do that.  I think that we should let the callers of the routines
decide if a problem is unlikely to happen or not as we do now.


Hmmm. Maybe inlining propates them, but otherwise they make sense for a 
compiler perspective.


I am still not convinced that this module is worth the extra cycles to 
justify its existence though.


They allow to quickly do performance tests, for me it is useful to keep it 
around, but you are the committer, you do as you feel.



[...]
There is no doubt that dividing 64 bits integers is a very bad idea, at
least on my architecture!


That's surprising.  I cannot reproduce that.


It seems to me that somehow you can, you have a 5 to 18 seconds drop 
below. There are actual reasons why some processors are more expensive 
than others, it is not just marketing:-)



2-1) mul:
- built-in: 5.1s
- fallback (with uint128): 8.0s
- fallback (without uint128): 18.1s



So, the built-in option is always faster, and keeping the int128 path
if available for the multiplication makes sense, but not for the
subtraction and the addition.


Yep.

I am wondering if we should review further the signed part for add and 
sub, but I'd rather not touch it in this patch.


The signed overflows are trickier even, I have not paid attention to the 
fallback code. I agree that it is better left untouched for know.



If you have done any changes on my previous patch, or if you have a
script to share I could use to attempt to reproduce your results, I
would be happy to do so.


Hmmm. I did manual tests really. Attached a psql script replicating them.

  # with builtin overflow detection
  sh> psql < oc.sql
  NOTICE:  int 16 mul: 00:00:02.747269 # slow
  NOTICE:  int 16 add: 00:00:01.83281
  NOTICE:  int 16 sub: 00:00:01.8501
  NOTICE:  uint 16 mul: 00:00:03.68362 # slower
  NOTICE:  uint 16 add: 00:00:01.835294
  NOTICE:  uint 16 sub: 00:00:02.136895 # slow
  NOTICE:  int 32 mul: 00:00:01.828065
  NOTICE:  int 32 add: 00:00:01.840269
  NOTICE:  int 32 sub: 00:00:01.843557
  NOTICE:  uint 32 mul: 00:00:02.447052 # slow
  NOTICE:  uint 32 add: 00:00:01.849899
  NOTICE:  uint 32 sub: 00:00:01.840773
  NOTICE:  int 64 mul: 00:00:01.839051
  NOTICE:  int 64 add: 00:00:01.839065
  NOTICE:  int 64 sub: 00:00:01.838599
  NOTICE:  uint 64 mul: 00:00:02.161346 # slow
  NOTICE:  uint 64 add: 00:00:01.839404
  NOTICE:  uint 64 sub: 00:00:01.838549
  DO


So, do you have more comments?


No other comments.

--
Fabien.

oc.sql
Description: application/sql


Proposal: roll pg_stat_statements into core

2019-09-01 Thread David Fetter
Folks,

I'd like to $Subject, on by default, with a switch to turn it off for
those really at the outer edges of performance. Some reasons include:

- It's broadly useful.
- Right now, the barrier for turning it on is quite high. In addition
  to being non-core, which is already a pretty high barrier at a lot
  of organizations, it requires a shared_preload_libraries setting,
  which is pretty close to untenable in a lot of use cases.
- The overhead for most use cases is low compared to the benefit.

Before I go do the patch, I'd like to see whether there's anything
like a consensus as to the what and how of doing this.

What say?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Yet another fast GiST build

2019-09-01 Thread Alexander Korotkov
On Fri, Aug 30, 2019 at 6:28 AM Peter Geoghegan  wrote:
> On Thu, Aug 29, 2019 at 8:22 PM Alexander Korotkov
>  wrote:
> > Alternatively you can encode size in Z-value.  But this increases
> > dimensionality of space and decreases efficiency of join.  Also,
> > spatial join can be made using two indexes, even just current GiST
> > without Z-values.  We've prototyped that, see [1].
>
> I'm pretty sure that spatial joins generally need two spatial indexes
> (usually R-Trees). There seems to have been quite a lot of research in
> it in the 1990s.

Sure, our prototype was an implementation of one of such papers.  My
point is that advantages of Z-value ordered GiST for spatial joins are
not yet clear for me.  Except faster build and smaller index, which
are general advantages.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Yet another fast GiST build

2019-09-01 Thread Alexander Korotkov
On Fri, Aug 30, 2019 at 2:44 PM Andrey Borodin  wrote:
>
> 30 авг. 2019 г., в 3:47, Alexander Korotkov  
> написал(а):
>
> 1) Binary search in non-leaf pages instead of probing each key is much faster.
>
>
> That's a neat idea, but key union breaks ordering, even for z-order.
> for two sets of tuples X and Y
> if for any i,o from N, Xi < Yo
> does not guaranty union(X) < union (Y)
>
> For example consider this z-ordered keyspace (picture attached)
>
> union(5, 9) is z-order-smaller than union(4,4)
>
> I'm not even sure we can use sorted search for choosing subtree for insertion.


Sorry, I didn't explain my proposal in enough details.  I didn't mean
B-tree separator keys would be the same as union key (MBR).  I mean
B-tree on Z-values, which maintains union key in addition to separator
keys.  So, you select downlink to insert using separator Z-values and
then also extend union key (if needed).  It's probably still not
enough detail yet.  I'll try to spend more time for more detailed
description later.

>
> How do you think, should I supply GiST-build patch with docs and tests and 
> add it to CF? Or do we need more design discussion before?


+1 for adding to CF.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Write visibility map during CLUSTER/VACUUM FULL

2019-09-01 Thread Alexander Korotkov
On Sun, Sep 1, 2019 at 11:07 AM Alexander Korotkov
 wrote:
> This patch have to implement its own check if tuple is allvisible.
> But it appears to be possible to simplify this check assuming that all
> tuples already past HeapTupleSatisfiesVacuum(), which sets hint bits.

Forgot to check tuple xmin against oldest xmin.  Fixed.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-write-vm-during-cluster-2.patch
Description: Binary data


Re: range_agg

2019-09-01 Thread Paul A Jungwirth
> > Btw I have working multirange_{send,recv,in,out} now. . . .

Just about all the other operators are done too, but I wonder what
symbols people like for union and minus? Range uses + for union. I
have working code and tests that adds this:

r + mr = mr
mr + r = mr
mr + mr = mr

But I would like to use a different symbol instead, like ++, so I can
have all four:

r ++ r = mr
r ++ mr = mr
mr ++ r = mr
mr ++ mr = mr

(The existing r + r operator throws an error if the inputs have a gap.)

The trouble is that ++ isn't allowed. (Neither is --.) From
https://www.postgresql.org/docs/11/sql-createoperator.html :

> A multicharacter operator name cannot end in + or -, unless the name also 
> contains at least one of these characters:
> ~ ! @ # % ^ & | ` ?

So are there any other suggestions? I'm open to arguments that I
should just use +, but I think having a way to add two simple ranges
and get a multirange would be nice too, so my preference is to find a
new operator. It should work with minus and intersection (* for
ranges) too. Some proposals:

+* and -* and ** (* as in regex "zero or many" reminds me of how a
multirange holds zero or many ranges. ** is confusing though because
it's like exponentiation.)

@+ and @- and @* (I dunno why but I kind of like it. We already have @> and <@.)

<+> and <-> and <*> (I was hoping for (+) etc for the math connection
to circled operators, but this is close. Maybe this would be stronger
if multirange_{in,out} used <> delims instead of {}, although I also
like how {} is consistent with arrays.)

Anyone else have any thoughts?

Thanks,
Paul




Re: refactoring - share str2*int64 functions

2019-09-01 Thread Michael Paquier
On Sun, Sep 01, 2019 at 01:57:06PM +0200, Fabien COELHO wrote:
> I would put back unlikely() on overflow tests, as there are indeed unlikely
> to occur and it may help some compilers, and cannot be harmful. It also
> helps the code reader to know that these path are not expected to be taken
> often.

Hm.  I don't agree about that part, and the original signed portions
don't do that.  I think that we should let the callers of the routines
decide if a problem is unlikely to happen or not as we do now.

> On reflection, I'm not sure that add_u64 and sub_u64 overflow with uint128
> are useful. The res < a or b > a tricks should suffice, just like for u16
> and u32 cases, and it may cost a little less anyway.

Actually, I agree and this is something I can see as well with some
extra measurements.  mul_u64 without int128 is twice slower, while
add_u64 and sub_u64 are 15~20% faster.

> I would suggest keep the overflow extension as "contrib/overflow_test". For
> mul tests, I'd suggest not to try only min/max values like add/sub, but also
> "standard" multiplications that overflow or not. It would be good if "make
> check" could be made to work", for some reason it requires "installcheck".

Any extensions out of core can only work with "installcheck", and
check is not supported (see pgxs.mk).  I am still not convinced that
this module is worth the extra cycles to justify its existence
though.

> There is no doubt that dividing 64 bits integers is a very bad idea, at
> least on my architecture!

That's surprising.  I cannot reproduce that.  Are you sure that you
didn't just undefine HAVE_INT128?  This would cause
HAVE__BUILTIN_OP_OVERFLOW to still be active in all the code paths.
Here are a couple of results from my side with this query, FWIW, and
some numbers for all the compile flags (-O2 used):
select pg_overflow_check(1, 1, 20, 'XXX', 'XXX');
1) uint16:
1-1) mul:
- built-in: 5.5s
- fallback: 5.5s
1-2) sub:
- built-in: 5.3s
- fallback: 5.4s
1-3) add:
- built-in: 5.3s
- fallback: 6.2s
2) uint32:
2-1) mul:
- built-in: 5.1s
- fallback: 5.9s
2-2) sub:
- built-in: 5.2s
- fallback: 5.4s
2-3) add:
- built-in: 5.2s
- fallback: 6.2s
2) uint64:
2-1) mul:
- built-in: 5.1s
- fallback (with uint128): 8.0s
- fallback (without uint128): 18.1s
2-2) sub:
- built-in: 5.2s
- fallback (with uint128): 7.1s
- fallback (without uint128): 5.5s
2-3) add:
- built-in: 5.2s
- fallback (with uint128): 7.1s
- fallback (without uint128): 6.3s

So, the built-in option is always faster, and keeping the int128 path
if available for the multiplication makes sense, but not for the
subtraction and the addition.  I am wondering if we should review
further the signed part for add and sub, but I'd rather not touch it
in this patch.

> Note that checks depends on value, so actual performance may vary depending
> on actual val1 and val2 passed. I used 1 1 like your example.

Sure.  Still that offers helpful hints as we do the same operations
for all code paths the same number of times.

If you have done any changes on my previous patch, or if you have a
script to share I could use to attempt to reproduce your results, I
would be happy to do so.

So, do you have more comments?
--
Michael


signature.asc
Description: PGP signature


Re: refactoring - share str2*int64 functions

2019-09-01 Thread Fabien COELHO


Bonjour Michaël,

You are right as well that having symmetry with the signed methods is 
much better.  In order to see the difference, you can just do that with 
the extension attached, after of course hijacking int.h with some undefs 
and recompiling the backend and the module: select 
pg_overflow_check(1, 1, 20, 'uint32', 'mul');


Ok.


still it is possible to trick things with signed integer arguments.


Is it?


If you pass -1 and then you can fall back to the maximum of each 16,
32 or 64 bits for the unsigned (see the regression tests I added with
the module).



Attached is also an updated version of the module I used to validate
this stuff.  Fabien, any thoughts?


Patch apply cleanly, compiles, "make check" ok (although changes 
are untested).


I would put back unlikely() on overflow tests, as there are indeed 
unlikely to occur and it may help some compilers, and cannot be harmful. 
It also helps the code reader to know that these path are not expected to 
be taken often.


On reflection, I'm not sure that add_u64 and sub_u64 overflow with uint128 
are useful. The res < a or b > a tricks should suffice, just like for u16 
and u32 cases, and it may cost a little less anyway.


I would suggest keep the overflow extension as "contrib/overflow_test". 
For mul tests, I'd suggest not to try only min/max values like add/sub, 
but also "standard" multiplications that overflow or not. It would be good 
if "make check" could be made to work", for some reason it requires 
"installcheck".


I could not test performance directly, loops are optimized out by the 
compiler. I added "volatile" on input value declarations to work around 
that. On 2B iterations I got on my laptop:


 int16: mul = 2770 ms, add = 1830 ms, sub = 1826 ms
 int32: mul = 1838 ms, add = 1835 ms, sub = 1840 ms
 int64: mul = 1836 ms, add = 1834 ms, sub = 1833 ms

 uint16: mul = 3670 ms, add = 1830 ms, sub = 2148 ms
 uint32: mul = 2438 ms, add = 1834 ms, sub = 1831 ms
 uint64: mul = 2139 ms, add = 1841 ms, sub = 1882 ms

Why int16 mul, uint* mul and uint16 sub are bad is unclear.

With fallback code triggered with:

 #undef HAVE__BUILTIN_OP_OVERFLOW

 int16: mul = 1433 ms, add = 1424 ms, sub = 1254 ms
 int32: mul = 1433 ms, add = 1425 ms, sub = 1443 ms
 int64: mul = 1430 ms, add = 1429 ms, sub = 1441 ms

 uint16: mul = 1445 ms, add = 1291 ms, sub = 1265 ms
 uint32: mul = 1419 ms, add = 1434 ms, sub = 1493 ms
 uint64: mul = 1266 ms, add = 1430 ms, sub = 1440 ms

For some unclear reason, 4 tests are significantly faster.

Forcing further down fallback code with:

  #undef HAVE_INT128

 int64: mul = 1424 ms, add = 1429 ms, sub = 1440 ms
 uint64: mul = 24145 ms, add = 1434 ms, sub = 1435 ms

There is no doubt that dividing 64 bits integers is a very bad idea, at 
least on my architecture!


Note that checks depends on value, so actual performance may vary 
depending on actual val1 and val2 passed. I used 1 1 like your 
example.


These results are definitely depressing because the fallback code is 
nearly twice as fast as the builtin overflow detection version. For the 
record: gcc 7.4.0 on ubuntu 18.04 LTS. Not sure what to advise, relying on 
the builtin should be the better idea…


--
Fabien.

Re: Yet another fast GiST build

2019-09-01 Thread Andrey Borodin


> 30 авг. 2019 г., в 16:44, Andrey Borodin  написал(а):
> 
> 
> How do you think, should I supply GiST-build patch with docs and tests and 
> add it to CF? Or do we need more design discussion before?

PFA v2: now sort support is part of opclass.
There's a problem with Z-ordering NaN which causes regression tests to fail. So 
I decided not to add patch to CF (despite having few minutes to do so).
How to correctly Z-order NaN? So that it would be consistent with semantics of 
union() and consistent() functions. If one of values is NaN, then we consider 
all it's bits to be 1?

BTW patch uses 
union {
float f;
uint32 i;
}
I hope it's OK, because AFAIK we do not have non-IEEE-754 platforms now.

Thanks!

Best regards, Andrey Borodin.



v2-0001-Add-sort-support-for-point-gist_point_sortsupport.patch
Description: Binary data


v2-0002-Implement-GiST-build-using-sort-support.patch
Description: Binary data


Re: row filtering for logical replication

2019-09-01 Thread Erik Rijkers

On 2019-09-01 02:28, Euler Taveira wrote:
Em dom, 3 de fev de 2019 às 07:14, Andres Freund  
escreveu:


As far as I can tell, the patch has not been refreshed since. So I'm
marking this as returned with feedback for now. Please resubmit once
ready.


I fix all of the bugs pointed in this thread. I decide to disallow



0001-Remove-unused-atttypmod-column-from-initial-table-sy.patch
0002-Store-number-of-tuples-in-WalRcvExecResult.patch
0003-Refactor-function-create_estate_for_relation.patch
0004-Rename-a-WHERE-node.patch
0005-Row-filtering-for-logical-replication.patch
0006-Print-publication-WHERE-condition-in-psql.patch
0007-Publication-where-condition-support-for-pg_dump.patch
0008-Debug-for-row-filtering.patch


Hi,

The first 4 of these apply without error, but I can't get 0005 to apply. 
This is what I use:


patch --dry-run -b -l -F 5 -p 1 < 
/home/aardvark/download/pgpatches/0130/logrep_rowfilter/20190901/0005-Row-filtering-for-logical-replication.patch



checking file doc/src/sgml/catalogs.sgml
Hunk #1 succeeded at 5595 (offset 8 lines).
checking file doc/src/sgml/ref/alter_publication.sgml
checking file doc/src/sgml/ref/create_publication.sgml
checking file src/backend/catalog/pg_publication.c
checking file src/backend/commands/publicationcmds.c
Hunk #1 succeeded at 352 (offset 8 lines).
Hunk #2 succeeded at 381 (offset 8 lines).
Hunk #3 succeeded at 539 (offset 8 lines).
Hunk #4 succeeded at 570 (offset 8 lines).
Hunk #5 succeeded at 601 (offset 8 lines).
Hunk #6 succeeded at 626 (offset 8 lines).
Hunk #7 succeeded at 647 (offset 8 lines).
Hunk #8 succeeded at 679 (offset 8 lines).
Hunk #9 succeeded at 693 (offset 8 lines).
checking file src/backend/parser/gram.y
checking file src/backend/parser/parse_agg.c
checking file src/backend/parser/parse_expr.c
Hunk #4 succeeded at 3571 (offset -2 lines).
checking file src/backend/parser/parse_func.c
Hunk #1 succeeded at 2516 (offset -13 lines).
checking file src/backend/replication/logical/tablesync.c
checking file src/backend/replication/logical/worker.c
checking file src/backend/replication/pgoutput/pgoutput.c
Hunk #1 FAILED at 12.
Hunk #2 succeeded at 60 (offset 2 lines).
Hunk #3 succeeded at 336 (offset 2 lines).
Hunk #4 succeeded at 630 (offset 2 lines).
Hunk #5 succeeded at 647 (offset 2 lines).
Hunk #6 succeeded at 738 (offset 2 lines).
1 out of 6 hunks FAILED
checking file src/include/catalog/pg_publication.h
checking file src/include/catalog/pg_publication_rel.h
checking file src/include/catalog/toasting.h
checking file src/include/nodes/nodes.h
checking file src/include/nodes/parsenodes.h
Hunk #1 succeeded at 3461 (offset -1 lines).
Hunk #2 succeeded at 3486 (offset -1 lines).
checking file src/include/parser/parse_node.h
checking file src/include/replication/logicalrelation.h
checking file src/test/regress/expected/publication.out
Hunk #1 succeeded at 116 (offset 9 lines).
checking file src/test/regress/sql/publication.sql
Hunk #1 succeeded at 69 with fuzz 1 (offset 9 lines).
checking file src/test/subscription/t/013_row_filter.pl


perhaps that can be fixed?

thanks,

Erik Rijkers




[Patch] Add a reset_computed_values function in pg_stat_statements

2019-09-01 Thread Pierre Ducroquet
Hello

pg_stat_statements is a great tool to track performance issue in live 
databases, especially when adding interfaces like PoWA on top of it.
But so far, tools like PoWA can not track the min_time, max_time, mean_time 
and sum_var_time of queries : these statistics are cumulated over time, 
fetching points in time would be of little to no use, especially when looking 
at the impact of a DDL change.
This patch thus introduces a simple pg_stat_statements_reset_computed_values 
function that reset the computed statistics, leaving all other information 
alive, thus allowing the aforementioned scenario.

Regards

 Pierre Ducroquet>From 5e6d16d738e5279968321c5f3695e72ded2432db Mon Sep 17 00:00:00 2001
From: Pierre 
Date: Thu, 14 Feb 2019 14:37:48 +0100
Subject: [PATCH] Add a function to reset the cumulative statistics

pg_stat_statements has two parts : raw statistics that are simple 'stable'
counters, and computed statistics (averages, min time, max time...)

When using pg_stat_statements to find and fix performance issues, being
able to reset the computed statistics can help track the issue and the
impact of the fixes.
This would also make it possible for tools like powa to collect these
statistics too and track them over time by resetting them after each
collection.
---
 .../pg_stat_statements--1.7--1.8.sql  | 11 
 .../pg_stat_statements/pg_stat_statements.c   | 52 +--
 .../pg_stat_statements.control|  2 +-
 3 files changed, 61 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql

diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql b/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql
new file mode 100644
index 00..7690a9ceba
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql
@@ -0,0 +1,11 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.8'" to load this file. \quit
+
+CREATE FUNCTION pg_stat_statements_reset_computed_values()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C PARALLEL SAFE;
+
+
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 221b47298c..3a6c227a80 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -150,6 +150,7 @@ typedef struct Counters
 	double		max_time;		/* maximum execution time in msec */
 	double		mean_time;		/* mean execution time in msec */
 	double		sum_var_time;	/* sum of variances in execution time in msec */
+	int64		computed_calls;		/* # of times executed considered for the previous computed values */
 	int64		rows;			/* total # of retrieved or affected rows */
 	int64		shared_blks_hit;	/* # of shared buffer hits */
 	int64		shared_blks_read;	/* # of shared disk blocks read */
@@ -289,6 +290,7 @@ static bool pgss_save;			/* whether to save stats across shutdown */
 void		_PG_init(void);
 void		_PG_fini(void);
 
+PG_FUNCTION_INFO_V1(pg_stat_statements_reset_computed_values);
 PG_FUNCTION_INFO_V1(pg_stat_statements_reset);
 PG_FUNCTION_INFO_V1(pg_stat_statements_reset_1_7);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
@@ -1252,8 +1254,9 @@ pgss_store(const char *query, uint64 queryId,
 			e->counters.usage = USAGE_INIT;
 
 		e->counters.calls += 1;
+		e->counters.computed_calls += 1;
 		e->counters.total_time += total_time;
-		if (e->counters.calls == 1)
+		if (e->counters.computed_calls == 1)
 		{
 			e->counters.min_time = total_time;
 			e->counters.max_time = total_time;
@@ -1268,7 +1271,7 @@ pgss_store(const char *query, uint64 queryId,
 			double		old_mean = e->counters.mean_time;
 
 			e->counters.mean_time +=
-(total_time - old_mean) / e->counters.calls;
+(total_time - old_mean) / e->counters.computed_calls;
 			e->counters.sum_var_time +=
 (total_time - old_mean) * (total_time - e->counters.mean_time);
 
@@ -1324,7 +1327,7 @@ pg_stat_statements_reset_1_7(PG_FUNCTION_ARGS)
 }
 
 /*
- * Reset statement statistics.
+ * Reset all statement statistics.
  */
 Datum
 pg_stat_statements_reset(PG_FUNCTION_ARGS)
@@ -1334,6 +1337,49 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+/*
+ * Reset computed statistics from all statements.
+ */
+Datum
+pg_stat_statements_reset_computed_values(PG_FUNCTION_ARGS)
+{
+	if (!pgss || !pgss_hash)
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("pg_stat_statements must be loaded via shared_preload_libraries")));
+	entry_reset_computed();
+	PG_RETURN_VOID();
+}
+
+
+/*
+ * Reset statement computed statistics.
+ * This takes a shared lock only on the hash table, and a lock per entry
+ */
+static void
+entry_reset_computed(void)
+{
+	HASH_SEQ_STATUS hash_seq;
+	pgssEntry  *entry;
+
+	/* Lookup the hash table entry with 

Write visibility map during CLUSTER/VACUUM FULL

2019-09-01 Thread Alexander Korotkov
Hi!

I found it weird that CLUSTER/VACUUM FULL don't write visibility map.
Attached patch implements writing visibility map in
heapam_relation_copy_for_cluster().

I've studied previous attempt to implement this [1].  The main problem
of that attempt was usage of existing heap_page_is_all_visible() and
visibilitymap_set() functions.  These functions works through buffer
manager, while heap rewriting is made bypass buffer manager.

In my patch visibility map pages are handled in the same way as heap
pages are.  RewriteStateData holds contents of single VM page.  Once
heap page is finished, corresponding bits are set to VM page etc.  VM
pages are flushed one-by-one to smgr.

This patch have to implement its own check if tuple is allvisible.
But it appears to be possible to simplify this check assuming that all
tuples already past HeapTupleSatisfiesVacuum(), which sets hint bits.

Links
1. 
https://www.postgresql.org/message-id/flat/20131203162556.GC27105%40momjian.us#90e4a6e77d92076650dcf1d96b32ba38

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-write-vm-during-cluster-1.patch
Description: Binary data


Re: Should we add xid_current() or a int8->xid cast?

2019-09-01 Thread Thomas Munro
On Fri, Aug 2, 2019 at 10:42 PM Thomas Munro  wrote:
> On Thu, Jul 25, 2019 at 1:11 PM Thomas Munro  wrote:
> > On Thu, Jul 25, 2019 at 12:42 PM Tom Lane  wrote:
> > > Andres Freund  writes:
> > > > On 2019-07-24 20:34:39 -0400, Tom Lane wrote:
> > > >> Yeah, I would absolutely NOT recommend that you open that can of worms
> > > >> right now.  We have looked at adding unsigned integer types in the past
> > > >> and it looked like a mess.
> > >
> > > > I assume Thomas was thinking more of another bespoke type like xid, just
> > > > wider.  There's some notational advantage in not being able to
> > > > immediately do math etc on xids.
> > >
> > > Well, we could invent an xid8 type if we want, just don't try to make
> > > it part of the numeric hierarchy (as indeed xid isn't).
> >
> > Yeah, I meant an xid64/xid8/fxid/pg_something/... type that isn't a
> > kind of number.

I thought about how to deal with the transition to xid8 for the
txid_XXX() family of functions.  The best idea I've come up with so
far is to create a parallel xid8_XXX() family of functions, and
declare the bigint-based functions to be deprecated, and threaten to
drop them from a future release.  The C code for the two families can
be the same (it's a bit of a dirty trick, but only until the
txid_XXX() variants go away).  Here's a PoC patch demonstrating that.
Not tested much, yet, probably needs some more work, but I wanted to
see if others thought the idea was terrible first.

I wonder if there is a better way to share hash functions than the
hack in check_hash_func_signature(), which I had to extend to cover
xid8.

Adding to CF.

-- 
Thomas Munro
https://enterprisedb.com


0001-Add-SQL-type-xid8-to-expose-FullTransactionId-to-use.patch
Description: Binary data


0002-Introduce-xid8-variants-of-the-txid_XXX-fmgr-functio.patch
Description: Binary data