Re: Why overhead of SPI is so large?

2019-08-22 Thread Konstantin Knizhnik




On 22.08.2019 5:40, Kyotaro Horiguchi wrote:

Hello.

At Wed, 21 Aug 2019 19:41:08 +0300, Konstantin Knizhnik  
wrote in 

Hi, hackers.

One of our customers complains about slow execution of PL/pgSQL
functions comparing with Oracle.
So he wants to compile PL/pgSQL functions (most likely just-in-time
compilation).
Certainly interpreter adds quite large overhead comparing with native
code (~10 times) but
most of PL/pgSQL functions are just running some SQL queues and
iterating through results.

I can not believe that JIT can significantly speed-up such functions.
So I decided to make simple experiment:  I created large enough table
and implemented functions
which calculates norm of one column in different languages.

Results are frustrating (at least for me):

PL/pgSQL:   29044.361 ms
C/SPI:          22785.597 ms
С/coreAPI: 2873.072 ms
PL/Lua:        33445.520 ms
SQL:              7397.639 ms (with parallel execution disabled)

The fact that difference between PL/pgSQL and function implemented in
C using SPI is not so large  was expected by me.
But why it is more than 3 time slower than correspondent SQL query?
The answer seems to be in the third result: the same function in C
implemented without SPI (usign table_beginscan/heap_getnext)
Looks like SPI adds quite significant overhead.
And as far as almost all PL languages are using SPI, them all suffer
from it.

As far as looking the attached spitest.c, it seems that the
overhead comes from cursor operation, not from SPI. As far as
spitest.c goes, cursor is useless.  "SQL" and C/coreAPI seem to
be scanning over the result from *a single* query. If that's
correct, why don't you use SPI_execute() then scan over
SPI_tuptable?


Scanned table is very large and doesn't fir in memory.
This is why I am using SPI cursors.
Please let me know if there is more efficient way to traverse larger 
table using SPI.





regards.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Optimization of vacuum for logical replication

2019-08-22 Thread Konstantin Knizhnik




On 22.08.2019 6:13, Kyotaro Horiguchi wrote:

Hello.

At Wed, 21 Aug 2019 18:06:52 +0300, Konstantin Knizhnik  
wrote in <968fc591-51d3-fd74-8a55-40aa770ba...@postgrespro.ru>

Ok, you convinced me that there are cases when people want to combine
logical replication with streaming replication without slot.
But is it acceptable to have GUC variable (disabled by default) which
allows to use this optimizations?

The odds are quite high. Couldn't we introduce a new wal_level
value instead?

wal_level = logical_only


I think this thread shows that logical replication no longer is a
superset(?) of physical replication.  I thougt that we might be
able to change wal_level from scalar to bitmap but it breaks
backward compatibility..

regards.


I think that introducing new wal_level is good idea.
There are a lot of other places (except vacuum) where we insert in the 
log information which is not needed for logical decoding.
Instead of changing all places in code where this information is 
inserted, we can filter it at xlog level (xlog.c).
My only concern is how much incompatibilities will be caused by 
introducing new wal level.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: FETCH FIRST clause PERCENT option

2019-08-22 Thread Surafel Temesgen
On Tue, Aug 20, 2019 at 9:10 AM Kyotaro Horiguchi 
wrote:

> Hi,
>
> At Wed, 7 Aug 2019 10:20:09 +0300, Surafel Temesgen 
> wrote in <
> calay4q98xbvhtz4yj9dccmg2-s1_jurr7fyanfw+bkmr22o...@mail.gmail.com>
> > Hi
> > On Wed, Aug 7, 2019 at 6:11 AM Kyotaro Horiguchi <
> horikyota@gmail.com>
> > wrote:
> >
> > >
> > > I have some comments.
> > >
> > > This patch uses distinct parameters for exact number and
> > > percentage. On the othe hand planner has a notion of
> > > tuple_fraction covering the both. The same handling is also used
> > > for tuple number estimation. Using the same scheme will make
> > > things far simpler. See the comment of grouping_planner().
> > >
> > >
> > Its because of data type difference .In planner the data type is the same
>
> I meant that, with the usage of tuple_fraction, one variable can
> represent both the absolute limit and relative limit. This
> simplifies parse structs.
>
In grouping_planner the patch set tuple bound to -1 in create_ordered_paths
because it iterate until the end in percentage. Did that answer your
question?

>
>
> > > In executor part, addition to LimiteState.position, this patch
> > > uses LimiteState.backwardPosition to count how many tuples we're
> > > back from the end of the current tuplestore. But things will get
> > > simpler by just asking the tuplestore for the number of holding
> > > tuples.
> > >
> > >
> > backwardPosition hold how many tuple returned in backward scan
>
> I meant that we don't need to hold the number in estate.
>

I did it this way because I didn't find an API in tuplestore to do this


>
> > > +slot = node->subSlot;
> > >
> > > Why is this needed? The variable is properly set before use and
> > > the assignment is bogus in the first place.
> > >
> > >
> > its because Tuplestore needs initialized slot.
>
> I meant that the initilized slot is overwritten before first use.
>
> > > The new code block in LIMIT_RESCAN in ExecLimit is useless since
> > > it is exatctly the same with existing code block. Why didn't you
> > > use the existing if block?
> > >
> >
> > But they test different scenario
>
> I meant that the two different scenario can share the code block.
>

Sorry for not clarifying will .One have to be check before offsetting and
the other is after offsetting


>
> > > >if (node->limitOption == PERCENTAGE)
> > > +{
> > > +node->perExactCount = ceil(node->percent *
> > > node->position / 100.0);
> > > +
> > > +
> > >
> > > node->position is the number of tuples returned to upper node so
> > > far (not the number of tuples this node has read from the lower
> > > node so far).  I don't understand what the expression means.
> > >
> >
> > node->position hold the number of tuples this node has read from the
> lower
> > node so far. see LIMIT_RESCAN state
>
> Reallly? node->position is incremented when
> tuplestore_gettupleslot_heaptuple() succeeded and reutnrs the
> tuple to the caller immediately...
>
> > > +if (node->perExactCount == node->perExactCount +
> 1)
> > > +node->perExactCount++;
> > >
> > > What? The condition never be true. As the result, the following
> > > if block below won't run.
> > >
> >
> > it became true according to the number of tuple returned in from the
> lower
> > node so far
> > and percentage specification.
>
> Mmm. How do you think X can be equal to  (X + 1)?
>

Oops my bad .The attached patch remove the need of doing that

>
> > > >/*
> > > + * Return the tuple up to the number of exact count in
> > > OFFSET
> > > + * clause without percentage value consideration.
> > > + */
> > > +if (node->perExactCount > 0)
> > > +{
> > > +
> > >
> > >
> > >
> > >
> > > +/*
> > > + * We may needed this tuple in backward scan so put it
> into
> > > + * tuplestore.
> > > + */
> > > +if (node->limitOption == PERCENTAGE)
> > > +{
> > > +tuplestore_puttupleslot(node->tupleStore, slot);
> > > +tuplestore_advance(node->tupleStore, true);
> > > +}
> > >
> > > "needed"->"need" ? The comment says that this is needed for
> > > backward scan, but it is actually required even in forward
> > > scan. More to the point, tuplestore_advance lacks comment.
> >
> >
> > ok
> >
> >
> > >
> > > Anyway, the code in LIMIT_RESCAN is broken in some ways. For
> > > example, if it is used as the inner scan of a NestLoop, the
> > > tuplestore accumulates tuples by every scan. You will see that
> > > the tuplestore stores up to 1000 tuples (10 times of the inner)
> > > by the following query.
> > >
> >
> > It this because in percentage we scan the whole table
>
> It's useless and rather harmful that the tuplestore holds
> indefinite number of duplicate set of the whole tuples from the
> lower node. We must reuse tuples already store

Refactoring of connection with password prompt loop for frontends

2019-08-22 Thread Michael Paquier
Hi all,

In six places of the code tree (+ one in psql which is a bit
different), we have the following pattern for frontend tools to
connect to a backend with a password prompt, roughly like that:
do
{
[...]
conn = PQconnectdbParams(keywords, values, true);

[...]

if (PQstatus(conn) == CONNECTION_BAD &&
PQconnectionNeedsPassword(conn) &&
!have_password)
{
PQfinish(conn);
simple_prompt("Password: ", password, sizeof(password), false);
have_password = true;
new_pass = true;
}
} while (new_pass);

Attached is a tentative of patch to consolidate this logic across the
tree.  The patch is far from being in a committable state, and there
are a couple of gotchas:
- pg_dumpall merges connection string parameters, so it is much harder
to plugin that the others, and I left it out on purpose.
- Some code paths spawn a password prompt before the first connection
attempt.  For now I have left this first attempt out of the refactored
logic, but I think that it is possible to consolidate that a bit more
by enforcing a password prompt before doing the first connection
attempt (somewhat related to the XXX portion in the patch).  At the
end it would be nice to not have to have a 100-byte-long field for the
password buffer we have here and there.  Unfortunately this comes with
its limitations in pg_dump as savedPassword needs to be moved around
and may be reused in the context.
- I don't like the routine name connect_with_password_prompt() I
introduced.  Suggestions of better names are welcome :)
- This also brings the point that some of our tools are not able to
handle tri-values for passwords, so we may want to consolidate that as
well.

Among the positive points, this brings a lot of consolidation in terms
of error handling, and this shaves a bit of code:
13 files changed, 190 insertions(+), 280 deletions(-) 

This moves the logic into src/fe_utils, which is natural as that's
aimed only for frontends and because this also links to libpq.

Please note that this links a bit with the refactoring of vacuumlo and
oid2name logging I proposed a couple of days back (applying one patch
or the other results in conflicts) because we need to have frontends
initialized for logging in order to be able to log errors in the
refactored routine:
https://www.postgresql.org/message-id/20190820012819.ga8...@paquier.xyz
This one should be merged first IMO, but that's a story for the other
thread.

This compiles, and passes all regression tests so it is possible to
play with it easily, still it needs much more testing and love.

Any thoughts?  I am adding that to the next commit fest.

Thanks,
--
Michael
diff --git a/contrib/oid2name/Makefile b/contrib/oid2name/Makefile
index 361a80a7a1..c6ba092f47 100644
--- a/contrib/oid2name/Makefile
+++ b/contrib/oid2name/Makefile
@@ -9,7 +9,7 @@ OBJS	= oid2name.o $(WIN32RES)
 TAP_TESTS = 1
 
 PG_CPPFLAGS = -I$(libpq_srcdir)
-PG_LIBS_INTERNAL = $(libpq_pgport)
+PG_LIBS_INTERNAL = $(libpq_pgport) -L$(top_builddir)/src/fe_utils -lpgfeutils
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/oid2name/oid2name.c b/contrib/oid2name/oid2name.c
index d3a4a50005..c38679e869 100644
--- a/contrib/oid2name/oid2name.c
+++ b/contrib/oid2name/oid2name.c
@@ -11,6 +11,7 @@
 
 #include "catalog/pg_class_d.h"
 
+#include "common/logging.h"
 #include "fe_utils/connect.h"
 #include "libpq-fe.h"
 #include "pg_getopt.h"
@@ -46,6 +47,8 @@ struct options
 	const char *progname;
 };
 
+const char *progname;
+
 /* function prototypes */
 static void help(const char *progname);
 void		get_opts(int, char **, struct options *);
@@ -82,9 +85,9 @@ get_opts(int argc, char **argv, struct options *my_opts)
 	};
 
 	int			c;
-	const char *progname;
 	int			optindex;
 
+	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
 
 	/* set the defaults */
@@ -284,65 +287,20 @@ PGconn *
 sql_conn(struct options *my_opts)
 {
 	PGconn	   *conn;
-	bool		have_password = false;
-	char		password[100];
-	bool		new_pass;
 	PGresult   *res;
 
-	/*
-	 * Start the connection.  Loop until we have a password if requested by
-	 * backend.
-	 */
-	do
+	/* grab a connection, with password prompting */
+	conn = connect_with_password_prompt(my_opts->hostname,
+		my_opts->port,
+		my_opts->username,
+		my_opts->dbname,
+		my_opts->progname,
+		NULL,
+		true);
+	if (conn == NULL)
 	{
-#define PARAMS_ARRAY_SIZE	7
-
-		const char *keywords[PARAMS_ARRAY_SIZE];
-		const char *values[PARAMS_ARRAY_SIZE];
-
-		keywords[0] = "host";
-		values[0] = my_opts->hostname;
-		keywords[1] = "port";
-		values[1] = my_opts->port;
-		keywords[2] = "user";
-		values[2] = my_opts->username;
-		keywords[3] = "password";
-		values[3] = have_password ? password : NULL;
-		keywords[4] = "dbname";
-		values[4] = my_opts->dbname;
-		keywords[5] = "fallback_application_name";
-		values[5] = my_opts->progname;
-		keywords[6] = NULL;
-		values[6] = NULL;
-
-		n

Comment in ginpostinglist.c doesn't match code

2019-08-22 Thread Heikki Linnakangas

Hi,

While merging Greenplum with 9.4, we ran into problems with the GIN 
posting list encoding, because Greenplum sometimes uses ItemPointers 
with offset numbers up to 32768. The GIN posting list code was written 
with the assumption that the maximum is MaxHeapTuplesPerPage, and it 
uses only 11 bits for the offset number. The fix was simple, we just 
modified ginpostinglist.c to use full 16 bits instead of 11.


However, I noticed that this comment in ginpostinglist.c that explains 
the encoding doesn't match the code:



 * These 43-bit integers are encoded using varbyte encoding. In each byte,
 * the 7 low bits contain data, while the highest bit is a continuation bit.
 * When the continuation bit is set, the next byte is part of the same
 * integer, otherwise this is the last byte of this integer.  43 bits fit
 * conveniently in at most 6 bytes when varbyte encoded (the 6th byte does
 * not need a continuation bit, because we know the max size to be 43 bits):
 *
 * 0XXX
 * 1XXX 0YYY
 * 1XXX 1YYY 0YYY
 * 1XXX 1YYY 1YYY 0YYY
 * 1XXX 1YYY 1YYY 1YYY 0YYY
 * 1XXX 1YYY 1YYY 1YYY 1YYY 
 *
 * X = bits used for offset number
 * Y = bits used for block number


The code doesn't actually give the 6th byte any special treatment. If 
the input integer has the 43rd bit set, the encoding function will put a 
continuation bit on the 6th byte, and generate a 7th byte. And the 
decoding function will correctly decode that, too. So to my surprise, 
the implementation actually works for integers up 49 bits wide. However, 
there is an overflow check in the encoding function that assumes max 6 
bytes per integer. That needs to be fixed, along with the comment.


Fitting any item pointer into 6 bytes was an important property when 
this was written, because in the old pre-9.4 format, posting lists were 
as arrays, with 6 bytes per item pointer. The maximum of 6 bytes per 
integer in the new format guaranteed that we could convert any page from 
the old format to the new format, after pg_upgrade, so that the new 
format was never larger than the old format. But I don't think we need 
to worry much about that anymore. Luckily, no one has ran into this 
while trying to upgrade. It would require having a 9.3 cluster with a 
table larger than 16 TB (with 8k block size), with a GIN index on it, 
and a posting list with TIDs more than 2^31 blocks distance, on a full 
page. So, not a problem in practice.


In summary, the comment in ginpostinglist.c is wrong, and the overflow 
check needs to be fixed. Patch attached.


The patch also includes a little unit test module to test this without 
creating a 16 TB table. A whole new test module seems a bit like 
overkill just for this, but clearly we were missing test coverage here. 
And it will come handy, if we want to invent a new better posting list 
format in the future. Thoughts on whether to include the test module or not?


- Heikki
>From 3178fe98ca93c52bc8678953b328f70d7ed16b5c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 22 Aug 2019 11:06:13 +0300
Subject: [PATCH 1/1] Fix overflow check and comment in GIN posting list
 encoding.

The comment did not match what the code actually did for integers with
the 43rd bit set. You get an integer like that, if you have a posting
list with two adjacent TIDs that are more than 2^31 blocks apart.
According to the comment, we would store that in 6 bytes, with no
continuation bit on the 6th byte, but in reality, the code encodes it
using 7 bytes, with a continuation bit on the 6th byte as normal.

The decoding routine also handled these 7-byte integers correctly, except
for an overflow check that assumed that one integer needs at most 6 bytes.
Fix the overflow check, and fix the comment to match what the code
actually does.

Fitting any item pointer into max 6 bytes was an important property when
this was written, because in the old pre-9.4 format, item pointers were
stored as plain arrays, with 6 bytes for every item pointer. The maximum
of 6 bytes per integer in the new format guaranteed that we could convert
any page from the old format to the new format after upgrade, so that the
new format was never larger than the old format. But we hardly need to
worry about that anymore, and running into that problem during upgrade,
where an item pointer is expanded from 6 to 7 bytes such that the data
doesn't fit on a page anymore, is implausible in practice anyway.

Backpatch to all supported versions.

This also includes a little test module to test these large distances
between item pointers, without requiring a 16 TB table. It is not
backpatched, I'm including it more for the benefit of future development
of new posting list formats.

Discussion: XXX
---
 src/backend/access/gin/ginpostinglist.c   | 33 +--
 src/test/modules/Makefile |  1 +
 src/test/modules/test_ginpostinglist/Makefile | 21 
 src/test/modules/test

Take skip header out of a loop in COPY FROM

2019-08-22 Thread Surafel Temesgen
Hello,

Even if skipping header is done only once its checked and skipped in a
loop. If I don’t miss something it can be done out side a loop like
attached patch

regards

Surafel
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4f04d122c3..4e7709d7bf 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -3004,6 +3004,13 @@ CopyFrom(CopyState cstate)
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
+	/* on input just throw the header line away */
+	if ( cstate->header_line)
+	{
+		cstate->cur_lineno++;
+		(void) CopyReadLine(cstate);
+	}
+
 	for (;;)
 	{
 		TupleTableSlot *myslot;
@@ -3642,14 +3649,6 @@ NextCopyFromRawFields(CopyState cstate, char ***fields, int *nfields)
 	/* only available for text or csv input */
 	Assert(!cstate->binary);
 
-	/* on input just throw the header line away */
-	if (cstate->cur_lineno == 0 && cstate->header_line)
-	{
-		cstate->cur_lineno++;
-		if (CopyReadLine(cstate))
-			return false;		/* done */
-	}
-
 	cstate->cur_lineno++;
 
 	/* Actually read the line into memory here */


Re: Take skip header out of a loop in COPY FROM

2019-08-22 Thread Heikki Linnakangas

On 22/08/2019 11:31, Surafel Temesgen wrote:

Hello,

Even if skipping header is done only once its checked and skipped in a 
loop. If I don’t miss something it can be done out side a loop like 
attached patch


You may be on to something, but if we move it to CopyFrom(), as in your 
patch, then it won't get executed e.g. from the calls in file_fdw. 
file_fdw calls BeginCopyFrom(), followed by NextCopyFrom(); it doesn't 
use CopyFrom().


- Heikki




Re: Make SQL/JSON error code names match SQL standard

2019-08-22 Thread Peter Eisentraut
On 2019-08-20 13:48, Alexander Korotkov wrote:
> On Tue, Aug 20, 2019 at 11:49 AM Peter Eisentraut
>  wrote:
>> I propose the attached patch to make the new SQL/JSON error code names
>> match the SQL standard.  The existing minor differences don't seem
>> necessary.
> 
> Thank you for noticing!
> +1 for pushing this

done

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: when the IndexScan reset to the next ScanKey for in operator

2019-08-22 Thread Kyotaro Horiguchi
Hi,

At Wed, 21 Aug 2019 20:58:47 +0800, Alex  wrote in 

> postgres=# select * from t2 where a in (1, 10);
...
> I can see the plan stores the "1 and 10" information in
> IndexScan->indexqual, which is an SCALARARRAYOPEXPR expression.
...
> suppose the executor  should scan 1 first,  If all the tuples for 1 has
> been scanned,  then **it should be reset to 10**  and scan again.
>  however I can't find out the code for that.  looks index_rescan is not for
> this.   am I miss something?

Perhaps _bt_advance_array_keys() and btgettuple() would be what
you are seeking for.

>/* ... otherwise see if we have more array keys to deal with */
>} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Email to hackers for test coverage

2019-08-22 Thread movead...@highgo.ca
Hello hackers,

One of the area that didn't get much attention in the community
recently is analysing and increasing the code coverage of 
PostgreSQL regession test suite. I have started working on the
code coverage by running the GCOV code coverage analysis tool in
order to analyse the current code coverage and come up with test
cases to increase the code coverage. This is going to be a long
exercise so my plan is do it incrementaly. I will be analysing
some area of untested code and then coming up with test cases to
test those lines of code in regression and then moving on next
area of untested code and so on.

So far I have come up with 3 test cases to increase the code
coverage of PostgreSQL regression test suite.

I have performed the regression run for this exercise on this commit:
(Commit version 75c1921cd6c868c5995b88113b4463a4830b9a27):

The regression is executed with make check-world command and the
results are gathered using  'make coverage-html' command. 

Below are the lines of untested code that i have analysed and the
test cases added to regression to test these as part of regression.

1. src/include/utils/float.h:140

Analyze: 
This is an error report line when converting a big float8 value
which a float4 can not storage to float4.

Test case:
Add a test case as below in file float4.sql:
select float4(1234567890123456789012345678901234567890::float8);

2. src/include/utils/float.h:145

Analyze:
This is an error report line when converting a small float8 value
which a float4 can not storage to float4.

Test case:
Add a test case as below in file float4.sql:
select float4(0.01::float8);

3.src/include/utils/sortsupport.h:264

Analyze:
It is reverse sorting for the data type that has abbreviated for
sort, for example macaddr, uuid, numeric, network and I choose
numeric to do it.

Test cast:
Add a test case as below in file numeric.sql:
INSERT INTO num_input_test(n1) values('99.998');
INSERT INTO num_input_test(n1) values('99.997');
SELECT * FROM num_input_test ORDER BY n1 DESC;

Result and patch

By adding the test cases, the test coverage of  float.h increased from
97.7% to 100% and sortsupport.h increased from 76.7% to 80.0%.
 
The increase in code coverage can be seen in the before and after
pictures of GCOV test coverage analysis summary.

The attached patch contain the test cases added in regression for
increasing the coverage.


--
Movead Li


before.png
Description: Binary data


after.png
Description: Binary data


regression_20190820.patch
Description: Binary data


Re: Take skip header out of a loop in COPY FROM

2019-08-22 Thread Adam Lee
On Thu, Aug 22, 2019 at 11:48:31AM +0300, Heikki Linnakangas wrote:
> On 22/08/2019 11:31, Surafel Temesgen wrote:
> > Hello,
> > 
> > Even if skipping header is done only once its checked and skipped in a
> > loop. If I don’t miss something it can be done out side a loop like
> > attached patch
> 
> You may be on to something, but if we move it to CopyFrom(), as in your
> patch, then it won't get executed e.g. from the calls in file_fdw. file_fdw
> calls BeginCopyFrom(), followed by NextCopyFrom(); it doesn't use
> CopyFrom().
> 
> - Heikki

Yes.

My next thought is to call unlikely() here, but we don't have it...
https://www.postgresql.org/message-id/CABRT9RC-AUuQL6txxsoOkLxjK1iTpyexpbizRF4Zxny1GXASGg%40mail.gmail.com

-- 
Adam Lee




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Amit Kapila
On Thu, Aug 22, 2019 at 11:04 AM Dilip Kumar  wrote:
>
> On Thu, Aug 22, 2019 at 10:24 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> > > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > > > We can not know the complete size of the record even by reading the
> > > > > header because we have a payload that is variable part and payload
> > > > > length are stored in the payload header which again can be at random
> > > > > offset.
> > > >
> > > > Wait, but that's just purely self inflicted damage, no? The initial
> > > > length just needs to include the payload. And all this is not an issue
> > > > anymore?
> > > >
> > > Actually, we store the undo length only at the end of the record and
> > > that is for traversing the transaction's undo record chain during bulk
> > > fetch.  Ac such in the beginning of the record we don't have the undo
> > > length.  We do have uur_info but that just tell us which all optional
> > > header are included in the record.
> >
> > But why? It makes a *lot* more sense to have it in the beginning. I
> > don't think bulk-fetch really requires it to be in the end - we can
> > still process records forward on a page-by-page basis.
>
> Yeah, we can handle the bulk fetch as you suggested and it will make
> it a lot easier.  But, currently while registering the undo request
> (especially during the first pass) we need to compute the from_urecptr
> and the to_urecptr. And,  for computing the from_urecptr,  we have the
> end location of the transaction because we have the uur_next in the
> transaction header and that will tell us the end of our transaction
> but we still don't know the undo record pointer of the last record of
> the transaction.  As of know, we read previous 2 bytes from the end of
> the transaction to know the length of the last record and from there
> we can compute the undo record pointer of the last record and that is
> our from_urecptr.
>

How about if we store the location of the last record of the
transaction instead of the location of the next transaction in the
transaction header?  I think if we do that then discard worker might
need to do some additional work in some cases as it needs to tell the
location up to which discard is possible, however, many other cases
might get simplified.  With this also, when the log is switched while
writing records for the same transaction, the transaction header in
the first log will store the start location of the same transaction's
records in the next log.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Take skip header out of a loop in COPY FROM

2019-08-22 Thread Heikki Linnakangas

On 22/08/2019 12:54, Adam Lee wrote:

My next thought is to call unlikely() here, but we don't have it...
https://www.postgresql.org/message-id/CABRT9RC-AUuQL6txxsoOkLxjK1iTpyexpbizRF4Zxny1GXASGg%40mail.gmail.com


We do, actually, since commit aa3ca5e3dd in v10.

Not sure it's worth the trouble here. Optimizing COPY in general would 
be good, even small speedups there are helpful because everyone uses 
COPY, but without some evidence I don't believe particular branch is 
even measurable.


- Heikki




Re: Why overhead of SPI is so large?

2019-08-22 Thread Konstantin Knizhnik



On 22.08.2019 3:27, Tsunakawa, Takayuki wrote:

From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru]

PL/pgSQL:   29044.361 ms
C/SPI:  22785.597 ms

The fact that difference between PL/pgSQL and function implemented in C
using SPI is not so large  was expected by me.

This PL/pgSQL overhead is not so significant compared with the three times, but 
makes me desire some feature like Oracle's ALTER PROCEDURE ... COMPILE; that 
compiles the PL/SQL logic to native code.  I've seen a few dozen percent speed 
up.



Actually my implementation of C/SPI version is not optimal: it is better 
to fetch several records:


    while (true)
    {
        SPI_cursor_fetch(portal, true, 100);
        if (SPI_processed) {
            for (i = 0; i < SPI_processed; i++) {
                HeapTuple spi_tuple = SPI_tuptable->vals[i];
                Datum val = SPI_getbinval(spi_tuple, 
SPI_tuptable->tupdesc, 1, &is_null);

                double x = DatumGetFloat8(val);
                result += x*x;
                SPI_freetuple(spi_tuple);
            }
    SPI_freetuptable(SPI_tuptable);
        } else
            break;
    }

This version shows result 9405.694 ms which is comparable with result of 
SQL query.
Unfortunately (or fortunately) PL/pgSQL is already using prefetch. If it 
is disables (when iterate through explicitly created cursor), time of 
query execution is increased almost twice (46552.935 ms)


So PL/SPI ratio is more than three times.

Updatede results are the following:


Impementation
time (ms)
PL/Lua  32220
PL/pgSQL29044
PL/pgSQL (JIT)
27594
C/SPI     9406
SQL   7399
SQL (JIT)
  5532
С/coreAPI     2873

--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-22 Thread Kyotaro Horiguchi
Hello.

At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch  wrote in 
<20190820060314.ga3086...@rfd.leadboat.com>
> On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
> > <20190818035230.gb3021...@rfd.leadboat.com>
> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> > 
> > Now TwoPhaseFileHeader has two new members for (commit-time)
> > pending syncs. Pending-syncs are useless on wal-replay, but that
> > is needed for commit-prepared.
> 
> There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
> command, which is far too late to be syncing new relation files.  (A crash may
> have already destroyed their data.)  PrepareTransaction(), which implements
> the PREPARE TRANSACTION command, is the right place for these syncs.
> 
> A failure in these new syncs needs to prevent the transaction from being
> marked committed.  Hence, in CommitTransaction(), these new syncs need to

Agreed.

> happen after the last step that could create assign a new relfilenode and
> before RecordTransactionCommit().  I suspect it's best to do it after
> PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

I don't find an obvious problem there. Since pending deletes and
pending syncs are separately processed, I'm planning to make a
separate list for syncs from deletes.

> > > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL 
> > > for
> > > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There 
> > > may be
> > > no bug today, but it's conceptually wrong to make RelationNeedsWAL() 
> > > return
> > > false due to this code, use RelationNeedsWAL() for multiple forks, and 
> > > then
> > > not actually sync all forks.
> > 
> > I agree that all forks needs syncing, but FSM and VM are checking
> > RelationNeedsWAL(modified). To make sure, are you suggesting to
> > sync all forks instead of emitting WAL for them, or suggesting
> > that VM and FSM to emit WALs even when the modified
> > RelationNeedsWAL returns false (+ sync all forks)?
> 
> I hadn't thought that far.  What do you think is best?

As in the latest patch, sync ALL forks then no WALs. We could
skip syncing FSM but I'm not sure it's work doing.


> > > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another 
> > > component
> > > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the 
> > > relation,
> > > or if it's smaller than some threshold, WAL-log the contents of the whole 
> > > file
> > > at that point."  Please write the part to WAL-log the contents of small 
> > > files
> > > instead of syncing them.
> > 
> > I'm not sure the point of the behavior. I suppose that the "log"
> > is a sequence of new_page records. It also needs to be synced and
> > it is always larger than the file to be synced. I can't think of
> > an appropriate threshold without the point.
> 
> Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
> every buffer header containing a buffer of the current database.  The belief
> has been that writing one page to xlog is cheaper than FlushRelationBuffers()
> in a busy system with large shared_buffers.

I'm at a loss.. The decision between WAL and sync is made at
commit time, when we no longer have a pin on a buffer. When
emitting WAL, opposite to the assumption, lock needs to be
re-acquired for every page to emit log_new_page. What is worse,
we may need to reload evicted buffers.  If the file has been
CopyFrom'ed, ring buffer strategy makes the situnation farther
worse. That doesn't seem cheap at all..

If there were any chance on WAL for smaller files here, it would
be on the files smaller than the ring size of bulk-write
strategy(16MB).

If we pick up every buffer page of the file instead of scanning
through all buffers, that makes things worse by conflicts on
partition locks.

Any thoughts?



# Sorry time's up today.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Remove page-read callback from XLogReaderState.

2019-08-22 Thread Heikki Linnakangas

On 22/08/2019 04:43, Kyotaro Horiguchi wrote:

At Mon, 29 Jul 2019 22:39:57 +0300, Heikki Linnakangas  wrote in 


On 12/07/2019 10:10, Kyotaro Horiguchi wrote:
* XLogReaderState->readBuf is now allocated and controlled by the
* caller, not by xlogreader.c itself. When XLogReadRecord() needs data,
* the caller makes the data available in readBuf, which can point to the
* same buffer in all calls, or the caller may allocate a new buffer, or
* it may point to a part of a larger buffer, whatever is convenient for
* the caller. (Currently, all callers just allocate a BLCKSZ'd buffer,
* though). The caller also sets readPagPtr, readLen and readPageTLI to
* tell XLogReadRecord() what's in the buffer. So all these read* fields
* are now set by the caller, XLogReadRecord() only reads them.


The caller knows how many byes to be read. So the caller provides
the required buffer seems reasonable.


I also had in mind that the caller could provide a larger buffer, 
spanning multiple pages, in one XLogReadRecord() call. It might be 
convenient to load a whole WAL file in memory and pass it to 
XLogReadRecord() in one call, for example. I think the interface would 
now allow that, but the code won't actually take advantage of that. 
XLogReadRecord() will always ask for one page at a time, and I think it 
will ask the caller for more data between each page, even if the caller 
supplies more than one page in one call.



I'm not sure how intelligible this patch is in its current state. But
I think the general idea is good. I plan to clean it up further next
week, but feel free to work on it before that, either based on this
patch or by starting afresh from your patch set.


I think you diff is intelligible enough for me. I'll take this if
you haven't done. Anyway I'm staring on this.


Thanks! I did actually spend some time on this last week, but got 
distracted by something else before finishing it up and posting a patch. 
Here's a snapshot of what I have in my local branch. It seems to pass 
"make check-world", but probably needs some more cleanup.


Main changes since last version:

* I changed the interface so that there is a new function to set the 
starting position, XLogBeginRead(), and XLogReadRecord() always 
continues from where it left off. I think that's more clear than having 
a starting point argument in XLogReadRecord(), which was only set on the 
first call. It makes the calling code more clear, too, IMO.


* Refactored the implementation of XLogFindNextRecord(). 
XLogFindNextRecord() is now a sibling function of XLogBeginRead(). It 
sets the starting point like XLogBeginRead(). The difference is that 
with XLogFindNextRecord(), the starting point doesn't need to point to a 
valid record, it will "fast forward" to the next valid record after the 
point. The "fast forwarding" is done in an extra state in the state 
machine in XLogReadRecord().


* I refactored XLogReadRecord() and the internal XLogNeedData() 
function. XLogNeedData() used to contain logic for verifying segment and 
page headers. That works quite differently now. Checking the segment 
header is now a new state in the state machine, and the page header is 
verified at the top of XLogReadRecord(), whenever the caller provides 
new data. I think that makes the code in XLogReadRecord() more clear.


- Heikki
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 477709bbc23..8ecb5ea55c5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1386,15 +1386,21 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
 
-	xlogreader = XLogReaderAllocate(wal_segment_size, &read_local_xlog_page,
-	NULL);
+	xlogreader = XLogReaderAllocate(wal_segment_size);
 	if (!xlogreader)
 		ereport(ERROR,
 (errcode(ERRCODE_OUT_OF_MEMORY),
  errmsg("out of memory"),
  errdetail("Failed while allocating a WAL reading processor.")));
+	xlogreader->readBuf = palloc(XLOG_BLCKSZ);
+
+	XLogBeginRead(xlogreader, lsn);
+	while (XLogReadRecord(xlogreader, &record, &errormsg) ==
+		   XLREAD_NEED_DATA)
+	{
+		read_local_xlog_page(xlogreader);
+	}
 
-	record = XLogReadRecord(xlogreader, lsn, &errormsg);
 	if (record == NULL)
 		ereport(ERROR,
 (errcode_for_file_access(),
@@ -1416,6 +1422,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
 	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
 
+	pfree(xlogreader->readBuf);
 	XLogReaderFree(xlogreader);
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e651a841bbe..1bb303a90dc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -803,13 +803,6 @@ static XLogSource readSource = 0;	/* XLOG_FROM_* code */
 static XLogSource currentSource = 0;	/* XLOG_FROM_* code */
 static bool

mingw32 floating point diff

2019-08-22 Thread Peter Eisentraut
On 2019-08-20 14:59, Peter Eisentraut wrote:
> Running the regression tests on mingw32, I get the following diff in
> circle.out:
> 
> @@ -111,8 +111,8 @@
>WHERE (c1.f1 < c2.f1) AND ((c1.f1 <-> c2.f1) > 0)
>ORDER BY distance, area(c1.f1), area(c2.f1);
>   five |  one   |  two   | distance
> ---+++--
> -  | <(3,5),0>  | <(1,2),3>  | 0.60555127546399
> +--+++---
> +  | <(3,5),0>  | <(1,2),3>  | 0.605551275463989
>| <(3,5),0>  | <(5,1),3>  | 1.47213595499958
>| <(100,200),10> | <(100,1),115>  |   74
>| <(100,200),10> | <(1,2),100>| 111.370729772479
> 
> I only get this on master/PG12, but not on PG11, so I suspect that the
> new floating-point output routines could be the root of the issue.
> 
> This happens only with the 32-bit build (mingw32), but not with a 64-bit
> build (mingw64).

OK, the problem isn't the new output routines.  The result of the
computations is actually different.  The test itself is new in PG12.

The difference in output is due to the mingw32 target using -mfpmath=387
by default.  If you build with -mfpmath=sse -msee, the tests pass again.

Do we care to do anything about this?  Pick slightly different test data
perhaps?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services






Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-08-22 Thread Paul Guo
Thanks. I updated the patch to v5. It passes install-check testing and
recovery testing.

On Fri, Aug 2, 2019 at 6:38 AM Thomas Munro  wrote:

> On Mon, Jul 15, 2019 at 10:52 PM Paul Guo  wrote:
> > Please see the attached v4 patch.
>
> While moving this to the next CF, I noticed that this needs updating
> for the new pg_list.h API.
>
> --
> Thomas Munro
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__enterprisedb.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=1zhC6VaaS7Ximav7vaUXMUt6EGjrVZpNZut32ug7LDI&s=jSDXnTPIW4WNZCCZ_HIbu7gZ3apEBx36DCeNeNuhLpY&e=
>


v5-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch
Description: Binary data


Re: mingw32 floating point diff

2019-08-22 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-08-20 14:59, Peter Eisentraut wrote:
>> Running the regression tests on mingw32, I get the following diff in
>> circle.out:
...
> OK, the problem isn't the new output routines.  The result of the
> computations is actually different.  The test itself is new in PG12.
> The difference in output is due to the mingw32 target using -mfpmath=387
> by default.  If you build with -mfpmath=sse -msee, the tests pass again.

Hm, so presumably we could replicate this on other Intel-oid platforms
by changing compiler switches?  (I haven't tried.)

> Do we care to do anything about this?  Pick slightly different test data
> perhaps?

Picking different test data might be a good "fix".  Alternatively, we
could try to figure out where the discrepancy is arising and adjust
the code --- but that might be a lot more work than it's worth.

regards, tom lane




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Robert Haas
On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> Yeah, we can handle the bulk fetch as you suggested and it will make
> it a lot easier.  But, currently while registering the undo request
> (especially during the first pass) we need to compute the from_urecptr
> and the to_urecptr. And,  for computing the from_urecptr,  we have the
> end location of the transaction because we have the uur_next in the
> transaction header and that will tell us the end of our transaction
> but we still don't know the undo record pointer of the last record of
> the transaction.  As of know, we read previous 2 bytes from the end of
> the transaction to know the length of the last record and from there
> we can compute the undo record pointer of the last record and that is
> our from_urecptr.=

I don't understand this.  If we're registering an undo request at "do"
time, we don't need to compute the starting location; we can just
remember the UndoRecPtr of the first record we inserted.  If we're
reregistering an undo request after a restart, we can (and, I think,
should) work forward from the discard location rather than backward
from the insert location.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Fault injection framework

2019-08-22 Thread Asim R P
Hello

Fault injection was discussed a few months ago at PGCon in Ottawa.  At
least a few folks showed interest and so I would like to present what we
have been using in Greenplum.

The attached patch set contains the fault injector framework ported to
PostgreSQL master.  It  provides ability to define points of interest in
backend code and then inject faults at those points from SQL.  Also
included is an isolation test to simulate a speculative insert conflict
scenario that was found to be rather cumbersome to implement using advisory
locks, see [1].  The alternative isolation spec using fault injectors seems
much simpler to understand.

Asim

[1] CAAKRu_a7hbyrk=wveHYhr4LbcRnRCG=ypuvoqyb9yo1cdub...@mail.gmail.com



0001-Framework-to-inject-faults-from-SQL-tests.patch
Description: Binary data


0003-Speculative-insert-isolation-test-spec-using-fault-i.patch
Description: Binary data


0004-Run-tests-with-faults-if-faultinjector-was-compiled-.patch
Description: Binary data


0002-Add-syntax-to-declare-a-step-that-is-expected-to-blo.patch
Description: Binary data


0005-Isolation-schedule-for-tests-that-inject-faults.patch
Description: Binary data


Re: Why overhead of SPI is so large?

2019-08-22 Thread Konstantin Knizhnik

Some more information...
First of all I found out that marking PL/pgSQL function as immutable 
significantly increase speed of its execution:
19808 ms vs. 27594. It happens because exec_eval_simple_expr is taken 
snapshot if function is volatile (default).
I wonder if PL/pgSQL compiler can detect that evaluated expression 
itself is actually immutable  and there is no need to take snapshot
for each invocation of this function. Also I have tried yet another PL 
language - JavaScript, which is now new outsider, despite to the fact that

v8 JIT compiler is very good.

Implementation
time (ms)
PL/v8
41550
PL/Lua  32220
PL/pgSQL19808
C/SPI     9406
SQL   7399
SQL (JIT)
  5532
С/coreAPI     2873

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Asymmetric partition-wise JOIN

2019-08-22 Thread Kohei KaiGai
Hello,

Even though nobody has respond the thread, I tried to make a prototype of
the asymmetric partition-wise join support.
This feature tries to join non-partitioned and partitioned relation
before append.

See the example below:

create table ptable (dist int, a int, b int) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
create table t1 (aid int, label text);
create table t2 (bid int, label text);
insert into ptable (select x, (1000*random())::int,
(1000*random())::int from generate_series(1,100) x);
insert into t1 (select x, md5(x::text) from generate_series(1,50) x);
insert into t2 (select x, md5(x::text) from generate_series(1,50) x);
vacuum analyze ptable;
vacuum analyze t1;
vacuum analyze t2;

ptable.a has values between 0 and 1000, and t1.aid has values between 1 and 50.
Therefore, tables join on ptable and t1 by a=aid can reduce almost 95% rows.
On the other hands, t1 is not partitioned and join-keys are not partition keys.
So, Append must process million rows first, then HashJoin processes
the rows read
from the partitioned table, and 95% of them are eventually dropped.
On the other words, 95% of jobs by Append are waste of time and CPU cycles.

postgres=# explain select * from ptable, t1 where a = aid;
  QUERY PLAN
--
 Hash Join  (cost=2.12..24658.62 rows=49950 width=49)
   Hash Cond: (ptable_p0.a = t1.aid)
   ->  Append  (cost=0.00..20407.00 rows=100 width=12)
 ->  Seq Scan on ptable_p0  (cost=0.00..5134.63 rows=333263 width=12)
 ->  Seq Scan on ptable_p1  (cost=0.00..5137.97 rows=333497 width=12)
 ->  Seq Scan on ptable_p2  (cost=0.00..5134.40 rows=333240 width=12)
   ->  Hash  (cost=1.50..1.50 rows=50 width=37)
 ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
(8 rows)

The asymmetric partitionwise join allows to join non-partitioned tables and
partitioned tables prior to Append.

postgres=# set enable_partitionwise_join = on;
SET
postgres=# explain select * from ptable, t1 where a = aid;
  QUERY PLAN
--
 Append  (cost=2.12..19912.62 rows=49950 width=49)
   ->  Hash Join  (cost=2.12..6552.96 rows=16647 width=49)
 Hash Cond: (ptable_p0.a = t1.aid)
 ->  Seq Scan on ptable_p0  (cost=0.00..5134.63 rows=333263 width=12)
 ->  Hash  (cost=1.50..1.50 rows=50 width=37)
   ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
   ->  Hash Join  (cost=2.12..6557.29 rows=16658 width=49)
 Hash Cond: (ptable_p1.a = t1.aid)
 ->  Seq Scan on ptable_p1  (cost=0.00..5137.97 rows=333497 width=12)
 ->  Hash  (cost=1.50..1.50 rows=50 width=37)
   ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
   ->  Hash Join  (cost=2.12..6552.62 rows=16645 width=49)
 Hash Cond: (ptable_p2.a = t1.aid)
 ->  Seq Scan on ptable_p2  (cost=0.00..5134.40 rows=333240 width=12)
 ->  Hash  (cost=1.50..1.50 rows=50 width=37)
   ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
(16 rows)

We can consider the table join ptable X t1 above is equivalent to:
  (ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.

On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.

One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.

How about your thought for this feature?

Best regards,

2019年8月12日(月) 15:03 Kohei KaiGai :
>
> Hello,
>
> PostgreSQL optimizer right now considers join pairs on only
> non-partition - non-partition or
> partition-leaf - partition-leaf relations. On the other hands, it is
> harmless and makes sense to
> consider a join pair on non-partition - partition-leaf.
>
> See the example below. ptable is partitioned by hash, and contains 10M
> rows. ftable is not
> partitioned and contains 50 rows. Most of ptable::fkey shall not have
> matched rows in this
> join.
>
> create table ptable (fkey int, dist text) partition by hash (dist);
> create table ptable_p0 partition of ptable for values with (modulus 3,
> remainder 0);
> create table ptable_p1 partition of ptable for values with (modulus 3,
> remainder 1);
> create table ptable_p2 partition of ptable for values with (modulu

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-08-22 Thread Anastasia Lubennikova

22.08.2019 16:13, Paul Guo wrote:
Thanks. I updated the patch to v5. It passes install-check testing and 
recovery testing.

Hi,
Thank you for working on this fix.
The overall design of the latest version looks good to me.
But during the review, I found a bug in the current implementation.
New behavior must apply to crash-recovery only, now it applies to 
archiveRecovery too.
That can cause a silent loss of a tablespace during regular standby 
operation

since it never calls CheckRecoveryConsistency().

Steps to reproduce:
1) run master and replica
2) create dir for tablespace:
mkdir  /tmp/tblspc1

3) create tablespace and database on the master:
create tablespace tblspc1 location '/tmp/tblspc1';
create database db1 tablespace tblspc1 ;

4) wait for replica to receive this changes and pause replication:
select pg_wal_replay_pause();

5) move replica's tablespace symlink to some empty directory, i.e. 
/tmp/tblspc2

mkdir  /tmp/tblspc2
ln -sfn /tmp/tblspc2 postgresql_data_replica/pg_tblspc/16384

6) create another database in tblspc1 on master:
create database db2 tablespace tblspc1 ;

7) resume replication on standby:
select pg_wal_replay_resume();

8) try to connect to db2 on standby

It's expected that dbase_redo() will fail because the directory on 
standby is not found.
While with the patch it suppresses the error until we attempt to connect 
db2 on the standby:


2019-08-22 18:34:39.178 MSK [21066] HINT:  Execute 
pg_wal_replay_resume() to continue.
2019-08-22 18:42:41.656 MSK [21066] WARNING:  Skip creating database 
directory "pg_tblspc/16384/PG_13_201908012". The dest tablespace may 
have been removed before abnormal shutdown. If the removal is illegal 
after later checking we will panic.
2019-08-22 18:42:41.656 MSK [21066] CONTEXT:  WAL redo at 0/3027738 for 
Database/CREATE: copy dir base/1 to pg_tblspc/16384/PG_13_201908012/16390
2019-08-22 18:42:46.096 MSK [21688] FATAL: 
"pg_tblspc/16384/PG_13_201908012/16390" is not a valid data directory
2019-08-22 18:42:46.096 MSK [21688] DETAIL:  File 
"pg_tblspc/16384/PG_13_201908012/16390/PG_VERSION" is missing.


Also some nitpicking about code style:
1) Please, add comment to forget_missing_directory().

2) +   elog(LOG, "Directory \"%s\" was missing during 
directory copying "

I think we'd better update this message elevel to WARNING.

3) Shouldn't we also move FlushDatabaseBuffers(xlrec->src_db_id); call under
    if (do_copydir) clause?
I don't see a reason to flush pages that we won't use later.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 9:21 PM Dilip Kumar  wrote:
>
> On Thu, Aug 22, 2019 at 7:34 PM Robert Haas  wrote:
> >
> > On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> > > Yeah, we can handle the bulk fetch as you suggested and it will make
> > > it a lot easier.  But, currently while registering the undo request
> > > (especially during the first pass) we need to compute the from_urecptr
> > > and the to_urecptr. And,  for computing the from_urecptr,  we have the
> > > end location of the transaction because we have the uur_next in the
> > > transaction header and that will tell us the end of our transaction
> > > but we still don't know the undo record pointer of the last record of
> > > the transaction.  As of know, we read previous 2 bytes from the end of
> > > the transaction to know the length of the last record and from there
> > > we can compute the undo record pointer of the last record and that is
> > > our from_urecptr.=
> >
> > I don't understand this.  If we're registering an undo request at "do"
> > time, we don't need to compute the starting location; we can just
> > remember the UndoRecPtr of the first record we inserted.  If we're
> > reregistering an undo request after a restart, we can (and, I think,
> > should) work forward from the discard location rather than backward
> > from the insert location.
>
> Right, we work froward from the discard location.  So after the
> discard location,  while traversing the undo log when we encounter an
> aborted transaction we need to register its rollback request.  And,
> for doing that we need 1) start location of the first undo record . 2)
> start location of the last undo record (last undo record pointer).
>
> We already have 1).  But we have to compute 2).   For doing that if we
> unpack the first undo record we will know the start of the next
> transaction.  From there if we read the last two bytes then that will
> have the length of the last undo record of our transaction.  So we can
> compute 2) with below formula
>
> start of the last undo record = start of the next transaction - length
> of our transaction's last record.

Maybe I am saying that because I am just thinking how the requests are
registered as per the current code.  But, those requests will
ultimately be used for collecting the record by the bulk fetch.  So if
we are planning to change the bulk fetch to read forward then maybe we
don't need the valid last undo record pointer because that we will
anyway get while processing forward.  So just knowing the end of the
transaction is sufficient for us to know where to stop.  I am not sure
if this solution has any problem.  Probably  I should think again in
the morning when my mind is well-rested.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: mingw32 floating point diff

2019-08-22 Thread Tom Lane
I wrote:
> Peter Eisentraut  writes:
>> Do we care to do anything about this?  Pick slightly different test data
>> perhaps?

> Picking different test data might be a good "fix".  Alternatively, we
> could try to figure out where the discrepancy is arising and adjust
> the code --- but that might be a lot more work than it's worth.

I poked at this a bit more.  I can reproduce the problem by using 
-mfpmath=387 on dromedary's host (fairly old 32-bit macOS); although
I also get half a dozen *other* failures in the core regression tests,
mostly around detection of float overflow.  So I'm not quite sure that
this is comparable.  But at any rate, I tracked the core of the problem
to pg_hypot:

/* Determine the hypotenuse */
yx = y / x;
result = x * sqrt(1.0 + (yx * yx));

With -mfpmath=387, these calculations are done to more-than-64-bit
precision, yielding a different end result --- note in particular
that sqrt() is a hardware instruction on this platform, so it's
not rounding either.

I experimented with preventing that by using volatile intermediate
variables (cf comments in float.c); but it seemed like a mess,
and it would likely pessimize the code more than we want for other
platforms, and it's kind of hard to argue that deliberately sabotaging
the more-accurate computation is an improvement.

What I suggest doing is reducing extra_float_digits to -1 for this
specific test.  Changing the contents of circle_tbl seems like it'd have
more consequences than we want, in particular there's no guarantee that
we'd not hit similar issues in other tests if they're given different
inputs.

regards, tom lane




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Andres Freund
Hi

On August 22, 2019 9:14:10 AM PDT, Dilip Kumar  wrote:
> But, those requests will
>ultimately be used for collecting the record by the bulk fetch.  So if
>we are planning to change the bulk fetch to read forward then maybe we
>don't need the valid last undo record pointer because that we will
>anyway get while processing forward.  So just knowing the end of the
>transaction is sufficient for us to know where to stop.  I am not sure
>if this solution has any problem.  Probably  I should think again in
>the morning when my mind is well-rested.

I don't think we can easily do so for bulk apply without incurring significant 
overhead. It's pretty cheap to read in forward order and then process backwards 
on a page level - but for an entire transactions undo the story is different. 
We can't necessarily keep all of it in memory, so we'd have to read the undo 
twice to find the end. Right?

Andres

Andres

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Robert Haas
On Thu, Aug 22, 2019 at 12:54 AM Andres Freund  wrote:
> But why? It makes a *lot* more sense to have it in the beginning. I
> don't think bulk-fetch really requires it to be in the end - we can
> still process records forward on a page-by-page basis.

There are two separate needs here: to be able to go forward, and to be
able to go backward.  We have the length at the end of each record not
because we're stupid, but so that we can back up.  If we have another
way of backing up, then the thing to do is not to move that to
beginning of the record but to remove it entirely as unnecessary
wastage.  We can also think about how to improve forward traversal.

Considering each problem separately:

For forward traversal, we could simplify things somewhat by having
only 3 decoding stages instead of N decoding stages.  We really only
need (1) a stage for accumulating bytes until we have uur_info, and
then (2) a stage for accumulating bytes until we know the payload and
tuple lengths, and then (3) a stage for accumulating bytes until we
have the whole record.  We have a lot more stages than that right now
but I don't think we really need them for anything. Originally we had
them so that we could do incremental decoding to find the transaction
header in the record, but now that the transaction header is at a
fixed offset, I think the multiplicity of stages is just baggage.

We could simplify things more by deciding that the first two bytes of
the record are going to contain the record size. That would increase
the size of the record by 2 bytes, but we could (mostly) claw those
bytes back by not storing the size of both uur_payload and uur_tuple.
The size of the other one would be computed by subtraction: take the
total record size, subtract the size of whichever of those two things
we store, subtract the mandatory and optional headers that are
present, and the rest must be the other value. That would still add 2
bytes for records that contain neither a payload nor a tuple, but that
would probably be OK given that (a) a lot of records wouldn't be
affected, (b) the code would get simpler, and (c) something like this
seems necessary anyway given that we want to make the record format
more generic. With this approach instead of 3 stages we only need 2:
(1) accumulating bytes until we have the 2-byte length word, and (2)
accumulating bytes until we have the whole record.

For backward traversal, as I see it, there are basically two options.
One is to do what we're doing right now, and store the record length
at the end of the record. (That might mean that a record both begins
and ends with its own length, which is not a crazy design.) The other
is to do what I think you are proposing here: locate the beginning of
the first record on the page, presumably based on some information
stored in the page header, and then work forward through the page to
figure out where all the records start.  Then process them in reverse
order. That saves 2 bytes per record.  It's a little more expensive in
terms of CPU cycles, especially if you only need some of the records
on the page but not all of them, but that's probably not too bad.

I think I'm basically agreeing with what you are proposing but I think
it's important to spell out the underlying concerns, because otherwise
I'm afraid we might think we have a meeting of the minds when we don't
really.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Why overhead of SPI is so large?

2019-08-22 Thread Pavel Stehule
čt 22. 8. 2019 v 17:51 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

> Some more information...
> First of all I found out that marking PL/pgSQL function as immutable
> significantly increase speed of its execution:
> 19808 ms vs. 27594. It happens because exec_eval_simple_expr is taken
> snapshot if function is volatile (default).
> I wonder if PL/pgSQL compiler can detect that evaluated expression itself
> is actually immutable  and there is no need to take snapshot
> for each invocation of this function. Also I have tried yet another PL
> language - JavaScript, which is now new outsider, despite to the fact that
> v8 JIT compiler is very good.
>

I have a plan to do some work in this direction. Snapshot is not necessary
for almost buildin functions. If expr calls only buildin functions, then
probably can be called without snapshot and without any work with plan
cache.

Pavel

>
> Implementation
> time (ms)
> PL/v8
> 41550
> PL/Lua 32220
> PL/pgSQL 19808
> C/SPI   9406
> SQL   7399
> SQL (JIT)
>   5532
> С/coreAPI   2873
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 7:34 PM Robert Haas  wrote:
>
> On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> > Yeah, we can handle the bulk fetch as you suggested and it will make
> > it a lot easier.  But, currently while registering the undo request
> > (especially during the first pass) we need to compute the from_urecptr
> > and the to_urecptr. And,  for computing the from_urecptr,  we have the
> > end location of the transaction because we have the uur_next in the
> > transaction header and that will tell us the end of our transaction
> > but we still don't know the undo record pointer of the last record of
> > the transaction.  As of know, we read previous 2 bytes from the end of
> > the transaction to know the length of the last record and from there
> > we can compute the undo record pointer of the last record and that is
> > our from_urecptr.=
>
> I don't understand this.  If we're registering an undo request at "do"
> time, we don't need to compute the starting location; we can just
> remember the UndoRecPtr of the first record we inserted.  If we're
> reregistering an undo request after a restart, we can (and, I think,
> should) work forward from the discard location rather than backward
> from the insert location.

Right, we work froward from the discard location.  So after the
discard location,  while traversing the undo log when we encounter an
aborted transaction we need to register its rollback request.  And,
for doing that we need 1) start location of the first undo record . 2)
start location of the last undo record (last undo record pointer).

We already have 1).  But we have to compute 2).   For doing that if we
unpack the first undo record we will know the start of the next
transaction.  From there if we read the last two bytes then that will
have the length of the last undo record of our transaction.  So we can
compute 2) with below formula

start of the last undo record = start of the next transaction - length
of our transaction's last record.

Am I making sense here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Cleanup isolation specs from unused steps

2019-08-22 Thread Melanie Plageman
On Wed, Aug 21, 2019 at 12:16 PM Alvaro Herrera 
wrote:

> On 2019-Aug-21, Melanie Plageman wrote:
>
> > In Greenplum, we mainly add new tests to a separate isolation
> > framework (called isolation2) which uses a completely different
> > syntax. It doesn't use isolationtester at all. So, I haven't had a use
> > case to add long options to isolationtester yet :)
>
> Is that other framework somehow more capable?
>

So, there is some historical context as to why it is a separate test suite.
And some of the differences are specific to Greenplum -- e.g. needing to
connect
to a specific database in "utility mode" to do something.

However, the other differences are actually pretty handy and would be
applicable
to upstream as well.
We use a different syntax than the isolation framework and have some nice
features. Most notably, explicit control over blocking.

The syntax for what would be a "step" in isolation is like this:

[<#>[flag]:]  | ! 

where # is the session number and flags include the following:

&: expect blocking behavior
>: running in background without blocking
<: join an existing session
q: quit the given session

See the script [1] for parsing the test cases for more details on the
syntax and
capabilities (it is in Python).

[1]
https://github.com/greenplum-db/gpdb/blob/master/src/test/isolation2/sql_isolation_testcase.py

-- 
Melanie Plageman


Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 9:55 PM Andres Freund  wrote:
>
> Hi
>
> On August 22, 2019 9:14:10 AM PDT, Dilip Kumar  wrote:
> > But, those requests will
> >ultimately be used for collecting the record by the bulk fetch.  So if
> >we are planning to change the bulk fetch to read forward then maybe we
> >don't need the valid last undo record pointer because that we will
> >anyway get while processing forward.  So just knowing the end of the
> >transaction is sufficient for us to know where to stop.  I am not sure
> >if this solution has any problem.  Probably  I should think again in
> >the morning when my mind is well-rested.
>
> I don't think we can easily do so for bulk apply without incurring 
> significant overhead. It's pretty cheap to read in forward order and then 
> process backwards on a page level - but for an entire transactions undo the 
> story is different. We can't necessarily keep all of it in memory, so we'd 
> have to read the undo twice to find the end. Right?
>
I was not talking about the entire transaction,  I was also telling
about the page level as you suggested.  I was just saying that we may
not need the start position of the last undo record of the transaction
for registering the rollback request (which we currently do).
However, we need to know the end of the transaction to know the last
page from which we need to start reading forward.

Let me explain with an example

Transaction1
first, undo start at 10
first, undo end at 100
second, undo start at 101
second, undo end at 200
..
last, undo start at 1000
last, undo end at  1100

Transaction2
first, undo start at 1101
first, undo end at 1200
second, undo start at 1201
second, undo end at 1300

Suppose we want to register the request for Transaction1.  Then
currently we need to know the start undo record pointer (10 as per
above example) and the last undo record pointer (1000).  But, we only
know the start undo record pointer(10) and the start of the next
transaction(1101).  So for calculating the start of the last record,
we use 1101 - 101 (length of the last record store 2 bytes before
1101).

So, now I am saying that maybe we don't need to compute the start of
last undo record (1000) because it's enough to know the end of the
last undo record(1100).  Because on whichever page the last undo
record ends, we can start from that page and read forward on that
page.

* All numbers I used in the above example can be considered as undo
record pointers.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Preserving extension ownership in dump/restore/pg_upgrade

2019-08-22 Thread Tom Lane
Currently, we fail to do $SUBJECT: post-restore, an extension will
be owned by the role used to do the restore.  This is because pg_dump
does not consider that extensions have owners at all, so it doesn't
issue ALTER EXTENSION OWNER TO, which is a command that the server
doesn't have anyway.

We've not gotten complaints about this, AFAIR, suggesting it's only a
minor problem in practice.  Probably most extensions are superuser-owned
anyway.  However, we do theoretically support PL extensions that are
owned by database owners, and if the "trustable extension" idea that
I proposed yesterday[1] gets in, there might soon be a lot more cases
of non-superuser-owned extensions.  Moreover, although pg_upgrade will
preserve the ownership of individual objects within the extension,
a regular dump/restore will not.  This means a database owner would
lose the ability to adjust permissions on a PL after dump/restore,
which seems like a problem.  So I started thinking about how to fix it.

The obvious way to fix it is to implement ALTER EXTENSION OWNER TO,
but that has a big problem: what should we do with the ownership of
the contained objects?  Taking the example of one of the PL extensions,
we'd like it to also change ownership of the procedural language object,
but it *must not* give away ownership of the C-language handler functions,
at least not if the ownership recipient is not a superuser.  (A user who
owns a C-language function can alter its properties to do almost
anything.)  There doesn't seem to be any way to distinguish these cases
except with very ad-hoc, ugly, restrictive code.  I considered proposing
that we *only* change ownership of contained procedural language objects,
but ick.  That's ugly, and it would hobble the usefulness of the
@extowner@ mechanism I proposed in [1] (although possibly that's not
very useful for anything but PLs anyway?).

Maybe, instead of @extowner@ as proposed (which is certainly just a
quick-n-dirty mechanism), we could add some syntax that would explicitly
identify objects whose ownership ought to track that of the extension.
Seems like a lot of work though.

Another idea, which is much uglier conceptually but seems like it
could be done with not much code, is to teach pg_dump/pg_restore
that it must use SET SESSION AUTHORIZATION to set the ownership of
an extension even when it's using ALTER OWNER for everything else.
The main drawback that I can think of is that if the target user
lacks permissions to create the extension in the destination database,
CREATE EXTENSION will fail (and then probably later restore commands
will too), rather than leaving the extension in place with the wrong
owner.

I guess, if we're going to need custom restore code anyway,
we could imagine solving that problem by emitting

SET SESSION AUTHORIZATION joe;
CREATE EXTENSION someextension;
RESET SESSION AUTHORIZATION;
CREATE EXTENSION IF NOT EXISTS someextension;

but man is that ugly.

Binary-upgrade mode has an independent problem:
binary_upgrade_create_empty_extension has a hard-wired assumption that 
it should use GetUserId() for the extension owner.  We could imagine
fixing that by passing the owner role name as a separate argument;
though if we go with the SET SESSION AUTHORIZATION solution for normal
mode, I'm a bit inclined to use it for binary upgrade as well.

I don't find any of these approaches terribly appealing.
Thoughts, better ideas?

regards, tom lane

[1] https://www.postgresql.org/message-id/5889.1566415762%40sss.pgh.pa.us




Re: Allow to_date() and to_timestamp() to accept localized names

2019-08-22 Thread Juan José Santamaría Flecha
On Sun, Aug 18, 2019 at 10:42 AM Juan José Santamaría Flecha
 wrote:
>
> Going through the current items in the wiki's todo list, I have been
> looking into: "Allow to_date () and to_timestamp () to accept
> localized month names".
>

I have gone through a second take on this, trying to give it a better
shape but it would surely benefit from some review, so I will open an
item in the commitfest.

Regards,

Juan José Santamaría Flecha


0001-Allow-localized-month-names-to_date-v1.patch
Description: Binary data


Re: Cleanup isolation specs from unused steps

2019-08-22 Thread Robert Eckhardt
On Thu, Aug 22, 2019 at 1:45 PM Melanie Plageman
 wrote:
>
>
> On Wed, Aug 21, 2019 at 12:16 PM Alvaro Herrera  
> wrote:
>>
>> On 2019-Aug-21, Melanie Plageman wrote:
>>
>> > In Greenplum, we mainly add new tests to a separate isolation
>> > framework (called isolation2) which uses a completely different
>> > syntax. It doesn't use isolationtester at all. So, I haven't had a use
>> > case to add long options to isolationtester yet :)
>>
>> Is that other framework somehow more capable?
>
>
> So, there is some historical context as to why it is a separate test suite.
> And some of the differences are specific to Greenplum -- e.g. needing to 
> connect
> to a specific database in "utility mode" to do something.
>
> However, the other differences are actually pretty handy and would be 
> applicable
> to upstream as well.
> We use a different syntax than the isolation framework and have some nice
> features. Most notably, explicit control over blocking.

Asim submitted this framework just yesterday:
https://www.postgresql.org/message-id/CANXE4TdxdESX1jKw48xet-5GvBFVSq=4cgneiotqff372ko...@mail.gmail.com

-- Rob

>
> The syntax for what would be a "step" in isolation is like this:
>
> [<#>[flag]:]  | ! 
>
> where # is the session number and flags include the following:
>
> &: expect blocking behavior
> >: running in background without blocking
> <: join an existing session
> q: quit the given session
>
> See the script [1] for parsing the test cases for more details on the syntax 
> and
> capabilities (it is in Python).
>
> [1] 
> https://github.com/greenplum-db/gpdb/blob/master/src/test/isolation2/sql_isolation_testcase.py
>
> --
> Melanie Plageman




Infinite wait for SyncRep while handling USR1

2019-08-22 Thread Soumyadeep Chakraborty
Hello Hackers,

There is an edge case in 9_5_STABLE (doesn't reproduce 9_6+) we found in how
backends handle the TERM signal while handling a USR1 signal that can cause
them to infinitely wait. A backend can end up waiting forever inside
SyncRepWaitForLSN() at:

rc = WaitLatch(MyLatch, WL_LATCH_SET |
WL_POSTMASTER_DEATH, -1, WAIT_EVENT_SYNC_REP);

Even pg_terminate_backend() would not be able to terminate it after it
reaches this point.

Stack trace:

#0  0x7f39418d73c8 in poll () from /lib64/libc.so.6
#1  0x00753d7b in WaitLatchOrSocket (latch=0x7f3940d371cc,
wakeEvents=17, sock=-1, timeout=-1)
at pg_latch.c:333
#2  0x00753b8c in WaitLatch (latch=0x7f3940d371cc, wakeEvents=17,
timeout=-1) at pg_latch.c:197
#3  0x007a0e02 in SyncRepWaitForLSN (XactCommitLSN=167868344) at
syncrep.c:231
#4  0x00518b8a in RecordTransactionCommit () at xact.c:1349
#5  0x0051966d in CommitTransaction () at xact.c:2057
#6  0x0051a1f9 in CommitTransactionCommand () at xact.c:2769
#7  0x0054ff5e in RemoveTempRelationsCallback (code=1, arg=0) at
namespace.c:3878
#8  0x007be3d1 in shmem_exit (code=1) at ipc.c:228
#9  0x007be2c6 in proc_exit_prepare (code=1) at ipc.c:185
#10 0x007be234 in proc_exit (code=1) at ipc.c:102
#11 0x0093152e in errfinish (dummy=0) at elog.c:535
#12 0x007ea492 in ProcessInterrupts () at postgres.c:2913
#13 0x007e9f93 in die (postgres_signal_arg=15) at postgres.c:2682
#14 
#15 procsignal_sigusr1_handler (postgres_signal_arg=10) at procsignal.c:271
#16 
#17 SocketBackend (inBuf=0x7ffc8dde0f80) at postgres.c:353
#18 0x007e70ca in ReadCommand (inBuf=0x7ffc8dde0f80) at
postgres.c:510
#19 0x007eb990 in PostgresMain (argc=1, argv=0x1708cd0,
dbname=0x1708b38 "gpadmin",
username=0x1708b18 "gpadmin") at postgres.c:4032
#20 0x00769922 in BackendRun (port=0x172e060) at postmaster.c:4309
#21 0x00769086 in BackendStartup (port=0x172e060) at
postmaster.c:3983
#22 0x007657dc in ServerLoop () at postmaster.c:1706
#23 0x00764e16 in PostmasterMain (argc=3, argv=0x1707ce0) at
postmaster.c:1314
#24 0x006bcdb3 in main (argc=3, argv=0x1707ce0) at main.c:228

Root cause:

The reason why the backend waits forever in WaitLatch is that it expects a
USR1
signal to be delivered in order to wake it up from the wait
(WaitLatchOrSocket
implementation) using latch_sigusr1_handler. Now since its is already
inside a
USR1 handler, USR1 is blocked by default. Thus, it never receives a USR1 and
never wakes up.

Reproduction: [9_5_STABLE, commit: 5a32fcd]

We must align the stars as follows:

1. Create a psql session and create a temp table. [temp table forces a Tx
during backend termination to drop the table => SyncRepWaitForLSN will be
called]

2. Since, we don't have a fault injection framework, we have to rely on the
debugger: Attach to the following processes and install these breakpoints:

i) Attach to the Backend:
b procsignal_sigusr1_handler
b postgres.c:353
at: SocketBackend: ereport(DEBUG1,
(errcode(ERRCODE_CONNECTION_DOES_NOT_EXIST),
errmsg("unexpected EOF on client connection")));

ii) Attach to the Wal Sender, no need for a specific breakpoint [We need to
make the Wal Sender fall behind in order for the condition lsn <=
WalSndCtl->lsn[mode] to be false inside SyncRepWaitForLSN()]

3. kill -9 the psql process [This will send an EOF to the backend's
connection
to the front end, which will cause DoingCommandRead = true and
whereToSendOutput = DestNone (in SocketBackend()). This will later ensure
that
ProcessInterrupts() is called from within die: the backend's TERM handler]

4. Continue the backend process in the debugger, it should reach the
breakpoint
we set inside SocketBackend.

5. Send a USR1 to the backend. kill -SIGUSR1. Continue the backend. We will
hit
the breakpoint b procsignal_sigusr1_handler.

6. Send a TERM or perform pg_terminate_backend() on the backend.

7. Detach from all of the processes.

8. The backend will be waiting for replication forever inside
SyncRepWaitForLSN(). We can confirm this by pausing the backend and checking
the back trace and also by looking at pg_stat_activity. Subsequent attempts
to
terminate the backend would be futile.


Working fix:
At the start of SyncRepWaitForLSN() check if we are already in a USR1
handler.
If yes, we return. We use sigprocmask(), sigismember() to perform the
detection.

-
Ashwin Agrawal, Bhuvnesh Chaudhary, Jesse Zhang and Soumyadeep Chakraborty


Re: Comment in ginpostinglist.c doesn't match code

2019-08-22 Thread Ashwin Agrawal
On Thu, Aug 22, 2019 at 1:14 AM Heikki Linnakangas  wrote:

>
> The patch also includes a little unit test module to test this without
> creating a 16 TB table. A whole new test module seems a bit like
> overkill just for this, but clearly we were missing test coverage here.
> And it will come handy, if we want to invent a new better posting list
> format in the future. Thoughts on whether to include the test module or
> not?
>

I like the test as importantly adds missing coverage. Also, really
simplifies validation effort if required to make change in this area
anytime in future. So, I would +1 keeping the same.


Re: Cleanup isolation specs from unused steps

2019-08-22 Thread Michael Paquier
On Thu, Aug 22, 2019 at 10:20:48AM -0700, Melanie Plageman wrote:
> So, there is some historical context as to why it is a separate test suite.
> And some of the differences are specific to Greenplum -- e.g. needing to
> connect to a specific database in "utility mode" to do something.

What is "utility mode"?

> The syntax for what would be a "step" in isolation is like this:
> 
> [<#>[flag]:]  | ! 
> 
> where # is the session number and flags include the following:
> 
> &: expect blocking behavior
> >: running in background without blocking
> <: join an existing session
> q: quit the given session

These could be transposed as new meta commands for the existing
specs?  Of course not as "step" per-se, but new dedicated commands?

> See the script [1] for parsing the test cases for more details on the 
> syntax and capabilities (it is in Python).

Hmm.  The bar to add a new hard language dependency in the test
suites is very high.  I am not sure that we'd want something with a
python dependency for the tests, also knowing how Python likes
breaking compatibility (isolation2_main() also mentions a dependency
to Python).
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-08-22 Thread Masahiko Sawada
On Thu, Aug 22, 2019 at 12:36 AM Masahiko Sawada  wrote:
>
> On Thu, Aug 22, 2019 at 12:19 AM Alvaro Herrera
>  wrote:
> >
> > Can I interest someone into updating this patch?  We now have (I think)
> > an agreed design, and I think the development work needed should be
> > straightforward.  We also already have the popcount stuff, so that's a
> > few lines to be removed from the patch ...
> >
>
> I will update the patch and register to the next Commit Fest tomorrow
> if nobody is interested in.
>

Attached the updated patch. While updating the doc I realized that
perhaps we should have the new section for heap and put the
descriptions of heap functions into it rather than having them as
general functions. If we need this change it is for PG12. I will
register only the new feature patch to the next Commit Fest.

Please review them.



Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 8d81f88..0b92025 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -127,6 +127,35 @@ test=# SELECT page_checksum(get_raw_page('pg_class', 0), 0);
 

 
+ fsm_page_contents(page bytea) returns text
+ 
+  fsm_page_contents
+ 
+
+
+
+ 
+  fsm_page_contents shows the internal node structure
+  of a FSM page. The output is a multiline string, with one line per
+  node in the binary tree within the page. Only those nodes that are not
+  zero are printed. The so-called "next" pointer, which points to the
+  next slot to be returned from the page, is also printed.
+ 
+ 
+  See src/backend/storage/freespace/README for more
+  information on the structure of an FSM page.
+ 
+
+   
+  
+ 
+
+ 
+  Heap Functions
+
+  
+   
+
  heap_page_items(page bytea) returns setof record
  
   heap_page_items
@@ -203,29 +232,6 @@ test=# SELECT * FROM heap_page_item_attrs(get_raw_page('pg_class', 0), 'pg_class
  
 

-
-   
-
- fsm_page_contents(page bytea) returns text
- 
-  fsm_page_contents
- 
-
-
-
- 
-  fsm_page_contents shows the internal node structure
-  of a FSM page. The output is a multiline string, with one line per
-  node in the binary tree within the page. Only those nodes that are not
-  zero are printed. The so-called "next" pointer, which points to the
-  next slot to be returned from the page, is also printed.
- 
- 
-  See src/backend/storage/freespace/README for more
-  information on the structure of an FSM page.
- 
-
-   
   
  
 
From 018077d786f874cb314b5f61b5ef85f42c62bbe5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Fri, 23 Aug 2019 10:36:13 +0900
Subject: [PATCH v3] Introduce heap_infomask_flags to decode infomask and
 infomask2

---
 contrib/pageinspect/Makefile  |   2 +-
 contrib/pageinspect/expected/page.out |  97 
 contrib/pageinspect/heapfuncs.c   | 104 ++
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  13 
 contrib/pageinspect/pageinspect.control   |   2 +-
 contrib/pageinspect/sql/page.sql  |  26 +++
 doc/src/sgml/pageinspect.sgml |  33 
 7 files changed, 275 insertions(+), 2 deletions(-)
 create mode 100644 contrib/pageinspect/pageinspect--1.7--1.8.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index e5a581f..cfe0129 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -5,7 +5,7 @@ OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \
 		  brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES)
 
 EXTENSION = pageinspect
-DATA =  pageinspect--1.6--1.7.sql \
+DATA =  pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
 	pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
 	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
 	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 3fcd9fb..3b47599 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -82,6 +82,103 @@ SELECT * FROM fsm_page_contents(get_raw_page('test1', 'fsm', 0));
  
 (1 row)
 
+-- If we freeze the only tuple on test1, the infomask should
+-- always be the same in all test runs.
+VACUUM FREEZE test1;
+SELECT t_infomask, t_infomask2, flags
+FROM heap_page_items(get_raw_page('test1', 0)),
+ LATERAL heap_infomask_flags(t_infomask, t_infomask2, true) m(flags);
+ t_infomask | t_infomask2 |flags 
++-+--
+   2816 |   2 | {HEAP_XMAX_INVALID,HEAP_XMIN_FROZEN}
+(1 row)
+
+SELECT t_infomask, t_infomask2, flags
+FROM heap_page_items(get_raw_page('test1', 0)),
+ LATERAL heap_infomask_

Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-08-22 Thread Michael Paquier
On Fri, Aug 23, 2019 at 11:09:44AM +0900, Masahiko Sawada wrote:
> While updating the doc I realized that
> perhaps we should have the new section for heap and put the
> descriptions of heap functions into it rather than having them as
> general functions. If we need this change it is for PG12. I will
> register only the new feature patch to the next Commit Fest.

I agree with the new heap section, and your patch on that looks good.
While on it, I have one suggestion: fsm_page_contents does not have an
example of query.  Could we add one while on it?  An example
consistent with the other function's examples:
=# SELECT fsm_page_contents(get_raw_page('pg_class', 'fsm', 0));
--
Michael


signature.asc
Description: PGP signature


Re: Cleanup isolation specs from unused steps

2019-08-22 Thread Melanie Plageman
On Thu, Aug 22, 2019 at 6:53 PM Michael Paquier  wrote:

> On Thu, Aug 22, 2019 at 10:20:48AM -0700, Melanie Plageman wrote:
> > So, there is some historical context as to why it is a separate test
> suite.
> > And some of the differences are specific to Greenplum -- e.g. needing to
> > connect to a specific database in "utility mode" to do something.
>
> What is "utility mode"?
>

It's a connection parameter that allows you to connect to a single Postgres
node
in a Greenplum cluster. I only included it as an example of the kind of
"Greenplum-specific" things that are in the test framework.


>
> > The syntax for what would be a "step" in isolation is like this:
> >
> > [<#>[flag]:]  | ! 
> >
> > where # is the session number and flags include the following:
> >
> > &: expect blocking behavior
> > >: running in background without blocking
> > <: join an existing session
> > q: quit the given session
>
> These could be transposed as new meta commands for the existing
> specs?  Of course not as "step" per-se, but new dedicated commands?
>
>
Yes, I think you could definitely add some of the flags as meta-commands for
existing steps -- the current implementation of "blocking" for isolation is
really limiting.
However, features aside, I actually find the existing "step" syntax in
isolation
clunkier than the syntax used in Greenplum's "isolation2" framework.


> > See the script [1] for parsing the test cases for more details on the
> > syntax and capabilities (it is in Python).
>
> Hmm.  The bar to add a new hard language dependency in the test
> suites is very high.  I am not sure that we'd want something with a
> python dependency for the tests, also knowing how Python likes
> breaking compatibility (isolation2_main() also mentions a dependency
> to Python).
>

Agreed, I don't think it needs to be in Python.

My point was that some of our "isolation2" framework has to be different
because
it is enabling us to test features that are in Greenplum and not in
Postgres.
However, many of the features it has would actually be really handy to have
for
testing Postgres.
It wasn't initially suggested upstream because it is actually mainly ported
from
a separate standalone testing framework that was written at Greenplum in
Python.

I've heard Greenplum folks talk about re-writing our "isolation2" framework
in
(probably) C and making it a better fit to contribute. It's definitely on
my wishlist.

-- 
Melanie Plageman


Hooks for session start and end, take two

2019-08-22 Thread Michael Paquier
Hi all,

Attached is a patch set to respawn the issue of $subject which has
been discussed here:
https://www.postgresql.org/message-id/20170720204733.40f2b7eb.nag...@sraoss.co.jp

The patch has been committed once as of cd8ce3a but it got shortly
reverted after with 98d54bb because of buildfarm failures.

The root of the buildfarm issues was that session hooks cannot be
tested with a simple LOAD, hence we need to mark the test with
NO_INSTALLCHECK.  Unfortunately we lacked support for that in MSVC
scripts, until I solved that with 431f1599 when refactoring PGXS
makefile rules for regression tests.

While on it, I have done a review over the patch, cleaning it up a bit
and I found some issues, so the initial patch was not fully baked
either:
- previous hook calls were only called for normal backends, which was
incorrect as we define the backend so as we apply no backend-related
filtering for the hook.
- The README could be also more talkative.
- test_session_hooks--1.0.sql also got confused with the module name.
And actually there is no need to have the SQL and control files just
for a module loading a hook.  So it is enough to use MODULE_big for
this purpose.
- The query generated needs to use quote_literal_cstr for the string
values added to the query.
- sample_session_start_hook and sample_session_end_hook missed a
(void), causing a compiler warning on Windows. 

Attached is an updated patch set to reintroduce the hook, as there was
a ask for it recently, fixing also the issue that we previously tried
to deal with.  I have tested the patch on Windows to make sure that
the test gets correctly bypassed, so this time we should not have any
buildfarm failures.

I am adding that to next CF.  Credits go of course to the initial
authors and reviewers of this feature.

Any opinions?
--
Michael
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e8d8e6f828..6d80cc2d64 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -171,6 +171,9 @@ static ProcSignalReason RecoveryConflictReason;
 static MemoryContext row_description_context = NULL;
 static StringInfoData row_description_buf;
 
+/* Hook for plugins to get control at start of session */
+session_start_hook_type session_start_hook = NULL;
+
 /* 
  *		decls for routines only used in this file
  * 
@@ -3968,6 +3971,9 @@ PostgresMain(int argc, char *argv[],
 	if (!IsUnderPostmaster)
 		PgStartTime = GetCurrentTimestamp();
 
+	if (session_start_hook)
+		(*session_start_hook) ();
+
 	/*
 	 * POSTGRES main processing loop begins here
 	 *
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 43b9f17f72..4037267be9 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -78,6 +78,8 @@ static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
 
+/* Hook for plugins to get control at end of session */
+session_end_hook_type session_end_hook = NULL;
 
 /*** InitPostgres support ***/
 
@@ -1197,6 +1199,10 @@ ShutdownPostgres(int code, Datum arg)
 	 * them explicitly.
 	 */
 	LockReleaseAll(USER_LOCKMETHOD, true);
+
+	/* Hook at session end */
+	if (session_end_hook)
+		(*session_end_hook) ();
 }
 
 
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index ec21f7e45c..63581048b2 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -30,6 +30,13 @@ extern PGDLLIMPORT const char *debug_query_string;
 extern int	max_stack_depth;
 extern int	PostAuthDelay;
 
+/* Hook for plugins to get control at start and end of session */
+typedef void (*session_start_hook_type) (void);
+typedef void (*session_end_hook_type) (void);
+
+extern PGDLLIMPORT session_start_hook_type session_start_hook;
+extern PGDLLIMPORT session_end_hook_type session_end_hook;
+
 /* GUC-configurable parameters */
 
 typedef enum
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 60d6d7be1b..066b98e114 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -18,6 +18,7 @@ SUBDIRS = \
 		  test_predtest \
 		  test_rbtree \
 		  test_rls_hooks \
+		  test_session_hooks \
 		  test_shm_mq \
 		  unsafe_tests \
 		  worker_spi
diff --git a/src/test/modules/test_session_hooks/.gitignore b/src/test/modules/test_session_hooks/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_session_hooks/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_session_hooks/Makefile b/src/test/modules/test_session_hooks/Makefile
new file mode 100644
index 00..f6fcdeaf20
--- /dev/null
+++ b/src/test/modules/test_session_hooks/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_ses

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-22 Thread Peter Geoghegan
On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
 wrote:
> I'm going to look through the patch once more to update nbtxlog
> comments, where needed and
> answer to your remarks that are still left in the comments.

Have you been using amcheck's rootdescend verification? I see this
problem with v8, with the TPC-H test data:

DEBUG:  finished verifying presence of 150 tuples from table
"customer" with bitset 51.09% set
ERROR:  could not find tuple using search from root page in index
"idx_customer_nationkey2"

I've been running my standard amcheck query with these databases, which is:

SELECT bt_index_parent_check(index => c.oid, heapallindexed => true,
rootdescend => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
AND c.relpersistence != 't'
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

There were many large indexes that amcheck didn't detect a problem
with. I don't yet understand what the problem is, or why we only see
the problem for a small number of indexes. Note that all of these
indexes passed verification with v5, so this is some kind of
regression.

I also noticed that there were some regressions in the size of indexes
-- indexes were not nearly as small as they were in v5 in some cases.
The overall picture was a clear regression in how effective
deduplication is.

I think that it would save time if you had direct access to my test
data, even though it's a bit cumbersome. You'll have to download about
10GB of dumps, which require plenty of disk space when restored:

regression=# \l+
List
of databases
Name| Owner | Encoding |  Collate   |   Ctype| Access
privileges |  Size   | Tablespace |Description
+---+--+++---+-++
 land   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 6425 MB | pg_default |
 mgd| pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 61 GB   | pg_default |
 postgres   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 7753 kB | pg_default | default administrative connection
database
 regression | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 886 MB  | pg_default |
 template0  | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg
 +| 7609 kB | pg_default | unmodifiable empty database
|   |  ||| pg=CTc/pg
  | ||
 template1  | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg
 +| 7609 kB | pg_default | default template for new databases
|   |  ||| pg=CTc/pg
  | ||
 tpcc   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 10 GB   | pg_default |
 tpce   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 26 GB   | pg_default |
 tpch   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 32 GB   | pg_default |
(9 rows)

I have found it very valuable to use this test data when changing
nbtsplitloc.c, or anything that could affect where page splits make
free space available. If this is too much data to handle conveniently,
then you could skip "mgd" and almost have as much test coverage. There
really does seem to be a benefit to using diverse test cases like
this, because sometimes regressions only affect a small number of
specific indexes for specific reasons. For example, only TPC-H has a
small number of indexes that have tuples that are inserted in order,
but also have many duplicates. Removing the BT_COMPRESS_THRESHOLD
stuff really helped with those indexes.

Want me to send this data and the associated tests script over to you?

-- 
Peter Geoghegan




Re: Cleanup isolation specs from unused steps

2019-08-22 Thread Michael Paquier
On Thu, Aug 22, 2019 at 09:19:47PM -0700, Melanie Plageman wrote:
> It's a connection parameter that allows you to connect to a single Postgres
> node in a Greenplum cluster. I only included it as an example of the kind of
> "Greenplum-specific" things that are in the test framework.

Ah, I see.  I had the same stuff for XC to connect to data nodes.

> I've heard Greenplum folks talk about re-writing our "isolation2" framework
> in (probably) C and making it a better fit to contribute. It's definitely on
> my wishlist.

Nice to hear that.
--
Michael


signature.asc
Description: PGP signature


Re: when the IndexScan reset to the next ScanKey for in operator

2019-08-22 Thread Peter Geoghegan
On Wed, Aug 21, 2019 at 6:24 AM Alex  wrote:
> suppose the executor  should scan 1 first,  If all the tuples for 1 has been 
> scanned,  then **it should be reset to 10**  and scan again.

You might find my nbtree index scan test patch useful:

https://postgr.es/m/cah2-wzmrt_0ybhf05axqb2oituqiqakr0lznntj8x3kadkz...@mail.gmail.com

-- 
Peter Geoghegan




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Thomas Munro
On Wed, Aug 21, 2019 at 4:44 AM Andres Freund  wrote:
> On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> > 1.  Anyone is allowed to try to read or write data at any UndoRecPtr
> > that has been allocated, through the buffer pool (though you'd usually
> > want to check it with UndoRecPtrIsDiscarded() first, and only rely on
> > the system I'm describing to deal with races).
> >
> > 2.  ReadBuffer() might return InvalidBuffer.  This can happen for a
> > cache miss, if the smgrread implementation wants to indicate that the
> > buffer has been discarded/truncated and that is expected (md.c won't
> > ever do that, but undofile.c can).
>
> Hm. This gives me a bit of a stomach ache. It somehow feels like a weird
> form of signalling.  Can't quite put my finger on why it makes me feel
> queasy.

Well, if we're going to allow concurrent access and discarding, there
has to be some part of the system where you can discover that the
thing you wanted is gone.  What's wrong with here?

Stepping back a bit... why do we have a new concept here?  The reason
we don't have anything like this for relations currently is that we
don't have live references to blocks that are expected to be
concurrently truncated, due to heavyweight locking.  But the whole
purpose of the undo system is to allow cheap truncation/discarding of
data that you *do* have live references to, and furthermore the
discarding is expected to be frequent.  The present experiment is
about trying to do so without throwing our hands up and using a
pessimistic lock.

At one point Robert and I discussed some kind of scheme where you'd
register your interest in a range of the log before you begin reading
(some kind of range locks or hazard pointers), so that you would block
discarding in that range, but the system would still allow someone to
read in the middle of the log while the discard worker concurrently
discards non-overlapping data at the end.  But I kept returning to the
idea that the buffer pool already has block-level range locking of
various kinds.  You can register your interest in a range by pinning
the buffers.  That's when you'll find out if the range is already
gone.  We could add an extra layer of range locking around that, but
it wouldn't be any better, it'd just thrash your bus a bit more, and
require more complexity in the discard worker (it has to defer
discarding a bit, and either block or go away and come back later).

> > 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> > unpinned buffers, and marks as BM_DISCARDED any that happen to be
> > pinned right now, so they can't be immediately invalidated.  Such
> > buffers are never written back and are eligible for reuse on the next
> > clock sweep, even if they're written into by a backend that managed to
> > do that when we were trying to discard.
>
> Hm. When is it legitimate for a backend to write into such a buffer? I
> guess that's about updating the previous transaction's next pointer? Or
> progress info?

Yes, previous transaction header's next pointer, and progress counter
during rollback.  We're mostly interested in the next pointer here,
because the progress counter update would normally not be updated at a
time when the page might be concurrently discarded.  The exception to
that is a superuser running CALL pg_force_discard_undo() (a
data-eating operation designed to escape a system that can't
successfully roll back and gets stuck, blowing away
not-yet-rolled-back undo records).

Here are some other ideas about how to avoid conflicts between
discarding and transaction header update:

1.  Lossy self-update-only: You could say that transactions are only
allowed to write to their own transaction header, and then have them
periodically update their own length in their own transaction header,
and then teach the discard worker that the length information is only
a starting point for a linear search for the next transaction based on
page header information.  That removes complexity for writers, but
adds complexity and IO and CPU to the discard worker.  Bleugh.

2.  Strict self-update-only:  We could update it as part of
transaction cleanup.  That is, when you commit or abort, probably at
some time when your transaction is still advertised as running, you go
and update your own transaction header with your the size.  If you
never reach that stage, I think you can fix it during crash recovery,
during the startup scan that feeds the rollback request queues.  That
is, if you encounter a transaction header with length == 0, it must be
the final one and its length is therefore known and can be updated,
before you allow new transactions to begin.  There are some
complications like backends that exit without crashing, which I
haven't thought about.  As Amit just pointed out to me, that means
that the update is not folded into the same buffer access as the next
transaction, but perhaps you can mitigate that by not updating your
header if the next header will be on the same p

Re: mingw32 floating point diff

2019-08-22 Thread Peter Eisentraut
On 2019-08-22 18:19, Tom Lane wrote:
> What I suggest doing is reducing extra_float_digits to -1 for this
> specific test.  Changing the contents of circle_tbl seems like it'd have
> more consequences than we want, in particular there's no guarantee that
> we'd not hit similar issues in other tests if they're given different
> inputs.

I agree that reducing the output precision is better than adjusting the
code.

The circle.sql file already has SET extra_float_digits TO 0, and a few
other files have other settings with different values.  Are we content
to use various numbers until it works in each case, or should we try to
use some consistency?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pg_checksums --help synopsis is quite long

2019-08-22 Thread Peter Eisentraut
With the additional functionality, the --help synopsis of pg_checksums
has gotten quite long:

   pg_checksums enables, disables, or verifies data checksums in a
   PostgreSQL database cluster.

Can we try to shorten this a bit?  Maybe

   pg_checksums manages data checksums in a PostgreSQL database cluster.

Other ideas?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-08-22 Thread Masahiko Sawada
On Fri, Aug 23, 2019 at 12:27 PM Michael Paquier  wrote:
>
> On Fri, Aug 23, 2019 at 11:09:44AM +0900, Masahiko Sawada wrote:
> > While updating the doc I realized that
> > perhaps we should have the new section for heap and put the
> > descriptions of heap functions into it rather than having them as
> > general functions. If we need this change it is for PG12. I will
> > register only the new feature patch to the next Commit Fest.
>
> I agree with the new heap section, and your patch on that looks good.
> While on it, I have one suggestion: fsm_page_contents does not have an
> example of query.  Could we add one while on it?  An example
> consistent with the other function's examples:
> =# SELECT fsm_page_contents(get_raw_page('pg_class', 'fsm', 0));

Good idea. I've updated the doc update patch.


Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 8d81f88..e1753aa 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -127,6 +127,55 @@ test=# SELECT page_checksum(get_raw_page('pg_class', 0), 0);
 

 
+ fsm_page_contents(page bytea) returns text
+ 
+  fsm_page_contents
+ 
+
+
+
+ 
+  fsm_page_contents shows the internal node structure
+  of a FSM page. The output is a multiline string, with one line per
+  node in the binary tree within the page. For example:
+
+test=# SELECT fsm_page_contents(get_raw_page('pg_class', 'fsm', 0));
+ fsm_page_contents
+---
+ 0: 235   +
+ 1: 235   +
+ 3: 235   +
+ 7: 235   +
+ 15: 235  +
+ 31: 235  +
+ 63: 235  +
+ 127: 235 +
+ 255: 235 +
+ 511: 235 +
+ 1023: 235+
+ 2047: 235+
+ 4095: 235+
+ fp_next_slot: 0  +
+
+  Only those nodes that are not zero are printed. The so-called "next"
+  pointer, which points to the next slot to be returned from the page,
+  is also printed.
+ 
+ 
+  See src/backend/storage/freespace/README for more
+  information on the structure of an FSM page.
+ 
+
+   
+  
+ 
+
+ 
+  Heap Functions
+
+  
+   
+
  heap_page_items(page bytea) returns setof record
  
   heap_page_items
@@ -203,29 +252,6 @@ test=# SELECT * FROM heap_page_item_attrs(get_raw_page('pg_class', 0), 'pg_class
  
 

-
-   
-
- fsm_page_contents(page bytea) returns text
- 
-  fsm_page_contents
- 
-
-
-
- 
-  fsm_page_contents shows the internal node structure
-  of a FSM page. The output is a multiline string, with one line per
-  node in the binary tree within the page. Only those nodes that are not
-  zero are printed. The so-called "next" pointer, which points to the
-  next slot to be returned from the page, is also printed.
- 
- 
-  See src/backend/storage/freespace/README for more
-  information on the structure of an FSM page.
- 
-
-   
   
  
 


Re: pg_checksums --help synopsis is quite long

2019-08-22 Thread Fabien COELHO


Hello Peter,


With the additional functionality, the --help synopsis of pg_checksums
has gotten quite long:

  pg_checksums enables, disables, or verifies data checksums in a
  PostgreSQL database cluster.

Can we try to shorten this a bit?  Maybe

  pg_checksums manages data checksums in a PostgreSQL database cluster.

Other ideas?


My 0.02 €:

  pg_checksums triggers or verifies checksums in a Postgres cluster.

I like the somehow detailed functionality list, if space allows, so I 
tried to compressed the other parts.


Not sure that there could be a checksum on anything else but data.

I have decided that PostgreSQL is a mouthful, thus I'm rather using 
"Postgres".


--
Fabien.

Re: Why overhead of SPI is so large?

2019-08-22 Thread Konstantin Knizhnik




On 22.08.2019 5:40, Kyotaro Horiguchi wrote:

Hello.

At Wed, 21 Aug 2019 19:41:08 +0300, Konstantin Knizhnik  
wrote in 

Hi, hackers.

One of our customers complains about slow execution of PL/pgSQL
functions comparing with Oracle.
So he wants to compile PL/pgSQL functions (most likely just-in-time
compilation).
Certainly interpreter adds quite large overhead comparing with native
code (~10 times) but
most of PL/pgSQL functions are just running some SQL queues and
iterating through results.

I can not believe that JIT can significantly speed-up such functions.
So I decided to make simple experiment:  I created large enough table
and implemented functions
which calculates norm of one column in different languages.

Results are frustrating (at least for me):

PL/pgSQL:   29044.361 ms
C/SPI:          22785.597 ms
С/coreAPI: 2873.072 ms
PL/Lua:        33445.520 ms
SQL:              7397.639 ms (with parallel execution disabled)

The fact that difference between PL/pgSQL and function implemented in
C using SPI is not so large  was expected by me.
But why it is more than 3 time slower than correspondent SQL query?
The answer seems to be in the third result: the same function in C
implemented without SPI (usign table_beginscan/heap_getnext)
Looks like SPI adds quite significant overhead.
And as far as almost all PL languages are using SPI, them all suffer
from it.

As far as looking the attached spitest.c, it seems that the
overhead comes from cursor operation, not from SPI. As far as
spitest.c goes, cursor is useless.  "SQL" and C/coreAPI seem to
be scanning over the result from *a single* query. If that's
correct, why don't you use SPI_execute() then scan over
SPI_tuptable?


Scanned table is very large and doesn't fir in memory.
This is why I am using SPI cursors.
Please let me know if there is more efficient way to traverse larger 
table using SPI.





regards.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Optimization of vacuum for logical replication

2019-08-22 Thread Konstantin Knizhnik




On 22.08.2019 6:13, Kyotaro Horiguchi wrote:

Hello.

At Wed, 21 Aug 2019 18:06:52 +0300, Konstantin Knizhnik  
wrote in <968fc591-51d3-fd74-8a55-40aa770ba...@postgrespro.ru>

Ok, you convinced me that there are cases when people want to combine
logical replication with streaming replication without slot.
But is it acceptable to have GUC variable (disabled by default) which
allows to use this optimizations?

The odds are quite high. Couldn't we introduce a new wal_level
value instead?

wal_level = logical_only


I think this thread shows that logical replication no longer is a
superset(?) of physical replication.  I thougt that we might be
able to change wal_level from scalar to bitmap but it breaks
backward compatibility..

regards.


I think that introducing new wal_level is good idea.
There are a lot of other places (except vacuum) where we insert in the 
log information which is not needed for logical decoding.
Instead of changing all places in code where this information is 
inserted, we can filter it at xlog level (xlog.c).
My only concern is how much incompatibilities will be caused by 
introducing new wal level.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: FETCH FIRST clause PERCENT option

2019-08-22 Thread Surafel Temesgen
On Tue, Aug 20, 2019 at 9:10 AM Kyotaro Horiguchi 
wrote:

> Hi,
>
> At Wed, 7 Aug 2019 10:20:09 +0300, Surafel Temesgen 
> wrote in <
> calay4q98xbvhtz4yj9dccmg2-s1_jurr7fyanfw+bkmr22o...@mail.gmail.com>
> > Hi
> > On Wed, Aug 7, 2019 at 6:11 AM Kyotaro Horiguchi <
> horikyota@gmail.com>
> > wrote:
> >
> > >
> > > I have some comments.
> > >
> > > This patch uses distinct parameters for exact number and
> > > percentage. On the othe hand planner has a notion of
> > > tuple_fraction covering the both. The same handling is also used
> > > for tuple number estimation. Using the same scheme will make
> > > things far simpler. See the comment of grouping_planner().
> > >
> > >
> > Its because of data type difference .In planner the data type is the same
>
> I meant that, with the usage of tuple_fraction, one variable can
> represent both the absolute limit and relative limit. This
> simplifies parse structs.
>
In grouping_planner the patch set tuple bound to -1 in create_ordered_paths
because it iterate until the end in percentage. Did that answer your
question?

>
>
> > > In executor part, addition to LimiteState.position, this patch
> > > uses LimiteState.backwardPosition to count how many tuples we're
> > > back from the end of the current tuplestore. But things will get
> > > simpler by just asking the tuplestore for the number of holding
> > > tuples.
> > >
> > >
> > backwardPosition hold how many tuple returned in backward scan
>
> I meant that we don't need to hold the number in estate.
>

I did it this way because I didn't find an API in tuplestore to do this


>
> > > +slot = node->subSlot;
> > >
> > > Why is this needed? The variable is properly set before use and
> > > the assignment is bogus in the first place.
> > >
> > >
> > its because Tuplestore needs initialized slot.
>
> I meant that the initilized slot is overwritten before first use.
>
> > > The new code block in LIMIT_RESCAN in ExecLimit is useless since
> > > it is exatctly the same with existing code block. Why didn't you
> > > use the existing if block?
> > >
> >
> > But they test different scenario
>
> I meant that the two different scenario can share the code block.
>

Sorry for not clarifying will .One have to be check before offsetting and
the other is after offsetting


>
> > > >if (node->limitOption == PERCENTAGE)
> > > +{
> > > +node->perExactCount = ceil(node->percent *
> > > node->position / 100.0);
> > > +
> > > +
> > >
> > > node->position is the number of tuples returned to upper node so
> > > far (not the number of tuples this node has read from the lower
> > > node so far).  I don't understand what the expression means.
> > >
> >
> > node->position hold the number of tuples this node has read from the
> lower
> > node so far. see LIMIT_RESCAN state
>
> Reallly? node->position is incremented when
> tuplestore_gettupleslot_heaptuple() succeeded and reutnrs the
> tuple to the caller immediately...
>
> > > +if (node->perExactCount == node->perExactCount +
> 1)
> > > +node->perExactCount++;
> > >
> > > What? The condition never be true. As the result, the following
> > > if block below won't run.
> > >
> >
> > it became true according to the number of tuple returned in from the
> lower
> > node so far
> > and percentage specification.
>
> Mmm. How do you think X can be equal to  (X + 1)?
>

Oops my bad .The attached patch remove the need of doing that

>
> > > >/*
> > > + * Return the tuple up to the number of exact count in
> > > OFFSET
> > > + * clause without percentage value consideration.
> > > + */
> > > +if (node->perExactCount > 0)
> > > +{
> > > +
> > >
> > >
> > >
> > >
> > > +/*
> > > + * We may needed this tuple in backward scan so put it
> into
> > > + * tuplestore.
> > > + */
> > > +if (node->limitOption == PERCENTAGE)
> > > +{
> > > +tuplestore_puttupleslot(node->tupleStore, slot);
> > > +tuplestore_advance(node->tupleStore, true);
> > > +}
> > >
> > > "needed"->"need" ? The comment says that this is needed for
> > > backward scan, but it is actually required even in forward
> > > scan. More to the point, tuplestore_advance lacks comment.
> >
> >
> > ok
> >
> >
> > >
> > > Anyway, the code in LIMIT_RESCAN is broken in some ways. For
> > > example, if it is used as the inner scan of a NestLoop, the
> > > tuplestore accumulates tuples by every scan. You will see that
> > > the tuplestore stores up to 1000 tuples (10 times of the inner)
> > > by the following query.
> > >
> >
> > It this because in percentage we scan the whole table
>
> It's useless and rather harmful that the tuplestore holds
> indefinite number of duplicate set of the whole tuples from the
> lower node. We must reuse tuples already store

Refactoring of connection with password prompt loop for frontends

2019-08-22 Thread Michael Paquier
Hi all,

In six places of the code tree (+ one in psql which is a bit
different), we have the following pattern for frontend tools to
connect to a backend with a password prompt, roughly like that:
do
{
[...]
conn = PQconnectdbParams(keywords, values, true);

[...]

if (PQstatus(conn) == CONNECTION_BAD &&
PQconnectionNeedsPassword(conn) &&
!have_password)
{
PQfinish(conn);
simple_prompt("Password: ", password, sizeof(password), false);
have_password = true;
new_pass = true;
}
} while (new_pass);

Attached is a tentative of patch to consolidate this logic across the
tree.  The patch is far from being in a committable state, and there
are a couple of gotchas:
- pg_dumpall merges connection string parameters, so it is much harder
to plugin that the others, and I left it out on purpose.
- Some code paths spawn a password prompt before the first connection
attempt.  For now I have left this first attempt out of the refactored
logic, but I think that it is possible to consolidate that a bit more
by enforcing a password prompt before doing the first connection
attempt (somewhat related to the XXX portion in the patch).  At the
end it would be nice to not have to have a 100-byte-long field for the
password buffer we have here and there.  Unfortunately this comes with
its limitations in pg_dump as savedPassword needs to be moved around
and may be reused in the context.
- I don't like the routine name connect_with_password_prompt() I
introduced.  Suggestions of better names are welcome :)
- This also brings the point that some of our tools are not able to
handle tri-values for passwords, so we may want to consolidate that as
well.

Among the positive points, this brings a lot of consolidation in terms
of error handling, and this shaves a bit of code:
13 files changed, 190 insertions(+), 280 deletions(-) 

This moves the logic into src/fe_utils, which is natural as that's
aimed only for frontends and because this also links to libpq.

Please note that this links a bit with the refactoring of vacuumlo and
oid2name logging I proposed a couple of days back (applying one patch
or the other results in conflicts) because we need to have frontends
initialized for logging in order to be able to log errors in the
refactored routine:
https://www.postgresql.org/message-id/20190820012819.ga8...@paquier.xyz
This one should be merged first IMO, but that's a story for the other
thread.

This compiles, and passes all regression tests so it is possible to
play with it easily, still it needs much more testing and love.

Any thoughts?  I am adding that to the next commit fest.

Thanks,
--
Michael
diff --git a/contrib/oid2name/Makefile b/contrib/oid2name/Makefile
index 361a80a7a1..c6ba092f47 100644
--- a/contrib/oid2name/Makefile
+++ b/contrib/oid2name/Makefile
@@ -9,7 +9,7 @@ OBJS	= oid2name.o $(WIN32RES)
 TAP_TESTS = 1
 
 PG_CPPFLAGS = -I$(libpq_srcdir)
-PG_LIBS_INTERNAL = $(libpq_pgport)
+PG_LIBS_INTERNAL = $(libpq_pgport) -L$(top_builddir)/src/fe_utils -lpgfeutils
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/oid2name/oid2name.c b/contrib/oid2name/oid2name.c
index d3a4a50005..c38679e869 100644
--- a/contrib/oid2name/oid2name.c
+++ b/contrib/oid2name/oid2name.c
@@ -11,6 +11,7 @@
 
 #include "catalog/pg_class_d.h"
 
+#include "common/logging.h"
 #include "fe_utils/connect.h"
 #include "libpq-fe.h"
 #include "pg_getopt.h"
@@ -46,6 +47,8 @@ struct options
 	const char *progname;
 };
 
+const char *progname;
+
 /* function prototypes */
 static void help(const char *progname);
 void		get_opts(int, char **, struct options *);
@@ -82,9 +85,9 @@ get_opts(int argc, char **argv, struct options *my_opts)
 	};
 
 	int			c;
-	const char *progname;
 	int			optindex;
 
+	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
 
 	/* set the defaults */
@@ -284,65 +287,20 @@ PGconn *
 sql_conn(struct options *my_opts)
 {
 	PGconn	   *conn;
-	bool		have_password = false;
-	char		password[100];
-	bool		new_pass;
 	PGresult   *res;
 
-	/*
-	 * Start the connection.  Loop until we have a password if requested by
-	 * backend.
-	 */
-	do
+	/* grab a connection, with password prompting */
+	conn = connect_with_password_prompt(my_opts->hostname,
+		my_opts->port,
+		my_opts->username,
+		my_opts->dbname,
+		my_opts->progname,
+		NULL,
+		true);
+	if (conn == NULL)
 	{
-#define PARAMS_ARRAY_SIZE	7
-
-		const char *keywords[PARAMS_ARRAY_SIZE];
-		const char *values[PARAMS_ARRAY_SIZE];
-
-		keywords[0] = "host";
-		values[0] = my_opts->hostname;
-		keywords[1] = "port";
-		values[1] = my_opts->port;
-		keywords[2] = "user";
-		values[2] = my_opts->username;
-		keywords[3] = "password";
-		values[3] = have_password ? password : NULL;
-		keywords[4] = "dbname";
-		values[4] = my_opts->dbname;
-		keywords[5] = "fallback_application_name";
-		values[5] = my_opts->progname;
-		keywords[6] = NULL;
-		values[6] = NULL;
-
-		n

Comment in ginpostinglist.c doesn't match code

2019-08-22 Thread Heikki Linnakangas

Hi,

While merging Greenplum with 9.4, we ran into problems with the GIN 
posting list encoding, because Greenplum sometimes uses ItemPointers 
with offset numbers up to 32768. The GIN posting list code was written 
with the assumption that the maximum is MaxHeapTuplesPerPage, and it 
uses only 11 bits for the offset number. The fix was simple, we just 
modified ginpostinglist.c to use full 16 bits instead of 11.


However, I noticed that this comment in ginpostinglist.c that explains 
the encoding doesn't match the code:



 * These 43-bit integers are encoded using varbyte encoding. In each byte,
 * the 7 low bits contain data, while the highest bit is a continuation bit.
 * When the continuation bit is set, the next byte is part of the same
 * integer, otherwise this is the last byte of this integer.  43 bits fit
 * conveniently in at most 6 bytes when varbyte encoded (the 6th byte does
 * not need a continuation bit, because we know the max size to be 43 bits):
 *
 * 0XXX
 * 1XXX 0YYY
 * 1XXX 1YYY 0YYY
 * 1XXX 1YYY 1YYY 0YYY
 * 1XXX 1YYY 1YYY 1YYY 0YYY
 * 1XXX 1YYY 1YYY 1YYY 1YYY 
 *
 * X = bits used for offset number
 * Y = bits used for block number


The code doesn't actually give the 6th byte any special treatment. If 
the input integer has the 43rd bit set, the encoding function will put a 
continuation bit on the 6th byte, and generate a 7th byte. And the 
decoding function will correctly decode that, too. So to my surprise, 
the implementation actually works for integers up 49 bits wide. However, 
there is an overflow check in the encoding function that assumes max 6 
bytes per integer. That needs to be fixed, along with the comment.


Fitting any item pointer into 6 bytes was an important property when 
this was written, because in the old pre-9.4 format, posting lists were 
as arrays, with 6 bytes per item pointer. The maximum of 6 bytes per 
integer in the new format guaranteed that we could convert any page from 
the old format to the new format, after pg_upgrade, so that the new 
format was never larger than the old format. But I don't think we need 
to worry much about that anymore. Luckily, no one has ran into this 
while trying to upgrade. It would require having a 9.3 cluster with a 
table larger than 16 TB (with 8k block size), with a GIN index on it, 
and a posting list with TIDs more than 2^31 blocks distance, on a full 
page. So, not a problem in practice.


In summary, the comment in ginpostinglist.c is wrong, and the overflow 
check needs to be fixed. Patch attached.


The patch also includes a little unit test module to test this without 
creating a 16 TB table. A whole new test module seems a bit like 
overkill just for this, but clearly we were missing test coverage here. 
And it will come handy, if we want to invent a new better posting list 
format in the future. Thoughts on whether to include the test module or not?


- Heikki
>From 3178fe98ca93c52bc8678953b328f70d7ed16b5c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 22 Aug 2019 11:06:13 +0300
Subject: [PATCH 1/1] Fix overflow check and comment in GIN posting list
 encoding.

The comment did not match what the code actually did for integers with
the 43rd bit set. You get an integer like that, if you have a posting
list with two adjacent TIDs that are more than 2^31 blocks apart.
According to the comment, we would store that in 6 bytes, with no
continuation bit on the 6th byte, but in reality, the code encodes it
using 7 bytes, with a continuation bit on the 6th byte as normal.

The decoding routine also handled these 7-byte integers correctly, except
for an overflow check that assumed that one integer needs at most 6 bytes.
Fix the overflow check, and fix the comment to match what the code
actually does.

Fitting any item pointer into max 6 bytes was an important property when
this was written, because in the old pre-9.4 format, item pointers were
stored as plain arrays, with 6 bytes for every item pointer. The maximum
of 6 bytes per integer in the new format guaranteed that we could convert
any page from the old format to the new format after upgrade, so that the
new format was never larger than the old format. But we hardly need to
worry about that anymore, and running into that problem during upgrade,
where an item pointer is expanded from 6 to 7 bytes such that the data
doesn't fit on a page anymore, is implausible in practice anyway.

Backpatch to all supported versions.

This also includes a little test module to test these large distances
between item pointers, without requiring a 16 TB table. It is not
backpatched, I'm including it more for the benefit of future development
of new posting list formats.

Discussion: XXX
---
 src/backend/access/gin/ginpostinglist.c   | 33 +--
 src/test/modules/Makefile |  1 +
 src/test/modules/test_ginpostinglist/Makefile | 21 
 src/test/modules/test

Take skip header out of a loop in COPY FROM

2019-08-22 Thread Surafel Temesgen
Hello,

Even if skipping header is done only once its checked and skipped in a
loop. If I don’t miss something it can be done out side a loop like
attached patch

regards

Surafel
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4f04d122c3..4e7709d7bf 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -3004,6 +3004,13 @@ CopyFrom(CopyState cstate)
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
+	/* on input just throw the header line away */
+	if ( cstate->header_line)
+	{
+		cstate->cur_lineno++;
+		(void) CopyReadLine(cstate);
+	}
+
 	for (;;)
 	{
 		TupleTableSlot *myslot;
@@ -3642,14 +3649,6 @@ NextCopyFromRawFields(CopyState cstate, char ***fields, int *nfields)
 	/* only available for text or csv input */
 	Assert(!cstate->binary);
 
-	/* on input just throw the header line away */
-	if (cstate->cur_lineno == 0 && cstate->header_line)
-	{
-		cstate->cur_lineno++;
-		if (CopyReadLine(cstate))
-			return false;		/* done */
-	}
-
 	cstate->cur_lineno++;
 
 	/* Actually read the line into memory here */


Re: Take skip header out of a loop in COPY FROM

2019-08-22 Thread Heikki Linnakangas

On 22/08/2019 11:31, Surafel Temesgen wrote:

Hello,

Even if skipping header is done only once its checked and skipped in a 
loop. If I don’t miss something it can be done out side a loop like 
attached patch


You may be on to something, but if we move it to CopyFrom(), as in your 
patch, then it won't get executed e.g. from the calls in file_fdw. 
file_fdw calls BeginCopyFrom(), followed by NextCopyFrom(); it doesn't 
use CopyFrom().


- Heikki




Re: Make SQL/JSON error code names match SQL standard

2019-08-22 Thread Peter Eisentraut
On 2019-08-20 13:48, Alexander Korotkov wrote:
> On Tue, Aug 20, 2019 at 11:49 AM Peter Eisentraut
>  wrote:
>> I propose the attached patch to make the new SQL/JSON error code names
>> match the SQL standard.  The existing minor differences don't seem
>> necessary.
> 
> Thank you for noticing!
> +1 for pushing this

done

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: when the IndexScan reset to the next ScanKey for in operator

2019-08-22 Thread Kyotaro Horiguchi
Hi,

At Wed, 21 Aug 2019 20:58:47 +0800, Alex  wrote in 

> postgres=# select * from t2 where a in (1, 10);
...
> I can see the plan stores the "1 and 10" information in
> IndexScan->indexqual, which is an SCALARARRAYOPEXPR expression.
...
> suppose the executor  should scan 1 first,  If all the tuples for 1 has
> been scanned,  then **it should be reset to 10**  and scan again.
>  however I can't find out the code for that.  looks index_rescan is not for
> this.   am I miss something?

Perhaps _bt_advance_array_keys() and btgettuple() would be what
you are seeking for.

>/* ... otherwise see if we have more array keys to deal with */
>} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Email to hackers for test coverage

2019-08-22 Thread movead...@highgo.ca
Hello hackers,

One of the area that didn't get much attention in the community
recently is analysing and increasing the code coverage of 
PostgreSQL regession test suite. I have started working on the
code coverage by running the GCOV code coverage analysis tool in
order to analyse the current code coverage and come up with test
cases to increase the code coverage. This is going to be a long
exercise so my plan is do it incrementaly. I will be analysing
some area of untested code and then coming up with test cases to
test those lines of code in regression and then moving on next
area of untested code and so on.

So far I have come up with 3 test cases to increase the code
coverage of PostgreSQL regression test suite.

I have performed the regression run for this exercise on this commit:
(Commit version 75c1921cd6c868c5995b88113b4463a4830b9a27):

The regression is executed with make check-world command and the
results are gathered using  'make coverage-html' command. 

Below are the lines of untested code that i have analysed and the
test cases added to regression to test these as part of regression.

1. src/include/utils/float.h:140

Analyze: 
This is an error report line when converting a big float8 value
which a float4 can not storage to float4.

Test case:
Add a test case as below in file float4.sql:
select float4(1234567890123456789012345678901234567890::float8);

2. src/include/utils/float.h:145

Analyze:
This is an error report line when converting a small float8 value
which a float4 can not storage to float4.

Test case:
Add a test case as below in file float4.sql:
select float4(0.01::float8);

3.src/include/utils/sortsupport.h:264

Analyze:
It is reverse sorting for the data type that has abbreviated for
sort, for example macaddr, uuid, numeric, network and I choose
numeric to do it.

Test cast:
Add a test case as below in file numeric.sql:
INSERT INTO num_input_test(n1) values('99.998');
INSERT INTO num_input_test(n1) values('99.997');
SELECT * FROM num_input_test ORDER BY n1 DESC;

Result and patch

By adding the test cases, the test coverage of  float.h increased from
97.7% to 100% and sortsupport.h increased from 76.7% to 80.0%.
 
The increase in code coverage can be seen in the before and after
pictures of GCOV test coverage analysis summary.

The attached patch contain the test cases added in regression for
increasing the coverage.


--
Movead Li


before.png
Description: Binary data


after.png
Description: Binary data


regression_20190820.patch
Description: Binary data


Re: Take skip header out of a loop in COPY FROM

2019-08-22 Thread Adam Lee
On Thu, Aug 22, 2019 at 11:48:31AM +0300, Heikki Linnakangas wrote:
> On 22/08/2019 11:31, Surafel Temesgen wrote:
> > Hello,
> > 
> > Even if skipping header is done only once its checked and skipped in a
> > loop. If I don’t miss something it can be done out side a loop like
> > attached patch
> 
> You may be on to something, but if we move it to CopyFrom(), as in your
> patch, then it won't get executed e.g. from the calls in file_fdw. file_fdw
> calls BeginCopyFrom(), followed by NextCopyFrom(); it doesn't use
> CopyFrom().
> 
> - Heikki

Yes.

My next thought is to call unlikely() here, but we don't have it...
https://www.postgresql.org/message-id/CABRT9RC-AUuQL6txxsoOkLxjK1iTpyexpbizRF4Zxny1GXASGg%40mail.gmail.com

-- 
Adam Lee




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Amit Kapila
On Thu, Aug 22, 2019 at 11:04 AM Dilip Kumar  wrote:
>
> On Thu, Aug 22, 2019 at 10:24 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> > > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > > > We can not know the complete size of the record even by reading the
> > > > > header because we have a payload that is variable part and payload
> > > > > length are stored in the payload header which again can be at random
> > > > > offset.
> > > >
> > > > Wait, but that's just purely self inflicted damage, no? The initial
> > > > length just needs to include the payload. And all this is not an issue
> > > > anymore?
> > > >
> > > Actually, we store the undo length only at the end of the record and
> > > that is for traversing the transaction's undo record chain during bulk
> > > fetch.  Ac such in the beginning of the record we don't have the undo
> > > length.  We do have uur_info but that just tell us which all optional
> > > header are included in the record.
> >
> > But why? It makes a *lot* more sense to have it in the beginning. I
> > don't think bulk-fetch really requires it to be in the end - we can
> > still process records forward on a page-by-page basis.
>
> Yeah, we can handle the bulk fetch as you suggested and it will make
> it a lot easier.  But, currently while registering the undo request
> (especially during the first pass) we need to compute the from_urecptr
> and the to_urecptr. And,  for computing the from_urecptr,  we have the
> end location of the transaction because we have the uur_next in the
> transaction header and that will tell us the end of our transaction
> but we still don't know the undo record pointer of the last record of
> the transaction.  As of know, we read previous 2 bytes from the end of
> the transaction to know the length of the last record and from there
> we can compute the undo record pointer of the last record and that is
> our from_urecptr.
>

How about if we store the location of the last record of the
transaction instead of the location of the next transaction in the
transaction header?  I think if we do that then discard worker might
need to do some additional work in some cases as it needs to tell the
location up to which discard is possible, however, many other cases
might get simplified.  With this also, when the log is switched while
writing records for the same transaction, the transaction header in
the first log will store the start location of the same transaction's
records in the next log.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Take skip header out of a loop in COPY FROM

2019-08-22 Thread Heikki Linnakangas

On 22/08/2019 12:54, Adam Lee wrote:

My next thought is to call unlikely() here, but we don't have it...
https://www.postgresql.org/message-id/CABRT9RC-AUuQL6txxsoOkLxjK1iTpyexpbizRF4Zxny1GXASGg%40mail.gmail.com


We do, actually, since commit aa3ca5e3dd in v10.

Not sure it's worth the trouble here. Optimizing COPY in general would 
be good, even small speedups there are helpful because everyone uses 
COPY, but without some evidence I don't believe particular branch is 
even measurable.


- Heikki




Re: Why overhead of SPI is so large?

2019-08-22 Thread Konstantin Knizhnik



On 22.08.2019 3:27, Tsunakawa, Takayuki wrote:

From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru]

PL/pgSQL:   29044.361 ms
C/SPI:  22785.597 ms

The fact that difference between PL/pgSQL and function implemented in C
using SPI is not so large  was expected by me.

This PL/pgSQL overhead is not so significant compared with the three times, but 
makes me desire some feature like Oracle's ALTER PROCEDURE ... COMPILE; that 
compiles the PL/SQL logic to native code.  I've seen a few dozen percent speed 
up.



Actually my implementation of C/SPI version is not optimal: it is better 
to fetch several records:


    while (true)
    {
        SPI_cursor_fetch(portal, true, 100);
        if (SPI_processed) {
            for (i = 0; i < SPI_processed; i++) {
                HeapTuple spi_tuple = SPI_tuptable->vals[i];
                Datum val = SPI_getbinval(spi_tuple, 
SPI_tuptable->tupdesc, 1, &is_null);

                double x = DatumGetFloat8(val);
                result += x*x;
                SPI_freetuple(spi_tuple);
            }
    SPI_freetuptable(SPI_tuptable);
        } else
            break;
    }

This version shows result 9405.694 ms which is comparable with result of 
SQL query.
Unfortunately (or fortunately) PL/pgSQL is already using prefetch. If it 
is disables (when iterate through explicitly created cursor), time of 
query execution is increased almost twice (46552.935 ms)


So PL/SPI ratio is more than three times.

Updatede results are the following:


Impementation
time (ms)
PL/Lua  32220
PL/pgSQL29044
PL/pgSQL (JIT)
27594
C/SPI     9406
SQL   7399
SQL (JIT)
  5532
С/coreAPI     2873

--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-22 Thread Kyotaro Horiguchi
Hello.

At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch  wrote in 
<20190820060314.ga3086...@rfd.leadboat.com>
> On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
> > <20190818035230.gb3021...@rfd.leadboat.com>
> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> > 
> > Now TwoPhaseFileHeader has two new members for (commit-time)
> > pending syncs. Pending-syncs are useless on wal-replay, but that
> > is needed for commit-prepared.
> 
> There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
> command, which is far too late to be syncing new relation files.  (A crash may
> have already destroyed their data.)  PrepareTransaction(), which implements
> the PREPARE TRANSACTION command, is the right place for these syncs.
> 
> A failure in these new syncs needs to prevent the transaction from being
> marked committed.  Hence, in CommitTransaction(), these new syncs need to

Agreed.

> happen after the last step that could create assign a new relfilenode and
> before RecordTransactionCommit().  I suspect it's best to do it after
> PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

I don't find an obvious problem there. Since pending deletes and
pending syncs are separately processed, I'm planning to make a
separate list for syncs from deletes.

> > > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL 
> > > for
> > > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There 
> > > may be
> > > no bug today, but it's conceptually wrong to make RelationNeedsWAL() 
> > > return
> > > false due to this code, use RelationNeedsWAL() for multiple forks, and 
> > > then
> > > not actually sync all forks.
> > 
> > I agree that all forks needs syncing, but FSM and VM are checking
> > RelationNeedsWAL(modified). To make sure, are you suggesting to
> > sync all forks instead of emitting WAL for them, or suggesting
> > that VM and FSM to emit WALs even when the modified
> > RelationNeedsWAL returns false (+ sync all forks)?
> 
> I hadn't thought that far.  What do you think is best?

As in the latest patch, sync ALL forks then no WALs. We could
skip syncing FSM but I'm not sure it's work doing.


> > > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another 
> > > component
> > > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the 
> > > relation,
> > > or if it's smaller than some threshold, WAL-log the contents of the whole 
> > > file
> > > at that point."  Please write the part to WAL-log the contents of small 
> > > files
> > > instead of syncing them.
> > 
> > I'm not sure the point of the behavior. I suppose that the "log"
> > is a sequence of new_page records. It also needs to be synced and
> > it is always larger than the file to be synced. I can't think of
> > an appropriate threshold without the point.
> 
> Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
> every buffer header containing a buffer of the current database.  The belief
> has been that writing one page to xlog is cheaper than FlushRelationBuffers()
> in a busy system with large shared_buffers.

I'm at a loss.. The decision between WAL and sync is made at
commit time, when we no longer have a pin on a buffer. When
emitting WAL, opposite to the assumption, lock needs to be
re-acquired for every page to emit log_new_page. What is worse,
we may need to reload evicted buffers.  If the file has been
CopyFrom'ed, ring buffer strategy makes the situnation farther
worse. That doesn't seem cheap at all..

If there were any chance on WAL for smaller files here, it would
be on the files smaller than the ring size of bulk-write
strategy(16MB).

If we pick up every buffer page of the file instead of scanning
through all buffers, that makes things worse by conflicts on
partition locks.

Any thoughts?



# Sorry time's up today.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Remove page-read callback from XLogReaderState.

2019-08-22 Thread Heikki Linnakangas

On 22/08/2019 04:43, Kyotaro Horiguchi wrote:

At Mon, 29 Jul 2019 22:39:57 +0300, Heikki Linnakangas  wrote in 


On 12/07/2019 10:10, Kyotaro Horiguchi wrote:
* XLogReaderState->readBuf is now allocated and controlled by the
* caller, not by xlogreader.c itself. When XLogReadRecord() needs data,
* the caller makes the data available in readBuf, which can point to the
* same buffer in all calls, or the caller may allocate a new buffer, or
* it may point to a part of a larger buffer, whatever is convenient for
* the caller. (Currently, all callers just allocate a BLCKSZ'd buffer,
* though). The caller also sets readPagPtr, readLen and readPageTLI to
* tell XLogReadRecord() what's in the buffer. So all these read* fields
* are now set by the caller, XLogReadRecord() only reads them.


The caller knows how many byes to be read. So the caller provides
the required buffer seems reasonable.


I also had in mind that the caller could provide a larger buffer, 
spanning multiple pages, in one XLogReadRecord() call. It might be 
convenient to load a whole WAL file in memory and pass it to 
XLogReadRecord() in one call, for example. I think the interface would 
now allow that, but the code won't actually take advantage of that. 
XLogReadRecord() will always ask for one page at a time, and I think it 
will ask the caller for more data between each page, even if the caller 
supplies more than one page in one call.



I'm not sure how intelligible this patch is in its current state. But
I think the general idea is good. I plan to clean it up further next
week, but feel free to work on it before that, either based on this
patch or by starting afresh from your patch set.


I think you diff is intelligible enough for me. I'll take this if
you haven't done. Anyway I'm staring on this.


Thanks! I did actually spend some time on this last week, but got 
distracted by something else before finishing it up and posting a patch. 
Here's a snapshot of what I have in my local branch. It seems to pass 
"make check-world", but probably needs some more cleanup.


Main changes since last version:

* I changed the interface so that there is a new function to set the 
starting position, XLogBeginRead(), and XLogReadRecord() always 
continues from where it left off. I think that's more clear than having 
a starting point argument in XLogReadRecord(), which was only set on the 
first call. It makes the calling code more clear, too, IMO.


* Refactored the implementation of XLogFindNextRecord(). 
XLogFindNextRecord() is now a sibling function of XLogBeginRead(). It 
sets the starting point like XLogBeginRead(). The difference is that 
with XLogFindNextRecord(), the starting point doesn't need to point to a 
valid record, it will "fast forward" to the next valid record after the 
point. The "fast forwarding" is done in an extra state in the state 
machine in XLogReadRecord().


* I refactored XLogReadRecord() and the internal XLogNeedData() 
function. XLogNeedData() used to contain logic for verifying segment and 
page headers. That works quite differently now. Checking the segment 
header is now a new state in the state machine, and the page header is 
verified at the top of XLogReadRecord(), whenever the caller provides 
new data. I think that makes the code in XLogReadRecord() more clear.


- Heikki
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 477709bbc23..8ecb5ea55c5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1386,15 +1386,21 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
 
-	xlogreader = XLogReaderAllocate(wal_segment_size, &read_local_xlog_page,
-	NULL);
+	xlogreader = XLogReaderAllocate(wal_segment_size);
 	if (!xlogreader)
 		ereport(ERROR,
 (errcode(ERRCODE_OUT_OF_MEMORY),
  errmsg("out of memory"),
  errdetail("Failed while allocating a WAL reading processor.")));
+	xlogreader->readBuf = palloc(XLOG_BLCKSZ);
+
+	XLogBeginRead(xlogreader, lsn);
+	while (XLogReadRecord(xlogreader, &record, &errormsg) ==
+		   XLREAD_NEED_DATA)
+	{
+		read_local_xlog_page(xlogreader);
+	}
 
-	record = XLogReadRecord(xlogreader, lsn, &errormsg);
 	if (record == NULL)
 		ereport(ERROR,
 (errcode_for_file_access(),
@@ -1416,6 +1422,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
 	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
 
+	pfree(xlogreader->readBuf);
 	XLogReaderFree(xlogreader);
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e651a841bbe..1bb303a90dc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -803,13 +803,6 @@ static XLogSource readSource = 0;	/* XLOG_FROM_* code */
 static XLogSource currentSource = 0;	/* XLOG_FROM_* code */
 static bool

mingw32 floating point diff

2019-08-22 Thread Peter Eisentraut
On 2019-08-20 14:59, Peter Eisentraut wrote:
> Running the regression tests on mingw32, I get the following diff in
> circle.out:
> 
> @@ -111,8 +111,8 @@
>WHERE (c1.f1 < c2.f1) AND ((c1.f1 <-> c2.f1) > 0)
>ORDER BY distance, area(c1.f1), area(c2.f1);
>   five |  one   |  two   | distance
> ---+++--
> -  | <(3,5),0>  | <(1,2),3>  | 0.60555127546399
> +--+++---
> +  | <(3,5),0>  | <(1,2),3>  | 0.605551275463989
>| <(3,5),0>  | <(5,1),3>  | 1.47213595499958
>| <(100,200),10> | <(100,1),115>  |   74
>| <(100,200),10> | <(1,2),100>| 111.370729772479
> 
> I only get this on master/PG12, but not on PG11, so I suspect that the
> new floating-point output routines could be the root of the issue.
> 
> This happens only with the 32-bit build (mingw32), but not with a 64-bit
> build (mingw64).

OK, the problem isn't the new output routines.  The result of the
computations is actually different.  The test itself is new in PG12.

The difference in output is due to the mingw32 target using -mfpmath=387
by default.  If you build with -mfpmath=sse -msee, the tests pass again.

Do we care to do anything about this?  Pick slightly different test data
perhaps?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services






Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-08-22 Thread Paul Guo
Thanks. I updated the patch to v5. It passes install-check testing and
recovery testing.

On Fri, Aug 2, 2019 at 6:38 AM Thomas Munro  wrote:

> On Mon, Jul 15, 2019 at 10:52 PM Paul Guo  wrote:
> > Please see the attached v4 patch.
>
> While moving this to the next CF, I noticed that this needs updating
> for the new pg_list.h API.
>
> --
> Thomas Munro
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__enterprisedb.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=1zhC6VaaS7Ximav7vaUXMUt6EGjrVZpNZut32ug7LDI&s=jSDXnTPIW4WNZCCZ_HIbu7gZ3apEBx36DCeNeNuhLpY&e=
>


v5-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch
Description: Binary data


Re: mingw32 floating point diff

2019-08-22 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-08-20 14:59, Peter Eisentraut wrote:
>> Running the regression tests on mingw32, I get the following diff in
>> circle.out:
...
> OK, the problem isn't the new output routines.  The result of the
> computations is actually different.  The test itself is new in PG12.
> The difference in output is due to the mingw32 target using -mfpmath=387
> by default.  If you build with -mfpmath=sse -msee, the tests pass again.

Hm, so presumably we could replicate this on other Intel-oid platforms
by changing compiler switches?  (I haven't tried.)

> Do we care to do anything about this?  Pick slightly different test data
> perhaps?

Picking different test data might be a good "fix".  Alternatively, we
could try to figure out where the discrepancy is arising and adjust
the code --- but that might be a lot more work than it's worth.

regards, tom lane




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Robert Haas
On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> Yeah, we can handle the bulk fetch as you suggested and it will make
> it a lot easier.  But, currently while registering the undo request
> (especially during the first pass) we need to compute the from_urecptr
> and the to_urecptr. And,  for computing the from_urecptr,  we have the
> end location of the transaction because we have the uur_next in the
> transaction header and that will tell us the end of our transaction
> but we still don't know the undo record pointer of the last record of
> the transaction.  As of know, we read previous 2 bytes from the end of
> the transaction to know the length of the last record and from there
> we can compute the undo record pointer of the last record and that is
> our from_urecptr.=

I don't understand this.  If we're registering an undo request at "do"
time, we don't need to compute the starting location; we can just
remember the UndoRecPtr of the first record we inserted.  If we're
reregistering an undo request after a restart, we can (and, I think,
should) work forward from the discard location rather than backward
from the insert location.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Fault injection framework

2019-08-22 Thread Asim R P
Hello

Fault injection was discussed a few months ago at PGCon in Ottawa.  At
least a few folks showed interest and so I would like to present what we
have been using in Greenplum.

The attached patch set contains the fault injector framework ported to
PostgreSQL master.  It  provides ability to define points of interest in
backend code and then inject faults at those points from SQL.  Also
included is an isolation test to simulate a speculative insert conflict
scenario that was found to be rather cumbersome to implement using advisory
locks, see [1].  The alternative isolation spec using fault injectors seems
much simpler to understand.

Asim

[1] CAAKRu_a7hbyrk=wveHYhr4LbcRnRCG=ypuvoqyb9yo1cdub...@mail.gmail.com



0001-Framework-to-inject-faults-from-SQL-tests.patch
Description: Binary data


0003-Speculative-insert-isolation-test-spec-using-fault-i.patch
Description: Binary data


0004-Run-tests-with-faults-if-faultinjector-was-compiled-.patch
Description: Binary data


0002-Add-syntax-to-declare-a-step-that-is-expected-to-blo.patch
Description: Binary data


0005-Isolation-schedule-for-tests-that-inject-faults.patch
Description: Binary data


Re: Why overhead of SPI is so large?

2019-08-22 Thread Konstantin Knizhnik

Some more information...
First of all I found out that marking PL/pgSQL function as immutable 
significantly increase speed of its execution:
19808 ms vs. 27594. It happens because exec_eval_simple_expr is taken 
snapshot if function is volatile (default).
I wonder if PL/pgSQL compiler can detect that evaluated expression 
itself is actually immutable  and there is no need to take snapshot
for each invocation of this function. Also I have tried yet another PL 
language - JavaScript, which is now new outsider, despite to the fact that

v8 JIT compiler is very good.

Implementation
time (ms)
PL/v8
41550
PL/Lua  32220
PL/pgSQL19808
C/SPI     9406
SQL   7399
SQL (JIT)
  5532
С/coreAPI     2873

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Asymmetric partition-wise JOIN

2019-08-22 Thread Kohei KaiGai
Hello,

Even though nobody has respond the thread, I tried to make a prototype of
the asymmetric partition-wise join support.
This feature tries to join non-partitioned and partitioned relation
before append.

See the example below:

create table ptable (dist int, a int, b int) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
create table t1 (aid int, label text);
create table t2 (bid int, label text);
insert into ptable (select x, (1000*random())::int,
(1000*random())::int from generate_series(1,100) x);
insert into t1 (select x, md5(x::text) from generate_series(1,50) x);
insert into t2 (select x, md5(x::text) from generate_series(1,50) x);
vacuum analyze ptable;
vacuum analyze t1;
vacuum analyze t2;

ptable.a has values between 0 and 1000, and t1.aid has values between 1 and 50.
Therefore, tables join on ptable and t1 by a=aid can reduce almost 95% rows.
On the other hands, t1 is not partitioned and join-keys are not partition keys.
So, Append must process million rows first, then HashJoin processes
the rows read
from the partitioned table, and 95% of them are eventually dropped.
On the other words, 95% of jobs by Append are waste of time and CPU cycles.

postgres=# explain select * from ptable, t1 where a = aid;
  QUERY PLAN
--
 Hash Join  (cost=2.12..24658.62 rows=49950 width=49)
   Hash Cond: (ptable_p0.a = t1.aid)
   ->  Append  (cost=0.00..20407.00 rows=100 width=12)
 ->  Seq Scan on ptable_p0  (cost=0.00..5134.63 rows=333263 width=12)
 ->  Seq Scan on ptable_p1  (cost=0.00..5137.97 rows=333497 width=12)
 ->  Seq Scan on ptable_p2  (cost=0.00..5134.40 rows=333240 width=12)
   ->  Hash  (cost=1.50..1.50 rows=50 width=37)
 ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
(8 rows)

The asymmetric partitionwise join allows to join non-partitioned tables and
partitioned tables prior to Append.

postgres=# set enable_partitionwise_join = on;
SET
postgres=# explain select * from ptable, t1 where a = aid;
  QUERY PLAN
--
 Append  (cost=2.12..19912.62 rows=49950 width=49)
   ->  Hash Join  (cost=2.12..6552.96 rows=16647 width=49)
 Hash Cond: (ptable_p0.a = t1.aid)
 ->  Seq Scan on ptable_p0  (cost=0.00..5134.63 rows=333263 width=12)
 ->  Hash  (cost=1.50..1.50 rows=50 width=37)
   ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
   ->  Hash Join  (cost=2.12..6557.29 rows=16658 width=49)
 Hash Cond: (ptable_p1.a = t1.aid)
 ->  Seq Scan on ptable_p1  (cost=0.00..5137.97 rows=333497 width=12)
 ->  Hash  (cost=1.50..1.50 rows=50 width=37)
   ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
   ->  Hash Join  (cost=2.12..6552.62 rows=16645 width=49)
 Hash Cond: (ptable_p2.a = t1.aid)
 ->  Seq Scan on ptable_p2  (cost=0.00..5134.40 rows=333240 width=12)
 ->  Hash  (cost=1.50..1.50 rows=50 width=37)
   ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
(16 rows)

We can consider the table join ptable X t1 above is equivalent to:
  (ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.

On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.

One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.

How about your thought for this feature?

Best regards,

2019年8月12日(月) 15:03 Kohei KaiGai :
>
> Hello,
>
> PostgreSQL optimizer right now considers join pairs on only
> non-partition - non-partition or
> partition-leaf - partition-leaf relations. On the other hands, it is
> harmless and makes sense to
> consider a join pair on non-partition - partition-leaf.
>
> See the example below. ptable is partitioned by hash, and contains 10M
> rows. ftable is not
> partitioned and contains 50 rows. Most of ptable::fkey shall not have
> matched rows in this
> join.
>
> create table ptable (fkey int, dist text) partition by hash (dist);
> create table ptable_p0 partition of ptable for values with (modulus 3,
> remainder 0);
> create table ptable_p1 partition of ptable for values with (modulus 3,
> remainder 1);
> create table ptable_p2 partition of ptable for values with (modulu

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-08-22 Thread Anastasia Lubennikova

22.08.2019 16:13, Paul Guo wrote:
Thanks. I updated the patch to v5. It passes install-check testing and 
recovery testing.

Hi,
Thank you for working on this fix.
The overall design of the latest version looks good to me.
But during the review, I found a bug in the current implementation.
New behavior must apply to crash-recovery only, now it applies to 
archiveRecovery too.
That can cause a silent loss of a tablespace during regular standby 
operation

since it never calls CheckRecoveryConsistency().

Steps to reproduce:
1) run master and replica
2) create dir for tablespace:
mkdir  /tmp/tblspc1

3) create tablespace and database on the master:
create tablespace tblspc1 location '/tmp/tblspc1';
create database db1 tablespace tblspc1 ;

4) wait for replica to receive this changes and pause replication:
select pg_wal_replay_pause();

5) move replica's tablespace symlink to some empty directory, i.e. 
/tmp/tblspc2

mkdir  /tmp/tblspc2
ln -sfn /tmp/tblspc2 postgresql_data_replica/pg_tblspc/16384

6) create another database in tblspc1 on master:
create database db2 tablespace tblspc1 ;

7) resume replication on standby:
select pg_wal_replay_resume();

8) try to connect to db2 on standby

It's expected that dbase_redo() will fail because the directory on 
standby is not found.
While with the patch it suppresses the error until we attempt to connect 
db2 on the standby:


2019-08-22 18:34:39.178 MSK [21066] HINT:  Execute 
pg_wal_replay_resume() to continue.
2019-08-22 18:42:41.656 MSK [21066] WARNING:  Skip creating database 
directory "pg_tblspc/16384/PG_13_201908012". The dest tablespace may 
have been removed before abnormal shutdown. If the removal is illegal 
after later checking we will panic.
2019-08-22 18:42:41.656 MSK [21066] CONTEXT:  WAL redo at 0/3027738 for 
Database/CREATE: copy dir base/1 to pg_tblspc/16384/PG_13_201908012/16390
2019-08-22 18:42:46.096 MSK [21688] FATAL: 
"pg_tblspc/16384/PG_13_201908012/16390" is not a valid data directory
2019-08-22 18:42:46.096 MSK [21688] DETAIL:  File 
"pg_tblspc/16384/PG_13_201908012/16390/PG_VERSION" is missing.


Also some nitpicking about code style:
1) Please, add comment to forget_missing_directory().

2) +   elog(LOG, "Directory \"%s\" was missing during 
directory copying "

I think we'd better update this message elevel to WARNING.

3) Shouldn't we also move FlushDatabaseBuffers(xlrec->src_db_id); call under
    if (do_copydir) clause?
I don't see a reason to flush pages that we won't use later.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 9:21 PM Dilip Kumar  wrote:
>
> On Thu, Aug 22, 2019 at 7:34 PM Robert Haas  wrote:
> >
> > On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> > > Yeah, we can handle the bulk fetch as you suggested and it will make
> > > it a lot easier.  But, currently while registering the undo request
> > > (especially during the first pass) we need to compute the from_urecptr
> > > and the to_urecptr. And,  for computing the from_urecptr,  we have the
> > > end location of the transaction because we have the uur_next in the
> > > transaction header and that will tell us the end of our transaction
> > > but we still don't know the undo record pointer of the last record of
> > > the transaction.  As of know, we read previous 2 bytes from the end of
> > > the transaction to know the length of the last record and from there
> > > we can compute the undo record pointer of the last record and that is
> > > our from_urecptr.=
> >
> > I don't understand this.  If we're registering an undo request at "do"
> > time, we don't need to compute the starting location; we can just
> > remember the UndoRecPtr of the first record we inserted.  If we're
> > reregistering an undo request after a restart, we can (and, I think,
> > should) work forward from the discard location rather than backward
> > from the insert location.
>
> Right, we work froward from the discard location.  So after the
> discard location,  while traversing the undo log when we encounter an
> aborted transaction we need to register its rollback request.  And,
> for doing that we need 1) start location of the first undo record . 2)
> start location of the last undo record (last undo record pointer).
>
> We already have 1).  But we have to compute 2).   For doing that if we
> unpack the first undo record we will know the start of the next
> transaction.  From there if we read the last two bytes then that will
> have the length of the last undo record of our transaction.  So we can
> compute 2) with below formula
>
> start of the last undo record = start of the next transaction - length
> of our transaction's last record.

Maybe I am saying that because I am just thinking how the requests are
registered as per the current code.  But, those requests will
ultimately be used for collecting the record by the bulk fetch.  So if
we are planning to change the bulk fetch to read forward then maybe we
don't need the valid last undo record pointer because that we will
anyway get while processing forward.  So just knowing the end of the
transaction is sufficient for us to know where to stop.  I am not sure
if this solution has any problem.  Probably  I should think again in
the morning when my mind is well-rested.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: mingw32 floating point diff

2019-08-22 Thread Tom Lane
I wrote:
> Peter Eisentraut  writes:
>> Do we care to do anything about this?  Pick slightly different test data
>> perhaps?

> Picking different test data might be a good "fix".  Alternatively, we
> could try to figure out where the discrepancy is arising and adjust
> the code --- but that might be a lot more work than it's worth.

I poked at this a bit more.  I can reproduce the problem by using 
-mfpmath=387 on dromedary's host (fairly old 32-bit macOS); although
I also get half a dozen *other* failures in the core regression tests,
mostly around detection of float overflow.  So I'm not quite sure that
this is comparable.  But at any rate, I tracked the core of the problem
to pg_hypot:

/* Determine the hypotenuse */
yx = y / x;
result = x * sqrt(1.0 + (yx * yx));

With -mfpmath=387, these calculations are done to more-than-64-bit
precision, yielding a different end result --- note in particular
that sqrt() is a hardware instruction on this platform, so it's
not rounding either.

I experimented with preventing that by using volatile intermediate
variables (cf comments in float.c); but it seemed like a mess,
and it would likely pessimize the code more than we want for other
platforms, and it's kind of hard to argue that deliberately sabotaging
the more-accurate computation is an improvement.

What I suggest doing is reducing extra_float_digits to -1 for this
specific test.  Changing the contents of circle_tbl seems like it'd have
more consequences than we want, in particular there's no guarantee that
we'd not hit similar issues in other tests if they're given different
inputs.

regards, tom lane




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Andres Freund
Hi

On August 22, 2019 9:14:10 AM PDT, Dilip Kumar  wrote:
> But, those requests will
>ultimately be used for collecting the record by the bulk fetch.  So if
>we are planning to change the bulk fetch to read forward then maybe we
>don't need the valid last undo record pointer because that we will
>anyway get while processing forward.  So just knowing the end of the
>transaction is sufficient for us to know where to stop.  I am not sure
>if this solution has any problem.  Probably  I should think again in
>the morning when my mind is well-rested.

I don't think we can easily do so for bulk apply without incurring significant 
overhead. It's pretty cheap to read in forward order and then process backwards 
on a page level - but for an entire transactions undo the story is different. 
We can't necessarily keep all of it in memory, so we'd have to read the undo 
twice to find the end. Right?

Andres

Andres

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Robert Haas
On Thu, Aug 22, 2019 at 12:54 AM Andres Freund  wrote:
> But why? It makes a *lot* more sense to have it in the beginning. I
> don't think bulk-fetch really requires it to be in the end - we can
> still process records forward on a page-by-page basis.

There are two separate needs here: to be able to go forward, and to be
able to go backward.  We have the length at the end of each record not
because we're stupid, but so that we can back up.  If we have another
way of backing up, then the thing to do is not to move that to
beginning of the record but to remove it entirely as unnecessary
wastage.  We can also think about how to improve forward traversal.

Considering each problem separately:

For forward traversal, we could simplify things somewhat by having
only 3 decoding stages instead of N decoding stages.  We really only
need (1) a stage for accumulating bytes until we have uur_info, and
then (2) a stage for accumulating bytes until we know the payload and
tuple lengths, and then (3) a stage for accumulating bytes until we
have the whole record.  We have a lot more stages than that right now
but I don't think we really need them for anything. Originally we had
them so that we could do incremental decoding to find the transaction
header in the record, but now that the transaction header is at a
fixed offset, I think the multiplicity of stages is just baggage.

We could simplify things more by deciding that the first two bytes of
the record are going to contain the record size. That would increase
the size of the record by 2 bytes, but we could (mostly) claw those
bytes back by not storing the size of both uur_payload and uur_tuple.
The size of the other one would be computed by subtraction: take the
total record size, subtract the size of whichever of those two things
we store, subtract the mandatory and optional headers that are
present, and the rest must be the other value. That would still add 2
bytes for records that contain neither a payload nor a tuple, but that
would probably be OK given that (a) a lot of records wouldn't be
affected, (b) the code would get simpler, and (c) something like this
seems necessary anyway given that we want to make the record format
more generic. With this approach instead of 3 stages we only need 2:
(1) accumulating bytes until we have the 2-byte length word, and (2)
accumulating bytes until we have the whole record.

For backward traversal, as I see it, there are basically two options.
One is to do what we're doing right now, and store the record length
at the end of the record. (That might mean that a record both begins
and ends with its own length, which is not a crazy design.) The other
is to do what I think you are proposing here: locate the beginning of
the first record on the page, presumably based on some information
stored in the page header, and then work forward through the page to
figure out where all the records start.  Then process them in reverse
order. That saves 2 bytes per record.  It's a little more expensive in
terms of CPU cycles, especially if you only need some of the records
on the page but not all of them, but that's probably not too bad.

I think I'm basically agreeing with what you are proposing but I think
it's important to spell out the underlying concerns, because otherwise
I'm afraid we might think we have a meeting of the minds when we don't
really.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Why overhead of SPI is so large?

2019-08-22 Thread Pavel Stehule
čt 22. 8. 2019 v 17:51 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

> Some more information...
> First of all I found out that marking PL/pgSQL function as immutable
> significantly increase speed of its execution:
> 19808 ms vs. 27594. It happens because exec_eval_simple_expr is taken
> snapshot if function is volatile (default).
> I wonder if PL/pgSQL compiler can detect that evaluated expression itself
> is actually immutable  and there is no need to take snapshot
> for each invocation of this function. Also I have tried yet another PL
> language - JavaScript, which is now new outsider, despite to the fact that
> v8 JIT compiler is very good.
>

I have a plan to do some work in this direction. Snapshot is not necessary
for almost buildin functions. If expr calls only buildin functions, then
probably can be called without snapshot and without any work with plan
cache.

Pavel

>
> Implementation
> time (ms)
> PL/v8
> 41550
> PL/Lua 32220
> PL/pgSQL 19808
> C/SPI   9406
> SQL   7399
> SQL (JIT)
>   5532
> С/coreAPI   2873
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 7:34 PM Robert Haas  wrote:
>
> On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar  wrote:
> > Yeah, we can handle the bulk fetch as you suggested and it will make
> > it a lot easier.  But, currently while registering the undo request
> > (especially during the first pass) we need to compute the from_urecptr
> > and the to_urecptr. And,  for computing the from_urecptr,  we have the
> > end location of the transaction because we have the uur_next in the
> > transaction header and that will tell us the end of our transaction
> > but we still don't know the undo record pointer of the last record of
> > the transaction.  As of know, we read previous 2 bytes from the end of
> > the transaction to know the length of the last record and from there
> > we can compute the undo record pointer of the last record and that is
> > our from_urecptr.=
>
> I don't understand this.  If we're registering an undo request at "do"
> time, we don't need to compute the starting location; we can just
> remember the UndoRecPtr of the first record we inserted.  If we're
> reregistering an undo request after a restart, we can (and, I think,
> should) work forward from the discard location rather than backward
> from the insert location.

Right, we work froward from the discard location.  So after the
discard location,  while traversing the undo log when we encounter an
aborted transaction we need to register its rollback request.  And,
for doing that we need 1) start location of the first undo record . 2)
start location of the last undo record (last undo record pointer).

We already have 1).  But we have to compute 2).   For doing that if we
unpack the first undo record we will know the start of the next
transaction.  From there if we read the last two bytes then that will
have the length of the last undo record of our transaction.  So we can
compute 2) with below formula

start of the last undo record = start of the next transaction - length
of our transaction's last record.

Am I making sense here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Cleanup isolation specs from unused steps

2019-08-22 Thread Melanie Plageman
On Wed, Aug 21, 2019 at 12:16 PM Alvaro Herrera 
wrote:

> On 2019-Aug-21, Melanie Plageman wrote:
>
> > In Greenplum, we mainly add new tests to a separate isolation
> > framework (called isolation2) which uses a completely different
> > syntax. It doesn't use isolationtester at all. So, I haven't had a use
> > case to add long options to isolationtester yet :)
>
> Is that other framework somehow more capable?
>

So, there is some historical context as to why it is a separate test suite.
And some of the differences are specific to Greenplum -- e.g. needing to
connect
to a specific database in "utility mode" to do something.

However, the other differences are actually pretty handy and would be
applicable
to upstream as well.
We use a different syntax than the isolation framework and have some nice
features. Most notably, explicit control over blocking.

The syntax for what would be a "step" in isolation is like this:

[<#>[flag]:]  | ! 

where # is the session number and flags include the following:

&: expect blocking behavior
>: running in background without blocking
<: join an existing session
q: quit the given session

See the script [1] for parsing the test cases for more details on the
syntax and
capabilities (it is in Python).

[1]
https://github.com/greenplum-db/gpdb/blob/master/src/test/isolation2/sql_isolation_testcase.py

-- 
Melanie Plageman


Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Dilip Kumar
On Thu, Aug 22, 2019 at 9:55 PM Andres Freund  wrote:
>
> Hi
>
> On August 22, 2019 9:14:10 AM PDT, Dilip Kumar  wrote:
> > But, those requests will
> >ultimately be used for collecting the record by the bulk fetch.  So if
> >we are planning to change the bulk fetch to read forward then maybe we
> >don't need the valid last undo record pointer because that we will
> >anyway get while processing forward.  So just knowing the end of the
> >transaction is sufficient for us to know where to stop.  I am not sure
> >if this solution has any problem.  Probably  I should think again in
> >the morning when my mind is well-rested.
>
> I don't think we can easily do so for bulk apply without incurring 
> significant overhead. It's pretty cheap to read in forward order and then 
> process backwards on a page level - but for an entire transactions undo the 
> story is different. We can't necessarily keep all of it in memory, so we'd 
> have to read the undo twice to find the end. Right?
>
I was not talking about the entire transaction,  I was also telling
about the page level as you suggested.  I was just saying that we may
not need the start position of the last undo record of the transaction
for registering the rollback request (which we currently do).
However, we need to know the end of the transaction to know the last
page from which we need to start reading forward.

Let me explain with an example

Transaction1
first, undo start at 10
first, undo end at 100
second, undo start at 101
second, undo end at 200
..
last, undo start at 1000
last, undo end at  1100

Transaction2
first, undo start at 1101
first, undo end at 1200
second, undo start at 1201
second, undo end at 1300

Suppose we want to register the request for Transaction1.  Then
currently we need to know the start undo record pointer (10 as per
above example) and the last undo record pointer (1000).  But, we only
know the start undo record pointer(10) and the start of the next
transaction(1101).  So for calculating the start of the last record,
we use 1101 - 101 (length of the last record store 2 bytes before
1101).

So, now I am saying that maybe we don't need to compute the start of
last undo record (1000) because it's enough to know the end of the
last undo record(1100).  Because on whichever page the last undo
record ends, we can start from that page and read forward on that
page.

* All numbers I used in the above example can be considered as undo
record pointers.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Preserving extension ownership in dump/restore/pg_upgrade

2019-08-22 Thread Tom Lane
Currently, we fail to do $SUBJECT: post-restore, an extension will
be owned by the role used to do the restore.  This is because pg_dump
does not consider that extensions have owners at all, so it doesn't
issue ALTER EXTENSION OWNER TO, which is a command that the server
doesn't have anyway.

We've not gotten complaints about this, AFAIR, suggesting it's only a
minor problem in practice.  Probably most extensions are superuser-owned
anyway.  However, we do theoretically support PL extensions that are
owned by database owners, and if the "trustable extension" idea that
I proposed yesterday[1] gets in, there might soon be a lot more cases
of non-superuser-owned extensions.  Moreover, although pg_upgrade will
preserve the ownership of individual objects within the extension,
a regular dump/restore will not.  This means a database owner would
lose the ability to adjust permissions on a PL after dump/restore,
which seems like a problem.  So I started thinking about how to fix it.

The obvious way to fix it is to implement ALTER EXTENSION OWNER TO,
but that has a big problem: what should we do with the ownership of
the contained objects?  Taking the example of one of the PL extensions,
we'd like it to also change ownership of the procedural language object,
but it *must not* give away ownership of the C-language handler functions,
at least not if the ownership recipient is not a superuser.  (A user who
owns a C-language function can alter its properties to do almost
anything.)  There doesn't seem to be any way to distinguish these cases
except with very ad-hoc, ugly, restrictive code.  I considered proposing
that we *only* change ownership of contained procedural language objects,
but ick.  That's ugly, and it would hobble the usefulness of the
@extowner@ mechanism I proposed in [1] (although possibly that's not
very useful for anything but PLs anyway?).

Maybe, instead of @extowner@ as proposed (which is certainly just a
quick-n-dirty mechanism), we could add some syntax that would explicitly
identify objects whose ownership ought to track that of the extension.
Seems like a lot of work though.

Another idea, which is much uglier conceptually but seems like it
could be done with not much code, is to teach pg_dump/pg_restore
that it must use SET SESSION AUTHORIZATION to set the ownership of
an extension even when it's using ALTER OWNER for everything else.
The main drawback that I can think of is that if the target user
lacks permissions to create the extension in the destination database,
CREATE EXTENSION will fail (and then probably later restore commands
will too), rather than leaving the extension in place with the wrong
owner.

I guess, if we're going to need custom restore code anyway,
we could imagine solving that problem by emitting

SET SESSION AUTHORIZATION joe;
CREATE EXTENSION someextension;
RESET SESSION AUTHORIZATION;
CREATE EXTENSION IF NOT EXISTS someextension;

but man is that ugly.

Binary-upgrade mode has an independent problem:
binary_upgrade_create_empty_extension has a hard-wired assumption that 
it should use GetUserId() for the extension owner.  We could imagine
fixing that by passing the owner role name as a separate argument;
though if we go with the SET SESSION AUTHORIZATION solution for normal
mode, I'm a bit inclined to use it for binary upgrade as well.

I don't find any of these approaches terribly appealing.
Thoughts, better ideas?

regards, tom lane

[1] https://www.postgresql.org/message-id/5889.1566415762%40sss.pgh.pa.us




Re: Allow to_date() and to_timestamp() to accept localized names

2019-08-22 Thread Juan José Santamaría Flecha
On Sun, Aug 18, 2019 at 10:42 AM Juan José Santamaría Flecha
 wrote:
>
> Going through the current items in the wiki's todo list, I have been
> looking into: "Allow to_date () and to_timestamp () to accept
> localized month names".
>

I have gone through a second take on this, trying to give it a better
shape but it would surely benefit from some review, so I will open an
item in the commitfest.

Regards,

Juan José Santamaría Flecha


0001-Allow-localized-month-names-to_date-v1.patch
Description: Binary data


Re: Cleanup isolation specs from unused steps

2019-08-22 Thread Robert Eckhardt
On Thu, Aug 22, 2019 at 1:45 PM Melanie Plageman
 wrote:
>
>
> On Wed, Aug 21, 2019 at 12:16 PM Alvaro Herrera  
> wrote:
>>
>> On 2019-Aug-21, Melanie Plageman wrote:
>>
>> > In Greenplum, we mainly add new tests to a separate isolation
>> > framework (called isolation2) which uses a completely different
>> > syntax. It doesn't use isolationtester at all. So, I haven't had a use
>> > case to add long options to isolationtester yet :)
>>
>> Is that other framework somehow more capable?
>
>
> So, there is some historical context as to why it is a separate test suite.
> And some of the differences are specific to Greenplum -- e.g. needing to 
> connect
> to a specific database in "utility mode" to do something.
>
> However, the other differences are actually pretty handy and would be 
> applicable
> to upstream as well.
> We use a different syntax than the isolation framework and have some nice
> features. Most notably, explicit control over blocking.

Asim submitted this framework just yesterday:
https://www.postgresql.org/message-id/CANXE4TdxdESX1jKw48xet-5GvBFVSq=4cgneiotqff372ko...@mail.gmail.com

-- Rob

>
> The syntax for what would be a "step" in isolation is like this:
>
> [<#>[flag]:]  | ! 
>
> where # is the session number and flags include the following:
>
> &: expect blocking behavior
> >: running in background without blocking
> <: join an existing session
> q: quit the given session
>
> See the script [1] for parsing the test cases for more details on the syntax 
> and
> capabilities (it is in Python).
>
> [1] 
> https://github.com/greenplum-db/gpdb/blob/master/src/test/isolation2/sql_isolation_testcase.py
>
> --
> Melanie Plageman




Infinite wait for SyncRep while handling USR1

2019-08-22 Thread Soumyadeep Chakraborty
Hello Hackers,

There is an edge case in 9_5_STABLE (doesn't reproduce 9_6+) we found in how
backends handle the TERM signal while handling a USR1 signal that can cause
them to infinitely wait. A backend can end up waiting forever inside
SyncRepWaitForLSN() at:

rc = WaitLatch(MyLatch, WL_LATCH_SET |
WL_POSTMASTER_DEATH, -1, WAIT_EVENT_SYNC_REP);

Even pg_terminate_backend() would not be able to terminate it after it
reaches this point.

Stack trace:

#0  0x7f39418d73c8 in poll () from /lib64/libc.so.6
#1  0x00753d7b in WaitLatchOrSocket (latch=0x7f3940d371cc,
wakeEvents=17, sock=-1, timeout=-1)
at pg_latch.c:333
#2  0x00753b8c in WaitLatch (latch=0x7f3940d371cc, wakeEvents=17,
timeout=-1) at pg_latch.c:197
#3  0x007a0e02 in SyncRepWaitForLSN (XactCommitLSN=167868344) at
syncrep.c:231
#4  0x00518b8a in RecordTransactionCommit () at xact.c:1349
#5  0x0051966d in CommitTransaction () at xact.c:2057
#6  0x0051a1f9 in CommitTransactionCommand () at xact.c:2769
#7  0x0054ff5e in RemoveTempRelationsCallback (code=1, arg=0) at
namespace.c:3878
#8  0x007be3d1 in shmem_exit (code=1) at ipc.c:228
#9  0x007be2c6 in proc_exit_prepare (code=1) at ipc.c:185
#10 0x007be234 in proc_exit (code=1) at ipc.c:102
#11 0x0093152e in errfinish (dummy=0) at elog.c:535
#12 0x007ea492 in ProcessInterrupts () at postgres.c:2913
#13 0x007e9f93 in die (postgres_signal_arg=15) at postgres.c:2682
#14 
#15 procsignal_sigusr1_handler (postgres_signal_arg=10) at procsignal.c:271
#16 
#17 SocketBackend (inBuf=0x7ffc8dde0f80) at postgres.c:353
#18 0x007e70ca in ReadCommand (inBuf=0x7ffc8dde0f80) at
postgres.c:510
#19 0x007eb990 in PostgresMain (argc=1, argv=0x1708cd0,
dbname=0x1708b38 "gpadmin",
username=0x1708b18 "gpadmin") at postgres.c:4032
#20 0x00769922 in BackendRun (port=0x172e060) at postmaster.c:4309
#21 0x00769086 in BackendStartup (port=0x172e060) at
postmaster.c:3983
#22 0x007657dc in ServerLoop () at postmaster.c:1706
#23 0x00764e16 in PostmasterMain (argc=3, argv=0x1707ce0) at
postmaster.c:1314
#24 0x006bcdb3 in main (argc=3, argv=0x1707ce0) at main.c:228

Root cause:

The reason why the backend waits forever in WaitLatch is that it expects a
USR1
signal to be delivered in order to wake it up from the wait
(WaitLatchOrSocket
implementation) using latch_sigusr1_handler. Now since its is already
inside a
USR1 handler, USR1 is blocked by default. Thus, it never receives a USR1 and
never wakes up.

Reproduction: [9_5_STABLE, commit: 5a32fcd]

We must align the stars as follows:

1. Create a psql session and create a temp table. [temp table forces a Tx
during backend termination to drop the table => SyncRepWaitForLSN will be
called]

2. Since, we don't have a fault injection framework, we have to rely on the
debugger: Attach to the following processes and install these breakpoints:

i) Attach to the Backend:
b procsignal_sigusr1_handler
b postgres.c:353
at: SocketBackend: ereport(DEBUG1,
(errcode(ERRCODE_CONNECTION_DOES_NOT_EXIST),
errmsg("unexpected EOF on client connection")));

ii) Attach to the Wal Sender, no need for a specific breakpoint [We need to
make the Wal Sender fall behind in order for the condition lsn <=
WalSndCtl->lsn[mode] to be false inside SyncRepWaitForLSN()]

3. kill -9 the psql process [This will send an EOF to the backend's
connection
to the front end, which will cause DoingCommandRead = true and
whereToSendOutput = DestNone (in SocketBackend()). This will later ensure
that
ProcessInterrupts() is called from within die: the backend's TERM handler]

4. Continue the backend process in the debugger, it should reach the
breakpoint
we set inside SocketBackend.

5. Send a USR1 to the backend. kill -SIGUSR1. Continue the backend. We will
hit
the breakpoint b procsignal_sigusr1_handler.

6. Send a TERM or perform pg_terminate_backend() on the backend.

7. Detach from all of the processes.

8. The backend will be waiting for replication forever inside
SyncRepWaitForLSN(). We can confirm this by pausing the backend and checking
the back trace and also by looking at pg_stat_activity. Subsequent attempts
to
terminate the backend would be futile.


Working fix:
At the start of SyncRepWaitForLSN() check if we are already in a USR1
handler.
If yes, we return. We use sigprocmask(), sigismember() to perform the
detection.

-
Ashwin Agrawal, Bhuvnesh Chaudhary, Jesse Zhang and Soumyadeep Chakraborty


Re: Comment in ginpostinglist.c doesn't match code

2019-08-22 Thread Ashwin Agrawal
On Thu, Aug 22, 2019 at 1:14 AM Heikki Linnakangas  wrote:

>
> The patch also includes a little unit test module to test this without
> creating a 16 TB table. A whole new test module seems a bit like
> overkill just for this, but clearly we were missing test coverage here.
> And it will come handy, if we want to invent a new better posting list
> format in the future. Thoughts on whether to include the test module or
> not?
>

I like the test as importantly adds missing coverage. Also, really
simplifies validation effort if required to make change in this area
anytime in future. So, I would +1 keeping the same.


Re: Cleanup isolation specs from unused steps

2019-08-22 Thread Michael Paquier
On Thu, Aug 22, 2019 at 10:20:48AM -0700, Melanie Plageman wrote:
> So, there is some historical context as to why it is a separate test suite.
> And some of the differences are specific to Greenplum -- e.g. needing to
> connect to a specific database in "utility mode" to do something.

What is "utility mode"?

> The syntax for what would be a "step" in isolation is like this:
> 
> [<#>[flag]:]  | ! 
> 
> where # is the session number and flags include the following:
> 
> &: expect blocking behavior
> >: running in background without blocking
> <: join an existing session
> q: quit the given session

These could be transposed as new meta commands for the existing
specs?  Of course not as "step" per-se, but new dedicated commands?

> See the script [1] for parsing the test cases for more details on the 
> syntax and capabilities (it is in Python).

Hmm.  The bar to add a new hard language dependency in the test
suites is very high.  I am not sure that we'd want something with a
python dependency for the tests, also knowing how Python likes
breaking compatibility (isolation2_main() also mentions a dependency
to Python).
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-08-22 Thread Masahiko Sawada
On Thu, Aug 22, 2019 at 12:36 AM Masahiko Sawada  wrote:
>
> On Thu, Aug 22, 2019 at 12:19 AM Alvaro Herrera
>  wrote:
> >
> > Can I interest someone into updating this patch?  We now have (I think)
> > an agreed design, and I think the development work needed should be
> > straightforward.  We also already have the popcount stuff, so that's a
> > few lines to be removed from the patch ...
> >
>
> I will update the patch and register to the next Commit Fest tomorrow
> if nobody is interested in.
>

Attached the updated patch. While updating the doc I realized that
perhaps we should have the new section for heap and put the
descriptions of heap functions into it rather than having them as
general functions. If we need this change it is for PG12. I will
register only the new feature patch to the next Commit Fest.

Please review them.



Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 8d81f88..0b92025 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -127,6 +127,35 @@ test=# SELECT page_checksum(get_raw_page('pg_class', 0), 0);
 

 
+ fsm_page_contents(page bytea) returns text
+ 
+  fsm_page_contents
+ 
+
+
+
+ 
+  fsm_page_contents shows the internal node structure
+  of a FSM page. The output is a multiline string, with one line per
+  node in the binary tree within the page. Only those nodes that are not
+  zero are printed. The so-called "next" pointer, which points to the
+  next slot to be returned from the page, is also printed.
+ 
+ 
+  See src/backend/storage/freespace/README for more
+  information on the structure of an FSM page.
+ 
+
+   
+  
+ 
+
+ 
+  Heap Functions
+
+  
+   
+
  heap_page_items(page bytea) returns setof record
  
   heap_page_items
@@ -203,29 +232,6 @@ test=# SELECT * FROM heap_page_item_attrs(get_raw_page('pg_class', 0), 'pg_class
  
 

-
-   
-
- fsm_page_contents(page bytea) returns text
- 
-  fsm_page_contents
- 
-
-
-
- 
-  fsm_page_contents shows the internal node structure
-  of a FSM page. The output is a multiline string, with one line per
-  node in the binary tree within the page. Only those nodes that are not
-  zero are printed. The so-called "next" pointer, which points to the
-  next slot to be returned from the page, is also printed.
- 
- 
-  See src/backend/storage/freespace/README for more
-  information on the structure of an FSM page.
- 
-
-   
   
  
 
From 018077d786f874cb314b5f61b5ef85f42c62bbe5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Fri, 23 Aug 2019 10:36:13 +0900
Subject: [PATCH v3] Introduce heap_infomask_flags to decode infomask and
 infomask2

---
 contrib/pageinspect/Makefile  |   2 +-
 contrib/pageinspect/expected/page.out |  97 
 contrib/pageinspect/heapfuncs.c   | 104 ++
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  13 
 contrib/pageinspect/pageinspect.control   |   2 +-
 contrib/pageinspect/sql/page.sql  |  26 +++
 doc/src/sgml/pageinspect.sgml |  33 
 7 files changed, 275 insertions(+), 2 deletions(-)
 create mode 100644 contrib/pageinspect/pageinspect--1.7--1.8.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index e5a581f..cfe0129 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -5,7 +5,7 @@ OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \
 		  brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES)
 
 EXTENSION = pageinspect
-DATA =  pageinspect--1.6--1.7.sql \
+DATA =  pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
 	pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
 	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
 	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 3fcd9fb..3b47599 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -82,6 +82,103 @@ SELECT * FROM fsm_page_contents(get_raw_page('test1', 'fsm', 0));
  
 (1 row)
 
+-- If we freeze the only tuple on test1, the infomask should
+-- always be the same in all test runs.
+VACUUM FREEZE test1;
+SELECT t_infomask, t_infomask2, flags
+FROM heap_page_items(get_raw_page('test1', 0)),
+ LATERAL heap_infomask_flags(t_infomask, t_infomask2, true) m(flags);
+ t_infomask | t_infomask2 |flags 
++-+--
+   2816 |   2 | {HEAP_XMAX_INVALID,HEAP_XMIN_FROZEN}
+(1 row)
+
+SELECT t_infomask, t_infomask2, flags
+FROM heap_page_items(get_raw_page('test1', 0)),
+ LATERAL heap_infomask_

Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-08-22 Thread Michael Paquier
On Fri, Aug 23, 2019 at 11:09:44AM +0900, Masahiko Sawada wrote:
> While updating the doc I realized that
> perhaps we should have the new section for heap and put the
> descriptions of heap functions into it rather than having them as
> general functions. If we need this change it is for PG12. I will
> register only the new feature patch to the next Commit Fest.

I agree with the new heap section, and your patch on that looks good.
While on it, I have one suggestion: fsm_page_contents does not have an
example of query.  Could we add one while on it?  An example
consistent with the other function's examples:
=# SELECT fsm_page_contents(get_raw_page('pg_class', 'fsm', 0));
--
Michael


signature.asc
Description: PGP signature


Re: Cleanup isolation specs from unused steps

2019-08-22 Thread Melanie Plageman
On Thu, Aug 22, 2019 at 6:53 PM Michael Paquier  wrote:

> On Thu, Aug 22, 2019 at 10:20:48AM -0700, Melanie Plageman wrote:
> > So, there is some historical context as to why it is a separate test
> suite.
> > And some of the differences are specific to Greenplum -- e.g. needing to
> > connect to a specific database in "utility mode" to do something.
>
> What is "utility mode"?
>

It's a connection parameter that allows you to connect to a single Postgres
node
in a Greenplum cluster. I only included it as an example of the kind of
"Greenplum-specific" things that are in the test framework.


>
> > The syntax for what would be a "step" in isolation is like this:
> >
> > [<#>[flag]:]  | ! 
> >
> > where # is the session number and flags include the following:
> >
> > &: expect blocking behavior
> > >: running in background without blocking
> > <: join an existing session
> > q: quit the given session
>
> These could be transposed as new meta commands for the existing
> specs?  Of course not as "step" per-se, but new dedicated commands?
>
>
Yes, I think you could definitely add some of the flags as meta-commands for
existing steps -- the current implementation of "blocking" for isolation is
really limiting.
However, features aside, I actually find the existing "step" syntax in
isolation
clunkier than the syntax used in Greenplum's "isolation2" framework.


> > See the script [1] for parsing the test cases for more details on the
> > syntax and capabilities (it is in Python).
>
> Hmm.  The bar to add a new hard language dependency in the test
> suites is very high.  I am not sure that we'd want something with a
> python dependency for the tests, also knowing how Python likes
> breaking compatibility (isolation2_main() also mentions a dependency
> to Python).
>

Agreed, I don't think it needs to be in Python.

My point was that some of our "isolation2" framework has to be different
because
it is enabling us to test features that are in Greenplum and not in
Postgres.
However, many of the features it has would actually be really handy to have
for
testing Postgres.
It wasn't initially suggested upstream because it is actually mainly ported
from
a separate standalone testing framework that was written at Greenplum in
Python.

I've heard Greenplum folks talk about re-writing our "isolation2" framework
in
(probably) C and making it a better fit to contribute. It's definitely on
my wishlist.

-- 
Melanie Plageman


Hooks for session start and end, take two

2019-08-22 Thread Michael Paquier
Hi all,

Attached is a patch set to respawn the issue of $subject which has
been discussed here:
https://www.postgresql.org/message-id/20170720204733.40f2b7eb.nag...@sraoss.co.jp

The patch has been committed once as of cd8ce3a but it got shortly
reverted after with 98d54bb because of buildfarm failures.

The root of the buildfarm issues was that session hooks cannot be
tested with a simple LOAD, hence we need to mark the test with
NO_INSTALLCHECK.  Unfortunately we lacked support for that in MSVC
scripts, until I solved that with 431f1599 when refactoring PGXS
makefile rules for regression tests.

While on it, I have done a review over the patch, cleaning it up a bit
and I found some issues, so the initial patch was not fully baked
either:
- previous hook calls were only called for normal backends, which was
incorrect as we define the backend so as we apply no backend-related
filtering for the hook.
- The README could be also more talkative.
- test_session_hooks--1.0.sql also got confused with the module name.
And actually there is no need to have the SQL and control files just
for a module loading a hook.  So it is enough to use MODULE_big for
this purpose.
- The query generated needs to use quote_literal_cstr for the string
values added to the query.
- sample_session_start_hook and sample_session_end_hook missed a
(void), causing a compiler warning on Windows. 

Attached is an updated patch set to reintroduce the hook, as there was
a ask for it recently, fixing also the issue that we previously tried
to deal with.  I have tested the patch on Windows to make sure that
the test gets correctly bypassed, so this time we should not have any
buildfarm failures.

I am adding that to next CF.  Credits go of course to the initial
authors and reviewers of this feature.

Any opinions?
--
Michael
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e8d8e6f828..6d80cc2d64 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -171,6 +171,9 @@ static ProcSignalReason RecoveryConflictReason;
 static MemoryContext row_description_context = NULL;
 static StringInfoData row_description_buf;
 
+/* Hook for plugins to get control at start of session */
+session_start_hook_type session_start_hook = NULL;
+
 /* 
  *		decls for routines only used in this file
  * 
@@ -3968,6 +3971,9 @@ PostgresMain(int argc, char *argv[],
 	if (!IsUnderPostmaster)
 		PgStartTime = GetCurrentTimestamp();
 
+	if (session_start_hook)
+		(*session_start_hook) ();
+
 	/*
 	 * POSTGRES main processing loop begins here
 	 *
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 43b9f17f72..4037267be9 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -78,6 +78,8 @@ static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
 
+/* Hook for plugins to get control at end of session */
+session_end_hook_type session_end_hook = NULL;
 
 /*** InitPostgres support ***/
 
@@ -1197,6 +1199,10 @@ ShutdownPostgres(int code, Datum arg)
 	 * them explicitly.
 	 */
 	LockReleaseAll(USER_LOCKMETHOD, true);
+
+	/* Hook at session end */
+	if (session_end_hook)
+		(*session_end_hook) ();
 }
 
 
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index ec21f7e45c..63581048b2 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -30,6 +30,13 @@ extern PGDLLIMPORT const char *debug_query_string;
 extern int	max_stack_depth;
 extern int	PostAuthDelay;
 
+/* Hook for plugins to get control at start and end of session */
+typedef void (*session_start_hook_type) (void);
+typedef void (*session_end_hook_type) (void);
+
+extern PGDLLIMPORT session_start_hook_type session_start_hook;
+extern PGDLLIMPORT session_end_hook_type session_end_hook;
+
 /* GUC-configurable parameters */
 
 typedef enum
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 60d6d7be1b..066b98e114 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -18,6 +18,7 @@ SUBDIRS = \
 		  test_predtest \
 		  test_rbtree \
 		  test_rls_hooks \
+		  test_session_hooks \
 		  test_shm_mq \
 		  unsafe_tests \
 		  worker_spi
diff --git a/src/test/modules/test_session_hooks/.gitignore b/src/test/modules/test_session_hooks/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_session_hooks/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_session_hooks/Makefile b/src/test/modules/test_session_hooks/Makefile
new file mode 100644
index 00..f6fcdeaf20
--- /dev/null
+++ b/src/test/modules/test_session_hooks/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_ses

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-22 Thread Peter Geoghegan
On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
 wrote:
> I'm going to look through the patch once more to update nbtxlog
> comments, where needed and
> answer to your remarks that are still left in the comments.

Have you been using amcheck's rootdescend verification? I see this
problem with v8, with the TPC-H test data:

DEBUG:  finished verifying presence of 150 tuples from table
"customer" with bitset 51.09% set
ERROR:  could not find tuple using search from root page in index
"idx_customer_nationkey2"

I've been running my standard amcheck query with these databases, which is:

SELECT bt_index_parent_check(index => c.oid, heapallindexed => true,
rootdescend => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
AND c.relpersistence != 't'
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

There were many large indexes that amcheck didn't detect a problem
with. I don't yet understand what the problem is, or why we only see
the problem for a small number of indexes. Note that all of these
indexes passed verification with v5, so this is some kind of
regression.

I also noticed that there were some regressions in the size of indexes
-- indexes were not nearly as small as they were in v5 in some cases.
The overall picture was a clear regression in how effective
deduplication is.

I think that it would save time if you had direct access to my test
data, even though it's a bit cumbersome. You'll have to download about
10GB of dumps, which require plenty of disk space when restored:

regression=# \l+
List
of databases
Name| Owner | Encoding |  Collate   |   Ctype| Access
privileges |  Size   | Tablespace |Description
+---+--+++---+-++
 land   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 6425 MB | pg_default |
 mgd| pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 61 GB   | pg_default |
 postgres   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 7753 kB | pg_default | default administrative connection
database
 regression | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 886 MB  | pg_default |
 template0  | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg
 +| 7609 kB | pg_default | unmodifiable empty database
|   |  ||| pg=CTc/pg
  | ||
 template1  | pg| UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg
 +| 7609 kB | pg_default | default template for new databases
|   |  ||| pg=CTc/pg
  | ||
 tpcc   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 10 GB   | pg_default |
 tpce   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 26 GB   | pg_default |
 tpch   | pg| UTF8 | en_US.UTF8 | en_US.UTF8 |
  | 32 GB   | pg_default |
(9 rows)

I have found it very valuable to use this test data when changing
nbtsplitloc.c, or anything that could affect where page splits make
free space available. If this is too much data to handle conveniently,
then you could skip "mgd" and almost have as much test coverage. There
really does seem to be a benefit to using diverse test cases like
this, because sometimes regressions only affect a small number of
specific indexes for specific reasons. For example, only TPC-H has a
small number of indexes that have tuples that are inserted in order,
but also have many duplicates. Removing the BT_COMPRESS_THRESHOLD
stuff really helped with those indexes.

Want me to send this data and the associated tests script over to you?

-- 
Peter Geoghegan




Re: Cleanup isolation specs from unused steps

2019-08-22 Thread Michael Paquier
On Thu, Aug 22, 2019 at 09:19:47PM -0700, Melanie Plageman wrote:
> It's a connection parameter that allows you to connect to a single Postgres
> node in a Greenplum cluster. I only included it as an example of the kind of
> "Greenplum-specific" things that are in the test framework.

Ah, I see.  I had the same stuff for XC to connect to data nodes.

> I've heard Greenplum folks talk about re-writing our "isolation2" framework
> in (probably) C and making it a better fit to contribute. It's definitely on
> my wishlist.

Nice to hear that.
--
Michael


signature.asc
Description: PGP signature


Re: when the IndexScan reset to the next ScanKey for in operator

2019-08-22 Thread Peter Geoghegan
On Wed, Aug 21, 2019 at 6:24 AM Alex  wrote:
> suppose the executor  should scan 1 first,  If all the tuples for 1 has been 
> scanned,  then **it should be reset to 10**  and scan again.

You might find my nbtree index scan test patch useful:

https://postgr.es/m/cah2-wzmrt_0ybhf05axqb2oituqiqakr0lznntj8x3kadkz...@mail.gmail.com

-- 
Peter Geoghegan




Re: POC: Cleaning up orphaned files using undo logs

2019-08-22 Thread Thomas Munro
On Wed, Aug 21, 2019 at 4:44 AM Andres Freund  wrote:
> On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> > 1.  Anyone is allowed to try to read or write data at any UndoRecPtr
> > that has been allocated, through the buffer pool (though you'd usually
> > want to check it with UndoRecPtrIsDiscarded() first, and only rely on
> > the system I'm describing to deal with races).
> >
> > 2.  ReadBuffer() might return InvalidBuffer.  This can happen for a
> > cache miss, if the smgrread implementation wants to indicate that the
> > buffer has been discarded/truncated and that is expected (md.c won't
> > ever do that, but undofile.c can).
>
> Hm. This gives me a bit of a stomach ache. It somehow feels like a weird
> form of signalling.  Can't quite put my finger on why it makes me feel
> queasy.

Well, if we're going to allow concurrent access and discarding, there
has to be some part of the system where you can discover that the
thing you wanted is gone.  What's wrong with here?

Stepping back a bit... why do we have a new concept here?  The reason
we don't have anything like this for relations currently is that we
don't have live references to blocks that are expected to be
concurrently truncated, due to heavyweight locking.  But the whole
purpose of the undo system is to allow cheap truncation/discarding of
data that you *do* have live references to, and furthermore the
discarding is expected to be frequent.  The present experiment is
about trying to do so without throwing our hands up and using a
pessimistic lock.

At one point Robert and I discussed some kind of scheme where you'd
register your interest in a range of the log before you begin reading
(some kind of range locks or hazard pointers), so that you would block
discarding in that range, but the system would still allow someone to
read in the middle of the log while the discard worker concurrently
discards non-overlapping data at the end.  But I kept returning to the
idea that the buffer pool already has block-level range locking of
various kinds.  You can register your interest in a range by pinning
the buffers.  That's when you'll find out if the range is already
gone.  We could add an extra layer of range locking around that, but
it wouldn't be any better, it'd just thrash your bus a bit more, and
require more complexity in the discard worker (it has to defer
discarding a bit, and either block or go away and come back later).

> > 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> > unpinned buffers, and marks as BM_DISCARDED any that happen to be
> > pinned right now, so they can't be immediately invalidated.  Such
> > buffers are never written back and are eligible for reuse on the next
> > clock sweep, even if they're written into by a backend that managed to
> > do that when we were trying to discard.
>
> Hm. When is it legitimate for a backend to write into such a buffer? I
> guess that's about updating the previous transaction's next pointer? Or
> progress info?

Yes, previous transaction header's next pointer, and progress counter
during rollback.  We're mostly interested in the next pointer here,
because the progress counter update would normally not be updated at a
time when the page might be concurrently discarded.  The exception to
that is a superuser running CALL pg_force_discard_undo() (a
data-eating operation designed to escape a system that can't
successfully roll back and gets stuck, blowing away
not-yet-rolled-back undo records).

Here are some other ideas about how to avoid conflicts between
discarding and transaction header update:

1.  Lossy self-update-only: You could say that transactions are only
allowed to write to their own transaction header, and then have them
periodically update their own length in their own transaction header,
and then teach the discard worker that the length information is only
a starting point for a linear search for the next transaction based on
page header information.  That removes complexity for writers, but
adds complexity and IO and CPU to the discard worker.  Bleugh.

2.  Strict self-update-only:  We could update it as part of
transaction cleanup.  That is, when you commit or abort, probably at
some time when your transaction is still advertised as running, you go
and update your own transaction header with your the size.  If you
never reach that stage, I think you can fix it during crash recovery,
during the startup scan that feeds the rollback request queues.  That
is, if you encounter a transaction header with length == 0, it must be
the final one and its length is therefore known and can be updated,
before you allow new transactions to begin.  There are some
complications like backends that exit without crashing, which I
haven't thought about.  As Amit just pointed out to me, that means
that the update is not folded into the same buffer access as the next
transaction, but perhaps you can mitigate that by not updating your
header if the next header will be on the same p

Re: mingw32 floating point diff

2019-08-22 Thread Peter Eisentraut
On 2019-08-22 18:19, Tom Lane wrote:
> What I suggest doing is reducing extra_float_digits to -1 for this
> specific test.  Changing the contents of circle_tbl seems like it'd have
> more consequences than we want, in particular there's no guarantee that
> we'd not hit similar issues in other tests if they're given different
> inputs.

I agree that reducing the output precision is better than adjusting the
code.

The circle.sql file already has SET extra_float_digits TO 0, and a few
other files have other settings with different values.  Are we content
to use various numbers until it works in each case, or should we try to
use some consistency?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pg_checksums --help synopsis is quite long

2019-08-22 Thread Peter Eisentraut
With the additional functionality, the --help synopsis of pg_checksums
has gotten quite long:

   pg_checksums enables, disables, or verifies data checksums in a
   PostgreSQL database cluster.

Can we try to shorten this a bit?  Maybe

   pg_checksums manages data checksums in a PostgreSQL database cluster.

Other ideas?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-08-22 Thread Masahiko Sawada
On Fri, Aug 23, 2019 at 12:27 PM Michael Paquier  wrote:
>
> On Fri, Aug 23, 2019 at 11:09:44AM +0900, Masahiko Sawada wrote:
> > While updating the doc I realized that
> > perhaps we should have the new section for heap and put the
> > descriptions of heap functions into it rather than having them as
> > general functions. If we need this change it is for PG12. I will
> > register only the new feature patch to the next Commit Fest.
>
> I agree with the new heap section, and your patch on that looks good.
> While on it, I have one suggestion: fsm_page_contents does not have an
> example of query.  Could we add one while on it?  An example
> consistent with the other function's examples:
> =# SELECT fsm_page_contents(get_raw_page('pg_class', 'fsm', 0));

Good idea. I've updated the doc update patch.


Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 8d81f88..e1753aa 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -127,6 +127,55 @@ test=# SELECT page_checksum(get_raw_page('pg_class', 0), 0);
 

 
+ fsm_page_contents(page bytea) returns text
+ 
+  fsm_page_contents
+ 
+
+
+
+ 
+  fsm_page_contents shows the internal node structure
+  of a FSM page. The output is a multiline string, with one line per
+  node in the binary tree within the page. For example:
+
+test=# SELECT fsm_page_contents(get_raw_page('pg_class', 'fsm', 0));
+ fsm_page_contents
+---
+ 0: 235   +
+ 1: 235   +
+ 3: 235   +
+ 7: 235   +
+ 15: 235  +
+ 31: 235  +
+ 63: 235  +
+ 127: 235 +
+ 255: 235 +
+ 511: 235 +
+ 1023: 235+
+ 2047: 235+
+ 4095: 235+
+ fp_next_slot: 0  +
+
+  Only those nodes that are not zero are printed. The so-called "next"
+  pointer, which points to the next slot to be returned from the page,
+  is also printed.
+ 
+ 
+  See src/backend/storage/freespace/README for more
+  information on the structure of an FSM page.
+ 
+
+   
+  
+ 
+
+ 
+  Heap Functions
+
+  
+   
+
  heap_page_items(page bytea) returns setof record
  
   heap_page_items
@@ -203,29 +252,6 @@ test=# SELECT * FROM heap_page_item_attrs(get_raw_page('pg_class', 0), 'pg_class
  
 

-
-   
-
- fsm_page_contents(page bytea) returns text
- 
-  fsm_page_contents
- 
-
-
-
- 
-  fsm_page_contents shows the internal node structure
-  of a FSM page. The output is a multiline string, with one line per
-  node in the binary tree within the page. Only those nodes that are not
-  zero are printed. The so-called "next" pointer, which points to the
-  next slot to be returned from the page, is also printed.
- 
- 
-  See src/backend/storage/freespace/README for more
-  information on the structure of an FSM page.
- 
-
-   
   
  
 


Re: pg_checksums --help synopsis is quite long

2019-08-22 Thread Fabien COELHO


Hello Peter,


With the additional functionality, the --help synopsis of pg_checksums
has gotten quite long:

  pg_checksums enables, disables, or verifies data checksums in a
  PostgreSQL database cluster.

Can we try to shorten this a bit?  Maybe

  pg_checksums manages data checksums in a PostgreSQL database cluster.

Other ideas?


My 0.02 €:

  pg_checksums triggers or verifies checksums in a Postgres cluster.

I like the somehow detailed functionality list, if space allows, so I 
tried to compressed the other parts.


Not sure that there could be a checksum on anything else but data.

I have decided that PostgreSQL is a mouthful, thus I'm rather using 
"Postgres".


--
Fabien.

  1   2   >