date:20181123

Re: WIP: Avoid creation of the free space map for small tables

2018-11-23 Thread Amit Kapila

On Fri, Nov 23, 2018 at 11:56 AM John Naylor  wrote:
>
> On 11/16/18, Amit Kapila  wrote:
>
> >
> > 3.
> >  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> > -uint8 newValue, uint8 minValue);
> > + uint8 newValue, uint8 minValue);
> >
> > This appears to be a spurious change.
>
> 2 and 3 are separated into 0001.
>

Is the point 3 change related to pgindent?  I think even if you want
these, then don't prepare other patches on top of this, keep it
entirely separate.

>
> Also, we don't quite have a consensus on the threshold value, but I
> have set it to 4 pages for v8. If this is still considered too
> expensive (and basic tests show it shouldn't be), I suspect it'd be
> better to interleave the available block numbers as described a couple
> days ago than lower the threshold further.
>

Can you please repeat the copy test you have done above with
fillfactor as 20 and 30?

> I have looked at zhio.c, and it seems trivial to adapt zheap to this patchset.
>

Cool, I also think so.

Few more comments:
---
1. I think we can add some test(s) to test the new functionality, may
be something on the lines of what Robert has originally provided as an
example of this behavior [1].

2.
@@ -2554,6 +2555,12 @@ AbortTransaction(void)
  s->parallelModeLevel = 0;
  }

+ /*
+ * In case we aborted during RelationGetBufferForTuple(),
+ * clear the local map of heap pages.
+ */
+ FSMClearLocalMap();
+

The similar call is required in AbortSubTransaction function as well.
I suggest to add it after pgstat_progress_end_command in both
functions.

3.
GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
{
..
+ if (target_block == InvalidBlockNumber &&
+ rel->rd_rel->relkind == RELKIND_RELATION)
+ {
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
+ {
+ /*
+ * If the FSM knows nothing of the rel, try the last page before
+ * we give up and extend.  This avoids one-tuple-per-page syndrome
+ * during bootstrapping or in a recently-started system.
+ */
+ target_block = nblocks - 1;
+ }
..
}

Moving this check inside GetPageWithFreeSpace has one disadvantage, we
will always consider last block which can have some inadvertent
effects.  Consider when this function gets called from
RelationGetBufferForTuple just before extension, it can consider to
check for the last block even though that is already being done in the
begining when GetPageWithFreeSpace was called.  I am not completely
sure how much this is a case to worry because  it will help to check
last block when the same is concurrently added and FSM is not updated
for same.  I am slightly worried because the unpatched code doesn't
care for such case and we have no intention to change this behaviour.
What do you think?

4. You have mentioned above that "system catalogs created during
bootstrap still have a FSM if they have any data." and I can also see
this behavior, have you investigated this point further?

5. Your logic to update FSM on standby seems okay, but can you show
some tests which proves its sanity?

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmoac%2B6qTNp2U%2BwedY8-PU6kK_b6hbdhR5xYGBG3GtdFcww%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] pgbench - allow to store select results into variables

2018-11-23 Thread Fabien COELHO



Hello Alvaro,


Please when you rebase, consider these (minor) changes.


Here it is.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index b5e3a62a33..246944ea79 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -954,6 +954,87 @@ pgbench  options  d
   
 
   
+   
+
+ \cset [prefix]
+
+
+
+ 
+  This command may be used to end SQL queries, replacing an embedded
+  semicolon (\;) within a compound SQL command.
+ 
+
+ 
+  When this command is used, the preceding SQL query is expected to
+  return one row, the columns of which are stored into variables named after
+  column names, and prefixed with prefix if provided.
+ 
+
+ 
+  The following example sends three queries as one compound SQL command,
+  inducing one message sent at the protocol level.
+  The result of the second query is stored into variable two,
+  whereas the results of the other queries are discarded.
+
+-- compound of 3 queries
+SELECT 1 AS one \; SELECT 2 AS two \cset
+SELECT 2;
+
+ 
+
+ 
+  
+\cset does not work when empty SQL queries appear
+within a compound SQL command.
+  
+ 
+
+   
+
+   
+
+ \gset [prefix]
+
+
+
+ 
+  This commands may be used to end SQL queries, replacing a final semicolon
+  (;). 
+ 
+
+ 
+  When this command is used, the preceding SQL query is expected to
+  return one row, the columns of which are stored into variables named after
+  column names, and prefixed with prefix if provided.
+ 
+
+ 
+  The following example puts the final account balance from the first query
+  into variable abalance, and fills variables
+  p_two and p_three 
+  with integers from the last query.
+  The result of the second query is discarded.
+
+UPDATE pgbench_accounts
+  SET abalance = abalance + :delta
+  WHERE aid = :aid
+  RETURNING abalance \gset
+-- compound of two queries
+SELECT 1 \;
+SELECT 2 AS two, 3 AS three \gset p_
+
+ 
+
+ 
+  
+\gset does not work when empty SQL queries appear
+within a compound SQL command.
+  
+ 
+
+   
+

 \if expression
 \elif expression
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index c64e16187a..fd0de34d49 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -501,12 +501,15 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"};
 
 typedef struct
 {
-	char	   *line;			/* text of command line */
+	char	   *first_line;		/* first line for short display */
+	PQExpBufferData	lines;		/* full multi-line text of command */
 	int			command_num;	/* unique index of this Command struct */
 	int			type;			/* command type (SQL_COMMAND or META_COMMAND) */
 	MetaCommand meta;			/* meta command identifier, or META_NONE */
 	int			argc;			/* number of command words */
 	char	   *argv[MAX_ARGS]; /* command word list */
+	int			compound;   /* last compound command (number of \;) */
+	char	  **gset;   /* per-compound command prefix */
 	PgBenchExpr *expr;			/* parsed expression, if needed */
 	SimpleStats stats;			/* time spent in this command */
 } Command;
@@ -1732,6 +1735,107 @@ valueTruth(PgBenchValue *pval)
 	}
 }
 
+/* read all responses from backend, storing into variable or discarding */
+static bool
+read_response(CState *st, char **gset)
+{
+	PGresult   *res;
+	int			compound = 0;
+
+	while ((res = PQgetResult(st->con)) != NULL)
+	{
+		switch (PQresultStatus(res))
+		{
+			case PGRES_COMMAND_OK: /* non-SELECT commands */
+			case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
+if (gset[compound] != NULL)
+{
+	fprintf(stderr,
+			"client %d file %d command %d compound %d: "
+			"\\gset/cset expects a row\n",
+			st->id, st->use_file, st->command, compound);
+	st->ecnt++;
+	return false;
+}
+break; /* OK */
+
+			case PGRES_TUPLES_OK:
+if (gset[compound] != NULL)
+{
+	/* store result into variables if required */
+	int ntuples = PQntuples(res),
+		nfields = PQnfields(res),
+		f;
+
+	if (ntuples != 1)
+	{
+		fprintf(stderr,
+"client %d file %d command %d compound %d: "
+"expecting one row, got %d\n",
+st->id, st->use_file, st->command, compound, ntuples);
+		st->ecnt++;
+		PQclear(res);
+		discard_response(st);
+		return false;
+	}
+
+	for (f = 0; f < nfields ; f++)
+	{
+		char *varname = PQfname(res, f);
+		/* prefix varname if required, will be freed below */
+		if (*gset[compound] != '\0')
+			varname = psprintf("%s%s", gset[compound], varname);
+
+		/* store result as a string */
+		if (!putVariable(st, "gset", varname,
+		 PQgetvalue(res, 0, f)))
+		{
+			/* internal error, should it rather abort? */
+			fprintf(stderr,
+	"client %d

Re: Tid scan improvements

2018-11-23 Thread Edmund Horner

On Sat, 24 Nov 2018 at 15:46, Tomas Vondra  wrote:
> On 11/24/18 1:56 AM, Edmund Horner wrote:
> > On Fri, 23 Nov 2018 at 07:03, Tomas Vondra  
> > wrote:
> >> On 11/22/18 8:41 AM, David Rowley wrote:
> >>> ...
> >>> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> >>> allocation until it reaches the required size. See how
> >>> MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
> >>> we always have a power of two sized array which is much nicer if we
> >>> ever reach the palloc() limit as if the array is sized at the palloc()
> >>> limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
> >>> unlikely to be a problem here, but... the question would be how to
> >>> decide on the initial size.
> >>
> >> I think it kinda tries to do that in some cases, by doing this:
> >>
> >>  *numAllocRanges *= 2;
> >>  ...
> >>  tidRanges = (TidRange *)
> >>  repalloc(tidRanges,
> >>   *numAllocRanges * sizeof(TidRange));
> >>
> >> The problem here is that what matters is not numAllocRanges being 2^N,
> >> but the number of bytes allocated being 2^N. Because that's what ends up
> >> in AllocSet, which keeps lists of 2^N chunks.
> >>
> >> And as TidRange is 12B, so this is guaranteed to waste memory, because
> >> no matter what the first factor is, the result will never be 2^N.
> >
> > For simplicity, I think making it a strict doubling of capacity each
> > time is fine.  That's what we see in numerous other places in the
> > backend code.
> >
>
> Sure.
>
> > What we don't really see is intentionally setting the initial capacity
> > so that each subsequent capacity is close-to-but-not-exceeding a power
> > of 2 bytes.  You can't really do that optimally if working in terms of
> > whole numbers of items that aren't each a power of 2 size.  This step,
> > there may be 2/3 of an item spare; next step, we'll have a whole item
> > spare that we're not going to use.
>
> Ah, I missed the detail with setting initial size.
>
> > So we could keep track in terms of bytes allocated, and then figure
> > out how many items we can fit at the current time.
> >
> > In my opinion, such complexity is overkill for Tid scans.
> >
> > Currently, we try to pick an initial size based on the number of
> > expressions.  We assume each expression will yield one range, and
> > allow that a saop expression might require us to enlarge the array.
> >
> > Again, for simplicity, we should scrap that and pick something like
> > floor(256/sizeof(TidRange)) = 21 items, with about 1.5% wastage.
> >
>
> Probably. I don't think it'd be a lot of code to do the exact sizing,
> but you're right 1.5% is close enough. As long as there is a comment
> explaining the initial sizing, I'm fine with that.
>
> If I could suggest one more thing, I'd define a struct combining the
> array of ranges, numRanges and numAllocRangeslike:
>
> typedef struct TidRanges
> {
> int numRanges;
> int numAllocRanges;
> TidRangeranges[FLEXIBLE_ARRAY_MEMBER];
> } TidRanges;
>
> and use that instead of the plain array. I find it easier to follow
> compared to passing the various fields directly (sometimes as a value,
> sometimes pointer to the value, etc.).

Ok, I've made rewritten it to use a struct:

typedef struct TidRangeArray {
TidRange *ranges;
int numRanges;
int numAllocated;
}   TidRangeArray;

which is slightly different from the flexible array member version you
suggested.  The TidRangeArray is allocated on the stack in the
function that builds it, and then ranges and numRanges are copied into
the TidScanState before the function returns.

Any particular pros/cons of this versus your approach?  With yours, I
presume we'd have a pointer to TidRanges in TidScanState.

My other concern now is that EnsureTidRangeSpace needs a loop to
double the allocated size.  Most such arrays in the backend only ever
grow by 1, so a single doubling is fine, but the TID scan one can grow
by an arbitrary number with a scalar array op, and it's nice to not
have to check the space for each individual item.  Here's what I've
got.

void
EnsureTidRangeSpace(TidRangeArray *tidRangeArray, int numNewItems)
{
int requiredSpace = tidRangeArray->numRanges + numNewItems;
if (requiredSpace <= tidRangeArray->numAllocated)
return;

/* it's not safe to double the size unless we're less than half MAX_INT */
if (requiredSpace >= INT_MAX / 2)
tidRangeArray->numAllocated = requiredSpace;
else
while (tidRangeArray->numAllocated < requiredSpace)
tidRangeArray->numAllocated *= 2;

tidRangeArray->ranges = (TidRange *)
repalloc(tidRangeArray->ranges,
 tidRangeArray->numAllocated * sizeof(TidRange));
}

If you're in danger of overflowing numAllocated with the number of
TIDs in your query, you're probably going to have other problems.  But
I'd prefer to at least not get stuck in an infinite doubling loop.

Re: SQL:2011 PERIODS vs Postgres Ranges?

2018-11-23 Thread Paul A Jungwirth

On Fri, Nov 23, 2018 at 3:41 PM Paul A Jungwirth
 wrote:
> Here is a patch for my progress on this so far.

Well this is embarrassing, but my last patch used the mistaken syntax
`PRIMARY KEY (cols, WITHOUT OVERLAPS col)`. Here is a new patch which
uses the correct syntax `PRIMARY KEY (cols, col WITHOUT OVERLAPS)`.
Sorry about that! Also I went ahead and rebased it off current master.

Yours,
Paul


temporal_pks_v0002.patch
Description: Binary data

Re: Tid scan improvements

2018-11-23 Thread Tomas Vondra




On 11/24/18 1:56 AM, Edmund Horner wrote:
> On Fri, 23 Nov 2018 at 07:03, Tomas Vondra  
> wrote:
>> On 11/22/18 8:41 AM, David Rowley wrote:
>>> ...
>>> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
>>> allocation until it reaches the required size. See how
>>> MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
>>> we always have a power of two sized array which is much nicer if we
>>> ever reach the palloc() limit as if the array is sized at the palloc()
>>> limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
>>> unlikely to be a problem here, but... the question would be how to
>>> decide on the initial size.
>>
>> I think it kinda tries to do that in some cases, by doing this:
>>
>>  *numAllocRanges *= 2;
>>  ...
>>  tidRanges = (TidRange *)
>>  repalloc(tidRanges,
>>   *numAllocRanges * sizeof(TidRange));
>>
>> The problem here is that what matters is not numAllocRanges being 2^N,
>> but the number of bytes allocated being 2^N. Because that's what ends up
>> in AllocSet, which keeps lists of 2^N chunks.
>>
>> And as TidRange is 12B, so this is guaranteed to waste memory, because
>> no matter what the first factor is, the result will never be 2^N.
> 
> For simplicity, I think making it a strict doubling of capacity each
> time is fine.  That's what we see in numerous other places in the
> backend code.
> 

Sure.

> What we don't really see is intentionally setting the initial capacity
> so that each subsequent capacity is close-to-but-not-exceeding a power
> of 2 bytes.  You can't really do that optimally if working in terms of
> whole numbers of items that aren't each a power of 2 size.  This step,
> there may be 2/3 of an item spare; next step, we'll have a whole item
> spare that we're not going to use.

Ah, I missed the detail with setting initial size.

> So we could keep track in terms of bytes allocated, and then figure
> out how many items we can fit at the current time.
> 
> In my opinion, such complexity is overkill for Tid scans.
> 
> Currently, we try to pick an initial size based on the number of
> expressions.  We assume each expression will yield one range, and
> allow that a saop expression might require us to enlarge the array.
> 
> Again, for simplicity, we should scrap that and pick something like
> floor(256/sizeof(TidRange)) = 21 items, with about 1.5% wastage.
> 

Probably. I don't think it'd be a lot of code to do the exact sizing,
but you're right 1.5% is close enough. As long as there is a comment
explaining the initial sizing, I'm fine with that.

If I could suggest one more thing, I'd define a struct combining the
array of ranges, numRanges and numAllocRangeslike:

typedef struct TidRanges
{
int numRanges;
int numAllocRanges;
TidRangeranges[FLEXIBLE_ARRAY_MEMBER];
} TidRanges;

and use that instead of the plain array. I find it easier to follow
compared to passing the various fields directly (sometimes as a value,
sometimes pointer to the value, etc.).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: COPY FROM WHEN condition

2018-11-23 Thread Tomas Vondra

On 11/23/18 12:14 PM, Surafel Temesgen wrote:
> 
> 
> On Sun, Nov 11, 2018 at 11:59 PM Tomas Vondra
> mailto:tomas.von...@2ndquadrant.com>> wrote:
> 
> 
> So, what about using FILTER here? We already use it for aggregates when
> filtering rows to process.
> 
> i think its good idea and describe its purpose more. Attache is a
> patch that use FILTER instead

Thanks, looks good to me. A couple of minor points:

1) While comparing this to the FILTER clause we already have for
aggregates, I've noticed the aggregate version is

FILTER '(' WHERE a_expr ')'

while here we have

FILTER '(' a_expr ')'

For a while I was thinking that maybe we should use the same syntax
here, but I don't think so. The WHERE bit seems rather unnecessary and
we probably implemented it only because it's required by SQL standard,
which does not apply to COPY. So I think this is fine.

2) The various parser checks emit errors like this:

case EXPR_KIND_COPY_FILTER:
err = _("cannot use subquery in copy from FILTER condition");
break;

I think the "copy from" should be capitalized, to make it clear that it
refers to a COPY FROM command and not a copy of something.

3) I think there should be regression tests for these prohibited things,
i.e. for a set-returning function, for a non-existent column, etc.

4) What might be somewhat confusing for users is that the filter uses a
single snapshot to evaluate the conditions for all rows. That is, one
might do this

create or replace function f() returns int as $$
select count(*)::int from t;
$$ language sql;

and hope that

copy t from '/...' filter (f() <= 100);

only ever imports the first 100 rows - but that's not true, of course,
because f() uses the snapshot acquired at the very beginning. For
example INSERT SELECT does behave differently:

test=# copy t from '/home/user/t.data' filter (f() < 100);
COPY 81
test=# insert into t select * from t where f() < 100;
INSERT 0 19

Obviously, this is not an issue when the filter clause references only
the input row (which I assume will be the primary use case).

Not sure if this is expected / appropriate behavior, and if the patch
needs to do something else here.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Should new partitions inherit their tablespace from their parent?

2018-11-23 Thread Michael Paquier

On Sat, Nov 24, 2018 at 12:32:36PM +1300, David Rowley wrote:
> On Fri, 23 Nov 2018 at 17:03, David Rowley  
> wrote:
> > Here's a patch for that.  Parking here until January's commitfest.
> 
> And another, now rebased atop of de38ce1b8 (which I probably should
> have waited for).

You have just gained a reviewer.  The current inconsistent state is
rather bad I think, so I'll try to look at that before January.  Now
there are other things waiting in the queue...
--
Michael


signature.asc
Description: PGP signature

Re: Add extension options to control TAP and isolation tests

2018-11-23 Thread Michael Paquier

On Fri, Nov 23, 2018 at 05:29:00PM +0300, Arthur Zakirov wrote:
> The patch is very useful. Using TAP_TESTS is more convenient and clearer
> than adding wal-check target. Every time I was adding TAP tests for a
> extension I had to remember that I should add wal-check.

wal-check is a custom option part of contrib/bloom/ which is not aimed
at spreading around.

> After applying the patch all tests pass, there wasn't any error.
> 
> Also I tested it in one of our extension which has TAP tests. installcheck
> and check work as expected.

Thanks.

> But there is a problem that you need to copy your extension to the contrib
> directory if you want to run TAP tests. I tried to run TAP test of the
> extension outside of PostgreSQL source directory. And it failed to run the
> test. It is because `prove_installcheck` redefines `top_builddir` and
> `PG_REGRESS`:

I have tested that as well with one of my custom extensions, which has
some TAP tests, and the following Makefile additions:
TAP_TESTS = 1
ifdef USE_PGXS
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
else
subdir = contrib/EXTENSION_NAME_HERE
top_builddir = ../..
include $(top_builddir)/src/Makefile.global
include $(top_srcdir)/contrib/contrib-global.mk
endif

Running make clean at the root of the tree, then running make check from
contrib/EXTENSION_NAME_HERE works for me.

> Unfortunately I didn't find the way to run it, maybe I miss something. It
> can be fixed by an additional patch I attached. I think I can create an
> entry in the future commitfest or it can be joined into your patch.

The previous version of the patch I sent make the build of
src/test/regress dependent on if REGRESS is set, but I missed the fact
that TAP tests also call pg_regress, which is the error you are seeing.
The attached patch will be able to work.

Thanks all for the reviews, I'll do a last lookup on Monday my time and
I'll try to get that committed by then.  That's a nice cleanup of the
tree.
--
Michael
diff --git a/contrib/bloom/Makefile b/contrib/bloom/Makefile
index 13bd397b70..da9553a2d0 100644
--- a/contrib/bloom/Makefile
+++ b/contrib/bloom/Makefile
@@ -8,6 +8,7 @@ DATA = bloom--1.0.sql
 PGFILEDESC = "bloom access method - signature file based index"
 
 REGRESS = bloom
+TAP_TESTS = 1
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -19,6 +20,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-wal-check: temp-install
-	$(prove_check)
diff --git a/contrib/oid2name/Makefile b/contrib/oid2name/Makefile
index 908e078714..361a80a7a1 100644
--- a/contrib/oid2name/Makefile
+++ b/contrib/oid2name/Makefile
@@ -6,11 +6,11 @@ PGAPPICON = win32
 PROGRAM = oid2name
 OBJS	= oid2name.o $(WIN32RES)
 
+TAP_TESTS = 1
+
 PG_CPPFLAGS = -I$(libpq_srcdir)
 PG_LIBS_INTERNAL = $(libpq_pgport)
 
-EXTRA_CLEAN = tmp_check
-
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
@@ -21,9 +21,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-check:
-	$(prove_check)
-
-installcheck:
-	$(prove_installcheck)
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index afcab930f7..8cd83a763f 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -3,9 +3,20 @@
 MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
-# Note: because we don't tell the Makefile there are any regression tests,
-# we have to clean those result files explicitly
-EXTRA_CLEAN = $(pg_regress_clean_files)
+EXTRA_INSTALL=contrib/test_decoding
+
+REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
+	decoding_into_rel binary prepared replorigin time messages \
+	spill slot truncate
+ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
+	oldest_xmin snapshot_transfer
+
+REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
+ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
+
+# Disabled because these tests require "wal_level=logical", which
+# typical installcheck users do not have (e.g. buildfarm clients).
+NO_INSTALLCHECK = 1
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -18,52 +29,8 @@ include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
 
-# Disabled because these tests require "wal_level=logical", which
-# typical installcheck users do not have (e.g. buildfarm clients).
-installcheck:;
-
 # But it can nonetheless be very helpful to run tests on preexisting
 # installation, allow to do so, but only if requested explicitly.
-installcheck-force: regresscheck-install-force isolationcheck-install-force
-
-check: regresscheck isolationcheck
-
-submake-regress:
-	$(MAKE) -C $(top_builddir)/src/test/regress all
-
-submake-isolation:
-	$(MAKE) -C $(top_builddir)/src/test/isolation all
-
-submake-test_decoding:
-

Re: Tid scan improvements

2018-11-23 Thread Edmund Horner

On Fri, 23 Nov 2018 at 07:03, Tomas Vondra  wrote:
> On 11/22/18 8:41 AM, David Rowley wrote:
> > ...
> > 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> > allocation until it reaches the required size. See how
> > MakeSharedInvalidMessagesArray() does it.  Doing it this way ensures
> > we always have a power of two sized array which is much nicer if we
> > ever reach the palloc() limit as if the array is sized at the palloc()
> > limit / 2 + 1, then if we try to double it'll fail.  Of course, it's
> > unlikely to be a problem here, but... the question would be how to
> > decide on the initial size.
>
> I think it kinda tries to do that in some cases, by doing this:
>
>  *numAllocRanges *= 2;
>  ...
>  tidRanges = (TidRange *)
>  repalloc(tidRanges,
>   *numAllocRanges * sizeof(TidRange));
>
> The problem here is that what matters is not numAllocRanges being 2^N,
> but the number of bytes allocated being 2^N. Because that's what ends up
> in AllocSet, which keeps lists of 2^N chunks.
>
> And as TidRange is 12B, so this is guaranteed to waste memory, because
> no matter what the first factor is, the result will never be 2^N.

For simplicity, I think making it a strict doubling of capacity each
time is fine.  That's what we see in numerous other places in the
backend code.

What we don't really see is intentionally setting the initial capacity
so that each subsequent capacity is close-to-but-not-exceeding a power
of 2 bytes.  You can't really do that optimally if working in terms of
whole numbers of items that aren't each a power of 2 size.  This step,
there may be 2/3 of an item spare; next step, we'll have a whole item
spare that we're not going to use.  So we could keep track in terms of
bytes allocated, and then figure out how many items we can fit at the
current time.

In my opinion, such complexity is overkill for Tid scans.

Currently, we try to pick an initial size based on the number of
expressions.  We assume each expression will yield one range, and
allow that a saop expression might require us to enlarge the array.

Again, for simplicity, we should scrap that and pick something like
floor(256/sizeof(TidRange)) = 21 items, with about 1.5% wastage.

Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2018-11-23 Thread David Rowley

On Mon, 2 Jul 2018 at 09:37, David Rowley  wrote:
> Now that the September 'fest is open for new patches I'm going to move
> this patch over there.  This patch has become slightly less important
> than some other stuff, but I'd still like to come back to it.

This had bit-rotted quite a bit, so I've rebased it on top of today's master.

There's not really any other changes to speak of. I've not reevaluated
the patch to see if there is any more simple way to take care of this
problem yet.  However, I do recall getting off to quite a few false
starts with this patch and the way I have it now was the only way I
saw to make it possible. Although, perhaps some more experience will
show me something different.

Anyway, I've attached the rebased version.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v8-0001-Remove-single-subpath-Append-and-MergeAppend-node.patch
Description: Binary data

Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-23 Thread Peter Geoghegan

On Fri, Nov 23, 2018 at 2:18 AM Andrey Borodin  wrote:
> I found no practical way to fix concept of "subtree-lock". Pre-locking all 
> left siblings for cleanup would suffice, but does not look very practical.
> GIN Posting tree has no all useful B-tree inventory like left links and high 
> keys for concurrent deletes without "subtree-lock".

Significantly changing the design of GIN vacuuming in back branches
for a release that's more than a year old is absolutely out of the
question. I think that your patch should be evaluated as a new v12
feature, following the revert of 218f51584d5.

> Please note, that attached patch do not fix 2nd bug found by Chen in page 
> delete redo.
>
> I understand that reverting commit 218f51584d5 and returning to long but 
> finite root locks is safest solution

Whatever happens with the master branch, I think that reverting commit
218f51584d5 is the only option on the table for the back branches. Its
design is based on ideas on locking protocols that are fundamentally
incorrect and unworkable. I don't have a lot of faith in our ability
to retrofit a design that fixes the issue without causing problems
elsewhere.

-- 
Peter Geoghegan

Re: SQL:2011 PERIODS vs Postgres Ranges?

2018-11-23 Thread Paul A Jungwirth

Here is a patch for my progress on this so far. I'd love some comments
on the general approach, as I've never contributed anything this
involved before. It's not ready for a commitfest, but it would help me
to have some feedback. There are TODO comments with my major
questions.

This patch lets you say `CONSTRAINT foo PRIMARY KEY (cols, WITHOUT
OVERLAPS some_range_col)`, both in `CREATE TABLE` and `ALTER TABLE`.
It doesn't support foreign keys yet, and it only supports range
columns, not PERIODs. (I'm starting to realize that adding PERIODs
will be a lot of work, although I'm still up for it. :-) The approach
isn't exactly the #2+#3 approach I suggested previously, since
user-exposed functions seem like an odd fit with how things normally
flow out of the grammar, but it follows the goal of permitting either
ranges or PERIODs for temporal keys without breaking the SQL:2011
standard.

It adds regression and pg_dump tests, although no documentation yet. A
few of my new regress tests fail, but only the ones for PERIODs. I
don't know if I need to do anything for pg_dump's custom format. For
the SQL format it exports correct `ALTER TABLE ... ADD CONSTRAINT ...
(... WITHOUT OVERLAPS ...)` statements. Also I left a question in
bin/psql/describe.c about how to make \d show a PK WITHOUT OVERLAPS.

It is based on 3be97b97ed37b966173f027091f21d8a7605e2a5 from Nov 14,
but I can rebase it if you like.

If it's easier to read this in smaller bits, you can find my (somewhat
messy) commit history here:
https://github.com/pjungwir/postgresql/commits/temporal-pks

For a next step (assuming what I've done already isn't too bad): I
could either work on PERIODs (building on Vik Fearing's patch from a
few months ago), or add range-based temporal foreign keys. Any
suggestions?

Thanks!
Paul
On Sun, Oct 28, 2018 at 2:29 PM Paul Jungwirth
 wrote:
>
> Hi Jeff,
>
> Thanks for sharing your thoughts and encouragement! :-)
>
>  > The model in [7] is
>  > based heavily on pack/unpack operators, and it's hard for me to see
>  > how those fit into SQL. Also, the pack/unpack operators have some
>  > theoretical weirdness that the book does not make clear*.
>  >
>  > *: My question was about the significance
>  > of the order when packing on two intervals. Hugh Darwen was kind
>  > enough to reply at length, and offered a lot of insight, but was still
>  > somewhat inconclusive.
>
> I'd be interested in seeing that conversation if you ever find it again.
>
> I really like how Date/Darwen/Lorentzos use pack/unpack to explain
> temporal operations as operating on every concurrent "instant"
> separately, and then bringing the adjacent instants back together into
> ranges again. Even if you don't materialize that approach, conceptually
> it makes it easy to understand what's going on.
>
> So what is great about the patch from Anton Dignös
> (https://www.postgresql-archive.org/PROPOSAL-Temporal-query-processing-with-range-types-tt5913058.html)
> is that (like Date/Darwen/Lorentzos) you still have temporal variants
> for every operator in the relational algebra, but they give
> straightforward & efficient implementations of each based on traditional
> operators plus just their two new "normalize" and "align" operations. (I
> think they renamed these in later papers/patches though?) Their main
> paper is at https://files.ifi.uzh.ch/boehlen/Papers/modf174-dignoes.pdf
> if anyone wants to read it. It's short! :-)
>
> The biggest challenge implementing temporal operators in plain SQL is
> merging/splitting ranges from the left & right sides of an operator so
> they line up. A single row can get split into multiple rows, or several
> rows might be merged into one, etc. You can see how tricky Snodgrass's
> "coalesce" operation is in his book. I gave some example SQL to
> implement coalesce with UNNEST plus a range_agg function at
> https://github.com/pjungwir/range_agg but with the Dignös approach I
> don't think you'd need that. Normalize/align targets roughly the same
> problem.
>
> Anyway I'd be curious whether the theoretical weirdness you found in
> pack/unpack also applies to normalize/align.
>
> Yours,
>
> --
> Paul  ~{:-)
> p...@illuminatedcomputing.com

temporal_pks_v0001.patch
Description: Binary data

Re: Should new partitions inherit their tablespace from their parent?

2018-11-23 Thread David Rowley

On Fri, 23 Nov 2018 at 17:03, David Rowley  wrote:
> Here's a patch for that.  Parking here until January's commitfest.

And another, now rebased atop of de38ce1b8 (which I probably should
have waited for).

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v2-0001-Allow-newly-created-partitions-to-inherit-their-p.patch
Description: Binary data

Re: Inadequate executor locking of indexes

2018-11-23 Thread David Rowley

On Sat, 24 Nov 2018 at 05:25, Tom Lane  wrote:
> Now, after more thought, I believe that it's probably impossible
> to have a plan tree in which ExecRelationIsTargetRelation would
> return true at an indexscan node that's not underneath the ModifyTable
> node.  What *is* possible is that we have such an indexscan on a
> different RTE for the same table.  I constructed this admittedly
> artificial example in the regression database:
>
> # explain with x1 as (select * from tenk1 t1 where unique1 = 42),
> x2 as (update tenk1 t2 set two = 2 where unique2 = 11 returning *)
> select * from x1,x2 where x1.ten = x2.ten;

Well, that problem exists with more than just indexes. Relations could
be problematic too. An even more simple artificial example would be:

select * from t1 inner join t1 t2 on t1.a=t2.a for update of t2;

We could fix that in the executor by processing the rangetable in the
planner, first throwing the whole thing into a hash table by Oid and
finding the strongest lock level and applying that level to each
(non-dummy) matching RangeTblEntry's rellockmode.  While we're there
we could add a new field for indexlockmode and do the same process on
that.   However... there might not be much point as this does nothing
for the same problem that exists in parse analyze.  That may be much
harder or even impossible to fix.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data

2018-11-23 Thread Tomas Vondra

On 11/19/18 11:44 AM, Tomas Vondra wrote:
> On 11/19/18 10:28 AM, Masahiko Sawada wrote:
>> On Mon, Nov 19, 2018 at 6:52 AM Tomas Vondra
>>  wrote:
>>>
>>> ...
>>>
>>> This does fix the issue, because we still decode the TOAST changes but
>>> there are no data and so
>>>
>>>  if (change->data.tp.newtuple != NULL)
>>>  {
>>>  dlist_delete(>node);
>>>  ReorderBufferToastAppendChunk(rb, txn, relation,
>>>    change);
>>>  }
>>>
>>> ends up not stashing the change in the hash table.
>>
>> With the below change you proposed can we remove the above condition
>> because toast-insertion changes being processed by the reorder buffer
>> always have a new tuple? If a toast-insertion record doesn't have a
>> new tuple it has already ignored when decoding.
>>
> 
> Good point. I think you're right the reorderbuffer part may be
> simplified as you propose.
> 

OK, here's an updated patch, tweaking the reorderbuffer part. I plan to
push this sometime mid next week.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d5bd282f8c..44caeca336 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -659,12 +659,11 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			options |= HEAP_INSERT_SKIP_WAL;
 
 		/*
-		 * The new relfilenode's relcache entrye doesn't have the necessary
-		 * information to determine whether a relation should emit data for
-		 * logical decoding.  Force it to off if necessary.
+		 * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
+		 * for the TOAST table are not logically decoded.  The main heap is
+		 * WAL-logged as XLOG FPI records, which are not logically decoded.
 		 */
-		if (!RelationIsLogicallyLogged(state->rs_old_rel))
-			options |= HEAP_INSERT_NO_LOGICAL;
+		options |= HEAP_INSERT_NO_LOGICAL;
 
 		heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
 		 options);
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index afb497227e..e3b05657f8 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -665,6 +665,9 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 static void
 DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 {
+	Size		datalen;
+	char	   *tupledata;
+	Size		tuplelen;
 	XLogReaderState *r = buf->record;
 	xl_heap_insert *xlrec;
 	ReorderBufferChange *change;
@@ -672,6 +675,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	xlrec = (xl_heap_insert *) XLogRecGetData(r);
 
+	/*
+	 * Ignore insert records without new tuples (this does happen when
+	 * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
+	 */
+	if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
+		return;
+
 	/* only interested in our database */
 	XLogRecGetBlockTag(r, 0, _node, NULL, NULL);
 	if (target_node.dbNode != ctx->slot->data.database)
@@ -690,17 +700,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	memcpy(>data.tp.relnode, _node, sizeof(RelFileNode));
 
-	if (xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE)
-	{
-		Size		datalen;
-		char	   *tupledata = XLogRecGetBlockData(r, 0, );
-		Size		tuplelen = datalen - SizeOfHeapHeader;
+	tupledata = XLogRecGetBlockData(r, 0, );
+	tuplelen = datalen - SizeOfHeapHeader;
 
-		change->data.tp.newtuple =
-			ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
+	change->data.tp.newtuple =
+		ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
 
-		DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
-	}
+	DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
 
 	change->data.tp.clear_toast_afterwards = true;
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bed63c768e..23466bade2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1598,17 +1598,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		 * transaction's changes. Otherwise it will get
 		 * freed/reused while restoring spooled data from
 		 * disk.
-		 *
-		 * But skip doing so if there's no tuple-data. That
-		 * happens if a non-mapped system catalog with a toast
-		 * table is rewritten.
 		 */
-		if (change->data.tp.newtuple != NULL)
-		{
-			dlist_delete(>node);
-			ReorderBufferToastAppendChunk(rb, txn, relation,
-		  change);
-		}
+		Assert(change->data.tp.newtuple != NULL);
+
+		dlist_delete(>node);
+		ReorderBufferToastAppendChunk(rb, txn, relation,
+	  change);
 	}
 
 			change_done:

Re: pg_upgrade supported versions policy

2018-11-23 Thread Andres Freund




On November 23, 2018 1:10:11 PM PST, David Fetter  wrote:
>On Wed, Nov 21, 2018 at 03:47:22PM -0800, Andres Freund wrote:
>> Hi,
>> 
>> I feel like we ought to trim the support for a few old versions from
>> pg_upgrade.  In my particular case I don't really think it's
>reasonable
>> to test < 9.0 versions for pg_largeobject_metadata migrations. But I
>> think we should create a policy that's the default, leaving
>individual
>> cases aside.
>> 
>> I can see a few possible policies:
>> 
>> 1) Support upgrading from the set of releases that were supported
>when
>>the pg_upgrade target version was released. While some will argue
>that
>>this is fairly short, people above it can still upgrade ~10 years
>>worth of versions with a single intermediary upgrade.
>> 2) Same, but +2 releases or such.
>> 3) Support until it gets too uncomfortable.
>> 4) Support all versions remotely feasible (i.e. don't drop versions
>that
>>still can be compiled)
>
>The way pg_upgrade works right now, 1), 2), and 4) basically make it
>impossible to change our storage format in any non-trivial way, and 3)
>is a "trivial case" in the sense that the first such non-trivial
>storage format change ends pg_upgrade support.
>
>Are we really that attached to how we store things?

I don't think this has anything to do with the storage format. See also my 
answer to Stephen. Where we upgrade from does not guarantee the page has been 
written in that version.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: tab-completion debug print

2018-11-23 Thread Tom Lane

Kyotaro HORIGUCHI  writes:
> I was reminded that I was often annoyed with identifying the code
> that made a word-completion, by hearing the same complaint from a
> collegue of mine just now.
> Something like the attached that tweaks completion_matches calls
> lets psql emit the line number where a word-completion
> happens. The output can be split out using redirection so that it
> doesn't break into the conversation on console.

> (make -s COPT=-DTABCOMPLETION_DEBUG install)
> $ psql postgres 2>~debug.out
> =# alt[tab]er [tab]t[tab]ab[tab] [tab]

> You can see the following output in another bash session.
> $ tail -f ~/debug.out
> [1414][1435][1435][1435][1431]
> Every number enclosed by brackets is the line number in
> tab-complete.c, where completion happens.

> Is this useful? Any suggestions, thoughts?

Hm.  I can see the value of instrumenting tab-complete when you're trying
to debug why it did something, but this output format seems pretty terse
and unreadable.  Can we get it to print the completion text as well?
I'm imagining something more like

1414: "er "
1435: ""
1435: "ab"
1435: ""
1431: ""

Perhaps there's room as well to print the context that the match looked
at:

1414: "alt" -> "er "
1435: "alter " -> ""
1435: "alter t" -> "ab"

etc.

regards, tom lane

Re: pg_upgrade supported versions policy

2018-11-23 Thread David Fetter

On Wed, Nov 21, 2018 at 03:47:22PM -0800, Andres Freund wrote:
> Hi,
> 
> I feel like we ought to trim the support for a few old versions from
> pg_upgrade.  In my particular case I don't really think it's reasonable
> to test < 9.0 versions for pg_largeobject_metadata migrations. But I
> think we should create a policy that's the default, leaving individual
> cases aside.
> 
> I can see a few possible policies:
> 
> 1) Support upgrading from the set of releases that were supported when
>the pg_upgrade target version was released. While some will argue that
>this is fairly short, people above it can still upgrade ~10 years
>worth of versions with a single intermediary upgrade.
> 2) Same, but +2 releases or such.
> 3) Support until it gets too uncomfortable.
> 4) Support all versions remotely feasible (i.e. don't drop versions that
>still can be compiled)

The way pg_upgrade works right now, 1), 2), and 4) basically make it
impossible to change our storage format in any non-trivial way, and 3)
is a "trivial case" in the sense that the first such non-trivial
storage format change ends pg_upgrade support.

Are we really that attached to how we store things?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Shared Memory: How to use SYSV rather than MMAP ?

2018-11-23 Thread Thomas Munro

On Sat, Nov 24, 2018 at 4:54 AM REIX, Tony  wrote:
> Here is a patch for enabling SystemV Shared Memory on AIX, for 64K or bigger 
> page size, rather than using MMAP shared memory, which is slower on AIX.

> We have tested this code with 64K pages and pgbench, on AIX 7.2 TL2 Power 9, 
> and it provided a maximum of +37% improvement.

You also mentioned changing from XLC to GCC.  Did you test the various
changes in isolation?  XLC->GCC, mmap->shmget, with/without
SHM_LGPAGE.  37% is a bigger performance change than I expected from
large pages, since reports from other architectures are single-digit
percentage increases with pgbench -S.

If just changing to GCC gives you a big speed-up, it could of course
just be different/better code generation (though that'd be a bit sad
for XLC), but I also wonder if the different atomics support in our
tree could be implicated.

> We'll test this code with Large Pages (SHM_LGPAGE | SHM_PIN | S_IRUSR | 
> S_IWUSR flags of shmget() ) ASAP.
>
>
> However, I wanted first to get your comments about this change in order to 
> improve it for acceptance.

I think we should respect the huge_pages GUC, as we do on Linux and
Windows (since there are downsides to using large pages, maybe not
everyone would want that).  It could even be useful to allow different
page sizes to be requested by GUC (I see that DB2 has an option to use
16GB pages -- yikes).  It also seems like a good idea to have a
shared_memory_type GUC as Andres proposed (see his link), instead of
using a compile time option.  I guess it was made a compile time
option because nobody could imagine wanting to go back to SysV shm!
(I'm still kinda surprised that MAP_ANONYMOUS memory can't be coaxed
into large pages by environment variables or loader controls, since
apparently other things like data segments etc apparently can, though
I can't find any text that says that's the case and I have no AIX
system).

-- 
Thomas Munro
http://www.enterprisedb.com

Re: row filtering for logical replication

2018-11-23 Thread Tomas Vondra




On 11/23/18 8:14 PM, Petr Jelinek wrote:
> On 23/11/2018 19:29, Fabrízio de Royes Mello wrote:
>>
>> On Fri, Nov 23, 2018 at 4:13 PM Petr Jelinek
>> mailto:petr.jeli...@2ndquadrant.com>> wrote:
>>>

 If carefully documented I see no problem with it... we already have an
 analogous problem with functional indexes.
>>>
>>> The difference is that with functional indexes you can recreate the
>>> missing object and everything is okay again. With logical replication
>>> recreating the object will not help.
>>>
>>
>> In this case with logical replication you should rsync the object. That
>> is the price of misunderstanding / bad use of the new feature.
>>
>> As usual, there are no free beer ;-)
>>
> 
> Yeah but you have to resync whole subscription, not just single table
> (removing table from the publication will also not help), that's pretty
> severe punishment. What if you have triggers downstream that do
> calculations or logging which you can't recover by simply rebuilding
> replica? I think it's better to err on the side of no data loss.
> 

Yeah, having to resync everything because you accidentally dropped a
function is quite annoying. Of course, you should notice that while
testing the upgrade in a testing environment, but still ...

> We could also try to figure out a way to recover from this that does not
> require resync, ie perhaps we could somehow temporarily force evaluation
> of the expression to have current snapshot.
> 

That seems like huge a can of worms ...


cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: row filtering for logical replication

2018-11-23 Thread Tomas Vondra

On 11/23/18 8:03 PM, Stephen Frost wrote:
> Greetings,
> 
> * Fabrízio de Royes Mello (fabriziome...@gmail.com) wrote:
>> On Fri, Nov 23, 2018 at 4:13 PM Petr Jelinek 
>> wrote:
 If carefully documented I see no problem with it... we already have an
 analogous problem with functional indexes.
>>>
>>> The difference is that with functional indexes you can recreate the
>>> missing object and everything is okay again. With logical replication
>>> recreating the object will not help.
>>>
>>
>> In this case with logical replication you should rsync the object. That is
>> the price of misunderstanding / bad use of the new feature.
>>
>> As usual, there are no free beer ;-)
> 
> There's also certainly no shortage of other ways to break logical
> replication, including ways that would also be hard to recover from
> today other than doing a full resync.
> 

Sure, but that seems more like an argument against creating additional
ones (and for preventing those that already exist). I'm not sure this
particular feature is where we should draw the line, though.

> What that seems to indicate, to me at least, is that it'd be awful
> nice to have a way to resync the data which doesn't necessairly
> involve transferring all of it over again.
> 
> Of course, it'd be nice if we could track those dependencies too,
> but that's yet another thing.

Yep, that seems like a good idea in general. Both here and for
functional indexes (although I suppose sure is a technical reason why it
wasn't implemented right away for them).

> 
> In short, I'm not sure that I agree with the idea that we shouldn't
> allow this and instead I'd rather we realize it and put the logical
> replication into some kind of an error state that requires a resync.
> 

That would still mean a need to resync the data to recover, so I'm not
sure it's really an improvement. And I suppose it'd require tracking the
dependencies, because how else would you mark the subscription as
requiring a resync? At which point we could decline the DROP without a
CASCADE, just like we do elsewhere, no?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: row filtering for logical replication

2018-11-23 Thread Petr Jelinek

On 23/11/2018 19:29, Fabrízio de Royes Mello wrote:
> 
> On Fri, Nov 23, 2018 at 4:13 PM Petr Jelinek
> mailto:petr.jeli...@2ndquadrant.com>> wrote:
>>
>> >
>> > If carefully documented I see no problem with it... we already have an
>> > analogous problem with functional indexes.
>>
>> The difference is that with functional indexes you can recreate the
>> missing object and everything is okay again. With logical replication
>> recreating the object will not help.
>>
> 
> In this case with logical replication you should rsync the object. That
> is the price of misunderstanding / bad use of the new feature.
> 
> As usual, there are no free beer ;-)
> 

Yeah but you have to resync whole subscription, not just single table
(removing table from the publication will also not help), that's pretty
severe punishment. What if you have triggers downstream that do
calculations or logging which you can't recover by simply rebuilding
replica? I think it's better to err on the side of no data loss.

We could also try to figure out a way to recover from this that does not
require resync, ie perhaps we could somehow temporarily force evaluation
of the expression to have current snapshot.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: 64-bit hash function for hstore and citext data type

2018-11-23 Thread Tom Lane

I wrote:
> Andrew Gierth  writes:
>> I'm inclined to fix this in hstoreUpgrade rather than complicate
>> hstore_hash with historical trivia. Also there have been no field
>> complaints - I guess it's unlikely that there is much pg 8.4 hstore data
>> in the wild that anyone wants to hash.

> Changing hstoreUpgrade at this point seems like wasted/misguided effort.

Oh, cancel that --- I was having a momentary brain fade about how that
function is used.  I agree your proposal is sensible.

regards, tom lane

Re: row filtering for logical replication

2018-11-23 Thread Stephen Frost

Greetings,

* Fabrízio de Royes Mello (fabriziome...@gmail.com) wrote:
> On Fri, Nov 23, 2018 at 4:13 PM Petr Jelinek 
> wrote:
> > > If carefully documented I see no problem with it... we already have an
> > > analogous problem with functional indexes.
> >
> > The difference is that with functional indexes you can recreate the
> > missing object and everything is okay again. With logical replication
> > recreating the object will not help.
> >
> 
> In this case with logical replication you should rsync the object. That is
> the price of misunderstanding / bad use of the new feature.
> 
> As usual, there are no free beer ;-)

There's also certainly no shortage of other ways to break logical
replication, including ways that would also be hard to recover from
today other than doing a full resync.

What that seems to indicate, to me at least, is that it'd be awful nice
to have a way to resync the data which doesn't necessairly involve
transferring all of it over again.

Of course, it'd be nice if we could track those dependencies too,
but that's yet another thing.

In short, I'm not sure that I agree with the idea that we shouldn't
allow this and instead I'd rather we realize it and put the logical
replication into some kind of an error state that requires a resync.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: 64-bit hash function for hstore and citext data type

2018-11-23 Thread Tom Lane

Andrew Gierth  writes:
> "Tomas" == Tomas Vondra  writes:
>  Tomas> The important question is - can there be two different encodings
>  Tomas> for the same hstore value?

> I was going to say "no", but in fact on closer examination there is an
> edge case caused by the fact that hstoreUpgrade allows an _empty_ hstore
> from pg_upgraded 8.4 data through without modifying it. (There's also a
> vanishingly unlikely case involving the pgfoundry release of hstore-new.)

Ugh.  Still, that's a pre-existing problem in hstore_hash, and so I don't
think it's a blocker for this patch.

> I'm inclined to fix this in hstoreUpgrade rather than complicate
> hstore_hash with historical trivia. Also there have been no field
> complaints - I guess it's unlikely that there is much pg 8.4 hstore data
> in the wild that anyone wants to hash.

Changing hstoreUpgrade at this point seems like wasted/misguided effort.
I don't doubt that there was a lot of 8.4 hstore data out there, but how
much remains unmigrated?  If we're going to take this seriously at all,
my inclination would be to change hstore_hash[_extended] to test for
the empty-hstore case and force the same value it gets for such an
hstore made today.

In the meantime, I went ahead and pushed these patches.  The only
non-cosmetic changes I made were to remove the changes in 
citext--unpackaged--1.0.sql/hstore--unpackaged--1.0.sql; those
were wrong, because the point of those files is to migrate pre-9.1
databases into the extension system.  Such a database would not
contain an extended hash function, and so adding an ALTER EXTENSION
command for that function would cause the script to fail.

regards, tom lane

Re: row filtering for logical replication

2018-11-23 Thread Fabrízio de Royes Mello

On Fri, Nov 23, 2018 at 4:13 PM Petr Jelinek 
wrote:
>
> >
> > If carefully documented I see no problem with it... we already have an
> > analogous problem with functional indexes.
>
> The difference is that with functional indexes you can recreate the
> missing object and everything is okay again. With logical replication
> recreating the object will not help.
>

In this case with logical replication you should rsync the object. That is
the price of misunderstanding / bad use of the new feature.

As usual, there are no free beer ;-)

Regards,

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: row filtering for logical replication

2018-11-23 Thread Petr Jelinek

On 23/11/2018 19:05, Fabrízio de Royes Mello wrote:
> On Fri, Nov 23, 2018 at 3:55 PM Petr Jelinek
> mailto:petr.jeli...@2ndquadrant.com>> wrote:
>>
>> On 23/11/2018 17:15, Euler Taveira wrote:
>> > Em qui, 22 de nov de 2018 às 20:03, Petr Jelinek
>> > mailto:petr.jeli...@2ndquadrant.com>>
> escreveu:
>> >> Firstly, I am not sure if it's wise to allow UDFs in the filter clause
>> >> for the table. The reason for that is that we can't record all
> necessary
>> >> dependencies there because the functions are black box for parser. That
>> >> means if somebody drops object that an UDF used in replication filter
>> >> depends on, that function will start failing. But unlike for user
>> >> sessions it will start failing during decoding (well processing in
>> >> output plugin). And that's not recoverable by reading the missing
>> >> object, the only way to get out of that is either to move slot forward
>> >> which means losing part of replication stream and need for manual
> resync
>> >> or full rebuild of replication. Neither of which are good IMHO.
>> >>
>> > It is a foot gun but there are several ways to do bad things in
>> > postgres. CREATE PUBLICATION is restricted to superusers and role with
>> > CREATE privilege in current database. AFAICS a role with CREATE
>> > privilege cannot drop objects whose owner is not himself. I wouldn't
>> > like to disallow UDFs in row filtering expressions just because
>> > someone doesn't set permissions correctly. Do you have any other case
>> > in mind?
>>
>> I don't think this has anything to do with security. Stupid example:
>>
>> user1: CREATE EXTENSION citext;
>>
>> user2: CREATE FUNCTION myfilter(col1 text, col2 text) returns boolean
>> language plpgsql as
>> $$BEGIN
>> RETURN col1::citext = col2::citext;
>> END;$$
>>
>> user2: ALTER PUBLICATION mypub ADD TABLE mytab WHERE (myfilter(a,b));
>>
>> [... replication happening ...]
>>
>> user1: DROP EXTENSION citext;
>>
>> And now replication is broken and unrecoverable without data loss.
>> Recreating extension will not help because the changes happening in
>> meantime will not see it in the historical snapshot.
>>
>> I don't think it's okay to do completely nothing about this.
>>
> 
> If carefully documented I see no problem with it... we already have an
> analogous problem with functional indexes.

The difference is that with functional indexes you can recreate the
missing object and everything is okay again. With logical replication
recreating the object will not help.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: row filtering for logical replication

2018-11-23 Thread Fabrízio de Royes Mello

On Fri, Nov 23, 2018 at 3:55 PM Petr Jelinek 
wrote:
>
> On 23/11/2018 17:15, Euler Taveira wrote:
> > Em qui, 22 de nov de 2018 às 20:03, Petr Jelinek
> >  escreveu:
> >> Firstly, I am not sure if it's wise to allow UDFs in the filter clause
> >> for the table. The reason for that is that we can't record all
necessary
> >> dependencies there because the functions are black box for parser. That
> >> means if somebody drops object that an UDF used in replication filter
> >> depends on, that function will start failing. But unlike for user
> >> sessions it will start failing during decoding (well processing in
> >> output plugin). And that's not recoverable by reading the missing
> >> object, the only way to get out of that is either to move slot forward
> >> which means losing part of replication stream and need for manual
resync
> >> or full rebuild of replication. Neither of which are good IMHO.
> >>
> > It is a foot gun but there are several ways to do bad things in
> > postgres. CREATE PUBLICATION is restricted to superusers and role with
> > CREATE privilege in current database. AFAICS a role with CREATE
> > privilege cannot drop objects whose owner is not himself. I wouldn't
> > like to disallow UDFs in row filtering expressions just because
> > someone doesn't set permissions correctly. Do you have any other case
> > in mind?
>
> I don't think this has anything to do with security. Stupid example:
>
> user1: CREATE EXTENSION citext;
>
> user2: CREATE FUNCTION myfilter(col1 text, col2 text) returns boolean
> language plpgsql as
> $$BEGIN
> RETURN col1::citext = col2::citext;
> END;$$
>
> user2: ALTER PUBLICATION mypub ADD TABLE mytab WHERE (myfilter(a,b));
>
> [... replication happening ...]
>
> user1: DROP EXTENSION citext;
>
> And now replication is broken and unrecoverable without data loss.
> Recreating extension will not help because the changes happening in
> meantime will not see it in the historical snapshot.
>
> I don't think it's okay to do completely nothing about this.
>

If carefully documented I see no problem with it... we already have an
analogous problem with functional indexes.

Regards,

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: row filtering for logical replication

2018-11-23 Thread Petr Jelinek

On 23/11/2018 17:15, Euler Taveira wrote:
> Em qui, 22 de nov de 2018 às 20:03, Petr Jelinek
>  escreveu:
>> Firstly, I am not sure if it's wise to allow UDFs in the filter clause
>> for the table. The reason for that is that we can't record all necessary
>> dependencies there because the functions are black box for parser. That
>> means if somebody drops object that an UDF used in replication filter
>> depends on, that function will start failing. But unlike for user
>> sessions it will start failing during decoding (well processing in
>> output plugin). And that's not recoverable by reading the missing
>> object, the only way to get out of that is either to move slot forward
>> which means losing part of replication stream and need for manual resync
>> or full rebuild of replication. Neither of which are good IMHO.
>>
> It is a foot gun but there are several ways to do bad things in
> postgres. CREATE PUBLICATION is restricted to superusers and role with
> CREATE privilege in current database. AFAICS a role with CREATE
> privilege cannot drop objects whose owner is not himself. I wouldn't
> like to disallow UDFs in row filtering expressions just because
> someone doesn't set permissions correctly. Do you have any other case
> in mind?

I don't think this has anything to do with security. Stupid example:

user1: CREATE EXTENSION citext;

user2: CREATE FUNCTION myfilter(col1 text, col2 text) returns boolean
language plpgsql as
$$BEGIN
RETURN col1::citext = col2::citext;
END;$$

user2: ALTER PUBLICATION mypub ADD TABLE mytab WHERE (myfilter(a,b));

[... replication happening ...]

user1: DROP EXTENSION citext;

And now replication is broken and unrecoverable without data loss.
Recreating extension will not help because the changes happening in
meantime will not see it in the historical snapshot.

I don't think it's okay to do completely nothing about this.

> 
>> Secondly, do we want to at least notify user on filters (or maybe even
>> disallow them) with combination of action + column where column value
>> will not be logged? I mean for example we do this when processing the
>> filter against a row:
>>
>>> + ExecStoreHeapTuple(new_tuple ? new_tuple : old_tuple, 
>>> ecxt->ecxt_scantuple, false);
>>
> We could emit a LOG message. That could possibly be an option but it
> could be too complex for the first version.
>

Well, it needs walker which extracts Vars from the expression and checks
them against replica identity columns. We already have a way to fetch
replica identity columns and the walker could be something like
simplified version of the find_expr_references_walker used by the
recordDependencyOnSingleRelExpr (I don't think there is anything ready
made already).

>> But if user has expression on column which is not part of replica
>> identity that expression will always return NULL for DELETEs because
>> only replica identity is logged with actual values and everything else
>> in NULL in old_tuple. So if publication replicates deletes we should
>> check for this somehow.
>>
> In this case, we should document this behavior. That is a recurring
> question in wal2json issues. Besides that we should explain that
> UPDATE/DELETE tuples doesn't log all columns (people think the
> behavior is equivalent to triggers; it is not unless you set REPLICA
> IDENTITY FULL).
> 
>> Not really sure that this is actually worth it given that we have to
>> allocate and free this in a loop now while before it was just sitting on
>> a stack.
>>
> That is a experimentation code that should be in a separate patch.
> Don't you think low memory use is a good goal? I also think that
> MaxTupleAttributeNumber is an extreme value. I didn't some preliminary
> tests and didn't notice overheads. I'll leave these modifications in a
> separate patch.
> 

It's static memory and it's a few KB of it (it's just single array of
pointers, not array of data, and does not depend on the number of rows).
Palloc will definitely need more CPU cycles.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: row filtering for logical replication

2018-11-23 Thread Petr Jelinek

On 23/11/2018 17:39, David Fetter wrote:
> On Fri, Nov 23, 2018 at 12:03:27AM +0100, Petr Jelinek wrote:
>> On 01/11/2018 01:29, Euler Taveira wrote:
>>> Em qua, 28 de fev de 2018 às 20:03, Euler Taveira
>>>  escreveu:
 The attached patches add support for filtering rows in the publisher.

>>> I rebased the patch. I added row filtering for initial
>>> synchronization, pg_dump support and psql support. 0001 removes unused
>>> code. 0002 reduces memory use. 0003 passes only structure member that
>>> is used in create_estate_for_relation. 0004 reuses a parser node for
>>> row filtering. 0005 is the feature. 0006 prints WHERE expression in
>>> psql. 0007 adds pg_dump support. 0008 is only for debug purposes (I'm
>>> not sure some of these messages will be part of the final patch).
>>> 0001, 0002, 0003 and 0008 are not mandatory for this feature.
>>
>> Hi,
>>
>> I think there are two main topics that still need to be discussed about
>> this patch.
>>
>> Firstly, I am not sure if it's wise to allow UDFs in the filter clause
>> for the table. The reason for that is that we can't record all necessary
>> dependencies there because the functions are black box for parser.
> 
> Some UDFs are not a black box for the parser, namely ones written in
> SQL. Would it make sense at least not to foreclose the non-(black box)
> option?
> 

Yeah inlinable SQL functions should be fine, we just need the ability to
extract dependencies.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE

2018-11-23 Thread Tom Lane

Melanie Plageman  writes:
> Given that you have addressed all of my feedback and that it's a pretty
> low-risk change, I will change the status to "ready for committer".

Thanks for reviewing!

> Has there been discussion in the past about adding a planner test
> extension similar to those in src/test/modules for cardinality
> estimation? I am imagining something that is a "soft" check that either
> the rows estimation that comes out of calc_joinrel_size_estimate is
> within an expected range (given differing estimates across machines) or
> that the selectivity estimate that comes out of eqjoinsel is within an
> expected range. The former seems relatively easy to do in a manner
> similar to the test_predtest extension and the latter seems like it
> could be done even more trivially.

No, I don't recall any discussion about that.  The regression tests in
general embody a lot of checking that the planner makes expected plan
choices: obviously the cases where we do an explicit EXPLAIN do that,
but even where we don't, we'd be likely to get artifacts such as varying
row order if an unexpected plan were chosen.  Perhaps there's a use-case
for a lower-level test harness such as you suggest, but I haven't really
felt a need for it.

> On Sat, Nov 17, 2018 at 12:22 PM Tom Lane  wrote:
>> (Here, both "outer rel's size" and "inner rel's size" mean the size after
>> earlier filtering steps.)  So that's why we only clamp nd2 and only do so
>> in eqjoinsel_semi: in the other three cases, we'd be double-counting the
>> selectivity of earlier filters if we did that.

> I just want to make sure I am understanding what the comment is saying: So,
> after we calculate the selectivity for inner join, when we return from
> calc_joinrel_size_estimate we do this math:
> nrows = outer_rows * inner_rows * fkselec * jselec;
> and in that equation, the outer and inner rows have been adjusted to account
> for any restrictions on the tables, so we don't clamp the ndvs for inner
> join in eqjoinsel_inner. However, for semi-join, that calculation is
> nrows = outer_rows * fkselec * jselec;
> Which means that we have to adjust the rows of the inner side before we get
> here?

Yeah.  Basically the point is that if we have some WHERE clause that
eliminates rows from the inner side of a semijoin, we can expect that
that means the size of the semijoin result will be smaller than if the
WHERE clause hadn't been there --- because some of the outer-rel rows
only had matches among those excluded rows.  But the equation in
calc_joinrel_size_estimate provides no way to factor that in, except
by adjusting the selectivity value, so that's what we do.

> You wrote:
>> Hm.  Maybe the "Psemi" and "Pinner" notation is not helpful ... would
>> "Ssemi" and "Sinner" be better?

> I think Ssemi and Sinner might be more clear--mostly because we haven't used
> P/predicate here or in most of the other selectivity estimation comments
> that I
> read. Also, in some cases when we have super limited information and make a
> guess, the selectivity feels pretty detached from the join predicate.

OK, thanks --- I'll have another go at writing that comment.

regards, tom lane

Re: row filtering for logical replication

2018-11-23 Thread David Fetter

On Fri, Nov 23, 2018 at 12:03:27AM +0100, Petr Jelinek wrote:
> On 01/11/2018 01:29, Euler Taveira wrote:
> > Em qua, 28 de fev de 2018 às 20:03, Euler Taveira
> >  escreveu:
> >> The attached patches add support for filtering rows in the publisher.
> >>
> > I rebased the patch. I added row filtering for initial
> > synchronization, pg_dump support and psql support. 0001 removes unused
> > code. 0002 reduces memory use. 0003 passes only structure member that
> > is used in create_estate_for_relation. 0004 reuses a parser node for
> > row filtering. 0005 is the feature. 0006 prints WHERE expression in
> > psql. 0007 adds pg_dump support. 0008 is only for debug purposes (I'm
> > not sure some of these messages will be part of the final patch).
> > 0001, 0002, 0003 and 0008 are not mandatory for this feature.
> 
> Hi,
> 
> I think there are two main topics that still need to be discussed about
> this patch.
> 
> Firstly, I am not sure if it's wise to allow UDFs in the filter clause
> for the table. The reason for that is that we can't record all necessary
> dependencies there because the functions are black box for parser.

Some UDFs are not a black box for the parser, namely ones written in
SQL. Would it make sense at least not to foreclose the non-(black box)
option?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Inadequate executor locking of indexes

2018-11-23 Thread Tom Lane

David Rowley  writes:
> On Thu, 8 Nov 2018 at 13:14, Tom Lane  wrote:
>> There are several things we might do to fix this:
>> 
>> 1. Drop the "operation != CMD_DELETE" condition from the above-quoted bit
>> in ExecInitModifyTable.  We might be forced into that someday anyway if
>> we want to support non-heap-style tables, since most other peoples'
>> versions of indexes do want to know about deletions.
>> 
>> 2. Drop the target-table optimization from nodeIndexscan.c and friends,
>> and just always open the scan index with AccessShareLock.  (BTW, it's
>> a bit inconsistent that these nodes don't release their index locks
>> at the end; ExecCloseIndices does.)
>> 
>> 3. Teach nodeIndexscan.c and friends about the operation != CMD_DELETE
>> exception, so that they get the lock for themselves in that case.  This
>> seems pretty ugly/fragile, but it's about the only option that won't end
>> in adding index lock-acquisition overhead in some cases.  (But given the
>> planner's behavior, it's not clear to me how often that really matters.)

> Since I missed this and only discovered this was a problem when
> someone else reported it today, and since I already did my own
> analysis separately in [1], then my vote is for #2.

Thinking more about this, the problem I noted previously about two of
these solutions not working if the index scan node is not physically
underneath the ModifyTable node actually applies to all three :-(.
It's a slightly different issue for #2, namely that what we risk is
first taking AccessShareLock and then upgrading to RowExclusiveLock.
Since there are places (not many) that take ShareLock on indexes,
this would pose a deadlock risk.

Now, after more thought, I believe that it's probably impossible
to have a plan tree in which ExecRelationIsTargetRelation would
return true at an indexscan node that's not underneath the ModifyTable
node.  What *is* possible is that we have such an indexscan on a
different RTE for the same table.  I constructed this admittedly
artificial example in the regression database:

# explain with x1 as (select * from tenk1 t1 where unique1 = 42),
x2 as (update tenk1 t2 set two = 2 where unique2 = 11 returning *)
select * from x1,x2 where x1.ten = x2.ten;
  QUERY PLAN

--
 Nested Loop  (cost=16.61..16.66 rows=1 width=488)
   Join Filter: (x1.ten = x2.ten)
   CTE x1
 ->  Index Scan using tenk1_unique1 on tenk1 t1  (cost=0.29..8.30 rows=1 
width=244)
   Index Cond: (unique1 = 42)
   CTE x2
 ->  Update on tenk1 t2  (cost=0.29..8.30 rows=1 width=250)
   ->  Index Scan using tenk1_unique2 on tenk1 t2  (cost=0.29..8.30 
rows=1 width=250)
 Index Cond: (unique2 = 11)
   ->  CTE Scan on x1  (cost=0.00..0.02 rows=1 width=244)
   ->  CTE Scan on x2  (cost=0.00..0.02 rows=1 width=244)
(11 rows)

Because the CTEs will be initialized in order, this presents a case
where the lock-upgrade hazard exists today: the x1 indexscan will
open tenk1_unique1 with AccessShareLock and then x2's ModifyTable
node will upgrade that to RowExclusiveLock.  None of the proposed
fixes improve this.

I'm beginning to think that postponing target-index locking to
ExecInitModifyTable was a damfool idea and we should undo it.

> For partitioned
> table updates, there may be many result relations which can cause
> ExecRelationIsTargetRelation() to become very slow, in such a case any
> additional redundant lock would be cheap by comparison.

Yeah, it'd be nice to get rid of the need for that.

regards, tom lane

Re: row filtering for logical replication

2018-11-23 Thread Alvaro Herrera

On 2018-Nov-23, Euler Taveira wrote:

> Em qui, 22 de nov de 2018 às 20:03, Petr Jelinek
>  escreveu:

> > Won't this leak memory? The list_free only frees the list cells, but not
> > the nodes you stored there before.
>
> Good catch. It should be list_free_deep.

Actually, if the nodes have more structure (say you palloc one list
item, but that list item also contains pointers to a Node) then a
list_free_deep won't be enough either.  I'd suggest to create a bespoke
memory context, which you can delete afterwards.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: row filtering for logical replication

2018-11-23 Thread Euler Taveira

Em qui, 22 de nov de 2018 às 20:03, Petr Jelinek
 escreveu:
> Firstly, I am not sure if it's wise to allow UDFs in the filter clause
> for the table. The reason for that is that we can't record all necessary
> dependencies there because the functions are black box for parser. That
> means if somebody drops object that an UDF used in replication filter
> depends on, that function will start failing. But unlike for user
> sessions it will start failing during decoding (well processing in
> output plugin). And that's not recoverable by reading the missing
> object, the only way to get out of that is either to move slot forward
> which means losing part of replication stream and need for manual resync
> or full rebuild of replication. Neither of which are good IMHO.
>
It is a foot gun but there are several ways to do bad things in
postgres. CREATE PUBLICATION is restricted to superusers and role with
CREATE privilege in current database. AFAICS a role with CREATE
privilege cannot drop objects whose owner is not himself. I wouldn't
like to disallow UDFs in row filtering expressions just because
someone doesn't set permissions correctly. Do you have any other case
in mind?

> Secondly, do we want to at least notify user on filters (or maybe even
> disallow them) with combination of action + column where column value
> will not be logged? I mean for example we do this when processing the
> filter against a row:
>
> > + ExecStoreHeapTuple(new_tuple ? new_tuple : old_tuple, 
> > ecxt->ecxt_scantuple, false);
>
We could emit a LOG message. That could possibly be an option but it
could be too complex for the first version.

> But if user has expression on column which is not part of replica
> identity that expression will always return NULL for DELETEs because
> only replica identity is logged with actual values and everything else
> in NULL in old_tuple. So if publication replicates deletes we should
> check for this somehow.
>
In this case, we should document this behavior. That is a recurring
question in wal2json issues. Besides that we should explain that
UPDATE/DELETE tuples doesn't log all columns (people think the
behavior is equivalent to triggers; it is not unless you set REPLICA
IDENTITY FULL).

> Not really sure that this is actually worth it given that we have to
> allocate and free this in a loop now while before it was just sitting on
> a stack.
>
That is a experimentation code that should be in a separate patch.
Don't you think low memory use is a good goal? I also think that
MaxTupleAttributeNumber is an extreme value. I didn't some preliminary
tests and didn't notice overheads. I'll leave these modifications in a
separate patch.

> Won't this leak memory? The list_free only frees the list cells, but not
> the nodes you stored there before.
>
Good catch. It should be list_free_deep.

> Also I think we should document here that the expression is run with the
> session environment of the replication connection (so that it's more
> obvious that things like CURRENT_USER will not return user which changed
> tuple but the replication user).
>
Sure.

> It would be nice if 0006 had regression test and 0007 TAP test.
>
Sure.

Besides the problem presented by Hironobu-san, I'm doing some cleanup
and improving docs. I also forget to declare pg_publication_rel TOAST
table.

Thanks for your review.


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: prewarm compiler warnings

2018-11-23 Thread Alvaro Herrera

On 2018-Nov-23, Erik Rijkers wrote:

> gcc-6.3.0 complains about this:
> 
> autoprewarm.c: In function ‘autoprewarm_main’:
> autoprewarm.c:210:9: warning: variable ‘rc’ set but not used
> [-Wunused-but-set-variable]
>int   rc;
> 
> The warning comes only in master.

Yeah, it's fairly new.  I noticed a couple of other places that could
use the same adjustment and pushed them all together.  (I used a cast to
void as all other places do that, which appeases Coverity about the
discrepancy.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Shared Memory: How to use SYSV rather than MMAP ?

2018-11-23 Thread REIX, Tony

Hi Andres, Thomas,


Here is a patch for enabling SystemV Shared Memory on AIX, for 64K or bigger 
page size, rather than using MMAP shared memory, which is slower on AIX.


We have tested this code with 64K pages and pgbench, on AIX 7.2 TL2 Power 9, 
and it provided a maximum of +37% improvement.


We'll test this code with Large Pages (SHM_LGPAGE | SHM_PIN | S_IRUSR | S_IWUSR 
flags of shmget() ) ASAP.


However, I wanted first to get your comments about this change in order to 
improve it for acceptance.


Thanks/Regards,


Cordialement,

Tony Reix

tony.r...@atos.net

ATOS / Bull SAS
ATOS Expert
IBM Coop Architect & Technical Leader
Office : +33 (0) 4 76 29 72 67
1 rue de Provence - 38432 Échirolles - France
www.atos.net

De : REIX, Tony
Envoyé : mercredi 21 novembre 2018 09:45:12
À : Thomas Munro; Andres Freund
Cc : Robert Haas; Pg Hackers; EMPEREUR-MOT, SYLVIE; BERGAMINI, DAMIEN
Objet : RE: Shared Memory: How to use SYSV rather than MMAP ?


Hi Thomas, Andres,


I still have to reread/study in depth the discussion in this thread in order to 
understand all these information. However, we've already got a very good 
performance improvement of pgbench on AIX 7.2 / Power9 with that change: + ~38% 
in best case. See below for the details.


This +38% improvement has been measured by comparison with a PostgreSQL v11.1 
code which was built with: XLC -O2 + power9-tuning, plus some changes about 
inlining for AIX and some fixes dealing with issues with XLC and PostgreSQL 
#ifdef. Maybe GCC provides better results, we'll know later.


Once we are done with this performance analysis campaign, I'll have to submit 
patches.

Meanwhile, if anyone has ideas about where the choices made for PostgreSQL on 
Linux may have an impact to the performance on AIX, I'm very interested!


Regards,


Tony



Changes in PostgreSQL11.1 sources for SYSV large pages (64K) support :

  *   Main shared memory segment in sysv_shmem.c

removal of #define USE_ANONYMOUS_SHMEM

  *   Dynamic shared memory implementations in src/include/storage/dsm_impl.h :

#define USE_DSM_POSIX
//  #define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE  DSM_IMPL_POSIX
#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE  DSM_IMPL_SYSV
#endif

* Changes in PostgreSQL11.1 XCOFF binary with ledit :

  *   ldedit -btextpsize=64K -bdatapsize=64K -bstackpsize=64K 
/opt/freeware/bin/postgres_64
  *   Env variable LDR_CNTRL=SHMPSIZE=64K



Cordialement,

Tony Reix

tony.r...@atos.net

ATOS / Bull SAS
ATOS Expert
IBM Coop Architect & Technical Leader
Office : +33 (0) 4 76 29 72 67
1 rue de Provence - 38432 Échirolles - France
www.atos.net
--- ./src/backend/port/sysv_shmem.c.ORIGIN	2018-11-23 11:05:31 +0100
+++ ./src/backend/port/sysv_shmem.c	2018-11-23 11:16:04 +0100
@@ -63,9 +63,14 @@
  * developer use, this shouldn't be a big problem.  Because of this, we do
  * not worry about supporting anonymous shmem in the EXEC_BACKEND cases below.
  */
+#if !defined(_AIX)
 #ifndef EXEC_BACKEND
 #define USE_ANONYMOUS_SHMEM
 #endif
+// On AIX, 64K pages can be used only with SysV shared memory.
+// Not defining USE_ANONYMOUS_SHMEM on AIX leads to SysV shared memory.
+#endif
+
 
 
 typedef key_t IpcMemoryKey;		/* shared memory key passed to shmget(2) */
@@ -125,7 +130,13 @@
 	}
 #endif
 
-	shmid = shmget(memKey, size, IPC_CREAT | IPC_EXCL | IPCProtection);
+	shmid = shmget(memKey, size, IPC_CREAT | IPC_EXCL | IPCProtection
+#if !defined(_AIX)
+		);
+#else
+	// On AIX, SHM_LGPAGE & SHM_PIN are required in order to be able to use Large Pages
+	  | SHM_LGPAGE | SHM_PIN | S_IRUSR | S_IWUSR);
+#endif
 
 	if (shmid < 0)
 	{
@@ -155,7 +166,13 @@
 		 */
 		if (shmget_errno == EINVAL)
 		{
-			shmid = shmget(memKey, 0, IPC_CREAT | IPC_EXCL | IPCProtection);
+			shmid = shmget(memKey, 0, IPC_CREAT | IPC_EXCL | IPCProtection
+#if !defined(_AIX)
+			);
+#else
+	// On AIX, SHM_LGPAGE & SHM_PIN are required in order to be able to use Large Pages
+			| SHM_LGPAGE | SHM_PIN | S_IRUSR | S_IWUSR);
+#endif
 
 			if (shmid < 0)
 			{
--- ./src/include/storage/dsm_impl.h.ORIGIN	2018-11-23 11:33:45 +0100
+++ ./src/include/storage/dsm_impl.h	2018-11-23 11:34:40 +0100
@@ -30,7 +30,12 @@
 #else
 #ifdef HAVE_SHM_OPEN
 #define USE_DSM_POSIX
+#if !defined(_AIX)
 #define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE		DSM_IMPL_POSIX
+#else
+// On AIX, 64K pages can be used only with SysV shared memory
+#define DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE		DSM_IMPL_SYSV
+#endif // AIX
 #endif
 #define USE_DSM_SYSV
 #ifndef DEFAULT_DYNAMIC_SHARED_MEMORY_TYPE

prewarm compiler warnings

2018-11-23 Thread Erik Rijkers


gcc-6.3.0 complains about this:

autoprewarm.c: In function ‘autoprewarm_main’:
autoprewarm.c:210:9: warning: variable ‘rc’ set but not used 
[-Wunused-but-set-variable]

   int   rc;

The warning comes only in master.

The warnings are a bit annoying; the attached silences them by removing 
those assignments.


Thanks,

Erik Rijkers


--- ./contrib/pg_prewarm/autoprewarm.c.orig	2018-11-23 13:34:08.237110218 +0100
+++ ./contrib/pg_prewarm/autoprewarm.c	2018-11-23 13:34:38.333492968 +0100
@@ -207,8 +207,6 @@
 	/* Periodically dump buffers until terminated. */
 	while (!got_sigterm)
 	{
-		int			rc;
-
 		/* In case of a SIGHUP, just reload the configuration. */
 		if (got_sighup)
 		{
@@ -219,7 +217,7 @@
 		if (autoprewarm_interval <= 0)
 		{
 			/* We're only dumping at shutdown, so just wait forever. */
-			rc = WaitLatch(>procLatch,
+			WaitLatch(>procLatch,
 		   WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
 		   -1L,
 		   PG_WAIT_EXTENSION);
@@ -248,7 +246,7 @@
 			}
 
 			/* Sleep until the next dump time. */
-			rc = WaitLatch(>procLatch,
+			WaitLatch(>procLatch,
 		   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
 		   delay_in_ms,
 		   PG_WAIT_EXTENSION);

Re: row filtering for logical replication

2018-11-23 Thread Euler Taveira

Em sex, 23 de nov de 2018 às 11:40, Petr Jelinek
 escreveu:
> But given that this is happening inside output plugin which does not
> have full executor setup and has catalog-only snapshot I am not sure how
> feasible it is to try to merge these two things. As per my previous
> email it's possible that we'll have to be stricter about what we allow
> in expressions here.
>
This feature should be as simple as possible. I don't want to
introduce a huge overhead just for filtering some data. Data sharding
generally uses simple expressions.

> The other issue with merging this is that the use-case for filtering out
> the data in logical replication is not necessarily about security, but
> often about sending only relevant data. So it makes sense to have filter
> on publication without RLS enabled on table and if we'd force that, we'd
> limit usefulness of this feature.
>
Use the same infrastructure as RLS could be a good idea but use RLS
for row filtering is not. RLS is complex.


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Index-only scan is slower than Index scan.

2018-11-23 Thread Konstantin Knizhnik


Hi hackers,

One of our customers noticed strange thing: time of execution of the 
same query is about 25% slower with index only scan, comparing with 
indexscan plan

(produced with enable_indexonlyscan = off).
The query is the following:

SELECT
    T1._Period,
    T1._RecorderTRef,
    T1._RecorderRRef,
    T1._LineNo,
    T1._Correspond,
    T1._KindRRef,
    T1._Value_TYPE,
    T1._Value_RTRef,
    T1._Value_RRRef
    FROM _AccRgED165 T1
    WHERE (T1._KindRRef = 
'\\217\\246\\000\\011\\017\\252\\000\\001\\021\\350 
\\204K\\226\\335\\225'::bytea) AND (T1._Value_TYPE = '\\010'::bytea AND 
T1._Value_RTRef = '\\000\\000\\000\\033'::bytea AND T1._Value_RRRef = 
'\\217\\246\\000\\011\\017\\252\\000\\001\\021\\350 
\\202O\\375/\\317'::bytea) AND ((T1._Period >= '2017-08-01 
00:00:00'::timestamp) AND (T1._Period <= '2017-09-01 00:00:00'::timestamp))

    ;

Most of the fetched fields have "bytea" type, so their offsets are not 
cached in tuple descriptor.
StoreIndexTuple in nodeIndexonlyscan.c is using index_getattr to 
extract  index tuple components.
As far as fields offset can not be cached, we have to scan i-1 preceding 
attributes to fetch i-th attribute.
So StoreIndexTuple has quadratic complexity of number of attributes. In 
this query there are 9 attributes and it is enough to make
index only scan 25% slower than Index scan, because last one is 
extracting all attributes from heap tuple using slot_getsomeattrs() 
function.


I have replaced loop extracting  attributes using index_getattr() in 
StoreIndexTuple with invocation of index_deform_tuple()
and reimplemented last one in the same way as heap_deform_tuple (extract 
all attributes in one path).
My small patch is attached to this mail. After applying it index-only 
scan takes almost the same time as index scan:



index scan  1.995
indexonly scan (original)   2.686
indexonly scan (patch)  2.005



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index aa52a96259..b39545cb4b 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -423,14 +423,71 @@ void
 index_deform_tuple(IndexTuple tup, TupleDesc tupleDescriptor,
    Datum *values, bool *isnull)
 {
-	int			i;
+	int			hasnulls = IndexTupleHasNulls(tup);
+	int			natts = tupleDescriptor->natts;	/* number of atts to extract */
+	int			attnum;
+	char	   *tp;/* ptr to tuple data */
+	int			off;			/* offset in tuple data */
+	bits8	   *bp;/* ptr to null bitmap in tuple */
+	bool		slow = false;	/* can we use/set attcacheoff? */
 
 	/* Assert to protect callers who allocate fixed-size arrays */
-	Assert(tupleDescriptor->natts <= INDEX_MAX_KEYS);
+	Assert(natts <= INDEX_MAX_KEYS);
+
+	tp = (char *) tup + IndexInfoFindDataOffset(tup->t_info);
+	off = 0;
+	/* XXX "knows" t_bits are just after fixed tuple header! */
+	bp = (bits8 *) ((char *) tup + sizeof(IndexTupleData));
 
-	for (i = 0; i < tupleDescriptor->natts; i++)
+	for (attnum = 0; attnum < natts; attnum++)
 	{
-		values[i] = index_getattr(tup, i + 1, tupleDescriptor, [i]);
+		Form_pg_attribute thisatt = TupleDescAttr(tupleDescriptor, attnum);
+
+		if (hasnulls && att_isnull(attnum, bp))
+		{
+			values[attnum] = (Datum) 0;
+			isnull[attnum] = true;
+			slow = true;		/* can't use attcacheoff anymore */
+			continue;
+		}
+
+		isnull[attnum] = false;
+
+		if (!slow && thisatt->attcacheoff >= 0)
+			off = thisatt->attcacheoff;
+		else if (thisatt->attlen == -1)
+		{
+			/*
+			 * We can only cache the offset for a varlena attribute if the
+			 * offset is already suitably aligned, so that there would be no
+			 * pad bytes in any case: then the offset will be valid for either
+			 * an aligned or unaligned value.
+			 */
+			if (!slow &&
+off == att_align_nominal(off, thisatt->attalign))
+thisatt->attcacheoff = off;
+			else
+			{
+off = att_align_pointer(off, thisatt->attalign, -1,
+		tp + off);
+slow = true;
+			}
+		}
+		else
+		{
+			/* not varlena, so safe to use att_align_nominal */
+			off = att_align_nominal(off, thisatt->attalign);
+
+			if (!slow)
+thisatt->attcacheoff = off;
+		}
+
+		values[attnum] = fetchatt(thisatt, tp + off);
+
+		off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+		if (thisatt->attlen <= 0)
+			slow = true;		/* can't use attcacheoff anymore */
 	}
 }
 
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index d1201a1807..d6ac5055cf 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -267,11 +267,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 static void
 StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup, TupleDesc itupdesc)
 {
-	int			nindexatts = itupdesc->natts;
-	Datum	   *values = slot->tts_values;
-	bool	   *isnull = slot->tts_isnull;
-	int			i;
-
 	/*
 	 * Note: we

Re: Support custom socket directory in pg_upgrade

2018-11-23 Thread Tom Lane

Alvaro Herrera  writes:
> On 12/11/2018 20:00, Tom Lane wrote:
>>> Also, even if we had an arguably-better idea, I suspect that there would
>>> always be cases where it didn't work.

> Surely they can just set TMPDIR if /tmp is not writable?  If TMPDIR is
> set and not writable, bark at the user for it.

(1) There was nothing about TMPDIR in Peter's proposal, nor would an
implementation based on mkdtemp(3) automatically do that for us.

(2) If you accept the proposition that we must provide a user knob of
some sort, why shouldn't it be a command line switch rather than an
environment variable?  The former are much easier to document and to
discover.

regards, tom lane

Re: row filtering for logical replication

2018-11-23 Thread Petr Jelinek

On 23/11/2018 03:02, Stephen Frost wrote:
> Greetings,
> 
> * Euler Taveira (eu...@timbira.com.br) wrote:
>> 2018-02-28 21:54 GMT-03:00 Craig Ringer :
>>> Good idea. I haven't read this yet, but one thing to make sure you've
>>> handled is limiting the clause to referencing only the current tuple and the
>>> catalogs. user-catalog tables are OK, too, anything that is
>>> RelationIsAccessibleInLogicalDecoding().
>>>
>>> This means only immutable functions may be invoked, since a stable or
>>> volatile function might attempt to access a table. And views must be
>>> prohibited or recursively checked. (We have tree walkers that would help
>>> with this).
>>>
>>> It might be worth looking at the current logic for CHECK expressions, since
>>> the requirements are similar. In my opinion you could safely not bother with
>>> allowing access to user catalog tables in the filter expressions and limit
>>> them strictly to immutable functions and the tuple its self.
>>
>> IIRC implementation is similar to RLS expressions. I'll check all of
>> these rules.
> 
> Given the similarity to RLS and the nearby discussion about allowing
> non-superusers to create subscriptions, and probably publications later,
> I wonder if we shouldn't be somehow associating this with RLS policies
> instead of having the publication filtering be entirely independent..
> 
I do see the appeal here, if you consider logical replication to be a
streaming select it probably applies well.

But given that this is happening inside output plugin which does not
have full executor setup and has catalog-only snapshot I am not sure how
feasible it is to try to merge these two things. As per my previous
email it's possible that we'll have to be stricter about what we allow
in expressions here.

The other issue with merging this is that the use-case for filtering out
the data in logical replication is not necessarily about security, but
often about sending only relevant data. So it makes sense to have filter
on publication without RLS enabled on table and if we'd force that, we'd
limit usefulness of this feature.

We definitely want to eventually create subscriptions as non-superuser
but that has zero effect on this as everything here is happening on
different server than where subscription lives (we already allow
creation of publications with just CREATE privilege on database and
ownership of the table).

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: Add extension options to control TAP and isolation tests

2018-11-23 Thread Arthur Zakirov


Hello,

On 21.11.2018 03:39, Michael Paquier wrote:

I have added a reference to regress-tap in one of the new paragraphs.
Linking the existing stuff to point to "regress" is a separate issue
though, and while pointing to the TAP section is adapted as its
guidelines are rather general, I am not sure which one would make the
most sense though.
--
Michael


The patch is very useful. Using TAP_TESTS is more convenient and clearer 
than adding wal-check target. Every time I was adding TAP tests for a 
extension I had to remember that I should add wal-check.


After applying the patch all tests pass, there wasn't any error.

Also I tested it in one of our extension which has TAP tests. 
installcheck and check work as expected.


I think the patch can be marked as "Ready for Committer".

But there is a problem that you need to copy your extension to the 
contrib directory if you want to run TAP tests. I tried to run TAP test 
of the extension outside of PostgreSQL source directory. And it failed 
to run the test. It is because `prove_installcheck` redefines 
`top_builddir` and `PG_REGRESS`:


cd ./ && TESTDIR='/home/artur/source/pg/rum' 
PATH="/home/artur/progs/pgsql/bin:$PATH" PGPORT='65432' 
top_builddir='/home/artur/source/pg/rum//home/artur/progs/pgsql/lib/pgxs/src/makefiles/../..' 
PG_REGRESS='/home/artur/source/pg/rum//home/artur/progs/pgsql/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress' 
/usr/sbin/prove -I 
/home/artur/progs/pgsql/lib/pgxs/src/makefiles/../../src/test/perl/ -I 
./  t/*.pl
t/001_wal.pl .. Bailout called.  Further testing stopped:  system 
/home/artur/source/pg/rum//home/artur/progs/pgsql/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress 
failed


Unfortunately I didn't find the way to run it, maybe I miss something. 
It can be fixed by an additional patch I attached. I think I can create 
an entry in the future commitfest or it can be joined into your patch.


--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/makefiles/pgxs.mk b/src/makefiles/pgxs.mk
index 249061541e..c096f82d60 100644
--- a/src/makefiles/pgxs.mk
+++ b/src/makefiles/pgxs.mk
@@ -90,6 +90,9 @@ endif
 ifeq ($(FLEX),)
 FLEX = flex
 endif
+ifeq ($(PROVE),)
+PROVE = prove
+endif
 
 endif # PGXS
 
@@ -405,6 +408,12 @@ ifdef ISOLATION
 endif
 endif # PGXS
 
+# There are common routines in src/test/perl, and some test suites have
+# extra perl modules in their own directory.
+PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/ -I $(srcdir)
+# User-supplied prove flags such as --verbose can be provided in PROVE_FLAGS.
+PROVE_FLAGS =
+
 # Standard rules to run regression tests including multiple test suites.
 # Runs against an installed postmaster
 ifndef NO_INSTALLCHECK
@@ -416,7 +425,9 @@ ifdef ISOLATION
 	$(pg_isolation_regress_installcheck) $(ISOLATION_OPTS) $(ISOLATION)
 endif
 ifdef TAP_TESTS
-	$(prove_installcheck)
+	rm -rf '$(CURDIR)'/tmp_check
+	$(MKDIR_P) '$(CURDIR)'/tmp_check
+	cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" PGPORT='6$(DEF_PGPORT)' top_builddir='$(top_builddir)' PG_REGRESS='$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
 endif
 endif # NO_INSTALLCHECK
 
@@ -434,7 +445,9 @@ ifdef ISOLATION
 	$(pg_isolation_regress_check) $(ISOLATION_OPTS) $(ISOLATION)
 endif
 ifdef TAP_TESTS
-	$(prove_check)
+	rm -rf '$(CURDIR)'/tmp_check
+	$(MKDIR_P) '$(CURDIR)'/tmp_check
+	cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' PG_REGRESS='$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
 endif
 endif # PGXS

Re: Psql patch to show access methods info

2018-11-23 Thread Sergey Cherkashin

> \dA{f,p,fo,fp,oc}
>   Please explain what these are.
We adhere to the following logic
f  - families
fo - operators in families
fp - procedures in families
p  - access method properties
oc - operator classes

> I think this is two patches -- one being the \dip/\dicp part, the
> other
> the \dA additions.  Let's deal with them separately?

The attached patches are applied sequentially: first 0003-
psql_add_am_info.patch, then 0003-psql_add_index_info.patch.

Best regards,
Sergey Cherkashin.

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c134bca809..e25412b7ce 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -675,7 +675,7 @@
search and ordering purposes.)
   
 
-  
+  
pg_amop Columns
 

@@ -818,7 +818,7 @@
is one row for each support function belonging to an operator family.
   
 
-  
+  
pg_amproc Columns
 

@@ -4421,7 +4421,7 @@ SCRAM-SHA-256$iteration count:
Operator classes are described at length in .
   
 
-  
+  
pg_opclass Columns
 

@@ -4683,7 +4683,7 @@ SCRAM-SHA-256$iteration count:
Operator families are described at length in .
   
 
-  
+  
pg_opfamily Columns
 

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 6e6d0f42d1..fcde01b2d4 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1204,6 +1204,105 @@ testdb=
 
   
 
+  
+
+  \dAf
+  [ access-method-pattern
+[operator-family-pattern]]
+  
+
+
+
+
+Lists operator families ().
+If access-method-pattern
+is specified, only families whose access method name matches the pattern
+are shown. If
+operator-family-pattern
+is specified, only operator families associated with whose name matches
+the pattern are shown. If + is appended to the
+command name, each operator family is listed with it's owner.
+
+
+  
+
+  
+
+  \dAfo
+[access-method-pattern
+  [operator-family-pattern]]
+  
+
+
+
+
+Lists operators () associated
+with access method operator families. If
+access-method-patttern is
+specified, only operators associated with access method whose name
+matches pattern are shown. If
+operator-family-pattern is
+specified, only operators associated with families whose name matches
+the pattern are shown.
+
+
+  
+
+  
+
+  \dAfp
+[access-method-pattern
+  [operator-family-pattern]]
+  
+
+
+
+List procedures () associated
+with access method operator families.
+If access-method-patttern
+is specified, only procedures associated with access method whose name
+matches pattern are shown.
+If operator-family-pattern
+is specified, only procedures associated with families whose name
+matches the pattern are shown.
+
+
+  
+
+  
+
+  \dAoc
+[access-method-pattern
+  [operator-class-pattern]]
+  
+
+
+
+Shows index access method operator classes listed in
+.
+If access-method-patttern
+is specified, only operator classes associated with access method whose
+name matches pattern are shown.
+If operator-class-pattern
+is specified, only procedures associated with families whose name
+matches the pattern are shown.
+
+
+  
+
+  
+\dAp [ pattern ]
+
+
+
+Shows access method properties listed in
+.
+If pattern is specified,
+only access methods whose names match the pattern are shown.
+
+
+  
+
   
 \db[+] [ pattern ]
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index ee88e1ca5c..4d0f619186 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -719,7 +719,23 @@ exec_command_d(PsqlScanState scan_state, bool active_branch, const char *cmd)
 	success = listTables("tvmsE", NULL, show_verbose, show_system);
 break;
 			case 'A':
-success = describeAccessMethods(pattern, show_verbose);
+{
+	char	   *pattern2 = psql_scan_slash_option(scan_state, OT_NORMAL, NULL, true);
+
+	if (strncmp(cmd, "dAp", 3) == 0)
+		success = describeAccessMethodProperties(pattern);
+	else if (strncmp(cmd, "dAfo", 4) == 0)
+		success = listFamilyClassOperators(pattern, pattern2);
+	else if (strncmp(cmd, "dAfp", 4) == 0)
+		success = listOperatorFamilyProcedures(pattern, pattern2);
+	else if (strncmp(cmd, "dAf", 3) == 0)
+		success = listAccessMethodOperatorFamilies(pattern, pattern2, show_verbose);
+	else if

Re: Support custom socket directory in pg_upgrade

2018-11-23 Thread Alvaro Herrera

On 2018-Nov-12, Tom Lane wrote:

> Peter Eisentraut  writes:
> > On 12/11/2018 20:00, Tom Lane wrote:
> >> Also, even if we had an arguably-better idea, I suspect that there would
> >> always be cases where it didn't work.  For example, one idea is to make
> >> a temporary directory under the installation's normal socket directory
> >> (thus, /tmp/pg/ or some such).  But, if the normal socket directory
> >> is not /tmp, we might find that pg_upgrade can't write there.
> 
> > We do exactly that in pg_regress and it's never been a problem.
> 
> Yeah, but pg_upgrade is used by a much wider variety of people
> than pg_regress.

Surely they can just set TMPDIR if /tmp is not writable?  If TMPDIR is
set and not writable, bark at the user for it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hitting CheckRelationLockedByMe() ASSERT with force_generic_plan

2018-11-23 Thread Rushabh Lathia

On Fri, Nov 23, 2018 at 3:33 AM David Rowley 
wrote:

> On Thu, 22 Nov 2018 at 22:33, Rushabh Lathia 
> wrote:
> > CREATE TABLE foo (x int primary key);
> > INSERT INTO foo VALUES (1), (2), (3), (4), (5);
> >
> > CREATE OR REPLACE FUNCTION f1(a int) RETURNS int
> > AS $$
> > BEGIN
> >  DELETE FROM foo where x = a;
> >  return 0;
> > END;
> > $$ LANGUAGE plpgsql;
> >
> > postgres@100858=#set plan_cache_mode = force_generic_plan;
> > SET
> > postgres@100858=#select f1(4);
> >  f1
> > 
> >   0
> > (1 row)
> >
> > postgres@100858=#select f1(4);
> > server closed the connection unexpectedly
>
>
> > Now, to fix this issue either we need to hold proper lock before reaching
> > to ExecInitIndexScan() or teach ExecInitIndexScan() to take
> AccessShareLock
> > on the scan coming from CMD_DELETE.
>
> I'd say the following comment and code in nodeIndexscan.c is outdated:
>
> /*
> * Open the index relation.
> *
> * If the parent table is one of the target relations of the query, then
> * InitPlan already opened and write-locked the index, so we can avoid
> * taking another lock here.  Otherwise we need a normal reader's lock.
> */
> relistarget = ExecRelationIsTargetRelation(estate, node->scan.scanrelid);
> indexstate->iss_RelationDesc = index_open(node->indexid,
>   relistarget ? NoLock : AccessShareLock);
>
> Despite what the above comment claims, these indexes have not been
> opened in InitPlan since 389af951552ff2. As you mentioned, they're
> opened in nodeModifyTable.c for non-delete statements.
>
>
+1.

I tried the same approach and with further testing I haven't found
any issues.


> I think we either need to update the above code to align it to what's
> now in nodeModifyTable.c, or just obtain an AccessShareLock
> regardless. I kinda think we should just take the lock regardless as
> determining if the relation is a result relation may be much more
> expensive than just taking another lower-level lock on the index.
>
> Anyway. I've attached a small patch to update the above fragment.
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


-- 
Rushabh Lathia

Re: COPY FROM WHEN condition

2018-11-23 Thread Surafel Temesgen

On Sun, Nov 11, 2018 at 11:59 PM Tomas Vondra 
wrote:

>
> So, what about using FILTER here? We already use it for aggregates when
> filtering rows to process.
>
> i think its good idea and describe its purpose more. Attache is a patch
that use FILTER instead
regards
Surafel
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 411941ed31..15b6ddebed 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -25,6 +25,7 @@ PostgreSQL documentation
 COPY table_name [ ( column_name [, ...] ) ]
 FROM { 'filename' | PROGRAM 'command' | STDIN }
 [ [ WITH ] ( option [, ...] ) ]
+[ FILTER ( condition ) ]
 
 COPY { table_name [ ( column_name [, ...] ) ] | ( query ) }
 TO { 'filename' | PROGRAM 'command' | STDOUT }
@@ -353,6 +354,30 @@ COPY { table_name [ ( 

 
+   
+FILTER
+
+   
+The optional FILTER clause has the general form
+
+FILTER condition
+
+where condition is
+any expression that evaluates to a result of type
+boolean.  Any row that does not satisfy this
+condition will not be inserted to the table.  A row satisfies the
+condition if it returns true when the actual row values are
+substituted for any variable references.
+   
+
+   
+Currently, FILTER expressions cannot contain
+subqueries.
+   
+
+
+   
+
   
  
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4aa8890fe8..bb260c41fd 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -40,7 +40,11 @@
 #include "miscadmin.h"
 #include "optimizer/clauses.h"
 #include "optimizer/planner.h"
+#include "optimizer/prep.h"
 #include "nodes/makefuncs.h"
+#include "parser/parse_coerce.h"
+#include "parser/parse_collate.h"
+#include "parser/parse_expr.h"
 #include "parser/parse_relation.h"
 #include "port/pg_bswap.h"
 #include "rewrite/rewriteHandler.h"
@@ -150,6 +154,7 @@ typedef struct CopyStateData
 	bool		convert_selectively;	/* do selective binary conversion? */
 	List	   *convert_select; /* list of column names (can be NIL) */
 	bool	   *convert_select_flags;	/* per-column CSV/TEXT CS flags */
+	Node	   *filterClause;	/* FILTER condition (or NULL) */
 
 	/* these are just for error messages, see CopyFromErrorCallback */
 	const char *cur_relname;	/* table name for error messages */
@@ -180,6 +185,7 @@ typedef struct CopyStateData
 	ExprState **defexprs;		/* array of default att expressions */
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
+	ExprState  *qualexpr;
 
 	TransitionCaptureState *transition_capture;
 
@@ -801,6 +807,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
 	Relation	rel;
 	Oid			relid;
 	RawStmt*query = NULL;
+	Node*filterClause = NULL;
 
 	/*
 	 * Disallow COPY to/from file or program except to users with the
@@ -854,6 +861,25 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
 			NULL, false, false);
 		rte->requiredPerms = (is_from ? ACL_INSERT : ACL_SELECT);
 
+		if (stmt->filterClause)
+		{
+			/* add rte to column namespace  */
+			addRTEtoQuery(pstate, rte, false, true, true);
+
+			/* Transform the raw expression tree */
+			filterClause = transformExpr(pstate, stmt->filterClause, EXPR_KIND_COPY_FILTER);
+
+			/*  Make sure it yields a boolean result. */
+			filterClause = coerce_to_boolean(pstate, filterClause, "FILTER");
+			/* we have to fix its collations too */
+			assign_expr_collations(pstate, filterClause);
+
+			filterClause = eval_const_expressions(NULL, filterClause);
+
+			filterClause = (Node *) canonicalize_qual((Expr *) filterClause, false);
+			filterClause = (Node *) make_ands_implicit((Expr *) filterClause);
+		}
+
 		tupDesc = RelationGetDescr(rel);
 		attnums = CopyGetAttnums(tupDesc, rel, stmt->attlist);
 		foreach(cur, attnums)
@@ -1002,6 +1028,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
 
 		cstate = BeginCopyFrom(pstate, rel, stmt->filename, stmt->is_program,
 			   NULL, stmt->attlist, stmt->options);
+		cstate->filterClause = filterClause;
 		*processed = CopyFrom(cstate);	/* copy from file to database */
 		EndCopyFrom(cstate);
 	}
@@ -2531,6 +2558,10 @@ CopyFrom(CopyState cstate)
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
+	if (cstate->filterClause)
+		cstate->qualexpr = ExecInitQual(castNode(List, cstate->filterClause),
+>ps);
+
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2679,6 +2710,13 @@ CopyFrom(CopyState cstate)
 		slot = myslot;
 		ExecStoreHeapTuple(tuple, slot, false);
 
+		if (cstate->filterClause)
+		{
+			econtext->ecxt_scantuple = myslot;
+			if (!ExecQual(cstate->qualexpr, econtext))
+continue;
+		}
+
 		/* Determine the partition to heap_insert the tuple into */
 		if (proute)
 		{
diff --git a/src/backend/nodes/copyfuncs.c

Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-23 Thread Andrey Borodin

Hi!

> 15 нояб. 2018 г., в 17:30, Andrey Borodin  написал(а):
> 
> I think I can compose patch for consideration.

I found no practical way to fix concept of "subtree-lock". Pre-locking all left 
siblings for cleanup would suffice, but does not look very practical.
GIN Posting tree has no all useful B-tree inventory like left links and high 
keys for concurrent deletes without "subtree-lock".

Let's consider concurrent deletion.
As far as I understand, absence of high-key makes unsafe concurrent deletion of 
highest keys from internal pages. Thus, internal pages cannot be deleted at all.
PFA patch with most simple deletion algorithm:
1. Descend to leftmost leaf
2. Scan by rightlinks until rightmost page
3. Delete empty leaf pages, if they are not highest keys in corresponding 
parent page

Parent page is found by rightlink scan for leftmost parent with cached previous 
result.
Deleted page is stamped by ReadNewTransactionId()'s xid, page cannot be reused 
until deletion xid is beyond RecentGlobalDataXmin.
DeleteXid is WAL-logged.
I've refactored btree->getLeftMostChild() and btree->findChildPtr() to reuse 
their code (removed unnecessary parameter).

Please note, that attached patch do not fix 2nd bug found by Chen in page 
delete redo.

I understand that reverting commit 218f51584d5 and returning to long but finite 
root locks is safest solution. In this description of proposed patch I've tried 
to put concepts in a minimum of words. Please ping me if some concepts sound 
cryptic and require more clarification.

Best regards, Andrey Borodin.


0001-Use-correct-locking-protocol-during-GIN-posting-tree.patch
Description: Binary data

48 matches

Mail list logo