date:20190415

Re: [PATCH v20] GSSAPI encryption support

2019-04-15 Thread Peter Eisentraut

On 2019-04-16 06:36, Michael Paquier wrote:
> +$node->append_conf('pg_hba.conf',
> +   qq{hostgssenc all all $hostaddr/32 gss map=mymap});
> +$node->restart;
> A reload should be enough but not race-condition free, which is why a
> set of restarts is done in this test right?  (I have noticed that it
> is done this way since the beginning.)

We got rid of all the reloads in these kinds of tests because they have
the effect that if the configuration has an error, the reload just
ignores it.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Minimal logical decoding on standbys

2019-04-15 Thread Amit Khandekar

On Sat, 13 Apr 2019 at 00:57, Andres Freund  wrote:
>
> Hi,
>
> On 2019-04-12 23:34:02 +0530, Amit Khandekar wrote:
> > I tried to see if I can quickly understand what's going on.
> >
> > Here, master wal_level is hot_standby, not logical, though slave
> > wal_level is logical.
>
> Oh, that's well diagnosed.  Cool.  Also nicely tested - this'd be ugly
> in production.

Tushar had made me aware of the fact that this reproduces only when
master wal_level is hot_standby.

>
> I assume the problem isn't present if you set the primary to wal_level =
> logical?

Right.

>
>
> > Not sure why this is happening. On slave, wal_level is logical, so
> > logical records should have tuple data. Not sure what does that have
> > to do with wal_level of master. Everything should be there on slave
> > after it replays the inserts; and also slave wal_level is logical.
>
> The standby doesn't write its own WAL, only primaries do. I thought we
> forbade running with wal_level=logical on a standby, when the primary is
> only set to replica.  But that's not what we do, see
> CheckRequiredParameterValues().
>
> I've not yet thought this through, but I think we'll have to somehow
> error out in this case.  I guess we could just check at the start of
> decoding what ControlFile->wal_level is set to,

By "start of decoding", I didn't get where exactly. Do you mean
CheckLogicalDecodingRequirements() ?

> and then raise an error
> in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
> wal_level to something lower?

Didn't get where exactly we should error out. We don't do
XLOG_PARAMETER_CHANGE handling in decode.c , so obviously you meant
something else, which I didn't understand.

What I am thinking is :
In CheckLogicalDecodingRequirements(), besides checking wal_level,
also check ControlFile->wal_level when InHotStandby. I mean, when we
are InHotStandby, both wal_level and ControlFile->wal_level should be
>= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical
slot when master has incompatible wal_level.

ControlFile is not accessible outside xlog.c so need to have an API to
extract this field.


>
> Could you try to implement that?
>
> Greetings,
>
> Andres Freund


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: Commit message / hash in commitfest page.

2019-04-15 Thread Peter Eisentraut

On 2019-04-16 08:47, Magnus Hagander wrote:
> Unless we want to go all the way and have said bot actualy close the CF
> entry. But the question is, do we?

I don't think so.  There are too many special cases that would make this
unreliable, like one commit fest thread consisting of multiple patches.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: REINDEX CONCURRENTLY 2.0

2019-04-15 Thread Peter Eisentraut

On 2019-04-16 08:19, Michael Paquier wrote:
> On Fri, Apr 12, 2019 at 12:11:12PM +0100, Dagfinn Ilmari Mannsåker wrote:
>> I don't have any comments on the code (but the test looks sensible, it's
>> the same trick I used to discover the issue in the first place).
> 
> After thinking some more on it, this behavior looks rather sensible to
> me.  Are there any objections?

Looks good to me.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Commit message / hash in commitfest page.

2019-04-15 Thread Magnus Hagander

On Sat, Apr 13, 2019 at 10:28 PM Tom Lane  wrote:

> Tomas Vondra  writes:
> > On Thu, Apr 11, 2019 at 02:55:10PM +0500, Ibrar Ahmed wrote:
> >> On Thu, Apr 11, 2019 at 2:44 PM Erikjan Rijkers  wrote:
>  Is it possible to have commit-message or at least git hash in
>  commitfest. It will be very easy to track commit against commitfest
>  item.
>
> >>> Commitfest items always point to discussion threads. These threads
> often
> >>> end with a message that says that the patch is pushed.  IMHO, that
> >>> message would be the place to include the commithash.   It would also
> be
> >>> easily findable via the commitfest application.
>
> > I think it might be useful to actually have that directly in the CF app,
> > not just in the thread. There would need to a way to enter multiple
> > hashes, because patches often have multiple pieces.
>
> > But maybe that'd be too much unnecessary burden. I don't remember when I
> > last needed this information. And I'd probably try searching in git log
> > first anyway.
>
> Yeah, I can't see committers bothering to do this.  Including the
> discussion thread link in the commit message is already pretty
> significant hassle, and something not everybody remembers/bothers with.
>
> But ... maybe it could be automated?  A bot looking at the commit log
> could probably suck out the thread links and try to match them up
> to CF entries.  Likely you could get about 90% right even without that,
> just by matching the committer's name and the time of commit vs time
> of CF entry closure.
>

Would you even need to match that? It would be easy enough to scan all git
commit messages for links to th earchives and populate any CF entry that
attaches to that same thread.

Of course, that would be async, so you'd end up closing the CF entry and
then have it populate with the git information a bit later (in the simple
case where there is just one commit and then it 's done).

Unless we want to go all the way and have said bot actualy close the CF
entry. But the question is, do we?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

2019-04-15 Thread Michael Paquier

On Sun, Mar 24, 2019 at 09:47:58PM +0900, Michael Paquier wrote:
> The failure is a bit weird, as I would expect all those three actions
> to be sequential.  piculet is the only failure happening on the
> buildfarm and it uses --disable-atomics, so I am wondering if that is
> related and if 0dfe3d0 is part of that.  With a primary/standby set,
> it could be possible to test that scenario pretty easily.  I'll give
> it a shot.

The buildfarm has just failed with a similar failure, for another
test aka 009_twophase.pl:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01

I think that we have race conditions with checkpointing and shutdown
on HEAD.
--
Michael


signature.asc
Description: PGP signature

Re: REINDEX CONCURRENTLY 2.0

2019-04-15 Thread Michael Paquier

On Fri, Apr 12, 2019 at 12:11:12PM +0100, Dagfinn Ilmari Mannsåker wrote:
> I don't have any comments on the code (but the test looks sensible, it's
> the same trick I used to discover the issue in the first place).

After thinking some more on it, this behavior looks rather sensible to
me.  Are there any objections?

> However, the doc patch lost the trailing paren:

Fixed on my branch, thanks.
--
Michael


signature.asc
Description: PGP signature

Re: Accidental setting of XLogReaderState.private_data ?

2019-04-15 Thread Michael Paquier

On Mon, Apr 15, 2019 at 11:06:18AM -0400, Tom Lane wrote:
> Hmm.  The second, duplicate assignment is surely pointless, but I think
> that setting the ctx as the private_data is a good idea.  It hardly seems
> out of the question that it might be needed in future.

Agreed that we should keep the assignment done with
XLogReaderAllocate().  I have committed the patch which removes the
useless assignment though.
--
Michael


signature.asc
Description: PGP signature

Re: Autovacuum-induced regression test instability

2019-04-15 Thread Michael Paquier

On Mon, Apr 15, 2019 at 01:22:30PM -0400, Tom Lane wrote:
> In connection with the issue discussed at [1], I tried to run
> the core regression tests with extremely aggressive autovacuuming
> (I set autovacuum_naptime = 1s, autovacuum_vacuum_threshold = 5,
> autovacuum_vacuum_cost_delay = 0).  I found that the timestamp
> test tends to fail with diffs caused by unstable row order in
> timestamp_tbl.  This is evidently because it does a couple of
> DELETEs before inserting the table's final contents; if autovac
> comes along at the right time then some of those slots can get
> recycled in between insertions.  I'm thinking of committing the
> attached patch to prevent this, since in principle such failures
> could occur even without hacking the autovac settings.  Thoughts?

Aren't extra ORDER BY clauses the usual response to tuple ordering?  I
really think that we should be more aggressive with that.  For table
AM, it can prove to be very useful to run the main regression test
suite with default_table_access_method enforced, and most likely AMs
will not ensure the same tuple ordering as heap.
--
Michael


signature.asc
Description: PGP signature

Re: extensions are hitting the ceiling

2019-04-15 Thread Noah Misch

On Mon, Mar 18, 2019 at 09:38:19PM -0500, Eric Hanson wrote:
> #1: Dependencies
> 
> Let's say we have two extensions, A and B, both of which depend on a third
> extension C, let's just say C is hstore.  A and B are written by different
> developers, and both contain in their .control file the line
> 
> requires = 'hstore'
> 
> When A is installed, if A creates a schema, it puts hstore in that schema.
> If not, hstore is already installed, it uses it in that location.  How does
> the extension know where to reference hstore?
> 
> Then, when B is installed, it checks to see if extension hstore is
> installed, sees that it is, and moves on.  What if it expects it in a
> different place than A does? The hstore extension can only be installed
> once, in a single schema, but if multiple extensions depend on it and look
> for it in different places, they are incompatible.
> 
> I have heard talk of a way to write extensions so that they dynamically
> reference the schema of their dependencies, but sure don't know how that
> would work if it's possible.  The @extschema@ variable references the
> *current* extension's schema, but not there is no dynamic variable to
> reference the schema of a dependency.

If desperate, you can do it like this:

  DO $$ BEGIN EXECUTE format('SELECT %I.earth()',
(SELECT nspname FROM pg_namespace n
 JOIN pg_extension ON n.oid = extnamespace
 WHERE extname = 'earthdistance' )); END $$;

Needless to say, that's too ugly.  Though probably unimportant in practice, it
also has a race condition vs. ALTER EXTENSION SET SCHEMA.

> Also it is possible in theory to dynamically set search_path to contain
> every schema of every dependency in play and then just not specify a schema
> when you use something in a dependency.  But this ANDs together all the
> scopes of all the dependencies of an extension, introducing potential for
> collisions, and is generally kind of clunky.

That's how it works today, and it has the problems you describe.  I discussed
some solution candidates here:
https://www.postgresql.org/message-id/20180710014308.ga805...@rfd.leadboat.com

The @DEPNAME_schema@ thing was trivial to implement, but I shelved it.  I'm
attaching the proof of concept, for your information.

> #2:  Data in Extensions
> 
> Extensions that are just a collection of functions and types seem to be the
> norm.  Extensions can contain what the docs call "configuration" data, but
> rows are really second class citizens:  They aren't tracked with
> pg_catalog.pg_depend, they aren't deleted when the extension is dropped,
> etc.
> 
> Sometimes it would make sense for an extension to contain *only* data, or
> insert some rows in a table that the extension doesn't "own", but has as a
> dependency.  For example, a "webserver" extension might contain a
> "resource" table that serves up the content of resources in the table at a
> specified path. But then, another extension, say an app, might want to just
> list the webserver extension as a dependency, and insert a few resource
> rows into it.  This is really from what I can tell beyond the scope of what
> extensions are capable of.

I never thought of this use case.  Interesting.
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 2e45381..cd061ea 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -786,13 +786,14 @@ static void
 execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 const char *from_version,
 const char *version,
+List *requiredExtensions,
 List *requiredSchemas,
 const char *schemaName, Oid 
schemaOid)
 {
char   *filename;
int save_nestlevel;
StringInfoData pathbuf;
-   ListCell   *lc;
+   ListCell   *lc, *le;
 
/*
 * Enforce superuser-ness if appropriate.  We postpone this check until
@@ -907,6 +908,19 @@ execute_extension_script(Oid extensionOid, 
ExtensionControlFile *control,
}
 
/*
+* For each dependency, substitute the dependency's schema for
+* @DEPNAME_schema@.  This is fishy for a relocatable 
dependency, but
+* we accept that risk.
+*/
+   forboth (le, requiredExtensions, lc, requiredSchemas)
+   {
+   t_sql = DirectFunctionCall3(replace_text,
+   
t_sql,
+   
CStringGetTextDatum(psprintf("@%s_schema@", 
get_extension_name(lfirst_oid(le,
+   
CStringGetTe

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Ashwin Agrawal

On Mon, Apr 15, 2019 at 12:50 PM Peter Geoghegan  wrote:

> On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal 
> wrote:
> > Would like to know more specifics on this Peter. We may be having
> different context on hybrid row/column design.
>
> I'm confused about how close your idea of a TID is to the traditional
> definition from heapam (and even zheap). If it's a purely logical
> identifier, then why would it have two components like a TID? Is that
> just a short-term convenience or something?
>

TID is purely logical identifier. Hence, stated in initial email that for
Zedstore TID, block number and offset split carries no meaning at all. It's
purely 48 bit integer entity assigned to datum of first column during
insertion, based on where in BTree it gets inserted. Rest of the column
datums are inserted using this assigned TID value. Just due to rest to
system restrictions discussed by Heikki and Andres on table am thread poses
limitations of value it can carry currently otherwise from zedstore design
perspective it just integer number.

> > Yes, the plan to optimize out TID space per datum, either by prefix
> compression or delta compression or some other trick.
>
> It would be easier to do this if you knew for sure that the TID
> behaves almost the same as a bigserial column -- a purely logical
> monotonically increasing identifier. That's why I'm interested in what
> exactly you mean by TID, the stability of a TID value, etc. If a leaf
> page almost always stores a range covering no more than few hundred
> contiguous logical values,  you can justify aggressively compressing
> the representation in the B-Tree entries. Compression would still be
> based on prefix compression, but the representation itself can be
> specialized.
>

Yes, it's for sure logical increasing number. With only inserts the number
is monotonically increasing. With deletes and updates, insert could use the
previously free'd TID values. Since TID is logical datums can be easily
moved around to split or merge pages as required.

Re: Calling pgstat_report_wait_end() before ereport(ERROR)

2019-04-15 Thread Michael Paquier

On Fri, Apr 12, 2019 at 10:06:41PM +0900, Masahiko Sawada wrote:
> But I think that's not right, I've checked the code. If the startup
> process failed in that function it raises a FATAL and recovery fails,
> and if checkpointer process does then it calls
> pgstat_report_wait_end() in CheckpointerMain().

Well, the point is that the code raises an ERROR, then a FATAL because
it gets upgraded by recovery.  The take, at least it seems to me, is
that if any new caller of the function misses to clean up the event
then the routine gets cleared.  So it seems to me that the current
coding is aimed to be more defensive than anything.  I agree that
there is perhaps little point in doing so.  In my experience a backend
switches very quickly back to ClientRead, cleaning up the previous
event.  Looking around, we have also some code paths in slot.c and
origin.c which close a transient file, clear the event flag...  And
then PANIC, which makes even less sense.

In short, I tend to think that the attached is an acceptable cleanup.
Thoughts?
--
Michael
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f9a4960f8a..a399c0052d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1693,26 +1693,18 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_WRITE);
 	if (write(fd, content, len) != len)
 	{
-		int			save_errno = errno;
-
-		pgstat_report_wait_end();
-		CloseTransientFile(fd);
-
 		/* if write didn't set errno, assume problem is no disk space */
-		errno = save_errno ? save_errno : ENOSPC;
+		if (errno == 0)
+			errno = ENOSPC;
 		ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not write file \"%s\": %m", path)));
 	}
 	if (write(fd, &statefile_crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
 	{
-		int			save_errno = errno;
-
-		pgstat_report_wait_end();
-		CloseTransientFile(fd);
-
 		/* if write didn't set errno, assume problem is no disk space */
-		errno = save_errno ? save_errno : ENOSPC;
+		if (errno == 0)
+			errno = ENOSPC;
 		ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not write file \"%s\": %m", path)));
@@ -1725,15 +1717,9 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_SYNC);
 	if (pg_fsync(fd) != 0)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
 		ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not fsync file \"%s\": %m", path)));
-	}
 	pgstat_report_wait_end();
 
 	if (CloseTransientFile(fd) != 0)
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 42fd8f5b6b..ff4d54d6ed 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -579,12 +579,9 @@ CheckPointReplicationOrigin(void)
 	errno = 0;
 	if ((write(tmpfd, &magic, sizeof(magic))) != sizeof(magic))
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(tmpfd);
-
 		/* if write didn't set errno, assume problem is no disk space */
-		errno = save_errno ? save_errno : ENOSPC;
+		if (errno == 0)
+			errno = ENOSPC;
 		ereport(PANIC,
 (errcode_for_file_access(),
  errmsg("could not write to file \"%s\": %m",
@@ -624,12 +621,9 @@ CheckPointReplicationOrigin(void)
 		if ((write(tmpfd, &disk_state, sizeof(disk_state))) !=
 			sizeof(disk_state))
 		{
-			int			save_errno = errno;
-
-			CloseTransientFile(tmpfd);
-
 			/* if write didn't set errno, assume problem is no disk space */
-			errno = save_errno ? save_errno : ENOSPC;
+			if (errno == 0)
+errno = ENOSPC;
 			ereport(PANIC,
 	(errcode_for_file_access(),
 	 errmsg("could not write to file \"%s\": %m",
@@ -646,12 +640,9 @@ CheckPointReplicationOrigin(void)
 	errno = 0;
 	if ((write(tmpfd, &crc, sizeof(crc))) != sizeof(crc))
 	{
-		int			save_errno = errno;
-
-		CloseTransientFile(tmpfd);
-
 		/* if write didn't set errno, assume problem is no disk space */
-		errno = save_errno ? save_errno : ENOSPC;
+		if (errno == 0)
+			errno = ENOSPC;
 		ereport(PANIC,
 (errcode_for_file_access(),
  errmsg("could not write to file \"%s\": %m",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 006446b663..057c5d7ab2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1398,16 +1398,10 @@ RestoreSlotFromDisk(const char *name)
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC);
 	if (pg_fsync(fd) != 0)
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		errno = save_errno;
 		ereport(PANIC,
 (errcode_for_file_access(),
  errmsg("could not fsync file \"%s\": %m",
 		path)));
-	}
 	pgstat_report_wait_end();
 
 	/* Also sync the parent directory */
@@ -1421,10 +1415,6 @@ RestoreSlotFromDisk(const char *name)
 	pgstat_report_wait_end();
 	if (readBytes !=

Re: [Patch] Mingw: Fix import library extension, build actual static libraries

2019-04-15 Thread Noah Misch

On Thu, Mar 07, 2019 at 03:23:50PM +0100, Sandro Mani wrote:
> Related, no actual static libraries are produced alongside the respective
> dlls. The attached patch 0002-Build-static-libraries.patch addresses this,
> in a similar fashion as is already done for the AIX case in Makefile.shlib.

We don't build static libraries on AIX, though Makefile.shlib uses the
$(stlib) variable to get a name for the *.a shared library it makes.  Here's
an example of one AIX Makefile.shlib build sequence, from
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2019-04-15%2022%3A35%3A52&stg=make

rm -f libpq.a
ar crs libpq.a fe-auth.o fe-auth-scram.o fe-connect.o fe-exec.o fe-misc.o 
fe-print.o fe-lobj.o fe-protocol2.o fe-protocol3.o pqexpbuffer.o fe-secure.o 
libpq-events.o encnames.o wchar.o fe-secure-openssl.o fe-secure-common.o
touch libpq.a
../../../src/backend/port/aix/mkldexport.sh libpq.a libpq.so.5 >libpq.exp
xlc_r -qmaxmem=33554432 -D_LARGE_FILES=1  -qnoansialias -g -O2 -qmaxmem=16384 
-qsrcmsg -D_REENTRANT -D_THREAD_SAFE -D_POSIX_PTHREAD_SEMANTICS  -o libpq.so.5 
libpq.a -Wl,-bE:libpq.exp -L../../../src/port -L../../../src/common 
-lpgcommon_shlib -lpgport_shlib  -L/home/nm/sw/nopath/libxml2-64/lib 
-L/home/nm/sw/nopath/icu58.2-64/lib  -L/home/nm/sw/nopath/uuid-64/lib 
-L/home/nm/sw/nopath/openldap-64/lib -L/home/nm/sw/nopath/icu58.2-64/lib 
-L/home/nm/sw/nopath/libxml2-64/lib 
-Wl,-blibpath:'/home/nm/farm/xlc64/HEAD/inst/lib:/home/nm/sw/nopath/libxml2-64/lib:/home/nm/sw/nopath/icu58.2-64/lib:/home/nm/sw/nopath/uuid-64/lib:/home/nm/sw/nopath/openldap-64/lib:/home/nm/sw/nopath/icu58.2-64/lib:/home/nm/sw/nopath/libxml2-64/lib:/usr/lib:/lib'
  -Wl,-bnoentry -Wl,-H512 -Wl,-bM:SRE -lintl -lssl -lcrypto -lm -lldap_r -llber 
-lpthreads
rm -f libpq.a
ar crs libpq.a libpq.so.5

Re: [PATCH v20] GSSAPI encryption support

2019-04-15 Thread Michael Paquier

On Mon, Apr 15, 2019 at 08:24:52AM -0400, Stephen Frost wrote:
> The tests are really fast enough with one KDC that I don't think it
> makes sense to have two independent tests.

Perhaps you should add a comment about the need of unicity at the top
of 001_auth.pl with a short description of the test?

> Please find attached a patch which updates the protocol.sgml docs that
> Michael mentioned before, and merges the tests into one test file (while
> adding in some additional tests to make sure that the server also agrees
> with what our expectations are, using the pg_stat_gssapi view).

Thanks for addressing all that feedback.  Parallel runs look more
stable on my side.  At least it seems that I can re-enable it safely.

> I'll push this soon unless there are concerns.  If you get a chance to
> test the patch out, that would be great.  It's working happily for me
> locally.

+calling gss_init_sec_context() in a loop and sending the result to the
Some markups should be added here for all function names.  Not all the
clients use C either, so you may want to say "or equivalent"?

+test_access($node, 'test1', 'SELECT gss_authenticated AND encrypted
from pg_stat_gssapi where pid = pg_backend_pid();', 0, '', 'succeeds
with mapping with default gssencmode and host hba');
+test_access($node, "test1", 'SELECT gss_authenticated AND encrypted
from pg_stat_gssapi where pid = pg_backend_pid();', 0,
"gssencmode=prefer", "succeeds with GSS-encrypted access preferred
with host hba");
+test_access($node, "test1", 'SELECT gss_authenticated AND encrypted
from pg_stat_gssapi where pid = pg_backend_pid();', 0,
"gssencmode=require", "succeeds with GSS-encrypted access required
with host hba");
If you could rework a bit the indentation of the new code added in
kerberos/t/001_auth.pl that would be nice.  I am afraid that the
current format makes debugging harder than necessary.

+$node->append_conf('pg_hba.conf',
+   qq{hostgssenc all all $hostaddr/32 gss map=mymap});
+$node->restart;
A reload should be enough but not race-condition free, which is why a
set of restarts is done in this test right?  (I have noticed that it
is done this way since the beginning.)
--
Michael


signature.asc
Description: PGP signature

Re: New vacuum option to do only freezing

2019-04-15 Thread Michael Paquier

On Mon, Apr 15, 2019 at 09:07:16PM -0400, Tom Lane wrote:
> If we're failing to remove it, and it's below the desired freeze
> horizon, then we'd darn well better freeze it instead, no?
> 
> Since we know that the tuple only just became dead, I suspect
> that the case would be unreachable in practice.  But the approach
> you propose risks violating the invariant that all old tuples
> will either be removed or frozen.

Please note that I have added an open item for this investigation (see
"topminnow triggered assertion failure with vacuum_index_cleanup").
--
Michael


signature.asc
Description: PGP signature

Re: finding changed blocks using WAL scanning

2019-04-15 Thread Michael Paquier

On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote:
> That is pretty much exactly what I was intending to propose.

Any caller of XLogWrite() could switch to a new segment once the
current one is done, and I am not sure that we would want some random
backend to potentially slow down to do that kind of operation.

Or would a separate background worker do this work by itself?  An
external tool can do that easily already:
https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks
--
Michael


signature.asc
Description: PGP signature

Re: ExecForceStoreMinimalTuple leaks memory like there's no tomorrow

2019-04-15 Thread Michael Paquier

On Mon, Apr 15, 2019 at 10:46:56PM -0400, Tom Lane wrote:
> create table t1 as select generate_series(1,4000) id;
> vacuum analyze t1;
> explain select * from t1, t1 t1b where t1.id = t1b.id;
> -- should indicate a hash join
> explain analyze select * from t1, t1 t1b where t1.id = t1b.id;
> 
> ... watch the process's memory consumption bloat.  (It runs for
> awhile before that starts to happen, but eventually it goes to
> a couple of GB.)
> 
> It looks to me like the problem is that ExecHashJoinGetSavedTuple
> calls ExecForceStoreMinimalTuple with shouldFree = true, and
> ExecForceStoreMinimalTuple's second code branch simply ignores
> the requirement to free the supplied tuple.

Open item added, as the root comes from 4da597ed.
--
Michael


signature.asc
Description: PGP signature

ExecForceStoreMinimalTuple leaks memory like there's no tomorrow

2019-04-15 Thread Tom Lane

Using HEAD,

create table t1 as select generate_series(1,4000) id;
vacuum analyze t1;
explain select * from t1, t1 t1b where t1.id = t1b.id;
-- should indicate a hash join
explain analyze select * from t1, t1 t1b where t1.id = t1b.id;

... watch the process's memory consumption bloat.  (It runs for
awhile before that starts to happen, but eventually it goes to
a couple of GB.)

It looks to me like the problem is that ExecHashJoinGetSavedTuple
calls ExecForceStoreMinimalTuple with shouldFree = true, and
ExecForceStoreMinimalTuple's second code branch simply ignores
the requirement to free the supplied tuple.

regards, tom lane

Improve search for missing parent downlinks in amcheck

2019-04-15 Thread Alexander Korotkov

Hi!

Currently we amcheck supports lossy checking for missing parent
downlinks.  It collects bitmap of downlink hashes and use it to check
subsequent tree level.  We've experienced some large corrupted indexes
which pass this check due to its looseness.

However, it seems to me we can implement this check in non-lossy
manner without making it significantly slower.  We anyway traverse
downlinks from parent to children in order to verify that hikeys are
corresponding to downlink keys.  We can also traverse from one
downlink to subsequent using rightlinks.  So, if there are some
intermediate pages between them, they are candidates to have missing
parent downlinks.  The patch is attached.

With this patch amcheck could successfully detect corruption for our
customer, which unpatched amcheck couldn't find.

Opinions?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
commit f433c2cc817eb5b6b11e4974127ec3e6904bcec0
Author: Alexander Korotkov 
Date:   Thu Dec 6 01:04:33 2018 +0300

Improve checking for missing parent links

Instead of collecting lossy bitmap, traverse from one downlink to subsequent
using rightlinks.  Intermediate pages are candidates to have lost parent
downlinks.

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 591e0a3e46a..f2b08d5dbdb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -92,6 +92,8 @@ typedef struct BtreeCheckState
 	BlockNumber targetblock;
 	/* Target page's LSN */
 	XLogRecPtr	targetlsn;
+	/* Previous downlink block number */
+	BlockNumber	prevdownlink;
 
 	/*
 	 * Mutable state, for optional heapallindexed verification:
@@ -99,8 +101,6 @@ typedef struct BtreeCheckState
 
 	/* Bloom filter fingerprints B-Tree index */
 	bloom_filter *filter;
-	/* Bloom filter fingerprints downlink blocks within tree */
-	bloom_filter *downlinkfilter;
 	/* Right half of incomplete split marker */
 	bool		rightsplit;
 	/* Debug counter */
@@ -137,7 +137,10 @@ static void bt_target_page_check(BtreeCheckState *state);
 static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
   BlockNumber childblock);
-static void bt_downlink_missing_check(BtreeCheckState *state);
+static void bt_downlink_connectivity_check(BtreeCheckState *state,
+   BlockNumber nextdownlink, uint32 parent_level);
+static void bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
+		  BlockNumber targetblock, Page target);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 		  Datum *values, bool *isnull,
 		  bool tupleIsAlive, void *checkstate);
@@ -420,23 +423,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 errmsg("index \"%s\" cannot be verified using transaction snapshot",
 RelationGetRelationName(rel;
 		}
-		else
-		{
-			int64		total_pages;
-
-			/*
-			 * Extra readonly downlink check.
-			 *
-			 * In readonly case, we know that there cannot be a concurrent
-			 * page split or a concurrent page deletion, which gives us the
-			 * opportunity to verify that every non-ignorable page had a
-			 * downlink one level up.  We must be tolerant of interrupted page
-			 * splits and page deletions, though.  This is taken care of in
-			 * bt_downlink_missing_check().
-			 */
-			total_pages = (int64) state->rel->rd_rel->relpages;
-			state->downlinkfilter = bloom_create(total_pages, work_mem, seed);
-		}
 	}
 
 	Assert(!state->rootdescend || state->readonly);
@@ -514,16 +500,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		IndexInfo  *indexinfo = BuildIndexInfo(state->rel);
 		TableScanDesc scan;
 
-		/* Report on extra downlink checks performed in readonly case */
-		if (state->readonly)
-		{
-			ereport(DEBUG1,
-	(errmsg_internal("finished verifying presence of downlink blocks within index \"%s\" with bitset %.2f%% set",
-	 RelationGetRelationName(rel),
-	 100.0 * bloom_prop_bits_set(state->downlinkfilter;
-			bloom_free(state->downlinkfilter);
-		}
-
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
@@ -626,6 +602,8 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 		 level.istruerootlevel ?
 		 " (true root level)" : level.level == 0 ? " (leaf level)" : "");
 
+	state->prevdownlink = InvalidBlockNumber;
+
 	do
 	{
 		/* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */
@@ -923,14 +901,14 @@ bt_target_page_check(BtreeCheckState *state)
 		(uint32) state->targetlsn)));
 		}
 
-		/* Fingerprint downlink blocks in heapallindexed + readonly case */
-		if (state->heapallindexed && state->readonly && !P_ISLEAF(topaque))
+		/*
+		 * Check connectivity of downlinks.
+		 */
+		if (!P_ISLEAF(topaque) && stat

Re: finding changed blocks using WAL scanning

2019-04-15 Thread Bruce Momjian

On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote:
> On Mon, Apr 15, 2019 at 4:31 PM Bruce Momjian  wrote:
> > Can I throw out a simple idea?  What if, when we finish writing a WAL
> > file, we create a new file 00010001.modblock which
> > lists all the heap/index files and block numbers modified in that WAL
> > file?  How much does that help with the list I posted earlier?
> >
> > I think there is some interesting complexity brought up in this 
> > thread.
> > Which options are going to minimize storage I/O, network I/O, have 
> > only
> > background overhead, allow parallel operation, integrate with
> > pg_basebackup.  Eventually we will need to evaluate the incremental
> > backup options against these criteria.
> >
> > I am thinking tools could retain modblock files along with WAL, could
> > pull full-page-writes from WAL, or from PGDATA.  It avoids the need to
> > scan 16MB WAL files, and the WAL files and modblock files could be
> > expired independently.
> 
> That is pretty much exactly what I was intending to propose.

OK, good.  Some of your wording was vague so I was unclear exactly what
you were suggesting.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

RE: Speedup of relation deletes during recovery

2019-04-15 Thread Jamison, Kirk

Hello Fujii-san, 

On April 18, 2018, Fujii Masao wrote: 

> On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki 
>  wrote:
>> Furthermore, TRUNCATE has a similar and worse issue.  While DROP TABLE 
>> scans the shared buffers once for each table, TRUNCATE does that for 
>> each fork, resulting in three scans per table.  How about changing the 
>> following functions
>>
>> smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber 
>> nblocks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
>>BlockNumber firstDelBlock)
>>
>> to
>>
>> smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber 
>> *nblocks, int nforks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, 
>> ForkNumber *forkNum,
>>BlockNumber *firstDelBlock, 
>> int nforks)
>>
>> to perform the scan only once per table?  If there doesn't seem to be a 
>> problem,
>> I think I'll submit the patch next month.  I'd welcome if anyone could do 
>> that.
>
> Yeah, it's worth working on this problem. To decrease the number of scans of
> shared_buffers, you would need to change the order of truncations of files 
> and WAL
> logging. In RelationTruncate(), currently WAL is logged after FSM and VM are 
> truncated.
> IOW, with the patch, FSM and VM would need to be truncated after WAL logging. 
> You would
> need to check whether this reordering is valid.

I know it's been almost a year since the previous message, but if you could 
still
recall your suggestion above, I'd like to ask question.
Could you explain your idea a bit further on how would the reordering of WAL 
writing and
file truncation possibly reduce the number of shared_buffers scan?
Thanks a lot in advance.

Regards,
Kirk Jamison

Re: New vacuum option to do only freezing

2019-04-15 Thread Tom Lane

Robert Haas  writes:
> On Mon, Apr 15, 2019 at 3:47 PM Tom Lane  wrote:
>> No.  I'm thinking there should be exactly one test of index_cleanup
>> in this logic, and what it would be is along the lines of ...

> I'm not sure that's correct.  If you do that, it'll end up in the
> non-tupgone case, which might try to freeze a tuple that should've
> been removed.  Or am I confused?

If we're failing to remove it, and it's below the desired freeze
horizon, then we'd darn well better freeze it instead, no?

Since we know that the tuple only just became dead, I suspect
that the case would be unreachable in practice.  But the approach
you propose risks violating the invariant that all old tuples
will either be removed or frozen.

regards, tom lane

Re: finding changed blocks using WAL scanning

2019-04-15 Thread Robert Haas

On Mon, Apr 15, 2019 at 4:31 PM Bruce Momjian  wrote:
> Can I throw out a simple idea?  What if, when we finish writing a WAL
> file, we create a new file 00010001.modblock which
> lists all the heap/index files and block numbers modified in that WAL
> file?  How much does that help with the list I posted earlier?
>
> I think there is some interesting complexity brought up in this 
> thread.
> Which options are going to minimize storage I/O, network I/O, have 
> only
> background overhead, allow parallel operation, integrate with
> pg_basebackup.  Eventually we will need to evaluate the incremental
> backup options against these criteria.
>
> I am thinking tools could retain modblock files along with WAL, could
> pull full-page-writes from WAL, or from PGDATA.  It avoids the need to
> scan 16MB WAL files, and the WAL files and modblock files could be
> expired independently.

That is pretty much exactly what I was intending to propose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: New vacuum option to do only freezing

2019-04-15 Thread Robert Haas

On Mon, Apr 15, 2019 at 3:47 PM Tom Lane  wrote:
> No.  I'm thinking there should be exactly one test of index_cleanup
> in this logic, and what it would be is along the lines of
>
> if (HeapTupleIsHotUpdated(&tuple) ||
> HeapTupleIsHeapOnly(&tuple) ||
> +   params->index_cleanup == VACOPT_TERNARY_DISABLED)
> nkeep += 1;
> else

I'm not sure that's correct.  If you do that, it'll end up in the
non-tupgone case, which might try to freeze a tuple that should've
been removed.  Or am I confused?

My idea of the mechanism of action of this patch is that we accumulate
the tuples just as if we were going to vacuum them, but then at the
end of each page we forget them all, sorta like there are no indexes.
In that mental model, I don't really see why this part of this logic
needs any adjustment at all vs. pre-patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: COLLATE: Hash partition vs UPDATE

2019-04-15 Thread Amit Langote

On 2019/04/16 5:47, Tom Lane wrote:
> Amit Langote  writes:
>> Attached updated patch.
> 
> LGTM, pushed.

Thank you.

Regards,
Amit

Re: Strange coding in mvdistinct.c

2019-04-15 Thread Tomas Vondra


On Mon, Apr 15, 2019 at 06:12:24PM -0400, Tom Lane wrote:

Oh, and as I continue to grep, I found this in dependencies.c:

   dependencies = (MVDependencies *) repalloc(dependencies,
  offsetof(MVDependencies, 
deps)
  + dependencies->ndeps * 
sizeof(MVDependency));

I'm pretty sure this is an actual bug: the calculation should be

  offsetof(MVDependencies, deps)
  + dependencies->ndeps * sizeof(MVDependency *));

because deps is an array of MVDependency* not MVDependency.

This would lead to an overallocation not underallocation, and it's
probably pretty harmless because ndeps can't get too large (I hope;
if it could, this would have O(N^2) performance problems).  Still,
you oughta fix it.

(There's a similar calculation later in the file that gets it right.)



Thanks. I noticed some of the bugs while investigating the recent MCV
serialization, and I plan to fix them soon. This week, hopefully.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Strange coding in mvdistinct.c

2019-04-15 Thread Tom Lane

Oh, and as I continue to grep, I found this in dependencies.c:

dependencies = (MVDependencies *) repalloc(dependencies,
   offsetof(MVDependencies, 
deps)
   + dependencies->ndeps * 
sizeof(MVDependency));

I'm pretty sure this is an actual bug: the calculation should be

   offsetof(MVDependencies, deps)
   + dependencies->ndeps * sizeof(MVDependency *));

because deps is an array of MVDependency* not MVDependency.

This would lead to an overallocation not underallocation, and it's
probably pretty harmless because ndeps can't get too large (I hope;
if it could, this would have O(N^2) performance problems).  Still,
you oughta fix it.

(There's a similar calculation later in the file that gets it right.)

regards, tom lane

Strange coding in mvdistinct.c

2019-04-15 Thread Tom Lane

In the wake of the discussion at [1] I went looking for structs that
should be using FLEXIBLE_ARRAY_MEMBER and are not, by dint of grepping
for size calculations of the form "offsetof(struct,fld) + n * sizeof(...)"
and then seeing how "fld" is declared.  I haven't yet found anything
like that that I want to change, but I did come across this bit in
mvdistinct.c's statext_ndistinct_serialize():

len = VARHDRSZ + SizeOfMVNDistinct +
ndistinct->nitems * (offsetof(MVNDistinctItem, attrs) + sizeof(int));

Given the way that the subsequent code looks, I would argue that
offsetof(MVNDistinctItem, attrs) has got basically nothing to do with
this calculation, and that the right way to phrase it is just

len = VARHDRSZ + SizeOfMVNDistinct +
ndistinct->nitems * (sizeof(double) + sizeof(int));

Consider if there happened to be alignment padding in MVNDistinctItem:
as the code stands it'd overestimate the space needed.  (There won't be
padding on any machine we support, I believe, so this isn't a live bug ---
but it's overcomplicated code, and could become buggy if any
less-than-double-width fields get added to MVNDistinctItem.)

For largely the same reason, I do not think that SizeOfMVNDistinct is
a helpful way to compute the space needed for those fields --- any
alignment padding that might be included is irrelevant for this purpose.
In short I'd be inclined to phrase this just as

len = VARHDRSZ + 3 * sizeof(uint32) +
ndistinct->nitems * (sizeof(double) + sizeof(int));

It looks to me actually like all the uses of both SizeOfMVNDistinctItem
and SizeOfMVNDistinct are wrong, because the code using those symbols
is really thinking about the size of this serialized representation,
which is guaranteed not to have any inter-field padding, unlike the
structs.

Thoughts?

regards, tom lane

[1] https://postgr.es/m/a620f85a-42ab-e0f3-3337-b04b97e2e...@redhat.com

Re: [PATCH v20] GSSAPI encryption support

2019-04-15 Thread Robbie Harwood

Stephen Frost  writes:

> Please find attached a patch which updates the protocol.sgml docs that
> Michael mentioned before, and merges the tests into one test file
> (while adding in some additional tests to make sure that the server
> also agrees with what our expectations are, using the pg_stat_gssapi
> view).

I can't speak to the Perl, but the documentation matches what I think
the code does :)

Thanks,
--Robbie


signature.asc
Description: PGP signature

Re: Adding Unix domain socket path and port to pg_stat_get_wal_senders()

2019-04-15 Thread Tom Lane

Peter Eisentraut  writes:
> On 2019-04-12 17:57, Tom Lane wrote:
>> Those are actually pretty common, for example if you use Red Hat's
>> packaging you will have both /var/run/postgresql and /tmp as socket
>> directories (since they consider use of /tmp deprecated, but getting
>> rid of all clients' use of it turns out to be really hard).  However,
>> it's definitely fair to question whether anyone *cares* which of
>> the server's socket directories a given connection used.  Aren't
>> they going to be pretty much all equivalent?

> So what is being asked here is really information about which end point
> on the server is being connected to.  That is also information for the
> TCP/IP case that we don't currently provide.

Good point.

> It's probably of marginal use, as you also say.

Yeah.  Per downthread discussion, what Tatsuo-san really wants to know
is not that at all, but which client (slave server) is connecting.
It's not very clear how to identify the client, but knowing which socket
it came through doesn't seem to help for that.

regards, tom lane

Re: Adding Unix domain socket path and port to pg_stat_get_wal_senders()

2019-04-15 Thread Peter Eisentraut

On 2019-04-12 17:57, Tom Lane wrote:
>> For a socket connection, directory is important and that
>> information I can get from unix_socket_directories parameter (I've
>> never seen a setup with multiple socket directories).
> Those are actually pretty common, for example if you use Red Hat's
> packaging you will have both /var/run/postgresql and /tmp as socket
> directories (since they consider use of /tmp deprecated, but getting
> rid of all clients' use of it turns out to be really hard).  However,
> it's definitely fair to question whether anyone *cares* which of
> the server's socket directories a given connection used.  Aren't
> they going to be pretty much all equivalent?

So what is being asked here is really information about which end point
on the server is being connected to.  That is also information for the
TCP/IP case that we don't currently provide.  It's probably of marginal
use, as you also say.

I don't get what this has to do with walsenders specifically.  Do they
have each walsender connect to a different socket?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: COLLATE: Hash partition vs UPDATE

2019-04-15 Thread Tom Lane

Amit Langote  writes:
> Attached updated patch.

LGTM, pushed.

regards, tom lane

Re: finding changed blocks using WAL scanning

2019-04-15 Thread Bruce Momjian

On Wed, Apr 10, 2019 at 08:11:11PM -0400, Robert Haas wrote:
> On Wed, Apr 10, 2019 at 5:49 PM Robert Haas  wrote:
> > There is one thing that does worry me about the file-per-LSN-range
> > approach, and that is memory consumption when trying to consume the
> > information.  Suppose you have a really high velocity system.  I don't
> > know exactly what the busiest systems around are doing in terms of
> > data churn these days, but let's say just for kicks that we are
> > dirtying 100GB/hour.  That means, roughly 12.5 million block
> > references per hour.  If each block reference takes 12 bytes, that's
> > maybe 150MB/hour in block reference files.  If you run a daily
> > incremental backup, you've got to load all the block references for
> > the last 24 hours and deduplicate them, which means you're going to
> > need about 3.6GB of memory.  If you run a weekly incremental backup,
> > you're going to need about 25GB of memory.  That is not ideal.  One
> > can keep the memory consumption to a more reasonable level by using
> > temporary files.  For instance, say you realize you're going to need
> > 25GB of memory to store all the block references you have, but you
> > only have 1GB of memory that you're allowed to use.  Well, just
> > hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno,
> > writing each batch to a separate temporary file, and then process each
> > of those 32 files separately.  That does add some additional I/O, but
> > it's not crazily complicated and doesn't seem too terrible, at least
> > to me.  Still, it's something not to like.
> 
> Oh, I'm being dumb.  We should just have the process that writes out
> these files sort the records first.  Then when we read them back in to
> use them, we can just do a merge pass like MergeAppend would do.  Then
> you never need very much memory at all.

Can I throw out a simple idea?  What if, when we finish writing a WAL
file, we create a new file 00010001.modblock which
lists all the heap/index files and block numbers modified in that WAL
file?  How much does that help with the list I posted earlier?

I think there is some interesting complexity brought up in this thread.
Which options are going to minimize storage I/O, network I/O, have only
background overhead, allow parallel operation, integrate with
pg_basebackup.  Eventually we will need to evaluate the incremental
backup options against these criteria.

I am thinking tools could retain modblock files along with WAL, could
pull full-page-writes from WAL, or from PGDATA.  It avoids the need to
scan 16MB WAL files, and the WAL files and modblock files could be
expired independently.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Tomas Vondra


On Mon, Apr 15, 2019 at 11:57:49AM -0700, Ashwin Agrawal wrote:

  On Mon, Apr 15, 2019 at 11:18 AM Tomas Vondra
   wrote:

Maybe. I'm not going to pretend I fully understand the internals. Does
that mean the container contains ZSUncompressedBtreeItem as elements? Or
just the plain Datum values?

  First, your reading of code and all the comments/questions so far have
  been highly encouraging. Thanks a lot for the same.


;-)


  Container contains ZSUncompressedBtreeItem as elements. As for Item will
  have to store meta-data like size, undo and such info. We don't wish to
  restrict compressing only items from same insertion sessions only. Hence,
  yes doens't just store Datum values. Wish to consider it more tuple level
  operations and have meta-data for it and able to work with tuple level
  granularity than block level.


OK, thanks for the clarification, that somewhat explains my confusion.
So if I understand it correctly, ZSCompressedBtreeItem is essentially a
sequence of ZSUncompressedBtreeItem(s) stored one after another, along
with some additional top-level metadata.


  Definitely many more tricks can be and need to be applied to optimize
  storage format, like for fixed width columns no need to store the size in
  every item. Keep it simple is theme have been trying to maintain.
  Compression ideally should compress duplicate data pretty easily and
  efficiently as well, but we will try to optimize as much we can without
  the same.


I think there's plenty of room for improvement. The main problem I see
is that it mixes different types of data, which is bad for compression
and vectorized execution. I think we'll end up with a very different
representation of the container, essentially decomposing the items into 
arrays of values of the same type - array of TIDs, array of undo 
pointers, buffer of serialized values, etc.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Peter Geoghegan

On Mon, Apr 15, 2019 at 1:02 PM Andres Freund  wrote:
> There's not much of an alternative currently. Indexes require tid
> looking things, and as a consequence (and some other comparatively small
> changes that'd be required) tableam does too.

I'm trying to establish whether or not that's the only reason. It
might be okay to use the same item pointer struct as the
representation of a integer-like logical identifier. Even if it isn't,
I'm still interested in just how logical the TIDs are, because it's an
important part of the overall design.

> That's one of the reasons why I've been trying to get you to get on
> board with allowing different leaf-level "item pointer equivalents"
> widths inside nbtree...

Getting me to agree that that would be nice and getting me to do the
work are two very different things.  ;-)

-- 
Peter Geoghegan

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Andres Freund

Hi,

On 2019-04-15 12:50:14 -0700, Peter Geoghegan wrote:
> On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal  wrote:
> > Would like to know more specifics on this Peter. We may be having different 
> > context on hybrid row/column design.
> 
> I'm confused about how close your idea of a TID is to the traditional
> definition from heapam (and even zheap). If it's a purely logical
> identifier, then why would it have two components like a TID? Is that
> just a short-term convenience or something?

There's not much of an alternative currently. Indexes require tid
looking things, and as a consequence (and some other comparatively small
changes that'd be required) tableam does too. And there's a few places
that imbue additional meaning into the higher bits of ip_posid too, so
not all of them are valid (It can't currently be zero - or
ItemPointerIsValid fails, it can't be larger than MaxOffsetNumber -
that's used to allocate things in e.g. indexes, tidbmap.c etc).

That's one of the reasons why I've been trying to get you to get on
board with allowing different leaf-level "item pointer equivalents"
widths inside nbtree...

Greetings,

Andres Freund

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Peter Geoghegan

On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal  wrote:
> Would like to know more specifics on this Peter. We may be having different 
> context on hybrid row/column design.

I'm confused about how close your idea of a TID is to the traditional
definition from heapam (and even zheap). If it's a purely logical
identifier, then why would it have two components like a TID? Is that
just a short-term convenience or something?

> Yes, the plan to optimize out TID space per datum, either by prefix 
> compression or delta compression or some other trick.

It would be easier to do this if you knew for sure that the TID
behaves almost the same as a bigserial column -- a purely logical
monotonically increasing identifier. That's why I'm interested in what
exactly you mean by TID, the stability of a TID value, etc. If a leaf
page almost always stores a range covering no more than few hundred
contiguous logical values,  you can justify aggressively compressing
the representation in the B-Tree entries. Compression would still be
based on prefix compression, but the representation itself can be
specialized.

-- 
Peter Geoghegan

Re: New vacuum option to do only freezing

2019-04-15 Thread Tom Lane

Robert Haas  writes:
> On Mon, Apr 15, 2019 at 1:13 PM Tom Lane  wrote:
>> I have a very strong feeling that this patch was not fully baked.

> I think you're right, but I don't understand the comment in the
> preceding paragraph.  How does this problem prevent tupgone from
> getting set?

My point is that I suspect that tupgone *shouldn't* get set.
It's not (going to be) gone.

> It looks to me like nleft_dead_tuples should be ripped out.

That was pretty much what I was thinking too.  It makes more sense
just to treat this case identically to dead-but-not-yet-removable.
I have substantial doubts about nleft_dead_itemids being worth
anything, as well.

> We should do something like:
> if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
> {
> nkeep += tups_vacuumed;
> tups_vacuumed = 0;
> }

No.  I'm thinking there should be exactly one test of index_cleanup
in this logic, and what it would be is along the lines of

if (HeapTupleIsHotUpdated(&tuple) ||
HeapTupleIsHeapOnly(&tuple) ||
+   params->index_cleanup == VACOPT_TERNARY_DISABLED)
nkeep += 1;
else

In general, this thing has a strong whiff of "large patch
with a small patch struggling to get out".

regards, tom lane

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Andres Freund

Hi,

On 2019-04-15 15:19:41 -0400, Tom Lane wrote:
> Peter Geoghegan  writes:
> > I am also concerned by the broad scope of ZedStore, and I tend to
> > agree that it will be difficult to maintain in core. At the same time,
> > I think that Andres and Robert are probably right about the difficulty
> > of maintaining it outside of core -- that would be difficult to
> > impossible as a practical matter.
> 
> Perhaps, but we won't know if we don't try.  I think we should try,
> and be willing to add hooks and flexibility to core as needed to make
> it possible.  Adding such flexibility would be good for other outside
> projects that have no chance of (or perhaps no interest in) getting into
> core, even if we end up deciding that ZedStore or some other specific
> implementation is so useful that it does belong in core.

I don't think anybody argued against providing that flexibility. I think
we should absolutely do so - but that's imo not an argument against
integrating something like a hypothetical well developed columnstore to
core.  I worked on tableam, which certainly provides a lot of new
extensibility, because it was the sane architecture to able to integrate
zheap.  The current set of UNDO patches (developed for zheap), while
requiring core integration for xact.c etc, co-initiated improvements to
make the checkpointer fsync being closer to extensible and UNDO as
currently developed would be extensible if WAL was extensible as it's
tied to rmgrlist.h.  And the improvements necessary to make query
executions for in-core columnar AM faster, would largely also be
applicable for out-of-core columnar AMs, and I'm sure we'd try to make
the necessary decisions not hardcoded if reasonable.

I think it's actually really hard to actually make something non-trivial
extensible without there being a proper in-core user of most of that
infrastructure.

Greetings,

Andres Freund

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Tom Lane

Andres Freund  writes:
> We also have at times pretty explicitly resisted making crucial pieces
> of infrastructure usable outside of core. E.g. because it's legitimately
> hard (grammar extensibility), or because we'd some concerns around
> stability and the exact approach (WAL - the generic stuff is usable for
> anything that wants to even be somewhat efficient, some xlog
> integration).  So there's several types of extensions that one
> realistically cannot do out of core, by our choice.

Well, the grammar issue comes from a pretty specific technical problem:
bison grammars don't cope with run-time extension, and moving off of bison
would cost a lot of work, and probably more than just work (i.e., probable
loss of ability to detect grammar ambiguity).  WAL extensibility likewise
has some technical issues that are hard to surmount (how do you find the
code for replaying an extension WAL record, when you can't read catalogs).
I think we could fix the latter, it's just that no one has yet troubled
to expend the effort.  Similarly, things like the planner's hard-wired
handling of physical-tlist optimization are certainly a problem for
column stores, but I think the way to solve that is to provide an actual
extension capability, not merely replace one hard-wired behavior with two.
As a counterpoint to my gripe about SP-GiST being a time sink, I do not
think I'll regret the time I spent a few months ago on implementing
"planner support function" hooks.  I'm all in favor of adding flexibility
like that.

regards, tom lane

Re: New vacuum option to do only freezing

2019-04-15 Thread Robert Haas

On Mon, Apr 15, 2019 at 1:13 PM Tom Lane  wrote:
> Masahiko Sawada  writes:
> >> Ugh, I think the assertion is right but the above condition is
> >> completely wrong. We should increment nleft_dead_tuples when index
> >> cleanup is *not* enabled.
>
> > Here is a draft patch to fix this issue.
>
> So the real issue here, I fear, is that we've got no consistent testing
> of the whole case block for HEAPTUPLE_DEAD, as you can easily confirm
> by checking the code coverage report at
> https://coverage.postgresql.org/src/backend/access/heap/vacuumlazy.c.gcov.html
>
> This is perhaps unsurprising, given the code comment that points out that
> we can only reach that block if the tuple's state changed since
> heap_page_prune() a few lines above.  Still, it means that this patch
> hasn't been tested in that scenario, until we were lucky enough for
> a slow buildfarm machine like topminnow to hit it.
>
> What's more, because that block is the only way for "tupgone" to be
> set, we also don't reach the "if (tupgone)" block at lines 1183ff.
> And I think this patch has probably broken that, too.  Surely, if we
> are not going to remove the tuple, we should not increment tups_vacuumed?
> And maybe we have to freeze it instead?  How is it that we can, or should,
> treat this situation as different from a dead-but-not-removable tuple?
>
> I have a very strong feeling that this patch was not fully baked.

I think you're right, but I don't understand the comment in the
preceding paragraph.  How does this problem prevent tupgone from
getting set?

It looks to me like nleft_dead_tuples should be ripped out.  It's
basically trying to count the same thing as tups_vacuumed, and there
doesn't seem to be any need to count that thing twice.  And then just
below this block:

/* save stats for use later */
vacrelstats->tuples_deleted = tups_vacuumed;
vacrelstats->new_dead_tuples = nkeep;
vacrelstats->nleft_dead_itemids = nleft_dead_itemids;

We should do something like:

if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
{
nkeep += tups_vacuumed;
tups_vacuumed = 0;
}

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Checksum errors in pg_stat_database

2019-04-15 Thread Julien Rouhaud

Sorry for late reply,

On Sun, Apr 14, 2019 at 7:12 PM Magnus Hagander  wrote:
>
> On Sat, Apr 13, 2019 at 8:46 PM Robert Treat  wrote:
>>
>> On Fri, Apr 12, 2019 at 8:18 AM Magnus Hagander  wrote:
>> ISTM the argument here is go with zero since you have zero connections
>> vs go with null since you can't actually connect, so it doesn't make
>> sense. (There is a third argument about making it -1 since you can't
>> connect, but that breaks sum(numbackends) so it's easily dismissed.) I
>> think I would have gone for 0 personally, but what ended up surprising
>> me was that a bunch of other stuff like xact_commit show zero when
>> AFAICT the above reasoning would apply the same to those columns.
>> (unless there is a way to commit a transaction in the global objects
>> that I don't know about).
>
>
> That's a good point. I mean, you can commit a transaction that involves 
> changes of global objects, but it counts in the database that you were 
> conneced to.
>
> We should probably at least make it consistent and make it NULL in all or 0 
> in all.
>
> I'm -1 for using -1 (!), for the very reason that you mention. But either 
> changing the numbackends to 0, or the others to NULL would work for 
> consistency. I'm leaning towards the 0 as well.

+1 for 0 :)  Especially since it's less code in the view.

>> What originally got me looking at this was the idea of returning -1
>> (or maybe null) for checksum failures for cases when checksums are not
>> enabled. This seems a little more complicated to set up, but seems
>> like it might ward off people thinking they are safe due to no
>> checksum error reports when they actually aren't.
>
>
> NULL seems like the reasonable thing to return there. I'm not sure what 
> you're referring to with a little more complicated to set up, thought? Do you 
> mean somehow for the end user?
>
> Code-wise it seems it should be simple -- just do an "if checksums disabled 
> then return null"  in the two functions.

That's indeed a good point!  Lack of checksum error is distinct from
checksums not activated and we should make it obvious.

I don't know if that counts as an open item, but I attach a patch for
all points discussed here.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 547fe4cce9..bf122f861a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2600,13 +2600,14 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
  checksum_failures
  bigint
  Number of data page checksum failures detected in this
- database
+  database, or NULL is data checksums are not enabled.
 
 
  checksum_last_failure
  timestamp with time zone
  Time at which the last data page checksum failure was detected in
- this database, or on a shared object.
+  this database (or on a shared object), or NULL is data checksums are not
+  enabled.
 
 
  blk_read_time
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 161bad6c90..566100d6df 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -817,7 +817,7 @@ CREATE VIEW pg_stat_database AS
 D.oid AS datid,
 D.datname AS datname,
 CASE
-WHEN (D.oid = (0)::oid) THEN NULL::integer
+WHEN (D.oid = (0)::oid) THEN 0
 ELSE pg_stat_get_db_numbackends(D.oid)
 END AS numbackends,
 pg_stat_get_db_xact_commit(D.oid) AS xact_commit,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 97f41fb46c..05240bfd14 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/xlog.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -1526,6 +1527,9 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
 	int64		result;
 	PgStat_StatDBEntry *dbentry;
 
+	if (!DataChecksumsEnabled())
+		PG_RETURN_NULL();
+
 	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
 		result = 0;
 	else
@@ -1541,6 +1545,9 @@ pg_stat_get_db_checksum_last_failure(PG_FUNCTION_ARGS)
 	TimestampTz result;
 	PgStat_StatDBEntry *dbentry;
 
+	if (!DataChecksumsEnabled())
+		PG_RETURN_NULL();
+
 	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
 		result = 0;
 	else
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 30973904c5..0c392e51e2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1806,7 +1806,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
 pg_stat_database| SELECT d.oid AS datid,
 d.datname,
 CASE
-WHEN (d.oid = (0)::oid) THEN NULL::integer
+WHEN (d.oid = (0)::oid) THEN 0

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Peter Geoghegan

On Mon, Apr 15, 2019 at 12:19 PM Tom Lane  wrote:
> Perhaps, but we won't know if we don't try.  I think we should try,
> and be willing to add hooks and flexibility to core as needed to make
> it possible.

We could approach it without taking a firm position on inclusion in
core until the project begins to mature. I have little faith in our
ability to predict which approach will be the least painful at this
early stage.

-- 
Peter Geoghegan

Re: minimizing pg_stat_statements performance overhead

2019-04-15 Thread Raymond Martin

Hi, 
Apologies, but I had already created a commit here: 
https://commitfest.postgresql.org/23/2080/ . 
Any preference on which to keep?

Thanks, 
Raymond Martin 
rama...@microsoft.com

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Andres Freund

Hi,

On 2019-04-15 14:35:43 -0400, Tom Lane wrote:
> Yeah, and that's something I've regretted more than once; I think SP-GiST
> is a sterling example of something that isn't nearly useful enough in the
> real world to justify the amount of maintenance effort we've been forced
> to expend on it.  You might trawl the commit logs to get a sense of the
> amount of my own personal time --- not that of the original submitters ---
> that's gone into that one module.  Then ask yourself how much that model
> will scale, and what other more-useful things I could've accomplished
> with that time.

I do agree that the [group of] contributor's history of maintaining such
work should play a role. And I think that's doubly so with a piece as
crucial as a table AM.

But:

> We do need to limit what we accept into core PG.  I do not buy your
> argument that users expect everything to be in core.  Or more accurately,
> the people who do think that way won't be using PG anyway --- they'll
> be using MSSQL because it comes from their OS vendor.

I don't think anybody disagrees with that, actually. Including
Robert.

But I don't think it follows that we shouldn't provide things that are
either much more reasonably done in core like a pooler (authentication /
encryption; infrastructure for managing state like prepared statements,
GUCs; avoiding issues of explosion of connection counts with pooling in
other places), are required by a very significant portion of our users
(imo the case for a columnar store or a row store without the
architectural issues of heap), or where it's hard to provide the
necessary infrastructure without an in-core user (imo also the case with
columnar, due to the necessary planner / executor improvements for fast
query execution).

We also have at times pretty explicitly resisted making crucial pieces
of infrastructure usable outside of core. E.g. because it's legitimately
hard (grammar extensibility), or because we'd some concerns around
stability and the exact approach (WAL - the generic stuff is usable for
anything that wants to even be somewhat efficient, some xlog
integration).  So there's several types of extensions that one
realistically cannot do out of core, by our choice.

Greetings,

Andres Freund

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Tom Lane

Peter Geoghegan  writes:
> I am also concerned by the broad scope of ZedStore, and I tend to
> agree that it will be difficult to maintain in core. At the same time,
> I think that Andres and Robert are probably right about the difficulty
> of maintaining it outside of core -- that would be difficult to
> impossible as a practical matter.

Perhaps, but we won't know if we don't try.  I think we should try,
and be willing to add hooks and flexibility to core as needed to make
it possible.  Adding such flexibility would be good for other outside
projects that have no chance of (or perhaps no interest in) getting into
core, even if we end up deciding that ZedStore or some other specific
implementation is so useful that it does belong in core.

regards, tom lane

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Robert Haas

On Mon, Apr 15, 2019 at 2:35 PM Tom Lane  wrote:
> Robert Haas  writes:
> > On Mon, Apr 15, 2019 at 11:10 AM Tom Lane  wrote:
> >> There is a finite limit to how much stuff we can maintain as part of core.
>
> > I don't agree with that at all.
>
> Really?  Let's have a discussion of how thermodynamics applies to
> software management sometime.

Sounds like an interesting discussion, perhaps for PGCon, but what I
was actually disagreeing with was the idea that we should add a table
AM interface and then not accept any new table AMs, which I think
would be silly.  And if we're going to accept any, a columnar one
seems like a strong candidate.

> Yeah, and that's something I've regretted more than once; I think SP-GiST
> is a sterling example of something that isn't nearly useful enough in the
> real world to justify the amount of maintenance effort we've been forced
> to expend on it.  You might trawl the commit logs to get a sense of the
> amount of my own personal time --- not that of the original submitters ---
> that's gone into that one module.  Then ask yourself how much that model
> will scale, and what other more-useful things I could've accomplished
> with that time.

Yep, that's fair.

> We do need to limit what we accept into core PG.  I do not buy your
> argument that users expect everything to be in core.  Or more accurately,
> the people who do think that way won't be using PG anyway --- they'll
> be using MSSQL because it comes from their OS vendor.

I agree that we need to be judicious in what we accept, but I don't
agree that we should therefore accept nothing.  There are lots of
things that we could put in core and users would like it that I'm glad
we haven't put in core.

I think you might be surprised at the number of people who normally
want everything from a single source but are still willing to consider
PostgreSQL; vendors like my employer help to smooth the road for such
people.  Still, I don't think there is any major database product
other than PostgreSQL that ships only a single table storage format
and just expects that it will be good enough for everyone.  Like
640kB, it just isn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Peter Geoghegan

On Mon, Apr 15, 2019 at 11:35 AM Tom Lane  wrote:
> We do need to limit what we accept into core PG.  I do not buy your
> argument that users expect everything to be in core.  Or more accurately,
> the people who do think that way won't be using PG anyway --- they'll
> be using MSSQL because it comes from their OS vendor.

I am also concerned by the broad scope of ZedStore, and I tend to
agree that it will be difficult to maintain in core. At the same time,
I think that Andres and Robert are probably right about the difficulty
of maintaining it outside of core -- that would be difficult to
impossible as a practical matter.

Unfortunately, you're both right. I don't know where that leaves us.
-- 
Peter Geoghegan

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Ashwin Agrawal

On Mon, Apr 15, 2019 at 11:18 AM Tomas Vondra 
wrote:

>
> Maybe. I'm not going to pretend I fully understand the internals. Does
> that mean the container contains ZSUncompressedBtreeItem as elements? Or
> just the plain Datum values?
>

First, your reading of code and all the comments/questions so far have been
highly encouraging. Thanks a lot for the same.

Container contains ZSUncompressedBtreeItem as elements. As for Item will
have to store meta-data like size, undo and such info. We don't wish to
restrict compressing only items from same insertion sessions only. Hence,
yes doens't just store Datum values. Wish to consider it more tuple level
operations and have meta-data for it and able to work with tuple level
granularity than block level.

Definitely many more tricks can be and need to be applied to optimize
storage format, like for fixed width columns no need to store the size in
every item. Keep it simple is theme have been trying to maintain.
Compression ideally should compress duplicate data pretty easily and
efficiently as well, but we will try to optimize as much we can without the
same.

Re: block-level incremental backup

2019-04-15 Thread Robert Haas

On Mon, Apr 15, 2019 at 9:01 AM Stephen Frost  wrote:
> I love the general idea of having additional facilities in core to
> support block-level incremental backups.  I've long been unhappy that
> any such approach ends up being limited to a subset of the files which
> need to be included in the backup, meaning the rest of the files have to
> be backed up in their entirety.  I don't think we have to solve for that
> as part of this, but I'd like to see a discussion for how to deal with
> the other files which are being backed up to avoid needing to just
> wholesale copy them.

Ideas?  Generally, I don't think that anything other than the main
forks of relations are worth worrying about, because the files are too
small to really matter.  Even if they're big, the main forks of
relations will be much bigger.  I think.

> I'm quite concerned that trying to graft this on to pg_basebackup
> (which, as you note later, is missing an awful lot of what users expect
> from a real backup solution already- retention handling, parallel
> capabilities, WAL archive management, and many more... but also is just
> not nearly as developed a tool as the external solutions) is going to
> make things unnecessairly difficult when what we really want here is
> better support from core for block-level incremental backup for the
> existing external tools to leverage.
>
> Perhaps there's something here which can be done with pg_basebackup to
> have it work with the block-level approach, but I certainly don't see
> it as a natural next step for it and really does seem like limiting the
> way this is implemented to something that pg_basebackup can easily
> digest might make it less useful for the more developed tools.

I agree that there are a bunch of things that pg_basebackup does not
do, such as backup management.  I think a lot of users do not want
PostgreSQL to do backup management for them.  They have an existing
solution that they use to manage backups, and they want PostgreSQL to
interoperate with it. I think it makes sense for pg_basebackup to be
in charge of taking the backup, and then other tools can either use it
as a building block or use the streaming replication protocol to send
approximately the same commands to the server.  I certainly would not
want to expose server capabilities that let you take an incremental
backup and NOT teach pg_basebackup to use them -- then we'd be in a
situation of saying that PostgreSQL has incremental backup, but you
have to get external tool XYZ to use it.  That will be perceived as
PostgreSQL does NOT have incremental backup and this external tool
adds it.

> As an example, I believe all of the other tools mentioned (at least,
> those that are open source I'm pretty sure all do) support parallel
> backup and therefore having a way to get the block-level changes in a
> parallel fashion would be a pretty big thing that those tools will want
> and pg_basebackup is single-threaded today and this proposal doesn't
> seem to be contemplating changing that, implying that a serial-based
> block-level protocol would be fine but that'd be a pretty awful
> restriction for the other tools.

I mentioned this exact issue in my original email.  I spoke positively
of it.  But I think it is different from what is being proposed here.
We could have parallel backup without incremental backup, and that
would be a good feature.  We could have parallel backup without full
backup, and that would also be a good feature.  We could also have
both, which would be best of all.  I don't see that my proposal throws
up any architectural obstacle to parallelism.  I assume parallel
backup, whether full or incremental, would be implemented by dividing
up the files that need to be sent across the available connections; if
incremental backup exists, each connection then has to decide whether
to send the whole file or only part of it.

> This part of the discussion is a another example of how we're limiting
> ourselves in this implementation to the "pg_basebackup can work with
> this" case- by only consideration the options of "scan all the files" or
> "use the WAL- if the request is for WAL we have available on the
> server."  The other backup solutions mentioned in your initial email,
> and others that weren't, have a WAL archive which includes a lot more
> WAL than just what the primary currently has.  When I've thought about
> how WAL could be used to build a differential or incremental backup, the
> question of "do we have all the WAL we need" hasn't ever been a
> consideration- because the backup tool manages the WAL archive and has
> WAL going back across, most likely, weeks or even months.  Having a tool
> which can essentially "compress" WAL would be fantastic and would be
> able to be leveraged by all of the different backup solutions.

I don't think this is a case of limiting ourselves; I think it's a
case of keeping separate considerations properly separate.  As I said
in my original email, the client doesn't rea

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Andres Freund

Hi,

On 2019-04-15 14:11:02 -0400, Robert Haas wrote:
> I hate to pick on any particular part of the tree, but it seems
> entirely plausible to me that a second columnar storage implementation
> could deliver more incremental value than spgist, an index AM you
> committed.  We should not move the goal posts into the stratosphere
> here.

Oh, I forgot: I agree that we don't need to be absurdly picky - but I
also think that table storage is much more crucial to get right than
index storage, which is already plenty crucial. Especially when that
type of index is not commonly usable for constraints. It really sucks to
get wrong query results due to a corrupted index / wrong index
implementation - but if your table AM level corruption, you're *really*
in a dark place. There's no way to just REINDEX and potentially recover
most information with a bit of surgery. Sure there can be app level
consequences to wrong query results that can be really bad, and lead to
very permanent data loss.  On-disk compat is also much more important
for table level data - it's annoying to have to reindex indexes after an
upgrade, but at least it can be done concurrently after the most
important indexes are operational.

Greetings,

Andres Freund

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Tom Lane

Robert Haas  writes:
> On Mon, Apr 15, 2019 at 11:10 AM Tom Lane  wrote:
>> There is a finite limit to how much stuff we can maintain as part of core.

> I don't agree with that at all.

Really?  Let's have a discussion of how thermodynamics applies to
software management sometime.

>>> If we have multiple candidates with sufficient code quality, then we may
>>> consider including both.

>> Dear god, no.

> I hate to pick on any particular part of the tree, but it seems
> entirely plausible to me that a second columnar storage implementation
> could deliver more incremental value than spgist, an index AM you
> committed.

Yeah, and that's something I've regretted more than once; I think SP-GiST
is a sterling example of something that isn't nearly useful enough in the
real world to justify the amount of maintenance effort we've been forced
to expend on it.  You might trawl the commit logs to get a sense of the
amount of my own personal time --- not that of the original submitters ---
that's gone into that one module.  Then ask yourself how much that model
will scale, and what other more-useful things I could've accomplished
with that time.

We do need to limit what we accept into core PG.  I do not buy your
argument that users expect everything to be in core.  Or more accurately,
the people who do think that way won't be using PG anyway --- they'll
be using MSSQL because it comes from their OS vendor.

regards, tom lane

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Andres Freund

Hi,

On 2019-04-15 14:11:02 -0400, Robert Haas wrote:
> Furthermore, different table AMs are going to have different
> needs.  It has already been remarked by both Andres and on this thread
> that for columnar storage to really zip along, the executor is going
> to need to be much smarter about deciding which columns to request.
> Presumably there will be a market for planner/executor optimizations
> that postpone fetching columns for as long as possible.  It's not
> going to be maintainable to build that kind of infrastructure in core
> and then have no in-core user of it.

Right. Two notes on that: A lot of that infrastructure needed for fast
query execution (both plan time and execution time) is also going to be
useful for a row store like heap, even though it won't have the
~order-of-magnitude impacts it can have for column stores. Secondly,
even without those, the storage density alone can make column stores
worthwhile, even without query execution speedups (or even slowdowns).

Greetings,

Andres Freund

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Andres Freund

Hi,

On 2019-04-15 11:10:38 -0400, Tom Lane wrote:
> Stephen Frost  writes:
> > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> >> I think having a colstore in core is important not just for adoption,
> >> but also for testing and development of the executor / planner bits.
> 
> > Agreed.
> 
> TBH, I thought the reason we were expending so much effort on a tableam
> API was exactly so we *wouldn't* have to include such stuff in core.

I think it's mostly orthogonal. We need something like tableam to have
multiple types of storage options for tables - independent of whether
they are in core. And sure, we could have maybe reduced the effort a bit
here and there by e.g. not allowing AMs to be dynamlically loaded, or
writing fewer comments or such.

> There is a finite limit to how much stuff we can maintain as part of core.
> We should embrace the notion that Postgres is an extensible system, rather
> than build all the tooling for extension and then proceed to dump stuff
> into core anyway.

I totally agree that that's something we should continue to focus
on. I personally think we *already* *have* embraced that - pretty
heavily so.  And sometimes to the detriment of our users.

I think there's a pretty good case for e.g. *one* column store
in-core. For one there is a huge portion of existing postgres workloads
that benefit from them (often not for all tables, but some). Relatedly,
it's also one of the more frequent reasons why users can't migrate to
postgres / have to migrate off.  And from a different angle, there's
plenty planner and executor work to be done to make column stores fast -
and that can't really be done nicely outside of core; and doing the
improvements in core without a user there is both harder, less likely to
be accepted, and more likely to regress.

> >> If we have multiple candidates with sufficient code quality, then we may
> >> consider including both.
> 
> Dear god, no.

Yea, I don't see much point in that. Unless there's a pretty fundamental
reason why one columnar AM can't fullfill two different workloads
(e.g. by having options that define how things are laid out / compressed
/ whatnot), I think that'd be a *terrible* idea. By that logic we'd just
get a lot of AMs with a few differences in some workloads, and our users
would be unable to choose one, and all of them would suck.  I think one
such fundamental difference is e.g. the visibility management for an
in-line mvcc approach like heap, and an undo-based mvcc row-store (like
zheap) - it's very hard to imagine meaningful code savings by having
those combined into one AM. I'm sure we can find similar large
architectural issues for some types of columnar AMs - but I'm far far
from convinced that there's enough distinctive need for two different
approaches in postgres.  Without having heap historically and the desire
for on-disk compat, I can't quite see being convinced that we should
e.g. add a store like heap if we already had zheap.

I think it's perfectly reasonable to have in-core AMs try to optimize
~80% for a lot of different [sets of] workloads, even if a very
specialized AM could be optimized for it much further.

Greetings,

Andres Freund

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Tomas Vondra


On Mon, Apr 15, 2019 at 10:50:21AM -0700, Ashwin Agrawal wrote:

  On Mon, Apr 15, 2019 at 10:33 AM Tomas Vondra
   wrote:

...

I see. Perhaps it'd be better to call the flag ZSBT_CONTAINER, when it
means "this is a container". And then have another flag to track whether
the container is compressed or not. But as I suggested elsewhere in this
thread, I think it might be better to store some ID of the compression
algorithm used instead of a simple flag.

FWIW when I had to deal with incremental compression (adding data into
already compressed buffers), which is what seems to be happening here, I
found it very useful/efficient to allow partially compressed buffers and
only trigger recompressin when absolutely needed.

Applied to this case, the container would first store compressed chunk,
followed by raw (uncompressed) data. Say, like this:

ZSContainerData {

    // header etc.

    int nbytes;         /* total bytes in data */
    int ncompressed;    /* ncompressed <= nbytes, fully compressed when
                         * (ncompressed == nbytes) */

    char data[FLEXIBLE_ARRAY_MEMBER];
}

When adding a value to the buffer, it'd be simply appended to the data
array. When the container would grow too much (can't fit on the page or
something), recompression is triggered.

  I think what you suggested here is exactly how its handled currently, just
  the mechanics are little different. Plain items are added to page as
  insertions are performed. Then when page becomes full, compression is
  triggerred container item is created for them to store the compressed
  data. Then new insertions are stored as plain items, once again when page
  becomes full, they are compressed and container item created for it. So,
  never, compressed data is attempted to be compressed again. So, on page
  plain items are acting as data section you mentioned above. A page can
  have mix of n plain and n container items.


Maybe. I'm not going to pretend I fully understand the internals. Does
that mean the container contains ZSUncompressedBtreeItem as elements? Or
just the plain Datum values?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: block-level incremental backup

2019-04-15 Thread Robert Haas

On Thu, Apr 11, 2019 at 1:29 PM Anastasia Lubennikova
 wrote:
> 2) There should be a command to get only a map of changes without actual
> data.

Good idea.

> 4) The algorithm of collecting changed blocks is another topic.
> Though, it's API should be discussed here:
>
> Do we want to have multiple implementations?
> Personally, I think that it's good to provide several strategies,
> since they have different requirements and fit for different workloads.
>
> Maybe we can add a hook to allow custom implementations.

I'm not sure a hook is going to be practical, but I do think we want
more than one strategy.

> I'll be happy to help with design, code, review, and testing.
> Hope that my experience with pg_probackup will be useful.

Great, thanks!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Robert Haas

On Mon, Apr 15, 2019 at 11:10 AM Tom Lane  wrote:
> TBH, I thought the reason we were expending so much effort on a tableam
> API was exactly so we *wouldn't* have to include such stuff in core.
>
> There is a finite limit to how much stuff we can maintain as part of core.
> We should embrace the notion that Postgres is an extensible system, rather
> than build all the tooling for extension and then proceed to dump stuff
> into core anyway.

I don't agree with that at all.   I expect, and hope, that there will
be some table AMs maintained outside of core, and I think that's
great.  At the same time, it's not like we have had any great success
with out-of-core index AMs, and I don't see that table AMs are likely
to be any different in that regard; indeed, they may be quite a bit
worse.  Up until now an index has only had to worry about one kind of
a table, but now a table is going to have to worry about every kind of
index.  Furthermore, different table AMs are going to have different
needs.  It has already been remarked by both Andres and on this thread
that for columnar storage to really zip along, the executor is going
to need to be much smarter about deciding which columns to request.
Presumably there will be a market for planner/executor optimizations
that postpone fetching columns for as long as possible.  It's not
going to be maintainable to build that kind of infrastructure in core
and then have no in-core user of it.

But even if it were, it would be foolish from an adoption perspective
to drive away people who are trying to contribute that kind of
technology to PostgreSQL.  Columnar storage is a big deal.  Very
significant numbers of people who won't consider PostgreSQL today
because the performance characteristics are not good enough for what
they need will consider it if it's got something like what Ashwin and
Heikki are building built in.  Some of those people may be determined
enough that even if the facility is out-of-core they'll be willing to
download an extension and compile it, but others won't.  It's already
a problem that people have to go get pgbouncer and/or pgpool to do
something that they kinda think the database should just handle.
Columnar storage, like JSON, is not some fringe thing where we can say
that the handful of people who want it can go get it: people expect
that to be a standard offering, and they wonder why PostgreSQL hasn't
got it yet.

> >> If we have multiple candidates with sufficient code quality, then we may
> >> consider including both.
>
> Dear god, no.

I hate to pick on any particular part of the tree, but it seems
entirely plausible to me that a second columnar storage implementation
could deliver more incremental value than spgist, an index AM you
committed.  We should not move the goal posts into the stratosphere
here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Robert Haas

On Sun, Apr 14, 2019 at 12:22 PM Tomas Vondra
 wrote:
> It's not clear to me whether you're arguing for not having any such
> implementation in core, or having multiple ones? I think we should aim
> to have at least one in-core implementation, even if it's not the best
> possible one for all sizes. It's not like our rowstore is the best
> possible implementation for all cases either.

I'm mostly arguing that it's too early to decide anything at this
point.  I'm definitely not opposed to having a column store in core.

> I think having a colstore in core is important not just for adoption,
> but also for testing and development of the executor / planner bits.
>
> If we have multiple candidates with sufficient code quality, then we may
> consider including both. I don't think it's very likely to happen in the
> same release, considering how much work it will require. And I have no
> idea if zedstore or VOPS are / will be the only candidates - it's way
> too early at this point.
>
> FWIW I personally plan to focus primarily on the features that aim to
> be included in core, and that applies to colstores too.

I agree with all of that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Ashwin Agrawal

On Mon, Apr 15, 2019 at 10:33 AM Tomas Vondra 
wrote:

> On Mon, Apr 15, 2019 at 09:29:37AM -0700, Ashwin Agrawal wrote:
> >   On Sun, Apr 14, 2019 at 9:40 AM Tomas Vondra
> >wrote:
> >
> > On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
> > >On 11/04/2019 17:54, Tom Lane wrote:
> > >>Ashwin Agrawal  writes:
> > >>>Thank you for trying it out. Yes, noticed for certain patterns
> > pg_lzcompress() actually requires much larger output buffers. Like
> for
> > one 86 len source it required 2296 len output buffer. Current
> zedstore
> > code doesn’t handle this case and errors out. LZ4 for same patterns
> > works fine, would highly recommend using LZ4 only, as anyways speed
> is
> > very fast as well with it.
> > >>
> > >>You realize of course that *every* compression method has some
> inputs
> > that
> > >>it makes bigger.  If your code assumes that compression always
> > produces a
> > >>smaller string, that's a bug in your code, not the compression
> > algorithm.
> > >
> > >Of course. The code is not making that assumption, although clearly
> > >there is a bug there somewhere because it throws that error. It's
> > >early days..
> > >
> > >In practice it's easy to weasel out of that, by storing the data
> > >uncompressed, if compression would make it longer. Then you need an
> > >extra flag somewhere to indicate whether it's compressed or not. It
> > >doesn't break the theoretical limit because the actual stored length
> > >is then original length + 1 bit, but it's usually not hard to find a
> > >place for one extra bit.
> > >
> >
> > Don't we already have that flag, though? I see ZSCompressedBtreeItem
> has
> > t_flags, and there's ZSBT_COMPRESSED, but maybe it's more
> complicated.
> >
> >   The flag ZSBT_COMPRESSED differentiates between container (compressed)
> >   item and plain (uncompressed item). Current code is writtten such that
> >   within container (compressed) item, all the data is compressed. If need
> >   exists to store some part of uncompressed data inside container item,
> then
> >   this additional flag would be required to indicate the same. Hence its
> >   different than ZSBT_COMPRESSED. I am thinking one of the ways could be
> to
> >   just not store this datum in container item if can't be compressed and
> >   just store it as plain item with uncompressed data, this additional
> flag
> >   won't be required. Will know more once write code for this.
>
> I see. Perhaps it'd be better to call the flag ZSBT_CONTAINER, when it
> means "this is a container". And then have another flag to track whether
> the container is compressed or not. But as I suggested elsewhere in this
> thread, I think it might be better to store some ID of the compression
> algorithm used instead of a simple flag.
>
> FWIW when I had to deal with incremental compression (adding data into
> already compressed buffers), which is what seems to be happening here, I
> found it very useful/efficient to allow partially compressed buffers and
> only trigger recompressin when absolutely needed.
>
> Applied to this case, the container would first store compressed chunk,
> followed by raw (uncompressed) data. Say, like this:
>
> ZSContainerData {
>
> // header etc.
>
> int nbytes; /* total bytes in data */
> int ncompressed;/* ncompressed <= nbytes, fully compressed when
>  * (ncompressed == nbytes) */
>
> char data[FLEXIBLE_ARRAY_MEMBER];
> }
>
> When adding a value to the buffer, it'd be simply appended to the data
> array. When the container would grow too much (can't fit on the page or
> something), recompression is triggered.
>

I think what you suggested here is exactly how its handled currently, just
the mechanics are little different. Plain items are added to page as
insertions are performed. Then when page becomes full, compression is
triggerred container item is created for them to store the compressed data.
Then new insertions are stored as plain items, once again when page becomes
full, they are compressed and container item created for it. So, never,
compressed data is attempted to be compressed again. So, on page plain
items are acting as data section you mentioned above. A page can have mix
of n plain and n container items.

Re: Minor fix in reloptions.c comments

2019-04-15 Thread Robert Haas

On Fri, Apr 12, 2019 at 12:01 AM Michael Paquier  wrote:
> On Fri, Apr 12, 2019 at 02:41:37AM +, Jamison, Kirk wrote:
> > I found some minor grammar mistake while reading reloptions.c code comments.
> > Attached is the fix.
> > I just changed "affect" to "effect", for both n_distinct and 
> > vacuum_truncate.
> >   - * values has no affect until the ...
> >   + * values has no effect until the ...
>
> A lot of those parameter updates affect processing and still they have
> many side effects, as per those paragraphs.

Well, "has no affect" is clearly wrong here, and Kirk's fix is clearly
right.  I don't know what your point here is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Tomas Vondra

On Mon, Apr 15, 2019 at 09:29:37AM -0700, Ashwin Agrawal wrote:

  On Sun, Apr 14, 2019 at 9:40 AM Tomas Vondra
   wrote:

On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
>On 11/04/2019 17:54, Tom Lane wrote:
>>Ashwin Agrawal  writes:
>>>Thank you for trying it out. Yes, noticed for certain patterns
pg_lzcompress() actually requires much larger output buffers. Like for
one 86 len source it required 2296 len output buffer. Current zedstore
code doesn’t handle this case and errors out. LZ4 for same patterns
works fine, would highly recommend using LZ4 only, as anyways speed is
very fast as well with it.
>>
>>You realize of course that *every* compression method has some inputs
that
>>it makes bigger.  If your code assumes that compression always
produces a
>>smaller string, that's a bug in your code, not the compression
algorithm.
>
>Of course. The code is not making that assumption, although clearly
>there is a bug there somewhere because it throws that error. It's
>early days..
>
>In practice it's easy to weasel out of that, by storing the data
>uncompressed, if compression would make it longer. Then you need an
>extra flag somewhere to indicate whether it's compressed or not. It
>doesn't break the theoretical limit because the actual stored length
>is then original length + 1 bit, but it's usually not hard to find a
>place for one extra bit.
>

Don't we already have that flag, though? I see ZSCompressedBtreeItem has
t_flags, and there's ZSBT_COMPRESSED, but maybe it's more complicated.

  The flag ZSBT_COMPRESSED differentiates between container (compressed)
  item and plain (uncompressed item). Current code is writtten such that
  within container (compressed) item, all the data is compressed. If need
  exists to store some part of uncompressed data inside container item, then
  this additional flag would be required to indicate the same. Hence its
  different than ZSBT_COMPRESSED. I am thinking one of the ways could be to
  just not store this datum in container item if can't be compressed and
  just store it as plain item with uncompressed data, this additional flag
  won't be required. Will know more once write code for this.

I see. Perhaps it'd be better to call the flag ZSBT_CONTAINER, when it
means "this is a container". And then have another flag to track whether
the container is compressed or not. But as I suggested elsewhere in this
thread, I think it might be better to store some ID of the compression
algorithm used instead of a simple flag.

FWIW when I had to deal with incremental compression (adding data into
already compressed buffers), which is what seems to be happening here, I
found it very useful/efficient to allow partially compressed buffers and
only trigger recompressin when absolutely needed.

Applied to this case, the container would first store compressed chunk,
followed by raw (uncompressed) data. Say, like this:

ZSContainerData {

   // header etc.

   int nbytes; /* total bytes in data */
   int ncompressed;/* ncompressed <= nbytes, fully compressed when
* (ncompressed == nbytes) */

   char data[FLEXIBLE_ARRAY_MEMBER];
}

When adding a value to the buffer, it'd be simply appended to the data
array. When the container would grow too much (can't fit on the page or
something), recompression is triggered.

cheers

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Autovacuum-induced regression test instability

2019-04-15 Thread Tom Lane

In connection with the issue discussed at [1], I tried to run
the core regression tests with extremely aggressive autovacuuming
(I set autovacuum_naptime = 1s, autovacuum_vacuum_threshold = 5,
autovacuum_vacuum_cost_delay = 0).  I found that the timestamp
test tends to fail with diffs caused by unstable row order in
timestamp_tbl.  This is evidently because it does a couple of
DELETEs before inserting the table's final contents; if autovac
comes along at the right time then some of those slots can get
recycled in between insertions.  I'm thinking of committing the
attached patch to prevent this, since in principle such failures
could occur even without hacking the autovac settings.  Thoughts?

regards, tom lane

[1] https://www.postgresql.org/message-id/15751.1555256860%40sss.pgh.pa.us

diff --git a/src/test/regress/expected/timestamp.out b/src/test/regress/expected/timestamp.out
index 4a2fabd..b2b171f 100644
--- a/src/test/regress/expected/timestamp.out
+++ b/src/test/regress/expected/timestamp.out
@@ -74,7 +74,7 @@ SELECT count(*) AS two FROM TIMESTAMP_TBL WHERE d1 = timestamp(2) without time z
 (1 row)
 
 COMMIT;
-DELETE FROM TIMESTAMP_TBL;
+TRUNCATE TIMESTAMP_TBL;
 -- Special values
 INSERT INTO TIMESTAMP_TBL VALUES ('-infinity');
 INSERT INTO TIMESTAMP_TBL VALUES ('infinity');
diff --git a/src/test/regress/sql/timestamp.sql b/src/test/regress/sql/timestamp.sql
index b7957cb..150eb54 100644
--- a/src/test/regress/sql/timestamp.sql
+++ b/src/test/regress/sql/timestamp.sql
@@ -44,7 +44,7 @@ SELECT pg_sleep(0.1);
 SELECT count(*) AS two FROM TIMESTAMP_TBL WHERE d1 = timestamp(2) without time zone 'now';
 COMMIT;
 
-DELETE FROM TIMESTAMP_TBL;
+TRUNCATE TIMESTAMP_TBL;
 
 -- Special values
 INSERT INTO TIMESTAMP_TBL VALUES ('-infinity');

Re: New vacuum option to do only freezing

2019-04-15 Thread Tom Lane

Masahiko Sawada  writes:
>> Ugh, I think the assertion is right but the above condition is
>> completely wrong. We should increment nleft_dead_tuples when index
>> cleanup is *not* enabled.

> Here is a draft patch to fix this issue.

So the real issue here, I fear, is that we've got no consistent testing
of the whole case block for HEAPTUPLE_DEAD, as you can easily confirm
by checking the code coverage report at
https://coverage.postgresql.org/src/backend/access/heap/vacuumlazy.c.gcov.html

This is perhaps unsurprising, given the code comment that points out that
we can only reach that block if the tuple's state changed since
heap_page_prune() a few lines above.  Still, it means that this patch
hasn't been tested in that scenario, until we were lucky enough for
a slow buildfarm machine like topminnow to hit it.

What's more, because that block is the only way for "tupgone" to be
set, we also don't reach the "if (tupgone)" block at lines 1183ff.
And I think this patch has probably broken that, too.  Surely, if we
are not going to remove the tuple, we should not increment tups_vacuumed?
And maybe we have to freeze it instead?  How is it that we can, or should,
treat this situation as different from a dead-but-not-removable tuple?

I have a very strong feeling that this patch was not fully baked.

BTW, I can reproduce the crash fairly easily (within a few cycles
of the regression tests) by (a) inserting pg_usleep(1) after
the heap_page_prune call, and (b) making autovacuum much more
aggressive.

regards, tom lane

Re: plpgsql - execute - cannot use a reference to record field

2019-04-15 Thread Pavel Stehule

po 15. 4. 2019 v 18:07 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > Is there reason why following code should not to work?
>
> > do $$
> > declare r record; result int;
> > begin
> >   select 10 as a, 20 as b into r;
> >   raise notice 'a: %', r.a;
> >   execute 'select $1.a + $1.b' into result using r;
> >   raise notice '%', result;
> > end;
> > $$
>
> You can't select fields by name out of an unspecified record.
> The EXECUTE'd query is not particularly different from
>
> regression=# prepare foo(record) as select $1.a + $1.b;
> psql: ERROR:  could not identify column "a" in record data type
> LINE 1: prepare foo(record) as select $1.a + $1.b;
>   ^
>
> and surely you wouldn't expect that to work.
> (The fact that either of the previous lines work is
> thanks to plpgsql-specific hacking.)
>

yes. I looking to the code and I see so SPI_execute_with_args doesn't allow
push typmods there.

Regards

Pavel




> regards, tom lane
>

Re: Multivariate MCV lists -- pg_mcv_list_items() seems to be broken

2019-04-15 Thread Tomas Vondra


On Mon, Apr 15, 2019 at 12:26:02PM -0400, Tom Lane wrote:

Dean Rasheed  writes:

SELECT pg_mcv_list_items(stxmcv) from pg_statistic_ext WHERE stxname = 'foo_s';
which fails with
ERROR:  cache lookup failed for type 0



That definitely used to work, so I'm guessing it got broken by the
recent reworking of the serialisation code, but I've not looked into
it.


Yeah, looks like sloppy thinking about whether or not the types array
participates in maxalign-forcing?



Actually, no. It seems aligned just fine, AFAICS. The bug a bit more
embarassing - the deserialization does

   memcpy(ptr, mcvlist->types, sizeof(Oid) * ndims);

while it should be doing

   memcpy(mcvlist->types, ptr, sizeof(Oid) * ndims);

Will fix.


cheers

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: block-level incremental backup

2019-04-15 Thread Bruce Momjian

On Mon, Apr 15, 2019 at 09:01:11AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Robert Haas (robertmh...@gmail.com) wrote:
> > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> > developed technology that permits a block-level incremental backup to
> > be taken from a PostgreSQL server.  I believe the idea in all of those
> > cases is that non-relation files should be backed up in their
> > entirety, but for relation files, only those blocks that have been
> > changed need to be backed up.
> 
> I love the general idea of having additional facilities in core to
> support block-level incremental backups.  I've long been unhappy that
> any such approach ends up being limited to a subset of the files which
> need to be included in the backup, meaning the rest of the files have to
> be backed up in their entirety.  I don't think we have to solve for that
> as part of this, but I'd like to see a discussion for how to deal with
> the other files which are being backed up to avoid needing to just
> wholesale copy them.

I assume you are talking about non-heap/index files.  Which of those are
large enough to benefit from incremental backup?

> > I would like to propose that we should
> > have a solution for this problem in core, rather than leaving it to
> > each individual PostgreSQL company to develop and maintain their own
> > solution. 
> 
> I'm certainly a fan of improving our in-core backup solutions.
> 
> I'm quite concerned that trying to graft this on to pg_basebackup
> (which, as you note later, is missing an awful lot of what users expect
> from a real backup solution already- retention handling, parallel
> capabilities, WAL archive management, and many more... but also is just
> not nearly as developed a tool as the external solutions) is going to
> make things unnecessairly difficult when what we really want here is
> better support from core for block-level incremental backup for the
> existing external tools to leverage.

I think there is some interesting complexity brought up in this thread. 
Which options are going to minimize storage I/O, network I/O, have only
background overhead, allow parallel operation, integrate with
pg_basebackup.  Eventually we will need to evaluate the incremental
backup options against these criteria.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Cleanup/remove/update references to OID column

2019-04-15 Thread Daniel Verite

Andres Freund wrote:

> Yes, I was planning to commit that soon-ish. There still seemed
> review / newer versions happening, though, so I was thinking of waiting
> for a bit longer.

You might want to apply this trivial one in the same batch:

index 452f307..7cfb67f 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -428,7 +428,7 @@ main(int argc, char **argv)

InitDumpOptions(&dopt);

-   while ((c = getopt_long(argc, argv,
"abBcCd:E:f:F:h:j:n:N:oOp:RsS:t:T:U:vwWxZ:",
+   while ((c = getopt_long(argc, argv,
"abBcCd:E:f:F:h:j:n:N:Op:RsS:t:T:U:vwWxZ:",
long_options,
&optindex)) != -1)
{
switch (c)

"o" in the options list is a leftover. Leaving it in getopt_long() has the 
effect that pg_dump -o fails (per the default case in the switch),
but it's missing the expected error message (pg_dump: invalid option -- 'o')

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Ashwin Agrawal

On Sun, Apr 14, 2019 at 9:40 AM Tomas Vondra 
wrote:

> On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
> >On 11/04/2019 17:54, Tom Lane wrote:
> >>Ashwin Agrawal  writes:
> >>>Thank you for trying it out. Yes, noticed for certain patterns
> pg_lzcompress() actually requires much larger output buffers. Like for one
> 86 len source it required 2296 len output buffer. Current zedstore code
> doesn’t handle this case and errors out. LZ4 for same patterns works fine,
> would highly recommend using LZ4 only, as anyways speed is very fast as
> well with it.
> >>
> >>You realize of course that *every* compression method has some inputs
> that
> >>it makes bigger.  If your code assumes that compression always produces a
> >>smaller string, that's a bug in your code, not the compression algorithm.
> >
> >Of course. The code is not making that assumption, although clearly
> >there is a bug there somewhere because it throws that error. It's
> >early days..
> >
> >In practice it's easy to weasel out of that, by storing the data
> >uncompressed, if compression would make it longer. Then you need an
> >extra flag somewhere to indicate whether it's compressed or not. It
> >doesn't break the theoretical limit because the actual stored length
> >is then original length + 1 bit, but it's usually not hard to find a
> >place for one extra bit.
> >
>
> Don't we already have that flag, though? I see ZSCompressedBtreeItem has
> t_flags, and there's ZSBT_COMPRESSED, but maybe it's more complicated.
>

The flag ZSBT_COMPRESSED differentiates between container (compressed) item
and plain (uncompressed item). Current code is writtten such that within
container (compressed) item, all the data is compressed. If need exists to
store some part of uncompressed data inside container item, then this
additional flag would be required to indicate the same. Hence its different
than ZSBT_COMPRESSED. I am thinking one of the ways could be to just not
store this datum in container item if can't be compressed and just store it
as plain item with uncompressed data, this additional flag won't be
required. Will know more once write code for this.

Re: Multivariate MCV lists -- pg_mcv_list_items() seems to be broken

2019-04-15 Thread Tom Lane

Dean Rasheed  writes:
> SELECT pg_mcv_list_items(stxmcv) from pg_statistic_ext WHERE stxname = 
> 'foo_s';
> which fails with
> ERROR:  cache lookup failed for type 0

> That definitely used to work, so I'm guessing it got broken by the
> recent reworking of the serialisation code, but I've not looked into
> it.

Yeah, looks like sloppy thinking about whether or not the types array
participates in maxalign-forcing?

> There should probably be regression test coverage of that function.

+1

regards, tom lane

Re: Multivariate MCV lists -- pg_mcv_list_items() seems to be broken

2019-04-15 Thread Tomas Vondra


On Mon, Apr 15, 2019 at 05:02:43PM +0100, Dean Rasheed wrote:

I just noticed the following:

CREATE TABLE foo (a int, b int);
INSERT INTO foo SELECT x/10, x/100 FROM generate_series(1, 100) x;
CREATE STATISTICS foo_s ON a,b FROM foo;
ANALYSE foo;

SELECT pg_mcv_list_items(stxmcv) from pg_statistic_ext WHERE stxname = 'foo_s';

which fails with

ERROR:  cache lookup failed for type 0

That definitely used to work, so I'm guessing it got broken by the
recent reworking of the serialisation code, but I've not looked into
it.



Yeah, that seems like a bug. I'll take a look.


There should probably be regression test coverage of that function.



Agreed. I plan to rework the existing tests to use the same approach as
the MCV, so I'll add a test for this function too.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Ashwin Agrawal

On Sat, Apr 13, 2019 at 4:22 PM Peter Geoghegan  wrote:

> On Thu, Apr 11, 2019 at 6:06 AM Rafia Sabih 
> wrote:
> > Reading about it reminds me of this work -- TAG column storage(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www09.sigmod.org_sigmod_record_issues_0703_03.article-2Dgraefe.pdf&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=H2hOVqCm9svWVOW1xh7FhoURKEP-WWpWso6lKD1fLoM&s=KNOse_VUg9-BW7SyDXt1vw92n6x_B92N9SJHZKrdoIo&e=
> ).
> > Isn't this storage system inspired from there, with TID as the TAG?
> >
> > It is not referenced here so made me wonder.
>
> I don't think they're particularly similar, because that paper
> describes an architecture based on using purely logical row
> identifiers, which is not what a TID is. TID is a hybrid
> physical/logical identifier, sometimes called a "physiological"
> identifier, which will have significant overhead.

Storage system wasn't inspired by that paper, but yes seems it also talks
about laying out column data in btrees, which is good to see. But yes as
pointed out by Peter, the main aspect the paper is focusing on to save
space for TAG, isn't something zedstore plan's to leverage, it being more
restrictive. As discussed below we can use other alternatives to save space.

> Ashwin said that
> ZedStore TIDs are logical identifiers, but I don't see how that's
> compatible with a hybrid row/column design (unless you map heap TID to
> logical row identifier using a separate B-Tree).
>

Would like to know more specifics on this Peter. We may be having different
context on hybrid row/column design. When we referenced design supports
hybrid row/column families, it meant not within same table. So, not inside
a table one can have some data in row and some in column nature. For a
table, the structure will be homogenous. But it can easily support storing
all the columns together, or subset of columns together or single column
all connected together by TID.

> The big idea with Graefe's TAG design is that there is practically no
> storage overhead for these logical identifiers, because each entry's
> identifier is calculated by adding its slot number to the page's
> tag/low key. The ZedStore design, in contrast, explicitly stores TID
> for every entry. ZedStore seems more flexible for that reason, but at
> the same time the per-datum overhead seems very high to me. Maybe
> prefix compression could help here, which a low key and high key can
> do rather well.
>

Yes, the plan to optimize out TID space per datum, either by prefix
compression or delta compression or some other trick.

Re: plpgsql - execute - cannot use a reference to record field

2019-04-15 Thread Tom Lane

Pavel Stehule  writes:
> Is there reason why following code should not to work?

> do $$
> declare r record; result int;
> begin
>   select 10 as a, 20 as b into r;
>   raise notice 'a: %', r.a;
>   execute 'select $1.a + $1.b' into result using r;
>   raise notice '%', result;
> end;
> $$

You can't select fields by name out of an unspecified record.
The EXECUTE'd query is not particularly different from

regression=# prepare foo(record) as select $1.a + $1.b;
psql: ERROR:  could not identify column "a" in record data type
LINE 1: prepare foo(record) as select $1.a + $1.b;
  ^

and surely you wouldn't expect that to work.
(The fact that either of the previous lines work is
thanks to plpgsql-specific hacking.)

regards, tom lane

Multivariate MCV lists -- pg_mcv_list_items() seems to be broken

2019-04-15 Thread Dean Rasheed

I just noticed the following:

CREATE TABLE foo (a int, b int);
INSERT INTO foo SELECT x/10, x/100 FROM generate_series(1, 100) x;
CREATE STATISTICS foo_s ON a,b FROM foo;
ANALYSE foo;

SELECT pg_mcv_list_items(stxmcv) from pg_statistic_ext WHERE stxname = 'foo_s';

which fails with

ERROR:  cache lookup failed for type 0

That definitely used to work, so I'm guessing it got broken by the
recent reworking of the serialisation code, but I've not looked into
it.

There should probably be regression test coverage of that function.

Regards,
Dean

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Tom Lane

Stephen Frost  writes:
> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
>> I think having a colstore in core is important not just for adoption,
>> but also for testing and development of the executor / planner bits.

> Agreed.

TBH, I thought the reason we were expending so much effort on a tableam
API was exactly so we *wouldn't* have to include such stuff in core.

There is a finite limit to how much stuff we can maintain as part of core.
We should embrace the notion that Postgres is an extensible system, rather
than build all the tooling for extension and then proceed to dump stuff
into core anyway.

>> If we have multiple candidates with sufficient code quality, then we may
>> consider including both.

Dear god, no.

regards, tom lane

Re: Accidental setting of XLogReaderState.private_data ?

2019-04-15 Thread Tom Lane

Antonin Houska  writes:
> StartupDecodingContext() initializes ctx->reader->private_data with ctx, and
> it even does so twice. I couldn't find a place in the code where the
> (LogicalDecodingContext *) pointer is retrieved from the reader, and a simple
> test of logical replication works if the patch below is applied. Thus I assume
> that assignment is a thinko, isn't it?

Hmm.  The second, duplicate assignment is surely pointless, but I think
that setting the ctx as the private_data is a good idea.  It hardly seems
out of the question that it might be needed in future.

regards, tom lane

Re: partitioning performance tests after recent patches

2019-04-15 Thread Tom Lane

Amit Langote  writes:
> On 2019/04/15 4:19, Floris Van Nee wrote:
>> 2) Is running non-inlined SQL functions with a generic plan even the best 
>> option all the time? Wouldn't it be better to adopt a similar approach to 
>> what plpgsql does, where it tries to test if using a generic plan is 
>> beneficial? The non-inlineable SQL functions suffer a big performance hit 
>> for a large number of partitions, because they cannot rely on static 
>> planning-time pruning.

> I'd never noticed this before.  It indeed seems to be the case that SQL
> functions and plpgsql functions are handled using completely different
> code paths, of which only for the latter it's possible to use static
> planning-time pruning.

Yeah.  Another big problem with the current implementation of SQL
functions is that there's no possibility of cross-query plan caching.
At some point I'd like to throw away functions.c and start over
with an implementation more similar to how plpgsql does it (in
particular, with persistent state and use of the plan cache).
It hasn't gotten to the top of the to-do queue though, mostly because
I think not many people use SQL-language functions except when they
want them to be inlined.

regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

2019-04-15 Thread Amit Langote

On Mon, Apr 15, 2019 at 5:05 PM Amit Langote
 wrote:
> On 2019/04/15 2:38, Tom Lane wrote:
> > So the point here is that that reasoning is faulty.  You *cannot* assume,
> > no matter how strong a lock or how many pins you hold, that a relcache
> > entry will not get rebuilt underneath you.  Cache flushes happen
> > regardless.  And unless relcache.c takes special measures to prevent it,
> > a rebuild will result in moving subsidiary data structures and thereby
> > breaking any pointers you may have pointing into those data structures.
> >
> > For certain subsidiary structures such as the relation tupdesc,
> > we do take such special measures: that's what the "keep_xxx" dance in
> > RelationClearRelation is.  However, that's expensive, both in cycles
> > and maintenance effort: it requires having code that can decide equality
> > of the subsidiary data structures, which we might well have no other use
> > for, and which we certainly don't have strong tests for correctness of.
> > It's also very error-prone for callers, because there isn't any good way
> > to cross-check that code using a long-lived pointer to a subsidiary
> > structure is holding a lock that's strong enough to guarantee non-mutation
> > of that structure, or even that relcache.c provides any such guarantee
> > at all.  (If our periodic attempts to reduce lock strength for assorted
> > DDL operations don't scare the pants off you in this connection, you have
> > not thought hard enough about it.)  So I think that even though we've
> > largely gotten away with this approach so far, it's also a half-baked
> > kluge that we should be looking to get rid of, not extend to yet more
> > cases.
>
> Thanks for the explanation.
>
> I understand that simply having a lock and a nonzero refcount on a
> relation doesn't prevent someone else from changing it concurrently.
>
> I get that we want to get rid of the keep_* kludge in the long term, but
> is it wrong to think, for example, that having keep_partdesc today allows
> us today to keep the pointer to rd_partdesc as long as we're holding the
> relation open or refcnt on the whole relation such as with
> PartitionDirectory mechanism?

Ah, we're also trying to fix the memory leak caused by the current
design of PartitionDirectory.  AIUI, the design assumes that the leak
would occur in fairly rare cases, but maybe not so?  If partitions are
frequently attached/detached concurrently (maybe won't be too uncommon
if reduced lock levels encourages users) causing the PartitionDesc of
a given relation changing all the time, then a planning session that's
holding the PartitionDirectory containing that relation would leak as
many PartitionDescs as there were concurrent changes, I guess.

I see that you've proposed to change the PartitionDirectory design to
copy PartitionDesc as way of keeping it around instead holding the
relation open, but having to resort to that would be unfortunate.

Thanks,
Amit

Re: jsonpath

2019-04-15 Thread Andrew Dunstan



On 4/14/19 10:43 PM, Alexander Korotkov wrote:
> On Tue, Apr 9, 2019 at 7:16 PM Tomas Vondra
>  wrote:
>> On Sun, Apr 07, 2019 at 03:03:58AM +0300, Alexander Korotkov wrote:
>>> On Sun, Apr 7, 2019 at 2:37 AM Tom Lane  wrote:
 Alexander Korotkov  writes:
> Thus, contents of unused function makes test fail or pass.  So far, it
> looks like a compiler bug.  Any thoughts?
 Yeah :-(.  The fact that we've not seen a similar failure on any other
 machines points in that direction, too.  Maybe it's some other aspect
 of the machine's toolchain, like flex or bison, but that's not that
 much different from our standpoint.

 There's a lot of stuff I don't especially like about jsonpath_scan,
 for instance I think the "init" arguments to resizeString etc are
 a pretty error-prone and unmaintainable way to do things.  But
 I don't see anything that looks like it'd be a portability hazard
 that would explain this.

 I still have a nagging feeling that there's a wild store somewhere
 in here, but I don't know how to find it based on the limited
 evidence we've got.
>>> Yeah, it might be not because compiler error.  It might depend on
>>> memory layout.  So existence of extra function changes memory layout
>>> and, in turn, causes an error.  I will try to disassemble
>>> jsonpath_scan.o and see whether content of yyparse2 influences
>>> assembly of other functions.
>>>
>> Have you tried other compiler version / different optimization level?
> Error goes away with -O0.  Or I just didn't manage to reproduce that.
> In my observation error depends on memory layout or something.  So, it
> might be that I just didn't manage to reproduce it with -O0 while it
> really still persists.  I didn't try other compilers yet.
>
>> Or running it under valgrind. Not sure how difficult that is on Windows.
> Valgrind isn't installed there.  I'm not sure how to do that, but I
> will probably try.
>
> The interesting thing is that on failure I got following backtrace.
>
> #0  0x7ff94f86a458 in ntdll!RtlRaiseStatus () from
> C:\WINDOWS\SYSTEM32\ntdll.dll
> #1  0x7ff94f87760e in ntdll!memset () from C:\WINDOWS\SYSTEM32\ntdll.dll
> #2  0x7ff94dc42e1a in msvcrt!_setjmpex () from
> C:\WINDOWS\System32\msvcrt.dll
> #3  0x0086a37a in pg_re_throw () at elog.c:1720
> #4  0x0086a166 in errfinish (dummy=) at elog.c:464
> #5  0x007c3d18 in jsonpath_yyerror (result=result@entry=0x0,
> message=message@entry=0xa87d38 <__func__.110220+1512>
> "unrecognized flag of LIKE_REGEX predicate") at jsonpath_scan.l:276
> #6  0x007c5f3d in makeItemLikeRegex (pattern=,
> pattern=, flags=, expr=0x7216760) at
> jsonpath_gram.y:500
> #7  jsonpath_yyparse (result=, result@entry=0x495e818)
> at jsonpath_gram.y:178
>
> So, error happens inside implementation of siglongjmp().  I've checked
> that contents of *PG_exception_stack didn't change since previous
> successfully thrown error.  Probably this implementation of long jump
> saves some part of state outside of sigjmp_buf and that part is
> corrupt.  Any ideas?
>

I have downgraded jacana to gcc 7.3.0, which has resolved the problem.
I'm still a bit worried that we're clobbering the stack somehow though.
I have no idea how to test that.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: partitioning performance tests after recent patches

2019-04-15 Thread David Rowley

On Mon, 15 Apr 2019 at 19:33, Floris Van Nee  wrote:
> Here's the output of explain/analyze for HEAD. At run-time, technically all 
> partitions could be pruned directly. However, one partition remains in the 
> output of explain/analyze because of other difficulties with removing all of 
> them, if I remember correctly? Still, that partition is never executed.  The 
> only difference I can see is the Limit node on top, as well as apparently 
> another  partition appearing in the analyze output (4096_4096, last 
> partition, remains in the first plan. 4096_1, the first partition, remains 
> the second plan).
>
> -- select_now.sql
> explain(analyze, verbose, buffers on)
> select * from :tbl where a='abc' and updated_at between now() and 
> now()+interval '1d';
>
> Append  (cost=0.16..8949.61 rows=4096 width=112) (actual time=0.000..0.000 
> rows=0 loops=1)
>   Subplans Removed: 4095
>   ->  Index Scan using p4096_4096_a_updated_at_idx on public.p4096_4096  
> (cost=0.16..2.18 rows=1 width=112) (never executed)
> Output: p4096_4096.a, p4096_4096.b, p4096_4096.c, p4096_4096.d, 
> p4096_4096.updated_at
> Index Cond: ((p4096_4096.a = 'abc'::text) AND (p4096_4096.updated_at 
> >= now()) AND (p4096_4096.updated_at <= (now() + '1 day'::interval)))
> Planning Time: 237.603 ms
> Execution Time: 0.475 ms
>
> -- select_now_limit.sql
> explain(analyze, verbose, buffers on)
> select * from :tbl where a='abc' and updated_at between now() and 
> now()+interval '1d'
> order by a, updated_at desc limit 1;
>
> Limit  (cost=645.53..647.56 rows=1 width=112) (actual time=0.002..0.002 
> rows=0 loops=1)
>   Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, p4096_1.updated_at
>   ->  Append  (cost=645.53..8949.61 rows=4096 width=112) (actual 
> time=0.000..0.000 rows=0 loops=1)
> Subplans Removed: 4095
> ->  Index Scan using p4096_1_a_updated_at_idx on public.p4096_1  
> (cost=0.57..2.03 rows=1 width=54) (never executed)
>   Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, 
> p4096_1.updated_at
>   Index Cond: ((p4096_1.a = 'abc'::text) AND (p4096_1.updated_at 
> >= now()) AND (p4096_1.updated_at <= (now() + '1 day'::interval)))
> Planning Time: 3897.687 ms
> Execution Time: 0.491 ms

I had a look at this and it's due to get_eclass_for_sort_expr() having
a hard time due to the EquivalenceClass having so many members. This
must be done for each partition, so search time is quadratic based on
the number of partitions. We only hit this in the 2nd plan due to
build_index_paths() finding that there are useful pathkeys from
query_pathkeys.  Of course, this does not happen for the first query
since it has no ORDER BY clause.

Tom and I were doing a bit of work in [1] to speed up cases when there
are many EquivalenceClasses by storing a Bitmapset for each RelOptInfo
to mark the indexes of each eq_classes they have members in.  This
does not really help this case since we're slow due to lots of members
rather than lots of classes, but perhaps something similar can be done
to allow members to be found more quickly.  I'm not sure exactly how
that can be done without having something like an array of Lists
indexed by relid in each EquivalenceClasses. That does not sound great
from a memory consumption point of view.  Maybe having
EquivalenceMember in some data structure that we don't have to perform
a linear search on would be a better fix.  Although, we don't
currently have any means to hash or binary search for note types
though. Perhaps its time we did.

[1] https://commitfest.postgresql.org/23/1984/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Tomas Vondra


On Sun, Apr 14, 2019 at 06:39:47PM +0200, Tomas Vondra wrote:

On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:

On 11/04/2019 17:54, Tom Lane wrote:

Ashwin Agrawal  writes:

Thank you for trying it out. Yes, noticed for certain patterns
pg_lzcompress() actually requires much larger output buffers. Like
for one 86 len source it required 2296 len output buffer. Current
zedstore code doesn’t handle this case and errors out. LZ4 for same
patterns works fine, would highly recommend using LZ4 only, as
anyways speed is very fast as well with it.


You realize of course that *every* compression method has some inputs
that it makes bigger.  If your code assumes that compression always
produces a smaller string, that's a bug in your code, not the
compression algorithm.


Of course. The code is not making that assumption, although clearly
there is a bug there somewhere because it throws that error. It's
early days..

In practice it's easy to weasel out of that, by storing the data
uncompressed, if compression would make it longer. Then you need an
extra flag somewhere to indicate whether it's compressed or not. It
doesn't break the theoretical limit because the actual stored length
is then original length + 1 bit, but it's usually not hard to find a
place for one extra bit.



Don't we already have that flag, though? I see ZSCompressedBtreeItem
has t_flags, and there's ZSBT_COMPRESSED, but maybe it's more
complicated.



After thinking about this a bit more, I think a simple flag may not be
enough. It might be better to have some sort of ID of the compression
algorithm in each item, which would allow switching algorithm for new
data (which may be useful e.g after we add new stuff in core, or when
the initial choice was not the best one).

Of course, those are just wild thoughts at this point, it's not
something the current PoC has to solve right away.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: block-level incremental backup

2019-04-15 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> developed technology that permits a block-level incremental backup to
> be taken from a PostgreSQL server.  I believe the idea in all of those
> cases is that non-relation files should be backed up in their
> entirety, but for relation files, only those blocks that have been
> changed need to be backed up.

I love the general idea of having additional facilities in core to
support block-level incremental backups.  I've long been unhappy that
any such approach ends up being limited to a subset of the files which
need to be included in the backup, meaning the rest of the files have to
be backed up in their entirety.  I don't think we have to solve for that
as part of this, but I'd like to see a discussion for how to deal with
the other files which are being backed up to avoid needing to just
wholesale copy them.

> I would like to propose that we should
> have a solution for this problem in core, rather than leaving it to
> each individual PostgreSQL company to develop and maintain their own
> solution. 

I'm certainly a fan of improving our in-core backup solutions.

I'm quite concerned that trying to graft this on to pg_basebackup
(which, as you note later, is missing an awful lot of what users expect
from a real backup solution already- retention handling, parallel
capabilities, WAL archive management, and many more... but also is just
not nearly as developed a tool as the external solutions) is going to
make things unnecessairly difficult when what we really want here is
better support from core for block-level incremental backup for the
existing external tools to leverage.

Perhaps there's something here which can be done with pg_basebackup to
have it work with the block-level approach, but I certainly don't see
it as a natural next step for it and really does seem like limiting the
way this is implemented to something that pg_basebackup can easily
digest might make it less useful for the more developed tools.

As an example, I believe all of the other tools mentioned (at least,
those that are open source I'm pretty sure all do) support parallel
backup and therefore having a way to get the block-level changes in a
parallel fashion would be a pretty big thing that those tools will want
and pg_basebackup is single-threaded today and this proposal doesn't
seem to be contemplating changing that, implying that a serial-based
block-level protocol would be fine but that'd be a pretty awful
restriction for the other tools.

> Generally my idea is:
> 
> 1. There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value.  There are
> several possible ways for the server to implement this, the simplest
> of which is to just scan all the blocks and send only the ones that
> satisfy that criterion.  That might sound dumb, but it does still save
> network bandwidth, and it works even without any prior setup. It will
> probably be more efficient in many cases to instead scan all the WAL
> generated since that LSN and extract block references from it, but
> that is only possible if the server has all of that WAL available or
> can somehow get it from the archive.  We could also, as several people
> have proposed previously, have some kind of additional relation for
> that stores either a single is-modified bit -- which only helps if the
> reference LSN for the is-modified bit is older than the requested LSN
> but not too much older -- or the highest LSN for each range of K
> blocks, or something like that.  I am at the moment not too concerned
> with the exact strategy we use here. I believe we may want to
> eventually support more than one, since they have different
> trade-offs.

This part of the discussion is a another example of how we're limiting
ourselves in this implementation to the "pg_basebackup can work with
this" case- by only consideration the options of "scan all the files" or
"use the WAL- if the request is for WAL we have available on the
server."  The other backup solutions mentioned in your initial email,
and others that weren't, have a WAL archive which includes a lot more
WAL than just what the primary currently has.  When I've thought about
how WAL could be used to build a differential or incremental backup, the
question of "do we have all the WAL we need" hasn't ever been a
consideration- because the backup tool manages the WAL archive and has
WAL going back across, most likely, weeks or even months.  Having a tool
which can essentially "compress" WAL would be fantastic and would be
able to be leveraged by all of the different backup solutions.

> 2. When you use pg_basebackup in this way, each relation file that is
> not sent in its entirety is replaced by a file with a different name.
> For example, instead of base/16384/16417, you might get
> base/16384/partial.16417 or however we decide to name them.  Each su

Re: PANIC: could not flush dirty data: Operation not permitted power8, Redhat Centos

2019-04-15 Thread Justin Pryzby

On Fri, Apr 12, 2019 at 08:04:00PM +0200, reiner peterke wrote:
> We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) 
> we get error on
> power8 ppc64le (Redhat and CentOS).  No error on SUSE on power8
> 
> No error on x86_64 (RH, Centos and  SUSE)

So there's an error on power8 with RH but not SUSE.

What kernel versions are used for each of the successful and not successful ?

Justin

Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On Tue, Apr 09, 2019 at 02:29:09PM -0400, Robert Haas wrote:
> >On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera  
> >wrote:
> >>This is not surprising, considering that columnar store is precisely the
> >>reason for starting the work on table AMs.
> >>
> >>We should certainly look into integrating some sort of columnar storage
> >>in mainline.  Not sure which of zedstore or VOPS is the best candidate,
> >>or maybe we'll have some other proposal.  My feeling is that having more
> >>than one is not useful; if there are optimizations to one that can be
> >>borrowed from the other, let's do that instead of duplicating effort.
> >
> >I think that conclusion may be premature.  There seem to be a bunch of
> >different ways of doing columnar storage, so I don't know how we can
> >be sure that one size will fit all, or that the first thing we accept
> >will be the best thing.
> >
> >Of course, we probably do not want to accept a ton of storage manager
> >implementations is core.  I think if people propose implementations
> >that are poor quality, or missing important features, or don't have
> >significantly different use cases from the ones we've already got,
> >it's reasonable to reject those.  But I wouldn't be prepared to say
> >that if we have two significantly different column store that are both
> >awesome code with a complete feature set and significantly disjoint
> >use cases, we should reject the second one just because it is also a
> >column store.  I think that won't get out of control because few
> >people will be able to produce really high-quality implementations.
> >
> >This stuff is hard, which I think is also why we only have 6.5 index
> >AMs in core after many, many years.  And our standards have gone up
> >over the years - not all of those would pass muster if they were
> >proposed today.
> 
> It's not clear to me whether you're arguing for not having any such
> implementation in core, or having multiple ones? I think we should aim
> to have at least one in-core implementation, even if it's not the best
> possible one for all sizes. It's not like our rowstore is the best
> possible implementation for all cases either.
> 
> I think having a colstore in core is important not just for adoption,
> but also for testing and development of the executor / planner bits.

Agreed.

> If we have multiple candidates with sufficient code quality, then we may
> consider including both. I don't think it's very likely to happen in the
> same release, considering how much work it will require. And I have no
> idea if zedstore or VOPS are / will be the only candidates - it's way
> too early at this point.

Definitely, but having as many different indexes as we have is certainly
a good thing and we should be looking to a future where we have multiple
in-core options for row and column-oriented storage.

> FWIW I personally plan to focus primarily on the features that aim to
> be included in core, and that applies to colstores too.

Yeah, same here.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: [PATCH v20] GSSAPI encryption support

2019-04-15 Thread Stephen Frost

Greetings,

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> On 2019-04-09 09:32, Peter Eisentraut wrote:
> > On 2019-04-09 04:51, Stephen Frost wrote:
> >>> Running just 002_enc.pl by itself passes the tests!
> >> Great!  I think what I'll do is work to incorporate the two tests back
> >> into one script, to avoid whatever the race condition or other confusion
> >> is happening on macOS here.
> > 
> > That seems reasonable.  It also avoids the large amount of duplicate
> > setup code.
> 
> Another problem is that the two test files cannot be run in parallel
> because they use the same hardcoded data directories.  That would have
> to be replaced by temporary directories.

The tests are really fast enough with one KDC that I don't think it
makes sense to have two independent tests.

> The race condition alluded to above appears to be simply that at the end
> of the test the kdc is shut down by kill -INT but nothing waits for that
> to finish before a new one is started by the next test.  So there is
> potential for all kinds of confusion.

Ah, good to know that's what was happening.

> Putting it all in one test file seems to be the easiest way out.

Please find attached a patch which updates the protocol.sgml docs that
Michael mentioned before, and merges the tests into one test file (while
adding in some additional tests to make sure that the server also agrees
with what our expectations are, using the pg_stat_gssapi view).

I'll push this soon unless there are concerns.  If you get a chance to
test the patch out, that would be great.  It's working happily for me
locally.

Thanks!

Stephen
From e0745810ba1582601f2c71db3b67a8cfc2480546 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Sun, 14 Apr 2019 21:14:42 -0400
Subject: [PATCH] GSSAPI: Improve documentation and tests

The GSSAPI encryption patch neglected to update the protocol
documentation to describe how to set up a GSSAPI encrypted connection
from a client to the server, so fix that by adding the appropriate
documentation to protocol.sgml.

The tests added for encryption support were overly long and couldn't be
run in parallel due to race conditions; this was largely because each
test was setting up its own KDC to perform the tests.  Instead, merge
the authentication tests and the encryption tests into one test file,
where we only create one KDC to run the tests with.  Also, have the
tests check what the server's opinion is of the connection and if it was
GSS authenticated or encrypted using the previously added pg_stat_gssapi
view.

In passing, fix the libpq label for GSSENC-Mode to be consistent with
the "PGGSSENCMODE" environment variable.

Missing protocol documentation pointed out by Michael Paquier.
Issues with the tests pointed out by Tom Lane and Peter Eisentraut.

Refactored tests and added documentation by me.
---
 doc/src/sgml/protocol.sgml| 103 
 src/interfaces/libpq/fe-connect.c |   2 +-
 src/test/kerberos/t/001_auth.pl   |  48 ++--
 src/test/kerberos/t/002_enc.pl| 197 --
 4 files changed, 141 insertions(+), 209 deletions(-)
 delete mode 100644 src/test/kerberos/t/002_enc.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 09893d96b5..204dca7c33 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1479,6 +1479,72 @@ SELCT 1/0;
 of authentication checking.

+
+  
+   GSSAPI Session Encryption
+
+   
+If PostgreSQL was built with
+GSSAPI support, frontend/backend communications
+can be encrypted using GSSAPI.  This provides
+communication security in environments where attackers might be
+able to capture the session traffic. For more information on
+encrypting PostgreSQL sessions with
+GSSAPI, see .
+   
+
+   
+To initiate a GSSAPI-encrypted connection, the
+frontend initially sends a GSSENCRequest message rather than a
+StartupMessage.  The server then responds with a single byte
+containing G or N, indicating that it
+is willing or unwilling to perform GSSAPI encryption,
+respectively.  The frontend might close the connection at this point
+if it is dissatisfied with the response.  To continue after
+G, perform a GSSAPI initialization by
+calling gss_init_sec_context() in a loop and sending the result to the
+server, starting with an empty input and then with each result from the
+server, until it returns no output.  When sending the results of
+gss_init_sec_context() to the server, prepend the length of the message as a
+four byte integer in network byte order.  If this is successful, then use
+gss_wrap() to encrypt the usual StartupMessage and all subsequent data,
+prepending the length of the result from gss_wrap() as a four byte integer
+in network byte order to the actual encrypted payload.  Note that the server
+will only accept encrypted packets from the client which are less than 16KB;
+gss

Re: log bind parameter values on error

2019-04-15 Thread Alexey Bashtanov


On 05/04/2019 12:23, Peter Eisentraut wrote:

On 2019-03-29 15:55, Alexey Bashtanov wrote:

ERROR:  value "62812" is out of range for type smallint
STATEMENT:  SELECT abalance FROM pgbench_accounts WHERE aid = $1;

(In this case the error message contains the parameter value, so it's
not a very practical case, but it should work, it seems.)

I guess this error occurred /while/ binding, so the parameters probably
weren't yet all bound by the time of error reporting.
That's why the error message came without parameters.

I see.  But I think that could be fixed.  Change exec_bind_message() to
loop over the parameters twice: once to save them away, once to actually
process them.  I think the case of a faulty input value is probably very
common, so it would be confusing if that didn't work.

I understand this may sound lazy of me, however let me take this risk
and try to explain why I think logging them here in the same fashion
would be inconsistent.

The original feature is intended to log textual representations of
the bind parameter values, whereas the problem suggested to be solved
together with it is verbose logging of the part of the bind message that
failed to get converted to a Datum.

These are equivalent only under the following conditions:
1) the rest of the message was correctly formatted, so we can extract
something for each of the parameter values
2) the values have been passed in text mode
(all of them, not only the one failed to process)
3) the values have been passed in the server encoding

I think it would be a bit of inconsistent to log the parts of the message
only when we are lucky to satisfy the 3 conditions above.

If we are to log bind parameters message parsing and processing problems
more verbosely, what do you think of rather wrapping calls to
OidInputFunctionCall, OidReceiveFunctionCall and pg_client_to_server
into PG_TRY blocks and logging their arguments individually?

In case it doesn't work for any reason, I have an alternative idea on
how to log half-processed parameters.
Instead of hasTextValues store the number of parameters having
their textual representations successfully saved.
This way we will be able, in case of text mode,
save the original value and increment the counter
before the call to OidInputFunctionCall. When logging and not having values
for some parameters, we can print an ellipsis in the log to indicate there's
some more of them missing.


I think this patch needs some tests.  Manually testing it is cumbersome,
and as we are seeing now, it is not quite clear which cases it is
supposed to cover.  There are also additional cases for binary
parameters, and there are additions to the CSV output format.  We need
tests for all that so the behavior explains itself and doesn't have to
be rediscovered every time someone wants to look at this again.



I have added a section into src/test/examples/testlibpq3.c,
please see attached.

As far as I could see these tests never run on "make check" or "make 
installcheck",

hence the README change. Please correct me if I'm wrong.

I also failed to find an automatic way to test what actually gets logged
to the server log, at least not in the installcheck case.
I would appreciate if you had any suggestions about it.

Best regards,
  Alex
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 75f9471..437dc44 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6418,6 +6418,23 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
   
  
 
+ 
+  log_parameters_on_error (boolean)
+  
+   log_parameters_on_error configuration parameter
+  
+  
+  
+   
+Controls whether the statement is logged with bind parameter values. 
+It adds some overhead, as postgres will cache textual
+representations of parameter values in memory for all statements,
+even if they eventually do not get logged. The default is
+off. Only superusers can change this setting.
+   
+  
+ 
+
  
   log_statement (enum)
   
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index f5d5613..5e5ccf1 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -45,6 +45,7 @@ makeParamList(int numParams)
 	retval->parserSetup = NULL;
 	retval->parserSetupArg = NULL;
 	retval->numParams = numParams;
+	retval->hasTextValues = false;
 
 	return retval;
 }
@@ -58,6 +59,10 @@ makeParamList(int numParams)
  * set of parameter values.  If dynamic parameter hooks are present, we
  * intentionally do not copy them into the result.  Rather, we forcibly
  * instantiate all available parameter values and copy the datum values.
+ *
+ * Since this function is never used for copying parameters into a top portal,
+ * we don't copy textual representations. Neither we set the hasTextValues
+ * flag, so noone will access them.
  */
 ParamListInfo
 copyParamList(ParamListInfo from)
diff --git a/src/backend/tcop/postgres.c b/src/back

Accidental setting of XLogReaderState.private_data ?

2019-04-15 Thread Antonin Houska

StartupDecodingContext() initializes ctx->reader->private_data with ctx, and
it even does so twice. I couldn't find a place in the code where the
(LogicalDecodingContext *) pointer is retrieved from the reader, and a simple
test of logical replication works if the patch below is applied. Thus I assume
that assignment is a thinko, isn't it?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 424fe86a1b..fbca486b59 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -172,14 +172,12 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->slot = slot;
 
-	ctx->reader = XLogReaderAllocate(wal_segment_size, read_page, ctx);
+	ctx->reader = XLogReaderAllocate(wal_segment_size, read_page, NULL);
 	if (!ctx->reader)
 		ereport(ERROR,
 (errcode(ERRCODE_OUT_OF_MEMORY),
  errmsg("out of memory")));
 
-	ctx->reader->private_data = ctx;
-
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,

Re: PANIC: could not flush dirty data: Operation not permitted power8, Redhat Centos

2019-04-15 Thread zedaardv




sent by smoke signals at great danger to my self. 

> On 12 Apr 2019, at 23:16, Thomas Munro  wrote:
> 
>> On Sat, Apr 13, 2019 at 7:23 AM Andres Freund  wrote:
>>> On 2019-04-12 20:04:00 +0200, reiner peterke wrote:
>>> We build Postgres on Power and x86 With the latest Postgres 11 release 
>>> (11.2) we get error on
>>> power8 ppc64le (Redhat and CentOS).  No error on SUSE on power8
> 
> Huh,  I wonder what is different.  I don't see this on EDB's CentOS
> 7.1 POWER8 system with an XFS filesystem.  I ran it under strace -f
> and saw this:
> 
> [pid 51614] sync_file_range2(0x19, 0x2, 0x8000, 0x2000, 0x2, 0x8) = 0
> 
>>> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on IPv4 address 
>>> "0.0.0.0", port 5432
>>> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on IPv6 address 
>>> "::", port 5432
>>> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on Unix socket 
>>> "/tmp/.s.PGSQL.5432"
>>> 2019-04-09 12:30:10 UTC   pid:204 xid:0 ip: LOG:  database system was shut 
>>> down at 2019-04-09 12:27:09 UTC
>>> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  database system is ready 
>>> to accept connections
>>> 2019-04-09 12:31:46 UTC   pid:203 xid:0 ip: LOG:  received SIGHUP, 
>>> reloading configuration files
>>> 2019-04-09 12:35:10 UTC   pid:205 xid:0 ip: PANIC:  could not flush dirty 
>>> data: Operation not permitted
>>> 2019-04-09 12:35:10 UTC   pid:203 xid:0 ip: LOG:  checkpointer process (PID 
>>> 205) was terminated by signal 6: Aborted
>> 
>> Any chance you can strace this? Because I don't understand how you'd get
>> a permission error here.
> 
> Me neither.  I hacked my tree so that it would use the msync() version
> instead of the sync_file_range() version but that worked too.
> 
> -- 
> Thomas Munro
> https://enterprisedb.com

I forgot to mention that this is happening in a docker container. 
I want to test it on a VM to see if it is container related. I am sick at the 
moment so i’m unable to do the test at the moment. 

Reiner

Re: partitioning performance tests after recent patches

2019-04-15 Thread Amit Langote

Hi,

Thanks a lot for very exhaustive testing.

David already replied to some points, but let me comment on a couple of
points.

Please be advised that these may not be the numbers (or scalability
pattern of these numbers) you'll see when PG 12 is actually released,
because we may end up changing something that makes performance suffer a
bit.  In particular, we are contemplating some changes around the safety
of planner's handling of cached partitioning metadata (in light of reduced
lock levels for adding/removing partitions) that might reduce the TPS
figure, the impact of which would worsen as the number of partitions
increases.  Although, nothing is final yet; if interested, you can follow
that discussion at [1].

On 2019/04/15 4:19, Floris Van Nee wrote:
> The test cases were (see benchmark.sql for the SQL commands for setup and 
> test cases):
> 1. Insert batches of 1000 rows per transaction
> 2. Simple SELECT query pruning on a static timestamp
> 3. The same SELECT query with static timestamp but with an added 'ORDER BY a, 
> updated_at DESC LIMIT 1', which matches the index defined on the table
> 4. The same simple SELECT query, but now it's wrapped inside an inlineable 
> SQL function, called with a static timestamp
> 5. The same simple SELECT query, but now it's wrapped inside a non-inlineable 
> SQL function, called with a static timestamp
> 6. The same simple SELECT query, but now it's wrapped inside a plpgsql 
> function, called with a static timestamp
> 7. Simple SELECT query pruning on a timestamp now()
> 8. The same SELECT query with dynamic timestamp but with an added 'ORDER BY 
> a, updated_at DESC LIMIT 1', which matches the index defined on the table
> 9. The same simple SELECT query, but now it's wrapped inside an inlineable 
> SQL function, called with a dynamic timestamp
> 10. The same simple SELECT query, but now it's wrapped inside a 
> non-inlineable SQL function, called with a dynamic timestamp
> 11. The same simple SELECT query, but now it's wrapped inside a plpgsql 
> function, called with a dynamic timestamp
> 12. The same query as 2) but then in an inlineable function
> 13. The same query as 3) but then in an inlineable function
> 14. A SELECT with nested loop (10 iterations) with opportunities for run-time 
> pruning - some rows from a table are selected and the timestamp from rows in 
> that table is used to join on another partitioned table
> 
> The full results can be found in the attached file (results.txt). I also 
> produced graphs of the results, which can be found on TimescaleDb's Github 
> page [1]. Please take a look at these figures for an easy overview of the 
> results. In general performance of HEAD looks really good.
> 
> While looking at these results, there were a few questions that I couldn't 
> answer.
> 1) It seems like the queries inside plpgsql functions (case 6 and 11) perform 
> relatively well in PG11 compared to a non-inlineable SQL function (case 5 and 
> 10), when a table consists of many partitions. As far as I know, both plpgsql 
> and non-inlineable SQL functions are executed with generic plans. What can 
> explain this difference? Are non-inlineable SQL function plans not reused 
> between transactions, while plpgsql plans are?
> 2) Is running non-inlined SQL functions with a generic plan even the best 
> option all the time? Wouldn't it be better to adopt a similar approach to 
> what plpgsql does, where it tries to test if using a generic plan is 
> beneficial? The non-inlineable SQL functions suffer a big performance hit for 
> a large number of partitions, because they cannot rely on static 
> planning-time pruning.

I'd never noticed this before.  It indeed seems to be the case that SQL
functions and plpgsql functions are handled using completely different
code paths, of which only for the latter it's possible to use static
planning-time pruning.

> There was one other thing I noticed, and I believe it was raised by Tom in a 
> separate thread as well: the setup code itself is really slow. Creating of 
> partitions is taking a long time (it's taking several minutes to create the 
> 4096 partition table).

Yeah that's rather bad.  Thinking of doing something about that for PG 13.

Thanks,
Amit

[1] https://www.postgresql.org/message-id/27380.1555270166%40sss.pgh.pa.us

Re: Issue in ExecCleanupTupleRouting()

2019-04-15 Thread Etsuro Fujita


(2019/04/15 13:32), David Rowley wrote:

On Fri, 12 Apr 2019 at 01:06, Etsuro Fujita  wrote:

While working on an update-tuple-routing bug in postgres_fdw [1], I
noticed this change to ExecCleanupTupleRouting() made by commit
3f2393edefa5ef2b6970a5a2fa2c7e9c55cc10cf:


Added to open items list.


Thanks!  I moved this item to resolved ones.

Best regards,
Etsuro Fujita

Re: Issue in ExecCleanupTupleRouting()

2019-04-15 Thread Etsuro Fujita


(2019/04/12 10:48), Amit Langote wrote:

On 2019/04/11 22:28, David Rowley wrote:

On Fri, 12 Apr 2019 at 01:06, Etsuro Fujita  wrote:

+   /*
+* Check if this result rel is one belonging to the node's subplans,
+* if so, let ExecEndPlan() clean it up.
+*/
+   if (htab)
+   {
+   Oid partoid;
+   boolfound;
+
+   partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+   (void) hash_search(htab,&partoid, HASH_FIND,&found);
+   if (found)
+   continue;
+   }

 /* Allow any FDWs to shut down if they've been exercised */
-   if (resultRelInfo->ri_PartitionReadyForRouting&&
-   resultRelInfo->ri_FdwRoutine != NULL&&
+   if (resultRelInfo->ri_FdwRoutine != NULL&&
 resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)

resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
resultRelInfo);

This skips subplan resultrels before calling EndForeignInsert() if they
are foreign tables, which I think causes an issue: the FDWs would fail
to release resources for their foreign insert operations, because
ExecEndPlan() and ExecEndModifyTable() don't do anything to allow them
to do that.  So I think we should skip subplan resultrels after
EndForeignInsert().  Attached is a small patch for that.


Oops. I had for some reason been under the impression that it was
nodeModifyTable.c, or whatever the calling code happened to be that
handles these ones, but this is not the case as we call
ExecInitRoutingInfo() from ExecFindPartition() which makes the call to
BeginForeignInsert. If that part is handled by the tuple routing code,
then the subsequent cleanup should be too, in which case your patch
looks fine.


That sounds right.


Pushed.  Thanks for reviewing, David and Amit!

Best regards,
Etsuro Fujita

Re: Attempt to consolidate reading of XLOG page

2019-04-15 Thread Antonin Houska

Alvaro Herrera  wrote:

> I agree that xlog reading is pretty messy.
> 
> I think ifdef'ing the way XLogRead reports errors is not great.  Maybe
> we can pass a function pointer that is to be called in case of errors?

I'll try a bit harder to evaluate the existing approaches to report the same
error on both backend and frontend side.

> Not sure about the walsize; maybe it can be a member in XLogReadPos, and
> given to XLogReadInitPos()?  (Maybe rename XLogReadPos as
> XLogReadContext or something like that, indicating it's not just the
> read position.)

As pointed out by others, XLogReadPos is not necessary. So if XLogRead()
receives XLogReaderState instead, it can get the segment size from there.

Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Mailing list not working

2019-04-15 Thread Magnus Hagander

On Mon, Apr 15, 2019 at 11:02 AM Ramanarayana  wrote:

> Hi,
>
> I also got same error couple of hours ago. Now it is working fine.
>

There were network issues in one of the Cisco firewalls for one of our
datacenters. It's the only "Enterprise technology" we use for pginfra I
think, and it does this every now and then. It was rebooted early this
morning and then services recovered.

It would be great to have alerting tools like prometheus to notify users
> when the server is down
>
>
There were hundreds of notifications. But PostgreSQL does not have 24/7
staff (since it's all volunteers), so it takes some times to get things
fixed. The issues started just past midnight, and were fixed at about
7:30AM.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Mailing list not working

2019-04-15 Thread Ramanarayana

Hi,

I also got same error couple of hours ago. Now it is working fine.

It would be great to have alerting tools like prometheus to notify users
when the server is down

Regards,
Ram.

Re: Attempt to consolidate reading of XLOG page

2019-04-15 Thread Antonin Houska

Michael Paquier  wrote:

> On Fri, Apr 12, 2019 at 12:27:11PM +0900, Kyotaro HORIGUCHI wrote:
> > This patch changes XLogRead to allow using other than
> > BasicOpenFile to open a segment, and use XLogReaderState.private
> > to hold a new struct XLogReadPos for the segment reader. The new
> > struct is heavily duplicated with XLogReaderState and I'm not
> > sure the rason why the XLogReadPos is needed.
> > Any other opinions, or thoughts?
> 
> The focus is on the stability of v12 for the next couple of months, so
> please make sure to register it to the next CF if you want feedback.

ok, will do. (A link to mailing list is needed for the CF entry, so I had to
post something anyway :-) Since I don't introduce any kind of "cool new
feature" here, I believe it did not disturb much.)

> Here are some basic thoughts after a very quick lookup.
> ...

Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

2019-04-15 Thread Antonin Houska

Kyotaro HORIGUCHI  wrote:

> Hello.
> 
> At Thu, 11 Apr 2019 18:05:42 +0200, Antonin Houska  wrote 
> in <14984.1554998...@spoje.net>
> > While working on the instance encryption I found it annoying to apply
> > decyption of XLOG page to three different functions. Attached is a patch 
> > that
> > tries to merge them all into one function, XLogRead(). The existing
> > implementations differ in the way new segment is opened. So I added a 
> > pointer
> > to callback function as a new argument. This callback handles the specific
> > ways to determine segment file name and to open the file.
> > 
> > I can split the patch into multiple diffs to make detailed review easier, 
> > but
> > first I'd like to hear if anything is seriously wrong about this
> > design. Thanks.
> 
> This patch changes XLogRead to allow using other than
> BasicOpenFile to open a segment,

Good point. The acceptable ways to open file on both frontend and backend side
need to be documented.

> and use XLogReaderState.private to hold a new struct XLogReadPos for the
> segment reader. The new struct is heavily duplicated with XLogReaderState
> and I'm not sure the rason why the XLogReadPos is needed.

ok, I missed the fact that XLogReaderState already contains most of the info
that I put into XLogReadPos. So XLogReadPos is not needed.

> Anyway, in the first place, such two distinct-but-highly-related
> callbacks makes things too complex. Heikki said that the log
> reader stuff is better not using callbacks and I agree to that. I
> did that once for my own but the code is no longer
> applicable. But it seems to be the time to do that.
> 
> https://www.postgresql.org/message-id/47215279-228d-f30d-35d1-16af695e5...@iki.fi

Thanks for the link. My understanding is that the drawback of the
XLogReaderState.read_page callback is that it cannot easily switch between
XLOG sources in order to handle failure because the caller of XLogReadRecord()
usually controls those sources too.

However the callback I pass to XLogRead() is different: if it fails, it simply
raises ERROR. Since this indicates rather low-level problem, there's no reason
for this callback to try to recover from the failure.

> That would seems like follows. That refactoring separates log
> reader and page reader.
> 
> 
> for(;;)
> {
> rc = XLogReadRecord(reader, startptr, errormsg);
> 
> if (rc == XLREAD_SUCCESS)
> {
> /* great, got record */
> }
> if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD)
> {
> elog(ERROR, "invalid record");
> }
> if (rc == XLREAD_NEED_DATA)
> {
> /*
>  * Read a page from disk, and place it into reader->readBuf
>  */
> XLogPageRead(reader->readPagePtr, /* page to read */
>  reader->reqLen   /* # of bytes to read */ );
> /*
>  * Now that we have read the data that XLogReadRecord()
>  * requested, call it again.
>  */
> continue;
> }
> }
> 
> DecodingContextFindStartpoint(ctx)
>   do
>   {
>  read_local_xlog_page();
>  rc = XLogReadRecord (reader);
>   while (rc == XLREAD_NEED_DATA);
> 
> I'm going to do that again.
> 
> 
> Any other opinions, or thoughts?

I don't see an overlap between what you do and what I do. It seems that even
if you change the XLOG reader API, you don't care what read_local_xlog_page()
does internally. What I try to fix is XLogRead(), and that is actually a
subroutine of read_local_xlog_page().

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: hyrax vs. RelationBuildPartitionDesc

2019-04-15 Thread Amit Langote

On 2019/04/15 2:38, Tom Lane wrote:
> Amit Langote  writes:
>> On 2019/04/10 15:42, Michael Paquier wrote:
>>> It seems to me that Tom's argument to push in the way relcache
>>> information is handled by copying its contents sounds sensible to me.
>>> That's not perfect, but it is consistent with what exists (note
>>> PublicationActions for a rather-still-not-much-complex example of
>>> structure which gets copied from the relcache).
> 
>> I gave my vote for direct access of unchangeable relcache substructures
>> (TupleDesc, PartitionDesc, etc.), because they're accessed during the
>> planning of every user query and fairly expensive to copy compared to list
>> of indexes or PublicationActions that you're citing.  To further my point
>> a bit, I wonder why PublicationActions needs to be copied out of relcache
>> at all?  Looking at its usage in execReplication.c, it seems we can do
>> fine without copying, because we are holding both a lock and a relcache
>> reference on the replication target relation during the entirety of
>> apply_handle_insert(), which means that information can't change under us,
>> neither logically nor physically.
> 
> So the point here is that that reasoning is faulty.  You *cannot* assume,
> no matter how strong a lock or how many pins you hold, that a relcache
> entry will not get rebuilt underneath you.  Cache flushes happen
> regardless.  And unless relcache.c takes special measures to prevent it,
> a rebuild will result in moving subsidiary data structures and thereby
> breaking any pointers you may have pointing into those data structures.
> 
> For certain subsidiary structures such as the relation tupdesc,
> we do take such special measures: that's what the "keep_xxx" dance in
> RelationClearRelation is.  However, that's expensive, both in cycles
> and maintenance effort: it requires having code that can decide equality
> of the subsidiary data structures, which we might well have no other use
> for, and which we certainly don't have strong tests for correctness of.
> It's also very error-prone for callers, because there isn't any good way
> to cross-check that code using a long-lived pointer to a subsidiary
> structure is holding a lock that's strong enough to guarantee non-mutation
> of that structure, or even that relcache.c provides any such guarantee
> at all.  (If our periodic attempts to reduce lock strength for assorted
> DDL operations don't scare the pants off you in this connection, you have
> not thought hard enough about it.)  So I think that even though we've
> largely gotten away with this approach so far, it's also a half-baked
> kluge that we should be looking to get rid of, not extend to yet more
> cases.

Thanks for the explanation.

I understand that simply having a lock and a nonzero refcount on a
relation doesn't prevent someone else from changing it concurrently.

I get that we want to get rid of the keep_* kludge in the long term, but
is it wrong to think, for example, that having keep_partdesc today allows
us today to keep the pointer to rd_partdesc as long as we're holding the
relation open or refcnt on the whole relation such as with
PartitionDirectory mechanism?

> To my mind there are only two trustworthy solutions to the problem of
> wanting time-extended usage of a relcache subsidiary data structure: one
> is to copy it, and the other is to reference-count it.  I think that going
> over to a reference-count-based approach for many of these structures
> might well be something we should do in future, maybe even the very near
> future.  In the mean time, though, I'm not really satisfied with inserting
> half-baked kluges, especially not ones that are different from our other
> half-baked kluges for similar purposes.  I think that's a path to creating
> hard-to-reproduce bugs.

+1 to reference-count-based approach.

I've occasionally wondered why there is both keep_tupdesc kludge and
refcounting for table TupleDescs.  I guess it's because *only* the
TupleTableSlot mechanism in the executor uses TupleDesc pinning (that is,
refcounting) and the rest of the sites depend on the shaky guarantee
provided by keep_tupdesc.

Thanks,
Amit

Re: cache lookup failed for collation 0

2019-04-15 Thread Peter Eisentraut

On 2019-04-15 07:44, Jeevan Chalke wrote:
> The root cause is that the same code match_pattern_prefix() is being
> used for text and bytea, but bytea does not use collations, so having
> the collation 0 is expected, and we shouldn't call
> get_collation_isdeterministic() in that case.
> 
> Proposed patch attached.
> 
> Looks fine to me.

Committed, thanks.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: partitioning performance tests after recent patches

2019-04-15 Thread Floris Van Nee

Hi David,

Thanks for your reply. I really appreciate your work on run-time pruning!
Here's the output of explain/analyze for HEAD. At run-time, technically all 
partitions could be pruned directly. However, one partition remains in the 
output of explain/analyze because of other difficulties with removing all of 
them, if I remember correctly? Still, that partition is never executed.  The 
only difference I can see is the Limit node on top, as well as apparently 
another  partition appearing in the analyze output (4096_4096, last partition, 
remains in the first plan. 4096_1, the first partition, remains the second 
plan).

-- select_now.sql
explain(analyze, verbose, buffers on)
select * from :tbl where a='abc' and updated_at between now() and 
now()+interval '1d';

Append  (cost=0.16..8949.61 rows=4096 width=112) (actual time=0.000..0.000 
rows=0 loops=1)
  Subplans Removed: 4095
  ->  Index Scan using p4096_4096_a_updated_at_idx on public.p4096_4096  
(cost=0.16..2.18 rows=1 width=112) (never executed)
Output: p4096_4096.a, p4096_4096.b, p4096_4096.c, p4096_4096.d, 
p4096_4096.updated_at
Index Cond: ((p4096_4096.a = 'abc'::text) AND (p4096_4096.updated_at >= 
now()) AND (p4096_4096.updated_at <= (now() + '1 day'::interval)))
Planning Time: 237.603 ms
Execution Time: 0.475 ms

-- select_now_limit.sql
explain(analyze, verbose, buffers on)
select * from :tbl where a='abc' and updated_at between now() and 
now()+interval '1d'
order by a, updated_at desc limit 1;

Limit  (cost=645.53..647.56 rows=1 width=112) (actual time=0.002..0.002 rows=0 
loops=1)
  Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, p4096_1.updated_at
  ->  Append  (cost=645.53..8949.61 rows=4096 width=112) (actual 
time=0.000..0.000 rows=0 loops=1)
Subplans Removed: 4095
->  Index Scan using p4096_1_a_updated_at_idx on public.p4096_1  
(cost=0.57..2.03 rows=1 width=54) (never executed)
  Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, 
p4096_1.updated_at
  Index Cond: ((p4096_1.a = 'abc'::text) AND (p4096_1.updated_at >= 
now()) AND (p4096_1.updated_at <= (now() + '1 day'::interval)))
Planning Time: 3897.687 ms
Execution Time: 0.491 ms

Regarding the nested loops - thanks for your explanation. I can see this is 
more complicated than I initially thought. It may be doable to determine if 
your set of pruned partitions is still valid, but it's more difficult to 
determine if, on top of that, extra partitions must be included due to widening 
of the range. 

-Floris


From: David Rowley 
Sent: Monday, April 15, 2019 1:25 AM
To: Floris Van Nee
Cc: Pg Hackers
Subject: Re: partitioning performance tests after recent patches [External]

On Mon, 15 Apr 2019 at 07:19, Floris Van Nee  wrote:
> 3) What could be causing the big performance difference between case 7 
> (simple SELECT) and 8 (simple SELECT with ORDER BY  LIMIT 1)? For 4096 
> partitions, TPS of 7) is around 5, while adding the ORDER BY  LIMIT 1 
> makes TPS drop well below 1. In theory, run-time pruning of the right chunk 
> should take exactly the same amount of time in both cases, because both are 
> pruning timestamp now() on the same number of partitions. The resulting plans 
> are also identical with the exception of the top LIMIT-node (in PG11 they 
> differ slightly as a MergeAppend is chosen for the ORDER BY instead of an 
> Append, in HEAD with ordered append this is not necessary anymore). Am I 
> missing something here?

With the information provided, I don't really see any reason why the
ORDER BY LIMIT would slow it down if the plan is the same apart from
the LIMIT node. Please share the EXPLAIN ANALYZE output of each.

> 4) A more general question about run-time pruning in nested loops, like the 
> one for case 14. I believe I read in one of the previous threads that 
> run-time pruning only reoccurs if it determines that the value that 
> determines which partitions must be excluded has changed in between 
> iterations. How is this defined? Eg. let's say partitions are 1-day wide and 
> the first iteration of the loop filters on the partitioned table for 
> timestamp between 14-04-2019 12:00 and 14-04-2019 20:00 (dynamically 
> determined). Then the second iteration comes along and now filters on values 
> between 14-04-2019 12:00 and 14-04-2019 19:00. The partition that should be 
> scanned hasn't changed, because both timestamps fall into the same partition. 
> Is the full process of run-time pruning applied again, or is there some kind 
> of shortcut that first checks if the previous pruning result is still valid 
> even if the value has changed slightly? If not, would this be a possible 
> optimization, as I think it's a case that occurs very often? I don't know the 
> run-time pruning code very well though, so it may just be a crazy idea that 
> can't be practically achieved.

Currently, there's no shortcut. It knows which parameters partition
pruning depends on and

Re: Mailing list not working

2019-04-15 Thread Amit Langote

On 2019/04/15 15:37, Tatsuo Ishii wrote:
>> Hi,
>>
>> I am not able to access the mailing list archive. Is the mailing list
>> server down or something?
> 
> What kind of problem do you have?
> 
> I can see your posting in the archive.
> https://www.postgresql.org/message-id/CAKm4Xs5%2BD%2BgB4yCQtsfKmdTMqvido1k5Qz7iwPAQj8CM-ptiXw%40mail.gmail.com

There may have been some glitch for limited time couple of hours ago.  I
too was facing issues accessing the ML archive in the browser, such as
getting "Guru Meditation" / "Error 503 Service Unavailable" error pages.
Also, errors like "Timeout when talking to search server" when doing
keyword search.

Thanks,
Amit

99 matches

Mail list logo