date:20180308

Fixes for missing schema qualifications

2018-03-08 Thread Michael Paquier

Hi all,

In light of CVE-2018-1058, user's applications need to be careful about
the use of schema-unqualified queries.  A lookup at the upstream code is
showing four areas which are missing such handling:
- psql has one problem in get_create_object_cmd which misses twice to
qualify array_remove().
- isolationtester is missing one for a call to pg_backend_pid()
- information_schema.sql has one problem as well: the function
_pg_interval_type does not qualify upper().  Please note that there is
no need to care about view's bodies because those use OID references, so
only the function body need to be taken care of.
- worker_spi scans pg_namespace and uses count() without schema
qualification.

Attached is a patch which fixes all four of them, and which should be
back-patched.  For information_schema.sql, users can always replace the
body of the function by redefining them (using SET search_path in CREATE
FUNCTION would work as well however this is more costly than a simple
qualification).

Thoughts?
--
Michael
From 44fafa6076a2108043e67ac76777c1c20664ea78 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 9 Mar 2018 16:43:06 +0900
Subject: [PATCH] Fix missing schema qualifications in code

Per CVE-2018-1058, not using proper schema qualifications can allow an
attacker who has an account on the server to execute arbitrary code as a
superuser even if he has no such rights.  After monitoring the whole
code of Postgres, I have bumped into four places that need to be
addressed:
- isolationtester is missing one qualification when calling
pg_backend_pid.
- psql is missing two qualifications for array_remove
- information_schema.sql has one problem within function
_pg_interval_type
- worker_spi scans pg_namespace and uses count() without
qualifications.
---
 src/backend/catalog/information_schema.sql | 2 +-
 src/bin/psql/command.c | 2 +-
 src/test/isolation/isolationtester.c   | 2 +-
 src/test/modules/worker_spi/worker_spi.c   | 4 +++-
 4 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql
index f4e69f4a26..4f2af408ac 100644
--- a/src/backend/catalog/information_schema.sql
+++ b/src/backend/catalog/information_schema.sql
@@ -186,7 +186,7 @@ CREATE FUNCTION _pg_interval_type(typid oid, mod int4) RETURNS text
 AS
 $$SELECT
   CASE WHEN $1 IN (1186) /* interval */
-   THEN upper(substring(format_type($1, $2) from 'interval[()0-9]* #"%#"' for '#'))
+   THEN pg_catalog.upper(substring(pg_catalog.format_type($1, $2) from 'interval[()0-9]* #"%#"' for '#'))
ELSE null
   END$$;
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 3560318749..f345572c8c 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -4483,7 +4483,7 @@ get_create_object_cmd(EditableObjectType obj_type, Oid oid,
 printfPQExpBuffer(query,
   "SELECT nspname, relname, relkind, "
   "pg_catalog.pg_get_viewdef(c.oid, true), "
-  "array_remove(array_remove(c.reloptions,'check_option=local'),'check_option=cascaded') AS reloptions, "
+  "pg_catalog.array_remove(pg_catalog.array_remove(c.reloptions,'check_option=local'),'check_option=cascaded') AS reloptions, "
   "CASE WHEN 'check_option=local' = ANY (c.reloptions) THEN 'LOCAL'::text "
   "WHEN 'check_option=cascaded' = ANY (c.reloptions) THEN 'CASCADED'::text ELSE NULL END AS checkoption "
   "FROM pg_catalog.pg_class c "
diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index 4ecad038bd..64d666f5cd 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -184,7 +184,7 @@ main(int argc, char **argv)
 		PQclear(res);
 
 		/* Get the backend pid for lock wait checking. */
-		res = PQexec(conns[i], "SELECT pg_backend_pid()");
+		res = PQexec(conns[i], "SELECT pg_catalog.pg_backend_pid()");
 		if (PQresultStatus(res) == PGRES_TUPLES_OK)
 		{
 			if (PQntuples(res) == 1 && PQnfields(res) == 1)
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index 3b98b1682b..547bdb26c4 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -115,7 +115,9 @@ initialize_worker_spi(worktable *table)
 
 	/* XXX could we use CREATE SCHEMA IF NOT EXISTS? */
 	initStringInfo(&buf);
-	appendStringInfo(&buf, "select count(*) from pg_namespace where nspname = '%s'",
+	appendStringInfo(&buf,
+	 "select pg_catalog.count(*) "
+	 "from pg_catalog.pg_namespace where nspname = '%s'",
 	 table->schema);
 
 	ret = SPI_execute(buf.data, true, 0);
-- 
2.16.2



signature.asc
Description: PGP signature

Re: csv format for psql

2018-03-08 Thread Fabien COELHO



About "fieldsep_csv", I do not like much the principle of having 
different output variables to represent the same concept depending on 
the format. I would rather have reused fieldsep as in your previous 
submission and set it to "," when under --csv.


yes



how will be possible to set different separator ';'? I don't see it with
described design


Indeed, it should be possible. I think that the following should be made 
to work:


  psql --csv -P fieldsep=; -c 'TABLE foo' > foo.csv

So that it can be changed the semi-colon (or tab or whatever) style if 
required.


--
Fabien.

Re: public schema default ACL

2018-03-08 Thread Noah Misch

On Thu, Mar 08, 2018 at 02:00:23PM -0500, Robert Haas wrote:
> I also wonder why we're all convinced that this urgently needs to be
> changed.  I agree that the default configuration we ship is not the
> most secure configuration that we could ship.  However, I think it's a
> big step from saying that the configuration is not as secure as it
> could be to saying that we absolutely must change it for v11.

Did someone say that?  I, for one, wanted to change it but didn't intend to
present it as a "must change".

Re: [bug fix] pg_rewind creates corrupt WAL files, and the standby cannot catch up the primary

2018-03-08 Thread Michael Paquier

On Thu, Mar 01, 2018 at 01:26:32AM +, Tsunakawa, Takayuki wrote:
> Our customer hit another bug of pg_rewind with PG 9.5.  The attached
> patch fixes this.

Sorry for my late reply.  It took me unfortunately some time before
coming back to this patch.

> PROBLEM
> 
> 
> After a long run of successful pg_rewind, the synchronized standby
> could not catch up the primary forever, emitting the following message
> repeatedly: 
> 
> LOG:  XX000: could not read from log segment 0006028A0031,
> offset 16384: No error

Oops.

> CAUSE
> 
> 
> If the primary removes WAL files that pg_rewind is going to get,
> pg_rewind leaves 0-byte WAL files in the target directory here: 
> 
> [libpq_fetch.c]
>   case FILE_ACTION_COPY:
>   /* Truncate the old file out of the way, if any 
> */
>   open_target_file(entry->path, true);
>   fetch_file_range(entry->path, 0, 
> entry->newsize);
>   break;
> 
> pg_rewind completes successfully, create recovery.conf, and then start
> the standby in the target cluster.  walreceiver receives WAL records
> and write them to the 0-byte WAL files.  Finally, xlog reader
> complains that he cannot read a WAL page.

This part qualifies as a data corruption bug as some records are lost by
the backend.

> FIX
> 
> 
> pg_rewind deletes the file when it finds that the primary has deleted
> it. 

The concept looks right to me.  I think that this does not apply only to
WAL segments though.  You could have a temporary WAL segment that has
been included in the chunk table, which is even more volatile, or a
multixact file, a snapshot file, etc.  In short anything which is not a
relation file and could disappear between the moment the file map is
built and the data is fetched from the source server.

@@ -174,7 +173,7 @@ remove_target_file(const char *path)
return;

snprintf(dstpath, sizeof(dstpath), "%s/%s", datadir_target, path);
-   if (unlink(dstpath) != 0)
+   if (unlink(dstpath) != 0 && !ignore_error)
I see.  So the only reason why this flag exists is that if a file is
large enough so as it is split into multiple chunks, then the first
unlink will work but not the successive ones.  One chunk can be at most
1 million bytes, which is why it is essential for WAL segments.  Instead
of ignoring *all* errors, let's ignore only ENOENT and rename
ignore_error to missing_ok as well.

You need to update the comment in receiveFileChunks where an entry gets
deleted with basically what I mentioned above, say:
"If a file has been deleted on the source, remove it on the target as
well.  Note that multiple unlink() calls may happen on the same file if
multiple data chunks are associated with it, hence ignore
unconditionally anything missing.  If this file is not a relation data
file, then it has been already truncated when creating the file chunk
list at hte previous execution of the filemap."

Adding also a comment on top of remove_target_file to explain what
missing_ok does would be nice to keep track of what the code should do.

The portion about whether data from pg_wal should be transfered or not
is really an improvement that could come in a future version, and what
you are proposing here is more general, so I think that we should go
ahead with it.
--
Michael

signature.asc
Description: PGP signature

Re: Let's remove DSM_INPL_NONE.

2018-03-08 Thread Kyotaro HORIGUCHI

At Wed, 28 Feb 2018 10:08:59 +0900, Michael Paquier  wrote 
in <20180228010859.gb1...@paquier.xyz>
> On Tue, Feb 27, 2018 at 02:53:55PM -0500, Tom Lane wrote:
> > Andres Freund  writes:
> >> On 2018-02-27 14:41:47 -0500, Tom Lane wrote:
> >>> What I didn't understand about it was what kind of testing this'd make
> >>> harder.  If we desupport dynamic_shared_memory_type=none, there aren't
> >>> any code paths that need to cope with the case, and we should just
> >>> remove any code that thereby becomes unreachable.
> > 
> >> What I'm concerned about isn't so much testing paths specific to
> >> dynamic_shared_memory_type=none, but paths where we currently need
> >> fallbacks for the case we couldn't actually allocate dynamic shared
> >> memory. Which I think we at least somewhat gracefully need to deal with.
> > 
> > Ah.  That's a fair point, but I do not think
> > dynamic_shared_memory_type=none is a good substitute for having a way to
> > provoke allocation failures.  That doesn't let you test recovery from
> > situations where your first allocation works and second one fails, for
> > example; and cleanup from that sort of case is likely to be more
> > complicated than the trivial case.

On the other hand, dynamic_shared_memory_type=none is not a
proper means to silence DSM failures. If this patch causes some
problems, exactly the same things would have happened for the
setting dynamic_shared_memory_type *!=* none.  If we need more
"graceful" behavior for some specific failure modes, it should be
amended separately.

> dynamic_shared_memory_type is used in six code paths where it is aimed
> at doing sanity checks:
...
> The point is that there are other control mechanisms for each one of
> them.  Mainly, for the executor portion, the number of workers is
> controlled by other GUCs on planner-side.

If this means just that there should be any means other than the
variable to turn off user-visible parallel features:

> - Mute DSM initialization at postmaster startup.
> - Mute cleanup of previous DSM segments from a past postmaster.
These should be allowed unconditionally and should succeed.

> - Block backend startup if there is no DSM.
This is the result of any parallel activity, so this also should
be allowed unconditionally and should succeed.

> - Mute parallel query in planner.
max_parallel_workers_per_gather=0 kills it.
(min_parallel_*_scan_size or parallel_*_cost effectively can do
the similar.)

> - Mute parallel query for parallel btree builds.
max_parallel_maintenance_workers = 0 does.

> - Mute creation of parallel contexts in executor.
No gahter or gather merge results in this. So
max_parallel_workers_per_gather=0 also forces this.


The attached is the rebased version, on which the symbols are
renumbered 0-based as per comments.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 0672b1d80a08b0fb0a33a26aad8b1cc77bd83083 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 15 Feb 2018 11:22:50 +0900
Subject: [PATCH] Remove dynamic_shared_memroy_type=none

Nowadays PostgreSQL on all supported platform offers any kind of
dynamic shared memory feature. Assuming none hinder us from using it
for core features.  Just removing the option leads to intermittent
failure of regression test on some platforms due to insufficient
max_connection on initdb. This patch requires the patch that increases
the minimum fallback value of max_connection of initdb.
---
 doc/src/sgml/config.sgml  |  6 +++---
 src/backend/access/transam/parallel.c |  7 ---
 src/backend/optimizer/plan/planner.c  |  4 +---
 src/backend/storage/ipc/dsm.c | 15 ---
 src/backend/storage/ipc/dsm_impl.c|  3 ---
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c   | 21 ++---
 src/include/storage/dsm_impl.h|  9 -
 8 files changed, 22 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3a8fc7d803..932ce11a58 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1602,9 +1602,9 @@ include_dir 'conf.d'
 should use.  Possible values are posix (for POSIX shared
 memory allocated using shm_open), sysv
 (for System V shared memory allocated via shmget),
-windows (for Windows shared memory), mmap
-(to simulate shared memory using memory-mapped files stored in the
-data directory), and none (to disable this feature).
+windows (for Windows shared memory),
+and mmap (to simulate shared memory using
+memory-mapped files stored in the data directory).
 Not all values are supported on all platforms; the first supported
 option is the default for that platform.  The use of the
 mmap option, which is not the default on any platform,
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/

Re: public schema default ACL

2018-03-08 Thread Noah Misch

On Wed, Mar 07, 2018 at 07:14:43AM -0500, Stephen Frost wrote:
> * Noah Misch (n...@leadboat.com) wrote:
> > I like the idea of getting more SQL-compatible, if this presents a distinct
> > opportunity to do so.  I do think it would be too weird to create the schema
> > in one database only.  Creating it on demand might work.  What would be the
> > procedure, if any, for database owners who want to deny object creation in
> > their databases?
> 
> My suggestion was that this would be a role attribute.  If an
> administrator doesn't wish for that role to have a schema created
> on-demand at login time, they would set the 'SCHEMA_CREATE' (or whatever
> we name it) role attribute to false.

I had in mind a site with diverse database owners, where the administrators
(folks with CREATEROLE or superuser) don't know every database owner
preference.  If we had a SCHEMA_CREATE like you describe, I expect its
documentation would say something like this:

  Since SCHEMA_CREATE provides the user one writable schema in each database,
  this allows the user to create permanent objects in any database that
  permits them to connect.  The database owner can prevent that by creating
  the schema in advance of the user's first login.  However, once the user has
  connected once, a non-superuser database owner cannot modify or drop it.

Is that good enough?

Re: public schema default ACL

2018-03-08 Thread Noah Misch

On Wed, Mar 07, 2018 at 09:22:16AM -0500, Peter Eisentraut wrote:
> On 3/6/18 15:20, Robert Haas wrote:
> > On Sat, Mar 3, 2018 at 4:56 AM, Noah Misch  wrote:
> >> I propose, for v11, switching to "GRANT USAGE ON SCHEMA
> >> public TO PUBLIC" (omit CREATE).  Concerns?  An alternative is to change 
> >> the
> >> default search_path to "$user"; that would be break more applications, and 
> >> I
> >> don't see an advantage to compensate for that.
> > 
> > Isn't this going to cause widespread breakage?  Unprivileged users
> > will suddenly find that they can no longer create tables, because
> > $user doesn't exist and they don't have permission on public.  That
> > seems quite unfriendly.
> 
> Moreover, the problem is that if you have database owners that are not
> superusers, they can't easily fix the issue themselves.  Since the
> public schema is owned by postgres, they database owner can't just go in
> and run GRANT CREATE ON SCHEMA PUBLIC TO whomever to restore the old
> behavior or grant specific access.  It would be simpler if we didn't
> install a public schema by default at all.

That's a good point.  Worse, a user with CREATEDB privilege would be able to
create new databases and immediately create and use any schema _except_
public.  That is rather silly.

Re: using worker_spi as pattern

2018-03-08 Thread Michael Paquier

On Thu, Mar 08, 2018 at 11:04:20PM -0600, Jeremy Finzel wrote:
> Since you mention, can anyone elaborate further on the memory leak danger
> here?
> 
> Line 193 in src/test/modules/worker_spi/worker_spi.c read:
> # Note some memory might be leaked here.
> 
> Is this any reason *not *to use this pattern in production?

quote_identifier may palloc the result, so the first pstrdup on the top
to save "schema" and "table" refer to a pointer which may perhaps get
lost.  Those are just a couple of bytes, so the code complication is not
worth the cleanup IMO.
--
Michael


signature.asc
Description: PGP signature

Re: disable SSL compression?

2018-03-08 Thread Gasper Zejn

On 09. 03. 2018 06:24, Craig Ringer wrote:
> I'm totally unconvinced by the threat posed by exploiting a client by
> tricking it into requesting protocol compression - or any other
> protocol change the client lib doesn't understand - with a connection
> option in PGOPTIONS or the "options" connstring entry. The attacker
> must be able to specify either environment variables (in which case I
> present "LD_PRELOAD") or the connstr. If they can set a connstr they
> can direct the client to talk to a different host that tries to
> exploit the connecting client in whatever manner they wish by sending
> any custom crafted messages they like.
>
If the attacker has access to client process or environment, he's
already won and this is not where the compression vulnerability lies.

CRIME and BREACH attacks with (SSL) compression are known plaintext
attacks, which require the attacker 1) to have ability to observe
encrypted data and 2) have a way to influence the plain text, in this
case SQL query. In the case of CRIME HTTPS attack, compression state was
shared between page content and request headers, thus by observing size
of responses, which are in HTTP headers, one could guess cookie values
and steal credentials even though the javascript making requests was
running on different domain.

So the vulnerability would be in guessing some values in request or
response, which the application or protocol might want to keep hidden,
while somehow getting the size of request or response from database.
Thus, sharing compression state too widely might not be wise.

Kind regards,
Gasper

Re: disable SSL compression?

2018-03-08 Thread Claudio Freire

On Thu, Mar 8, 2018 at 11:06 PM, Peter Eisentraut
 wrote:
> On 3/8/18 14:23, Claudio Freire wrote:
>> On Thu, Mar 8, 2018 at 3:40 PM, Peter Eisentraut
>>  wrote:
>>> It appears that SSL compression is nowadays deprecated as insecure.
>>> Yet, it is still enabled by libpq by default, and there is no way to
>>> disable it in the server.  Should we make some changes here?  Does
>>> anyone know more about this?
>>
>> Even if libpq enables it, it has to be enabled both in the client and
>> the server for it to work.
>>
>> OpenSSL disables the whole feature by default, and enabling it is
>> rather cumbersome. The result is that, at least with OpenSSL, the
>> server and client won't accept compression without extensive fiddling
>> by the user.
>
> But however that may be, libpq appears to enable it by default.  This is
> what I get from psql:
>
> SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384,
> bits: 256, compression: on)

I don't get that:

SSL connection (protocol: TLSv1.2, cipher:
ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)

Even if I set OPENSSL_DEFAULT_ZLIB=1 on the client, I get the same.
The serverside refuses.

Re: disable SSL compression?

2018-03-08 Thread Craig Ringer

On 9 March 2018 at 03:23, Claudio Freire  wrote:

> On Thu, Mar 8, 2018 at 3:40 PM, Peter Eisentraut
>  wrote:
> > It appears that SSL compression is nowadays deprecated as insecure.
> > Yet, it is still enabled by libpq by default, and there is no way to
> > disable it in the server.  Should we make some changes here?  Does
> > anyone know more about this?
>
> Even if libpq enables it, it has to be enabled both in the client and
> the server for it to work.
>
> OpenSSL disables the whole feature by default, and enabling it is
> rather cumbersome. The result is that, at least with OpenSSL, the
> server and client won't accept compression without extensive fiddling
> by the user.
>
> So I don't think libpq has to change anything here.
>
>

So SSL compression is unsafe, deprecated and unavailable. Maybe it's time
to look at native protocol compression[1] again?

I'm totally unconvinced by the threat posed by exploiting a client by
tricking it into requesting protocol compression - or any other protocol
change the client lib doesn't understand - with a connection option in
PGOPTIONS or the "options" connstring entry. The attacker must be able to
specify either environment variables (in which case I present "LD_PRELOAD")
or the connstr. If they can set a connstr they can direct the client to
talk to a different host that tries to exploit the connecting client in
whatever manner they wish by sending any custom crafted messages they like.

However, I already proposed [2] way to handle such protocol extensions in a
way that will address this concern anyway, and more importantly will
present a more informative error to the user if the client doesn't support
the requested protocol option. We send what's effectively an ERROR with a
special SQLSTATE that clients that understand compression may ignore.
Anything that doesn't understand it will see it as a connection error.

I think that strategy would actually work well for a number of possible
protocol extensions that can be client-requested.

That doesn't automagically give us protocol compression. We'd still need a
way to indicate the start/end of a range of compressed payloads; we don't
want to compress each individual message payload independently, but also
need to be able to recover from errors. Presumably after a Sync the
receiver would be expected to treat subsequent message payloads as
uncompressed, or we'd extend Sync with a flag for whether compression in
subsequent messages is on or not. By never compressing the message type and
length we'd make sure we could always read the protocol stream, at the cost
of some waste. But we could still use the same deflate state across
successive message bodies for efficient compression of big CopyData
streams, result sets, etc.

Any thoughts on this model for client-requested protocol extensions? I've
been stymied a number of times by being unable to have the client ask the
server to send things that require a protocol change, and I think this is
possibly a solidly backward compatible way to handle it.

[1] https://www.postgresql.org/message-id/CADT4RqCKfawgwa735s_
brELaJ8ySutCC-u3iyLL_EEsJQDYFrg%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAMsr%2BYH0N7TaAtvS2hu-
y8LBkA25fJs0oungGe-tNhwr7scLSQ%40mail.gmail.com

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-08 Thread Ashutosh Bapat

On Thu, Mar 8, 2018 at 8:02 PM, Jeevan Chalke
 wrote:
>
>
> On Thu, Mar 8, 2018 at 7:49 PM, Robert Haas  wrote:
>>
>> On Thu, Mar 8, 2018 at 9:15 AM, Jeevan Chalke
>>  wrote:
>> > I am not sure why we don't set reltarget into the grouped_rel too.
>> >
>> > But if we do so like we did in partially_grouped_rel, then it will be
>> > lot
>> > easier for partitionwise aggregate as then we don't have to pass target
>> > to
>> > functions creating paths like create_append_path. We now need to update
>> > generate_gather_paths() to take target too as it is now being called on
>> > grouped_rel in which reltarget is not set.
>> >
>> > But yes, if there is any specific reason we can't do so, then I think
>> > the
>> > same like Ashutosh Said. I didn't aware of such reason though.
>>
>> I see no problem with setting reltarget for the grouped_rel.  Before
>> we added partially_grouped_rel, that rel computed paths with two
>> different targets: partial paths had the partial grouping target, and
>> non-partial paths had the ordinary grouping target.  But that's fixed
>> now.
>
>
> OK.
> Will update my changes accordingly.
> If we set reltarget into the grouped_rel now, then I don't need one of the
> refactoring patch which is passing target to the path creation functions.
>

For some reason we do not set reltarget of any of the upper relations.
I don't know why, neither browsing through the comments in
grouping_planner(), including the one below before the code that
creates an array of upper relation targets.
/*
 * Save the various upper-rel PathTargets we just computed into
 * root->upper_targets[].  The core code doesn't use this, but it
 * provides a convenient place for extensions to get at the info.  For
 * consistency, we save all the intermediate targets, even though some
 * of the corresponding upperrels might not be needed for this query.
 */
Why don't we just set those in the corresponding RelOptInfos? May be
we should do that for all the upper rels and not just grouping_rel.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: using worker_spi as pattern

2018-03-08 Thread Jeremy Finzel

>
> If you look at the code of worker_spi.c closely the answer shows up by
> itself:
>
> appendStringInfo(&buf,
>  "CREATE SCHEMA \"%s\" "
>  "CREATE TABLE \"%s\" ("
>  "  type text CHECK (type IN ('total',
> 'delta')), "
>  "  value   integer)"
>  "CREATE UNIQUE INDEX \"%s_unique_total\" ON \"%s\" (type)
> "
>  "WHERE type = 'total'",
>
> In this case "total" is not a type, it is one of the authorized value in
> the value.  So just insert an initial tuple like that:
> INSERT INTO schema1.counted VALUES ('total', 1);
> And then insert periodically for example the following:
> INSERT INTO schema1.counted VALUES ('delta', 3);
> And then the background worker will sum up the values inserted in
> "delta" tuples to the actual "total".
>
>
I could not find the table schema1.counted.  What confused me is that I
ran SELECT worker_spi_launch(1); but it created the schema in the database
postgres instead of the current database I am in!  Doesn't that seem a bit
counter-intuitive?  Anyway, I found it now, so I am good to go!  Thank you!


> > I am trying to use this extension as a pattern for my own background
> > worker, but just trying to understand it.
>
> You are right to do so, this is a good learning step.
> --
> Michael
>

Since you mention, can anyone elaborate further on the memory leak danger
here?

Line 193 in src/test/modules/worker_spi/worker_spi.c read:
# Note some memory might be leaked here.

Is this any reason *not *to use this pattern in production?

Thanks,
Jeremy

Re: csv format for psql

2018-03-08 Thread Pavel Stehule

2018-03-09 3:13 GMT+01:00 Peter Eisentraut :

> On 3/8/18 05:05, Fabien COELHO wrote:
> > I'm in favor of having a simple psql way to generate a convenient and
> > compliant CSV output for export/import.
>
> yes
>
> > I also think that a short option brings little value, and "--csv" is good
> > enough for the purpose, so I would agree to remove the "-C" binding.
>
> yes


> > About "fieldsep_csv", I do not like much the principle of having
> different
> > output variables to represent the same concept depending on the format. I
> > would rather have reused fieldsep as in your previous submission and set
> > it to "," when under --csv.
>
> yes
>

how will be possible to set different separator ';'? I don't see it with
described design

Regards

Pavel


>
> > The "\n" eol style is hardcoded. Should it use "recordsep"?
>
> yes
>
> > The "\pset format" error message in "do_pset" shows values in seemingly
> > random order. The situation is pre-existing but not really satisfactory.
> > I'd suggest to put all values in alphabetical order.
>
> yes
>
> > I'd suggest that tests should include more types, not just strings. I
> > would suggest int, float, timestamp, bytea, an array (which uses , as a
> > separator), json (which uses both " and ,)...
>
> sounds good
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-08 Thread Peter Geoghegan

On Thu, Mar 8, 2018 at 4:54 AM, Pavan Deolasee  wrote:
> Thanks for the feedback. I've greatly refactored the code based on your
> comments and I too like the resultant code. What we have now have
> essentially is: two functions:

I really like how ExecMerge() now mostly just consists of this code
(plus a lot of explanatory comments):

> +   if (!matched ||
> +   !ExecMergeMatched(mtstate, estate, slot, junkfilter, tupleid))
> +   ExecMergeNotMatched(mtstate, estate, slot);

This is far easier to follow. In general, I'm now a lot less worried
about what was previously the #1 concern -- READ COMMITTED conflict
handling (EvalPlanQual() stuff). My #1 concern has become RLS, and
perhaps only because I haven't studied it in enough detail. It seems
like getting this patch committable is now a matter of verifying
details. Of course, there are a lot of details to verify.  :-)

> I've also added a bunch of comments, but it might still not be enough. Feel
> free to suggest improvements there.

I think that the ExecMerge() comments are very good, although they
could perhaps use some light copy-editing. Also, I noticed some
specific issues with comments in ExecMerge() and friends, as well as a
few code issues. Those are:

* MERGE code needs to do cardinality violations like ON CONFLICT.
Specifically, you need the TransactionIdIsCurrentTransactionId()
check, and a second error fallback "attempted to lock invisible tuple"
error (though this should say "attempted to update or delete invisible
tuple" instead). The extra check is redundant, but better safe than
sorry. I like defensive errors like this because it makes the code
easier to read -- I don't get distracted by their absence.

* The last item on cardinality violations also implies this new merge
comment should be changed:

> +   /*
> +* The target tuple was already updated or deleted by the
> +* current command, or by a later command in the current
> +* transaction.
> +*/

It should be changed because it can't have been a later command in
this xact. We handled that case when we called ExecUpdate or
ExecDelete() (plus we now have the extra defensive "can't happen"
elog() error).

* This related comment shouldn't even talk about update/historic
behavior, now that it isn't in ExecUpdate() -- just MERGE:

> +   /*
> +* The former case is possible in a join UPDATE where
> +* multiple tuples join to the same target tuple. This is
> +* pretty questionable, but Postgres has always allowed
> +* it: we just execute the first update action and ignore
> +* additional update attempts.  SQLStandard disallows this
> +* for MERGE, so allow the caller to select how to handle
> +* this.
> +*/

* This wording should be tweaked:

> +* If the current tuple is that last tuple in the update
> +* chain, then we know that the tuple was concurrently
> +* deleted. We just switch to NOT MATCHED case and let the
> +* caller retry the NOT MATCHED actions.

This should say something like "caller can move on to NOT MATCHED
actions". They can never go back from there, of course, which I want
us to be clear on.

* This check of whether whenqual is set is unnecessary, and doesn't
match MATCHED code, or the accompanying comments:

> +   /*
> +* Test condition, if any
> +*
> +* In the absence of a condition we perform the action unconditionally
> +* (no need to check separately since ExecQual() will return true if
> +* there are no conditions to evaluate).
> +*/
> +   if (action->whenqual && !ExecQual(action->whenqual, econtext))
> +   continue;

* I think that this ExecMerge() assertion is not helpful, since you go
on to dereference the pointer in all cases anyway:

> +   Assert(junkfilter != NULL);

* Executor README changes, particularly about projecting twice, really
should be ExecMerge() comments. Maybe just get rid of these?

* Why are we using CMD_NOTHING at all? That constant has something to
do with user-visible rules, and there is no need to reuse it (make a
new CMD_* if you have to). More importantly, why do we even have the
corresponding DO NOTHING stuff in the grammar? Why would users want
that?

For quite a while, I thought that patch must have been support for ON
CONFLICT DO NOTHING within MERGE INSERTs (the docs don't even say what
DO NOTHING is). But that's not what it is at all. It seems like this
is a way of having an action that terminates early, so you don't have
to go on to evaluate other action quals. I can't see much point,
though. More importantly, supporting this necessitates code like the
following RLS code within ExecMergeMatched():

> +   if ((action->

Re: csv format for psql

2018-03-08 Thread Peter Eisentraut

On 3/8/18 05:05, Fabien COELHO wrote:
> I'm in favor of having a simple psql way to generate a convenient and 
> compliant CSV output for export/import.

yes

> I also think that a short option brings little value, and "--csv" is good 
> enough for the purpose, so I would agree to remove the "-C" binding.

yes

> About "fieldsep_csv", I do not like much the principle of having different 
> output variables to represent the same concept depending on the format. I 
> would rather have reused fieldsep as in your previous submission and set 
> it to "," when under --csv.

yes

> The "\n" eol style is hardcoded. Should it use "recordsep"?

yes

> The "\pset format" error message in "do_pset" shows values in seemingly 
> random order. The situation is pre-existing but not really satisfactory. 
> I'd suggest to put all values in alphabetical order.

yes

> I'd suggest that tests should include more types, not just strings. I 
> would suggest int, float, timestamp, bytea, an array (which uses , as a 
> separator), json (which uses both " and ,)...

sounds good

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: disable SSL compression?

2018-03-08 Thread Peter Eisentraut

On 3/8/18 14:23, Claudio Freire wrote:
> On Thu, Mar 8, 2018 at 3:40 PM, Peter Eisentraut
>  wrote:
>> It appears that SSL compression is nowadays deprecated as insecure.
>> Yet, it is still enabled by libpq by default, and there is no way to
>> disable it in the server.  Should we make some changes here?  Does
>> anyone know more about this?
> 
> Even if libpq enables it, it has to be enabled both in the client and
> the server for it to work.
> 
> OpenSSL disables the whole feature by default, and enabling it is
> rather cumbersome. The result is that, at least with OpenSSL, the
> server and client won't accept compression without extensive fiddling
> by the user.

But however that may be, libpq appears to enable it by default.  This is
what I get from psql:

SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384,
bits: 256, compression: on)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Speed up the removal of WAL files

2018-03-08 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> Hm.  durable_xx should remain a sane operation as an isolated call as you
> still get the same problem if a crash happens before flushing the parent...
> Fujii-san idea also has value to speed up the end of recovery but this costs
> as well in extra recycling operations post promotion.  If the checkpoint
> was to happen a the end of recovery then that would be more logic, but we
> don't for performance reasons.  Let's continue to discuss on this thread.
> If you have any patch to offer, let's also look at them.
> 
> Anyway, as things are pretty much idle on this thread for a couple of days
> and that we are still discussing potential ideas, I think that this entry
> should be marked as returned with feedback.  Thoughts?

OK, I moved this to the next CF.  Thank you for your cooperation.



Regards
Takayuki Tsunakawa

Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-08 Thread Peter Geoghegan

On Thu, Mar 8, 2018 at 3:29 PM, Tomas Vondra
 wrote:
> The reason why the patch tried to prevent that is because the SQL
> standard says this (p. 1176 of SQL 2016):
>
> 15) The  immediately contained in a ,
> the  immediately contained in a  clause>, and the  immediately contained in a  when not matched clause> shall not generally contain a  invocation> whose subject routine is an SQL-invoked routine that
> possibly modifies SQL-data.
>
> I'm not quite sure what is required to be compliant with this rule. For
> example what does "immediately contained" or "shall not generally
> contain" mean? Does that mean user are expected not to do that because
> it's obviously silly, or do we need to implement some protection?

My impression is that this means that you shouldn't treat this as a
particularly likely case, or try to facilitate it. The  blurb is about intent, rather than actual restrictions
implementations must enforce, AFAICT. Though the UPDATE precedent is
what really matters here -- not the SQL standard. The SQL standard
doesn't say anything to make me doubt that that's the right precedent
to want to follow.

Close by, under "General Rules", rule #4 is: "The extent to which an
SQL-implementation may disallow independent changes that are not
significant is implementation-defined". This same sentence appears in
quite a few different places, including in the description of UPDATE.
ISTM that the SQL standard actually enforces that volatile qual
weirdness (and what to do about it) is a general
INSERT/UPDATE/DELETE/MERGE issue.

> That being said the volatility check seems reasonable to me (and i would
> not expect it to be a huge amount of code).

If we're going to do this, we'd have to do the same with UPDATE, IMV.
And, well, we're not gonna do that.

-- 
Peter Geoghegan

Re: Speed up the removal of WAL files

2018-03-08 Thread Michael Paquier

On Wed, Mar 07, 2018 at 06:15:24AM +, Tsunakawa, Takayuki wrote:
> Good point.  I understood you referred to PreallocXlogFiles(), which
> may create one new WAL file if RemoveNonParentXlogFiles() is not
> called or does not recycle WAL files in the old timeline. 
> 
> The best hack (or a compromise/kludge?) seems to be:
> 
> 1. Modify durable_xx() functions so that they don't fsync directory
> hanges when enableFsync is false. 
> 
> 2. RemoveNonParentXlogFiles() sets enableFsync to false before the
> while loop, restores the original value of it after the while loop,
> and fsync pg_wal/ just once.   What do you think? 

Hm.  durable_xx should remain a sane operation as an isolated call as
you still get the same problem if a crash happens before flushing the
parent...  Fujii-san idea also has value to speed up the end of recovery
but this costs as well in extra recycling operations post promotion.  If
the checkpoint was to happen a the end of recovery then that would be
more logic, but we don't for performance reasons.  Let's continue to
discuss on this thread.  If you have any patch to offer, let's also look
at them.

Anyway, as things are pretty much idle on this thread for a couple of
days and that we are still discussing potential ideas, I think that this
entry should be marked as returned with feedback.  Thoughts?
--
Michael

signature.asc
Description: PGP signature

Re: using worker_spi as pattern

2018-03-08 Thread Michael Paquier

On Thu, Mar 08, 2018 at 03:29:52PM -0600, Jeremy Finzel wrote:
> However, this raises many questions for me:
> 
>- Insert a value into what table?  I see the process referring to an
>object that doesn't exist in my database - schema1.counted
>- What is "total" type?  I don't see any type with this name in the
>database
>- Same question for "delta" type

If you look at the code of worker_spi.c closely the answer shows up by
itself: 

appendStringInfo(&buf,
 "CREATE SCHEMA \"%s\" "
 "CREATE TABLE \"%s\" ("
 "  type text CHECK (type IN ('total', 'delta')), "
 "  value   integer)"
 "CREATE UNIQUE INDEX \"%s_unique_total\" ON \"%s\" (type) "
 "WHERE type = 'total'",

In this case "total" is not a type, it is one of the authorized value in
the value.  So just insert an initial tuple like that:
INSERT INTO schema1.counted VALUES ('total', 1);
And then insert periodically for example the following:
INSERT INTO schema1.counted VALUES ('delta', 3);
And then the background worker will sum up the values inserted in
"delta" tuples to the actual "total".

> I am trying to use this extension as a pattern for my own background
> worker, but just trying to understand it.

You are right to do so, this is a good learning step.
--
Michael

signature.asc
Description: PGP signature

Re: [HACKERS] GUC for cleanup indexes threshold.

2018-03-08 Thread Alexander Korotkov

Hi!

I'd like to propose a revised patch based on various ideas upthread.

This patch works as following.

1) B-tree meta page is extended with 2 additional parameters:
 * btm_oldest_btpo_xact – oldest btpo_xact among of deleted pages,
 * btm_last_cleanup_num_heap_tuples – number of heap tuples during last
cleanup scan.

2) These parameters are reset during btbulkdelete() and set during
btvacuumcleanup().

3) Index scans during second and subsequent btvacuumcleanup() happen only if
   btm_oldest_btpo_xact is older than RecentGlobalXmin
  OR num_heap_tuples >= btm_last_cleanup_num_heap_tuples(1
+ vacuum_cleanup_index_scale_factor).

In other words btvacuumcleanup() scans the index only if there are
recyclable pages,
or index statistics is stalled (inserted more than
vacuum_cleanup_index_scale_factor
since last index statistics collection).

4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption.
Default value is 0.1.  So, by default cleanup scan is triggered after
increasing of
table size by 10%.

5) Since new fields are added to the metapage, BTREE_VERSION is bumped.
In order to support pg_upgrade, read of previous metapage version is
supported.
On metapage rewrite, it's upgraded to the new version.

So, since we don't skip scan of recyclable pages, there is no risk of xid
wraparound.
Risk of stalled statistics is also small, because
vacuum_cleanup_index_scale_factor
default value is quite low.  User can increase
vacuum_cleanup_index_scale_factor
on his own risk and have less load of B-tree cleanup scan bought by more
gap in
index statistics.

Some simple benchmark shows the effect.

Before patch.

# insert into t select i from generate_series(1,1) i;
# create index t_i_idx on t(i);
# vacuum t;
VACUUM
Time: 15639,822 ms (00:15,640)
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 6,195 ms
# vacuum t;
VACUUM
Time: 1012,794 ms (00:01,013)
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 5,276 ms
# vacuum t;
VACUUM
Time: 1013,254 ms (00:01,013)

After patch.

# insert into t select i from generate_series(1,1) i;
# create index t_i_idx on t(i);
# vacuum t;
VACUUM
Time: 15689,450 ms (00:15,689)
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 5,585 ms
# vacuum t;
VACUUM
Time: 50,777 ms
# insert into t select i from generate_series(1,1000) i;
INSERT 0 1000
Time: 5,641 ms
# vacuum t;
VACUUM
Time: 46,997 ms

Thus, vacuum time for append-only table drops from 1000 ms to 50 ms (in
about 20X).

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-lazy-btree-cleanup-3.patch
Description: Binary data

Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-08 Thread Tomas Vondra

On 03/08/2018 11:46 PM, Peter Geoghegan wrote:
> On Thu, Mar 8, 2018 at 6:52 AM, Robert Haas  wrote:
>>> I removed this code since it was wrong. We might want to add some basic
>>> checks for existence of volatile functions in the WHEN or SET clauses. But I
>>> agree, it's no different than regular UPDATEs. So may be not a big deal.
>>
>> I just caught up on this thread.  I'm definitely glad to see that code
>> go because, wow, that is all kinds of wrong.  I don't see a real need
>> to add any kind of replacement check, either.  Prohibiting volatile
>> functions here doesn't seem likely to accomplish anything useful.  It
>> seems like the most we'd want to do is mention this the documentation
>> somehow, and I'm not even sure we really need to do that much.
> 
> Thanks in large part to Pavan's excellent work, the situation in
> nodeModifyTable.c is much clearer than it was a few weeks ago. It's
> now obvious that MERGE is very similar to UPDATE ... FROM, which
> doesn't have any restrictions on volatile functions.
> 

Yeah, I agree Pavan did an excellent work on moving this patch forward.

> I don't see any sense in prohibiting volatile functions in either
> case, because it should be obvious to users that that's just asking
> for trouble. I can believe that someone would make that mistake, just
> about, but they'd have to be writing their DML statement on
> auto-pilot.
> 

The reason why the patch tried to prevent that is because the SQL
standard says this (p. 1176 of SQL 2016):

15) The  immediately contained in a ,
the  immediately contained in a , and the  immediately contained in a  shall not generally contain a  whose subject routine is an SQL-invoked routine that
possibly modifies SQL-data.

I'm not quite sure what is required to be compliant with this rule. For
example what does "immediately contained" or "shall not generally
contain" mean? Does that mean user are expected not to do that because
it's obviously silly, or do we need to implement some protection?

That being said the volatility check seems reasonable to me (and i would
not expect it to be a huge amount of code).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Testbed for predtest.c ... and some arguable bugs therein

2018-03-08 Thread Tom Lane

Robert Haas  writes:
> On Thu, Mar 8, 2018 at 12:33 AM, Tom Lane  wrote:
>> A bit of hacking later, I have the attached.  The set of test cases it
>> includes at the moment were mostly developed with an eye to getting to
>> full code coverage of predtest.c, but we could add more later.  What's
>> really interesting is that it proves that the "weak refutation" logic,
>> i.e. predicate_refuted_by() with clause_is_check = true, is not
>> self-consistent.

> Oops.

After further navel-gazing, I've concluded that my initial opinion of the
"correct" semantics for weak refutation was wrong.  When I revise the
test logic to follow my revised understanding, it no longer claims that
any proofs are mistaken, although there are still things that could be
proven and aren't being.  Let me summarize my thoughts here for the
record.

For predicate_implied_by, we have two plausible definitions of
implication:

* "strong" form: A implies B if truth of A implies truth of B.
We need this to prove that a row satisfying one WHERE clause or
index predicate must satisfy another one.

* "weak" form: A implies B if non-falsity of A implies non-falsity of B.
We need this to prove that a row satisfying one CHECK constraint
must satisfy another one.

We could conceivably use a third proof rule that tries to prove
non-falsity of B from truth of A, but AFAICS there are no places in
the current code that need exactly that behavior, and the "strong" rule
seems like an adequate substitute unless such a need becomes mainstream.
(There aren't that many cases where we could prove something per this rule
but not per the strong rule, anyway.)

The fourth possible proof rule is to prove truth of B from non-falsity
of A, but the number of cases where you can make such a proof is so
small that this option isn't actually interesting in practice.
(For instance, you can't even conclude "X implies X" with such a rule.)

On the predicate_refuted_by side, I was confused about how to choose
among some similar options:

* R1: A refutes B if truth of A implies falsity of B.

* R2: A refutes B if truth of A implies non-truth of B.

* R3: A refutes B if non-falsity of A implies non-truth of B.

* R4: A refutes B if non-falsity of A implies falsity of B.

We can discard R4, again because there are too few cases where we
could prove anything per that rule, eg NOT X wouldn't refute X
or vice versa.

R1 is the traditional behavior of predicate_refuted_by, and as its
comment says, is needed because we want to disprove a CHECK
constraint (ie prove it would be violated) given a WHERE clause.

R2 is of use for disproving one WHERE clause given another one.
It turns out that of the two extant calls to predicate_refuted_by,
one of them would actually like to have that semantics, because
it's looking for mutually contradictory WHERE clauses.  We've been
getting by with R1 for that, but R2 would let us optimize some
more cases.

R3 could be of use for disproving a WHERE clause given a CHECK constraint,
and is the one I'd supposed should be the alternate behavior of
predicate_refuted_by.  However, we have no actual use cases for that;
relation_excluded_by_constraints goes the other way.

Moreover, although R3 seems like it'd be the more tractable rule on
grounds of symmetry, it's looking to me like R2 is about equivalent
for proof purposes.  For instance see the code near predtest.c:700,

 * If A is a strong NOT-clause, A R=> B if B equals A's arg
 *
 * We cannot make the stronger conclusion that B is refuted if B
 * implies A's arg; that would only prove that B is not-TRUE, not
 * that it's not NULL either.  Hence use equal() rather than
 * predicate_implied_by_recurse().  We could do the latter if we
 * ever had a need for the weak form of refutation.

The previous patch to add weak implication rules missed the opportunity
for improvement here.  If we are using R1 or R2, then we have the
assumption "NOT A is true", allowing us to conclude A is false.  Then,
if we can prove B implies A under weak implication (not-false B implies
not-false A), we have proven B must be false, satisfying the proof rule for
R1.  But also, if we can prove B implies A under strong implication (true
B implies true A), we have proven B is not true, satisfying the proof rule
for R2.  So in either case there's potential to recurse to
predicate_implied_by rather than only being able to handle the tautology
"NOT A refutes A".

Now, the same logic would go through if we were using R3.  We'd be
starting with non-falsity of "NOT A", only allowing us to conclude
non-truth of A, but a proof of strong B => A would still let us conclude
B is not true, satisfying R3.

Alternatively, consider if we were starting with A and trying to refute
NOT B.  Intuitively you'd figure that if you can prove A => B, that oughta
be enough to refute NOT B.  That certainly works for R1 with strong
implication: truth of A implies truth of B im

Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-08 Thread Peter Geoghegan

On Thu, Mar 8, 2018 at 6:52 AM, Robert Haas  wrote:
>> I removed this code since it was wrong. We might want to add some basic
>> checks for existence of volatile functions in the WHEN or SET clauses. But I
>> agree, it's no different than regular UPDATEs. So may be not a big deal.
>
> I just caught up on this thread.  I'm definitely glad to see that code
> go because, wow, that is all kinds of wrong.  I don't see a real need
> to add any kind of replacement check, either.  Prohibiting volatile
> functions here doesn't seem likely to accomplish anything useful.  It
> seems like the most we'd want to do is mention this the documentation
> somehow, and I'm not even sure we really need to do that much.

Thanks in large part to Pavan's excellent work, the situation in
nodeModifyTable.c is much clearer than it was a few weeks ago. It's
now obvious that MERGE is very similar to UPDATE ... FROM, which
doesn't have any restrictions on volatile functions.

I don't see any sense in prohibiting volatile functions in either
case, because it should be obvious to users that that's just asking
for trouble. I can believe that someone would make that mistake, just
about, but they'd have to be writing their DML statement on
auto-pilot.

-- 
Peter Geoghegan

using worker_spi as pattern

2018-03-08 Thread Jeremy Finzel

Hello - I have compiled and installed the extension worker_spi.  I also
launched the process via SELECT worker_spi_launch(1);

I see this in pg_stat_activity:
WITH deleted AS (DELETE FROM schema1.counted WHERE type = 'delta' RETURNING
value), total AS (SELECT coalesce(sum(value), 0) as sum FROM deleted)
UPDATE schema1.counted SET value = counted.value + total.sum FROM total
WHERE type = 'total' RETURNING counted.value

However, I'm not sure what I am supposed to do next?  The docs at the top
of the module say:

To see it working, insert an initial value
 * with "total" type and some initial value; then insert some other rows
with
 * "delta" type.  Delta rows will be deleted by this worker and their values
 * aggregated into the total.

However, this raises many questions for me:

   - Insert a value into what table?  I see the process referring to an
   object that doesn't exist in my database - schema1.counted
   - What is "total" type?  I don't see any type with this name in the
   database
   - Same question for "delta" type

I am trying to use this extension as a pattern for my own background
worker, but just trying to understand it.

Thanks!
Jeremy

Re: Implementing SQL ASSERTION

2018-03-08 Thread David Fetter

On Thu, Mar 08, 2018 at 09:11:58PM +, Joe Wildish wrote:
> Hi David,
> 
> > 
> > This patch no longer applies.  Any chance of a rebase?
> 
> Of course. I’ll look at it this weekend,

Much appreciate it!

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Implementing SQL ASSERTION

2018-03-08 Thread Joe Wildish

Hi David,

> 
> This patch no longer applies.  Any chance of a rebase?
> 



Of course. I’ll look at it this weekend,

Cheers,
-Joe

Re: JIT compiling with LLVM v11

2018-03-08 Thread Thomas Munro

On Fri, Mar 9, 2018 at 9:12 AM, Andres Freund  wrote:
> Or even in core LLVM, which has this nice comment:
>
>   // If user asked for the 'native' CPU, we need to autodetect features.
>   // This is necessary for x86 where the CPU might not support all the
>   // features the autodetected CPU name lists in the target. For example,
>   // not all Sandybridge processors support AVX.
>   if (MCPU == "native") {
>
> which pretty much describes the issue you're apparently hitting.
>
> I've pushed an attempted fix (needs a comment, but works here).

===
 All 186 tests passed.
===

That did the trick.  Thanks!

-- 
Thomas Munro
http://www.enterprisedb.com

Re: RFC: Add 'taint' field to pg_control.

2018-03-08 Thread Andres Freund

On 2018-03-07 23:34:37 -0500, Tom Lane wrote:
> Craig Ringer  writes:
> > As I understand it, because we allow multiple Pg instances on a system, we
> > identify the small sysv shmem segment we use by the postmaster's pid. If
> > you remove the DirLockFile (postmaster.pid) you remove the interlock
> > against starting a new postmaster. It'll think it's a new independent
> > instance on the same host, make a new shmem segment and go merrily on its
> > way mangling data horribly.
> 
> Yeah.  If we realized that the old shmem segment was associated with this
> data directory, we could check for processes still attached to it ... but
> the lock file is exactly where that association is kept.

I'd somehow remembered that we just took the path as the identifier, but
that's wrong...

Greetings,

Andres Freund

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Andres Freund

On 2018-03-08 14:25:59 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On March 8, 2018 10:46:53 AM PST, Tom Lane  wrote:
> >> Breaking fundamental invariants like
> >> "ctid points to this tuple or its update successor" is going to cause
> >> trouble.  There's a lot of code that knows that; more than knows the
> >> details of what's in xmax, I believe.
> 
> > I don't think this is that big a problem. All code already needs to handle 
> > the case where ctid points to an aborted update tuple. Which might long 
> > have been replaced by as an independent role. That's why we have all this 
> > updated.xmax == new.xmin checks. Which will, without any changes, catch the 
> > proposed scheme, no?
> 
> No.  In those situations, the conclusion is that the current tuple is
> live, which is exactly the wrong conclusion for a cross-partition
> update.

I don't see the problem you're seeing here. Visibility decisions and
ctid chaining aren't really done in the same way. And in the cases we do
want different behaviour for updates that cross partition boundaries,
the patch adds the error messages. What I was trying to say is not that
we don't need to touch any of those paths, but that there's code to
handle bogus ctid values already. That really wasn't the case for
multixacts (in fact, they broke this check in multiple places).

> Or at least it might be the wrong conclusion ... I wonder how this patch
> works if the updating transaction aborted.

If the updated transaction aborted, HTSU will return
HeapTupleMayBeUpdated and we can just go ahead and allow an update?

Greetings,

Andres Freund

Re: JIT compiling with LLVM v11

2018-03-08 Thread Andres Freund

On 2018-03-08 11:58:41 -0800, Andres Freund wrote:
> I think we can easily fix this by behaving like clang, which uses
> llvm::sys::getHostCPUFeatures(HostFeatures) to built the feature list:
> 
>   // If -march=native, autodetect the feature list.
>   if (const Arg *A = Args.getLastArg(clang::driver::options::OPT_march_EQ)) {
> if (StringRef(A->getValue()) == "native") {
>   llvm::StringMap HostFeatures;
>   if (llvm::sys::getHostCPUFeatures(HostFeatures))
> for (auto &F : HostFeatures)
>   Features.push_back(
>   Args.MakeArgString((F.second ? "+" : "-") + F.first()));
> }
>   }
> 
> which seems easy enough.

Or even in core LLVM, which has this nice comment:

  // If user asked for the 'native' CPU, we need to autodetect features.
  // This is necessary for x86 where the CPU might not support all the
  // features the autodetected CPU name lists in the target. For example,
  // not all Sandybridge processors support AVX.
  if (MCPU == "native") {

which pretty much describes the issue you're apparently hitting.

I've pushed an attempted fix (needs a comment, but works here).

Greetings,

Andres Freund

Re: JIT compiling with LLVM v11

2018-03-08 Thread Andres Freund

Hi,

On 2018-03-09 00:33:03 +1300, Thomas Munro wrote:
> On Wed, Mar 7, 2018 at 3:49 PM, Thomas Munro
>  wrote:
> > make check at today's HEAD of your jit branch crashes on my FreeBSD
> > box.  The first thing to crash is this query from point.sql:
> >
> > LOG:  server process (PID 87060) was terminated by signal 4: Illegal 
> > instruction
> > DETAIL:  Failed process was running: SELECT '' AS thirtysix, p1.f1 AS
> > point1, p2.f1 AS point2, p1.f1 <-> p2.f1 AS dist
> >FROM POINT_TBL p1, POINT_TBL p2
> >ORDER BY dist, p1.f1[0], p2.f1[0];
> 
> Hmm.  It's trying to execute an AVX instruction.

Ah, that's interesting.



> I am not sure if that is real though, because the stack is immediately
> corrupted.

I don't think the stack is corrupted at all, it's just that lldb can't
unwind with functions it doesn't know. To add that capability I've a
pending LLVM patch.


> So either func is not really a function, or it is but was
> compiled for the wrong target.  I see that you call
> LLVMCreateTargetMachine() with the result of LLVMGetHostCPUName() as
> cpu.  For me that's "ivybridge", so I tried hard coding "generic"
> instead and it didn't help.

Hm.


> I see that you say "" for features, where
> is where one would normally put "avx" to turn on AVX instructions, so
> I think perhaps that theory is entirely bogus.

Could you try a -avx in features and see whether it fixes things?

This kinda suggests an LLVM bug or at least an oddity, but I'll try to
drill down more into this. Is this a native machine or a VM?

I think we can easily fix this by behaving like clang, which uses
llvm::sys::getHostCPUFeatures(HostFeatures) to built the feature list:

  // If -march=native, autodetect the feature list.
  if (const Arg *A = Args.getLastArg(clang::driver::options::OPT_march_EQ)) {
if (StringRef(A->getValue()) == "native") {
  llvm::StringMap HostFeatures;
  if (llvm::sys::getHostCPUFeatures(HostFeatures))
for (auto &F : HostFeatures)
  Features.push_back(
  Args.MakeArgString((F.second ? "+" : "-") + F.first()));
}
  }

which seems easy enough.

Greetings,

Andres Freund

Re: parallel append vs. simple UNION ALL

2018-03-08 Thread Robert Haas

On Thu, Mar 1, 2018 at 8:30 AM, Ashutosh Bapat
 wrote:
> The patches look clean. I particularly looked at 0003.
>
> patch 0001
> +/*
> + * Generate partial paths for final_rel, too, if outer query levels might
> + * be able to make use of them.
> + */
> I am not able to understand the construct esp. the if clause. Did you want to
> say "... if there are outer query levels. Those might ..." or something like
> that?

Well, that's what I meant, but I didn't think it was necessary to
spell it out in quite that much detail.

> 0002
> (op->all == top_union->all || op->all) &&
> This isn't really your change.  Checking
> op->all is cheaper than checking equality, so may be we should check that 
> first
> and take advantage of short-circuit condition evaluation. If we do that above
> condition reduces to (op->all || !top_union->all) which is two boolean
> conditions, even cheaper. But may be the second optimization is not worth the
> loss of readability.

I doubt this makes any difference.  The compiler should be smart
enough to do this the same way regardless of exactly how we write it,
and if it's not, it can't make more than a few instructions worth of
difference.  This code is not nearly performance-critical enough for
that to matter.  Also, it's not the job of this patch to whack this
around.

> "identically-propertied UNIONs" may be "UNIONs with identical properties".

Likewise, I didn't write those words, so I don't plan to change them
just because I might have written them differently.

> 0003
> Probably we want to rename generate_union_path() as generate_union_rel() or
> generate_union_paths() since the function doesn't return a path anymore.
> Similarly for generate_nonunion_path().

Good point.  Changed.

> In recurse_set_operations()
> -return NULL;/* keep compiler quiet */
> This line is deleted and instead rel is initialized to NULL. That way we loose
> any chance to detect a future bug because of a block leaving rel uninitialized
> through compiler warning. May be we should replace "return NULL" with "rel =
> NULL", which will not be executed because of the error.

I don't think this is really a problem.  If some code path fails to
initialize rel, it's going to crash when postprocess_setop_rel() calls
set_cheapest().  Any warning the compiler gives is more likely to be a
false-positive than an actual problem.

> +/* Build path list and relid set. */
> +foreach(lc, rellist)
> +{
> With the changes in this patch, we could actually use 
> add_paths_to_append_rel()
> to create an append path. That function builds paths with different pathkeys,
> parameterization (doesn't matter here) and also handles parallel append. So we
> can avoid code duplication and also leverage more optimizations like using
> MergeAppend instead of overall sort etc. But that function doesn't have 
> ability
> to add a final node like make_union_unique(). A similar requirement has arisen
> in partition-wise join where we need to add a final node for finalising
> aggregate on top of paths created by add_paths_to_append_rel().  May be we can
> change that function to return a list of paths, which are then finalized by 
> the
> caller and added to "append" rel. But I don't think doing all that is in the
> scope of this patch set.

Yeah, I thought about all of that and came to similar conclusions.

> 0004
> +if (!op->all)
> +ppath = make_union_unique(op, ppath, tlist, root);
> We could probably push the grouping/sorting down to the parallel workers. But
> again not part of this patchset.

Yep.  There's a lot more work that could be done to improve setop
planning, but I think getting even this much done for v11 would be a
significant step forward.

Updated patches attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyOn Thu, Mar 1, 2018 at 8:30 AM, Ashutosh Bapat
ashutosh.ba...@enterprisedb.com>
wrote:On
Sat, Feb 24, 2018 at 2:55 AM, Robert Haas robertmh...@gmail.com>
wrote:

>
> Here's an extended series of patches that
now handles both the simple
> UNION ALL case (where we flatten it) and the unflattened case:
>

The patches look clean. I particularly looked at 0003.

patch 0001
+    /*
+     * Generate partial paths for final_rel, too, if
outer query levels might
+     * be able to make use of them.
+     */
I am not able to understand the construct esp. the if clause. Did you
want to
say "... if there are outer query levels. Those might ..." or something like
that?

0002
                (op->all ==
top_union->all || op->all) &&
This isn't really your change.  Checking
op->all is cheaper than checking equality, so may be we should
check that first
and take advantage of short-circuit condition evaluation. If we do
that above
condition reduces to (op->all || !top_union->all) which is two boolean
conditions, even ch

Re: parallel append vs. simple UNION ALL

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 2:46 AM, Rajkumar Raghuwanshi
 wrote:
> On Thu, Mar 8, 2018 at 12:27 AM, Robert Haas  wrote:
>>
>> New patches attached, fixing all 3 of the issues you reported:
>
> Thanks. new patches applied cleanly on head and fixing all reported issue.

Great.  Committed 0001.  Are you planning any further testing of this
patch series?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Tom Lane

Andres Freund  writes:
> On March 8, 2018 10:46:53 AM PST, Tom Lane  wrote:
>> Breaking fundamental invariants like
>> "ctid points to this tuple or its update successor" is going to cause
>> trouble.  There's a lot of code that knows that; more than knows the
>> details of what's in xmax, I believe.

> I don't think this is that big a problem. All code already needs to handle 
> the case where ctid points to an aborted update tuple. Which might long have 
> been replaced by as an independent role. That's why we have all this 
> updated.xmax == new.xmin checks. Which will, without any changes, catch the 
> proposed scheme, no?

No.  In those situations, the conclusion is that the current tuple is
live, which is exactly the wrong conclusion for a cross-partition update.
Or at least it might be the wrong conclusion ... I wonder how this patch
works if the updating transaction aborted.

regards, tom lane

Re: disable SSL compression?

2018-03-08 Thread Claudio Freire

On Thu, Mar 8, 2018 at 3:40 PM, Peter Eisentraut
 wrote:
> It appears that SSL compression is nowadays deprecated as insecure.
> Yet, it is still enabled by libpq by default, and there is no way to
> disable it in the server.  Should we make some changes here?  Does
> anyone know more about this?

Even if libpq enables it, it has to be enabled both in the client and
the server for it to work.

OpenSSL disables the whole feature by default, and enabling it is
rather cumbersome. The result is that, at least with OpenSSL, the
server and client won't accept compression without extensive fiddling
by the user.

So I don't think libpq has to change anything here.

Re: Handling better supported channel binding types for SSL implementations

2018-03-08 Thread Peter Eisentraut

On 1/23/18 21:27, Michael Paquier wrote:
> On Tue, Jan 23, 2018 at 12:08:37PM -0500, Peter Eisentraut wrote:
>> On 1/22/18 02:29, Michael Paquier wrote:
>>> However there is as well the argument that this list's contents are not
>>> directly used now, and based on what I saw from the MacOS SSL and GnuTLS
>>> patches that would not be the case after either.
>>
>> Right, there is no facility for negotiating the channel binding type, so
>> a boolean result should be enough.
> 
> I am not completely convinced either that we need to complicate the code
> to handle channel binding type negotiation.
> 
>> In which case we wouldn't actually need this for GnuTLS yet.
> 
> Sure. This depends mainly on how the patch for Mac's Secure Transport
> moves forward.

Moved to next CF along with those other two patches.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Andres Freund



On March 8, 2018 10:46:53 AM PST, Tom Lane  wrote:
>Robert Haas  writes:
>> On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane  wrote:
>>> FWIW, I would also vote for (1), especially if the only way to do
>(2)
>>> is stuff as outright scary as this.  I would far rather have (3)
>than
>>> this, because IMO, what we are looking at right now is going to make
>>> the fallout from multixacts look like a pleasant day at the beach.
>
>> Whoa.  Well, that would clearly be bad, but I don't understand why
>you
>> find this so scary.  Can you explain further?
>
>Possibly I'm crying wolf; it's hard to be sure.  But I recall that
>nobody
>was particularly afraid of multixacts when that went in, and look at
>all
>the trouble we've had with that.  Breaking fundamental invariants like
>"ctid points to this tuple or its update successor" is going to cause
>trouble.  There's a lot of code that knows that; more than knows the
>details of what's in xmax, I believe.

I don't think this is that big a problem. All code already needs to handle the 
case where ctid points to an aborted update tuple. Which might long have been 
replaced by as an independent role. That's why we have all this updated.xmax == 
new.xmin checks. Which will, without any changes, catch the proposed scheme, no?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 1:09 PM, Tom Lane  wrote:
> We already have autovacuum taking care of that, and as I stated, asking
> backends to do it is provably insufficient.  The right path is to make
> autovacuum cover the cases it's missing today.

Well, your counter-proposal of teaching autovacuum to recognize the
case where the backend and the orphaned schema are in different
databases is also provably insufficient.  We have to make more than
one change here to really fix this, and the two changes that
Tsunakawa-san proposed, taken together, are sufficient; either one by
itself is not.  Doing only #2 out of his proposals, as you proposed,
is better than doing nothing, but it's not a complete fix.

>> I don't
>> really share your concern about performance; one extra syscache lookup
>> at backend startup isn't going to break the bank.
>
> If it were just that, I wouldn't be worried either.  But it's not.
> Once the pg_namespace row exists, which it will in an installation
> that's been in use for for any length of time, you're going to have
> to actively search for entries in that schema.  I'm not sure how
> expensive a performDeletion() scan that finds nothing to do really
> is, but for sure it's more than the syscache lookup you expended to
> find the pg_namespace row.

True, but that's a rare scenario.

Still, I'm not sure we're too far apart on the underlying issue here;
we both would prefer, for one reason or another, if autovacuum could
just take care of this.  But I think that Tsunakawa-san is correct to
guess that your proposal might have some race conditions that need to
be solved somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] GnuTLS support

2018-03-08 Thread Peter Eisentraut

In the thread about Secure Transport we agreed to move the consideration
of new SSL libraries to PG12.

Here is my current patch, after all the refactorings.

The status is that it works fine and could be used.

There are two failures in the SSL tests that I cannot explain.  The
tests are for some rather obscure configurations, so the changed
behaviors are not obviously wrong, perhaps legitimate implementation
differences.  But someone wrote those tests with a purpose (probably),
so we should have some kind of explanation for the regressions.

Other non-critical, nice-to-have issues:

- Do something about sslinfo, perhaps fold into pg_stat_ssl view.
- Do something about pgcrypto.
- Add tests for load_dh_file().
- Implement channel binding tls-server-end-point.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From c76b6efca0f52fe4deb6003009f4f8730201f041 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 8 Mar 2018 14:05:33 -0500
Subject: [PATCH v6] GnuTLS support

---
 configure | 258 ++--
 configure.in  |  37 +-
 doc/src/sgml/client-auth.sgml |   2 +-
 doc/src/sgml/config.sgml  |  81 ++-
 doc/src/sgml/installation.sgml|  47 +-
 doc/src/sgml/libpq.sgml   |  54 +-
 doc/src/sgml/runtime.sgml |  13 +-
 doc/src/sgml/sslinfo.sgml |   1 +
 src/Makefile.global.in|   1 +
 src/backend/libpq/Makefile|   4 +-
 src/backend/libpq/be-secure-gnutls.c  | 820 +
 src/backend/libpq/be-secure-openssl.c |   4 +-
 src/backend/libpq/be-secure.c |   4 +
 src/backend/libpq/hba.c   |   2 +-
 src/backend/utils/misc/guc.c  |  38 ++
 src/backend/utils/misc/postgresql.conf.sample |   7 +-
 src/common/Makefile   |   4 +-
 src/common/sha2_gnutls.c  |  99 +++
 src/include/common/sha2.h |  14 +-
 src/include/libpq/libpq-be.h  |  13 +-
 src/include/libpq/libpq.h |   2 +
 src/include/pg_config.h.in|  17 +
 src/include/pg_config_manual.h|   2 +-
 src/interfaces/libpq/.gitignore   |   1 +
 src/interfaces/libpq/Makefile |  14 +-
 src/interfaces/libpq/fe-secure-gnutls.c   | 836 ++
 src/interfaces/libpq/fe-secure.c  |   2 +-
 src/interfaces/libpq/libpq-fe.h   |   2 +-
 src/interfaces/libpq/libpq-int.h  |  14 +-
 src/port/pg_strong_random.c   |  18 +-
 src/test/Makefile |   2 +-
 src/test/ssl/Makefile |   2 +-
 src/test/ssl/t/001_ssltests.pl|  65 +-
 src/test/ssl/t/002_scram.pl   |   2 +-
 src/tools/msvc/Mkvcbuild.pm   |  10 +
 src/tools/pgindent/typedefs.list  |   3 +
 36 files changed, 2360 insertions(+), 135 deletions(-)
 create mode 100644 src/backend/libpq/be-secure-gnutls.c
 create mode 100644 src/common/sha2_gnutls.c
 create mode 100644 src/interfaces/libpq/fe-secure-gnutls.c

diff --git a/configure b/configure
index 3943711283..f2d4aa502a 100755
--- a/configure
+++ b/configure
@@ -707,6 +707,7 @@ UUID_EXTRA_OBJS
 with_uuid
 with_systemd
 with_selinux
+with_gnutls
 with_openssl
 with_ldap
 with_krb_srvnam
@@ -838,6 +839,7 @@ with_bsd_auth
 with_ldap
 with_bonjour
 with_openssl
+with_gnutls
 with_selinux
 with_systemd
 with_readline
@@ -1532,6 +1534,7 @@ Optional Packages:
   --with-ldap build with LDAP support
   --with-bonjour  build with Bonjour support
   --with-openssl  build with OpenSSL support
+  --with-gnutls   build with GnuTLS support
   --with-selinux  build with SELinux support
   --with-systemd  build with systemd support
   --without-readline  do not use GNU Readline nor BSD Libedit for editing
@@ -1999,6 +2002,52 @@ $as_echo "$ac_res" >&6; }
 
 } # ac_fn_c_check_func
 
+# ac_fn_c_check_decl LINENO SYMBOL VAR INCLUDES
+# -
+# Tests whether SYMBOL is declared in INCLUDES, setting cache variable VAR
+# accordingly.
+ac_fn_c_check_decl ()
+{
+  as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack
+  as_decl_name=`echo $2|sed 's/ *(.*//'`
+  as_decl_use=`echo $2|sed -e 's/(/((/' -e 's/)/) 0&/' -e 's/,/) 0& (/g'`
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether $as_decl_name is 
declared" >&5
+$as_echo_n "checking whether $as_decl_name is declared... " >&6; }
+if eval \${$3+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+$4
+int
+main ()
+{
+#ifndef $as_decl_name
+#ifd

Re: public schema default ACL

2018-03-08 Thread Robert Haas

On Wed, Mar 7, 2018 at 5:11 PM, David G. Johnston
 wrote:
> I still feel like I want to mull this over more but auto-creating schemas
> strikes me as being "spooky action at a distance".

I don't think that it's a terrible proposal, but I don't see it as
fixing the real issue.  If we do something even as simple as removing
'public' from the default search_path, then I predict that a very
significant number of people will run pg_upgrade (or dump and restore,
or logically replicate the database), restart their application, and
find that it no longer works and they have absolutely no idea what has
gone wrong or how to fix it.  That seems like a major usability fail
to me, and auto-creating per-user schemas does absolutely nothing to
improve the situation.  That's not to say that it's a bad idea;
honestly, I think it's kind of nifty, and it certainly makes things a
lot nicer if there's no public schema any more because it makes CREATE
TABLE work out of the box again, something that we certainly want.
But if we don't have some solution to the problem of upgrade =>
everything breaks, then I don't think we really have a good solution
here.

I also wonder why we're all convinced that this urgently needs to be
changed.  I agree that the default configuration we ship is not the
most secure configuration that we could ship.  However, I think it's a
big step from saying that the configuration is not as secure as it
could be to saying that we absolutely must change it for v11.  We have
shipped tons of releases with sslmode=prefer and a wide-open
pg_hba.conf, and those aren't the most secure default configurations
either.  And changing either of those things would probably break a
lot fewer users than the changes being proposed on this thread.  This
issue isn't something that is brand new in a recent release of
PostgreSQL, and a lot of users are unaffected by it.  People need to
be aware that having the Donald Trump campaign and the Hilary Clinton
campaign share access to the same public schema, to which both
campaigns have CREATE and USAGE access, is probably asking for
trouble, but to be honest I suspect a fair number of users had figured
that out well before this security release went out the door.

It's certainly worth considering ideas for improving PostgreSQL's
security out-of-the-box, but the sky isn't falling, and it appears to
me that the risk of collateral damage from changes in this area is
pretty high.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Andres Freund



On March 8, 2018 10:46:53 AM PST, Tom Lane  wrote:
>Robert Haas  writes:
>> On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane  wrote:
>>> FWIW, I would also vote for (1), especially if the only way to do
>(2)
>>> is stuff as outright scary as this.  I would far rather have (3)
>than
>>> this, because IMO, what we are looking at right now is going to make
>>> the fallout from multixacts look like a pleasant day at the beach.
>
>> Whoa.  Well, that would clearly be bad, but I don't understand why
>you
>> find this so scary.  Can you explain further?
>
>Possibly I'm crying wolf; it's hard to be sure.  But I recall that
>nobody
>was particularly afraid of multixacts when that went in, and look at
>all
>the trouble we've had with that.  Breaking fundamental invariants like
>"ctid points to this tuple or its update successor" is going to cause
>trouble.  There's a lot of code that knows that; more than knows the
>details of what's in xmax, I believe.
>
>I would've been happier about expending an infomask bit towards this
>purpose.  Just eyeing what we've got, I can't help noticing that
>HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
>in a partitioned table.  Perhaps making these tests depend on
>partitioned-ness would be unworkably messy, but it's worth thinking
>about.

We're pretty much doing so for speculative lock IDs/upsert already. Which 
doesn't seem to have caused a lot of problems.

Andres 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Tom Lane

Robert Haas  writes:
> On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane  wrote:
>> FWIW, I would also vote for (1), especially if the only way to do (2)
>> is stuff as outright scary as this.  I would far rather have (3) than
>> this, because IMO, what we are looking at right now is going to make
>> the fallout from multixacts look like a pleasant day at the beach.

> Whoa.  Well, that would clearly be bad, but I don't understand why you
> find this so scary.  Can you explain further?

Possibly I'm crying wolf; it's hard to be sure.  But I recall that nobody
was particularly afraid of multixacts when that went in, and look at all
the trouble we've had with that.  Breaking fundamental invariants like
"ctid points to this tuple or its update successor" is going to cause
trouble.  There's a lot of code that knows that; more than knows the
details of what's in xmax, I believe.

I would've been happier about expending an infomask bit towards this
purpose.  Just eyeing what we've got, I can't help noticing that
HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
in a partitioned table.  Perhaps making these tests depend on
partitioned-ness would be unworkably messy, but it's worth thinking
about.

regards, tom lane

disable SSL compression?

2018-03-08 Thread Peter Eisentraut

It appears that SSL compression is nowadays deprecated as insecure.
Yet, it is still enabled by libpq by default, and there is no way to
disable it in the server.  Should we make some changes here?  Does
anyone know more about this?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Testbed for predtest.c ... and some arguable bugs therein

2018-03-08 Thread Tom Lane

Robert Haas  writes:
> On Thu, Mar 8, 2018 at 12:33 AM, Tom Lane  wrote:
>> I'm not sure that that's worth fixing right now.  Instead I'm tempted
>> to revert the addition of the clause_is_check argument to
>> predicate_refuted_by, on the grounds that it's both broken and currently
>> unnecessary.

> Hmm, I think you were the one who pushed for adding that argument in
> the first place: http://postgr.es/m/31878.1497389...@sss.pgh.pa.us

I'm kind of disappointed that you failed to take the *other* advice
in that message, as I still think that clause_is_check is a poor
choice of name for the flag.  It could have been salvaged with a
clear comment defining the semantics, but that's not there either.

> I have no problem with taking it back out, although I'm disappointed
> that I failed to find whatever was broken about it during review.

Maybe I'll spend a few minutes trying to isolate why the current
results are wrong.  However, it's certainly arguable that we shouldn't
spend much time on this with no use-case in sight.

regards, tom lane

Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-03-08 Thread Tom Lane

Robert Haas  writes:
> On Tue, Mar 6, 2018 at 6:39 PM, Tom Lane  wrote:
>> Now as for the problem originally stated, step 1 alone doesn't fix it,
>> and there's reason not to like that change much.  Forcing backends to
>> clear their temp schemas immediately on connection will slow down
>> connection times, and for applications that never make any temp tables,
>> that's just a dead loss (though admittedly it might not be too expensive
>> in that case).

> I think that's a little short-sighted.  I think we really want temp
> tables of no-longer-running backends to go away as soon as possible;
> that should be viewed as a gain in and of itself.

We already have autovacuum taking care of that, and as I stated, asking
backends to do it is provably insufficient.  The right path is to make
autovacuum cover the cases it's missing today.

> I don't
> really share your concern about performance; one extra syscache lookup
> at backend startup isn't going to break the bank.

If it were just that, I wouldn't be worried either.  But it's not.
Once the pg_namespace row exists, which it will in an installation
that's been in use for for any length of time, you're going to have
to actively search for entries in that schema.  I'm not sure how
expensive a performDeletion() scan that finds nothing to do really
is, but for sure it's more than the syscache lookup you expended to
find the pg_namespace row.

It's perhaps also worth reminding people that this whole discussion
pertains only to crash recovery; if the previous owner of the temp
schema had exited cleanly, it would've cleaned things up.  I'm unexcited
about adding overhead for crash recovery to the normal connection
code path when it's not necessary, and even more so when it's neither
necessary nor sufficient.

regards, tom lane

Re: FOR EACH ROW triggers on partitioned tables

2018-03-08 Thread Alvaro Herrera

Thomas Munro wrote:

> What is this test for?
> 
> +create trigger failed after update on parted_trig
> +  referencing old table as old_table
> +  for each statement execute procedure trigger_nothing();
> 
> It doesn't fail as you apparently expected.  Perhaps it was supposed
> to be "for each row" so you could hit your new error with
> errdetail("Triggers on partitioned tables cannot have transition
> tables.")?

You're absolutely right.  Fixed in the attached version.

I also include two requisite fixes for missing CCI calls in existing
code: one is in StorePartitionBounds which I think is backpatchable to
pg10 (this is the one that was causing me to add the one Peter
complained about in [1]), and the others are in the partition indexing
code.  In terms of the current tests, the first one is necessary in
order for things to work after this patch; the ones in the second patch
I only added after code review in order to understand where the first
one was.  (In that second patch I also remove one which now seems
unnecessary and in hindsight was probably there because I was lacking
the others.)

Patch 0003 is the feature at hand.  Compared to v3, this version adds
some recursing logic during ENABLE/DISABLE TRIGGER, so the test that was
previously failing now works correctly.

I kept the test on "irregular" partitioning from v5, too; it works here
without any further changes.

One thing I'd like to add before claiming this committable (backend-
side) is enabling constraint triggers.  AFAICT that requires a bit of
additional logic, but it shouldn't be too terrible.  This would allow
for deferrable unique constraints, for example.

[1] https://postgr.es/m/171cb95a-35ec-2ace-3add-a8d16279f...@2ndquadrant.com

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 994e69105fc341add1e5b5cc76e8fa039f81d6a4 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 8 Mar 2018 14:01:39 -0300
Subject: [PATCH v6 1/3] add missing CommandCounterIncrement in
 StorePartitionBound

---
 src/backend/catalog/heap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index cf36ce4add..2b5377bdf2 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -3299,6 +3299,9 @@ StorePartitionBound(Relation rel, Relation parent, 
PartitionBoundSpec *bound)
heap_freetuple(newtuple);
heap_close(classRel, RowExclusiveLock);
 
+   /* Make update visible */
+   CommandCounterIncrement();
+
/*
 * The partition constraint for the default partition depends on the
 * partition bounds of every other partition, so we must invalidate the
-- 
2.11.0

>From 4d68f3ef71667696c41ede27fe8d3fd0dcec7844 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 8 Mar 2018 14:04:13 -0300
Subject: [PATCH v6 2/3] Add missing CommandCounterIncrement() in partitioned
 index code

---
 src/backend/catalog/pg_constraint.c | 4 
 src/backend/commands/indexcmds.c| 6 ++
 src/backend/commands/tablecmds.c| 2 --
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/pg_constraint.c 
b/src/backend/catalog/pg_constraint.c
index 731c5e4317..38fdf72877 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -18,6 +18,7 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
@@ -781,6 +782,9 @@ ConstraintSetParentConstraint(Oid childConstrId, Oid 
parentConstrId)
recordDependencyOn(&depender, &referenced, DEPENDENCY_INTERNAL_AUTO);
 
heap_close(constrRel, RowExclusiveLock);
+
+   /* make update visible */
+   CommandCounterIncrement();
 }
 
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 504806b25b..9ca632865b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1003,6 +1003,9 @@ DefineIndex(Oid relationId,
ReleaseSysCache(tup);
heap_close(pg_index, RowExclusiveLock);
heap_freetuple(newtup);
+
+   /* make update visible */
+   CommandCounterIncrement();
}
}
else
@@ -2512,5 +2515,8 @@ IndexSetParentIndex(Relation partitionIdx, Oid parentOid)
 
recordDependencyOn(&partIdx, &partitionTbl, 
DEPENDENCY_AUTO);
}
+
+   /* make our updates visible */
+   CommandCounterIncrement();
}
 }
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 74e020bffc..7ecfbc17a0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/comman

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 12:25 PM, Pavan Deolasee
 wrote:
> I think the question is: isn't there an alternate way to achieve the same
> result? One alternate way would be to do what I suggested above i.e. free up
> more bits and use one of those.

That's certainly possible, but TBH the CTID field seems like a pretty
good choice for this particular feature.  I mean, we're essentially
trying to indicate that the CTID link is not valid, so using an
invalid value in the CTID field seems like a pretty natural choice.
We could use, say, an infomask bit to indicate that the CTID link is
not valid, but an infomask bit is more precious.  Any two-valued
property can be represented by an infomask bit, but using the CTID
field is only possible for properties that can't be true at the same
time that the CTID field needs to be valid.  So it makes sense that
this property, which can't be true at the same time the CTID field
needs to be valid, should try to use an otherwise-unused bit pattern
for the CTID field itself.

> Another way would be to add a hidden column
> to the partition table, when it is created or when it is attached as a
> partition. This only penalises the partition tables, but keeps rest of the
> system out of it. Obviously, if this column is added when the table is
> attached as a partition, as against at table creation time, then the old
> tuple may not have room to store this additional field. May be we can handle
> that by double updating the tuple? That seems bad, but then it only impacts
> the case when a partition key is updated. And we can clearly document
> performance implications of that operation. I am not sure how common this
> case is going to be anyways. With this hidden column, we can even store a
> pointer to another partition and do something with that, if at all needed.

Sure, but that would mean that partitioned tables would get bigger as
compared with unpartitioned tables, it would break backward
compatibility with v10, and it would require a major redesign of the
system -- the list of "system" columns is deeply embedded in the
system design and previous proposals to add to it have not been met
with wild applause.

> That's just one idea. Of course, I haven't thought about it for more than
> 10mins, so most likely I may have missed out on details and it's probably a
> stupid idea afterall. But there could be other ideas too. And even if we
> can't find one, my vote would be to settle for #1 instead of trying to do
> #2.

Fair enough.  I don't really see a reason why we can't make #2 work.
Obviously, the patch touches the on-disk format and is therefore scary
-- that's why I thought it should be broken out of the main update
tuple routing patch -- but it's far less of a structural change than
Alvaro's multixact work or the WARM stuff, at least according to my
current understanding.  Tom said he thinks it's riskier than the
multixact stuff but I don't see why that should be the case.  That had
widespread impacts on vacuuming and checkpointing that are not at
issue here.  Still, there's no question that it's a scary patch and if
the consensus is now that we don't need it -- so be it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-03-08 Thread Robert Haas

On Tue, Mar 6, 2018 at 6:39 PM, Tom Lane  wrote:
> Now as for the problem originally stated, step 1 alone doesn't fix it,
> and there's reason not to like that change much.  Forcing backends to
> clear their temp schemas immediately on connection will slow down
> connection times, and for applications that never make any temp tables,
> that's just a dead loss (though admittedly it might not be too expensive
> in that case).

I think that's a little short-sighted.  I think we really want temp
tables of no-longer-running backends to go away as soon as possible;
that should be viewed as a gain in and of itself.  One, it saves disk
space.  Two, it prevents them from causing wraparound problems.  I
believe we've had people complain about both, but definitely the
latter.

> Now, you can argue that autovacuum's check can be fooled by an "owner"
> backend that is connected to the current DB but hasn't actually taken
> possession of its assigned temp schema (and hence the table in question
> really is orphaned after all).  This edge case could be taken care of by
> having backends clear their temp schema immediately, as in step 1 of the
> patch.  But I still think that that is an expensive way to catch what
> would be a really infrequent case.

I think we should try to do something about this case -- if not now,
then later.  I agree that it would be better if we could get
autovacuum to do it instead of doing it in the foreground.  I don't
really share your concern about performance; one extra syscache lookup
at backend startup isn't going to break the bank.  Rather, I'm
concerned about reliability.  As I said upthread:

"So it would be really bad if you had catalog corruption
preventing the removal of pg_temp_2, because then every time you
connect, it will try to remove that schema, fail, and disconnect you."

Now granted your database *shouldn't* have catalog corruption, but a
lot of things that shouldn't happen sometimes do, and it's better when
those problem don't cause cascading failures.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Pavan Deolasee

On Thu, Mar 8, 2018 at 10:27 PM, Robert Haas  wrote:

>
> However, there's no such thing as a free lunch.  We can't use the CTID
> field to point to a CTID in another table because there's no room to
> include the identify of the other table in the field.  We can't widen
> it to make room because that would break on-disk compatibility and
> bloat our already-too-big tuple headers.  So, we cannot make it work
> like it does when the updates are confined to a single partition.
> Therefore, the only options are (1) ignore the problem, and let a
> cross-partition update look entirely like a delete+insert, (2) try to
> throw some error in the case where this introduces user-visible
> anomalies that wouldn't be visible otherwise, or (3) revert update
> tuple routing entirely.  I voted for (1), but the consensus was (2).
> I think that (3) will make a lot of people sad; it's a very good
> feature.

I am definitely not suggesting to do #3, though I agree with Tom that the
option is on table. May be two back-to-back bugs in the area makes me
worried and raises questions about the amount of testing the feature has
got. In addition, making such a significant on-disk change for one corner
case, for which even #1 might be acceptable, seems a lot. If we at all want
to go in that direction, I would suggest considering a patch that I wrote
last year to free-up additional bits from the ctid field (as part of the
WARM). I know Tom did not like that either, but at the very least, it
provides us a lot more room for future work, with the same amount of risk.

> If we want to have (2), then we've got to have some way to
> mark a tuple that was deleted as part of a cross-partition update, and
> that requires a change to the on-disk format.
>

I think the question is: isn't there an alternate way to achieve the same
result? One alternate way would be to do what I suggested above i.e. free
up more bits and use one of those. Another way would be to add a hidden
column to the partition table, when it is created or when it is attached as
a partition. This only penalises the partition tables, but keeps rest of
the system out of it. Obviously, if this column is added when the table is
attached as a partition, as against at table creation time, then the old
tuple may not have room to store this additional field. May be we can
handle that by double updating the tuple? That seems bad, but then it only
impacts the case when a partition key is updated. And we can clearly
document performance implications of that operation. I am not sure how
common this case is going to be anyways. With this hidden column, we can
even store a pointer to another partition and do something with that, if at
all needed.

That's just one idea. Of course, I haven't thought about it for more than
10mins, so most likely I may have missed out on details and it's probably a
stupid idea afterall. But there could be other ideas too. And even if we
can't find one, my vote would be to settle for #1 instead of trying to do
#2.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Testbed for predtest.c ... and some arguable bugs therein

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 12:33 AM, Tom Lane  wrote:
> A bit of hacking later, I have the attached.  The set of test cases it
> includes at the moment were mostly developed with an eye to getting to
> full code coverage of predtest.c, but we could add more later.  What's
> really interesting is that it proves that the "weak refutation" logic,
> i.e. predicate_refuted_by() with clause_is_check = true, is not
> self-consistent.

Oops.

> I'm not sure that that's worth fixing right now.  Instead I'm tempted
> to revert the addition of the clause_is_check argument to
> predicate_refuted_by, on the grounds that it's both broken and currently
> unnecessary.

Hmm, I think you were the one who pushed for adding that argument in
the first place: http://postgr.es/m/31878.1497389...@sss.pgh.pa.us

I have no problem with taking it back out, although I'm disappointed
that I failed to find whatever was broken about it during review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Therefore, the only options are (1) ignore the problem, and let a
>> cross-partition update look entirely like a delete+insert, (2) try to
>> throw some error in the case where this introduces user-visible
>> anomalies that wouldn't be visible otherwise, or (3) revert update
>> tuple routing entirely.  I voted for (1), but the consensus was (2).
>
> FWIW, I would also vote for (1), especially if the only way to do (2)
> is stuff as outright scary as this.  I would far rather have (3) than
> this, because IMO, what we are looking at right now is going to make
> the fallout from multixacts look like a pleasant day at the beach.

Whoa.  Well, that would clearly be bad, but I don't understand why you
find this so scary.  Can you explain further?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Tom Lane

Robert Haas  writes:
> Therefore, the only options are (1) ignore the problem, and let a
> cross-partition update look entirely like a delete+insert, (2) try to
> throw some error in the case where this introduces user-visible
> anomalies that wouldn't be visible otherwise, or (3) revert update
> tuple routing entirely.  I voted for (1), but the consensus was (2).

FWIW, I would also vote for (1), especially if the only way to do (2)
is stuff as outright scary as this.  I would far rather have (3) than
this, because IMO, what we are looking at right now is going to make
the fallout from multixacts look like a pleasant day at the beach.

regards, tom lane

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 11:45 AM, Tom Lane  wrote:
> Prabhat Sahu  writes:
>> On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas  wrote:
>>> That looks like the background worker got killed by the OOM killer.  How
>>> much memory do you have in the machine where this occurred?
>
>> I have ran the testcase in my local machine with below configurations:
>> Environment: CentOS 7(64bit)
>> HD : 100GB
>> RAM: 4GB
>> Processor: 4
>
> If you only have 4GB of physical RAM, it hardly seems surprising that
> trying to use 8GB of maintenance_work_mem would draw the wrath of the
> OOM killer.

Yup.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 10:07 AM, Tom Lane  wrote:
> Pavan Deolasee  writes:
>> I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
>> not do anything to deal with the fact that t_ctid may no longer point to
>> itself to mark end of the chain. I just can't see how that would work.
>> ...
>> I am actually worried that we're tinkering with ip_blkid to handle one
>> corner case of detecting partition key update. This is going to change
>> on-disk format and probably need more careful attention.
>
> You know, either one of those alone would be scary as hell.  Both in
> one patch seem to me to be sufficient reason to reject it outright.
> Not only will it be an unending source of bugs, but it's chewing up
> far too much of what few remaining degrees-of-freedom we have in the
> on-disk format ... for a single purpose that hasn't even been sold as
> something we have to have.

I agree that it isn't clear that it's worth making a change to the
on-disk format for this feature.  I made the argument when it was
first proposed that we should just document that there would be
anomalies with cross-partition updates that didn't occur otherwise.
However, multiple people thought that it was worth burning one of our
precious few remaining infomask bits in order to throw an error in
that case rather than just silently having an anomaly, and that's why
this patch got written.  It's not too late to decide that we'd rather
not do that after all.

However, there's no such thing as a free lunch.  We can't use the CTID
field to point to a CTID in another table because there's no room to
include the identify of the other table in the field.  We can't widen
it to make room because that would break on-disk compatibility and
bloat our already-too-big tuple headers.  So, we cannot make it work
like it does when the updates are confined to a single partition.
Therefore, the only options are (1) ignore the problem, and let a
cross-partition update look entirely like a delete+insert, (2) try to
throw some error in the case where this introduces user-visible
anomalies that wouldn't be visible otherwise, or (3) revert update
tuple routing entirely.  I voted for (1), but the consensus was (2).
I think that (3) will make a lot of people sad; it's a very good
feature.  If we want to have (2), then we've got to have some way to
mark a tuple that was deleted as part of a cross-partition update, and
that requires a change to the on-disk format.

In short, the two things that you are claiming are prohibitively scary
if done in the same patch look to me like they're actually just one
thing, and that one thing is something which absolutely has to be done
in order to implement the design most community members favored in the
original discussion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-03-08 Thread Tom Lane

Prabhat Sahu  writes:
> On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas  wrote:
>> That looks like the background worker got killed by the OOM killer.  How
>> much memory do you have in the machine where this occurred?

> I have ran the testcase in my local machine with below configurations:
> Environment: CentOS 7(64bit)
> HD : 100GB
> RAM: 4GB
> Processor: 4

If you only have 4GB of physical RAM, it hardly seems surprising that
trying to use 8GB of maintenance_work_mem would draw the wrath of the
OOM killer.

regards, tom lane

Re: Server won't start with fallback setting by initdb.

2018-03-08 Thread Tom Lane

Robert Haas  writes:
> On Wed, Mar 7, 2018 at 6:43 PM, Michael Paquier  wrote:
>> On Wed, Mar 07, 2018 at 06:39:32PM -0500, Tom Lane wrote:
>>> OK, seems like I'm on the short end of that vote.  I propose to push the
>>> GUC-crosschecking patch I posted yesterday, but not the default-value
>>> change, and instead push Kyotaro-san's initdb change.  Should we back-patch
>>> these things to v10 where the problem appeared?

>> I would vote for a backpatch.  If anybody happens to run initdb on v10
>> and gets max_connections to 10, that would immediately cause a failure.
>> We could also wait for sombody to actually complain about that, but a
>> bit of prevention does not hurt to ease future user experience on this
>> released version.

> In theory, back-patching the GUC-crosschecking patch could cause the
> cluster to fail to restart after the upgrade.  It's pretty unlikely.
> We have to postulate someone with, say, default values but for
> max_connections=12.  But it's not impossible.  I would be inclined to
> back-patch the increase in the max_connections fallback from 10 to 20
> because that fixes a real, if unlikely, failure mode, but treat the
> GUC cross-checking stuff as a master-only improvement.  Although it's
> unlikely to hurt many people, there's no real upside.  Nobody is going
> to say "boy, it's a good thing they tidied that GUC cross-checking in
> the latest major release -- that really saved my bacon!".  Nothing is
> really broken as things stand.

Done that way.  I concur that there's little reason to back-patch
the cross-check change before v10, since the case was even less likely
to happen back when max_wal_senders defaulted to zero.  There's some
argument for changing it in v10, but avoiding thrashing translatable
strings in a released branch probably outweighs it.

regards, tom lane

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-03-08 Thread Prabhat Sahu

On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas  wrote:

> On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu <
> prabhat.s...@enterprisedb.com> wrote:
>>
>> 2018-03-07 19:24:44.263 IST [54400] LOG:  background worker "parallel
>> worker" (PID 54482) was terminated by signal 9: Killed
>>
>
> That looks like the background worker got killed by the OOM killer.  How
> much memory do you have in the machine where this occurred?
>
I have ran the testcase in my local machine with below configurations:

Environment: CentOS 7(64bit)
HD : 100GB
RAM: 4GB
Processor: 4

I have nerrowdown the testcase as below, which also reproduce the same
crash.

-- GUCs under postgres.conf
maintenance_work_mem = 8GB

./pgbench -i -s 500 -d postgres

postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid,
abalance,filler);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

On Thu, Mar 8, 2018 at 7:12 AM, Andres Freund  wrote:

>
>
> On March 7, 2018 5:40:18 PM PST, Peter Geoghegan  wrote:
> >On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra
> > wrote:
> >> FWIW that's usually written to the system log. Does dmesg say
> >something
> >> about the kill?
> >
> >While it would be nice to confirm that it was indeed the OOM killer,
> >either way the crash happened because SIGKILL was sent to a parallel
> >worker. There is no reason to suspect a bug.
>
> Not impossible there's a leak somewhere though.
>
> Andres
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Tom Lane

Robert Haas  writes:
> ... I suppose we could still decide that if we
> can't have that, we don't want update tuple routing at all, but I
> think that's an overreaction.

Between this thread and

I am getting the distinct impression that that feature wasn't ready
to be committed.  I think that reverting it for v11 is definitely
an option that needs to be kept on the table.

regards, tom lane

Re: unique indexes on partitioned tables

2018-03-08 Thread Alvaro Herrera

Hi,

Shinoda, Noriyoshi wrote:

> I tried this feature with the latest snapshot. When I executed the
> following SQL statement, multiple primary keys were created on the
> partition.  Is this the intended behavior?

Hahah.  Is that a serious question?  Of course it isn't.  I'll fix it:

> -- test
> postgres=> CREATE TABLE part1(c1 INT PRIMARY KEY, c2 INT, c3 VARCHAR(10)) 
> PARTITION BY RANGE(c1) ;
> CREATE TABLE
> postgres=> CREATE TABLE part1v1 (LIKE part1) ;
> CREATE TABLE
> postgres=> ALTER TABLE part1v1 ADD CONSTRAINT pk_part1v1 PRIMARY KEY (c1, c2) 
> ;
> ALTER TABLE
> postgres=> ALTER TABLE part1 ATTACH PARTITION part1v1 FOR VALUES FROM (100) 
> TO (200) ;
> ALTER TABLE

I think the correct behavior here is to error out the ATTACH PARTITION
indicating that a primary key already exist.

Thanks for pointing this out.  Please keep testing,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Tom Lane

Pavan Deolasee  writes:
> I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
> not do anything to deal with the fact that t_ctid may no longer point to
> itself to mark end of the chain. I just can't see how that would work.
> ...
> I am actually worried that we're tinkering with ip_blkid to handle one
> corner case of detecting partition key update. This is going to change
> on-disk format and probably need more careful attention.

You know, either one of those alone would be scary as hell.  Both in
one patch seem to me to be sufficient reason to reject it outright.
Not only will it be an unending source of bugs, but it's chewing up
far too much of what few remaining degrees-of-freedom we have in the
on-disk format ... for a single purpose that hasn't even been sold as
something we have to have.

Find another way.

regards, tom lane

RE: unique indexes on partitioned tables

2018-03-08 Thread Shinoda, Noriyoshi

Hi.

I tried this feature with the latest snapshot. When I executed the following 
SQL statement, multiple primary keys were created on the partition. 
Is this the intended behavior?

-- test
postgres=> CREATE TABLE part1(c1 INT PRIMARY KEY, c2 INT, c3 VARCHAR(10)) 
PARTITION BY RANGE(c1) ;
CREATE TABLE
postgres=> CREATE TABLE part1v1 (LIKE part1) ;
CREATE TABLE
postgres=> ALTER TABLE part1v1 ADD CONSTRAINT pk_part1v1 PRIMARY KEY (c1, c2) ;
ALTER TABLE
postgres=> ALTER TABLE part1 ATTACH PARTITION part1v1 FOR VALUES FROM (100) TO 
(200) ;
ALTER TABLE
postgres=> \d part1v1
 Table "public.part1v1"
 Column | Type  | Collation | Nullable | Default
+---+---+--+-
 c1 | integer   |   | not null |
 c2 | integer   |   | not null |
 c3 | character varying(10) |   |  |
Partition of: part1 FOR VALUES FROM (100) TO (200)
Indexes:
"part1v1_pkey" PRIMARY KEY, btree (c1)
"pk_part1v1" PRIMARY KEY, btree (c1, c2)

Regards,

Noriyoshi Shinoda

-Original Message-
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] 
Sent: Tuesday, February 20, 2018 6:24 PM
To: Alvaro Herrera ; Peter Eisentraut 
; Jaime Casanova 

Cc: Jesper Pedersen ; Pg Hackers 

Subject: Re: unique indexes on partitioned tables

Hi.

On 2018/02/20 5:45, Alvaro Herrera wrote:
> I pushed this now, with fixes for the last few comments there were.

I noticed with the commit that, while ON CONFLICT (conflict_target) DO UPDATE 
gives a less surprising error message by catching it in the parser, ON CONFLICT 
(conflict_target) DO NOTHING will go into the executor without the necessary 
code to handle the case.  Example:

create table p (a int primary key, b text) partition by list (a); create table 
p12 partition of p for values in (1, 2); create table p3 partition of p (a 
unique) for values in (3);

insert into p values (1, 'a') on conflict (a) do nothing;
ERROR:  unexpected failure to find arbiter index

Attached is a patch to fix that.  Actually, there are two -- one that adjusts 
the partitioned table tests in insert_conflict.sql to have a partitioned unique 
index and another that fixes the code.

I suppose we'd need to apply this temporarily until we fix the ON CONFLICT
(conflict_target) case to be able to use partitioned indexes.

Thanks,
Amit

Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 7:54 AM, Pavan Deolasee  wrote:
>> > +   /*
>> > +* SQL Standard says that WHEN AND conditions must not
>> > +* write to the database, so check we haven't written
>> > +* any WAL during the test. Very sensible that is, since
>> > +* we can end up evaluating some tests multiple times if
>> > +* we have concurrent activity and complex WHEN clauses.
>> > +*
>> > +* XXX If we had some clear form of functional labelling
>> > +* we could use that, if we trusted it.
>> > +*/
>> > +   if (startWAL < GetXactWALBytes())
>> > +   ereport(ERROR,
>> > +   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> > +errmsg("cannot write to database within WHEN
>> > AND condition")));
>>
>> This needs to go. Apart from the fact that GetXactWALBytes() is buggy
>> (it returns int64 for the unsigned type XLogRecPtr), the whole idea
>> just seems unnecessary. I don't see why this is any different to using
>> a volatile function in a regular UPDATE.
>
> I removed this code since it was wrong. We might want to add some basic
> checks for existence of volatile functions in the WHEN or SET clauses. But I
> agree, it's no different than regular UPDATEs. So may be not a big deal.

I just caught up on this thread.  I'm definitely glad to see that code
go because, wow, that is all kinds of wrong.  I don't see a real need
to add any kind of replacement check, either.  Prohibiting volatile
functions here doesn't seem likely to accomplish anything useful.  It
seems like the most we'd want to do is mention this the documentation
somehow, and I'm not even sure we really need to do that much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-08 Thread Jeevan Chalke

On Thu, Mar 8, 2018 at 7:49 PM, Robert Haas  wrote:

> On Thu, Mar 8, 2018 at 9:15 AM, Jeevan Chalke
>  wrote:
> > I am not sure why we don't set reltarget into the grouped_rel too.
> >
> > But if we do so like we did in partially_grouped_rel, then it will be lot
> > easier for partitionwise aggregate as then we don't have to pass target
> to
> > functions creating paths like create_append_path. We now need to update
> > generate_gather_paths() to take target too as it is now being called on
> > grouped_rel in which reltarget is not set.
> >
> > But yes, if there is any specific reason we can't do so, then I think the
> > same like Ashutosh Said. I didn't aware of such reason though.
>
> I see no problem with setting reltarget for the grouped_rel.  Before
> we added partially_grouped_rel, that rel computed paths with two
> different targets: partial paths had the partial grouping target, and
> non-partial paths had the ordinary grouping target.  But that's fixed
> now.
>

OK.
Will update my changes accordingly.
If we set reltarget into the grouped_rel now, then I don't need one of the
refactoring patch which is passing target to the path creation functions.


> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 9:15 AM, Jeevan Chalke
 wrote:
> I am not sure why we don't set reltarget into the grouped_rel too.
>
> But if we do so like we did in partially_grouped_rel, then it will be lot
> easier for partitionwise aggregate as then we don't have to pass target to
> functions creating paths like create_append_path. We now need to update
> generate_gather_paths() to take target too as it is now being called on
> grouped_rel in which reltarget is not set.
>
> But yes, if there is any specific reason we can't do so, then I think the
> same like Ashutosh Said. I didn't aware of such reason though.

I see no problem with setting reltarget for the grouped_rel.  Before
we added partially_grouped_rel, that rel computed paths with two
different targets: partial paths had the partial grouping target, and
non-partial paths had the ordinary grouping target.  But that's fixed
now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-08 Thread Jeevan Chalke

On Thu, Mar 8, 2018 at 1:15 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Wed, Mar 7, 2018 at 8:07 PM, Jeevan Chalke
>  wrote:
> Here are some more review comments esp. on
> try_partitionwise_grouping() function. BTW name of that function
> doesn't go in sync with enable_partitionwise_aggregation (which btw is
> in sync with enable_fooagg GUCs). But it goes in sync with
> create_grouping_paths(). It looks like we have already confused
> aggregation and grouping e.g. enable_hashagg may affect path creation
> when there is no aggregation involved i.e. only grouping is involved
> but create_grouping_paths will create paths when there is no grouping
> involved. Generally it looks like the function names use grouping to
> mean both aggregation and grouping but GUCs use aggregation to mean
> both of those. So, the naming in this patch looks consistent with the
> current naming conventions.
>
> +grouped_rel->part_scheme = input_rel->part_scheme;
> +grouped_rel->nparts = nparts;
> +grouped_rel->boundinfo = input_rel->boundinfo;
> +grouped_rel->part_rels = part_rels;
>
> You need to set the part_exprs which will provide partition keys for this
> partitioned relation. I think, we should include all the part_exprs of
> input_rel which are part of GROUP BY clause. Since any other expressions in
> part_exprs are not part of GROUP BY clause, they can not appear in the
> targetlist without an aggregate on top. So they can't be part of the
> partition
> keys of the grouped relation.
>
> In create_grouping_paths() we fetch both partial as well as fully grouped
> rel
> for given input relation. But in case of partial aggregation, we don't need
> fully grouped rel since we are not computing full aggregates for the
> children.
> Since fetch_upper_rel() creates a relation when one doesn't exist, we are
> unnecessarily creating fully grouped rels in this case. For thousands of
> partitions that's a lot of memory wasted.
>
> I see a similar issue with create_grouping_paths() when we are computing
> only
> full aggregates (either because partial aggregation is not possible or
> because
> parallelism is not possible). In that case, we unconditionally create
> partially
> grouped rels. That too would waste a lot of memory.
>
> I think that partially_grouped_rel, when required, is partitioned
> irrespective
> of whether we do full aggregation per partition or not. So, if we have its
> part_rels and other partitioning properties set. I agree that right now we
> won't use this information anywhere. It may be useful, in case we device a
> way
> to use partially_grouped_rel directly without using grouped_rel for
> planning
> beyond grouping, which seems unlikely.
>
> +
> +/*
> + * Parallel aggregation requires partial target, so compute it
> here
> + * and translate all vars. For partial aggregation, we need it
> + * anyways.
> + */
> +partial_target = make_partial_grouping_target(root, target,
> +  havingQual);
>
> Don't we have this available in partially_grouped_rel?
>
> That shows one asymmetry that Robert's refactoring has introduced. We
> don't set
> reltarget of grouped_rel but set reltarget of partially_grouped_rel. If
> reltarget of grouped_rel changes across paths (the reason why we have not
> set
> it in grouped_rel), shouldn't reltarget of partially grouped paths change
> accordingly?
>

I am not sure why we don't set reltarget into the grouped_rel too.

But if we do so like we did in partially_grouped_rel, then it will be lot
easier for partitionwise aggregate as then we don't have to pass target to
functions creating paths like create_append_path. We now need to update
generate_gather_paths() to take target too as it is now being called on
grouped_rel in which reltarget is not set.

But yes, if there is any specific reason we can't do so, then I think the
same like Ashutosh Said. I didn't aware of such reason though.


> +
> +/*
> + * group_by_has_partkey
> + *
> + * Returns true, if all the partition keys of the given relation are part
> of
> + * the GROUP BY clauses, false otherwise.
> + */
> +static bool
> +group_by_has_partkey(RelOptInfo *input_rel, PathTarget *target,
> + List *groupClause)
>
> We could modify this function to return the list of part_exprs which are
> part
> of group clause and use that as the partition keys of the grouped_rel if
> required. If group by doesn't have all the partition keys, the function
> would
> return a NULL list.
>
> Right now, in case of full aggregation, partially_grouped_rel is populated
> with
> the partial paths created by adding partial aggregation to the partial
> paths of
> input relation. But we are not trying to create partial paths by (parallel)
> appending the (non)partial paths from the child partially_grouped_rel.
> Have we
> thought about that? Would such paths have different shapes from the ones
>

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 12:34 AM, Pavan Deolasee
 wrote:
> I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does not
> do anything to deal with the fact that t_ctid may no longer point to itself
> to mark end of the chain. I just can't see how that would work. But if it
> does, it needs good amount of comments explaining why and most likely
> updating comments at other places where chain following is done. For
> example, how's this code in heap_get_latest_tid() is still valid? Aren't we
> setting "ctid" to some invalid value here?

So the general idea of the patch is that this new kind of marking
marks the CTID chain as "broken" and that code which cares about
following CTID chains forward can see that it's reached a point where
the chain is broken and throw an error saying "hey, I can't do the
stuff we normally do in concurrency scenarions, because the CTID chain
got broken by a cross-partition update".

I don't think it's practical to actually make CTID links across
partitions work.  Certainly not in time for v11.  If somebody wants to
try that at some point in the future, cool.  But that's moving the
goalposts an awfully long way.  When this was discussed about a year
ago, my understanding is that there was a consensus that doing nothing
was not acceptable, but that throwing an error in the cases where
anomalies would have happened was good enough.  I don't think anyone
argued that we had to be able to perfectly mimic the usual EPQ
semantics as a condition of having update tuple routing.  That's
setting the bar at a level that we're not going to be able to reach in
the next couple of weeks.  I suppose we could still decide that if we
can't have that, we don't want update tuple routing at all, but I
think that's an overreaction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Subscription code improvements

2018-03-08 Thread Masahiko Sawada

On Wed, Mar 7, 2018 at 11:13 PM, Alvaro Herrera 
wrote:
> 0002 looks like a good improvement to me.  The existing routine is
> messy, and apparently it's so just to save one LockSharedObject plus
> cache lookup; IMO it's not worth it.  Patched code looks simpler.  If
> there are cases where having the combined behavior is useful, it's not
> clear what they are.  (If I understand correctly, the reason is that a
> sync worker could try to insert-or-update the row after some other
> process deleted it [because of removing the table from subscription?]
> ... but that seems to work out *simpler* with the new code.  So what's
> up?)
>

The function calling SetSubscriptionRelState with update_only=false (i.g.
going to do insert-or-update) is two function: CreateSubscription() and
AlterSubscription_refresh(). AFAICS these two function actually doesn't
need such insert-or-update functionality because it doesn't happen that a
backend process creates/alters the same name subscription which already
exists. Since CreateSubscirption() inserts a heap into the system catalog
one transaction ends up with the error of key already exists if two process
tries to create the same name subscription . Similarly for
AlterSubscription_refresh(), since we acquire the AccessExclusiveLock on
the subscription object before getting the new table list in the
publication the updating a existing entry doesn't happen. So this patch
changes SetsubscriptionRelState() with update_only=fasle to
AddSubscriptionRelState() and others to UpdateSubscriptionRelState().

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 2:45 AM, Ashutosh Bapat
 wrote:
> +grouped_rel->part_scheme = input_rel->part_scheme;
> +grouped_rel->nparts = nparts;
> +grouped_rel->boundinfo = input_rel->boundinfo;
> +grouped_rel->part_rels = part_rels;
>
> You need to set the part_exprs which will provide partition keys for this
> partitioned relation. I think, we should include all the part_exprs of
> input_rel which are part of GROUP BY clause. Since any other expressions in
> part_exprs are not part of GROUP BY clause, they can not appear in the
> targetlist without an aggregate on top. So they can't be part of the partition
> keys of the grouped relation.
>
> In create_grouping_paths() we fetch both partial as well as fully grouped rel
> for given input relation. But in case of partial aggregation, we don't need
> fully grouped rel since we are not computing full aggregates for the children.
> Since fetch_upper_rel() creates a relation when one doesn't exist, we are
> unnecessarily creating fully grouped rels in this case. For thousands of
> partitions that's a lot of memory wasted.
>
> I see a similar issue with create_grouping_paths() when we are computing only
> full aggregates (either because partial aggregation is not possible or because
> parallelism is not possible). In that case, we unconditionally create 
> partially
> grouped rels. That too would waste a lot of memory.

This kind of goes along with the suggestion I made yesterday to
introduce a new function, which at the time I proposed calling
initialize_grouping_rel(), to set up new grouped or partially grouped
relations.  By doing that it would be easier to ensure the
initialization is always done in a consistent way but only for the
relations we actually need.  But maybe we should call it
fetch_grouping_rel() instead.  The idea would be that instead of
calling fetch_upper_rel() we would call fetch_grouping_rel() when it
is a question of the grouped or partially grouped relation.  It would
either return the existing relation or initialize a new one for us.  I
think that would make it fairly easy to initialize only the ones we're
going to need.

Also, I don't think we should be paranoid about memory usage here.
It's good to avoid creating new rels that are obviously not needed,
not only because of memory consumption but also because of the CPU
consumption involved, but I don't want to contort the code to squeeze
every last byte of memory out of this.

On a related note, I'm not sure that this code is correct:

+   if (!isPartialAgg)
+   {
+   grouped_rel->part_scheme = input_rel->part_scheme;
+   grouped_rel->nparts = nparts;
+   grouped_rel->boundinfo = input_rel->boundinfo;
+   grouped_rel->part_rels = part_rels;
+   }

It's not obvious to me why this should be done only when
!isPartialAgg.  The comments claim that the partially grouped child
rels can't be considered partitions of the top-level partitially
grouped rel, but it seems to me that we could consider them that way.
Maybe I'm missing something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [FEATURE PATCH] pg_stat_statements with plans (v02)

2018-03-08 Thread Julian Markwort

I've goofed up now, sorry for failing to attach my updated patch.

Am Donnerstag, den 08.03.2018, 14:55 +0100 schrieb Julian Markwort:
> Tom Lane wrote on 2018-03-02:
> > You need to make your changes in a 1.5--1.6
> > upgrade file.  Otherwise there's no clean path for existing
> > installations
> > to upgrade to the new version.
> 
> I've adressed all the issues that were brought up so far:
> 1. there is now only an added 1.5--1.6.sql file.
> 2. the overhead, as previously discussed (as much as a 12% decrease
> in
> TPS during read-only tests), is now gone, the problem was that I was
> collecting the plan String before checking if it needed to be stored
> at
> all.
> The patched version is now 99.95% as fast as unmodified
> pg_stat_statements.
> 3. I've cleaned up my own code and made sure it adheres to GNU C
> coding
> style, I was guilty of some // comments and curly brackets were
> sometimes in the same line as my control structures.
> 
> I'd love to hear more feedback, here are two ideas to polish this
> patch:
> a) Right now, good_plan and bad_plan collection can be activated and
> deactivated with separate GUCs. I think it would be sensible to
> collect
> either both or none. (This would result in fewer convoluted
> conditionals.)
> b) Would you like to be able to tune the percentiles yourself, to
> adjust for the point at which a new plan is stored?
> 
> Greetings
> Juliandiff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 39b368b..49bb462 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -4,7 +4,8 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o $(WIN32RES)
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.4--1.5.sql \
+DATA = pg_stat_statements--1.4.sql \
+	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
 	pg_stat_statements--1.1--1.2.sql pg_stat_statements--1.0--1.1.sql \
 	pg_stat_statements--unpackaged--1.0.sql
diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 5318c35..3e79890 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -395,4 +395,40 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
  SELECT pg_stat_statements_reset() | 1 |1
 (8 rows)
 
+-- test to see if any plans have been recorded.
+SELECT
+  CASE WHEN good_plan_time > 0 THEN 1 ELSE 0 END,
+  CASE WHEN bad_plan_time > 0 THEN 1 ELSE 0 END,
+  CASE WHEN good_plan_timestamp >= timestamp '1970-01-01 00:00:00' THEN 1 ELSE 0 END,
+  CASE WHEN good_plan_timestamp >= timestamp '1970-01-01 00:00:00' THEN 1 ELSE 0 END
+FROM pg_stat_statements ORDER BY query COLLATE "C";
+ case | case | case | case 
+--+--+--+--
+0 |0 |0 |0
+0 |0 |0 |0
+0 |0 |0 |0
+0 |0 |0 |0
+0 |0 |0 |0
+0 |0 |0 |0
+1 |1 |1 |1
+1 |1 |1 |1
+1 |1 |1 |1
+(9 rows)
+
+-- test if there is some text in the recorded plans.
+select substr(good_plan, 0, 11), substr(bad_plan, 0, 11) from pg_stat_statements ORDER BY query COLLATE "C";
+   substr   |   substr   
++
+| 
+| 
+| 
+| 
+| 
+| 
+ Query Text | Query Text
+ Query Text | Query Text
+ Query Text | Query Text
+ Query Text | Query Text
+(10 rows)
+
 DROP EXTENSION pg_stat_statements;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.5--1.6.sql b/contrib/pg_stat_statements/pg_stat_statements--1.5--1.6.sql
new file mode 100644
index 000..5302d01
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.5--1.6.sql
@@ -0,0 +1,78 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.6'" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements(boolean);
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements_reset();
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements(boolean);
+DROP FUNCTION pg_stat_statements_reset();
+
+-- Register functions.
+CREATE FUNCTION pg_stat_statements_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE FUNCTION pg_stat_statements_good_plan_reset(IN queryid bigint)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE FUNCTION pg_stat_statements_bad_plan_reset(IN queryid big

Re: [HACKERS] [FEATURE PATCH] pg_stat_statements with plans (v02)

2018-03-08 Thread Julian Markwort

Tom Lane wrote on 2018-03-02:
> You need to make your changes in a 1.5--1.6
> upgrade file.  Otherwise there's no clean path for existing
> installations
> to upgrade to the new version.

I've adressed all the issues that were brought up so far:
1. there is now only an added 1.5--1.6.sql file.
2. the overhead, as previously discussed (as much as a 12% decrease in
TPS during read-only tests), is now gone, the problem was that I was
collecting the plan String before checking if it needed to be stored at
all.
The patched version is now 99.95% as fast as unmodified
pg_stat_statements.
3. I've cleaned up my own code and made sure it adheres to GNU C coding
style, I was guilty of some // comments and curly brackets were
sometimes in the same line as my control structures.

I'd love to hear more feedback, here are two ideas to polish this
patch:
a) Right now, good_plan and bad_plan collection can be activated and
deactivated with separate GUCs. I think it would be sensible to collect
either both or none. (This would result in fewer convoluted
conditionals.)
b) Would you like to be able to tune the percentiles yourself, to
adjust for the point at which a new plan is stored?

Greetings
Julian

smime.p7s
Description: S/MIME cryptographic signature

Re: Failed to request an autovacuum work-item in silence

2018-03-08 Thread Alvaro Herrera

Hi

I was thinking that the BRIN code requesting the workitem would print
the error message based on the return value.  There is no point to
returning a boolean indicator if the caller isn't going to do anything
with it ...  This means you don't need to convert the type to string in
autovacuum.c (which would defeat attempts at generalizing this code).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ALTER TABLE ADD COLUMN fast default

2018-03-08 Thread David Rowley

On 8 March 2018 at 18:40, Andrew Dunstan  wrote:
>  select * from t;
>  fastdef tps = 107.145811
>  master  tps = 150.207957
>
> "select * from t" used to be about a wash, but with this patch it's
> got worse. The last two queries were worse and are now better, so
> that's a win.

How does it compare to master if you drop a column out the table?
Physical tlists will be disabled in that case too. I imagine the
performance of master will drop much lower than the all columns
missing case.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-03-08 Thread Shubham Barai

On 07-Mar-2018 11:00 PM, "Alvaro Herrera"  wrote:

I suggest to create a new function GinPredicateLockPage() that checks
whether fast update is enabled for the index.  The current arrangement
looks too repetitive and it seems easy to make a mistake.

Stylistically, please keep #include lines ordered alphabetically, and
cut long lines to below 80 chars.


Okay, I will update the patch.

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread amul sul

On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila  wrote:
> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
>  wrote:
>>
[.]
>
>> But if it
>> does, it needs good amount of comments explaining why and most likely
>> updating comments at other places where chain following is done. For
>> example, how's this code in heap_get_latest_tid() is still valid? Aren't we
>> setting "ctid" to some invalid value here?
>>
>> 2302 /*
>> 2303  * If there's a valid t_ctid link, follow it, else we're done.
>> 2304  */
>> 2305 if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
>> 2306 HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
>> 2307 ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
>> 2308 {
>> 2309 UnlockReleaseBuffer(buffer);
>> 2310 break;
>> 2311 }
>> 2312
>> 2313 ctid = tp.t_data->t_ctid;
>>
>
> I have not tested, but it seems this could be problematic, but I feel
> we could deal with such cases by checking invalid block id in the
> above if check.  Another one such case is in EvalPlanQualFetch
>

I tried the following test to hit this code and found that the situation is not
that much unpleasant.

heap_get_latest_tid() will follow the chain and return latest tid iff the
current tuple satisfies visibility check (via HeapTupleSatisfiesVisibility), in
our case it doesn't and we are safe here, but I agree with Amit -- it is better
to add invalid block id check.
In EvalPlanQualFetch() invalid block id check already there before
ItemPointerEquals call.

=== TEST ==
create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
insert into foo values(1, 'Initial record');
update foo set b= b || ' -> update1' where a=1;
update foo set b= b || ' -> update2' where a=1;

postgres=# select tableoid::regclass, ctid, * from foo;
 tableoid | ctid  | a |  b
--+---+---+--
 foo1 | (0,3) | 1 | Initial record -> update1 -> update2
(1 row)

postgres=# select currtid2('foo1','(0,1)');
 currtid2
--
 (0,3)
(1 row)

postgres=# select tableoid::regclass, ctid, * from foo;
 tableoid | ctid  | a |  b
--+---+---+--
 foo2 | (0,1) | 2 | Initial record -> update1 -> update2-> moved
(1 row)

postgres=# select currtid2('foo1','(0,1)');
 currtid2
--
 (0,1)
(1 row)

=== END ===


>> This is just one example. I am almost certain there are many such cases that
>> will require careful attention.
>>
>
> Right, I think we should be able to detect and fix such cases.
>

Will look into the places carefully where ItemPointerEquals() call
made for heap tuple.

Regards,
Amul

Re: JIT compiling with LLVM v11

2018-03-08 Thread Thomas Munro

On Wed, Mar 7, 2018 at 3:49 PM, Thomas Munro
 wrote:
> make check at today's HEAD of your jit branch crashes on my FreeBSD
> box.  The first thing to crash is this query from point.sql:
>
> LOG:  server process (PID 87060) was terminated by signal 4: Illegal 
> instruction
> DETAIL:  Failed process was running: SELECT '' AS thirtysix, p1.f1 AS
> point1, p2.f1 AS point2, p1.f1 <-> p2.f1 AS dist
>FROM POINT_TBL p1, POINT_TBL p2
>ORDER BY dist, p1.f1[0], p2.f1[0];

Hmm.  It's trying to execute an AVX instruction.

* thread #1, stop reason = breakpoint 1.1
frame #0: llvmjit.so`ExecRunCompiledExpr(state=0x000801de4880,
econtext=0x000801de3560, isNull="") at llvmjit_expr.c:432
   429
   430  state->evalfunc = func;
   431
-> 432  return func(state, econtext, isNull);
   433  }
   434
   435  static void emit_lifetime_end(ExprState *state, LLVMModuleRef
mod, LLVMBuilderRef b);
(lldb) s
Process 44513 stopped
* thread #1, stop reason = signal SIGILL: privileged instruction
frame #0: 0x000801157193
->  0x801157193: vmovsd (%rax), %xmm0 ; xmm0 = mem[0],zero
0x801157197: vmovsd 0x8(%rax), %xmm1  ; xmm1 = mem[0],zero
0x80115719c: vsubsd (%rcx), %xmm0, %xmm2
0x8011571a0: vsubsd 0x8(%rcx), %xmm1, %xmm0
(lldb) bt
* thread #1, stop reason = signal SIGILL: privileged instruction
  * frame #0: 0x000801157193

This is running on a "Intel(R) Celeron(R) CPU G1610T @ 2.30GHz" with no AVX.

I am not sure if that is real though, because the stack is immediately
corrupted.  So either func is not really a function, or it is but was
compiled for the wrong target.  I see that you call
LLVMCreateTargetMachine() with the result of LLVMGetHostCPUName() as
cpu.  For me that's "ivybridge", so I tried hard coding "generic"
instead and it didn't help.  I see that you say "" for features, where
is where one would normally put "avx" to turn on AVX instructions, so
I think perhaps that theory is entirely bogus.

-- 
Thomas Munro
http://www.enterprisedb.com

Concurrency bug in UPDATE of partition-key

2018-03-08 Thread Amit Khandekar

(Mail subject changed; original thread : [1])

On 8 March 2018 at 11:57, Amit Khandekar  wrote:
> On 8 March 2018 at 09:15, Pavan Deolasee  wrote:
>>
>> CREATE TABLE pa_target (key integer, val text)
>> PARTITION BY LIST (key);
>> CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
>> CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);
>> INSERT INTO pa_target VALUES (1, 'initial1');
>>
>> session1:
>> BEGIN;
>> UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1;
>> UPDATE 1
>> postgres=# SELECT * FROM pa_target ;
>>  key | val
>> -+-
>>1 | initial1 updated by update1
>> (1 row)
>>
>> session2:
>> UPDATE pa_target SET val = val || ' updated by update2', key = key + 1 WHERE
>> key = 1
>> 
>>
>> session1:
>> postgres=# COMMIT;
>> COMMIT
>>
>> 
>>
>> postgres=# SELECT * FROM pa_target ;
>>  key | val
>> -+-
>>2 | initial1 updated by update2
>> (1 row)
>>
>> Ouch. The committed updates by session1 are overwritten by session2. This
>> clearly violates the rules that rest of the system obeys and is not
>> acceptable IMHO.
>>
>> Clearly, ExecUpdate() while moving rows between partitions is missing out on
>> re-constructing the to-be-updated tuple, based on the latest tuple in the
>> update chain. Instead, it's simply deleting the latest tuple and inserting a
>> new tuple in the new partition based on the old tuple. That's simply wrong.
>
> You are right. This need to be fixed. This is a different issue than
> the particular one that is being worked upon in this thread, and both
> these issues have different fixes.
>

As discussed above, there is a concurrency bug where during UPDATE of
a partition-key, another transaction T2 updates the same row.

> Like you said, the tuple needs to be reconstructed when ExecDelete()
> finds that the row has been updated by another transaction. We should
> send back this information from ExecDelete() (I think tupleid
> parameter gets updated in this case), and then in ExecUpdate() we
> should goto lreplace, so that the the row is fetched back similar to
> how it happens when heap_update() knows that the tuple was updated.

The attached WIP patch is how the fix is turning out to be.
ExecDelete() passes back the epqslot returned by EvalPlanQual().
ExecUpdate() uses this re-fetched tuple to re-process the row, just
like it does when heap_update() returns HeapTupleUpdated.

I need to write an isolation test for this fix.

[1] : 
https://www.postgresql.org/message-id/CAJ3gD9d%3DwcbF-G00bYo4v5iWC69zm1UTSNxyRLpOgO9j4AtLWQ%40mail.gmail.com


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


fix_concurrency_bug_WIP.patch
Description: Binary data

Re: csv format for psql

2018-03-08 Thread Pavel Stehule

2018-03-08 11:17 GMT+01:00 Pavel Stehule :

>
>
> 2018-03-08 11:05 GMT+01:00 Fabien COELHO :
>
>>
>> Hello Daniel,
>>
>> PFA a v3 patch that implements that, along with regression tests this
>>> time.
>>>
>>
>> My 0.02 €:
>>
>> Patch applies cleanly, compiles, make check ok, doc generation ok.
>>
>> I'm in favor of having a simple psql way to generate a convenient and
>> compliant CSV output for export/import.
>>
>> I also think that a short option brings little value, and "--csv" is good
>> enough for the purpose, so I would agree to remove the "-C" binding.
>>
>> About "fieldsep_csv", I do not like much the principle of having
>> different output variables to represent the same concept depending on the
>> format. I would rather have reused fieldsep as in your previous submission
>> and set it to "," when under --csv. This is not a strong opinion and other
>> people may differ: the committer opinion is the one to follow:-)
>>
>
Looks like complex rule. But maybe it is acceptable. If the format is csv,
and fieldsep is not defined, then use ','. If it defined, then use defined
field sep. I am not sure, if is possible distinct between default and
explicitly defined field separator.

If not, then I prefer fieldsep_csv.



>
>> The "\n" eol style is hardcoded. Should it use "recordsep"? For instance,
>> https://tools.ietf.org/html/rfc4180 seems to specify CRLF end of lines.
>> The definition is evolving, eg https://www.w3.org/TR/tabular-data-model/
>> accepts both "\r" and "\r\n". I do not like using windows eol, but I think
>> that it should be possible to do it, which is not the case with this
>> version.
>>
>
> In this case recordsep is not problem - default is ok.
>
>
>>
>> The "\pset format" error message in "do_pset" shows values in seemingly
>> random order. The situation is pre-existing but not really satisfactory.
>> I'd suggest to put all values in alphabetical order.
>>
>> In csv_print_field & csv_print_text, you are not consistent when putting
>> braces on blocks with only one instruction. I'd suggest not to put braces
>> in that case.
>>
>> I'd suggest that tests should include more types, not just strings. I
>> would suggest int, float, timestamp, bytea, an array (which uses , as a
>> separator), json (which uses both " and ,)...
>>
>> --
>> Fabien.
>
>
>

Re: csv format for psql

2018-03-08 Thread Pavel Stehule

2018-03-08 11:05 GMT+01:00 Fabien COELHO :

>
> Hello Daniel,
>
> PFA a v3 patch that implements that, along with regression tests this time.
>>
>
> My 0.02 €:
>
> Patch applies cleanly, compiles, make check ok, doc generation ok.
>
> I'm in favor of having a simple psql way to generate a convenient and
> compliant CSV output for export/import.
>
> I also think that a short option brings little value, and "--csv" is good
> enough for the purpose, so I would agree to remove the "-C" binding.
>
> About "fieldsep_csv", I do not like much the principle of having different
> output variables to represent the same concept depending on the format. I
> would rather have reused fieldsep as in your previous submission and set it
> to "," when under --csv. This is not a strong opinion and other people may
> differ: the committer opinion is the one to follow:-)
>
> The "\n" eol style is hardcoded. Should it use "recordsep"? For instance,
> https://tools.ietf.org/html/rfc4180 seems to specify CRLF end of lines.
> The definition is evolving, eg https://www.w3.org/TR/tabular-data-model/
> accepts both "\r" and "\r\n". I do not like using windows eol, but I think
> that it should be possible to do it, which is not the case with this
> version.
>

In this case recordsep is not problem - default is ok.


>
> The "\pset format" error message in "do_pset" shows values in seemingly
> random order. The situation is pre-existing but not really satisfactory.
> I'd suggest to put all values in alphabetical order.
>
> In csv_print_field & csv_print_text, you are not consistent when putting
> braces on blocks with only one instruction. I'd suggest not to put braces
> in that case.
>
> I'd suggest that tests should include more types, not just strings. I
> would suggest int, float, timestamp, bytea, an array (which uses , as a
> separator), json (which uses both " and ,)...
>
> --
> Fabien.

Re: csv format for psql

2018-03-08 Thread Fabien COELHO



Hello Daniel,

PFA a v3 patch that implements that, along with regression tests this 
time.


My 0.02 €:

Patch applies cleanly, compiles, make check ok, doc generation ok.

I'm in favor of having a simple psql way to generate a convenient and 
compliant CSV output for export/import.


I also think that a short option brings little value, and "--csv" is good 
enough for the purpose, so I would agree to remove the "-C" binding.


About "fieldsep_csv", I do not like much the principle of having different 
output variables to represent the same concept depending on the format. I 
would rather have reused fieldsep as in your previous submission and set 
it to "," when under --csv. This is not a strong opinion and other people 
may differ: the committer opinion is the one to follow:-)


The "\n" eol style is hardcoded. Should it use "recordsep"? For instance, 
https://tools.ietf.org/html/rfc4180 seems to specify CRLF end of lines. 
The definition is evolving, eg https://www.w3.org/TR/tabular-data-model/ 
accepts both "\r" and "\r\n". I do not like using windows eol, but I think 
that it should be possible to do it, which is not the case with this 
version.


The "\pset format" error message in "do_pset" shows values in seemingly 
random order. The situation is pre-existing but not really satisfactory. 
I'd suggest to put all values in alphabetical order.


In csv_print_field & csv_print_text, you are not consistent when putting 
braces on blocks with only one instruction. I'd suggest not to put braces 
in that case.


I'd suggest that tests should include more types, not just strings. I 
would suggest int, float, timestamp, bytea, an array (which uses , as a 
separator), json (which uses both " and ,)...


--
Fabien.

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread amul sul

On Thu, Mar 8, 2018 at 3:01 PM, Amit Kapila  wrote:
> On Thu, Mar 8, 2018 at 12:52 PM, Pavan Deolasee
>  wrote:
>>
>> On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila 
>> wrote:
>>>
>>> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
>>>  wrote:
>>> >
>>> > On Tue, Feb 13, 2018 at 12:41 PM, amul sul  wrote:
>>> >>
>>> >> Thanks for the confirmation, updated patch attached.
>>> >>
>>> >
>>> > I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
>>> > not
>>> > do anything to deal with the fact that t_ctid may no longer point to
>>> > itself
>>> > to mark end of the chain. I just can't see how that would work.
>>> >
>>>
>>> I think it is not that patch doesn't care about the end of the chain.
>>>  For example, ctid pointing to itself is used to ensure that for
>>> deleted rows, nothing more needs to be done like below check in the
>>> ExecUpdate/ExecDelete code path.
>>
>>
>> Yeah, but it only looks for places where it needs to detect deleted tuples
>> and thus wants to throw an error. I am worried about other places where it
>> is assumed that the ctid points to a valid looking tid, self or otherwise. I
>> see no such places being either updated or commented.
>>
>> Now may be there is no danger because of other protections in place, but it
>> looks hazardous.
>>
>
> Right, I feel we need some tests to prove it, I think as per code I
> can see we need checks in few more places (like the ones mentioned in
> the previous email) apart from where this patch has added.
>
>>>
>>>
>>> I have not tested, but it seems this could be problematic, but I feel
>>> we could deal with such cases by checking invalid block id in the
>>> above if check.  Another one such case is in EvalPlanQualFetch
>>>
>>
>> Right.
>>
>
> Amul, can you please look into the scenario being discussed and see if
> you can write a test to see the behavior.
>

Sure, I'll try.

Regards,
Amu

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Amit Kapila

On Thu, Mar 8, 2018 at 12:52 PM, Pavan Deolasee
 wrote:
>
> On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila 
> wrote:
>>
>> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
>>  wrote:
>> >
>> > On Tue, Feb 13, 2018 at 12:41 PM, amul sul  wrote:
>> >>
>> >> Thanks for the confirmation, updated patch attached.
>> >>
>> >
>> > I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
>> > not
>> > do anything to deal with the fact that t_ctid may no longer point to
>> > itself
>> > to mark end of the chain. I just can't see how that would work.
>> >
>>
>> I think it is not that patch doesn't care about the end of the chain.
>>  For example, ctid pointing to itself is used to ensure that for
>> deleted rows, nothing more needs to be done like below check in the
>> ExecUpdate/ExecDelete code path.
>
>
> Yeah, but it only looks for places where it needs to detect deleted tuples
> and thus wants to throw an error. I am worried about other places where it
> is assumed that the ctid points to a valid looking tid, self or otherwise. I
> see no such places being either updated or commented.
>
> Now may be there is no danger because of other protections in place, but it
> looks hazardous.
>

Right, I feel we need some tests to prove it, I think as per code I
can see we need checks in few more places (like the ones mentioned in
the previous email) apart from where this patch has added.

>>
>>
>> I have not tested, but it seems this could be problematic, but I feel
>> we could deal with such cases by checking invalid block id in the
>> above if check.  Another one such case is in EvalPlanQualFetch
>>
>
> Right.
>

Amul, can you please look into the scenario being discussed and see if
you can write a test to see the behavior.

>>
>>
>> > What happens if a partition key update deletes a row, but the operation
>> > is
>> > aborted? Do we need any special handling for that case?
>> >
>>
>> If the transaction is aborted than future updates would update the
>> ctid to a new row, do you see any problem with it?
>
>
> I don't know. May be there is none. But it needs to explained why it's not a
> problem.
>

Sure, I guess in that case, we need to update in comments why it would
be okay after abort.

>>
>>
>> > I am actually worried that we're tinkering with ip_blkid to handle one
>> > corner case of detecting partition key update. This is going to change
>> > on-disk format and probably need more careful attention. Are we certain
>> > that
>> > we would never require update-chain following when partition keys are
>> > updated?
>> >
>>
>> I think we should never need update-chain following when the row is
>> moved from one partition to another partition, otherwise, we don't
>> change anything on the tuple.
>
>
> I am not sure I follow. I understand that it's probably a tough problem to
> follow update chain from one partition to another. But why do you think we
> would never need that? What if someone wants to improve on the restriction
> this patch is imposing and actually implement partition key UPDATEs the way
> we do for regular tables i.e. instead of throwing error, we actually
> update/delete the row in the new partition?
>

I think even if we want to uplift this restriction, storing ctid link
of another partition appears to be a major change somebody would like
to do for this feature.  We had some discussion on this matter earlier
where Robert, Greg seems to have said something like that as well.
See [1][2].  I think one way could be if updates/deletes, encounter
InvalidBlkID, they can use metadata of partition table to refind the
row.  We already had a discussion on this point in the original thread
"UPDATE of partition key" and agreed to throw an error as the better
way to deal with it.



[1] - 
https://www.postgresql.org/message-id/CAM-w4HPis7rbnwi%2BoXjnouqMSRAC5DeVcMdxEXTMfDos1kaYPQ%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CA%2BTgmoY1W-jaS0vH8f%3D5xKQB3EWj5L0XcBf6P7WB7JqbKB3tSQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] btree_gin, add support for uuid, bool, name, bpchar and anyrange types

2018-03-08 Thread Matheus de Oliveira

Hi all.

Em 4 de mar de 2018 16:00, "Tomas Vondra" 
escreveu:


1) I personally am not that sure GIN indexes on ranges are very useful,
considering those columns are usually queried for containment (i.e. is
this value contained in the range) rather than equality. And we already
have gist/spgist opclasses for ranges, which seems way more useful. We
seem to already have hash opclasses for ranges, but I'm not sure that's
a proof of usefulness.


So I'd cut this, although it's a tiny amount of code.


I pondered that either, and I also haven't thought about a good use case,
but since it has B-Tree support, I thought it should be included on
btree_gin as well, so I did.

If you all decide to remove, I'm totally fine with that.



2) The patch tweaks a couple of .sql files from previous versions. It
modifies a comment in the 1.0--1.1 upgrade script from

-- macaddr8 datatype support new in 10.0.

to

-- macaddr8 datatype support new in 1.0.

which is obviously incorrect, because not only is that in upgrade script
to 1.1. (so it should be "new in 1.1) but the original comment probably
refers to PostgreSQL 10, not the btree_gin version.


I forgot I have changed that, sorry. I think though that 10.0 was a typo,
since it has been introduced way before PostgreSQL 10. But you are right,
it should be 1.1.


It also tweaks \echo in 1.1--1.2 upgrade script to mention 1.2 instead
of 1.1. This change seems correct, but it seems more like a bugfix that
part of this patch.


I can send it later as a bugfix then. Sounds better indeed.



3) The documentation refers to range, which is bogus as
there is no such type. It should say anyrange instead.


I've just followed what has been done for ENUM type, if we are going to
change for range we should also change to use anyenum, no?



4) The opclass is called "anyrange_ops", which is somewhat inconsistent
with the opclasses for btree, hash, gist and spgist. All those index
types use "range_ops" so I suggest using the same name.


Ok.


5) I've tweaked a comment in btree_gin.c a bit, the original wording
seemed a bit unclear to me. And I've moved part of the comment to the
following function (it wasn't really about the left-most value).


My English skills aren't very good, so feel free to tweak any comment or
documentation I have done ;)


Attached is a patch that does all of this, but it may be incomplete (I
haven't really checked if it breaks tests, for example).


I really appreciate your review. I'd like to know what you think about my
comments above.

Thanks a lot.

Best regards,

Re: Failed to request an autovacuum work-item in silence

2018-03-08 Thread Ildar Musin

Hello,

The patch applies and compiles cleanly, tests pass. The code is working
as well. I was able to check it by simply creating a BRIN index and
filling the table with data forcing the index to request lots of work items:

create table test (a serial, b text);
create index on test using brin (a) with (pages_per_range=1,
autosummarize=true);
insert into test select i, repeat(md5(random()::text), 30) from
generate_series(1, 3000) i;
LOG:  could not request an autovacuum work item "BrinSummarizeRange" for
"test_a_idx"
LOG:  could not request an autovacuum work item "BrinSummarizeRange" for
"test_a_idx"
...

Just couple remarks. I would rename 'requested' variable in
AutoVacuumRequestWork() func to something like 'success' or 'result'.
Because request is something caller does. And I would also rephrase log
message as follows:

   request for autovacuum work item "%s" for "%s" failed

Thanks!

-- 
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: csv format for psql

2018-03-08 Thread Pavel Stehule

2018-03-08 9:29 GMT+01:00 Fabien COELHO :

>
> I.e. really generate some csv from the data in just one option, not many.
>>>
>>> But this is obviously debatable.
>>>
>>
>> I suspect we'll get requests for an all-JSON option, HTML tables,
>> etc., assuming we don't have them already.
>>
>
> I would definitely be fine with --html (which indeed already exists) &
> --json (which does not, but could some day) as long options.
>
> I'm hoping we can have that all in one framework.
>>
>
> ISTM that it is more or less the case if an option simply presets a bunch
> of existing table output options which is an existing framework.
>
> I get that setting each of tuples_only, fieldsep, recordsep, etc. might be
>> a bit of a lift for some users, but it's not clear how we'd make a sane
>> default that made choices among those correct for enough users. For
>> example, do we know that we want tuples_only behavior by default? A lot of
>> people's CSV tools assume a header row.
>>
>
> If there is a possible disagreement on one option, then let it out and use
> the corresponding short option if needed?
>
> Tuple only:
>
>   psql --csv -t -c 'TABLE foo' -o foo.csv
>
> With title headers:
>
>   psql --csv-c 'TABLE foo' -o foo.csv
>
> Would be okay.
>

+1

Pavel

>
> --
> Fabien.
>

Re: csv format for psql

2018-03-08 Thread Fabien COELHO




I.e. really generate some csv from the data in just one option, not many.

But this is obviously debatable.


I suspect we'll get requests for an all-JSON option, HTML tables,
etc., assuming we don't have them already.


I would definitely be fine with --html (which indeed already exists) & 
--json (which does not, but could some day) as long options.



I'm hoping we can have that all in one framework.


ISTM that it is more or less the case if an option simply presets a bunch 
of existing table output options which is an existing framework.


I get that setting each of tuples_only, fieldsep, recordsep, etc. 
might be a bit of a lift for some users, but it's not clear how we'd 
make a sane default that made choices among those correct for enough 
users. For example, do we know that we want tuples_only behavior by 
default? A lot of people's CSV tools assume a header row.


If there is a possible disagreement on one option, then let it out and use 
the corresponding short option if needed?


Tuple only:

  psql --csv -t -c 'TABLE foo' -o foo.csv

With title headers:

  psql --csv-c 'TABLE foo' -o foo.csv

Would be okay.

--
Fabien.

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

2018-03-08 Thread David Gould

On Wed, 7 Mar 2018 21:39:08 -0800
Jeff Janes  wrote:

> As for preventing it in the first place, based on your description of your
> hardware and operations, I was going to say you need to increase the max
> number of autovac workers, but then I remembered you from "Autovacuum slows
> down with large numbers of tables. More workers makes it slower" (
> https://www.postgresql.org/message-id/20151030133252.3033.4249%40wrigleys.postgresql.org).
> So you are probably still suffering from that?  Your patch from then seemed
> to be pretty invasive and so controversial.

We have been building from source using that patch for the worker contention
since then. It's very effective, there is no way we could have continued to
rely on autovacuum without it. It's sort of a nuisance to keep updating it
for each point release that touches autovacuum, but here we are.

The current patch is motivated by the fact that even with effective workers
we still regularly find tables with inflated reltuples. I have some theories
about why, but not really proof. Mainly variants on "all the vacuum workers
were busy making their way through a list of 100,000 tables and did not get
back to the problem table before it became a problem."

I do have a design in mind for a larger more principled patch that fixes the
same issue and some others too, but given the reaction to the earlier one I
hesitate to spend a lot of time on it. I'd be happy to discuss a way to try
to move forward though if any one is interested.

Your patch helped, but mainly was targeted at the lock contention part of the
problem.

The other part of the problem was that autovacuum workers will force a rewrite
of the stats file every time they try to choose a new table to work on.
With large numbers of tables and many autovacuum workers this is a significant
extra workload.

-dg

-- 
David Gould   da...@sonic.net
If simplicity worked, the world would be overrun with insects.

88 matches

Mail list logo